Files
genomic-consultant/docs/phase1_implementation_plan.md
2025-11-28 11:52:04 +08:00

5.2 KiB

Phase 1 Implementation Plan (Genomic Foundation)

Scope: deliver a working trio-based variant pipeline, annotated genomic store, query APIs, initial ACMG evidence tagging, and reporting/logging scaffolding. This assumes local execution with Python 3.11+.

Objectives

  • Trio BAM → joint VCF with QC artifacts (Auto).
  • Annotate variants with population frequency, ClinVar, consequence/prediction (Auto).
  • Provide queryable interfaces for gene/region lookups with filters (Auto).
  • Disease/phenotype → gene panel lookup and filtered variant outputs (Auto+Review).
  • Auto-tag subset of ACMG criteria; human-only final classification (Auto+Review).
  • Produce machine-readable run logs with versions, configs, and overrides.

Work Breakdown

  1. Data & references

    • Reference genome: GRCh38 (primary) with option for GRCh37; pin version hash.
    • Resource bundles: known sites for BQSR (if using GATK), gnomAD/ClinVar versions for annotation.
    • Test fixtures: small trio BAM/CRAM subset or GIAB trio downsampled for CI-like checks.
  2. Variant calling pipeline (wrapper, Auto)

    • Tooling: GATK HaplotypeCaller → gVCF; GenotypeGVCFs for joint calls. Alt: DeepVariant + joint genotyper (parameterized).
    • Steps:
      • Validate inputs (file presence, reference match).
      • Optional QC: coverage, duplicates, on-target.
      • Generate per-sample gVCF; joint genotyping to trio VCF.
    • Outputs: joint VCF + index; QC summary JSON/TSV; log with tool versions and params.
  3. Annotation pipeline (Auto)

    • Tooling: Ensembl VEP with plugins for gnomAD, ClinVar, CADD, SpliceAI where available; alt path: ANNOVAR.
    • Steps:
      • Normalize variants (bcftools norm) if needed.
      • Annotate with gene, transcript, protein change; population AF; ClinVar significance; consequence; predictions (SIFT/PolyPhen/CADD); inheritance flags.
      • Include SpliceAI/CADD plugins if installed; CLI accepts extra flags/plugins to embed SpliceAI/CADD fields.
    • Outputs: annotated VCF; flattened TSV/Parquet for faster querying; manifest of DB versions used.
  4. Genomic store + query API (Auto)

    • Early option: tabix-indexed VCF with Python wrapper.
    • Functions (Python module genomic_store):
      • get_variants_by_gene(individual_id, genes, filters)
        Filters: AF thresholds, consequence categories, ClinVar significance, inheritance pattern.
      • get_variants_by_region(individual_id, chrom, start, end, filters)
      • list_genes_with_variants(individual_id, filters) (optional).
    • Filters defined in a FilterConfig dataclass; serialize-able for logging.
    • Future option: import to SQLite/Postgres via Arrow/Parquet for richer predicates.
  5. Disease/phenotype → gene panel (Auto+Review)

    • Data: HPO/OMIM/GenCC lookup or curated JSON panels.
    • Function: get_gene_panel(disease_or_hpo_id, version=None) returning gene list + provenance.
    • Phenotype resolver: curated JSON mapping (e.g., phenotype_to_genes.example.json) as placeholder until upstream data is wired; allow dynamic panel synthesis by phenotype ID in CLI; support merging multiple sources into one mapping.
    • Flow: resolve panel → call genomic store → apply simple ranking (AF low, ClinVar pathogenicity high).
    • Manual review points: panel curation, rank threshold tuning.
  6. ACMG evidence tagging (subset, Auto+Review)

    • Criteria to auto-evaluate initially: PVS1 (LoF in LoF-sensitive gene list), PM2 (absent/rare in population), BA1/BS1 (common frequency), possible BP7 (synonymous, no splicing impact).
    • Config: YAML/JSON with thresholds (AF cutoffs, LoF gene list, transcript precedence).
    • Output schema per variant: {variant_id, evidence: [tag, strength, rationale], suggested_class} with suggested class computed purely from auto tags; final class left blank for human.
    • Logging: capture rule version and reasons for each fired rule.
  7. Run logging and versioning

    • Every pipeline run emits run_log.json containing:
      • Inputs (sample IDs, file paths, reference build).
      • Tool versions and parameters; DB versions; config hashes (panel/ACMG configs).
      • Automation level per step; manual overrides (who/when/why).
      • Derived artifacts paths and checksums.
    • Embed run_log reference in reports.
  8. Report template (minimal)

    • Input: disease/gene panel name, variant query results, ACMG evidence tags.
    • Output: Markdown + JSON summary with sections: context, methods, variants table, limitations.
    • Mark human-only decisions clearly.

Milestones

  • M1: Repo scaffolding + configs; tiny trio test data wired; make pipeline runs through variant calling + annotation on fixture.
  • M2: Genomic store wrapper with gene/region queries; filter config; basic CLI/Notebook demo.
  • M3: Panel lookup + ranked variant listing; ACMG auto tags on outputs; run_log generation.
  • M4: Minimal report generator + acceptance of human-reviewed classifications.

Validation Strategy

  • Unit tests for filter logic, panel resolution, ACMG tagger decisions (synthetic variants).
  • Integration test on small fixture trio to ensure call → annotate → query path works.
  • Determinism checks: hash configs and verify outputs stable across runs given same inputs.