Initial commit

2025-11-28 11:52:04 +08:00
commit f74dc351f7
51 changed files with 2402 additions and 0 deletions
--- a/docs/phase1_implementation_plan.md
+++ b/docs/phase1_implementation_plan.md
@@ -0,0 +1,80 @@
+# Phase 1 Implementation Plan (Genomic Foundation)
+
+Scope: deliver a working trio-based variant pipeline, annotated genomic store, query APIs, initial ACMG evidence tagging, and reporting/logging scaffolding. This assumes local execution with Python 3.11+.
+
+## Objectives
+- Trio BAM → joint VCF with QC artifacts (`Auto`).
+- Annotate variants with population frequency, ClinVar, consequence/prediction (`Auto`).
+- Provide queryable interfaces for gene/region lookups with filters (`Auto`).
+- Disease/phenotype → gene panel lookup and filtered variant outputs (`Auto+Review`).
+- Auto-tag subset of ACMG criteria; human-only final classification (`Auto+Review`).
+- Produce machine-readable run logs with versions, configs, and overrides.
+
+## Work Breakdown
+1) **Data & references**
+   - Reference genome: GRCh38 (primary) with option for GRCh37; pin version hash.
+   - Resource bundles: known sites for BQSR (if using GATK), gnomAD/ClinVar versions for annotation.
+   - Test fixtures: small trio BAM/CRAM subset or GIAB trio downsampled for CI-like checks.
+
+2) **Variant calling pipeline (wrapper, `Auto`)**
+   - Tooling: GATK HaplotypeCaller → gVCF; GenotypeGVCFs for joint calls. Alt: DeepVariant + joint genotyper (parameterized).
+   - Steps:
+     - Validate inputs (file presence, reference match).
+     - Optional QC: coverage, duplicates, on-target.
+     - Generate per-sample gVCF; joint genotyping to trio VCF.
+   - Outputs: joint VCF + index; QC summary JSON/TSV; log with tool versions and params.
+
+3) **Annotation pipeline (`Auto`)**
+   - Tooling: Ensembl VEP with plugins for gnomAD, ClinVar, CADD, SpliceAI where available; alt path: ANNOVAR.
+   - Steps:
+     - Normalize variants (bcftools norm) if needed.
+     - Annotate with gene, transcript, protein change; population AF; ClinVar significance; consequence; predictions (SIFT/PolyPhen/CADD); inheritance flags.
+     - Include SpliceAI/CADD plugins if installed; CLI accepts extra flags/plugins to embed SpliceAI/CADD fields.
+   - Outputs: annotated VCF; flattened TSV/Parquet for faster querying; manifest of DB versions used.
+
+4) **Genomic store + query API (`Auto`)**
+   - Early option: tabix-indexed VCF with Python wrapper.
+   - Functions (Python module `genomic_store`):
+     - `get_variants_by_gene(individual_id, genes, filters)`  
+       Filters: AF thresholds, consequence categories, ClinVar significance, inheritance pattern.
+     - `get_variants_by_region(individual_id, chrom, start, end, filters)`
+     - `list_genes_with_variants(individual_id, filters)` (optional).
+   - Filters defined in a `FilterConfig` dataclass; serialize-able for logging.
+   - Future option: import to SQLite/Postgres via Arrow/Parquet for richer predicates.
+
+5) **Disease/phenotype → gene panel (`Auto+Review`)**
+   - Data: HPO/OMIM/GenCC lookup or curated JSON panels.
+   - Function: `get_gene_panel(disease_or_hpo_id, version=None)` returning gene list + provenance.
+   - Phenotype resolver: curated JSON mapping (e.g., `phenotype_to_genes.example.json`) as placeholder until upstream data is wired; allow dynamic panel synthesis by phenotype ID in CLI; support merging multiple sources into one mapping.
+   - Flow: resolve panel → call genomic store → apply simple ranking (AF low, ClinVar pathogenicity high).
+   - Manual review points: panel curation, rank threshold tuning.
+
+6) **ACMG evidence tagging (subset, `Auto+Review`)**
+   - Criteria to auto-evaluate initially: PVS1 (LoF in LoF-sensitive gene list), PM2 (absent/rare in population), BA1/BS1 (common frequency), possible BP7 (synonymous, no splicing impact).
+   - Config: YAML/JSON with thresholds (AF cutoffs, LoF gene list, transcript precedence).
+   - Output schema per variant: `{variant_id, evidence: [tag, strength, rationale], suggested_class}` with suggested class computed purely from auto tags; final class left blank for human.
+   - Logging: capture rule version and reasons for each fired rule.
+
+7) **Run logging and versioning**
+   - Every pipeline run emits `run_log.json` containing:
+     - Inputs (sample IDs, file paths, reference build).
+     - Tool versions and parameters; DB versions; config hashes (panel/ACMG configs).
+     - Automation level per step; manual overrides (`who/when/why`).
+     - Derived artifacts paths and checksums.
+   - Embed run_log reference in reports.
+
+8) **Report template (minimal)**
+   - Input: disease/gene panel name, variant query results, ACMG evidence tags.
+   - Output: Markdown + JSON summary with sections: context, methods, variants table, limitations.
+   - Mark human-only decisions clearly.
+
+## Milestones
+- **M1**: Repo scaffolding + configs; tiny trio test data wired; `make pipeline` runs through variant calling + annotation on fixture.
+- **M2**: Genomic store wrapper with gene/region queries; filter config; basic CLI/Notebook demo.
+- **M3**: Panel lookup + ranked variant listing; ACMG auto tags on outputs; run_log generation.
+- **M4**: Minimal report generator + acceptance of human-reviewed classifications.
+
+## Validation Strategy
+- Unit tests for filter logic, panel resolution, ACMG tagger decisions (synthetic variants).
+- Integration test on small fixture trio to ensure call → annotate → query path works.
+- Determinism checks: hash configs and verify outputs stable across runs given same inputs.