Initial commit

2025-11-28 11:52:04 +08:00
commit f74dc351f7
51 changed files with 2402 additions and 0 deletions
--- a/docs/system_architecture.md
+++ b/docs/system_architecture.md
@@ -0,0 +1,69 @@
+# System Architecture Blueprint (v0.1)
+
+This document turns `genomic_decision_support_system_spec_v0.1.md` into a buildable architecture and phased roadmap. Automation levels follow `Auto / Auto+Review / Human-only`.
+
+## High-Level Views
+- **Core layers**: (1) sequencing ingest → variant calling/annotation (`Auto`), (2) genomic query layer (`Auto`), (3) rule engines (ACMG, PGx, DDI, supplements; mixed automation), (4) orchestration/LLM and report generation (`Auto` tool calls, `Auto+Review` outputs).
+- **Data custody**: all PHI/genomic artifacts remain local; external calls require de-identification or private models.
+- **Traceability**: every run records tool versions, database snapshots, configs, and manual overrides in machine-readable logs.
+
+### End-to-end flow
+```
+BAM (proband + parents)
+  ↓ Variant Calling (gVCF) [Auto]
+Joint Genotyper → joint VCF
+  ↓ Annotation (VEP/ANNOVAR + ClinVar/gnomAD etc.) [Auto]
+  ↓ Genomic Store (VCF+tabix or SQL) + Query API [Auto]
+  ↓
+  ├─ Disease/Phenotype → Gene Panel lookup [Auto+Review]
+  │    └─ Panel variants with basic ranking (freq, ClinVar) [Auto+Review]
+  ├─ ACMG evidence tagging subset (PVS1, PM2, BA1, BS1…) [Auto+Review]
+  ├─ PGx genotype→phenotype and recommendation rules [Auto → Auto+Review]
+  ├─ DDI rule evaluation [Auto]
+  └─ Supplement/Herb normalization + interaction rules [Auto+Review → Human-only]
+  ↓
+LLM/Orchestrator routes user questions to tools, produces JSON + Markdown drafts [Auto tools, Auto+Review narratives]
+```
+
+## Phase Roadmap (build-first view)
+- **Phase 1 – Genomic foundation**  
+  - Deliverables: trio joint VCF + annotation; query functions (`get_variants_by_gene/region`); disease→gene panel lookup; partial ACMG evidence tagging.  
+  - Data stores: tabix-backed VCF wrapper initially; optional SQLite/Postgres import later.  
+  - Interfaces: Python CLI/SDK first; machine-readable run logs with versions and automation levels.
+- **Phase 2 – PGx & DDI**  
+  - Drug vocabulary normalization (ATC/RxNorm).  
+  - PGx engine: star-allele calling or rule-based genotype→phenotype; guideline-mapped advice with review gates.  
+  - DDI engine: rule base with severity tiers; combine with PGx outputs.
+- **Phase 3 – Supplements & Herbs**  
+  - Name/ingredient normalization; herb formula expansion.  
+  - Rule tables for CYP/transporters, coagulation, CNS effects.  
+  - Evidence grading and conservative messaging; human-only final clinical language.
+- **Phase 4 – LLM Interface & Reports**  
+  - Tool-calling schema for queries listed above.  
+  - JSON + Markdown report templates with traceability to rules, data versions, and overrides.
+
+## Module Boundaries
+- **Variant Calling Pipeline** (`Auto`): wrapper around GATK or DeepVariant + joint genotyper; pluggable reference genome; QC summaries.  
+- **Annotation Pipeline** (`Auto`): VEP/ANNOVAR with pinned database versions (gnomAD, ClinVar, transcript set); emits annotated VCF + flat table.  
+- **Genomic Query Layer** (`Auto`): abstraction over tabix or SQL; minimal APIs: `get_variants_by_gene`, `get_variants_by_region`, filters (freq, consequence, clinvar).  
+- **Disease/Phenotype to Panel** (`Auto+Review`): HPO/OMIM lookups or curated panels; panel versioned; feeds queries.  
+- **Phenotype Resolver** (`Auto+Review`): JSON/DB mapping of phenotype/HPO IDs to gene lists as a placeholder until upstream sources are integrated; can synthesize panels dynamically and merge multiple sources.  
+- **ACMG Evidence Tagger** (`Auto+Review`): auto-evaluable criteria only; config-driven thresholds; human-only final classification.  
+- **PGx Engine** (`Auto → Auto+Review`): star-allele calling where possible; guideline rules (CPIC/DPWG) with conservative defaults; flag items needing review.  
+- **DDI Engine** (`Auto`): rule tables keyed by normalized drug IDs; outputs severity and rationale.  
+- **Supplements/Herbs** (`Auto+Review → Human-only`): ingredient extraction + mapping; interaction rules; human sign-off for clinical language.  
+- **Orchestrator/LLM** (`Auto tools, Auto+Review outputs`): intent parsing, tool sequencing, safety guardrails, report drafting.
+
+## Observability and Versioning
+- Every pipeline run writes a JSON log: tool versions, reference genome, DB versions, config hashes, automation level per step, manual overrides (who/when/why).
+- Reports embed references to those logs so outputs remain reproducible.
+- Configs (ACMG thresholds, gene panels, PGx rules) are versioned artifacts stored alongside code.
+
+## Security/Privacy Notes
+- Default to local processing; if external LLMs are used, strip identifiers and avoid full VCF uploads.
+- Secrets kept out of repo; rely on environment variables or local config files (excluded by `.gitignore`).
+
+## Initial Tech Bets (to be validated)
+- Language/runtime: Python 3.11+ for pipelines, rules, and orchestration stubs.
+- Bio stack candidates: GATK or DeepVariant; VEP; tabix for early querying; SQLAlchemy + SQLite/Postgres when scaling.  
+- Infra: containerized runners for pipelines; makefiles or workflow engine (Nextflow/Snakemake) later if needed.