Files
genomic-consultant/docs/system_architecture.md
2025-11-28 11:52:04 +08:00

70 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# System Architecture Blueprint (v0.1)
This document turns `genomic_decision_support_system_spec_v0.1.md` into a buildable architecture and phased roadmap. Automation levels follow `Auto / Auto+Review / Human-only`.
## High-Level Views
- **Core layers**: (1) sequencing ingest → variant calling/annotation (`Auto`), (2) genomic query layer (`Auto`), (3) rule engines (ACMG, PGx, DDI, supplements; mixed automation), (4) orchestration/LLM and report generation (`Auto` tool calls, `Auto+Review` outputs).
- **Data custody**: all PHI/genomic artifacts remain local; external calls require de-identification or private models.
- **Traceability**: every run records tool versions, database snapshots, configs, and manual overrides in machine-readable logs.
### End-to-end flow
```
BAM (proband + parents)
↓ Variant Calling (gVCF) [Auto]
Joint Genotyper → joint VCF
↓ Annotation (VEP/ANNOVAR + ClinVar/gnomAD etc.) [Auto]
↓ Genomic Store (VCF+tabix or SQL) + Query API [Auto]
├─ Disease/Phenotype → Gene Panel lookup [Auto+Review]
│ └─ Panel variants with basic ranking (freq, ClinVar) [Auto+Review]
├─ ACMG evidence tagging subset (PVS1, PM2, BA1, BS1…) [Auto+Review]
├─ PGx genotype→phenotype and recommendation rules [Auto → Auto+Review]
├─ DDI rule evaluation [Auto]
└─ Supplement/Herb normalization + interaction rules [Auto+Review → Human-only]
LLM/Orchestrator routes user questions to tools, produces JSON + Markdown drafts [Auto tools, Auto+Review narratives]
```
## Phase Roadmap (build-first view)
- **Phase 1 Genomic foundation**
- Deliverables: trio joint VCF + annotation; query functions (`get_variants_by_gene/region`); disease→gene panel lookup; partial ACMG evidence tagging.
- Data stores: tabix-backed VCF wrapper initially; optional SQLite/Postgres import later.
- Interfaces: Python CLI/SDK first; machine-readable run logs with versions and automation levels.
- **Phase 2 PGx & DDI**
- Drug vocabulary normalization (ATC/RxNorm).
- PGx engine: star-allele calling or rule-based genotype→phenotype; guideline-mapped advice with review gates.
- DDI engine: rule base with severity tiers; combine with PGx outputs.
- **Phase 3 Supplements & Herbs**
- Name/ingredient normalization; herb formula expansion.
- Rule tables for CYP/transporters, coagulation, CNS effects.
- Evidence grading and conservative messaging; human-only final clinical language.
- **Phase 4 LLM Interface & Reports**
- Tool-calling schema for queries listed above.
- JSON + Markdown report templates with traceability to rules, data versions, and overrides.
## Module Boundaries
- **Variant Calling Pipeline** (`Auto`): wrapper around GATK or DeepVariant + joint genotyper; pluggable reference genome; QC summaries.
- **Annotation Pipeline** (`Auto`): VEP/ANNOVAR with pinned database versions (gnomAD, ClinVar, transcript set); emits annotated VCF + flat table.
- **Genomic Query Layer** (`Auto`): abstraction over tabix or SQL; minimal APIs: `get_variants_by_gene`, `get_variants_by_region`, filters (freq, consequence, clinvar).
- **Disease/Phenotype to Panel** (`Auto+Review`): HPO/OMIM lookups or curated panels; panel versioned; feeds queries.
- **Phenotype Resolver** (`Auto+Review`): JSON/DB mapping of phenotype/HPO IDs to gene lists as a placeholder until upstream sources are integrated; can synthesize panels dynamically and merge multiple sources.
- **ACMG Evidence Tagger** (`Auto+Review`): auto-evaluable criteria only; config-driven thresholds; human-only final classification.
- **PGx Engine** (`Auto → Auto+Review`): star-allele calling where possible; guideline rules (CPIC/DPWG) with conservative defaults; flag items needing review.
- **DDI Engine** (`Auto`): rule tables keyed by normalized drug IDs; outputs severity and rationale.
- **Supplements/Herbs** (`Auto+Review → Human-only`): ingredient extraction + mapping; interaction rules; human sign-off for clinical language.
- **Orchestrator/LLM** (`Auto tools, Auto+Review outputs`): intent parsing, tool sequencing, safety guardrails, report drafting.
## Observability and Versioning
- Every pipeline run writes a JSON log: tool versions, reference genome, DB versions, config hashes, automation level per step, manual overrides (who/when/why).
- Reports embed references to those logs so outputs remain reproducible.
- Configs (ACMG thresholds, gene panels, PGx rules) are versioned artifacts stored alongside code.
## Security/Privacy Notes
- Default to local processing; if external LLMs are used, strip identifiers and avoid full VCF uploads.
- Secrets kept out of repo; rely on environment variables or local config files (excluded by `.gitignore`).
## Initial Tech Bets (to be validated)
- Language/runtime: Python 3.11+ for pipelines, rules, and orchestration stubs.
- Bio stack candidates: GATK or DeepVariant; VEP; tabix for early querying; SQLAlchemy + SQLite/Postgres when scaling.
- Infra: containerized runners for pipelines; makefiles or workflow engine (Nextflow/Snakemake) later if needed.