Files
genomic-consultant/docs/system_architecture.md
2025-11-28 11:52:04 +08:00

5.0 KiB
Raw Blame History

System Architecture Blueprint (v0.1)

This document turns genomic_decision_support_system_spec_v0.1.md into a buildable architecture and phased roadmap. Automation levels follow Auto / Auto+Review / Human-only.

High-Level Views

  • Core layers: (1) sequencing ingest → variant calling/annotation (Auto), (2) genomic query layer (Auto), (3) rule engines (ACMG, PGx, DDI, supplements; mixed automation), (4) orchestration/LLM and report generation (Auto tool calls, Auto+Review outputs).
  • Data custody: all PHI/genomic artifacts remain local; external calls require de-identification or private models.
  • Traceability: every run records tool versions, database snapshots, configs, and manual overrides in machine-readable logs.

End-to-end flow

BAM (proband + parents)
  ↓ Variant Calling (gVCF) [Auto]
Joint Genotyper → joint VCF
  ↓ Annotation (VEP/ANNOVAR + ClinVar/gnomAD etc.) [Auto]
  ↓ Genomic Store (VCF+tabix or SQL) + Query API [Auto]
  ↓
  ├─ Disease/Phenotype → Gene Panel lookup [Auto+Review]
  │    └─ Panel variants with basic ranking (freq, ClinVar) [Auto+Review]
  ├─ ACMG evidence tagging subset (PVS1, PM2, BA1, BS1…) [Auto+Review]
  ├─ PGx genotype→phenotype and recommendation rules [Auto → Auto+Review]
  ├─ DDI rule evaluation [Auto]
  └─ Supplement/Herb normalization + interaction rules [Auto+Review → Human-only]
  ↓
LLM/Orchestrator routes user questions to tools, produces JSON + Markdown drafts [Auto tools, Auto+Review narratives]

Phase Roadmap (build-first view)

  • Phase 1 Genomic foundation
    • Deliverables: trio joint VCF + annotation; query functions (get_variants_by_gene/region); disease→gene panel lookup; partial ACMG evidence tagging.
    • Data stores: tabix-backed VCF wrapper initially; optional SQLite/Postgres import later.
    • Interfaces: Python CLI/SDK first; machine-readable run logs with versions and automation levels.
  • Phase 2 PGx & DDI
    • Drug vocabulary normalization (ATC/RxNorm).
    • PGx engine: star-allele calling or rule-based genotype→phenotype; guideline-mapped advice with review gates.
    • DDI engine: rule base with severity tiers; combine with PGx outputs.
  • Phase 3 Supplements & Herbs
    • Name/ingredient normalization; herb formula expansion.
    • Rule tables for CYP/transporters, coagulation, CNS effects.
    • Evidence grading and conservative messaging; human-only final clinical language.
  • Phase 4 LLM Interface & Reports
    • Tool-calling schema for queries listed above.
    • JSON + Markdown report templates with traceability to rules, data versions, and overrides.

Module Boundaries

  • Variant Calling Pipeline (Auto): wrapper around GATK or DeepVariant + joint genotyper; pluggable reference genome; QC summaries.
  • Annotation Pipeline (Auto): VEP/ANNOVAR with pinned database versions (gnomAD, ClinVar, transcript set); emits annotated VCF + flat table.
  • Genomic Query Layer (Auto): abstraction over tabix or SQL; minimal APIs: get_variants_by_gene, get_variants_by_region, filters (freq, consequence, clinvar).
  • Disease/Phenotype to Panel (Auto+Review): HPO/OMIM lookups or curated panels; panel versioned; feeds queries.
  • Phenotype Resolver (Auto+Review): JSON/DB mapping of phenotype/HPO IDs to gene lists as a placeholder until upstream sources are integrated; can synthesize panels dynamically and merge multiple sources.
  • ACMG Evidence Tagger (Auto+Review): auto-evaluable criteria only; config-driven thresholds; human-only final classification.
  • PGx Engine (Auto → Auto+Review): star-allele calling where possible; guideline rules (CPIC/DPWG) with conservative defaults; flag items needing review.
  • DDI Engine (Auto): rule tables keyed by normalized drug IDs; outputs severity and rationale.
  • Supplements/Herbs (Auto+Review → Human-only): ingredient extraction + mapping; interaction rules; human sign-off for clinical language.
  • Orchestrator/LLM (Auto tools, Auto+Review outputs): intent parsing, tool sequencing, safety guardrails, report drafting.

Observability and Versioning

  • Every pipeline run writes a JSON log: tool versions, reference genome, DB versions, config hashes, automation level per step, manual overrides (who/when/why).
  • Reports embed references to those logs so outputs remain reproducible.
  • Configs (ACMG thresholds, gene panels, PGx rules) are versioned artifacts stored alongside code.

Security/Privacy Notes

  • Default to local processing; if external LLMs are used, strip identifiers and avoid full VCF uploads.
  • Secrets kept out of repo; rely on environment variables or local config files (excluded by .gitignore).

Initial Tech Bets (to be validated)

  • Language/runtime: Python 3.11+ for pipelines, rules, and orchestration stubs.
  • Bio stack candidates: GATK or DeepVariant; VEP; tabix for early querying; SQLAlchemy + SQLite/Postgres when scaling.
  • Infra: containerized runners for pipelines; makefiles or workflow engine (Nextflow/Snakemake) later if needed.