Initial commit
This commit is contained in:
101
docs/phase1_howto.md
Normal file
101
docs/phase1_howto.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# Phase 1 How-To: BAM → VCF → Annotate → Panel Report
|
||||
|
||||
本文件說明如何用現有 CLI 從 BAM 執行到報告輸出,選項意義,以及可跳過步驟的用法。假設已安裝 GATK、VEP、bcftools/tabix,並以 `pip install -e .` 安裝本專案。
|
||||
|
||||
## 流程總覽
|
||||
1) Trio variant calling → joint VCF
|
||||
2) VEP 註解 → annotated VCF + 平坦 TSV
|
||||
3) Panel/phenotype 查詢 + ACMG 標籤 → Markdown/JSON 報告
|
||||
可用 `phase1-run` 一鍵(可跳過 call/annotate),或分步 `run-call` / `run-annotate` / `panel-report`。
|
||||
|
||||
## 分步執行
|
||||
|
||||
### 1) Variant calling(GATK)
|
||||
```bash
|
||||
genomic-consultant run-call \
|
||||
--sample proband:/path/proband.bam \
|
||||
--sample father:/path/father.bam \
|
||||
--sample mother:/path/mother.bam \
|
||||
--reference /refs/GRCh38.fa \
|
||||
--workdir /tmp/trio \
|
||||
--prefix trio \
|
||||
--log /tmp/trio/run_call_log.json \
|
||||
--probe-tools
|
||||
```
|
||||
- `--sample`: `sample_id:/path/to.bam`,可重複。
|
||||
- `--reference`: 參考序列。
|
||||
- `--workdir`: 輸出與中間檔位置。
|
||||
- `--prefix`: 輸出檔名前綴。
|
||||
- `--log`: run log(JSON)路徑。
|
||||
- 輸出:joint VCF (`/tmp/trio/trio.joint.vcf.gz`)、run log(含指令/returncode)。
|
||||
|
||||
### 2) Annotation(VEP + bcftools)
|
||||
```bash
|
||||
genomic-consultant run-annotate \
|
||||
--vcf /tmp/trio/trio.joint.vcf.gz \
|
||||
--workdir /tmp/trio/annot \
|
||||
--prefix trio \
|
||||
--reference /refs/GRCh38.fa \
|
||||
--plugin 'SpliceAI,snv=/path/spliceai.snv.vcf.gz,indel=/path/spliceai.indel.vcf.gz' \
|
||||
--plugin 'CADD,/path/whole_genome_SNVs.tsv.gz,/path/InDels.tsv.gz' \
|
||||
--extra-flag "--cache --offline" \
|
||||
--log /tmp/trio/annot/run_annot_log.json \
|
||||
--probe-tools
|
||||
```
|
||||
- `--plugin`: VEP plugin 規格,可重複(示範 SpliceAI/CADD)。
|
||||
- `--extra-flag`: 附加給 VEP 的旗標(如 cache/offline)。
|
||||
- 輸出:annotated VCF (`trio.vep.vcf.gz`)、平坦 TSV (`trio.vep.tsv`)、run log。
|
||||
|
||||
### 3) Panel/Phenotype 報告
|
||||
使用 panel 檔:
|
||||
```bash
|
||||
genomic-consultant panel-report \
|
||||
--tsv /tmp/trio/annot/trio.vep.tsv \
|
||||
--panel configs/panel.example.json \
|
||||
--acmg-config configs/acmg_config.example.yaml \
|
||||
--individual-id proband \
|
||||
--max-af 0.05 \
|
||||
--format markdown \
|
||||
--log /tmp/trio/panel_log.json
|
||||
```
|
||||
使用 phenotype 直譯 panel:
|
||||
```bash
|
||||
genomic-consultant panel-report \
|
||||
--tsv /tmp/trio/annot/trio.vep.tsv \
|
||||
--phenotype-id HP:0000365 \
|
||||
--phenotype-mapping configs/phenotype_to_genes.hpo_seed.json \
|
||||
--acmg-config configs/acmg_config.example.yaml \
|
||||
--individual-id proband \
|
||||
--max-af 0.05 \
|
||||
--format markdown
|
||||
```
|
||||
- `--max-af`: 過濾等位基因頻率上限。
|
||||
- `--format`: `markdown` 或 `json`。
|
||||
- 輸出:報告文字 + run log(記錄 panel/ACMG config 的 hash)。
|
||||
|
||||
## 一鍵模式(可跳過 call/annotate)
|
||||
已經有 joint VCF/TSV 時,可跳過前兩步:
|
||||
```bash
|
||||
genomic-consultant phase1-run \
|
||||
--workdir /tmp/trio \
|
||||
--prefix trio \
|
||||
--tsv /tmp/trio/annot/trio.vep.tsv \
|
||||
--skip-call --skip-annotate \
|
||||
--panel configs/panel.example.json \
|
||||
--acmg-config configs/acmg_config.example.yaml \
|
||||
--max-af 0.05 \
|
||||
--format markdown \
|
||||
--log-dir /tmp/trio/runtime
|
||||
```
|
||||
若要實際跑 VEP,可移除 `--skip-annotate` 並提供 `--plugins/--extra-flag`;若要跑 calling,移除 `--skip-call` 並提供 `--sample`/`--reference`。
|
||||
|
||||
## 主要輸出
|
||||
- joint VCF(呼叫結果)
|
||||
- annotated VCF + 平坦 TSV(含 gene/consequence/ClinVar/AF/SpliceAI/CADD 等欄位)
|
||||
- run logs(JSON,含指令、return code、config hash;在 `--log` 或 `--log-dir`)
|
||||
- Panel 報告(Markdown 或 JSON),附 ACMG 自動標籤,需人工複核
|
||||
|
||||
## 注意
|
||||
- Call/annotate 依賴外部工具與對應資源(參考序列、VEP cache、SpliceAI/CADD 資料)。
|
||||
- 若無 BAM/資源,可用 sample TSV:`phase1-run --tsv sample_data/example_annotated.tsv --skip-call --skip-annotate ...` 演示報告。
|
||||
- 安全:`.gitignore` 已排除大型基因檔案;建議本地受控環境執行。
|
||||
80
docs/phase1_implementation_plan.md
Normal file
80
docs/phase1_implementation_plan.md
Normal file
@@ -0,0 +1,80 @@
|
||||
# Phase 1 Implementation Plan (Genomic Foundation)
|
||||
|
||||
Scope: deliver a working trio-based variant pipeline, annotated genomic store, query APIs, initial ACMG evidence tagging, and reporting/logging scaffolding. This assumes local execution with Python 3.11+.
|
||||
|
||||
## Objectives
|
||||
- Trio BAM → joint VCF with QC artifacts (`Auto`).
|
||||
- Annotate variants with population frequency, ClinVar, consequence/prediction (`Auto`).
|
||||
- Provide queryable interfaces for gene/region lookups with filters (`Auto`).
|
||||
- Disease/phenotype → gene panel lookup and filtered variant outputs (`Auto+Review`).
|
||||
- Auto-tag subset of ACMG criteria; human-only final classification (`Auto+Review`).
|
||||
- Produce machine-readable run logs with versions, configs, and overrides.
|
||||
|
||||
## Work Breakdown
|
||||
1) **Data & references**
|
||||
- Reference genome: GRCh38 (primary) with option for GRCh37; pin version hash.
|
||||
- Resource bundles: known sites for BQSR (if using GATK), gnomAD/ClinVar versions for annotation.
|
||||
- Test fixtures: small trio BAM/CRAM subset or GIAB trio downsampled for CI-like checks.
|
||||
|
||||
2) **Variant calling pipeline (wrapper, `Auto`)**
|
||||
- Tooling: GATK HaplotypeCaller → gVCF; GenotypeGVCFs for joint calls. Alt: DeepVariant + joint genotyper (parameterized).
|
||||
- Steps:
|
||||
- Validate inputs (file presence, reference match).
|
||||
- Optional QC: coverage, duplicates, on-target.
|
||||
- Generate per-sample gVCF; joint genotyping to trio VCF.
|
||||
- Outputs: joint VCF + index; QC summary JSON/TSV; log with tool versions and params.
|
||||
|
||||
3) **Annotation pipeline (`Auto`)**
|
||||
- Tooling: Ensembl VEP with plugins for gnomAD, ClinVar, CADD, SpliceAI where available; alt path: ANNOVAR.
|
||||
- Steps:
|
||||
- Normalize variants (bcftools norm) if needed.
|
||||
- Annotate with gene, transcript, protein change; population AF; ClinVar significance; consequence; predictions (SIFT/PolyPhen/CADD); inheritance flags.
|
||||
- Include SpliceAI/CADD plugins if installed; CLI accepts extra flags/plugins to embed SpliceAI/CADD fields.
|
||||
- Outputs: annotated VCF; flattened TSV/Parquet for faster querying; manifest of DB versions used.
|
||||
|
||||
4) **Genomic store + query API (`Auto`)**
|
||||
- Early option: tabix-indexed VCF with Python wrapper.
|
||||
- Functions (Python module `genomic_store`):
|
||||
- `get_variants_by_gene(individual_id, genes, filters)`
|
||||
Filters: AF thresholds, consequence categories, ClinVar significance, inheritance pattern.
|
||||
- `get_variants_by_region(individual_id, chrom, start, end, filters)`
|
||||
- `list_genes_with_variants(individual_id, filters)` (optional).
|
||||
- Filters defined in a `FilterConfig` dataclass; serialize-able for logging.
|
||||
- Future option: import to SQLite/Postgres via Arrow/Parquet for richer predicates.
|
||||
|
||||
5) **Disease/phenotype → gene panel (`Auto+Review`)**
|
||||
- Data: HPO/OMIM/GenCC lookup or curated JSON panels.
|
||||
- Function: `get_gene_panel(disease_or_hpo_id, version=None)` returning gene list + provenance.
|
||||
- Phenotype resolver: curated JSON mapping (e.g., `phenotype_to_genes.example.json`) as placeholder until upstream data is wired; allow dynamic panel synthesis by phenotype ID in CLI; support merging multiple sources into one mapping.
|
||||
- Flow: resolve panel → call genomic store → apply simple ranking (AF low, ClinVar pathogenicity high).
|
||||
- Manual review points: panel curation, rank threshold tuning.
|
||||
|
||||
6) **ACMG evidence tagging (subset, `Auto+Review`)**
|
||||
- Criteria to auto-evaluate initially: PVS1 (LoF in LoF-sensitive gene list), PM2 (absent/rare in population), BA1/BS1 (common frequency), possible BP7 (synonymous, no splicing impact).
|
||||
- Config: YAML/JSON with thresholds (AF cutoffs, LoF gene list, transcript precedence).
|
||||
- Output schema per variant: `{variant_id, evidence: [tag, strength, rationale], suggested_class}` with suggested class computed purely from auto tags; final class left blank for human.
|
||||
- Logging: capture rule version and reasons for each fired rule.
|
||||
|
||||
7) **Run logging and versioning**
|
||||
- Every pipeline run emits `run_log.json` containing:
|
||||
- Inputs (sample IDs, file paths, reference build).
|
||||
- Tool versions and parameters; DB versions; config hashes (panel/ACMG configs).
|
||||
- Automation level per step; manual overrides (`who/when/why`).
|
||||
- Derived artifacts paths and checksums.
|
||||
- Embed run_log reference in reports.
|
||||
|
||||
8) **Report template (minimal)**
|
||||
- Input: disease/gene panel name, variant query results, ACMG evidence tags.
|
||||
- Output: Markdown + JSON summary with sections: context, methods, variants table, limitations.
|
||||
- Mark human-only decisions clearly.
|
||||
|
||||
## Milestones
|
||||
- **M1**: Repo scaffolding + configs; tiny trio test data wired; `make pipeline` runs through variant calling + annotation on fixture.
|
||||
- **M2**: Genomic store wrapper with gene/region queries; filter config; basic CLI/Notebook demo.
|
||||
- **M3**: Panel lookup + ranked variant listing; ACMG auto tags on outputs; run_log generation.
|
||||
- **M4**: Minimal report generator + acceptance of human-reviewed classifications.
|
||||
|
||||
## Validation Strategy
|
||||
- Unit tests for filter logic, panel resolution, ACMG tagger decisions (synthetic variants).
|
||||
- Integration test on small fixture trio to ensure call → annotate → query path works.
|
||||
- Determinism checks: hash configs and verify outputs stable across runs given same inputs.
|
||||
69
docs/system_architecture.md
Normal file
69
docs/system_architecture.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# System Architecture Blueprint (v0.1)
|
||||
|
||||
This document turns `genomic_decision_support_system_spec_v0.1.md` into a buildable architecture and phased roadmap. Automation levels follow `Auto / Auto+Review / Human-only`.
|
||||
|
||||
## High-Level Views
|
||||
- **Core layers**: (1) sequencing ingest → variant calling/annotation (`Auto`), (2) genomic query layer (`Auto`), (3) rule engines (ACMG, PGx, DDI, supplements; mixed automation), (4) orchestration/LLM and report generation (`Auto` tool calls, `Auto+Review` outputs).
|
||||
- **Data custody**: all PHI/genomic artifacts remain local; external calls require de-identification or private models.
|
||||
- **Traceability**: every run records tool versions, database snapshots, configs, and manual overrides in machine-readable logs.
|
||||
|
||||
### End-to-end flow
|
||||
```
|
||||
BAM (proband + parents)
|
||||
↓ Variant Calling (gVCF) [Auto]
|
||||
Joint Genotyper → joint VCF
|
||||
↓ Annotation (VEP/ANNOVAR + ClinVar/gnomAD etc.) [Auto]
|
||||
↓ Genomic Store (VCF+tabix or SQL) + Query API [Auto]
|
||||
↓
|
||||
├─ Disease/Phenotype → Gene Panel lookup [Auto+Review]
|
||||
│ └─ Panel variants with basic ranking (freq, ClinVar) [Auto+Review]
|
||||
├─ ACMG evidence tagging subset (PVS1, PM2, BA1, BS1…) [Auto+Review]
|
||||
├─ PGx genotype→phenotype and recommendation rules [Auto → Auto+Review]
|
||||
├─ DDI rule evaluation [Auto]
|
||||
└─ Supplement/Herb normalization + interaction rules [Auto+Review → Human-only]
|
||||
↓
|
||||
LLM/Orchestrator routes user questions to tools, produces JSON + Markdown drafts [Auto tools, Auto+Review narratives]
|
||||
```
|
||||
|
||||
## Phase Roadmap (build-first view)
|
||||
- **Phase 1 – Genomic foundation**
|
||||
- Deliverables: trio joint VCF + annotation; query functions (`get_variants_by_gene/region`); disease→gene panel lookup; partial ACMG evidence tagging.
|
||||
- Data stores: tabix-backed VCF wrapper initially; optional SQLite/Postgres import later.
|
||||
- Interfaces: Python CLI/SDK first; machine-readable run logs with versions and automation levels.
|
||||
- **Phase 2 – PGx & DDI**
|
||||
- Drug vocabulary normalization (ATC/RxNorm).
|
||||
- PGx engine: star-allele calling or rule-based genotype→phenotype; guideline-mapped advice with review gates.
|
||||
- DDI engine: rule base with severity tiers; combine with PGx outputs.
|
||||
- **Phase 3 – Supplements & Herbs**
|
||||
- Name/ingredient normalization; herb formula expansion.
|
||||
- Rule tables for CYP/transporters, coagulation, CNS effects.
|
||||
- Evidence grading and conservative messaging; human-only final clinical language.
|
||||
- **Phase 4 – LLM Interface & Reports**
|
||||
- Tool-calling schema for queries listed above.
|
||||
- JSON + Markdown report templates with traceability to rules, data versions, and overrides.
|
||||
|
||||
## Module Boundaries
|
||||
- **Variant Calling Pipeline** (`Auto`): wrapper around GATK or DeepVariant + joint genotyper; pluggable reference genome; QC summaries.
|
||||
- **Annotation Pipeline** (`Auto`): VEP/ANNOVAR with pinned database versions (gnomAD, ClinVar, transcript set); emits annotated VCF + flat table.
|
||||
- **Genomic Query Layer** (`Auto`): abstraction over tabix or SQL; minimal APIs: `get_variants_by_gene`, `get_variants_by_region`, filters (freq, consequence, clinvar).
|
||||
- **Disease/Phenotype to Panel** (`Auto+Review`): HPO/OMIM lookups or curated panels; panel versioned; feeds queries.
|
||||
- **Phenotype Resolver** (`Auto+Review`): JSON/DB mapping of phenotype/HPO IDs to gene lists as a placeholder until upstream sources are integrated; can synthesize panels dynamically and merge multiple sources.
|
||||
- **ACMG Evidence Tagger** (`Auto+Review`): auto-evaluable criteria only; config-driven thresholds; human-only final classification.
|
||||
- **PGx Engine** (`Auto → Auto+Review`): star-allele calling where possible; guideline rules (CPIC/DPWG) with conservative defaults; flag items needing review.
|
||||
- **DDI Engine** (`Auto`): rule tables keyed by normalized drug IDs; outputs severity and rationale.
|
||||
- **Supplements/Herbs** (`Auto+Review → Human-only`): ingredient extraction + mapping; interaction rules; human sign-off for clinical language.
|
||||
- **Orchestrator/LLM** (`Auto tools, Auto+Review outputs`): intent parsing, tool sequencing, safety guardrails, report drafting.
|
||||
|
||||
## Observability and Versioning
|
||||
- Every pipeline run writes a JSON log: tool versions, reference genome, DB versions, config hashes, automation level per step, manual overrides (who/when/why).
|
||||
- Reports embed references to those logs so outputs remain reproducible.
|
||||
- Configs (ACMG thresholds, gene panels, PGx rules) are versioned artifacts stored alongside code.
|
||||
|
||||
## Security/Privacy Notes
|
||||
- Default to local processing; if external LLMs are used, strip identifiers and avoid full VCF uploads.
|
||||
- Secrets kept out of repo; rely on environment variables or local config files (excluded by `.gitignore`).
|
||||
|
||||
## Initial Tech Bets (to be validated)
|
||||
- Language/runtime: Python 3.11+ for pipelines, rules, and orchestration stubs.
|
||||
- Bio stack candidates: GATK or DeepVariant; VEP; tabix for early querying; SQLAlchemy + SQLite/Postgres when scaling.
|
||||
- Infra: containerized runners for pipelines; makefiles or workflow engine (Nextflow/Snakemake) later if needed.
|
||||
Reference in New Issue
Block a user