Refactor: Replace scaffolding with working analysis scripts

- Add trio_analysis.py for trio-based variant analysis with de novo detection - Add clinvar_acmg_annotate.py for ClinVar/ACMG annotation - Add gwas_comprehensive.py with 201 SNPs across 18 categories - Add pharmgkb_full_analysis.py for pharmacogenomics analysis - Add gwas_trait_lookup.py for basic GWAS trait lookup - Add pharmacogenomics.py for basic PGx analysis - Remove unused scaffolding code (src/, configs/, docs/, tests/) - Update README.md with new documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-01 22:36:02 +08:00
parent f74dc351f7
commit d13d58df8b
56 changed files with 2608 additions and 2347 deletions
--- a/README.md
+++ b/README.md
@@ -1,112 +1,113 @@
 # Genomic Consultant

-Early design for a personal genomic risk and drug–interaction decision support system. Specs are sourced from `genomic_decision_support_system_spec_v0.1.md`.
+A practical genomics analysis toolkit for Trio WES (Whole Exome Sequencing) data analysis, including ClinVar/ACMG annotation, GWAS trait analysis, and pharmacogenomics.

-## Vision (per spec)
- Phase 1: trio variant calling, annotation, queryable genomic DB, initial ACMG evidence tagging.
- Phase 2: pharmacogenomics genotype-to-phenotype mapping plus drug–drug interaction checks.
- Phase 3: supplement/herb normalization and interaction risk layering.
- Phase 4: LLM-driven query orchestration and report generation.
+## Analysis Scripts

-## Repository Layout
- `docs/` — system architecture notes, phase plans, data models (work in progress).
- `configs/` — example ACMG config and gene panel JSON.
- `configs/phenotype_to_genes.example.json` — placeholder phenotype/HPO → gene mappings.
- `configs/phenotype_to_genes.hpo_seed.json` — seed HPO mappings (replace with full HPO/GenCC derived panels).
- `sample_data/` — tiny annotated TSV for demo.
- `src/genomic_consultant/` — Python scaffolding (pipelines, store, panel lookup, ACMG tagging, reporting).
- `genomic_decision_support_system_spec_v0.1.md` — original requirements draft.
+### 1. Trio Analysis (`trio_analysis.py`)
+Comprehensive trio-based variant analysis with de novo detection, compound heterozygosity, and inheritance pattern annotation.

-## Contributing/next steps
-1. Finalize Phase 1 tech selection (variant caller, annotation stack, reference/DB versions).
-2. Stand up the Phase 1 pipelines and minimal query API surface.
-3. Add ACMG evidence tagging config and human-review logging.
-4. Layer in PGx/DDI and supplement modules per later phases.
-
-Data safety: keep genomic/clinical data local; the `.gitignore` blocks common genomic outputs by default.
-
-## Quickstart (CLI scaffolding)
-```
-pip install -e .
-
-# 1) Show trio calling plan (commands only; not executed)
-genomic-consultant plan-call \
-  --sample proband:/data/proband.bam \
-  --sample father:/data/father.bam \
-  --sample mother:/data/mother.bam \
-  --reference /refs/GRCh38.fa \
-  --workdir /tmp/trio
-
-# 1b) Execute calling plan (requires GATK installed) and emit run log
-genomic-consultant run-call \
-  --sample proband:/data/proband.bam \
-  --sample father:/data/father.bam \
-  --sample mother:/data/mother.bam \
-  --reference /refs/GRCh38.fa \
-  --workdir /tmp/trio \
-  --log /tmp/trio/run_call_log.json \
-  --probe-tools
-
-# 2) Show annotation plan for a joint VCF
-genomic-consultant plan-annotate \
-  --vcf /tmp/trio/trio.joint.vcf.gz \
-  --workdir /tmp/trio/annot \
-  --prefix trio \
-  --reference /refs/GRCh38.fa
-
-# 2b) Execute annotation plan (requires VEP, bcftools) with run log
-genomic-consultant run-annotate \
-  --vcf /tmp/trio/trio.joint.vcf.gz \
-  --workdir /tmp/trio/annot \
-  --prefix trio \
-  --reference /refs/GRCh38.fa \
-  --log /tmp/trio/annot/run_annot_log.json \
-  --probe-tools
-
-# 3) Demo panel report using sample data (panel file)
-genomic-consultant panel-report \
-  --tsv sample_data/example_annotated.tsv \
-  --panel configs/panel.example.json \
-  --acmg-config configs/acmg_config.example.yaml \
-  --individual-id demo \
-  --format markdown \
-  --log /tmp/panel_log.json
-
-# 3b) Demo panel report using phenotype mapping (HPO)
-genomic-consultant panel-report \
-  --tsv sample_data/example_annotated.tsv \
-  --phenotype-id HP:0000365 \
-  --phenotype-mapping configs/phenotype_to_genes.hpo_seed.json \
-  --acmg-config configs/acmg_config.example.yaml \
-  --individual-id demo \
-  --format markdown
-
-# 3c) Merge multiple phenotype→gene mappings into one
-genomic-consultant build-phenotype-mapping \
-  --output configs/phenotype_to_genes.merged.json \
-  configs/phenotype_to_genes.example.json configs/phenotype_to_genes.hpo_seed.json
-
-# 4) End-to-end Phase 1 pipeline (optionally skip call/annotate; use sample TSV)
-genomic-consultant phase1-run \
-  --tsv sample_data/example_annotated.tsv \
-  --skip-call --skip-annotate \
-  --panel configs/panel.example.json \
-  --acmg-config configs/acmg_config.example.yaml \
-  --workdir runtime \
-  --prefix demo
-
-# Run tests
-pytest
+```bash
+python trio_analysis.py <vcf_path> <output_dir>
 ```

-### Optional Parquet-backed store
-Install pandas to enable Parquet ingestion:
-```
-pip install -e .[store]
+### 2. ClinVar/ACMG Annotation (`clinvar_acmg_annotate.py`)
+Annotates variants with ClinVar clinical significance and generates ACMG-style evidence tags.
+
+```bash
+python clinvar_acmg_annotate.py <vcf_path> <output_path> [sample_idx]
 ```

-### Notes on VEP plugins (SpliceAI/CADD)
- The annotation plan already queries `SpliceAI` and `CADD_PHRED` fields; ensure your VEP run includes plugins/flags that produce them, e.g.:
-  - `--plugin SpliceAI,snv=/path/to/spliceai.snv.vcf.gz,indel=/path/to/spliceai.indel.vcf.gz`
-  - `--plugin CADD,/path/to/whole_genome_SNVs.tsv.gz,/path/to/InDels.tsv.gz`
- Pass these via `--plugin` and/or `--extra-flag` on `run-annotate` / `plan-annotate` to embed fields into the TSV.
+### 3. GWAS Comprehensive Analysis (`gwas_comprehensive.py`)
+Comprehensive GWAS trait analysis with 201 curated SNPs across 18 categories:
+- Gout / Uric acid metabolism
+- Kidney disease
+- Hearing loss
+- Autoimmune diseases
+- Cancer risk
+- Blood clotting / Thrombosis
+- Thyroid disorders
+- Bone health / Osteoporosis
+- Liver disease (NAFLD)
+- Migraine
+- Longevity / Aging
+- Sleep
+- Skin conditions
+- Cardiovascular disease
+- Metabolic disorders
+- Eye conditions
+- Neuropsychiatric
+- Other traits
+
+```bash
+python gwas_comprehensive.py <vcf_path> <output_path> [sample_idx]
+```
+
+### 4. PharmGKB Full Analysis (`pharmgkb_full_analysis.py`)
+Comprehensive pharmacogenomics analysis using the PharmGKB clinical annotations database.
+
+```bash
+python pharmgkb_full_analysis.py <vcf_path> <output_path> [sample_idx]
+```
+
+### 5. GWAS Trait Lookup (`gwas_trait_lookup.py`)
+Original curated GWAS trait lookup (smaller SNP set).
+
+```bash
+python gwas_trait_lookup.py <vcf_path> <output_path> [sample_idx]
+```
+
+### 6. Basic Pharmacogenomics (`pharmacogenomics.py`)
+Basic pharmacogenomics analysis with common drug-gene interactions.
+
+## Prerequisites
+
+- Python 3.8+
+- conda environment with bioinformatics tools:
+  ```bash
+  conda create -n genomics python=3.10
+  conda activate genomics
+  conda install -c bioconda bcftools snpeff gatk4
+  ```
+
+## Reference Databases Required
+
+- **ClinVar**: VCF from NCBI
+- **PharmGKB**: Clinical annotations TSV
+- **dbSNP**: For rsID annotation
+- **GRCh37/hg19 reference genome**
+
+## Data Directory Structure
+
+```
+/Volumes/NV2/
+├── genomics_analysis/
+│   └── vcf/
+│       ├── trio_joint.vcf.gz          # Joint-called VCF
+│       ├── trio_joint.rsid.vcf.gz     # With rsID annotations
+│       └── trio_joint.snpeff.vcf      # With SnpEff annotations
+└── genomics_reference/
+    ├── clinvar/
+    ├── pharmgkb/
+    ├── dbsnp/
+    └── gwas_catalog/
+```
+
+## Sample Index Mapping
+
+For trio VCF files:
+- Index 0: Mother
+- Index 1: Father
+- Index 2: Proband
+
+## Output Reports
+
+Each script generates detailed reports including:
+- Summary statistics
+- Risk variant identification
+- Family comparison (for trio data)
+- Clinical annotations and recommendations
+
+## License
+
+Private use only.