Refactor: Replace scaffolding with working analysis scripts

- Add trio_analysis.py for trio-based variant analysis with de novo detection
- Add clinvar_acmg_annotate.py for ClinVar/ACMG annotation
- Add gwas_comprehensive.py with 201 SNPs across 18 categories
- Add pharmgkb_full_analysis.py for pharmacogenomics analysis
- Add gwas_trait_lookup.py for basic GWAS trait lookup
- Add pharmacogenomics.py for basic PGx analysis
- Remove unused scaffolding code (src/, configs/, docs/, tests/)
- Update README.md with new documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-12-01 22:36:02 +08:00
parent f74dc351f7
commit d13d58df8b
56 changed files with 2608 additions and 2347 deletions

207
README.md
View File

@@ -1,112 +1,113 @@
# Genomic Consultant
Early design for a personal genomic risk and druginteraction decision support system. Specs are sourced from `genomic_decision_support_system_spec_v0.1.md`.
A practical genomics analysis toolkit for Trio WES (Whole Exome Sequencing) data analysis, including ClinVar/ACMG annotation, GWAS trait analysis, and pharmacogenomics.
## Vision (per spec)
- Phase 1: trio variant calling, annotation, queryable genomic DB, initial ACMG evidence tagging.
- Phase 2: pharmacogenomics genotype-to-phenotype mapping plus drugdrug interaction checks.
- Phase 3: supplement/herb normalization and interaction risk layering.
- Phase 4: LLM-driven query orchestration and report generation.
## Analysis Scripts
## Repository Layout
- `docs/` — system architecture notes, phase plans, data models (work in progress).
- `configs/` — example ACMG config and gene panel JSON.
- `configs/phenotype_to_genes.example.json` — placeholder phenotype/HPO → gene mappings.
- `configs/phenotype_to_genes.hpo_seed.json` — seed HPO mappings (replace with full HPO/GenCC derived panels).
- `sample_data/` — tiny annotated TSV for demo.
- `src/genomic_consultant/` — Python scaffolding (pipelines, store, panel lookup, ACMG tagging, reporting).
- `genomic_decision_support_system_spec_v0.1.md` — original requirements draft.
### 1. Trio Analysis (`trio_analysis.py`)
Comprehensive trio-based variant analysis with de novo detection, compound heterozygosity, and inheritance pattern annotation.
## Contributing/next steps
1. Finalize Phase 1 tech selection (variant caller, annotation stack, reference/DB versions).
2. Stand up the Phase 1 pipelines and minimal query API surface.
3. Add ACMG evidence tagging config and human-review logging.
4. Layer in PGx/DDI and supplement modules per later phases.
Data safety: keep genomic/clinical data local; the `.gitignore` blocks common genomic outputs by default.
## Quickstart (CLI scaffolding)
```
pip install -e .
# 1) Show trio calling plan (commands only; not executed)
genomic-consultant plan-call \
--sample proband:/data/proband.bam \
--sample father:/data/father.bam \
--sample mother:/data/mother.bam \
--reference /refs/GRCh38.fa \
--workdir /tmp/trio
# 1b) Execute calling plan (requires GATK installed) and emit run log
genomic-consultant run-call \
--sample proband:/data/proband.bam \
--sample father:/data/father.bam \
--sample mother:/data/mother.bam \
--reference /refs/GRCh38.fa \
--workdir /tmp/trio \
--log /tmp/trio/run_call_log.json \
--probe-tools
# 2) Show annotation plan for a joint VCF
genomic-consultant plan-annotate \
--vcf /tmp/trio/trio.joint.vcf.gz \
--workdir /tmp/trio/annot \
--prefix trio \
--reference /refs/GRCh38.fa
# 2b) Execute annotation plan (requires VEP, bcftools) with run log
genomic-consultant run-annotate \
--vcf /tmp/trio/trio.joint.vcf.gz \
--workdir /tmp/trio/annot \
--prefix trio \
--reference /refs/GRCh38.fa \
--log /tmp/trio/annot/run_annot_log.json \
--probe-tools
# 3) Demo panel report using sample data (panel file)
genomic-consultant panel-report \
--tsv sample_data/example_annotated.tsv \
--panel configs/panel.example.json \
--acmg-config configs/acmg_config.example.yaml \
--individual-id demo \
--format markdown \
--log /tmp/panel_log.json
# 3b) Demo panel report using phenotype mapping (HPO)
genomic-consultant panel-report \
--tsv sample_data/example_annotated.tsv \
--phenotype-id HP:0000365 \
--phenotype-mapping configs/phenotype_to_genes.hpo_seed.json \
--acmg-config configs/acmg_config.example.yaml \
--individual-id demo \
--format markdown
# 3c) Merge multiple phenotype→gene mappings into one
genomic-consultant build-phenotype-mapping \
--output configs/phenotype_to_genes.merged.json \
configs/phenotype_to_genes.example.json configs/phenotype_to_genes.hpo_seed.json
# 4) End-to-end Phase 1 pipeline (optionally skip call/annotate; use sample TSV)
genomic-consultant phase1-run \
--tsv sample_data/example_annotated.tsv \
--skip-call --skip-annotate \
--panel configs/panel.example.json \
--acmg-config configs/acmg_config.example.yaml \
--workdir runtime \
--prefix demo
# Run tests
pytest
```bash
python trio_analysis.py <vcf_path> <output_dir>
```
### Optional Parquet-backed store
Install pandas to enable Parquet ingestion:
```
pip install -e .[store]
### 2. ClinVar/ACMG Annotation (`clinvar_acmg_annotate.py`)
Annotates variants with ClinVar clinical significance and generates ACMG-style evidence tags.
```bash
python clinvar_acmg_annotate.py <vcf_path> <output_path> [sample_idx]
```
### Notes on VEP plugins (SpliceAI/CADD)
- The annotation plan already queries `SpliceAI` and `CADD_PHRED` fields; ensure your VEP run includes plugins/flags that produce them, e.g.:
- `--plugin SpliceAI,snv=/path/to/spliceai.snv.vcf.gz,indel=/path/to/spliceai.indel.vcf.gz`
- `--plugin CADD,/path/to/whole_genome_SNVs.tsv.gz,/path/to/InDels.tsv.gz`
- Pass these via `--plugin` and/or `--extra-flag` on `run-annotate` / `plan-annotate` to embed fields into the TSV.
### 3. GWAS Comprehensive Analysis (`gwas_comprehensive.py`)
Comprehensive GWAS trait analysis with 201 curated SNPs across 18 categories:
- Gout / Uric acid metabolism
- Kidney disease
- Hearing loss
- Autoimmune diseases
- Cancer risk
- Blood clotting / Thrombosis
- Thyroid disorders
- Bone health / Osteoporosis
- Liver disease (NAFLD)
- Migraine
- Longevity / Aging
- Sleep
- Skin conditions
- Cardiovascular disease
- Metabolic disorders
- Eye conditions
- Neuropsychiatric
- Other traits
```bash
python gwas_comprehensive.py <vcf_path> <output_path> [sample_idx]
```
### 4. PharmGKB Full Analysis (`pharmgkb_full_analysis.py`)
Comprehensive pharmacogenomics analysis using the PharmGKB clinical annotations database.
```bash
python pharmgkb_full_analysis.py <vcf_path> <output_path> [sample_idx]
```
### 5. GWAS Trait Lookup (`gwas_trait_lookup.py`)
Original curated GWAS trait lookup (smaller SNP set).
```bash
python gwas_trait_lookup.py <vcf_path> <output_path> [sample_idx]
```
### 6. Basic Pharmacogenomics (`pharmacogenomics.py`)
Basic pharmacogenomics analysis with common drug-gene interactions.
## Prerequisites
- Python 3.8+
- conda environment with bioinformatics tools:
```bash
conda create -n genomics python=3.10
conda activate genomics
conda install -c bioconda bcftools snpeff gatk4
```
## Reference Databases Required
- **ClinVar**: VCF from NCBI
- **PharmGKB**: Clinical annotations TSV
- **dbSNP**: For rsID annotation
- **GRCh37/hg19 reference genome**
## Data Directory Structure
```
/Volumes/NV2/
├── genomics_analysis/
│ └── vcf/
│ ├── trio_joint.vcf.gz # Joint-called VCF
│ ├── trio_joint.rsid.vcf.gz # With rsID annotations
│ └── trio_joint.snpeff.vcf # With SnpEff annotations
└── genomics_reference/
├── clinvar/
├── pharmgkb/
├── dbsnp/
└── gwas_catalog/
```
## Sample Index Mapping
For trio VCF files:
- Index 0: Mother
- Index 1: Father
- Index 2: Proband
## Output Reports
Each script generates detailed reports including:
- Summary statistics
- Risk variant identification
- Family comparison (for trio data)
- Clinical annotations and recommendations
## License
Private use only.