Refactor: Replace scaffolding with working analysis scripts
- Add trio_analysis.py for trio-based variant analysis with de novo detection - Add clinvar_acmg_annotate.py for ClinVar/ACMG annotation - Add gwas_comprehensive.py with 201 SNPs across 18 categories - Add pharmgkb_full_analysis.py for pharmacogenomics analysis - Add gwas_trait_lookup.py for basic GWAS trait lookup - Add pharmacogenomics.py for basic PGx analysis - Remove unused scaffolding code (src/, configs/, docs/, tests/) - Update README.md with new documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
207
README.md
207
README.md
@@ -1,112 +1,113 @@
|
||||
# Genomic Consultant
|
||||
|
||||
Early design for a personal genomic risk and drug–interaction decision support system. Specs are sourced from `genomic_decision_support_system_spec_v0.1.md`.
|
||||
A practical genomics analysis toolkit for Trio WES (Whole Exome Sequencing) data analysis, including ClinVar/ACMG annotation, GWAS trait analysis, and pharmacogenomics.
|
||||
|
||||
## Vision (per spec)
|
||||
- Phase 1: trio variant calling, annotation, queryable genomic DB, initial ACMG evidence tagging.
|
||||
- Phase 2: pharmacogenomics genotype-to-phenotype mapping plus drug–drug interaction checks.
|
||||
- Phase 3: supplement/herb normalization and interaction risk layering.
|
||||
- Phase 4: LLM-driven query orchestration and report generation.
|
||||
## Analysis Scripts
|
||||
|
||||
## Repository Layout
|
||||
- `docs/` — system architecture notes, phase plans, data models (work in progress).
|
||||
- `configs/` — example ACMG config and gene panel JSON.
|
||||
- `configs/phenotype_to_genes.example.json` — placeholder phenotype/HPO → gene mappings.
|
||||
- `configs/phenotype_to_genes.hpo_seed.json` — seed HPO mappings (replace with full HPO/GenCC derived panels).
|
||||
- `sample_data/` — tiny annotated TSV for demo.
|
||||
- `src/genomic_consultant/` — Python scaffolding (pipelines, store, panel lookup, ACMG tagging, reporting).
|
||||
- `genomic_decision_support_system_spec_v0.1.md` — original requirements draft.
|
||||
### 1. Trio Analysis (`trio_analysis.py`)
|
||||
Comprehensive trio-based variant analysis with de novo detection, compound heterozygosity, and inheritance pattern annotation.
|
||||
|
||||
## Contributing/next steps
|
||||
1. Finalize Phase 1 tech selection (variant caller, annotation stack, reference/DB versions).
|
||||
2. Stand up the Phase 1 pipelines and minimal query API surface.
|
||||
3. Add ACMG evidence tagging config and human-review logging.
|
||||
4. Layer in PGx/DDI and supplement modules per later phases.
|
||||
|
||||
Data safety: keep genomic/clinical data local; the `.gitignore` blocks common genomic outputs by default.
|
||||
|
||||
## Quickstart (CLI scaffolding)
|
||||
```
|
||||
pip install -e .
|
||||
|
||||
# 1) Show trio calling plan (commands only; not executed)
|
||||
genomic-consultant plan-call \
|
||||
--sample proband:/data/proband.bam \
|
||||
--sample father:/data/father.bam \
|
||||
--sample mother:/data/mother.bam \
|
||||
--reference /refs/GRCh38.fa \
|
||||
--workdir /tmp/trio
|
||||
|
||||
# 1b) Execute calling plan (requires GATK installed) and emit run log
|
||||
genomic-consultant run-call \
|
||||
--sample proband:/data/proband.bam \
|
||||
--sample father:/data/father.bam \
|
||||
--sample mother:/data/mother.bam \
|
||||
--reference /refs/GRCh38.fa \
|
||||
--workdir /tmp/trio \
|
||||
--log /tmp/trio/run_call_log.json \
|
||||
--probe-tools
|
||||
|
||||
# 2) Show annotation plan for a joint VCF
|
||||
genomic-consultant plan-annotate \
|
||||
--vcf /tmp/trio/trio.joint.vcf.gz \
|
||||
--workdir /tmp/trio/annot \
|
||||
--prefix trio \
|
||||
--reference /refs/GRCh38.fa
|
||||
|
||||
# 2b) Execute annotation plan (requires VEP, bcftools) with run log
|
||||
genomic-consultant run-annotate \
|
||||
--vcf /tmp/trio/trio.joint.vcf.gz \
|
||||
--workdir /tmp/trio/annot \
|
||||
--prefix trio \
|
||||
--reference /refs/GRCh38.fa \
|
||||
--log /tmp/trio/annot/run_annot_log.json \
|
||||
--probe-tools
|
||||
|
||||
# 3) Demo panel report using sample data (panel file)
|
||||
genomic-consultant panel-report \
|
||||
--tsv sample_data/example_annotated.tsv \
|
||||
--panel configs/panel.example.json \
|
||||
--acmg-config configs/acmg_config.example.yaml \
|
||||
--individual-id demo \
|
||||
--format markdown \
|
||||
--log /tmp/panel_log.json
|
||||
|
||||
# 3b) Demo panel report using phenotype mapping (HPO)
|
||||
genomic-consultant panel-report \
|
||||
--tsv sample_data/example_annotated.tsv \
|
||||
--phenotype-id HP:0000365 \
|
||||
--phenotype-mapping configs/phenotype_to_genes.hpo_seed.json \
|
||||
--acmg-config configs/acmg_config.example.yaml \
|
||||
--individual-id demo \
|
||||
--format markdown
|
||||
|
||||
# 3c) Merge multiple phenotype→gene mappings into one
|
||||
genomic-consultant build-phenotype-mapping \
|
||||
--output configs/phenotype_to_genes.merged.json \
|
||||
configs/phenotype_to_genes.example.json configs/phenotype_to_genes.hpo_seed.json
|
||||
|
||||
# 4) End-to-end Phase 1 pipeline (optionally skip call/annotate; use sample TSV)
|
||||
genomic-consultant phase1-run \
|
||||
--tsv sample_data/example_annotated.tsv \
|
||||
--skip-call --skip-annotate \
|
||||
--panel configs/panel.example.json \
|
||||
--acmg-config configs/acmg_config.example.yaml \
|
||||
--workdir runtime \
|
||||
--prefix demo
|
||||
|
||||
# Run tests
|
||||
pytest
|
||||
```bash
|
||||
python trio_analysis.py <vcf_path> <output_dir>
|
||||
```
|
||||
|
||||
### Optional Parquet-backed store
|
||||
Install pandas to enable Parquet ingestion:
|
||||
```
|
||||
pip install -e .[store]
|
||||
### 2. ClinVar/ACMG Annotation (`clinvar_acmg_annotate.py`)
|
||||
Annotates variants with ClinVar clinical significance and generates ACMG-style evidence tags.
|
||||
|
||||
```bash
|
||||
python clinvar_acmg_annotate.py <vcf_path> <output_path> [sample_idx]
|
||||
```
|
||||
|
||||
### Notes on VEP plugins (SpliceAI/CADD)
|
||||
- The annotation plan already queries `SpliceAI` and `CADD_PHRED` fields; ensure your VEP run includes plugins/flags that produce them, e.g.:
|
||||
- `--plugin SpliceAI,snv=/path/to/spliceai.snv.vcf.gz,indel=/path/to/spliceai.indel.vcf.gz`
|
||||
- `--plugin CADD,/path/to/whole_genome_SNVs.tsv.gz,/path/to/InDels.tsv.gz`
|
||||
- Pass these via `--plugin` and/or `--extra-flag` on `run-annotate` / `plan-annotate` to embed fields into the TSV.
|
||||
### 3. GWAS Comprehensive Analysis (`gwas_comprehensive.py`)
|
||||
Comprehensive GWAS trait analysis with 201 curated SNPs across 18 categories:
|
||||
- Gout / Uric acid metabolism
|
||||
- Kidney disease
|
||||
- Hearing loss
|
||||
- Autoimmune diseases
|
||||
- Cancer risk
|
||||
- Blood clotting / Thrombosis
|
||||
- Thyroid disorders
|
||||
- Bone health / Osteoporosis
|
||||
- Liver disease (NAFLD)
|
||||
- Migraine
|
||||
- Longevity / Aging
|
||||
- Sleep
|
||||
- Skin conditions
|
||||
- Cardiovascular disease
|
||||
- Metabolic disorders
|
||||
- Eye conditions
|
||||
- Neuropsychiatric
|
||||
- Other traits
|
||||
|
||||
```bash
|
||||
python gwas_comprehensive.py <vcf_path> <output_path> [sample_idx]
|
||||
```
|
||||
|
||||
### 4. PharmGKB Full Analysis (`pharmgkb_full_analysis.py`)
|
||||
Comprehensive pharmacogenomics analysis using the PharmGKB clinical annotations database.
|
||||
|
||||
```bash
|
||||
python pharmgkb_full_analysis.py <vcf_path> <output_path> [sample_idx]
|
||||
```
|
||||
|
||||
### 5. GWAS Trait Lookup (`gwas_trait_lookup.py`)
|
||||
Original curated GWAS trait lookup (smaller SNP set).
|
||||
|
||||
```bash
|
||||
python gwas_trait_lookup.py <vcf_path> <output_path> [sample_idx]
|
||||
```
|
||||
|
||||
### 6. Basic Pharmacogenomics (`pharmacogenomics.py`)
|
||||
Basic pharmacogenomics analysis with common drug-gene interactions.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- conda environment with bioinformatics tools:
|
||||
```bash
|
||||
conda create -n genomics python=3.10
|
||||
conda activate genomics
|
||||
conda install -c bioconda bcftools snpeff gatk4
|
||||
```
|
||||
|
||||
## Reference Databases Required
|
||||
|
||||
- **ClinVar**: VCF from NCBI
|
||||
- **PharmGKB**: Clinical annotations TSV
|
||||
- **dbSNP**: For rsID annotation
|
||||
- **GRCh37/hg19 reference genome**
|
||||
|
||||
## Data Directory Structure
|
||||
|
||||
```
|
||||
/Volumes/NV2/
|
||||
├── genomics_analysis/
|
||||
│ └── vcf/
|
||||
│ ├── trio_joint.vcf.gz # Joint-called VCF
|
||||
│ ├── trio_joint.rsid.vcf.gz # With rsID annotations
|
||||
│ └── trio_joint.snpeff.vcf # With SnpEff annotations
|
||||
└── genomics_reference/
|
||||
├── clinvar/
|
||||
├── pharmgkb/
|
||||
├── dbsnp/
|
||||
└── gwas_catalog/
|
||||
```
|
||||
|
||||
## Sample Index Mapping
|
||||
|
||||
For trio VCF files:
|
||||
- Index 0: Mother
|
||||
- Index 1: Father
|
||||
- Index 2: Proband
|
||||
|
||||
## Output Reports
|
||||
|
||||
Each script generates detailed reports including:
|
||||
- Summary statistics
|
||||
- Risk variant identification
|
||||
- Family comparison (for trio data)
|
||||
- Clinical annotations and recommendations
|
||||
|
||||
## License
|
||||
|
||||
Private use only.
|
||||
|
||||
Reference in New Issue
Block a user