Refactor: Replace scaffolding with working analysis scripts

- Add trio_analysis.py for trio-based variant analysis with de novo detection - Add clinvar_acmg_annotate.py for ClinVar/ACMG annotation - Add gwas_comprehensive.py with 201 SNPs across 18 categories - Add pharmgkb_full_analysis.py for pharmacogenomics analysis - Add gwas_trait_lookup.py for basic GWAS trait lookup - Add pharmacogenomics.py for basic PGx analysis - Remove unused scaffolding code (src/, configs/, docs/, tests/) - Update README.md with new documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-01 22:36:02 +08:00
parent f74dc351f7
commit d13d58df8b
56 changed files with 2608 additions and 2347 deletions
--- a/14
+++ b/14
@@ -1,14 +0,0 @@
 PYTHON := .venv/bin/python
 PIP := .venv/bin/pip
 .PHONY: venv install test
 venv:
 	python3 -m venv .venv
 	$(PIP) install -e .
 install: venv
 	$(PIP) install -e .[store]
 test:
 	$(PYTHON) -m pytest
--- a/README.md
+++ b/README.md
@@ -1,112 +1,113 @@
 # Genomic Consultant
-Early design for a personal genomic risk and drug–interaction decision support system. Specs are sourced from `genomic_decision_support_system_spec_v0.1.md`.
+A practical genomics analysis toolkit for Trio WES (Whole Exome Sequencing) data analysis, including ClinVar/ACMG annotation, GWAS trait analysis, and pharmacogenomics.
-## Vision (per spec)
+## Analysis Scripts
 - Phase 1: trio variant calling, annotation, queryable genomic DB, initial ACMG evidence tagging.
 - Phase 2: pharmacogenomics genotype-to-phenotype mapping plus drug–drug interaction checks.
 - Phase 3: supplement/herb normalization and interaction risk layering.
 - Phase 4: LLM-driven query orchestration and report generation.
-## Repository Layout
+### 1. Trio Analysis (`trio_analysis.py`)
- `docs/` — system architecture notes, phase plans, data models (work in progress).
+Comprehensive trio-based variant analysis with de novo detection, compound heterozygosity, and inheritance pattern annotation.
 - `configs/` — example ACMG config and gene panel JSON.
 - `configs/phenotype_to_genes.example.json` — placeholder phenotype/HPO → gene mappings.
 - `configs/phenotype_to_genes.hpo_seed.json` — seed HPO mappings (replace with full HPO/GenCC derived panels).
 - `sample_data/` — tiny annotated TSV for demo.
 - `src/genomic_consultant/` — Python scaffolding (pipelines, store, panel lookup, ACMG tagging, reporting).
 - `genomic_decision_support_system_spec_v0.1.md` — original requirements draft.
-## Contributing/next steps
+```bash
-1. Finalize Phase 1 tech selection (variant caller, annotation stack, reference/DB versions).
+python trio_analysis.py <vcf_path> <output_dir>
 2. Stand up the Phase 1 pipelines and minimal query API surface.
 3. Add ACMG evidence tagging config and human-review logging.
 4. Layer in PGx/DDI and supplement modules per later phases.
 Data safety: keep genomic/clinical data local; the `.gitignore` blocks common genomic outputs by default.
 ## Quickstart (CLI scaffolding)
 ```
 pip install -e .
 # 1) Show trio calling plan (commands only; not executed)
 genomic-consultant plan-call \
  --sample proband:/data/proband.bam \
  --sample father:/data/father.bam \
  --sample mother:/data/mother.bam \
  --reference /refs/GRCh38.fa \
  --workdir /tmp/trio
 # 1b) Execute calling plan (requires GATK installed) and emit run log
 genomic-consultant run-call \
  --sample proband:/data/proband.bam \
  --sample father:/data/father.bam \
  --sample mother:/data/mother.bam \
  --reference /refs/GRCh38.fa \
  --workdir /tmp/trio \
  --log /tmp/trio/run_call_log.json \
  --probe-tools
 # 2) Show annotation plan for a joint VCF
 genomic-consultant plan-annotate \
  --vcf /tmp/trio/trio.joint.vcf.gz \
  --workdir /tmp/trio/annot \
  --prefix trio \
  --reference /refs/GRCh38.fa
 # 2b) Execute annotation plan (requires VEP, bcftools) with run log
 genomic-consultant run-annotate \
  --vcf /tmp/trio/trio.joint.vcf.gz \
  --workdir /tmp/trio/annot \
  --prefix trio \
  --reference /refs/GRCh38.fa \
  --log /tmp/trio/annot/run_annot_log.json \
  --probe-tools
 # 3) Demo panel report using sample data (panel file)
 genomic-consultant panel-report \
  --tsv sample_data/example_annotated.tsv \
  --panel configs/panel.example.json \
  --acmg-config configs/acmg_config.example.yaml \
  --individual-id demo \
  --format markdown \
  --log /tmp/panel_log.json
 # 3b) Demo panel report using phenotype mapping (HPO)
 genomic-consultant panel-report \
  --tsv sample_data/example_annotated.tsv \
  --phenotype-id HP:0000365 \
  --phenotype-mapping configs/phenotype_to_genes.hpo_seed.json \
  --acmg-config configs/acmg_config.example.yaml \
  --individual-id demo \
  --format markdown
 # 3c) Merge multiple phenotype→gene mappings into one
 genomic-consultant build-phenotype-mapping \
  --output configs/phenotype_to_genes.merged.json \
  configs/phenotype_to_genes.example.json configs/phenotype_to_genes.hpo_seed.json
 # 4) End-to-end Phase 1 pipeline (optionally skip call/annotate; use sample TSV)
 genomic-consultant phase1-run \
  --tsv sample_data/example_annotated.tsv \
  --skip-call --skip-annotate \
  --panel configs/panel.example.json \
  --acmg-config configs/acmg_config.example.yaml \
  --workdir runtime \
  --prefix demo
 # Run tests
 pytest
 ```
-### Optional Parquet-backed store
+### 2. ClinVar/ACMG Annotation (`clinvar_acmg_annotate.py`)
-Install pandas to enable Parquet ingestion:
+Annotates variants with ClinVar clinical significance and generates ACMG-style evidence tags.
-```
+
-pip install -e .[store]
+```bash
 python clinvar_acmg_annotate.py <vcf_path> <output_path> [sample_idx]
 ```
-### Notes on VEP plugins (SpliceAI/CADD)
+### 3. GWAS Comprehensive Analysis (`gwas_comprehensive.py`)
- The annotation plan already queries `SpliceAI` and `CADD_PHRED` fields; ensure your VEP run includes plugins/flags that produce them, e.g.:
+Comprehensive GWAS trait analysis with 201 curated SNPs across 18 categories:
-  - `--plugin SpliceAI,snv=/path/to/spliceai.snv.vcf.gz,indel=/path/to/spliceai.indel.vcf.gz`
+- Gout / Uric acid metabolism
-  - `--plugin CADD,/path/to/whole_genome_SNVs.tsv.gz,/path/to/InDels.tsv.gz`
+- Kidney disease
- Pass these via `--plugin` and/or `--extra-flag` on `run-annotate` / `plan-annotate` to embed fields into the TSV.
+- Hearing loss
 - Autoimmune diseases
 - Cancer risk
 - Blood clotting / Thrombosis
 - Thyroid disorders
 - Bone health / Osteoporosis
 - Liver disease (NAFLD)
 - Migraine
 - Longevity / Aging
 - Sleep
 - Skin conditions
 - Cardiovascular disease
 - Metabolic disorders
 - Eye conditions
 - Neuropsychiatric
 - Other traits
 ```bash
 python gwas_comprehensive.py <vcf_path> <output_path> [sample_idx]
 ```
 ### 4. PharmGKB Full Analysis (`pharmgkb_full_analysis.py`)
 Comprehensive pharmacogenomics analysis using the PharmGKB clinical annotations database.
 ```bash
 python pharmgkb_full_analysis.py <vcf_path> <output_path> [sample_idx]
 ```
 ### 5. GWAS Trait Lookup (`gwas_trait_lookup.py`)
 Original curated GWAS trait lookup (smaller SNP set).
 ```bash
 python gwas_trait_lookup.py <vcf_path> <output_path> [sample_idx]
 ```
 ### 6. Basic Pharmacogenomics (`pharmacogenomics.py`)
 Basic pharmacogenomics analysis with common drug-gene interactions.
 ## Prerequisites
 - Python 3.8+
 - conda environment with bioinformatics tools:
  ```bash
  conda create -n genomics python=3.10
  conda activate genomics
  conda install -c bioconda bcftools snpeff gatk4
  ```
 ## Reference Databases Required
 - **ClinVar**: VCF from NCBI
 - **PharmGKB**: Clinical annotations TSV
 - **dbSNP**: For rsID annotation
 - **GRCh37/hg19 reference genome**
 ## Data Directory Structure
 ```
 /Volumes/NV2/
 ├── genomics_analysis/
 │   └── vcf/
 │       ├── trio_joint.vcf.gz          # Joint-called VCF
 │       ├── trio_joint.rsid.vcf.gz     # With rsID annotations
 │       └── trio_joint.snpeff.vcf      # With SnpEff annotations
 └── genomics_reference/
    ├── clinvar/
    ├── pharmgkb/
    ├── dbsnp/
    └── gwas_catalog/
 ```
 ## Sample Index Mapping
 For trio VCF files:
 - Index 0: Mother
 - Index 1: Father
 - Index 2: Proband
 ## Output Reports
 Each script generates detailed reports including:
 - Summary statistics
 - Risk variant identification
 - Family comparison (for trio data)
 - Clinical annotations and recommendations
 ## License
 Private use only.
--- a/clinvar_acmg_annotate.py
+++ b/clinvar_acmg_annotate.py
@@ -0,0 +1,448 @@
 #!/usr/bin/env python3
 """
 ClinVar Annotation and ACMG Classification Script
 Integrates ClinVar lookup with ACMG auto-classification for trio analysis.
 """
 import gzip
 import re
 import sys
 from collections import defaultdict
 from dataclasses import dataclass, field
 from typing import Dict, List, Optional, Set, Tuple
 from pathlib import Path
 # Add project src to path
 sys.path.insert(0, str(Path(__file__).parent / "src"))
 try:
    from genomic_consultant.acmg.tagger import ACMGConfig, tag_variant, _is_lof
    from genomic_consultant.utils.models import Variant, EvidenceTag, SuggestedClassification
    HAS_PROJECT_MODULES = True
 except ImportError:
    HAS_PROJECT_MODULES = False
    print("Warning: Project modules not found, using built-in ACMG classification")
@dataclass
 class ClinVarEntry:
    """ClinVar database entry"""
    chrom: str
    pos: int
    ref: str
    alt: str
    clnsig: str  # Clinical significance
    clndn: str   # Disease name
    clnrevstat: str  # Review status
    clnvc: str   # Variant type
    af: Optional[float] = None
@dataclass
 class AnnotatedVariant:
    """Variant with all annotations"""
    chrom: str
    pos: int
    ref: str
    alt: str
    gene: Optional[str] = None
    effect: Optional[str] = None
    impact: Optional[str] = None
    genotypes: Dict[str, str] = field(default_factory=dict)
    clinvar_sig: Optional[str] = None
    clinvar_disease: Optional[str] = None
    clinvar_review: Optional[str] = None
    acmg_class: Optional[str] = None
    acmg_evidence: List[str] = field(default_factory=list)
    inheritance_pattern: Optional[str] = None  # de_novo, compound_het, hom_rec, etc.
    @property
    def variant_id(self) -> str:
        return f"{self.chrom}-{self.pos}-{self.ref}-{self.alt}"
 def load_clinvar_vcf(clinvar_path: str) -> Dict[str, ClinVarEntry]:
    """Load ClinVar VCF into a lookup dictionary"""
    print(f"Loading ClinVar database from {clinvar_path}...")
    clinvar_db = {}
    open_func = gzip.open if clinvar_path.endswith('.gz') else open
    mode = 'rt' if clinvar_path.endswith('.gz') else 'r'
    count = 0
    with open_func(clinvar_path, mode) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 8:
                continue
            chrom, pos, _, ref, alt, _, _, info = parts[:8]
            # Parse INFO field
            info_dict = {}
            for item in info.split(';'):
                if '=' in item:
                    k, v = item.split('=', 1)
                    info_dict[k] = v
            clnsig = info_dict.get('CLNSIG', '')
            clndn = info_dict.get('CLNDN', '')
            clnrevstat = info_dict.get('CLNREVSTAT', '')
            clnvc = info_dict.get('CLNVC', '')
            # Handle multiple alts
            for a in alt.split(','):
                key = f"{chrom}-{pos}-{ref}-{a}"
                clinvar_db[key] = ClinVarEntry(
                    chrom=chrom,
                    pos=int(pos),
                    ref=ref,
                    alt=a,
                    clnsig=clnsig,
                    clndn=clndn,
                    clnrevstat=clnrevstat,
                    clnvc=clnvc
                )
                count += 1
    print(f"Loaded {count} ClinVar entries")
    return clinvar_db
 def parse_snpeff_annotation(info: str) -> Dict:
    """Parse SnpEff ANN field"""
    result = {
        'gene': None,
        'effect': None,
        'impact': None,
        'hgvs_c': None,
        'hgvs_p': None,
    }
    ann_match = re.search(r'ANN=([^;]+)', info)
    if not ann_match:
        return result
    ann_field = ann_match.group(1)
    annotations = ann_field.split(',')
    if annotations:
        parts = annotations[0].split('|')
        if len(parts) >= 4:
            result['effect'] = parts[1] if len(parts) > 1 else None
            result['impact'] = parts[2] if len(parts) > 2 else None
            result['gene'] = parts[3] if len(parts) > 3 else None
            if len(parts) > 9:
                result['hgvs_c'] = parts[9]
            if len(parts) > 10:
                result['hgvs_p'] = parts[10]
    return result
 def get_genotype_class(gt: str) -> str:
    """Classify genotype"""
    if gt in ['./.', '.|.', '.']:
        return 'MISSING'
    alleles = re.split('[/|]', gt)
    if all(a == '0' for a in alleles):
        return 'HOM_REF'
    elif all(a != '0' and a != '.' for a in alleles):
        return 'HOM_ALT'
    else:
        return 'HET'
 class ACMGClassifier:
    """ACMG variant classifier"""
    def __init__(self, lof_genes: Optional[Set[str]] = None):
        self.lof_genes = lof_genes or {
            'BRCA1', 'BRCA2', 'TP53', 'PTEN', 'MLH1', 'MSH2', 'MSH6', 'PMS2',
            'APC', 'MEN1', 'RB1', 'VHL', 'WT1', 'NF1', 'NF2', 'TSC1', 'TSC2'
        }
        self.ba1_af = 0.05
        self.bs1_af = 0.01
        self.pm2_af = 0.0005
    def classify(self, variant: AnnotatedVariant, is_de_novo: bool = False) -> Tuple[str, List[str]]:
        """Apply ACMG classification rules"""
        evidence = []
        # ClinVar evidence
        if variant.clinvar_sig:
            sig_lower = variant.clinvar_sig.lower()
            if 'pathogenic' in sig_lower and 'likely' not in sig_lower:
                evidence.append("PP5: ClinVar pathogenic")
            elif 'likely_pathogenic' in sig_lower:
                evidence.append("PP5: ClinVar likely pathogenic")
            elif 'benign' in sig_lower and 'likely' not in sig_lower:
                evidence.append("BP6: ClinVar benign")
            elif 'likely_benign' in sig_lower:
                evidence.append("BP6: ClinVar likely benign")
        # Loss of function in LoF-sensitive gene (PVS1)
        if variant.effect and variant.gene:
            lof_keywords = ['frameshift', 'stop_gained', 'splice_acceptor', 'splice_donor', 'start_lost']
            if any(k in variant.effect.lower() for k in lof_keywords):
                if variant.gene.upper() in self.lof_genes:
                    evidence.append("PVS1: Null variant in LoF-sensitive gene")
                else:
                    evidence.append("PVS1_moderate: Null variant (gene not confirmed LoF-sensitive)")
        # De novo (PS2)
        if is_de_novo:
            evidence.append("PS2: De novo variant")
        # Impact-based evidence
        if variant.impact == 'HIGH':
            evidence.append("PM4: Protein length change (HIGH impact)")
        elif variant.impact == 'MODERATE':
            if variant.effect and 'missense' in variant.effect.lower():
                evidence.append("PP3: Computational evidence (missense)")
        # Determine final classification
        classification = self._determine_class(evidence, variant.clinvar_sig)
        return classification, evidence
    def _determine_class(self, evidence: List[str], clinvar_sig: Optional[str]) -> str:
        """Determine ACMG class based on evidence"""
        evidence_str = ' '.join(evidence)
        # ClinVar takes precedence if high confidence
        if clinvar_sig:
            sig_lower = clinvar_sig.lower()
            if 'pathogenic' in sig_lower and 'conflicting' not in sig_lower:
                if 'likely' in sig_lower:
                    return 'Likely Pathogenic'
                return 'Pathogenic'
            elif 'benign' in sig_lower and 'conflicting' not in sig_lower:
                if 'likely' in sig_lower:
                    return 'Likely Benign'
                return 'Benign'
        # Rule-based classification
        has_pvs1 = 'PVS1:' in evidence_str
        has_ps2 = 'PS2:' in evidence_str
        has_pm4 = 'PM4:' in evidence_str
        has_pp = 'PP' in evidence_str
        has_bp = 'BP' in evidence_str
        if has_pvs1 and has_ps2:
            return 'Pathogenic'
        elif has_pvs1 or (has_ps2 and has_pm4):
            return 'Likely Pathogenic'
        elif has_bp and not has_pp and not has_pvs1:
            return 'Likely Benign'
        else:
            return 'VUS'
 def analyze_trio_with_clinvar(
    snpeff_vcf: str,
    clinvar_path: str,
    output_path: str,
    proband_idx: int = 0,
    father_idx: int = 1,
    mother_idx: int = 2
 ):
    """Main analysis function"""
    # Load ClinVar
    clinvar_db = load_clinvar_vcf(clinvar_path)
    # Initialize classifier
    classifier = ACMGClassifier()
    # Parse VCF and annotate
    print(f"Processing {snpeff_vcf}...")
    samples = []
    results = []
    pathogenic_variants = []
    open_func = gzip.open if snpeff_vcf.endswith('.gz') else open
    mode = 'rt' if snpeff_vcf.endswith('.gz') else 'r'
    with open_func(snpeff_vcf, mode) as f:
        for line in f:
            if line.startswith('##'):
                continue
            elif line.startswith('#CHROM'):
                parts = line.strip().split('\t')
                samples = parts[9:]
                continue
            parts = line.strip().split('\t')
            if len(parts) < 10:
                continue
            chrom, pos, _, ref, alt, qual, filt, info, fmt = parts[:9]
            gt_fields = parts[9:]
            # Parse genotypes
            fmt_parts = fmt.split(':')
            gt_idx = fmt_parts.index('GT') if 'GT' in fmt_parts else 0
            genotypes = {}
            for i, sample in enumerate(samples):
                gt_data = gt_fields[i].split(':')
                genotypes[sample] = gt_data[gt_idx] if gt_idx < len(gt_data) else './.'
            # Parse SnpEff annotation
            ann = parse_snpeff_annotation(info)
            # Only process variants in proband
            proband = samples[proband_idx] if proband_idx < len(samples) else samples[0]
            proband_gt = get_genotype_class(genotypes.get(proband, './.'))
            if proband_gt == 'HOM_REF' or proband_gt == 'MISSING':
                continue
            # Check inheritance pattern
            father = samples[father_idx] if father_idx < len(samples) else samples[1]
            mother = samples[mother_idx] if mother_idx < len(samples) else samples[2]
            father_gt = get_genotype_class(genotypes.get(father, './.'))
            mother_gt = get_genotype_class(genotypes.get(mother, './.'))
            is_de_novo = (proband_gt in ['HET', 'HOM_ALT'] and
                         father_gt == 'HOM_REF' and mother_gt == 'HOM_REF')
            is_hom_rec = (proband_gt == 'HOM_ALT' and
                         father_gt == 'HET' and mother_gt == 'HET')
            inheritance = None
            if is_de_novo:
                inheritance = 'de_novo'
            elif is_hom_rec:
                inheritance = 'homozygous_recessive'
            elif proband_gt == 'HET':
                if father_gt in ['HET', 'HOM_ALT'] and mother_gt == 'HOM_REF':
                    inheritance = 'paternal'
                elif mother_gt in ['HET', 'HOM_ALT'] and father_gt == 'HOM_REF':
                    inheritance = 'maternal'
            # Lookup ClinVar
            for a in alt.split(','):
                var_key = f"{chrom}-{pos}-{ref}-{a}"
                clinvar_entry = clinvar_db.get(var_key)
                variant = AnnotatedVariant(
                    chrom=chrom,
                    pos=int(pos),
                    ref=ref,
                    alt=a,
                    gene=ann['gene'],
                    effect=ann['effect'],
                    impact=ann['impact'],
                    genotypes=genotypes,
                    inheritance_pattern=inheritance
                )
                if clinvar_entry:
                    variant.clinvar_sig = clinvar_entry.clnsig
                    variant.clinvar_disease = clinvar_entry.clndn
                    variant.clinvar_review = clinvar_entry.clnrevstat
                # ACMG classification
                acmg_class, evidence = classifier.classify(variant, is_de_novo)
                variant.acmg_class = acmg_class
                variant.acmg_evidence = evidence
                # Filter for clinically relevant variants
                if (variant.clinvar_sig and 'pathogenic' in variant.clinvar_sig.lower()) or \
                   acmg_class in ['Pathogenic', 'Likely Pathogenic'] or \
                   (is_de_novo and ann['impact'] in ['HIGH', 'MODERATE']):
                    pathogenic_variants.append(variant)
                results.append(variant)
    # Generate report
    print(f"Writing report to {output_path}...")
    with open(output_path, 'w') as f:
        f.write("# ClinVar & ACMG Classification Report\n")
        f.write(f"# Input: {snpeff_vcf}\n")
        f.write(f"# ClinVar: {clinvar_path}\n")
        f.write(f"# Samples: {', '.join(samples)}\n")
        f.write(f"# Total variants processed: {len(results)}\n\n")
        f.write("## CLINICALLY RELEVANT VARIANTS\n\n")
        f.write("CHROM\tPOS\tREF\tALT\tGENE\tEFFECT\tIMPACT\tINHERITANCE\tCLINVAR_SIG\tCLINVAR_DISEASE\tACMG_CLASS\tACMG_EVIDENCE\n")
        for v in sorted(pathogenic_variants, key=lambda x: (x.acmg_class != 'Pathogenic',
                                                            x.acmg_class != 'Likely Pathogenic',
                                                            x.chrom, x.pos)):
            f.write(f"{v.chrom}\t{v.pos}\t{v.ref}\t{v.alt}\t")
            f.write(f"{v.gene or 'N/A'}\t{v.effect or 'N/A'}\t{v.impact or 'N/A'}\t")
            f.write(f"{v.inheritance_pattern or 'N/A'}\t")
            f.write(f"{v.clinvar_sig or 'N/A'}\t")
            f.write(f"{v.clinvar_disease or 'N/A'}\t")
            f.write(f"{v.acmg_class}\t")
            f.write(f"{'; '.join(v.acmg_evidence)}\n")
        # Summary statistics
        f.write("\n## SUMMARY\n")
        f.write(f"Total variants in proband: {len(results)}\n")
        f.write(f"Clinically relevant variants: {len(pathogenic_variants)}\n")
        # Count by ACMG class
        acmg_counts = defaultdict(int)
        for v in pathogenic_variants:
            acmg_counts[v.acmg_class] += 1
        f.write("\nBy ACMG Classification:\n")
        for cls in ['Pathogenic', 'Likely Pathogenic', 'VUS', 'Likely Benign', 'Benign']:
            if cls in acmg_counts:
                f.write(f"  {cls}: {acmg_counts[cls]}\n")
        # Count by inheritance
        inh_counts = defaultdict(int)
        for v in pathogenic_variants:
            inh_counts[v.inheritance_pattern or 'unknown'] += 1
        f.write("\nBy Inheritance Pattern:\n")
        for inh, count in sorted(inh_counts.items()):
            f.write(f"  {inh}: {count}\n")
        # ClinVar matches
        clinvar_match = sum(1 for v in pathogenic_variants if v.clinvar_sig)
        f.write(f"\nVariants with ClinVar annotation: {clinvar_match}\n")
    print(f"\nAnalysis complete!")
    print(f"Clinically relevant variants: {len(pathogenic_variants)}")
    print(f"Report saved to: {output_path}")
    # Print top candidates
    print("\n=== TOP PATHOGENIC CANDIDATES ===\n")
    top_variants = [v for v in pathogenic_variants if v.acmg_class in ['Pathogenic', 'Likely Pathogenic']][:20]
    for v in top_variants:
        print(f"{v.chrom}:{v.pos} {v.ref}>{v.alt}")
        print(f"  Gene: {v.gene} | Effect: {v.effect}")
        print(f"  Inheritance: {v.inheritance_pattern}")
        print(f"  ClinVar: {v.clinvar_sig or 'Not found'}")
        if v.clinvar_disease:
            print(f"  Disease: {v.clinvar_disease[:80]}...")
        print(f"  ACMG: {v.acmg_class}")
        print(f"  Evidence: {'; '.join(v.acmg_evidence)}")
        print()
 if __name__ == '__main__':
    snpeff_vcf = sys.argv[1] if len(sys.argv) > 1 else '/Volumes/NV2/genomics_analysis/vcf/trio_joint.snpeff.vcf'
    clinvar_path = sys.argv[2] if len(sys.argv) > 2 else '/Volumes/NV2/genomics_reference/clinvar/clinvar_GRCh37.vcf.gz'
    output_path = sys.argv[3] if len(sys.argv) > 3 else '/Volumes/NV2/genomics_analysis/clinvar_acmg_report.txt'
    # VCF sample order: NV0066-08_S33 (idx 0), NV0066-09_S34 (idx 1), NV0066-10_S35 (idx 2)
    # Correct mapping: S35 = proband (II-3), S33 = parent, S34 = parent
    proband_idx = int(sys.argv[4]) if len(sys.argv) > 4 else 2  # S35 is proband
    father_idx = int(sys.argv[5]) if len(sys.argv) > 5 else 0   # S33
    mother_idx = int(sys.argv[6]) if len(sys.argv) > 6 else 1   # S34
    analyze_trio_with_clinvar(snpeff_vcf, clinvar_path, output_path, proband_idx, father_idx, mother_idx)
--- a/configs/acmg_config.example.yaml
+++ b/configs/acmg_config.example.yaml
@@ -1,12 +0,0 @@
 # ACMG evidence thresholds (example, adjust per lab policy)
 ba1_af: 0.05   # Stand-alone benign if AF >= this
 bs1_af: 0.01   # Strong benign if AF >= this (and not meeting BA1)
 pm2_af: 0.0005 # Moderate pathogenic if AF <= this
 bp7_splice_ai_max: 0.1 # Supporting benign if synonymous and predicted low splice impact
 # Genes considered loss-of-function sensitive for PVS1 auto-tagging
 lof_genes:
  - BRCA1
  - BRCA2
  - TTN
  - CFTR
--- a/configs/panel.example.json
+++ b/configs/panel.example.json
@@ -1,10 +0,0 @@
 {
  "name": "Hearing loss core",
  "version": "0.1",
  "source": "curated",
  "last_updated": "2024-06-01",
  "genes": ["GJB2", "SLC26A4", "MITF", "OTOF"],
  "metadata": {
    "notes": "Example panel for demo; replace with curated panel and provenance."
  }
 }
--- a/configs/phenotype_to_genes.example.json
+++ b/configs/phenotype_to_genes.example.json
@@ -1,11 +0,0 @@
 {
  "version": "0.1",
  "source": "example-curated",
  "phenotype_to_genes": {
    "HP:0000365": ["GJB2", "SLC26A4", "OTOF"],
    "HP:0000510": ["MITF", "SOX10"]
  },
  "metadata": {
    "notes": "Placeholder mapping; replace with curated HPO/OMIM/GenCC panels."
  }
 }
--- a/configs/phenotype_to_genes.hpo_seed.json
+++ b/configs/phenotype_to_genes.hpo_seed.json
@@ -1,13 +0,0 @@
 {
  "version": "2024-11-01",
  "source": "HPO-curated-seed",
  "phenotype_to_genes": {
    "HP:0000365": ["GJB2", "SLC26A4", "OTOF", "TECTA"],
    "HP:0001250": ["SCN1A", "KCNQ2", "STXBP1"],
    "HP:0001631": ["MYH7", "TNNT2", "MYBPC3"],
    "HP:0001156": ["COL1A1", "COL1A2", "PLOD2"]
  },
  "metadata": {
    "notes": "Seed mapping using common phenotype IDs; replace with full HPO-derived panels."
  }
 }
--- a/docs/phase1_howto.md
+++ b/docs/phase1_howto.md
@@ -1,101 +0,0 @@
 # Phase 1 How-To: BAM → VCF → Annotate → Panel Report
 本文件說明如何用現有 CLI 從 BAM 執行到報告輸出，選項意義，以及可跳過步驟的用法。假設已安裝 GATK、VEP、bcftools/tabix，並以 `pip install -e .` 安裝本專案。
 ## 流程總覽
 1) Trio variant calling → joint VCF  
 2) VEP 註解 → annotated VCF + 平坦 TSV  
 3) Panel/phenotype 查詢 + ACMG 標籤 → Markdown/JSON 報告  
 可用 `phase1-run` 一鍵（可跳過 call/annotate），或分步 `run-call` / `run-annotate` / `panel-report`。
 ## 分步執行
 ### 1) Variant calling（GATK）
 ```bash
 genomic-consultant run-call \
  --sample proband:/path/proband.bam \
  --sample father:/path/father.bam \
  --sample mother:/path/mother.bam \
  --reference /refs/GRCh38.fa \
  --workdir /tmp/trio \
  --prefix trio \
  --log /tmp/trio/run_call_log.json \
  --probe-tools
 ```
 - `--sample`: `sample_id:/path/to.bam`，可重複。
 - `--reference`: 參考序列。
 - `--workdir`: 輸出與中間檔位置。
 - `--prefix`: 輸出檔名前綴。
 - `--log`: run log（JSON）路徑。
 - 輸出：joint VCF (`/tmp/trio/trio.joint.vcf.gz`)、run log（含指令/returncode）。
 ### 2) Annotation（VEP + bcftools）
 ```bash
 genomic-consultant run-annotate \
  --vcf /tmp/trio/trio.joint.vcf.gz \
  --workdir /tmp/trio/annot \
  --prefix trio \
  --reference /refs/GRCh38.fa \
  --plugin 'SpliceAI,snv=/path/spliceai.snv.vcf.gz,indel=/path/spliceai.indel.vcf.gz' \
  --plugin 'CADD,/path/whole_genome_SNVs.tsv.gz,/path/InDels.tsv.gz' \
  --extra-flag "--cache --offline" \
  --log /tmp/trio/annot/run_annot_log.json \
  --probe-tools
 ```
 - `--plugin`: VEP plugin 規格，可重複（示範 SpliceAI/CADD）。
 - `--extra-flag`: 附加給 VEP 的旗標（如 cache/offline）。
 - 輸出：annotated VCF (`trio.vep.vcf.gz`)、平坦 TSV (`trio.vep.tsv`)、run log。
 ### 3) Panel/Phenotype 報告
 使用 panel 檔：
 ```bash
 genomic-consultant panel-report \
  --tsv /tmp/trio/annot/trio.vep.tsv \
  --panel configs/panel.example.json \
  --acmg-config configs/acmg_config.example.yaml \
  --individual-id proband \
  --max-af 0.05 \
  --format markdown \
  --log /tmp/trio/panel_log.json
 ```
 使用 phenotype 直譯 panel：
 ```bash
 genomic-consultant panel-report \
  --tsv /tmp/trio/annot/trio.vep.tsv \
  --phenotype-id HP:0000365 \
  --phenotype-mapping configs/phenotype_to_genes.hpo_seed.json \
  --acmg-config configs/acmg_config.example.yaml \
  --individual-id proband \
  --max-af 0.05 \
  --format markdown
 ```
 - `--max-af`: 過濾等位基因頻率上限。
 - `--format`: `markdown` 或 `json`。
 - 輸出：報告文字 + run log（記錄 panel/ACMG config 的 hash）。
 ## 一鍵模式（可跳過 call/annotate）
 已經有 joint VCF/TSV 時，可跳過前兩步：
 ```bash
 genomic-consultant phase1-run \
  --workdir /tmp/trio \
  --prefix trio \
  --tsv /tmp/trio/annot/trio.vep.tsv \
  --skip-call --skip-annotate \
  --panel configs/panel.example.json \
  --acmg-config configs/acmg_config.example.yaml \
  --max-af 0.05 \
  --format markdown \
  --log-dir /tmp/trio/runtime
 ```
 若要實際跑 VEP，可移除 `--skip-annotate` 並提供 `--plugins/--extra-flag`；若要跑 calling，移除 `--skip-call` 並提供 `--sample`/`--reference`。
 ## 主要輸出
 - joint VCF（呼叫結果）
 - annotated VCF + 平坦 TSV（含 gene/consequence/ClinVar/AF/SpliceAI/CADD 等欄位）
 - run logs（JSON，含指令、return code、config hash；在 `--log` 或 `--log-dir`）
 - Panel 報告（Markdown 或 JSON），附 ACMG 自動標籤，需人工複核
 ## 注意
 - Call/annotate 依賴外部工具與對應資源（參考序列、VEP cache、SpliceAI/CADD 資料）。
 - 若無 BAM/資源，可用 sample TSV：`phase1-run --tsv sample_data/example_annotated.tsv --skip-call --skip-annotate ...` 演示報告。
 - 安全：`.gitignore` 已排除大型基因檔案；建議本地受控環境執行。
--- a/docs/phase1_implementation_plan.md
+++ b/docs/phase1_implementation_plan.md
@@ -1,80 +0,0 @@
 # Phase 1 Implementation Plan (Genomic Foundation)
 Scope: deliver a working trio-based variant pipeline, annotated genomic store, query APIs, initial ACMG evidence tagging, and reporting/logging scaffolding. This assumes local execution with Python 3.11+.
 ## Objectives
 - Trio BAM → joint VCF with QC artifacts (`Auto`).
 - Annotate variants with population frequency, ClinVar, consequence/prediction (`Auto`).
 - Provide queryable interfaces for gene/region lookups with filters (`Auto`).
 - Disease/phenotype → gene panel lookup and filtered variant outputs (`Auto+Review`).
 - Auto-tag subset of ACMG criteria; human-only final classification (`Auto+Review`).
 - Produce machine-readable run logs with versions, configs, and overrides.
 ## Work Breakdown
 1) **Data & references**
   - Reference genome: GRCh38 (primary) with option for GRCh37; pin version hash.
   - Resource bundles: known sites for BQSR (if using GATK), gnomAD/ClinVar versions for annotation.
   - Test fixtures: small trio BAM/CRAM subset or GIAB trio downsampled for CI-like checks.
 2) **Variant calling pipeline (wrapper, `Auto`)**
   - Tooling: GATK HaplotypeCaller → gVCF; GenotypeGVCFs for joint calls. Alt: DeepVariant + joint genotyper (parameterized).
   - Steps:
     - Validate inputs (file presence, reference match).
     - Optional QC: coverage, duplicates, on-target.
     - Generate per-sample gVCF; joint genotyping to trio VCF.
   - Outputs: joint VCF + index; QC summary JSON/TSV; log with tool versions and params.
 3) **Annotation pipeline (`Auto`)**
   - Tooling: Ensembl VEP with plugins for gnomAD, ClinVar, CADD, SpliceAI where available; alt path: ANNOVAR.
   - Steps:
     - Normalize variants (bcftools norm) if needed.
     - Annotate with gene, transcript, protein change; population AF; ClinVar significance; consequence; predictions (SIFT/PolyPhen/CADD); inheritance flags.
     - Include SpliceAI/CADD plugins if installed; CLI accepts extra flags/plugins to embed SpliceAI/CADD fields.
   - Outputs: annotated VCF; flattened TSV/Parquet for faster querying; manifest of DB versions used.
 4) **Genomic store + query API (`Auto`)**
   - Early option: tabix-indexed VCF with Python wrapper.
   - Functions (Python module `genomic_store`):
     - `get_variants_by_gene(individual_id, genes, filters)`  
       Filters: AF thresholds, consequence categories, ClinVar significance, inheritance pattern.
     - `get_variants_by_region(individual_id, chrom, start, end, filters)`
     - `list_genes_with_variants(individual_id, filters)` (optional).
   - Filters defined in a `FilterConfig` dataclass; serialize-able for logging.
   - Future option: import to SQLite/Postgres via Arrow/Parquet for richer predicates.
 5) **Disease/phenotype → gene panel (`Auto+Review`)**
   - Data: HPO/OMIM/GenCC lookup or curated JSON panels.
   - Function: `get_gene_panel(disease_or_hpo_id, version=None)` returning gene list + provenance.
   - Phenotype resolver: curated JSON mapping (e.g., `phenotype_to_genes.example.json`) as placeholder until upstream data is wired; allow dynamic panel synthesis by phenotype ID in CLI; support merging multiple sources into one mapping.
   - Flow: resolve panel → call genomic store → apply simple ranking (AF low, ClinVar pathogenicity high).
   - Manual review points: panel curation, rank threshold tuning.
 6) **ACMG evidence tagging (subset, `Auto+Review`)**
   - Criteria to auto-evaluate initially: PVS1 (LoF in LoF-sensitive gene list), PM2 (absent/rare in population), BA1/BS1 (common frequency), possible BP7 (synonymous, no splicing impact).
   - Config: YAML/JSON with thresholds (AF cutoffs, LoF gene list, transcript precedence).
   - Output schema per variant: `{variant_id, evidence: [tag, strength, rationale], suggested_class}` with suggested class computed purely from auto tags; final class left blank for human.
   - Logging: capture rule version and reasons for each fired rule.
 7) **Run logging and versioning**
   - Every pipeline run emits `run_log.json` containing:
     - Inputs (sample IDs, file paths, reference build).
     - Tool versions and parameters; DB versions; config hashes (panel/ACMG configs).
     - Automation level per step; manual overrides (`who/when/why`).
     - Derived artifacts paths and checksums.
   - Embed run_log reference in reports.
 8) **Report template (minimal)**
   - Input: disease/gene panel name, variant query results, ACMG evidence tags.
   - Output: Markdown + JSON summary with sections: context, methods, variants table, limitations.
   - Mark human-only decisions clearly.
 ## Milestones
 - **M1**: Repo scaffolding + configs; tiny trio test data wired; `make pipeline` runs through variant calling + annotation on fixture.
 - **M2**: Genomic store wrapper with gene/region queries; filter config; basic CLI/Notebook demo.
 - **M3**: Panel lookup + ranked variant listing; ACMG auto tags on outputs; run_log generation.
 - **M4**: Minimal report generator + acceptance of human-reviewed classifications.
 ## Validation Strategy
 - Unit tests for filter logic, panel resolution, ACMG tagger decisions (synthetic variants).
 - Integration test on small fixture trio to ensure call → annotate → query path works.
 - Determinism checks: hash configs and verify outputs stable across runs given same inputs.
--- a/docs/system_architecture.md
+++ b/docs/system_architecture.md
@@ -1,69 +0,0 @@
 # System Architecture Blueprint (v0.1)
 This document turns `genomic_decision_support_system_spec_v0.1.md` into a buildable architecture and phased roadmap. Automation levels follow `Auto / Auto+Review / Human-only`.
 ## High-Level Views
 - **Core layers**: (1) sequencing ingest → variant calling/annotation (`Auto`), (2) genomic query layer (`Auto`), (3) rule engines (ACMG, PGx, DDI, supplements; mixed automation), (4) orchestration/LLM and report generation (`Auto` tool calls, `Auto+Review` outputs).
 - **Data custody**: all PHI/genomic artifacts remain local; external calls require de-identification or private models.
 - **Traceability**: every run records tool versions, database snapshots, configs, and manual overrides in machine-readable logs.
 ### End-to-end flow
 ```
 BAM (proband + parents)
  ↓ Variant Calling (gVCF) [Auto]
 Joint Genotyper → joint VCF
  ↓ Annotation (VEP/ANNOVAR + ClinVar/gnomAD etc.) [Auto]
  ↓ Genomic Store (VCF+tabix or SQL) + Query API [Auto]
  ↓
  ├─ Disease/Phenotype → Gene Panel lookup [Auto+Review]
  │    └─ Panel variants with basic ranking (freq, ClinVar) [Auto+Review]
  ├─ ACMG evidence tagging subset (PVS1, PM2, BA1, BS1…) [Auto+Review]
  ├─ PGx genotype→phenotype and recommendation rules [Auto → Auto+Review]
  ├─ DDI rule evaluation [Auto]
  └─ Supplement/Herb normalization + interaction rules [Auto+Review → Human-only]
  ↓
 LLM/Orchestrator routes user questions to tools, produces JSON + Markdown drafts [Auto tools, Auto+Review narratives]
 ```
 ## Phase Roadmap (build-first view)
 - **Phase 1 – Genomic foundation**  
  - Deliverables: trio joint VCF + annotation; query functions (`get_variants_by_gene/region`); disease→gene panel lookup; partial ACMG evidence tagging.  
  - Data stores: tabix-backed VCF wrapper initially; optional SQLite/Postgres import later.  
  - Interfaces: Python CLI/SDK first; machine-readable run logs with versions and automation levels.
 - **Phase 2 – PGx & DDI**  
  - Drug vocabulary normalization (ATC/RxNorm).  
  - PGx engine: star-allele calling or rule-based genotype→phenotype; guideline-mapped advice with review gates.  
  - DDI engine: rule base with severity tiers; combine with PGx outputs.
 - **Phase 3 – Supplements & Herbs**  
  - Name/ingredient normalization; herb formula expansion.  
  - Rule tables for CYP/transporters, coagulation, CNS effects.  
  - Evidence grading and conservative messaging; human-only final clinical language.
 - **Phase 4 – LLM Interface & Reports**  
  - Tool-calling schema for queries listed above.  
  - JSON + Markdown report templates with traceability to rules, data versions, and overrides.
 ## Module Boundaries
 - **Variant Calling Pipeline** (`Auto`): wrapper around GATK or DeepVariant + joint genotyper; pluggable reference genome; QC summaries.  
 - **Annotation Pipeline** (`Auto`): VEP/ANNOVAR with pinned database versions (gnomAD, ClinVar, transcript set); emits annotated VCF + flat table.  
 - **Genomic Query Layer** (`Auto`): abstraction over tabix or SQL; minimal APIs: `get_variants_by_gene`, `get_variants_by_region`, filters (freq, consequence, clinvar).  
 - **Disease/Phenotype to Panel** (`Auto+Review`): HPO/OMIM lookups or curated panels; panel versioned; feeds queries.  
 - **Phenotype Resolver** (`Auto+Review`): JSON/DB mapping of phenotype/HPO IDs to gene lists as a placeholder until upstream sources are integrated; can synthesize panels dynamically and merge multiple sources.  
 - **ACMG Evidence Tagger** (`Auto+Review`): auto-evaluable criteria only; config-driven thresholds; human-only final classification.  
 - **PGx Engine** (`Auto → Auto+Review`): star-allele calling where possible; guideline rules (CPIC/DPWG) with conservative defaults; flag items needing review.  
 - **DDI Engine** (`Auto`): rule tables keyed by normalized drug IDs; outputs severity and rationale.  
 - **Supplements/Herbs** (`Auto+Review → Human-only`): ingredient extraction + mapping; interaction rules; human sign-off for clinical language.  
 - **Orchestrator/LLM** (`Auto tools, Auto+Review outputs`): intent parsing, tool sequencing, safety guardrails, report drafting.
 ## Observability and Versioning
 - Every pipeline run writes a JSON log: tool versions, reference genome, DB versions, config hashes, automation level per step, manual overrides (who/when/why).
 - Reports embed references to those logs so outputs remain reproducible.
 - Configs (ACMG thresholds, gene panels, PGx rules) are versioned artifacts stored alongside code.
 ## Security/Privacy Notes
 - Default to local processing; if external LLMs are used, strip identifiers and avoid full VCF uploads.
 - Secrets kept out of repo; rely on environment variables or local config files (excluded by `.gitignore`).
 ## Initial Tech Bets (to be validated)
 - Language/runtime: Python 3.11+ for pipelines, rules, and orchestration stubs.
 - Bio stack candidates: GATK or DeepVariant; VEP; tabix for early querying; SQLAlchemy + SQLite/Postgres when scaling.  
 - Infra: containerized runners for pipelines; makefiles or workflow engine (Nextflow/Snakemake) later if needed.
--- a/genomic_decision_support_system_spec_v0.1.md
+++ b/genomic_decision_support_system_spec_v0.1.md
@@ -1,356 +0,0 @@
 # 個人基因風險與用藥交互作用決策支援系統 – 系統規格書
 - 版本：v0.1-draft  
 - 作者：Gbanyan + 助理規劃  
 - 狀態：草案，預期隨實作迭代  
 - 目的：提供給 LLM（如 Claude、Codex 等）與開發者閱讀，作為系統設計與實作的基礎規格。
 ---
 ## 0. 系統目標與範圍
 ### 0.1 目標
 建立一套「個人化、基因驅動」的決策支援系統，核心功能：
 1. **從個人與父母的外顯子定序資料（BAM）出發**，產生可查詢的變異資料庫。
 2. **針對特定疾病、症狀或表型**，自動查詢相關基因與已知致病變異，並在個人資料中搜尋對應變異。
 3. 依據 **ACMG/AMP 等準則** 與公開資料庫，給出 **機器輔助、具人工介入點的變異詮釋**。
 4. 進一步整合：
   - 基因藥物學（Pharmacogenomics, PGx）
   - 藥–藥交互作用（DDI）
   - 保健食品、中藥等成分的潛在交互作用與風險
 5. 提供一個自然語言問答介面，使用者可直接問：
   - 「我有沒有 XXX 的遺傳風險？」
   - 「以我現在吃的藥和保健食品，有沒有需要注意的交互作用？」
 > 系統定位為：**個人決策支援工具 / 研究工具**，而非正式醫療診斷系統。
 ### 0.2 核心設計哲學
 - **分階段發展**：先穩固「基因本體」與變異詮釋，再往 PGx 與交互作用擴充。
 - **明確的人機分工**：對每個模組標記 `Auto / Auto+Review / Human-only`。
 - **可追蹤、可回溯**：每一個結論都可追蹤到所用規則、資料庫版本、人工 override。
 ---
 ## 1. 發展階段與整體架構
 ### 1.1 發展階段總覽
 | 階段 | 名稱 | 核心產出 | 主要對象 |
 |------|------|----------|----------|
 | Phase 1 | 基因本體與變異詮釋基礎 | 個人+trio VCF、註解與疾病/基因查詢 | 單基因疾病風險 |
 | Phase 2 | 基因藥物學與 DDI | 基因–藥物對應、藥–藥交互作用分析 | 精準用藥建議 |
 | Phase 3 | 保健食品與中藥交互作用 | 成分標準化與交互作用風險層級 | 整體用藥＋補充品安全網 |
 | Phase 4 | NLP/LLM 問答介面 | 自然語言問答、報告生成 | 一般使用者 / 臨床對話 |
 ### 1.2 高階架構圖（Mermaid）
 ```mermaid
 flowchart TD
 subgraph P1[Phase 1: 基因本體]
  A1[BAM (本人+父母)] --> A2[Variant Calling Pipeline]
  A2 --> A3[Joint VCF (Trio)]
  A3 --> A4[Variant Annotation (ClinVar, gnomAD, VEP...)]
  A4 --> A5[Genomic DB / Query API]
 end
 subgraph P2[Phase 2: PGx & DDI]
  B1[藥物清單] --> B2[PGx Engine]
  A5 --> B2
  B1 --> B3[DDI Engine]
 end
 subgraph P3[Phase 3: 補充品 & 中藥]
  C1[保健食品/中藥清單] --> C2[成分標準化]
  C2 --> C3[成分交互作用引擎]
  B1 --> C3
  A5 --> C3
 end
 subgraph P4[Phase 4: 問答與報告]
  D1[前端 UI / CLI / API Client] --> D2[LLM Orchestrator]
  D2 -->|疾病/症狀詢問| A5
  D2 -->|用藥/成分詢問| B2
  D2 --> B3
  D2 --> C3
  A5 --> D3[報告產生器]
  B2 --> D3
  B3 --> D3
  C3 --> D3
 end
 ```
 ---
 ## 2. 通用設計：人機分工標記
 所有模組需標記自動化等級：
 - `Auto`：可完全自動執行的步驟（例：variant calling、基本註解）。
 - `Auto+Review`：系統先產生建議，需人工複核或有條件接受（例：ACMG 部分 evidence scoring）。
 - `Human-only`：最終醫療判斷／用語／管理建議，必須由人決策（例：最終 Pathogenic 分類、臨床處置建議）。
 每次分析需生成一份 **machine-readable log**，紀錄：
 - 使用的模組與版本
 - 每一步的自動化等級
 - 哪些地方有人工 override（人員、時間、理由）
 ---
 ## 3. Phase 1：基因本體與變異詮釋基礎
 ### 3.1 功能需求
 1. **輸入**
   - 本人與雙親外顯子定序 BAM 檔。
 2. **輸出**
   - 高品質 joint VCF（含 trio）
   - 每個變異的註解資訊：
     - 基因、轉錄本、蛋白改變
     - 族群頻率（gnomAD 等）
     - ClinVar 註解
     - 功能預測（SIFT/PolyPhen/CADD 等）
   - 對特定疾病/基因清單的變異過濾結果。
 3. **對外服務**
   - 以 API / 函式介面提供：
     - 給定基因列表 → 回傳該個體在這些基因中的變異列表
     - 支援疾病名稱/HPO → 基因 → 變異的查詢流程（初期可分步呼叫）
 ### 3.2 模組設計
 #### 3.2.1 Variant Calling Pipeline
 - **輸入**：BAM（本人 + 父母）
 - **輸出**：個別 gVCF → joint VCF
 - **工具候選**：
  - GATK（HaplotypeCaller + GenotypeGVCFs）
  - 或 DeepVariant + joint genotyper
 - **自動化等級**：`Auto`
 - **需求**：
  - 基本 QC（coverage、duplicate rate、on-target rate）
  - 支援版本標記（如 reference genome 版本）
 #### 3.2.2 Annotation Pipeline
 - **輸入**：joint VCF
 - **輸出**：annotated VCF / 變異表
 - **工具候選**：
  - VEP、ANNOVAR 或類似工具
 - **資料庫**：
  - ClinVar
  - gnomAD
  - 基因功能與轉錄本資料庫
 - **自動化等級**：`Auto`
 #### 3.2.3 Genomic DB / Query API
 - **目的**：提供高效查詢，作為後續模組（疾病風險、PGx 等）的基底。
 - **形式**：
  - 選項 A：基於 VCF + tabix，以封裝函式操作
  - 選項 B：匯入 SQLite / PostgreSQL / 專用 genomic DB
 - **關鍵查詢**：
  - `get_variants_by_gene(individual_id, gene_list, filters)`
  - `get_variants_by_region(individual_id, chr, start, end, filters)`
 - **自動化等級**：`Auto`
 #### 3.2.4 疾病/表型 → 基因 → 變異流程
 - 初期可拆成三步：
  1. 使用外部知識庫或手動 panel：疾病/表型 → 基因清單
  2. 透過 Genomic DB 查詢個人變異
  3. 以簡單規則（頻率、ClinVar 標註）做初步排序
 - **自動化等級**：`Auto+Review`
 ### 3.3 ACMG 規則實作（初版）
 - **範圍**：僅實作部分機器可自動判定之 evidence（如 PVS1、PM2、BA1、BS1 等）。
 - **輸出**：
  - 每個變異的 evidence tag 列表與建議分級（例如：`suggested_class = "VUS"`）
 - **人工介入點**：
  - 變異最終分類（Pathogenic / Likely pathogenic / VUS / Likely benign / Benign） → `Human-only`
  - 規則閾值（如頻率 cutoff）以 config 檔管理 → `Auto+Review`
 ---
 ## 4. Phase 2：基因藥物學（PGx）與藥–藥交互作用（DDI）
 ### 4.1 功能需求
 1. 接收使用者目前用藥清單（處方藥、成藥）。
 2. 透過基因資料，判定與 PGx 相關的 genotype（例如 CYP2D6, CYP2C9, HLA 等）。
 3. 根據 CPIC / DPWG 等指南，給出：
   - 適應症相關風險（如 HLA-B*58:01 與 allopurinol）
   - 劑量調整建議 / 藥物替代建議（僅 decision-support 層級）
 4. 計算基礎藥–藥交互作用（DDI），例如：
   - CYP 抑制 / 誘導疊加
   - QT prolongation 疊加
   - 出血風險疊加
 ### 4.2 模組設計
 #### 4.2.1 用藥資料標準化
 - 使用 ATC / RxNorm / 自訂 ID。
 - **自動化等級**：`Auto`
 #### 4.2.2 PGx Engine
 - **輸入**：個人變異（Phase 1 DB）、藥物清單
 - **輸出**：每個藥物的 PGx 評估（genotype → phenotype → 建議）
 - **資料庫**：
  - CPIC guidelines
  - PharmGKB 關聯資料
 - **自動化等級**：
  - genotype → phenotype：`Auto`
  - phenotype → 臨床建議：`Auto+Review`
 #### 4.2.3 DDI Engine
 - **輸入**：藥物清單
 - **輸出**：已知 DDI 清單與嚴重程度分級
 - **資料來源**：公開或商用 DDI 資料庫（視可用性）
 - **自動化等級**：`Auto`
 ---
 ## 5. Phase 3：保健食品與中藥交互作用模組
 ### 5.1 功能需求
 1. 接收使用者的保健食品與中藥使用資料。
 2. 將名稱解析為：
   - 標準化有效成分（如 EPA/DHA mg、Vit D IU、銀杏葉萃取物 mg 等）
   - 中藥材名稱（如 黃耆、當歸、川芎…）
 3. 評估：
   - 成分與藥物、基因的交互作用風險
   - 成分間的加乘作用（如抗凝、CNS 抑制等）
 4. 按證據等級給出：
   - 高優先級警示（有較強臨床證據）
   - 一般提醒（動物實驗 / case report 等）
   - 資料不足，僅能提醒不確定性
 ### 5.2 模組設計
 #### 5.2.1 成分標準化引擎
 - **輸入**：使用者輸入的品名 / 處方
 - **輸出**：
  - 標準化成分列表
  - 估計劑量範圍（若無精確資料）
 - **資料**：
  - 保健食品常用成分資料表
  - 中藥方與藥材對應表
 - **自動化等級**：`Auto+Review`
 #### 5.2.2 成分交互作用引擎
 - **輸入**：成分列表、藥物清單、基因資料
 - **輸出**：交互作用列表與風險層級
 - **邏輯**：
  - 成分對 CYP / P-gp / OATP 等的影響
  - 成分對凝血、血壓、中樞神經等系統的影響
 - **自動化等級**：
  - 規則推論：`Auto`
  - 最終臨床建議表述：`Human-only`
 ---
 ## 6. Phase 4：NLP/LLM 問答介面與報告生成
 ### 6.1 功能需求
 1. 支援使用者以自然語言提問：
   - 疾病/症狀相關風險
   - 用藥安全性
   - 保健食品、中藥併用風險
 2. LLM 負責：
   - 問題解析 → 結構化查詢（疾病、HPO、藥物、成分等）
   - 協調呼叫底層 API（Phase 1–3）
   - 整合結果並生成報告草稿
 3. 報告形式：
   - 機器可讀 JSON（便於後處理）
   - 人類可讀 Markdown / PDF 報告
 ### 6.2 Orchestration 設計
 - 可採用「LLM + Tool/Function Calling」模式：
  - 工具包括：
    - `query_variants_by_gene`
    - `query_disease_gene_panel`
    - `run_pgx_analysis`
    - `run_ddi_analysis`
    - `run_supplement_herb_interaction`
 - LLM 主要負責：
  - 意圖辨識與拆解
  - 工具呼叫順序規劃
  - 結果解釋與用語調整（需符合安全與保守原則）
 - **自動化等級**：
  - 工具呼叫：`Auto`
  - 臨床敏感結論：`Auto+Review` / `Human-only`（視場景而定）
 ---
 ## 7. 安全性、隱私與版本管理
 ### 7.1 資料安全與隱私
 - 所有基因資料、用藥清單、報告：
  - 儲存於本地或受控環境
  - 若需與外部服務（如雲端 LLM）互動，需：
    - 做脫敏處理（移除個資）
    - 或改用 local/私有 LLM
 ### 7.2 版本管理
 - 對以下物件進行版本控制：
  - 參考基因組版本
  - variant calling pipeline 版本
  - 資料庫版本（ClinVar、gnomAD 等）
  - ACMG 規則 config 版本
  - gene panel / PGx 規則版本
 - 每份分析報告需記錄所用版本，以利追蹤與重跑。
 ### 7.3 人工介入紀錄
 - 每次人工 override 或審核需紀錄：
  - 變異 ID / 分析項目
  - 原自動建議
  - 人工調整結果
  - 理由與參考文獻（如有）
  - 審核者與時間
 ---
 ## 8. 未來擴充方向（Optional）
 - 整合 polygenic risk score（PRS）模組
 - 整合 longitudinal data（實驗室數據、症狀日誌）做風險動態追蹤
 - 為特定疾病領域建立更深的 expert-curated knowledge base
 - 與可穿戴裝置／其他健康資料源整合
 ---
 ## 9. 第一階段實作建議路線（Actionable TODO）
 1. **規劃 Phase 1 的技術選型**
   - 選擇 variant caller（如 GATK）與 reference genome 版本
   - 選擇 annotation 工具（如 VEP 或 ANNOVAR）
 2. **建立基本 pipeline**
   - BAM → gVCF → joint VCF（trio）
   - 加上基本 QC 報表
 3. **建置簡單的 Genomic Query 介面**
   - 先以 CLI/Notebook 函式為主（例如 Python 函式庫）
 4. **選一個你最關心的疾病領域**
   - 建立第一個 gene panel（例如視覺/聽力相關）
   - 實作 panel-based 查詢與變異列表輸出
 5. **撰寫第一版報告模板**
   - 輸入：疾病名稱 + gene panel + 查詢結果
   - 輸出：簡易 Markdown 報告（含變異表 + 限制說明）
 6. **逐步加入 ACMG 自動 evidence 標記與人工 review 流程**
 這個規格書預期會在實作過程中持續更新，可視此為 v0.1 的起點版本。
--- a/gwas_comprehensive.py
+++ b/gwas_comprehensive.py
@@ -0,0 +1,590 @@
 #!/usr/bin/env python3
 """
 Comprehensive GWAS Trait Analysis Script
 Expanded version with 200+ clinically relevant trait-associated SNPs
 """
 import gzip
 import sys
 import re
 from collections import defaultdict
 from typing import Dict, List, Tuple
 # ============================================================================
 # COMPREHENSIVE TRAIT-ASSOCIATED SNPs DATABASE
 # Format: rsid -> (chrom, pos, risk_allele, trait, effect, category)
 # ============================================================================
 TRAIT_SNPS = {
    # ========================================================================
    # GOUT / URIC ACID METABOLISM (新增)
    # ========================================================================
    "rs2231142": ("4", 89052323, "T", "Gout / Hyperuricemia", "risk", "Gout"),
    "rs16890979": ("4", 9922166, "T", "Serum uric acid levels", "higher", "Gout"),
    "rs734553": ("4", 9920485, "G", "Gout", "risk", "Gout"),
    "rs1014290": ("4", 10001861, "A", "Serum uric acid levels", "higher", "Gout"),
    "rs505802": ("11", 64357072, "C", "Serum uric acid levels", "higher", "Gout"),
    "rs3775948": ("4", 9999007, "G", "Gout", "risk", "Gout"),
    "rs12498742": ("4", 9993806, "A", "Serum uric acid levels", "higher", "Gout"),
    "rs675209": ("4", 89011046, "T", "Gout", "risk", "Gout"),
    "rs1165151": ("11", 64352047, "T", "Serum uric acid levels", "higher", "Gout"),
    "rs478607": ("17", 19459563, "A", "Serum uric acid levels", "higher", "Gout"),
    # ========================================================================
    # KIDNEY DISEASE (新增)
    # ========================================================================
    "rs4293393": ("16", 20364808, "T", "Chronic kidney disease", "risk", "Kidney"),
    "rs12917707": ("16", 20369861, "G", "Chronic kidney disease", "protective", "Kidney"),
    "rs11959928": ("5", 39394747, "A", "eGFR decline", "risk", "Kidney"),
    "rs1260326": ("2", 27730940, "T", "Chronic kidney disease", "risk", "Kidney"),
    "rs13329952": ("16", 20393103, "C", "Chronic kidney disease", "risk", "Kidney"),
    "rs267734": ("1", 150950830, "C", "Chronic kidney disease", "risk", "Kidney"),
    # ========================================================================
    # HEARING LOSS (與 Usher syndrome 家庭相關)
    # ========================================================================
    "rs7598759": ("2", 70439175, "A", "Age-related hearing loss", "risk", "Hearing"),
    "rs161927": ("5", 88228027, "G", "Hearing impairment", "risk", "Hearing"),
    "rs10497394": ("2", 70477374, "T", "Hearing loss", "risk", "Hearing"),
    "rs3752752": ("7", 129608155, "C", "Noise-induced hearing loss", "risk", "Hearing"),
    "rs7294": ("4", 6303557, "G", "Hearing loss", "risk", "Hearing"),
    # ========================================================================
    # AUTOIMMUNE DISEASES (新增)
    # ========================================================================
    # Rheumatoid Arthritis
    "rs6679677": ("1", 114179091, "A", "Rheumatoid arthritis", "risk", "Autoimmune"),
    "rs2476601": ("1", 114377568, "A", "Rheumatoid arthritis / Autoimmune", "risk", "Autoimmune"),
    "rs3087243": ("2", 204447164, "G", "Rheumatoid arthritis", "protective", "Autoimmune"),
    "rs4810485": ("20", 44747947, "T", "Rheumatoid arthritis", "risk", "Autoimmune"),
    # Systemic Lupus Erythematosus (SLE)
    "rs1143679": ("16", 31276811, "A", "Systemic lupus erythematosus", "risk", "Autoimmune"),
    "rs7574865": ("2", 191099907, "T", "Systemic lupus erythematosus", "risk", "Autoimmune"),
    "rs2187668": ("6", 32605884, "T", "Systemic lupus erythematosus", "risk", "Autoimmune"),
    # Multiple Sclerosis
    "rs3135388": ("6", 32439887, "A", "Multiple sclerosis", "risk", "Autoimmune"),
    "rs6897932": ("5", 35910332, "C", "Multiple sclerosis", "risk", "Autoimmune"),
    "rs4648356": ("1", 101256530, "C", "Multiple sclerosis", "risk", "Autoimmune"),
    # Inflammatory Bowel Disease
    "rs2241880": ("16", 50756540, "G", "Crohn's disease / IBD", "risk", "Autoimmune"),
    "rs11209026": ("1", 67705958, "A", "Crohn's disease / IBD", "protective", "Autoimmune"),
    "rs10883365": ("10", 64426914, "G", "Ulcerative colitis", "risk", "Autoimmune"),
    "rs2066847": ("16", 50745926, "C", "Crohn's disease", "risk", "Autoimmune"),
    # Type 1 Diabetes
    "rs2292239": ("12", 56482804, "T", "Type 1 diabetes", "risk", "Autoimmune"),
    "rs3129889": ("6", 32609440, "G", "Type 1 diabetes", "risk", "Autoimmune"),
    "rs689": ("11", 2182224, "T", "Type 1 diabetes", "risk", "Autoimmune"),
    # Celiac Disease
    "rs2395182": ("6", 32713854, "T", "Celiac disease", "risk", "Autoimmune"),
    "rs7775228": ("6", 32665438, "C", "Celiac disease", "risk", "Autoimmune"),
    # Hashimoto's Thyroiditis / Graves' Disease
    "rs179247": ("2", 204733986, "A", "Autoimmune thyroid disease", "risk", "Autoimmune"),
    "rs1980422": ("6", 90957406, "C", "Autoimmune thyroid disease", "risk", "Autoimmune"),
    # ========================================================================
    # CANCER RISK (新增)
    # ========================================================================
    # Breast Cancer
    "rs2981582": ("10", 123337335, "A", "Breast cancer (FGFR2)", "risk", "Cancer"),
    "rs13281615": ("8", 128355618, "G", "Breast cancer", "risk", "Cancer"),
    "rs889312": ("5", 56067641, "C", "Breast cancer (MAP3K1)", "risk", "Cancer"),
    "rs3817198": ("11", 1909006, "C", "Breast cancer (LSP1)", "risk", "Cancer"),
    "rs13387042": ("2", 217905832, "A", "Breast cancer", "risk", "Cancer"),
    # Prostate Cancer
    "rs1447295": ("8", 128554220, "A", "Prostate cancer", "risk", "Cancer"),
    "rs16901979": ("8", 128320346, "A", "Prostate cancer", "risk", "Cancer"),
    "rs6983267": ("8", 128413305, "G", "Prostate cancer / Colorectal cancer", "risk", "Cancer"),
    "rs10993994": ("10", 51549496, "T", "Prostate cancer (MSMB)", "risk", "Cancer"),
    "rs7679673": ("4", 106061534, "C", "Prostate cancer", "risk", "Cancer"),
    # Colorectal Cancer
    "rs4939827": ("18", 46453463, "T", "Colorectal cancer (SMAD7)", "risk", "Cancer"),
    "rs6983267_crc": ("8", 128413305, "G", "Colorectal cancer", "risk", "Cancer"),
    "rs4779584": ("15", 32994756, "T", "Colorectal cancer", "risk", "Cancer"),
    "rs10795668": ("10", 8701219, "G", "Colorectal cancer", "protective", "Cancer"),
    # Lung Cancer
    "rs8034191": ("15", 78894339, "C", "Lung cancer", "risk", "Cancer"),
    "rs1051730": ("15", 78882925, "A", "Lung cancer / Nicotine dependence", "risk", "Cancer"),
    "rs2736100": ("5", 1286516, "C", "Lung cancer (TERT)", "risk", "Cancer"),
    # Melanoma
    "rs910873": ("20", 32665748, "C", "Melanoma", "risk", "Cancer"),
    "rs1801516": ("11", 108175462, "A", "Melanoma (ATM)", "risk", "Cancer"),
    "rs16953002": ("12", 89328335, "A", "Melanoma", "risk", "Cancer"),
    # Thyroid Cancer
    "rs965513": ("9", 100556109, "A", "Thyroid cancer", "risk", "Cancer"),
    "rs944289": ("14", 36649246, "T", "Thyroid cancer", "risk", "Cancer"),
    # Bladder Cancer
    "rs710521": ("3", 189643526, "A", "Bladder cancer", "risk", "Cancer"),
    "rs9642880": ("8", 128787253, "T", "Bladder cancer", "risk", "Cancer"),
    # ========================================================================
    # BLOOD CLOTTING / THROMBOSIS (新增)
    # ========================================================================
    "rs6025": ("1", 169519049, "T", "Factor V Leiden / DVT risk", "risk", "Thrombosis"),
    "rs1799963": ("11", 46761055, "A", "Prothrombin G20210A / DVT risk", "risk", "Thrombosis"),
    "rs8176719": ("9", 136131322, "C", "Blood type O (protective for VTE)", "protective", "Thrombosis"),
    "rs505922": ("9", 136149229, "C", "Venous thromboembolism", "risk", "Thrombosis"),
    "rs2066865": ("4", 155525276, "G", "Fibrinogen levels / DVT", "risk", "Thrombosis"),
    # ========================================================================
    # THYROID DISORDERS (新增)
    # ========================================================================
    "rs1991517": ("8", 133020441, "C", "Hypothyroidism", "risk", "Thyroid"),
    "rs925489": ("2", 218283107, "T", "TSH levels", "higher", "Thyroid"),
    "rs10499559": ("6", 166474536, "T", "Hypothyroidism", "risk", "Thyroid"),
    "rs7850258": ("9", 4126287, "G", "Thyroid function", "altered", "Thyroid"),
    # ========================================================================
    # OSTEOPOROSIS / BONE HEALTH (新增)
    # ========================================================================
    "rs3736228": ("11", 68179081, "T", "Osteoporosis / Low BMD", "risk", "Bone"),
    "rs4988235": ("2", 136608646, "G", "Lactose intolerance (affects Ca)", "risk", "Bone"),
    "rs2282679": ("4", 72608383, "C", "Vitamin D deficiency", "risk", "Bone"),
    "rs1800012": ("17", 48275363, "T", "Osteoporosis (COL1A1)", "risk", "Bone"),
    "rs2062377": ("8", 119964052, "A", "Bone mineral density", "lower", "Bone"),
    "rs4355801": ("8", 119963145, "G", "Bone mineral density", "higher", "Bone"),
    # ========================================================================
    # LIVER DISEASE (新增)
    # ========================================================================
    "rs738409": ("22", 44324727, "G", "NAFLD / Fatty liver (PNPLA3)", "risk", "Liver"),
    "rs58542926": ("19", 19379549, "T", "NAFLD / Liver fibrosis (TM6SF2)", "risk", "Liver"),
    "rs2228603": ("19", 11350488, "T", "NAFLD", "risk", "Liver"),
    "rs12979860": ("19", 39248147, "C", "Hepatitis C clearance", "favorable", "Liver"),
    # ========================================================================
    # MIGRAINE / HEADACHE (新增)
    # ========================================================================
    "rs2651899": ("1", 10796866, "C", "Migraine", "risk", "Migraine"),
    "rs10166942": ("2", 234824778, "T", "Migraine", "risk", "Migraine"),
    "rs11172113": ("12", 57527283, "C", "Migraine (LRP1)", "risk", "Migraine"),
    "rs1835740": ("8", 87521374, "A", "Migraine", "risk", "Migraine"),
    # ========================================================================
    # LONGEVITY / AGING (新增)
    # ========================================================================
    "rs2802292": ("6", 157192662, "G", "Longevity (FOXO3)", "protective", "Longevity"),
    "rs1042522": ("17", 7579472, "C", "Longevity (TP53)", "altered", "Longevity"),
    "rs4420638": ("19", 45422946, "A", "Longevity / Cardiovascular", "risk", "Longevity"),
    # ========================================================================
    # SLEEP / CIRCADIAN (原有 + 擴展)
    # ========================================================================
    "rs113851554": ("2", 66799986, "T", "Insomnia", "risk", "Sleep"),
    "rs12927162": ("16", 68856985, "A", "Sleep duration", "shorter", "Sleep"),
    "rs1823125": ("1", 205713532, "G", "Chronotype (morning person)", "morning", "Sleep"),
    "rs10493596": ("1", 215803417, "T", "Insomnia", "risk", "Sleep"),
    "rs3104997": ("6", 27424938, "C", "Sleep duration", "shorter", "Sleep"),
    "rs73598374": ("4", 94847526, "A", "Insomnia", "risk", "Sleep"),
    "rs2302729": ("5", 35857091, "G", "Insomnia", "risk", "Sleep"),
    "rs12936231": ("17", 44282378, "C", "Restless legs syndrome", "risk", "Sleep"),
    "rs3923809": ("6", 38642286, "A", "Restless legs syndrome (BTBD9)", "risk", "Sleep"),
    # ========================================================================
    # SKIN CONDITIONS (原有 + 擴展)
    # ========================================================================
    "rs1800629": ("6", 31543031, "A", "Psoriasis", "risk", "Skin"),
    "rs20541": ("5", 131995964, "A", "Atopic dermatitis", "risk", "Skin"),
    "rs2066808": ("6", 31540784, "A", "Psoriasis", "risk", "Skin"),
    "rs3093662": ("6", 31574339, "G", "Psoriasis", "risk", "Skin"),
    "rs10484554": ("6", 31271836, "A", "Psoriasis", "risk", "Skin"),
    "rs1295686": ("5", 131996447, "A", "Atopic dermatitis", "risk", "Skin"),
    "rs2227956": ("6", 31783279, "T", "Psoriasis", "risk", "Skin"),
    "rs6906021": ("6", 32051991, "C", "Atopic dermatitis", "risk", "Skin"),
    "rs12203592": ("6", 396321, "T", "Skin pigmentation / Freckling", "risk", "Skin"),
    "rs1805007": ("16", 89986117, "T", "Red hair / Fair skin (MC1R)", "risk", "Skin"),
    "rs1805008": ("16", 89986144, "T", "Red hair / Fair skin (MC1R)", "risk", "Skin"),
    # ========================================================================
    # CARDIOVASCULAR (原有 + 大幅擴展)
    # ========================================================================
    "rs10757274": ("9", 22096055, "G", "Coronary artery disease", "risk", "Cardiovascular"),
    "rs1333049": ("9", 22125503, "C", "Coronary artery disease", "risk", "Cardiovascular"),
    "rs4665058": ("2", 43845437, "C", "Coronary artery disease", "risk", "Cardiovascular"),
    "rs17465637": ("1", 222823529, "A", "Coronary artery disease", "risk", "Cardiovascular"),
    "rs6725887": ("2", 203828796, "C", "Coronary artery disease", "risk", "Cardiovascular"),
    # Hypertension
    "rs699": ("1", 230845794, "G", "Hypertension (AGT)", "risk", "Cardiovascular"),
    "rs5186": ("3", 148459988, "C", "Hypertension (AGTR1)", "risk", "Cardiovascular"),
    "rs4961": ("4", 2906707, "T", "Hypertension / Salt sensitivity", "risk", "Cardiovascular"),
    "rs1799998": ("8", 142876043, "T", "Hypertension (CYP11B2)", "risk", "Cardiovascular"),
    # Atrial Fibrillation
    "rs2200733": ("4", 111718106, "T", "Atrial fibrillation", "risk", "Cardiovascular"),
    "rs10033464": ("4", 111714418, "T", "Atrial fibrillation", "risk", "Cardiovascular"),
    "rs6843082": ("4", 111712344, "G", "Atrial fibrillation (PITX2)", "risk", "Cardiovascular"),
    # Heart Failure
    "rs1739843": ("15", 75086042, "T", "Heart failure", "risk", "Cardiovascular"),
    # Stroke
    "rs11833579": ("12", 115553310, "A", "Ischemic stroke", "risk", "Cardiovascular"),
    "rs12425791": ("12", 115557677, "A", "Stroke (NINJ2)", "risk", "Cardiovascular"),
    # Lipids
    "rs1801177": ("8", 19813529, "A", "LDL cholesterol (LPL)", "higher", "Cardiovascular"),
    "rs12740374": ("1", 109822166, "G", "LDL cholesterol (CELSR2)", "lower", "Cardiovascular"),
    "rs3764261": ("16", 56993324, "A", "HDL cholesterol (CETP)", "higher", "Cardiovascular"),
    "rs1800588": ("15", 58723675, "T", "HDL cholesterol (LIPC)", "higher", "Cardiovascular"),
    "rs328": ("8", 19819724, "G", "Triglycerides (LPL)", "lower", "Cardiovascular"),
    "rs662799": ("11", 116663707, "G", "Triglycerides (APOA5)", "higher", "Cardiovascular"),
    # ========================================================================
    # TYPE 2 DIABETES / METABOLIC (原有 + 擴展)
    # ========================================================================
    "rs7903146": ("10", 114758349, "T", "Type 2 diabetes (TCF7L2)", "risk", "Metabolic"),
    "rs12255372": ("10", 114808902, "T", "Type 2 diabetes (TCF7L2)", "risk", "Metabolic"),
    "rs1801282": ("3", 12393125, "C", "Type 2 diabetes (PPARG)", "risk", "Metabolic"),
    "rs5219": ("11", 17409572, "T", "Type 2 diabetes (KCNJ11)", "risk", "Metabolic"),
    "rs13266634": ("8", 118184783, "C", "Type 2 diabetes (SLC30A8)", "risk", "Metabolic"),
    "rs7754840": ("6", 20679709, "C", "Type 2 diabetes (CDKAL1)", "risk", "Metabolic"),
    "rs10811661": ("9", 22134095, "T", "Type 2 diabetes (CDKN2A/B)", "risk", "Metabolic"),
    "rs864745": ("7", 28196413, "T", "Type 2 diabetes (JAZF1)", "risk", "Metabolic"),
    "rs4402960": ("3", 185511687, "T", "Type 2 diabetes (IGF2BP2)", "risk", "Metabolic"),
    # Obesity/BMI
    "rs9939609": ("16", 53820527, "A", "Obesity (FTO)", "risk", "Metabolic"),
    "rs17782313": ("18", 57851097, "C", "Obesity (MC4R)", "risk", "Metabolic"),
    "rs6548238": ("2", 634905, "C", "BMI", "higher", "Metabolic"),
    "rs10938397": ("4", 45186139, "G", "BMI (GNPDA2)", "higher", "Metabolic"),
    "rs571312": ("18", 57839769, "A", "BMI (MC4R)", "higher", "Metabolic"),
    "rs10767664": ("11", 27682562, "A", "BMI (BDNF)", "higher", "Metabolic"),
    # ========================================================================
    # EYE CONDITIONS (原有 + 擴展)
    # ========================================================================
    "rs10490924": ("10", 124214448, "T", "Age-related macular degeneration (ARMS2)", "risk", "Eye"),
    "rs1061170": ("1", 196659237, "C", "Age-related macular degeneration (CFH)", "risk", "Eye"),
    "rs9621532": ("22", 38477587, "C", "Myopia", "risk", "Eye"),
    "rs10034228": ("4", 81951543, "A", "Myopia", "risk", "Eye"),
    "rs1048661": ("1", 165655423, "C", "Glaucoma (LOXL1)", "risk", "Eye"),
    "rs4656461": ("1", 165653012, "G", "Glaucoma (LOXL1)", "risk", "Eye"),
    "rs2165241": ("15", 93600556, "T", "Glaucoma", "risk", "Eye"),
    "rs3753841": ("1", 196704632, "C", "Age-related macular degeneration", "risk", "Eye"),
    # ========================================================================
    # NEUROPSYCHIATRIC (原有)
    # ========================================================================
    # Alzheimer's Disease
    "rs429358": ("19", 45411941, "C", "Alzheimer's disease (APOE e4)", "risk", "Neuropsychiatric"),
    "rs7412": ("19", 45412079, "T", "Alzheimer's disease (APOE e2)", "protective", "Neuropsychiatric"),
    "rs3865444": ("19", 51727962, "C", "Alzheimer's disease (CD33)", "risk", "Neuropsychiatric"),
    "rs744373": ("2", 127892810, "G", "Alzheimer's disease (BIN1)", "risk", "Neuropsychiatric"),
    "rs3851179": ("11", 85868640, "T", "Alzheimer's disease (PICALM)", "protective", "Neuropsychiatric"),
    "rs670139": ("11", 59939307, "G", "Alzheimer's disease (MS4A)", "risk", "Neuropsychiatric"),
    "rs9349407": ("6", 47487762, "C", "Alzheimer's disease (CD2AP)", "risk", "Neuropsychiatric"),
    "rs11136000": ("8", 27468503, "C", "Alzheimer's disease (CLU)", "protective", "Neuropsychiatric"),
    "rs3764650": ("19", 1063443, "G", "Alzheimer's disease (ABCA7)", "risk", "Neuropsychiatric"),
    "rs3818361": ("1", 207692049, "A", "Alzheimer's disease (CR1)", "risk", "Neuropsychiatric"),
    # Parkinson's Disease (新增)
    "rs356220": ("4", 90626111, "T", "Parkinson's disease (SNCA)", "risk", "Neuropsychiatric"),
    "rs11931074": ("4", 90674917, "G", "Parkinson's disease (SNCA)", "risk", "Neuropsychiatric"),
    "rs34637584": ("12", 40734202, "A", "Parkinson's disease (LRRK2)", "risk", "Neuropsychiatric"),
    "rs34311866": ("4", 951947, "C", "Parkinson's disease (TMEM175)", "risk", "Neuropsychiatric"),
    # Depression
    "rs1545843": ("1", 72761657, "A", "Major depression (NEGR1)", "risk", "Neuropsychiatric"),
    "rs7973260": ("12", 118364392, "A", "Major depression (KSR2)", "risk", "Neuropsychiatric"),
    "rs10514299": ("5", 87992715, "T", "Major depression (TMEM161B)", "risk", "Neuropsychiatric"),
    "rs2422321": ("15", 88945878, "G", "Major depression (NTRK3)", "risk", "Neuropsychiatric"),
    "rs301806": ("1", 8477981, "A", "Major depression (RERE)", "risk", "Neuropsychiatric"),
    "rs1432639": ("3", 117115304, "G", "Major depression (LSAMP)", "risk", "Neuropsychiatric"),
    "rs9530139": ("13", 53645407, "G", "Major depression", "risk", "Neuropsychiatric"),
    "rs4543289": ("10", 106610839, "T", "Major depression (SORCS3)", "risk", "Neuropsychiatric"),
    # Anxiety
    "rs1709393": ("1", 34774088, "A", "Anxiety disorder", "risk", "Neuropsychiatric"),
    "rs7688285": ("4", 123372626, "A", "Anxiety disorder", "risk", "Neuropsychiatric"),
    # Bipolar
    "rs4765913": ("12", 2345295, "A", "Bipolar disorder (CACNA1C)", "risk", "Neuropsychiatric"),
    "rs10994336": ("10", 64649959, "T", "Bipolar disorder (ANK3)", "risk", "Neuropsychiatric"),
    "rs9804190": ("11", 79077426, "C", "Bipolar disorder (ODZ4)", "risk", "Neuropsychiatric"),
    # Schizophrenia
    "rs1625579": ("8", 130635575, "T", "Schizophrenia (MIR137)", "risk", "Neuropsychiatric"),
    "rs2007044": ("6", 28626894, "G", "Schizophrenia (HIST1H2BJ)", "risk", "Neuropsychiatric"),
    "rs6932590": ("6", 27243984, "T", "Schizophrenia", "risk", "Neuropsychiatric"),
    # ADHD (新增)
    "rs1412005": ("16", 73099702, "T", "ADHD", "risk", "Neuropsychiatric"),
    "rs11210892": ("1", 44185231, "A", "ADHD", "risk", "Neuropsychiatric"),
    # ========================================================================
    # OTHER TRAITS (原有 + 擴展)
    # ========================================================================
    # Caffeine
    "rs762551": ("15", 75041917, "C", "Caffeine metabolism (slow)", "slow", "Other"),
    "rs2472297": ("15", 75027880, "T", "Caffeine consumption", "higher", "Other"),
    # Alcohol
    "rs671": ("12", 112241766, "A", "Alcohol flush reaction (ALDH2)", "risk", "Other"),
    "rs1229984": ("4", 100239319, "T", "Alcohol metabolism (ADH1B)", "fast", "Other"),
    # Lactose
    "rs4988235_lct": ("2", 136608646, "G", "Lactose intolerance (LCT)", "risk", "Other"),
    # Vitamin D
    "rs12785878": ("11", 71167449, "T", "Vitamin D levels (lower)", "lower", "Other"),
    # Hair
    "rs2180439": ("20", 22162468, "T", "Male pattern baldness", "risk", "Other"),
    "rs1160312": ("X", 67052952, "A", "Male pattern baldness (AR)", "risk", "Other"),
    "rs6625163": ("X", 67177092, "A", "Male pattern baldness", "risk", "Other"),
    # Muscle performance (新增)
    "rs1815739": ("11", 66560624, "T", "Sprint/Power athlete (ACTN3)", "power", "Other"),
    # Bitter taste (新增)
    "rs713598": ("7", 141972804, "C", "Bitter taste sensitivity (PTC)", "taster", "Other"),
    "rs1726866": ("7", 141972905, "T", "Bitter taste sensitivity", "taster", "Other"),
    # Cilantro aversion (新增)
    "rs72921001": ("11", 6889648, "A", "Cilantro aversion", "aversion", "Other"),
 }
 # Category display order and descriptions
 CATEGORIES = {
    "Gout": "痛風 / 尿酸代謝",
    "Kidney": "腎臟疾病",
    "Hearing": "聽力損失",
    "Autoimmune": "自體免疫疾病",
    "Cancer": "癌症風險",
    "Thrombosis": "血栓 / 凝血",
    "Thyroid": "甲狀腺疾病",
    "Bone": "骨質疏鬆 / 骨骼健康",
    "Liver": "肝臟疾病",
    "Migraine": "偏頭痛",
    "Longevity": "長壽 / 老化",
    "Sleep": "睡眠",
    "Skin": "皮膚",
    "Cardiovascular": "心血管疾病",
    "Metabolic": "代謝疾病",
    "Eye": "眼睛疾病",
    "Neuropsychiatric": "神經精神疾病",
    "Other": "其他特性",
 }
 def get_genotype_class(gt: str) -> str:
    """Classify genotype"""
    if gt in ['./.', '.|.', '.']:
        return 'MISSING'
    alleles = re.split('[/|]', gt)
    if all(a == '0' for a in alleles):
        return 'HOM_REF'
    elif all(a != '0' and a != '.' for a in alleles):
        return 'HOM_ALT'
    else:
        return 'HET'
 def parse_vcf_for_traits(vcf_path: str, sample_idx: int = 2) -> Tuple[Dict, List]:
    """Parse VCF and look for trait-associated SNPs"""
    print(f"Scanning VCF for {len(TRAIT_SNPS)} trait-associated variants...")
    # Build position lookup
    pos_to_snp = {}
    for rsid, (chrom, pos, risk_allele, trait, effect, category) in TRAIT_SNPS.items():
        key = f"{chrom}-{pos}"
        if key not in pos_to_snp:
            pos_to_snp[key] = []
        pos_to_snp[key].append((rsid, risk_allele, trait, effect, category))
    found_variants = {}
    samples = []
    open_func = gzip.open if vcf_path.endswith('.gz') else open
    mode = 'rt' if vcf_path.endswith('.gz') else 'r'
    with open_func(vcf_path, mode) as f:
        for line in f:
            if line.startswith('##'):
                continue
            elif line.startswith('#CHROM'):
                parts = line.strip().split('\t')
                samples = parts[9:]
                continue
            parts = line.strip().split('\t')
            if len(parts) < 10:
                continue
            chrom, pos, rsid_vcf, ref, alt, qual, filt, info, fmt = parts[:9]
            gt_fields = parts[9:]
            # Check if this position has a known trait SNP
            key = f"{chrom}-{pos}"
            if key not in pos_to_snp:
                continue
            # Get sample genotype
            fmt_parts = fmt.split(':')
            gt_idx = fmt_parts.index('GT') if 'GT' in fmt_parts else 0
            if sample_idx < len(gt_fields):
                gt_data = gt_fields[sample_idx].split(':')
                gt = gt_data[gt_idx] if gt_idx < len(gt_data) else './.'
            else:
                gt = './.'
            gt_class = get_genotype_class(gt)
            alleles = [ref] + alt.split(',')
            # Process each SNP at this position
            for rsid, risk_allele, trait, effect, category in pos_to_snp[key]:
                # Check if risk allele is present
                has_risk = False
                risk_copies = 0
                if gt_class != 'MISSING':
                    gt_alleles = re.split('[/|]', gt)
                    for a in gt_alleles:
                        if a.isdigit():
                            allele_idx = int(a)
                            if allele_idx < len(alleles) and alleles[allele_idx] == risk_allele:
                                has_risk = True
                                risk_copies += 1
                found_variants[rsid] = {
                    'rsid': rsid,
                    'chrom': chrom,
                    'pos': pos,
                    'ref': ref,
                    'alt': alt,
                    'genotype': gt,
                    'genotype_class': gt_class,
                    'risk_allele': risk_allele,
                    'trait': trait,
                    'effect': effect,
                    'category': category,
                    'has_risk_allele': has_risk,
                    'risk_copies': risk_copies
                }
    return found_variants, samples
 def generate_report(found_variants: Dict, output_path: str, sample_name: str):
    """Generate comprehensive trait analysis report"""
    # Group by category
    by_category = defaultdict(list)
    for rsid, var in found_variants.items():
        by_category[var['category']].append(var)
    with open(output_path, 'w') as f:
        f.write("=" * 80 + "\n")
        f.write("COMPREHENSIVE GWAS TRAIT ANALYSIS REPORT\n")
        f.write(f"Sample: {sample_name}\n")
        f.write(f"Total SNPs analyzed: {len(TRAIT_SNPS)}\n")
        f.write(f"SNPs found in data: {len(found_variants)}\n")
        f.write("=" * 80 + "\n\n")
        # Summary statistics
        total_risk = sum(1 for v in found_variants.values() if v['has_risk_allele'])
        f.write(f"OVERALL SUMMARY: {total_risk} risk variants found\n\n")
        # Category summary
        f.write("=" * 80 + "\n")
        f.write("SUMMARY BY CATEGORY\n")
        f.write("=" * 80 + "\n\n")
        for cat_key in CATEGORIES.keys():
            if cat_key in by_category:
                variants = by_category[cat_key]
                risk_count = sum(1 for v in variants if v['has_risk_allele'])
                cat_name = CATEGORIES[cat_key]
                f.write(f"{cat_name}: {risk_count}/{len(variants)} risk variants\n")
        # Detailed results by category
        f.write("\n" + "=" * 80 + "\n")
        f.write("DETAILED RESULTS BY CATEGORY\n")
        f.write("=" * 80 + "\n")
        for cat_key in CATEGORIES.keys():
            if cat_key not in by_category:
                continue
            variants = by_category[cat_key]
            cat_name = CATEGORIES[cat_key]
            risk_count = sum(1 for v in variants if v['has_risk_allele'])
            f.write(f"\n\n## {cat_name} ({risk_count}/{len(variants)} risk)\n")
            f.write("-" * 60 + "\n")
            # Sort: risk variants first
            sorted_vars = sorted(variants, key=lambda x: (not x['has_risk_allele'], x['trait']))
            for v in sorted_vars:
                status = "⚠️ RISK" if v['has_risk_allele'] else "✓ OK"
                copies = f"({v['risk_copies']}份)" if v['has_risk_allele'] else ""
                f.write(f"\n{v['trait']}: {v['rsid']} [{status}] {copies}\n")
                f.write(f"  基因型: {v['genotype']} | 風險等位基因: {v['risk_allele']} | 效應: {v['effect']}\n")
        # Full variant table
        f.write("\n\n" + "=" * 80 + "\n")
        f.write("COMPLETE VARIANT TABLE\n")
        f.write("=" * 80 + "\n\n")
        f.write("RSID\tCHROM\tPOS\tGENOTYPE\tRISK_ALLELE\tHAS_RISK\tCOPIES\tTRAIT\tCATEGORY\tEFFECT\n")
        for rsid, var in sorted(found_variants.items(), key=lambda x: (x[1]['category'], x[1]['trait'])):
            f.write(f"{var['rsid']}\t{var['chrom']}\t{var['pos']}\t{var['genotype']}\t")
            f.write(f"{var['risk_allele']}\t{var['has_risk_allele']}\t{var['risk_copies']}\t")
            f.write(f"{var['trait']}\t{var['category']}\t{var['effect']}\n")
    print(f"Report saved to: {output_path}")
 def main():
    vcf_path = sys.argv[1] if len(sys.argv) > 1 else '/Volumes/NV2/genomics_analysis/vcf/trio_joint.rsid.vcf.gz'
    output_path = sys.argv[2] if len(sys.argv) > 2 else '/Volumes/NV2/genomics_analysis/gwas_comprehensive_report.txt'
    sample_idx = int(sys.argv[3]) if len(sys.argv) > 3 else 2
    print("=" * 60)
    print("COMPREHENSIVE GWAS TRAIT ANALYSIS")
    print("=" * 60)
    print(f"VCF: {vcf_path}")
    print(f"Sample index: {sample_idx}")
    print(f"Total trait SNPs in database: {len(TRAIT_SNPS)}")
    print()
    found_variants, samples = parse_vcf_for_traits(vcf_path, sample_idx)
    sample_name = samples[sample_idx] if sample_idx < len(samples) else f"Sample_{sample_idx}"
    print(f"Analyzing sample: {sample_name}")
    print(f"\nFound {len(found_variants)} trait-associated variants in VCF")
    # Quick summary by category
    by_category = defaultdict(list)
    for rsid, var in found_variants.items():
        by_category[var['category']].append(var)
    print("\n" + "=" * 60)
    print("QUICK SUMMARY BY CATEGORY")
    print("=" * 60)
    for cat_key in CATEGORIES.keys():
        if cat_key in by_category:
            variants = by_category[cat_key]
            risk_count = sum(1 for v in variants if v['has_risk_allele'])
            cat_name = CATEGORIES[cat_key]
            marker = "⚠️ " if risk_count > 0 else "  "
            print(f"{marker}{cat_name}: {risk_count}/{len(variants)} risk variants")
    generate_report(found_variants, output_path, sample_name)
    # Print high-risk findings
    print("\n" + "=" * 60)
    print("HIGH-PRIORITY FINDINGS (2+ copies of risk allele)")
    print("=" * 60)
    high_risk = [v for v in found_variants.values() if v['risk_copies'] >= 2]
    if high_risk:
        for v in sorted(high_risk, key=lambda x: x['category']):
            print(f"\n{v['trait']} ({v['rsid']})")
            print(f"  Category: {CATEGORIES[v['category']]}")
            print(f"  Genotype: {v['genotype']} (2 copies of risk allele {v['risk_allele']})")
    else:
        print("\nNo variants with 2 copies of risk allele found.")
 if __name__ == '__main__':
    main()
--- a/gwas_trait_lookup.py
+++ b/gwas_trait_lookup.py
@@ -0,0 +1,365 @@
 #!/usr/bin/env python3
 """
 GWAS Trait Lookup Script
 Searches for trait-associated variants in VCF data using GWAS Catalog data.
 """
 import gzip
 import sys
 import os
 from collections import defaultdict
 from dataclasses import dataclass
 from typing import Dict, List, Optional, Set, Tuple
@dataclass
 class GWASAssociation:
    """GWAS association entry"""
    rsid: str
    chrom: str
    pos: int
    risk_allele: str
    trait: str
    p_value: float
    odds_ratio: Optional[float]
    beta: Optional[float]
    pubmed_id: str
    study: str
@dataclass 
 class TraitResult:
    """Result for a specific trait"""
    trait: str
    variants_found: int
    risk_variants: int
    protective_variants: int
    details: List[dict]
 # Common trait-associated SNPs (curated from GWAS Catalog)
 # Format: rsid -> (chrom, pos, risk_allele, trait, effect_direction)
 TRAIT_SNPS = {
    # Sleep quality / Insomnia
    "rs113851554": ("2", 66799986, "T", "Insomnia", "risk"),
    "rs12927162": ("16", 68856985, "A", "Sleep duration", "shorter"),
    "rs1823125": ("1", 205713532, "G", "Chronotype (morning person)", "morning"),
    "rs10493596": ("1", 215803417, "T", "Insomnia", "risk"),
    "rs3104997": ("6", 27424938, "C", "Sleep duration", "shorter"),
    "rs73598374": ("4", 94847526, "A", "Insomnia", "risk"),
    "rs2302729": ("5", 35857091, "G", "Insomnia", "risk"),
    # Skin conditions
    "rs1800629": ("6", 31543031, "A", "Psoriasis", "risk"),
    "rs20541": ("5", 131995964, "A", "Atopic dermatitis", "risk"),
    "rs2066808": ("6", 31540784, "A", "Psoriasis", "risk"),
    "rs3093662": ("6", 31574339, "G", "Psoriasis", "risk"),
    "rs10484554": ("6", 31271836, "A", "Psoriasis", "risk"),
    "rs1295686": ("5", 131996447, "A", "Atopic dermatitis", "risk"),
    "rs2227956": ("6", 31783279, "T", "Psoriasis", "risk"),
    "rs6906021": ("6", 32051991, "C", "Atopic dermatitis", "risk"),
    "rs661313": ("1", 152285861, "G", "Ichthyosis vulgaris (FLG)", "risk"),
    "rs2065958": ("11", 35300406, "C", "Atopic dermatitis", "risk"),
    # Cardiovascular
    "rs10757274": ("9", 22096055, "G", "Coronary artery disease", "risk"),
    "rs1333049": ("9", 22125503, "C", "Coronary artery disease", "risk"),
    "rs4665058": ("2", 43845437, "C", "Coronary artery disease", "risk"),
    "rs17465637": ("1", 222823529, "A", "Coronary artery disease", "risk"),
    "rs6725887": ("2", 203828796, "C", "Coronary artery disease", "risk"),
    # Type 2 Diabetes
    "rs7903146": ("10", 114758349, "T", "Type 2 diabetes", "risk"),
    "rs12255372": ("10", 114808902, "T", "Type 2 diabetes", "risk"),
    "rs1801282": ("3", 12393125, "C", "Type 2 diabetes", "risk"),
    "rs5219": ("11", 17409572, "T", "Type 2 diabetes", "risk"),
    "rs13266634": ("8", 118184783, "C", "Type 2 diabetes", "risk"),
    # Obesity/BMI
    "rs9939609": ("16", 53820527, "A", "Obesity (FTO)", "risk"),
    "rs17782313": ("18", 57851097, "C", "Obesity (MC4R)", "risk"),
    "rs6548238": ("2", 634905, "C", "BMI", "higher"),
    "rs10938397": ("4", 45186139, "G", "BMI", "higher"),
    # Hair loss / Baldness
    "rs2180439": ("20", 22162468, "T", "Male pattern baldness", "risk"),
    "rs1160312": ("X", 67052952, "A", "Male pattern baldness", "risk"),
    "rs6625163": ("X", 67177092, "A", "Male pattern baldness", "risk"),
    # Eye conditions (relevant to Usher)
    "rs10490924": ("10", 124214448, "T", "Age-related macular degeneration", "risk"),
    "rs1061170": ("1", 196659237, "C", "Age-related macular degeneration", "risk"),
    "rs9621532": ("22", 38477587, "C", "Myopia", "risk"),
    # Caffeine metabolism
    "rs762551": ("15", 75041917, "C", "Caffeine metabolism (slow)", "slow"),
    "rs2472297": ("15", 75027880, "T", "Caffeine consumption", "higher"),
    # Alcohol metabolism
    "rs671": ("12", 112241766, "A", "Alcohol flush reaction", "risk"),
    "rs1229984": ("4", 100239319, "T", "Alcohol metabolism (fast)", "fast"),
    # Lactose intolerance
    "rs4988235": ("2", 136608646, "G", "Lactose intolerance", "risk"),
    # Vitamin D
    "rs2282679": ("4", 72608383, "C", "Vitamin D deficiency", "risk"),
    "rs12785878": ("11", 71167449, "T", "Vitamin D levels (lower)", "lower"),
    # Alzheimer's Disease / Dementia
    "rs429358": ("19", 45411941, "C", "Alzheimer's disease (APOE e4)", "risk"),  # APOE e4
    "rs7412": ("19", 45412079, "T", "Alzheimer's disease (APOE e2)", "protective"),  # APOE e2
    "rs3865444": ("19", 51727962, "C", "Alzheimer's disease (CD33)", "risk"),
    "rs744373": ("2", 127892810, "G", "Alzheimer's disease (BIN1)", "risk"),
    "rs3851179": ("11", 85868640, "T", "Alzheimer's disease (PICALM)", "protective"),
    "rs670139": ("11", 59939307, "G", "Alzheimer's disease (MS4A)", "risk"),
    "rs9349407": ("6", 47487762, "C", "Alzheimer's disease (CD2AP)", "risk"),
    "rs11136000": ("8", 27468503, "C", "Alzheimer's disease (CLU)", "protective"),
    "rs3764650": ("19", 1063443, "G", "Alzheimer's disease (ABCA7)", "risk"),
    "rs3818361": ("1", 207692049, "A", "Alzheimer's disease (CR1)", "risk"),
    # Depression / Major Depressive Disorder
    "rs1545843": ("1", 72761657, "A", "Major depression (NEGR1)", "risk"),
    "rs7973260": ("12", 118364392, "A", "Major depression (KSR2)", "risk"),
    "rs10514299": ("5", 87992715, "T", "Major depression (TMEM161B)", "risk"),
    "rs2422321": ("15", 88945878, "G", "Major depression (NTRK3)", "risk"),
    "rs301806": ("1", 8477981, "A", "Major depression (RERE)", "risk"),
    "rs1432639": ("3", 117115304, "G", "Major depression (LSAMP)", "risk"),
    "rs9530139": ("13", 53645407, "G", "Major depression", "risk"),
    "rs4543289": ("10", 106610839, "T", "Major depression (SORCS3)", "risk"),
    # Anxiety
    "rs1709393": ("1", 34774088, "A", "Anxiety disorder", "risk"),
    "rs7688285": ("4", 123372626, "A", "Anxiety disorder", "risk"),
    # Bipolar disorder
    "rs4765913": ("12", 2345295, "A", "Bipolar disorder (CACNA1C)", "risk"),
    "rs10994336": ("10", 64649959, "T", "Bipolar disorder (ANK3)", "risk"),
    "rs9804190": ("11", 79077426, "C", "Bipolar disorder (ODZ4)", "risk"),
    # Schizophrenia
    "rs1625579": ("8", 130635575, "T", "Schizophrenia (MIR137)", "risk"),
    "rs2007044": ("6", 28626894, "G", "Schizophrenia (HIST1H2BJ)", "risk"),
    "rs6932590": ("6", 27243984, "T", "Schizophrenia", "risk"),
 }
 def get_genotype_class(gt: str) -> str:
    """Classify genotype"""
    if gt in ['./.', '.|.', '.']:
        return 'MISSING'
    import re
    alleles = re.split('[/|]', gt)
    if all(a == '0' for a in alleles):
        return 'HOM_REF'
    elif all(a != '0' and a != '.' for a in alleles):
        return 'HOM_ALT'
    else:
        return 'HET'
 def parse_vcf_for_traits(vcf_path: str, proband_idx: int = 2) -> Dict[str, dict]:
    """Parse VCF and look for trait-associated SNPs"""
    print(f"Scanning VCF for trait-associated variants...")
    # Build position lookup
    pos_to_snp = {}
    for rsid, (chrom, pos, risk_allele, trait, effect) in TRAIT_SNPS.items():
        key = f"{chrom}-{pos}"
        pos_to_snp[key] = (rsid, risk_allele, trait, effect)
    found_variants = {}
    samples = []
    open_func = gzip.open if vcf_path.endswith('.gz') else open
    mode = 'rt' if vcf_path.endswith('.gz') else 'r'
    with open_func(vcf_path, mode) as f:
        for line in f:
            if line.startswith('##'):
                continue
            elif line.startswith('#CHROM'):
                parts = line.strip().split('\t')
                samples = parts[9:]
                continue
            parts = line.strip().split('\t')
            if len(parts) < 10:
                continue
            chrom, pos, rsid_vcf, ref, alt, qual, filt, info, fmt = parts[:9]
            gt_fields = parts[9:]
            # Check if this position has a known trait SNP
            key = f"{chrom}-{pos}"
            if key not in pos_to_snp:
                continue
            rsid, risk_allele, trait, effect = pos_to_snp[key]
            # Get proband genotype
            fmt_parts = fmt.split(':')
            gt_idx = fmt_parts.index('GT') if 'GT' in fmt_parts else 0
            if proband_idx < len(gt_fields):
                gt_data = gt_fields[proband_idx].split(':')
                gt = gt_data[gt_idx] if gt_idx < len(gt_data) else './.'
            else:
                gt = './.'
            gt_class = get_genotype_class(gt)
            # Determine risk
            alleles = [ref] + alt.split(',')
            # Check if risk allele is present
            has_risk = False
            risk_copies = 0
            if gt_class != 'MISSING':
                import re
                gt_alleles = re.split('[/|]', gt)
                for a in gt_alleles:
                    if a.isdigit():
                        allele_idx = int(a)
                        if allele_idx < len(alleles) and alleles[allele_idx] == risk_allele:
                            has_risk = True
                            risk_copies += 1
            found_variants[rsid] = {
                'rsid': rsid,
                'chrom': chrom,
                'pos': pos,
                'ref': ref,
                'alt': alt,
                'genotype': gt,
                'genotype_class': gt_class,
                'risk_allele': risk_allele,
                'trait': trait,
                'effect': effect,
                'has_risk_allele': has_risk,
                'risk_copies': risk_copies
            }
    return found_variants, samples
 def generate_trait_report(found_variants: Dict, output_path: str):
    """Generate trait analysis report"""
    # Group by trait
    traits = defaultdict(list)
    for rsid, var in found_variants.items():
        traits[var['trait']].append(var)
    with open(output_path, 'w') as f:
        f.write("# GWAS Trait Analysis Report\n")
        f.write("# Based on curated GWAS Catalog associations\n\n")
        f.write("=" * 80 + "\n")
        f.write("SUMMARY BY TRAIT CATEGORY\n")
        f.write("=" * 80 + "\n\n")
        # Categorize traits
        categories = {
            "Sleep": ["Insomnia", "Sleep duration", "Chronotype (morning person)"],
            "Skin": ["Psoriasis", "Atopic dermatitis", "Ichthyosis vulgaris (FLG)"],
            "Cardiovascular": ["Coronary artery disease"],
            "Metabolic": ["Type 2 diabetes", "Obesity (FTO)", "Obesity (MC4R)", "BMI"],
            "Eye": ["Age-related macular degeneration", "Myopia"],
            "Neuropsychiatric": [
                "Alzheimer's disease (APOE e4)", "Alzheimer's disease (APOE e2)",
                "Alzheimer's disease (CD33)", "Alzheimer's disease (BIN1)",
                "Alzheimer's disease (PICALM)", "Alzheimer's disease (MS4A)",
                "Alzheimer's disease (CD2AP)", "Alzheimer's disease (CLU)",
                "Alzheimer's disease (ABCA7)", "Alzheimer's disease (CR1)",
                "Major depression (NEGR1)", "Major depression (KSR2)",
                "Major depression (TMEM161B)", "Major depression (NTRK3)",
                "Major depression (RERE)", "Major depression (LSAMP)",
                "Major depression", "Major depression (SORCS3)",
                "Anxiety disorder",
                "Bipolar disorder (CACNA1C)", "Bipolar disorder (ANK3)", "Bipolar disorder (ODZ4)",
                "Schizophrenia (MIR137)", "Schizophrenia (HIST1H2BJ)", "Schizophrenia"
            ],
            "Other": ["Caffeine metabolism (slow)", "Caffeine consumption", "Alcohol flush reaction",
                     "Alcohol metabolism (fast)", "Lactose intolerance", "Vitamin D deficiency",
                     "Vitamin D levels (lower)", "Male pattern baldness"]
        }
        for category, trait_list in categories.items():
            f.write(f"\n## {category}\n")
            f.write("-" * 40 + "\n")
            category_risk = 0
            category_total = 0
            for trait in trait_list:
                if trait in traits:
                    variants = traits[trait]
                    risk_count = sum(1 for v in variants if v['has_risk_allele'])
                    category_risk += risk_count
                    category_total += len(variants)
                    for v in variants:
                        status = "RISK" if v['has_risk_allele'] else "OK"
                        copies = f"({v['risk_copies']} copies)" if v['has_risk_allele'] else ""
                        f.write(f"  {v['trait']}: {v['rsid']} [{status}] {copies}\n")
                        f.write(f"    Genotype: {v['genotype']} | Risk allele: {v['risk_allele']}\n")
            if category_total > 0:
                f.write(f"\n  Category summary: {category_risk}/{category_total} risk variants found\n")
        # Detailed results
        f.write("\n" + "=" * 80 + "\n")
        f.write("DETAILED RESULTS\n")
        f.write("=" * 80 + "\n\n")
        f.write("RSID\tCHROM\tPOS\tGENOTYPE\tRISK_ALLELE\tHAS_RISK\tCOPIES\tTRAIT\tEFFECT\n")
        for rsid, var in sorted(found_variants.items(), key=lambda x: (x[1]['trait'], x[0])):
            f.write(f"{var['rsid']}\t{var['chrom']}\t{var['pos']}\t{var['genotype']}\t")
            f.write(f"{var['risk_allele']}\t{var['has_risk_allele']}\t{var['risk_copies']}\t")
            f.write(f"{var['trait']}\t{var['effect']}\n")
    print(f"Report saved to: {output_path}")
 def main():
    vcf_path = sys.argv[1] if len(sys.argv) > 1 else '/Volumes/NV2/genomics_analysis/vcf/trio_joint.snpeff.vcf'
    output_path = sys.argv[2] if len(sys.argv) > 2 else '/Volumes/NV2/genomics_analysis/gwas_trait_report.txt'
    proband_idx = int(sys.argv[3]) if len(sys.argv) > 3 else 2
    print(f"GWAS Trait Analysis")
    print(f"VCF: {vcf_path}")
    print(f"Proband index: {proband_idx}")
    print(f"Searching for {len(TRAIT_SNPS)} trait-associated SNPs...\n")
    found_variants, samples = parse_vcf_for_traits(vcf_path, proband_idx)
    print(f"\nFound {len(found_variants)} trait-associated variants in VCF")
    # Quick summary
    risk_count = sum(1 for v in found_variants.values() if v['has_risk_allele'])
    print(f"Variants with risk allele: {risk_count}")
    generate_trait_report(found_variants, output_path)
    # Print summary to console
    print("\n" + "=" * 60)
    print("QUICK SUMMARY")
    print("=" * 60)
    traits = defaultdict(list)
    for rsid, var in found_variants.items():
        traits[var['trait']].append(var)
    for trait in sorted(traits.keys()):
        variants = traits[trait]
        risk_vars = [v for v in variants if v['has_risk_allele']]
        if risk_vars:
            print(f"\n{trait}:")
            for v in risk_vars:
                print(f"  {v['rsid']}: {v['genotype']} (risk allele: {v['risk_allele']}, copies: {v['risk_copies']})")
 if __name__ == '__main__':
    main()
--- a/pharmacogenomics.py
+++ b/pharmacogenomics.py
@@ -0,0 +1,376 @@
 #!/usr/bin/env python3
 """
 Pharmacogenomics Analysis Script
 Analyzes drug-gene interactions based on PharmGKB and CPIC guidelines.
 """
 import gzip
 import sys
 import re
 from collections import defaultdict
 from dataclasses import dataclass
 from typing import Dict, List, Optional
 # Key pharmacogenomic variants (curated from PharmGKB/CPIC)
 # Format: rsid -> (chrom, pos, gene, drug_class, effect, clinical_recommendation)
 PHARMGKB_VARIANTS = {
    # CYP2D6 - Codeine, Tramadol, Tamoxifen, many antidepressants
    "rs3892097": ("22", 42526694, "CYP2D6", "Codeine/Tramadol/Antidepressants", 
                  "*4 allele - Poor metabolizer", 
                  "Reduced efficacy of codeine (no conversion to morphine); Consider alternative analgesics"),
    "rs1065852": ("22", 42525772, "CYP2D6", "Codeine/Tramadol/Antidepressants",
                  "*10 allele - Reduced function",
                  "Intermediate metabolizer; May need dose adjustment"),
    "rs16947": ("22", 42523943, "CYP2D6", "Codeine/Tramadol",
               "*2 allele - Normal function", 
               "Normal metabolism"),
    # CYP2C19 - Clopidogrel, PPIs, some antidepressants
    "rs4244285": ("10", 96541616, "CYP2C19", "Clopidogrel/PPIs/Antidepressants",
                  "*2 allele - Loss of function",
                  "Poor metabolizer; Clopidogrel may have reduced efficacy; Consider prasugrel or ticagrelor"),
    "rs4986893": ("10", 96540410, "CYP2C19", "Clopidogrel/PPIs",
                  "*3 allele - Loss of function",
                  "Poor metabolizer; Reduced clopidogrel activation"),
    "rs12248560": ("10", 96522463, "CYP2C19", "Clopidogrel/PPIs",
                   "*17 allele - Increased function",
                   "Ultra-rapid metabolizer; May need lower PPI doses"),
    # CYP2C9 - Warfarin, NSAIDs, Phenytoin
    "rs1799853": ("10", 96702047, "CYP2C9", "Warfarin/NSAIDs/Phenytoin",
                  "*2 allele - Reduced function",
                  "Slower warfarin metabolism; Lower dose may be needed"),
    "rs1057910": ("10", 96741053, "CYP2C9", "Warfarin/NSAIDs/Phenytoin",
                  "*3 allele - Reduced function",
                  "Significantly slower warfarin metabolism; Require ~50% lower dose"),
    # VKORC1 - Warfarin sensitivity
    "rs9923231": ("16", 31107689, "VKORC1", "Warfarin",
                  "-1639G>A - Warfarin sensitivity",
                  "A allele: Increased sensitivity, need lower warfarin dose"),
    # CYP3A4/CYP3A5 - Many drugs (statins, immunosuppressants, etc.)
    "rs776746": ("7", 99270539, "CYP3A5", "Tacrolimus/Cyclosporine/Statins",
                 "*3 allele - Non-expressor",
                 "Most common; Normal tacrolimus dosing"),
    "rs2740574": ("7", 99382096, "CYP3A4", "Statins/Many drugs",
                  "*1B allele",
                  "May affect drug metabolism"),
    # SLCO1B1 - Statin-induced myopathy
    "rs4149056": ("12", 21331549, "SLCO1B1", "Simvastatin/Statins",
                  "*5 allele - Reduced function",
                  "C allele: Increased risk of statin myopathy; Consider lower dose or alternative statin"),
    # TPMT - Thiopurines (Azathioprine, 6-MP)
    "rs1800460": ("6", 18130918, "TPMT", "Azathioprine/6-Mercaptopurine",
                  "*3B allele - Reduced function",
                  "Intermediate/Poor metabolizer; High risk of myelosuppression; Reduce dose"),
    "rs1142345": ("6", 18130725, "TPMT", "Azathioprine/6-Mercaptopurine",
                  "*3C allele - Reduced function",
                  "Intermediate/Poor metabolizer; High risk of myelosuppression; Reduce dose"),
    # DPYD - Fluoropyrimidines (5-FU, Capecitabine)
    "rs3918290": ("1", 97915614, "DPYD", "5-Fluorouracil/Capecitabine",
                  "*2A allele - No function",
                  "CRITICAL: Complete DPD deficiency; Contraindicated - severe toxicity risk"),
    "rs55886062": ("1", 98205966, "DPYD", "5-Fluorouracil/Capecitabine",
                   "*13 allele - No function",
                   "CRITICAL: DPD deficiency; Contraindicated"),
    "rs67376798": ("1", 97981395, "DPYD", "5-Fluorouracil/Capecitabine",
                   "D949V - Reduced function",
                   "Intermediate metabolizer; Consider dose reduction"),
    # UGT1A1 - Irinotecan
    "rs8175347": ("2", 234668879, "UGT1A1", "Irinotecan",
                  "*28 allele (TA repeat)",
                  "7/7 genotype: Reduced glucuronidation; Increased toxicity risk; Consider dose reduction"),
    # HLA-B*57:01 - Abacavir hypersensitivity
    "rs2395029": ("6", 31431780, "HLA-B", "Abacavir (HIV)",
                  "HLA-B*57:01 tag SNP",
                  "CRITICAL: If positive, abacavir contraindicated - hypersensitivity reaction risk"),
    # HLA-B*15:02 - Carbamazepine/Phenytoin (SJS/TEN)
    "rs144012689": ("6", 31356867, "HLA-B", "Carbamazepine/Phenytoin",
                    "HLA-B*15:02 tag SNP",
                    "CRITICAL: If positive in Asian ancestry, carbamazepine contraindicated - SJS/TEN risk"),
    # HLA-A*31:01 - Carbamazepine
    "rs1061235": ("6", 29912280, "HLA-A", "Carbamazepine",
                  "HLA-A*31:01 tag SNP",
                  "If positive, increased carbamazepine hypersensitivity risk"),
    # F5 - Oral contraceptives, HRT (Factor V Leiden)
    "rs6025": ("1", 169519049, "F5", "Oral Contraceptives/HRT",
               "Factor V Leiden",
               "CRITICAL: Increased thrombosis risk; Oral contraceptives relatively contraindicated"),
    # F2 - Oral contraceptives (Prothrombin)
    "rs1799963": ("11", 46761055, "F2", "Oral Contraceptives/HRT",
                  "Prothrombin G20210A",
                  "Increased thrombosis risk; Caution with oral contraceptives"),
    # MTHFR - Methotrexate, Folate metabolism
    "rs1801133": ("1", 11856378, "MTHFR", "Methotrexate/Folate",
                  "C677T - Reduced function",
                  "T/T genotype: Reduced MTHFR activity; May need folate supplementation with methotrexate"),
    "rs1801131": ("1", 11854476, "MTHFR", "Methotrexate/Folate",
                  "A1298C",
                  "May affect folate metabolism"),
    # OPRM1 - Opioid response
    "rs1799971": ("6", 154039662, "OPRM1", "Opioids (Morphine, etc.)",
                  "A118G",
                  "G allele: May need higher opioid doses for pain relief"),
    # COMT - Pain medications, ADHD drugs
    "rs4680": ("22", 19951271, "COMT", "Pain medications/ADHD drugs",
               "Val158Met",
               "Met/Met: Lower COMT activity; May affect pain perception and stimulant response"),
    # IFNL3 (IL28B) - Hepatitis C treatment
    "rs12979860": ("19", 39738787, "IFNL3", "Hepatitis C treatment (Interferon)",
                   "IL28B genotype",
                   "C/C genotype: Better response to interferon-based HCV treatment"),
    # NAT2 - Isoniazid, Hydralazine
    "rs1801280": ("8", 18257854, "NAT2", "Isoniazid/Hydralazine/Sulfonamides",
                  "*5 allele - Slow acetylator",
                  "Slow acetylator; Increased isoniazid toxicity risk; Monitor for peripheral neuropathy"),
    "rs1799930": ("8", 18258103, "NAT2", "Isoniazid/Hydralazine",
                  "*6 allele - Slow acetylator",
                  "Slow acetylator; May need dose adjustment"),
    # G6PD - Primaquine, Dapsone, Sulfonamides
    "rs1050828": ("X", 153764217, "G6PD", "Primaquine/Dapsone/Sulfonamides",
                  "G6PD A- variant",
                  "CRITICAL: G6PD deficiency; Avoid oxidant drugs - hemolysis risk"),
    # CYP2B6 - Efavirenz
    "rs3745274": ("19", 41512841, "CYP2B6", "Efavirenz (HIV)",
                  "*6 allele - Reduced function",
                  "T/T genotype: Slow metabolizer; Consider lower efavirenz dose; CNS side effects more likely"),
 }
 def get_genotype_class(gt: str) -> str:
    """Classify genotype"""
    if gt in ['./.', '.|.', '.']:
        return 'MISSING'
    alleles = re.split('[/|]', gt)
    if all(a == '0' for a in alleles):
        return 'HOM_REF'
    elif all(a != '0' and a != '.' for a in alleles):
        return 'HOM_ALT'
    else:
        return 'HET'
 def analyze_pharmacogenomics(vcf_path: str, proband_idx: int = 2) -> Dict:
    """Analyze VCF for pharmacogenomic variants"""
    print("Scanning for pharmacogenomic variants...")
    # Build position lookup
    pos_to_variant = {}
    for rsid, data in PHARMGKB_VARIANTS.items():
        chrom, pos, gene, drug, effect, recommendation = data
        key = f"{chrom}-{pos}"
        pos_to_variant[key] = {
            'rsid': rsid,
            'gene': gene,
            'drug': drug,
            'effect': effect,
            'recommendation': recommendation
        }
    results = {}
    samples = []
    open_func = gzip.open if vcf_path.endswith('.gz') else open
    mode = 'rt' if vcf_path.endswith('.gz') else 'r'
    with open_func(vcf_path, mode) as f:
        for line in f:
            if line.startswith('##'):
                continue
            elif line.startswith('#CHROM'):
                parts = line.strip().split('\t')
                samples = parts[9:]
                continue
            parts = line.strip().split('\t')
            if len(parts) < 10:
                continue
            chrom, pos, rsid_vcf, ref, alt, qual, filt, info, fmt = parts[:9]
            gt_fields = parts[9:]
            key = f"{chrom}-{pos}"
            if key not in pos_to_variant:
                continue
            variant_info = pos_to_variant[key]
            # Get proband genotype
            fmt_parts = fmt.split(':')
            gt_idx = fmt_parts.index('GT') if 'GT' in fmt_parts else 0
            if proband_idx < len(gt_fields):
                gt_data = gt_fields[proband_idx].split(':')
                gt = gt_data[gt_idx] if gt_idx < len(gt_data) else './.'
            else:
                gt = './.'
            gt_class = get_genotype_class(gt)
            # Determine alleles
            alleles = [ref] + alt.split(',')
            gt_alleles_str = []
            if gt_class != 'MISSING':
                gt_indices = re.split('[/|]', gt)
                for idx in gt_indices:
                    if idx.isdigit() and int(idx) < len(alleles):
                        gt_alleles_str.append(alleles[int(idx)])
            results[variant_info['rsid']] = {
                **variant_info,
                'chrom': chrom,
                'pos': pos,
                'ref': ref,
                'alt': alt,
                'genotype': gt,
                'genotype_class': gt_class,
                'alleles': '/'.join(gt_alleles_str) if gt_alleles_str else 'N/A',
                'has_variant': gt_class in ['HET', 'HOM_ALT']
            }
    return results, samples
 def generate_pgx_report(results: Dict, output_path: str):
    """Generate pharmacogenomics report"""
    # Categorize by drug class
    drug_classes = defaultdict(list)
    for rsid, data in results.items():
        drug_classes[data['drug']].append(data)
    # Identify actionable results
    critical = []
    actionable = []
    informational = []
    for rsid, data in results.items():
        if data['has_variant']:
            if 'CRITICAL' in data['recommendation']:
                critical.append(data)
            elif any(word in data['recommendation'].lower() for word in ['reduce', 'consider', 'lower', 'avoid', 'contraindicated']):
                actionable.append(data)
            else:
                informational.append(data)
    with open(output_path, 'w') as f:
        f.write("# Pharmacogenomics Analysis Report\n")
        f.write("# Based on PharmGKB and CPIC Guidelines\n\n")
        # Critical findings first
        if critical:
            f.write("=" * 80 + "\n")
            f.write("⚠️  CRITICAL FINDINGS - Immediate Clinical Relevance\n")
            f.write("=" * 80 + "\n\n")
            for data in critical:
                f.write(f"GENE: {data['gene']} ({data['rsid']})\n")
                f.write(f"  Drug(s): {data['drug']}\n")
                f.write(f"  Genotype: {data['alleles']} ({data['genotype_class']})\n")
                f.write(f"  Effect: {data['effect']}\n")
                f.write(f"  ⚠️  {data['recommendation']}\n\n")
        # Actionable findings
        if actionable:
            f.write("=" * 80 + "\n")
            f.write("📋 ACTIONABLE FINDINGS - May Require Dose Adjustment\n")
            f.write("=" * 80 + "\n\n")
            for data in actionable:
                f.write(f"GENE: {data['gene']} ({data['rsid']})\n")
                f.write(f"  Drug(s): {data['drug']}\n")
                f.write(f"  Genotype: {data['alleles']} ({data['genotype_class']})\n")
                f.write(f"  Effect: {data['effect']}\n")
                f.write(f"  Recommendation: {data['recommendation']}\n\n")
        # Summary by drug class
        f.write("=" * 80 + "\n")
        f.write("SUMMARY BY DRUG CLASS\n")
        f.write("=" * 80 + "\n\n")
        for drug_class in sorted(drug_classes.keys()):
            variants = drug_classes[drug_class]
            has_risk = any(v['has_variant'] for v in variants)
            status = "⚠️ VARIANT DETECTED" if has_risk else "✓ Normal"
            f.write(f"\n## {drug_class}\n")
            f.write(f"Status: {status}\n")
            for v in variants:
                marker = "→" if v['has_variant'] else " "
                f.write(f"  {marker} {v['gene']} ({v['rsid']}): {v['alleles']} - {v['genotype_class']}\n")
        # Detailed table
        f.write("\n" + "=" * 80 + "\n")
        f.write("DETAILED RESULTS\n")
        f.write("=" * 80 + "\n\n")
        f.write("RSID\tGENE\tGENOTYPE\tALLELES\tHAS_VARIANT\tDRUG\tEFFECT\n")
        for rsid in sorted(results.keys()):
            data = results[rsid]
            f.write(f"{rsid}\t{data['gene']}\t{data['genotype']}\t{data['alleles']}\t")
            f.write(f"{data['has_variant']}\t{data['drug']}\t{data['effect']}\n")
    print(f"Report saved to: {output_path}")
    return critical, actionable
 def main():
    vcf_path = sys.argv[1] if len(sys.argv) > 1 else '/Volumes/NV2/genomics_analysis/vcf/trio_joint.snpeff.vcf'
    output_path = sys.argv[2] if len(sys.argv) > 2 else '/Volumes/NV2/genomics_analysis/pharmacogenomics_report.txt'
    proband_idx = int(sys.argv[3]) if len(sys.argv) > 3 else 2
    print("=" * 60)
    print("PHARMACOGENOMICS ANALYSIS")
    print("=" * 60)
    print(f"VCF: {vcf_path}")
    print(f"Searching for {len(PHARMGKB_VARIANTS)} pharmacogenomic variants...\n")
    results, samples = analyze_pharmacogenomics(vcf_path, proband_idx)
    print(f"Found {len(results)} pharmacogenomic variants in VCF")
    critical, actionable = generate_pgx_report(results, output_path)
    # Console summary
    print("\n" + "=" * 60)
    print("QUICK SUMMARY")
    print("=" * 60)
    variants_with_effect = [r for r in results.values() if r['has_variant']]
    print(f"\nVariants detected: {len(variants_with_effect)}/{len(results)}")
    if critical:
        print("\n⚠️  CRITICAL FINDINGS:")
        for c in critical:
            print(f"  - {c['gene']}: {c['drug']}")
            print(f"    {c['recommendation']}")
    if actionable:
        print("\n📋 ACTIONABLE FINDINGS:")
        for a in actionable:
            print(f"  - {a['gene']} ({a['rsid']}): {a['drug']}")
            print(f"    Genotype: {a['alleles']}")
            print(f"    {a['recommendation']}")
    if not critical and not actionable:
        print("\n✓ No critical or actionable pharmacogenomic variants detected")
 if __name__ == '__main__':
    main()
--- a/pharmgkb_full_analysis.py
+++ b/pharmgkb_full_analysis.py
@@ -0,0 +1,349 @@
 #!/usr/bin/env python3
 """
 Comprehensive PharmGKB Analysis Script
 Uses full PharmGKB clinical annotations database for pharmacogenomics analysis.
 """
 import gzip
 import sys
 import os
 import re
 from collections import defaultdict
 from typing import Dict, List, Set, Tuple
 # PharmGKB database paths
 PHARMGKB_DIR = "/Volumes/NV2/genomics_reference/pharmgkb"
 ANNOTATIONS_FILE = f"{PHARMGKB_DIR}/clinical_annotations.tsv"
 ALLELES_FILE = f"{PHARMGKB_DIR}/clinical_ann_alleles.tsv"
 def load_pharmgkb_annotations() -> Tuple[Dict, Dict]:
    """Load PharmGKB clinical annotations and allele information"""
    # Load main annotations
    annotations = {}
    print(f"Loading PharmGKB annotations from {ANNOTATIONS_FILE}...")
    with open(ANNOTATIONS_FILE, 'r') as f:
        header = f.readline().strip().split('\t')
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) < 11:
                continue
            ann_id = parts[0]
            variant = parts[1]  # rsid or haplotype
            gene = parts[2]
            evidence_level = parts[3]
            phenotype_category = parts[7] if len(parts) > 7 else ""
            drugs = parts[10] if len(parts) > 10 else ""
            phenotypes = parts[11] if len(parts) > 11 else ""
            # Only process rs variants (SNPs)
            if variant.startswith('rs'):
                rsid = variant
                if rsid not in annotations:
                    annotations[rsid] = []
                annotations[rsid].append({
                    'ann_id': ann_id,
                    'gene': gene,
                    'evidence_level': evidence_level,
                    'phenotype_category': phenotype_category,
                    'drugs': drugs,
                    'phenotypes': phenotypes
                })
    # Load allele-specific information
    allele_info = {}
    print(f"Loading allele information from {ALLELES_FILE}...")
    with open(ALLELES_FILE, 'r') as f:
        header = f.readline().strip().split('\t')
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) < 3:
                continue
            ann_id = parts[0]
            genotype = parts[1]
            annotation_text = parts[2] if len(parts) > 2 else ""
            allele_function = parts[3] if len(parts) > 3 else ""
            if ann_id not in allele_info:
                allele_info[ann_id] = {}
            allele_info[ann_id][genotype] = {
                'text': annotation_text,
                'function': allele_function
            }
    print(f"Loaded {len(annotations)} unique variants with annotations")
    return annotations, allele_info
 def get_genotype_class(gt: str) -> str:
    """Classify genotype"""
    if gt in ['./.', '.|.', '.']:
        return 'MISSING'
    alleles = re.split('[/|]', gt)
    if all(a == '0' for a in alleles):
        return 'HOM_REF'
    elif all(a != '0' and a != '.' for a in alleles):
        return 'HOM_ALT'
    else:
        return 'HET'
 def get_genotype_string(gt: str, ref: str, alt: str) -> str:
    """Convert numeric genotype to allele string"""
    if gt in ['./.', '.|.', '.']:
        return 'N/A'
    alleles = [ref] + alt.split(',')
    gt_alleles = re.split('[/|]', gt)
    result = []
    for a in gt_alleles:
        if a.isdigit():
            idx = int(a)
            if idx < len(alleles):
                result.append(alleles[idx])
            else:
                result.append('?')
        else:
            result.append('?')
    return '/'.join(result)
 def parse_vcf_for_pharmgkb(vcf_path: str, sample_idx: int, annotations: Dict) -> Dict:
    """Parse VCF and look for PharmGKB variants"""
    print(f"Scanning VCF for {len(annotations)} PharmGKB variants...")
    found_variants = {}
    samples = []
    # Build rsid lookup from VCF
    open_func = gzip.open if vcf_path.endswith('.gz') else open
    mode = 'rt' if vcf_path.endswith('.gz') else 'r'
    with open_func(vcf_path, mode) as f:
        for line in f:
            if line.startswith('##'):
                continue
            elif line.startswith('#CHROM'):
                parts = line.strip().split('\t')
                samples = parts[9:]
                print(f"Found {len(samples)} samples, analyzing index {sample_idx}: {samples[sample_idx] if sample_idx < len(samples) else 'N/A'}")
                continue
            parts = line.strip().split('\t')
            if len(parts) < 10:
                continue
            chrom, pos, rsid_vcf, ref, alt, qual, filt, info, fmt = parts[:9]
            gt_fields = parts[9:]
            # Check if this rsid has PharmGKB annotation
            if rsid_vcf not in annotations:
                continue
            # Get sample genotype
            fmt_parts = fmt.split(':')
            gt_idx = fmt_parts.index('GT') if 'GT' in fmt_parts else 0
            if sample_idx < len(gt_fields):
                gt_data = gt_fields[sample_idx].split(':')
                gt = gt_data[gt_idx] if gt_idx < len(gt_data) else './.'
            else:
                gt = './.'
            gt_class = get_genotype_class(gt)
            gt_string = get_genotype_string(gt, ref, alt)
            found_variants[rsid_vcf] = {
                'rsid': rsid_vcf,
                'chrom': chrom,
                'pos': pos,
                'ref': ref,
                'alt': alt,
                'genotype': gt,
                'genotype_class': gt_class,
                'genotype_string': gt_string,
                'annotations': annotations[rsid_vcf]
            }
    return found_variants, samples
 def generate_comprehensive_report(found_variants: Dict, allele_info: Dict,
                                   output_path: str, sample_name: str):
    """Generate comprehensive pharmacogenomics report"""
    # Categorize by evidence level and drug class
    by_evidence = defaultdict(list)
    by_category = defaultdict(list)
    for rsid, var in found_variants.items():
        for ann in var['annotations']:
            level = ann['evidence_level']
            category = ann['phenotype_category']
            by_evidence[level].append((rsid, var, ann))
            if category:
                by_category[category].append((rsid, var, ann))
    with open(output_path, 'w') as f:
        f.write("=" * 80 + "\n")
        f.write("COMPREHENSIVE PHARMACOGENOMICS REPORT\n")
        f.write("Based on PharmGKB Clinical Annotations Database\n")
        f.write("=" * 80 + "\n\n")
        f.write(f"Sample: {sample_name}\n")
        f.write(f"Total variants with PharmGKB annotations: {len(found_variants)}\n\n")
        # Summary statistics
        f.write("=" * 80 + "\n")
        f.write("SUMMARY BY EVIDENCE LEVEL\n")
        f.write("=" * 80 + "\n\n")
        f.write("Level 1A: Annotation based on CPIC or DPWG guideline\n")
        f.write("Level 1B: Annotation based on FDA or EMA label\n")
        f.write("Level 2A: Moderate clinical significance\n")
        f.write("Level 2B: Lower clinical significance\n")
        f.write("Level 3: Low evidence\n")
        f.write("Level 4: In vitro/preclinical evidence only\n\n")
        for level in ['1A', '1B', '2A', '2B', '3', '4']:
            count = len(by_evidence.get(level, []))
            f.write(f"  Level {level}: {count} annotations\n")
        # High evidence findings (1A, 1B)
        f.write("\n" + "=" * 80 + "\n")
        f.write("HIGH EVIDENCE FINDINGS (Level 1A/1B - CPIC/DPWG Guidelines & FDA Labels)\n")
        f.write("=" * 80 + "\n\n")
        high_evidence = by_evidence.get('1A', []) + by_evidence.get('1B', [])
        if high_evidence:
            for rsid, var, ann in sorted(high_evidence, key=lambda x: x[2]['gene']):
                gt_string = var['genotype_string']
                f.write(f"GENE: {ann['gene']} ({rsid})\n")
                f.write(f"  Genotype: {gt_string} ({var['genotype_class']})\n")
                f.write(f"  Drug(s): {ann['drugs']}\n")
                f.write(f"  Category: {ann['phenotype_category']}\n")
                f.write(f"  Evidence Level: {ann['evidence_level']}\n")
                # Get allele-specific annotation
                ann_id = ann['ann_id']
                if ann_id in allele_info:
                    # Try to match genotype
                    for geno, info in allele_info[ann_id].items():
                        if gt_string.replace('/', '') == geno.replace('/', '') or \
                           gt_string == geno or \
                           set(gt_string.split('/')) == set(geno):
                            if info['text']:
                                f.write(f"  Clinical Annotation: {info['text'][:500]}...\n" if len(info['text']) > 500 else f"  Clinical Annotation: {info['text']}\n")
                            if info['function']:
                                f.write(f"  Allele Function: {info['function']}\n")
                            break
                f.write("\n")
        else:
            f.write("  No high-evidence findings.\n\n")
        # Moderate evidence findings (2A, 2B)
        f.write("=" * 80 + "\n")
        f.write("MODERATE EVIDENCE FINDINGS (Level 2A/2B)\n")
        f.write("=" * 80 + "\n\n")
        moderate_evidence = by_evidence.get('2A', []) + by_evidence.get('2B', [])
        if moderate_evidence:
            for rsid, var, ann in sorted(moderate_evidence, key=lambda x: x[2]['gene'])[:50]:  # Limit to top 50
                gt_string = var['genotype_string']
                f.write(f"GENE: {ann['gene']} ({rsid})\n")
                f.write(f"  Genotype: {gt_string}\n")
                f.write(f"  Drug(s): {ann['drugs']}\n")
                f.write(f"  Category: {ann['phenotype_category']}\n")
                f.write(f"  Level: {ann['evidence_level']}\n\n")
            if len(moderate_evidence) > 50:
                f.write(f"  ... and {len(moderate_evidence) - 50} more moderate evidence findings\n\n")
        else:
            f.write("  No moderate-evidence findings.\n\n")
        # Summary by phenotype category
        f.write("=" * 80 + "\n")
        f.write("SUMMARY BY PHENOTYPE CATEGORY\n")
        f.write("=" * 80 + "\n\n")
        for category in sorted(by_category.keys()):
            items = by_category[category]
            f.write(f"\n## {category}: {len(items)} annotations\n")
            f.write("-" * 40 + "\n")
            # Show high-evidence items for each category
            high_in_cat = [x for x in items if x[2]['evidence_level'] in ['1A', '1B', '2A']]
            for rsid, var, ann in high_in_cat[:5]:
                f.write(f"  {ann['gene']} ({rsid}): {ann['drugs'][:50]}...\n" if len(ann['drugs']) > 50 else f"  {ann['gene']} ({rsid}): {ann['drugs']}\n")
        # Full detailed list
        f.write("\n" + "=" * 80 + "\n")
        f.write("COMPLETE VARIANT LIST\n")
        f.write("=" * 80 + "\n\n")
        f.write("RSID\tGENE\tGENOTYPE\tLEVEL\tCATEGORY\tDRUGS\n")
        for rsid, var in sorted(found_variants.items()):
            for ann in var['annotations']:
                drugs_short = ann['drugs'][:30] + "..." if len(ann['drugs']) > 30 else ann['drugs']
                f.write(f"{rsid}\t{ann['gene']}\t{var['genotype_string']}\t{ann['evidence_level']}\t{ann['phenotype_category']}\t{drugs_short}\n")
    print(f"Report saved to: {output_path}")
 def main():
    vcf_path = sys.argv[1] if len(sys.argv) > 1 else '/Volumes/NV2/genomics_analysis/vcf/trio_joint.snpeff.vcf'
    output_path = sys.argv[2] if len(sys.argv) > 2 else '/Volumes/NV2/genomics_analysis/pharmgkb_full_report.txt'
    sample_idx = int(sys.argv[3]) if len(sys.argv) > 3 else 2
    print("=" * 60)
    print("COMPREHENSIVE PHARMGKB ANALYSIS")
    print("=" * 60)
    print(f"VCF: {vcf_path}")
    print(f"Sample index: {sample_idx}")
    print()
    # Load PharmGKB database
    annotations, allele_info = load_pharmgkb_annotations()
    # Parse VCF
    found_variants, samples = parse_vcf_for_pharmgkb(vcf_path, sample_idx, annotations)
    sample_name = samples[sample_idx] if sample_idx < len(samples) else f"Sample_{sample_idx}"
    print(f"\nFound {len(found_variants)} variants with PharmGKB annotations")
    # Count by evidence level
    level_counts = defaultdict(int)
    for rsid, var in found_variants.items():
        for ann in var['annotations']:
            level_counts[ann['evidence_level']] += 1
    print("\nAnnotations by evidence level:")
    for level in ['1A', '1B', '2A', '2B', '3', '4']:
        print(f"  Level {level}: {level_counts.get(level, 0)}")
    # Generate report
    generate_comprehensive_report(found_variants, allele_info, output_path, sample_name)
    # Print high-evidence findings to console
    print("\n" + "=" * 60)
    print("HIGH EVIDENCE FINDINGS (Level 1A/1B)")
    print("=" * 60)
    for rsid, var in found_variants.items():
        for ann in var['annotations']:
            if ann['evidence_level'] in ['1A', '1B']:
                print(f"\n{ann['gene']} ({rsid})")
                print(f"  Genotype: {var['genotype_string']}")
                print(f"  Drug(s): {ann['drugs'][:80]}...")
                print(f"  Level: {ann['evidence_level']}")
 if __name__ == '__main__':
    main()
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,31 +0,0 @@
 [build-system]
 requires = ["setuptools>=61", "wheel"]
 build-backend = "setuptools.build_meta"
 [project]
 name = "genomic-consultant"
 version = "0.1.0"
 description = "Personal genomic risk and drug–interaction decision support scaffolding"
 readme = "README.md"
 requires-python = ">=3.11"
 dependencies = [
  "pyyaml>=6",
 ]
 [project.scripts]
 genomic-consultant = "genomic_consultant.cli:main"
 [project.optional-dependencies]
 dev = [
  "pytest",
  "ruff",
 ]
 store = [
  "pandas>=2",
 ]
 [tool.setuptools]
 package-dir = {"" = "src"}
 [tool.setuptools.packages.find]
 where = ["src"]
--- a/sample_data/example_annotated.tsv
+++ b/sample_data/example_annotated.tsv
@@ -1,3 +0,0 @@
 #CHROM	POS	REF	ALT	SYMBOL	Consequence	Protein_position	PolyPhen	SIFT	CLIN_SIG	AF	gnomAD_AF	SpliceAI	CADD_PHRED
 1	123456	A	T	GJB2	missense_variant	p.Val37Ile	benign	tolerated	Benign	0.012	0.012	0.02	5.1
 2	234567	G	C	OTOF	stop_gained	p.*	probably_damaging	deleterious	Uncertain_significance	0.0001	0.0001	0.6	28.5
--- a/src/genomic_consultant.egg-info/PKG-INFO
+++ b/src/genomic_consultant.egg-info/PKG-INFO
@@ -1,63 +0,0 @@
 Metadata-Version: 2.4
 Name: genomic-consultant
 Version: 0.1.0
 Summary: Personal genomic risk and drug–interaction decision support scaffolding
 Requires-Python: >=3.11
 Description-Content-Type: text/markdown
 Requires-Dist: pyyaml>=6
 Provides-Extra: dev
 Requires-Dist: pytest; extra == "dev"
 Requires-Dist: ruff; extra == "dev"
 # Genomic Consultant
 Early design for a personal genomic risk and drug–interaction decision support system. Specs are sourced from `genomic_decision_support_system_spec_v0.1.md`.
 ## Vision (per spec)
 - Phase 1: trio variant calling, annotation, queryable genomic DB, initial ACMG evidence tagging.
 - Phase 2: pharmacogenomics genotype-to-phenotype mapping plus drug–drug interaction checks.
 - Phase 3: supplement/herb normalization and interaction risk layering.
 - Phase 4: LLM-driven query orchestration and report generation.
 ## Repository Layout
 - `docs/` — system architecture notes, phase plans, data models (work in progress).
 - `configs/` — example ACMG config and gene panel JSON.
 - `sample_data/` — tiny annotated TSV for demo.
 - `src/genomic_consultant/` — Python scaffolding (pipelines, store, panel lookup, ACMG tagging, reporting).
 - `genomic_decision_support_system_spec_v0.1.md` — original requirements draft.
 ## Contributing/next steps
 1. Finalize Phase 1 tech selection (variant caller, annotation stack, reference/DB versions).
 2. Stand up the Phase 1 pipelines and minimal query API surface.
 3. Add ACMG evidence tagging config and human-review logging.
 4. Layer in PGx/DDI and supplement modules per later phases.
 Data safety: keep genomic/clinical data local; the `.gitignore` blocks common genomic outputs by default.
 ## Quickstart (CLI scaffolding)
 ```
 pip install -e .
 # 1) Show trio calling plan (commands only; not executed)
 genomic-consultant plan-call \
  --sample proband:/data/proband.bam \
  --sample father:/data/father.bam \
  --sample mother:/data/mother.bam \
  --reference /refs/GRCh38.fa \
  --workdir /tmp/trio
 # 2) Show annotation plan for a joint VCF
 genomic-consultant plan-annotate \
  --vcf /tmp/trio/trio.joint.vcf.gz \
  --workdir /tmp/trio/annot \
  --prefix trio \
  --reference /refs/GRCh38.fa
 # 3) Demo panel report using sample data
 genomic-consultant panel-report \
  --tsv sample_data/example_annotated.tsv \
  --panel configs/panel.example.json \
  --acmg-config configs/acmg_config.example.yaml \
  --individual-id demo \
  --format markdown
 ```
--- a/src/genomic_consultant.egg-info/SOURCES.txt
+++ b/src/genomic_consultant.egg-info/SOURCES.txt
@@ -1,27 +0,0 @@
 README.md
 pyproject.toml
 src/genomic_consultant/__init__.py
 src/genomic_consultant/cli.py
 src/genomic_consultant.egg-info/PKG-INFO
 src/genomic_consultant.egg-info/SOURCES.txt
 src/genomic_consultant.egg-info/dependency_links.txt
 src/genomic_consultant.egg-info/entry_points.txt
 src/genomic_consultant.egg-info/requires.txt
 src/genomic_consultant.egg-info/top_level.txt
 src/genomic_consultant/acmg/__init__.py
 src/genomic_consultant/acmg/tagger.py
 src/genomic_consultant/audit/__init__.py
 src/genomic_consultant/audit/run_log.py
 src/genomic_consultant/orchestration/__init__.py
 src/genomic_consultant/orchestration/workflows.py
 src/genomic_consultant/panels/__init__.py
 src/genomic_consultant/panels/panels.py
 src/genomic_consultant/pipelines/__init__.py
 src/genomic_consultant/pipelines/annotation.py
 src/genomic_consultant/pipelines/variant_calling.py
 src/genomic_consultant/reporting/__init__.py
 src/genomic_consultant/reporting/report.py
 src/genomic_consultant/store/__init__.py
 src/genomic_consultant/store/query.py
 src/genomic_consultant/utils/__init__.py
 src/genomic_consultant/utils/models.py
--- a/src/genomic_consultant.egg-info/dependency_links.txt
+++ b/src/genomic_consultant.egg-info/dependency_links.txt
@@ -1 +0,0 @@
--- a/src/genomic_consultant.egg-info/entry_points.txt
+++ b/src/genomic_consultant.egg-info/entry_points.txt
@@ -1,2 +0,0 @@
 [console_scripts]
 genomic-consultant = genomic_consultant.cli:main
--- a/src/genomic_consultant.egg-info/requires.txt
+++ b/src/genomic_consultant.egg-info/requires.txt
@@ -1,5 +0,0 @@
 pyyaml>=6
 [dev]
 pytest
 ruff
--- a/src/genomic_consultant.egg-info/top_level.txt
+++ b/src/genomic_consultant.egg-info/top_level.txt
@@ -1 +0,0 @@
 genomic_consultant
--- a/src/genomic_consultant/init.py
+++ b/src/genomic_consultant/init.py
@@ -1,4 +0,0 @@
 """Genomic Consultant: genomic decision support scaffolding."""
 __all__ = ["__version__"]
 __version__ = "0.1.0"
--- a/src/genomic_consultant/acmg/init.py
+++ b/src/genomic_consultant/acmg/init.py
@@ -1 +0,0 @@
--- a/src/genomic_consultant/acmg/tagger.py
+++ b/src/genomic_consultant/acmg/tagger.py
@@ -1,91 +0,0 @@
 from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
 from typing import List, Set
 import yaml
 from genomic_consultant.utils.models import EvidenceTag, SuggestedClassification, Variant
@dataclass
 class ACMGConfig:
    ba1_af: float = 0.05
    bs1_af: float = 0.01
    pm2_af: float = 0.0005
    lof_genes: Set[str] | None = None
    bp7_splice_ai_max: float = 0.1
 def load_acmg_config(path: Path) -> ACMGConfig:
    data = yaml.safe_load(Path(path).read_text())
    return ACMGConfig(
        ba1_af=data.get("ba1_af", 0.05),
        bs1_af=data.get("bs1_af", 0.01),
        pm2_af=data.get("pm2_af", 0.0005),
        lof_genes=set(data.get("lof_genes", [])) if data.get("lof_genes") else set(),
        bp7_splice_ai_max=data.get("bp7_splice_ai_max", 0.1),
    )
 def tag_variant(variant: Variant, config: ACMGConfig) -> SuggestedClassification:
    evidence: List[EvidenceTag] = []
    af = variant.allele_frequency
    if af is not None:
        if af >= config.ba1_af:
            evidence.append(EvidenceTag(tag="BA1", strength="Stand-alone", rationale=f"AF {af} >= {config.ba1_af}"))
        elif af >= config.bs1_af:
            evidence.append(EvidenceTag(tag="BS1", strength="Strong", rationale=f"AF {af} >= {config.bs1_af}"))
        elif af <= config.pm2_af:
            evidence.append(EvidenceTag(tag="PM2", strength="Moderate", rationale=f"AF {af} <= {config.pm2_af}"))
    if _is_lof(variant) and variant.gene and variant.gene in config.lof_genes:
        evidence.append(
            EvidenceTag(tag="PVS1", strength="Very strong", rationale="Predicted LoF in LoF-sensitive gene")
        )
    splice_ai = _get_float(variant.annotations.get("splice_ai_delta_score"))
    if _is_synonymous(variant) and (splice_ai is None or splice_ai <= config.bp7_splice_ai_max):
        evidence.append(
            EvidenceTag(
                tag="BP7",
                strength="Supporting",
                rationale=f"Synonymous with low predicted splice impact (spliceAI {splice_ai})",
            )
        )
    suggested = _suggest_class(evidence)
    return SuggestedClassification(suggested_class=suggested, evidence=evidence)
 def _is_lof(variant: Variant) -> bool:
    consequence = (variant.consequence or "").lower()
    lof_keywords = ["frameshift", "stop_gained", "splice_acceptor", "splice_donor", "start_lost"]
    return any(k in consequence for k in lof_keywords)
 def _suggest_class(evidence: List[EvidenceTag]) -> str:
    tags = {e.tag for e in evidence}
    if "BA1" in tags:
        return "Benign"
    if "BS1" in tags and "PM2" not in tags and "PVS1" not in tags:
        return "Likely benign"
    if "PVS1" in tags and "PM2" in tags:
        return "Likely pathogenic"
    return "VUS"
 def _is_synonymous(variant: Variant) -> bool:
    consequence = (variant.consequence or "").lower()
    return "synonymous_variant" in consequence
 def _get_float(value: str | float | None) -> float | None:
    if value is None:
        return None
    try:
        return float(value)
    except (TypeError, ValueError):
        return None
--- a/src/genomic_consultant/audit/init.py
+++ b/src/genomic_consultant/audit/init.py
@@ -1 +0,0 @@
--- a/src/genomic_consultant/audit/run_log.py
+++ b/src/genomic_consultant/audit/run_log.py
@@ -1,30 +0,0 @@
 from __future__ import annotations
 import json
 from dataclasses import asdict, is_dataclass
 from datetime import datetime
 from pathlib import Path
 from typing import Any
 from genomic_consultant.utils.models import RunLog
 def _serialize(obj: Any) -> Any:
    if isinstance(obj, datetime):
        return obj.isoformat()
    if is_dataclass(obj):
        return {k: _serialize(v) for k, v in asdict(obj).items()}
    if isinstance(obj, dict):
        return {k: _serialize(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [_serialize(v) for v in obj]
    return obj
 def write_run_log(run_log: RunLog, path: str | Path) -> Path:
    """Persist a RunLog to JSON."""
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    payload = _serialize(run_log)
    path.write_text(json.dumps(payload, indent=2))
    return path
--- a/src/genomic_consultant/cli.py
+++ b/src/genomic_consultant/cli.py
@@ -1,273 +0,0 @@
 from __future__ import annotations
 import argparse
 import sys
 from datetime import datetime
 from pathlib import Path
 from typing import Dict, List
 from genomic_consultant.acmg.tagger import ACMGConfig, load_acmg_config
 from genomic_consultant.orchestration.workflows import run_panel_variant_review
 from genomic_consultant.panels.aggregate import merge_mappings
 from genomic_consultant.panels.panels import load_panel
 from genomic_consultant.panels.resolver import PhenotypeGeneResolver
 from genomic_consultant.pipelines.annotation import build_vep_plan
 from genomic_consultant.pipelines.variant_calling import build_gatk_trio_plan
 from genomic_consultant.pipelines.runner import execute_plan
 from genomic_consultant.reporting.report import panel_report_json, panel_report_markdown
 from genomic_consultant.orchestration.phase1_pipeline import run_phase1_pipeline
 from genomic_consultant.store.query import GenomicStore
 from genomic_consultant.utils.hashing import sha256sum
 from genomic_consultant.utils.models import FilterConfig
 from genomic_consultant.utils.tooling import probe_tool_versions
 def parse_samples(sample_args: List[str]) -> Dict[str, Path]:
    samples: Dict[str, Path] = {}
    for arg in sample_args:
        if ":" not in arg:
            raise ValueError(f"Sample must be sample_id:/path/to.bam, got {arg}")
        sample_id, bam_path = arg.split(":", 1)
        samples[sample_id] = Path(bam_path)
    return samples
 def main(argv: List[str] | None = None) -> int:
    parser = argparse.ArgumentParser(prog="genomic-consultant", description="Genomic decision support scaffolding")
    sub = parser.add_subparsers(dest="command", required=True)
    # Variant calling plan
    call = sub.add_parser("plan-call", help="Build variant calling command plan (GATK trio).")
    call.add_argument("--sample", action="append", required=True, help="sample_id:/path/to.bam (repeatable)")
    call.add_argument("--reference", required=True, help="Path to reference FASTA")
    call.add_argument("--workdir", required=True, help="Working directory for outputs")
    call.add_argument("--prefix", default="trio", help="Output prefix for joint VCF")
    run_call = sub.add_parser("run-call", help="Execute variant calling plan (GATK trio).")
    run_call.add_argument("--sample", action="append", required=True, help="sample_id:/path/to.bam (repeatable)")
    run_call.add_argument("--reference", required=True, help="Path to reference FASTA")
    run_call.add_argument("--workdir", required=True, help="Working directory for outputs")
    run_call.add_argument("--prefix", default="trio", help="Output prefix for joint VCF")
    run_call.add_argument("--log", required=False, help="Path to write run log JSON")
    run_call.add_argument("--probe-tools", action="store_true", help="Attempt to record tool versions")
    # Annotation plan
    ann = sub.add_parser("plan-annotate", help="Build annotation command plan (VEP).")
    ann.add_argument("--vcf", required=True, help="Path to joint VCF")
    ann.add_argument("--workdir", required=True, help="Working directory for annotated outputs")
    ann.add_argument("--prefix", default="annotated", help="Output prefix")
    ann.add_argument("--reference", required=False, help="Reference FASTA (optional)")
    ann.add_argument("--plugin", action="append", help="VEP plugin spec, repeatable", default=[])
    run_ann = sub.add_parser("run-annotate", help="Execute annotation plan (VEP).")
    run_ann.add_argument("--vcf", required=True, help="Path to joint VCF")
    run_ann.add_argument("--workdir", required=True, help="Working directory for annotated outputs")
    run_ann.add_argument("--prefix", default="annotated", help="Output prefix")
    run_ann.add_argument("--reference", required=False, help="Reference FASTA (optional)")
    run_ann.add_argument("--plugin", action="append", help="VEP plugin spec, repeatable", default=[])
    run_ann.add_argument("--extra-flag", action="append", help="Extra flags appended to VEP command", default=[])
    run_ann.add_argument("--log", required=False, help="Path to write run log JSON")
    run_ann.add_argument("--probe-tools", action="store_true", help="Attempt to record tool versions")
    # Panel report
    panel = sub.add_parser("panel-report", help="Run panel query + ACMG tagging and emit report.")
    panel.add_argument("--tsv", required=True, help="Flattened annotated TSV")
    panel.add_argument("--panel", required=False, help="Panel JSON file")
    panel.add_argument("--phenotype-id", required=False, help="Phenotype/HPO ID to resolve to a panel")
    panel.add_argument("--phenotype-mapping", required=False, help="Phenotype→genes mapping JSON")
    panel.add_argument("--acmg-config", required=True, help="ACMG config YAML")
    panel.add_argument("--individual-id", required=True, help="Individual identifier")
    panel.add_argument("--max-af", type=float, default=None, help="Max allele frequency filter")
    panel.add_argument("--format", choices=["markdown", "json"], default="markdown", help="Output format")
    panel.add_argument("--log", required=False, help="Path to write run log JSON for the analysis")
    panel.add_argument("--phenotype-panel", required=False, help="Phenotype→genes mapping JSON (optional)")
    phase1 = sub.add_parser("phase1-run", help="End-to-end Phase 1 pipeline (call→annotate→panel).")
    phase1.add_argument("--sample", action="append", help="sample_id:/path/to.bam (repeatable)")
    phase1.add_argument("--reference", required=False, help="Path to reference FASTA")
    phase1.add_argument("--workdir", required=True, help="Working directory for outputs")
    phase1.add_argument("--prefix", default="trio", help="Output prefix")
    phase1.add_argument("--plugins", action="append", default=[], help="VEP plugin specs")
    phase1.add_argument("--extra-flag", action="append", default=[], help="Extra flags for VEP")
    phase1.add_argument("--joint-vcf", required=False, help="Existing joint VCF (skip calling)")
    phase1.add_argument("--tsv", required=False, help="Existing annotated TSV (skip annotation)")
    phase1.add_argument("--skip-call", action="store_true", help="Skip variant calling step")
    phase1.add_argument("--skip-annotate", action="store_true", help="Skip annotation step")
    phase1.add_argument("--panel", required=False, help="Panel JSON file")
    phase1.add_argument("--phenotype-id", required=False, help="Phenotype/HPO ID to resolve to a panel")
    phase1.add_argument("--phenotype-mapping", required=False, help="Phenotype→genes mapping JSON")
    phase1.add_argument("--acmg-config", required=True, help="ACMG config YAML")
    phase1.add_argument("--max-af", type=float, default=None, help="Max allele frequency filter")
    phase1.add_argument("--format", choices=["markdown", "json"], default="markdown", help="Report format")
    phase1.add_argument("--log-dir", required=False, help="Directory to write run logs")
    build_map = sub.add_parser("build-phenotype-mapping", help="Merge phenotype→gene mapping JSON files.")
    build_map.add_argument("--output", required=True, help="Output JSON path")
    build_map.add_argument("inputs", nargs="+", help="Input mapping JSON files")
    args = parser.parse_args(argv)
    if args.command == "plan-call":
        samples = parse_samples(args.sample)
        plan = build_gatk_trio_plan(
            samples=samples,
            reference_fasta=Path(args.reference),
            workdir=Path(args.workdir),
            output_prefix=args.prefix,
        )
        print("# Variant calling command plan")
        for cmd in plan.commands:
            print(cmd)
        return 0
    if args.command == "run-call":
        samples = parse_samples(args.sample)
        plan = build_gatk_trio_plan(
            samples=samples,
            reference_fasta=Path(args.reference),
            workdir=Path(args.workdir),
            output_prefix=args.prefix,
        )
        tool_versions = probe_tool_versions({"gatk": "gatk"}) if args.probe_tools else {}
        run_log = execute_plan(
            plan, automation_level="Auto", log_path=Path(args.log) if args.log else None
        )
        run_log.tool_versions.update(tool_versions)
        if args.log:
            from genomic_consultant.audit.run_log import write_run_log
            write_run_log(run_log, Path(args.log))
        print(f"Run finished with {len(run_log.outputs.get('command_results', []))} steps. Log ID: {run_log.run_id}")
        if args.log:
            print(f"Run log written to {args.log}")
        return 0
    if args.command == "plan-annotate":
        plan = build_vep_plan(
            vcf_path=Path(args.vcf),
            workdir=Path(args.workdir),
            reference_fasta=Path(args.reference) if args.reference else None,
            output_prefix=args.prefix,
            plugins=args.plugin,
        )
        print("# Annotation command plan")
        for cmd in plan.commands:
            print(cmd)
        return 0
    if args.command == "run-annotate":
        plan = build_vep_plan(
            vcf_path=Path(args.vcf),
            workdir=Path(args.workdir),
            reference_fasta=Path(args.reference) if args.reference else None,
            output_prefix=args.prefix,
            plugins=args.plugin,
            extra_flags=args.extra_flag,
        )
        tool_versions = probe_tool_versions({"vep": "vep", "bcftools": "bcftools", "tabix": "tabix"}) if args.probe_tools else {}
        run_log = execute_plan(
            plan, automation_level="Auto", log_path=Path(args.log) if args.log else None
        )
        run_log.tool_versions.update(tool_versions)
        if args.log:
            from genomic_consultant.audit.run_log import write_run_log
            write_run_log(run_log, Path(args.log))
        print(f"Run finished with {len(run_log.outputs.get('command_results', []))} steps. Log ID: {run_log.run_id}")
        if args.log:
            print(f"Run log written to {args.log}")
        return 0
    if args.command == "panel-report":
        if not args.panel and not (args.phenotype_id and args.phenotype_mapping):
            raise SystemExit("Provide either --panel or (--phenotype-id and --phenotype-mapping).")
        store = GenomicStore.from_tsv(Path(args.tsv))
        if args.panel:
            panel_obj = load_panel(Path(args.panel))
            panel_config_hash = sha256sum(Path(args.panel))
        else:
            resolver = PhenotypeGeneResolver.from_json(Path(args.phenotype_mapping))
            panel_obj = resolver.build_panel(args.phenotype_id)
            if panel_obj is None:
                raise SystemExit(f"No genes found for phenotype {args.phenotype_id}")
            panel_config_hash = sha256sum(Path(args.phenotype_mapping))
        acmg_config = load_acmg_config(Path(args.acmg_config))
        filters = FilterConfig(max_af=args.max_af)
        result = run_panel_variant_review(
            individual_id=args.individual_id,
            panel=panel_obj,
            store=store,
            acmg_config=acmg_config,
            filters=filters,
        )
        output = panel_report_json(result) if args.format == "json" else panel_report_markdown(result)
        print(output)
        if args.log:
            from genomic_consultant.audit.run_log import write_run_log
            from genomic_consultant.utils.models import RunLog
            run_log = RunLog(
                run_id=f"panel-{args.individual_id}",
                started_at=datetime.utcnow(),
                inputs={
                    "tsv": str(args.tsv),
                    "panel": str(args.panel) if args.panel else f"phenotype:{args.phenotype_id}",
                    "acmg_config": str(args.acmg_config),
                },
                parameters={
                    "max_af": args.max_af,
                    "format": args.format,
                    "phenotype_id": args.phenotype_id,
                },
                tool_versions={},
                database_versions={},
                config_hashes={
                    "panel": panel_config_hash,
                    "acmg_config": sha256sum(Path(args.acmg_config)),
                },
                automation_levels={"panel_report": "Auto+Review"},
                overrides=[],
                outputs={"report": output},
                notes=None,
            )
            write_run_log(run_log, Path(args.log))
            print(f"Analysis log written to {args.log}")
        return 0
    if args.command == "build-phenotype-mapping":
        merge_mappings(inputs=[Path(p) for p in args.inputs], output=Path(args.output))
        print(f"Merged mapping written to {args.output}")
        return 0
    if args.command == "phase1-run":
        samples = parse_samples(args.sample) if args.sample else None
        log_dir = Path(args.log_dir) if args.log_dir else None
        artifacts = run_phase1_pipeline(
            samples=samples,
            reference_fasta=Path(args.reference) if args.reference else None,
            workdir=Path(args.workdir),
            output_prefix=args.prefix,
            acmg_config_path=Path(args.acmg_config),
            max_af=args.max_af,
            panel_path=Path(args.panel) if args.panel else None,
            phenotype_id=args.phenotype_id,
            phenotype_mapping=Path(args.phenotype_mapping) if args.phenotype_mapping else None,
            report_format=args.format,
            plugins=args.plugins,
            extra_flags=args.extra_flag,
            existing_joint_vcf=Path(args.joint_vcf) if args.joint_vcf else None,
            existing_tsv=Path(args.tsv) if args.tsv else None,
            skip_call=args.skip_call,
            skip_annotate=args.skip_annotate,
            log_dir=log_dir,
        )
        print(artifacts.panel_report)
        return 0
    return 1
 if __name__ == "__main__":
    sys.exit(main())
--- a/src/genomic_consultant/orchestration/init.py
+++ b/src/genomic_consultant/orchestration/init.py
@@ -1 +0,0 @@
--- a/src/genomic_consultant/orchestration/phase1_pipeline.py
+++ b/src/genomic_consultant/orchestration/phase1_pipeline.py
@@ -1,146 +0,0 @@
 from __future__ import annotations
 import uuid
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Dict, List, Optional
 from datetime import datetime, timezone
 from genomic_consultant.acmg.tagger import ACMGConfig, load_acmg_config
 from genomic_consultant.audit.run_log import write_run_log
 from genomic_consultant.orchestration.workflows import run_panel_variant_review
 from genomic_consultant.panels.panels import load_panel
 from genomic_consultant.panels.resolver import PhenotypeGeneResolver
 from genomic_consultant.pipelines.annotation import AnnotationPlan, build_vep_plan
 from genomic_consultant.pipelines.runner import execute_plan
 from genomic_consultant.pipelines.variant_calling import VariantCallingPlan, build_gatk_trio_plan
 from genomic_consultant.reporting.report import panel_report_json, panel_report_markdown
 from genomic_consultant.store.query import GenomicStore
 from genomic_consultant.utils.hashing import sha256sum
 from genomic_consultant.utils.models import FilterConfig, RunLog
@dataclass
 class Phase1Artifacts:
    call_log: Optional[RunLog]
    annotate_log: Optional[RunLog]
    panel_report: str
    panel_report_format: str
    panel_result_log: RunLog
    tsv_path: Path
    joint_vcf: Optional[Path]
 def run_phase1_pipeline(
    samples: Optional[Dict[str, Path]],
    reference_fasta: Optional[Path],
    workdir: Path,
    output_prefix: str,
    acmg_config_path: Path,
    max_af: Optional[float],
    panel_path: Optional[Path],
    phenotype_id: Optional[str],
    phenotype_mapping: Optional[Path],
    report_format: str = "markdown",
    plugins: Optional[List[str]] = None,
    extra_flags: Optional[List[str]] = None,
    existing_joint_vcf: Optional[Path] = None,
    existing_tsv: Optional[Path] = None,
    skip_call: bool = False,
    skip_annotate: bool = False,
    log_dir: Optional[Path] = None,
 ) -> Phase1Artifacts:
    """
    Orchestrate Phase 1: optional call -> annotate -> panel report.
    Allows skipping call/annotate when precomputed artifacts are supplied.
    """
    log_dir = log_dir or workdir / "runtime"
    log_dir.mkdir(parents=True, exist_ok=True)
    call_log = None
    annotate_log = None
    # Variant calling
    joint_vcf: Optional[Path] = existing_joint_vcf
    if not skip_call:
        if not samples or not reference_fasta:
            raise ValueError("samples and reference_fasta are required unless skip_call is True")
        call_plan: VariantCallingPlan = build_gatk_trio_plan(
            samples=samples, reference_fasta=reference_fasta, workdir=workdir, output_prefix=output_prefix
        )
        call_log_path = log_dir / f"{output_prefix}_call_runlog.json"
        call_log = execute_plan(call_plan, automation_level="Auto", log_path=call_log_path)
        joint_vcf = call_plan.joint_vcf
    elif joint_vcf is None:
        joint_vcf = existing_joint_vcf
    # Annotation
    tsv_path: Optional[Path] = existing_tsv
    if not skip_annotate:
        if joint_vcf is None:
            raise ValueError("joint VCF must be provided (via call step or existing_joint_vcf)")
        ann_plan: AnnotationPlan = build_vep_plan(
            vcf_path=joint_vcf,
            workdir=workdir,
            reference_fasta=reference_fasta,
            output_prefix=output_prefix,
            plugins=plugins or [],
            extra_flags=extra_flags or [],
        )
        ann_log_path = log_dir / f"{output_prefix}_annotate_runlog.json"
        annotate_log = execute_plan(ann_plan, automation_level="Auto", log_path=ann_log_path)
        tsv_path = ann_plan.flat_table
    if tsv_path is None:
        raise ValueError("No TSV available; provide existing_tsv or run annotation.")
    # Panel selection
    if panel_path:
        panel_obj = load_panel(panel_path)
        panel_hash = sha256sum(panel_path)
    elif phenotype_id and phenotype_mapping:
        resolver = PhenotypeGeneResolver.from_json(phenotype_mapping)
        panel_obj = resolver.build_panel(phenotype_id)
        if panel_obj is None:
            raise ValueError(f"No genes found for phenotype {phenotype_id}")
        panel_hash = sha256sum(phenotype_mapping)
    else:
        raise ValueError("Provide panel_path or (phenotype_id and phenotype_mapping).")
    acmg_config = load_acmg_config(acmg_config_path)
    store = GenomicStore.from_tsv(tsv_path)
    filters = FilterConfig(max_af=max_af)
    panel_result = run_panel_variant_review(
        individual_id=output_prefix, panel=panel_obj, store=store, acmg_config=acmg_config, filters=filters
    )
    report = panel_report_markdown(panel_result) if report_format == "markdown" else panel_report_json(panel_result)
    panel_log = RunLog(
        run_id=f"phase1-panel-{uuid.uuid4()}",
        started_at=datetime.now(timezone.utc),
        inputs={
            "tsv": str(tsv_path),
            "panel": str(panel_path) if panel_path else f"phenotype:{phenotype_id}",
            "acmg_config": str(acmg_config_path),
        },
        parameters={"max_af": max_af, "report_format": report_format, "phenotype_id": phenotype_id},
        tool_versions={},
        database_versions={},
        config_hashes={"panel": panel_hash, "acmg_config": sha256sum(acmg_config_path)},
        automation_levels={"panel_report": "Auto+Review"},
        overrides=[],
        outputs={"report": report},
        notes=None,
    )
    panel_log_path = log_dir / f"{output_prefix}_panel_runlog.json"
    write_run_log(panel_log, panel_log_path)
    return Phase1Artifacts(
        call_log=call_log,
        annotate_log=annotate_log,
        panel_report=report,
        panel_report_format=report_format,
        panel_result_log=panel_log,
        tsv_path=tsv_path,
        joint_vcf=joint_vcf,
    )
--- a/src/genomic_consultant/orchestration/workflows.py
+++ b/src/genomic_consultant/orchestration/workflows.py
@@ -1,35 +0,0 @@
 from __future__ import annotations
 from dataclasses import dataclass
 from typing import List
 from genomic_consultant.acmg.tagger import ACMGConfig, tag_variant
 from genomic_consultant.panels.panels import GenePanel
 from genomic_consultant.store.query import GenomicStore
 from genomic_consultant.utils.models import FilterConfig, SuggestedClassification, Variant
@dataclass
 class PanelVariantResult:
    variant: Variant
    acmg: SuggestedClassification
@dataclass
 class PanelAnalysisResult:
    individual_id: str
    panel: GenePanel
    variants: List[PanelVariantResult]
 def run_panel_variant_review(
    individual_id: str,
    panel: GenePanel,
    store: GenomicStore,
    acmg_config: ACMGConfig,
    filters: FilterConfig | None = None,
 ) -> PanelAnalysisResult:
    """Query variants for a panel and attach ACMG evidence suggestions."""
    variants = store.get_variants_by_gene(individual_id=individual_id, genes=panel.genes, filters=filters)
    enriched = [PanelVariantResult(variant=v, acmg=tag_variant(v, acmg_config)) for v in variants]
    return PanelAnalysisResult(individual_id=individual_id, panel=panel, variants=enriched)
--- a/src/genomic_consultant/panels/init.py
+++ b/src/genomic_consultant/panels/init.py
@@ -1 +0,0 @@
--- a/src/genomic_consultant/panels/aggregate.py
+++ b/src/genomic_consultant/panels/aggregate.py
@@ -1,31 +0,0 @@
 from __future__ import annotations
 import json
 from pathlib import Path
 from typing import Dict, Iterable, List, Set
 def merge_mappings(inputs: Iterable[Path], output: Path, version: str = "merged", sources: List[str] | None = None) -> Path:
    """
    Merge multiple phenotype→gene mapping JSON files into one.
    Input schema: {"phenotype_to_genes": {"HP:xxxx": ["GENE1", ...]}, "version": "...", "source": "..."}
    """
    merged: Dict[str, Set[str]] = {}
    source_list: List[str] = sources or []
    for path in inputs:
        data = json.loads(Path(path).read_text())
        phenos = data.get("phenotype_to_genes", {})
        for pid, genes in phenos.items():
            merged.setdefault(pid, set()).update(genes)
        src_label = data.get("source") or path.name
        source_list.append(src_label)
    out = {
        "version": version,
        "source": ",".join(source_list),
        "phenotype_to_genes": {pid: sorted(list(genes)) for pid, genes in merged.items()},
        "metadata": {"merged_from": [str(p) for p in inputs]},
    }
    output.parent.mkdir(parents=True, exist_ok=True)
    output.write_text(json.dumps(out, indent=2))
    return output
--- a/src/genomic_consultant/panels/panels.py
+++ b/src/genomic_consultant/panels/panels.py
@@ -1,38 +0,0 @@
 from __future__ import annotations
 import json
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Dict, Optional
 from genomic_consultant.utils.models import GenePanel
@dataclass
 class PanelRepository:
    """Loads curated gene panels stored as JSON files."""
    panels: Dict[str, GenePanel]
    @classmethod
    def from_directory(cls, path: Path) -> "PanelRepository":
        panels: Dict[str, GenePanel] = {}
        for json_file in Path(path).glob("*.json"):
            panel = load_panel(json_file)
            panels[panel.name] = panel
        return cls(panels=panels)
    def get(self, name: str) -> Optional[GenePanel]:
        return self.panels.get(name)
 def load_panel(path: Path) -> GenePanel:
    data = json.loads(Path(path).read_text())
    return GenePanel(
        name=data["name"],
        genes=data["genes"],
        source=data.get("source", "unknown"),
        version=data.get("version", "unknown"),
        last_updated=data.get("last_updated", ""),
        metadata=data.get("metadata", {}),
    )
--- a/src/genomic_consultant/panels/resolver.py
+++ b/src/genomic_consultant/panels/resolver.py
@@ -1,42 +0,0 @@
 from __future__ import annotations
 import json
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Dict, List, Optional
 from genomic_consultant.utils.models import GenePanel
@dataclass
 class PhenotypeGeneResolver:
    """Resolves phenotype/HPO terms to gene lists using a curated mapping file."""
    mapping: Dict[str, List[str]]
    version: str
    source: str
    @classmethod
    def from_json(cls, path: Path) -> "PhenotypeGeneResolver":
        data = json.loads(Path(path).read_text())
        mapping = data.get("phenotype_to_genes", {})
        version = data.get("version", "unknown")
        source = data.get("source", "unknown")
        return cls(mapping=mapping, version=version, source=source)
    def resolve(self, phenotype_id: str) -> Optional[List[str]]:
        """Return gene list for a phenotype/HPO ID if present."""
        return self.mapping.get(phenotype_id)
    def build_panel(self, phenotype_id: str) -> Optional[GenePanel]:
        genes = self.resolve(phenotype_id)
        if not genes:
            return None
        return GenePanel(
            name=f"Phenotype:{phenotype_id}",
            genes=genes,
            source=self.source,
            version=self.version,
            last_updated="",
            metadata={"phenotype_id": phenotype_id},
        )
--- a/src/genomic_consultant/pipelines/init.py
+++ b/src/genomic_consultant/pipelines/init.py
@@ -1 +0,0 @@
--- a/src/genomic_consultant/pipelines/annotation.py
+++ b/src/genomic_consultant/pipelines/annotation.py
@@ -1,72 +0,0 @@
 from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
 from typing import List, Sequence
@dataclass
 class AnnotationPlan:
    """Command plan for annotating a VCF with VEP (or similar)."""
    annotated_vcf: Path
    flat_table: Path
    commands: List[str]
    workdir: Path
 def build_vep_plan(
    vcf_path: Path,
    workdir: Path,
    reference_fasta: Path | None = None,
    output_prefix: str = "annotated",
    plugins: Sequence[str] | None = None,
    extra_flags: Sequence[str] | None = None,
 ) -> AnnotationPlan:
    """
    Build shell commands for running VEP on a VCF. Produces compressed VCF and a flattened TSV.
    This is a plan only; execution is left to a runner.
    """
    workdir = Path(workdir)
    workdir.mkdir(parents=True, exist_ok=True)
    annotated_vcf = workdir / f"{output_prefix}.vep.vcf.gz"
    flat_table = workdir / f"{output_prefix}.vep.tsv"
    plugin_arg = ""
    if plugins:
        plugin_arg = " ".join(f"--plugin {p}" for p in plugins)
    extra_arg = " ".join(extra_flags) if extra_flags else ""
    ref_arg = f"--fasta {reference_fasta}" if reference_fasta else ""
    commands: List[str] = [
        (
            "vep "
            f"-i {vcf_path} "
            f"-o {annotated_vcf} "
            "--vcf --compress_output bgzip "
            "--symbol --canonical "
            "--af --af_gnomad "
            "--polyphen b --sift b "
            "--everything "
            f"{plugin_arg} "
            f"{extra_arg} "
            f"{ref_arg}"
        ),
        f"tabix -p vcf {annotated_vcf}",
        (
            "bcftools query "
            f"-f '%CHROM\\t%POS\\t%REF\\t%ALT\\t%SYMBOL\\t%Consequence\\t%Protein_position\\t"
            f"%PolyPhen\\t%SIFT\\t%CLIN_SIG\\t%AF\\t%gnomAD_AF\\t%SpliceAI\\t%CADD_PHRED\\n' "
            f"{annotated_vcf} > {flat_table}"
        ),
    ]
    return AnnotationPlan(
        annotated_vcf=annotated_vcf,
        flat_table=flat_table,
        commands=commands,
        workdir=workdir,
    )
--- a/src/genomic_consultant/pipelines/runner.py
+++ b/src/genomic_consultant/pipelines/runner.py
@@ -1,62 +0,0 @@
 from __future__ import annotations
 import subprocess
 import uuid
 from datetime import datetime
 from pathlib import Path
 from typing import Any, Dict, List, Protocol
 from genomic_consultant.audit.run_log import write_run_log
 from genomic_consultant.utils.models import RunLog
 class HasCommands(Protocol):
    commands: List[str]
    workdir: Path
 def execute_plan(plan: HasCommands, automation_level: str, log_path: Path | None = None) -> RunLog:
    """
    Execute shell commands defined in a plan sequentially. Captures stdout/stderr and exit codes.
    Suitable for VariantCallingPlan or AnnotationPlan.
    """
    run_id = str(uuid.uuid4())
    started_at = datetime.utcnow()
    outputs: Dict[str, Any] = {}
    # The plan object may expose outputs such as joint_vcf/annotated_vcf we can introspect.
    for attr in ("joint_vcf", "per_sample_gvcf", "annotated_vcf", "flat_table"):
        if hasattr(plan, attr):
            outputs[attr] = getattr(plan, attr)
    results: List[Dict[str, Any]] = []
    for idx, cmd in enumerate(plan.commands):
        proc = subprocess.run(cmd, shell=True, cwd=plan.workdir, capture_output=True, text=True)
        results.append(
            {
                "step": idx,
                "command": cmd,
                "returncode": proc.returncode,
                "stdout": proc.stdout,
                "stderr": proc.stderr,
            }
        )
        if proc.returncode != 0:
            break
    run_log = RunLog(
        run_id=run_id,
        started_at=started_at,
        inputs={"workdir": str(plan.workdir)},
        parameters={},
        tool_versions={},  # left for caller to fill if known
        database_versions={},
        config_hashes={},
        automation_levels={"pipeline": automation_level},
        overrides=[],
        outputs=outputs | {"command_results": results},
        notes=None,
    )
    if log_path:
        write_run_log(run_log, log_path)
    return run_log
--- a/src/genomic_consultant/pipelines/variant_calling.py
+++ b/src/genomic_consultant/pipelines/variant_calling.py
@@ -1,65 +0,0 @@
 from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Dict, List, Sequence
@dataclass
 class VariantCallingPlan:
    """Command plan for running a trio variant calling pipeline."""
    per_sample_gvcf: Dict[str, Path]
    joint_vcf: Path
    commands: List[str]
    workdir: Path
 def build_gatk_trio_plan(
    samples: Dict[str, Path],
    reference_fasta: Path,
    workdir: Path,
    output_prefix: str = "trio",
    intervals: Sequence[Path] | None = None,
 ) -> VariantCallingPlan:
    """
    Build shell commands for a GATK-based trio pipeline (HaplotypeCaller gVCF + joint genotyping).
    Does not execute commands; returns a plan for orchestration layers to run.
    """
    workdir = Path(workdir)
    workdir.mkdir(parents=True, exist_ok=True)
    per_sample_gvcf: Dict[str, Path] = {}
    commands: List[str] = []
    interval_args = ""
    if intervals:
        interval_args = " ".join(f"-L {i}" for i in intervals)
    for sample_id, bam_path in samples.items():
        gvcf_path = workdir / f"{sample_id}.g.vcf.gz"
        per_sample_gvcf[sample_id] = gvcf_path
        cmd = (
            "gatk HaplotypeCaller "
            f"-R {reference_fasta} "
            f"-I {bam_path} "
            "-O {out} "
            "-ERC GVCF "
            f"{interval_args}"
        ).format(out=gvcf_path)
        commands.append(cmd)
    joint_vcf = workdir / f"{output_prefix}.joint.vcf.gz"
    joint_cmd = (
        "gatk GenotypeGVCFs "
        f"-R {reference_fasta} "
        + " ".join(f"--variant {p}" for p in per_sample_gvcf.values())
        + f" -O {joint_vcf}"
    )
    commands.append(joint_cmd)
    return VariantCallingPlan(
        per_sample_gvcf=per_sample_gvcf,
        joint_vcf=joint_vcf,
        commands=commands,
        workdir=workdir,
    )
--- a/src/genomic_consultant/reporting/init.py
+++ b/src/genomic_consultant/reporting/init.py
@@ -1 +0,0 @@
--- a/src/genomic_consultant/reporting/report.py
+++ b/src/genomic_consultant/reporting/report.py
@@ -1,75 +0,0 @@
 from __future__ import annotations
 import json
 from datetime import datetime, timezone
 from typing import List
 from genomic_consultant.orchestration.workflows import PanelAnalysisResult, PanelVariantResult
 def panel_report_markdown(result: PanelAnalysisResult) -> str:
    lines: List[str] = []
    lines.append(f"# Panel Report: {result.panel.name}")
    lines.append("")
    lines.append(f"- Individual: `{result.individual_id}`")
    lines.append(f"- Panel version: `{result.panel.version}` (source: {result.panel.source})")
    lines.append(f"- Generated: {datetime.now(timezone.utc).isoformat()}")
    lines.append("")
    lines.append("## Variants")
    if not result.variants:
        lines.append("No variants found for this panel with current filters.")
        return "\n".join(lines)
    header = "| Variant | Consequence | ClinVar | AF | ACMG suggestion | Evidence |"
    lines.append(header)
    lines.append("|---|---|---|---|---|---|")
    for pv in result.variants:
        v = pv.variant
        ev_summary = "; ".join(f"{e.tag}({e.strength})" for e in pv.acmg.evidence) or "None"
        lines.append(
            "|"
            + f"{v.id} ({v.gene or 'NA'})"
            + f"|{v.consequence or 'NA'}"
            + f"|{v.clinvar_significance or 'NA'}"
            + f"|{v.allele_frequency if v.allele_frequency is not None else 'NA'}"
            + f"|{pv.acmg.suggested_class}"
            + f"|{ev_summary}"
            + "|"
        )
    lines.append("")
    lines.append("> Note: ACMG suggestions here are auto-generated and require human review for clinical decisions.")
    return "\n".join(lines)
 def panel_report_json(result: PanelAnalysisResult) -> str:
    payload = {
        "individual_id": result.individual_id,
        "panel": {
            "name": result.panel.name,
            "version": result.panel.version,
            "source": result.panel.source,
        },
        "generated": datetime.utcnow().isoformat(),
        "variants": [
            {
                "id": pv.variant.id,
                "gene": pv.variant.gene,
                "consequence": pv.variant.consequence,
                "clinvar_significance": pv.variant.clinvar_significance,
                "allele_frequency": pv.variant.allele_frequency,
                "acmg": {
                    "suggested_class": pv.acmg.suggested_class,
                    "evidence": [
                        {"tag": e.tag, "strength": e.strength, "rationale": e.rationale}
                        for e in pv.acmg.evidence
                    ],
                    "human_classification": pv.acmg.human_classification,
                    "human_reviewer": pv.acmg.human_reviewer,
                    "human_notes": pv.acmg.human_notes,
                },
            }
            for pv in result.variants
        ],
        "disclaimer": "Auto-generated suggestions; human review required.",
    }
    return json.dumps(payload, indent=2)
--- a/src/genomic_consultant/store/init.py
+++ b/src/genomic_consultant/store/init.py
@@ -1 +0,0 @@
--- a/src/genomic_consultant/store/parquet_store.py
+++ b/src/genomic_consultant/store/parquet_store.py
@@ -1,83 +0,0 @@
 from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
 from typing import List, Sequence
 from genomic_consultant.store.query import _matches_any
 from genomic_consultant.utils.models import FilterConfig, Variant
 try:
    import pandas as pd
 except ImportError as exc:  # pragma: no cover - import guard
    raise ImportError("ParquetGenomicStore requires pandas. Install with `pip install pandas`.") from exc
@dataclass
 class ParquetGenomicStore:
    """Parquet-backed store for larger datasets (optional dependency: pandas)."""
    df: "pd.DataFrame"
    @classmethod
    def from_parquet(cls, path: Path) -> "ParquetGenomicStore":
        df = pd.read_parquet(path)
        return cls(df=df)
    def get_variants_by_gene(
        self, individual_id: str, genes: Sequence[str], filters: FilterConfig | None = None
    ) -> List[Variant]:
        filters = filters or FilterConfig()
        subset = self.df[self.df["SYMBOL"].str.upper().isin([g.upper() for g in genes])]
        return self._apply_filters(subset, filters)
    def get_variants_by_region(
        self, individual_id: str, chrom: str, start: int, end: int, filters: FilterConfig | None = None
    ) -> List[Variant]:
        filters = filters or FilterConfig()
        subset = self.df[(self.df["CHROM"] == chrom) & (self.df["POS"] >= start) & (self.df["POS"] <= end)]
        return self._apply_filters(subset, filters)
    def _apply_filters(self, df: "pd.DataFrame", filters: FilterConfig) -> List[Variant]:
        mask = pd.Series(True, index=df.index)
        if filters.max_af is not None:
            mask &= (df["AF"].fillna(df.get("gnomAD_AF")).fillna(0) <= filters.max_af)
        if filters.min_af is not None:
            mask &= (df["AF"].fillna(df.get("gnomAD_AF")).fillna(0) >= filters.min_af)
        if filters.clinvar_significance:
            mask &= df["CLIN_SIG"].str.lower().isin([s.lower() for s in filters.clinvar_significance])
        if filters.consequence_includes:
            mask &= df["Consequence"].str.lower().apply(
                lambda v: _matches_any(v, [c.lower() for c in filters.consequence_includes])
            )
        if filters.consequence_excludes:
            mask &= ~df["Consequence"].str.lower().apply(
                lambda v: _matches_any(v, [c.lower() for c in filters.consequence_excludes])
            )
        filtered = df[mask]
        variants: List[Variant] = []
        for _, row in filtered.iterrows():
            variants.append(
                Variant(
                    chrom=row["CHROM"],
                    pos=int(row["POS"]),
                    ref=row["REF"],
                    alt=row["ALT"],
                    gene=row.get("SYMBOL"),
                    consequence=row.get("Consequence"),
                    protein_change=row.get("Protein_position"),
                    clinvar_significance=row.get("CLIN_SIG"),
                    allele_frequency=_maybe_float(row.get("AF")) or _maybe_float(row.get("gnomAD_AF")),
                    annotations={"gnomad_af": _maybe_float(row.get("gnomAD_AF"))},
                )
            )
        return variants
 def _maybe_float(value) -> float | None:
    try:
        if value is None:
            return None
        return float(value)
    except (TypeError, ValueError):
        return None
--- a/src/genomic_consultant/store/query.py
+++ b/src/genomic_consultant/store/query.py
@@ -1,109 +0,0 @@
 from __future__ import annotations
 import csv
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Dict, Iterable, List, Sequence
 from genomic_consultant.utils.models import FilterConfig, Variant
@dataclass
 class GenomicStore:
    """Lightweight wrapper around annotated variants."""
    variants: List[Variant]
    @classmethod
    def from_tsv(cls, path: Path) -> "GenomicStore":
        """
        Load variants from a flattened TSV generated by the annotation plan.
        Expected columns (flexible, missing columns are tolerated):
        CHROM POS REF ALT SYMBOL Consequence Protein_position PolyPhen SIFT CLIN_SIG AF gnomAD_AF SpliceAI CADD_PHRED
        """
        variants: List[Variant] = []
        with Path(path).open() as fh:
            reader = csv.DictReader(fh, delimiter="\t")
            for row in reader:
                row = {k: v for k, v in row.items()} if row else {}
                if not row:
                    continue
                variants.append(_row_to_variant(row))
        return cls(variants=variants)
    def get_variants_by_gene(
        self, individual_id: str, genes: Sequence[str], filters: FilterConfig | None = None
    ) -> List[Variant]:
        filters = filters or FilterConfig()
        gene_set = {g.upper() for g in genes}
        return self._apply_filters((v for v in self.variants if (v.gene or "").upper() in gene_set), filters)
    def get_variants_by_region(
        self, individual_id: str, chrom: str, start: int, end: int, filters: FilterConfig | None = None
    ) -> List[Variant]:
        filters = filters or FilterConfig()
        return self._apply_filters(
            (v for v in self.variants if v.chrom == chrom and start <= v.pos <= end),
            filters,
        )
    def _apply_filters(self, variants: Iterable[Variant], filters: FilterConfig) -> List[Variant]:
        out: List[Variant] = []
        for v in variants:
            if filters.max_af is not None and v.allele_frequency is not None and v.allele_frequency > filters.max_af:
                continue
            if filters.min_af is not None and v.allele_frequency is not None and v.allele_frequency < filters.min_af:
                continue
            if filters.clinvar_significance and (v.clinvar_significance or "").lower() not in {
                sig.lower() for sig in filters.clinvar_significance
            }:
                continue
            if filters.consequence_includes and not _matches_any(v.consequence, filters.consequence_includes):
                continue
            if filters.consequence_excludes and _matches_any(v.consequence, filters.consequence_excludes):
                continue
            out.append(v)
        return out
 def _matches_any(value: str | None, patterns: Sequence[str]) -> bool:
    if value is None:
        return False
    v = value.lower()
    return any(pat.lower() in v for pat in patterns)
 def _parse_float(val: str | None) -> float | None:
    if val in (None, "", "."):
        return None
    try:
        return float(val)
    except ValueError:
        return None
 def _row_to_variant(row: Dict[str, str]) -> Variant:
    chrom = row.get("CHROM") or row.get("#CHROM")
    pos = int(row["POS"])
    af = _parse_float(row.get("AF"))
    gnomad_af = _parse_float(row.get("gnomAD_AF"))
    splice_ai = _parse_float(row.get("SpliceAI"))
    cadd = _parse_float(row.get("CADD_PHRED"))
    return Variant(
        chrom=chrom,
        pos=pos,
        ref=row.get("REF"),
        alt=row.get("ALT"),
        gene=row.get("SYMBOL") or None,
        consequence=row.get("Consequence") or None,
        protein_change=row.get("Protein_position") or None,
        clinvar_significance=row.get("CLIN_SIG") or None,
        allele_frequency=af if af is not None else gnomad_af,
        annotations={
            "polyphen": row.get("PolyPhen"),
            "sift": row.get("SIFT"),
            "gnomad_af": gnomad_af,
            "splice_ai_delta_score": splice_ai,
            "cadd_phred": cadd,
        },
    )
--- a/src/genomic_consultant/utils/init.py
+++ b/src/genomic_consultant/utils/init.py
@@ -1 +0,0 @@
--- a/src/genomic_consultant/utils/hashing.py
+++ b/src/genomic_consultant/utils/hashing.py
@@ -1,15 +0,0 @@
 from __future__ import annotations
 import hashlib
 from pathlib import Path
 from typing import Optional
 def sha256sum(path: Path) -> Optional[str]:
    if not Path(path).exists():
        return None
    h = hashlib.sha256()
    with Path(path).open("rb") as fh:
        for chunk in iter(lambda: fh.read(8192), b""):
            h.update(chunk)
    return h.hexdigest()
--- a/src/genomic_consultant/utils/models.py
+++ b/src/genomic_consultant/utils/models.py
@@ -1,81 +0,0 @@
 from __future__ import annotations
 from dataclasses import dataclass, field
 from datetime import datetime
 from typing import Any, Dict, List, Optional
@dataclass
 class Variant:
    """Lightweight variant representation used by the query layer."""
    chrom: str
    pos: int
    ref: str
    alt: str
    gene: Optional[str] = None
    transcript: Optional[str] = None
    consequence: Optional[str] = None
    protein_change: Optional[str] = None
    clinvar_significance: Optional[str] = None
    allele_frequency: Optional[float] = None
    annotations: Dict[str, Any] = field(default_factory=dict)
    @property
    def id(self) -> str:
        return f"{self.chrom}-{self.pos}-{self.ref}-{self.alt}"
@dataclass
 class FilterConfig:
    """Common filters for variant queries."""
    max_af: Optional[float] = None
    min_af: Optional[float] = None
    clinvar_significance: Optional[List[str]] = None
    consequence_includes: Optional[List[str]] = None
    consequence_excludes: Optional[List[str]] = None
    inheritances: Optional[List[str]] = None
@dataclass
 class EvidenceTag:
    tag: str
    strength: str
    rationale: str
@dataclass
 class SuggestedClassification:
    suggested_class: str
    evidence: List[EvidenceTag] = field(default_factory=list)
    human_classification: Optional[str] = None
    human_reviewer: Optional[str] = None
    human_notes: Optional[str] = None
@dataclass
 class GenePanel:
    name: str
    genes: List[str]
    source: str
    version: str
    last_updated: str
    metadata: Dict[str, Any] = field(default_factory=dict)
@dataclass
 class RunLog:
    """Machine-readable record of a pipeline or analysis run."""
    run_id: str
    started_at: datetime
    inputs: Dict[str, Any]
    parameters: Dict[str, Any]
    tool_versions: Dict[str, str]
    database_versions: Dict[str, str]
    config_hashes: Dict[str, str]
    automation_levels: Dict[str, str]
    overrides: List[Dict[str, Any]] = field(default_factory=list)
    outputs: Dict[str, Any] = field(default_factory=dict)
    notes: Optional[str] = None
--- a/src/genomic_consultant/utils/tooling.py
+++ b/src/genomic_consultant/utils/tooling.py
@@ -1,22 +0,0 @@
 from __future__ import annotations
 import subprocess
 from typing import Dict
 def probe_tool_versions(commands: Dict[str, str]) -> Dict[str, str]:
    """
    Attempt to get tool versions by running the provided commands with --version.
    Returns best-effort outputs; missing tools are skipped.
    """
    results: Dict[str, str] = {}
    for name, cmd in commands.items():
        try:
            proc = subprocess.run(f"{cmd} --version", shell=True, capture_output=True, text=True, timeout=10)
            if proc.returncode == 0 and proc.stdout:
                results[name] = proc.stdout.strip().splitlines()[0]
            elif proc.returncode == 0 and proc.stderr:
                results[name] = proc.stderr.strip().splitlines()[0]
        except Exception:
            continue
    return results
--- a/tests/test_acmg.py
+++ b/tests/test_acmg.py
@@ -1,41 +0,0 @@
 from genomic_consultant.acmg.tagger import ACMGConfig, tag_variant
 from genomic_consultant.utils.models import Variant
 def test_ba1_trumps():
    cfg = ACMGConfig(ba1_af=0.05, bs1_af=0.01, pm2_af=0.0005, lof_genes=set())
    v = Variant(chrom="1", pos=1, ref="A", alt="T", allele_frequency=0.2)
    result = tag_variant(v, cfg)
    assert result.suggested_class == "Benign"
    assert any(e.tag == "BA1" for e in result.evidence)
 def test_pvs1_pm2_likely_pathogenic():
    cfg = ACMGConfig(lof_genes={"GENE1"}, pm2_af=0.0005, ba1_af=0.05, bs1_af=0.01)
    v = Variant(
        chrom="1",
        pos=1,
        ref="A",
        alt="T",
        gene="GENE1",
        consequence="stop_gained",
        allele_frequency=0.0001,
    )
    result = tag_variant(v, cfg)
    assert result.suggested_class == "Likely pathogenic"
    tags = {e.tag for e in result.evidence}
    assert {"PVS1", "PM2"} <= tags
 def test_bp7_supporting():
    cfg = ACMGConfig(bp7_splice_ai_max=0.1)
    v = Variant(
        chrom="1",
        pos=1,
        ref="A",
        alt="T",
        consequence="synonymous_variant",
        annotations={"splice_ai_delta_score": 0.05},
    )
    result = tag_variant(v, cfg)
    assert any(e.tag == "BP7" for e in result.evidence)
--- a/tests/test_aggregate.py
+++ b/tests/test_aggregate.py
@@ -1,16 +0,0 @@
 from pathlib import Path
 from genomic_consultant.panels.aggregate import merge_mappings
 import json
 def test_merge_mappings(tmp_path: Path):
    a = tmp_path / "a.json"
    b = tmp_path / "b.json"
    a.write_text('{"source":"A","phenotype_to_genes":{"HP:1":["G1","G2"]}}')
    b.write_text('{"source":"B","phenotype_to_genes":{"HP:1":["G2","G3"],"HP:2":["G4"]}}')
    out = tmp_path / "out.json"
    merge_mappings([a, b], out)
    data = json.loads(out.read_text())
    assert sorted(data["phenotype_to_genes"]["HP:1"]) == ["G1", "G2", "G3"]
    assert data["phenotype_to_genes"]["HP:2"] == ["G4"]
--- a/tests/test_phase1_pipeline.py
+++ b/tests/test_phase1_pipeline.py
@@ -1,27 +0,0 @@
 from pathlib import Path
 from genomic_consultant.orchestration.phase1_pipeline import run_phase1_pipeline
 def test_phase1_pipeline_with_existing_tsv():
    root = Path(__file__).resolve().parents[1]
    tsv = root / "sample_data/example_annotated.tsv"
    panel = root / "configs/panel.example.json"
    acmg = root / "configs/acmg_config.example.yaml"
    result = run_phase1_pipeline(
        samples=None,
        reference_fasta=None,
        workdir=root / "runtime",
        output_prefix="demo",
        acmg_config_path=acmg,
        max_af=0.05,
        panel_path=panel,
        phenotype_id=None,
        phenotype_mapping=None,
        report_format="markdown",
        existing_tsv=tsv,
        skip_call=True,
        skip_annotate=True,
    )
    assert "Panel Report" in result.panel_report
    assert result.tsv_path == tsv
--- a/tests/test_resolver.py
+++ b/tests/test_resolver.py
@@ -1,18 +0,0 @@
 from pathlib import Path
 from genomic_consultant.panels.resolver import PhenotypeGeneResolver
 def test_resolver_build_panel(tmp_path: Path):
    data = {
        "version": "test",
        "source": "example",
        "phenotype_to_genes": {"HP:0001": ["GENE1", "GENE2"]},
    }
    path = tmp_path / "map.json"
    path.write_text('{"version":"test","source":"example","phenotype_to_genes":{"HP:0001":["GENE1","GENE2"]}}')
    resolver = PhenotypeGeneResolver.from_json(path)
    panel = resolver.build_panel("HP:0001")
    assert panel is not None
    assert panel.genes == ["GENE1", "GENE2"]
    assert "phenotype_id" in panel.metadata
--- a/tests/test_store.py
+++ b/tests/test_store.py
@@ -1,28 +0,0 @@
 from genomic_consultant.store.query import GenomicStore
 from genomic_consultant.utils.models import FilterConfig, Variant
 def test_filter_by_gene_and_af():
    store = GenomicStore(
        variants=[
            Variant(chrom="1", pos=1, ref="A", alt="T", gene="GENE1", allele_frequency=0.02),
            Variant(chrom="1", pos=2, ref="G", alt="C", gene="GENE2", allele_frequency=0.0001),
        ]
    )
    res = store.get_variants_by_gene("ind", ["GENE2"], filters=FilterConfig(max_af=0.001))
    assert len(res) == 1
    assert res[0].gene == "GENE2"
 def test_consequence_include_exclude():
    store = GenomicStore(
        variants=[
            Variant(chrom="1", pos=1, ref="A", alt="T", gene="GENE1", consequence="missense_variant"),
            Variant(chrom="1", pos=2, ref="G", alt="C", gene="GENE1", consequence="synonymous_variant"),
        ]
    )
    res = store.get_variants_by_gene(
        "ind", ["GENE1"], filters=FilterConfig(consequence_includes=["missense"], consequence_excludes=["synonymous"])
    )
    assert len(res) == 1
    assert res[0].consequence == "missense_variant"
--- a/tests/test_store_tsv.py
+++ b/tests/test_store_tsv.py
@@ -1,33 +0,0 @@
 from pathlib import Path
 from genomic_consultant.store.query import GenomicStore
 def test_from_tsv_with_extra_columns(tmp_path: Path):
    content = "\t".join(
        [
            "#CHROM",
            "POS",
            "REF",
            "ALT",
            "SYMBOL",
            "Consequence",
            "Protein_position",
            "PolyPhen",
            "SIFT",
            "CLIN_SIG",
            "AF",
            "gnomAD_AF",
            "SpliceAI",
            "CADD_PHRED",
        ]
    ) + "\n"
    content += "1\t100\tA\tT\tGENE1\tmissense_variant\tp.X\t.\t.\tPathogenic\t0.0001\t0.0002\t0.05\t20.1\n"
    path = tmp_path / "v.tsv"
    path.write_text(content)
    store = GenomicStore.from_tsv(path)
    assert len(store.variants) == 1
    v = store.variants[0]
    assert v.annotations["splice_ai_delta_score"] == 0.05
    assert v.annotations["cadd_phred"] == 20.1
--- a/trio_analysis.py
+++ b/trio_analysis.py
@@ -0,0 +1,376 @@
 #!/usr/bin/env python3
 """
 Trio WES Analysis Script
 Analyzes trio VCF for de novo mutations, compound heterozygous variants,
 and potential pathogenic variants.
 """
 import gzip
 import re
 from collections import defaultdict
 from dataclasses import dataclass
 from typing import List, Dict, Optional, Tuple
 import json
@dataclass
 class Variant:
    chrom: str
    pos: int
    ref: str
    alt: str
    qual: float
    filter_status: str
    info: str
    genotypes: Dict[str, str]  # sample -> genotype
    annotation: Optional[str] = None
    gene: Optional[str] = None
    effect: Optional[str] = None
    impact: Optional[str] = None
 def parse_genotype(gt_field: str) -> Tuple[str, int, int]:
    """Parse genotype field, return (gt_string, ref_count, alt_count)"""
    parts = gt_field.split(':')
    gt = parts[0]
    if gt in ['./.', '.|.', '.']:
        return gt, 0, 0
    alleles = re.split('[/|]', gt)
    ref_count = sum(1 for a in alleles if a == '0')
    alt_count = sum(1 for a in alleles if a != '0' and a != '.')
    return gt, ref_count, alt_count
 def get_genotype_class(gt: str) -> str:
    """Classify genotype as HOM_REF, HET, HOM_ALT, or MISSING"""
    if gt in ['./.', '.|.', '.']:
        return 'MISSING'
    alleles = re.split('[/|]', gt)
    if all(a == '0' for a in alleles):
        return 'HOM_REF'
    elif all(a != '0' and a != '.' for a in alleles):
        return 'HOM_ALT'
    else:
        return 'HET'
 def parse_snpeff_annotation(info: str) -> Dict:
    """Parse SnpEff ANN field"""
    result = {
        'gene': None,
        'effect': None,
        'impact': None,
        'hgvs_c': None,
        'hgvs_p': None,
    }
    ann_match = re.search(r'ANN=([^;]+)', info)
    if not ann_match:
        return result
    ann_field = ann_match.group(1)
    annotations = ann_field.split(',')
    if annotations:
        # Take the first (most severe) annotation
        parts = annotations[0].split('|')
        if len(parts) >= 4:
            result['effect'] = parts[1] if len(parts) > 1 else None
            result['impact'] = parts[2] if len(parts) > 2 else None
            result['gene'] = parts[3] if len(parts) > 3 else None
            if len(parts) > 9:
                result['hgvs_c'] = parts[9]
            if len(parts) > 10:
                result['hgvs_p'] = parts[10]
    return result
 def parse_vcf(vcf_path: str) -> Tuple[List[str], List[Variant]]:
    """Parse VCF file and return sample names and variants"""
    samples = []
    variants = []
    open_func = gzip.open if vcf_path.endswith('.gz') else open
    mode = 'rt' if vcf_path.endswith('.gz') else 'r'
    with open_func(vcf_path, mode) as f:
        for line in f:
            if line.startswith('##'):
                continue
            elif line.startswith('#CHROM'):
                parts = line.strip().split('\t')
                samples = parts[9:]
                continue
            parts = line.strip().split('\t')
            if len(parts) < 10:
                continue
            chrom, pos, _, ref, alt, qual, filt, info, fmt = parts[:9]
            gt_fields = parts[9:]
            # Parse genotypes
            genotypes = {}
            fmt_fields = fmt.split(':')
            gt_idx = fmt_fields.index('GT') if 'GT' in fmt_fields else 0
            for i, sample in enumerate(samples):
                gt_parts = gt_fields[i].split(':')
                genotypes[sample] = gt_parts[gt_idx] if gt_idx < len(gt_parts) else './.'
            # Parse annotation
            ann = parse_snpeff_annotation(info)
            try:
                qual_val = float(qual) if qual != '.' else 0
            except ValueError:
                qual_val = 0
            variant = Variant(
                chrom=chrom,
                pos=int(pos),
                ref=ref,
                alt=alt,
                qual=qual_val,
                filter_status=filt,
                info=info,
                genotypes=genotypes,
                annotation=info,
                gene=ann['gene'],
                effect=ann['effect'],
                impact=ann['impact']
            )
            variants.append(variant)
    return samples, variants
 def identify_de_novo(variants: List[Variant], proband: str, father: str, mother: str) -> List[Variant]:
    """Identify de novo variants: present in proband but absent in both parents"""
    de_novo = []
    for v in variants:
        if proband not in v.genotypes or father not in v.genotypes or mother not in v.genotypes:
            continue
        proband_gt = get_genotype_class(v.genotypes[proband])
        father_gt = get_genotype_class(v.genotypes[father])
        mother_gt = get_genotype_class(v.genotypes[mother])
        # De novo: proband has variant, both parents are HOM_REF
        if proband_gt in ['HET', 'HOM_ALT'] and father_gt == 'HOM_REF' and mother_gt == 'HOM_REF':
            de_novo.append(v)
    return de_novo
 def identify_compound_het(variants: List[Variant], proband: str, father: str, mother: str) -> Dict[str, List[Variant]]:
    """Identify compound heterozygous variants in genes"""
    gene_variants = defaultdict(list)
    # Group HET variants by gene
    for v in variants:
        if not v.gene:
            continue
        if proband not in v.genotypes:
            continue
        proband_gt = get_genotype_class(v.genotypes[proband])
        if proband_gt != 'HET':
            continue
        gene_variants[v.gene].append(v)
    # Find compound het (>1 HET variant in same gene, inherited from different parents)
    compound_het = {}
    for gene, vars_list in gene_variants.items():
        if len(vars_list) < 2:
            continue
        maternal_inherited = []
        paternal_inherited = []
        for v in vars_list:
            if father not in v.genotypes or mother not in v.genotypes:
                continue
            father_gt = get_genotype_class(v.genotypes[father])
            mother_gt = get_genotype_class(v.genotypes[mother])
            if father_gt in ['HET', 'HOM_ALT'] and mother_gt == 'HOM_REF':
                paternal_inherited.append(v)
            elif mother_gt in ['HET', 'HOM_ALT'] and father_gt == 'HOM_REF':
                maternal_inherited.append(v)
        if maternal_inherited and paternal_inherited:
            compound_het[gene] = maternal_inherited + paternal_inherited
    return compound_het
 def identify_homozygous_recessive(variants: List[Variant], proband: str, father: str, mother: str) -> List[Variant]:
    """Identify homozygous recessive variants: HOM_ALT in proband, both parents HET"""
    hom_rec = []
    for v in variants:
        if proband not in v.genotypes or father not in v.genotypes or mother not in v.genotypes:
            continue
        proband_gt = get_genotype_class(v.genotypes[proband])
        father_gt = get_genotype_class(v.genotypes[father])
        mother_gt = get_genotype_class(v.genotypes[mother])
        # Homozygous recessive: proband HOM_ALT, both parents HET
        if proband_gt == 'HOM_ALT' and father_gt == 'HET' and mother_gt == 'HET':
            hom_rec.append(v)
    return hom_rec
 def filter_by_impact(variants: List[Variant], impacts: List[str] = ['HIGH', 'MODERATE']) -> List[Variant]:
    """Filter variants by impact level"""
    return [v for v in variants if v.impact in impacts]
 def generate_report(vcf_path: str, output_path: str):
    """Generate trio analysis report"""
    print(f"Parsing VCF: {vcf_path}")
    samples, variants = parse_vcf(vcf_path)
    print(f"Found {len(samples)} samples: {samples}")
    print(f"Total variants: {len(variants)}")
    # Identify sample roles based on file naming convention
    # Expected: I-1 (father), I-2 (mother), II-3 (proband)
    proband = None
    father = None
    mother = None
    for s in samples:
        s_upper = s.upper()
        if 'II-3' in s_upper or 'PROBAND' in s_upper:
            proband = s
        elif 'I-1' in s_upper:
            father = s
        elif 'I-2' in s_upper:
            mother = s
    if not all([proband, father, mother]):
        # Fallback: assume order is proband, father, mother
        if len(samples) >= 3:
            proband = samples[0]
            father = samples[1]
            mother = samples[2]
        else:
            print("ERROR: Could not identify trio samples")
            return
    print(f"\nTrio identified:")
    print(f"  Proband: {proband}")
    print(f"  Father:  {father}")
    print(f"  Mother:  {mother}")
    # Analysis
    print("\n" + "="*80)
    print("TRIO ANALYSIS RESULTS")
    print("="*80)
    # De novo variants
    de_novo = identify_de_novo(variants, proband, father, mother)
    de_novo_high = filter_by_impact(de_novo, ['HIGH', 'MODERATE'])
    print(f"\n1. DE NOVO VARIANTS")
    print(f"   Total de novo: {len(de_novo)}")
    print(f"   HIGH/MODERATE impact: {len(de_novo_high)}")
    # Compound heterozygous
    compound_het = identify_compound_het(variants, proband, father, mother)
    print(f"\n2. COMPOUND HETEROZYGOUS GENES")
    print(f"   Genes with compound het: {len(compound_het)}")
    # Homozygous recessive
    hom_rec = identify_homozygous_recessive(variants, proband, father, mother)
    hom_rec_high = filter_by_impact(hom_rec, ['HIGH', 'MODERATE'])
    print(f"\n3. HOMOZYGOUS RECESSIVE VARIANTS")
    print(f"   Total: {len(hom_rec)}")
    print(f"   HIGH/MODERATE impact: {len(hom_rec_high)}")
    # Generate detailed report
    with open(output_path, 'w') as f:
        f.write("# Trio WES Analysis Report\n")
        f.write(f"# Generated from: {vcf_path}\n")
        f.write(f"# Samples: Proband={proband}, Father={father}, Mother={mother}\n")
        f.write(f"# Total variants analyzed: {len(variants)}\n\n")
        # De novo HIGH/MODERATE impact
        f.write("## DE NOVO VARIANTS (HIGH/MODERATE IMPACT)\n")
        f.write("CHROM\tPOS\tREF\tALT\tGENE\tEFFECT\tIMPACT\tPROBAND_GT\tFATHER_GT\tMOTHER_GT\n")
        for v in sorted(de_novo_high, key=lambda x: (x.chrom, x.pos)):
            f.write(f"{v.chrom}\t{v.pos}\t{v.ref}\t{v.alt}\t{v.gene or 'N/A'}\t")
            f.write(f"{v.effect or 'N/A'}\t{v.impact or 'N/A'}\t")
            f.write(f"{v.genotypes.get(proband, './.')}\t")
            f.write(f"{v.genotypes.get(father, './.')}\t")
            f.write(f"{v.genotypes.get(mother, './.')}\n")
        # Compound heterozygous
        f.write("\n## COMPOUND HETEROZYGOUS GENES\n")
        for gene, vars_list in sorted(compound_het.items()):
            high_impact = [v for v in vars_list if v.impact in ['HIGH', 'MODERATE']]
            if high_impact:
                f.write(f"\n### {gene} ({len(vars_list)} variants, {len(high_impact)} HIGH/MODERATE)\n")
                f.write("CHROM\tPOS\tREF\tALT\tEFFECT\tIMPACT\tPROBAND_GT\tFATHER_GT\tMOTHER_GT\n")
                for v in sorted(high_impact, key=lambda x: x.pos):
                    f.write(f"{v.chrom}\t{v.pos}\t{v.ref}\t{v.alt}\t")
                    f.write(f"{v.effect or 'N/A'}\t{v.impact or 'N/A'}\t")
                    f.write(f"{v.genotypes.get(proband, './.')}\t")
                    f.write(f"{v.genotypes.get(father, './.')}\t")
                    f.write(f"{v.genotypes.get(mother, './.')}\n")
        # Homozygous recessive HIGH/MODERATE
        f.write("\n## HOMOZYGOUS RECESSIVE VARIANTS (HIGH/MODERATE IMPACT)\n")
        f.write("CHROM\tPOS\tREF\tALT\tGENE\tEFFECT\tIMPACT\tPROBAND_GT\tFATHER_GT\tMOTHER_GT\n")
        for v in sorted(hom_rec_high, key=lambda x: (x.chrom, x.pos)):
            f.write(f"{v.chrom}\t{v.pos}\t{v.ref}\t{v.alt}\t{v.gene or 'N/A'}\t")
            f.write(f"{v.effect or 'N/A'}\t{v.impact or 'N/A'}\t")
            f.write(f"{v.genotypes.get(proband, './.')}\t")
            f.write(f"{v.genotypes.get(father, './.')}\t")
            f.write(f"{v.genotypes.get(mother, './.')}\n")
        # Summary statistics
        f.write("\n## SUMMARY STATISTICS\n")
        f.write(f"Total variants: {len(variants)}\n")
        f.write(f"De novo variants: {len(de_novo)}\n")
        f.write(f"De novo HIGH/MODERATE: {len(de_novo_high)}\n")
        f.write(f"Compound het genes: {len(compound_het)}\n")
        f.write(f"Homozygous recessive: {len(hom_rec)}\n")
        f.write(f"Homozygous recessive HIGH/MODERATE: {len(hom_rec_high)}\n")
    print(f"\nReport saved to: {output_path}")
    # Also print top candidates
    print("\n" + "="*80)
    print("TOP CANDIDATE VARIANTS")
    print("="*80)
    print("\n--- De Novo HIGH Impact ---")
    de_novo_high_only = [v for v in de_novo if v.impact == 'HIGH']
    for v in de_novo_high_only[:10]:
        print(f"  {v.chrom}:{v.pos} {v.ref}>{v.alt} | {v.gene} | {v.effect}")
    print("\n--- Compound Het Genes (with HIGH impact) ---")
    for gene, vars_list in list(compound_het.items())[:10]:
        high_count = sum(1 for v in vars_list if v.impact == 'HIGH')
        if high_count > 0:
            print(f"  {gene}: {len(vars_list)} variants ({high_count} HIGH)")
    print("\n--- Homozygous Recessive HIGH Impact ---")
    hom_rec_high_only = [v for v in hom_rec if v.impact == 'HIGH']
    for v in hom_rec_high_only[:10]:
        print(f"  {v.chrom}:{v.pos} {v.ref}>{v.alt} | {v.gene} | {v.effect}")
 if __name__ == '__main__':
    import sys
    vcf_path = sys.argv[1] if len(sys.argv) > 1 else '/Volumes/NV2/genomics_analysis/vcf/trio_joint.snpeff.vcf'
    output_path = sys.argv[2] if len(sys.argv) > 2 else '/Volumes/NV2/genomics_analysis/trio_analysis_report.txt'
    generate_report(vcf_path, output_path)
		`@@ -1,2 +0,0 @@`
			`[console_scripts]`
			`genomic-consultant = genomic_consultant.cli:main`