--- phase: 03-core-evidence-layers plan: 05 type: execute wave: 1 depends_on: [] files_modified: - src/usher_pipeline/evidence/animal_models/__init__.py - src/usher_pipeline/evidence/animal_models/models.py - src/usher_pipeline/evidence/animal_models/fetch.py - src/usher_pipeline/evidence/animal_models/transform.py - src/usher_pipeline/evidence/animal_models/load.py - tests/test_animal_models.py - tests/test_animal_models_integration.py - src/usher_pipeline/cli/evidence_cmd.py autonomous: true must_haves: truths: - "Pipeline retrieves gene knockout/perturbation phenotypes from MGI, ZFIN, and IMPC" - "Phenotypes are filtered for relevance to sensory function, balance, vision, hearing, and cilia morphology" - "Ortholog mapping uses established databases with confidence scoring, handling one-to-many mappings" artifacts: - path: "src/usher_pipeline/evidence/animal_models/fetch.py" provides: "MGI, ZFIN, and IMPC phenotype data retrieval" exports: ["fetch_mgi_phenotypes", "fetch_zfin_phenotypes", "fetch_impc_phenotypes", "fetch_ortholog_mapping"] - path: "src/usher_pipeline/evidence/animal_models/transform.py" provides: "Phenotype relevance filtering and scoring" exports: ["filter_sensory_phenotypes", "score_animal_evidence", "process_animal_model_evidence"] - path: "src/usher_pipeline/evidence/animal_models/load.py" provides: "DuckDB persistence for animal model evidence" exports: ["load_to_duckdb"] - path: "tests/test_animal_models.py" provides: "Unit tests for phenotype filtering and ortholog handling" key_links: - from: "src/usher_pipeline/evidence/animal_models/fetch.py" to: "MGI/IMPC bulk data and ZFIN API" via: "httpx with tenacity retry for bulk downloads" pattern: "informatics\\.jax\\.org|mousephenotype\\.org|zfin\\.org" - from: "src/usher_pipeline/evidence/animal_models/fetch.py" to: "HCOP ortholog database" via: "HCOP bulk download from HGNC" pattern: "genenames\\.org.*hcop" - from: "src/usher_pipeline/evidence/animal_models/transform.py" to: "src/usher_pipeline/evidence/animal_models/fetch.py" via: "filters phenotypes and scores animal model evidence" pattern: "filter_sensory_phenotypes|score_animal_evidence" - from: "src/usher_pipeline/evidence/animal_models/load.py" to: "src/usher_pipeline/persistence/duckdb_store.py" via: "store.save_dataframe" pattern: "save_dataframe.*animal_model_phenotypes" --- Implement the Animal Model Phenotypes evidence layer (ANIM-01/02/03): retrieve knockout/perturbation phenotypes from MGI (mouse), ZFIN (zebrafish), and IMPC, map orthologs with confidence scoring, filter for sensory/cilia relevance, and score. Purpose: Animal model phenotypes provide functional evidence -- genes whose orthologs cause sensory, balance, hearing, or cilia defects in mouse/zebrafish are strong candidates. Ortholog confidence prevents false positives from paralog mis-mapping. Output: animal_model_phenotypes DuckDB table with per-gene ortholog mapping, phenotype summaries, sensory relevance flags, and normalized animal model score. @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create animal model evidence data model, fetch (orthologs + phenotypes), and transform src/usher_pipeline/evidence/animal_models/__init__.py src/usher_pipeline/evidence/animal_models/models.py src/usher_pipeline/evidence/animal_models/fetch.py src/usher_pipeline/evidence/animal_models/transform.py Create the animal model evidence layer following the established fetch->transform pattern. **models.py**: Define AnimalModelRecord pydantic model with fields: gene_id (str), gene_symbol (str), mouse_ortholog (str|None), mouse_ortholog_confidence (str|None -- "HIGH"/"MEDIUM"/"LOW" based on HCOP source count), zebrafish_ortholog (str|None), zebrafish_ortholog_confidence (str|None), has_mouse_phenotype (bool|None), has_zebrafish_phenotype (bool|None), has_impc_phenotype (bool|None), sensory_phenotype_count (int|None -- number of sensory-relevant phenotypes), phenotype_categories (str|None -- semicolon-separated list of matched MP/ZP terms), animal_model_score_normalized (float|None -- 0-1 composite). Define ANIMAL_TABLE_NAME = "animal_model_phenotypes". Define SENSORY_MP_KEYWORDS (Mammalian Phenotype ontology terms): ["hearing", "deaf", "vestibular", "balance", "retina", "photoreceptor", "vision", "blind", "cochlea", "stereocilia", "cilia", "cilium", "flagellum", "situs inversus", "laterality", "hydrocephalus", "kidney cyst", "polycystic"]. Define SENSORY_ZP_KEYWORDS similarly for zebrafish phenotype terms. **fetch.py**: Four functions: 1. `fetch_ortholog_mapping(gene_ids: list[str]) -> pl.DataFrame` -- Download HCOP ortholog data from HGNC: https://ftp.ebi.ac.uk/pub/databases/genenames/hcop/human_mouse_hcop_fifteen_column.txt.gz and human_zebrafish equivalent. Parse with polars. Extract columns: human gene ID, ortholog ID, ortholog symbol, support count (number of databases agreeing). Assign confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3). For one-to-many mappings: keep the ortholog with highest support count. Flag genes with multiple orthologs (ortholog_count column). Return DataFrame with gene_id, mouse_ortholog, mouse_ortholog_confidence, zebrafish_ortholog, zebrafish_ortholog_confidence. 2. `fetch_mgi_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Download MGI gene-phenotype report from MGI FTP: https://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt (or HMD_HumanPhenotype.rpt for human orthologs). Parse tab-separated report. Extract: mouse gene ID, allele symbol, MP term name, MP term ID. Return DataFrame with mouse gene and phenotype terms. Use httpx streaming download with retry. 3. `fetch_zfin_phenotypes(zebrafish_gene_ids: list[str]) -> pl.DataFrame` -- Download ZFIN phenotype data from: https://zfin.org/downloads/phenoGeneCleanData_fish.txt. Parse tab-separated. Extract: zebrafish gene, phenotype terms. Return DataFrame. Use httpx streaming download with retry. 4. `fetch_impc_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Query IMPC SOLR API: https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=marker_symbol:{gene}&rows=1000. Process in batches (50 genes at a time with rate limiting). Extract: gene symbol, MP term, p-value, effect size. Return DataFrame. Use httpx with tenacity retry, ratelimit (5 req/sec conservative). If IMPC API is unreliable, fall back to bulk download from https://www.mousephenotype.org/data/release. **transform.py**: Three functions: 1. `filter_sensory_phenotypes(phenotype_df: pl.DataFrame, keywords: list[str]) -> pl.DataFrame` -- Filter phenotype terms for sensory/cilia relevance using keyword matching (case-insensitive substring). Return only rows where MP/ZP term matches any keyword in SENSORY_MP_KEYWORDS or SENSORY_ZP_KEYWORDS. Count matches per gene as sensory_phenotype_count. Concatenate matched terms as phenotype_categories string. 2. `score_animal_evidence(df: pl.DataFrame) -> pl.DataFrame` -- Compute animal_model_score_normalized. Scoring formula: base_score = 0 if no phenotypes. For each organism with sensory phenotypes: mouse +0.4 (weighted by ortholog confidence: HIGH=1.0, MEDIUM=0.7, LOW=0.4), zebrafish +0.3 (same confidence weighting), IMPC +0.3 (independent confirmation bonus). Clamp to [0, 1]. Multiply by log2(sensory_phenotype_count + 1) / log2(max_count + 1) to reward more phenotypes. NULL if no ortholog mapping exists. 3. `process_animal_model_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch orthologs -> fetch MGI -> fetch ZFIN -> fetch IMPC -> join on orthologs -> filter sensory -> score -> collect. Follow established patterns: NULL preservation (no ortholog = NULL, not zero), structlog logging. Handle one-to-many orthologs: take best confidence, aggregate phenotypes across all orthologs for that human gene. cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.animal_models import fetch_ortholog_mapping, fetch_mgi_phenotypes, fetch_zfin_phenotypes, fetch_impc_phenotypes, filter_sensory_phenotypes, score_animal_evidence, process_animal_model_evidence; print('imports OK')" Animal model fetch retrieves orthologs from HCOP and phenotypes from MGI, ZFIN, IMPC. Transform filters for sensory/cilia relevance and scores with ortholog confidence weighting. All functions importable. Task 2: Create animal model DuckDB loader, CLI command, and tests src/usher_pipeline/evidence/animal_models/load.py src/usher_pipeline/cli/evidence_cmd.py tests/test_animal_models.py tests/test_animal_models_integration.py **load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "animal_model_phenotypes" table. Record provenance: genes with mouse orthologs, genes with zebrafish orthologs, genes with sensory phenotypes, ortholog confidence distribution, mean sensory phenotype count. Create `query_sensory_phenotype_genes(store, min_score=0.3) -> pl.DataFrame` helper. **evidence_cmd.py**: Add `animal-models` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('animal_model_phenotypes')), --force flag, load gene universe for gene_ids, call process_animal_model_evidence, load to DuckDB, save provenance sidecar to data/animal_models/phenotypes.provenance.json. Display summary: ortholog coverage, sensory phenotype counts by organism, top scoring genes. **tests/test_animal_models.py**: Unit tests with synthetic data. Mock httpx for all downloads. Test cases: - test_ortholog_confidence_high: 8+ supporting sources -> HIGH - test_ortholog_confidence_low: 1-3 sources -> LOW - test_one_to_many_best_selected: Multiple mouse orthologs -> highest confidence kept - test_sensory_keyword_match: "hearing loss" phenotype matches SENSORY_MP_KEYWORDS - test_non_sensory_filtered: "increased body weight" phenotype filtered out - test_score_with_confidence_weighting: HIGH confidence ortholog scores higher than LOW - test_score_null_no_ortholog: Gene without ortholog -> NULL score - test_multi_organism_bonus: Phenotypes in both mouse and zebrafish -> higher score - test_phenotype_count_scaling: More sensory phenotypes -> higher score (diminishing returns via log) - test_impc_integration: IMPC phenotypes contribute to score **tests/test_animal_models_integration.py**: Integration tests. Mock HCOP download, MGI/ZFIN/IMPC responses. Test full pipeline, checkpoint-restart, provenance. Synthetic phenotype report fixtures. cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v All animal model unit and integration tests pass. CLI `evidence animal-models` command registered. DuckDB stores animal_model_phenotypes table with ortholog mappings, phenotype summaries, and confidence-weighted scores. Checkpoint-restart works. - `python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v` -- all tests pass - `python -c "from usher_pipeline.evidence.animal_models import *"` -- all exports importable - `usher-pipeline evidence animal-models --help` -- CLI help displays - DuckDB animal_model_phenotypes table has columns: gene_id, gene_symbol, mouse_ortholog, mouse_ortholog_confidence, sensory_phenotype_count, phenotype_categories, animal_model_score_normalized - ANIM-01: Phenotypes retrieved from MGI (mouse), ZFIN (zebrafish), and IMPC via bulk downloads and API - ANIM-02: Phenotypes filtered for sensory/balance/vision/hearing/cilia relevance via keyword matching - ANIM-03: Ortholog mapping via HCOP with confidence scoring (HIGH/MEDIUM/LOW), one-to-many handled by taking best confidence - Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure After completion, create `.planning/phases/03-core-evidence-layers/03-05-SUMMARY.md`