Files
2026-02-11 18:46:28 +08:00

13 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
phase plan type wave depends_on files_modified autonomous must_haves
03-core-evidence-layers 05 execute 1
src/usher_pipeline/evidence/animal_models/__init__.py
src/usher_pipeline/evidence/animal_models/models.py
src/usher_pipeline/evidence/animal_models/fetch.py
src/usher_pipeline/evidence/animal_models/transform.py
src/usher_pipeline/evidence/animal_models/load.py
tests/test_animal_models.py
tests/test_animal_models_integration.py
src/usher_pipeline/cli/evidence_cmd.py
true
truths artifacts key_links
Pipeline retrieves gene knockout/perturbation phenotypes from MGI, ZFIN, and IMPC
Phenotypes are filtered for relevance to sensory function, balance, vision, hearing, and cilia morphology
Ortholog mapping uses established databases with confidence scoring, handling one-to-many mappings
path provides exports
src/usher_pipeline/evidence/animal_models/fetch.py MGI, ZFIN, and IMPC phenotype data retrieval
fetch_mgi_phenotypes
fetch_zfin_phenotypes
fetch_impc_phenotypes
fetch_ortholog_mapping
path provides exports
src/usher_pipeline/evidence/animal_models/transform.py Phenotype relevance filtering and scoring
filter_sensory_phenotypes
score_animal_evidence
process_animal_model_evidence
path provides exports
src/usher_pipeline/evidence/animal_models/load.py DuckDB persistence for animal model evidence
load_to_duckdb
path provides
tests/test_animal_models.py Unit tests for phenotype filtering and ortholog handling
from to via pattern
src/usher_pipeline/evidence/animal_models/fetch.py MGI/IMPC bulk data and ZFIN API httpx with tenacity retry for bulk downloads informatics.jax.org|mousephenotype.org|zfin.org
from to via pattern
src/usher_pipeline/evidence/animal_models/fetch.py HCOP ortholog database HCOP bulk download from HGNC genenames.org.*hcop
from to via pattern
src/usher_pipeline/evidence/animal_models/transform.py src/usher_pipeline/evidence/animal_models/fetch.py filters phenotypes and scores animal model evidence filter_sensory_phenotypes|score_animal_evidence
from to via pattern
src/usher_pipeline/evidence/animal_models/load.py src/usher_pipeline/persistence/duckdb_store.py store.save_dataframe save_dataframe.*animal_model_phenotypes
Implement the Animal Model Phenotypes evidence layer (ANIM-01/02/03): retrieve knockout/perturbation phenotypes from MGI (mouse), ZFIN (zebrafish), and IMPC, map orthologs with confidence scoring, filter for sensory/cilia relevance, and score.

Purpose: Animal model phenotypes provide functional evidence -- genes whose orthologs cause sensory, balance, hearing, or cilia defects in mouse/zebrafish are strong candidates. Ortholog confidence prevents false positives from paralog mis-mapping. Output: animal_model_phenotypes DuckDB table with per-gene ortholog mapping, phenotype summaries, sensory relevance flags, and normalized animal model score.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create animal model evidence data model, fetch (orthologs + phenotypes), and transform src/usher_pipeline/evidence/animal_models/__init__.py src/usher_pipeline/evidence/animal_models/models.py src/usher_pipeline/evidence/animal_models/fetch.py src/usher_pipeline/evidence/animal_models/transform.py Create the animal model evidence layer following the established fetch->transform pattern.
**models.py**: Define AnimalModelRecord pydantic model with fields: gene_id (str), gene_symbol (str), mouse_ortholog (str|None), mouse_ortholog_confidence (str|None -- "HIGH"/"MEDIUM"/"LOW" based on HCOP source count), zebrafish_ortholog (str|None), zebrafish_ortholog_confidence (str|None), has_mouse_phenotype (bool|None), has_zebrafish_phenotype (bool|None), has_impc_phenotype (bool|None), sensory_phenotype_count (int|None -- number of sensory-relevant phenotypes), phenotype_categories (str|None -- semicolon-separated list of matched MP/ZP terms), animal_model_score_normalized (float|None -- 0-1 composite). Define ANIMAL_TABLE_NAME = "animal_model_phenotypes". Define SENSORY_MP_KEYWORDS (Mammalian Phenotype ontology terms): ["hearing", "deaf", "vestibular", "balance", "retina", "photoreceptor", "vision", "blind", "cochlea", "stereocilia", "cilia", "cilium", "flagellum", "situs inversus", "laterality", "hydrocephalus", "kidney cyst", "polycystic"]. Define SENSORY_ZP_KEYWORDS similarly for zebrafish phenotype terms.

**fetch.py**: Four functions:
1. `fetch_ortholog_mapping(gene_ids: list[str]) -> pl.DataFrame` -- Download HCOP ortholog data from HGNC: https://ftp.ebi.ac.uk/pub/databases/genenames/hcop/human_mouse_hcop_fifteen_column.txt.gz and human_zebrafish equivalent. Parse with polars. Extract columns: human gene ID, ortholog ID, ortholog symbol, support count (number of databases agreeing). Assign confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3). For one-to-many mappings: keep the ortholog with highest support count. Flag genes with multiple orthologs (ortholog_count column). Return DataFrame with gene_id, mouse_ortholog, mouse_ortholog_confidence, zebrafish_ortholog, zebrafish_ortholog_confidence.
2. `fetch_mgi_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Download MGI gene-phenotype report from MGI FTP: https://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt (or HMD_HumanPhenotype.rpt for human orthologs). Parse tab-separated report. Extract: mouse gene ID, allele symbol, MP term name, MP term ID. Return DataFrame with mouse gene and phenotype terms. Use httpx streaming download with retry.
3. `fetch_zfin_phenotypes(zebrafish_gene_ids: list[str]) -> pl.DataFrame` -- Download ZFIN phenotype data from: https://zfin.org/downloads/phenoGeneCleanData_fish.txt. Parse tab-separated. Extract: zebrafish gene, phenotype terms. Return DataFrame. Use httpx streaming download with retry.
4. `fetch_impc_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Query IMPC SOLR API: https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=marker_symbol:{gene}&rows=1000. Process in batches (50 genes at a time with rate limiting). Extract: gene symbol, MP term, p-value, effect size. Return DataFrame. Use httpx with tenacity retry, ratelimit (5 req/sec conservative). If IMPC API is unreliable, fall back to bulk download from https://www.mousephenotype.org/data/release.

**transform.py**: Three functions:
1. `filter_sensory_phenotypes(phenotype_df: pl.DataFrame, keywords: list[str]) -> pl.DataFrame` -- Filter phenotype terms for sensory/cilia relevance using keyword matching (case-insensitive substring). Return only rows where MP/ZP term matches any keyword in SENSORY_MP_KEYWORDS or SENSORY_ZP_KEYWORDS. Count matches per gene as sensory_phenotype_count. Concatenate matched terms as phenotype_categories string.
2. `score_animal_evidence(df: pl.DataFrame) -> pl.DataFrame` -- Compute animal_model_score_normalized. Scoring formula: base_score = 0 if no phenotypes. For each organism with sensory phenotypes: mouse +0.4 (weighted by ortholog confidence: HIGH=1.0, MEDIUM=0.7, LOW=0.4), zebrafish +0.3 (same confidence weighting), IMPC +0.3 (independent confirmation bonus). Clamp to [0, 1]. Multiply by log2(sensory_phenotype_count + 1) / log2(max_count + 1) to reward more phenotypes. NULL if no ortholog mapping exists.
3. `process_animal_model_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch orthologs -> fetch MGI -> fetch ZFIN -> fetch IMPC -> join on orthologs -> filter sensory -> score -> collect.

Follow established patterns: NULL preservation (no ortholog = NULL, not zero), structlog logging. Handle one-to-many orthologs: take best confidence, aggregate phenotypes across all orthologs for that human gene.
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.animal_models import fetch_ortholog_mapping, fetch_mgi_phenotypes, fetch_zfin_phenotypes, fetch_impc_phenotypes, filter_sensory_phenotypes, score_animal_evidence, process_animal_model_evidence; print('imports OK')" Animal model fetch retrieves orthologs from HCOP and phenotypes from MGI, ZFIN, IMPC. Transform filters for sensory/cilia relevance and scores with ortholog confidence weighting. All functions importable. Task 2: Create animal model DuckDB loader, CLI command, and tests src/usher_pipeline/evidence/animal_models/load.py src/usher_pipeline/cli/evidence_cmd.py tests/test_animal_models.py tests/test_animal_models_integration.py **load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "animal_model_phenotypes" table. Record provenance: genes with mouse orthologs, genes with zebrafish orthologs, genes with sensory phenotypes, ortholog confidence distribution, mean sensory phenotype count. Create `query_sensory_phenotype_genes(store, min_score=0.3) -> pl.DataFrame` helper.
**evidence_cmd.py**: Add `animal-models` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('animal_model_phenotypes')), --force flag, load gene universe for gene_ids, call process_animal_model_evidence, load to DuckDB, save provenance sidecar to data/animal_models/phenotypes.provenance.json. Display summary: ortholog coverage, sensory phenotype counts by organism, top scoring genes.

**tests/test_animal_models.py**: Unit tests with synthetic data. Mock httpx for all downloads. Test cases:
- test_ortholog_confidence_high: 8+ supporting sources -> HIGH
- test_ortholog_confidence_low: 1-3 sources -> LOW
- test_one_to_many_best_selected: Multiple mouse orthologs -> highest confidence kept
- test_sensory_keyword_match: "hearing loss" phenotype matches SENSORY_MP_KEYWORDS
- test_non_sensory_filtered: "increased body weight" phenotype filtered out
- test_score_with_confidence_weighting: HIGH confidence ortholog scores higher than LOW
- test_score_null_no_ortholog: Gene without ortholog -> NULL score
- test_multi_organism_bonus: Phenotypes in both mouse and zebrafish -> higher score
- test_phenotype_count_scaling: More sensory phenotypes -> higher score (diminishing returns via log)
- test_impc_integration: IMPC phenotypes contribute to score

**tests/test_animal_models_integration.py**: Integration tests. Mock HCOP download, MGI/ZFIN/IMPC responses. Test full pipeline, checkpoint-restart, provenance. Synthetic phenotype report fixtures.
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v All animal model unit and integration tests pass. CLI `evidence animal-models` command registered. DuckDB stores animal_model_phenotypes table with ortholog mappings, phenotype summaries, and confidence-weighted scores. Checkpoint-restart works. - `python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v` -- all tests pass - `python -c "from usher_pipeline.evidence.animal_models import *"` -- all exports importable - `usher-pipeline evidence animal-models --help` -- CLI help displays - DuckDB animal_model_phenotypes table has columns: gene_id, gene_symbol, mouse_ortholog, mouse_ortholog_confidence, sensory_phenotype_count, phenotype_categories, animal_model_score_normalized

<success_criteria>

  • ANIM-01: Phenotypes retrieved from MGI (mouse), ZFIN (zebrafish), and IMPC via bulk downloads and API
  • ANIM-02: Phenotypes filtered for sensory/balance/vision/hearing/cilia relevance via keyword matching
  • ANIM-03: Ortholog mapping via HCOP with confidence scoring (HIGH/MEDIUM/LOW), one-to-many handled by taking best confidence
  • Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure </success_criteria>
After completion, create `.planning/phases/03-core-evidence-layers/03-05-SUMMARY.md`