13 KiB
13 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
| phase | plan | type | wave | depends_on | files_modified | autonomous | must_haves | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 03-core-evidence-layers | 05 | execute | 1 |
|
true |
|
Purpose: Animal model phenotypes provide functional evidence -- genes whose orthologs cause sensory, balance, hearing, or cilia defects in mouse/zebrafish are strong candidates. Ortholog confidence prevents false positives from paralog mis-mapping. Output: animal_model_phenotypes DuckDB table with per-gene ortholog mapping, phenotype summaries, sensory relevance flags, and normalized animal model score.
<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>
@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create animal model evidence data model, fetch (orthologs + phenotypes), and transform src/usher_pipeline/evidence/animal_models/__init__.py src/usher_pipeline/evidence/animal_models/models.py src/usher_pipeline/evidence/animal_models/fetch.py src/usher_pipeline/evidence/animal_models/transform.py Create the animal model evidence layer following the established fetch->transform pattern.**models.py**: Define AnimalModelRecord pydantic model with fields: gene_id (str), gene_symbol (str), mouse_ortholog (str|None), mouse_ortholog_confidence (str|None -- "HIGH"/"MEDIUM"/"LOW" based on HCOP source count), zebrafish_ortholog (str|None), zebrafish_ortholog_confidence (str|None), has_mouse_phenotype (bool|None), has_zebrafish_phenotype (bool|None), has_impc_phenotype (bool|None), sensory_phenotype_count (int|None -- number of sensory-relevant phenotypes), phenotype_categories (str|None -- semicolon-separated list of matched MP/ZP terms), animal_model_score_normalized (float|None -- 0-1 composite). Define ANIMAL_TABLE_NAME = "animal_model_phenotypes". Define SENSORY_MP_KEYWORDS (Mammalian Phenotype ontology terms): ["hearing", "deaf", "vestibular", "balance", "retina", "photoreceptor", "vision", "blind", "cochlea", "stereocilia", "cilia", "cilium", "flagellum", "situs inversus", "laterality", "hydrocephalus", "kidney cyst", "polycystic"]. Define SENSORY_ZP_KEYWORDS similarly for zebrafish phenotype terms.
**fetch.py**: Four functions:
1. `fetch_ortholog_mapping(gene_ids: list[str]) -> pl.DataFrame` -- Download HCOP ortholog data from HGNC: https://ftp.ebi.ac.uk/pub/databases/genenames/hcop/human_mouse_hcop_fifteen_column.txt.gz and human_zebrafish equivalent. Parse with polars. Extract columns: human gene ID, ortholog ID, ortholog symbol, support count (number of databases agreeing). Assign confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3). For one-to-many mappings: keep the ortholog with highest support count. Flag genes with multiple orthologs (ortholog_count column). Return DataFrame with gene_id, mouse_ortholog, mouse_ortholog_confidence, zebrafish_ortholog, zebrafish_ortholog_confidence.
2. `fetch_mgi_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Download MGI gene-phenotype report from MGI FTP: https://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt (or HMD_HumanPhenotype.rpt for human orthologs). Parse tab-separated report. Extract: mouse gene ID, allele symbol, MP term name, MP term ID. Return DataFrame with mouse gene and phenotype terms. Use httpx streaming download with retry.
3. `fetch_zfin_phenotypes(zebrafish_gene_ids: list[str]) -> pl.DataFrame` -- Download ZFIN phenotype data from: https://zfin.org/downloads/phenoGeneCleanData_fish.txt. Parse tab-separated. Extract: zebrafish gene, phenotype terms. Return DataFrame. Use httpx streaming download with retry.
4. `fetch_impc_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Query IMPC SOLR API: https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=marker_symbol:{gene}&rows=1000. Process in batches (50 genes at a time with rate limiting). Extract: gene symbol, MP term, p-value, effect size. Return DataFrame. Use httpx with tenacity retry, ratelimit (5 req/sec conservative). If IMPC API is unreliable, fall back to bulk download from https://www.mousephenotype.org/data/release.
**transform.py**: Three functions:
1. `filter_sensory_phenotypes(phenotype_df: pl.DataFrame, keywords: list[str]) -> pl.DataFrame` -- Filter phenotype terms for sensory/cilia relevance using keyword matching (case-insensitive substring). Return only rows where MP/ZP term matches any keyword in SENSORY_MP_KEYWORDS or SENSORY_ZP_KEYWORDS. Count matches per gene as sensory_phenotype_count. Concatenate matched terms as phenotype_categories string.
2. `score_animal_evidence(df: pl.DataFrame) -> pl.DataFrame` -- Compute animal_model_score_normalized. Scoring formula: base_score = 0 if no phenotypes. For each organism with sensory phenotypes: mouse +0.4 (weighted by ortholog confidence: HIGH=1.0, MEDIUM=0.7, LOW=0.4), zebrafish +0.3 (same confidence weighting), IMPC +0.3 (independent confirmation bonus). Clamp to [0, 1]. Multiply by log2(sensory_phenotype_count + 1) / log2(max_count + 1) to reward more phenotypes. NULL if no ortholog mapping exists.
3. `process_animal_model_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch orthologs -> fetch MGI -> fetch ZFIN -> fetch IMPC -> join on orthologs -> filter sensory -> score -> collect.
Follow established patterns: NULL preservation (no ortholog = NULL, not zero), structlog logging. Handle one-to-many orthologs: take best confidence, aggregate phenotypes across all orthologs for that human gene.
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.animal_models import fetch_ortholog_mapping, fetch_mgi_phenotypes, fetch_zfin_phenotypes, fetch_impc_phenotypes, filter_sensory_phenotypes, score_animal_evidence, process_animal_model_evidence; print('imports OK')"
Animal model fetch retrieves orthologs from HCOP and phenotypes from MGI, ZFIN, IMPC. Transform filters for sensory/cilia relevance and scores with ortholog confidence weighting. All functions importable.
Task 2: Create animal model DuckDB loader, CLI command, and tests
src/usher_pipeline/evidence/animal_models/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_animal_models.py
tests/test_animal_models_integration.py
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "animal_model_phenotypes" table. Record provenance: genes with mouse orthologs, genes with zebrafish orthologs, genes with sensory phenotypes, ortholog confidence distribution, mean sensory phenotype count. Create `query_sensory_phenotype_genes(store, min_score=0.3) -> pl.DataFrame` helper.
**evidence_cmd.py**: Add `animal-models` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('animal_model_phenotypes')), --force flag, load gene universe for gene_ids, call process_animal_model_evidence, load to DuckDB, save provenance sidecar to data/animal_models/phenotypes.provenance.json. Display summary: ortholog coverage, sensory phenotype counts by organism, top scoring genes.
**tests/test_animal_models.py**: Unit tests with synthetic data. Mock httpx for all downloads. Test cases:
- test_ortholog_confidence_high: 8+ supporting sources -> HIGH
- test_ortholog_confidence_low: 1-3 sources -> LOW
- test_one_to_many_best_selected: Multiple mouse orthologs -> highest confidence kept
- test_sensory_keyword_match: "hearing loss" phenotype matches SENSORY_MP_KEYWORDS
- test_non_sensory_filtered: "increased body weight" phenotype filtered out
- test_score_with_confidence_weighting: HIGH confidence ortholog scores higher than LOW
- test_score_null_no_ortholog: Gene without ortholog -> NULL score
- test_multi_organism_bonus: Phenotypes in both mouse and zebrafish -> higher score
- test_phenotype_count_scaling: More sensory phenotypes -> higher score (diminishing returns via log)
- test_impc_integration: IMPC phenotypes contribute to score
**tests/test_animal_models_integration.py**: Integration tests. Mock HCOP download, MGI/ZFIN/IMPC responses. Test full pipeline, checkpoint-restart, provenance. Synthetic phenotype report fixtures.
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v
All animal model unit and integration tests pass. CLI `evidence animal-models` command registered. DuckDB stores animal_model_phenotypes table with ortholog mappings, phenotype summaries, and confidence-weighted scores. Checkpoint-restart works.
- `python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.animal_models import *"` -- all exports importable
- `usher-pipeline evidence animal-models --help` -- CLI help displays
- DuckDB animal_model_phenotypes table has columns: gene_id, gene_symbol, mouse_ortholog, mouse_ortholog_confidence, sensory_phenotype_count, phenotype_categories, animal_model_score_normalized
<success_criteria>
- ANIM-01: Phenotypes retrieved from MGI (mouse), ZFIN (zebrafish), and IMPC via bulk downloads and API
- ANIM-02: Phenotypes filtered for sensory/balance/vision/hearing/cilia relevance via keyword matching
- ANIM-03: Ortholog mapping via HCOP with confidence scoring (HIGH/MEDIUM/LOW), one-to-many handled by taking best confidence
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure </success_criteria>