usher-exploring/.planning/phases/03-core-evidence-layers/03-05-PLAN.md at 71c4e8f736ddfe1a273513fcc20fb54b095ae7df

gbanyan/usher-exploring

Fork 0

Files

gbanyan 0d252da348 docs(03): create phase plan

2026-02-11 18:46:28 +08:00

13 KiB

Raw Blame History

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves

phase

plan

type

wave

depends_on

files_modified

autonomous

must_haves

03-core-evidence-layers

execute

src/usher_pipeline/evidence/animal_models/__init__.py

src/usher_pipeline/evidence/animal_models/models.py

src/usher_pipeline/evidence/animal_models/fetch.py

src/usher_pipeline/evidence/animal_models/transform.py

src/usher_pipeline/evidence/animal_models/load.py

tests/test_animal_models.py

tests/test_animal_models_integration.py

src/usher_pipeline/cli/evidence_cmd.py

true

truths

artifacts

key_links

Pipeline retrieves gene knockout/perturbation phenotypes from MGI, ZFIN, and IMPC

Phenotypes are filtered for relevance to sensory function, balance, vision, hearing, and cilia morphology

Ortholog mapping uses established databases with confidence scoring, handling one-to-many mappings

path

provides

exports

src/usher_pipeline/evidence/animal_models/fetch.py

MGI, ZFIN, and IMPC phenotype data retrieval

fetch_mgi_phenotypes

fetch_zfin_phenotypes

fetch_impc_phenotypes

fetch_ortholog_mapping

path

provides

exports

src/usher_pipeline/evidence/animal_models/transform.py

Phenotype relevance filtering and scoring

filter_sensory_phenotypes

score_animal_evidence

process_animal_model_evidence

path

provides

exports

src/usher_pipeline/evidence/animal_models/load.py

DuckDB persistence for animal model evidence

load_to_duckdb

path	provides
tests/test_animal_models.py	Unit tests for phenotype filtering and ortholog handling

from	to	via	pattern
src/usher_pipeline/evidence/animal_models/fetch.py	MGI/IMPC bulk data and ZFIN API	httpx with tenacity retry for bulk downloads	informatics.jax.org\|mousephenotype.org\|zfin.org

from	to	via	pattern
src/usher_pipeline/evidence/animal_models/fetch.py	HCOP ortholog database	HCOP bulk download from HGNC	genenames.org.*hcop

from	to	via	pattern
src/usher_pipeline/evidence/animal_models/transform.py	src/usher_pipeline/evidence/animal_models/fetch.py	filters phenotypes and scores animal model evidence	filter_sensory_phenotypes\|score_animal_evidence

from	to	via	pattern
src/usher_pipeline/evidence/animal_models/load.py	src/usher_pipeline/persistence/duckdb_store.py	store.save_dataframe	save_dataframe.*animal_model_phenotypes

Implement the Animal Model Phenotypes evidence layer (ANIM-01/02/03): retrieve knockout/perturbation phenotypes from MGI (mouse), ZFIN (zebrafish), and IMPC, map orthologs with confidence scoring, filter for sensory/cilia relevance, and score.

Purpose: Animal model phenotypes provide functional evidence -- genes whose orthologs cause sensory, balance, hearing, or cilia defects in mouse/zebrafish are strong candidates. Ortholog confidence prevents false positives from paralog mis-mapping. Output: animal_model_phenotypes DuckDB table with per-gene ortholog mapping, phenotype summaries, sensory relevance flags, and normalized animal model score.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create animal model evidence data model, fetch (orthologs + phenotypes), and transform src/usher_pipeline/evidence/animal_models/__init__.py src/usher_pipeline/evidence/animal_models/models.py src/usher_pipeline/evidence/animal_models/fetch.py src/usher_pipeline/evidence/animal_models/transform.py Create the animal model evidence layer following the established fetch->transform pattern.

**models.py**: Define AnimalModelRecord pydantic model with fields: gene_id (str), gene_symbol (str), mouse_ortholog (str|None), mouse_ortholog_confidence (str|None -- "HIGH"/"MEDIUM"/"LOW" based on HCOP source count), zebrafish_ortholog (str|None), zebrafish_ortholog_confidence (str|None), has_mouse_phenotype (bool|None), has_zebrafish_phenotype (bool|None), has_impc_phenotype (bool|None), sensory_phenotype_count (int|None -- number of sensory-relevant phenotypes), phenotype_categories (str|None -- semicolon-separated list of matched MP/ZP terms), animal_model_score_normalized (float|None -- 0-1 composite). Define ANIMAL_TABLE_NAME = "animal_model_phenotypes". Define SENSORY_MP_KEYWORDS (Mammalian Phenotype ontology terms): ["hearing", "deaf", "vestibular", "balance", "retina", "photoreceptor", "vision", "blind", "cochlea", "stereocilia", "cilia", "cilium", "flagellum", "situs inversus", "laterality", "hydrocephalus", "kidney cyst", "polycystic"]. Define SENSORY_ZP_KEYWORDS similarly for zebrafish phenotype terms.

**fetch.py**: Four functions:
1. `fetch_ortholog_mapping(gene_ids: list[str]) -> pl.DataFrame` -- Download HCOP ortholog data from HGNC: https://ftp.ebi.ac.uk/pub/databases/genenames/hcop/human_mouse_hcop_fifteen_column.txt.gz and human_zebrafish equivalent. Parse with polars. Extract columns: human gene ID, ortholog ID, ortholog symbol, support count (number of databases agreeing). Assign confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3). For one-to-many mappings: keep the ortholog with highest support count. Flag genes with multiple orthologs (ortholog_count column). Return DataFrame with gene_id, mouse_ortholog, mouse_ortholog_confidence, zebrafish_ortholog, zebrafish_ortholog_confidence.
2. `fetch_mgi_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Download MGI gene-phenotype report from MGI FTP: https://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt (or HMD_HumanPhenotype.rpt for human orthologs). Parse tab-separated report. Extract: mouse gene ID, allele symbol, MP term name, MP term ID. Return DataFrame with mouse gene and phenotype terms. Use httpx streaming download with retry.
3. `fetch_zfin_phenotypes(zebrafish_gene_ids: list[str]) -> pl.DataFrame` -- Download ZFIN phenotype data from: https://zfin.org/downloads/phenoGeneCleanData_fish.txt. Parse tab-separated. Extract: zebrafish gene, phenotype terms. Return DataFrame. Use httpx streaming download with retry.
4. `fetch_impc_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Query IMPC SOLR API: https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=marker_symbol:{gene}&rows=1000. Process in batches (50 genes at a time with rate limiting). Extract: gene symbol, MP term, p-value, effect size. Return DataFrame. Use httpx with tenacity retry, ratelimit (5 req/sec conservative). If IMPC API is unreliable, fall back to bulk download from https://www.mousephenotype.org/data/release.

**transform.py**: Three functions:
1. `filter_sensory_phenotypes(phenotype_df: pl.DataFrame, keywords: list[str]) -> pl.DataFrame` -- Filter phenotype terms for sensory/cilia relevance using keyword matching (case-insensitive substring). Return only rows where MP/ZP term matches any keyword in SENSORY_MP_KEYWORDS or SENSORY_ZP_KEYWORDS. Count matches per gene as sensory_phenotype_count. Concatenate matched terms as phenotype_categories string.
2. `score_animal_evidence(df: pl.DataFrame) -> pl.DataFrame` -- Compute animal_model_score_normalized. Scoring formula: base_score = 0 if no phenotypes. For each organism with sensory phenotypes: mouse +0.4 (weighted by ortholog confidence: HIGH=1.0, MEDIUM=0.7, LOW=0.4), zebrafish +0.3 (same confidence weighting), IMPC +0.3 (independent confirmation bonus). Clamp to [0, 1]. Multiply by log2(sensory_phenotype_count + 1) / log2(max_count + 1) to reward more phenotypes. NULL if no ortholog mapping exists.
3. `process_animal_model_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch orthologs -> fetch MGI -> fetch ZFIN -> fetch IMPC -> join on orthologs -> filter sensory -> score -> collect.

Follow established patterns: NULL preservation (no ortholog = NULL, not zero), structlog logging. Handle one-to-many orthologs: take best confidence, aggregate phenotypes across all orthologs for that human gene.

cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.animal_models import fetch_ortholog_mapping, fetch_mgi_phenotypes, fetch_zfin_phenotypes, fetch_impc_phenotypes, filter_sensory_phenotypes, score_animal_evidence, process_animal_model_evidence; print('imports OK')" Animal model fetch retrieves orthologs from HCOP and phenotypes from MGI, ZFIN, IMPC. Transform filters for sensory/cilia relevance and scores with ortholog confidence weighting. All functions importable. Task 2: Create animal model DuckDB loader, CLI command, and tests src/usher_pipeline/evidence/animal_models/load.py src/usher_pipeline/cli/evidence_cmd.py tests/test_animal_models.py tests/test_animal_models_integration.py **load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "animal_model_phenotypes" table. Record provenance: genes with mouse orthologs, genes with zebrafish orthologs, genes with sensory phenotypes, ortholog confidence distribution, mean sensory phenotype count. Create `query_sensory_phenotype_genes(store, min_score=0.3) -> pl.DataFrame` helper.

**evidence_cmd.py**: Add `animal-models` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('animal_model_phenotypes')), --force flag, load gene universe for gene_ids, call process_animal_model_evidence, load to DuckDB, save provenance sidecar to data/animal_models/phenotypes.provenance.json. Display summary: ortholog coverage, sensory phenotype counts by organism, top scoring genes.

**tests/test_animal_models.py**: Unit tests with synthetic data. Mock httpx for all downloads. Test cases:
- test_ortholog_confidence_high: 8+ supporting sources -> HIGH
- test_ortholog_confidence_low: 1-3 sources -> LOW
- test_one_to_many_best_selected: Multiple mouse orthologs -> highest confidence kept
- test_sensory_keyword_match: "hearing loss" phenotype matches SENSORY_MP_KEYWORDS
- test_non_sensory_filtered: "increased body weight" phenotype filtered out
- test_score_with_confidence_weighting: HIGH confidence ortholog scores higher than LOW
- test_score_null_no_ortholog: Gene without ortholog -> NULL score
- test_multi_organism_bonus: Phenotypes in both mouse and zebrafish -> higher score
- test_phenotype_count_scaling: More sensory phenotypes -> higher score (diminishing returns via log)
- test_impc_integration: IMPC phenotypes contribute to score

**tests/test_animal_models_integration.py**: Integration tests. Mock HCOP download, MGI/ZFIN/IMPC responses. Test full pipeline, checkpoint-restart, provenance. Synthetic phenotype report fixtures.

cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v All animal model unit and integration tests pass. CLI `evidence animal-models` command registered. DuckDB stores animal_model_phenotypes table with ortholog mappings, phenotype summaries, and confidence-weighted scores. Checkpoint-restart works. - `python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v` -- all tests pass - `python -c "from usher_pipeline.evidence.animal_models import *"` -- all exports importable - `usher-pipeline evidence animal-models --help` -- CLI help displays - DuckDB animal_model_phenotypes table has columns: gene_id, gene_symbol, mouse_ortholog, mouse_ortholog_confidence, sensory_phenotype_count, phenotype_categories, animal_model_score_normalized

<success_criteria>

ANIM-01: Phenotypes retrieved from MGI (mouse), ZFIN (zebrafish), and IMPC via bulk downloads and API
ANIM-02: Phenotypes filtered for sensory/balance/vision/hearing/cilia relevance via keyword matching
ANIM-03: Ortholog mapping via HCOP with confidence scoring (HIGH/MEDIUM/LOW), one-to-many handled by taking best confidence
Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure </success_criteria>

After completion, create `.planning/phases/03-core-evidence-layers/03-05-SUMMARY.md`

13 KiB Raw Blame History

13 KiB

Raw Blame History