docs(03): create phase plan

This commit is contained in:
2026-02-11 18:46:28 +08:00
parent 3354cfe006
commit 0d252da348
7 changed files with 1022 additions and 3 deletions

View File

@@ -65,9 +65,15 @@ Plans:
4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions 4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions
5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring 5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring
6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring 6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring
**Plans**: TBD **Plans**: 6 plans
Plans: (to be created during plan-phase) Plans:
- [ ] 03-01-PLAN.md -- Gene annotation completeness (GO terms, UniProt scores, pathway membership, tier classification)
- [ ] 03-02-PLAN.md -- Tissue expression (HPA, GTEx, CellxGene with Tau specificity and enrichment scoring)
- [ ] 03-03-PLAN.md -- Protein sequence/structure features (UniProt/InterPro domains, cilia motifs, normalization)
- [ ] 03-04-PLAN.md -- Subcellular localization (HPA subcellular, cilia proteomics, evidence type distinction)
- [ ] 03-05-PLAN.md -- Animal model phenotypes (MGI, ZFIN, IMPC with HCOP ortholog mapping)
- [ ] 03-06-PLAN.md -- Literature evidence (PubMed queries, evidence tier classification, quality-weighted scoring)
### Phase 4: Scoring & Integration ### Phase 4: Scoring & Integration
**Goal**: Multi-evidence weighted scoring with known gene validation **Goal**: Multi-evidence weighted scoring with known gene validation
@@ -120,7 +126,7 @@ Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6
|-------|----------------|--------|-----------| |-------|----------------|--------|-----------|
| 1. Data Infrastructure | 4/4 | ✓ Complete | 2026-02-11 | | 1. Data Infrastructure | 4/4 | ✓ Complete | 2026-02-11 |
| 2. Prototype Evidence Layer | 2/2 | ✓ Complete | 2026-02-11 | | 2. Prototype Evidence Layer | 2/2 | ✓ Complete | 2026-02-11 |
| 3. Core Evidence Layers | 0/TBD | Not started | - | | 3. Core Evidence Layers | 0/6 | In Progress | - |
| 4. Scoring & Integration | 0/TBD | Not started | - | | 4. Scoring & Integration | 0/TBD | Not started | - |
| 5. Output & CLI | 0/TBD | Not started | - | | 5. Output & CLI | 0/TBD | Not started | - |
| 6. Validation | 0/TBD | Not started | - | | 6. Validation | 0/TBD | Not started | - |

View File

@@ -0,0 +1,167 @@
---
phase: 03-core-evidence-layers
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/annotation/__init__.py
- src/usher_pipeline/evidence/annotation/models.py
- src/usher_pipeline/evidence/annotation/fetch.py
- src/usher_pipeline/evidence/annotation/transform.py
- src/usher_pipeline/evidence/annotation/load.py
- tests/test_annotation.py
- tests/test_annotation_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
- pyproject.toml
autonomous: true
must_haves:
truths:
- "Pipeline quantifies functional annotation depth per gene using GO term count, UniProt annotation score, and pathway membership"
- "Genes are classified into annotation tiers (well-annotated, partially-annotated, poorly-annotated) based on composite annotation metrics"
- "Annotation completeness score is normalized to 0-1 scale and stored as an evidence layer in DuckDB"
artifacts:
- path: "src/usher_pipeline/evidence/annotation/fetch.py"
provides: "GO term count and UniProt annotation score retrieval per gene"
exports: ["fetch_go_annotations", "fetch_uniprot_scores"]
- path: "src/usher_pipeline/evidence/annotation/transform.py"
provides: "Annotation tier classification and 0-1 normalization"
exports: ["classify_annotation_tier", "normalize_annotation_score", "process_annotation_evidence"]
- path: "src/usher_pipeline/evidence/annotation/load.py"
provides: "DuckDB persistence for annotation evidence"
exports: ["load_to_duckdb"]
- path: "tests/test_annotation.py"
provides: "Unit tests for annotation scoring, tiering, NULL handling"
key_links:
- from: "src/usher_pipeline/evidence/annotation/fetch.py"
to: "mygene.info API"
via: "mygene library batch query"
pattern: "mg\\.querymany.*fields.*go"
- from: "src/usher_pipeline/evidence/annotation/transform.py"
to: "src/usher_pipeline/evidence/annotation/fetch.py"
via: "processes fetched GO/UniProt data into scores"
pattern: "classify_annotation_tier"
- from: "src/usher_pipeline/evidence/annotation/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*annotation_completeness"
---
<objective>
Implement the Gene Annotation Completeness evidence layer (ANNOT-01/02/03): retrieve GO term counts and UniProt annotation scores per gene, classify genes into annotation tiers, normalize to 0-1 composite score, and persist to DuckDB.
Purpose: Annotation depth quantifies how well-studied each gene is -- poorly-annotated genes with other evidence are prime candidates for under-studied cilia/Usher genes.
Output: annotation_completeness DuckDB table with per-gene GO counts, UniProt scores, pathway flags, annotation tier, and normalized composite score.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/models.py
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/transform.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create annotation evidence data model, fetch, and transform modules</name>
<files>
src/usher_pipeline/evidence/annotation/__init__.py
src/usher_pipeline/evidence/annotation/models.py
src/usher_pipeline/evidence/annotation/fetch.py
src/usher_pipeline/evidence/annotation/transform.py
pyproject.toml
</files>
<action>
Create the annotation evidence layer following the established gnomAD pattern (fetch->transform->load).
**models.py**: Define AnnotationRecord pydantic model with fields: gene_id (str), gene_symbol (str), go_term_count (int|None), go_biological_process_count (int|None), go_molecular_function_count (int|None), go_cellular_component_count (int|None), uniprot_annotation_score (int|None -- UniProt annotation score 1-5), has_pathway_membership (bool|None -- present in any KEGG/Reactome pathway), annotation_tier (str -- "well_annotated", "partially_annotated", "poorly_annotated"), annotation_score_normalized (float|None -- 0-1 composite). Define ANNOTATION_TABLE_NAME = "annotation_completeness".
**fetch.py**: Two fetch functions:
1. `fetch_go_annotations(gene_ids: list[str]) -> pl.DataFrame` -- Use mygene.info library (already a dependency) to batch query GO annotations. Call `mg.querymany(gene_ids, scopes='ensembl.gene', fields='go,pathway.kegg,pathway.reactome,symbol', species='human')`. Extract GO term counts by category (BP, MF, CC). For each gene, count GO terms per ontology. Return LazyFrame with gene_id, gene_symbol, go_term_count, go_biological_process_count, go_molecular_function_count, go_cellular_component_count, has_pathway_membership. Handle genes with no GO annotations as NULL (not zero). Process in batches of 1000 to avoid mygene timeout.
2. `fetch_uniprot_scores(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- Use UniProt REST API (httpx with tenacity retry) to batch-query annotation scores. UniProt annotation score is available from REST API JSON under `.annotationScore`. Query batches of 100 accessions using UniProt search endpoint: `https://rest.uniprot.org/uniprotkb/search?query=accession:P12345+OR+accession:P67890&fields=accession,annotation_score`. Use ratelimit decorator (200 req/sec). Return LazyFrame with gene_id, uniprot_annotation_score. NULL for genes without UniProt mapping.
Add `ratelimit` to pyproject.toml dependencies if not already present.
**transform.py**: Three functions:
1. `classify_annotation_tier(df: pl.DataFrame) -> pl.DataFrame` -- Add annotation_tier column: "well_annotated" (go_term_count >= 20 AND uniprot_annotation_score >= 4), "partially_annotated" (go_term_count >= 5 OR uniprot_annotation_score >= 3), "poorly_annotated" (everything else including NULLs). NULL GO counts treated as zero for tier classification (conservative -- assume unannotated).
2. `normalize_annotation_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute composite annotation score. Formula: weighted average of (a) log2(go_term_count + 1) normalized by max across dataset, (b) uniprot_annotation_score / 5.0, (c) has_pathway_membership as 0/1. Weights: 0.5 GO, 0.3 UniProt, 0.2 Pathway. Result clamped to 0-1. NULL if ALL three inputs are NULL.
3. `process_annotation_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end pipeline: fetch GO -> fetch UniProt -> join -> classify tier -> normalize -> collect.
Follow established patterns: NULL preservation (unknown != zero), structlog logging, lazy polars evaluation where possible.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.annotation import fetch_go_annotations, fetch_uniprot_scores, classify_annotation_tier, normalize_annotation_score, process_annotation_evidence; print('imports OK')"
</verify>
<done>
Annotation fetch module retrieves GO terms via mygene and UniProt scores via REST API. Transform module classifies genes into 3 tiers and normalizes composite score to 0-1 scale. All functions importable and follow established NULL preservation patterns.
</done>
</task>
<task type="auto">
<name>Task 2: Create annotation DuckDB loader, CLI command, and tests</name>
<files>
src/usher_pipeline/evidence/annotation/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_annotation.py
tests/test_annotation_integration.py
</files>
<action>
**load.py**: Follow gnomad/load.py pattern exactly. Create `load_to_duckdb(df, store, provenance, description)` that saves to "annotation_completeness" table with CREATE OR REPLACE. Record provenance with tier distribution counts (well/partial/poor), NULL annotation counts, mean/median annotation scores. Create `query_poorly_annotated(store, max_score=0.3) -> pl.DataFrame` helper to find under-studied genes.
**evidence_cmd.py**: Add `annotation` subcommand to existing evidence command group. Follow gnomad command pattern: checkpoint check (has_checkpoint('annotation_completeness')), --force flag, load gene universe from DuckDB gene_universe table to get gene_ids and uniprot mappings, call process_annotation_evidence, load to DuckDB, save provenance sidecar to data/annotation/completeness.provenance.json. Display summary with tier distribution counts.
**tests/test_annotation.py**: Unit tests with synthetic data (no external API calls). Mock mygene.querymany to return synthetic GO data. Test cases:
- test_go_count_extraction: Correct GO term counting by category
- test_null_go_handling: Genes with no GO data get NULL counts
- test_tier_classification_well_annotated: High GO + high UniProt = well_annotated
- test_tier_classification_poorly_annotated: Low/NULL GO + low UniProt = poorly_annotated
- test_tier_classification_partial: Medium annotations = partially_annotated
- test_normalization_bounds: Score always in [0, 1] range
- test_normalization_null_preservation: All-NULL inputs produce NULL score
- test_normalization_with_pathway: Pathway membership contributes to score
- test_composite_weighting: Verify 0.5/0.3/0.2 weight distribution
**tests/test_annotation_integration.py**: Integration tests following gnomad pattern. Mock mygene and UniProt API. Test full pipeline: fetch -> transform -> load -> query. Test checkpoint-restart, provenance recording, idempotent loading. Use synthetic fixtures.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v
</verify>
<done>
All annotation unit and integration tests pass. CLI `evidence annotation` command registered and functional. DuckDB stores annotation_completeness table with tier classification. Checkpoint-restart works. Provenance tracked.
</done>
</task>
</tasks>
<verification>
- `python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.annotation import *"` -- all exports importable
- `usher-pipeline evidence annotation --help` -- CLI help displays with correct options
- DuckDB annotation_completeness table has columns: gene_id, gene_symbol, go_term_count, uniprot_annotation_score, has_pathway_membership, annotation_tier, annotation_score_normalized
</verification>
<success_criteria>
- ANNOT-01: GO term count and UniProt annotation score retrieved per gene with NULL for missing data
- ANNOT-02: Genes classified into well/partially/poorly annotated tiers based on composite metrics
- ANNOT-03: Normalized 0-1 annotation score stored in DuckDB annotation_completeness table
- Pattern compliance: fetch->transform->load->CLI->tests matching gnomAD evidence layer structure
</success_criteria>
<output>
After completion, create `.planning/phases/03-core-evidence-layers/03-01-SUMMARY.md`
</output>

View File

@@ -0,0 +1,169 @@
---
phase: 03-core-evidence-layers
plan: 02
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/expression/__init__.py
- src/usher_pipeline/evidence/expression/models.py
- src/usher_pipeline/evidence/expression/fetch.py
- src/usher_pipeline/evidence/expression/transform.py
- src/usher_pipeline/evidence/expression/load.py
- tests/test_expression.py
- tests/test_expression_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
- pyproject.toml
autonomous: true
must_haves:
truths:
- "Pipeline retrieves tissue-level expression data from HPA and GTEx for retina, inner ear, and cilia-rich tissues"
- "Pipeline retrieves scRNA-seq data from CellxGene for photoreceptor and hair cell subpopulations"
- "Expression data is converted to tissue specificity metrics (Tau index) across data sources"
- "Expression score reflects enrichment in Usher-relevant tissues relative to global expression"
artifacts:
- path: "src/usher_pipeline/evidence/expression/fetch.py"
provides: "HPA, GTEx, and CellxGene expression data retrieval"
exports: ["fetch_hpa_expression", "fetch_gtex_expression", "fetch_cellxgene_expression"]
- path: "src/usher_pipeline/evidence/expression/transform.py"
provides: "Tau specificity index calculation and expression scoring"
exports: ["calculate_tau_specificity", "compute_expression_score", "process_expression_evidence"]
- path: "src/usher_pipeline/evidence/expression/load.py"
provides: "DuckDB persistence for expression evidence"
exports: ["load_to_duckdb"]
- path: "tests/test_expression.py"
provides: "Unit tests for expression scoring and Tau calculation"
key_links:
- from: "src/usher_pipeline/evidence/expression/fetch.py"
to: "HPA API and GTEx API"
via: "httpx with tenacity retry"
pattern: "proteinatlas\\.org|gtexportal\\.org"
- from: "src/usher_pipeline/evidence/expression/transform.py"
to: "src/usher_pipeline/evidence/expression/fetch.py"
via: "processes HPA/GTEx/CellxGene data into specificity scores"
pattern: "calculate_tau_specificity"
- from: "src/usher_pipeline/evidence/expression/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*tissue_expression"
---
<objective>
Implement the Tissue Expression evidence layer (EXPR-01/02/03/04): retrieve expression data from HPA, GTEx, and CellxGene for Usher-relevant tissues, compute tissue specificity (Tau index), score enrichment in target tissues, and persist to DuckDB.
Purpose: Expression in retina, inner ear, and cilia-rich tissues is strong evidence for potential cilia/Usher involvement. Genes specifically expressed in these tissues (high Tau) are higher-priority candidates.
Output: tissue_expression DuckDB table with per-gene expression values across target tissues, Tau specificity index, and normalized expression score.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/transform.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create expression evidence data model, fetch, and transform modules</name>
<files>
src/usher_pipeline/evidence/expression/__init__.py
src/usher_pipeline/evidence/expression/models.py
src/usher_pipeline/evidence/expression/fetch.py
src/usher_pipeline/evidence/expression/transform.py
pyproject.toml
</files>
<action>
Create the expression evidence layer following the established gnomAD fetch->transform pattern.
**models.py**: Define ExpressionRecord pydantic model with fields: gene_id (str), gene_symbol (str), hpa_retina_tpm (float|None), hpa_cerebellum_tpm (float|None -- proxy for cilia-rich tissue), gtex_retina_tpm (float|None -- "Eye - Retina" from GTEx), gtex_brain_cerebellum_tpm (float|None), cellxgene_photoreceptor_expr (float|None), cellxgene_hair_cell_expr (float|None), tau_specificity (float|None -- Tau index 0=ubiquitous, 1=tissue-specific), usher_tissue_enrichment (float|None -- relative enrichment in target tissues), expression_score_normalized (float|None -- composite 0-1). Define EXPRESSION_TABLE_NAME = "tissue_expression". Define TARGET_TISSUES dict mapping tissue keys to API-specific identifiers for HPA, GTEx.
**fetch.py**: Three fetch functions:
1. `fetch_hpa_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download HPA normal tissue data TSV from https://www.proteinatlas.org/download/normal_tissue.tsv.zip (bulk download is more efficient than per-gene API for 20K genes). Filter to target tissues: "retina", "cerebellum", "testis" (cilia-rich), "fallopian tube" (ciliated epithelium). Extract TPM/expression levels. Use httpx streaming download with tenacity retry (same pattern as gnomAD). Parse with polars scan_csv. Return DataFrame with gene_id, tissue columns. NULL for genes not in HPA.
2. `fetch_gtex_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download GTEx median gene expression from GTEx Portal bulk data (V10) or query API in batches. Target tissues: "Eye - Retina" (not available in all GTEx versions -- handle NULL), "Brain - Cerebellum", "Testis", "Fallopian Tube". Use httpx with tenacity retry and ratelimit (conservative 5 req/sec for GTEx API). If bulk download available, prefer that. Return DataFrame with gene_id, gtex tissue columns. NULL for unavailable tissue/gene combinations.
3. `fetch_cellxgene_expression(gene_ids: list[str]) -> pl.DataFrame` -- Use cellxgene_census library to query scRNA-seq data. Query for cell types: "photoreceptor cell", "retinal rod cell", "retinal cone cell", "hair cell" in tissues "retina", "inner ear", "cochlea". Compute mean expression per gene per cell type. If cellxgene_census not available (optional dependency), log warning and return DataFrame with all NULLs. Process in gene batches (100 at a time) to control memory. Return DataFrame with gene_id, cellxgene columns.
Add `cellxgene_census` to pyproject.toml optional dependencies under [project.optional-dependencies] as `expression = ["cellxgene-census>=1.19"]` (heavy dependency, keep optional).
**transform.py**: Three functions:
1. `calculate_tau_specificity(df: pl.DataFrame, tissue_columns: list[str]) -> pl.DataFrame` -- Implement Tau index: Tau = sum(1 - xi/xmax) / (n-1). If any tissue value is NULL, tau is NULL (insufficient data for reliable specificity). Add tau_specificity column.
2. `compute_expression_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute usher_tissue_enrichment: ratio of mean expression in Usher-relevant tissues (retina, inner ear proxies) to mean expression across all tissues. Higher ratio = more enriched. Normalize to 0-1. Compute expression_score_normalized as weighted combination: 0.4 * usher_tissue_enrichment_normalized + 0.3 * tau_specificity + 0.3 * max_target_tissue_rank (rank of max expression in target tissues / total genes). NULL if all expression data is NULL.
3. `process_expression_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch HPA -> fetch GTEx -> fetch CellxGene -> merge on gene_id -> compute Tau -> compute score -> collect.
Follow established patterns: NULL preservation, structlog logging, lazy polars evaluation. NOTE: GTEx lacks inner ear/cochlea tissue -- handle as NULL, do not fabricate. CellxGene is the primary source for inner ear data.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.expression import fetch_hpa_expression, fetch_gtex_expression, calculate_tau_specificity, compute_expression_score, process_expression_evidence; print('imports OK')"
</verify>
<done>
Expression fetch module retrieves data from HPA (bulk TSV), GTEx (API/bulk), and CellxGene (census library). Transform module computes Tau specificity index and Usher tissue enrichment score normalized to 0-1. NULL preserved for missing tissues/genes.
</done>
</task>
<task type="auto">
<name>Task 2: Create expression DuckDB loader, CLI command, and tests</name>
<files>
src/usher_pipeline/evidence/expression/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_expression.py
tests/test_expression_integration.py
</files>
<action>
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "tissue_expression" table. Record provenance with: genes with retina expression, genes with inner ear expression, mean Tau, expression score distribution stats. Create `query_tissue_enriched(store, min_enrichment=2.0) -> pl.DataFrame` helper for genes enriched in Usher tissues.
**evidence_cmd.py**: Add `expression` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('tissue_expression')), --force flag, --skip-cellxgene flag (for running without optional CellxGene dependency), load gene universe for gene_ids, call process_expression_evidence, load to DuckDB, save provenance sidecar to data/expression/tissue.provenance.json. Display summary: tissue coverage counts, mean Tau, top enriched genes preview.
**tests/test_expression.py**: Unit tests with synthetic data, NO external API calls. Mock httpx responses for HPA/GTEx, mock cellxgene_census. Test cases:
- test_tau_calculation_ubiquitous: Equal expression across tissues -> Tau near 0
- test_tau_calculation_specific: Expression in one tissue only -> Tau near 1
- test_tau_null_handling: NULL tissue values -> NULL Tau
- test_enrichment_score_high: High retina expression relative to global -> high enrichment
- test_enrichment_score_low: No target tissue expression -> low enrichment
- test_expression_score_normalization: Composite score in [0, 1]
- test_null_preservation_all_sources: Gene with no data from any source -> NULL score
- test_hpa_parsing: Correct extraction of tissue-level TPM from HPA TSV format
- test_gtex_missing_tissue: NULL for tissues not in GTEx (inner ear)
**tests/test_expression_integration.py**: Integration tests with mocked downloads. Test full pipeline, checkpoint-restart, provenance. Synthetic HPA TSV fixture, mocked GTEx responses.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_expression.py tests/test_expression_integration.py -v
</verify>
<done>
All expression unit and integration tests pass. CLI `evidence expression` command registered with --skip-cellxgene option. DuckDB stores tissue_expression table with Tau and enrichment scores. Checkpoint-restart works.
</done>
</task>
</tasks>
<verification>
- `python -m pytest tests/test_expression.py tests/test_expression_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.expression import *"` -- all exports importable
- `usher-pipeline evidence expression --help` -- CLI help displays
- DuckDB tissue_expression table has columns: gene_id, gene_symbol, hpa_retina_tpm, gtex_retina_tpm, cellxgene_photoreceptor_expr, tau_specificity, usher_tissue_enrichment, expression_score_normalized
</verification>
<success_criteria>
- EXPR-01: HPA and GTEx tissue expression retrieved for retina, inner ear proxies, and cilia-rich tissues
- EXPR-02: CellxGene scRNA-seq data retrieved for photoreceptor and hair cell types (optional dependency graceful fallback)
- EXPR-03: Tau specificity index computed across data sources with NULL handling
- EXPR-04: Expression score reflects enrichment in Usher-relevant tissues with normalized 0-1 composite
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
</success_criteria>
<output>
After completion, create `.planning/phases/03-core-evidence-layers/03-02-SUMMARY.md`
</output>

View File

@@ -0,0 +1,169 @@
---
phase: 03-core-evidence-layers
plan: 03
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/protein/__init__.py
- src/usher_pipeline/evidence/protein/models.py
- src/usher_pipeline/evidence/protein/fetch.py
- src/usher_pipeline/evidence/protein/transform.py
- src/usher_pipeline/evidence/protein/load.py
- tests/test_protein.py
- tests/test_protein_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
autonomous: true
must_haves:
truths:
- "Pipeline extracts protein length, domain composition, and domain count from UniProt/InterPro per gene"
- "Pipeline identifies coiled-coil regions, scaffold/adaptor domains, and transmembrane domains"
- "Pipeline checks for known cilia-associated motifs and domain annotations without presupposing conclusions"
- "Protein features are encoded as binary and continuous features normalized to 0-1 scale"
artifacts:
- path: "src/usher_pipeline/evidence/protein/fetch.py"
provides: "UniProt protein features and InterPro domain retrieval"
exports: ["fetch_uniprot_features", "fetch_interpro_domains"]
- path: "src/usher_pipeline/evidence/protein/transform.py"
provides: "Feature extraction, cilia motif detection, and normalization"
exports: ["extract_protein_features", "detect_cilia_motifs", "normalize_protein_features", "process_protein_evidence"]
- path: "src/usher_pipeline/evidence/protein/load.py"
provides: "DuckDB persistence for protein features"
exports: ["load_to_duckdb"]
- path: "tests/test_protein.py"
provides: "Unit tests for feature extraction and motif detection"
key_links:
- from: "src/usher_pipeline/evidence/protein/fetch.py"
to: "UniProt REST API"
via: "httpx batch queries with ratelimit"
pattern: "rest\\.uniprot\\.org"
- from: "src/usher_pipeline/evidence/protein/fetch.py"
to: "InterPro REST API"
via: "httpx with tenacity retry"
pattern: "interpro.*api"
- from: "src/usher_pipeline/evidence/protein/transform.py"
to: "src/usher_pipeline/evidence/protein/fetch.py"
via: "processes raw features into scored evidence"
pattern: "detect_cilia_motifs|extract_protein_features"
- from: "src/usher_pipeline/evidence/protein/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*protein_features"
---
<objective>
Implement the Protein Sequence and Structure Features evidence layer (PROT-01/02/03/04): extract protein length, domains, coiled-coils, transmembrane regions, and cilia-associated motifs from UniProt/InterPro, encode as binary and continuous features normalized to 0-1.
Purpose: Usher proteins share structural motifs (coiled-coils, scaffold domains, transmembrane regions). Identifying these features in unstudied genes provides structural evidence for potential cilia/Usher involvement.
Output: protein_features DuckDB table with per-gene protein length, domain count, coiled-coil flag, transmembrane count, cilia motif flags, and normalized composite protein score.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/transform.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create protein features data model, fetch, and transform modules</name>
<files>
src/usher_pipeline/evidence/protein/__init__.py
src/usher_pipeline/evidence/protein/models.py
src/usher_pipeline/evidence/protein/fetch.py
src/usher_pipeline/evidence/protein/transform.py
</files>
<action>
Create the protein features evidence layer following the established gnomAD fetch->transform pattern.
**models.py**: Define ProteinFeatureRecord pydantic model with fields: gene_id (str), gene_symbol (str), uniprot_id (str|None), protein_length (int|None), domain_count (int|None), coiled_coil (bool|None -- has coiled-coil region), coiled_coil_count (int|None), transmembrane_count (int|None), scaffold_adaptor_domain (bool|None -- has PDZ, SH3, ankyrin, WD40, or similar scaffold domains), has_cilia_domain (bool|None -- has IFT, BBSome, ciliary targeting, or transition zone domain), has_sensory_domain (bool|None -- has stereocilia, photoreceptor, or sensory-associated domain), protein_score_normalized (float|None -- 0-1 composite). Define PROTEIN_TABLE_NAME = "protein_features". Define CILIA_DOMAIN_KEYWORDS as list: ["IFT", "intraflagellar", "BBSome", "ciliary", "cilia", "basal body", "centrosome", "transition zone", "axoneme"]. Define SCAFFOLD_DOMAIN_TYPES as list: ["PDZ", "SH3", "Ankyrin", "WD40", "Coiled coil", "SAM", "FERM", "Harmonin"].
**fetch.py**: Two fetch functions:
1. `fetch_uniprot_features(uniprot_ids: list[str]) -> pl.DataFrame` -- Query UniProt REST API in batches of 100 accessions. Use search endpoint with fields parameter: `accession,length,ft_domain,ft_coiled,ft_transmem,annotation_score`. Parse JSON response to extract: protein_length, list of domain names, coiled-coil region count, transmembrane region count. Use httpx with tenacity retry (5 attempts, exponential backoff) and ratelimit (200 req/sec). Return DataFrame with uniprot_id and extracted features. NULL for accessions not found.
2. `fetch_interpro_domains(uniprot_ids: list[str]) -> pl.DataFrame` -- Query InterPro REST API for domain annotations per protein. Endpoint: `https://www.ebi.ac.uk/interpro/api/entry/interpro/protein/uniprot/{accession}`. Use conservative rate limiting (10 req/sec as recommended in RESEARCH.md). Extract domain names, InterPro IDs. Return DataFrame with uniprot_id, domain_names list, interpro_ids list. This supplements UniProt with more detailed domain classification. For >10K proteins, consider InterPro bulk download as fallback (log warning if API too slow).
**transform.py**: Four functions:
1. `extract_protein_features(uniprot_df: pl.DataFrame, interpro_df: pl.DataFrame) -> pl.DataFrame` -- Join UniProt and InterPro data on uniprot_id. Compute domain_count from combined sources (deduplicate). Set coiled_coil boolean from UniProt ft_coiled count > 0. Set transmembrane_count from UniProt ft_transmem.
2. `detect_cilia_motifs(df: pl.DataFrame) -> pl.DataFrame` -- Scan domain names for CILIA_DOMAIN_KEYWORDS (case-insensitive substring match). Set has_cilia_domain = True if any domain matches. Scan for SCAFFOLD_DOMAIN_TYPES. Set scaffold_adaptor_domain = True if match found. Set has_sensory_domain based on keywords: ["stereocilia", "photoreceptor", "usher", "harmonin", "cadherin 23", "protocadherin"]. Important: this is pattern matching on domain annotations, NOT presupposing cilia involvement -- it flags structural features that happen to be associated with cilia biology.
3. `normalize_protein_features(df: pl.DataFrame) -> pl.DataFrame` -- Normalize continuous features: protein_length via log-transform rank percentile (0-1), domain_count via rank percentile (0-1), transmembrane_count capped at 20 then /20. Binary features stay as 0/1. Composite protein_score_normalized = 0.15 * length_rank + 0.20 * domain_rank + 0.20 * coiled_coil + 0.20 * transmembrane_normalized + 0.15 * has_cilia_domain + 0.10 * scaffold_adaptor_domain. NULL if no UniProt entry exists for gene.
4. `process_protein_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end: map gene_ids to uniprot_ids -> fetch UniProt -> fetch InterPro -> extract -> detect motifs -> normalize -> collect.
Follow established patterns: NULL preservation, structlog logging. Note: do NOT use external tools (CoCoNat, Phobius) for coiled-coil/TM prediction -- use UniProt annotations only (already curated). Research mentioned these tools but they add complexity with marginal value at this stage.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.protein import fetch_uniprot_features, fetch_interpro_domains, extract_protein_features, detect_cilia_motifs, normalize_protein_features, process_protein_evidence; print('imports OK')"
</verify>
<done>
Protein fetch module retrieves features from UniProt REST API and InterPro API. Transform module extracts domain features, detects cilia-associated motifs via keyword matching, and normalizes to 0-1 composite. All functions importable.
</done>
</task>
<task type="auto">
<name>Task 2: Create protein DuckDB loader, CLI command, and tests</name>
<files>
src/usher_pipeline/evidence/protein/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_protein.py
tests/test_protein_integration.py
</files>
<action>
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "protein_features" table. Record provenance: gene count, genes with cilia domains, genes with scaffold domains, genes with coiled-coils, genes with TM domains, mean domain count, NULL UniProt count. Create `query_cilia_candidates(store) -> pl.DataFrame` helper querying genes with has_cilia_domain=True OR (coiled_coil=True AND scaffold_adaptor_domain=True).
**evidence_cmd.py**: Add `protein` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('protein_features')), --force flag, load gene universe for gene_ids and uniprot mappings, call process_protein_evidence, load to DuckDB, save provenance sidecar to data/protein/features.provenance.json. Display summary: genes with UniProt data, cilia domain count, scaffold domain count, coiled-coil count.
**tests/test_protein.py**: Unit tests with synthetic data, mock httpx for UniProt/InterPro. Test cases:
- test_uniprot_feature_extraction: Correct parsing of length, domain, coiled-coil, TM from UniProt JSON
- test_cilia_motif_detection_positive: Domain name containing "IFT" -> has_cilia_domain=True
- test_cilia_motif_detection_negative: Standard domain (e.g., "Kinase") -> has_cilia_domain=False
- test_scaffold_detection: PDZ domain -> scaffold_adaptor_domain=True
- test_null_uniprot: Gene without UniProt entry -> all features NULL
- test_normalization_bounds: All features in [0, 1]
- test_composite_score_cilia_gene: Gene with cilia domains scores higher
- test_composite_score_null_handling: NULL UniProt -> NULL composite
- test_domain_keyword_case_insensitive: "intraflagellar" matches case-insensitively
**tests/test_protein_integration.py**: Integration tests. Mock UniProt and InterPro API responses. Test full pipeline, checkpoint-restart, provenance recording. Synthetic UniProt JSON fixtures with realistic domain structures.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_protein.py tests/test_protein_integration.py -v
</verify>
<done>
All protein unit and integration tests pass. CLI `evidence protein` command registered. DuckDB stores protein_features table with domain counts, motif flags, and normalized score. Checkpoint-restart works.
</done>
</task>
</tasks>
<verification>
- `python -m pytest tests/test_protein.py tests/test_protein_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.protein import *"` -- all exports importable
- `usher-pipeline evidence protein --help` -- CLI help displays
- DuckDB protein_features table has columns: gene_id, gene_symbol, protein_length, domain_count, coiled_coil, transmembrane_count, has_cilia_domain, scaffold_adaptor_domain, protein_score_normalized
</verification>
<success_criteria>
- PROT-01: Protein length, domain composition, domain count extracted from UniProt/InterPro per gene
- PROT-02: Coiled-coil, scaffold/adaptor, and transmembrane domains identified
- PROT-03: Cilia-associated motifs detected via domain keyword matching without presupposing conclusions
- PROT-04: Binary and continuous protein features normalized to 0-1 composite score
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
</success_criteria>
<output>
After completion, create `.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md`
</output>

View File

@@ -0,0 +1,162 @@
---
phase: 03-core-evidence-layers
plan: 04
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/localization/__init__.py
- src/usher_pipeline/evidence/localization/models.py
- src/usher_pipeline/evidence/localization/fetch.py
- src/usher_pipeline/evidence/localization/transform.py
- src/usher_pipeline/evidence/localization/load.py
- tests/test_localization.py
- tests/test_localization_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
autonomous: true
must_haves:
truths:
- "Pipeline integrates protein localization data from HPA subcellular and published centrosome/cilium proteomics"
- "Localization evidence distinguishes experimental from computational predictions"
- "Localization score reflects proximity to cilia-related compartments"
artifacts:
- path: "src/usher_pipeline/evidence/localization/fetch.py"
provides: "HPA subcellular and proteomics localization data retrieval"
exports: ["fetch_hpa_subcellular", "fetch_cilia_proteomics"]
- path: "src/usher_pipeline/evidence/localization/transform.py"
provides: "Localization scoring with evidence type distinction"
exports: ["score_localization", "classify_evidence_type", "process_localization_evidence"]
- path: "src/usher_pipeline/evidence/localization/load.py"
provides: "DuckDB persistence for localization evidence"
exports: ["load_to_duckdb"]
- path: "tests/test_localization.py"
provides: "Unit tests for localization scoring and evidence classification"
key_links:
- from: "src/usher_pipeline/evidence/localization/fetch.py"
to: "HPA subcellular data"
via: "httpx bulk download of subcellular_location.tsv"
pattern: "proteinatlas\\.org.*subcellular"
- from: "src/usher_pipeline/evidence/localization/transform.py"
to: "src/usher_pipeline/evidence/localization/fetch.py"
via: "processes HPA and proteomics data into localization scores"
pattern: "score_localization|classify_evidence_type"
- from: "src/usher_pipeline/evidence/localization/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*subcellular_localization"
---
<objective>
Implement the Subcellular Localization evidence layer (LOCA-01/02/03): retrieve HPA subcellular localization and curated cilium/centrosome proteomics data, distinguish experimental from computational evidence, score proximity to cilia-related compartments.
Purpose: Direct localization to cilia, centrosome, basal body, or transition zone is strong evidence for cilia involvement. Distinguishing experimental from computational predictions prevents overweighting predictions.
Output: subcellular_localization DuckDB table with per-gene compartment assignments, evidence type (experimental/predicted), cilia proximity score, and normalized localization score.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create localization evidence data model, fetch, and transform modules</name>
<files>
src/usher_pipeline/evidence/localization/__init__.py
src/usher_pipeline/evidence/localization/models.py
src/usher_pipeline/evidence/localization/fetch.py
src/usher_pipeline/evidence/localization/transform.py
</files>
<action>
Create the localization evidence layer following the established gnomAD fetch->transform pattern.
**models.py**: Define LocalizationRecord pydantic model with fields: gene_id (str), gene_symbol (str), hpa_main_location (str|None -- semicolon-separated HPA locations), hpa_reliability (str|None -- "Enhanced", "Supported", "Approved", "Uncertain"), hpa_evidence_type (str|None -- "experimental" or "predicted" based on reliability), in_cilia_proteomics (bool|None -- found in published cilium proteomics datasets), in_centrosome_proteomics (bool|None -- found in centrosome proteomics), compartment_cilia (bool|None), compartment_centrosome (bool|None), compartment_basal_body (bool|None), compartment_transition_zone (bool|None), compartment_stereocilia (bool|None), evidence_type (str -- "experimental", "computational", "both", "none"), cilia_proximity_score (float|None -- 0-1 based on compartment relevance), localization_score_normalized (float|None -- 0-1 composite). Define LOCALIZATION_TABLE_NAME = "subcellular_localization". Define CILIA_COMPARTMENTS = ["Cilia", "Cilium", "Centrosome", "Centriole", "Basal body", "Microtubule organizing center"]. Define CILIA_ADJACENT_COMPARTMENTS = ["Cytoskeleton", "Microtubules", "Cell Junctions", "Focal adhesion sites"].
**fetch.py**: Two fetch functions:
1. `fetch_hpa_subcellular(gene_ids: list[str]) -> pl.DataFrame` -- Download HPA subcellular location data TSV from https://www.proteinatlas.org/download/subcellular_location.tsv.zip (bulk download). Parse with polars. Extract columns: Gene name, Reliability, Main location, Additional location, Extracellular location. Map gene symbols to gene_ids using gene universe. Filter to input gene_ids. Return DataFrame with gene_id, hpa_main_location (string of all locations), hpa_reliability. Use httpx streaming with tenacity retry.
2. `fetch_cilia_proteomics(gene_ids: list[str]) -> pl.DataFrame` -- Create a curated reference set of known cilium and centrosome proteomics datasets. Use gene lists from published studies embedded as Python data (CiliaCarta gene set, Centrosome-DB genes). These are static reference lists that can be hardcoded or loaded from a small bundled CSV (data/reference/cilia_proteomics_genes.csv, data/reference/centrosome_proteomics_genes.csv). Cross-reference gene_ids against these lists. Return DataFrame with gene_id, in_cilia_proteomics (bool), in_centrosome_proteomics (bool). For genes not in either set, both are False (not NULL -- absence from proteomics is informative).
**transform.py**: Three functions:
1. `classify_evidence_type(df: pl.DataFrame) -> pl.DataFrame` -- Based on HPA reliability: "Enhanced"/"Supported" -> "experimental", "Approved"/"Uncertain" -> "computational". If gene in proteomics dataset -> "experimental" (overrides HPA if HPA says computational). If both -> "both". If neither HPA nor proteomics -> "none". Add evidence_type column.
2. `score_localization(df: pl.DataFrame) -> pl.DataFrame` -- Parse hpa_main_location string. Check each location against CILIA_COMPARTMENTS (direct match = 1.0 weight) and CILIA_ADJACENT_COMPARTMENTS (adjacent = 0.5 weight). Set compartment booleans. Compute cilia_proximity_score: 1.0 if any direct cilia compartment, 0.5 if adjacent only, 0.3 if in proteomics but no HPA cilia location, 0.0 if none. Multiply by evidence weight: experimental=1.0, computational=0.6. Compute localization_score_normalized: cilia_proximity_score weighted by evidence type. NULL if no localization data at all (gene not in HPA and not in proteomics).
3. `process_localization_evidence(gene_ids: list[str], gene_symbol_map: pl.DataFrame) -> pl.DataFrame` -- End-to-end: fetch HPA -> fetch proteomics -> merge -> classify evidence -> score -> collect.
Follow established patterns: NULL preservation, structlog logging. NOTE: Absence from proteomics datasets is informative (= not detected), so use False not NULL for proteomics columns. Absence from HPA is unknown (= not tested), so use NULL for HPA columns.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.localization import fetch_hpa_subcellular, fetch_cilia_proteomics, classify_evidence_type, score_localization, process_localization_evidence; print('imports OK')"
</verify>
<done>
Localization fetch retrieves HPA subcellular data and cross-references proteomics gene lists. Transform classifies evidence type and scores cilia proximity with experimental/computational weighting. All functions importable.
</done>
</task>
<task type="auto">
<name>Task 2: Create localization DuckDB loader, CLI command, and tests</name>
<files>
src/usher_pipeline/evidence/localization/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_localization.py
tests/test_localization_integration.py
</files>
<action>
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "subcellular_localization" table. Record provenance: genes in cilia compartment, genes in centrosome, experimental vs computational counts, mean localization score. Create `query_cilia_localized(store) -> pl.DataFrame` helper for genes with cilia_proximity_score > 0.5.
**evidence_cmd.py**: Add `localization` subcommand to evidence command group. Follow gnomad pattern: checkpoint check, --force flag, load gene universe for gene_ids and symbol mapping, call process_localization_evidence, load to DuckDB, save provenance sidecar to data/localization/subcellular.provenance.json. Display summary: HPA coverage, cilia-localized gene count, evidence type distribution.
**tests/test_localization.py**: Unit tests with synthetic data. Mock httpx for HPA download. Test cases:
- test_hpa_location_parsing: Correct extraction of locations from semicolon-separated string
- test_cilia_compartment_detection: "Centrosome" in location -> compartment_centrosome=True
- test_adjacent_compartment_scoring: "Cytoskeleton" only -> cilia_proximity_score=0.5
- test_evidence_type_experimental: HPA Enhanced reliability -> experimental
- test_evidence_type_computational: HPA Uncertain reliability -> computational
- test_proteomics_override: Gene in proteomics but HPA uncertain -> evidence_type="both"
- test_null_handling_no_hpa: Gene not in HPA -> HPA columns NULL
- test_proteomics_absence_is_false: Gene not in proteomics -> in_cilia_proteomics=False (not NULL)
- test_score_normalization: Localization score in [0, 1]
- test_evidence_weight_applied: Experimental evidence scores higher than computational for same compartment
**tests/test_localization_integration.py**: Integration tests. Mock HPA download, synthetic proteomics reference. Test full pipeline, checkpoint-restart, provenance.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_localization.py tests/test_localization_integration.py -v
</verify>
<done>
All localization unit and integration tests pass. CLI `evidence localization` command registered. DuckDB stores subcellular_localization table with compartment flags, evidence types, and cilia proximity score. Checkpoint-restart works.
</done>
</task>
</tasks>
<verification>
- `python -m pytest tests/test_localization.py tests/test_localization_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.localization import *"` -- all exports importable
- `usher-pipeline evidence localization --help` -- CLI help displays
- DuckDB subcellular_localization table has columns: gene_id, gene_symbol, hpa_main_location, evidence_type, compartment_cilia, cilia_proximity_score, localization_score_normalized
</verification>
<success_criteria>
- LOCA-01: HPA subcellular and cilium/centrosome proteomics data integrated
- LOCA-02: Evidence distinguished as experimental vs computational based on HPA reliability and proteomics source
- LOCA-03: Localization score reflects cilia compartment proximity with evidence-type weighting
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
</success_criteria>
<output>
After completion, create `.planning/phases/03-core-evidence-layers/03-04-SUMMARY.md`
</output>

View File

@@ -0,0 +1,168 @@
---
phase: 03-core-evidence-layers
plan: 05
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/animal_models/__init__.py
- src/usher_pipeline/evidence/animal_models/models.py
- src/usher_pipeline/evidence/animal_models/fetch.py
- src/usher_pipeline/evidence/animal_models/transform.py
- src/usher_pipeline/evidence/animal_models/load.py
- tests/test_animal_models.py
- tests/test_animal_models_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
autonomous: true
must_haves:
truths:
- "Pipeline retrieves gene knockout/perturbation phenotypes from MGI, ZFIN, and IMPC"
- "Phenotypes are filtered for relevance to sensory function, balance, vision, hearing, and cilia morphology"
- "Ortholog mapping uses established databases with confidence scoring, handling one-to-many mappings"
artifacts:
- path: "src/usher_pipeline/evidence/animal_models/fetch.py"
provides: "MGI, ZFIN, and IMPC phenotype data retrieval"
exports: ["fetch_mgi_phenotypes", "fetch_zfin_phenotypes", "fetch_impc_phenotypes", "fetch_ortholog_mapping"]
- path: "src/usher_pipeline/evidence/animal_models/transform.py"
provides: "Phenotype relevance filtering and scoring"
exports: ["filter_sensory_phenotypes", "score_animal_evidence", "process_animal_model_evidence"]
- path: "src/usher_pipeline/evidence/animal_models/load.py"
provides: "DuckDB persistence for animal model evidence"
exports: ["load_to_duckdb"]
- path: "tests/test_animal_models.py"
provides: "Unit tests for phenotype filtering and ortholog handling"
key_links:
- from: "src/usher_pipeline/evidence/animal_models/fetch.py"
to: "MGI/IMPC bulk data and ZFIN API"
via: "httpx with tenacity retry for bulk downloads"
pattern: "informatics\\.jax\\.org|mousephenotype\\.org|zfin\\.org"
- from: "src/usher_pipeline/evidence/animal_models/fetch.py"
to: "HCOP ortholog database"
via: "HCOP bulk download from HGNC"
pattern: "genenames\\.org.*hcop"
- from: "src/usher_pipeline/evidence/animal_models/transform.py"
to: "src/usher_pipeline/evidence/animal_models/fetch.py"
via: "filters phenotypes and scores animal model evidence"
pattern: "filter_sensory_phenotypes|score_animal_evidence"
- from: "src/usher_pipeline/evidence/animal_models/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*animal_model_phenotypes"
---
<objective>
Implement the Animal Model Phenotypes evidence layer (ANIM-01/02/03): retrieve knockout/perturbation phenotypes from MGI (mouse), ZFIN (zebrafish), and IMPC, map orthologs with confidence scoring, filter for sensory/cilia relevance, and score.
Purpose: Animal model phenotypes provide functional evidence -- genes whose orthologs cause sensory, balance, hearing, or cilia defects in mouse/zebrafish are strong candidates. Ortholog confidence prevents false positives from paralog mis-mapping.
Output: animal_model_phenotypes DuckDB table with per-gene ortholog mapping, phenotype summaries, sensory relevance flags, and normalized animal model score.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create animal model evidence data model, fetch (orthologs + phenotypes), and transform</name>
<files>
src/usher_pipeline/evidence/animal_models/__init__.py
src/usher_pipeline/evidence/animal_models/models.py
src/usher_pipeline/evidence/animal_models/fetch.py
src/usher_pipeline/evidence/animal_models/transform.py
</files>
<action>
Create the animal model evidence layer following the established fetch->transform pattern.
**models.py**: Define AnimalModelRecord pydantic model with fields: gene_id (str), gene_symbol (str), mouse_ortholog (str|None), mouse_ortholog_confidence (str|None -- "HIGH"/"MEDIUM"/"LOW" based on HCOP source count), zebrafish_ortholog (str|None), zebrafish_ortholog_confidence (str|None), has_mouse_phenotype (bool|None), has_zebrafish_phenotype (bool|None), has_impc_phenotype (bool|None), sensory_phenotype_count (int|None -- number of sensory-relevant phenotypes), phenotype_categories (str|None -- semicolon-separated list of matched MP/ZP terms), animal_model_score_normalized (float|None -- 0-1 composite). Define ANIMAL_TABLE_NAME = "animal_model_phenotypes". Define SENSORY_MP_KEYWORDS (Mammalian Phenotype ontology terms): ["hearing", "deaf", "vestibular", "balance", "retina", "photoreceptor", "vision", "blind", "cochlea", "stereocilia", "cilia", "cilium", "flagellum", "situs inversus", "laterality", "hydrocephalus", "kidney cyst", "polycystic"]. Define SENSORY_ZP_KEYWORDS similarly for zebrafish phenotype terms.
**fetch.py**: Four functions:
1. `fetch_ortholog_mapping(gene_ids: list[str]) -> pl.DataFrame` -- Download HCOP ortholog data from HGNC: https://ftp.ebi.ac.uk/pub/databases/genenames/hcop/human_mouse_hcop_fifteen_column.txt.gz and human_zebrafish equivalent. Parse with polars. Extract columns: human gene ID, ortholog ID, ortholog symbol, support count (number of databases agreeing). Assign confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3). For one-to-many mappings: keep the ortholog with highest support count. Flag genes with multiple orthologs (ortholog_count column). Return DataFrame with gene_id, mouse_ortholog, mouse_ortholog_confidence, zebrafish_ortholog, zebrafish_ortholog_confidence.
2. `fetch_mgi_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Download MGI gene-phenotype report from MGI FTP: https://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt (or HMD_HumanPhenotype.rpt for human orthologs). Parse tab-separated report. Extract: mouse gene ID, allele symbol, MP term name, MP term ID. Return DataFrame with mouse gene and phenotype terms. Use httpx streaming download with retry.
3. `fetch_zfin_phenotypes(zebrafish_gene_ids: list[str]) -> pl.DataFrame` -- Download ZFIN phenotype data from: https://zfin.org/downloads/phenoGeneCleanData_fish.txt. Parse tab-separated. Extract: zebrafish gene, phenotype terms. Return DataFrame. Use httpx streaming download with retry.
4. `fetch_impc_phenotypes(mouse_gene_ids: list[str]) -> pl.DataFrame` -- Query IMPC SOLR API: https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=marker_symbol:{gene}&rows=1000. Process in batches (50 genes at a time with rate limiting). Extract: gene symbol, MP term, p-value, effect size. Return DataFrame. Use httpx with tenacity retry, ratelimit (5 req/sec conservative). If IMPC API is unreliable, fall back to bulk download from https://www.mousephenotype.org/data/release.
**transform.py**: Three functions:
1. `filter_sensory_phenotypes(phenotype_df: pl.DataFrame, keywords: list[str]) -> pl.DataFrame` -- Filter phenotype terms for sensory/cilia relevance using keyword matching (case-insensitive substring). Return only rows where MP/ZP term matches any keyword in SENSORY_MP_KEYWORDS or SENSORY_ZP_KEYWORDS. Count matches per gene as sensory_phenotype_count. Concatenate matched terms as phenotype_categories string.
2. `score_animal_evidence(df: pl.DataFrame) -> pl.DataFrame` -- Compute animal_model_score_normalized. Scoring formula: base_score = 0 if no phenotypes. For each organism with sensory phenotypes: mouse +0.4 (weighted by ortholog confidence: HIGH=1.0, MEDIUM=0.7, LOW=0.4), zebrafish +0.3 (same confidence weighting), IMPC +0.3 (independent confirmation bonus). Clamp to [0, 1]. Multiply by log2(sensory_phenotype_count + 1) / log2(max_count + 1) to reward more phenotypes. NULL if no ortholog mapping exists.
3. `process_animal_model_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch orthologs -> fetch MGI -> fetch ZFIN -> fetch IMPC -> join on orthologs -> filter sensory -> score -> collect.
Follow established patterns: NULL preservation (no ortholog = NULL, not zero), structlog logging. Handle one-to-many orthologs: take best confidence, aggregate phenotypes across all orthologs for that human gene.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.animal_models import fetch_ortholog_mapping, fetch_mgi_phenotypes, fetch_zfin_phenotypes, fetch_impc_phenotypes, filter_sensory_phenotypes, score_animal_evidence, process_animal_model_evidence; print('imports OK')"
</verify>
<done>
Animal model fetch retrieves orthologs from HCOP and phenotypes from MGI, ZFIN, IMPC. Transform filters for sensory/cilia relevance and scores with ortholog confidence weighting. All functions importable.
</done>
</task>
<task type="auto">
<name>Task 2: Create animal model DuckDB loader, CLI command, and tests</name>
<files>
src/usher_pipeline/evidence/animal_models/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_animal_models.py
tests/test_animal_models_integration.py
</files>
<action>
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "animal_model_phenotypes" table. Record provenance: genes with mouse orthologs, genes with zebrafish orthologs, genes with sensory phenotypes, ortholog confidence distribution, mean sensory phenotype count. Create `query_sensory_phenotype_genes(store, min_score=0.3) -> pl.DataFrame` helper.
**evidence_cmd.py**: Add `animal-models` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('animal_model_phenotypes')), --force flag, load gene universe for gene_ids, call process_animal_model_evidence, load to DuckDB, save provenance sidecar to data/animal_models/phenotypes.provenance.json. Display summary: ortholog coverage, sensory phenotype counts by organism, top scoring genes.
**tests/test_animal_models.py**: Unit tests with synthetic data. Mock httpx for all downloads. Test cases:
- test_ortholog_confidence_high: 8+ supporting sources -> HIGH
- test_ortholog_confidence_low: 1-3 sources -> LOW
- test_one_to_many_best_selected: Multiple mouse orthologs -> highest confidence kept
- test_sensory_keyword_match: "hearing loss" phenotype matches SENSORY_MP_KEYWORDS
- test_non_sensory_filtered: "increased body weight" phenotype filtered out
- test_score_with_confidence_weighting: HIGH confidence ortholog scores higher than LOW
- test_score_null_no_ortholog: Gene without ortholog -> NULL score
- test_multi_organism_bonus: Phenotypes in both mouse and zebrafish -> higher score
- test_phenotype_count_scaling: More sensory phenotypes -> higher score (diminishing returns via log)
- test_impc_integration: IMPC phenotypes contribute to score
**tests/test_animal_models_integration.py**: Integration tests. Mock HCOP download, MGI/ZFIN/IMPC responses. Test full pipeline, checkpoint-restart, provenance. Synthetic phenotype report fixtures.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v
</verify>
<done>
All animal model unit and integration tests pass. CLI `evidence animal-models` command registered. DuckDB stores animal_model_phenotypes table with ortholog mappings, phenotype summaries, and confidence-weighted scores. Checkpoint-restart works.
</done>
</task>
</tasks>
<verification>
- `python -m pytest tests/test_animal_models.py tests/test_animal_models_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.animal_models import *"` -- all exports importable
- `usher-pipeline evidence animal-models --help` -- CLI help displays
- DuckDB animal_model_phenotypes table has columns: gene_id, gene_symbol, mouse_ortholog, mouse_ortholog_confidence, sensory_phenotype_count, phenotype_categories, animal_model_score_normalized
</verification>
<success_criteria>
- ANIM-01: Phenotypes retrieved from MGI (mouse), ZFIN (zebrafish), and IMPC via bulk downloads and API
- ANIM-02: Phenotypes filtered for sensory/balance/vision/hearing/cilia relevance via keyword matching
- ANIM-03: Ortholog mapping via HCOP with confidence scoring (HIGH/MEDIUM/LOW), one-to-many handled by taking best confidence
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
</success_criteria>
<output>
After completion, create `.planning/phases/03-core-evidence-layers/03-05-SUMMARY.md`
</output>

View File

@@ -0,0 +1,178 @@
---
phase: 03-core-evidence-layers
plan: 06
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/literature/__init__.py
- src/usher_pipeline/evidence/literature/models.py
- src/usher_pipeline/evidence/literature/fetch.py
- src/usher_pipeline/evidence/literature/transform.py
- src/usher_pipeline/evidence/literature/load.py
- tests/test_literature.py
- tests/test_literature_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
- pyproject.toml
autonomous: true
user_setup:
- service: ncbi
why: "PubMed E-utilities rate limit increase (3/sec -> 10/sec)"
env_vars:
- name: NCBI_API_KEY
source: "https://www.ncbi.nlm.nih.gov/account/settings/ -> API Key Management -> Create"
- name: NCBI_EMAIL
source: "Your email address (required by NCBI E-utilities)"
dashboard_config: []
must_haves:
truths:
- "Pipeline performs systematic PubMed queries per candidate gene for cilia, sensory, cytoskeleton, and cell polarity contexts"
- "Literature evidence distinguishes direct experimental evidence from incidental mentions and HTS hits"
- "Literature score reflects evidence quality not just publication count to mitigate well-studied gene bias"
artifacts:
- path: "src/usher_pipeline/evidence/literature/fetch.py"
provides: "PubMed query and publication retrieval per gene"
exports: ["query_pubmed_gene", "fetch_literature_evidence"]
- path: "src/usher_pipeline/evidence/literature/transform.py"
provides: "Evidence tier classification and quality-weighted scoring"
exports: ["classify_evidence_tier", "compute_literature_score", "process_literature_evidence"]
- path: "src/usher_pipeline/evidence/literature/load.py"
provides: "DuckDB persistence for literature evidence"
exports: ["load_to_duckdb"]
- path: "tests/test_literature.py"
provides: "Unit tests for evidence classification and quality scoring"
key_links:
- from: "src/usher_pipeline/evidence/literature/fetch.py"
to: "NCBI PubMed E-utilities"
via: "Biopython Entrez with ratelimit"
pattern: "Entrez\\.esearch|Entrez\\.efetch"
- from: "src/usher_pipeline/evidence/literature/transform.py"
to: "src/usher_pipeline/evidence/literature/fetch.py"
via: "classifies and scores fetched publication data"
pattern: "classify_evidence_tier|compute_literature_score"
- from: "src/usher_pipeline/evidence/literature/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*literature_evidence"
---
<objective>
Implement the Literature Evidence layer (LITE-01/02/03): perform systematic PubMed queries per gene for cilia/sensory contexts, classify evidence into quality tiers (direct experimental, functional mention, HTS hit, incidental), and compute quality-weighted literature score that avoids well-studied gene bias.
Purpose: Literature evidence confirms or suggests gene involvement in cilia/sensory pathways. Quality weighting prevents TP53-like genes (100K publications) from dominating scoring over novel candidates with 5 high-quality papers.
Output: literature_evidence DuckDB table with per-gene PubMed hit counts by context, evidence tier classification, and quality-weighted normalized literature score.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create literature evidence data model, PubMed fetch, and scoring transform</name>
<files>
src/usher_pipeline/evidence/literature/__init__.py
src/usher_pipeline/evidence/literature/models.py
src/usher_pipeline/evidence/literature/fetch.py
src/usher_pipeline/evidence/literature/transform.py
pyproject.toml
</files>
<action>
Create the literature evidence layer following the established fetch->transform pattern.
**models.py**: Define LiteratureRecord pydantic model with fields: gene_id (str), gene_symbol (str), total_pubmed_count (int|None -- total publications mentioning gene), cilia_context_count (int|None), sensory_context_count (int|None), cytoskeleton_context_count (int|None), cell_polarity_context_count (int|None), direct_experimental_count (int|None -- publications with knockout/mutation/knockdown language), hts_screen_count (int|None -- publications from high-throughput screens), evidence_tier (str -- "direct_experimental", "functional_mention", "hts_hit", "incidental", "none"), literature_score_normalized (float|None -- 0-1 quality-weighted). Define LITERATURE_TABLE_NAME = "literature_evidence". Define SEARCH_CONTEXTS as dict mapping context names to PubMed search terms: {"cilia": "(cilia OR cilium OR ciliary OR flagellum OR intraflagellar)", "sensory": "(retina OR cochlea OR hair cell OR photoreceptor OR vestibular OR hearing OR usher syndrome)", "cytoskeleton": "(cytoskeleton OR actin OR microtubule OR motor protein)", "cell_polarity": "(cell polarity OR planar cell polarity OR apicobasal OR tight junction)"}. Define DIRECT_EVIDENCE_TERMS = "(knockout OR knockdown OR mutation OR CRISPR OR siRNA OR morpholino OR null allele)".
**fetch.py**: Two functions:
1. `query_pubmed_gene(gene_symbol: str, contexts: dict[str, str], email: str, api_key: str|None = None) -> dict` -- Use Biopython Entrez.esearch to query PubMed. For each context, construct query: `({gene_symbol}[Gene Name]) AND {context_terms}`. Also query with DIRECT_EVIDENCE_TERMS appended for direct experimental count. Query for HTS: `({gene_symbol}[Gene Name]) AND (screen[Title/Abstract] OR proteomics[Title/Abstract] OR transcriptomics[Title/Abstract])`. Get total count for gene (no context filter). Use ratelimit decorator: 3 req/sec without API key, 10 req/sec with key. Return dict with counts per context plus direct_experimental and hts counts. Set Entrez.email and Entrez.api_key from parameters.
2. `fetch_literature_evidence(gene_symbols: list[str], email: str, api_key: str|None = None) -> pl.DataFrame` -- Process all genes through query_pubmed_gene. Batch processing with progress logging (log every 100 genes). Build DataFrame with all count columns. For genes that fail (API errors), set counts to NULL (not zero). This function will be slow (~20K genes * ~6 queries each = ~120K queries, at 10/sec = ~3.3 hours with API key, ~11 hours without). Log estimated time at start. Support resumption by accepting partial results DataFrame.
Add `biopython>=1.84` to pyproject.toml dependencies (if not already present).
**transform.py**: Three functions:
1. `classify_evidence_tier(df: pl.DataFrame) -> pl.DataFrame` -- Classify each gene: "direct_experimental" if direct_experimental_count >= 1 AND cilia/sensory context count >= 1, "functional_mention" if cilia/sensory context count >= 1 AND total count >= 3, "hts_hit" if hts_screen_count >= 1 AND cilia/sensory context count >= 1, "incidental" if total_pubmed_count >= 1 but no context matches, "none" if total_pubmed_count == 0 or NULL. Add evidence_tier column.
2. `compute_literature_score(df: pl.DataFrame) -> pl.DataFrame` -- Quality-weighted scoring that avoids well-studied gene bias. Formula: context_score = sum of context_counts (cilia * 2.0 + sensory * 2.0 + cytoskeleton * 1.0 + cell_polarity * 1.0) -- weighted by relevance. evidence_quality_weight = {"direct_experimental": 1.0, "functional_mention": 0.6, "hts_hit": 0.3, "incidental": 0.1, "none": 0.0}. raw_score = context_score * evidence_quality_weight. CRITICAL bias mitigation: normalize by log2(total_pubmed_count + 1) to penalize genes where cilia mentions are a tiny fraction of total literature. Final normalization: rank-percentile across all genes, scaled to [0, 1]. NULL if total_pubmed_count is NULL.
3. `process_literature_evidence(gene_ids: list[str], gene_symbol_map: pl.DataFrame, email: str, api_key: str|None = None) -> pl.DataFrame` -- End-to-end: map gene_ids to symbols -> fetch literature -> classify tier -> compute score -> collect.
Follow established patterns: NULL preservation, structlog logging. IMPORTANT: PubMed queries are slow. Design for checkpoint-restart (save partial results every 500 genes). Log progress extensively.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.literature import query_pubmed_gene, fetch_literature_evidence, classify_evidence_tier, compute_literature_score, process_literature_evidence; print('imports OK')"
</verify>
<done>
Literature fetch queries PubMed via Biopython Entrez with rate limiting and context-specific search terms. Transform classifies evidence tiers and computes quality-weighted score with bias mitigation via total publication normalization. All functions importable.
</done>
</task>
<task type="auto">
<name>Task 2: Create literature DuckDB loader, CLI command, and tests</name>
<files>
src/usher_pipeline/evidence/literature/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_literature.py
tests/test_literature_integration.py
</files>
<action>
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "literature_evidence" table. Record provenance: genes queried, genes with direct evidence, tier distribution, mean literature score, total PubMed queries made. Create `query_literature_supported(store, min_tier="functional_mention") -> pl.DataFrame` helper returning genes with evidence tier at or above specified tier.
**evidence_cmd.py**: Add `literature` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('literature_evidence')), --force flag, --email (required, for NCBI), --api-key (optional, for higher rate limit), --batch-size (default 500, for partial checkpoint saves), load gene universe for gene symbols, call process_literature_evidence, load to DuckDB, save provenance sidecar to data/literature/pubmed.provenance.json. Display summary: genes queried, evidence tier distribution, estimated remaining time. IMPORTANT: This command will take hours. Log progress clearly. Save partial checkpoints to DuckDB every batch-size genes.
**tests/test_literature.py**: Unit tests with synthetic data. Mock Biopython Entrez.esearch. Test cases:
- test_direct_experimental_classification: Gene with knockout paper in cilia context -> "direct_experimental"
- test_functional_mention_classification: Gene with cilia context but no knockout -> "functional_mention"
- test_hts_hit_classification: Gene from proteomics screen in cilia context -> "hts_hit"
- test_incidental_classification: Gene with publications but no cilia/sensory context -> "incidental"
- test_no_evidence: Gene with zero publications -> "none"
- test_bias_mitigation: TP53-like gene (100K total, 5 cilia) scores LOWER than novel gene (10 total, 5 cilia)
- test_quality_weighting: Direct experimental evidence scores higher than incidental mention
- test_null_preservation: Failed PubMed query -> NULL counts, not zero
- test_context_weighting: Cilia/sensory contexts weighted higher than cytoskeleton
- test_score_normalization: Final score in [0, 1]
**tests/test_literature_integration.py**: Integration tests. Mock Entrez.esearch/efetch responses. Test full pipeline with small gene set, checkpoint-restart (verify partial results preserved), provenance recording. Synthetic PubMed response fixtures.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_literature.py tests/test_literature_integration.py -v
</verify>
<done>
All literature unit and integration tests pass. CLI `evidence literature` command registered with --email and --api-key options. DuckDB stores literature_evidence table with context counts, evidence tiers, and quality-weighted scores. Bias mitigation validated: novel genes with focused evidence score higher than well-studied genes with incidental mentions. Checkpoint-restart works for long-running PubMed queries.
</done>
</task>
</tasks>
<verification>
- `python -m pytest tests/test_literature.py tests/test_literature_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.literature import *"` -- all exports importable
- `usher-pipeline evidence literature --help` -- CLI help displays with --email and --api-key options
- DuckDB literature_evidence table has columns: gene_id, gene_symbol, total_pubmed_count, cilia_context_count, sensory_context_count, direct_experimental_count, evidence_tier, literature_score_normalized
- Bias test: Gene with 10 total/5 cilia publications scores higher than gene with 100K total/5 cilia publications
</verification>
<success_criteria>
- LITE-01: Systematic PubMed queries for each gene across cilia, sensory, cytoskeleton, and cell polarity contexts
- LITE-02: Evidence classified into tiers: direct_experimental, functional_mention, hts_hit, incidental, none
- LITE-03: Quality-weighted score with bias mitigation via total publication normalization -- evidence quality matters more than count
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
- User setup: NCBI_API_KEY and NCBI_EMAIL documented for rate limit increase
</success_criteria>
<output>
After completion, create `.planning/phases/03-core-evidence-layers/03-06-SUMMARY.md`
</output>