---
phase: 03-core-evidence-layers
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/annotation/__init__.py
- src/usher_pipeline/evidence/annotation/models.py
- src/usher_pipeline/evidence/annotation/fetch.py
- src/usher_pipeline/evidence/annotation/transform.py
- src/usher_pipeline/evidence/annotation/load.py
- tests/test_annotation.py
- tests/test_annotation_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
- pyproject.toml
autonomous: true
must_haves:
truths:
- "Pipeline quantifies functional annotation depth per gene using GO term count, UniProt annotation score, and pathway membership"
- "Genes are classified into annotation tiers (well-annotated, partially-annotated, poorly-annotated) based on composite annotation metrics"
- "Annotation completeness score is normalized to 0-1 scale and stored as an evidence layer in DuckDB"
artifacts:
- path: "src/usher_pipeline/evidence/annotation/fetch.py"
provides: "GO term count and UniProt annotation score retrieval per gene"
exports: ["fetch_go_annotations", "fetch_uniprot_scores"]
- path: "src/usher_pipeline/evidence/annotation/transform.py"
provides: "Annotation tier classification and 0-1 normalization"
exports: ["classify_annotation_tier", "normalize_annotation_score", "process_annotation_evidence"]
- path: "src/usher_pipeline/evidence/annotation/load.py"
provides: "DuckDB persistence for annotation evidence"
exports: ["load_to_duckdb"]
- path: "tests/test_annotation.py"
provides: "Unit tests for annotation scoring, tiering, NULL handling"
key_links:
- from: "src/usher_pipeline/evidence/annotation/fetch.py"
to: "mygene.info API"
via: "mygene library batch query"
pattern: "mg\\.querymany.*fields.*go"
- from: "src/usher_pipeline/evidence/annotation/transform.py"
to: "src/usher_pipeline/evidence/annotation/fetch.py"
via: "processes fetched GO/UniProt data into scores"
pattern: "classify_annotation_tier"
- from: "src/usher_pipeline/evidence/annotation/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*annotation_completeness"
---
Implement the Gene Annotation Completeness evidence layer (ANNOT-01/02/03): retrieve GO term counts and UniProt annotation scores per gene, classify genes into annotation tiers, normalize to 0-1 composite score, and persist to DuckDB.
Purpose: Annotation depth quantifies how well-studied each gene is -- poorly-annotated genes with other evidence are prime candidates for under-studied cilia/Usher genes.
Output: annotation_completeness DuckDB table with per-gene GO counts, UniProt scores, pathway flags, annotation tier, and normalized composite score.
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/models.py
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/transform.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
Task 1: Create annotation evidence data model, fetch, and transform modules
src/usher_pipeline/evidence/annotation/__init__.py
src/usher_pipeline/evidence/annotation/models.py
src/usher_pipeline/evidence/annotation/fetch.py
src/usher_pipeline/evidence/annotation/transform.py
pyproject.toml
Create the annotation evidence layer following the established gnomAD pattern (fetch->transform->load).
**models.py**: Define AnnotationRecord pydantic model with fields: gene_id (str), gene_symbol (str), go_term_count (int|None), go_biological_process_count (int|None), go_molecular_function_count (int|None), go_cellular_component_count (int|None), uniprot_annotation_score (int|None -- UniProt annotation score 1-5), has_pathway_membership (bool|None -- present in any KEGG/Reactome pathway), annotation_tier (str -- "well_annotated", "partially_annotated", "poorly_annotated"), annotation_score_normalized (float|None -- 0-1 composite). Define ANNOTATION_TABLE_NAME = "annotation_completeness".
**fetch.py**: Two fetch functions:
1. `fetch_go_annotations(gene_ids: list[str]) -> pl.DataFrame` -- Use mygene.info library (already a dependency) to batch query GO annotations. Call `mg.querymany(gene_ids, scopes='ensembl.gene', fields='go,pathway.kegg,pathway.reactome,symbol', species='human')`. Extract GO term counts by category (BP, MF, CC). For each gene, count GO terms per ontology. Return LazyFrame with gene_id, gene_symbol, go_term_count, go_biological_process_count, go_molecular_function_count, go_cellular_component_count, has_pathway_membership. Handle genes with no GO annotations as NULL (not zero). Process in batches of 1000 to avoid mygene timeout.
2. `fetch_uniprot_scores(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- Use UniProt REST API (httpx with tenacity retry) to batch-query annotation scores. UniProt annotation score is available from REST API JSON under `.annotationScore`. Query batches of 100 accessions using UniProt search endpoint: `https://rest.uniprot.org/uniprotkb/search?query=accession:P12345+OR+accession:P67890&fields=accession,annotation_score`. Use ratelimit decorator (200 req/sec). Return LazyFrame with gene_id, uniprot_annotation_score. NULL for genes without UniProt mapping.
Add `ratelimit` to pyproject.toml dependencies if not already present.
**transform.py**: Three functions:
1. `classify_annotation_tier(df: pl.DataFrame) -> pl.DataFrame` -- Add annotation_tier column: "well_annotated" (go_term_count >= 20 AND uniprot_annotation_score >= 4), "partially_annotated" (go_term_count >= 5 OR uniprot_annotation_score >= 3), "poorly_annotated" (everything else including NULLs). NULL GO counts treated as zero for tier classification (conservative -- assume unannotated).
2. `normalize_annotation_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute composite annotation score. Formula: weighted average of (a) log2(go_term_count + 1) normalized by max across dataset, (b) uniprot_annotation_score / 5.0, (c) has_pathway_membership as 0/1. Weights: 0.5 GO, 0.3 UniProt, 0.2 Pathway. Result clamped to 0-1. NULL if ALL three inputs are NULL.
3. `process_annotation_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end pipeline: fetch GO -> fetch UniProt -> join -> classify tier -> normalize -> collect.
Follow established patterns: NULL preservation (unknown != zero), structlog logging, lazy polars evaluation where possible.
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.annotation import fetch_go_annotations, fetch_uniprot_scores, classify_annotation_tier, normalize_annotation_score, process_annotation_evidence; print('imports OK')"
Annotation fetch module retrieves GO terms via mygene and UniProt scores via REST API. Transform module classifies genes into 3 tiers and normalizes composite score to 0-1 scale. All functions importable and follow established NULL preservation patterns.
Task 2: Create annotation DuckDB loader, CLI command, and tests
src/usher_pipeline/evidence/annotation/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_annotation.py
tests/test_annotation_integration.py
**load.py**: Follow gnomad/load.py pattern exactly. Create `load_to_duckdb(df, store, provenance, description)` that saves to "annotation_completeness" table with CREATE OR REPLACE. Record provenance with tier distribution counts (well/partial/poor), NULL annotation counts, mean/median annotation scores. Create `query_poorly_annotated(store, max_score=0.3) -> pl.DataFrame` helper to find under-studied genes.
**evidence_cmd.py**: Add `annotation` subcommand to existing evidence command group. Follow gnomad command pattern: checkpoint check (has_checkpoint('annotation_completeness')), --force flag, load gene universe from DuckDB gene_universe table to get gene_ids and uniprot mappings, call process_annotation_evidence, load to DuckDB, save provenance sidecar to data/annotation/completeness.provenance.json. Display summary with tier distribution counts.
**tests/test_annotation.py**: Unit tests with synthetic data (no external API calls). Mock mygene.querymany to return synthetic GO data. Test cases:
- test_go_count_extraction: Correct GO term counting by category
- test_null_go_handling: Genes with no GO data get NULL counts
- test_tier_classification_well_annotated: High GO + high UniProt = well_annotated
- test_tier_classification_poorly_annotated: Low/NULL GO + low UniProt = poorly_annotated
- test_tier_classification_partial: Medium annotations = partially_annotated
- test_normalization_bounds: Score always in [0, 1] range
- test_normalization_null_preservation: All-NULL inputs produce NULL score
- test_normalization_with_pathway: Pathway membership contributes to score
- test_composite_weighting: Verify 0.5/0.3/0.2 weight distribution
**tests/test_annotation_integration.py**: Integration tests following gnomad pattern. Mock mygene and UniProt API. Test full pipeline: fetch -> transform -> load -> query. Test checkpoint-restart, provenance recording, idempotent loading. Use synthetic fixtures.
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v
All annotation unit and integration tests pass. CLI `evidence annotation` command registered and functional. DuckDB stores annotation_completeness table with tier classification. Checkpoint-restart works. Provenance tracked.
- `python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.annotation import *"` -- all exports importable
- `usher-pipeline evidence annotation --help` -- CLI help displays with correct options
- DuckDB annotation_completeness table has columns: gene_id, gene_symbol, go_term_count, uniprot_annotation_score, has_pathway_membership, annotation_tier, annotation_score_normalized
- ANNOT-01: GO term count and UniProt annotation score retrieved per gene with NULL for missing data
- ANNOT-02: Genes classified into well/partially/poorly annotated tiers based on composite metrics
- ANNOT-03: Normalized 0-1 annotation score stored in DuckDB annotation_completeness table
- Pattern compliance: fetch->transform->load->CLI->tests matching gnomAD evidence layer structure