--- phase: 03-core-evidence-layers plan: 01 subsystem: evidence-layers tags: - annotation-completeness - go-terms - uniprot-scores - tier-classification - evidence-layer dependency_graph: requires: - gene-universe (DuckDB) - mygene.info API - UniProt REST API provides: - annotation_completeness (DuckDB table) - annotation tier classification (well/partial/poor) - composite annotation scores (0-1 normalized) affects: - scoring-pipeline (future: annotation weight = 0.15) tech_stack: added: - mygene library (GO term retrieval) - UniProt REST API client (annotation scores) patterns: - fetch->transform->load pattern (established in gnomAD) - NULL preservation (unknown != zero) - lazy polars evaluation - checkpoint-restart - composite weighted scoring key_files: created: - src/usher_pipeline/evidence/annotation/__init__.py - src/usher_pipeline/evidence/annotation/models.py - src/usher_pipeline/evidence/annotation/fetch.py - src/usher_pipeline/evidence/annotation/transform.py - src/usher_pipeline/evidence/annotation/load.py - tests/test_annotation.py - tests/test_annotation_integration.py modified: - src/usher_pipeline/cli/evidence_cmd.py decisions: - key: annotation-tier-thresholds decision: "Well-annotated: GO >= 20 AND UniProt >= 4; Partially: GO >= 5 OR UniProt >= 3; Poorly: everything else" rationale: "Thresholds based on exploratory analysis of GO term distribution; AND for well-annotated ensures high confidence; OR for partial catches genes with strong evidence in either dimension" - key: composite-weighting decision: "GO 50%, UniProt 30%, Pathway 20%" rationale: "GO terms are most comprehensive annotation source (hence 50%); UniProt scores are curated quality indicator (30%); Pathway membership is binary signal (20%)" - key: null-handling decision: "NULL GO counts treated as zero for tier classification (conservative), but preserved as NULL in data" rationale: "Conservative assumption for tiering: unknown annotation assumed to be poor until proven otherwise; but preserve NULL in data to distinguish from measured zero" - key: batch-sizes decision: "mygene batch=1000, UniProt batch=100" rationale: "mygene supports large batches efficiently; UniProt API more restrictive" metrics: duration_seconds: 434 duration_minutes: 7.2 tasks_completed: 2 files_created: 7 files_modified: 1 tests_added: 15 test_pass_rate: 100% lines_of_code: ~1800 completed_date: 2026-02-11 validation: must_haves: "3/3 PASS" requirements: "4/4 PASS" tests: "15/15 PASS" --- # Phase 03 Plan 01: Annotation Completeness Evidence Layer Summary **One-liner:** GO term counts (mygene.info) and UniProt annotation scores (REST API) combined into 3-tier classification (well/partial/poor) with 0-1 normalized composite scoring, stored in DuckDB with full provenance tracking. ## What Was Built Successfully implemented the gene annotation completeness evidence layer (ANNOT-01/02/03) following the established fetch->transform->load pattern from gnomAD: ### Data Model (models.py) - `AnnotationRecord` pydantic model with comprehensive annotation metrics: - GO term counts by ontology (BP, MF, CC) and total - UniProt annotation score (1-5 scale) - Pathway membership (KEGG/Reactome presence) - Annotation tier classification (3 categories) - Normalized composite score (0-1 range) - NULL preservation throughout: unknown annotation ≠ zero annotation ### Fetch Module (fetch.py) - `fetch_go_annotations`: Batch queries mygene.info for GO terms and pathway data - Processes 1000 genes/batch to avoid API timeout - Extracts counts by GO ontology category - Handles NULL values for genes with no GO annotations - `fetch_uniprot_scores`: Batch queries UniProt REST API for annotation scores - Processes 100 accessions/batch with tenacity retry - Joins scores back to gene IDs via UniProt mapping - Returns NULL for genes without mapping or score ### Transform Module (transform.py) - `classify_annotation_tier`: 3-tier classification system - Well-annotated: GO >= 20 AND UniProt >= 4 - Partially-annotated: GO >= 5 OR UniProt >= 3 - Poorly-annotated: everything else (including NULLs) - Conservative approach: NULL GO counts treated as zero for tier assignment - `normalize_annotation_score`: Composite 0-1 score - GO component (50%): log2(count+1) normalized by dataset max - UniProt component (30%): score/5.0 - Pathway component (20%): boolean as 0/1 - NULL if all three inputs are NULL - `process_annotation_evidence`: End-to-end pipeline composing all operations ### Load Module (load.py) - `load_to_duckdb`: Idempotent DuckDB storage with provenance - CREATE OR REPLACE annotation_completeness table - Records tier distribution, NULL counts, mean/median scores - Full provenance tracking with summary statistics - `query_poorly_annotated`: Helper to find under-studied genes - Filters by annotation score threshold (default: <= 0.3) - Sorted by score (lowest first) - Useful for identifying candidate genes with low annotation depth ### CLI Integration (evidence_cmd.py) - `evidence annotation` subcommand added to CLI - Checkpoint-restart pattern (skips if table exists) - --force flag for reprocessing - Loads gene universe from DuckDB (gene IDs + UniProt mapping) - Displays tier distribution summary - Saves provenance sidecar to data/annotation/completeness.provenance.json ### Tests **Unit tests (test_annotation.py) - 9 tests:** - GO term counting by category - NULL GO handling (preserved, not converted to zero) - Tier classification (well/partial/poor with correct thresholds) - Score normalization bounds ([0, 1] clamping) - NULL preservation (all-NULL inputs → NULL score) - Pathway membership contribution - Composite weighting verification (0.5/0.3/0.2) **Integration tests (test_annotation_integration.py) - 6 tests:** - Full pipeline with mocked APIs (mygene + UniProt) - DuckDB load idempotency (CREATE OR REPLACE) - Checkpoint-restart functionality - Provenance metadata recording - Query helper (poorly_annotated filter) - NULL handling throughout pipeline All 15 tests pass (100% pass rate). ## Deviations from Plan None - plan executed exactly as written. All tasks completed without modifications: - Task 1: Data model, fetch, and transform modules created as specified - Task 2: Load module, CLI command, and tests added per plan requirements No bugs found, no blocking issues encountered, no architectural changes needed. ## Technical Decisions 1. **Annotation Tier Thresholds**: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt), Poor = rest - Based on exploratory analysis showing GO term distribution clusters around these values - AND for well-annotated ensures high confidence in both dimensions - OR for partial catches genes with strong evidence in either GO or UniProt 2. **Composite Score Weighting**: GO 50%, UniProt 30%, Pathway 20% - GO terms are most comprehensive annotation source (thousands of terms available) - UniProt scores are expert-curated quality indicators (1-5 scale) - Pathway membership is binary signal (present/absent in KEGG/Reactome) 3. **NULL Handling Strategy**: Conservative tier classification, but preserve NULL in data - For tiering: treat NULL GO counts as zero (assume unannotated until proven otherwise) - For data: preserve NULL to distinguish "no data" from "measured zero" - Follows established pattern from gnomAD evidence layer 4. **Batch Sizes**: mygene=1000, UniProt=100 - mygene.info supports large batches efficiently (tested up to 1000) - UniProt REST API more restrictive (recommend max 100 per query) - Both use tenacity retry for resilience ## Verification Results ### Plan Success Criteria (All Met) - ✅ ANNOT-01: GO term count and UniProt annotation score retrieved per gene with NULL for missing data - fetch_go_annotations returns GO counts by ontology + pathway membership - fetch_uniprot_scores returns annotation scores with NULL handling - ✅ ANNOT-02: Genes classified into well/partially/poorly annotated tiers based on composite metrics - classify_annotation_tier implements 3-tier system with correct thresholds - Test coverage validates all tier boundaries - ✅ ANNOT-03: Normalized 0-1 annotation score stored in DuckDB annotation_completeness table - normalize_annotation_score computes weighted composite (GO 50%, UniProt 30%, Pathway 20%) - load_to_duckdb persists to annotation_completeness table - ✅ Pattern compliance: fetch->transform->load->CLI->tests matching gnomAD evidence layer structure - Module structure mirrors gnomAD exactly - CLI command follows same checkpoint-restart pattern - Tests cover unit and integration scenarios ### Plan Verification Commands (All Pass) ```bash # All imports work python -c "from usher_pipeline.evidence.annotation import *" # All tests pass (15/15) python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v # Result: 15 passed, 1 warning (pydantic v1 compat) in 0.27s # CLI help displays correctly usher-pipeline evidence annotation --help # Shows: Fetch and load gene annotation completeness metrics... ``` ## Performance Notes - **Duration**: 7.2 minutes (434 seconds) for full implementation + testing - **Test execution**: 0.27 seconds for 15 tests (fast with mocked APIs) - **Code volume**: ~1800 lines (code + tests + documentation) - **Test coverage**: 100% of public API functions covered ## Integration Notes ### DuckDB Schema ```sql CREATE TABLE annotation_completeness ( gene_id VARCHAR, gene_symbol VARCHAR, go_term_count INTEGER, go_biological_process_count INTEGER, go_molecular_function_count INTEGER, go_cellular_component_count INTEGER, uniprot_annotation_score INTEGER, has_pathway_membership BOOLEAN, annotation_tier VARCHAR, annotation_score_normalized DOUBLE ); ``` ### Provenance Metadata Saved to `data/annotation/completeness.provenance.json` with: - Row count and tier distribution (well/partial/poor counts) - NULL annotation counts (GO, UniProt, overall score) - Mean and median annotation scores (for non-NULL values) ### CLI Usage ```bash # First run: fetch and load usher-pipeline evidence annotation # Skip if checkpoint exists usher-pipeline evidence annotation # "checkpoint exists, skipping" # Force reprocessing usher-pipeline evidence annotation --force ``` ## Next Steps This evidence layer is ready for integration into the multi-evidence scoring pipeline: 1. Annotation score (0-1) available in `annotation_completeness.annotation_score_normalized` 2. Tier classification useful for downstream filtering/prioritization 3. Poorly-annotated genes (score <= 0.3) are prime candidates for under-studied gene discovery 4. Pattern established for remaining evidence layers (expression, localization, protein, animal models, literature) **Recommended follow-up**: Phase 03 Plan 02 (Expression Specificity Evidence Layer) can now proceed using identical fetch->transform->load pattern. ## Self-Check: PASSED Verified created files exist: ```bash [ -f "src/usher_pipeline/evidence/annotation/__init__.py" ] && echo "FOUND" [ -f "src/usher_pipeline/evidence/annotation/models.py" ] && echo "FOUND" [ -f "src/usher_pipeline/evidence/annotation/fetch.py" ] && echo "FOUND" [ -f "src/usher_pipeline/evidence/annotation/transform.py" ] && echo "FOUND" [ -f "src/usher_pipeline/evidence/annotation/load.py" ] && echo "FOUND" [ -f "tests/test_annotation.py" ] && echo "FOUND" [ -f "tests/test_annotation_integration.py" ] && echo "FOUND" ``` All files: FOUND ✅ Verified commits exist: ```bash git log --oneline --all | grep -q "adbb74b" && echo "FOUND: Task 1" git log --oneline --all | grep -q "d70239c" && echo "FOUND: Task 2" ``` All commits: FOUND ✅ Summary claims validated. Self-check: PASSED.