11 KiB
11 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
| phase | plan | type | wave | depends_on | files_modified | autonomous | must_haves | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 03-core-evidence-layers | 01 | execute | 1 |
|
true |
|
Purpose: Annotation depth quantifies how well-studied each gene is -- poorly-annotated genes with other evidence are prime candidates for under-studied cilia/Usher genes. Output: annotation_completeness DuckDB table with per-gene GO counts, UniProt scores, pathway flags, annotation tier, and normalized composite score.
<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>
@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/models.py @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/transform.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create annotation evidence data model, fetch, and transform modules src/usher_pipeline/evidence/annotation/__init__.py src/usher_pipeline/evidence/annotation/models.py src/usher_pipeline/evidence/annotation/fetch.py src/usher_pipeline/evidence/annotation/transform.py pyproject.toml Create the annotation evidence layer following the established gnomAD pattern (fetch->transform->load).**models.py**: Define AnnotationRecord pydantic model with fields: gene_id (str), gene_symbol (str), go_term_count (int|None), go_biological_process_count (int|None), go_molecular_function_count (int|None), go_cellular_component_count (int|None), uniprot_annotation_score (int|None -- UniProt annotation score 1-5), has_pathway_membership (bool|None -- present in any KEGG/Reactome pathway), annotation_tier (str -- "well_annotated", "partially_annotated", "poorly_annotated"), annotation_score_normalized (float|None -- 0-1 composite). Define ANNOTATION_TABLE_NAME = "annotation_completeness".
**fetch.py**: Two fetch functions:
1. `fetch_go_annotations(gene_ids: list[str]) -> pl.DataFrame` -- Use mygene.info library (already a dependency) to batch query GO annotations. Call `mg.querymany(gene_ids, scopes='ensembl.gene', fields='go,pathway.kegg,pathway.reactome,symbol', species='human')`. Extract GO term counts by category (BP, MF, CC). For each gene, count GO terms per ontology. Return LazyFrame with gene_id, gene_symbol, go_term_count, go_biological_process_count, go_molecular_function_count, go_cellular_component_count, has_pathway_membership. Handle genes with no GO annotations as NULL (not zero). Process in batches of 1000 to avoid mygene timeout.
2. `fetch_uniprot_scores(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- Use UniProt REST API (httpx with tenacity retry) to batch-query annotation scores. UniProt annotation score is available from REST API JSON under `.annotationScore`. Query batches of 100 accessions using UniProt search endpoint: `https://rest.uniprot.org/uniprotkb/search?query=accession:P12345+OR+accession:P67890&fields=accession,annotation_score`. Use ratelimit decorator (200 req/sec). Return LazyFrame with gene_id, uniprot_annotation_score. NULL for genes without UniProt mapping.
Add `ratelimit` to pyproject.toml dependencies if not already present.
**transform.py**: Three functions:
1. `classify_annotation_tier(df: pl.DataFrame) -> pl.DataFrame` -- Add annotation_tier column: "well_annotated" (go_term_count >= 20 AND uniprot_annotation_score >= 4), "partially_annotated" (go_term_count >= 5 OR uniprot_annotation_score >= 3), "poorly_annotated" (everything else including NULLs). NULL GO counts treated as zero for tier classification (conservative -- assume unannotated).
2. `normalize_annotation_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute composite annotation score. Formula: weighted average of (a) log2(go_term_count + 1) normalized by max across dataset, (b) uniprot_annotation_score / 5.0, (c) has_pathway_membership as 0/1. Weights: 0.5 GO, 0.3 UniProt, 0.2 Pathway. Result clamped to 0-1. NULL if ALL three inputs are NULL.
3. `process_annotation_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end pipeline: fetch GO -> fetch UniProt -> join -> classify tier -> normalize -> collect.
Follow established patterns: NULL preservation (unknown != zero), structlog logging, lazy polars evaluation where possible.
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.annotation import fetch_go_annotations, fetch_uniprot_scores, classify_annotation_tier, normalize_annotation_score, process_annotation_evidence; print('imports OK')"
Annotation fetch module retrieves GO terms via mygene and UniProt scores via REST API. Transform module classifies genes into 3 tiers and normalizes composite score to 0-1 scale. All functions importable and follow established NULL preservation patterns.
Task 2: Create annotation DuckDB loader, CLI command, and tests
src/usher_pipeline/evidence/annotation/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_annotation.py
tests/test_annotation_integration.py
**load.py**: Follow gnomad/load.py pattern exactly. Create `load_to_duckdb(df, store, provenance, description)` that saves to "annotation_completeness" table with CREATE OR REPLACE. Record provenance with tier distribution counts (well/partial/poor), NULL annotation counts, mean/median annotation scores. Create `query_poorly_annotated(store, max_score=0.3) -> pl.DataFrame` helper to find under-studied genes.
**evidence_cmd.py**: Add `annotation` subcommand to existing evidence command group. Follow gnomad command pattern: checkpoint check (has_checkpoint('annotation_completeness')), --force flag, load gene universe from DuckDB gene_universe table to get gene_ids and uniprot mappings, call process_annotation_evidence, load to DuckDB, save provenance sidecar to data/annotation/completeness.provenance.json. Display summary with tier distribution counts.
**tests/test_annotation.py**: Unit tests with synthetic data (no external API calls). Mock mygene.querymany to return synthetic GO data. Test cases:
- test_go_count_extraction: Correct GO term counting by category
- test_null_go_handling: Genes with no GO data get NULL counts
- test_tier_classification_well_annotated: High GO + high UniProt = well_annotated
- test_tier_classification_poorly_annotated: Low/NULL GO + low UniProt = poorly_annotated
- test_tier_classification_partial: Medium annotations = partially_annotated
- test_normalization_bounds: Score always in [0, 1] range
- test_normalization_null_preservation: All-NULL inputs produce NULL score
- test_normalization_with_pathway: Pathway membership contributes to score
- test_composite_weighting: Verify 0.5/0.3/0.2 weight distribution
**tests/test_annotation_integration.py**: Integration tests following gnomad pattern. Mock mygene and UniProt API. Test full pipeline: fetch -> transform -> load -> query. Test checkpoint-restart, provenance recording, idempotent loading. Use synthetic fixtures.
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v
All annotation unit and integration tests pass. CLI `evidence annotation` command registered and functional. DuckDB stores annotation_completeness table with tier classification. Checkpoint-restart works. Provenance tracked.
- `python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.annotation import *"` -- all exports importable
- `usher-pipeline evidence annotation --help` -- CLI help displays with correct options
- DuckDB annotation_completeness table has columns: gene_id, gene_symbol, go_term_count, uniprot_annotation_score, has_pathway_membership, annotation_tier, annotation_score_normalized
<success_criteria>
- ANNOT-01: GO term count and UniProt annotation score retrieved per gene with NULL for missing data
- ANNOT-02: Genes classified into well/partially/poorly annotated tiers based on composite metrics
- ANNOT-03: Normalized 0-1 annotation score stored in DuckDB annotation_completeness table
- Pattern compliance: fetch->transform->load->CLI->tests matching gnomAD evidence layer structure </success_criteria>