docs(03): create phase plan

2026-02-11 18:46:28 +08:00
parent 3354cfe006
commit 0d252da348
7 changed files with 1022 additions and 3 deletions
--- a/.planning/phases/03-core-evidence-layers/03-01-PLAN.md
+++ b/.planning/phases/03-core-evidence-layers/03-01-PLAN.md
@@ -0,0 +1,167 @@
+---
+phase: 03-core-evidence-layers
+plan: 01
+type: execute
+wave: 1
+depends_on: []
+files_modified:
+  - src/usher_pipeline/evidence/annotation/__init__.py
+  - src/usher_pipeline/evidence/annotation/models.py
+  - src/usher_pipeline/evidence/annotation/fetch.py
+  - src/usher_pipeline/evidence/annotation/transform.py
+  - src/usher_pipeline/evidence/annotation/load.py
+  - tests/test_annotation.py
+  - tests/test_annotation_integration.py
+  - src/usher_pipeline/cli/evidence_cmd.py
+  - pyproject.toml
+autonomous: true
+
+must_haves:
+  truths:
+    - "Pipeline quantifies functional annotation depth per gene using GO term count, UniProt annotation score, and pathway membership"
+    - "Genes are classified into annotation tiers (well-annotated, partially-annotated, poorly-annotated) based on composite annotation metrics"
+    - "Annotation completeness score is normalized to 0-1 scale and stored as an evidence layer in DuckDB"
+  artifacts:
+    - path: "src/usher_pipeline/evidence/annotation/fetch.py"
+      provides: "GO term count and UniProt annotation score retrieval per gene"
+      exports: ["fetch_go_annotations", "fetch_uniprot_scores"]
+    - path: "src/usher_pipeline/evidence/annotation/transform.py"
+      provides: "Annotation tier classification and 0-1 normalization"
+      exports: ["classify_annotation_tier", "normalize_annotation_score", "process_annotation_evidence"]
+    - path: "src/usher_pipeline/evidence/annotation/load.py"
+      provides: "DuckDB persistence for annotation evidence"
+      exports: ["load_to_duckdb"]
+    - path: "tests/test_annotation.py"
+      provides: "Unit tests for annotation scoring, tiering, NULL handling"
+  key_links:
+    - from: "src/usher_pipeline/evidence/annotation/fetch.py"
+      to: "mygene.info API"
+      via: "mygene library batch query"
+      pattern: "mg\\.querymany.*fields.*go"
+    - from: "src/usher_pipeline/evidence/annotation/transform.py"
+      to: "src/usher_pipeline/evidence/annotation/fetch.py"
+      via: "processes fetched GO/UniProt data into scores"
+      pattern: "classify_annotation_tier"
+    - from: "src/usher_pipeline/evidence/annotation/load.py"
+      to: "src/usher_pipeline/persistence/duckdb_store.py"
+      via: "store.save_dataframe"
+      pattern: "save_dataframe.*annotation_completeness"
+---
+
+<objective>
+Implement the Gene Annotation Completeness evidence layer (ANNOT-01/02/03): retrieve GO term counts and UniProt annotation scores per gene, classify genes into annotation tiers, normalize to 0-1 composite score, and persist to DuckDB.
+
+Purpose: Annotation depth quantifies how well-studied each gene is -- poorly-annotated genes with other evidence are prime candidates for under-studied cilia/Usher genes.
+Output: annotation_completeness DuckDB table with per-gene GO counts, UniProt scores, pathway flags, annotation tier, and normalized composite score.
+</objective>
+
+<execution_context>
+@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
+@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
+@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
+@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
+@src/usher_pipeline/evidence/gnomad/models.py
+@src/usher_pipeline/evidence/gnomad/fetch.py
+@src/usher_pipeline/evidence/gnomad/transform.py
+@src/usher_pipeline/evidence/gnomad/load.py
+@src/usher_pipeline/cli/evidence_cmd.py
+@src/usher_pipeline/persistence/duckdb_store.py
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Create annotation evidence data model, fetch, and transform modules</name>
+  <files>
+    src/usher_pipeline/evidence/annotation/__init__.py
+    src/usher_pipeline/evidence/annotation/models.py
+    src/usher_pipeline/evidence/annotation/fetch.py
+    src/usher_pipeline/evidence/annotation/transform.py
+    pyproject.toml
+  </files>
+  <action>
+    Create the annotation evidence layer following the established gnomAD pattern (fetch->transform->load).
+
+    **models.py**: Define AnnotationRecord pydantic model with fields: gene_id (str), gene_symbol (str), go_term_count (int|None), go_biological_process_count (int|None), go_molecular_function_count (int|None), go_cellular_component_count (int|None), uniprot_annotation_score (int|None -- UniProt annotation score 1-5), has_pathway_membership (bool|None -- present in any KEGG/Reactome pathway), annotation_tier (str -- "well_annotated", "partially_annotated", "poorly_annotated"), annotation_score_normalized (float|None -- 0-1 composite). Define ANNOTATION_TABLE_NAME = "annotation_completeness".
+
+    **fetch.py**: Two fetch functions:
+    1. `fetch_go_annotations(gene_ids: list[str]) -> pl.DataFrame` -- Use mygene.info library (already a dependency) to batch query GO annotations. Call `mg.querymany(gene_ids, scopes='ensembl.gene', fields='go,pathway.kegg,pathway.reactome,symbol', species='human')`. Extract GO term counts by category (BP, MF, CC). For each gene, count GO terms per ontology. Return LazyFrame with gene_id, gene_symbol, go_term_count, go_biological_process_count, go_molecular_function_count, go_cellular_component_count, has_pathway_membership. Handle genes with no GO annotations as NULL (not zero). Process in batches of 1000 to avoid mygene timeout.
+    2. `fetch_uniprot_scores(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- Use UniProt REST API (httpx with tenacity retry) to batch-query annotation scores. UniProt annotation score is available from REST API JSON under `.annotationScore`. Query batches of 100 accessions using UniProt search endpoint: `https://rest.uniprot.org/uniprotkb/search?query=accession:P12345+OR+accession:P67890&fields=accession,annotation_score`. Use ratelimit decorator (200 req/sec). Return LazyFrame with gene_id, uniprot_annotation_score. NULL for genes without UniProt mapping.
+
+    Add `ratelimit` to pyproject.toml dependencies if not already present.
+
+    **transform.py**: Three functions:
+    1. `classify_annotation_tier(df: pl.DataFrame) -> pl.DataFrame` -- Add annotation_tier column: "well_annotated" (go_term_count >= 20 AND uniprot_annotation_score >= 4), "partially_annotated" (go_term_count >= 5 OR uniprot_annotation_score >= 3), "poorly_annotated" (everything else including NULLs). NULL GO counts treated as zero for tier classification (conservative -- assume unannotated).
+    2. `normalize_annotation_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute composite annotation score. Formula: weighted average of (a) log2(go_term_count + 1) normalized by max across dataset, (b) uniprot_annotation_score / 5.0, (c) has_pathway_membership as 0/1. Weights: 0.5 GO, 0.3 UniProt, 0.2 Pathway. Result clamped to 0-1. NULL if ALL three inputs are NULL.
+    3. `process_annotation_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end pipeline: fetch GO -> fetch UniProt -> join -> classify tier -> normalize -> collect.
+
+    Follow established patterns: NULL preservation (unknown != zero), structlog logging, lazy polars evaluation where possible.
+  </action>
+  <verify>
+    cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.annotation import fetch_go_annotations, fetch_uniprot_scores, classify_annotation_tier, normalize_annotation_score, process_annotation_evidence; print('imports OK')"
+  </verify>
+  <done>
+    Annotation fetch module retrieves GO terms via mygene and UniProt scores via REST API. Transform module classifies genes into 3 tiers and normalizes composite score to 0-1 scale. All functions importable and follow established NULL preservation patterns.
+  </done>
+</task>
+
+<task type="auto">
+  <name>Task 2: Create annotation DuckDB loader, CLI command, and tests</name>
+  <files>
+    src/usher_pipeline/evidence/annotation/load.py
+    src/usher_pipeline/cli/evidence_cmd.py
+    tests/test_annotation.py
+    tests/test_annotation_integration.py
+  </files>
+  <action>
+    **load.py**: Follow gnomad/load.py pattern exactly. Create `load_to_duckdb(df, store, provenance, description)` that saves to "annotation_completeness" table with CREATE OR REPLACE. Record provenance with tier distribution counts (well/partial/poor), NULL annotation counts, mean/median annotation scores. Create `query_poorly_annotated(store, max_score=0.3) -> pl.DataFrame` helper to find under-studied genes.
+
+    **evidence_cmd.py**: Add `annotation` subcommand to existing evidence command group. Follow gnomad command pattern: checkpoint check (has_checkpoint('annotation_completeness')), --force flag, load gene universe from DuckDB gene_universe table to get gene_ids and uniprot mappings, call process_annotation_evidence, load to DuckDB, save provenance sidecar to data/annotation/completeness.provenance.json. Display summary with tier distribution counts.
+
+    **tests/test_annotation.py**: Unit tests with synthetic data (no external API calls). Mock mygene.querymany to return synthetic GO data. Test cases:
+    - test_go_count_extraction: Correct GO term counting by category
+    - test_null_go_handling: Genes with no GO data get NULL counts
+    - test_tier_classification_well_annotated: High GO + high UniProt = well_annotated
+    - test_tier_classification_poorly_annotated: Low/NULL GO + low UniProt = poorly_annotated
+    - test_tier_classification_partial: Medium annotations = partially_annotated
+    - test_normalization_bounds: Score always in [0, 1] range
+    - test_normalization_null_preservation: All-NULL inputs produce NULL score
+    - test_normalization_with_pathway: Pathway membership contributes to score
+    - test_composite_weighting: Verify 0.5/0.3/0.2 weight distribution
+
+    **tests/test_annotation_integration.py**: Integration tests following gnomad pattern. Mock mygene and UniProt API. Test full pipeline: fetch -> transform -> load -> query. Test checkpoint-restart, provenance recording, idempotent loading. Use synthetic fixtures.
+  </action>
+  <verify>
+    cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v
+  </verify>
+  <done>
+    All annotation unit and integration tests pass. CLI `evidence annotation` command registered and functional. DuckDB stores annotation_completeness table with tier classification. Checkpoint-restart works. Provenance tracked.
+  </done>
+</task>
+
+</tasks>
+
+<verification>
+- `python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v` -- all tests pass
+- `python -c "from usher_pipeline.evidence.annotation import *"` -- all exports importable
+- `usher-pipeline evidence annotation --help` -- CLI help displays with correct options
+- DuckDB annotation_completeness table has columns: gene_id, gene_symbol, go_term_count, uniprot_annotation_score, has_pathway_membership, annotation_tier, annotation_score_normalized
+</verification>
+
+<success_criteria>
+- ANNOT-01: GO term count and UniProt annotation score retrieved per gene with NULL for missing data
+- ANNOT-02: Genes classified into well/partially/poorly annotated tiers based on composite metrics
+- ANNOT-03: Normalized 0-1 annotation score stored in DuckDB annotation_completeness table
+- Pattern compliance: fetch->transform->load->CLI->tests matching gnomAD evidence layer structure
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/03-core-evidence-layers/03-01-SUMMARY.md`
+</output>