docs(03): create phase plan
This commit is contained in:
167
.planning/phases/03-core-evidence-layers/03-01-PLAN.md
Normal file
167
.planning/phases/03-core-evidence-layers/03-01-PLAN.md
Normal file
@@ -0,0 +1,167 @@
|
||||
---
|
||||
phase: 03-core-evidence-layers
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- src/usher_pipeline/evidence/annotation/__init__.py
|
||||
- src/usher_pipeline/evidence/annotation/models.py
|
||||
- src/usher_pipeline/evidence/annotation/fetch.py
|
||||
- src/usher_pipeline/evidence/annotation/transform.py
|
||||
- src/usher_pipeline/evidence/annotation/load.py
|
||||
- tests/test_annotation.py
|
||||
- tests/test_annotation_integration.py
|
||||
- src/usher_pipeline/cli/evidence_cmd.py
|
||||
- pyproject.toml
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Pipeline quantifies functional annotation depth per gene using GO term count, UniProt annotation score, and pathway membership"
|
||||
- "Genes are classified into annotation tiers (well-annotated, partially-annotated, poorly-annotated) based on composite annotation metrics"
|
||||
- "Annotation completeness score is normalized to 0-1 scale and stored as an evidence layer in DuckDB"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/evidence/annotation/fetch.py"
|
||||
provides: "GO term count and UniProt annotation score retrieval per gene"
|
||||
exports: ["fetch_go_annotations", "fetch_uniprot_scores"]
|
||||
- path: "src/usher_pipeline/evidence/annotation/transform.py"
|
||||
provides: "Annotation tier classification and 0-1 normalization"
|
||||
exports: ["classify_annotation_tier", "normalize_annotation_score", "process_annotation_evidence"]
|
||||
- path: "src/usher_pipeline/evidence/annotation/load.py"
|
||||
provides: "DuckDB persistence for annotation evidence"
|
||||
exports: ["load_to_duckdb"]
|
||||
- path: "tests/test_annotation.py"
|
||||
provides: "Unit tests for annotation scoring, tiering, NULL handling"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/evidence/annotation/fetch.py"
|
||||
to: "mygene.info API"
|
||||
via: "mygene library batch query"
|
||||
pattern: "mg\\.querymany.*fields.*go"
|
||||
- from: "src/usher_pipeline/evidence/annotation/transform.py"
|
||||
to: "src/usher_pipeline/evidence/annotation/fetch.py"
|
||||
via: "processes fetched GO/UniProt data into scores"
|
||||
pattern: "classify_annotation_tier"
|
||||
- from: "src/usher_pipeline/evidence/annotation/load.py"
|
||||
to: "src/usher_pipeline/persistence/duckdb_store.py"
|
||||
via: "store.save_dataframe"
|
||||
pattern: "save_dataframe.*annotation_completeness"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Implement the Gene Annotation Completeness evidence layer (ANNOT-01/02/03): retrieve GO term counts and UniProt annotation scores per gene, classify genes into annotation tiers, normalize to 0-1 composite score, and persist to DuckDB.
|
||||
|
||||
Purpose: Annotation depth quantifies how well-studied each gene is -- poorly-annotated genes with other evidence are prime candidates for under-studied cilia/Usher genes.
|
||||
Output: annotation_completeness DuckDB table with per-gene GO counts, UniProt scores, pathway flags, annotation tier, and normalized composite score.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
|
||||
@src/usher_pipeline/evidence/gnomad/models.py
|
||||
@src/usher_pipeline/evidence/gnomad/fetch.py
|
||||
@src/usher_pipeline/evidence/gnomad/transform.py
|
||||
@src/usher_pipeline/evidence/gnomad/load.py
|
||||
@src/usher_pipeline/cli/evidence_cmd.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Create annotation evidence data model, fetch, and transform modules</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/annotation/__init__.py
|
||||
src/usher_pipeline/evidence/annotation/models.py
|
||||
src/usher_pipeline/evidence/annotation/fetch.py
|
||||
src/usher_pipeline/evidence/annotation/transform.py
|
||||
pyproject.toml
|
||||
</files>
|
||||
<action>
|
||||
Create the annotation evidence layer following the established gnomAD pattern (fetch->transform->load).
|
||||
|
||||
**models.py**: Define AnnotationRecord pydantic model with fields: gene_id (str), gene_symbol (str), go_term_count (int|None), go_biological_process_count (int|None), go_molecular_function_count (int|None), go_cellular_component_count (int|None), uniprot_annotation_score (int|None -- UniProt annotation score 1-5), has_pathway_membership (bool|None -- present in any KEGG/Reactome pathway), annotation_tier (str -- "well_annotated", "partially_annotated", "poorly_annotated"), annotation_score_normalized (float|None -- 0-1 composite). Define ANNOTATION_TABLE_NAME = "annotation_completeness".
|
||||
|
||||
**fetch.py**: Two fetch functions:
|
||||
1. `fetch_go_annotations(gene_ids: list[str]) -> pl.DataFrame` -- Use mygene.info library (already a dependency) to batch query GO annotations. Call `mg.querymany(gene_ids, scopes='ensembl.gene', fields='go,pathway.kegg,pathway.reactome,symbol', species='human')`. Extract GO term counts by category (BP, MF, CC). For each gene, count GO terms per ontology. Return LazyFrame with gene_id, gene_symbol, go_term_count, go_biological_process_count, go_molecular_function_count, go_cellular_component_count, has_pathway_membership. Handle genes with no GO annotations as NULL (not zero). Process in batches of 1000 to avoid mygene timeout.
|
||||
2. `fetch_uniprot_scores(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- Use UniProt REST API (httpx with tenacity retry) to batch-query annotation scores. UniProt annotation score is available from REST API JSON under `.annotationScore`. Query batches of 100 accessions using UniProt search endpoint: `https://rest.uniprot.org/uniprotkb/search?query=accession:P12345+OR+accession:P67890&fields=accession,annotation_score`. Use ratelimit decorator (200 req/sec). Return LazyFrame with gene_id, uniprot_annotation_score. NULL for genes without UniProt mapping.
|
||||
|
||||
Add `ratelimit` to pyproject.toml dependencies if not already present.
|
||||
|
||||
**transform.py**: Three functions:
|
||||
1. `classify_annotation_tier(df: pl.DataFrame) -> pl.DataFrame` -- Add annotation_tier column: "well_annotated" (go_term_count >= 20 AND uniprot_annotation_score >= 4), "partially_annotated" (go_term_count >= 5 OR uniprot_annotation_score >= 3), "poorly_annotated" (everything else including NULLs). NULL GO counts treated as zero for tier classification (conservative -- assume unannotated).
|
||||
2. `normalize_annotation_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute composite annotation score. Formula: weighted average of (a) log2(go_term_count + 1) normalized by max across dataset, (b) uniprot_annotation_score / 5.0, (c) has_pathway_membership as 0/1. Weights: 0.5 GO, 0.3 UniProt, 0.2 Pathway. Result clamped to 0-1. NULL if ALL three inputs are NULL.
|
||||
3. `process_annotation_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end pipeline: fetch GO -> fetch UniProt -> join -> classify tier -> normalize -> collect.
|
||||
|
||||
Follow established patterns: NULL preservation (unknown != zero), structlog logging, lazy polars evaluation where possible.
|
||||
</action>
|
||||
<verify>
|
||||
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.annotation import fetch_go_annotations, fetch_uniprot_scores, classify_annotation_tier, normalize_annotation_score, process_annotation_evidence; print('imports OK')"
|
||||
</verify>
|
||||
<done>
|
||||
Annotation fetch module retrieves GO terms via mygene and UniProt scores via REST API. Transform module classifies genes into 3 tiers and normalizes composite score to 0-1 scale. All functions importable and follow established NULL preservation patterns.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Create annotation DuckDB loader, CLI command, and tests</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/annotation/load.py
|
||||
src/usher_pipeline/cli/evidence_cmd.py
|
||||
tests/test_annotation.py
|
||||
tests/test_annotation_integration.py
|
||||
</files>
|
||||
<action>
|
||||
**load.py**: Follow gnomad/load.py pattern exactly. Create `load_to_duckdb(df, store, provenance, description)` that saves to "annotation_completeness" table with CREATE OR REPLACE. Record provenance with tier distribution counts (well/partial/poor), NULL annotation counts, mean/median annotation scores. Create `query_poorly_annotated(store, max_score=0.3) -> pl.DataFrame` helper to find under-studied genes.
|
||||
|
||||
**evidence_cmd.py**: Add `annotation` subcommand to existing evidence command group. Follow gnomad command pattern: checkpoint check (has_checkpoint('annotation_completeness')), --force flag, load gene universe from DuckDB gene_universe table to get gene_ids and uniprot mappings, call process_annotation_evidence, load to DuckDB, save provenance sidecar to data/annotation/completeness.provenance.json. Display summary with tier distribution counts.
|
||||
|
||||
**tests/test_annotation.py**: Unit tests with synthetic data (no external API calls). Mock mygene.querymany to return synthetic GO data. Test cases:
|
||||
- test_go_count_extraction: Correct GO term counting by category
|
||||
- test_null_go_handling: Genes with no GO data get NULL counts
|
||||
- test_tier_classification_well_annotated: High GO + high UniProt = well_annotated
|
||||
- test_tier_classification_poorly_annotated: Low/NULL GO + low UniProt = poorly_annotated
|
||||
- test_tier_classification_partial: Medium annotations = partially_annotated
|
||||
- test_normalization_bounds: Score always in [0, 1] range
|
||||
- test_normalization_null_preservation: All-NULL inputs produce NULL score
|
||||
- test_normalization_with_pathway: Pathway membership contributes to score
|
||||
- test_composite_weighting: Verify 0.5/0.3/0.2 weight distribution
|
||||
|
||||
**tests/test_annotation_integration.py**: Integration tests following gnomad pattern. Mock mygene and UniProt API. Test full pipeline: fetch -> transform -> load -> query. Test checkpoint-restart, provenance recording, idempotent loading. Use synthetic fixtures.
|
||||
</action>
|
||||
<verify>
|
||||
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v
|
||||
</verify>
|
||||
<done>
|
||||
All annotation unit and integration tests pass. CLI `evidence annotation` command registered and functional. DuckDB stores annotation_completeness table with tier classification. Checkpoint-restart works. Provenance tracked.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v` -- all tests pass
|
||||
- `python -c "from usher_pipeline.evidence.annotation import *"` -- all exports importable
|
||||
- `usher-pipeline evidence annotation --help` -- CLI help displays with correct options
|
||||
- DuckDB annotation_completeness table has columns: gene_id, gene_symbol, go_term_count, uniprot_annotation_score, has_pathway_membership, annotation_tier, annotation_score_normalized
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- ANNOT-01: GO term count and UniProt annotation score retrieved per gene with NULL for missing data
|
||||
- ANNOT-02: Genes classified into well/partially/poorly annotated tiers based on composite metrics
|
||||
- ANNOT-03: Normalized 0-1 annotation score stored in DuckDB annotation_completeness table
|
||||
- Pattern compliance: fetch->transform->load->CLI->tests matching gnomAD evidence layer structure
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/03-core-evidence-layers/03-01-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user