usher-exploring/.planning/phases/03-core-evidence-layers/03-01-PLAN.md at cfe4b830e6eca3d10f78c30409dbeccfb4546b55

gbanyan/usher-exploring

Fork 0

Files

gbanyan 0d252da348 docs(03): create phase plan

2026-02-11 18:46:28 +08:00

11 KiB

Raw Blame History

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves

phase

plan

type

wave

depends_on

files_modified

autonomous

must_haves

03-core-evidence-layers

execute

src/usher_pipeline/evidence/annotation/__init__.py

src/usher_pipeline/evidence/annotation/models.py

src/usher_pipeline/evidence/annotation/fetch.py

src/usher_pipeline/evidence/annotation/transform.py

src/usher_pipeline/evidence/annotation/load.py

tests/test_annotation.py

tests/test_annotation_integration.py

src/usher_pipeline/cli/evidence_cmd.py

pyproject.toml

true

truths

artifacts

key_links

Pipeline quantifies functional annotation depth per gene using GO term count, UniProt annotation score, and pathway membership

Genes are classified into annotation tiers (well-annotated, partially-annotated, poorly-annotated) based on composite annotation metrics

Annotation completeness score is normalized to 0-1 scale and stored as an evidence layer in DuckDB

path

provides

exports

src/usher_pipeline/evidence/annotation/fetch.py

GO term count and UniProt annotation score retrieval per gene

fetch_go_annotations

fetch_uniprot_scores

path

provides

exports

src/usher_pipeline/evidence/annotation/transform.py

Annotation tier classification and 0-1 normalization

classify_annotation_tier

normalize_annotation_score

process_annotation_evidence

path

provides

exports

src/usher_pipeline/evidence/annotation/load.py

DuckDB persistence for annotation evidence

load_to_duckdb

path	provides
tests/test_annotation.py	Unit tests for annotation scoring, tiering, NULL handling

from	to	via	pattern
src/usher_pipeline/evidence/annotation/fetch.py	mygene.info API	mygene library batch query	mg.querymany.fields.go

from	to	via	pattern
src/usher_pipeline/evidence/annotation/transform.py	src/usher_pipeline/evidence/annotation/fetch.py	processes fetched GO/UniProt data into scores	classify_annotation_tier

from	to	via	pattern
src/usher_pipeline/evidence/annotation/load.py	src/usher_pipeline/persistence/duckdb_store.py	store.save_dataframe	save_dataframe.*annotation_completeness

Implement the Gene Annotation Completeness evidence layer (ANNOT-01/02/03): retrieve GO term counts and UniProt annotation scores per gene, classify genes into annotation tiers, normalize to 0-1 composite score, and persist to DuckDB.

Purpose: Annotation depth quantifies how well-studied each gene is -- poorly-annotated genes with other evidence are prime candidates for under-studied cilia/Usher genes. Output: annotation_completeness DuckDB table with per-gene GO counts, UniProt scores, pathway flags, annotation tier, and normalized composite score.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/models.py @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/transform.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create annotation evidence data model, fetch, and transform modules src/usher_pipeline/evidence/annotation/__init__.py src/usher_pipeline/evidence/annotation/models.py src/usher_pipeline/evidence/annotation/fetch.py src/usher_pipeline/evidence/annotation/transform.py pyproject.toml Create the annotation evidence layer following the established gnomAD pattern (fetch->transform->load).

**models.py**: Define AnnotationRecord pydantic model with fields: gene_id (str), gene_symbol (str), go_term_count (int|None), go_biological_process_count (int|None), go_molecular_function_count (int|None), go_cellular_component_count (int|None), uniprot_annotation_score (int|None -- UniProt annotation score 1-5), has_pathway_membership (bool|None -- present in any KEGG/Reactome pathway), annotation_tier (str -- "well_annotated", "partially_annotated", "poorly_annotated"), annotation_score_normalized (float|None -- 0-1 composite). Define ANNOTATION_TABLE_NAME = "annotation_completeness".

**fetch.py**: Two fetch functions:
1. `fetch_go_annotations(gene_ids: list[str]) -> pl.DataFrame` -- Use mygene.info library (already a dependency) to batch query GO annotations. Call `mg.querymany(gene_ids, scopes='ensembl.gene', fields='go,pathway.kegg,pathway.reactome,symbol', species='human')`. Extract GO term counts by category (BP, MF, CC). For each gene, count GO terms per ontology. Return LazyFrame with gene_id, gene_symbol, go_term_count, go_biological_process_count, go_molecular_function_count, go_cellular_component_count, has_pathway_membership. Handle genes with no GO annotations as NULL (not zero). Process in batches of 1000 to avoid mygene timeout.
2. `fetch_uniprot_scores(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- Use UniProt REST API (httpx with tenacity retry) to batch-query annotation scores. UniProt annotation score is available from REST API JSON under `.annotationScore`. Query batches of 100 accessions using UniProt search endpoint: `https://rest.uniprot.org/uniprotkb/search?query=accession:P12345+OR+accession:P67890&fields=accession,annotation_score`. Use ratelimit decorator (200 req/sec). Return LazyFrame with gene_id, uniprot_annotation_score. NULL for genes without UniProt mapping.

Add `ratelimit` to pyproject.toml dependencies if not already present.

**transform.py**: Three functions:
1. `classify_annotation_tier(df: pl.DataFrame) -> pl.DataFrame` -- Add annotation_tier column: "well_annotated" (go_term_count >= 20 AND uniprot_annotation_score >= 4), "partially_annotated" (go_term_count >= 5 OR uniprot_annotation_score >= 3), "poorly_annotated" (everything else including NULLs). NULL GO counts treated as zero for tier classification (conservative -- assume unannotated).
2. `normalize_annotation_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute composite annotation score. Formula: weighted average of (a) log2(go_term_count + 1) normalized by max across dataset, (b) uniprot_annotation_score / 5.0, (c) has_pathway_membership as 0/1. Weights: 0.5 GO, 0.3 UniProt, 0.2 Pathway. Result clamped to 0-1. NULL if ALL three inputs are NULL.
3. `process_annotation_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end pipeline: fetch GO -> fetch UniProt -> join -> classify tier -> normalize -> collect.

Follow established patterns: NULL preservation (unknown != zero), structlog logging, lazy polars evaluation where possible.

cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.annotation import fetch_go_annotations, fetch_uniprot_scores, classify_annotation_tier, normalize_annotation_score, process_annotation_evidence; print('imports OK')" Annotation fetch module retrieves GO terms via mygene and UniProt scores via REST API. Transform module classifies genes into 3 tiers and normalizes composite score to 0-1 scale. All functions importable and follow established NULL preservation patterns. Task 2: Create annotation DuckDB loader, CLI command, and tests src/usher_pipeline/evidence/annotation/load.py src/usher_pipeline/cli/evidence_cmd.py tests/test_annotation.py tests/test_annotation_integration.py **load.py**: Follow gnomad/load.py pattern exactly. Create `load_to_duckdb(df, store, provenance, description)` that saves to "annotation_completeness" table with CREATE OR REPLACE. Record provenance with tier distribution counts (well/partial/poor), NULL annotation counts, mean/median annotation scores. Create `query_poorly_annotated(store, max_score=0.3) -> pl.DataFrame` helper to find under-studied genes.

**evidence_cmd.py**: Add `annotation` subcommand to existing evidence command group. Follow gnomad command pattern: checkpoint check (has_checkpoint('annotation_completeness')), --force flag, load gene universe from DuckDB gene_universe table to get gene_ids and uniprot mappings, call process_annotation_evidence, load to DuckDB, save provenance sidecar to data/annotation/completeness.provenance.json. Display summary with tier distribution counts.

**tests/test_annotation.py**: Unit tests with synthetic data (no external API calls). Mock mygene.querymany to return synthetic GO data. Test cases:
- test_go_count_extraction: Correct GO term counting by category
- test_null_go_handling: Genes with no GO data get NULL counts
- test_tier_classification_well_annotated: High GO + high UniProt = well_annotated
- test_tier_classification_poorly_annotated: Low/NULL GO + low UniProt = poorly_annotated
- test_tier_classification_partial: Medium annotations = partially_annotated
- test_normalization_bounds: Score always in [0, 1] range
- test_normalization_null_preservation: All-NULL inputs produce NULL score
- test_normalization_with_pathway: Pathway membership contributes to score
- test_composite_weighting: Verify 0.5/0.3/0.2 weight distribution

**tests/test_annotation_integration.py**: Integration tests following gnomad pattern. Mock mygene and UniProt API. Test full pipeline: fetch -> transform -> load -> query. Test checkpoint-restart, provenance recording, idempotent loading. Use synthetic fixtures.

cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v All annotation unit and integration tests pass. CLI `evidence annotation` command registered and functional. DuckDB stores annotation_completeness table with tier classification. Checkpoint-restart works. Provenance tracked. - `python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v` -- all tests pass - `python -c "from usher_pipeline.evidence.annotation import *"` -- all exports importable - `usher-pipeline evidence annotation --help` -- CLI help displays with correct options - DuckDB annotation_completeness table has columns: gene_id, gene_symbol, go_term_count, uniprot_annotation_score, has_pathway_membership, annotation_tier, annotation_score_normalized

<success_criteria>

ANNOT-01: GO term count and UniProt annotation score retrieved per gene with NULL for missing data
ANNOT-02: Genes classified into well/partially/poorly annotated tiers based on composite metrics
ANNOT-03: Normalized 0-1 annotation score stored in DuckDB annotation_completeness table
Pattern compliance: fetch->transform->load->CLI->tests matching gnomAD evidence layer structure </success_criteria>

After completion, create `.planning/phases/03-core-evidence-layers/03-01-SUMMARY.md`

11 KiB Raw Blame History

11 KiB

Raw Blame History