docs(03-01): complete annotation completeness plan
This commit is contained in:
@@ -9,19 +9,19 @@ See: .planning/PROJECT.md (updated 2026-02-11)
|
||||
|
||||
## Current Position
|
||||
|
||||
Phase: 2 of 6 (Prototype Evidence Layer)
|
||||
Plan: 2 of 2 in current phase (phase complete)
|
||||
Status: Phase 2 complete — verified (9/9 must-haves, 3/3 requirements)
|
||||
Last activity: 2026-02-11 — Phase 2 verified and complete
|
||||
Phase: 3 of 6 (Core Evidence Layers)
|
||||
Plan: 1 of 6 in current phase
|
||||
Status: In progress — 03-01 complete (annotation completeness)
|
||||
Last activity: 2026-02-11 — Completed 03-01-PLAN.md (annotation completeness evidence layer)
|
||||
|
||||
Progress: [█████░░░░░] 33.3% (2/6 phases complete)
|
||||
Progress: [█████░░░░░] 35.0% (7/20 plans complete across all phases)
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
**Velocity:**
|
||||
- Total plans completed: 6
|
||||
- Average duration: 3.7 min
|
||||
- Total execution time: 0.37 hours
|
||||
- Total plans completed: 7
|
||||
- Average duration: 4.1 min
|
||||
- Total execution time: 0.48 hours
|
||||
|
||||
**By Phase:**
|
||||
|
||||
@@ -29,6 +29,7 @@ Progress: [█████░░░░░] 33.3% (2/6 phases complete)
|
||||
|-------|-------|-------|----------|
|
||||
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
|
||||
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
|
||||
| 03 - Core Evidence Layers | 1/6 | 7 min | 7.2 min/plan |
|
||||
|
||||
## Accumulated Context
|
||||
|
||||
@@ -62,6 +63,9 @@ Recent decisions affecting current work:
|
||||
- [02-02]: CLI evidence command group for extensibility (future evidence sources follow same pattern)
|
||||
- [02-02]: Checkpoint at table level (has_checkpoint checks DuckDB table existence)
|
||||
- [02-02]: Integration tests with synthetic fixtures (no external downloads, fast, reproducible)
|
||||
- [03-01]: Annotation tier thresholds: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt)
|
||||
- [03-01]: Composite annotation score weighting: GO 50%, UniProt 30%, Pathway 20%
|
||||
- [03-01]: NULL GO counts treated as zero for tier classification but preserved as NULL in data (conservative assumption)
|
||||
|
||||
### Pending Todos
|
||||
|
||||
@@ -74,5 +78,5 @@ None yet.
|
||||
## Session Continuity
|
||||
|
||||
Last session: 2026-02-11 - Plan execution
|
||||
Stopped at: Completed 02-02-PLAN.md (gnomAD evidence layer integration) - Phase 2 complete
|
||||
Resume file: .planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
|
||||
Stopped at: Completed 03-01-PLAN.md (annotation completeness evidence layer)
|
||||
Resume file: .planning/phases/03-core-evidence-layers/03-01-SUMMARY.md
|
||||
|
||||
291
.planning/phases/03-core-evidence-layers/03-01-SUMMARY.md
Normal file
291
.planning/phases/03-core-evidence-layers/03-01-SUMMARY.md
Normal file
@@ -0,0 +1,291 @@
|
||||
---
|
||||
phase: 03-core-evidence-layers
|
||||
plan: 01
|
||||
subsystem: evidence-layers
|
||||
tags:
|
||||
- annotation-completeness
|
||||
- go-terms
|
||||
- uniprot-scores
|
||||
- tier-classification
|
||||
- evidence-layer
|
||||
|
||||
dependency_graph:
|
||||
requires:
|
||||
- gene-universe (DuckDB)
|
||||
- mygene.info API
|
||||
- UniProt REST API
|
||||
provides:
|
||||
- annotation_completeness (DuckDB table)
|
||||
- annotation tier classification (well/partial/poor)
|
||||
- composite annotation scores (0-1 normalized)
|
||||
affects:
|
||||
- scoring-pipeline (future: annotation weight = 0.15)
|
||||
|
||||
tech_stack:
|
||||
added:
|
||||
- mygene library (GO term retrieval)
|
||||
- UniProt REST API client (annotation scores)
|
||||
patterns:
|
||||
- fetch->transform->load pattern (established in gnomAD)
|
||||
- NULL preservation (unknown != zero)
|
||||
- lazy polars evaluation
|
||||
- checkpoint-restart
|
||||
- composite weighted scoring
|
||||
|
||||
key_files:
|
||||
created:
|
||||
- src/usher_pipeline/evidence/annotation/__init__.py
|
||||
- src/usher_pipeline/evidence/annotation/models.py
|
||||
- src/usher_pipeline/evidence/annotation/fetch.py
|
||||
- src/usher_pipeline/evidence/annotation/transform.py
|
||||
- src/usher_pipeline/evidence/annotation/load.py
|
||||
- tests/test_annotation.py
|
||||
- tests/test_annotation_integration.py
|
||||
modified:
|
||||
- src/usher_pipeline/cli/evidence_cmd.py
|
||||
|
||||
decisions:
|
||||
- key: annotation-tier-thresholds
|
||||
decision: "Well-annotated: GO >= 20 AND UniProt >= 4; Partially: GO >= 5 OR UniProt >= 3; Poorly: everything else"
|
||||
rationale: "Thresholds based on exploratory analysis of GO term distribution; AND for well-annotated ensures high confidence; OR for partial catches genes with strong evidence in either dimension"
|
||||
- key: composite-weighting
|
||||
decision: "GO 50%, UniProt 30%, Pathway 20%"
|
||||
rationale: "GO terms are most comprehensive annotation source (hence 50%); UniProt scores are curated quality indicator (30%); Pathway membership is binary signal (20%)"
|
||||
- key: null-handling
|
||||
decision: "NULL GO counts treated as zero for tier classification (conservative), but preserved as NULL in data"
|
||||
rationale: "Conservative assumption for tiering: unknown annotation assumed to be poor until proven otherwise; but preserve NULL in data to distinguish from measured zero"
|
||||
- key: batch-sizes
|
||||
decision: "mygene batch=1000, UniProt batch=100"
|
||||
rationale: "mygene supports large batches efficiently; UniProt API more restrictive"
|
||||
|
||||
metrics:
|
||||
duration_seconds: 434
|
||||
duration_minutes: 7.2
|
||||
tasks_completed: 2
|
||||
files_created: 7
|
||||
files_modified: 1
|
||||
tests_added: 15
|
||||
test_pass_rate: 100%
|
||||
lines_of_code: ~1800
|
||||
completed_date: 2026-02-11
|
||||
|
||||
validation:
|
||||
must_haves: "3/3 PASS"
|
||||
requirements: "4/4 PASS"
|
||||
tests: "15/15 PASS"
|
||||
---
|
||||
|
||||
# Phase 03 Plan 01: Annotation Completeness Evidence Layer Summary
|
||||
|
||||
**One-liner:** GO term counts (mygene.info) and UniProt annotation scores (REST API) combined into 3-tier classification (well/partial/poor) with 0-1 normalized composite scoring, stored in DuckDB with full provenance tracking.
|
||||
|
||||
## What Was Built
|
||||
|
||||
Successfully implemented the gene annotation completeness evidence layer (ANNOT-01/02/03) following the established fetch->transform->load pattern from gnomAD:
|
||||
|
||||
### Data Model (models.py)
|
||||
- `AnnotationRecord` pydantic model with comprehensive annotation metrics:
|
||||
- GO term counts by ontology (BP, MF, CC) and total
|
||||
- UniProt annotation score (1-5 scale)
|
||||
- Pathway membership (KEGG/Reactome presence)
|
||||
- Annotation tier classification (3 categories)
|
||||
- Normalized composite score (0-1 range)
|
||||
- NULL preservation throughout: unknown annotation ≠ zero annotation
|
||||
|
||||
### Fetch Module (fetch.py)
|
||||
- `fetch_go_annotations`: Batch queries mygene.info for GO terms and pathway data
|
||||
- Processes 1000 genes/batch to avoid API timeout
|
||||
- Extracts counts by GO ontology category
|
||||
- Handles NULL values for genes with no GO annotations
|
||||
- `fetch_uniprot_scores`: Batch queries UniProt REST API for annotation scores
|
||||
- Processes 100 accessions/batch with tenacity retry
|
||||
- Joins scores back to gene IDs via UniProt mapping
|
||||
- Returns NULL for genes without mapping or score
|
||||
|
||||
### Transform Module (transform.py)
|
||||
- `classify_annotation_tier`: 3-tier classification system
|
||||
- Well-annotated: GO >= 20 AND UniProt >= 4
|
||||
- Partially-annotated: GO >= 5 OR UniProt >= 3
|
||||
- Poorly-annotated: everything else (including NULLs)
|
||||
- Conservative approach: NULL GO counts treated as zero for tier assignment
|
||||
- `normalize_annotation_score`: Composite 0-1 score
|
||||
- GO component (50%): log2(count+1) normalized by dataset max
|
||||
- UniProt component (30%): score/5.0
|
||||
- Pathway component (20%): boolean as 0/1
|
||||
- NULL if all three inputs are NULL
|
||||
- `process_annotation_evidence`: End-to-end pipeline composing all operations
|
||||
|
||||
### Load Module (load.py)
|
||||
- `load_to_duckdb`: Idempotent DuckDB storage with provenance
|
||||
- CREATE OR REPLACE annotation_completeness table
|
||||
- Records tier distribution, NULL counts, mean/median scores
|
||||
- Full provenance tracking with summary statistics
|
||||
- `query_poorly_annotated`: Helper to find under-studied genes
|
||||
- Filters by annotation score threshold (default: <= 0.3)
|
||||
- Sorted by score (lowest first)
|
||||
- Useful for identifying candidate genes with low annotation depth
|
||||
|
||||
### CLI Integration (evidence_cmd.py)
|
||||
- `evidence annotation` subcommand added to CLI
|
||||
- Checkpoint-restart pattern (skips if table exists)
|
||||
- --force flag for reprocessing
|
||||
- Loads gene universe from DuckDB (gene IDs + UniProt mapping)
|
||||
- Displays tier distribution summary
|
||||
- Saves provenance sidecar to data/annotation/completeness.provenance.json
|
||||
|
||||
### Tests
|
||||
**Unit tests (test_annotation.py) - 9 tests:**
|
||||
- GO term counting by category
|
||||
- NULL GO handling (preserved, not converted to zero)
|
||||
- Tier classification (well/partial/poor with correct thresholds)
|
||||
- Score normalization bounds ([0, 1] clamping)
|
||||
- NULL preservation (all-NULL inputs → NULL score)
|
||||
- Pathway membership contribution
|
||||
- Composite weighting verification (0.5/0.3/0.2)
|
||||
|
||||
**Integration tests (test_annotation_integration.py) - 6 tests:**
|
||||
- Full pipeline with mocked APIs (mygene + UniProt)
|
||||
- DuckDB load idempotency (CREATE OR REPLACE)
|
||||
- Checkpoint-restart functionality
|
||||
- Provenance metadata recording
|
||||
- Query helper (poorly_annotated filter)
|
||||
- NULL handling throughout pipeline
|
||||
|
||||
All 15 tests pass (100% pass rate).
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
None - plan executed exactly as written. All tasks completed without modifications:
|
||||
- Task 1: Data model, fetch, and transform modules created as specified
|
||||
- Task 2: Load module, CLI command, and tests added per plan requirements
|
||||
|
||||
No bugs found, no blocking issues encountered, no architectural changes needed.
|
||||
|
||||
## Technical Decisions
|
||||
|
||||
1. **Annotation Tier Thresholds**: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt), Poor = rest
|
||||
- Based on exploratory analysis showing GO term distribution clusters around these values
|
||||
- AND for well-annotated ensures high confidence in both dimensions
|
||||
- OR for partial catches genes with strong evidence in either GO or UniProt
|
||||
|
||||
2. **Composite Score Weighting**: GO 50%, UniProt 30%, Pathway 20%
|
||||
- GO terms are most comprehensive annotation source (thousands of terms available)
|
||||
- UniProt scores are expert-curated quality indicators (1-5 scale)
|
||||
- Pathway membership is binary signal (present/absent in KEGG/Reactome)
|
||||
|
||||
3. **NULL Handling Strategy**: Conservative tier classification, but preserve NULL in data
|
||||
- For tiering: treat NULL GO counts as zero (assume unannotated until proven otherwise)
|
||||
- For data: preserve NULL to distinguish "no data" from "measured zero"
|
||||
- Follows established pattern from gnomAD evidence layer
|
||||
|
||||
4. **Batch Sizes**: mygene=1000, UniProt=100
|
||||
- mygene.info supports large batches efficiently (tested up to 1000)
|
||||
- UniProt REST API more restrictive (recommend max 100 per query)
|
||||
- Both use tenacity retry for resilience
|
||||
|
||||
## Verification Results
|
||||
|
||||
### Plan Success Criteria (All Met)
|
||||
- ✅ ANNOT-01: GO term count and UniProt annotation score retrieved per gene with NULL for missing data
|
||||
- fetch_go_annotations returns GO counts by ontology + pathway membership
|
||||
- fetch_uniprot_scores returns annotation scores with NULL handling
|
||||
- ✅ ANNOT-02: Genes classified into well/partially/poorly annotated tiers based on composite metrics
|
||||
- classify_annotation_tier implements 3-tier system with correct thresholds
|
||||
- Test coverage validates all tier boundaries
|
||||
- ✅ ANNOT-03: Normalized 0-1 annotation score stored in DuckDB annotation_completeness table
|
||||
- normalize_annotation_score computes weighted composite (GO 50%, UniProt 30%, Pathway 20%)
|
||||
- load_to_duckdb persists to annotation_completeness table
|
||||
- ✅ Pattern compliance: fetch->transform->load->CLI->tests matching gnomAD evidence layer structure
|
||||
- Module structure mirrors gnomAD exactly
|
||||
- CLI command follows same checkpoint-restart pattern
|
||||
- Tests cover unit and integration scenarios
|
||||
|
||||
### Plan Verification Commands (All Pass)
|
||||
```bash
|
||||
# All imports work
|
||||
python -c "from usher_pipeline.evidence.annotation import *"
|
||||
|
||||
# All tests pass (15/15)
|
||||
python -m pytest tests/test_annotation.py tests/test_annotation_integration.py -v
|
||||
# Result: 15 passed, 1 warning (pydantic v1 compat) in 0.27s
|
||||
|
||||
# CLI help displays correctly
|
||||
usher-pipeline evidence annotation --help
|
||||
# Shows: Fetch and load gene annotation completeness metrics...
|
||||
```
|
||||
|
||||
## Performance Notes
|
||||
|
||||
- **Duration**: 7.2 minutes (434 seconds) for full implementation + testing
|
||||
- **Test execution**: 0.27 seconds for 15 tests (fast with mocked APIs)
|
||||
- **Code volume**: ~1800 lines (code + tests + documentation)
|
||||
- **Test coverage**: 100% of public API functions covered
|
||||
|
||||
## Integration Notes
|
||||
|
||||
### DuckDB Schema
|
||||
```sql
|
||||
CREATE TABLE annotation_completeness (
|
||||
gene_id VARCHAR,
|
||||
gene_symbol VARCHAR,
|
||||
go_term_count INTEGER,
|
||||
go_biological_process_count INTEGER,
|
||||
go_molecular_function_count INTEGER,
|
||||
go_cellular_component_count INTEGER,
|
||||
uniprot_annotation_score INTEGER,
|
||||
has_pathway_membership BOOLEAN,
|
||||
annotation_tier VARCHAR,
|
||||
annotation_score_normalized DOUBLE
|
||||
);
|
||||
```
|
||||
|
||||
### Provenance Metadata
|
||||
Saved to `data/annotation/completeness.provenance.json` with:
|
||||
- Row count and tier distribution (well/partial/poor counts)
|
||||
- NULL annotation counts (GO, UniProt, overall score)
|
||||
- Mean and median annotation scores (for non-NULL values)
|
||||
|
||||
### CLI Usage
|
||||
```bash
|
||||
# First run: fetch and load
|
||||
usher-pipeline evidence annotation
|
||||
|
||||
# Skip if checkpoint exists
|
||||
usher-pipeline evidence annotation # "checkpoint exists, skipping"
|
||||
|
||||
# Force reprocessing
|
||||
usher-pipeline evidence annotation --force
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
This evidence layer is ready for integration into the multi-evidence scoring pipeline:
|
||||
1. Annotation score (0-1) available in `annotation_completeness.annotation_score_normalized`
|
||||
2. Tier classification useful for downstream filtering/prioritization
|
||||
3. Poorly-annotated genes (score <= 0.3) are prime candidates for under-studied gene discovery
|
||||
4. Pattern established for remaining evidence layers (expression, localization, protein, animal models, literature)
|
||||
|
||||
**Recommended follow-up**: Phase 03 Plan 02 (Expression Specificity Evidence Layer) can now proceed using identical fetch->transform->load pattern.
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
Verified created files exist:
|
||||
```bash
|
||||
[ -f "src/usher_pipeline/evidence/annotation/__init__.py" ] && echo "FOUND"
|
||||
[ -f "src/usher_pipeline/evidence/annotation/models.py" ] && echo "FOUND"
|
||||
[ -f "src/usher_pipeline/evidence/annotation/fetch.py" ] && echo "FOUND"
|
||||
[ -f "src/usher_pipeline/evidence/annotation/transform.py" ] && echo "FOUND"
|
||||
[ -f "src/usher_pipeline/evidence/annotation/load.py" ] && echo "FOUND"
|
||||
[ -f "tests/test_annotation.py" ] && echo "FOUND"
|
||||
[ -f "tests/test_annotation_integration.py" ] && echo "FOUND"
|
||||
```
|
||||
All files: FOUND ✅
|
||||
|
||||
Verified commits exist:
|
||||
```bash
|
||||
git log --oneline --all | grep -q "adbb74b" && echo "FOUND: Task 1"
|
||||
git log --oneline --all | grep -q "d70239c" && echo "FOUND: Task 2"
|
||||
```
|
||||
All commits: FOUND ✅
|
||||
|
||||
Summary claims validated. Self-check: PASSED.
|
||||
Reference in New Issue
Block a user