docs(03-03): complete protein features plan with SUMMARY and STATE updates
This commit is contained in:
170
.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md
Normal file
170
.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md
Normal file
@@ -0,0 +1,170 @@
|
||||
---
|
||||
phase: 03-core-evidence-layers
|
||||
plan: 03
|
||||
subsystem: evidence-protein
|
||||
tags: [evidence, protein, uniprot, interpro, domains, motifs]
|
||||
dependency_graph:
|
||||
requires: [gene-universe, duckdb-store, provenance-tracker]
|
||||
provides: [protein-features-table, protein-evidence-cli, protein-motif-detection]
|
||||
affects: [evidence-pipeline, cilia-candidate-scoring]
|
||||
tech_stack:
|
||||
added: [uniprot-rest-api, interpro-api]
|
||||
patterns: [fetch-transform-load, motif-detection, domain-annotation]
|
||||
key_files:
|
||||
created:
|
||||
- src/usher_pipeline/evidence/protein/__init__.py
|
||||
- src/usher_pipeline/evidence/protein/models.py
|
||||
- src/usher_pipeline/evidence/protein/fetch.py
|
||||
- src/usher_pipeline/evidence/protein/transform.py
|
||||
- src/usher_pipeline/evidence/protein/load.py
|
||||
- tests/test_protein.py
|
||||
- tests/test_protein_integration.py
|
||||
modified:
|
||||
- src/usher_pipeline/cli/evidence_cmd.py
|
||||
- src/usher_pipeline/persistence/provenance.py
|
||||
decisions:
|
||||
- UniProt REST API chosen over bulk download for flexibility (batches of 100 accessions)
|
||||
- InterPro API for supplemental domain annotations (10 req/sec rate limit)
|
||||
- Domain keyword matching for cilia motif detection (not ML-based to maintain explainability)
|
||||
- Composite score weights: length 15%, domain 20%, coiled-coil 20%, TM 20%, cilia 15%, scaffold 10%
|
||||
- List(Null) edge case handling added for proteins with no domains
|
||||
metrics:
|
||||
duration_min: 11
|
||||
completed_at: "2026-02-11T19:07:42Z"
|
||||
tasks_completed: 2
|
||||
tests_added: 11 unit + 5 integration = 16 total
|
||||
files_created: 7
|
||||
lines_added: ~1937
|
||||
---
|
||||
|
||||
# Phase 03 Plan 03: Protein Features Evidence Layer Summary
|
||||
|
||||
**One-liner:** Protein domain extraction and cilia motif detection from UniProt/InterPro with normalized composite scoring
|
||||
|
||||
## What Was Built
|
||||
|
||||
Implemented the protein sequence and structure features evidence layer (PROT-01/02/03/04) with:
|
||||
|
||||
1. **Data Model** (`models.py`):
|
||||
- `ProteinFeatureRecord` pydantic model with 12 fields
|
||||
- Cilia domain keywords (IFT, BBSome, ciliary, basal body, etc.)
|
||||
- Scaffold domain types (PDZ, SH3, Ankyrin, WD40, etc.)
|
||||
- NULL preservation for genes without UniProt entries
|
||||
|
||||
2. **Fetch Layer** (`fetch.py`):
|
||||
- `fetch_uniprot_features()`: UniProt REST API with batching (100 accessions/request)
|
||||
- `fetch_interpro_domains()`: InterPro API for domain annotations (10 req/sec)
|
||||
- Retry logic with tenacity (5 attempts, exponential backoff)
|
||||
- Conservative rate limiting to respect API constraints
|
||||
|
||||
3. **Transform Layer** (`transform.py`):
|
||||
- `extract_protein_features()`: Join UniProt + InterPro, compute domain counts
|
||||
- `detect_cilia_motifs()`: Case-insensitive keyword matching for cilia/scaffold/sensory domains
|
||||
- `normalize_protein_features()`: Log-transform length ranks, cap TM counts, composite score
|
||||
- `process_protein_evidence()`: End-to-end pipeline (map genes → fetch → transform → normalize)
|
||||
|
||||
4. **Load Layer** (`load.py`):
|
||||
- `load_to_duckdb()`: Persist to protein_features table with provenance
|
||||
- `query_cilia_candidates()`: SQL query for genes with cilia domains or coiled-coil+scaffold
|
||||
|
||||
5. **CLI Integration** (`evidence_cmd.py`):
|
||||
- `usher-pipeline evidence protein` command
|
||||
- Checkpoint-restart pattern (skip if data exists)
|
||||
- Summary display with domain counts, cilia matches, scaffold counts
|
||||
|
||||
6. **Comprehensive Tests**:
|
||||
- 11 unit tests: feature extraction, motif detection, normalization, NULL handling
|
||||
- 5 integration tests: full pipeline, checkpoint-restart, provenance, queries
|
||||
- All tests passing with mocked APIs
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
### Auto-fixed Issues
|
||||
|
||||
**1. [Rule 3 - Blocking] List(Null) type handling in domain names**
|
||||
- **Found during:** Test development (test_null_uniprot)
|
||||
- **Issue:** When all proteins have empty domain lists, Polars creates `List(Null)` type instead of `List(String)`, causing `.str.to_lowercase()` to fail with "expected String type, got: null"
|
||||
- **Fix:** Added type coercion in `detect_cilia_motifs()` to cast `List(Null)` to `List(String)` before keyword matching
|
||||
- **Files modified:** `src/usher_pipeline/evidence/protein/transform.py`
|
||||
- **Commit:** 4605987
|
||||
|
||||
**2. [Rule 1 - Bug] Polars list concatenation operator incompatibility**
|
||||
- **Found during:** Test execution (test_uniprot_feature_extraction)
|
||||
- **Issue:** Cannot use `+` operator to concatenate list columns in Polars with non-numeric inner types
|
||||
- **Fix:** Changed from `pl.col("domain_names") + pl.col("domain_names_interpro")` to `pl.col("domain_names").list.concat(pl.col("domain_names_interpro"))`
|
||||
- **Files modified:** `src/usher_pipeline/evidence/protein/transform.py`
|
||||
- **Commit:** 4605987
|
||||
|
||||
**3. [Rule 2 - Missing critical functionality] ProvenanceTracker.get_steps() method**
|
||||
- **Found during:** Test development (test_provenance_recording)
|
||||
- **Issue:** Tests expected `get_steps()` method to verify recorded provenance steps, but method was missing
|
||||
- **Fix:** Added `get_steps()` method to ProvenanceTracker returning `self.processing_steps`
|
||||
- **Files modified:** `src/usher_pipeline/persistence/provenance.py`
|
||||
- **Commit:** 4605987
|
||||
|
||||
**4. [Rule 2 - Missing critical functionality] ProvenanceTracker step dict "name" field**
|
||||
- **Found during:** Test development (test_provenance_recording)
|
||||
- **Issue:** Tests expected both "name" and "step_name" fields in provenance steps for compatibility
|
||||
- **Fix:** Added "name" field to step dict in `record_step()` alongside "step_name"
|
||||
- **Files modified:** `src/usher_pipeline/persistence/provenance.py`
|
||||
- **Commit:** 4605987
|
||||
|
||||
## Key Technical Decisions
|
||||
|
||||
1. **UniProt REST API over bulk download**: REST API provides flexibility for incremental updates and per-gene queries without downloading full proteome datasets (~200GB uncompressed). Batch size of 100 balances API efficiency with rate limits.
|
||||
|
||||
2. **InterPro supplemental annotations**: While UniProt provides basic domain annotations, InterPro offers more comprehensive domain classification from multiple databases (Pfam, SMART, PROSITE, etc.). This improves motif detection recall.
|
||||
|
||||
3. **Keyword-based motif detection**: Simple case-insensitive substring matching on domain names rather than ML-based classification. Rationale: maintains explainability (every flagged gene can be traced to specific domain annotation) and avoids training data requirements.
|
||||
|
||||
4. **Composite score weights**: Empirically balanced weights favoring domain count (20%), transmembrane regions (20%), and coiled-coils (20%) as these are most enriched in known cilia proteins. Length contributes 15% to avoid penalizing small adaptor proteins.
|
||||
|
||||
5. **NULL preservation throughout pipeline**: Genes without UniProt entries get NULL scores (not 0.0) to distinguish "unknown" from "no evidence". Critical for downstream scoring to avoid false confidence.
|
||||
|
||||
## Verification
|
||||
|
||||
All success criteria met:
|
||||
|
||||
- ✅ **PROT-01**: Protein length, domain composition, domain count extracted from UniProt/InterPro per gene
|
||||
- ✅ **PROT-02**: Coiled-coil, scaffold/adaptor, and transmembrane domains identified
|
||||
- ✅ **PROT-03**: Cilia-associated motifs detected via domain keyword matching without presupposing conclusions
|
||||
- ✅ **PROT-04**: Binary and continuous protein features normalized to 0-1 composite score
|
||||
- ✅ **Pattern compliance**: fetch->transform->load->CLI->tests matching established gnomAD evidence layer structure
|
||||
- ✅ **All tests passing**: 16/16 tests pass (11 unit + 5 integration)
|
||||
- ✅ **Imports verified**: All protein module exports importable
|
||||
|
||||
## Self-Check
|
||||
|
||||
### Files Created
|
||||
|
||||
All files verified to exist:
|
||||
|
||||
- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/__init__.py`
|
||||
- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/models.py`
|
||||
- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/fetch.py`
|
||||
- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/transform.py`
|
||||
- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/load.py`
|
||||
- ✅ `/Users/gbanyan/Project/usher-exploring/tests/test_protein.py`
|
||||
- ✅ `/Users/gbanyan/Project/usher-exploring/tests/test_protein_integration.py`
|
||||
|
||||
### Commits
|
||||
|
||||
Task commits verified:
|
||||
|
||||
- ✅ `4605987`: feat(03-03): implement protein evidence layer with UniProt/InterPro integration
|
||||
|
||||
**Self-Check: PASSED**
|
||||
|
||||
## Impact
|
||||
|
||||
This evidence layer enables:
|
||||
|
||||
1. **Structural evidence scoring**: Genes with cilia-associated domains, scaffold proteins, or coiled-coil/TM combinations receive higher protein scores for candidate ranking
|
||||
2. **Explainable motif matching**: Every flagged gene can be traced to specific UniProt/InterPro domain annotations (no black-box ML)
|
||||
3. **Integration with gene universe**: protein_features table joins on gene_id for multi-evidence scoring
|
||||
4. **Cilia candidate queries**: `query_cilia_candidates()` identifies genes with structural features enriched in known cilia proteins
|
||||
|
||||
Next steps (Phase 03 Plans 04-06):
|
||||
- Add expression evidence (GTEx, HPA)
|
||||
- Add localization evidence (subcellular predictions)
|
||||
- Add literature evidence (PubMed co-mentions)
|
||||
Reference in New Issue
Block a user