8.9 KiB
phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
| phase | plan | subsystem | tags | dependency_graph | tech_stack | key_files | decisions | metrics | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 03-core-evidence-layers | 03 | evidence-protein |
|
|
|
|
|
|
Phase 03 Plan 03: Protein Features Evidence Layer Summary
One-liner: Protein domain extraction and cilia motif detection from UniProt/InterPro with normalized composite scoring
What Was Built
Implemented the protein sequence and structure features evidence layer (PROT-01/02/03/04) with:
-
Data Model (
models.py):ProteinFeatureRecordpydantic model with 12 fields- Cilia domain keywords (IFT, BBSome, ciliary, basal body, etc.)
- Scaffold domain types (PDZ, SH3, Ankyrin, WD40, etc.)
- NULL preservation for genes without UniProt entries
-
Fetch Layer (
fetch.py):fetch_uniprot_features(): UniProt REST API with batching (100 accessions/request)fetch_interpro_domains(): InterPro API for domain annotations (10 req/sec)- Retry logic with tenacity (5 attempts, exponential backoff)
- Conservative rate limiting to respect API constraints
-
Transform Layer (
transform.py):extract_protein_features(): Join UniProt + InterPro, compute domain countsdetect_cilia_motifs(): Case-insensitive keyword matching for cilia/scaffold/sensory domainsnormalize_protein_features(): Log-transform length ranks, cap TM counts, composite scoreprocess_protein_evidence(): End-to-end pipeline (map genes → fetch → transform → normalize)
-
Load Layer (
load.py):load_to_duckdb(): Persist to protein_features table with provenancequery_cilia_candidates(): SQL query for genes with cilia domains or coiled-coil+scaffold
-
CLI Integration (
evidence_cmd.py):usher-pipeline evidence proteincommand- Checkpoint-restart pattern (skip if data exists)
- Summary display with domain counts, cilia matches, scaffold counts
-
Comprehensive Tests:
- 11 unit tests: feature extraction, motif detection, normalization, NULL handling
- 5 integration tests: full pipeline, checkpoint-restart, provenance, queries
- All tests passing with mocked APIs
Deviations from Plan
Auto-fixed Issues
1. [Rule 3 - Blocking] List(Null) type handling in domain names
- Found during: Test development (test_null_uniprot)
- Issue: When all proteins have empty domain lists, Polars creates
List(Null)type instead ofList(String), causing.str.to_lowercase()to fail with "expected String type, got: null" - Fix: Added type coercion in
detect_cilia_motifs()to castList(Null)toList(String)before keyword matching - Files modified:
src/usher_pipeline/evidence/protein/transform.py - Commit:
4605987
2. [Rule 1 - Bug] Polars list concatenation operator incompatibility
- Found during: Test execution (test_uniprot_feature_extraction)
- Issue: Cannot use
+operator to concatenate list columns in Polars with non-numeric inner types - Fix: Changed from
pl.col("domain_names") + pl.col("domain_names_interpro")topl.col("domain_names").list.concat(pl.col("domain_names_interpro")) - Files modified:
src/usher_pipeline/evidence/protein/transform.py - Commit:
4605987
3. [Rule 2 - Missing critical functionality] ProvenanceTracker.get_steps() method
- Found during: Test development (test_provenance_recording)
- Issue: Tests expected
get_steps()method to verify recorded provenance steps, but method was missing - Fix: Added
get_steps()method to ProvenanceTracker returningself.processing_steps - Files modified:
src/usher_pipeline/persistence/provenance.py - Commit:
4605987
4. [Rule 2 - Missing critical functionality] ProvenanceTracker step dict "name" field
- Found during: Test development (test_provenance_recording)
- Issue: Tests expected both "name" and "step_name" fields in provenance steps for compatibility
- Fix: Added "name" field to step dict in
record_step()alongside "step_name" - Files modified:
src/usher_pipeline/persistence/provenance.py - Commit:
4605987
Key Technical Decisions
-
UniProt REST API over bulk download: REST API provides flexibility for incremental updates and per-gene queries without downloading full proteome datasets (~200GB uncompressed). Batch size of 100 balances API efficiency with rate limits.
-
InterPro supplemental annotations: While UniProt provides basic domain annotations, InterPro offers more comprehensive domain classification from multiple databases (Pfam, SMART, PROSITE, etc.). This improves motif detection recall.
-
Keyword-based motif detection: Simple case-insensitive substring matching on domain names rather than ML-based classification. Rationale: maintains explainability (every flagged gene can be traced to specific domain annotation) and avoids training data requirements.
-
Composite score weights: Empirically balanced weights favoring domain count (20%), transmembrane regions (20%), and coiled-coils (20%) as these are most enriched in known cilia proteins. Length contributes 15% to avoid penalizing small adaptor proteins.
-
NULL preservation throughout pipeline: Genes without UniProt entries get NULL scores (not 0.0) to distinguish "unknown" from "no evidence". Critical for downstream scoring to avoid false confidence.
Verification
All success criteria met:
- ✅ PROT-01: Protein length, domain composition, domain count extracted from UniProt/InterPro per gene
- ✅ PROT-02: Coiled-coil, scaffold/adaptor, and transmembrane domains identified
- ✅ PROT-03: Cilia-associated motifs detected via domain keyword matching without presupposing conclusions
- ✅ PROT-04: Binary and continuous protein features normalized to 0-1 composite score
- ✅ Pattern compliance: fetch->transform->load->CLI->tests matching established gnomAD evidence layer structure
- ✅ All tests passing: 16/16 tests pass (11 unit + 5 integration)
- ✅ Imports verified: All protein module exports importable
Self-Check
Files Created
All files verified to exist:
- ✅
/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/__init__.py - ✅
/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/models.py - ✅
/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/fetch.py - ✅
/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/transform.py - ✅
/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/load.py - ✅
/Users/gbanyan/Project/usher-exploring/tests/test_protein.py - ✅
/Users/gbanyan/Project/usher-exploring/tests/test_protein_integration.py
Commits
Task commits verified:
- ✅
4605987: feat(03-03): implement protein evidence layer with UniProt/InterPro integration
Self-Check: PASSED
Impact
This evidence layer enables:
- Structural evidence scoring: Genes with cilia-associated domains, scaffold proteins, or coiled-coil/TM combinations receive higher protein scores for candidate ranking
- Explainable motif matching: Every flagged gene can be traced to specific UniProt/InterPro domain annotations (no black-box ML)
- Integration with gene universe: protein_features table joins on gene_id for multi-evidence scoring
- Cilia candidate queries:
query_cilia_candidates()identifies genes with structural features enriched in known cilia proteins
Next steps (Phase 03 Plans 04-06):
- Add expression evidence (GTEx, HPA)
- Add localization evidence (subcellular predictions)
- Add literature evidence (PubMed co-mentions)