gbanyan/usher-exploring

Fork 0

Files

gbanyan cfe4b830e6 docs(03-03): complete protein features plan with SUMMARY and STATE updates

2026-02-11 19:10:03 +08:00

8.9 KiB

Raw Blame History

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics

phase

plan

subsystem

Phase 03 Plan 03: Protein Features Evidence Layer Summary

One-liner: Protein domain extraction and cilia motif detection from UniProt/InterPro with normalized composite scoring

What Was Built

Implemented the protein sequence and structure features evidence layer (PROT-01/02/03/04) with:

Data Model (models.py):
- ProteinFeatureRecord pydantic model with 12 fields
- Cilia domain keywords (IFT, BBSome, ciliary, basal body, etc.)
- Scaffold domain types (PDZ, SH3, Ankyrin, WD40, etc.)
- NULL preservation for genes without UniProt entries
Fetch Layer (fetch.py):
- fetch_uniprot_features(): UniProt REST API with batching (100 accessions/request)
- fetch_interpro_domains(): InterPro API for domain annotations (10 req/sec)
- Retry logic with tenacity (5 attempts, exponential backoff)
- Conservative rate limiting to respect API constraints
Transform Layer (transform.py):
- extract_protein_features(): Join UniProt + InterPro, compute domain counts
- detect_cilia_motifs(): Case-insensitive keyword matching for cilia/scaffold/sensory domains
- normalize_protein_features(): Log-transform length ranks, cap TM counts, composite score
- process_protein_evidence(): End-to-end pipeline (map genes → fetch → transform → normalize)
Load Layer (load.py):
- load_to_duckdb(): Persist to protein_features table with provenance
- query_cilia_candidates(): SQL query for genes with cilia domains or coiled-coil+scaffold
CLI Integration (evidence_cmd.py):
- usher-pipeline evidence protein command
- Checkpoint-restart pattern (skip if data exists)
- Summary display with domain counts, cilia matches, scaffold counts
Comprehensive Tests:
- 11 unit tests: feature extraction, motif detection, normalization, NULL handling
- 5 integration tests: full pipeline, checkpoint-restart, provenance, queries
- All tests passing with mocked APIs

Deviations from Plan

Auto-fixed Issues

1. [Rule 3 - Blocking] List(Null) type handling in domain names

Found during: Test development (test_null_uniprot)
Issue: When all proteins have empty domain lists, Polars creates List(Null) type instead of List(String), causing .str.to_lowercase() to fail with "expected String type, got: null"
Fix: Added type coercion in detect_cilia_motifs() to cast List(Null) to List(String) before keyword matching
Files modified: src/usher_pipeline/evidence/protein/transform.py
Commit: 4605987

2. [Rule 1 - Bug] Polars list concatenation operator incompatibility

Found during: Test execution (test_uniprot_feature_extraction)
Issue: Cannot use + operator to concatenate list columns in Polars with non-numeric inner types
Fix: Changed from pl.col("domain_names") + pl.col("domain_names_interpro") to pl.col("domain_names").list.concat(pl.col("domain_names_interpro"))
Files modified: src/usher_pipeline/evidence/protein/transform.py
Commit: 4605987

3. [Rule 2 - Missing critical functionality] ProvenanceTracker.get_steps() method

Found during: Test development (test_provenance_recording)
Issue: Tests expected get_steps() method to verify recorded provenance steps, but method was missing
Fix: Added get_steps() method to ProvenanceTracker returning self.processing_steps
Files modified: src/usher_pipeline/persistence/provenance.py
Commit: 4605987

4. [Rule 2 - Missing critical functionality] ProvenanceTracker step dict "name" field

Found during: Test development (test_provenance_recording)
Issue: Tests expected both "name" and "step_name" fields in provenance steps for compatibility
Fix: Added "name" field to step dict in record_step() alongside "step_name"
Files modified: src/usher_pipeline/persistence/provenance.py
Commit: 4605987

Key Technical Decisions

UniProt REST API over bulk download: REST API provides flexibility for incremental updates and per-gene queries without downloading full proteome datasets (~200GB uncompressed). Batch size of 100 balances API efficiency with rate limits.
InterPro supplemental annotations: While UniProt provides basic domain annotations, InterPro offers more comprehensive domain classification from multiple databases (Pfam, SMART, PROSITE, etc.). This improves motif detection recall.
Keyword-based motif detection: Simple case-insensitive substring matching on domain names rather than ML-based classification. Rationale: maintains explainability (every flagged gene can be traced to specific domain annotation) and avoids training data requirements.
Composite score weights: Empirically balanced weights favoring domain count (20%), transmembrane regions (20%), and coiled-coils (20%) as these are most enriched in known cilia proteins. Length contributes 15% to avoid penalizing small adaptor proteins.
NULL preservation throughout pipeline: Genes without UniProt entries get NULL scores (not 0.0) to distinguish "unknown" from "no evidence". Critical for downstream scoring to avoid false confidence.

Verification

All success criteria met:

✅ PROT-01: Protein length, domain composition, domain count extracted from UniProt/InterPro per gene
✅ PROT-02: Coiled-coil, scaffold/adaptor, and transmembrane domains identified
✅ PROT-03: Cilia-associated motifs detected via domain keyword matching without presupposing conclusions
✅ PROT-04: Binary and continuous protein features normalized to 0-1 composite score
✅ Pattern compliance: fetch->transform->load->CLI->tests matching established gnomAD evidence layer structure
✅ All tests passing: 16/16 tests pass (11 unit + 5 integration)
✅ Imports verified: All protein module exports importable

Self-Check

Files Created

All files verified to exist:

✅ /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/__init__.py
✅ /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/models.py
✅ /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/fetch.py
✅ /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/transform.py
✅ /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/load.py
✅ /Users/gbanyan/Project/usher-exploring/tests/test_protein.py
✅ /Users/gbanyan/Project/usher-exploring/tests/test_protein_integration.py

Commits

Task commits verified:

✅ 4605987: feat(03-03): implement protein evidence layer with UniProt/InterPro integration

Self-Check: PASSED

Impact

This evidence layer enables:

Structural evidence scoring: Genes with cilia-associated domains, scaffold proteins, or coiled-coil/TM combinations receive higher protein scores for candidate ranking
Explainable motif matching: Every flagged gene can be traced to specific UniProt/InterPro domain annotations (no black-box ML)
Integration with gene universe: protein_features table joins on gene_id for multi-evidence scoring
Cilia candidate queries: query_cilia_candidates() identifies genes with structural features enriched in known cilia proteins

Next steps (Phase 03 Plans 04-06):

Add expression evidence (GTEx, HPA)
Add localization evidence (subcellular predictions)
Add literature evidence (PubMed co-mentions)

8.9 KiB Raw Blame History