Files
usher-exploring/.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md

8.9 KiB

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
phase plan subsystem tags dependency_graph tech_stack key_files decisions metrics
03-core-evidence-layers 03 evidence-protein
evidence
protein
uniprot
interpro
domains
motifs
requires provides affects
gene-universe
duckdb-store
provenance-tracker
protein-features-table
protein-evidence-cli
protein-motif-detection
evidence-pipeline
cilia-candidate-scoring
added patterns
uniprot-rest-api
interpro-api
fetch-transform-load
motif-detection
domain-annotation
created modified
src/usher_pipeline/evidence/protein/__init__.py
src/usher_pipeline/evidence/protein/models.py
src/usher_pipeline/evidence/protein/fetch.py
src/usher_pipeline/evidence/protein/transform.py
src/usher_pipeline/evidence/protein/load.py
tests/test_protein.py
tests/test_protein_integration.py
src/usher_pipeline/cli/evidence_cmd.py
src/usher_pipeline/persistence/provenance.py
UniProt REST API chosen over bulk download for flexibility (batches of 100 accessions)
InterPro API for supplemental domain annotations (10 req/sec rate limit)
Domain keyword matching for cilia motif detection (not ML-based to maintain explainability)
Composite score weights
length 15%, domain 20%, coiled-coil 20%, TM 20%, cilia 15%, scaffold 10%
List(Null) edge case handling added for proteins with no domains
duration_min completed_at tasks_completed tests_added files_created lines_added
11 2026-02-11T19:07:42Z 2 11 unit + 5 integration = 16 total 7 ~1937

Phase 03 Plan 03: Protein Features Evidence Layer Summary

One-liner: Protein domain extraction and cilia motif detection from UniProt/InterPro with normalized composite scoring

What Was Built

Implemented the protein sequence and structure features evidence layer (PROT-01/02/03/04) with:

  1. Data Model (models.py):

    • ProteinFeatureRecord pydantic model with 12 fields
    • Cilia domain keywords (IFT, BBSome, ciliary, basal body, etc.)
    • Scaffold domain types (PDZ, SH3, Ankyrin, WD40, etc.)
    • NULL preservation for genes without UniProt entries
  2. Fetch Layer (fetch.py):

    • fetch_uniprot_features(): UniProt REST API with batching (100 accessions/request)
    • fetch_interpro_domains(): InterPro API for domain annotations (10 req/sec)
    • Retry logic with tenacity (5 attempts, exponential backoff)
    • Conservative rate limiting to respect API constraints
  3. Transform Layer (transform.py):

    • extract_protein_features(): Join UniProt + InterPro, compute domain counts
    • detect_cilia_motifs(): Case-insensitive keyword matching for cilia/scaffold/sensory domains
    • normalize_protein_features(): Log-transform length ranks, cap TM counts, composite score
    • process_protein_evidence(): End-to-end pipeline (map genes → fetch → transform → normalize)
  4. Load Layer (load.py):

    • load_to_duckdb(): Persist to protein_features table with provenance
    • query_cilia_candidates(): SQL query for genes with cilia domains or coiled-coil+scaffold
  5. CLI Integration (evidence_cmd.py):

    • usher-pipeline evidence protein command
    • Checkpoint-restart pattern (skip if data exists)
    • Summary display with domain counts, cilia matches, scaffold counts
  6. Comprehensive Tests:

    • 11 unit tests: feature extraction, motif detection, normalization, NULL handling
    • 5 integration tests: full pipeline, checkpoint-restart, provenance, queries
    • All tests passing with mocked APIs

Deviations from Plan

Auto-fixed Issues

1. [Rule 3 - Blocking] List(Null) type handling in domain names

  • Found during: Test development (test_null_uniprot)
  • Issue: When all proteins have empty domain lists, Polars creates List(Null) type instead of List(String), causing .str.to_lowercase() to fail with "expected String type, got: null"
  • Fix: Added type coercion in detect_cilia_motifs() to cast List(Null) to List(String) before keyword matching
  • Files modified: src/usher_pipeline/evidence/protein/transform.py
  • Commit: 4605987

2. [Rule 1 - Bug] Polars list concatenation operator incompatibility

  • Found during: Test execution (test_uniprot_feature_extraction)
  • Issue: Cannot use + operator to concatenate list columns in Polars with non-numeric inner types
  • Fix: Changed from pl.col("domain_names") + pl.col("domain_names_interpro") to pl.col("domain_names").list.concat(pl.col("domain_names_interpro"))
  • Files modified: src/usher_pipeline/evidence/protein/transform.py
  • Commit: 4605987

3. [Rule 2 - Missing critical functionality] ProvenanceTracker.get_steps() method

  • Found during: Test development (test_provenance_recording)
  • Issue: Tests expected get_steps() method to verify recorded provenance steps, but method was missing
  • Fix: Added get_steps() method to ProvenanceTracker returning self.processing_steps
  • Files modified: src/usher_pipeline/persistence/provenance.py
  • Commit: 4605987

4. [Rule 2 - Missing critical functionality] ProvenanceTracker step dict "name" field

  • Found during: Test development (test_provenance_recording)
  • Issue: Tests expected both "name" and "step_name" fields in provenance steps for compatibility
  • Fix: Added "name" field to step dict in record_step() alongside "step_name"
  • Files modified: src/usher_pipeline/persistence/provenance.py
  • Commit: 4605987

Key Technical Decisions

  1. UniProt REST API over bulk download: REST API provides flexibility for incremental updates and per-gene queries without downloading full proteome datasets (~200GB uncompressed). Batch size of 100 balances API efficiency with rate limits.

  2. InterPro supplemental annotations: While UniProt provides basic domain annotations, InterPro offers more comprehensive domain classification from multiple databases (Pfam, SMART, PROSITE, etc.). This improves motif detection recall.

  3. Keyword-based motif detection: Simple case-insensitive substring matching on domain names rather than ML-based classification. Rationale: maintains explainability (every flagged gene can be traced to specific domain annotation) and avoids training data requirements.

  4. Composite score weights: Empirically balanced weights favoring domain count (20%), transmembrane regions (20%), and coiled-coils (20%) as these are most enriched in known cilia proteins. Length contributes 15% to avoid penalizing small adaptor proteins.

  5. NULL preservation throughout pipeline: Genes without UniProt entries get NULL scores (not 0.0) to distinguish "unknown" from "no evidence". Critical for downstream scoring to avoid false confidence.

Verification

All success criteria met:

  • PROT-01: Protein length, domain composition, domain count extracted from UniProt/InterPro per gene
  • PROT-02: Coiled-coil, scaffold/adaptor, and transmembrane domains identified
  • PROT-03: Cilia-associated motifs detected via domain keyword matching without presupposing conclusions
  • PROT-04: Binary and continuous protein features normalized to 0-1 composite score
  • Pattern compliance: fetch->transform->load->CLI->tests matching established gnomAD evidence layer structure
  • All tests passing: 16/16 tests pass (11 unit + 5 integration)
  • Imports verified: All protein module exports importable

Self-Check

Files Created

All files verified to exist:

  • /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/__init__.py
  • /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/models.py
  • /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/fetch.py
  • /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/transform.py
  • /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/load.py
  • /Users/gbanyan/Project/usher-exploring/tests/test_protein.py
  • /Users/gbanyan/Project/usher-exploring/tests/test_protein_integration.py

Commits

Task commits verified:

  • 4605987: feat(03-03): implement protein evidence layer with UniProt/InterPro integration

Self-Check: PASSED

Impact

This evidence layer enables:

  1. Structural evidence scoring: Genes with cilia-associated domains, scaffold proteins, or coiled-coil/TM combinations receive higher protein scores for candidate ranking
  2. Explainable motif matching: Every flagged gene can be traced to specific UniProt/InterPro domain annotations (no black-box ML)
  3. Integration with gene universe: protein_features table joins on gene_id for multi-evidence scoring
  4. Cilia candidate queries: query_cilia_candidates() identifies genes with structural features enriched in known cilia proteins

Next steps (Phase 03 Plans 04-06):

  • Add expression evidence (GTEx, HPA)
  • Add localization evidence (subcellular predictions)
  • Add literature evidence (PubMed co-mentions)