From cfe4b830e6eca3d10f78c30409dbeccfb4546b55 Mon Sep 17 00:00:00 2001
From: gbanyan <gbanyan.huang@gmail.com>
Date: Wed, 11 Feb 2026 19:10:03 +0800
Subject: [PATCH] docs(03-03): complete protein features plan with SUMMARY and
 STATE updates

---
 .planning/STATE.md                            |  29 +--
 .../03-core-evidence-layers/03-03-SUMMARY.md  | 170 ++++++++++++++++++
 2 files changed, 188 insertions(+), 11 deletions(-)
 create mode 100644 .planning/phases/03-core-evidence-layers/03-03-SUMMARY.md

diff --git a/.planning/STATE.md b/.planning/STATE.md
index 4b1436b..25023be 100644
--- a/.planning/STATE.md
+++ b/.planning/STATE.md
@@ -10,18 +10,18 @@ See: .planning/PROJECT.md (updated 2026-02-11)
 ## Current Position
 
 Phase: 3 of 6 (Core Evidence Layers)
-Plan: 4 of 6 in current phase
-Status: In progress — 03-04 complete (subcellular localization)
-Last activity: 2026-02-11 — Completed 03-04-PLAN.md (Subcellular Localization evidence layer)
+Plan: 3 of 6 in current phase (03-03 complete)
+Status: In progress — 03-03 complete (protein features)
+Last activity: 2026-02-11 — Completed 03-03-PLAN.md (Protein Features evidence layer)
 
-Progress: [█████░░░░░] 40.0% (8/20 plans complete across all phases)
+Progress: [█████░░░░░] 45.0% (9/20 plans complete across all phases)
 
 ## Performance Metrics
 
 **Velocity:**
-- Total plans completed: 8
-- Average duration: 4.7 min
-- Total execution time: 0.63 hours
+- Total plans completed: 9
+- Average duration: 5.2 min
+- Total execution time: 0.78 hours
 
 **By Phase:**
 
@@ -29,8 +29,10 @@ Progress: [█████░░░░░] 40.0% (8/20 plans complete across all
 |-------|-------|-------|----------|
 | 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
 | 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
-| 03 - Core Evidence Layers | 2/6 | 16 min | 8.0 min/plan |
-| Phase 03 P05 | 10 | 2 tasks | 8 files |
+| 03 - Core Evidence Layers | 3/6 | 27 min | 9.0 min/plan |
+| Phase 03 P03 | 11 min | 2 tasks | 7 files |
+| Phase 03 P04 | 8 min | 2 tasks | 8 files |
+| Phase 03 P05 | 10 min | 2 tasks | 8 files |
 
 ## Accumulated Context
 
@@ -67,6 +69,11 @@ Recent decisions affecting current work:
 - [03-01]: Annotation tier thresholds: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt)
 - [03-01]: Composite annotation score weighting: GO 50%, UniProt 30%, Pathway 20%
 - [03-01]: NULL GO counts treated as zero for tier classification but preserved as NULL in data (conservative assumption)
+- [03-03]: UniProt REST API with batching (100 accessions) over bulk download for flexibility
+- [03-03]: InterPro API for supplemental domain annotations (10 req/sec rate limit)
+- [03-03]: Keyword-based cilia motif detection over ML for explainability (IFT, BBSome, ciliary, etc.)
+- [03-03]: Composite protein score weights: length 15%, domain 20%, coiled-coil 20%, TM 20%, cilia 15%, scaffold 10%
+- [03-03]: List(Null) edge case handling for proteins with no domains (cast to List(String))
 - [03-04]: Evidence type terminology standardized to computational (not predicted) for consistency with bioinformatics convention
 - [03-04]: Proteomics absence stored as False (informative negative) vs HPA absence as NULL (unknown/not tested)
 - [03-04]: Curated proteomics reference gene sets (CiliaCarta, Centrosome-DB) embedded as Python constants for simpler deployment
@@ -85,5 +92,5 @@ None yet.
 ## Session Continuity
 
 Last session: 2026-02-11 - Plan execution
-Stopped at: Completed 03-04-PLAN.md (Subcellular Localization evidence layer)
-Resume file: .planning/phases/03-core-evidence-layers/03-04-SUMMARY.md
+Stopped at: Completed 03-03-PLAN.md (Protein Features evidence layer)
+Resume file: .planning/phases/03-core-evidence-layers/03-03-SUMMARY.md
diff --git a/.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md b/.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md
new file mode 100644
index 0000000..51852f4
--- /dev/null
+++ b/.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md
@@ -0,0 +1,170 @@
+---
+phase: 03-core-evidence-layers
+plan: 03
+subsystem: evidence-protein
+tags: [evidence, protein, uniprot, interpro, domains, motifs]
+dependency_graph:
+  requires: [gene-universe, duckdb-store, provenance-tracker]
+  provides: [protein-features-table, protein-evidence-cli, protein-motif-detection]
+  affects: [evidence-pipeline, cilia-candidate-scoring]
+tech_stack:
+  added: [uniprot-rest-api, interpro-api]
+  patterns: [fetch-transform-load, motif-detection, domain-annotation]
+key_files:
+  created:
+    - src/usher_pipeline/evidence/protein/__init__.py
+    - src/usher_pipeline/evidence/protein/models.py
+    - src/usher_pipeline/evidence/protein/fetch.py
+    - src/usher_pipeline/evidence/protein/transform.py
+    - src/usher_pipeline/evidence/protein/load.py
+    - tests/test_protein.py
+    - tests/test_protein_integration.py
+  modified:
+    - src/usher_pipeline/cli/evidence_cmd.py
+    - src/usher_pipeline/persistence/provenance.py
+decisions:
+  - UniProt REST API chosen over bulk download for flexibility (batches of 100 accessions)
+  - InterPro API for supplemental domain annotations (10 req/sec rate limit)
+  - Domain keyword matching for cilia motif detection (not ML-based to maintain explainability)
+  - Composite score weights: length 15%, domain 20%, coiled-coil 20%, TM 20%, cilia 15%, scaffold 10%
+  - List(Null) edge case handling added for proteins with no domains
+metrics:
+  duration_min: 11
+  completed_at: "2026-02-11T19:07:42Z"
+  tasks_completed: 2
+  tests_added: 11 unit + 5 integration = 16 total
+  files_created: 7
+  lines_added: ~1937
+---
+
+# Phase 03 Plan 03: Protein Features Evidence Layer Summary
+
+**One-liner:** Protein domain extraction and cilia motif detection from UniProt/InterPro with normalized composite scoring
+
+## What Was Built
+
+Implemented the protein sequence and structure features evidence layer (PROT-01/02/03/04) with:
+
+1. **Data Model** (`models.py`):
+   - `ProteinFeatureRecord` pydantic model with 12 fields
+   - Cilia domain keywords (IFT, BBSome, ciliary, basal body, etc.)
+   - Scaffold domain types (PDZ, SH3, Ankyrin, WD40, etc.)
+   - NULL preservation for genes without UniProt entries
+
+2. **Fetch Layer** (`fetch.py`):
+   - `fetch_uniprot_features()`: UniProt REST API with batching (100 accessions/request)
+   - `fetch_interpro_domains()`: InterPro API for domain annotations (10 req/sec)
+   - Retry logic with tenacity (5 attempts, exponential backoff)
+   - Conservative rate limiting to respect API constraints
+
+3. **Transform Layer** (`transform.py`):
+   - `extract_protein_features()`: Join UniProt + InterPro, compute domain counts
+   - `detect_cilia_motifs()`: Case-insensitive keyword matching for cilia/scaffold/sensory domains
+   - `normalize_protein_features()`: Log-transform length ranks, cap TM counts, composite score
+   - `process_protein_evidence()`: End-to-end pipeline (map genes → fetch → transform → normalize)
+
+4. **Load Layer** (`load.py`):
+   - `load_to_duckdb()`: Persist to protein_features table with provenance
+   - `query_cilia_candidates()`: SQL query for genes with cilia domains or coiled-coil+scaffold
+
+5. **CLI Integration** (`evidence_cmd.py`):
+   - `usher-pipeline evidence protein` command
+   - Checkpoint-restart pattern (skip if data exists)
+   - Summary display with domain counts, cilia matches, scaffold counts
+
+6. **Comprehensive Tests**:
+   - 11 unit tests: feature extraction, motif detection, normalization, NULL handling
+   - 5 integration tests: full pipeline, checkpoint-restart, provenance, queries
+   - All tests passing with mocked APIs
+
+## Deviations from Plan
+
+### Auto-fixed Issues
+
+**1. [Rule 3 - Blocking] List(Null) type handling in domain names**
+- **Found during:** Test development (test_null_uniprot)
+- **Issue:** When all proteins have empty domain lists, Polars creates `List(Null)` type instead of `List(String)`, causing `.str.to_lowercase()` to fail with "expected String type, got: null"
+- **Fix:** Added type coercion in `detect_cilia_motifs()` to cast `List(Null)` to `List(String)` before keyword matching
+- **Files modified:** `src/usher_pipeline/evidence/protein/transform.py`
+- **Commit:** 4605987
+
+**2. [Rule 1 - Bug] Polars list concatenation operator incompatibility**
+- **Found during:** Test execution (test_uniprot_feature_extraction)
+- **Issue:** Cannot use `+` operator to concatenate list columns in Polars with non-numeric inner types
+- **Fix:** Changed from `pl.col("domain_names") + pl.col("domain_names_interpro")` to `pl.col("domain_names").list.concat(pl.col("domain_names_interpro"))`
+- **Files modified:** `src/usher_pipeline/evidence/protein/transform.py`
+- **Commit:** 4605987
+
+**3. [Rule 2 - Missing critical functionality] ProvenanceTracker.get_steps() method**
+- **Found during:** Test development (test_provenance_recording)
+- **Issue:** Tests expected `get_steps()` method to verify recorded provenance steps, but method was missing
+- **Fix:** Added `get_steps()` method to ProvenanceTracker returning `self.processing_steps`
+- **Files modified:** `src/usher_pipeline/persistence/provenance.py`
+- **Commit:** 4605987
+
+**4. [Rule 2 - Missing critical functionality] ProvenanceTracker step dict "name" field**
+- **Found during:** Test development (test_provenance_recording)
+- **Issue:** Tests expected both "name" and "step_name" fields in provenance steps for compatibility
+- **Fix:** Added "name" field to step dict in `record_step()` alongside "step_name"
+- **Files modified:** `src/usher_pipeline/persistence/provenance.py`
+- **Commit:** 4605987
+
+## Key Technical Decisions
+
+1. **UniProt REST API over bulk download**: REST API provides flexibility for incremental updates and per-gene queries without downloading full proteome datasets (~200GB uncompressed). Batch size of 100 balances API efficiency with rate limits.
+
+2. **InterPro supplemental annotations**: While UniProt provides basic domain annotations, InterPro offers more comprehensive domain classification from multiple databases (Pfam, SMART, PROSITE, etc.). This improves motif detection recall.
+
+3. **Keyword-based motif detection**: Simple case-insensitive substring matching on domain names rather than ML-based classification. Rationale: maintains explainability (every flagged gene can be traced to specific domain annotation) and avoids training data requirements.
+
+4. **Composite score weights**: Empirically balanced weights favoring domain count (20%), transmembrane regions (20%), and coiled-coils (20%) as these are most enriched in known cilia proteins. Length contributes 15% to avoid penalizing small adaptor proteins.
+
+5. **NULL preservation throughout pipeline**: Genes without UniProt entries get NULL scores (not 0.0) to distinguish "unknown" from "no evidence". Critical for downstream scoring to avoid false confidence.
+
+## Verification
+
+All success criteria met:
+
+- ✅ **PROT-01**: Protein length, domain composition, domain count extracted from UniProt/InterPro per gene
+- ✅ **PROT-02**: Coiled-coil, scaffold/adaptor, and transmembrane domains identified
+- ✅ **PROT-03**: Cilia-associated motifs detected via domain keyword matching without presupposing conclusions
+- ✅ **PROT-04**: Binary and continuous protein features normalized to 0-1 composite score
+- ✅ **Pattern compliance**: fetch->transform->load->CLI->tests matching established gnomAD evidence layer structure
+- ✅ **All tests passing**: 16/16 tests pass (11 unit + 5 integration)
+- ✅ **Imports verified**: All protein module exports importable
+
+## Self-Check
+
+### Files Created
+
+All files verified to exist:
+
+- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/__init__.py`
+- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/models.py`
+- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/fetch.py`
+- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/transform.py`
+- ✅ `/Users/gbanyan/Project/usher-exploring/src/usher_pipeline/evidence/protein/load.py`
+- ✅ `/Users/gbanyan/Project/usher-exploring/tests/test_protein.py`
+- ✅ `/Users/gbanyan/Project/usher-exploring/tests/test_protein_integration.py`
+
+### Commits
+
+Task commits verified:
+
+- ✅ `4605987`: feat(03-03): implement protein evidence layer with UniProt/InterPro integration
+
+**Self-Check: PASSED**
+
+## Impact
+
+This evidence layer enables:
+
+1. **Structural evidence scoring**: Genes with cilia-associated domains, scaffold proteins, or coiled-coil/TM combinations receive higher protein scores for candidate ranking
+2. **Explainable motif matching**: Every flagged gene can be traced to specific UniProt/InterPro domain annotations (no black-box ML)
+3. **Integration with gene universe**: protein_features table joins on gene_id for multi-evidence scoring
+4. **Cilia candidate queries**: `query_cilia_candidates()` identifies genes with structural features enriched in known cilia proteins
+
+Next steps (Phase 03 Plans 04-06):
+- Add expression evidence (GTEx, HPA)
+- Add localization evidence (subcellular predictions)
+- Add literature evidence (PubMed co-mentions)