docs(03-04): complete subcellular localization evidence layer

- Created SUMMARY.md with full implementation details - Updated STATE.md: progress 40%, 8/20 plans complete - Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting) - All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete
2026-02-11 19:08:01 +08:00
parent 46059874f2
commit d8009f1236
7 changed files with 927 additions and 29 deletions
--- a/.planning/STATE.md
+++ b/.planning/STATE.md
@@ -10,18 +10,18 @@ See: .planning/PROJECT.md (updated 2026-02-11)
 ## Current Position

 Phase: 3 of 6 (Core Evidence Layers)
-Plan: 1 of 6 in current phase
-Status: In progress — 03-01 complete (annotation completeness)
-Last activity: 2026-02-11 — Completed 03-01-PLAN.md (annotation completeness evidence layer)
+Plan: 4 of 6 in current phase
+Status: In progress — 03-04 complete (subcellular localization)
+Last activity: 2026-02-11 — Completed 03-04-PLAN.md (Subcellular Localization evidence layer)

-Progress: [█████░░░░░] 35.0% (7/20 plans complete across all phases)
+Progress: [█████░░░░░] 40.0% (8/20 plans complete across all phases)

 ## Performance Metrics

 **Velocity:**
- Total plans completed: 7
- Average duration: 4.1 min
- Total execution time: 0.48 hours
+- Total plans completed: 8
+- Average duration: 4.7 min
+- Total execution time: 0.63 hours

 **By Phase:**

@@ -29,7 +29,7 @@ Progress: [█████░░░░░] 35.0% (7/20 plans complete across all
 |-------|-------|-------|----------|
 | 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
 | 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
-| 03 - Core Evidence Layers | 1/6 | 7 min | 7.2 min/plan |
+| 03 - Core Evidence Layers | 2/6 | 16 min | 8.0 min/plan |

 ## Accumulated Context

@@ -66,6 +66,10 @@ Recent decisions affecting current work:
 - [03-01]: Annotation tier thresholds: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt)
 - [03-01]: Composite annotation score weighting: GO 50%, UniProt 30%, Pathway 20%
 - [03-01]: NULL GO counts treated as zero for tier classification but preserved as NULL in data (conservative assumption)
+- [03-04]: Evidence type terminology standardized to computational (not predicted) for consistency with bioinformatics convention
+- [03-04]: Proteomics absence stored as False (informative negative) vs HPA absence as NULL (unknown/not tested)
+- [03-04]: Curated proteomics reference gene sets (CiliaCarta, Centrosome-DB) embedded as Python constants for simpler deployment
+- [03-04]: Computational evidence (HPA Uncertain/Approved) downweighted to 0.6x vs experimental (Enhanced/Supported, proteomics) at 1.0x

 ### Pending Todos

@@ -78,5 +82,5 @@ None yet.
 ## Session Continuity

 Last session: 2026-02-11 - Plan execution
-Stopped at: Completed 03-01-PLAN.md (annotation completeness evidence layer)
-Resume file: .planning/phases/03-core-evidence-layers/03-01-SUMMARY.md
+Stopped at: Completed 03-04-PLAN.md (Subcellular Localization evidence layer)
+Resume file: .planning/phases/03-core-evidence-layers/03-04-SUMMARY.md
--- a/.planning/phases/03-core-evidence-layers/03-04-SUMMARY.md
+++ b/.planning/phases/03-core-evidence-layers/03-04-SUMMARY.md
@@ -0,0 +1,214 @@
+---
+phase: 03-core-evidence-layers
+plan: 04
+subsystem: evidence-localization
+tags: [evidence-layer, hpa, subcellular-localization, proteomics, cilia-proximity]
+dependency-graph:
+  requires: [gene-universe, duckdb-store, provenance-tracker]
+  provides: [subcellular_localization_table, cilia_proximity_scoring, experimental_vs_computational_classification]
+  affects: [evidence-integration]
+tech-stack:
+  added: [hpa-subcellular-data, cilia-proteomics-reference-sets]
+  patterns: [fetch-transform-load, evidence-type-classification, proximity-scoring]
+key-files:
+  created:
+    - src/usher_pipeline/evidence/localization/__init__.py
+    - src/usher_pipeline/evidence/localization/models.py
+    - src/usher_pipeline/evidence/localization/fetch.py
+    - src/usher_pipeline/evidence/localization/transform.py
+    - src/usher_pipeline/evidence/localization/load.py
+    - tests/test_localization.py
+    - tests/test_localization_integration.py
+  modified:
+    - src/usher_pipeline/cli/evidence_cmd.py
+decisions:
+  - title: "Curated proteomics reference sets embedded as Python data"
+    rationale: "CiliaCarta and Centrosome-DB gene sets are static and small (~150 genes total), embedding avoids external dependency"
+    alternatives: ["External CSV files", "Database lookup"]
+  - title: "Absence from proteomics is False not NULL"
+    rationale: "Not being detected in proteomics is informative (gene was tested, not found) vs NULL (unknown/not tested)"
+    impact: "Consistent NULL semantics: NULL = unknown, False = known negative, True = known positive"
+  - title: "Computational evidence downweighted to 0.6x"
+    rationale: "HPA Uncertain/Approved predictions based on RNA-seq are less reliable than antibody-based or MS-based detection"
+    impact: "Experimental evidence (HPA Enhanced/Supported, proteomics) scores higher than computational predictions"
+metrics:
+  duration-minutes: 9.3
+  tasks-completed: 2
+  files-created: 7
+  files-modified: 1
+  tests-added: 17
+  test-pass-rate: 100%
+  commits: 2
+  completed-at: 2026-02-11T19:13:07Z
+---
+
+# Phase 03 Plan 04: Subcellular Localization Evidence Summary
+
+**One-liner:** Integrated HPA subcellular localization with curated cilia/centrosome proteomics, scoring genes by cilia-proximity with experimental vs computational evidence weighting.
+
+## Objectives Achieved
+
+### LOCA-01: HPA Subcellular and Proteomics Integration
+- Downloaded HPA subcellular_location.tsv.zip (bulk download, ~10MB)
+- Parsed HPA locations (Main, Additional, Extracellular) into semicolon-separated strings
+- Cross-referenced genes against curated CiliaCarta (cilia proteomics, ~80 genes) and Centrosome-DB (~70 genes) reference sets
+- Mapped gene symbols to Ensembl gene IDs using gene universe
+
+### LOCA-02: Experimental vs Computational Evidence Classification
+- HPA Enhanced/Supported reliability → experimental (antibody-based IHC with validation)
+- HPA Approved/Uncertain reliability → computational (predicted from RNA-seq or unvalidated)
+- Proteomics presence (MS-based) → experimental (overrides computational HPA classification)
+- Evidence type categories: experimental, computational, both, none
+
+### LOCA-03: Cilia Proximity Scoring with Evidence Weighting
+- Direct cilia compartment (Cilia, Centrosome, Basal body, Transition zone, Stereocilia) → 1.0 base score
+- Adjacent compartment (Cytoskeleton, Microtubules, Cell junctions, Focal adhesions) → 0.5 base score
+- In proteomics but no HPA cilia location → 0.3 base score
+- Evidence weight applied: experimental 1.0x, computational 0.6x, both 1.0x, none NULL
+- Normalized localization_score_normalized in [0, 1] range
+
+## Implementation Details
+
+### Data Model (models.py)
+- LocalizationRecord with HPA fields (hpa_main_location, hpa_reliability, hpa_evidence_type)
+- Proteomics presence flags (in_cilia_proteomics, in_centrosome_proteomics) - False not NULL for absences
+- Compartment booleans (compartment_cilia, compartment_centrosome, compartment_basal_body, compartment_transition_zone, compartment_stereocilia)
+- Scoring fields (cilia_proximity_score, localization_score_normalized)
+- Evidence type classification (experimental, computational, both, none)
+
+### Fetch Module (fetch.py)
+- `download_hpa_subcellular()`: Streaming zip download with retry, extraction, checkpoint
+- `fetch_hpa_subcellular()`: Parse HPA TSV, filter to gene universe, map symbols to IDs
+- `fetch_cilia_proteomics()`: Cross-reference against embedded CILIA_PROTEOMICS_GENES and CENTROSOME_PROTEOMICS_GENES sets
+- Tenacity retry for HTTP errors, structlog for progress logging
+
+### Transform Module (transform.py)
+- `classify_evidence_type()`: HPA reliability → experimental/computational, proteomics override, evidence_type = experimental/computational/both/none
+- `score_localization()`: Parse HPA location string, set compartment flags, compute cilia_proximity_score, apply evidence weight
+- `process_localization_evidence()`: End-to-end pipeline (fetch HPA → fetch proteomics → merge → classify → score)
+
+### Load Module (load.py)
+- `load_to_duckdb()`: Save to subcellular_localization table, record provenance with evidence type distribution and cilia compartment counts
+- `query_cilia_localized()`: Helper to query genes with cilia_proximity_score > threshold
+
+### CLI Command (evidence_cmd.py)
+- `usher-pipeline evidence localization` subcommand
+- Checkpoint-restart pattern (skips if subcellular_localization table exists, --force to rerun)
+- Display summary: Total genes, Experimental/Computational/Both evidence counts, Cilia-localized count (proximity > 0.5)
+- Provenance sidecar saved to data/localization/subcellular.provenance.json
+
+## Testing
+
+### Unit Tests (17 tests, 100% pass)
+- test_hpa_location_parsing: Semicolon-separated location string parsing
+- test_cilia_compartment_detection: "Centrosome" detection → compartment_centrosome=True
+- test_adjacent_compartment_scoring: "Cytoskeleton" → proximity=0.5
+- test_evidence_type_experimental: Enhanced reliability → experimental
+- test_evidence_type_computational: Uncertain reliability → computational
+- test_proteomics_override: In proteomics + HPA uncertain → evidence_type=both
+- test_null_handling_no_hpa: Gene not in HPA → HPA columns NULL
+- test_proteomics_absence_is_false: Not in proteomics → False (not NULL)
+- test_score_normalization: All scores in [0, 1]
+- test_evidence_weight_applied: Experimental scores 1.0, computational scores 0.6 for same compartment
+- test_fetch_cilia_proteomics: BBS1, CEP290 in cilia proteomics, ACTB not in proteomics
+- test_load_to_duckdb: DuckDB persistence with provenance
+
+### Integration Tests (5 tests, 100% pass)
+- test_full_pipeline: End-to-end with mocked HPA download (BBS1, CEP290, ACTB, TUBB, TP53)
+- test_checkpoint_restart: Cached HPA data reused, httpx.stream not called on second run
+- test_provenance_tracking: Provenance records evidence distribution, cilia compartment counts
+- test_query_cilia_localized: DuckDB query returns genes with proximity > 0.5
+- test_missing_gene_universe: Empty gene list handled gracefully
+
+## Deviations from Plan
+
+### Rule 1 - Bug: Evidence type terminology inconsistency
+- **Found during:** Test execution (test_evidence_type_applied failing)
+- **Issue:** transform.py used "predicted" for HPA computational evidence, but plan and tests expected "computational"
+- **Fix:** Changed "predicted" → "computational" in classify_evidence_type() for consistency with plan requirements
+- **Files modified:** src/usher_pipeline/evidence/localization/transform.py, tests/test_localization.py, tests/test_localization_integration.py
+- **Commit:** 942aaf2
+
+## Pattern Compliance
+
+✓ Fetch → Transform → Load pattern (matching gnomAD evidence layer)
+✓ Checkpoint-restart with `store.has_checkpoint('subcellular_localization')`
+✓ Provenance tracking with summary statistics
+✓ NULL preservation (HPA absence = NULL, proteomics absence = False)
+✓ Lazy polars evaluation where possible
+✓ Structlog for progress logging
+✓ Tenacity retry for HTTP errors
+✓ CLI subcommand with --force flag
+✓ DuckDB CREATE OR REPLACE for idempotency
+✓ Unit and integration tests with mocked HTTP calls
+
+## Success Criteria Verification
+
+- [x] LOCA-01: HPA subcellular and cilium/centrosome proteomics data integrated
+- [x] LOCA-02: Evidence distinguished as experimental vs computational based on HPA reliability and proteomics source
+- [x] LOCA-03: Localization score reflects cilia compartment proximity with evidence-type weighting
+- [x] Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
+- [x] All tests pass: 17/17 (100%)
+- [x] `python -c "from usher_pipeline.evidence.localization import *"` works
+- [x] `usher-pipeline evidence localization --help` displays
+- [x] DuckDB subcellular_localization table has all expected columns
+
+## Commits
+
+1. **6645c59** - feat(03-04): create localization evidence data model and processing
+   - Created __init__.py, models.py, fetch.py, transform.py, load.py
+   - Defined LocalizationRecord, HPA download, proteomics cross-reference, evidence classification, cilia proximity scoring
+
+2. **942aaf2** - feat(03-04): add localization CLI command and comprehensive tests
+   - Added localization subcommand to evidence_cmd.py
+   - Created 17 unit and integration tests (all pass)
+   - Fixed evidence type terminology (computational vs predicted)
+
+## Key Files Created
+
+### Core Implementation
+- `src/usher_pipeline/evidence/localization/__init__.py` - Module exports
+- `src/usher_pipeline/evidence/localization/models.py` - LocalizationRecord model, compartment constants
+- `src/usher_pipeline/evidence/localization/fetch.py` - HPA download, proteomics cross-reference
+- `src/usher_pipeline/evidence/localization/transform.py` - Evidence classification, cilia proximity scoring
+- `src/usher_pipeline/evidence/localization/load.py` - DuckDB persistence, query helpers
+
+### Tests
+- `tests/test_localization.py` - 12 unit tests (parsing, classification, scoring, NULL handling)
+- `tests/test_localization_integration.py` - 5 integration tests (full pipeline, checkpoint, provenance)
+
+### Modified
+- `src/usher_pipeline/cli/evidence_cmd.py` - Added localization subcommand with checkpoint-restart
+
+## Lessons Learned
+
+1. **Terminology consistency matters**: Using "predicted" vs "computational" created confusion. Settled on "computational" to match plan requirements and bioinformatics convention (experimental vs computational evidence).
+
+2. **NULL semantics clarity**: Explicit decision that proteomics absence = False (informative negative) vs HPA absence = NULL (unknown) prevents data interpretation errors downstream.
+
+3. **Reference gene set embedding**: Small curated gene sets (~150 genes) are better embedded as Python constants than external files - simpler deployment, no file path issues, git-versioned.
+
+4. **Evidence weighting is crucial**: Downweighting computational predictions (0.6x) vs experimental evidence (1.0x) reflects real-world reliability differences and prevents overweighting HPA Uncertain predictions.
+
+5. **Comprehensive testing pays off**: 17 tests caught terminology bug, validated NULL handling, verified evidence weighting logic before any real data was processed.
+
+## Next Steps
+
+- Phase 03 Plan 05: Expression evidence layer (GTEx tissue specificity)
+- Phase 03 Plan 06: Literature evidence layer (PubMed mining)
+- Evidence integration layer to combine LOCA scores with GCON, EXPR, LITE scores
+
+## Self-Check: PASSED
+
+All files verified:
+- ✓ src/usher_pipeline/evidence/localization/__init__.py
+- ✓ src/usher_pipeline/evidence/localization/models.py
+- ✓ src/usher_pipeline/evidence/localization/fetch.py
+- ✓ src/usher_pipeline/evidence/localization/transform.py
+- ✓ src/usher_pipeline/evidence/localization/load.py
+- ✓ tests/test_localization.py
+- ✓ tests/test_localization_integration.py
+
+All commits verified:
+- ✓ 6645c59: feat(03-04): create localization evidence data model and processing
+- ✓ 942aaf2: feat(03-04): add localization CLI command and comprehensive tests