Files
usher-exploring/.planning/phases/03-core-evidence-layers/03-04-SUMMARY.md
gbanyan d8009f1236 docs(03-04): complete subcellular localization evidence layer
- Created SUMMARY.md with full implementation details
- Updated STATE.md: progress 40%, 8/20 plans complete
- Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting)
- All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete
2026-02-11 19:08:01 +08:00

12 KiB

phase, plan, subsystem, tags, dependency-graph, tech-stack, key-files, decisions, metrics
phase plan subsystem tags dependency-graph tech-stack key-files decisions metrics
03-core-evidence-layers 04 evidence-localization
evidence-layer
hpa
subcellular-localization
proteomics
cilia-proximity
requires provides affects
gene-universe
duckdb-store
provenance-tracker
subcellular_localization_table
cilia_proximity_scoring
experimental_vs_computational_classification
evidence-integration
added patterns
hpa-subcellular-data
cilia-proteomics-reference-sets
fetch-transform-load
evidence-type-classification
proximity-scoring
created modified
src/usher_pipeline/evidence/localization/__init__.py
src/usher_pipeline/evidence/localization/models.py
src/usher_pipeline/evidence/localization/fetch.py
src/usher_pipeline/evidence/localization/transform.py
src/usher_pipeline/evidence/localization/load.py
tests/test_localization.py
tests/test_localization_integration.py
src/usher_pipeline/cli/evidence_cmd.py
title rationale alternatives
Curated proteomics reference sets embedded as Python data CiliaCarta and Centrosome-DB gene sets are static and small (~150 genes total), embedding avoids external dependency
External CSV files
Database lookup
title rationale impact
Absence from proteomics is False not NULL Not being detected in proteomics is informative (gene was tested, not found) vs NULL (unknown/not tested) Consistent NULL semantics: NULL = unknown, False = known negative, True = known positive
title rationale impact
Computational evidence downweighted to 0.6x HPA Uncertain/Approved predictions based on RNA-seq are less reliable than antibody-based or MS-based detection Experimental evidence (HPA Enhanced/Supported, proteomics) scores higher than computational predictions
duration-minutes tasks-completed files-created files-modified tests-added test-pass-rate commits completed-at
9.3 2 7 1 17 100% 2 2026-02-11T19:13:07Z

Phase 03 Plan 04: Subcellular Localization Evidence Summary

One-liner: Integrated HPA subcellular localization with curated cilia/centrosome proteomics, scoring genes by cilia-proximity with experimental vs computational evidence weighting.

Objectives Achieved

LOCA-01: HPA Subcellular and Proteomics Integration

  • Downloaded HPA subcellular_location.tsv.zip (bulk download, ~10MB)
  • Parsed HPA locations (Main, Additional, Extracellular) into semicolon-separated strings
  • Cross-referenced genes against curated CiliaCarta (cilia proteomics, ~80 genes) and Centrosome-DB (~70 genes) reference sets
  • Mapped gene symbols to Ensembl gene IDs using gene universe

LOCA-02: Experimental vs Computational Evidence Classification

  • HPA Enhanced/Supported reliability → experimental (antibody-based IHC with validation)
  • HPA Approved/Uncertain reliability → computational (predicted from RNA-seq or unvalidated)
  • Proteomics presence (MS-based) → experimental (overrides computational HPA classification)
  • Evidence type categories: experimental, computational, both, none

LOCA-03: Cilia Proximity Scoring with Evidence Weighting

  • Direct cilia compartment (Cilia, Centrosome, Basal body, Transition zone, Stereocilia) → 1.0 base score
  • Adjacent compartment (Cytoskeleton, Microtubules, Cell junctions, Focal adhesions) → 0.5 base score
  • In proteomics but no HPA cilia location → 0.3 base score
  • Evidence weight applied: experimental 1.0x, computational 0.6x, both 1.0x, none NULL
  • Normalized localization_score_normalized in [0, 1] range

Implementation Details

Data Model (models.py)

  • LocalizationRecord with HPA fields (hpa_main_location, hpa_reliability, hpa_evidence_type)
  • Proteomics presence flags (in_cilia_proteomics, in_centrosome_proteomics) - False not NULL for absences
  • Compartment booleans (compartment_cilia, compartment_centrosome, compartment_basal_body, compartment_transition_zone, compartment_stereocilia)
  • Scoring fields (cilia_proximity_score, localization_score_normalized)
  • Evidence type classification (experimental, computational, both, none)

Fetch Module (fetch.py)

  • download_hpa_subcellular(): Streaming zip download with retry, extraction, checkpoint
  • fetch_hpa_subcellular(): Parse HPA TSV, filter to gene universe, map symbols to IDs
  • fetch_cilia_proteomics(): Cross-reference against embedded CILIA_PROTEOMICS_GENES and CENTROSOME_PROTEOMICS_GENES sets
  • Tenacity retry for HTTP errors, structlog for progress logging

Transform Module (transform.py)

  • classify_evidence_type(): HPA reliability → experimental/computational, proteomics override, evidence_type = experimental/computational/both/none
  • score_localization(): Parse HPA location string, set compartment flags, compute cilia_proximity_score, apply evidence weight
  • process_localization_evidence(): End-to-end pipeline (fetch HPA → fetch proteomics → merge → classify → score)

Load Module (load.py)

  • load_to_duckdb(): Save to subcellular_localization table, record provenance with evidence type distribution and cilia compartment counts
  • query_cilia_localized(): Helper to query genes with cilia_proximity_score > threshold

CLI Command (evidence_cmd.py)

  • usher-pipeline evidence localization subcommand
  • Checkpoint-restart pattern (skips if subcellular_localization table exists, --force to rerun)
  • Display summary: Total genes, Experimental/Computational/Both evidence counts, Cilia-localized count (proximity > 0.5)
  • Provenance sidecar saved to data/localization/subcellular.provenance.json

Testing

Unit Tests (17 tests, 100% pass)

  • test_hpa_location_parsing: Semicolon-separated location string parsing
  • test_cilia_compartment_detection: "Centrosome" detection → compartment_centrosome=True
  • test_adjacent_compartment_scoring: "Cytoskeleton" → proximity=0.5
  • test_evidence_type_experimental: Enhanced reliability → experimental
  • test_evidence_type_computational: Uncertain reliability → computational
  • test_proteomics_override: In proteomics + HPA uncertain → evidence_type=both
  • test_null_handling_no_hpa: Gene not in HPA → HPA columns NULL
  • test_proteomics_absence_is_false: Not in proteomics → False (not NULL)
  • test_score_normalization: All scores in [0, 1]
  • test_evidence_weight_applied: Experimental scores 1.0, computational scores 0.6 for same compartment
  • test_fetch_cilia_proteomics: BBS1, CEP290 in cilia proteomics, ACTB not in proteomics
  • test_load_to_duckdb: DuckDB persistence with provenance

Integration Tests (5 tests, 100% pass)

  • test_full_pipeline: End-to-end with mocked HPA download (BBS1, CEP290, ACTB, TUBB, TP53)
  • test_checkpoint_restart: Cached HPA data reused, httpx.stream not called on second run
  • test_provenance_tracking: Provenance records evidence distribution, cilia compartment counts
  • test_query_cilia_localized: DuckDB query returns genes with proximity > 0.5
  • test_missing_gene_universe: Empty gene list handled gracefully

Deviations from Plan

Rule 1 - Bug: Evidence type terminology inconsistency

  • Found during: Test execution (test_evidence_type_applied failing)
  • Issue: transform.py used "predicted" for HPA computational evidence, but plan and tests expected "computational"
  • Fix: Changed "predicted" → "computational" in classify_evidence_type() for consistency with plan requirements
  • Files modified: src/usher_pipeline/evidence/localization/transform.py, tests/test_localization.py, tests/test_localization_integration.py
  • Commit: 942aaf2

Pattern Compliance

✓ Fetch → Transform → Load pattern (matching gnomAD evidence layer) ✓ Checkpoint-restart with store.has_checkpoint('subcellular_localization') ✓ Provenance tracking with summary statistics ✓ NULL preservation (HPA absence = NULL, proteomics absence = False) ✓ Lazy polars evaluation where possible ✓ Structlog for progress logging ✓ Tenacity retry for HTTP errors ✓ CLI subcommand with --force flag ✓ DuckDB CREATE OR REPLACE for idempotency ✓ Unit and integration tests with mocked HTTP calls

Success Criteria Verification

  • LOCA-01: HPA subcellular and cilium/centrosome proteomics data integrated
  • LOCA-02: Evidence distinguished as experimental vs computational based on HPA reliability and proteomics source
  • LOCA-03: Localization score reflects cilia compartment proximity with evidence-type weighting
  • Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
  • All tests pass: 17/17 (100%)
  • python -c "from usher_pipeline.evidence.localization import *" works
  • usher-pipeline evidence localization --help displays
  • DuckDB subcellular_localization table has all expected columns

Commits

  1. 6645c59 - feat(03-04): create localization evidence data model and processing

    • Created init.py, models.py, fetch.py, transform.py, load.py
    • Defined LocalizationRecord, HPA download, proteomics cross-reference, evidence classification, cilia proximity scoring
  2. 942aaf2 - feat(03-04): add localization CLI command and comprehensive tests

    • Added localization subcommand to evidence_cmd.py
    • Created 17 unit and integration tests (all pass)
    • Fixed evidence type terminology (computational vs predicted)

Key Files Created

Core Implementation

  • src/usher_pipeline/evidence/localization/__init__.py - Module exports
  • src/usher_pipeline/evidence/localization/models.py - LocalizationRecord model, compartment constants
  • src/usher_pipeline/evidence/localization/fetch.py - HPA download, proteomics cross-reference
  • src/usher_pipeline/evidence/localization/transform.py - Evidence classification, cilia proximity scoring
  • src/usher_pipeline/evidence/localization/load.py - DuckDB persistence, query helpers

Tests

  • tests/test_localization.py - 12 unit tests (parsing, classification, scoring, NULL handling)
  • tests/test_localization_integration.py - 5 integration tests (full pipeline, checkpoint, provenance)

Modified

  • src/usher_pipeline/cli/evidence_cmd.py - Added localization subcommand with checkpoint-restart

Lessons Learned

  1. Terminology consistency matters: Using "predicted" vs "computational" created confusion. Settled on "computational" to match plan requirements and bioinformatics convention (experimental vs computational evidence).

  2. NULL semantics clarity: Explicit decision that proteomics absence = False (informative negative) vs HPA absence = NULL (unknown) prevents data interpretation errors downstream.

  3. Reference gene set embedding: Small curated gene sets (~150 genes) are better embedded as Python constants than external files - simpler deployment, no file path issues, git-versioned.

  4. Evidence weighting is crucial: Downweighting computational predictions (0.6x) vs experimental evidence (1.0x) reflects real-world reliability differences and prevents overweighting HPA Uncertain predictions.

  5. Comprehensive testing pays off: 17 tests caught terminology bug, validated NULL handling, verified evidence weighting logic before any real data was processed.

Next Steps

  • Phase 03 Plan 05: Expression evidence layer (GTEx tissue specificity)
  • Phase 03 Plan 06: Literature evidence layer (PubMed mining)
  • Evidence integration layer to combine LOCA scores with GCON, EXPR, LITE scores

Self-Check: PASSED

All files verified:

  • ✓ src/usher_pipeline/evidence/localization/init.py
  • ✓ src/usher_pipeline/evidence/localization/models.py
  • ✓ src/usher_pipeline/evidence/localization/fetch.py
  • ✓ src/usher_pipeline/evidence/localization/transform.py
  • ✓ src/usher_pipeline/evidence/localization/load.py
  • ✓ tests/test_localization.py
  • ✓ tests/test_localization_integration.py

All commits verified:

  • 6645c59: feat(03-04): create localization evidence data model and processing
  • 942aaf2: feat(03-04): add localization CLI command and comprehensive tests