Files

gbanyan d8009f1236 docs(03-04): complete subcellular localization evidence layer

- Created SUMMARY.md with full implementation details
- Updated STATE.md: progress 40%, 8/20 plans complete
- Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting)
- All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete

2026-02-11 19:08:01 +08:00

12 KiB

Raw Blame History

phase, plan, subsystem, tags, dependency-graph, tech-stack, key-files, decisions, metrics

phase

plan

subsystem

Phase 03 Plan 04: Subcellular Localization Evidence Summary

One-liner: Integrated HPA subcellular localization with curated cilia/centrosome proteomics, scoring genes by cilia-proximity with experimental vs computational evidence weighting.

Objectives Achieved

LOCA-01: HPA Subcellular and Proteomics Integration

Downloaded HPA subcellular_location.tsv.zip (bulk download, ~10MB)
Parsed HPA locations (Main, Additional, Extracellular) into semicolon-separated strings
Cross-referenced genes against curated CiliaCarta (cilia proteomics, ~80 genes) and Centrosome-DB (~70 genes) reference sets
Mapped gene symbols to Ensembl gene IDs using gene universe

LOCA-02: Experimental vs Computational Evidence Classification

HPA Enhanced/Supported reliability → experimental (antibody-based IHC with validation)
HPA Approved/Uncertain reliability → computational (predicted from RNA-seq or unvalidated)
Proteomics presence (MS-based) → experimental (overrides computational HPA classification)
Evidence type categories: experimental, computational, both, none

LOCA-03: Cilia Proximity Scoring with Evidence Weighting

Direct cilia compartment (Cilia, Centrosome, Basal body, Transition zone, Stereocilia) → 1.0 base score
Adjacent compartment (Cytoskeleton, Microtubules, Cell junctions, Focal adhesions) → 0.5 base score
In proteomics but no HPA cilia location → 0.3 base score
Evidence weight applied: experimental 1.0x, computational 0.6x, both 1.0x, none NULL
Normalized localization_score_normalized in [0, 1] range

Implementation Details

Data Model (models.py)

LocalizationRecord with HPA fields (hpa_main_location, hpa_reliability, hpa_evidence_type)
Proteomics presence flags (in_cilia_proteomics, in_centrosome_proteomics) - False not NULL for absences
Compartment booleans (compartment_cilia, compartment_centrosome, compartment_basal_body, compartment_transition_zone, compartment_stereocilia)
Scoring fields (cilia_proximity_score, localization_score_normalized)
Evidence type classification (experimental, computational, both, none)

Fetch Module (fetch.py)

download_hpa_subcellular(): Streaming zip download with retry, extraction, checkpoint
fetch_hpa_subcellular(): Parse HPA TSV, filter to gene universe, map symbols to IDs
fetch_cilia_proteomics(): Cross-reference against embedded CILIA_PROTEOMICS_GENES and CENTROSOME_PROTEOMICS_GENES sets
Tenacity retry for HTTP errors, structlog for progress logging

Transform Module (transform.py)

classify_evidence_type(): HPA reliability → experimental/computational, proteomics override, evidence_type = experimental/computational/both/none
score_localization(): Parse HPA location string, set compartment flags, compute cilia_proximity_score, apply evidence weight
process_localization_evidence(): End-to-end pipeline (fetch HPA → fetch proteomics → merge → classify → score)

Load Module (load.py)

load_to_duckdb(): Save to subcellular_localization table, record provenance with evidence type distribution and cilia compartment counts
query_cilia_localized(): Helper to query genes with cilia_proximity_score > threshold

CLI Command (evidence_cmd.py)

usher-pipeline evidence localization subcommand
Checkpoint-restart pattern (skips if subcellular_localization table exists, --force to rerun)
Display summary: Total genes, Experimental/Computational/Both evidence counts, Cilia-localized count (proximity > 0.5)
Provenance sidecar saved to data/localization/subcellular.provenance.json

Testing

Unit Tests (17 tests, 100% pass)

test_hpa_location_parsing: Semicolon-separated location string parsing
test_cilia_compartment_detection: "Centrosome" detection → compartment_centrosome=True
test_adjacent_compartment_scoring: "Cytoskeleton" → proximity=0.5
test_evidence_type_experimental: Enhanced reliability → experimental
test_evidence_type_computational: Uncertain reliability → computational
test_proteomics_override: In proteomics + HPA uncertain → evidence_type=both
test_null_handling_no_hpa: Gene not in HPA → HPA columns NULL
test_proteomics_absence_is_false: Not in proteomics → False (not NULL)
test_score_normalization: All scores in [0, 1]
test_evidence_weight_applied: Experimental scores 1.0, computational scores 0.6 for same compartment
test_fetch_cilia_proteomics: BBS1, CEP290 in cilia proteomics, ACTB not in proteomics
test_load_to_duckdb: DuckDB persistence with provenance

Integration Tests (5 tests, 100% pass)

test_full_pipeline: End-to-end with mocked HPA download (BBS1, CEP290, ACTB, TUBB, TP53)
test_checkpoint_restart: Cached HPA data reused, httpx.stream not called on second run
test_provenance_tracking: Provenance records evidence distribution, cilia compartment counts
test_query_cilia_localized: DuckDB query returns genes with proximity > 0.5
test_missing_gene_universe: Empty gene list handled gracefully

Deviations from Plan

Rule 1 - Bug: Evidence type terminology inconsistency

Found during: Test execution (test_evidence_type_applied failing)
Issue: transform.py used "predicted" for HPA computational evidence, but plan and tests expected "computational"
Fix: Changed "predicted" → "computational" in classify_evidence_type() for consistency with plan requirements
Files modified: src/usher_pipeline/evidence/localization/transform.py, tests/test_localization.py, tests/test_localization_integration.py
Commit: 942aaf2

Pattern Compliance

✓ Fetch → Transform → Load pattern (matching gnomAD evidence layer) ✓ Checkpoint-restart with store.has_checkpoint('subcellular_localization') ✓ Provenance tracking with summary statistics ✓ NULL preservation (HPA absence = NULL, proteomics absence = False) ✓ Lazy polars evaluation where possible ✓ Structlog for progress logging ✓ Tenacity retry for HTTP errors ✓ CLI subcommand with --force flag ✓ DuckDB CREATE OR REPLACE for idempotency ✓ Unit and integration tests with mocked HTTP calls

Success Criteria Verification

LOCA-01: HPA subcellular and cilium/centrosome proteomics data integrated
LOCA-02: Evidence distinguished as experimental vs computational based on HPA reliability and proteomics source
LOCA-03: Localization score reflects cilia compartment proximity with evidence-type weighting
Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
All tests pass: 17/17 (100%)
python -c "from usher_pipeline.evidence.localization import *" works
usher-pipeline evidence localization --help displays
DuckDB subcellular_localization table has all expected columns

Commits

6645c59 - feat(03-04): create localization evidence data model and processing
- Created init.py, models.py, fetch.py, transform.py, load.py
- Defined LocalizationRecord, HPA download, proteomics cross-reference, evidence classification, cilia proximity scoring
942aaf2 - feat(03-04): add localization CLI command and comprehensive tests
- Added localization subcommand to evidence_cmd.py
- Created 17 unit and integration tests (all pass)
- Fixed evidence type terminology (computational vs predicted)

Key Files Created

Core Implementation

src/usher_pipeline/evidence/localization/__init__.py - Module exports
src/usher_pipeline/evidence/localization/models.py - LocalizationRecord model, compartment constants
src/usher_pipeline/evidence/localization/fetch.py - HPA download, proteomics cross-reference
src/usher_pipeline/evidence/localization/transform.py - Evidence classification, cilia proximity scoring
src/usher_pipeline/evidence/localization/load.py - DuckDB persistence, query helpers

Tests

tests/test_localization.py - 12 unit tests (parsing, classification, scoring, NULL handling)
tests/test_localization_integration.py - 5 integration tests (full pipeline, checkpoint, provenance)

Modified

src/usher_pipeline/cli/evidence_cmd.py - Added localization subcommand with checkpoint-restart

Lessons Learned

Terminology consistency matters: Using "predicted" vs "computational" created confusion. Settled on "computational" to match plan requirements and bioinformatics convention (experimental vs computational evidence).
NULL semantics clarity: Explicit decision that proteomics absence = False (informative negative) vs HPA absence = NULL (unknown) prevents data interpretation errors downstream.
Reference gene set embedding: Small curated gene sets (~150 genes) are better embedded as Python constants than external files - simpler deployment, no file path issues, git-versioned.
Evidence weighting is crucial: Downweighting computational predictions (0.6x) vs experimental evidence (1.0x) reflects real-world reliability differences and prevents overweighting HPA Uncertain predictions.
Comprehensive testing pays off: 17 tests caught terminology bug, validated NULL handling, verified evidence weighting logic before any real data was processed.

Next Steps

Phase 03 Plan 05: Expression evidence layer (GTEx tissue specificity)
Phase 03 Plan 06: Literature evidence layer (PubMed mining)
Evidence integration layer to combine LOCA scores with GCON, EXPR, LITE scores

Self-Check: PASSED

All files verified:

✓ src/usher_pipeline/evidence/localization/init.py
✓ src/usher_pipeline/evidence/localization/models.py
✓ src/usher_pipeline/evidence/localization/fetch.py
✓ src/usher_pipeline/evidence/localization/transform.py
✓ src/usher_pipeline/evidence/localization/load.py
✓ tests/test_localization.py
✓ tests/test_localization_integration.py

All commits verified:

✓ 6645c59: feat(03-04): create localization evidence data model and processing
✓ 942aaf2: feat(03-04): add localization CLI command and comprehensive tests

12 KiB Raw Blame History