docs(03): create phase plan
This commit is contained in:
162
.planning/phases/03-core-evidence-layers/03-04-PLAN.md
Normal file
162
.planning/phases/03-core-evidence-layers/03-04-PLAN.md
Normal file
@@ -0,0 +1,162 @@
|
||||
---
|
||||
phase: 03-core-evidence-layers
|
||||
plan: 04
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- src/usher_pipeline/evidence/localization/__init__.py
|
||||
- src/usher_pipeline/evidence/localization/models.py
|
||||
- src/usher_pipeline/evidence/localization/fetch.py
|
||||
- src/usher_pipeline/evidence/localization/transform.py
|
||||
- src/usher_pipeline/evidence/localization/load.py
|
||||
- tests/test_localization.py
|
||||
- tests/test_localization_integration.py
|
||||
- src/usher_pipeline/cli/evidence_cmd.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Pipeline integrates protein localization data from HPA subcellular and published centrosome/cilium proteomics"
|
||||
- "Localization evidence distinguishes experimental from computational predictions"
|
||||
- "Localization score reflects proximity to cilia-related compartments"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/evidence/localization/fetch.py"
|
||||
provides: "HPA subcellular and proteomics localization data retrieval"
|
||||
exports: ["fetch_hpa_subcellular", "fetch_cilia_proteomics"]
|
||||
- path: "src/usher_pipeline/evidence/localization/transform.py"
|
||||
provides: "Localization scoring with evidence type distinction"
|
||||
exports: ["score_localization", "classify_evidence_type", "process_localization_evidence"]
|
||||
- path: "src/usher_pipeline/evidence/localization/load.py"
|
||||
provides: "DuckDB persistence for localization evidence"
|
||||
exports: ["load_to_duckdb"]
|
||||
- path: "tests/test_localization.py"
|
||||
provides: "Unit tests for localization scoring and evidence classification"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/evidence/localization/fetch.py"
|
||||
to: "HPA subcellular data"
|
||||
via: "httpx bulk download of subcellular_location.tsv"
|
||||
pattern: "proteinatlas\\.org.*subcellular"
|
||||
- from: "src/usher_pipeline/evidence/localization/transform.py"
|
||||
to: "src/usher_pipeline/evidence/localization/fetch.py"
|
||||
via: "processes HPA and proteomics data into localization scores"
|
||||
pattern: "score_localization|classify_evidence_type"
|
||||
- from: "src/usher_pipeline/evidence/localization/load.py"
|
||||
to: "src/usher_pipeline/persistence/duckdb_store.py"
|
||||
via: "store.save_dataframe"
|
||||
pattern: "save_dataframe.*subcellular_localization"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Implement the Subcellular Localization evidence layer (LOCA-01/02/03): retrieve HPA subcellular localization and curated cilium/centrosome proteomics data, distinguish experimental from computational evidence, score proximity to cilia-related compartments.
|
||||
|
||||
Purpose: Direct localization to cilia, centrosome, basal body, or transition zone is strong evidence for cilia involvement. Distinguishing experimental from computational predictions prevents overweighting predictions.
|
||||
Output: subcellular_localization DuckDB table with per-gene compartment assignments, evidence type (experimental/predicted), cilia proximity score, and normalized localization score.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
|
||||
@src/usher_pipeline/evidence/gnomad/fetch.py
|
||||
@src/usher_pipeline/evidence/gnomad/load.py
|
||||
@src/usher_pipeline/cli/evidence_cmd.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Create localization evidence data model, fetch, and transform modules</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/localization/__init__.py
|
||||
src/usher_pipeline/evidence/localization/models.py
|
||||
src/usher_pipeline/evidence/localization/fetch.py
|
||||
src/usher_pipeline/evidence/localization/transform.py
|
||||
</files>
|
||||
<action>
|
||||
Create the localization evidence layer following the established gnomAD fetch->transform pattern.
|
||||
|
||||
**models.py**: Define LocalizationRecord pydantic model with fields: gene_id (str), gene_symbol (str), hpa_main_location (str|None -- semicolon-separated HPA locations), hpa_reliability (str|None -- "Enhanced", "Supported", "Approved", "Uncertain"), hpa_evidence_type (str|None -- "experimental" or "predicted" based on reliability), in_cilia_proteomics (bool|None -- found in published cilium proteomics datasets), in_centrosome_proteomics (bool|None -- found in centrosome proteomics), compartment_cilia (bool|None), compartment_centrosome (bool|None), compartment_basal_body (bool|None), compartment_transition_zone (bool|None), compartment_stereocilia (bool|None), evidence_type (str -- "experimental", "computational", "both", "none"), cilia_proximity_score (float|None -- 0-1 based on compartment relevance), localization_score_normalized (float|None -- 0-1 composite). Define LOCALIZATION_TABLE_NAME = "subcellular_localization". Define CILIA_COMPARTMENTS = ["Cilia", "Cilium", "Centrosome", "Centriole", "Basal body", "Microtubule organizing center"]. Define CILIA_ADJACENT_COMPARTMENTS = ["Cytoskeleton", "Microtubules", "Cell Junctions", "Focal adhesion sites"].
|
||||
|
||||
**fetch.py**: Two fetch functions:
|
||||
1. `fetch_hpa_subcellular(gene_ids: list[str]) -> pl.DataFrame` -- Download HPA subcellular location data TSV from https://www.proteinatlas.org/download/subcellular_location.tsv.zip (bulk download). Parse with polars. Extract columns: Gene name, Reliability, Main location, Additional location, Extracellular location. Map gene symbols to gene_ids using gene universe. Filter to input gene_ids. Return DataFrame with gene_id, hpa_main_location (string of all locations), hpa_reliability. Use httpx streaming with tenacity retry.
|
||||
2. `fetch_cilia_proteomics(gene_ids: list[str]) -> pl.DataFrame` -- Create a curated reference set of known cilium and centrosome proteomics datasets. Use gene lists from published studies embedded as Python data (CiliaCarta gene set, Centrosome-DB genes). These are static reference lists that can be hardcoded or loaded from a small bundled CSV (data/reference/cilia_proteomics_genes.csv, data/reference/centrosome_proteomics_genes.csv). Cross-reference gene_ids against these lists. Return DataFrame with gene_id, in_cilia_proteomics (bool), in_centrosome_proteomics (bool). For genes not in either set, both are False (not NULL -- absence from proteomics is informative).
|
||||
|
||||
**transform.py**: Three functions:
|
||||
1. `classify_evidence_type(df: pl.DataFrame) -> pl.DataFrame` -- Based on HPA reliability: "Enhanced"/"Supported" -> "experimental", "Approved"/"Uncertain" -> "computational". If gene in proteomics dataset -> "experimental" (overrides HPA if HPA says computational). If both -> "both". If neither HPA nor proteomics -> "none". Add evidence_type column.
|
||||
2. `score_localization(df: pl.DataFrame) -> pl.DataFrame` -- Parse hpa_main_location string. Check each location against CILIA_COMPARTMENTS (direct match = 1.0 weight) and CILIA_ADJACENT_COMPARTMENTS (adjacent = 0.5 weight). Set compartment booleans. Compute cilia_proximity_score: 1.0 if any direct cilia compartment, 0.5 if adjacent only, 0.3 if in proteomics but no HPA cilia location, 0.0 if none. Multiply by evidence weight: experimental=1.0, computational=0.6. Compute localization_score_normalized: cilia_proximity_score weighted by evidence type. NULL if no localization data at all (gene not in HPA and not in proteomics).
|
||||
3. `process_localization_evidence(gene_ids: list[str], gene_symbol_map: pl.DataFrame) -> pl.DataFrame` -- End-to-end: fetch HPA -> fetch proteomics -> merge -> classify evidence -> score -> collect.
|
||||
|
||||
Follow established patterns: NULL preservation, structlog logging. NOTE: Absence from proteomics datasets is informative (= not detected), so use False not NULL for proteomics columns. Absence from HPA is unknown (= not tested), so use NULL for HPA columns.
|
||||
</action>
|
||||
<verify>
|
||||
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.localization import fetch_hpa_subcellular, fetch_cilia_proteomics, classify_evidence_type, score_localization, process_localization_evidence; print('imports OK')"
|
||||
</verify>
|
||||
<done>
|
||||
Localization fetch retrieves HPA subcellular data and cross-references proteomics gene lists. Transform classifies evidence type and scores cilia proximity with experimental/computational weighting. All functions importable.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Create localization DuckDB loader, CLI command, and tests</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/localization/load.py
|
||||
src/usher_pipeline/cli/evidence_cmd.py
|
||||
tests/test_localization.py
|
||||
tests/test_localization_integration.py
|
||||
</files>
|
||||
<action>
|
||||
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "subcellular_localization" table. Record provenance: genes in cilia compartment, genes in centrosome, experimental vs computational counts, mean localization score. Create `query_cilia_localized(store) -> pl.DataFrame` helper for genes with cilia_proximity_score > 0.5.
|
||||
|
||||
**evidence_cmd.py**: Add `localization` subcommand to evidence command group. Follow gnomad pattern: checkpoint check, --force flag, load gene universe for gene_ids and symbol mapping, call process_localization_evidence, load to DuckDB, save provenance sidecar to data/localization/subcellular.provenance.json. Display summary: HPA coverage, cilia-localized gene count, evidence type distribution.
|
||||
|
||||
**tests/test_localization.py**: Unit tests with synthetic data. Mock httpx for HPA download. Test cases:
|
||||
- test_hpa_location_parsing: Correct extraction of locations from semicolon-separated string
|
||||
- test_cilia_compartment_detection: "Centrosome" in location -> compartment_centrosome=True
|
||||
- test_adjacent_compartment_scoring: "Cytoskeleton" only -> cilia_proximity_score=0.5
|
||||
- test_evidence_type_experimental: HPA Enhanced reliability -> experimental
|
||||
- test_evidence_type_computational: HPA Uncertain reliability -> computational
|
||||
- test_proteomics_override: Gene in proteomics but HPA uncertain -> evidence_type="both"
|
||||
- test_null_handling_no_hpa: Gene not in HPA -> HPA columns NULL
|
||||
- test_proteomics_absence_is_false: Gene not in proteomics -> in_cilia_proteomics=False (not NULL)
|
||||
- test_score_normalization: Localization score in [0, 1]
|
||||
- test_evidence_weight_applied: Experimental evidence scores higher than computational for same compartment
|
||||
|
||||
**tests/test_localization_integration.py**: Integration tests. Mock HPA download, synthetic proteomics reference. Test full pipeline, checkpoint-restart, provenance.
|
||||
</action>
|
||||
<verify>
|
||||
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_localization.py tests/test_localization_integration.py -v
|
||||
</verify>
|
||||
<done>
|
||||
All localization unit and integration tests pass. CLI `evidence localization` command registered. DuckDB stores subcellular_localization table with compartment flags, evidence types, and cilia proximity score. Checkpoint-restart works.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `python -m pytest tests/test_localization.py tests/test_localization_integration.py -v` -- all tests pass
|
||||
- `python -c "from usher_pipeline.evidence.localization import *"` -- all exports importable
|
||||
- `usher-pipeline evidence localization --help` -- CLI help displays
|
||||
- DuckDB subcellular_localization table has columns: gene_id, gene_symbol, hpa_main_location, evidence_type, compartment_cilia, cilia_proximity_score, localization_score_normalized
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- LOCA-01: HPA subcellular and cilium/centrosome proteomics data integrated
|
||||
- LOCA-02: Evidence distinguished as experimental vs computational based on HPA reliability and proteomics source
|
||||
- LOCA-03: Localization score reflects cilia compartment proximity with evidence-type weighting
|
||||
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/03-core-evidence-layers/03-04-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user