usher-exploring/.planning/phases/03-core-evidence-layers/03-02-PLAN.md at 0cd2f7c9dd7327bb337fa842b9c29d01299dd1d2

gbanyan/usher-exploring

Fork 0

Files

gbanyan 0d252da348 docs(03): create phase plan

2026-02-11 18:46:28 +08:00

12 KiB

Raw Blame History

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves

phase

plan

type

wave

depends_on

files_modified

autonomous

must_haves

03-core-evidence-layers

execute

src/usher_pipeline/evidence/expression/__init__.py

src/usher_pipeline/evidence/expression/models.py

src/usher_pipeline/evidence/expression/fetch.py

src/usher_pipeline/evidence/expression/transform.py

src/usher_pipeline/evidence/expression/load.py

tests/test_expression.py

tests/test_expression_integration.py

src/usher_pipeline/cli/evidence_cmd.py

pyproject.toml

true

truths

artifacts

key_links

Pipeline retrieves tissue-level expression data from HPA and GTEx for retina, inner ear, and cilia-rich tissues

Pipeline retrieves scRNA-seq data from CellxGene for photoreceptor and hair cell subpopulations

Expression data is converted to tissue specificity metrics (Tau index) across data sources

Expression score reflects enrichment in Usher-relevant tissues relative to global expression

path

provides

exports

src/usher_pipeline/evidence/expression/fetch.py

HPA, GTEx, and CellxGene expression data retrieval

fetch_hpa_expression

fetch_gtex_expression

fetch_cellxgene_expression

path

provides

exports

src/usher_pipeline/evidence/expression/transform.py

Tau specificity index calculation and expression scoring

calculate_tau_specificity

compute_expression_score

process_expression_evidence

path

provides

exports

src/usher_pipeline/evidence/expression/load.py

DuckDB persistence for expression evidence

load_to_duckdb

path	provides
tests/test_expression.py	Unit tests for expression scoring and Tau calculation

from	to	via	pattern
src/usher_pipeline/evidence/expression/fetch.py	HPA API and GTEx API	httpx with tenacity retry	proteinatlas.org\|gtexportal.org

from	to	via	pattern
src/usher_pipeline/evidence/expression/transform.py	src/usher_pipeline/evidence/expression/fetch.py	processes HPA/GTEx/CellxGene data into specificity scores	calculate_tau_specificity

from	to	via	pattern
src/usher_pipeline/evidence/expression/load.py	src/usher_pipeline/persistence/duckdb_store.py	store.save_dataframe	save_dataframe.*tissue_expression

Implement the Tissue Expression evidence layer (EXPR-01/02/03/04): retrieve expression data from HPA, GTEx, and CellxGene for Usher-relevant tissues, compute tissue specificity (Tau index), score enrichment in target tissues, and persist to DuckDB.

Purpose: Expression in retina, inner ear, and cilia-rich tissues is strong evidence for potential cilia/Usher involvement. Genes specifically expressed in these tissues (high Tau) are higher-priority candidates. Output: tissue_expression DuckDB table with per-gene expression values across target tissues, Tau specificity index, and normalized expression score.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/transform.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create expression evidence data model, fetch, and transform modules src/usher_pipeline/evidence/expression/__init__.py src/usher_pipeline/evidence/expression/models.py src/usher_pipeline/evidence/expression/fetch.py src/usher_pipeline/evidence/expression/transform.py pyproject.toml Create the expression evidence layer following the established gnomAD fetch->transform pattern.

**models.py**: Define ExpressionRecord pydantic model with fields: gene_id (str), gene_symbol (str), hpa_retina_tpm (float|None), hpa_cerebellum_tpm (float|None -- proxy for cilia-rich tissue), gtex_retina_tpm (float|None -- "Eye - Retina" from GTEx), gtex_brain_cerebellum_tpm (float|None), cellxgene_photoreceptor_expr (float|None), cellxgene_hair_cell_expr (float|None), tau_specificity (float|None -- Tau index 0=ubiquitous, 1=tissue-specific), usher_tissue_enrichment (float|None -- relative enrichment in target tissues), expression_score_normalized (float|None -- composite 0-1). Define EXPRESSION_TABLE_NAME = "tissue_expression". Define TARGET_TISSUES dict mapping tissue keys to API-specific identifiers for HPA, GTEx.

**fetch.py**: Three fetch functions:
1. `fetch_hpa_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download HPA normal tissue data TSV from https://www.proteinatlas.org/download/normal_tissue.tsv.zip (bulk download is more efficient than per-gene API for 20K genes). Filter to target tissues: "retina", "cerebellum", "testis" (cilia-rich), "fallopian tube" (ciliated epithelium). Extract TPM/expression levels. Use httpx streaming download with tenacity retry (same pattern as gnomAD). Parse with polars scan_csv. Return DataFrame with gene_id, tissue columns. NULL for genes not in HPA.
2. `fetch_gtex_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download GTEx median gene expression from GTEx Portal bulk data (V10) or query API in batches. Target tissues: "Eye - Retina" (not available in all GTEx versions -- handle NULL), "Brain - Cerebellum", "Testis", "Fallopian Tube". Use httpx with tenacity retry and ratelimit (conservative 5 req/sec for GTEx API). If bulk download available, prefer that. Return DataFrame with gene_id, gtex tissue columns. NULL for unavailable tissue/gene combinations.
3. `fetch_cellxgene_expression(gene_ids: list[str]) -> pl.DataFrame` -- Use cellxgene_census library to query scRNA-seq data. Query for cell types: "photoreceptor cell", "retinal rod cell", "retinal cone cell", "hair cell" in tissues "retina", "inner ear", "cochlea". Compute mean expression per gene per cell type. If cellxgene_census not available (optional dependency), log warning and return DataFrame with all NULLs. Process in gene batches (100 at a time) to control memory. Return DataFrame with gene_id, cellxgene columns.

Add `cellxgene_census` to pyproject.toml optional dependencies under [project.optional-dependencies] as `expression = ["cellxgene-census>=1.19"]` (heavy dependency, keep optional).

**transform.py**: Three functions:
1. `calculate_tau_specificity(df: pl.DataFrame, tissue_columns: list[str]) -> pl.DataFrame` -- Implement Tau index: Tau = sum(1 - xi/xmax) / (n-1). If any tissue value is NULL, tau is NULL (insufficient data for reliable specificity). Add tau_specificity column.
2. `compute_expression_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute usher_tissue_enrichment: ratio of mean expression in Usher-relevant tissues (retina, inner ear proxies) to mean expression across all tissues. Higher ratio = more enriched. Normalize to 0-1. Compute expression_score_normalized as weighted combination: 0.4 * usher_tissue_enrichment_normalized + 0.3 * tau_specificity + 0.3 * max_target_tissue_rank (rank of max expression in target tissues / total genes). NULL if all expression data is NULL.
3. `process_expression_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch HPA -> fetch GTEx -> fetch CellxGene -> merge on gene_id -> compute Tau -> compute score -> collect.

Follow established patterns: NULL preservation, structlog logging, lazy polars evaluation. NOTE: GTEx lacks inner ear/cochlea tissue -- handle as NULL, do not fabricate. CellxGene is the primary source for inner ear data.

cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.expression import fetch_hpa_expression, fetch_gtex_expression, calculate_tau_specificity, compute_expression_score, process_expression_evidence; print('imports OK')" Expression fetch module retrieves data from HPA (bulk TSV), GTEx (API/bulk), and CellxGene (census library). Transform module computes Tau specificity index and Usher tissue enrichment score normalized to 0-1. NULL preserved for missing tissues/genes. Task 2: Create expression DuckDB loader, CLI command, and tests src/usher_pipeline/evidence/expression/load.py src/usher_pipeline/cli/evidence_cmd.py tests/test_expression.py tests/test_expression_integration.py **load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "tissue_expression" table. Record provenance with: genes with retina expression, genes with inner ear expression, mean Tau, expression score distribution stats. Create `query_tissue_enriched(store, min_enrichment=2.0) -> pl.DataFrame` helper for genes enriched in Usher tissues.

**evidence_cmd.py**: Add `expression` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('tissue_expression')), --force flag, --skip-cellxgene flag (for running without optional CellxGene dependency), load gene universe for gene_ids, call process_expression_evidence, load to DuckDB, save provenance sidecar to data/expression/tissue.provenance.json. Display summary: tissue coverage counts, mean Tau, top enriched genes preview.

**tests/test_expression.py**: Unit tests with synthetic data, NO external API calls. Mock httpx responses for HPA/GTEx, mock cellxgene_census. Test cases:
- test_tau_calculation_ubiquitous: Equal expression across tissues -> Tau near 0
- test_tau_calculation_specific: Expression in one tissue only -> Tau near 1
- test_tau_null_handling: NULL tissue values -> NULL Tau
- test_enrichment_score_high: High retina expression relative to global -> high enrichment
- test_enrichment_score_low: No target tissue expression -> low enrichment
- test_expression_score_normalization: Composite score in [0, 1]
- test_null_preservation_all_sources: Gene with no data from any source -> NULL score
- test_hpa_parsing: Correct extraction of tissue-level TPM from HPA TSV format
- test_gtex_missing_tissue: NULL for tissues not in GTEx (inner ear)

**tests/test_expression_integration.py**: Integration tests with mocked downloads. Test full pipeline, checkpoint-restart, provenance. Synthetic HPA TSV fixture, mocked GTEx responses.

cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_expression.py tests/test_expression_integration.py -v All expression unit and integration tests pass. CLI `evidence expression` command registered with --skip-cellxgene option. DuckDB stores tissue_expression table with Tau and enrichment scores. Checkpoint-restart works. - `python -m pytest tests/test_expression.py tests/test_expression_integration.py -v` -- all tests pass - `python -c "from usher_pipeline.evidence.expression import *"` -- all exports importable - `usher-pipeline evidence expression --help` -- CLI help displays - DuckDB tissue_expression table has columns: gene_id, gene_symbol, hpa_retina_tpm, gtex_retina_tpm, cellxgene_photoreceptor_expr, tau_specificity, usher_tissue_enrichment, expression_score_normalized

<success_criteria>

EXPR-01: HPA and GTEx tissue expression retrieved for retina, inner ear proxies, and cilia-rich tissues
EXPR-02: CellxGene scRNA-seq data retrieved for photoreceptor and hair cell types (optional dependency graceful fallback)
EXPR-03: Tau specificity index computed across data sources with NULL handling
EXPR-04: Expression score reflects enrichment in Usher-relevant tissues with normalized 0-1 composite
Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure </success_criteria>

After completion, create `.planning/phases/03-core-evidence-layers/03-02-SUMMARY.md`

12 KiB Raw Blame History

12 KiB

Raw Blame History