---
phase: 03-core-evidence-layers
plan: 02
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/expression/__init__.py
- src/usher_pipeline/evidence/expression/models.py
- src/usher_pipeline/evidence/expression/fetch.py
- src/usher_pipeline/evidence/expression/transform.py
- src/usher_pipeline/evidence/expression/load.py
- tests/test_expression.py
- tests/test_expression_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
- pyproject.toml
autonomous: true
must_haves:
truths:
- "Pipeline retrieves tissue-level expression data from HPA and GTEx for retina, inner ear, and cilia-rich tissues"
- "Pipeline retrieves scRNA-seq data from CellxGene for photoreceptor and hair cell subpopulations"
- "Expression data is converted to tissue specificity metrics (Tau index) across data sources"
- "Expression score reflects enrichment in Usher-relevant tissues relative to global expression"
artifacts:
- path: "src/usher_pipeline/evidence/expression/fetch.py"
provides: "HPA, GTEx, and CellxGene expression data retrieval"
exports: ["fetch_hpa_expression", "fetch_gtex_expression", "fetch_cellxgene_expression"]
- path: "src/usher_pipeline/evidence/expression/transform.py"
provides: "Tau specificity index calculation and expression scoring"
exports: ["calculate_tau_specificity", "compute_expression_score", "process_expression_evidence"]
- path: "src/usher_pipeline/evidence/expression/load.py"
provides: "DuckDB persistence for expression evidence"
exports: ["load_to_duckdb"]
- path: "tests/test_expression.py"
provides: "Unit tests for expression scoring and Tau calculation"
key_links:
- from: "src/usher_pipeline/evidence/expression/fetch.py"
to: "HPA API and GTEx API"
via: "httpx with tenacity retry"
pattern: "proteinatlas\\.org|gtexportal\\.org"
- from: "src/usher_pipeline/evidence/expression/transform.py"
to: "src/usher_pipeline/evidence/expression/fetch.py"
via: "processes HPA/GTEx/CellxGene data into specificity scores"
pattern: "calculate_tau_specificity"
- from: "src/usher_pipeline/evidence/expression/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*tissue_expression"
---
Implement the Tissue Expression evidence layer (EXPR-01/02/03/04): retrieve expression data from HPA, GTEx, and CellxGene for Usher-relevant tissues, compute tissue specificity (Tau index), score enrichment in target tissues, and persist to DuckDB.
Purpose: Expression in retina, inner ear, and cilia-rich tissues is strong evidence for potential cilia/Usher involvement. Genes specifically expressed in these tissues (high Tau) are higher-priority candidates.
Output: tissue_expression DuckDB table with per-gene expression values across target tissues, Tau specificity index, and normalized expression score.
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/transform.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
Task 1: Create expression evidence data model, fetch, and transform modules
src/usher_pipeline/evidence/expression/__init__.py
src/usher_pipeline/evidence/expression/models.py
src/usher_pipeline/evidence/expression/fetch.py
src/usher_pipeline/evidence/expression/transform.py
pyproject.toml
Create the expression evidence layer following the established gnomAD fetch->transform pattern.
**models.py**: Define ExpressionRecord pydantic model with fields: gene_id (str), gene_symbol (str), hpa_retina_tpm (float|None), hpa_cerebellum_tpm (float|None -- proxy for cilia-rich tissue), gtex_retina_tpm (float|None -- "Eye - Retina" from GTEx), gtex_brain_cerebellum_tpm (float|None), cellxgene_photoreceptor_expr (float|None), cellxgene_hair_cell_expr (float|None), tau_specificity (float|None -- Tau index 0=ubiquitous, 1=tissue-specific), usher_tissue_enrichment (float|None -- relative enrichment in target tissues), expression_score_normalized (float|None -- composite 0-1). Define EXPRESSION_TABLE_NAME = "tissue_expression". Define TARGET_TISSUES dict mapping tissue keys to API-specific identifiers for HPA, GTEx.
**fetch.py**: Three fetch functions:
1. `fetch_hpa_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download HPA normal tissue data TSV from https://www.proteinatlas.org/download/normal_tissue.tsv.zip (bulk download is more efficient than per-gene API for 20K genes). Filter to target tissues: "retina", "cerebellum", "testis" (cilia-rich), "fallopian tube" (ciliated epithelium). Extract TPM/expression levels. Use httpx streaming download with tenacity retry (same pattern as gnomAD). Parse with polars scan_csv. Return DataFrame with gene_id, tissue columns. NULL for genes not in HPA.
2. `fetch_gtex_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download GTEx median gene expression from GTEx Portal bulk data (V10) or query API in batches. Target tissues: "Eye - Retina" (not available in all GTEx versions -- handle NULL), "Brain - Cerebellum", "Testis", "Fallopian Tube". Use httpx with tenacity retry and ratelimit (conservative 5 req/sec for GTEx API). If bulk download available, prefer that. Return DataFrame with gene_id, gtex tissue columns. NULL for unavailable tissue/gene combinations.
3. `fetch_cellxgene_expression(gene_ids: list[str]) -> pl.DataFrame` -- Use cellxgene_census library to query scRNA-seq data. Query for cell types: "photoreceptor cell", "retinal rod cell", "retinal cone cell", "hair cell" in tissues "retina", "inner ear", "cochlea". Compute mean expression per gene per cell type. If cellxgene_census not available (optional dependency), log warning and return DataFrame with all NULLs. Process in gene batches (100 at a time) to control memory. Return DataFrame with gene_id, cellxgene columns.
Add `cellxgene_census` to pyproject.toml optional dependencies under [project.optional-dependencies] as `expression = ["cellxgene-census>=1.19"]` (heavy dependency, keep optional).
**transform.py**: Three functions:
1. `calculate_tau_specificity(df: pl.DataFrame, tissue_columns: list[str]) -> pl.DataFrame` -- Implement Tau index: Tau = sum(1 - xi/xmax) / (n-1). If any tissue value is NULL, tau is NULL (insufficient data for reliable specificity). Add tau_specificity column.
2. `compute_expression_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute usher_tissue_enrichment: ratio of mean expression in Usher-relevant tissues (retina, inner ear proxies) to mean expression across all tissues. Higher ratio = more enriched. Normalize to 0-1. Compute expression_score_normalized as weighted combination: 0.4 * usher_tissue_enrichment_normalized + 0.3 * tau_specificity + 0.3 * max_target_tissue_rank (rank of max expression in target tissues / total genes). NULL if all expression data is NULL.
3. `process_expression_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch HPA -> fetch GTEx -> fetch CellxGene -> merge on gene_id -> compute Tau -> compute score -> collect.
Follow established patterns: NULL preservation, structlog logging, lazy polars evaluation. NOTE: GTEx lacks inner ear/cochlea tissue -- handle as NULL, do not fabricate. CellxGene is the primary source for inner ear data.
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.expression import fetch_hpa_expression, fetch_gtex_expression, calculate_tau_specificity, compute_expression_score, process_expression_evidence; print('imports OK')"
Expression fetch module retrieves data from HPA (bulk TSV), GTEx (API/bulk), and CellxGene (census library). Transform module computes Tau specificity index and Usher tissue enrichment score normalized to 0-1. NULL preserved for missing tissues/genes.
Task 2: Create expression DuckDB loader, CLI command, and tests
src/usher_pipeline/evidence/expression/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_expression.py
tests/test_expression_integration.py
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "tissue_expression" table. Record provenance with: genes with retina expression, genes with inner ear expression, mean Tau, expression score distribution stats. Create `query_tissue_enriched(store, min_enrichment=2.0) -> pl.DataFrame` helper for genes enriched in Usher tissues.
**evidence_cmd.py**: Add `expression` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('tissue_expression')), --force flag, --skip-cellxgene flag (for running without optional CellxGene dependency), load gene universe for gene_ids, call process_expression_evidence, load to DuckDB, save provenance sidecar to data/expression/tissue.provenance.json. Display summary: tissue coverage counts, mean Tau, top enriched genes preview.
**tests/test_expression.py**: Unit tests with synthetic data, NO external API calls. Mock httpx responses for HPA/GTEx, mock cellxgene_census. Test cases:
- test_tau_calculation_ubiquitous: Equal expression across tissues -> Tau near 0
- test_tau_calculation_specific: Expression in one tissue only -> Tau near 1
- test_tau_null_handling: NULL tissue values -> NULL Tau
- test_enrichment_score_high: High retina expression relative to global -> high enrichment
- test_enrichment_score_low: No target tissue expression -> low enrichment
- test_expression_score_normalization: Composite score in [0, 1]
- test_null_preservation_all_sources: Gene with no data from any source -> NULL score
- test_hpa_parsing: Correct extraction of tissue-level TPM from HPA TSV format
- test_gtex_missing_tissue: NULL for tissues not in GTEx (inner ear)
**tests/test_expression_integration.py**: Integration tests with mocked downloads. Test full pipeline, checkpoint-restart, provenance. Synthetic HPA TSV fixture, mocked GTEx responses.
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_expression.py tests/test_expression_integration.py -v
All expression unit and integration tests pass. CLI `evidence expression` command registered with --skip-cellxgene option. DuckDB stores tissue_expression table with Tau and enrichment scores. Checkpoint-restart works.
- `python -m pytest tests/test_expression.py tests/test_expression_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.expression import *"` -- all exports importable
- `usher-pipeline evidence expression --help` -- CLI help displays
- DuckDB tissue_expression table has columns: gene_id, gene_symbol, hpa_retina_tpm, gtex_retina_tpm, cellxgene_photoreceptor_expr, tau_specificity, usher_tissue_enrichment, expression_score_normalized
- EXPR-01: HPA and GTEx tissue expression retrieved for retina, inner ear proxies, and cilia-rich tissues
- EXPR-02: CellxGene scRNA-seq data retrieved for photoreceptor and hair cell types (optional dependency graceful fallback)
- EXPR-03: Tau specificity index computed across data sources with NULL handling
- EXPR-04: Expression score reflects enrichment in Usher-relevant tissues with normalized 0-1 composite
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure