--- phase: 03-core-evidence-layers plan: 02 type: execute wave: 1 depends_on: [] files_modified: - src/usher_pipeline/evidence/expression/__init__.py - src/usher_pipeline/evidence/expression/models.py - src/usher_pipeline/evidence/expression/fetch.py - src/usher_pipeline/evidence/expression/transform.py - src/usher_pipeline/evidence/expression/load.py - tests/test_expression.py - tests/test_expression_integration.py - src/usher_pipeline/cli/evidence_cmd.py - pyproject.toml autonomous: true must_haves: truths: - "Pipeline retrieves tissue-level expression data from HPA and GTEx for retina, inner ear, and cilia-rich tissues" - "Pipeline retrieves scRNA-seq data from CellxGene for photoreceptor and hair cell subpopulations" - "Expression data is converted to tissue specificity metrics (Tau index) across data sources" - "Expression score reflects enrichment in Usher-relevant tissues relative to global expression" artifacts: - path: "src/usher_pipeline/evidence/expression/fetch.py" provides: "HPA, GTEx, and CellxGene expression data retrieval" exports: ["fetch_hpa_expression", "fetch_gtex_expression", "fetch_cellxgene_expression"] - path: "src/usher_pipeline/evidence/expression/transform.py" provides: "Tau specificity index calculation and expression scoring" exports: ["calculate_tau_specificity", "compute_expression_score", "process_expression_evidence"] - path: "src/usher_pipeline/evidence/expression/load.py" provides: "DuckDB persistence for expression evidence" exports: ["load_to_duckdb"] - path: "tests/test_expression.py" provides: "Unit tests for expression scoring and Tau calculation" key_links: - from: "src/usher_pipeline/evidence/expression/fetch.py" to: "HPA API and GTEx API" via: "httpx with tenacity retry" pattern: "proteinatlas\\.org|gtexportal\\.org" - from: "src/usher_pipeline/evidence/expression/transform.py" to: "src/usher_pipeline/evidence/expression/fetch.py" via: "processes HPA/GTEx/CellxGene data into specificity scores" pattern: "calculate_tau_specificity" - from: "src/usher_pipeline/evidence/expression/load.py" to: "src/usher_pipeline/persistence/duckdb_store.py" via: "store.save_dataframe" pattern: "save_dataframe.*tissue_expression" --- Implement the Tissue Expression evidence layer (EXPR-01/02/03/04): retrieve expression data from HPA, GTEx, and CellxGene for Usher-relevant tissues, compute tissue specificity (Tau index), score enrichment in target tissues, and persist to DuckDB. Purpose: Expression in retina, inner ear, and cilia-rich tissues is strong evidence for potential cilia/Usher involvement. Genes specifically expressed in these tissues (high Tau) are higher-priority candidates. Output: tissue_expression DuckDB table with per-gene expression values across target tissues, Tau specificity index, and normalized expression score. @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/transform.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create expression evidence data model, fetch, and transform modules src/usher_pipeline/evidence/expression/__init__.py src/usher_pipeline/evidence/expression/models.py src/usher_pipeline/evidence/expression/fetch.py src/usher_pipeline/evidence/expression/transform.py pyproject.toml Create the expression evidence layer following the established gnomAD fetch->transform pattern. **models.py**: Define ExpressionRecord pydantic model with fields: gene_id (str), gene_symbol (str), hpa_retina_tpm (float|None), hpa_cerebellum_tpm (float|None -- proxy for cilia-rich tissue), gtex_retina_tpm (float|None -- "Eye - Retina" from GTEx), gtex_brain_cerebellum_tpm (float|None), cellxgene_photoreceptor_expr (float|None), cellxgene_hair_cell_expr (float|None), tau_specificity (float|None -- Tau index 0=ubiquitous, 1=tissue-specific), usher_tissue_enrichment (float|None -- relative enrichment in target tissues), expression_score_normalized (float|None -- composite 0-1). Define EXPRESSION_TABLE_NAME = "tissue_expression". Define TARGET_TISSUES dict mapping tissue keys to API-specific identifiers for HPA, GTEx. **fetch.py**: Three fetch functions: 1. `fetch_hpa_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download HPA normal tissue data TSV from https://www.proteinatlas.org/download/normal_tissue.tsv.zip (bulk download is more efficient than per-gene API for 20K genes). Filter to target tissues: "retina", "cerebellum", "testis" (cilia-rich), "fallopian tube" (ciliated epithelium). Extract TPM/expression levels. Use httpx streaming download with tenacity retry (same pattern as gnomAD). Parse with polars scan_csv. Return DataFrame with gene_id, tissue columns. NULL for genes not in HPA. 2. `fetch_gtex_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download GTEx median gene expression from GTEx Portal bulk data (V10) or query API in batches. Target tissues: "Eye - Retina" (not available in all GTEx versions -- handle NULL), "Brain - Cerebellum", "Testis", "Fallopian Tube". Use httpx with tenacity retry and ratelimit (conservative 5 req/sec for GTEx API). If bulk download available, prefer that. Return DataFrame with gene_id, gtex tissue columns. NULL for unavailable tissue/gene combinations. 3. `fetch_cellxgene_expression(gene_ids: list[str]) -> pl.DataFrame` -- Use cellxgene_census library to query scRNA-seq data. Query for cell types: "photoreceptor cell", "retinal rod cell", "retinal cone cell", "hair cell" in tissues "retina", "inner ear", "cochlea". Compute mean expression per gene per cell type. If cellxgene_census not available (optional dependency), log warning and return DataFrame with all NULLs. Process in gene batches (100 at a time) to control memory. Return DataFrame with gene_id, cellxgene columns. Add `cellxgene_census` to pyproject.toml optional dependencies under [project.optional-dependencies] as `expression = ["cellxgene-census>=1.19"]` (heavy dependency, keep optional). **transform.py**: Three functions: 1. `calculate_tau_specificity(df: pl.DataFrame, tissue_columns: list[str]) -> pl.DataFrame` -- Implement Tau index: Tau = sum(1 - xi/xmax) / (n-1). If any tissue value is NULL, tau is NULL (insufficient data for reliable specificity). Add tau_specificity column. 2. `compute_expression_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute usher_tissue_enrichment: ratio of mean expression in Usher-relevant tissues (retina, inner ear proxies) to mean expression across all tissues. Higher ratio = more enriched. Normalize to 0-1. Compute expression_score_normalized as weighted combination: 0.4 * usher_tissue_enrichment_normalized + 0.3 * tau_specificity + 0.3 * max_target_tissue_rank (rank of max expression in target tissues / total genes). NULL if all expression data is NULL. 3. `process_expression_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch HPA -> fetch GTEx -> fetch CellxGene -> merge on gene_id -> compute Tau -> compute score -> collect. Follow established patterns: NULL preservation, structlog logging, lazy polars evaluation. NOTE: GTEx lacks inner ear/cochlea tissue -- handle as NULL, do not fabricate. CellxGene is the primary source for inner ear data. cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.expression import fetch_hpa_expression, fetch_gtex_expression, calculate_tau_specificity, compute_expression_score, process_expression_evidence; print('imports OK')" Expression fetch module retrieves data from HPA (bulk TSV), GTEx (API/bulk), and CellxGene (census library). Transform module computes Tau specificity index and Usher tissue enrichment score normalized to 0-1. NULL preserved for missing tissues/genes. Task 2: Create expression DuckDB loader, CLI command, and tests src/usher_pipeline/evidence/expression/load.py src/usher_pipeline/cli/evidence_cmd.py tests/test_expression.py tests/test_expression_integration.py **load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "tissue_expression" table. Record provenance with: genes with retina expression, genes with inner ear expression, mean Tau, expression score distribution stats. Create `query_tissue_enriched(store, min_enrichment=2.0) -> pl.DataFrame` helper for genes enriched in Usher tissues. **evidence_cmd.py**: Add `expression` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('tissue_expression')), --force flag, --skip-cellxgene flag (for running without optional CellxGene dependency), load gene universe for gene_ids, call process_expression_evidence, load to DuckDB, save provenance sidecar to data/expression/tissue.provenance.json. Display summary: tissue coverage counts, mean Tau, top enriched genes preview. **tests/test_expression.py**: Unit tests with synthetic data, NO external API calls. Mock httpx responses for HPA/GTEx, mock cellxgene_census. Test cases: - test_tau_calculation_ubiquitous: Equal expression across tissues -> Tau near 0 - test_tau_calculation_specific: Expression in one tissue only -> Tau near 1 - test_tau_null_handling: NULL tissue values -> NULL Tau - test_enrichment_score_high: High retina expression relative to global -> high enrichment - test_enrichment_score_low: No target tissue expression -> low enrichment - test_expression_score_normalization: Composite score in [0, 1] - test_null_preservation_all_sources: Gene with no data from any source -> NULL score - test_hpa_parsing: Correct extraction of tissue-level TPM from HPA TSV format - test_gtex_missing_tissue: NULL for tissues not in GTEx (inner ear) **tests/test_expression_integration.py**: Integration tests with mocked downloads. Test full pipeline, checkpoint-restart, provenance. Synthetic HPA TSV fixture, mocked GTEx responses. cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_expression.py tests/test_expression_integration.py -v All expression unit and integration tests pass. CLI `evidence expression` command registered with --skip-cellxgene option. DuckDB stores tissue_expression table with Tau and enrichment scores. Checkpoint-restart works. - `python -m pytest tests/test_expression.py tests/test_expression_integration.py -v` -- all tests pass - `python -c "from usher_pipeline.evidence.expression import *"` -- all exports importable - `usher-pipeline evidence expression --help` -- CLI help displays - DuckDB tissue_expression table has columns: gene_id, gene_symbol, hpa_retina_tpm, gtex_retina_tpm, cellxgene_photoreceptor_expr, tau_specificity, usher_tissue_enrichment, expression_score_normalized - EXPR-01: HPA and GTEx tissue expression retrieved for retina, inner ear proxies, and cilia-rich tissues - EXPR-02: CellxGene scRNA-seq data retrieved for photoreceptor and hair cell types (optional dependency graceful fallback) - EXPR-03: Tau specificity index computed across data sources with NULL handling - EXPR-04: Expression score reflects enrichment in Usher-relevant tissues with normalized 0-1 composite - Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure After completion, create `.planning/phases/03-core-evidence-layers/03-02-SUMMARY.md`