12 KiB
12 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
| phase | plan | type | wave | depends_on | files_modified | autonomous | must_haves | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 03-core-evidence-layers | 02 | execute | 1 |
|
true |
|
Purpose: Expression in retina, inner ear, and cilia-rich tissues is strong evidence for potential cilia/Usher involvement. Genes specifically expressed in these tissues (high Tau) are higher-priority candidates. Output: tissue_expression DuckDB table with per-gene expression values across target tissues, Tau specificity index, and normalized expression score.
<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>
@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/transform.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create expression evidence data model, fetch, and transform modules src/usher_pipeline/evidence/expression/__init__.py src/usher_pipeline/evidence/expression/models.py src/usher_pipeline/evidence/expression/fetch.py src/usher_pipeline/evidence/expression/transform.py pyproject.toml Create the expression evidence layer following the established gnomAD fetch->transform pattern.**models.py**: Define ExpressionRecord pydantic model with fields: gene_id (str), gene_symbol (str), hpa_retina_tpm (float|None), hpa_cerebellum_tpm (float|None -- proxy for cilia-rich tissue), gtex_retina_tpm (float|None -- "Eye - Retina" from GTEx), gtex_brain_cerebellum_tpm (float|None), cellxgene_photoreceptor_expr (float|None), cellxgene_hair_cell_expr (float|None), tau_specificity (float|None -- Tau index 0=ubiquitous, 1=tissue-specific), usher_tissue_enrichment (float|None -- relative enrichment in target tissues), expression_score_normalized (float|None -- composite 0-1). Define EXPRESSION_TABLE_NAME = "tissue_expression". Define TARGET_TISSUES dict mapping tissue keys to API-specific identifiers for HPA, GTEx.
**fetch.py**: Three fetch functions:
1. `fetch_hpa_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download HPA normal tissue data TSV from https://www.proteinatlas.org/download/normal_tissue.tsv.zip (bulk download is more efficient than per-gene API for 20K genes). Filter to target tissues: "retina", "cerebellum", "testis" (cilia-rich), "fallopian tube" (ciliated epithelium). Extract TPM/expression levels. Use httpx streaming download with tenacity retry (same pattern as gnomAD). Parse with polars scan_csv. Return DataFrame with gene_id, tissue columns. NULL for genes not in HPA.
2. `fetch_gtex_expression(gene_ids: list[str]) -> pl.DataFrame` -- Download GTEx median gene expression from GTEx Portal bulk data (V10) or query API in batches. Target tissues: "Eye - Retina" (not available in all GTEx versions -- handle NULL), "Brain - Cerebellum", "Testis", "Fallopian Tube". Use httpx with tenacity retry and ratelimit (conservative 5 req/sec for GTEx API). If bulk download available, prefer that. Return DataFrame with gene_id, gtex tissue columns. NULL for unavailable tissue/gene combinations.
3. `fetch_cellxgene_expression(gene_ids: list[str]) -> pl.DataFrame` -- Use cellxgene_census library to query scRNA-seq data. Query for cell types: "photoreceptor cell", "retinal rod cell", "retinal cone cell", "hair cell" in tissues "retina", "inner ear", "cochlea". Compute mean expression per gene per cell type. If cellxgene_census not available (optional dependency), log warning and return DataFrame with all NULLs. Process in gene batches (100 at a time) to control memory. Return DataFrame with gene_id, cellxgene columns.
Add `cellxgene_census` to pyproject.toml optional dependencies under [project.optional-dependencies] as `expression = ["cellxgene-census>=1.19"]` (heavy dependency, keep optional).
**transform.py**: Three functions:
1. `calculate_tau_specificity(df: pl.DataFrame, tissue_columns: list[str]) -> pl.DataFrame` -- Implement Tau index: Tau = sum(1 - xi/xmax) / (n-1). If any tissue value is NULL, tau is NULL (insufficient data for reliable specificity). Add tau_specificity column.
2. `compute_expression_score(df: pl.DataFrame) -> pl.DataFrame` -- Compute usher_tissue_enrichment: ratio of mean expression in Usher-relevant tissues (retina, inner ear proxies) to mean expression across all tissues. Higher ratio = more enriched. Normalize to 0-1. Compute expression_score_normalized as weighted combination: 0.4 * usher_tissue_enrichment_normalized + 0.3 * tau_specificity + 0.3 * max_target_tissue_rank (rank of max expression in target tissues / total genes). NULL if all expression data is NULL.
3. `process_expression_evidence(gene_ids: list[str]) -> pl.DataFrame` -- End-to-end: fetch HPA -> fetch GTEx -> fetch CellxGene -> merge on gene_id -> compute Tau -> compute score -> collect.
Follow established patterns: NULL preservation, structlog logging, lazy polars evaluation. NOTE: GTEx lacks inner ear/cochlea tissue -- handle as NULL, do not fabricate. CellxGene is the primary source for inner ear data.
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.expression import fetch_hpa_expression, fetch_gtex_expression, calculate_tau_specificity, compute_expression_score, process_expression_evidence; print('imports OK')"
Expression fetch module retrieves data from HPA (bulk TSV), GTEx (API/bulk), and CellxGene (census library). Transform module computes Tau specificity index and Usher tissue enrichment score normalized to 0-1. NULL preserved for missing tissues/genes.
Task 2: Create expression DuckDB loader, CLI command, and tests
src/usher_pipeline/evidence/expression/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_expression.py
tests/test_expression_integration.py
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "tissue_expression" table. Record provenance with: genes with retina expression, genes with inner ear expression, mean Tau, expression score distribution stats. Create `query_tissue_enriched(store, min_enrichment=2.0) -> pl.DataFrame` helper for genes enriched in Usher tissues.
**evidence_cmd.py**: Add `expression` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('tissue_expression')), --force flag, --skip-cellxgene flag (for running without optional CellxGene dependency), load gene universe for gene_ids, call process_expression_evidence, load to DuckDB, save provenance sidecar to data/expression/tissue.provenance.json. Display summary: tissue coverage counts, mean Tau, top enriched genes preview.
**tests/test_expression.py**: Unit tests with synthetic data, NO external API calls. Mock httpx responses for HPA/GTEx, mock cellxgene_census. Test cases:
- test_tau_calculation_ubiquitous: Equal expression across tissues -> Tau near 0
- test_tau_calculation_specific: Expression in one tissue only -> Tau near 1
- test_tau_null_handling: NULL tissue values -> NULL Tau
- test_enrichment_score_high: High retina expression relative to global -> high enrichment
- test_enrichment_score_low: No target tissue expression -> low enrichment
- test_expression_score_normalization: Composite score in [0, 1]
- test_null_preservation_all_sources: Gene with no data from any source -> NULL score
- test_hpa_parsing: Correct extraction of tissue-level TPM from HPA TSV format
- test_gtex_missing_tissue: NULL for tissues not in GTEx (inner ear)
**tests/test_expression_integration.py**: Integration tests with mocked downloads. Test full pipeline, checkpoint-restart, provenance. Synthetic HPA TSV fixture, mocked GTEx responses.
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_expression.py tests/test_expression_integration.py -v
All expression unit and integration tests pass. CLI `evidence expression` command registered with --skip-cellxgene option. DuckDB stores tissue_expression table with Tau and enrichment scores. Checkpoint-restart works.
- `python -m pytest tests/test_expression.py tests/test_expression_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.expression import *"` -- all exports importable
- `usher-pipeline evidence expression --help` -- CLI help displays
- DuckDB tissue_expression table has columns: gene_id, gene_symbol, hpa_retina_tpm, gtex_retina_tpm, cellxgene_photoreceptor_expr, tau_specificity, usher_tissue_enrichment, expression_score_normalized
<success_criteria>
- EXPR-01: HPA and GTEx tissue expression retrieved for retina, inner ear proxies, and cilia-rich tissues
- EXPR-02: CellxGene scRNA-seq data retrieved for photoreceptor and hair cell types (optional dependency graceful fallback)
- EXPR-03: Tau specificity index computed across data sources with NULL handling
- EXPR-04: Expression score reflects enrichment in Usher-relevant tissues with normalized 0-1 composite
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure </success_criteria>