docs(02): create phase plan
This commit is contained in:
226
.planning/phases/02-prototype-evidence-layer/02-01-PLAN.md
Normal file
226
.planning/phases/02-prototype-evidence-layer/02-01-PLAN.md
Normal file
@@ -0,0 +1,226 @@
|
||||
---
|
||||
phase: 02-prototype-evidence-layer
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- src/usher_pipeline/evidence/__init__.py
|
||||
- src/usher_pipeline/evidence/gnomad/__init__.py
|
||||
- src/usher_pipeline/evidence/gnomad/models.py
|
||||
- src/usher_pipeline/evidence/gnomad/fetch.py
|
||||
- src/usher_pipeline/evidence/gnomad/transform.py
|
||||
- tests/test_gnomad.py
|
||||
- pyproject.toml
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "gnomAD constraint TSV downloads with retry and streams to disk without loading entirely into memory"
|
||||
- "Coverage quality filter removes genes with mean depth <30x or <90% CDS covered"
|
||||
- "Missing constraint data (pLI, LOEUF) is preserved as NULL, not zero"
|
||||
- "Quality flag column distinguishes 'measured' from 'incomplete_coverage' genes"
|
||||
- "LOEUF scores are normalized to 0-1 range with inversion (high score = more constrained)"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/evidence/gnomad/models.py"
|
||||
provides: "Pydantic model for gnomAD constraint record"
|
||||
contains: "class ConstraintRecord"
|
||||
- path: "src/usher_pipeline/evidence/gnomad/fetch.py"
|
||||
provides: "gnomAD constraint file download with retry"
|
||||
contains: "def download_constraint_metrics"
|
||||
- path: "src/usher_pipeline/evidence/gnomad/transform.py"
|
||||
provides: "Coverage filter, NULL handling, normalization"
|
||||
contains: "def filter_by_coverage"
|
||||
- path: "tests/test_gnomad.py"
|
||||
provides: "Unit tests for gnomAD fetch and transform"
|
||||
contains: "test_"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/evidence/gnomad/fetch.py"
|
||||
to: "httpx"
|
||||
via: "streaming download with tenacity retry"
|
||||
pattern: "httpx\\.stream|@retry"
|
||||
- from: "src/usher_pipeline/evidence/gnomad/transform.py"
|
||||
to: "polars"
|
||||
via: "lazy scan with null handling and coverage filter"
|
||||
pattern: "pl\\.scan_csv|pl\\.col.*is_null|quality_flag"
|
||||
- from: "src/usher_pipeline/evidence/gnomad/transform.py"
|
||||
to: "src/usher_pipeline/evidence/gnomad/models.py"
|
||||
via: "uses ConstraintRecord or column names for validation"
|
||||
pattern: "ConstraintRecord|loeuf_normalized"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Implement gnomAD constraint data retrieval, quality filtering, and normalization -- the core data pipeline for the prototype evidence layer.
|
||||
|
||||
Purpose: This plan establishes the evidence layer pattern (fetch -> filter -> normalize) that all future evidence layers in Phase 3 will follow. Getting this right with gnomAD sets the template.
|
||||
Output: Downloadable gnomAD constraint data, coverage-filtered and normalized, ready for DuckDB storage.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-RESEARCH.md
|
||||
@src/usher_pipeline/config/schema.py
|
||||
@src/usher_pipeline/api_clients/base.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
@pyproject.toml
|
||||
@config/default.yaml
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Create gnomAD data model and download module</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/__init__.py
|
||||
src/usher_pipeline/evidence/gnomad/__init__.py
|
||||
src/usher_pipeline/evidence/gnomad/models.py
|
||||
src/usher_pipeline/evidence/gnomad/fetch.py
|
||||
pyproject.toml
|
||||
</files>
|
||||
<action>
|
||||
Create the evidence layer package structure and gnomAD-specific modules.
|
||||
|
||||
1. Create `src/usher_pipeline/evidence/__init__.py` -- empty package init for evidence layers.
|
||||
|
||||
2. Create `src/usher_pipeline/evidence/gnomad/__init__.py` -- exports from models, fetch, transform.
|
||||
|
||||
3. Create `src/usher_pipeline/evidence/gnomad/models.py`:
|
||||
- Define `ConstraintRecord` as a Pydantic BaseModel with fields:
|
||||
- `gene_id: str` (Ensembl gene ID, e.g. ENSG00000...)
|
||||
- `gene_symbol: str`
|
||||
- `transcript: str` (canonical transcript ID)
|
||||
- `pli: float | None` (NULL if no estimate)
|
||||
- `loeuf: float | None` (NULL if no estimate -- CRITICAL: not 0.0)
|
||||
- `loeuf_upper: float | None` (upper bound of LOEUF CI)
|
||||
- `mean_depth: float | None` (mean exome depth)
|
||||
- `cds_covered_pct: float | None` (fraction of CDS bases with adequate coverage)
|
||||
- `quality_flag: str` -- "measured", "incomplete_coverage", or "no_data"
|
||||
- `loeuf_normalized: float | None` -- normalized score (filled by transform)
|
||||
- Define `GNOMAD_CONSTRAINT_URL` constant -- use gnomAD v4.1 constraint metrics download URL: `https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/constraint/gnomad.v4.1.constraint_metrics.tsv`
|
||||
If that exact URL is wrong, the code should accept a configurable URL parameter with this as default. The file may be .tsv.bgz (bgzip compressed) -- handle both .tsv and .tsv.bgz extensions.
|
||||
- Define column name constants mapping gnomAD TSV columns to our field names. The gnomAD v4.0+ constraint file uses columns like: `gene`, `transcript`, `mane_select`, `lof.oe_ci.upper` (LOEUF), `lof.pLI`, `mean_proportion_covered` (CDS coverage). Inspect actual header at download time and log column names if they differ from expected. Research note: column names may vary between v2.1.1 and v4.x -- code MUST handle this by looking for known column name variants.
|
||||
|
||||
4. Create `src/usher_pipeline/evidence/gnomad/fetch.py`:
|
||||
- `download_constraint_metrics(output_path: Path, url: str = GNOMAD_CONSTRAINT_URL, force: bool = False) -> Path`:
|
||||
- If `output_path` exists and `force=False`, return early (checkpoint pattern).
|
||||
- Use `httpx` with streaming to download to disk (NOT into memory -- file can be ~50-100MB).
|
||||
- Wrap with `@retry` from tenacity: 5 attempts, exponential backoff (min=4s, max=60s), retry on httpx.HTTPStatusError / httpx.ConnectError / httpx.TimeoutException.
|
||||
- Log download progress with structlog (use `structlog.get_logger()`).
|
||||
- If file is .bgz compressed, decompress after download using gzip (bgzip is gzip-compatible).
|
||||
- Return path to final TSV file.
|
||||
- `parse_constraint_tsv(tsv_path: Path) -> pl.LazyFrame`:
|
||||
- Use `pl.scan_csv(tsv_path, separator="\t", null_values=["NA", "", "."])` for lazy evaluation.
|
||||
- Select and rename relevant columns to match our ConstraintRecord field names.
|
||||
- Handle column name variants between gnomAD versions (v2.1.1 uses `oe_lof_upper`, v4.x might use `lof.oe_ci.upper` -- check actual file and map accordingly).
|
||||
- Log detected column names and gnomAD version info.
|
||||
- Return LazyFrame (do NOT call .collect() -- leave lazy for downstream filtering).
|
||||
|
||||
5. Update `pyproject.toml` -- add `httpx>=0.28` and `structlog>=25.0` to dependencies. Keep all existing dependencies unchanged.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/pip install -e ".[dev]"` -- succeeds with httpx and structlog installed.
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/python -c "from usher_pipeline.evidence.gnomad.models import ConstraintRecord, GNOMAD_CONSTRAINT_URL; print(GNOMAD_CONSTRAINT_URL)"` -- prints URL.
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/python -c "from usher_pipeline.evidence.gnomad.fetch import download_constraint_metrics, parse_constraint_tsv; print('imports OK')"` -- prints imports OK.
|
||||
</verify>
|
||||
<done>
|
||||
gnomAD evidence module exists with ConstraintRecord model, streaming download function with retry, and lazy TSV parser. httpx and structlog are installed as dependencies.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Create coverage filter, normalization, and unit tests</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/gnomad/transform.py
|
||||
tests/test_gnomad.py
|
||||
</files>
|
||||
<action>
|
||||
Create the transform module and comprehensive tests for the entire gnomAD evidence layer.
|
||||
|
||||
1. Create `src/usher_pipeline/evidence/gnomad/transform.py`:
|
||||
- `filter_by_coverage(lf: pl.LazyFrame, min_depth: float = 30.0, min_cds_pct: float = 0.9) -> pl.LazyFrame`:
|
||||
- Does NOT drop rows -- adds `quality_flag` column:
|
||||
- "measured" if mean_depth >= min_depth AND cds_covered_pct >= min_cds_pct AND loeuf is not null
|
||||
- "incomplete_coverage" if coverage thresholds not met but gene exists
|
||||
- "no_data" if loeuf AND pli are both null
|
||||
- Preserves ALL genes (important for downstream: "unknown" != "zero").
|
||||
- Returns LazyFrame with quality_flag column added.
|
||||
|
||||
- `normalize_scores(lf: pl.LazyFrame) -> pl.LazyFrame`:
|
||||
- Compute min-max normalization on LOEUF scores for genes with quality_flag == "measured".
|
||||
- INVERT the scale: lower LOEUF = more constrained = HIGHER normalized score.
|
||||
- Formula: `loeuf_normalized = (loeuf_max - loeuf) / (loeuf_max - loeuf_min)` for measured genes.
|
||||
- Genes with quality_flag != "measured" get `loeuf_normalized = NULL` (not 0.0).
|
||||
- Returns LazyFrame with loeuf_normalized column added.
|
||||
|
||||
- `process_gnomad_constraint(tsv_path: Path, min_depth: float = 30.0, min_cds_pct: float = 0.9) -> pl.DataFrame`:
|
||||
- Convenience function composing: parse_constraint_tsv -> filter_by_coverage -> normalize_scores -> .collect()
|
||||
- Returns materialized DataFrame ready for DuckDB storage.
|
||||
- Logs summary stats: total genes, measured count, incomplete count, no_data count.
|
||||
|
||||
2. Create `tests/test_gnomad.py` with tests using synthetic data (NO real gnomAD download in tests):
|
||||
|
||||
Create a pytest fixture `sample_constraint_tsv(tmp_path)` that writes a small TSV file with ~10-15 rows covering edge cases:
|
||||
- Normal genes with good coverage (measured)
|
||||
- Genes with low depth (<30x) (incomplete_coverage)
|
||||
- Genes with low CDS coverage (<90%) (incomplete_coverage)
|
||||
- Genes with NULL loeuf/pli (no_data)
|
||||
- Genes with extreme LOEUF values (0.0, 2.5) for normalization bounds
|
||||
- Gene with LOEUF = 0.0 (most constrained -- should normalize to 1.0)
|
||||
|
||||
Tests (at least 10):
|
||||
- `test_parse_constraint_tsv_returns_lazyframe`: Verify parse returns LazyFrame with expected columns
|
||||
- `test_parse_constraint_tsv_null_handling`: NA/empty values become polars null, not zero
|
||||
- `test_filter_by_coverage_measured`: Good coverage genes get quality_flag="measured"
|
||||
- `test_filter_by_coverage_incomplete`: Low depth/CDS genes get quality_flag="incomplete_coverage"
|
||||
- `test_filter_by_coverage_no_data`: NULL loeuf+pli genes get quality_flag="no_data"
|
||||
- `test_filter_preserves_all_genes`: Row count before == row count after (no genes dropped)
|
||||
- `test_normalize_scores_range`: All non-null normalized scores are in [0, 1]
|
||||
- `test_normalize_scores_inversion`: Lower LOEUF -> higher normalized score
|
||||
- `test_normalize_scores_null_preserved`: NULL loeuf stays NULL after normalization
|
||||
- `test_normalize_scores_incomplete_stays_null`: incomplete_coverage genes get NULL normalized score
|
||||
- `test_process_gnomad_constraint_end_to_end`: Full pipeline returns DataFrame with all expected columns
|
||||
- `test_constraint_record_model_validation`: ConstraintRecord validates correctly, rejects bad types
|
||||
- `test_download_skips_if_exists`: download_constraint_metrics returns early if file exists and force=False (mock httpx)
|
||||
|
||||
Use `polars.testing.assert_frame_equal` where appropriate.
|
||||
Mock httpx in download tests (do NOT make real HTTP requests).
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/pytest tests/test_gnomad.py -v` -- all tests pass.
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/pytest tests/ -v` -- full test suite passes (existing + new).
|
||||
</verify>
|
||||
<done>
|
||||
Transform module implements coverage filtering (preserving all genes with quality flags), LOEUF normalization with inversion, and NULL preservation. 10+ tests pass covering all edge cases including null handling, normalization bounds, and download skip behavior.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
1. All new modules import without error
|
||||
2. All tests pass: `pytest tests/test_gnomad.py -v`
|
||||
3. Full test suite still passes: `pytest tests/ -v`
|
||||
4. NULL values are preserved through the pipeline (not converted to 0)
|
||||
5. Normalized LOEUF scores are in [0, 1] range with correct inversion
|
||||
6. Quality flags correctly categorize genes by coverage status
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- gnomAD evidence module exists at src/usher_pipeline/evidence/gnomad/ with models, fetch, transform
|
||||
- Download function uses httpx streaming with tenacity retry (not requests library)
|
||||
- Coverage filter adds quality_flag without dropping any genes
|
||||
- Normalization inverts LOEUF (lower LOEUF = higher score) and preserves NULLs
|
||||
- 10+ unit tests pass covering edge cases
|
||||
- httpx and structlog added to pyproject.toml dependencies
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md`
|
||||
</output>
|
||||
206
.planning/phases/02-prototype-evidence-layer/02-02-PLAN.md
Normal file
206
.planning/phases/02-prototype-evidence-layer/02-02-PLAN.md
Normal file
@@ -0,0 +1,206 @@
|
||||
---
|
||||
phase: 02-prototype-evidence-layer
|
||||
plan: 02
|
||||
type: execute
|
||||
wave: 2
|
||||
depends_on: ["02-01"]
|
||||
files_modified:
|
||||
- src/usher_pipeline/evidence/gnomad/load.py
|
||||
- src/usher_pipeline/cli/evidence_cmd.py
|
||||
- src/usher_pipeline/cli/main.py
|
||||
- tests/test_gnomad_integration.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "gnomAD constraint data is written to DuckDB gnomad_constraint table with all columns including quality_flag and loeuf_normalized"
|
||||
- "Checkpoint-restart works: re-running the command skips download and processing if DuckDB table exists"
|
||||
- "Provenance sidecar JSON is saved alongside the gnomAD data recording source URL, version, processing steps"
|
||||
- "CLI command 'usher-pipeline evidence gnomad' runs the full fetch-transform-load pipeline"
|
||||
- "GCON-03 is addressed: constraint evidence is stored as weak signal, not direct cilia involvement"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/evidence/gnomad/load.py"
|
||||
provides: "DuckDB persistence for gnomAD constraint data"
|
||||
contains: "def load_to_duckdb"
|
||||
- path: "src/usher_pipeline/cli/evidence_cmd.py"
|
||||
provides: "CLI evidence subcommand group with gnomad command"
|
||||
contains: "def gnomad"
|
||||
- path: "tests/test_gnomad_integration.py"
|
||||
provides: "Integration tests for gnomAD evidence layer"
|
||||
contains: "test_"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/evidence/gnomad/load.py"
|
||||
to: "src/usher_pipeline/persistence/duckdb_store.py"
|
||||
via: "saves constraint DataFrame to DuckDB via PipelineStore"
|
||||
pattern: "PipelineStore|save_dataframe"
|
||||
- from: "src/usher_pipeline/evidence/gnomad/load.py"
|
||||
to: "src/usher_pipeline/persistence/provenance.py"
|
||||
via: "records provenance metadata for gnomAD processing"
|
||||
pattern: "ProvenanceTracker|record_step"
|
||||
- from: "src/usher_pipeline/cli/evidence_cmd.py"
|
||||
to: "src/usher_pipeline/evidence/gnomad/fetch.py"
|
||||
via: "orchestrates download-transform-load pipeline"
|
||||
pattern: "download_constraint_metrics|process_gnomad_constraint|load_to_duckdb"
|
||||
- from: "src/usher_pipeline/cli/main.py"
|
||||
to: "src/usher_pipeline/cli/evidence_cmd.py"
|
||||
via: "registers evidence command group"
|
||||
pattern: "cli\\.add_command|evidence"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Wire gnomAD evidence layer into DuckDB persistence and CLI, completing the full fetch-transform-load pipeline with checkpoint-restart and provenance tracking.
|
||||
|
||||
Purpose: This plan completes the prototype evidence layer by connecting the data retrieval/transformation (Plan 01) to storage and CLI, proving the end-to-end pattern works for all future evidence layers.
|
||||
Output: Working CLI command that downloads, filters, normalizes, and persists gnomAD constraint data with full provenance.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-RESEARCH.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
@src/usher_pipeline/persistence/provenance.py
|
||||
@src/usher_pipeline/cli/main.py
|
||||
@src/usher_pipeline/cli/setup_cmd.py
|
||||
@config/default.yaml
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Create DuckDB loader and CLI evidence command</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/gnomad/load.py
|
||||
src/usher_pipeline/evidence/gnomad/__init__.py
|
||||
src/usher_pipeline/cli/evidence_cmd.py
|
||||
src/usher_pipeline/cli/main.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/evidence/gnomad/load.py`:
|
||||
- `load_to_duckdb(df: pl.DataFrame, store: PipelineStore, provenance: ProvenanceTracker, description: str = "") -> None`:
|
||||
- Save DataFrame to DuckDB table named `gnomad_constraint` using `store.save_dataframe()` with `replace=True` (idempotent -- CREATE OR REPLACE).
|
||||
- Record provenance step: "load_gnomad_constraint" with details including row_count, measured_count (quality_flag == "measured"), incomplete_count, no_data_count, null_loeuf_count.
|
||||
- Log summary stats with structlog.
|
||||
|
||||
- `query_constrained_genes(store: PipelineStore, loeuf_threshold: float = 0.6) -> pl.DataFrame`:
|
||||
- Convenience query: SELECT genes from gnomad_constraint WHERE loeuf < threshold AND quality_flag = 'measured'.
|
||||
- Returns polars DataFrame sorted by loeuf ascending (most constrained first).
|
||||
- This demonstrates DuckDB query capability and validates GCON-03 interpretation: constrained genes are "important but under-studied" signals, not direct cilia evidence.
|
||||
|
||||
2. Update `src/usher_pipeline/evidence/gnomad/__init__.py`:
|
||||
- Add exports for load_to_duckdb, query_constrained_genes, and all models/fetch/transform exports.
|
||||
|
||||
3. Create `src/usher_pipeline/cli/evidence_cmd.py`:
|
||||
- Create a Click command group `evidence` with help: "Fetch and process evidence layer data."
|
||||
- Create subcommand `gnomad`:
|
||||
- Options: `--force` (re-download and reprocess), `--url` (override download URL), `--min-depth` (default 30.0), `--min-cds-pct` (default 0.9)
|
||||
- Full pipeline orchestration:
|
||||
a. Load config from context (same pattern as setup_cmd.py)
|
||||
b. Create PipelineStore and ProvenanceTracker from config
|
||||
c. Check checkpoint: `store.has_checkpoint('gnomad_constraint')` -- if exists and no --force, print summary and return
|
||||
d. Download gnomAD constraint TSV to `config.data_dir / "gnomad"` directory
|
||||
e. Process with `process_gnomad_constraint()` from transform module
|
||||
f. Load to DuckDB with `load_to_duckdb()`
|
||||
g. Save provenance sidecar to `config.data_dir / "gnomad" / "constraint.provenance.json"`
|
||||
h. Print summary: total genes, measured/incomplete/no_data counts, DuckDB path
|
||||
- Use colored Click output (green for success, yellow for checkpoint skip, red for errors)
|
||||
- Wrap in try/finally for store.close() cleanup
|
||||
- Log all steps with structlog
|
||||
|
||||
4. Update `src/usher_pipeline/cli/main.py`:
|
||||
- Import evidence command group: `from usher_pipeline.cli.evidence_cmd import evidence`
|
||||
- Register with: `cli.add_command(evidence)`
|
||||
- This enables: `usher-pipeline evidence gnomad [--force] [--url URL]`
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/python -c "from usher_pipeline.evidence.gnomad.load import load_to_duckdb, query_constrained_genes; print('load imports OK')"` -- prints OK.
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/usher-pipeline evidence --help` -- shows gnomad subcommand.
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/usher-pipeline evidence gnomad --help` -- shows --force, --url, --min-depth, --min-cds-pct options.
|
||||
</verify>
|
||||
<done>
|
||||
DuckDB loader persists gnomAD data with provenance. CLI `evidence gnomad` command is available with checkpoint-restart, force-rerun, and configurable options.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Create integration tests for full gnomAD pipeline</name>
|
||||
<files>
|
||||
tests/test_gnomad_integration.py
|
||||
</files>
|
||||
<action>
|
||||
Create integration tests verifying the full gnomAD evidence layer pipeline end-to-end.
|
||||
|
||||
Create `tests/test_gnomad_integration.py` with the following:
|
||||
|
||||
1. Fixtures:
|
||||
- `test_config(tmp_path)`: Creates a temp config YAML pointing to tmp_path for data_dir, cache_dir, duckdb_path. Uses the same pattern as tests/test_integration.py.
|
||||
- `sample_tsv(tmp_path)`: Creates a synthetic gnomAD constraint TSV with ~15 rows covering:
|
||||
- 5 well-covered genes with measured LOEUF/pLI (varying values 0.1 to 2.0)
|
||||
- 3 genes with low depth (<30x)
|
||||
- 3 genes with low CDS coverage (<90%)
|
||||
- 2 genes with NULL LOEUF and pLI
|
||||
- 2 genes at normalization bounds (LOEUF=0.0, LOEUF=3.0)
|
||||
|
||||
2. Integration tests (at least 8):
|
||||
- `test_full_pipeline_to_duckdb`: process_gnomad_constraint -> load_to_duckdb -> verify DuckDB table has all rows, correct columns, quality_flags
|
||||
- `test_checkpoint_restart_skips_processing`: Load data, verify has_checkpoint returns True, simulate re-run check
|
||||
- `test_provenance_recorded`: After load_to_duckdb, verify provenance has step "load_gnomad_constraint" with expected details
|
||||
- `test_provenance_sidecar_created`: Verify .provenance.json file is written with correct metadata
|
||||
- `test_query_constrained_genes_filters_correctly`: Load data, then query_constrained_genes with threshold=0.6 returns only measured genes below threshold
|
||||
- `test_null_loeuf_not_in_constrained_results`: Genes with NULL LOEUF are excluded from constrained gene queries
|
||||
- `test_duckdb_schema_has_quality_flag`: Verify gnomad_constraint table has quality_flag column with non-null values
|
||||
- `test_normalized_scores_in_duckdb`: Load and verify loeuf_normalized values are in [0,1] for measured genes and NULL for others
|
||||
- `test_cli_evidence_gnomad_help`: Use Click CliRunner to invoke `evidence gnomad --help`, verify output
|
||||
- `test_cli_evidence_gnomad_with_mock`: Use Click CliRunner + mock download to test CLI runs without error (mock the actual download, provide synthetic TSV)
|
||||
|
||||
Use `polars.testing.assert_frame_equal` where appropriate.
|
||||
All tests use tmp_path -- no real gnomAD downloads.
|
||||
Mock httpx downloads in CLI test -- provide synthetic TSV file instead.
|
||||
|
||||
3. Run full test suite to ensure no regressions:
|
||||
`pytest tests/ -v` -- all tests pass (existing 49-50 + new ~18-20).
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/pytest tests/test_gnomad_integration.py -v` -- all integration tests pass.
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && .venv/bin/pytest tests/ -v` -- full suite passes with no regressions.
|
||||
</verify>
|
||||
<done>
|
||||
Integration tests prove: gnomAD data flows from TSV through transform to DuckDB with correct quality flags, normalized scores, NULL handling, checkpoint-restart, and provenance. CLI command works end-to-end with mocked downloads. Full test suite passes.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
1. CLI command `usher-pipeline evidence gnomad --help` works
|
||||
2. Integration tests verify full pipeline: fetch -> transform -> DuckDB with provenance
|
||||
3. Checkpoint-restart: re-running skips if gnomad_constraint table exists
|
||||
4. DuckDB table has correct schema: gene_id, gene_symbol, pli, loeuf, quality_flag, loeuf_normalized
|
||||
5. Provenance sidecar JSON captures source URL, version, processing steps
|
||||
6. Full test suite passes: `pytest tests/ -v`
|
||||
7. Requirements satisfied:
|
||||
- GCON-01: pLI and LOEUF retrieved and stored per gene
|
||||
- GCON-02: Coverage quality filter with quality flags
|
||||
- GCON-03: Constraint treated as weak signal (query_constrained_genes is informational, not cilia-direct)
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- DuckDB gnomad_constraint table stores all genes with quality_flag and loeuf_normalized columns
|
||||
- NULL loeuf values remain NULL in DuckDB (not converted to 0)
|
||||
- Checkpoint-restart works: second run detects existing table and skips
|
||||
- Provenance JSON records source URL, gnomAD version, and processing step details
|
||||
- CLI `usher-pipeline evidence gnomad` orchestrates the full pipeline
|
||||
- 8+ integration tests pass covering end-to-end pipeline, checkpoint, provenance, CLI
|
||||
- Full test suite passes with no regressions from Phase 1
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user