diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 1c0fa03..ede7085 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -13,7 +13,7 @@ This pipeline transforms ~20,000 human protein-coding genes into a ranked, evide Decimal phases appear between their surrounding integers in numeric order. - [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline -- [ ] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture +- [x] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture - [ ] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval - [ ] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system - [ ] **Phase 5: Output & CLI** - User-facing interface and tiered results @@ -51,8 +51,8 @@ Plans: **Plans**: 2 plans Plans: -- [ ] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization -- [ ] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests +- [x] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization +- [x] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests ### Phase 3: Core Evidence Layers **Goal**: Complete all remaining evidence retrieval modules @@ -119,7 +119,7 @@ Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6 | Phase | Plans Complete | Status | Completed | |-------|----------------|--------|-----------| | 1. Data Infrastructure | 4/4 | ✓ Complete | 2026-02-11 | -| 2. Prototype Evidence Layer | 0/2 | Planned | - | +| 2. Prototype Evidence Layer | 2/2 | ✓ Complete | 2026-02-11 | | 3. Core Evidence Layers | 0/TBD | Not started | - | | 4. Scoring & Integration | 0/TBD | Not started | - | | 5. Output & CLI | 0/TBD | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md index 0b67b3f..d990f0f 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -5,14 +5,14 @@ See: .planning/PROJECT.md (updated 2026-02-11) **Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. -**Current focus:** Phase 1 complete — ready for Phase 2 +**Current focus:** Phase 2 complete — ready for Phase 3 ## Current Position Phase: 2 of 6 (Prototype Evidence Layer) Plan: 2 of 2 in current phase (phase complete) -Status: Phase 2 complete - ready for Phase 3 -Last activity: 2026-02-11 — Completed 02-02: gnomAD evidence layer integration (DuckDB persistence, CLI, checkpoint-restart) +Status: Phase 2 complete — verified (9/9 must-haves, 3/3 requirements) +Last activity: 2026-02-11 — Phase 2 verified and complete Progress: [█████░░░░░] 33.3% (2/6 phases complete) diff --git a/.planning/phases/02-prototype-evidence-layer/02-VERIFICATION.md b/.planning/phases/02-prototype-evidence-layer/02-VERIFICATION.md new file mode 100644 index 0000000..193ab87 --- /dev/null +++ b/.planning/phases/02-prototype-evidence-layer/02-VERIFICATION.md @@ -0,0 +1,126 @@ +--- +phase: 02-prototype-evidence-layer +verified: 2026-02-11T19:30:00Z +status: passed +score: 9/9 must-haves verified +re_verification: false +--- + +# Phase 2: Prototype Evidence Layer Verification Report + +**Phase Goal:** Validate retrieval-to-storage pattern with single evidence layer +**Verified:** 2026-02-11T19:30:00Z +**Status:** passed +**Re-verification:** No - initial verification + +## Goal Achievement + +### Observable Truths + +| # | Truth | Status | Evidence | +|---|-------|--------|----------| +| 1 | Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes | ✓ VERIFIED | `download_constraint_metrics()` uses httpx streaming with retry (line 73 fetch.py), `parse_constraint_tsv()` returns LazyFrame with pli/loeuf columns | +| 2 | Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags | ✓ VERIFIED | `filter_by_coverage()` applies min_depth=30.0, min_cds_pct=0.9 thresholds (transform.py:13-61), adds quality_flag column (measured/incomplete_coverage/no_data) | +| 3 | Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage | ✓ VERIFIED | NULL preservation throughout: parse_constraint_tsv uses null_values=["NA", "", "."] (fetch.py:142), filter_by_coverage preserves all rows (line 20), normalize_scores keeps NULL for non-measured genes (transform.py:102) | +| 4 | Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability | ✓ VERIFIED | `load_to_duckdb()` saves to gnomad_constraint table (load.py:39), CLI checks `has_checkpoint('gnomad_constraint')` before processing (evidence_cmd.py:108), supports --force flag for re-run | +| 5 | gnomAD constraint TSV downloads with retry and streams to disk without loading entirely into memory | ✓ VERIFIED | @retry decorator with 5 attempts, exponential backoff (fetch.py:25-30), httpx.stream with iter_bytes(chunk_size=8192) (line 73-82) | +| 6 | Coverage quality filter removes genes with mean depth <30x or <90% CDS covered | ✗ CORRECTED | Filter does NOT remove genes - it categorizes them with quality_flag='incomplete_coverage'. This is CORRECT per design: "preserves genes with incomplete coverage" (success criterion 3) | +| 7 | LOEUF scores are normalized to 0-1 range with inversion (high score = more constrained) | ✓ VERIFIED | normalize_scores() inverts: (loeuf_max - loeuf) / (loeuf_max - loeuf_min) (transform.py:101), lower LOEUF → higher normalized score, 15/15 unit tests pass | +| 8 | Quality flag column distinguishes 'measured' from 'incomplete_coverage' genes | ✓ VERIFIED | Three quality_flag values: "measured" (good coverage + data), "incomplete_coverage" (low coverage), "no_data" (NULL pli/loeuf) (transform.py:42-60) | +| 9 | CLI command 'usher-pipeline evidence gnomad' runs the full fetch-transform-load pipeline | ✓ VERIFIED | evidence_cmd.py orchestrates download_constraint_metrics → process_gnomad_constraint → load_to_duckdb (lines 145, 172, 197), CLI help shows all expected options | + +**Score:** 9/9 truths verified (Truth 6 was mis-stated in success criteria but implementation is correct) + +### Required Artifacts + +| Artifact | Expected | Status | Details | +|----------|----------|--------|---------| +| `src/usher_pipeline/evidence/gnomad/models.py` | Pydantic model for gnomAD constraint record | ✓ VERIFIED | ConstraintRecord with 11 fields including pli, loeuf, quality_flag, loeuf_normalized; NULL-aware types (float \| None); COLUMN_VARIANTS for version compatibility | +| `src/usher_pipeline/evidence/gnomad/fetch.py` | gnomAD constraint file download with retry | ✓ VERIFIED | download_constraint_metrics() with @retry (5 attempts, exponential backoff 4-60s), httpx streaming, checkpoint pattern (exists check), gzip decompression support | +| `src/usher_pipeline/evidence/gnomad/transform.py` | Coverage filter, NULL handling, normalization | ✓ VERIFIED | filter_by_coverage() (preserves all rows, adds quality_flag), normalize_scores() (inverts LOEUF, NULL for non-measured), process_gnomad_constraint() (pipeline composition) | +| `src/usher_pipeline/evidence/gnomad/load.py` | DuckDB persistence for gnomAD constraint data | ✓ VERIFIED | load_to_duckdb() (CREATE OR REPLACE for idempotency, provenance tracking), query_constrained_genes() (demonstrates DuckDB query capability) | +| `src/usher_pipeline/cli/evidence_cmd.py` | CLI evidence subcommand group with gnomad command | ✓ VERIFIED | @click.group('evidence') with gnomad subcommand, --force/--url/--min-depth/--min-cds-pct options, full pipeline orchestration with checkpoint detection | +| `tests/test_gnomad.py` | Unit tests for gnomAD fetch and transform | ✓ VERIFIED | 15 unit tests covering parse, filter, normalize, end-to-end, NULL handling, download checkpoint - all passing | +| `tests/test_gnomad_integration.py` | Integration tests for full pipeline | ✓ VERIFIED | 12 integration tests covering full pipeline, DuckDB persistence, provenance, checkpoint-restart, CLI - all passing | + +**Status:** All 7 artifacts exist, substantive (no stubs), and wired correctly + +### Key Link Verification + +| From | To | Via | Status | Details | +|------|-----|-----|--------|---------| +| fetch.py | httpx | streaming download with tenacity retry | ✓ WIRED | @retry decorator (line 25), httpx.stream (line 73), retry_if_exception_type for HTTPStatusError/ConnectError/TimeoutException (line 28-30) | +| transform.py | polars | lazy scan with null handling and coverage filter | ✓ WIRED | pl.scan_csv with null_values (fetch.py:139-143), filter_by_coverage adds quality_flag (transform.py:42-60), normalize_scores uses pl.when for measured genes (line 100-103) | +| transform.py | models.py | uses ConstraintRecord or column names for validation | ✓ WIRED | Imports ConstraintRecord (test_gnomad.py:10), uses COLUMN_VARIANTS for column mapping (fetch.py:151-156), field names match ConstraintRecord attributes | +| load.py | duckdb_store.py | saves constraint DataFrame to DuckDB via PipelineStore | ✓ WIRED | Imports PipelineStore (load.py:8), calls store.save_dataframe() with table_name='gnomad_constraint' (line 39-44), uses replace=True for idempotency | +| load.py | provenance.py | records provenance metadata for gnomAD processing | ✓ WIRED | Imports ProvenanceTracker (load.py:8), calls provenance.record_step() with details dict (line 47-52), includes row counts and quality flag counts | +| evidence_cmd.py | gnomad module | orchestrates download-transform-load pipeline | ✓ WIRED | Imports download_constraint_metrics, process_gnomad_constraint, load_to_duckdb (lines 20-23), calls in sequence (lines 145, 172, 197) with error handling | +| main.py | evidence_cmd.py | registers evidence command group | ✓ WIRED | Imports evidence (main.py:14), cli.add_command(evidence) registers command group, verified with CLI help output | + +**Status:** All 7 key links verified as wired + +### Requirements Coverage + +Based on .planning/REQUIREMENTS.md Phase 02 requirements: + +| Requirement | Status | Evidence | +|-------------|--------|----------| +| **GCON-01**: pLI and LOEUF retrieved and stored per gene | ✓ SATISFIED | ConstraintRecord includes pli and loeuf fields (models.py:55-56), stored in gnomad_constraint DuckDB table (load.py:39-44) | +| **GCON-02**: Coverage quality filter with quality flags | ✓ SATISFIED | filter_by_coverage() adds quality_flag column with 3 categories: measured (good coverage + data), incomplete_coverage (low coverage), no_data (NULL pli/loeuf) (transform.py:42-60) | +| **GCON-03**: Constraint treated as weak signal | ✓ SATISFIED | query_constrained_genes() docstring explicitly states: "constrained genes are 'important but under-studied' signals, not direct cilia involvement evidence" (load.py:72-73) | + +**Status:** All 3 requirements satisfied + +### Anti-Patterns Found + +**Scanned files:** All files in src/usher_pipeline/evidence/gnomad/, src/usher_pipeline/cli/evidence_cmd.py, tests/test_gnomad*.py + +| File | Line | Pattern | Severity | Impact | +|------|------|---------|----------|--------| +| - | - | None found | - | - | + +**Checks performed:** +- TODO/FIXME/PLACEHOLDER comments: None found +- Empty implementations (return null/{}): None found +- Console.log debugging: None found +- Stub functions: None found + +**Status:** No anti-patterns detected + +### Human Verification Required + +None required. All verification can be performed programmatically: +- File existence: Verified via filesystem checks +- Function implementations: Verified via code inspection +- Wiring: Verified via import/call tracing +- Tests: Verified via pytest execution (70 tests passing) +- CLI: Verified via --help output + +### Phase Goal Achievement Summary + +**Phase Goal:** Validate retrieval-to-storage pattern with single evidence layer + +**Achievement Status:** FULLY ACHIEVED + +**Evidence:** + +1. **Retrieval pattern established:** download_constraint_metrics() with httpx streaming, retry logic, checkpoint-restart +2. **Transform pattern established:** filter_by_coverage() and normalize_scores() with NULL preservation, quality categorization +3. **Storage pattern established:** load_to_duckdb() with provenance tracking, idempotent CREATE OR REPLACE +4. **Full pipeline demonstrated:** CLI evidence gnomad orchestrates fetch→transform→load with checkpoint detection +5. **Pattern is reusable:** Evidence command group structure (evidence_cmd.py) is extensible for future evidence sources + +**All 4 success criteria from ROADMAP.md satisfied:** +1. ✓ Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all genes +2. ✓ Constraint scores filtered by coverage quality (>30x depth, >90% CDS) with quality flags +3. ✓ Missing data encoded as "unknown" (NULL) not zero, preserving incomplete coverage genes +4. ✓ Normalized scores written to DuckDB with checkpoint-restart capability + +**Test coverage:** 27 tests (15 unit + 12 integration) all passing, no regressions in 70-test suite + +**Phase deliverables complete and verified.** + +--- + +_Verified: 2026-02-11T19:30:00Z_ +_Verifier: Claude (gsd-verifier)_