docs(phase-02): complete phase execution

This commit is contained in:
2026-02-11 18:28:13 +08:00
parent a0388cf4e1
commit ffb4963d2b
3 changed files with 133 additions and 7 deletions

View File

@@ -13,7 +13,7 @@ This pipeline transforms ~20,000 human protein-coding genes into a ranked, evide
Decimal phases appear between their surrounding integers in numeric order. Decimal phases appear between their surrounding integers in numeric order.
- [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline - [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline
- [ ] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture - [x] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture
- [ ] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval - [ ] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval
- [ ] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system - [ ] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system
- [ ] **Phase 5: Output & CLI** - User-facing interface and tiered results - [ ] **Phase 5: Output & CLI** - User-facing interface and tiered results
@@ -51,8 +51,8 @@ Plans:
**Plans**: 2 plans **Plans**: 2 plans
Plans: Plans:
- [ ] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization - [x] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization
- [ ] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests - [x] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests
### Phase 3: Core Evidence Layers ### Phase 3: Core Evidence Layers
**Goal**: Complete all remaining evidence retrieval modules **Goal**: Complete all remaining evidence retrieval modules
@@ -119,7 +119,7 @@ Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6
| Phase | Plans Complete | Status | Completed | | Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------| |-------|----------------|--------|-----------|
| 1. Data Infrastructure | 4/4 | ✓ Complete | 2026-02-11 | | 1. Data Infrastructure | 4/4 | ✓ Complete | 2026-02-11 |
| 2. Prototype Evidence Layer | 0/2 | Planned | - | | 2. Prototype Evidence Layer | 2/2 | ✓ Complete | 2026-02-11 |
| 3. Core Evidence Layers | 0/TBD | Not started | - | | 3. Core Evidence Layers | 0/TBD | Not started | - |
| 4. Scoring & Integration | 0/TBD | Not started | - | | 4. Scoring & Integration | 0/TBD | Not started | - |
| 5. Output & CLI | 0/TBD | Not started | - | | 5. Output & CLI | 0/TBD | Not started | - |

View File

@@ -5,14 +5,14 @@
See: .planning/PROJECT.md (updated 2026-02-11) See: .planning/PROJECT.md (updated 2026-02-11)
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. **Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
**Current focus:** Phase 1 complete — ready for Phase 2 **Current focus:** Phase 2 complete — ready for Phase 3
## Current Position ## Current Position
Phase: 2 of 6 (Prototype Evidence Layer) Phase: 2 of 6 (Prototype Evidence Layer)
Plan: 2 of 2 in current phase (phase complete) Plan: 2 of 2 in current phase (phase complete)
Status: Phase 2 complete - ready for Phase 3 Status: Phase 2 complete — verified (9/9 must-haves, 3/3 requirements)
Last activity: 2026-02-11 — Completed 02-02: gnomAD evidence layer integration (DuckDB persistence, CLI, checkpoint-restart) Last activity: 2026-02-11 — Phase 2 verified and complete
Progress: [█████░░░░░] 33.3% (2/6 phases complete) Progress: [█████░░░░░] 33.3% (2/6 phases complete)

View File

@@ -0,0 +1,126 @@
---
phase: 02-prototype-evidence-layer
verified: 2026-02-11T19:30:00Z
status: passed
score: 9/9 must-haves verified
re_verification: false
---
# Phase 2: Prototype Evidence Layer Verification Report
**Phase Goal:** Validate retrieval-to-storage pattern with single evidence layer
**Verified:** 2026-02-11T19:30:00Z
**Status:** passed
**Re-verification:** No - initial verification
## Goal Achievement
### Observable Truths
| # | Truth | Status | Evidence |
|---|-------|--------|----------|
| 1 | Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes | ✓ VERIFIED | `download_constraint_metrics()` uses httpx streaming with retry (line 73 fetch.py), `parse_constraint_tsv()` returns LazyFrame with pli/loeuf columns |
| 2 | Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags | ✓ VERIFIED | `filter_by_coverage()` applies min_depth=30.0, min_cds_pct=0.9 thresholds (transform.py:13-61), adds quality_flag column (measured/incomplete_coverage/no_data) |
| 3 | Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage | ✓ VERIFIED | NULL preservation throughout: parse_constraint_tsv uses null_values=["NA", "", "."] (fetch.py:142), filter_by_coverage preserves all rows (line 20), normalize_scores keeps NULL for non-measured genes (transform.py:102) |
| 4 | Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability | ✓ VERIFIED | `load_to_duckdb()` saves to gnomad_constraint table (load.py:39), CLI checks `has_checkpoint('gnomad_constraint')` before processing (evidence_cmd.py:108), supports --force flag for re-run |
| 5 | gnomAD constraint TSV downloads with retry and streams to disk without loading entirely into memory | ✓ VERIFIED | @retry decorator with 5 attempts, exponential backoff (fetch.py:25-30), httpx.stream with iter_bytes(chunk_size=8192) (line 73-82) |
| 6 | Coverage quality filter removes genes with mean depth <30x or <90% CDS covered | ✗ CORRECTED | Filter does NOT remove genes - it categorizes them with quality_flag='incomplete_coverage'. This is CORRECT per design: "preserves genes with incomplete coverage" (success criterion 3) |
| 7 | LOEUF scores are normalized to 0-1 range with inversion (high score = more constrained) | ✓ VERIFIED | normalize_scores() inverts: (loeuf_max - loeuf) / (loeuf_max - loeuf_min) (transform.py:101), lower LOEUF → higher normalized score, 15/15 unit tests pass |
| 8 | Quality flag column distinguishes 'measured' from 'incomplete_coverage' genes | ✓ VERIFIED | Three quality_flag values: "measured" (good coverage + data), "incomplete_coverage" (low coverage), "no_data" (NULL pli/loeuf) (transform.py:42-60) |
| 9 | CLI command 'usher-pipeline evidence gnomad' runs the full fetch-transform-load pipeline | ✓ VERIFIED | evidence_cmd.py orchestrates download_constraint_metrics → process_gnomad_constraint → load_to_duckdb (lines 145, 172, 197), CLI help shows all expected options |
**Score:** 9/9 truths verified (Truth 6 was mis-stated in success criteria but implementation is correct)
### Required Artifacts
| Artifact | Expected | Status | Details |
|----------|----------|--------|---------|
| `src/usher_pipeline/evidence/gnomad/models.py` | Pydantic model for gnomAD constraint record | ✓ VERIFIED | ConstraintRecord with 11 fields including pli, loeuf, quality_flag, loeuf_normalized; NULL-aware types (float \| None); COLUMN_VARIANTS for version compatibility |
| `src/usher_pipeline/evidence/gnomad/fetch.py` | gnomAD constraint file download with retry | ✓ VERIFIED | download_constraint_metrics() with @retry (5 attempts, exponential backoff 4-60s), httpx streaming, checkpoint pattern (exists check), gzip decompression support |
| `src/usher_pipeline/evidence/gnomad/transform.py` | Coverage filter, NULL handling, normalization | ✓ VERIFIED | filter_by_coverage() (preserves all rows, adds quality_flag), normalize_scores() (inverts LOEUF, NULL for non-measured), process_gnomad_constraint() (pipeline composition) |
| `src/usher_pipeline/evidence/gnomad/load.py` | DuckDB persistence for gnomAD constraint data | ✓ VERIFIED | load_to_duckdb() (CREATE OR REPLACE for idempotency, provenance tracking), query_constrained_genes() (demonstrates DuckDB query capability) |
| `src/usher_pipeline/cli/evidence_cmd.py` | CLI evidence subcommand group with gnomad command | ✓ VERIFIED | @click.group('evidence') with gnomad subcommand, --force/--url/--min-depth/--min-cds-pct options, full pipeline orchestration with checkpoint detection |
| `tests/test_gnomad.py` | Unit tests for gnomAD fetch and transform | ✓ VERIFIED | 15 unit tests covering parse, filter, normalize, end-to-end, NULL handling, download checkpoint - all passing |
| `tests/test_gnomad_integration.py` | Integration tests for full pipeline | ✓ VERIFIED | 12 integration tests covering full pipeline, DuckDB persistence, provenance, checkpoint-restart, CLI - all passing |
**Status:** All 7 artifacts exist, substantive (no stubs), and wired correctly
### Key Link Verification
| From | To | Via | Status | Details |
|------|-----|-----|--------|---------|
| fetch.py | httpx | streaming download with tenacity retry | ✓ WIRED | @retry decorator (line 25), httpx.stream (line 73), retry_if_exception_type for HTTPStatusError/ConnectError/TimeoutException (line 28-30) |
| transform.py | polars | lazy scan with null handling and coverage filter | ✓ WIRED | pl.scan_csv with null_values (fetch.py:139-143), filter_by_coverage adds quality_flag (transform.py:42-60), normalize_scores uses pl.when for measured genes (line 100-103) |
| transform.py | models.py | uses ConstraintRecord or column names for validation | ✓ WIRED | Imports ConstraintRecord (test_gnomad.py:10), uses COLUMN_VARIANTS for column mapping (fetch.py:151-156), field names match ConstraintRecord attributes |
| load.py | duckdb_store.py | saves constraint DataFrame to DuckDB via PipelineStore | ✓ WIRED | Imports PipelineStore (load.py:8), calls store.save_dataframe() with table_name='gnomad_constraint' (line 39-44), uses replace=True for idempotency |
| load.py | provenance.py | records provenance metadata for gnomAD processing | ✓ WIRED | Imports ProvenanceTracker (load.py:8), calls provenance.record_step() with details dict (line 47-52), includes row counts and quality flag counts |
| evidence_cmd.py | gnomad module | orchestrates download-transform-load pipeline | ✓ WIRED | Imports download_constraint_metrics, process_gnomad_constraint, load_to_duckdb (lines 20-23), calls in sequence (lines 145, 172, 197) with error handling |
| main.py | evidence_cmd.py | registers evidence command group | ✓ WIRED | Imports evidence (main.py:14), cli.add_command(evidence) registers command group, verified with CLI help output |
**Status:** All 7 key links verified as wired
### Requirements Coverage
Based on .planning/REQUIREMENTS.md Phase 02 requirements:
| Requirement | Status | Evidence |
|-------------|--------|----------|
| **GCON-01**: pLI and LOEUF retrieved and stored per gene | ✓ SATISFIED | ConstraintRecord includes pli and loeuf fields (models.py:55-56), stored in gnomad_constraint DuckDB table (load.py:39-44) |
| **GCON-02**: Coverage quality filter with quality flags | ✓ SATISFIED | filter_by_coverage() adds quality_flag column with 3 categories: measured (good coverage + data), incomplete_coverage (low coverage), no_data (NULL pli/loeuf) (transform.py:42-60) |
| **GCON-03**: Constraint treated as weak signal | ✓ SATISFIED | query_constrained_genes() docstring explicitly states: "constrained genes are 'important but under-studied' signals, not direct cilia involvement evidence" (load.py:72-73) |
**Status:** All 3 requirements satisfied
### Anti-Patterns Found
**Scanned files:** All files in src/usher_pipeline/evidence/gnomad/, src/usher_pipeline/cli/evidence_cmd.py, tests/test_gnomad*.py
| File | Line | Pattern | Severity | Impact |
|------|------|---------|----------|--------|
| - | - | None found | - | - |
**Checks performed:**
- TODO/FIXME/PLACEHOLDER comments: None found
- Empty implementations (return null/{}): None found
- Console.log debugging: None found
- Stub functions: None found
**Status:** No anti-patterns detected
### Human Verification Required
None required. All verification can be performed programmatically:
- File existence: Verified via filesystem checks
- Function implementations: Verified via code inspection
- Wiring: Verified via import/call tracing
- Tests: Verified via pytest execution (70 tests passing)
- CLI: Verified via --help output
### Phase Goal Achievement Summary
**Phase Goal:** Validate retrieval-to-storage pattern with single evidence layer
**Achievement Status:** FULLY ACHIEVED
**Evidence:**
1. **Retrieval pattern established:** download_constraint_metrics() with httpx streaming, retry logic, checkpoint-restart
2. **Transform pattern established:** filter_by_coverage() and normalize_scores() with NULL preservation, quality categorization
3. **Storage pattern established:** load_to_duckdb() with provenance tracking, idempotent CREATE OR REPLACE
4. **Full pipeline demonstrated:** CLI evidence gnomad orchestrates fetch→transform→load with checkpoint detection
5. **Pattern is reusable:** Evidence command group structure (evidence_cmd.py) is extensible for future evidence sources
**All 4 success criteria from ROADMAP.md satisfied:**
1. ✓ Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all genes
2. ✓ Constraint scores filtered by coverage quality (>30x depth, >90% CDS) with quality flags
3. ✓ Missing data encoded as "unknown" (NULL) not zero, preserving incomplete coverage genes
4. ✓ Normalized scores written to DuckDB with checkpoint-restart capability
**Test coverage:** 27 tests (15 unit + 12 integration) all passing, no regressions in 70-test suite
**Phase deliverables complete and verified.**
---
_Verified: 2026-02-11T19:30:00Z_
_Verifier: Claude (gsd-verifier)_