From c97592d62903c690dfcc8581f98ce805934fce4e Mon Sep 17 00:00:00 2001 From: gbanyan Date: Thu, 12 Feb 2026 04:53:05 +0800 Subject: [PATCH] docs(06-03): complete comprehensive validation report & CLI plan - Created 06-03-SUMMARY.md documenting plan execution - Updated STATE.md: - Current Position: Phase 6 COMPLETE (21/21 plans) - Performance Metrics: Phase 06 3/3 plans, 10 min total, 3.3 min/plan avg - Added decisions for comprehensive validation report and weight tuning recommendations - Session Continuity: Stopped at 06-03-PLAN.md completion Co-Authored-By: Claude Opus 4.6 --- .planning/STATE.md | 20 +- .../phases/06-validation/06-03-SUMMARY.md | 331 ++++++++++++++++++ 2 files changed, 343 insertions(+), 8 deletions(-) create mode 100644 .planning/phases/06-validation/06-03-SUMMARY.md diff --git a/.planning/STATE.md b/.planning/STATE.md index 9479f51..50aee82 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -10,16 +10,16 @@ See: .planning/PROJECT.md (updated 2026-02-11) ## Current Position Phase: 6 of 6 (Validation) -Plan: 3 of 3 in current phase (plans 06-01, 06-02 complete) -Status: Phase 6 in progress — plans 06-01 and 06-02 complete -Last activity: 2026-02-12 — Completed 06-02: Sensitivity Analysis Module +Plan: 3 of 3 in current phase (all plans complete) +Status: Phase 6 COMPLETE — all validation plans complete +Last activity: 2026-02-12 — Completed 06-03: Comprehensive Validation Report & CLI -Progress: [██████████] 100.0% (20/20 plans complete across all phases) +Progress: [██████████] 100.0% (21/21 plans complete across all phases) ## Performance Metrics **Velocity:** -- Total plans completed: 20 +- Total plans completed: 21 - Average duration: 4.6 min - Total execution time: 1.6 hours @@ -32,7 +32,7 @@ Progress: [██████████] 100.0% (20/20 plans complete across a | 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan | | 04 - Scoring Integration | 3/3 | 10 min | 3.3 min/plan | | 05 - Output & CLI | 3/3 | 12 min | 4.0 min/plan | -| 06 - Validation | 2/3 | 5 min | 2.5 min/plan | +| 06 - Validation | 3/3 | 10 min | 3.3 min/plan | **Recent Plan Details:** | Plan | Duration | Tasks | Files | @@ -45,6 +45,7 @@ Progress: [██████████] 100.0% (20/20 plans complete across a | Phase 05 P03 | 3 min | 2 tasks | 3 files | | Phase 06 P01 | 2 min | 2 tasks | 3 files | | Phase 06 P02 | 3 min | 2 tasks | 2 files | +| Phase 06 P03 | 5 min | 2 tasks | 5 files | ## Accumulated Context @@ -141,6 +142,9 @@ Recent decisions affecting current work: - [06-02]: Top-N default 100 genes for ranking comparison (relevant for candidate prioritization) - [06-02]: Minimum overlap 10 genes required for Spearman correlation (avoids meaningless correlations) - [06-02]: Per-layer sensitivity tracking (most_sensitive_layer and most_robust_layer computed from mean rho) +- [06-03]: Comprehensive validation report combines positive, negative, and sensitivity prongs in single Markdown document +- [06-03]: Weight tuning recommendations include critical circular validation warnings (post-validation tuning invalidates controls) +- [06-03]: CLI validate command provides --skip-sensitivity flag for faster iteration during development ### Pending Todos @@ -153,5 +157,5 @@ None yet. ## Session Continuity Last session: 2026-02-12 - Phase 6 execution -Stopped at: Completed 06-02-PLAN.md (Sensitivity Analysis Module) -Resume file: .planning/phases/06-validation/06-02-SUMMARY.md +Stopped at: Completed 06-03-PLAN.md (Comprehensive Validation Report & CLI) +Resume file: .planning/phases/06-validation/06-03-SUMMARY.md diff --git a/.planning/phases/06-validation/06-03-SUMMARY.md b/.planning/phases/06-validation/06-03-SUMMARY.md new file mode 100644 index 0000000..50f2297 --- /dev/null +++ b/.planning/phases/06-validation/06-03-SUMMARY.md @@ -0,0 +1,331 @@ +--- +phase: 06-validation +plan: 03 +subsystem: validation +tags: [comprehensive-validation, cli-validate, validation-report, weight-tuning, unit-tests] +dependency_graph: + requires: [06-01-negative-controls-recall, 06-02-sensitivity-analysis] + provides: [comprehensive-validation-report, cli-validate-command, validation-tests] + affects: [validation-workflow-completion] +tech_stack: + added: [] + patterns: [comprehensive-validation-pipeline, weight-tuning-recommendations, cli-orchestration] +key_files: + created: + - src/usher_pipeline/scoring/validation_report.py + - src/usher_pipeline/cli/validate_cmd.py + - tests/test_validation.py + modified: + - src/usher_pipeline/cli/main.py + - src/usher_pipeline/scoring/__init__.py +decisions: + - "Comprehensive validation report combines positive, negative, and sensitivity prongs in single Markdown document" + - "Weight tuning recommendations are guidance-only with critical circular validation warnings" + - "CLI validate command follows score_cmd.py pattern with --force, --skip-sensitivity, --output-dir, --top-n options" + - "Tests use synthetic DuckDB data with designed score patterns (known genes high, housekeeping low)" + - "Validation report saved to {data_dir}/validation/validation_report.md by default" +metrics: + duration_minutes: 5 + completed_date: 2026-02-12 + tasks_completed: 2 + files_created: 3 + files_modified: 2 + commits: 2 +--- + +# Phase 6 Plan 03: Comprehensive Validation Report & CLI Summary + +CLI validate command orchestrating full validation pipeline (positive controls, negative controls, sensitivity analysis) with comprehensive Markdown report and weight tuning recommendations. + +## Tasks Completed + +### Task 1: Create comprehensive validation report and CLI validate command +**Status:** Complete +**Commit:** 10f19f8 +**Files:** src/usher_pipeline/scoring/validation_report.py, src/usher_pipeline/cli/validate_cmd.py, src/usher_pipeline/cli/main.py, src/usher_pipeline/scoring/__init__.py + +**Created validation_report.py** with comprehensive report generation: + +- **generate_comprehensive_validation_report()**: Multi-section Markdown report combining all three validation prongs + - Section 1: Positive Control Validation (median percentile, recall@k table, per-source breakdown, pass/fail) + - Section 2: Negative Control Validation (median percentile, top quartile count, in-HIGH-tier count, pass/fail) + - Section 3: Sensitivity Analysis (Spearman rho table by layer × delta, stability verdict, most/least sensitive layers) + - Section 4: Overall Validation Summary (all-pass/partial-fail/fail verdict with interpretation) + - Section 5: Weight Tuning Recommendations (targeted suggestions based on validation results) + +- **recommend_weight_tuning()**: Analyzes validation results and provides weight adjustment guidance + - All validations pass → "Current weights are validated. No tuning recommended." + - Positive controls fail → Suggest increasing weights for layers where known genes score highly + - Negative controls fail → Suggest examining layers boosting housekeeping genes, reducing generic layer weights + - Sensitivity unstable → Identify most sensitive layer, suggest reducing its weight + - **CRITICAL WARNING:** Documents circular validation risk (post-validation tuning invalidates controls) + - Provides best practices: independent validation set, document rationale, prefer a priori weights + +- **save_validation_report()**: Persists report to file with parent directory creation + +**Created validate_cmd.py** CLI command following score_cmd.py pattern: + +- Click command `validate` with options: + - `--force`: Re-run even if validation checkpoint exists + - `--skip-sensitivity`: Skip sensitivity analysis for faster iteration + - `--output-dir`: Custom output directory (default: {data_dir}/validation) + - `--top-n`: Top N genes for sensitivity analysis (default: 100) + +- Pipeline steps: + 1. Load configuration and initialize store + 2. Check scored_genes checkpoint exists (error if not - must run `score` first) + 3. Run positive control validation (validate_positive_controls_extended) + 4. Run negative control validation (validate_negative_controls) + 5. Run sensitivity analysis (unless --skip-sensitivity) - run_sensitivity_analysis + summarize_sensitivity + 6. Generate comprehensive validation report (generate_comprehensive_validation_report) + 7. Save report to output_dir/validation_report.md and provenance sidecar + +- Styled output with click.echo patterns (green for success, yellow for warnings, red for errors, bold for step headers) +- Provenance tracking: record_step for each validation phase with metrics +- Final summary: displays overall pass/fail, recall@10%, housekeeping median percentile, sensitivity stability + +**Updated main.py:** +- Imported validate from validate_cmd +- Added `cli.add_command(validate)` following existing pattern + +**Updated scoring.__init__.py:** +- Added validation_report imports: generate_comprehensive_validation_report, recommend_weight_tuning, save_validation_report +- Added all 3 functions to __all__ exports + +**Verification:** Both verification commands passed: +- `python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')"` → OK +- `python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands"` → OK + +### Task 2: Create unit tests for all validation modules +**Status:** Complete +**Commit:** 5879ae9 +**Files:** tests/test_validation.py + +**Created test_validation.py** with 13 comprehensive tests using synthetic DuckDB data: + +**Test helper:** +- **create_synthetic_scored_db()**: Creates DuckDB with gene_universe (20 genes) and scored_genes table + - Designed scores ensure known cilia genes (MYO7A, IFT88, BBS1) get high scores (0.85-0.92) + - Housekeeping genes (GAPDH, ACTB, RPL13A) get low scores (0.12-0.20) + - Filler genes get mid-range scores (0.35-0.58) + - Includes all 6 layer scores and quality_flag + - Creates known_genes table with 3 genes (1 OMIM, 2 SYSCILIA) + +**Tests for negative controls (4 tests):** +1. **test_compile_housekeeping_genes_structure**: Verifies compile_housekeeping_genes() returns DataFrame with 13 genes, correct columns (gene_symbol, source, confidence), all confidence=HIGH, all source=literature_validated + +2. **test_compile_housekeeping_genes_known_genes_present**: Asserts GAPDH, ACTB, RPL13A, TBP are in gene_symbol column + +3. **test_validate_negative_controls_with_synthetic_data**: Uses synthetic DB where housekeeping genes score low, asserts validation_passed=True, median_percentile < 0.5 + +4. **test_validate_negative_controls_inverted_logic**: Creates DB where housekeeping genes score HIGH (artificial scenario), asserts validation_passed=False + +**Tests for recall@k (1 test):** +5. **test_compute_recall_at_k**: Uses synthetic DB, asserts recall@k returns dict with recalls_absolute and recalls_percentage keys. With 3 known genes in dataset (out of 38 total from compile_known_genes), recall@100 = 3/38 = 0.0789 + +**Tests for weight perturbation (3 tests):** +6. **test_perturb_weight_renormalizes**: Perturbs gnomad by +0.10, asserts weights still sum to 1.0 within 1e-6 tolerance + +7. **test_perturb_weight_large_negative**: Perturbs by -0.25 (more than weight value), asserts weight >= 0.0 (clamped) and sum = 1.0 + +8. **test_perturb_weight_invalid_layer**: Asserts perturb_weight with layer="nonexistent" raises ValueError + +**Tests for validation report (5 tests):** +9. **test_generate_comprehensive_validation_report_format**: Passes mock metrics dicts, asserts report contains expected sections ("Positive Control", "Negative Control", "Sensitivity Analysis", "Weight Tuning") + +10. **test_recommend_weight_tuning_all_pass**: Passes metrics indicating all validations pass, asserts response contains "No tuning recommended" + +11. **test_recommend_weight_tuning_positive_fail**: Passes metrics with positive controls failed, asserts response contains "Known Gene Ranking Issue" or "Positive Control" + +12. **test_recommend_weight_tuning_negative_fail**: Passes metrics with negative controls failed, asserts response contains "Housekeeping" or "Negative Control" + +13. **test_recommend_weight_tuning_sensitivity_fail**: Passes metrics with sensitivity unstable, asserts response contains "Sensitivity" or "gnomad" + +**Verification:** All 13 tests passed: +``` +tests/test_validation.py::test_compile_housekeeping_genes_structure PASSED +tests/test_validation.py::test_compile_housekeeping_genes_known_genes_present PASSED +tests/test_validation.py::test_validate_negative_controls_with_synthetic_data PASSED +tests/test_validation.py::test_validate_negative_controls_inverted_logic PASSED +tests/test_validation.py::test_compute_recall_at_k PASSED +tests/test_validation.py::test_perturb_weight_renormalizes PASSED +tests/test_validation.py::test_perturb_weight_large_negative PASSED +tests/test_validation.py::test_perturb_weight_invalid_layer PASSED +tests/test_validation.py::test_generate_comprehensive_validation_report_format PASSED +tests/test_validation.py::test_recommend_weight_tuning_all_pass PASSED +tests/test_validation.py::test_recommend_weight_tuning_positive_fail PASSED +tests/test_validation.py::test_recommend_weight_tuning_negative_fail PASSED +tests/test_validation.py::test_recommend_weight_tuning_sensitivity_fail PASSED +======================== 13 passed, 1 warning in 0.79s ========================= +``` + +## Deviations from Plan + +None - plan executed exactly as written. + +## Verification Results + +All verification checks passed: + +1. `python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')"` → OK +2. `python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands"` → OK +3. `python -m pytest tests/test_validation.py -v` → 13 passed, 0 failed + +## Success Criteria + +- [x] CLI `validate` command runs positive + negative + sensitivity validations and generates comprehensive report +- [x] Validation report includes all three prongs with pass/fail verdicts and weight tuning recommendations +- [x] Unit tests cover negative controls, recall@k, perturbation, and report generation +- [x] All tests pass with synthetic data +- [x] validate command registered in main CLI + +## Key Files + +### Created +- **src/usher_pipeline/scoring/validation_report.py** (410 lines) + - Comprehensive validation report generation combining all three validation prongs + - Exports: generate_comprehensive_validation_report, recommend_weight_tuning, save_validation_report + +- **src/usher_pipeline/cli/validate_cmd.py** (408 lines) + - CLI validate command orchestrating full validation pipeline + - Exports: validate (Click command) + +- **tests/test_validation.py** (478 lines) + - Unit tests for negative controls, recall@k, sensitivity, and validation report + - 13 tests with synthetic DuckDB fixture + +### Modified +- **src/usher_pipeline/cli/main.py** (+2 lines) + - Added validate command import and registration + +- **src/usher_pipeline/scoring/__init__.py** (+7 lines) + - Added validation_report module exports + +## Integration Points + +**Depends on:** +- Phase 06-01: Negative control validation (validate_negative_controls) and positive control validation (validate_positive_controls_extended, compute_recall_at_k) +- Phase 06-02: Sensitivity analysis (run_sensitivity_analysis, summarize_sensitivity) +- Phase 04-02: scored_genes checkpoint (validation requires scoring to be complete) + +**Provides:** +- Comprehensive validation report combining all three validation prongs +- CLI `validate` command for user-facing validation workflow +- Unit test coverage for all validation modules + +**Affects:** +- Phase 6 completion: This is the final plan in validation phase +- User workflow: Provides `usher-pipeline validate` command for validation step + +## Technical Notes + +**Comprehensive Validation Report Design:** + +The report combines three complementary validation approaches: +1. **Positive controls** (Plan 06-01): Known genes should rank high → validates sensitivity +2. **Negative controls** (Plan 06-01): Housekeeping genes should rank low → validates specificity +3. **Sensitivity analysis** (Plan 06-02): Rankings stable under perturbations → validates robustness + +If all three pass: scoring system is sensitive, specific, and robust. + +**Overall Validation Verdict Logic:** +- **All pass** → "ALL VALIDATIONS PASSED ✓" (scientifically defensible) +- **Pos + Neg pass, Sensitivity fail** → "PARTIAL PASS (Sensitivity Unstable)" (directionally correct but may need weight tuning) +- **Pos pass, Neg fail** → "PARTIAL PASS (Specificity Issue)" (sensitive but not specific) +- **Pos fail** → "VALIDATION FAILED ✗" (fundamental scoring issues) + +**Weight Tuning Recommendations Philosophy:** + +Recommendations are **guidance**, not automatic actions. They suggest: +- Which layers to adjust (increase/decrease weights) +- Why adjustments are needed (based on validation failures) +- How to interpret failures (specificity vs sensitivity vs stability) + +**CRITICAL WARNING** included in all recommendations: +- Post-validation tuning introduces **circular validation risk** +- If weights are tuned based on validation results, those same controls cannot validate the tuned weights +- Best practices: independent validation set, document rationale, prefer a priori weights + +This prevents the pitfall identified in 06-RESEARCH.md: "Tuning weights to maximize known gene recall, then using known gene recall as validation." + +**CLI validate Command Design:** + +Follows established pattern from score_cmd.py: +1. Click command with options (--force, --skip-sensitivity, --output-dir, --top-n) +2. Step-by-step pipeline with styled output (bold headers, colored status) +3. Checkpoint-restart support (skips if validation_report.md exists unless --force) +4. Provenance tracking for all steps (record_step for each validation phase) +5. Final summary with overall status and key metrics +6. Error handling with sys.exit(1) on failures + +**Test Design Philosophy:** + +All tests use **synthetic DuckDB data** with **designed score patterns**: +- Known genes get high scores (0.85-0.92) → positive controls should pass +- Housekeeping genes get low scores (0.12-0.20) → negative controls should pass +- Deterministic scores enable precise assertions + +Tests cover: +- **Happy path**: Validations pass with expected data +- **Inverted logic**: Validations fail when data is wrong (housekeeping genes high) +- **Edge cases**: Large negative perturbations, invalid layer names +- **Format verification**: Report contains expected sections +- **Recommendation logic**: Different tuning suggestions for different failure modes + +**Usage Pattern:** + +```bash +# Full validation pipeline +usher-pipeline validate + +# Skip sensitivity analysis (faster iteration) +usher-pipeline validate --skip-sensitivity + +# Custom output directory +usher-pipeline validate --output-dir results/validation + +# More genes for sensitivity (default 100) +usher-pipeline validate --top-n 200 + +# Force re-run +usher-pipeline validate --force +``` + +**Expected Workflow:** + +1. User runs `usher-pipeline score` (Phase 04-03) +2. User runs `usher-pipeline validate` (this plan) +3. User reviews validation report at {data_dir}/validation/validation_report.md +4. If all pass: proceed to candidate prioritization +5. If failures: review weight tuning recommendations, adjust weights with biological justification, re-run + +**Phase 6 Completion:** + +This plan completes Phase 6 (Validation). All three plans executed: +- 06-01: Negative controls and recall@k (2 min) +- 06-02: Sensitivity analysis (3 min) +- 06-03: Comprehensive validation report and CLI (5 min) + +Total Phase 6 duration: 10 minutes across 3 plans. + +## Self-Check: PASSED + +**Created files verified:** +- [x] src/usher_pipeline/scoring/validation_report.py exists (410 lines) +- [x] src/usher_pipeline/cli/validate_cmd.py exists (408 lines) +- [x] tests/test_validation.py exists (478 lines) + +**Modified files verified:** +- [x] src/usher_pipeline/cli/main.py updated with validate import and registration +- [x] src/usher_pipeline/scoring/__init__.py updated with validation_report exports + +**Commits verified:** +- [x] 10f19f8: Task 1 commit exists (comprehensive validation report and CLI validate command) +- [x] 5879ae9: Task 2 commit exists (unit tests for all validation modules) + +**Functionality verified:** +- [x] validate command imports correctly (Command name: validate, OK) +- [x] validate registered in CLI (CLI commands includes 'validate', OK) +- [x] All 13 tests pass (pytest reports 13 passed, 0 failed) + +All claims in summary verified against actual implementation.