Files
gbanyan c97592d629 docs(06-03): complete comprehensive validation report & CLI plan
- Created 06-03-SUMMARY.md documenting plan execution
- Updated STATE.md:
  - Current Position: Phase 6 COMPLETE (21/21 plans)
  - Performance Metrics: Phase 06 3/3 plans, 10 min total, 3.3 min/plan avg
  - Added decisions for comprehensive validation report and weight tuning recommendations
  - Session Continuity: Stopped at 06-03-PLAN.md completion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 04:53:05 +08:00

332 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 06-validation
plan: 03
subsystem: validation
tags: [comprehensive-validation, cli-validate, validation-report, weight-tuning, unit-tests]
dependency_graph:
requires: [06-01-negative-controls-recall, 06-02-sensitivity-analysis]
provides: [comprehensive-validation-report, cli-validate-command, validation-tests]
affects: [validation-workflow-completion]
tech_stack:
added: []
patterns: [comprehensive-validation-pipeline, weight-tuning-recommendations, cli-orchestration]
key_files:
created:
- src/usher_pipeline/scoring/validation_report.py
- src/usher_pipeline/cli/validate_cmd.py
- tests/test_validation.py
modified:
- src/usher_pipeline/cli/main.py
- src/usher_pipeline/scoring/__init__.py
decisions:
- "Comprehensive validation report combines positive, negative, and sensitivity prongs in single Markdown document"
- "Weight tuning recommendations are guidance-only with critical circular validation warnings"
- "CLI validate command follows score_cmd.py pattern with --force, --skip-sensitivity, --output-dir, --top-n options"
- "Tests use synthetic DuckDB data with designed score patterns (known genes high, housekeeping low)"
- "Validation report saved to {data_dir}/validation/validation_report.md by default"
metrics:
duration_minutes: 5
completed_date: 2026-02-12
tasks_completed: 2
files_created: 3
files_modified: 2
commits: 2
---
# Phase 6 Plan 03: Comprehensive Validation Report & CLI Summary
CLI validate command orchestrating full validation pipeline (positive controls, negative controls, sensitivity analysis) with comprehensive Markdown report and weight tuning recommendations.
## Tasks Completed
### Task 1: Create comprehensive validation report and CLI validate command
**Status:** Complete
**Commit:** 10f19f8
**Files:** src/usher_pipeline/scoring/validation_report.py, src/usher_pipeline/cli/validate_cmd.py, src/usher_pipeline/cli/main.py, src/usher_pipeline/scoring/__init__.py
**Created validation_report.py** with comprehensive report generation:
- **generate_comprehensive_validation_report()**: Multi-section Markdown report combining all three validation prongs
- Section 1: Positive Control Validation (median percentile, recall@k table, per-source breakdown, pass/fail)
- Section 2: Negative Control Validation (median percentile, top quartile count, in-HIGH-tier count, pass/fail)
- Section 3: Sensitivity Analysis (Spearman rho table by layer × delta, stability verdict, most/least sensitive layers)
- Section 4: Overall Validation Summary (all-pass/partial-fail/fail verdict with interpretation)
- Section 5: Weight Tuning Recommendations (targeted suggestions based on validation results)
- **recommend_weight_tuning()**: Analyzes validation results and provides weight adjustment guidance
- All validations pass → "Current weights are validated. No tuning recommended."
- Positive controls fail → Suggest increasing weights for layers where known genes score highly
- Negative controls fail → Suggest examining layers boosting housekeeping genes, reducing generic layer weights
- Sensitivity unstable → Identify most sensitive layer, suggest reducing its weight
- **CRITICAL WARNING:** Documents circular validation risk (post-validation tuning invalidates controls)
- Provides best practices: independent validation set, document rationale, prefer a priori weights
- **save_validation_report()**: Persists report to file with parent directory creation
**Created validate_cmd.py** CLI command following score_cmd.py pattern:
- Click command `validate` with options:
- `--force`: Re-run even if validation checkpoint exists
- `--skip-sensitivity`: Skip sensitivity analysis for faster iteration
- `--output-dir`: Custom output directory (default: {data_dir}/validation)
- `--top-n`: Top N genes for sensitivity analysis (default: 100)
- Pipeline steps:
1. Load configuration and initialize store
2. Check scored_genes checkpoint exists (error if not - must run `score` first)
3. Run positive control validation (validate_positive_controls_extended)
4. Run negative control validation (validate_negative_controls)
5. Run sensitivity analysis (unless --skip-sensitivity) - run_sensitivity_analysis + summarize_sensitivity
6. Generate comprehensive validation report (generate_comprehensive_validation_report)
7. Save report to output_dir/validation_report.md and provenance sidecar
- Styled output with click.echo patterns (green for success, yellow for warnings, red for errors, bold for step headers)
- Provenance tracking: record_step for each validation phase with metrics
- Final summary: displays overall pass/fail, recall@10%, housekeeping median percentile, sensitivity stability
**Updated main.py:**
- Imported validate from validate_cmd
- Added `cli.add_command(validate)` following existing pattern
**Updated scoring.__init__.py:**
- Added validation_report imports: generate_comprehensive_validation_report, recommend_weight_tuning, save_validation_report
- Added all 3 functions to __all__ exports
**Verification:** Both verification commands passed:
- `python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')"` → OK
- `python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands"` → OK
### Task 2: Create unit tests for all validation modules
**Status:** Complete
**Commit:** 5879ae9
**Files:** tests/test_validation.py
**Created test_validation.py** with 13 comprehensive tests using synthetic DuckDB data:
**Test helper:**
- **create_synthetic_scored_db()**: Creates DuckDB with gene_universe (20 genes) and scored_genes table
- Designed scores ensure known cilia genes (MYO7A, IFT88, BBS1) get high scores (0.85-0.92)
- Housekeeping genes (GAPDH, ACTB, RPL13A) get low scores (0.12-0.20)
- Filler genes get mid-range scores (0.35-0.58)
- Includes all 6 layer scores and quality_flag
- Creates known_genes table with 3 genes (1 OMIM, 2 SYSCILIA)
**Tests for negative controls (4 tests):**
1. **test_compile_housekeeping_genes_structure**: Verifies compile_housekeeping_genes() returns DataFrame with 13 genes, correct columns (gene_symbol, source, confidence), all confidence=HIGH, all source=literature_validated
2. **test_compile_housekeeping_genes_known_genes_present**: Asserts GAPDH, ACTB, RPL13A, TBP are in gene_symbol column
3. **test_validate_negative_controls_with_synthetic_data**: Uses synthetic DB where housekeeping genes score low, asserts validation_passed=True, median_percentile < 0.5
4. **test_validate_negative_controls_inverted_logic**: Creates DB where housekeeping genes score HIGH (artificial scenario), asserts validation_passed=False
**Tests for recall@k (1 test):**
5. **test_compute_recall_at_k**: Uses synthetic DB, asserts recall@k returns dict with recalls_absolute and recalls_percentage keys. With 3 known genes in dataset (out of 38 total from compile_known_genes), recall@100 = 3/38 = 0.0789
**Tests for weight perturbation (3 tests):**
6. **test_perturb_weight_renormalizes**: Perturbs gnomad by +0.10, asserts weights still sum to 1.0 within 1e-6 tolerance
7. **test_perturb_weight_large_negative**: Perturbs by -0.25 (more than weight value), asserts weight >= 0.0 (clamped) and sum = 1.0
8. **test_perturb_weight_invalid_layer**: Asserts perturb_weight with layer="nonexistent" raises ValueError
**Tests for validation report (5 tests):**
9. **test_generate_comprehensive_validation_report_format**: Passes mock metrics dicts, asserts report contains expected sections ("Positive Control", "Negative Control", "Sensitivity Analysis", "Weight Tuning")
10. **test_recommend_weight_tuning_all_pass**: Passes metrics indicating all validations pass, asserts response contains "No tuning recommended"
11. **test_recommend_weight_tuning_positive_fail**: Passes metrics with positive controls failed, asserts response contains "Known Gene Ranking Issue" or "Positive Control"
12. **test_recommend_weight_tuning_negative_fail**: Passes metrics with negative controls failed, asserts response contains "Housekeeping" or "Negative Control"
13. **test_recommend_weight_tuning_sensitivity_fail**: Passes metrics with sensitivity unstable, asserts response contains "Sensitivity" or "gnomad"
**Verification:** All 13 tests passed:
```
tests/test_validation.py::test_compile_housekeeping_genes_structure PASSED
tests/test_validation.py::test_compile_housekeeping_genes_known_genes_present PASSED
tests/test_validation.py::test_validate_negative_controls_with_synthetic_data PASSED
tests/test_validation.py::test_validate_negative_controls_inverted_logic PASSED
tests/test_validation.py::test_compute_recall_at_k PASSED
tests/test_validation.py::test_perturb_weight_renormalizes PASSED
tests/test_validation.py::test_perturb_weight_large_negative PASSED
tests/test_validation.py::test_perturb_weight_invalid_layer PASSED
tests/test_validation.py::test_generate_comprehensive_validation_report_format PASSED
tests/test_validation.py::test_recommend_weight_tuning_all_pass PASSED
tests/test_validation.py::test_recommend_weight_tuning_positive_fail PASSED
tests/test_validation.py::test_recommend_weight_tuning_negative_fail PASSED
tests/test_validation.py::test_recommend_weight_tuning_sensitivity_fail PASSED
======================== 13 passed, 1 warning in 0.79s =========================
```
## Deviations from Plan
None - plan executed exactly as written.
## Verification Results
All verification checks passed:
1. `python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')"` → OK
2. `python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands"` → OK
3. `python -m pytest tests/test_validation.py -v` → 13 passed, 0 failed
## Success Criteria
- [x] CLI `validate` command runs positive + negative + sensitivity validations and generates comprehensive report
- [x] Validation report includes all three prongs with pass/fail verdicts and weight tuning recommendations
- [x] Unit tests cover negative controls, recall@k, perturbation, and report generation
- [x] All tests pass with synthetic data
- [x] validate command registered in main CLI
## Key Files
### Created
- **src/usher_pipeline/scoring/validation_report.py** (410 lines)
- Comprehensive validation report generation combining all three validation prongs
- Exports: generate_comprehensive_validation_report, recommend_weight_tuning, save_validation_report
- **src/usher_pipeline/cli/validate_cmd.py** (408 lines)
- CLI validate command orchestrating full validation pipeline
- Exports: validate (Click command)
- **tests/test_validation.py** (478 lines)
- Unit tests for negative controls, recall@k, sensitivity, and validation report
- 13 tests with synthetic DuckDB fixture
### Modified
- **src/usher_pipeline/cli/main.py** (+2 lines)
- Added validate command import and registration
- **src/usher_pipeline/scoring/__init__.py** (+7 lines)
- Added validation_report module exports
## Integration Points
**Depends on:**
- Phase 06-01: Negative control validation (validate_negative_controls) and positive control validation (validate_positive_controls_extended, compute_recall_at_k)
- Phase 06-02: Sensitivity analysis (run_sensitivity_analysis, summarize_sensitivity)
- Phase 04-02: scored_genes checkpoint (validation requires scoring to be complete)
**Provides:**
- Comprehensive validation report combining all three validation prongs
- CLI `validate` command for user-facing validation workflow
- Unit test coverage for all validation modules
**Affects:**
- Phase 6 completion: This is the final plan in validation phase
- User workflow: Provides `usher-pipeline validate` command for validation step
## Technical Notes
**Comprehensive Validation Report Design:**
The report combines three complementary validation approaches:
1. **Positive controls** (Plan 06-01): Known genes should rank high → validates sensitivity
2. **Negative controls** (Plan 06-01): Housekeeping genes should rank low → validates specificity
3. **Sensitivity analysis** (Plan 06-02): Rankings stable under perturbations → validates robustness
If all three pass: scoring system is sensitive, specific, and robust.
**Overall Validation Verdict Logic:**
- **All pass** → "ALL VALIDATIONS PASSED ✓" (scientifically defensible)
- **Pos + Neg pass, Sensitivity fail** → "PARTIAL PASS (Sensitivity Unstable)" (directionally correct but may need weight tuning)
- **Pos pass, Neg fail** → "PARTIAL PASS (Specificity Issue)" (sensitive but not specific)
- **Pos fail** → "VALIDATION FAILED ✗" (fundamental scoring issues)
**Weight Tuning Recommendations Philosophy:**
Recommendations are **guidance**, not automatic actions. They suggest:
- Which layers to adjust (increase/decrease weights)
- Why adjustments are needed (based on validation failures)
- How to interpret failures (specificity vs sensitivity vs stability)
**CRITICAL WARNING** included in all recommendations:
- Post-validation tuning introduces **circular validation risk**
- If weights are tuned based on validation results, those same controls cannot validate the tuned weights
- Best practices: independent validation set, document rationale, prefer a priori weights
This prevents the pitfall identified in 06-RESEARCH.md: "Tuning weights to maximize known gene recall, then using known gene recall as validation."
**CLI validate Command Design:**
Follows established pattern from score_cmd.py:
1. Click command with options (--force, --skip-sensitivity, --output-dir, --top-n)
2. Step-by-step pipeline with styled output (bold headers, colored status)
3. Checkpoint-restart support (skips if validation_report.md exists unless --force)
4. Provenance tracking for all steps (record_step for each validation phase)
5. Final summary with overall status and key metrics
6. Error handling with sys.exit(1) on failures
**Test Design Philosophy:**
All tests use **synthetic DuckDB data** with **designed score patterns**:
- Known genes get high scores (0.85-0.92) → positive controls should pass
- Housekeeping genes get low scores (0.12-0.20) → negative controls should pass
- Deterministic scores enable precise assertions
Tests cover:
- **Happy path**: Validations pass with expected data
- **Inverted logic**: Validations fail when data is wrong (housekeeping genes high)
- **Edge cases**: Large negative perturbations, invalid layer names
- **Format verification**: Report contains expected sections
- **Recommendation logic**: Different tuning suggestions for different failure modes
**Usage Pattern:**
```bash
# Full validation pipeline
usher-pipeline validate
# Skip sensitivity analysis (faster iteration)
usher-pipeline validate --skip-sensitivity
# Custom output directory
usher-pipeline validate --output-dir results/validation
# More genes for sensitivity (default 100)
usher-pipeline validate --top-n 200
# Force re-run
usher-pipeline validate --force
```
**Expected Workflow:**
1. User runs `usher-pipeline score` (Phase 04-03)
2. User runs `usher-pipeline validate` (this plan)
3. User reviews validation report at {data_dir}/validation/validation_report.md
4. If all pass: proceed to candidate prioritization
5. If failures: review weight tuning recommendations, adjust weights with biological justification, re-run
**Phase 6 Completion:**
This plan completes Phase 6 (Validation). All three plans executed:
- 06-01: Negative controls and recall@k (2 min)
- 06-02: Sensitivity analysis (3 min)
- 06-03: Comprehensive validation report and CLI (5 min)
Total Phase 6 duration: 10 minutes across 3 plans.
## Self-Check: PASSED
**Created files verified:**
- [x] src/usher_pipeline/scoring/validation_report.py exists (410 lines)
- [x] src/usher_pipeline/cli/validate_cmd.py exists (408 lines)
- [x] tests/test_validation.py exists (478 lines)
**Modified files verified:**
- [x] src/usher_pipeline/cli/main.py updated with validate import and registration
- [x] src/usher_pipeline/scoring/__init__.py updated with validation_report exports
**Commits verified:**
- [x] 10f19f8: Task 1 commit exists (comprehensive validation report and CLI validate command)
- [x] 5879ae9: Task 2 commit exists (unit tests for all validation modules)
**Functionality verified:**
- [x] validate command imports correctly (Command name: validate, OK)
- [x] validate registered in CLI (CLI commands includes 'validate', OK)
- [x] All 13 tests pass (pytest reports 13 passed, 0 failed)
All claims in summary verified against actual implementation.