--- phase: 06-validation plan: 03 subsystem: validation tags: [comprehensive-validation, cli-validate, validation-report, weight-tuning, unit-tests] dependency_graph: requires: [06-01-negative-controls-recall, 06-02-sensitivity-analysis] provides: [comprehensive-validation-report, cli-validate-command, validation-tests] affects: [validation-workflow-completion] tech_stack: added: [] patterns: [comprehensive-validation-pipeline, weight-tuning-recommendations, cli-orchestration] key_files: created: - src/usher_pipeline/scoring/validation_report.py - src/usher_pipeline/cli/validate_cmd.py - tests/test_validation.py modified: - src/usher_pipeline/cli/main.py - src/usher_pipeline/scoring/__init__.py decisions: - "Comprehensive validation report combines positive, negative, and sensitivity prongs in single Markdown document" - "Weight tuning recommendations are guidance-only with critical circular validation warnings" - "CLI validate command follows score_cmd.py pattern with --force, --skip-sensitivity, --output-dir, --top-n options" - "Tests use synthetic DuckDB data with designed score patterns (known genes high, housekeeping low)" - "Validation report saved to {data_dir}/validation/validation_report.md by default" metrics: duration_minutes: 5 completed_date: 2026-02-12 tasks_completed: 2 files_created: 3 files_modified: 2 commits: 2 --- # Phase 6 Plan 03: Comprehensive Validation Report & CLI Summary CLI validate command orchestrating full validation pipeline (positive controls, negative controls, sensitivity analysis) with comprehensive Markdown report and weight tuning recommendations. ## Tasks Completed ### Task 1: Create comprehensive validation report and CLI validate command **Status:** Complete **Commit:** 10f19f8 **Files:** src/usher_pipeline/scoring/validation_report.py, src/usher_pipeline/cli/validate_cmd.py, src/usher_pipeline/cli/main.py, src/usher_pipeline/scoring/__init__.py **Created validation_report.py** with comprehensive report generation: - **generate_comprehensive_validation_report()**: Multi-section Markdown report combining all three validation prongs - Section 1: Positive Control Validation (median percentile, recall@k table, per-source breakdown, pass/fail) - Section 2: Negative Control Validation (median percentile, top quartile count, in-HIGH-tier count, pass/fail) - Section 3: Sensitivity Analysis (Spearman rho table by layer × delta, stability verdict, most/least sensitive layers) - Section 4: Overall Validation Summary (all-pass/partial-fail/fail verdict with interpretation) - Section 5: Weight Tuning Recommendations (targeted suggestions based on validation results) - **recommend_weight_tuning()**: Analyzes validation results and provides weight adjustment guidance - All validations pass → "Current weights are validated. No tuning recommended." - Positive controls fail → Suggest increasing weights for layers where known genes score highly - Negative controls fail → Suggest examining layers boosting housekeeping genes, reducing generic layer weights - Sensitivity unstable → Identify most sensitive layer, suggest reducing its weight - **CRITICAL WARNING:** Documents circular validation risk (post-validation tuning invalidates controls) - Provides best practices: independent validation set, document rationale, prefer a priori weights - **save_validation_report()**: Persists report to file with parent directory creation **Created validate_cmd.py** CLI command following score_cmd.py pattern: - Click command `validate` with options: - `--force`: Re-run even if validation checkpoint exists - `--skip-sensitivity`: Skip sensitivity analysis for faster iteration - `--output-dir`: Custom output directory (default: {data_dir}/validation) - `--top-n`: Top N genes for sensitivity analysis (default: 100) - Pipeline steps: 1. Load configuration and initialize store 2. Check scored_genes checkpoint exists (error if not - must run `score` first) 3. Run positive control validation (validate_positive_controls_extended) 4. Run negative control validation (validate_negative_controls) 5. Run sensitivity analysis (unless --skip-sensitivity) - run_sensitivity_analysis + summarize_sensitivity 6. Generate comprehensive validation report (generate_comprehensive_validation_report) 7. Save report to output_dir/validation_report.md and provenance sidecar - Styled output with click.echo patterns (green for success, yellow for warnings, red for errors, bold for step headers) - Provenance tracking: record_step for each validation phase with metrics - Final summary: displays overall pass/fail, recall@10%, housekeeping median percentile, sensitivity stability **Updated main.py:** - Imported validate from validate_cmd - Added `cli.add_command(validate)` following existing pattern **Updated scoring.__init__.py:** - Added validation_report imports: generate_comprehensive_validation_report, recommend_weight_tuning, save_validation_report - Added all 3 functions to __all__ exports **Verification:** Both verification commands passed: - `python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')"` → OK - `python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands"` → OK ### Task 2: Create unit tests for all validation modules **Status:** Complete **Commit:** 5879ae9 **Files:** tests/test_validation.py **Created test_validation.py** with 13 comprehensive tests using synthetic DuckDB data: **Test helper:** - **create_synthetic_scored_db()**: Creates DuckDB with gene_universe (20 genes) and scored_genes table - Designed scores ensure known cilia genes (MYO7A, IFT88, BBS1) get high scores (0.85-0.92) - Housekeeping genes (GAPDH, ACTB, RPL13A) get low scores (0.12-0.20) - Filler genes get mid-range scores (0.35-0.58) - Includes all 6 layer scores and quality_flag - Creates known_genes table with 3 genes (1 OMIM, 2 SYSCILIA) **Tests for negative controls (4 tests):** 1. **test_compile_housekeeping_genes_structure**: Verifies compile_housekeeping_genes() returns DataFrame with 13 genes, correct columns (gene_symbol, source, confidence), all confidence=HIGH, all source=literature_validated 2. **test_compile_housekeeping_genes_known_genes_present**: Asserts GAPDH, ACTB, RPL13A, TBP are in gene_symbol column 3. **test_validate_negative_controls_with_synthetic_data**: Uses synthetic DB where housekeeping genes score low, asserts validation_passed=True, median_percentile < 0.5 4. **test_validate_negative_controls_inverted_logic**: Creates DB where housekeeping genes score HIGH (artificial scenario), asserts validation_passed=False **Tests for recall@k (1 test):** 5. **test_compute_recall_at_k**: Uses synthetic DB, asserts recall@k returns dict with recalls_absolute and recalls_percentage keys. With 3 known genes in dataset (out of 38 total from compile_known_genes), recall@100 = 3/38 = 0.0789 **Tests for weight perturbation (3 tests):** 6. **test_perturb_weight_renormalizes**: Perturbs gnomad by +0.10, asserts weights still sum to 1.0 within 1e-6 tolerance 7. **test_perturb_weight_large_negative**: Perturbs by -0.25 (more than weight value), asserts weight >= 0.0 (clamped) and sum = 1.0 8. **test_perturb_weight_invalid_layer**: Asserts perturb_weight with layer="nonexistent" raises ValueError **Tests for validation report (5 tests):** 9. **test_generate_comprehensive_validation_report_format**: Passes mock metrics dicts, asserts report contains expected sections ("Positive Control", "Negative Control", "Sensitivity Analysis", "Weight Tuning") 10. **test_recommend_weight_tuning_all_pass**: Passes metrics indicating all validations pass, asserts response contains "No tuning recommended" 11. **test_recommend_weight_tuning_positive_fail**: Passes metrics with positive controls failed, asserts response contains "Known Gene Ranking Issue" or "Positive Control" 12. **test_recommend_weight_tuning_negative_fail**: Passes metrics with negative controls failed, asserts response contains "Housekeeping" or "Negative Control" 13. **test_recommend_weight_tuning_sensitivity_fail**: Passes metrics with sensitivity unstable, asserts response contains "Sensitivity" or "gnomad" **Verification:** All 13 tests passed: ``` tests/test_validation.py::test_compile_housekeeping_genes_structure PASSED tests/test_validation.py::test_compile_housekeeping_genes_known_genes_present PASSED tests/test_validation.py::test_validate_negative_controls_with_synthetic_data PASSED tests/test_validation.py::test_validate_negative_controls_inverted_logic PASSED tests/test_validation.py::test_compute_recall_at_k PASSED tests/test_validation.py::test_perturb_weight_renormalizes PASSED tests/test_validation.py::test_perturb_weight_large_negative PASSED tests/test_validation.py::test_perturb_weight_invalid_layer PASSED tests/test_validation.py::test_generate_comprehensive_validation_report_format PASSED tests/test_validation.py::test_recommend_weight_tuning_all_pass PASSED tests/test_validation.py::test_recommend_weight_tuning_positive_fail PASSED tests/test_validation.py::test_recommend_weight_tuning_negative_fail PASSED tests/test_validation.py::test_recommend_weight_tuning_sensitivity_fail PASSED ======================== 13 passed, 1 warning in 0.79s ========================= ``` ## Deviations from Plan None - plan executed exactly as written. ## Verification Results All verification checks passed: 1. `python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')"` → OK 2. `python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands"` → OK 3. `python -m pytest tests/test_validation.py -v` → 13 passed, 0 failed ## Success Criteria - [x] CLI `validate` command runs positive + negative + sensitivity validations and generates comprehensive report - [x] Validation report includes all three prongs with pass/fail verdicts and weight tuning recommendations - [x] Unit tests cover negative controls, recall@k, perturbation, and report generation - [x] All tests pass with synthetic data - [x] validate command registered in main CLI ## Key Files ### Created - **src/usher_pipeline/scoring/validation_report.py** (410 lines) - Comprehensive validation report generation combining all three validation prongs - Exports: generate_comprehensive_validation_report, recommend_weight_tuning, save_validation_report - **src/usher_pipeline/cli/validate_cmd.py** (408 lines) - CLI validate command orchestrating full validation pipeline - Exports: validate (Click command) - **tests/test_validation.py** (478 lines) - Unit tests for negative controls, recall@k, sensitivity, and validation report - 13 tests with synthetic DuckDB fixture ### Modified - **src/usher_pipeline/cli/main.py** (+2 lines) - Added validate command import and registration - **src/usher_pipeline/scoring/__init__.py** (+7 lines) - Added validation_report module exports ## Integration Points **Depends on:** - Phase 06-01: Negative control validation (validate_negative_controls) and positive control validation (validate_positive_controls_extended, compute_recall_at_k) - Phase 06-02: Sensitivity analysis (run_sensitivity_analysis, summarize_sensitivity) - Phase 04-02: scored_genes checkpoint (validation requires scoring to be complete) **Provides:** - Comprehensive validation report combining all three validation prongs - CLI `validate` command for user-facing validation workflow - Unit test coverage for all validation modules **Affects:** - Phase 6 completion: This is the final plan in validation phase - User workflow: Provides `usher-pipeline validate` command for validation step ## Technical Notes **Comprehensive Validation Report Design:** The report combines three complementary validation approaches: 1. **Positive controls** (Plan 06-01): Known genes should rank high → validates sensitivity 2. **Negative controls** (Plan 06-01): Housekeeping genes should rank low → validates specificity 3. **Sensitivity analysis** (Plan 06-02): Rankings stable under perturbations → validates robustness If all three pass: scoring system is sensitive, specific, and robust. **Overall Validation Verdict Logic:** - **All pass** → "ALL VALIDATIONS PASSED ✓" (scientifically defensible) - **Pos + Neg pass, Sensitivity fail** → "PARTIAL PASS (Sensitivity Unstable)" (directionally correct but may need weight tuning) - **Pos pass, Neg fail** → "PARTIAL PASS (Specificity Issue)" (sensitive but not specific) - **Pos fail** → "VALIDATION FAILED ✗" (fundamental scoring issues) **Weight Tuning Recommendations Philosophy:** Recommendations are **guidance**, not automatic actions. They suggest: - Which layers to adjust (increase/decrease weights) - Why adjustments are needed (based on validation failures) - How to interpret failures (specificity vs sensitivity vs stability) **CRITICAL WARNING** included in all recommendations: - Post-validation tuning introduces **circular validation risk** - If weights are tuned based on validation results, those same controls cannot validate the tuned weights - Best practices: independent validation set, document rationale, prefer a priori weights This prevents the pitfall identified in 06-RESEARCH.md: "Tuning weights to maximize known gene recall, then using known gene recall as validation." **CLI validate Command Design:** Follows established pattern from score_cmd.py: 1. Click command with options (--force, --skip-sensitivity, --output-dir, --top-n) 2. Step-by-step pipeline with styled output (bold headers, colored status) 3. Checkpoint-restart support (skips if validation_report.md exists unless --force) 4. Provenance tracking for all steps (record_step for each validation phase) 5. Final summary with overall status and key metrics 6. Error handling with sys.exit(1) on failures **Test Design Philosophy:** All tests use **synthetic DuckDB data** with **designed score patterns**: - Known genes get high scores (0.85-0.92) → positive controls should pass - Housekeeping genes get low scores (0.12-0.20) → negative controls should pass - Deterministic scores enable precise assertions Tests cover: - **Happy path**: Validations pass with expected data - **Inverted logic**: Validations fail when data is wrong (housekeeping genes high) - **Edge cases**: Large negative perturbations, invalid layer names - **Format verification**: Report contains expected sections - **Recommendation logic**: Different tuning suggestions for different failure modes **Usage Pattern:** ```bash # Full validation pipeline usher-pipeline validate # Skip sensitivity analysis (faster iteration) usher-pipeline validate --skip-sensitivity # Custom output directory usher-pipeline validate --output-dir results/validation # More genes for sensitivity (default 100) usher-pipeline validate --top-n 200 # Force re-run usher-pipeline validate --force ``` **Expected Workflow:** 1. User runs `usher-pipeline score` (Phase 04-03) 2. User runs `usher-pipeline validate` (this plan) 3. User reviews validation report at {data_dir}/validation/validation_report.md 4. If all pass: proceed to candidate prioritization 5. If failures: review weight tuning recommendations, adjust weights with biological justification, re-run **Phase 6 Completion:** This plan completes Phase 6 (Validation). All three plans executed: - 06-01: Negative controls and recall@k (2 min) - 06-02: Sensitivity analysis (3 min) - 06-03: Comprehensive validation report and CLI (5 min) Total Phase 6 duration: 10 minutes across 3 plans. ## Self-Check: PASSED **Created files verified:** - [x] src/usher_pipeline/scoring/validation_report.py exists (410 lines) - [x] src/usher_pipeline/cli/validate_cmd.py exists (408 lines) - [x] tests/test_validation.py exists (478 lines) **Modified files verified:** - [x] src/usher_pipeline/cli/main.py updated with validate import and registration - [x] src/usher_pipeline/scoring/__init__.py updated with validation_report exports **Commits verified:** - [x] 10f19f8: Task 1 commit exists (comprehensive validation report and CLI validate command) - [x] 5879ae9: Task 2 commit exists (unit tests for all validation modules) **Functionality verified:** - [x] validate command imports correctly (Command name: validate, OK) - [x] validate registered in CLI (CLI commands includes 'validate', OK) - [x] All 13 tests pass (pytest reports 13 passed, 0 failed) All claims in summary verified against actual implementation.