Files
gbanyan c97592d629 docs(06-03): complete comprehensive validation report & CLI plan
- Created 06-03-SUMMARY.md documenting plan execution
- Updated STATE.md:
  - Current Position: Phase 6 COMPLETE (21/21 plans)
  - Performance Metrics: Phase 06 3/3 plans, 10 min total, 3.3 min/plan avg
  - Added decisions for comprehensive validation report and weight tuning recommendations
  - Session Continuity: Stopped at 06-03-PLAN.md completion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 04:53:05 +08:00

16 KiB
Raw Permalink Blame History

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
phase plan subsystem tags dependency_graph tech_stack key_files decisions metrics
06-validation 03 validation
comprehensive-validation
cli-validate
validation-report
weight-tuning
unit-tests
requires provides affects
06-01-negative-controls-recall
06-02-sensitivity-analysis
comprehensive-validation-report
cli-validate-command
validation-tests
validation-workflow-completion
added patterns
comprehensive-validation-pipeline
weight-tuning-recommendations
cli-orchestration
created modified
src/usher_pipeline/scoring/validation_report.py
src/usher_pipeline/cli/validate_cmd.py
tests/test_validation.py
src/usher_pipeline/cli/main.py
src/usher_pipeline/scoring/__init__.py
Comprehensive validation report combines positive, negative, and sensitivity prongs in single Markdown document
Weight tuning recommendations are guidance-only with critical circular validation warnings
CLI validate command follows score_cmd.py pattern with --force, --skip-sensitivity, --output-dir, --top-n options
Tests use synthetic DuckDB data with designed score patterns (known genes high, housekeeping low)
Validation report saved to {data_dir}/validation/validation_report.md by default
duration_minutes completed_date tasks_completed files_created files_modified commits
5 2026-02-12 2 3 2 2

Phase 6 Plan 03: Comprehensive Validation Report & CLI Summary

CLI validate command orchestrating full validation pipeline (positive controls, negative controls, sensitivity analysis) with comprehensive Markdown report and weight tuning recommendations.

Tasks Completed

Task 1: Create comprehensive validation report and CLI validate command

Status: Complete Commit: 10f19f8 Files: src/usher_pipeline/scoring/validation_report.py, src/usher_pipeline/cli/validate_cmd.py, src/usher_pipeline/cli/main.py, src/usher_pipeline/scoring/init.py

Created validation_report.py with comprehensive report generation:

  • generate_comprehensive_validation_report(): Multi-section Markdown report combining all three validation prongs

    • Section 1: Positive Control Validation (median percentile, recall@k table, per-source breakdown, pass/fail)
    • Section 2: Negative Control Validation (median percentile, top quartile count, in-HIGH-tier count, pass/fail)
    • Section 3: Sensitivity Analysis (Spearman rho table by layer × delta, stability verdict, most/least sensitive layers)
    • Section 4: Overall Validation Summary (all-pass/partial-fail/fail verdict with interpretation)
    • Section 5: Weight Tuning Recommendations (targeted suggestions based on validation results)
  • recommend_weight_tuning(): Analyzes validation results and provides weight adjustment guidance

    • All validations pass → "Current weights are validated. No tuning recommended."
    • Positive controls fail → Suggest increasing weights for layers where known genes score highly
    • Negative controls fail → Suggest examining layers boosting housekeeping genes, reducing generic layer weights
    • Sensitivity unstable → Identify most sensitive layer, suggest reducing its weight
    • CRITICAL WARNING: Documents circular validation risk (post-validation tuning invalidates controls)
    • Provides best practices: independent validation set, document rationale, prefer a priori weights
  • save_validation_report(): Persists report to file with parent directory creation

Created validate_cmd.py CLI command following score_cmd.py pattern:

  • Click command validate with options:

    • --force: Re-run even if validation checkpoint exists
    • --skip-sensitivity: Skip sensitivity analysis for faster iteration
    • --output-dir: Custom output directory (default: {data_dir}/validation)
    • --top-n: Top N genes for sensitivity analysis (default: 100)
  • Pipeline steps:

    1. Load configuration and initialize store
    2. Check scored_genes checkpoint exists (error if not - must run score first)
    3. Run positive control validation (validate_positive_controls_extended)
    4. Run negative control validation (validate_negative_controls)
    5. Run sensitivity analysis (unless --skip-sensitivity) - run_sensitivity_analysis + summarize_sensitivity
    6. Generate comprehensive validation report (generate_comprehensive_validation_report)
    7. Save report to output_dir/validation_report.md and provenance sidecar
  • Styled output with click.echo patterns (green for success, yellow for warnings, red for errors, bold for step headers)

  • Provenance tracking: record_step for each validation phase with metrics

  • Final summary: displays overall pass/fail, recall@10%, housekeeping median percentile, sensitivity stability

Updated main.py:

  • Imported validate from validate_cmd
  • Added cli.add_command(validate) following existing pattern

Updated scoring.init.py:

  • Added validation_report imports: generate_comprehensive_validation_report, recommend_weight_tuning, save_validation_report
  • Added all 3 functions to all exports

Verification: Both verification commands passed:

  • python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')" → OK
  • python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands" → OK

Task 2: Create unit tests for all validation modules

Status: Complete Commit: 5879ae9 Files: tests/test_validation.py

Created test_validation.py with 13 comprehensive tests using synthetic DuckDB data:

Test helper:

  • create_synthetic_scored_db(): Creates DuckDB with gene_universe (20 genes) and scored_genes table
    • Designed scores ensure known cilia genes (MYO7A, IFT88, BBS1) get high scores (0.85-0.92)
    • Housekeeping genes (GAPDH, ACTB, RPL13A) get low scores (0.12-0.20)
    • Filler genes get mid-range scores (0.35-0.58)
    • Includes all 6 layer scores and quality_flag
    • Creates known_genes table with 3 genes (1 OMIM, 2 SYSCILIA)

Tests for negative controls (4 tests):

  1. test_compile_housekeeping_genes_structure: Verifies compile_housekeeping_genes() returns DataFrame with 13 genes, correct columns (gene_symbol, source, confidence), all confidence=HIGH, all source=literature_validated

  2. test_compile_housekeeping_genes_known_genes_present: Asserts GAPDH, ACTB, RPL13A, TBP are in gene_symbol column

  3. test_validate_negative_controls_with_synthetic_data: Uses synthetic DB where housekeeping genes score low, asserts validation_passed=True, median_percentile < 0.5

  4. test_validate_negative_controls_inverted_logic: Creates DB where housekeeping genes score HIGH (artificial scenario), asserts validation_passed=False

Tests for recall@k (1 test): 5. test_compute_recall_at_k: Uses synthetic DB, asserts recall@k returns dict with recalls_absolute and recalls_percentage keys. With 3 known genes in dataset (out of 38 total from compile_known_genes), recall@100 = 3/38 = 0.0789

Tests for weight perturbation (3 tests): 6. test_perturb_weight_renormalizes: Perturbs gnomad by +0.10, asserts weights still sum to 1.0 within 1e-6 tolerance

  1. test_perturb_weight_large_negative: Perturbs by -0.25 (more than weight value), asserts weight >= 0.0 (clamped) and sum = 1.0

  2. test_perturb_weight_invalid_layer: Asserts perturb_weight with layer="nonexistent" raises ValueError

Tests for validation report (5 tests): 9. test_generate_comprehensive_validation_report_format: Passes mock metrics dicts, asserts report contains expected sections ("Positive Control", "Negative Control", "Sensitivity Analysis", "Weight Tuning")

  1. test_recommend_weight_tuning_all_pass: Passes metrics indicating all validations pass, asserts response contains "No tuning recommended"

  2. test_recommend_weight_tuning_positive_fail: Passes metrics with positive controls failed, asserts response contains "Known Gene Ranking Issue" or "Positive Control"

  3. test_recommend_weight_tuning_negative_fail: Passes metrics with negative controls failed, asserts response contains "Housekeeping" or "Negative Control"

  4. test_recommend_weight_tuning_sensitivity_fail: Passes metrics with sensitivity unstable, asserts response contains "Sensitivity" or "gnomad"

Verification: All 13 tests passed:

tests/test_validation.py::test_compile_housekeeping_genes_structure PASSED
tests/test_validation.py::test_compile_housekeeping_genes_known_genes_present PASSED
tests/test_validation.py::test_validate_negative_controls_with_synthetic_data PASSED
tests/test_validation.py::test_validate_negative_controls_inverted_logic PASSED
tests/test_validation.py::test_compute_recall_at_k PASSED
tests/test_validation.py::test_perturb_weight_renormalizes PASSED
tests/test_validation.py::test_perturb_weight_large_negative PASSED
tests/test_validation.py::test_perturb_weight_invalid_layer PASSED
tests/test_validation.py::test_generate_comprehensive_validation_report_format PASSED
tests/test_validation.py::test_recommend_weight_tuning_all_pass PASSED
tests/test_validation.py::test_recommend_weight_tuning_positive_fail PASSED
tests/test_validation.py::test_recommend_weight_tuning_negative_fail PASSED
tests/test_validation.py::test_recommend_weight_tuning_sensitivity_fail PASSED
======================== 13 passed, 1 warning in 0.79s =========================

Deviations from Plan

None - plan executed exactly as written.

Verification Results

All verification checks passed:

  1. python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')" → OK
  2. python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands" → OK
  3. python -m pytest tests/test_validation.py -v → 13 passed, 0 failed

Success Criteria

  • CLI validate command runs positive + negative + sensitivity validations and generates comprehensive report
  • Validation report includes all three prongs with pass/fail verdicts and weight tuning recommendations
  • Unit tests cover negative controls, recall@k, perturbation, and report generation
  • All tests pass with synthetic data
  • validate command registered in main CLI

Key Files

Created

  • src/usher_pipeline/scoring/validation_report.py (410 lines)

    • Comprehensive validation report generation combining all three validation prongs
    • Exports: generate_comprehensive_validation_report, recommend_weight_tuning, save_validation_report
  • src/usher_pipeline/cli/validate_cmd.py (408 lines)

    • CLI validate command orchestrating full validation pipeline
    • Exports: validate (Click command)
  • tests/test_validation.py (478 lines)

    • Unit tests for negative controls, recall@k, sensitivity, and validation report
    • 13 tests with synthetic DuckDB fixture

Modified

  • src/usher_pipeline/cli/main.py (+2 lines)

    • Added validate command import and registration
  • src/usher_pipeline/scoring/init.py (+7 lines)

    • Added validation_report module exports

Integration Points

Depends on:

  • Phase 06-01: Negative control validation (validate_negative_controls) and positive control validation (validate_positive_controls_extended, compute_recall_at_k)
  • Phase 06-02: Sensitivity analysis (run_sensitivity_analysis, summarize_sensitivity)
  • Phase 04-02: scored_genes checkpoint (validation requires scoring to be complete)

Provides:

  • Comprehensive validation report combining all three validation prongs
  • CLI validate command for user-facing validation workflow
  • Unit test coverage for all validation modules

Affects:

  • Phase 6 completion: This is the final plan in validation phase
  • User workflow: Provides usher-pipeline validate command for validation step

Technical Notes

Comprehensive Validation Report Design:

The report combines three complementary validation approaches:

  1. Positive controls (Plan 06-01): Known genes should rank high → validates sensitivity
  2. Negative controls (Plan 06-01): Housekeeping genes should rank low → validates specificity
  3. Sensitivity analysis (Plan 06-02): Rankings stable under perturbations → validates robustness

If all three pass: scoring system is sensitive, specific, and robust.

Overall Validation Verdict Logic:

  • All pass → "ALL VALIDATIONS PASSED ✓" (scientifically defensible)
  • Pos + Neg pass, Sensitivity fail → "PARTIAL PASS (Sensitivity Unstable)" (directionally correct but may need weight tuning)
  • Pos pass, Neg fail → "PARTIAL PASS (Specificity Issue)" (sensitive but not specific)
  • Pos fail → "VALIDATION FAILED ✗" (fundamental scoring issues)

Weight Tuning Recommendations Philosophy:

Recommendations are guidance, not automatic actions. They suggest:

  • Which layers to adjust (increase/decrease weights)
  • Why adjustments are needed (based on validation failures)
  • How to interpret failures (specificity vs sensitivity vs stability)

CRITICAL WARNING included in all recommendations:

  • Post-validation tuning introduces circular validation risk
  • If weights are tuned based on validation results, those same controls cannot validate the tuned weights
  • Best practices: independent validation set, document rationale, prefer a priori weights

This prevents the pitfall identified in 06-RESEARCH.md: "Tuning weights to maximize known gene recall, then using known gene recall as validation."

CLI validate Command Design:

Follows established pattern from score_cmd.py:

  1. Click command with options (--force, --skip-sensitivity, --output-dir, --top-n)
  2. Step-by-step pipeline with styled output (bold headers, colored status)
  3. Checkpoint-restart support (skips if validation_report.md exists unless --force)
  4. Provenance tracking for all steps (record_step for each validation phase)
  5. Final summary with overall status and key metrics
  6. Error handling with sys.exit(1) on failures

Test Design Philosophy:

All tests use synthetic DuckDB data with designed score patterns:

  • Known genes get high scores (0.85-0.92) → positive controls should pass
  • Housekeeping genes get low scores (0.12-0.20) → negative controls should pass
  • Deterministic scores enable precise assertions

Tests cover:

  • Happy path: Validations pass with expected data
  • Inverted logic: Validations fail when data is wrong (housekeeping genes high)
  • Edge cases: Large negative perturbations, invalid layer names
  • Format verification: Report contains expected sections
  • Recommendation logic: Different tuning suggestions for different failure modes

Usage Pattern:

# Full validation pipeline
usher-pipeline validate

# Skip sensitivity analysis (faster iteration)
usher-pipeline validate --skip-sensitivity

# Custom output directory
usher-pipeline validate --output-dir results/validation

# More genes for sensitivity (default 100)
usher-pipeline validate --top-n 200

# Force re-run
usher-pipeline validate --force

Expected Workflow:

  1. User runs usher-pipeline score (Phase 04-03)
  2. User runs usher-pipeline validate (this plan)
  3. User reviews validation report at {data_dir}/validation/validation_report.md
  4. If all pass: proceed to candidate prioritization
  5. If failures: review weight tuning recommendations, adjust weights with biological justification, re-run

Phase 6 Completion:

This plan completes Phase 6 (Validation). All three plans executed:

  • 06-01: Negative controls and recall@k (2 min)
  • 06-02: Sensitivity analysis (3 min)
  • 06-03: Comprehensive validation report and CLI (5 min)

Total Phase 6 duration: 10 minutes across 3 plans.

Self-Check: PASSED

Created files verified:

  • src/usher_pipeline/scoring/validation_report.py exists (410 lines)
  • src/usher_pipeline/cli/validate_cmd.py exists (408 lines)
  • tests/test_validation.py exists (478 lines)

Modified files verified:

  • src/usher_pipeline/cli/main.py updated with validate import and registration
  • src/usher_pipeline/scoring/init.py updated with validation_report exports

Commits verified:

  • 10f19f8: Task 1 commit exists (comprehensive validation report and CLI validate command)
  • 5879ae9: Task 2 commit exists (unit tests for all validation modules)

Functionality verified:

  • validate command imports correctly (Command name: validate, OK)
  • validate registered in CLI (CLI commands includes 'validate', OK)
  • All 13 tests pass (pytest reports 13 passed, 0 failed)

All claims in summary verified against actual implementation.