docs(06): create phase plan

2026-02-12 04:33:17 +08:00
parent ca2b715d8e
commit 844295c681
4 changed files with 570 additions and 3 deletions
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@@ -119,9 +119,12 @@ Plans:
  2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
  3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
  4. Final scoring weights are tuned based on validation metrics and documented with rationale
-**Plans**: TBD
+**Plans**: 3 plans
-Plans: (to be created during plan-phase)
+Plans:
 - [ ] 06-01-PLAN.md -- Negative control validation (housekeeping genes) and enhanced positive control metrics (recall@k)
 - [ ] 06-02-PLAN.md -- Sensitivity analysis (weight perturbation sweeps with Spearman rank correlation)
 - [ ] 06-03-PLAN.md -- Comprehensive validation report, CLI validate command, and unit tests
 ## Progress
@@ -135,4 +138,4 @@ Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6
 | 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 |
 | 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 |
 | 5. Output & CLI | 3/3 | Complete | 2026-02-12 |
-| 6. Validation | 0/TBD | Not started | - |
+| 6. Validation | 0/3 | Not started | - |
--- a/.planning/phases/06-validation/06-01-PLAN.md
+++ b/.planning/phases/06-validation/06-01-PLAN.md
@@ -0,0 +1,158 @@
 ---
 phase: 06-validation
 plan: 01
 type: execute
 wave: 1
 depends_on: []
 files_modified:
  - src/usher_pipeline/scoring/negative_controls.py
  - src/usher_pipeline/scoring/validation.py
  - src/usher_pipeline/scoring/__init__.py
 autonomous: true
 must_haves:
  truths:
    - "Housekeeping genes are compiled as a curated negative control set with source provenance"
    - "Negative control validation shows housekeeping genes rank below 50th percentile median"
    - "Positive control validation reports recall@k metrics at k=10%, 20%, top-100"
    - "Known genes achieve >70% recall in top 10% of scored candidates"
  artifacts:
    - path: "src/usher_pipeline/scoring/negative_controls.py"
      provides: "Housekeeping gene compilation and negative control validation"
      exports: ["HOUSEKEEPING_GENES_CORE", "compile_housekeeping_genes", "validate_negative_controls"]
    - path: "src/usher_pipeline/scoring/validation.py"
      provides: "Enhanced positive control validation with recall@k and per-source breakdown"
      exports: ["validate_known_gene_ranking", "compute_recall_at_k", "generate_validation_report"]
    - path: "src/usher_pipeline/scoring/__init__.py"
      provides: "Updated exports including negative control functions"
      contains: "validate_negative_controls"
  key_links:
    - from: "src/usher_pipeline/scoring/negative_controls.py"
      to: "DuckDB scored_genes table"
      via: "PERCENT_RANK window function query"
      pattern: "PERCENT_RANK.*ORDER BY composite_score"
    - from: "src/usher_pipeline/scoring/validation.py"
      to: "src/usher_pipeline/scoring/known_genes.py"
      via: "compile_known_genes import"
      pattern: "from usher_pipeline.scoring.known_genes import"
 ---
 <objective>
 Implement negative control validation with housekeeping genes and enhance positive control validation with recall@k metrics.
 Purpose: Negative controls ensure the scoring system does not indiscriminately rank all genes high (complementing the existing positive control validation). Enhanced positive control metrics (recall@k) provide the specific ">70% in top 10%" measurement required by success criteria.
 Output: Two modules -- negative_controls.py (new) and enhanced validation.py (updated) -- ready for integration into the comprehensive validation report (Plan 03).
 </objective>
 <execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/06-validation/06-RESEARCH.md
@src/usher_pipeline/scoring/validation.py
@src/usher_pipeline/scoring/known_genes.py
@src/usher_pipeline/scoring/quality_control.py
@src/usher_pipeline/scoring/__init__.py
@src/usher_pipeline/persistence/duckdb_store.py
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Create negative control validation module with housekeeping genes</name>
  <files>src/usher_pipeline/scoring/negative_controls.py</files>
  <action>
 Create `src/usher_pipeline/scoring/negative_controls.py` with:
 1. **HOUSEKEEPING_GENES_CORE** frozenset constant containing 13 curated housekeeping genes:
   RPL13A, RPL32, RPLP0, GAPDH, ACTB, B2M, HPRT1, TBP, SDHA, PGK1, PPIA, UBC, YWHAZ.
   Include inline comments grouping by function (ribosomal, metabolic, transcription/reference).
 2. **compile_housekeeping_genes() -> pl.DataFrame** function returning DataFrame with columns:
   - gene_symbol (str)
   - source (str): "literature_validated" for all
   - confidence (str): "HIGH" for all
   Follow the exact same pattern as `compile_known_genes()` in known_genes.py.
 3. **validate_negative_controls(store: PipelineStore, percentile_threshold: float = 0.50) -> dict** function:
   - Register housekeeping genes as temporary DuckDB table `_housekeeping_genes`
   - Use the same PERCENT_RANK window function pattern as `validate_known_gene_ranking()` in validation.py
   - Query: join ranked_genes CTE with _housekeeping_genes on gene_symbol
   - INVERTED validation logic: `validation_passed = median_percentile < percentile_threshold`
   - Return dict with keys: total_expected, total_in_dataset, median_percentile, top_quartile_count, in_high_tier_count, validation_passed, housekeeping_gene_details (top 20 by percentile ASC)
   - Clean up temp table after query
   - Use structlog logger with info/warning levels matching validation.py patterns
 4. **generate_negative_control_report(metrics: dict) -> str** function:
   - Follow the exact formatting pattern from generate_validation_report() in validation.py
   - Show gene table with Score, Percentile, headers
   - Include interpretation text for pass/fail
 Use structlog, polars, duckdb imports matching existing scoring module patterns. Import PipelineStore from usher_pipeline.persistence.duckdb_store.
  </action>
  <verify>
 Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report; df = compile_housekeeping_genes(); print(f'Housekeeping genes: {df.height}'); assert df.height == 13; assert set(df.columns) == {'gene_symbol', 'source', 'confidence'}; print('OK')"` exits 0
  </verify>
  <done>
 negative_controls.py exists with 13 curated housekeeping genes, compile function returns correct DataFrame structure, validate function uses PERCENT_RANK with inverted threshold logic, report function generates human-readable output.
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Enhance positive control validation with recall@k metrics</name>
  <files>src/usher_pipeline/scoring/validation.py, src/usher_pipeline/scoring/__init__.py</files>
  <action>
 **In validation.py**, add the following functions (do NOT modify existing functions, only ADD):
 1. **compute_recall_at_k(store: PipelineStore, k_values: list[int] | None = None) -> dict** function:
   - Default k_values: [100, 500, 1000, 2000] (absolute counts)
   - Also compute recall at percentage thresholds: top 5%, 10%, 20% of scored genes
   - Query scored_genes ordered by composite_score DESC (WHERE NOT NULL)
   - For each k: count how many known genes (from compile_known_genes, deduplicated on gene_symbol) appear in top-k
   - Recall@k = found_in_top_k / total_known_unique
   - Return dict with: recalls_absolute (dict mapping k -> recall float), recalls_percentage (dict mapping pct_string -> recall float), total_known_unique (int), total_scored (int)
   - Use structlog for logging results
 2. **validate_positive_controls_extended(store: PipelineStore, percentile_threshold: float = 0.75) -> dict** function:
   - Call existing validate_known_gene_ranking(store, percentile_threshold) to get base metrics
   - Call compute_recall_at_k(store) to get recall metrics
   - Add per-source breakdown: compute median percentile separately for "omim_usher" and "syscilia_scgs_v2" genes
   - Per-source query: same PERCENT_RANK CTE but filter JOIN by source
   - Return dict combining base metrics + recall_at_k + per_source_breakdown (dict mapping source -> {median_percentile, count, top_quartile_count})
   - This is the "full" positive control validation for Phase 6
 **In __init__.py**, add exports for: compute_recall_at_k, validate_positive_controls_extended, and also add imports/exports for negative_controls module: HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report.
  </action>
  <verify>
 Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.scoring import compute_recall_at_k, validate_positive_controls_extended, HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report; print('All imports OK')"` exits 0
  </verify>
  <done>
 validation.py has compute_recall_at_k and validate_positive_controls_extended functions. __init__.py exports all new functions from both negative_controls.py and updated validation.py. Recall@k computes at both absolute and percentage thresholds. Per-source breakdown separates OMIM from SYSCILIA metrics.
  </done>
 </task>
 </tasks>
 <verification>
 - `python -c "from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE; assert len(HOUSEKEEPING_GENES_CORE) == 13"` -- housekeeping genes compiled
 - `python -c "from usher_pipeline.scoring import validate_negative_controls, compute_recall_at_k, validate_positive_controls_extended"` -- all functions importable
 - `python -c "from usher_pipeline.scoring.negative_controls import compile_housekeeping_genes; df = compile_housekeeping_genes(); assert 'gene_symbol' in df.columns and 'source' in df.columns"` -- DataFrame structure correct
 </verification>
 <success_criteria>
 - negative_controls.py creates housekeeping gene set and validates they rank low (inverted threshold)
 - validation.py compute_recall_at_k measures recall at multiple k values including percentage-based thresholds
 - validate_positive_controls_extended combines percentile + recall + per-source metrics
 - All new functions exported from scoring.__init__
 </success_criteria>
 <output>
 After completion, create `.planning/phases/06-validation/06-01-SUMMARY.md`
 </output>
--- a/.planning/phases/06-validation/06-02-PLAN.md
+++ b/.planning/phases/06-validation/06-02-PLAN.md
@@ -0,0 +1,195 @@
 ---
 phase: 06-validation
 plan: 02
 type: execute
 wave: 1
 depends_on: []
 files_modified:
  - src/usher_pipeline/scoring/sensitivity.py
 autonomous: true
 must_haves:
  truths:
    - "Sensitivity analysis perturbs each weight by +-5% and +-10% and measures rank stability"
    - "Spearman rank correlation is computed for top-100 genes between baseline and perturbed configurations"
    - "Weight perturbation renormalizes remaining weights to maintain sum=1.0 constraint"
    - "Rank stability assessment classifies each perturbation as stable (rho>=0.85) or unstable"
  artifacts:
    - path: "src/usher_pipeline/scoring/sensitivity.py"
      provides: "Parameter sweep sensitivity analysis with Spearman correlation"
      exports: ["perturb_weight", "run_sensitivity_analysis", "summarize_sensitivity"]
  key_links:
    - from: "src/usher_pipeline/scoring/sensitivity.py"
      to: "src/usher_pipeline/scoring/integration.py"
      via: "compute_composite_scores import"
      pattern: "from usher_pipeline.scoring.integration import compute_composite_scores"
    - from: "src/usher_pipeline/scoring/sensitivity.py"
      to: "scipy.stats"
      via: "spearmanr import"
      pattern: "from scipy.stats import spearmanr"
    - from: "src/usher_pipeline/scoring/sensitivity.py"
      to: "src/usher_pipeline/config/schema.py"
      via: "ScoringWeights import"
      pattern: "from usher_pipeline.config.schema import ScoringWeights"
 ---
 <objective>
 Implement sensitivity analysis module for parameter sweep validation of scoring weights.
 Purpose: Demonstrates that top candidate rankings are robust to reasonable weight perturbations (+-5-10%), satisfying success criterion 3 (rank stability). This is the core of the "are our results defensible?" validation.
 Output: sensitivity.py module with weight perturbation, Spearman correlation analysis, and stability classification.
 </objective>
 <execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/06-validation/06-RESEARCH.md
@src/usher_pipeline/scoring/integration.py
@src/usher_pipeline/config/schema.py
@src/usher_pipeline/persistence/duckdb_store.py
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Create sensitivity analysis module with weight perturbation and rank correlation</name>
  <files>src/usher_pipeline/scoring/sensitivity.py</files>
  <action>
 Create `src/usher_pipeline/scoring/sensitivity.py` with:
 1. **EVIDENCE_LAYERS** list constant: ["gnomad", "expression", "annotation", "localization", "animal_model", "literature"]
 2. **DEFAULT_DELTAS** list constant: [-0.10, -0.05, 0.05, 0.10]
 3. **STABILITY_THRESHOLD** float constant: 0.85 (Spearman rho threshold for "stable")
 4. **perturb_weight(baseline: ScoringWeights, layer: str, delta: float) -> ScoringWeights** function:
   - Get baseline weights as dict via baseline.model_dump()
   - Apply perturbation: w_dict[layer] = max(0.0, min(1.0, w_dict[layer] + delta))
   - Renormalize ALL weights so they sum to 1.0: divide each by total
   - Return new ScoringWeights instance
   - Raise ValueError if layer not in EVIDENCE_LAYERS
 5. **run_sensitivity_analysis(store: PipelineStore, baseline_weights: ScoringWeights, deltas: list[float] | None = None, top_n: int = 100) -> dict** function:
   - Default deltas to DEFAULT_DELTAS if None
   - Compute baseline scores via compute_composite_scores(store, baseline_weights)
   - Sort by composite_score DESC, take top_n genes as baseline ranking
   - For each layer in EVIDENCE_LAYERS, for each delta in deltas:
     - Create perturbed weights via perturb_weight()
     - Compute perturbed scores via compute_composite_scores(store, perturbed_weights)
     - Sort by composite_score DESC, take top_n genes
     - Inner join baseline and perturbed on gene_symbol to get paired scores
     - If fewer than 10 overlapping genes, log warning and record rho=None
     - Otherwise compute spearmanr() on paired composite_score columns
     - Record: layer, delta, perturbed_weights (as dict), spearman_rho, spearman_pval, overlap_count (how many of top_n genes appear in both), top_n
   - Return dict with keys: baseline_weights (dict), results (list of per-perturbation dicts), top_n, total_perturbations
   - Use structlog for progress logging (log each perturbation result)
   IMPORTANT: The compute_composite_scores function re-queries the DB each time. This is by design -- different weights produce different composite_score values from the same underlying evidence layer scores.
   For the Spearman correlation, join baseline_top_n and perturbed_top_n DataFrames on gene_symbol (inner join). Use the composite_score from each as the paired values. This measures whether the relative ordering of shared top genes is preserved.
 6. **summarize_sensitivity(analysis_result: dict) -> dict** function:
   - From the results list, compute:
     - min_rho, max_rho, mean_rho across all perturbations (excluding None values)
     - count of stable perturbations (rho >= STABILITY_THRESHOLD)
     - count of unstable perturbations (rho < STABILITY_THRESHOLD)
     - most_sensitive_layer: layer with lowest mean rho across its perturbations
     - most_robust_layer: layer with highest mean rho across its perturbations
   - overall_stable: bool = all non-None rhos >= STABILITY_THRESHOLD
   - Return dict with: min_rho, max_rho, mean_rho, stable_count, unstable_count, total_perturbations, overall_stable, most_sensitive_layer, most_robust_layer
 7. **generate_sensitivity_report(analysis_result: dict, summary: dict) -> str** function:
   - Follow the formatting pattern from generate_validation_report() in validation.py
   - Show table: Layer | Delta | Spearman rho | p-value | Stable?
   - Show summary: overall stability verdict, most/least sensitive layers
   - Include interpretation text
 Use structlog, polars, scipy.stats.spearmanr imports. Import compute_composite_scores from usher_pipeline.scoring.integration, ScoringWeights from usher_pipeline.config.schema, PipelineStore from usher_pipeline.persistence.duckdb_store.
  </action>
  <verify>
 Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
 from usher_pipeline.scoring.sensitivity import perturb_weight, EVIDENCE_LAYERS, STABILITY_THRESHOLD, DEFAULT_DELTAS
 from usher_pipeline.config.schema import ScoringWeights
 # Test weight perturbation
 w = ScoringWeights()
 p = perturb_weight(w, 'gnomad', 0.10)
 p.validate_sum()  # Must not raise
 print(f'Original gnomad: {w.gnomad}, Perturbed: {p.gnomad:.4f}')
 assert p.gnomad > w.gnomad, 'Perturbed weight should be higher'
 # Test renormalization
 total = p.gnomad + p.expression + p.annotation + p.localization + p.animal_model + p.literature
 assert abs(total - 1.0) < 1e-6, f'Weights must sum to 1.0, got {total}'
 # Test edge: perturb to near-zero
 p_low = perturb_weight(w, 'gnomad', -0.25)
 p_low.validate_sum()
 assert p_low.gnomad >= 0.0, 'Weight must not go negative'
 print('All perturb_weight tests passed')
 "` exits 0
  </verify>
  <done>
 sensitivity.py exists with perturb_weight (renormalizing), run_sensitivity_analysis (computing Spearman rho for top-N genes across all layer/delta combinations), summarize_sensitivity (stability classification), and generate_sensitivity_report (formatted output). Weights always renormalize to sum=1.0 after perturbation.
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Export sensitivity module from scoring package</name>
  <files>src/usher_pipeline/scoring/__init__.py</files>
  <action>
 Update `src/usher_pipeline/scoring/__init__.py` to add imports and exports for the sensitivity module:
 Add imports:
 ```python
 from usher_pipeline.scoring.sensitivity import (
    perturb_weight,
    run_sensitivity_analysis,
    summarize_sensitivity,
    generate_sensitivity_report,
    EVIDENCE_LAYERS,
    STABILITY_THRESHOLD,
 )
 ```
 Add to __all__ list: "perturb_weight", "run_sensitivity_analysis", "summarize_sensitivity", "generate_sensitivity_report", "EVIDENCE_LAYERS", "STABILITY_THRESHOLD"
 NOTE: Plan 01 may have already updated __init__.py to add negative_controls exports. If so, ADD the sensitivity imports alongside those -- do not remove them. Read the file first to check current state.
  </action>
  <verify>
 Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.scoring import perturb_weight, run_sensitivity_analysis, summarize_sensitivity, generate_sensitivity_report, EVIDENCE_LAYERS, STABILITY_THRESHOLD; print(f'EVIDENCE_LAYERS: {EVIDENCE_LAYERS}'); print(f'STABILITY_THRESHOLD: {STABILITY_THRESHOLD}'); print('All sensitivity exports OK')"` exits 0
  </verify>
  <done>
 All sensitivity analysis functions and constants are importable from usher_pipeline.scoring. Existing exports from negative_controls (Plan 01) are preserved.
  </done>
 </task>
 </tasks>
 <verification>
 - `python -c "from usher_pipeline.scoring.sensitivity import perturb_weight; from usher_pipeline.config.schema import ScoringWeights; w = ScoringWeights(); p = perturb_weight(w, 'gnomad', 0.05); p.validate_sum(); print('OK')"` -- weight perturbation works and renormalizes
 - `python -c "from usher_pipeline.scoring import run_sensitivity_analysis, summarize_sensitivity, generate_sensitivity_report"` -- all exports available
 - `python -c "from usher_pipeline.scoring.sensitivity import STABILITY_THRESHOLD; assert STABILITY_THRESHOLD == 0.85"` -- threshold configured
 </verification>
 <success_criteria>
 - perturb_weight correctly perturbs one layer and renormalizes to sum=1.0
 - run_sensitivity_analysis computes Spearman rho for all layer x delta combinations
 - summarize_sensitivity classifies perturbations as stable/unstable
 - generate_sensitivity_report produces human-readable output
 - All functions exported from scoring package
 </success_criteria>
 <output>
 After completion, create `.planning/phases/06-validation/06-02-SUMMARY.md`
 </output>
--- a/.planning/phases/06-validation/06-03-PLAN.md
+++ b/.planning/phases/06-validation/06-03-PLAN.md
@@ -0,0 +1,211 @@
 ---
 phase: 06-validation
 plan: 03
 type: execute
 wave: 2
 depends_on: ["06-01", "06-02"]
 files_modified:
  - src/usher_pipeline/scoring/validation_report.py
  - src/usher_pipeline/cli/validate_cmd.py
  - src/usher_pipeline/cli/main.py
  - tests/test_validation.py
 autonomous: true
 must_haves:
  truths:
    - "CLI validate command runs positive controls, negative controls, and sensitivity analysis in sequence"
    - "Comprehensive validation report documents all three validation prongs with pass/fail verdicts"
    - "Weight tuning recommendations are generated based on validation results with documented rationale"
    - "Tests verify negative control logic, recall@k computation, weight perturbation, and report generation"
  artifacts:
    - path: "src/usher_pipeline/scoring/validation_report.py"
      provides: "Comprehensive validation report combining all three validation prongs"
      exports: ["generate_comprehensive_validation_report", "recommend_weight_tuning"]
    - path: "src/usher_pipeline/cli/validate_cmd.py"
      provides: "CLI validate command orchestrating full validation pipeline"
      exports: ["validate"]
    - path: "tests/test_validation.py"
      provides: "Unit tests for negative controls, recall@k, sensitivity, and validation report"
  key_links:
    - from: "src/usher_pipeline/cli/validate_cmd.py"
      to: "src/usher_pipeline/scoring/negative_controls.py"
      via: "validate_negative_controls import"
      pattern: "from usher_pipeline.scoring import validate_negative_controls"
    - from: "src/usher_pipeline/cli/validate_cmd.py"
      to: "src/usher_pipeline/scoring/sensitivity.py"
      via: "run_sensitivity_analysis import"
      pattern: "from usher_pipeline.scoring import run_sensitivity_analysis"
    - from: "src/usher_pipeline/cli/validate_cmd.py"
      to: "src/usher_pipeline/scoring/validation.py"
      via: "validate_positive_controls_extended import"
      pattern: "from usher_pipeline.scoring import validate_positive_controls_extended"
    - from: "src/usher_pipeline/cli/main.py"
      to: "src/usher_pipeline/cli/validate_cmd.py"
      via: "Click group add_command"
      pattern: "cli.add_command.*validate"
 ---
 <objective>
 Create comprehensive validation report generator, CLI validate command, and unit tests for all Phase 6 validation modules.
 Purpose: This plan wires together the positive control, negative control, and sensitivity analysis modules (from Plans 01 and 02) into a single CLI command and comprehensive report. Tests ensure correctness with synthetic data. This completes Phase 6 by providing the user-facing validation workflow.
 Output: validation_report.py, validate_cmd.py (CLI), updated main.py, and test_validation.py.
 </objective>
 <execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/06-validation/06-RESEARCH.md
 # Need SUMMARYs from Plans 01 and 02 for what was actually built
@.planning/phases/06-validation/06-01-SUMMARY.md
@.planning/phases/06-validation/06-02-SUMMARY.md
@src/usher_pipeline/scoring/__init__.py
@src/usher_pipeline/scoring/negative_controls.py
@src/usher_pipeline/scoring/sensitivity.py
@src/usher_pipeline/scoring/validation.py
@src/usher_pipeline/cli/score_cmd.py
@src/usher_pipeline/cli/main.py
@tests/test_scoring.py
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Create comprehensive validation report and CLI validate command</name>
  <files>src/usher_pipeline/scoring/validation_report.py, src/usher_pipeline/cli/validate_cmd.py, src/usher_pipeline/cli/main.py</files>
  <action>
 **Create `src/usher_pipeline/scoring/validation_report.py`:**
 1. **generate_comprehensive_validation_report(positive_metrics: dict, negative_metrics: dict, sensitivity_result: dict, sensitivity_summary: dict) -> str** function:
   - Generate a multi-section Markdown report combining all three validation prongs
   - Section 1: "Positive Control Validation" -- median percentile, recall@k table, per-source breakdown, pass/fail
   - Section 2: "Negative Control Validation" -- median percentile, top quartile count, in-HIGH-tier count, pass/fail
   - Section 3: "Sensitivity Analysis" -- Spearman rho table (layer x delta), overall stability verdict, most/least sensitive layers
   - Section 4: "Overall Validation Summary" -- all-pass/partial-fail/fail verdict
   - Section 5: "Weight Tuning Recommendations" -- call recommend_weight_tuning()
   - Return the full Markdown string
 2. **recommend_weight_tuning(positive_metrics: dict, negative_metrics: dict, sensitivity_summary: dict) -> str** function:
   - Analyze validation results and suggest weight adjustments
   - If positive controls pass AND negative controls pass AND sensitivity stable: "Current weights are validated. No tuning recommended."
   - If positive controls fail: suggest increasing weights for layers where known genes score highly
   - If negative controls fail (housekeeping genes ranking too high): suggest examining which layers boost housekeeping genes
   - If sensitivity unstable: identify most sensitive layer and suggest reducing its weight
   - Document rationale for each recommendation
   - CRITICAL: Note that any tuning is "post-validation" and flag circular validation risk per research pitfall
   - Return formatted recommendation text
 3. **save_validation_report(report_text: str, output_path: Path) -> None**: Write report to file
 **Create `src/usher_pipeline/cli/validate_cmd.py`:**
 Follow the established CLI pattern from score_cmd.py (config load, store init, checkpoint, steps, summary, cleanup):
 1. Click command `validate` with options:
   - `--force`: Re-run even if validation checkpoint exists
   - `--skip-sensitivity`: Skip sensitivity analysis (faster iteration)
   - `--output-dir`: Output directory for validation report (default: {data_dir}/validation)
   - `--top-n`: Top N genes for sensitivity analysis (default: 100)
 2. Pipeline steps:
   - Step 1: Load configuration and initialize store
   - Step 2: Check scored_genes checkpoint exists (error if not -- must run `score` first)
   - Step 3: Run positive control validation (validate_positive_controls_extended)
   - Step 4: Run negative control validation (validate_negative_controls)
   - Step 5: Run sensitivity analysis (unless --skip-sensitivity) -- run_sensitivity_analysis + summarize_sensitivity
   - Step 6: Generate comprehensive validation report (generate_comprehensive_validation_report)
   - Step 7: Save report to output_dir/validation_report.md and provenance sidecar
 3. Use click.echo with styled output matching score_cmd.py patterns (green for success, yellow for warnings, red for errors, bold for step headers)
 4. Provenance tracking: record_step for each validation phase with metrics
 5. Final summary: display overall pass/fail, recall@top-10%, housekeeping median percentile, sensitivity stability
 **Update `src/usher_pipeline/cli/main.py`:**
 - Import validate from validate_cmd
 - Add: `cli.add_command(validate)`
 - Follow the existing pattern used for score and report commands
  </action>
  <verify>
 Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')"` exits 0 AND `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.cli.main import cli; commands = list(cli.commands.keys()); print(f'CLI commands: {commands}'); assert 'validate' in commands; print('OK')"` exits 0
  </verify>
  <done>
 validation_report.py generates comprehensive multi-section Markdown report with weight tuning recommendations. validate_cmd.py provides CLI command running all three validation prongs. main.py registers validate as a CLI subcommand. All follow established patterns from score_cmd.py.
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Create unit tests for all validation modules</name>
  <files>tests/test_validation.py</files>
  <action>
 Create `tests/test_validation.py` with comprehensive tests using synthetic DuckDB data. Follow the test pattern from tests/test_scoring.py (use tmp_path fixtures, create in-memory DuckDB with synthetic data).
 **Test helper: create_synthetic_scored_db(tmp_path)** function:
 - Create DuckDB with gene_universe (20 genes: GENE001-GENE020)
 - Create scored_genes table with composite_score and all 6 layer scores
 - Design scores so that:
  - MYO7A, IFT88, BBS1 (known cilia genes) get high scores (0.8-0.95)
  - GAPDH, ACTB, RPL13A (housekeeping genes) get low scores (0.1-0.3)
  - Other genes get mid-range scores (0.3-0.6)
 - This ensures positive controls rank high and negative controls rank low in tests
 **Tests to include:**
 1. **test_compile_housekeeping_genes_structure**: Verify compile_housekeeping_genes() returns DataFrame with 13 genes, correct columns (gene_symbol, source, confidence), all confidence=HIGH, all source=literature_validated
 2. **test_compile_housekeeping_genes_known_genes_present**: Assert GAPDH, ACTB, RPL13A, TBP are in the gene_symbol column
 3. **test_validate_negative_controls_with_synthetic_data**: Use synthetic DB where housekeeping genes score low. Assert validation_passed=True, median_percentile < 0.5
 4. **test_validate_negative_controls_inverted_logic**: Create a DB where housekeeping genes score HIGH (artificial scenario). Assert validation_passed=False
 5. **test_compute_recall_at_k**: Use synthetic DB. Assert recall@k returns dict with recalls_absolute and recalls_percentage keys. With 3 known genes in top 5 of 20, recall@5 should be high (>0.5)
 6. **test_perturb_weight_renormalizes**: Perturb gnomad by +0.10, assert weights still sum to 1.0. Perturb by -0.25 (more than weight value), assert weight >= 0.0 and sum = 1.0
 7. **test_perturb_weight_invalid_layer**: perturb_weight with layer="nonexistent" should raise ValueError
 8. **test_generate_comprehensive_validation_report_format**: Pass mock metrics dicts, assert report contains expected sections ("Positive Control", "Negative Control", "Sensitivity Analysis", "Weight Tuning")
 9. **test_recommend_weight_tuning_all_pass**: Pass metrics indicating all validations pass. Assert response contains "No tuning recommended" or similar
 All tests should use tmp_path for DuckDB isolation. Import from usher_pipeline.scoring (not internal modules directly where possible). Use PipelineStore with direct conn assignment pattern from test_scoring.py.
  </action>
  <verify>
 Run: `cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_validation.py -v --tb=short` -- all tests pass
  </verify>
  <done>
 test_validation.py contains 9+ tests covering negative controls, recall@k, weight perturbation, sensitivity analysis, and report generation. All tests pass using synthetic DuckDB data with designed score patterns ensuring known genes rank high and housekeeping genes rank low.
  </done>
 </task>
 </tasks>
 <verification>
 - `python -m pytest tests/test_validation.py -v` -- all validation tests pass
 - `python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands"` -- CLI command registered
 - `python -c "from usher_pipeline.scoring.validation_report import generate_comprehensive_validation_report, recommend_weight_tuning"` -- report functions importable
 - `usher-pipeline validate --help` displays usage information with all options
 </verification>
 <success_criteria>
 - CLI `validate` command runs positive + negative + sensitivity validations and generates comprehensive report
 - Validation report includes all three prongs with pass/fail verdicts and weight tuning recommendations
 - Unit tests cover negative controls, recall@k, perturbation, and report generation
 - All tests pass with synthetic data
 - validate command registered in main CLI
 </success_criteria>
 <output>
 After completion, create `.planning/phases/06-validation/06-03-SUMMARY.md`
 </output>