docs(06): create phase plan

This commit is contained in:
2026-02-12 04:33:17 +08:00
parent ca2b715d8e
commit 844295c681
4 changed files with 570 additions and 3 deletions

View File

@@ -119,9 +119,12 @@ Plans:
2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier) 2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates 3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
4. Final scoring weights are tuned based on validation metrics and documented with rationale 4. Final scoring weights are tuned based on validation metrics and documented with rationale
**Plans**: TBD **Plans**: 3 plans
Plans: (to be created during plan-phase) Plans:
- [ ] 06-01-PLAN.md -- Negative control validation (housekeeping genes) and enhanced positive control metrics (recall@k)
- [ ] 06-02-PLAN.md -- Sensitivity analysis (weight perturbation sweeps with Spearman rank correlation)
- [ ] 06-03-PLAN.md -- Comprehensive validation report, CLI validate command, and unit tests
## Progress ## Progress
@@ -135,4 +138,4 @@ Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6
| 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 | | 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 |
| 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 | | 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 |
| 5. Output & CLI | 3/3 | Complete | 2026-02-12 | | 5. Output & CLI | 3/3 | Complete | 2026-02-12 |
| 6. Validation | 0/TBD | Not started | - | | 6. Validation | 0/3 | Not started | - |

View File

@@ -0,0 +1,158 @@
---
phase: 06-validation
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/scoring/negative_controls.py
- src/usher_pipeline/scoring/validation.py
- src/usher_pipeline/scoring/__init__.py
autonomous: true
must_haves:
truths:
- "Housekeeping genes are compiled as a curated negative control set with source provenance"
- "Negative control validation shows housekeeping genes rank below 50th percentile median"
- "Positive control validation reports recall@k metrics at k=10%, 20%, top-100"
- "Known genes achieve >70% recall in top 10% of scored candidates"
artifacts:
- path: "src/usher_pipeline/scoring/negative_controls.py"
provides: "Housekeeping gene compilation and negative control validation"
exports: ["HOUSEKEEPING_GENES_CORE", "compile_housekeeping_genes", "validate_negative_controls"]
- path: "src/usher_pipeline/scoring/validation.py"
provides: "Enhanced positive control validation with recall@k and per-source breakdown"
exports: ["validate_known_gene_ranking", "compute_recall_at_k", "generate_validation_report"]
- path: "src/usher_pipeline/scoring/__init__.py"
provides: "Updated exports including negative control functions"
contains: "validate_negative_controls"
key_links:
- from: "src/usher_pipeline/scoring/negative_controls.py"
to: "DuckDB scored_genes table"
via: "PERCENT_RANK window function query"
pattern: "PERCENT_RANK.*ORDER BY composite_score"
- from: "src/usher_pipeline/scoring/validation.py"
to: "src/usher_pipeline/scoring/known_genes.py"
via: "compile_known_genes import"
pattern: "from usher_pipeline.scoring.known_genes import"
---
<objective>
Implement negative control validation with housekeeping genes and enhance positive control validation with recall@k metrics.
Purpose: Negative controls ensure the scoring system does not indiscriminately rank all genes high (complementing the existing positive control validation). Enhanced positive control metrics (recall@k) provide the specific ">70% in top 10%" measurement required by success criteria.
Output: Two modules -- negative_controls.py (new) and enhanced validation.py (updated) -- ready for integration into the comprehensive validation report (Plan 03).
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/06-validation/06-RESEARCH.md
@src/usher_pipeline/scoring/validation.py
@src/usher_pipeline/scoring/known_genes.py
@src/usher_pipeline/scoring/quality_control.py
@src/usher_pipeline/scoring/__init__.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create negative control validation module with housekeeping genes</name>
<files>src/usher_pipeline/scoring/negative_controls.py</files>
<action>
Create `src/usher_pipeline/scoring/negative_controls.py` with:
1. **HOUSEKEEPING_GENES_CORE** frozenset constant containing 13 curated housekeeping genes:
RPL13A, RPL32, RPLP0, GAPDH, ACTB, B2M, HPRT1, TBP, SDHA, PGK1, PPIA, UBC, YWHAZ.
Include inline comments grouping by function (ribosomal, metabolic, transcription/reference).
2. **compile_housekeeping_genes() -> pl.DataFrame** function returning DataFrame with columns:
- gene_symbol (str)
- source (str): "literature_validated" for all
- confidence (str): "HIGH" for all
Follow the exact same pattern as `compile_known_genes()` in known_genes.py.
3. **validate_negative_controls(store: PipelineStore, percentile_threshold: float = 0.50) -> dict** function:
- Register housekeeping genes as temporary DuckDB table `_housekeeping_genes`
- Use the same PERCENT_RANK window function pattern as `validate_known_gene_ranking()` in validation.py
- Query: join ranked_genes CTE with _housekeeping_genes on gene_symbol
- INVERTED validation logic: `validation_passed = median_percentile < percentile_threshold`
- Return dict with keys: total_expected, total_in_dataset, median_percentile, top_quartile_count, in_high_tier_count, validation_passed, housekeeping_gene_details (top 20 by percentile ASC)
- Clean up temp table after query
- Use structlog logger with info/warning levels matching validation.py patterns
4. **generate_negative_control_report(metrics: dict) -> str** function:
- Follow the exact formatting pattern from generate_validation_report() in validation.py
- Show gene table with Score, Percentile, headers
- Include interpretation text for pass/fail
Use structlog, polars, duckdb imports matching existing scoring module patterns. Import PipelineStore from usher_pipeline.persistence.duckdb_store.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report; df = compile_housekeeping_genes(); print(f'Housekeeping genes: {df.height}'); assert df.height == 13; assert set(df.columns) == {'gene_symbol', 'source', 'confidence'}; print('OK')"` exits 0
</verify>
<done>
negative_controls.py exists with 13 curated housekeeping genes, compile function returns correct DataFrame structure, validate function uses PERCENT_RANK with inverted threshold logic, report function generates human-readable output.
</done>
</task>
<task type="auto">
<name>Task 2: Enhance positive control validation with recall@k metrics</name>
<files>src/usher_pipeline/scoring/validation.py, src/usher_pipeline/scoring/__init__.py</files>
<action>
**In validation.py**, add the following functions (do NOT modify existing functions, only ADD):
1. **compute_recall_at_k(store: PipelineStore, k_values: list[int] | None = None) -> dict** function:
- Default k_values: [100, 500, 1000, 2000] (absolute counts)
- Also compute recall at percentage thresholds: top 5%, 10%, 20% of scored genes
- Query scored_genes ordered by composite_score DESC (WHERE NOT NULL)
- For each k: count how many known genes (from compile_known_genes, deduplicated on gene_symbol) appear in top-k
- Recall@k = found_in_top_k / total_known_unique
- Return dict with: recalls_absolute (dict mapping k -> recall float), recalls_percentage (dict mapping pct_string -> recall float), total_known_unique (int), total_scored (int)
- Use structlog for logging results
2. **validate_positive_controls_extended(store: PipelineStore, percentile_threshold: float = 0.75) -> dict** function:
- Call existing validate_known_gene_ranking(store, percentile_threshold) to get base metrics
- Call compute_recall_at_k(store) to get recall metrics
- Add per-source breakdown: compute median percentile separately for "omim_usher" and "syscilia_scgs_v2" genes
- Per-source query: same PERCENT_RANK CTE but filter JOIN by source
- Return dict combining base metrics + recall_at_k + per_source_breakdown (dict mapping source -> {median_percentile, count, top_quartile_count})
- This is the "full" positive control validation for Phase 6
**In __init__.py**, add exports for: compute_recall_at_k, validate_positive_controls_extended, and also add imports/exports for negative_controls module: HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.scoring import compute_recall_at_k, validate_positive_controls_extended, HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report; print('All imports OK')"` exits 0
</verify>
<done>
validation.py has compute_recall_at_k and validate_positive_controls_extended functions. __init__.py exports all new functions from both negative_controls.py and updated validation.py. Recall@k computes at both absolute and percentage thresholds. Per-source breakdown separates OMIM from SYSCILIA metrics.
</done>
</task>
</tasks>
<verification>
- `python -c "from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE; assert len(HOUSEKEEPING_GENES_CORE) == 13"` -- housekeeping genes compiled
- `python -c "from usher_pipeline.scoring import validate_negative_controls, compute_recall_at_k, validate_positive_controls_extended"` -- all functions importable
- `python -c "from usher_pipeline.scoring.negative_controls import compile_housekeeping_genes; df = compile_housekeeping_genes(); assert 'gene_symbol' in df.columns and 'source' in df.columns"` -- DataFrame structure correct
</verification>
<success_criteria>
- negative_controls.py creates housekeeping gene set and validates they rank low (inverted threshold)
- validation.py compute_recall_at_k measures recall at multiple k values including percentage-based thresholds
- validate_positive_controls_extended combines percentile + recall + per-source metrics
- All new functions exported from scoring.__init__
</success_criteria>
<output>
After completion, create `.planning/phases/06-validation/06-01-SUMMARY.md`
</output>

View File

@@ -0,0 +1,195 @@
---
phase: 06-validation
plan: 02
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/scoring/sensitivity.py
autonomous: true
must_haves:
truths:
- "Sensitivity analysis perturbs each weight by +-5% and +-10% and measures rank stability"
- "Spearman rank correlation is computed for top-100 genes between baseline and perturbed configurations"
- "Weight perturbation renormalizes remaining weights to maintain sum=1.0 constraint"
- "Rank stability assessment classifies each perturbation as stable (rho>=0.85) or unstable"
artifacts:
- path: "src/usher_pipeline/scoring/sensitivity.py"
provides: "Parameter sweep sensitivity analysis with Spearman correlation"
exports: ["perturb_weight", "run_sensitivity_analysis", "summarize_sensitivity"]
key_links:
- from: "src/usher_pipeline/scoring/sensitivity.py"
to: "src/usher_pipeline/scoring/integration.py"
via: "compute_composite_scores import"
pattern: "from usher_pipeline.scoring.integration import compute_composite_scores"
- from: "src/usher_pipeline/scoring/sensitivity.py"
to: "scipy.stats"
via: "spearmanr import"
pattern: "from scipy.stats import spearmanr"
- from: "src/usher_pipeline/scoring/sensitivity.py"
to: "src/usher_pipeline/config/schema.py"
via: "ScoringWeights import"
pattern: "from usher_pipeline.config.schema import ScoringWeights"
---
<objective>
Implement sensitivity analysis module for parameter sweep validation of scoring weights.
Purpose: Demonstrates that top candidate rankings are robust to reasonable weight perturbations (+-5-10%), satisfying success criterion 3 (rank stability). This is the core of the "are our results defensible?" validation.
Output: sensitivity.py module with weight perturbation, Spearman correlation analysis, and stability classification.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/06-validation/06-RESEARCH.md
@src/usher_pipeline/scoring/integration.py
@src/usher_pipeline/config/schema.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create sensitivity analysis module with weight perturbation and rank correlation</name>
<files>src/usher_pipeline/scoring/sensitivity.py</files>
<action>
Create `src/usher_pipeline/scoring/sensitivity.py` with:
1. **EVIDENCE_LAYERS** list constant: ["gnomad", "expression", "annotation", "localization", "animal_model", "literature"]
2. **DEFAULT_DELTAS** list constant: [-0.10, -0.05, 0.05, 0.10]
3. **STABILITY_THRESHOLD** float constant: 0.85 (Spearman rho threshold for "stable")
4. **perturb_weight(baseline: ScoringWeights, layer: str, delta: float) -> ScoringWeights** function:
- Get baseline weights as dict via baseline.model_dump()
- Apply perturbation: w_dict[layer] = max(0.0, min(1.0, w_dict[layer] + delta))
- Renormalize ALL weights so they sum to 1.0: divide each by total
- Return new ScoringWeights instance
- Raise ValueError if layer not in EVIDENCE_LAYERS
5. **run_sensitivity_analysis(store: PipelineStore, baseline_weights: ScoringWeights, deltas: list[float] | None = None, top_n: int = 100) -> dict** function:
- Default deltas to DEFAULT_DELTAS if None
- Compute baseline scores via compute_composite_scores(store, baseline_weights)
- Sort by composite_score DESC, take top_n genes as baseline ranking
- For each layer in EVIDENCE_LAYERS, for each delta in deltas:
- Create perturbed weights via perturb_weight()
- Compute perturbed scores via compute_composite_scores(store, perturbed_weights)
- Sort by composite_score DESC, take top_n genes
- Inner join baseline and perturbed on gene_symbol to get paired scores
- If fewer than 10 overlapping genes, log warning and record rho=None
- Otherwise compute spearmanr() on paired composite_score columns
- Record: layer, delta, perturbed_weights (as dict), spearman_rho, spearman_pval, overlap_count (how many of top_n genes appear in both), top_n
- Return dict with keys: baseline_weights (dict), results (list of per-perturbation dicts), top_n, total_perturbations
- Use structlog for progress logging (log each perturbation result)
IMPORTANT: The compute_composite_scores function re-queries the DB each time. This is by design -- different weights produce different composite_score values from the same underlying evidence layer scores.
For the Spearman correlation, join baseline_top_n and perturbed_top_n DataFrames on gene_symbol (inner join). Use the composite_score from each as the paired values. This measures whether the relative ordering of shared top genes is preserved.
6. **summarize_sensitivity(analysis_result: dict) -> dict** function:
- From the results list, compute:
- min_rho, max_rho, mean_rho across all perturbations (excluding None values)
- count of stable perturbations (rho >= STABILITY_THRESHOLD)
- count of unstable perturbations (rho < STABILITY_THRESHOLD)
- most_sensitive_layer: layer with lowest mean rho across its perturbations
- most_robust_layer: layer with highest mean rho across its perturbations
- overall_stable: bool = all non-None rhos >= STABILITY_THRESHOLD
- Return dict with: min_rho, max_rho, mean_rho, stable_count, unstable_count, total_perturbations, overall_stable, most_sensitive_layer, most_robust_layer
7. **generate_sensitivity_report(analysis_result: dict, summary: dict) -> str** function:
- Follow the formatting pattern from generate_validation_report() in validation.py
- Show table: Layer | Delta | Spearman rho | p-value | Stable?
- Show summary: overall stability verdict, most/least sensitive layers
- Include interpretation text
Use structlog, polars, scipy.stats.spearmanr imports. Import compute_composite_scores from usher_pipeline.scoring.integration, ScoringWeights from usher_pipeline.config.schema, PipelineStore from usher_pipeline.persistence.duckdb_store.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
from usher_pipeline.scoring.sensitivity import perturb_weight, EVIDENCE_LAYERS, STABILITY_THRESHOLD, DEFAULT_DELTAS
from usher_pipeline.config.schema import ScoringWeights
# Test weight perturbation
w = ScoringWeights()
p = perturb_weight(w, 'gnomad', 0.10)
p.validate_sum() # Must not raise
print(f'Original gnomad: {w.gnomad}, Perturbed: {p.gnomad:.4f}')
assert p.gnomad > w.gnomad, 'Perturbed weight should be higher'
# Test renormalization
total = p.gnomad + p.expression + p.annotation + p.localization + p.animal_model + p.literature
assert abs(total - 1.0) < 1e-6, f'Weights must sum to 1.0, got {total}'
# Test edge: perturb to near-zero
p_low = perturb_weight(w, 'gnomad', -0.25)
p_low.validate_sum()
assert p_low.gnomad >= 0.0, 'Weight must not go negative'
print('All perturb_weight tests passed')
"` exits 0
</verify>
<done>
sensitivity.py exists with perturb_weight (renormalizing), run_sensitivity_analysis (computing Spearman rho for top-N genes across all layer/delta combinations), summarize_sensitivity (stability classification), and generate_sensitivity_report (formatted output). Weights always renormalize to sum=1.0 after perturbation.
</done>
</task>
<task type="auto">
<name>Task 2: Export sensitivity module from scoring package</name>
<files>src/usher_pipeline/scoring/__init__.py</files>
<action>
Update `src/usher_pipeline/scoring/__init__.py` to add imports and exports for the sensitivity module:
Add imports:
```python
from usher_pipeline.scoring.sensitivity import (
perturb_weight,
run_sensitivity_analysis,
summarize_sensitivity,
generate_sensitivity_report,
EVIDENCE_LAYERS,
STABILITY_THRESHOLD,
)
```
Add to __all__ list: "perturb_weight", "run_sensitivity_analysis", "summarize_sensitivity", "generate_sensitivity_report", "EVIDENCE_LAYERS", "STABILITY_THRESHOLD"
NOTE: Plan 01 may have already updated __init__.py to add negative_controls exports. If so, ADD the sensitivity imports alongside those -- do not remove them. Read the file first to check current state.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.scoring import perturb_weight, run_sensitivity_analysis, summarize_sensitivity, generate_sensitivity_report, EVIDENCE_LAYERS, STABILITY_THRESHOLD; print(f'EVIDENCE_LAYERS: {EVIDENCE_LAYERS}'); print(f'STABILITY_THRESHOLD: {STABILITY_THRESHOLD}'); print('All sensitivity exports OK')"` exits 0
</verify>
<done>
All sensitivity analysis functions and constants are importable from usher_pipeline.scoring. Existing exports from negative_controls (Plan 01) are preserved.
</done>
</task>
</tasks>
<verification>
- `python -c "from usher_pipeline.scoring.sensitivity import perturb_weight; from usher_pipeline.config.schema import ScoringWeights; w = ScoringWeights(); p = perturb_weight(w, 'gnomad', 0.05); p.validate_sum(); print('OK')"` -- weight perturbation works and renormalizes
- `python -c "from usher_pipeline.scoring import run_sensitivity_analysis, summarize_sensitivity, generate_sensitivity_report"` -- all exports available
- `python -c "from usher_pipeline.scoring.sensitivity import STABILITY_THRESHOLD; assert STABILITY_THRESHOLD == 0.85"` -- threshold configured
</verification>
<success_criteria>
- perturb_weight correctly perturbs one layer and renormalizes to sum=1.0
- run_sensitivity_analysis computes Spearman rho for all layer x delta combinations
- summarize_sensitivity classifies perturbations as stable/unstable
- generate_sensitivity_report produces human-readable output
- All functions exported from scoring package
</success_criteria>
<output>
After completion, create `.planning/phases/06-validation/06-02-SUMMARY.md`
</output>

View File

@@ -0,0 +1,211 @@
---
phase: 06-validation
plan: 03
type: execute
wave: 2
depends_on: ["06-01", "06-02"]
files_modified:
- src/usher_pipeline/scoring/validation_report.py
- src/usher_pipeline/cli/validate_cmd.py
- src/usher_pipeline/cli/main.py
- tests/test_validation.py
autonomous: true
must_haves:
truths:
- "CLI validate command runs positive controls, negative controls, and sensitivity analysis in sequence"
- "Comprehensive validation report documents all three validation prongs with pass/fail verdicts"
- "Weight tuning recommendations are generated based on validation results with documented rationale"
- "Tests verify negative control logic, recall@k computation, weight perturbation, and report generation"
artifacts:
- path: "src/usher_pipeline/scoring/validation_report.py"
provides: "Comprehensive validation report combining all three validation prongs"
exports: ["generate_comprehensive_validation_report", "recommend_weight_tuning"]
- path: "src/usher_pipeline/cli/validate_cmd.py"
provides: "CLI validate command orchestrating full validation pipeline"
exports: ["validate"]
- path: "tests/test_validation.py"
provides: "Unit tests for negative controls, recall@k, sensitivity, and validation report"
key_links:
- from: "src/usher_pipeline/cli/validate_cmd.py"
to: "src/usher_pipeline/scoring/negative_controls.py"
via: "validate_negative_controls import"
pattern: "from usher_pipeline.scoring import validate_negative_controls"
- from: "src/usher_pipeline/cli/validate_cmd.py"
to: "src/usher_pipeline/scoring/sensitivity.py"
via: "run_sensitivity_analysis import"
pattern: "from usher_pipeline.scoring import run_sensitivity_analysis"
- from: "src/usher_pipeline/cli/validate_cmd.py"
to: "src/usher_pipeline/scoring/validation.py"
via: "validate_positive_controls_extended import"
pattern: "from usher_pipeline.scoring import validate_positive_controls_extended"
- from: "src/usher_pipeline/cli/main.py"
to: "src/usher_pipeline/cli/validate_cmd.py"
via: "Click group add_command"
pattern: "cli.add_command.*validate"
---
<objective>
Create comprehensive validation report generator, CLI validate command, and unit tests for all Phase 6 validation modules.
Purpose: This plan wires together the positive control, negative control, and sensitivity analysis modules (from Plans 01 and 02) into a single CLI command and comprehensive report. Tests ensure correctness with synthetic data. This completes Phase 6 by providing the user-facing validation workflow.
Output: validation_report.py, validate_cmd.py (CLI), updated main.py, and test_validation.py.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/06-validation/06-RESEARCH.md
# Need SUMMARYs from Plans 01 and 02 for what was actually built
@.planning/phases/06-validation/06-01-SUMMARY.md
@.planning/phases/06-validation/06-02-SUMMARY.md
@src/usher_pipeline/scoring/__init__.py
@src/usher_pipeline/scoring/negative_controls.py
@src/usher_pipeline/scoring/sensitivity.py
@src/usher_pipeline/scoring/validation.py
@src/usher_pipeline/cli/score_cmd.py
@src/usher_pipeline/cli/main.py
@tests/test_scoring.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create comprehensive validation report and CLI validate command</name>
<files>src/usher_pipeline/scoring/validation_report.py, src/usher_pipeline/cli/validate_cmd.py, src/usher_pipeline/cli/main.py</files>
<action>
**Create `src/usher_pipeline/scoring/validation_report.py`:**
1. **generate_comprehensive_validation_report(positive_metrics: dict, negative_metrics: dict, sensitivity_result: dict, sensitivity_summary: dict) -> str** function:
- Generate a multi-section Markdown report combining all three validation prongs
- Section 1: "Positive Control Validation" -- median percentile, recall@k table, per-source breakdown, pass/fail
- Section 2: "Negative Control Validation" -- median percentile, top quartile count, in-HIGH-tier count, pass/fail
- Section 3: "Sensitivity Analysis" -- Spearman rho table (layer x delta), overall stability verdict, most/least sensitive layers
- Section 4: "Overall Validation Summary" -- all-pass/partial-fail/fail verdict
- Section 5: "Weight Tuning Recommendations" -- call recommend_weight_tuning()
- Return the full Markdown string
2. **recommend_weight_tuning(positive_metrics: dict, negative_metrics: dict, sensitivity_summary: dict) -> str** function:
- Analyze validation results and suggest weight adjustments
- If positive controls pass AND negative controls pass AND sensitivity stable: "Current weights are validated. No tuning recommended."
- If positive controls fail: suggest increasing weights for layers where known genes score highly
- If negative controls fail (housekeeping genes ranking too high): suggest examining which layers boost housekeeping genes
- If sensitivity unstable: identify most sensitive layer and suggest reducing its weight
- Document rationale for each recommendation
- CRITICAL: Note that any tuning is "post-validation" and flag circular validation risk per research pitfall
- Return formatted recommendation text
3. **save_validation_report(report_text: str, output_path: Path) -> None**: Write report to file
**Create `src/usher_pipeline/cli/validate_cmd.py`:**
Follow the established CLI pattern from score_cmd.py (config load, store init, checkpoint, steps, summary, cleanup):
1. Click command `validate` with options:
- `--force`: Re-run even if validation checkpoint exists
- `--skip-sensitivity`: Skip sensitivity analysis (faster iteration)
- `--output-dir`: Output directory for validation report (default: {data_dir}/validation)
- `--top-n`: Top N genes for sensitivity analysis (default: 100)
2. Pipeline steps:
- Step 1: Load configuration and initialize store
- Step 2: Check scored_genes checkpoint exists (error if not -- must run `score` first)
- Step 3: Run positive control validation (validate_positive_controls_extended)
- Step 4: Run negative control validation (validate_negative_controls)
- Step 5: Run sensitivity analysis (unless --skip-sensitivity) -- run_sensitivity_analysis + summarize_sensitivity
- Step 6: Generate comprehensive validation report (generate_comprehensive_validation_report)
- Step 7: Save report to output_dir/validation_report.md and provenance sidecar
3. Use click.echo with styled output matching score_cmd.py patterns (green for success, yellow for warnings, red for errors, bold for step headers)
4. Provenance tracking: record_step for each validation phase with metrics
5. Final summary: display overall pass/fail, recall@top-10%, housekeeping median percentile, sensitivity stability
**Update `src/usher_pipeline/cli/main.py`:**
- Import validate from validate_cmd
- Add: `cli.add_command(validate)`
- Follow the existing pattern used for score and report commands
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.cli.validate_cmd import validate; print(f'Command name: {validate.name}'); print('OK')"` exits 0 AND `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.cli.main import cli; commands = list(cli.commands.keys()); print(f'CLI commands: {commands}'); assert 'validate' in commands; print('OK')"` exits 0
</verify>
<done>
validation_report.py generates comprehensive multi-section Markdown report with weight tuning recommendations. validate_cmd.py provides CLI command running all three validation prongs. main.py registers validate as a CLI subcommand. All follow established patterns from score_cmd.py.
</done>
</task>
<task type="auto">
<name>Task 2: Create unit tests for all validation modules</name>
<files>tests/test_validation.py</files>
<action>
Create `tests/test_validation.py` with comprehensive tests using synthetic DuckDB data. Follow the test pattern from tests/test_scoring.py (use tmp_path fixtures, create in-memory DuckDB with synthetic data).
**Test helper: create_synthetic_scored_db(tmp_path)** function:
- Create DuckDB with gene_universe (20 genes: GENE001-GENE020)
- Create scored_genes table with composite_score and all 6 layer scores
- Design scores so that:
- MYO7A, IFT88, BBS1 (known cilia genes) get high scores (0.8-0.95)
- GAPDH, ACTB, RPL13A (housekeeping genes) get low scores (0.1-0.3)
- Other genes get mid-range scores (0.3-0.6)
- This ensures positive controls rank high and negative controls rank low in tests
**Tests to include:**
1. **test_compile_housekeeping_genes_structure**: Verify compile_housekeeping_genes() returns DataFrame with 13 genes, correct columns (gene_symbol, source, confidence), all confidence=HIGH, all source=literature_validated
2. **test_compile_housekeeping_genes_known_genes_present**: Assert GAPDH, ACTB, RPL13A, TBP are in the gene_symbol column
3. **test_validate_negative_controls_with_synthetic_data**: Use synthetic DB where housekeeping genes score low. Assert validation_passed=True, median_percentile < 0.5
4. **test_validate_negative_controls_inverted_logic**: Create a DB where housekeeping genes score HIGH (artificial scenario). Assert validation_passed=False
5. **test_compute_recall_at_k**: Use synthetic DB. Assert recall@k returns dict with recalls_absolute and recalls_percentage keys. With 3 known genes in top 5 of 20, recall@5 should be high (>0.5)
6. **test_perturb_weight_renormalizes**: Perturb gnomad by +0.10, assert weights still sum to 1.0. Perturb by -0.25 (more than weight value), assert weight >= 0.0 and sum = 1.0
7. **test_perturb_weight_invalid_layer**: perturb_weight with layer="nonexistent" should raise ValueError
8. **test_generate_comprehensive_validation_report_format**: Pass mock metrics dicts, assert report contains expected sections ("Positive Control", "Negative Control", "Sensitivity Analysis", "Weight Tuning")
9. **test_recommend_weight_tuning_all_pass**: Pass metrics indicating all validations pass. Assert response contains "No tuning recommended" or similar
All tests should use tmp_path for DuckDB isolation. Import from usher_pipeline.scoring (not internal modules directly where possible). Use PipelineStore with direct conn assignment pattern from test_scoring.py.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_validation.py -v --tb=short` -- all tests pass
</verify>
<done>
test_validation.py contains 9+ tests covering negative controls, recall@k, weight perturbation, sensitivity analysis, and report generation. All tests pass using synthetic DuckDB data with designed score patterns ensuring known genes rank high and housekeeping genes rank low.
</done>
</task>
</tasks>
<verification>
- `python -m pytest tests/test_validation.py -v` -- all validation tests pass
- `python -c "from usher_pipeline.cli.main import cli; assert 'validate' in cli.commands"` -- CLI command registered
- `python -c "from usher_pipeline.scoring.validation_report import generate_comprehensive_validation_report, recommend_weight_tuning"` -- report functions importable
- `usher-pipeline validate --help` displays usage information with all options
</verification>
<success_criteria>
- CLI `validate` command runs positive + negative + sensitivity validations and generates comprehensive report
- Validation report includes all three prongs with pass/fail verdicts and weight tuning recommendations
- Unit tests cover negative controls, recall@k, perturbation, and report generation
- All tests pass with synthetic data
- validate command registered in main CLI
</success_criteria>
<output>
After completion, create `.planning/phases/06-validation/06-03-SUMMARY.md`
</output>