Files

gbanyan 2d29f43848 docs(06-02): complete sensitivity analysis plan

- Create SUMMARY.md with implementation details and verification results
- Update STATE.md: progress 100% (20/20 plans), plan 06-02 complete
- Record decisions: perturbation deltas, stability threshold, renormalization
- All tasks completed with 2 commits in 3 minutes

2026-02-12 04:44:13 +08:00

11 KiB

Raw Blame History

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics

phase

plan

subsystem

Phase 6 Plan 02: Sensitivity Analysis Module Summary

One-liner: Parameter sweep sensitivity analysis with Spearman rank correlation for scoring weight robustness validation (±5-10% perturbations, rho >= 0.85 stability threshold)

Implementation

Task 1: Create sensitivity analysis module with weight perturbation and rank correlation

Commit: a7589d9

What was built:

Created src/usher_pipeline/scoring/sensitivity.py with:
- Constants:
  - EVIDENCE_LAYERS: List of 6 evidence layer names (gnomad, expression, annotation, localization, animal_model, literature)
  - DEFAULT_DELTAS: [-0.10, -0.05, 0.05, 0.10] for ±5% and ±10% perturbations
  - STABILITY_THRESHOLD: 0.85 (Spearman rho threshold for "stable" classification)
- perturb_weight(baseline, layer, delta):
  - Perturbs one weight by delta amount
  - Clamps perturbed weight to [0.0, 1.0]
  - Renormalizes ALL weights so they sum to 1.0
  - Returns new ScoringWeights instance
  - Validates layer name (raises ValueError if invalid)
- run_sensitivity_analysis(store, baseline_weights, deltas, top_n):
  - Computes baseline composite scores and gets top-N genes
  - For each layer × delta combination:
    - Creates perturbed weights via perturb_weight()
    - Recomputes composite scores with perturbed weights
    - Gets top-N genes from perturbed scores
    - Inner joins baseline and perturbed top-N on gene_symbol
    - Computes Spearman rank correlation on composite_score of overlapping genes
    - Records: layer, delta, perturbed_weights, spearman_rho, spearman_pval, overlap_count
  - Returns dict with baseline_weights, results list, top_n, total_perturbations
  - Logs each perturbation result with structlog
  - Handles insufficient overlap (< 10 genes) by setting rho=None and logging warning
- summarize_sensitivity(analysis_result):
  - Computes global statistics: min_rho, max_rho, mean_rho (excluding None)
  - Counts stable (rho >= STABILITY_THRESHOLD) and unstable perturbations
  - Determines overall_stable: all non-None rhos >= threshold
  - Computes per-layer mean rho
  - Identifies most_sensitive_layer (lowest mean rho) and most_robust_layer (highest mean rho)
  - Returns summary dict with stability classification
- generate_sensitivity_report(analysis_result, summary):
  - Follows formatting pattern from validation.py's generate_validation_report()
  - Shows status: "STABLE ✓" or "UNSTABLE ✗"
  - Summary section with total/stable/unstable counts, mean rho, range
  - Interpretation text explaining stability verdict
  - Most sensitive/robust layer identification
  - Table with columns: Layer | Delta | Spearman rho | p-value | Overlap | Stable?
  - Uses ✓/✗ marks for per-perturbation stability

Key implementation details:

Weight renormalization: After perturbing one weight, divides all weights by new total to maintain sum=1.0
compute_composite_scores re-queries DB each time (by design - different weights produce different scores)
Spearman correlation measures whether relative ordering of shared top genes is preserved
Uses scipy.stats.spearmanr for correlation computation
Inner join ensures only genes in both top-N lists are compared
Structlog for progress logging (one log per perturbation)

Verification result: PASSED

Weight perturbation works correctly (gnomad increased from 0.2 to 0.2727 with +0.10 delta)
Renormalization maintains sum=1.0 (verified within 1e-6 tolerance)
Edge case handling: perturb to near-zero (-0.25) clamps to 0.0 and renormalizes correctly

Task 2: Export sensitivity module from scoring package

Commit: 0084a67

What was built:

Updated src/usher_pipeline/scoring/__init__.py:
- Added imports from sensitivity module:
  - Functions: perturb_weight, run_sensitivity_analysis, summarize_sensitivity, generate_sensitivity_report
  - Constants: EVIDENCE_LAYERS, STABILITY_THRESHOLD
- Added all 6 sensitivity exports to all list
- Preserved existing negative_controls exports from Plan 06-01

Key implementation details:

Followed established pattern from existing scoring module exports
Added alongside negative_controls imports (Plan 01 already executed)
All sensitivity functions now importable from usher_pipeline.scoring

Verification result: PASSED

All sensitivity exports available: from usher_pipeline.scoring import perturb_weight, run_sensitivity_analysis, summarize_sensitivity, generate_sensitivity_report, EVIDENCE_LAYERS, STABILITY_THRESHOLD
Constants verified: EVIDENCE_LAYERS has 6 layers, STABILITY_THRESHOLD = 0.85

Deviations from Plan

None - plan executed exactly as written.

Success Criteria

All success criteria met:

perturb_weight correctly perturbs one layer and renormalizes to sum=1.0
run_sensitivity_analysis computes Spearman rho for all layer x delta combinations
summarize_sensitivity classifies perturbations as stable/unstable
generate_sensitivity_report produces human-readable output
All functions exported from scoring package

Verification

Verification commands executed:

Weight perturbation and renormalization:

python -c "
from usher_pipeline.scoring.sensitivity import perturb_weight
from usher_pipeline.config.schema import ScoringWeights
w = ScoringWeights()
p = perturb_weight(w, 'gnomad', 0.05)
p.validate_sum()
print('OK')
"

Result: PASSED - validate_sum() did not raise

All exports available:

python -c "from usher_pipeline.scoring import run_sensitivity_analysis, summarize_sensitivity, generate_sensitivity_report"

Result: PASSED - all imports successful

Threshold configured:

python -c "from usher_pipeline.scoring.sensitivity import STABILITY_THRESHOLD; assert STABILITY_THRESHOLD == 0.85"

Result: PASSED - threshold correctly set to 0.85

Self-Check

Verifying all claimed artifacts exist:

Created files:

src/usher_pipeline/scoring/sensitivity.py - EXISTS

Modified files:

src/usher_pipeline/scoring/init.py - EXISTS

Commits:

a7589d9 - EXISTS (feat: implement sensitivity analysis module)
0084a67 - EXISTS (feat: export sensitivity module from scoring package)

Self-Check: PASSED

All files, commits, and functionality verified.

Notes

Integration with broader validation workflow:

The sensitivity analysis module complements the positive and negative control validation:

Positive controls (Plan 06-01): Validate that known genes rank highly
Negative controls (Plan 06-01): Validate that housekeeping genes rank low
Sensitivity analysis (Plan 06-02): Validate that rankings are stable under weight perturbations

This combination provides three-pronged validation:

Known genes rank high (scoring system captures known biology)
Housekeeping genes rank low (scoring system discriminates against generic genes)
Rankings stable under perturbations (results defensible, not arbitrary)

Key design choices:

Renormalization strategy: After perturbing one weight, renormalizes ALL weights to maintain sum=1.0 constraint. This ensures perturbed weights are always valid ScoringWeights instances.
Spearman vs Pearson: Uses Spearman rank correlation (not Pearson) because we care about ordinal ranking preservation, not linear relationship of scores. More appropriate for rank stability assessment.
Top-N comparison: Compares top-100 genes (by default) because:
- Relevant for candidate prioritization use case
- Reduces computational burden vs whole-genome comparison
- Focus on high-scoring genes where rank changes matter most
Overlap threshold: Requires >= 10 overlapping genes for Spearman correlation to avoid meaningless correlations from tiny samples. Records rho=None if insufficient overlap.
Stability threshold: 0.85 chosen as "stable" cutoff based on common practice in rank stability studies. Allows for some rank shuffling (15%) while ensuring overall ordering preserved.

Usage pattern:

from usher_pipeline.persistence.duckdb_store import PipelineStore
from usher_pipeline.config.schema import ScoringWeights
from usher_pipeline.scoring import (
    run_sensitivity_analysis,
    summarize_sensitivity,
    generate_sensitivity_report,
)

# Initialize
store = PipelineStore(db_path)
baseline_weights = ScoringWeights()  # or load from config

# Run sensitivity analysis
analysis = run_sensitivity_analysis(
    store,
    baseline_weights,
    deltas=[-0.10, -0.05, 0.05, 0.10],
    top_n=100
)

# Summarize results
summary = summarize_sensitivity(analysis)

# Generate report
report = generate_sensitivity_report(analysis, summary)
print(report)

# Check overall stability
if summary["overall_stable"]:
    print("Results are robust to weight perturbations!")
else:
    print(f"Warning: {summary['unstable_count']} perturbations unstable")
    print(f"Most sensitive layer: {summary['most_sensitive_layer']}")

Performance considerations:

Runs 6 layers × 4 deltas = 24 perturbations by default
Each perturbation requires full composite score computation (DB query)
For 20K genes, expect ~1-2 minutes total runtime
Could parallelize perturbations if performance becomes issue

Future enhancements:

Potential extensions not in current plan:

Bootstrapping for confidence intervals on Spearman rho
Visualization: heatmap of stability by layer × delta
Sensitivity to multiple simultaneous weight changes (2D/3D sweeps)
Automatic weight tuning based on stability landscape

11 KiB Raw Blame History Unescape Escape