Files
usher-exploring/.planning/phases/06-validation/06-01-SUMMARY.md
gbanyan a2d6e97acf docs(06-01): complete negative controls and recall@k validation plan
Summary:
- Created negative_controls.py with 13 housekeeping genes
- Added recall@k metrics (absolute and percentage thresholds)
- Added per-source breakdown for OMIM vs SYSCILIA
- Updated STATE.md with Phase 6 progress and decisions

Plan Summary: .planning/phases/06-validation/06-01-SUMMARY.md
2026-02-12 04:43:59 +08:00

8.2 KiB

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
phase plan subsystem tags dependency_graph tech_stack key_files decisions metrics
06-validation 01 validation
negative-controls
recall-metrics
housekeeping-genes
positive-controls
requires provides affects
04-02-scoring-qc
04-01-known-genes
negative-control-validation
recall-at-k-metrics
extended-positive-validation
06-03-comprehensive-validation-report
added patterns
inverted-threshold-validation
recall-at-k
per-source-breakdown
created modified
src/usher_pipeline/scoring/negative_controls.py
src/usher_pipeline/scoring/validation.py
src/usher_pipeline/scoring/__init__.py
Housekeeping genes as negative controls: 13 literature-validated genes (Eisenberg & Levanon 2013)
Inverted threshold logic for negative controls: median percentile < 50% is success
Recall@k at both absolute (100, 500, 1000, 2000) and percentage (5%, 10%, 20%) thresholds
Per-source breakdown separates OMIM Usher from SYSCILIA SCGS v2 for granular analysis
duration_minutes completed_date tasks_completed files_created files_modified commits
2 2026-02-12 2 1 2 2

Phase 6 Plan 01: Negative Controls & Recall@k Validation Summary

Negative control validation with housekeeping genes and enhanced positive control validation with recall@k metrics, providing complementary validation approaches (negative + positive controls) with granular metrics.

Tasks Completed

Task 1: Create negative control validation module with housekeeping genes

Status: Complete Commit: e488ff2 Files: src/usher_pipeline/scoring/negative_controls.py

Created negative_controls.py implementing housekeeping gene-based negative control validation:

  • HOUSEKEEPING_GENES_CORE frozenset with 13 curated genes (RPL13A, RPL32, RPLP0, GAPDH, ACTB, B2M, HPRT1, TBP, SDHA, PGK1, PPIA, UBC, YWHAZ)
  • Grouped by function: ribosomal proteins, metabolic enzymes, transcription factors, protein folding
  • Source: Eisenberg & Levanon (2013) "Human housekeeping genes, revisited" Trends in Genetics

compile_housekeeping_genes(): Returns DataFrame with gene_symbol, source ("literature_validated"), confidence ("HIGH") - matches compile_known_genes() pattern from known_genes.py

validate_negative_controls():

  • Uses PERCENT_RANK window function (same pattern as positive control validation)
  • INVERTED threshold logic: median_percentile < 0.50 = success (negative controls should rank LOW)
  • Returns metrics: total_expected, total_in_dataset, median_percentile, top_quartile_count, in_high_tier_count, validation_passed, housekeeping_gene_details
  • Creates temporary _housekeeping_genes table for join, cleans up after query
  • Tracks both top quartile presence (should be minimal) and high-tier score count (>= 0.70)

generate_negative_control_report(): Human-readable output following validation.py patterns, shows lowest-ranked genes (best outcome for negative controls)

Task 2: Enhance positive control validation with recall@k metrics

Status: Complete Commit: 0f615c0 Files: src/usher_pipeline/scoring/validation.py, src/usher_pipeline/scoring/init.py

Enhanced validation.py with recall@k functions:

compute_recall_at_k():

  • Computes recall at absolute thresholds: top-100, top-500, top-1000, top-2000
  • Computes recall at percentage thresholds: top 5%, 10%, 20% of scored genes
  • Deduplicates known genes on gene_symbol (genes in both OMIM + SYSCILIA count once)
  • Recall@k = (known genes in top-k) / total_known_unique
  • Provides the ">70% recall in top 10%" metric required by success criteria
  • Returns: recalls_absolute, recalls_percentage, total_known_unique, total_scored

validate_positive_controls_extended():

  • Combines base percentile validation (validate_known_gene_ranking) with recall@k metrics
  • Adds per-source breakdown: separate median percentile for "omim_usher" vs "syscilia_scgs_v2"
  • Per-source uses same PERCENT_RANK CTE pattern but filters JOIN by source
  • Allows detecting if one gene set validates better than the other (e.g., disease genes vs ciliary genes)
  • Returns: all base metrics + recall_at_k dict + per_source_breakdown dict

Updated init.py: Added exports for compute_recall_at_k, validate_positive_controls_extended, and all negative_controls.py functions (HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report)

Deviations from Plan

None - plan executed exactly as written.

Verification Results

All verification checks passed:

  1. from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE; assert len(HOUSEKEEPING_GENES_CORE) == 13 - OK
  2. from usher_pipeline.scoring import validate_negative_controls, compute_recall_at_k, validate_positive_controls_extended - All imports OK
  3. compile_housekeeping_genes(); assert 'gene_symbol' in df.columns and 'source' in df.columns - DataFrame structure correct

Success Criteria

  • negative_controls.py creates housekeeping gene set and validates they rank low (inverted threshold)
  • validation.py compute_recall_at_k measures recall at multiple k values including percentage-based thresholds
  • validate_positive_controls_extended combines percentile + recall + per-source metrics
  • All new functions exported from scoring.init

Key Files

Created

  • src/usher_pipeline/scoring/negative_controls.py (287 lines)
    • Housekeeping gene compilation and negative control validation
    • Exports: HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report

Modified

  • src/usher_pipeline/scoring/validation.py (+183 lines)

    • Added compute_recall_at_k() for recall@k metrics
    • Added validate_positive_controls_extended() for comprehensive validation
  • src/usher_pipeline/scoring/init.py (+8 exports)

    • Added negative_controls module exports
    • Added new validation functions: compute_recall_at_k, validate_positive_controls_extended

Integration Points

Depends on:

  • Phase 04-01: Known genes compilation (OMIM Usher + SYSCILIA SCGS v2)
  • Phase 04-02: scored_genes table with composite_score and PERCENT_RANK validation pattern

Provides:

  • Negative control validation (housekeeping genes should rank low)
  • Recall@k metrics (what % of known genes in top-k candidates)
  • Per-source breakdown (separate OMIM vs SYSCILIA analysis)

Affects:

  • Phase 06-03: Comprehensive validation report will integrate both positive and negative control results

Technical Notes

Negative Control Design:

  • Housekeeping genes (ubiquitous, essential, not cilia-specific) serve as negative controls
  • Inverted threshold logic: LOW percentiles are GOOD (confirms scoring specificity)
  • Complements positive controls: known genes should rank HIGH, housekeeping genes should rank LOW
  • If both validations pass: scoring system is both sensitive (catches true positives) and specific (excludes non-ciliary genes)

Recall@k Metrics:

  • Provides specific measurement for ">70% in top 10%" success criterion
  • Absolute thresholds useful for fixed candidate list sizes (e.g., "top 100 for experimental follow-up")
  • Percentage thresholds adapt to total scored gene count (dataset-size independent)
  • Deduplication ensures genes in both OMIM + SYSCILIA count once (avoids double-counting)

Per-Source Breakdown:

  • Disease genes (OMIM Usher) vs core ciliary genes (SYSCILIA SCGS v2) may have different evidence profiles
  • Usher genes may score higher on expression (retina, inner ear specific)
  • SYSCILIA genes may score higher on protein structure (IFT, BBSome domains)
  • Separate metrics detect if one set validates poorly (suggests evidence layer imbalance)

Self-Check: PASSED

Created files verified:

  • src/usher_pipeline/scoring/negative_controls.py exists and is importable

Commits verified:

  • e488ff2: Task 1 commit exists (negative control validation module)
  • 0f615c0: Task 2 commit exists (recall@k and extended validation)

Functionality verified:

  • All imports successful from usher_pipeline.scoring
  • HOUSEKEEPING_GENES_CORE has 13 genes
  • compile_housekeeping_genes() returns correct DataFrame structure
  • All functions callable (no import errors)

All claims in summary verified against actual implementation.