Files

gbanyan a2d6e97acf docs(06-01): complete negative controls and recall@k validation plan

Summary:
- Created negative_controls.py with 13 housekeeping genes
- Added recall@k metrics (absolute and percentage thresholds)
- Added per-source breakdown for OMIM vs SYSCILIA
- Updated STATE.md with Phase 6 progress and decisions

Plan Summary: .planning/phases/06-validation/06-01-SUMMARY.md

2026-02-12 04:43:59 +08:00

8.2 KiB

Raw Blame History

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics

phase

plan

subsystem

Phase 6 Plan 01: Negative Controls & Recall@k Validation Summary

Negative control validation with housekeeping genes and enhanced positive control validation with recall@k metrics, providing complementary validation approaches (negative + positive controls) with granular metrics.

Tasks Completed

Task 1: Create negative control validation module with housekeeping genes

Status: Complete Commit: e488ff2 Files: src/usher_pipeline/scoring/negative_controls.py

Created negative_controls.py implementing housekeeping gene-based negative control validation:

HOUSEKEEPING_GENES_CORE frozenset with 13 curated genes (RPL13A, RPL32, RPLP0, GAPDH, ACTB, B2M, HPRT1, TBP, SDHA, PGK1, PPIA, UBC, YWHAZ)
Grouped by function: ribosomal proteins, metabolic enzymes, transcription factors, protein folding
Source: Eisenberg & Levanon (2013) "Human housekeeping genes, revisited" Trends in Genetics

compile_housekeeping_genes(): Returns DataFrame with gene_symbol, source ("literature_validated"), confidence ("HIGH") - matches compile_known_genes() pattern from known_genes.py

validate_negative_controls():

Uses PERCENT_RANK window function (same pattern as positive control validation)
INVERTED threshold logic: median_percentile < 0.50 = success (negative controls should rank LOW)
Returns metrics: total_expected, total_in_dataset, median_percentile, top_quartile_count, in_high_tier_count, validation_passed, housekeeping_gene_details
Creates temporary _housekeeping_genes table for join, cleans up after query
Tracks both top quartile presence (should be minimal) and high-tier score count (>= 0.70)

generate_negative_control_report(): Human-readable output following validation.py patterns, shows lowest-ranked genes (best outcome for negative controls)

Task 2: Enhance positive control validation with recall@k metrics

Status: Complete Commit: 0f615c0 Files: src/usher_pipeline/scoring/validation.py, src/usher_pipeline/scoring/init.py

Enhanced validation.py with recall@k functions:

compute_recall_at_k():

Computes recall at absolute thresholds: top-100, top-500, top-1000, top-2000
Computes recall at percentage thresholds: top 5%, 10%, 20% of scored genes
Deduplicates known genes on gene_symbol (genes in both OMIM + SYSCILIA count once)
Recall@k = (known genes in top-k) / total_known_unique
Provides the ">70% recall in top 10%" metric required by success criteria
Returns: recalls_absolute, recalls_percentage, total_known_unique, total_scored

validate_positive_controls_extended():

Combines base percentile validation (validate_known_gene_ranking) with recall@k metrics
Adds per-source breakdown: separate median percentile for "omim_usher" vs "syscilia_scgs_v2"
Per-source uses same PERCENT_RANK CTE pattern but filters JOIN by source
Allows detecting if one gene set validates better than the other (e.g., disease genes vs ciliary genes)
Returns: all base metrics + recall_at_k dict + per_source_breakdown dict

Updated init.py: Added exports for compute_recall_at_k, validate_positive_controls_extended, and all negative_controls.py functions (HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report)

Deviations from Plan

None - plan executed exactly as written.

Verification Results

All verification checks passed:

from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE; assert len(HOUSEKEEPING_GENES_CORE) == 13 - OK
from usher_pipeline.scoring import validate_negative_controls, compute_recall_at_k, validate_positive_controls_extended - All imports OK
compile_housekeeping_genes(); assert 'gene_symbol' in df.columns and 'source' in df.columns - DataFrame structure correct

Success Criteria

negative_controls.py creates housekeeping gene set and validates they rank low (inverted threshold)
validation.py compute_recall_at_k measures recall at multiple k values including percentage-based thresholds
validate_positive_controls_extended combines percentile + recall + per-source metrics
All new functions exported from scoring.init

Key Files

Created

src/usher_pipeline/scoring/negative_controls.py (287 lines)
- Housekeeping gene compilation and negative control validation
- Exports: HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report

Modified

src/usher_pipeline/scoring/validation.py (+183 lines)
- Added compute_recall_at_k() for recall@k metrics
- Added validate_positive_controls_extended() for comprehensive validation
src/usher_pipeline/scoring/init.py (+8 exports)
- Added negative_controls module exports
- Added new validation functions: compute_recall_at_k, validate_positive_controls_extended

Integration Points

Depends on:

Phase 04-01: Known genes compilation (OMIM Usher + SYSCILIA SCGS v2)
Phase 04-02: scored_genes table with composite_score and PERCENT_RANK validation pattern

Provides:

Negative control validation (housekeeping genes should rank low)
Recall@k metrics (what % of known genes in top-k candidates)
Per-source breakdown (separate OMIM vs SYSCILIA analysis)

Affects:

Phase 06-03: Comprehensive validation report will integrate both positive and negative control results

Technical Notes

Negative Control Design:

Housekeeping genes (ubiquitous, essential, not cilia-specific) serve as negative controls
Inverted threshold logic: LOW percentiles are GOOD (confirms scoring specificity)
Complements positive controls: known genes should rank HIGH, housekeeping genes should rank LOW
If both validations pass: scoring system is both sensitive (catches true positives) and specific (excludes non-ciliary genes)

Recall@k Metrics:

Provides specific measurement for ">70% in top 10%" success criterion
Absolute thresholds useful for fixed candidate list sizes (e.g., "top 100 for experimental follow-up")
Percentage thresholds adapt to total scored gene count (dataset-size independent)
Deduplication ensures genes in both OMIM + SYSCILIA count once (avoids double-counting)

Per-Source Breakdown:

Disease genes (OMIM Usher) vs core ciliary genes (SYSCILIA SCGS v2) may have different evidence profiles
Usher genes may score higher on expression (retina, inner ear specific)
SYSCILIA genes may score higher on protein structure (IFT, BBSome domains)
Separate metrics detect if one set validates poorly (suggests evidence layer imbalance)

Self-Check: PASSED

Created files verified:

src/usher_pipeline/scoring/negative_controls.py exists and is importable

Commits verified:

e488ff2: Task 1 commit exists (negative control validation module)
0f615c0: Task 2 commit exists (recall@k and extended validation)

Functionality verified:

All imports successful from usher_pipeline.scoring
HOUSEKEEPING_GENES_CORE has 13 genes
compile_housekeeping_genes() returns correct DataFrame structure
All functions callable (no import errors)

All claims in summary verified against actual implementation.

8.2 KiB Raw Blame History