Summary: - Created negative_controls.py with 13 housekeeping genes - Added recall@k metrics (absolute and percentage thresholds) - Added per-source breakdown for OMIM vs SYSCILIA - Updated STATE.md with Phase 6 progress and decisions Plan Summary: .planning/phases/06-validation/06-01-SUMMARY.md
8.2 KiB
phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
| phase | plan | subsystem | tags | dependency_graph | tech_stack | key_files | decisions | metrics | ||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 06-validation | 01 | validation |
|
|
|
|
|
|
Phase 6 Plan 01: Negative Controls & Recall@k Validation Summary
Negative control validation with housekeeping genes and enhanced positive control validation with recall@k metrics, providing complementary validation approaches (negative + positive controls) with granular metrics.
Tasks Completed
Task 1: Create negative control validation module with housekeeping genes
Status: Complete
Commit: e488ff2
Files: src/usher_pipeline/scoring/negative_controls.py
Created negative_controls.py implementing housekeeping gene-based negative control validation:
- HOUSEKEEPING_GENES_CORE frozenset with 13 curated genes (RPL13A, RPL32, RPLP0, GAPDH, ACTB, B2M, HPRT1, TBP, SDHA, PGK1, PPIA, UBC, YWHAZ)
- Grouped by function: ribosomal proteins, metabolic enzymes, transcription factors, protein folding
- Source: Eisenberg & Levanon (2013) "Human housekeeping genes, revisited" Trends in Genetics
compile_housekeeping_genes(): Returns DataFrame with gene_symbol, source ("literature_validated"), confidence ("HIGH") - matches compile_known_genes() pattern from known_genes.py
validate_negative_controls():
- Uses PERCENT_RANK window function (same pattern as positive control validation)
- INVERTED threshold logic: median_percentile < 0.50 = success (negative controls should rank LOW)
- Returns metrics: total_expected, total_in_dataset, median_percentile, top_quartile_count, in_high_tier_count, validation_passed, housekeeping_gene_details
- Creates temporary _housekeeping_genes table for join, cleans up after query
- Tracks both top quartile presence (should be minimal) and high-tier score count (>= 0.70)
generate_negative_control_report(): Human-readable output following validation.py patterns, shows lowest-ranked genes (best outcome for negative controls)
Task 2: Enhance positive control validation with recall@k metrics
Status: Complete
Commit: 0f615c0
Files: src/usher_pipeline/scoring/validation.py, src/usher_pipeline/scoring/init.py
Enhanced validation.py with recall@k functions:
compute_recall_at_k():
- Computes recall at absolute thresholds: top-100, top-500, top-1000, top-2000
- Computes recall at percentage thresholds: top 5%, 10%, 20% of scored genes
- Deduplicates known genes on gene_symbol (genes in both OMIM + SYSCILIA count once)
- Recall@k = (known genes in top-k) / total_known_unique
- Provides the ">70% recall in top 10%" metric required by success criteria
- Returns: recalls_absolute, recalls_percentage, total_known_unique, total_scored
validate_positive_controls_extended():
- Combines base percentile validation (validate_known_gene_ranking) with recall@k metrics
- Adds per-source breakdown: separate median percentile for "omim_usher" vs "syscilia_scgs_v2"
- Per-source uses same PERCENT_RANK CTE pattern but filters JOIN by source
- Allows detecting if one gene set validates better than the other (e.g., disease genes vs ciliary genes)
- Returns: all base metrics + recall_at_k dict + per_source_breakdown dict
Updated init.py: Added exports for compute_recall_at_k, validate_positive_controls_extended, and all negative_controls.py functions (HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report)
Deviations from Plan
None - plan executed exactly as written.
Verification Results
All verification checks passed:
from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE; assert len(HOUSEKEEPING_GENES_CORE) == 13- OKfrom usher_pipeline.scoring import validate_negative_controls, compute_recall_at_k, validate_positive_controls_extended- All imports OKcompile_housekeeping_genes(); assert 'gene_symbol' in df.columns and 'source' in df.columns- DataFrame structure correct
Success Criteria
- negative_controls.py creates housekeeping gene set and validates they rank low (inverted threshold)
- validation.py compute_recall_at_k measures recall at multiple k values including percentage-based thresholds
- validate_positive_controls_extended combines percentile + recall + per-source metrics
- All new functions exported from scoring.init
Key Files
Created
- src/usher_pipeline/scoring/negative_controls.py (287 lines)
- Housekeeping gene compilation and negative control validation
- Exports: HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report
Modified
-
src/usher_pipeline/scoring/validation.py (+183 lines)
- Added compute_recall_at_k() for recall@k metrics
- Added validate_positive_controls_extended() for comprehensive validation
-
src/usher_pipeline/scoring/init.py (+8 exports)
- Added negative_controls module exports
- Added new validation functions: compute_recall_at_k, validate_positive_controls_extended
Integration Points
Depends on:
- Phase 04-01: Known genes compilation (OMIM Usher + SYSCILIA SCGS v2)
- Phase 04-02: scored_genes table with composite_score and PERCENT_RANK validation pattern
Provides:
- Negative control validation (housekeeping genes should rank low)
- Recall@k metrics (what % of known genes in top-k candidates)
- Per-source breakdown (separate OMIM vs SYSCILIA analysis)
Affects:
- Phase 06-03: Comprehensive validation report will integrate both positive and negative control results
Technical Notes
Negative Control Design:
- Housekeeping genes (ubiquitous, essential, not cilia-specific) serve as negative controls
- Inverted threshold logic: LOW percentiles are GOOD (confirms scoring specificity)
- Complements positive controls: known genes should rank HIGH, housekeeping genes should rank LOW
- If both validations pass: scoring system is both sensitive (catches true positives) and specific (excludes non-ciliary genes)
Recall@k Metrics:
- Provides specific measurement for ">70% in top 10%" success criterion
- Absolute thresholds useful for fixed candidate list sizes (e.g., "top 100 for experimental follow-up")
- Percentage thresholds adapt to total scored gene count (dataset-size independent)
- Deduplication ensures genes in both OMIM + SYSCILIA count once (avoids double-counting)
Per-Source Breakdown:
- Disease genes (OMIM Usher) vs core ciliary genes (SYSCILIA SCGS v2) may have different evidence profiles
- Usher genes may score higher on expression (retina, inner ear specific)
- SYSCILIA genes may score higher on protein structure (IFT, BBSome domains)
- Separate metrics detect if one set validates poorly (suggests evidence layer imbalance)
Self-Check: PASSED
Created files verified:
- src/usher_pipeline/scoring/negative_controls.py exists and is importable
Commits verified:
e488ff2: Task 1 commit exists (negative control validation module)0f615c0: Task 2 commit exists (recall@k and extended validation)
Functionality verified:
- All imports successful from usher_pipeline.scoring
- HOUSEKEEPING_GENES_CORE has 13 genes
- compile_housekeeping_genes() returns correct DataFrame structure
- All functions callable (no import errors)
All claims in summary verified against actual implementation.