159 lines
9.1 KiB
Markdown
159 lines
9.1 KiB
Markdown
---
|
|
phase: 06-validation
|
|
plan: 01
|
|
type: execute
|
|
wave: 1
|
|
depends_on: []
|
|
files_modified:
|
|
- src/usher_pipeline/scoring/negative_controls.py
|
|
- src/usher_pipeline/scoring/validation.py
|
|
- src/usher_pipeline/scoring/__init__.py
|
|
autonomous: true
|
|
|
|
must_haves:
|
|
truths:
|
|
- "Housekeeping genes are compiled as a curated negative control set with source provenance"
|
|
- "Negative control validation shows housekeeping genes rank below 50th percentile median"
|
|
- "Positive control validation reports recall@k metrics at k=10%, 20%, top-100"
|
|
- "Known genes achieve >70% recall in top 10% of scored candidates"
|
|
artifacts:
|
|
- path: "src/usher_pipeline/scoring/negative_controls.py"
|
|
provides: "Housekeeping gene compilation and negative control validation"
|
|
exports: ["HOUSEKEEPING_GENES_CORE", "compile_housekeeping_genes", "validate_negative_controls"]
|
|
- path: "src/usher_pipeline/scoring/validation.py"
|
|
provides: "Enhanced positive control validation with recall@k and per-source breakdown"
|
|
exports: ["validate_known_gene_ranking", "compute_recall_at_k", "generate_validation_report"]
|
|
- path: "src/usher_pipeline/scoring/__init__.py"
|
|
provides: "Updated exports including negative control functions"
|
|
contains: "validate_negative_controls"
|
|
key_links:
|
|
- from: "src/usher_pipeline/scoring/negative_controls.py"
|
|
to: "DuckDB scored_genes table"
|
|
via: "PERCENT_RANK window function query"
|
|
pattern: "PERCENT_RANK.*ORDER BY composite_score"
|
|
- from: "src/usher_pipeline/scoring/validation.py"
|
|
to: "src/usher_pipeline/scoring/known_genes.py"
|
|
via: "compile_known_genes import"
|
|
pattern: "from usher_pipeline.scoring.known_genes import"
|
|
---
|
|
|
|
<objective>
|
|
Implement negative control validation with housekeeping genes and enhance positive control validation with recall@k metrics.
|
|
|
|
Purpose: Negative controls ensure the scoring system does not indiscriminately rank all genes high (complementing the existing positive control validation). Enhanced positive control metrics (recall@k) provide the specific ">70% in top 10%" measurement required by success criteria.
|
|
|
|
Output: Two modules -- negative_controls.py (new) and enhanced validation.py (updated) -- ready for integration into the comprehensive validation report (Plan 03).
|
|
</objective>
|
|
|
|
<execution_context>
|
|
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
|
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
|
</execution_context>
|
|
|
|
<context>
|
|
@.planning/PROJECT.md
|
|
@.planning/ROADMAP.md
|
|
@.planning/STATE.md
|
|
@.planning/phases/06-validation/06-RESEARCH.md
|
|
|
|
@src/usher_pipeline/scoring/validation.py
|
|
@src/usher_pipeline/scoring/known_genes.py
|
|
@src/usher_pipeline/scoring/quality_control.py
|
|
@src/usher_pipeline/scoring/__init__.py
|
|
@src/usher_pipeline/persistence/duckdb_store.py
|
|
</context>
|
|
|
|
<tasks>
|
|
|
|
<task type="auto">
|
|
<name>Task 1: Create negative control validation module with housekeeping genes</name>
|
|
<files>src/usher_pipeline/scoring/negative_controls.py</files>
|
|
<action>
|
|
Create `src/usher_pipeline/scoring/negative_controls.py` with:
|
|
|
|
1. **HOUSEKEEPING_GENES_CORE** frozenset constant containing 13 curated housekeeping genes:
|
|
RPL13A, RPL32, RPLP0, GAPDH, ACTB, B2M, HPRT1, TBP, SDHA, PGK1, PPIA, UBC, YWHAZ.
|
|
Include inline comments grouping by function (ribosomal, metabolic, transcription/reference).
|
|
|
|
2. **compile_housekeeping_genes() -> pl.DataFrame** function returning DataFrame with columns:
|
|
- gene_symbol (str)
|
|
- source (str): "literature_validated" for all
|
|
- confidence (str): "HIGH" for all
|
|
Follow the exact same pattern as `compile_known_genes()` in known_genes.py.
|
|
|
|
3. **validate_negative_controls(store: PipelineStore, percentile_threshold: float = 0.50) -> dict** function:
|
|
- Register housekeeping genes as temporary DuckDB table `_housekeeping_genes`
|
|
- Use the same PERCENT_RANK window function pattern as `validate_known_gene_ranking()` in validation.py
|
|
- Query: join ranked_genes CTE with _housekeeping_genes on gene_symbol
|
|
- INVERTED validation logic: `validation_passed = median_percentile < percentile_threshold`
|
|
- Return dict with keys: total_expected, total_in_dataset, median_percentile, top_quartile_count, in_high_tier_count, validation_passed, housekeeping_gene_details (top 20 by percentile ASC)
|
|
- Clean up temp table after query
|
|
- Use structlog logger with info/warning levels matching validation.py patterns
|
|
|
|
4. **generate_negative_control_report(metrics: dict) -> str** function:
|
|
- Follow the exact formatting pattern from generate_validation_report() in validation.py
|
|
- Show gene table with Score, Percentile, headers
|
|
- Include interpretation text for pass/fail
|
|
|
|
Use structlog, polars, duckdb imports matching existing scoring module patterns. Import PipelineStore from usher_pipeline.persistence.duckdb_store.
|
|
</action>
|
|
<verify>
|
|
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report; df = compile_housekeeping_genes(); print(f'Housekeeping genes: {df.height}'); assert df.height == 13; assert set(df.columns) == {'gene_symbol', 'source', 'confidence'}; print('OK')"` exits 0
|
|
</verify>
|
|
<done>
|
|
negative_controls.py exists with 13 curated housekeeping genes, compile function returns correct DataFrame structure, validate function uses PERCENT_RANK with inverted threshold logic, report function generates human-readable output.
|
|
</done>
|
|
</task>
|
|
|
|
<task type="auto">
|
|
<name>Task 2: Enhance positive control validation with recall@k metrics</name>
|
|
<files>src/usher_pipeline/scoring/validation.py, src/usher_pipeline/scoring/__init__.py</files>
|
|
<action>
|
|
**In validation.py**, add the following functions (do NOT modify existing functions, only ADD):
|
|
|
|
1. **compute_recall_at_k(store: PipelineStore, k_values: list[int] | None = None) -> dict** function:
|
|
- Default k_values: [100, 500, 1000, 2000] (absolute counts)
|
|
- Also compute recall at percentage thresholds: top 5%, 10%, 20% of scored genes
|
|
- Query scored_genes ordered by composite_score DESC (WHERE NOT NULL)
|
|
- For each k: count how many known genes (from compile_known_genes, deduplicated on gene_symbol) appear in top-k
|
|
- Recall@k = found_in_top_k / total_known_unique
|
|
- Return dict with: recalls_absolute (dict mapping k -> recall float), recalls_percentage (dict mapping pct_string -> recall float), total_known_unique (int), total_scored (int)
|
|
- Use structlog for logging results
|
|
|
|
2. **validate_positive_controls_extended(store: PipelineStore, percentile_threshold: float = 0.75) -> dict** function:
|
|
- Call existing validate_known_gene_ranking(store, percentile_threshold) to get base metrics
|
|
- Call compute_recall_at_k(store) to get recall metrics
|
|
- Add per-source breakdown: compute median percentile separately for "omim_usher" and "syscilia_scgs_v2" genes
|
|
- Per-source query: same PERCENT_RANK CTE but filter JOIN by source
|
|
- Return dict combining base metrics + recall_at_k + per_source_breakdown (dict mapping source -> {median_percentile, count, top_quartile_count})
|
|
- This is the "full" positive control validation for Phase 6
|
|
|
|
**In __init__.py**, add exports for: compute_recall_at_k, validate_positive_controls_extended, and also add imports/exports for negative_controls module: HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report.
|
|
</action>
|
|
<verify>
|
|
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.scoring import compute_recall_at_k, validate_positive_controls_extended, HOUSEKEEPING_GENES_CORE, compile_housekeeping_genes, validate_negative_controls, generate_negative_control_report; print('All imports OK')"` exits 0
|
|
</verify>
|
|
<done>
|
|
validation.py has compute_recall_at_k and validate_positive_controls_extended functions. __init__.py exports all new functions from both negative_controls.py and updated validation.py. Recall@k computes at both absolute and percentage thresholds. Per-source breakdown separates OMIM from SYSCILIA metrics.
|
|
</done>
|
|
</task>
|
|
|
|
</tasks>
|
|
|
|
<verification>
|
|
- `python -c "from usher_pipeline.scoring.negative_controls import HOUSEKEEPING_GENES_CORE; assert len(HOUSEKEEPING_GENES_CORE) == 13"` -- housekeeping genes compiled
|
|
- `python -c "from usher_pipeline.scoring import validate_negative_controls, compute_recall_at_k, validate_positive_controls_extended"` -- all functions importable
|
|
- `python -c "from usher_pipeline.scoring.negative_controls import compile_housekeeping_genes; df = compile_housekeeping_genes(); assert 'gene_symbol' in df.columns and 'source' in df.columns"` -- DataFrame structure correct
|
|
</verification>
|
|
|
|
<success_criteria>
|
|
- negative_controls.py creates housekeeping gene set and validates they rank low (inverted threshold)
|
|
- validation.py compute_recall_at_k measures recall at multiple k values including percentage-based thresholds
|
|
- validate_positive_controls_extended combines percentile + recall + per-source metrics
|
|
- All new functions exported from scoring.__init__
|
|
</success_criteria>
|
|
|
|
<output>
|
|
After completion, create `.planning/phases/06-validation/06-01-SUMMARY.md`
|
|
</output>
|