docs(04): create phase plan for scoring and integration
This commit is contained in:
244
.planning/phases/04-scoring-integration/04-02-PLAN.md
Normal file
244
.planning/phases/04-scoring-integration/04-02-PLAN.md
Normal file
@@ -0,0 +1,244 @@
|
||||
---
|
||||
phase: 04-scoring-integration
|
||||
plan: 02
|
||||
type: execute
|
||||
wave: 2
|
||||
depends_on: ["04-01"]
|
||||
files_modified:
|
||||
- src/usher_pipeline/scoring/quality_control.py
|
||||
- src/usher_pipeline/scoring/validation.py
|
||||
- src/usher_pipeline/scoring/__init__.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Quality control detects missing data rates per evidence layer and flags layers above threshold"
|
||||
- "Score distribution anomalies (no variation, out-of-range values) are detected per layer"
|
||||
- "Outlier genes are identified using MAD-based robust detection per evidence layer"
|
||||
- "Known cilia/Usher genes rank in top quartile of composite scores before exclusion"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/scoring/quality_control.py"
|
||||
provides: "QC checks for missing data, distributions, outliers"
|
||||
contains: "median_absolute_deviation"
|
||||
- path: "src/usher_pipeline/scoring/validation.py"
|
||||
provides: "Positive control validation against known gene ranking"
|
||||
contains: "PERCENT_RANK"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/scoring/quality_control.py"
|
||||
to: "DuckDB scored_genes table"
|
||||
via: "store.conn.execute SQL queries"
|
||||
pattern: "scored_genes"
|
||||
- from: "src/usher_pipeline/scoring/validation.py"
|
||||
to: "src/usher_pipeline/scoring/known_genes.py"
|
||||
via: "compile_known_genes import"
|
||||
pattern: "compile_known_genes"
|
||||
- from: "src/usher_pipeline/scoring/validation.py"
|
||||
to: "DuckDB scored_genes table"
|
||||
via: "PERCENT_RANK window function"
|
||||
pattern: "PERCENT_RANK.*ORDER BY.*composite_score"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Implement quality control checks and positive control validation for the scoring system.
|
||||
|
||||
Purpose: QC catches upstream failures (high missing data rates, distribution anomalies, normalization bugs, outliers) before results are trusted. Positive control validation confirms known cilia/Usher genes rank highly, proving the scoring logic works correctly. Both are essential for scoring system credibility.
|
||||
|
||||
Output: `src/usher_pipeline/scoring/quality_control.py` and `src/usher_pipeline/scoring/validation.py`.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/04-scoring-integration/04-RESEARCH.md
|
||||
@.planning/phases/04-scoring-integration/04-01-SUMMARY.md
|
||||
@src/usher_pipeline/scoring/integration.py
|
||||
@src/usher_pipeline/scoring/known_genes.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Quality control checks for scoring results</name>
|
||||
<files>
|
||||
src/usher_pipeline/scoring/quality_control.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/scoring/quality_control.py`:
|
||||
|
||||
Import: numpy, polars, structlog, PipelineStore from persistence. Import scipy.stats for MAD-based outlier detection (add `scipy>=1.14` to project dependencies if not present -- check pyproject.toml first).
|
||||
|
||||
Define constants:
|
||||
- `MISSING_RATE_WARN = 0.5` (50% missing = warning)
|
||||
- `MISSING_RATE_ERROR = 0.8` (80% missing = error)
|
||||
- `MIN_STD_THRESHOLD = 0.01` (std < 0.01 = no variation warning)
|
||||
- `OUTLIER_MAD_THRESHOLD = 3.0` (>3 MAD from median = outlier)
|
||||
- `EVIDENCE_LAYERS` = list of 6 layer names: ["gnomad", "expression", "annotation", "localization", "animal_model", "literature"]
|
||||
- `SCORE_COLUMNS` = dict mapping layer name to column name: {"gnomad": "gnomad_score", "expression": "expression_score", "annotation": "annotation_score", "localization": "localization_score", "animal_model": "animal_model_score", "literature": "literature_score"}
|
||||
|
||||
Create function `compute_missing_data_rates(store: PipelineStore) -> dict`:
|
||||
- Query `scored_genes` table to compute fraction of NULL values per score column.
|
||||
- Use a single SQL query: `SELECT COUNT(*) AS total, AVG(CASE WHEN col IS NULL THEN 1.0 ELSE 0.0 END) AS col_missing ...` for all 6 columns.
|
||||
- Return dict with keys: "rates" (dict[str, float] per layer), "warnings" (list of str), "errors" (list of str).
|
||||
- Classify: rate > MISSING_RATE_ERROR -> add to errors; rate > MISSING_RATE_WARN -> add to warnings.
|
||||
- Log each layer's rate with appropriate level (error/warning/info).
|
||||
|
||||
Create function `compute_distribution_stats(store: PipelineStore) -> dict`:
|
||||
- For each evidence layer, query non-NULL scores from `scored_genes`.
|
||||
- Compute: mean, median, std, min, max using numpy on the extracted array.
|
||||
- Detect anomalies:
|
||||
- std < MIN_STD_THRESHOLD -> warning "no variation"
|
||||
- min < 0.0 or max > 1.0 -> error "out of range"
|
||||
- Return dict with keys: "distributions" (dict per layer with stats), "warnings", "errors".
|
||||
|
||||
Create function `detect_outliers(store: PipelineStore) -> dict`:
|
||||
- For each evidence layer, query gene_symbol + score from `scored_genes` where score IS NOT NULL.
|
||||
- Compute MAD: `median(|scores - median(scores)|)`.
|
||||
- If MAD == 0, skip (no variation). Otherwise, flag genes where `|score - median| > OUTLIER_MAD_THRESHOLD * MAD`.
|
||||
- Return dict with keys per layer: "count" (int), "example_genes" (list of up to 5 gene symbols).
|
||||
- Log outlier counts per layer.
|
||||
|
||||
Create function `run_qc_checks(store: PipelineStore) -> dict`:
|
||||
- Orchestrates all three checks above.
|
||||
- Returns combined dict with keys: "missing_data", "distributions", "outliers", "composite_stats", "warnings" (combined list), "errors" (combined list), "passed" (bool: True if no errors).
|
||||
- Also compute composite score distribution: mean, median, std, percentiles (p10, p25, p50, p75, p90) from `scored_genes` WHERE composite_score IS NOT NULL.
|
||||
- Log final QC summary: total warnings, total errors, pass/fail.
|
||||
|
||||
Note: Use numpy for statistics (scipy.stats.median_abs_deviation can be used for MAD, or compute manually with np.median(np.abs(x - np.median(x)))). Keep scipy import conditional or direct -- either approach is fine as long as outlier detection works.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.scoring.quality_control import (
|
||||
compute_missing_data_rates,
|
||||
compute_distribution_stats,
|
||||
detect_outliers,
|
||||
run_qc_checks,
|
||||
MISSING_RATE_WARN,
|
||||
MISSING_RATE_ERROR,
|
||||
OUTLIER_MAD_THRESHOLD,
|
||||
EVIDENCE_LAYERS,
|
||||
SCORE_COLUMNS,
|
||||
)
|
||||
import inspect
|
||||
print('Constants:')
|
||||
print(f' MISSING_RATE_WARN={MISSING_RATE_WARN}')
|
||||
print(f' MISSING_RATE_ERROR={MISSING_RATE_ERROR}')
|
||||
print(f' OUTLIER_MAD_THRESHOLD={OUTLIER_MAD_THRESHOLD}')
|
||||
print(f' EVIDENCE_LAYERS={EVIDENCE_LAYERS}')
|
||||
print(f' SCORE_COLUMNS has {len(SCORE_COLUMNS)} entries')
|
||||
assert len(EVIDENCE_LAYERS) == 6
|
||||
assert len(SCORE_COLUMNS) == 6
|
||||
# Verify function signatures
|
||||
for fn in [compute_missing_data_rates, compute_distribution_stats, detect_outliers, run_qc_checks]:
|
||||
sig = inspect.signature(fn)
|
||||
assert 'store' in sig.parameters, f'{fn.__name__} missing store param'
|
||||
print('All QC functions verified')
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- compute_missing_data_rates() queries NULL rates per layer from scored_genes with warning/error classification
|
||||
- compute_distribution_stats() computes mean/median/std/min/max per layer and detects anomalies (no variation, out of range)
|
||||
- detect_outliers() uses MAD-based detection (>3 MAD from median) per layer
|
||||
- run_qc_checks() orchestrates all checks and returns combined results with pass/fail status
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Positive control validation against known gene rankings</name>
|
||||
<files>
|
||||
src/usher_pipeline/scoring/validation.py
|
||||
src/usher_pipeline/scoring/__init__.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/scoring/validation.py`:
|
||||
|
||||
Import: duckdb, polars, structlog, PipelineStore, compile_known_genes from scoring.known_genes.
|
||||
|
||||
Create function `validate_known_gene_ranking(store: PipelineStore, percentile_threshold: float = 0.75) -> dict`:
|
||||
- Call `compile_known_genes()` to get known gene DataFrame.
|
||||
- Register the known genes DataFrame as a temporary DuckDB table: `store.conn.execute("CREATE OR REPLACE TEMP TABLE _known_genes AS SELECT * FROM known_df")` (where known_df is the polars DataFrame variable).
|
||||
- Execute a DuckDB SQL query that:
|
||||
a. Computes PERCENT_RANK() OVER (ORDER BY composite_score) for ALL genes in scored_genes (not just known genes).
|
||||
b. INNER JOINs with `_known_genes` on gene_symbol to get percentile ranks for known genes only.
|
||||
c. Filters WHERE composite_score IS NOT NULL.
|
||||
d. Returns gene_symbol, composite_score, source, percentile_rank.
|
||||
- Compute validation metrics from the result:
|
||||
- `total_known_in_dataset`: count of known genes found in scored_genes
|
||||
- `total_known_expected`: count of unique gene_symbols in known gene list
|
||||
- `median_percentile`: median of percentile_rank values
|
||||
- `top_quartile_count`: count where percentile_rank >= 0.75
|
||||
- `top_quartile_fraction`: top_quartile_count / total_known_in_dataset
|
||||
- `validation_passed`: median_percentile >= percentile_threshold
|
||||
- `known_gene_details`: list of dicts with gene_symbol, composite_score, percentile_rank, source (sorted by percentile_rank descending, limited to top 20)
|
||||
- If no known genes found in scored_genes, return validation_passed=False with reason="no_known_genes_found".
|
||||
- Log validation results: passed/failed, median_percentile, top_quartile stats.
|
||||
- Drop the temp table after query: `store.conn.execute("DROP TABLE IF EXISTS _known_genes")`.
|
||||
- Return the metrics dict.
|
||||
|
||||
Create function `generate_validation_report(metrics: dict) -> str`:
|
||||
- Produce a human-readable text report summarizing validation results.
|
||||
- Include: pass/fail status, median percentile, top quartile count/fraction, table of top-ranked known genes.
|
||||
- Format percentiles as percentages (e.g., "87.3%").
|
||||
- Return the report as a string.
|
||||
|
||||
2. Update `src/usher_pipeline/scoring/__init__.py` to also export: `run_qc_checks`, `validate_known_gene_ranking`, `generate_validation_report`.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.scoring.validation import validate_known_gene_ranking, generate_validation_report
|
||||
from usher_pipeline.scoring import (
|
||||
compile_known_genes,
|
||||
load_known_genes_to_duckdb,
|
||||
join_evidence_layers,
|
||||
compute_composite_scores,
|
||||
persist_scored_genes,
|
||||
run_qc_checks,
|
||||
validate_known_gene_ranking,
|
||||
generate_validation_report,
|
||||
)
|
||||
import inspect
|
||||
sig = inspect.signature(validate_known_gene_ranking)
|
||||
assert 'store' in sig.parameters
|
||||
assert 'percentile_threshold' in sig.parameters
|
||||
src = inspect.getsource(validate_known_gene_ranking)
|
||||
assert 'PERCENT_RANK' in src, 'Missing PERCENT_RANK window function'
|
||||
assert 'compile_known_genes' in src, 'Missing known genes integration'
|
||||
print('Validation function verified')
|
||||
print('All scoring module exports verified')
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- validate_known_gene_ranking() computes percentile ranks of known genes in scored_genes using PERCENT_RANK window function
|
||||
- Returns metrics dict with median_percentile, top_quartile_count/fraction, validation_passed boolean
|
||||
- generate_validation_report() produces human-readable text summary
|
||||
- All scoring module functions exported from __init__.py
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- Quality control detects missing data rates and classifies as warning (>50%) or error (>80%)
|
||||
- Distribution stats computed per layer with anomaly detection (no variation, out-of-range)
|
||||
- MAD-based outlier detection flags genes >3 MAD from median per layer
|
||||
- Known gene validation uses PERCENT_RANK before exclusion (not after filtering)
|
||||
- All functions importable from usher_pipeline.scoring
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- run_qc_checks() returns structured dict with missing_data, distributions, outliers, composite_stats, pass/fail
|
||||
- validate_known_gene_ranking() computes percentile ranks for known cilia/Usher genes
|
||||
- Validation checks median percentile against threshold (default 0.75 = top quartile)
|
||||
- generate_validation_report() produces readable text output
|
||||
- scipy used for MAD computation (or equivalent numpy manual calculation)
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/04-scoring-integration/04-02-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user