docs(04): create phase plan for scoring and integration
This commit is contained in:
213
.planning/phases/04-scoring-integration/04-01-PLAN.md
Normal file
213
.planning/phases/04-scoring-integration/04-01-PLAN.md
Normal file
@@ -0,0 +1,213 @@
|
||||
---
|
||||
phase: 04-scoring-integration
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- src/usher_pipeline/scoring/__init__.py
|
||||
- src/usher_pipeline/scoring/known_genes.py
|
||||
- src/usher_pipeline/scoring/integration.py
|
||||
- src/usher_pipeline/config/schema.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Known cilia/Usher genes from SYSCILIA and OMIM are compiled into a reusable gene set"
|
||||
- "ScoringWeights validates that all weights sum to 1.0 and rejects invalid configs"
|
||||
- "Multi-evidence scoring joins all 6 evidence tables and computes weighted average of available evidence only"
|
||||
- "Genes with missing evidence layers receive NULL (not zero) for those layers"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/scoring/__init__.py"
|
||||
provides: "Scoring module package"
|
||||
exports: ["compile_known_genes", "compute_composite_scores", "join_evidence_layers"]
|
||||
- path: "src/usher_pipeline/scoring/known_genes.py"
|
||||
provides: "Known cilia/Usher gene compilation"
|
||||
contains: "OMIM_USHER_GENES"
|
||||
- path: "src/usher_pipeline/scoring/integration.py"
|
||||
provides: "Multi-evidence weighted scoring with NULL preservation"
|
||||
contains: "COALESCE"
|
||||
- path: "src/usher_pipeline/config/schema.py"
|
||||
provides: "ScoringWeights with validate_sum method"
|
||||
contains: "validate_sum"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/scoring/integration.py"
|
||||
to: "DuckDB evidence tables"
|
||||
via: "LEFT JOIN on gene_id"
|
||||
pattern: "LEFT JOIN.*ON.*gene_id"
|
||||
- from: "src/usher_pipeline/scoring/integration.py"
|
||||
to: "src/usher_pipeline/config/schema.py"
|
||||
via: "ScoringWeights parameter"
|
||||
pattern: "ScoringWeights"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Compile known cilia/Usher gene set and implement multi-evidence weighted scoring integration.
|
||||
|
||||
Purpose: Establishes the foundation for Phase 4 -- the known gene list (for exclusion and positive control validation) and the core scoring engine that joins all 6 evidence tables with configurable weights and NULL-preserving weighted averages.
|
||||
|
||||
Output: `src/usher_pipeline/scoring/` module with known_genes.py and integration.py; updated config/schema.py with weight sum validation.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/04-scoring-integration/04-RESEARCH.md
|
||||
@src/usher_pipeline/config/schema.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
@src/usher_pipeline/evidence/gnomad/load.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Known gene compilation and ScoringWeights validation</name>
|
||||
<files>
|
||||
src/usher_pipeline/scoring/__init__.py
|
||||
src/usher_pipeline/scoring/known_genes.py
|
||||
src/usher_pipeline/config/schema.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/scoring/__init__.py` with exports for the module.
|
||||
|
||||
2. Create `src/usher_pipeline/scoring/known_genes.py`:
|
||||
- Define `OMIM_USHER_GENES` as a frozenset of 10 known Usher syndrome gene symbols: MYO7A, USH1C, CDH23, PCDH15, USH1G (SANS), CIB2, USH2A, ADGRV1 (GPR98), WHRN, CLRN1. Include a brief docstring noting these are OMIM Usher syndrome entries.
|
||||
- Define `SYSCILIA_SCGS_V2_CORE` as a frozenset of well-known ciliary genes that serve as positive controls. Include at minimum: IFT88, IFT140, IFT172, BBS1, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, RPGRIP1L, CEP290, ARL13B, INPP5E, TMEM67, CC2D2A, NPHP1, NPHP3, NPHP4, RPGR, CEP164, OFD1, MKS1, TCTN1, TCTN2, TMEM216, TMEM231, TMEM138. This is a curated subset (~30 genes) of the full SCGS v2 list (686 genes). Add a docstring noting the full list can be downloaded from the SCGS v2 publication supplementary data (DOI: 10.1091/mbc.E21-05-0226) and loaded via a future `fetch_scgs_v2()` function.
|
||||
- Create function `compile_known_genes() -> pl.DataFrame` that returns a polars DataFrame with columns: `gene_symbol` (str), `source` (str: "omim_usher" or "syscilia_scgs_v2"), `confidence` (str: "HIGH"). Combines both gene sets. De-duplicates on gene_symbol (if any gene appears in both lists, keep both source entries as separate rows).
|
||||
- Create function `load_known_genes_to_duckdb(store: PipelineStore) -> int` that calls `compile_known_genes()`, saves to DuckDB table `known_cilia_genes` using `store.save_dataframe()`, and returns the count of unique gene symbols.
|
||||
|
||||
3. Update `src/usher_pipeline/config/schema.py`:
|
||||
- Add a `validate_sum(self) -> None` method to the `ScoringWeights` class that sums all 6 weight fields and raises `ValueError` if the absolute difference from 1.0 exceeds 1e-6. Message format: `f"Scoring weights must sum to 1.0, got {total:.6f}"`.
|
||||
- Do NOT change any existing field defaults or field definitions -- only add the method.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.scoring.known_genes import compile_known_genes, OMIM_USHER_GENES, SYSCILIA_SCGS_V2_CORE
|
||||
from usher_pipeline.config.schema import ScoringWeights
|
||||
df = compile_known_genes()
|
||||
print(f'Known genes: {df.height} rows, {df[\"gene_symbol\"].n_unique()} unique symbols')
|
||||
assert df.height >= 38, f'Expected >= 38 rows, got {df.height}'
|
||||
assert 'MYO7A' in df['gene_symbol'].to_list()
|
||||
assert 'IFT88' in df['gene_symbol'].to_list()
|
||||
w = ScoringWeights()
|
||||
w.validate_sum() # Should pass with defaults
|
||||
print('ScoringWeights.validate_sum() passed with defaults')
|
||||
try:
|
||||
w2 = ScoringWeights(gnomad=0.5)
|
||||
w2.validate_sum()
|
||||
print('ERROR: Should have raised ValueError')
|
||||
except ValueError as e:
|
||||
print(f'Correctly rejected invalid weights: {e}')
|
||||
print('All checks passed')
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- OMIM_USHER_GENES contains exactly 10 known Usher syndrome genes
|
||||
- SYSCILIA_SCGS_V2_CORE contains >= 25 core ciliary genes
|
||||
- compile_known_genes() returns DataFrame with gene_symbol, source, confidence columns
|
||||
- ScoringWeights.validate_sum() passes with defaults, raises ValueError when weights do not sum to 1.0
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Multi-evidence weighted scoring integration</name>
|
||||
<files>
|
||||
src/usher_pipeline/scoring/integration.py
|
||||
src/usher_pipeline/scoring/__init__.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/scoring/integration.py`:
|
||||
|
||||
Import: duckdb, polars, structlog, ScoringWeights from config.schema, PipelineStore from persistence.
|
||||
|
||||
Create function `join_evidence_layers(store: PipelineStore) -> pl.DataFrame`:
|
||||
- Execute a DuckDB SQL query using `store.conn` (direct DuckDB connection) that LEFT JOINs `gene_universe` with all 6 evidence tables on `gene_id`:
|
||||
- `gnomad_constraint` -> `loeuf_normalized` AS `gnomad_score`
|
||||
- `tissue_expression` -> `expression_score_normalized` AS `expression_score`
|
||||
- `annotation_completeness` -> `annotation_score_normalized` AS `annotation_score`
|
||||
- `subcellular_localization` -> `localization_score_normalized` AS `localization_score`
|
||||
- `animal_model_phenotypes` -> `animal_model_score_normalized` AS `animal_model_score`
|
||||
- `literature_evidence` -> `literature_score_normalized` AS `literature_score`
|
||||
- Compute `evidence_count` as the count of non-NULL scores (sum of CASE WHEN ... IS NOT NULL THEN 1 ELSE 0 END for all 6 layers).
|
||||
- Select `gene_id`, `gene_symbol` from `gene_universe`, plus all 6 aliased scores and `evidence_count`.
|
||||
- Return result as polars DataFrame via `.pl()`.
|
||||
- Log the total gene count, mean evidence_count, and per-layer NULL rates using structlog.
|
||||
|
||||
Create function `compute_composite_scores(store: PipelineStore, weights: ScoringWeights) -> pl.DataFrame`:
|
||||
- Call `weights.validate_sum()` first to assert valid weights.
|
||||
- Execute a DuckDB SQL query that:
|
||||
a. Uses the same join as `join_evidence_layers` (or call it as a CTE / subquery).
|
||||
b. Computes `available_weight` = sum of weights for non-NULL layers (using CASE WHEN ... IS NOT NULL THEN weight_value ELSE 0 END for each layer).
|
||||
c. Computes `weighted_sum` = sum of COALESCE(score * weight, 0) for each layer.
|
||||
d. Computes `composite_score` = CASE WHEN available_weight > 0 THEN weighted_sum / available_weight ELSE NULL END.
|
||||
e. Computes `quality_flag`:
|
||||
- `evidence_count >= 4` -> 'sufficient_evidence'
|
||||
- `evidence_count >= 2` -> 'moderate_evidence'
|
||||
- `evidence_count >= 1` -> 'sparse_evidence'
|
||||
- ELSE 'no_evidence'
|
||||
f. Includes all individual layer scores for explainability.
|
||||
g. Includes per-layer contribution columns: `gnomad_contribution` = gnomad_score * gnomad_weight (NULL if score is NULL), etc.
|
||||
h. Orders by composite_score DESC NULLS LAST.
|
||||
- Return as polars DataFrame.
|
||||
- Log summary stats: total genes, genes with composite score, mean/median composite score, quality flag distribution.
|
||||
|
||||
Create function `persist_scored_genes(store: PipelineStore, scored_df: pl.DataFrame, weights: ScoringWeights) -> None`:
|
||||
- Save `scored_df` to DuckDB table `scored_genes` via `store.save_dataframe()` with replace=True.
|
||||
- Description: "Multi-evidence weighted composite scores with per-layer contributions".
|
||||
- Log the row count and quality flag distribution.
|
||||
|
||||
2. Update `src/usher_pipeline/scoring/__init__.py` to export: `compile_known_genes`, `load_known_genes_to_duckdb`, `join_evidence_layers`, `compute_composite_scores`, `persist_scored_genes`.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.scoring.integration import join_evidence_layers, compute_composite_scores, persist_scored_genes
|
||||
from usher_pipeline.config.schema import ScoringWeights
|
||||
import inspect
|
||||
# Verify function signatures exist and have correct params
|
||||
sig_join = inspect.signature(join_evidence_layers)
|
||||
assert 'store' in sig_join.parameters
|
||||
sig_score = inspect.signature(compute_composite_scores)
|
||||
assert 'store' in sig_score.parameters
|
||||
assert 'weights' in sig_score.parameters
|
||||
sig_persist = inspect.signature(persist_scored_genes)
|
||||
assert 'store' in sig_persist.parameters
|
||||
print('All function signatures verified')
|
||||
print('Source contains COALESCE:', 'COALESCE' in inspect.getsource(compute_composite_scores))
|
||||
print('Source contains LEFT JOIN:', 'LEFT JOIN' in inspect.getsource(join_evidence_layers))
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- join_evidence_layers() LEFT JOINs gene_universe with all 6 evidence tables on gene_id, returns DataFrame with gene_id, gene_symbol, 6 score columns, evidence_count
|
||||
- compute_composite_scores() computes weighted average of available evidence only (weighted_sum / available_weight), with quality_flag and per-layer contributions
|
||||
- NULL scores are not replaced with zero in the weighted average -- only available evidence contributes
|
||||
- persist_scored_genes() saves scored_genes table to DuckDB
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `src/usher_pipeline/scoring/` module exists with `__init__.py`, `known_genes.py`, `integration.py`
|
||||
- Known gene set includes 10 OMIM Usher genes and 25+ SYSCILIA core ciliary genes
|
||||
- ScoringWeights.validate_sum() enforces weight sum constraint
|
||||
- Integration SQL uses LEFT JOINs preserving NULLs and COALESCE for weighted scoring
|
||||
- No evidence layer with NULL score contributes to composite (weighted_sum / available_weight pattern)
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- compile_known_genes() returns polars DataFrame with >= 38 rows of known cilia/Usher genes
|
||||
- compute_composite_scores() produces composite_score using weighted average of available evidence
|
||||
- Genes with 0 evidence layers get composite_score = NULL (not 0)
|
||||
- ScoringWeights with defaults passes validate_sum(); invalid weights raise ValueError
|
||||
- All functions importable from usher_pipeline.scoring
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/04-scoring-integration/04-01-SUMMARY.md`
|
||||
</output>
|
||||
244
.planning/phases/04-scoring-integration/04-02-PLAN.md
Normal file
244
.planning/phases/04-scoring-integration/04-02-PLAN.md
Normal file
@@ -0,0 +1,244 @@
|
||||
---
|
||||
phase: 04-scoring-integration
|
||||
plan: 02
|
||||
type: execute
|
||||
wave: 2
|
||||
depends_on: ["04-01"]
|
||||
files_modified:
|
||||
- src/usher_pipeline/scoring/quality_control.py
|
||||
- src/usher_pipeline/scoring/validation.py
|
||||
- src/usher_pipeline/scoring/__init__.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Quality control detects missing data rates per evidence layer and flags layers above threshold"
|
||||
- "Score distribution anomalies (no variation, out-of-range values) are detected per layer"
|
||||
- "Outlier genes are identified using MAD-based robust detection per evidence layer"
|
||||
- "Known cilia/Usher genes rank in top quartile of composite scores before exclusion"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/scoring/quality_control.py"
|
||||
provides: "QC checks for missing data, distributions, outliers"
|
||||
contains: "median_absolute_deviation"
|
||||
- path: "src/usher_pipeline/scoring/validation.py"
|
||||
provides: "Positive control validation against known gene ranking"
|
||||
contains: "PERCENT_RANK"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/scoring/quality_control.py"
|
||||
to: "DuckDB scored_genes table"
|
||||
via: "store.conn.execute SQL queries"
|
||||
pattern: "scored_genes"
|
||||
- from: "src/usher_pipeline/scoring/validation.py"
|
||||
to: "src/usher_pipeline/scoring/known_genes.py"
|
||||
via: "compile_known_genes import"
|
||||
pattern: "compile_known_genes"
|
||||
- from: "src/usher_pipeline/scoring/validation.py"
|
||||
to: "DuckDB scored_genes table"
|
||||
via: "PERCENT_RANK window function"
|
||||
pattern: "PERCENT_RANK.*ORDER BY.*composite_score"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Implement quality control checks and positive control validation for the scoring system.
|
||||
|
||||
Purpose: QC catches upstream failures (high missing data rates, distribution anomalies, normalization bugs, outliers) before results are trusted. Positive control validation confirms known cilia/Usher genes rank highly, proving the scoring logic works correctly. Both are essential for scoring system credibility.
|
||||
|
||||
Output: `src/usher_pipeline/scoring/quality_control.py` and `src/usher_pipeline/scoring/validation.py`.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/04-scoring-integration/04-RESEARCH.md
|
||||
@.planning/phases/04-scoring-integration/04-01-SUMMARY.md
|
||||
@src/usher_pipeline/scoring/integration.py
|
||||
@src/usher_pipeline/scoring/known_genes.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Quality control checks for scoring results</name>
|
||||
<files>
|
||||
src/usher_pipeline/scoring/quality_control.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/scoring/quality_control.py`:
|
||||
|
||||
Import: numpy, polars, structlog, PipelineStore from persistence. Import scipy.stats for MAD-based outlier detection (add `scipy>=1.14` to project dependencies if not present -- check pyproject.toml first).
|
||||
|
||||
Define constants:
|
||||
- `MISSING_RATE_WARN = 0.5` (50% missing = warning)
|
||||
- `MISSING_RATE_ERROR = 0.8` (80% missing = error)
|
||||
- `MIN_STD_THRESHOLD = 0.01` (std < 0.01 = no variation warning)
|
||||
- `OUTLIER_MAD_THRESHOLD = 3.0` (>3 MAD from median = outlier)
|
||||
- `EVIDENCE_LAYERS` = list of 6 layer names: ["gnomad", "expression", "annotation", "localization", "animal_model", "literature"]
|
||||
- `SCORE_COLUMNS` = dict mapping layer name to column name: {"gnomad": "gnomad_score", "expression": "expression_score", "annotation": "annotation_score", "localization": "localization_score", "animal_model": "animal_model_score", "literature": "literature_score"}
|
||||
|
||||
Create function `compute_missing_data_rates(store: PipelineStore) -> dict`:
|
||||
- Query `scored_genes` table to compute fraction of NULL values per score column.
|
||||
- Use a single SQL query: `SELECT COUNT(*) AS total, AVG(CASE WHEN col IS NULL THEN 1.0 ELSE 0.0 END) AS col_missing ...` for all 6 columns.
|
||||
- Return dict with keys: "rates" (dict[str, float] per layer), "warnings" (list of str), "errors" (list of str).
|
||||
- Classify: rate > MISSING_RATE_ERROR -> add to errors; rate > MISSING_RATE_WARN -> add to warnings.
|
||||
- Log each layer's rate with appropriate level (error/warning/info).
|
||||
|
||||
Create function `compute_distribution_stats(store: PipelineStore) -> dict`:
|
||||
- For each evidence layer, query non-NULL scores from `scored_genes`.
|
||||
- Compute: mean, median, std, min, max using numpy on the extracted array.
|
||||
- Detect anomalies:
|
||||
- std < MIN_STD_THRESHOLD -> warning "no variation"
|
||||
- min < 0.0 or max > 1.0 -> error "out of range"
|
||||
- Return dict with keys: "distributions" (dict per layer with stats), "warnings", "errors".
|
||||
|
||||
Create function `detect_outliers(store: PipelineStore) -> dict`:
|
||||
- For each evidence layer, query gene_symbol + score from `scored_genes` where score IS NOT NULL.
|
||||
- Compute MAD: `median(|scores - median(scores)|)`.
|
||||
- If MAD == 0, skip (no variation). Otherwise, flag genes where `|score - median| > OUTLIER_MAD_THRESHOLD * MAD`.
|
||||
- Return dict with keys per layer: "count" (int), "example_genes" (list of up to 5 gene symbols).
|
||||
- Log outlier counts per layer.
|
||||
|
||||
Create function `run_qc_checks(store: PipelineStore) -> dict`:
|
||||
- Orchestrates all three checks above.
|
||||
- Returns combined dict with keys: "missing_data", "distributions", "outliers", "composite_stats", "warnings" (combined list), "errors" (combined list), "passed" (bool: True if no errors).
|
||||
- Also compute composite score distribution: mean, median, std, percentiles (p10, p25, p50, p75, p90) from `scored_genes` WHERE composite_score IS NOT NULL.
|
||||
- Log final QC summary: total warnings, total errors, pass/fail.
|
||||
|
||||
Note: Use numpy for statistics (scipy.stats.median_abs_deviation can be used for MAD, or compute manually with np.median(np.abs(x - np.median(x)))). Keep scipy import conditional or direct -- either approach is fine as long as outlier detection works.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.scoring.quality_control import (
|
||||
compute_missing_data_rates,
|
||||
compute_distribution_stats,
|
||||
detect_outliers,
|
||||
run_qc_checks,
|
||||
MISSING_RATE_WARN,
|
||||
MISSING_RATE_ERROR,
|
||||
OUTLIER_MAD_THRESHOLD,
|
||||
EVIDENCE_LAYERS,
|
||||
SCORE_COLUMNS,
|
||||
)
|
||||
import inspect
|
||||
print('Constants:')
|
||||
print(f' MISSING_RATE_WARN={MISSING_RATE_WARN}')
|
||||
print(f' MISSING_RATE_ERROR={MISSING_RATE_ERROR}')
|
||||
print(f' OUTLIER_MAD_THRESHOLD={OUTLIER_MAD_THRESHOLD}')
|
||||
print(f' EVIDENCE_LAYERS={EVIDENCE_LAYERS}')
|
||||
print(f' SCORE_COLUMNS has {len(SCORE_COLUMNS)} entries')
|
||||
assert len(EVIDENCE_LAYERS) == 6
|
||||
assert len(SCORE_COLUMNS) == 6
|
||||
# Verify function signatures
|
||||
for fn in [compute_missing_data_rates, compute_distribution_stats, detect_outliers, run_qc_checks]:
|
||||
sig = inspect.signature(fn)
|
||||
assert 'store' in sig.parameters, f'{fn.__name__} missing store param'
|
||||
print('All QC functions verified')
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- compute_missing_data_rates() queries NULL rates per layer from scored_genes with warning/error classification
|
||||
- compute_distribution_stats() computes mean/median/std/min/max per layer and detects anomalies (no variation, out of range)
|
||||
- detect_outliers() uses MAD-based detection (>3 MAD from median) per layer
|
||||
- run_qc_checks() orchestrates all checks and returns combined results with pass/fail status
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Positive control validation against known gene rankings</name>
|
||||
<files>
|
||||
src/usher_pipeline/scoring/validation.py
|
||||
src/usher_pipeline/scoring/__init__.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/scoring/validation.py`:
|
||||
|
||||
Import: duckdb, polars, structlog, PipelineStore, compile_known_genes from scoring.known_genes.
|
||||
|
||||
Create function `validate_known_gene_ranking(store: PipelineStore, percentile_threshold: float = 0.75) -> dict`:
|
||||
- Call `compile_known_genes()` to get known gene DataFrame.
|
||||
- Register the known genes DataFrame as a temporary DuckDB table: `store.conn.execute("CREATE OR REPLACE TEMP TABLE _known_genes AS SELECT * FROM known_df")` (where known_df is the polars DataFrame variable).
|
||||
- Execute a DuckDB SQL query that:
|
||||
a. Computes PERCENT_RANK() OVER (ORDER BY composite_score) for ALL genes in scored_genes (not just known genes).
|
||||
b. INNER JOINs with `_known_genes` on gene_symbol to get percentile ranks for known genes only.
|
||||
c. Filters WHERE composite_score IS NOT NULL.
|
||||
d. Returns gene_symbol, composite_score, source, percentile_rank.
|
||||
- Compute validation metrics from the result:
|
||||
- `total_known_in_dataset`: count of known genes found in scored_genes
|
||||
- `total_known_expected`: count of unique gene_symbols in known gene list
|
||||
- `median_percentile`: median of percentile_rank values
|
||||
- `top_quartile_count`: count where percentile_rank >= 0.75
|
||||
- `top_quartile_fraction`: top_quartile_count / total_known_in_dataset
|
||||
- `validation_passed`: median_percentile >= percentile_threshold
|
||||
- `known_gene_details`: list of dicts with gene_symbol, composite_score, percentile_rank, source (sorted by percentile_rank descending, limited to top 20)
|
||||
- If no known genes found in scored_genes, return validation_passed=False with reason="no_known_genes_found".
|
||||
- Log validation results: passed/failed, median_percentile, top_quartile stats.
|
||||
- Drop the temp table after query: `store.conn.execute("DROP TABLE IF EXISTS _known_genes")`.
|
||||
- Return the metrics dict.
|
||||
|
||||
Create function `generate_validation_report(metrics: dict) -> str`:
|
||||
- Produce a human-readable text report summarizing validation results.
|
||||
- Include: pass/fail status, median percentile, top quartile count/fraction, table of top-ranked known genes.
|
||||
- Format percentiles as percentages (e.g., "87.3%").
|
||||
- Return the report as a string.
|
||||
|
||||
2. Update `src/usher_pipeline/scoring/__init__.py` to also export: `run_qc_checks`, `validate_known_gene_ranking`, `generate_validation_report`.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.scoring.validation import validate_known_gene_ranking, generate_validation_report
|
||||
from usher_pipeline.scoring import (
|
||||
compile_known_genes,
|
||||
load_known_genes_to_duckdb,
|
||||
join_evidence_layers,
|
||||
compute_composite_scores,
|
||||
persist_scored_genes,
|
||||
run_qc_checks,
|
||||
validate_known_gene_ranking,
|
||||
generate_validation_report,
|
||||
)
|
||||
import inspect
|
||||
sig = inspect.signature(validate_known_gene_ranking)
|
||||
assert 'store' in sig.parameters
|
||||
assert 'percentile_threshold' in sig.parameters
|
||||
src = inspect.getsource(validate_known_gene_ranking)
|
||||
assert 'PERCENT_RANK' in src, 'Missing PERCENT_RANK window function'
|
||||
assert 'compile_known_genes' in src, 'Missing known genes integration'
|
||||
print('Validation function verified')
|
||||
print('All scoring module exports verified')
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- validate_known_gene_ranking() computes percentile ranks of known genes in scored_genes using PERCENT_RANK window function
|
||||
- Returns metrics dict with median_percentile, top_quartile_count/fraction, validation_passed boolean
|
||||
- generate_validation_report() produces human-readable text summary
|
||||
- All scoring module functions exported from __init__.py
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- Quality control detects missing data rates and classifies as warning (>50%) or error (>80%)
|
||||
- Distribution stats computed per layer with anomaly detection (no variation, out-of-range)
|
||||
- MAD-based outlier detection flags genes >3 MAD from median per layer
|
||||
- Known gene validation uses PERCENT_RANK before exclusion (not after filtering)
|
||||
- All functions importable from usher_pipeline.scoring
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- run_qc_checks() returns structured dict with missing_data, distributions, outliers, composite_stats, pass/fail
|
||||
- validate_known_gene_ranking() computes percentile ranks for known cilia/Usher genes
|
||||
- Validation checks median percentile against threshold (default 0.75 = top quartile)
|
||||
- generate_validation_report() produces readable text output
|
||||
- scipy used for MAD computation (or equivalent numpy manual calculation)
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/04-scoring-integration/04-02-SUMMARY.md`
|
||||
</output>
|
||||
254
.planning/phases/04-scoring-integration/04-03-PLAN.md
Normal file
254
.planning/phases/04-scoring-integration/04-03-PLAN.md
Normal file
@@ -0,0 +1,254 @@
|
||||
---
|
||||
phase: 04-scoring-integration
|
||||
plan: 03
|
||||
type: execute
|
||||
wave: 3
|
||||
depends_on: ["04-01", "04-02"]
|
||||
files_modified:
|
||||
- src/usher_pipeline/cli/score_cmd.py
|
||||
- src/usher_pipeline/cli/main.py
|
||||
- tests/test_scoring.py
|
||||
- tests/test_scoring_integration.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "CLI 'usher-pipeline score' command orchestrates full scoring pipeline with checkpoint-restart"
|
||||
- "Scoring pipeline can be run end-to-end on synthetic test data"
|
||||
- "Unit tests verify NULL preservation, weight validation, and known gene compilation"
|
||||
- "Integration test verifies full scoring pipeline with synthetic evidence data"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/cli/score_cmd.py"
|
||||
provides: "CLI command for scoring pipeline orchestration"
|
||||
contains: "click.command"
|
||||
- path: "tests/test_scoring.py"
|
||||
provides: "Unit tests for scoring module"
|
||||
contains: "test_compile_known_genes"
|
||||
- path: "tests/test_scoring_integration.py"
|
||||
provides: "Integration tests for full scoring pipeline"
|
||||
contains: "test_scoring_pipeline"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/cli/score_cmd.py"
|
||||
to: "src/usher_pipeline/scoring/"
|
||||
via: "imports integration, known_genes, quality_control, validation"
|
||||
pattern: "from usher_pipeline.scoring import"
|
||||
- from: "src/usher_pipeline/cli/main.py"
|
||||
to: "src/usher_pipeline/cli/score_cmd.py"
|
||||
via: "cli.add_command(score)"
|
||||
pattern: "add_command.*score"
|
||||
- from: "tests/test_scoring_integration.py"
|
||||
to: "src/usher_pipeline/scoring/integration.py"
|
||||
via: "synthetic DuckDB data -> compute_composite_scores"
|
||||
pattern: "compute_composite_scores"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Create CLI score command and comprehensive tests for the scoring module.
|
||||
|
||||
Purpose: The CLI command provides the user-facing interface for running the scoring pipeline (integrating all evidence, running QC, validating against known genes). Tests ensure correctness of NULL handling, weight validation, and end-to-end scoring with synthetic data.
|
||||
|
||||
Output: `src/usher_pipeline/cli/score_cmd.py`, `tests/test_scoring.py`, `tests/test_scoring_integration.py`.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/04-scoring-integration/04-RESEARCH.md
|
||||
@.planning/phases/04-scoring-integration/04-01-SUMMARY.md
|
||||
@.planning/phases/04-scoring-integration/04-02-SUMMARY.md
|
||||
@src/usher_pipeline/cli/evidence_cmd.py
|
||||
@src/usher_pipeline/cli/main.py
|
||||
@src/usher_pipeline/scoring/__init__.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: CLI score command with checkpoint-restart</name>
|
||||
<files>
|
||||
src/usher_pipeline/cli/score_cmd.py
|
||||
src/usher_pipeline/cli/main.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/cli/score_cmd.py` following the established pattern from `evidence_cmd.py`:
|
||||
|
||||
Import: click, structlog, sys, Path, load_config, PipelineStore, ProvenanceTracker, and from scoring module: load_known_genes_to_duckdb, compute_composite_scores, persist_scored_genes, run_qc_checks, validate_known_gene_ranking, generate_validation_report. Import ScoringWeights from config.schema.
|
||||
|
||||
Create Click command `score` (not a group -- single command):
|
||||
- Options:
|
||||
- `--force` (is_flag): Re-run scoring even if scored_genes checkpoint exists
|
||||
- `--skip-qc` (is_flag): Skip quality control checks (for faster iteration)
|
||||
- `--skip-validation` (is_flag): Skip known gene validation
|
||||
- Uses `@click.pass_context` to get config_path from `ctx.obj['config_path']`
|
||||
|
||||
Implementation flow (follows evidence_cmd.py pattern):
|
||||
a. Load config, initialize store and provenance
|
||||
b. Check checkpoint: `store.has_checkpoint('scored_genes')` -- if exists and not --force, show summary and return
|
||||
c. Load and validate scoring weights: `config.scoring`, call `validate_sum()`
|
||||
d. Step 1 - Load known genes: call `load_known_genes_to_duckdb(store)`, display count
|
||||
e. Step 2 - Compute composite scores: call `compute_composite_scores(store, config.scoring)`, display summary (total genes, mean score, quality flag distribution)
|
||||
f. Step 3 - Persist scores: call `persist_scored_genes(store, scored_df, config.scoring)`
|
||||
g. Step 4 (unless --skip-qc) - Run QC: call `run_qc_checks(store)`, display warnings/errors, log missing data rates
|
||||
h. Step 5 (unless --skip-validation) - Validate: call `validate_known_gene_ranking(store)`, display results with `generate_validation_report()`
|
||||
i. Save provenance sidecar to `data_dir/scoring/scoring.provenance.json`
|
||||
j. Display final summary: total scored genes, mean composite score, quality flag counts, QC pass/fail, validation pass/fail
|
||||
|
||||
Use Click styling consistent with evidence_cmd.py: `click.style("=== Title ===", bold=True)`, green for success, yellow for warnings, red for errors.
|
||||
|
||||
Error handling: wrap each step in try/except, display error with click.style(fg='red'), sys.exit(1). Always close store in finally block.
|
||||
|
||||
2. Update `src/usher_pipeline/cli/main.py`:
|
||||
- Import score command from score_cmd.py
|
||||
- Add it to the CLI group: `cli.add_command(score)` (same pattern as evidence command)
|
||||
- The score command should be a top-level command (not nested under evidence), since it's a different pipeline phase
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.cli.score_cmd import score
|
||||
import click.testing
|
||||
runner = click.testing.CliRunner()
|
||||
result = runner.invoke(score, ['--help'])
|
||||
print(result.output)
|
||||
assert result.exit_code == 0
|
||||
assert '--force' in result.output
|
||||
assert '--skip-qc' in result.output
|
||||
assert '--skip-validation' in result.output
|
||||
print('CLI score command --help works')
|
||||
" && python -c "
|
||||
from usher_pipeline.cli.main import cli
|
||||
import click.testing
|
||||
runner = click.testing.CliRunner()
|
||||
result = runner.invoke(cli, ['--help'])
|
||||
print(result.output)
|
||||
assert 'score' in result.output, 'score command not registered in main CLI'
|
||||
print('Score command registered in main CLI')
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- `usher-pipeline score` command exists with --force, --skip-qc, --skip-validation options
|
||||
- Score command registered in main CLI group
|
||||
- Follows established pattern: config load -> checkpoint check -> process -> persist -> provenance
|
||||
- Orchestrates full pipeline: known genes -> scoring -> QC -> validation
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Unit and integration tests for scoring module</name>
|
||||
<files>
|
||||
tests/test_scoring.py
|
||||
tests/test_scoring_integration.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `tests/test_scoring.py` with unit tests:
|
||||
|
||||
Import: pytest, polars, from scoring module all functions, ScoringWeights.
|
||||
|
||||
Test class or functions:
|
||||
|
||||
a. `test_compile_known_genes_returns_expected_structure`:
|
||||
- Call compile_known_genes()
|
||||
- Assert returns polars DataFrame with columns: gene_symbol, source, confidence
|
||||
- Assert height >= 38 (10 Usher + 28+ SYSCILIA)
|
||||
- Assert "MYO7A" in gene_symbol values
|
||||
- Assert "IFT88" in gene_symbol values
|
||||
- Assert all confidence values == "HIGH"
|
||||
- Assert sources include both "omim_usher" and "syscilia_scgs_v2"
|
||||
|
||||
b. `test_compile_known_genes_no_duplicates_within_source`:
|
||||
- Verify no duplicate gene_symbol within the same source
|
||||
- (A gene CAN appear in both sources as separate rows)
|
||||
|
||||
c. `test_scoring_weights_validate_sum_defaults`:
|
||||
- ScoringWeights() with defaults should pass validate_sum()
|
||||
|
||||
d. `test_scoring_weights_validate_sum_custom_valid`:
|
||||
- ScoringWeights with custom weights summing to 1.0 should pass
|
||||
|
||||
e. `test_scoring_weights_validate_sum_invalid`:
|
||||
- ScoringWeights(gnomad=0.5) sums to 1.35 -> validate_sum() raises ValueError
|
||||
|
||||
f. `test_scoring_weights_validate_sum_close_to_one`:
|
||||
- Weights that sum to 0.999999 (within 1e-6) should pass
|
||||
- Weights that sum to 0.99 should fail
|
||||
|
||||
g. `test_null_preservation_in_composite`:
|
||||
- Create a synthetic PipelineStore (in-memory DuckDB: `duckdb.connect(':memory:')`)
|
||||
- Create a minimal gene_universe table with 3 genes
|
||||
- Create gnomad_constraint table with scores for genes 1 and 2 (gene 3 has no entry)
|
||||
- Create annotation_completeness with scores for gene 1 only
|
||||
- Create empty/missing entries for other evidence tables (create them with no rows or only partial rows)
|
||||
- Call join_evidence_layers and verify gene 3 has NULL gnomad_score and NULL annotation_score
|
||||
- Call compute_composite_scores and verify gene 3 with zero evidence layers has composite_score = NULL
|
||||
|
||||
2. Create `tests/test_scoring_integration.py` with integration tests:
|
||||
|
||||
a. `test_scoring_pipeline_end_to_end`:
|
||||
- Create in-memory PipelineStore (wrap duckdb.connect(':memory:') in a PipelineStore-like interface, OR create tmp file with pytest tmp_path)
|
||||
- Create synthetic tables for all 7 tables (gene_universe + 6 evidence):
|
||||
- gene_universe: 20 genes (gene_001 through gene_020) with gene_symbols
|
||||
- Include some known genes (MYO7A, IFT88, CDH23) in the gene universe as genes 18-20
|
||||
- gnomad_constraint: 15 genes with loeuf_normalized scores, 5 NULL
|
||||
- tissue_expression: 12 genes with expression_score_normalized, 8 NULL
|
||||
- annotation_completeness: 18 genes with annotation_score_normalized
|
||||
- subcellular_localization: 10 genes with localization_score_normalized
|
||||
- animal_model_phenotypes: 8 genes with animal_model_score_normalized
|
||||
- literature_evidence: 14 genes with literature_score_normalized
|
||||
- Give known genes (MYO7A, IFT88, CDH23) HIGH scores in multiple layers (0.8-0.95) to ensure they rank highly
|
||||
- Run compute_composite_scores with default ScoringWeights
|
||||
- Assert: all 20 genes present in result
|
||||
- Assert: composite_score is not NULL for genes with at least 1 evidence layer
|
||||
- Assert: evidence_count values are correct (count of non-NULL scores)
|
||||
- Assert: quality_flag values are correct based on evidence_count
|
||||
- Assert: known genes (MYO7A, IFT88, CDH23) have high composite scores (among top 5)
|
||||
|
||||
b. `test_qc_detects_missing_data`:
|
||||
- Create scored_genes table where one layer is 90% NULL
|
||||
- Run run_qc_checks
|
||||
- Assert that layer appears in errors (>80% missing)
|
||||
|
||||
c. `test_validation_passes_with_known_genes_ranked_highly`:
|
||||
- Use scored_genes from end-to-end test (known genes scored highly)
|
||||
- Run validate_known_gene_ranking
|
||||
- Assert validation_passed is True
|
||||
|
||||
Use `tmp_path` fixture for DuckDB file-based stores. Use `PipelineStore(tmp_path / "test.duckdb")` for store creation. Follow existing test patterns from tests/test_gnomad_integration.py.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_scoring.py tests/test_scoring_integration.py -v --tb=short 2>&1 | tail -30`
|
||||
</verify>
|
||||
<done>
|
||||
- test_scoring.py: 7+ unit tests covering known genes, weight validation, NULL preservation
|
||||
- test_scoring_integration.py: 3+ integration tests covering end-to-end pipeline with synthetic data
|
||||
- All tests pass with `pytest tests/test_scoring.py tests/test_scoring_integration.py`
|
||||
- Tests verify NULL preservation (genes with no evidence get NULL composite score)
|
||||
- Tests verify known genes rank highly when given high scores
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `usher-pipeline score --help` shows available options
|
||||
- Score command registered in main CLI
|
||||
- Unit tests pass: known genes, weight validation, NULL handling
|
||||
- Integration tests pass: end-to-end scoring with synthetic data, QC detection, validation
|
||||
- All tests runnable with `pytest tests/test_scoring*.py`
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- CLI score command orchestrates: known genes -> composite scoring -> QC -> validation
|
||||
- Checkpoint-restart: skips if scored_genes table exists (unless --force)
|
||||
- pytest tests/test_scoring.py passes all unit tests
|
||||
- pytest tests/test_scoring_integration.py passes all integration tests
|
||||
- Tests use synthetic data (no external API calls, fast, reproducible)
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/04-scoring-integration/04-03-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user