docs(04): create phase plan for scoring and integration

This commit is contained in:
2026-02-11 20:31:55 +08:00
parent 32988c631f
commit a52724aff4
4 changed files with 717 additions and 3 deletions

View File

@@ -85,9 +85,12 @@ Plans:
3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers 3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers
4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works 4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works
5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer 5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer
**Plans**: TBD **Plans**: 3 plans
Plans: (to be created during plan-phase) Plans:
- [ ] 04-01-PLAN.md -- Known gene compilation, weight validation, and multi-evidence scoring integration
- [ ] 04-02-PLAN.md -- Quality control checks and positive control validation
- [ ] 04-03-PLAN.md -- CLI score command and unit/integration tests
### Phase 5: Output & CLI ### Phase 5: Output & CLI
**Goal**: User-facing interface and structured tiered output **Goal**: User-facing interface and structured tiered output
@@ -127,6 +130,6 @@ Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6
| 1. Data Infrastructure | 4/4 | ✓ Complete | 2026-02-11 | | 1. Data Infrastructure | 4/4 | ✓ Complete | 2026-02-11 |
| 2. Prototype Evidence Layer | 2/2 | ✓ Complete | 2026-02-11 | | 2. Prototype Evidence Layer | 2/2 | ✓ Complete | 2026-02-11 |
| 3. Core Evidence Layers | 6/6 | ✓ Complete | 2026-02-11 | | 3. Core Evidence Layers | 6/6 | ✓ Complete | 2026-02-11 |
| 4. Scoring & Integration | 0/TBD | Not started | - | | 4. Scoring & Integration | 0/3 | In progress | - |
| 5. Output & CLI | 0/TBD | Not started | - | | 5. Output & CLI | 0/TBD | Not started | - |
| 6. Validation | 0/TBD | Not started | - | | 6. Validation | 0/TBD | Not started | - |

View File

@@ -0,0 +1,213 @@
---
phase: 04-scoring-integration
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/scoring/__init__.py
- src/usher_pipeline/scoring/known_genes.py
- src/usher_pipeline/scoring/integration.py
- src/usher_pipeline/config/schema.py
autonomous: true
must_haves:
truths:
- "Known cilia/Usher genes from SYSCILIA and OMIM are compiled into a reusable gene set"
- "ScoringWeights validates that all weights sum to 1.0 and rejects invalid configs"
- "Multi-evidence scoring joins all 6 evidence tables and computes weighted average of available evidence only"
- "Genes with missing evidence layers receive NULL (not zero) for those layers"
artifacts:
- path: "src/usher_pipeline/scoring/__init__.py"
provides: "Scoring module package"
exports: ["compile_known_genes", "compute_composite_scores", "join_evidence_layers"]
- path: "src/usher_pipeline/scoring/known_genes.py"
provides: "Known cilia/Usher gene compilation"
contains: "OMIM_USHER_GENES"
- path: "src/usher_pipeline/scoring/integration.py"
provides: "Multi-evidence weighted scoring with NULL preservation"
contains: "COALESCE"
- path: "src/usher_pipeline/config/schema.py"
provides: "ScoringWeights with validate_sum method"
contains: "validate_sum"
key_links:
- from: "src/usher_pipeline/scoring/integration.py"
to: "DuckDB evidence tables"
via: "LEFT JOIN on gene_id"
pattern: "LEFT JOIN.*ON.*gene_id"
- from: "src/usher_pipeline/scoring/integration.py"
to: "src/usher_pipeline/config/schema.py"
via: "ScoringWeights parameter"
pattern: "ScoringWeights"
---
<objective>
Compile known cilia/Usher gene set and implement multi-evidence weighted scoring integration.
Purpose: Establishes the foundation for Phase 4 -- the known gene list (for exclusion and positive control validation) and the core scoring engine that joins all 6 evidence tables with configurable weights and NULL-preserving weighted averages.
Output: `src/usher_pipeline/scoring/` module with known_genes.py and integration.py; updated config/schema.py with weight sum validation.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/04-scoring-integration/04-RESEARCH.md
@src/usher_pipeline/config/schema.py
@src/usher_pipeline/persistence/duckdb_store.py
@src/usher_pipeline/evidence/gnomad/load.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Known gene compilation and ScoringWeights validation</name>
<files>
src/usher_pipeline/scoring/__init__.py
src/usher_pipeline/scoring/known_genes.py
src/usher_pipeline/config/schema.py
</files>
<action>
1. Create `src/usher_pipeline/scoring/__init__.py` with exports for the module.
2. Create `src/usher_pipeline/scoring/known_genes.py`:
- Define `OMIM_USHER_GENES` as a frozenset of 10 known Usher syndrome gene symbols: MYO7A, USH1C, CDH23, PCDH15, USH1G (SANS), CIB2, USH2A, ADGRV1 (GPR98), WHRN, CLRN1. Include a brief docstring noting these are OMIM Usher syndrome entries.
- Define `SYSCILIA_SCGS_V2_CORE` as a frozenset of well-known ciliary genes that serve as positive controls. Include at minimum: IFT88, IFT140, IFT172, BBS1, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, RPGRIP1L, CEP290, ARL13B, INPP5E, TMEM67, CC2D2A, NPHP1, NPHP3, NPHP4, RPGR, CEP164, OFD1, MKS1, TCTN1, TCTN2, TMEM216, TMEM231, TMEM138. This is a curated subset (~30 genes) of the full SCGS v2 list (686 genes). Add a docstring noting the full list can be downloaded from the SCGS v2 publication supplementary data (DOI: 10.1091/mbc.E21-05-0226) and loaded via a future `fetch_scgs_v2()` function.
- Create function `compile_known_genes() -> pl.DataFrame` that returns a polars DataFrame with columns: `gene_symbol` (str), `source` (str: "omim_usher" or "syscilia_scgs_v2"), `confidence` (str: "HIGH"). Combines both gene sets. De-duplicates on gene_symbol (if any gene appears in both lists, keep both source entries as separate rows).
- Create function `load_known_genes_to_duckdb(store: PipelineStore) -> int` that calls `compile_known_genes()`, saves to DuckDB table `known_cilia_genes` using `store.save_dataframe()`, and returns the count of unique gene symbols.
3. Update `src/usher_pipeline/config/schema.py`:
- Add a `validate_sum(self) -> None` method to the `ScoringWeights` class that sums all 6 weight fields and raises `ValueError` if the absolute difference from 1.0 exceeds 1e-6. Message format: `f"Scoring weights must sum to 1.0, got {total:.6f}"`.
- Do NOT change any existing field defaults or field definitions -- only add the method.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
from usher_pipeline.scoring.known_genes import compile_known_genes, OMIM_USHER_GENES, SYSCILIA_SCGS_V2_CORE
from usher_pipeline.config.schema import ScoringWeights
df = compile_known_genes()
print(f'Known genes: {df.height} rows, {df[\"gene_symbol\"].n_unique()} unique symbols')
assert df.height >= 38, f'Expected >= 38 rows, got {df.height}'
assert 'MYO7A' in df['gene_symbol'].to_list()
assert 'IFT88' in df['gene_symbol'].to_list()
w = ScoringWeights()
w.validate_sum() # Should pass with defaults
print('ScoringWeights.validate_sum() passed with defaults')
try:
w2 = ScoringWeights(gnomad=0.5)
w2.validate_sum()
print('ERROR: Should have raised ValueError')
except ValueError as e:
print(f'Correctly rejected invalid weights: {e}')
print('All checks passed')
"`
</verify>
<done>
- OMIM_USHER_GENES contains exactly 10 known Usher syndrome genes
- SYSCILIA_SCGS_V2_CORE contains >= 25 core ciliary genes
- compile_known_genes() returns DataFrame with gene_symbol, source, confidence columns
- ScoringWeights.validate_sum() passes with defaults, raises ValueError when weights do not sum to 1.0
</done>
</task>
<task type="auto">
<name>Task 2: Multi-evidence weighted scoring integration</name>
<files>
src/usher_pipeline/scoring/integration.py
src/usher_pipeline/scoring/__init__.py
</files>
<action>
1. Create `src/usher_pipeline/scoring/integration.py`:
Import: duckdb, polars, structlog, ScoringWeights from config.schema, PipelineStore from persistence.
Create function `join_evidence_layers(store: PipelineStore) -> pl.DataFrame`:
- Execute a DuckDB SQL query using `store.conn` (direct DuckDB connection) that LEFT JOINs `gene_universe` with all 6 evidence tables on `gene_id`:
- `gnomad_constraint` -> `loeuf_normalized` AS `gnomad_score`
- `tissue_expression` -> `expression_score_normalized` AS `expression_score`
- `annotation_completeness` -> `annotation_score_normalized` AS `annotation_score`
- `subcellular_localization` -> `localization_score_normalized` AS `localization_score`
- `animal_model_phenotypes` -> `animal_model_score_normalized` AS `animal_model_score`
- `literature_evidence` -> `literature_score_normalized` AS `literature_score`
- Compute `evidence_count` as the count of non-NULL scores (sum of CASE WHEN ... IS NOT NULL THEN 1 ELSE 0 END for all 6 layers).
- Select `gene_id`, `gene_symbol` from `gene_universe`, plus all 6 aliased scores and `evidence_count`.
- Return result as polars DataFrame via `.pl()`.
- Log the total gene count, mean evidence_count, and per-layer NULL rates using structlog.
Create function `compute_composite_scores(store: PipelineStore, weights: ScoringWeights) -> pl.DataFrame`:
- Call `weights.validate_sum()` first to assert valid weights.
- Execute a DuckDB SQL query that:
a. Uses the same join as `join_evidence_layers` (or call it as a CTE / subquery).
b. Computes `available_weight` = sum of weights for non-NULL layers (using CASE WHEN ... IS NOT NULL THEN weight_value ELSE 0 END for each layer).
c. Computes `weighted_sum` = sum of COALESCE(score * weight, 0) for each layer.
d. Computes `composite_score` = CASE WHEN available_weight > 0 THEN weighted_sum / available_weight ELSE NULL END.
e. Computes `quality_flag`:
- `evidence_count >= 4` -> 'sufficient_evidence'
- `evidence_count >= 2` -> 'moderate_evidence'
- `evidence_count >= 1` -> 'sparse_evidence'
- ELSE 'no_evidence'
f. Includes all individual layer scores for explainability.
g. Includes per-layer contribution columns: `gnomad_contribution` = gnomad_score * gnomad_weight (NULL if score is NULL), etc.
h. Orders by composite_score DESC NULLS LAST.
- Return as polars DataFrame.
- Log summary stats: total genes, genes with composite score, mean/median composite score, quality flag distribution.
Create function `persist_scored_genes(store: PipelineStore, scored_df: pl.DataFrame, weights: ScoringWeights) -> None`:
- Save `scored_df` to DuckDB table `scored_genes` via `store.save_dataframe()` with replace=True.
- Description: "Multi-evidence weighted composite scores with per-layer contributions".
- Log the row count and quality flag distribution.
2. Update `src/usher_pipeline/scoring/__init__.py` to export: `compile_known_genes`, `load_known_genes_to_duckdb`, `join_evidence_layers`, `compute_composite_scores`, `persist_scored_genes`.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
from usher_pipeline.scoring.integration import join_evidence_layers, compute_composite_scores, persist_scored_genes
from usher_pipeline.config.schema import ScoringWeights
import inspect
# Verify function signatures exist and have correct params
sig_join = inspect.signature(join_evidence_layers)
assert 'store' in sig_join.parameters
sig_score = inspect.signature(compute_composite_scores)
assert 'store' in sig_score.parameters
assert 'weights' in sig_score.parameters
sig_persist = inspect.signature(persist_scored_genes)
assert 'store' in sig_persist.parameters
print('All function signatures verified')
print('Source contains COALESCE:', 'COALESCE' in inspect.getsource(compute_composite_scores))
print('Source contains LEFT JOIN:', 'LEFT JOIN' in inspect.getsource(join_evidence_layers))
"`
</verify>
<done>
- join_evidence_layers() LEFT JOINs gene_universe with all 6 evidence tables on gene_id, returns DataFrame with gene_id, gene_symbol, 6 score columns, evidence_count
- compute_composite_scores() computes weighted average of available evidence only (weighted_sum / available_weight), with quality_flag and per-layer contributions
- NULL scores are not replaced with zero in the weighted average -- only available evidence contributes
- persist_scored_genes() saves scored_genes table to DuckDB
</done>
</task>
</tasks>
<verification>
- `src/usher_pipeline/scoring/` module exists with `__init__.py`, `known_genes.py`, `integration.py`
- Known gene set includes 10 OMIM Usher genes and 25+ SYSCILIA core ciliary genes
- ScoringWeights.validate_sum() enforces weight sum constraint
- Integration SQL uses LEFT JOINs preserving NULLs and COALESCE for weighted scoring
- No evidence layer with NULL score contributes to composite (weighted_sum / available_weight pattern)
</verification>
<success_criteria>
- compile_known_genes() returns polars DataFrame with >= 38 rows of known cilia/Usher genes
- compute_composite_scores() produces composite_score using weighted average of available evidence
- Genes with 0 evidence layers get composite_score = NULL (not 0)
- ScoringWeights with defaults passes validate_sum(); invalid weights raise ValueError
- All functions importable from usher_pipeline.scoring
</success_criteria>
<output>
After completion, create `.planning/phases/04-scoring-integration/04-01-SUMMARY.md`
</output>

View File

@@ -0,0 +1,244 @@
---
phase: 04-scoring-integration
plan: 02
type: execute
wave: 2
depends_on: ["04-01"]
files_modified:
- src/usher_pipeline/scoring/quality_control.py
- src/usher_pipeline/scoring/validation.py
- src/usher_pipeline/scoring/__init__.py
autonomous: true
must_haves:
truths:
- "Quality control detects missing data rates per evidence layer and flags layers above threshold"
- "Score distribution anomalies (no variation, out-of-range values) are detected per layer"
- "Outlier genes are identified using MAD-based robust detection per evidence layer"
- "Known cilia/Usher genes rank in top quartile of composite scores before exclusion"
artifacts:
- path: "src/usher_pipeline/scoring/quality_control.py"
provides: "QC checks for missing data, distributions, outliers"
contains: "median_absolute_deviation"
- path: "src/usher_pipeline/scoring/validation.py"
provides: "Positive control validation against known gene ranking"
contains: "PERCENT_RANK"
key_links:
- from: "src/usher_pipeline/scoring/quality_control.py"
to: "DuckDB scored_genes table"
via: "store.conn.execute SQL queries"
pattern: "scored_genes"
- from: "src/usher_pipeline/scoring/validation.py"
to: "src/usher_pipeline/scoring/known_genes.py"
via: "compile_known_genes import"
pattern: "compile_known_genes"
- from: "src/usher_pipeline/scoring/validation.py"
to: "DuckDB scored_genes table"
via: "PERCENT_RANK window function"
pattern: "PERCENT_RANK.*ORDER BY.*composite_score"
---
<objective>
Implement quality control checks and positive control validation for the scoring system.
Purpose: QC catches upstream failures (high missing data rates, distribution anomalies, normalization bugs, outliers) before results are trusted. Positive control validation confirms known cilia/Usher genes rank highly, proving the scoring logic works correctly. Both are essential for scoring system credibility.
Output: `src/usher_pipeline/scoring/quality_control.py` and `src/usher_pipeline/scoring/validation.py`.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/04-scoring-integration/04-RESEARCH.md
@.planning/phases/04-scoring-integration/04-01-SUMMARY.md
@src/usher_pipeline/scoring/integration.py
@src/usher_pipeline/scoring/known_genes.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Quality control checks for scoring results</name>
<files>
src/usher_pipeline/scoring/quality_control.py
</files>
<action>
1. Create `src/usher_pipeline/scoring/quality_control.py`:
Import: numpy, polars, structlog, PipelineStore from persistence. Import scipy.stats for MAD-based outlier detection (add `scipy>=1.14` to project dependencies if not present -- check pyproject.toml first).
Define constants:
- `MISSING_RATE_WARN = 0.5` (50% missing = warning)
- `MISSING_RATE_ERROR = 0.8` (80% missing = error)
- `MIN_STD_THRESHOLD = 0.01` (std < 0.01 = no variation warning)
- `OUTLIER_MAD_THRESHOLD = 3.0` (>3 MAD from median = outlier)
- `EVIDENCE_LAYERS` = list of 6 layer names: ["gnomad", "expression", "annotation", "localization", "animal_model", "literature"]
- `SCORE_COLUMNS` = dict mapping layer name to column name: {"gnomad": "gnomad_score", "expression": "expression_score", "annotation": "annotation_score", "localization": "localization_score", "animal_model": "animal_model_score", "literature": "literature_score"}
Create function `compute_missing_data_rates(store: PipelineStore) -> dict`:
- Query `scored_genes` table to compute fraction of NULL values per score column.
- Use a single SQL query: `SELECT COUNT(*) AS total, AVG(CASE WHEN col IS NULL THEN 1.0 ELSE 0.0 END) AS col_missing ...` for all 6 columns.
- Return dict with keys: "rates" (dict[str, float] per layer), "warnings" (list of str), "errors" (list of str).
- Classify: rate > MISSING_RATE_ERROR -> add to errors; rate > MISSING_RATE_WARN -> add to warnings.
- Log each layer's rate with appropriate level (error/warning/info).
Create function `compute_distribution_stats(store: PipelineStore) -> dict`:
- For each evidence layer, query non-NULL scores from `scored_genes`.
- Compute: mean, median, std, min, max using numpy on the extracted array.
- Detect anomalies:
- std < MIN_STD_THRESHOLD -> warning "no variation"
- min < 0.0 or max > 1.0 -> error "out of range"
- Return dict with keys: "distributions" (dict per layer with stats), "warnings", "errors".
Create function `detect_outliers(store: PipelineStore) -> dict`:
- For each evidence layer, query gene_symbol + score from `scored_genes` where score IS NOT NULL.
- Compute MAD: `median(|scores - median(scores)|)`.
- If MAD == 0, skip (no variation). Otherwise, flag genes where `|score - median| > OUTLIER_MAD_THRESHOLD * MAD`.
- Return dict with keys per layer: "count" (int), "example_genes" (list of up to 5 gene symbols).
- Log outlier counts per layer.
Create function `run_qc_checks(store: PipelineStore) -> dict`:
- Orchestrates all three checks above.
- Returns combined dict with keys: "missing_data", "distributions", "outliers", "composite_stats", "warnings" (combined list), "errors" (combined list), "passed" (bool: True if no errors).
- Also compute composite score distribution: mean, median, std, percentiles (p10, p25, p50, p75, p90) from `scored_genes` WHERE composite_score IS NOT NULL.
- Log final QC summary: total warnings, total errors, pass/fail.
Note: Use numpy for statistics (scipy.stats.median_abs_deviation can be used for MAD, or compute manually with np.median(np.abs(x - np.median(x)))). Keep scipy import conditional or direct -- either approach is fine as long as outlier detection works.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
from usher_pipeline.scoring.quality_control import (
compute_missing_data_rates,
compute_distribution_stats,
detect_outliers,
run_qc_checks,
MISSING_RATE_WARN,
MISSING_RATE_ERROR,
OUTLIER_MAD_THRESHOLD,
EVIDENCE_LAYERS,
SCORE_COLUMNS,
)
import inspect
print('Constants:')
print(f' MISSING_RATE_WARN={MISSING_RATE_WARN}')
print(f' MISSING_RATE_ERROR={MISSING_RATE_ERROR}')
print(f' OUTLIER_MAD_THRESHOLD={OUTLIER_MAD_THRESHOLD}')
print(f' EVIDENCE_LAYERS={EVIDENCE_LAYERS}')
print(f' SCORE_COLUMNS has {len(SCORE_COLUMNS)} entries')
assert len(EVIDENCE_LAYERS) == 6
assert len(SCORE_COLUMNS) == 6
# Verify function signatures
for fn in [compute_missing_data_rates, compute_distribution_stats, detect_outliers, run_qc_checks]:
sig = inspect.signature(fn)
assert 'store' in sig.parameters, f'{fn.__name__} missing store param'
print('All QC functions verified')
"`
</verify>
<done>
- compute_missing_data_rates() queries NULL rates per layer from scored_genes with warning/error classification
- compute_distribution_stats() computes mean/median/std/min/max per layer and detects anomalies (no variation, out of range)
- detect_outliers() uses MAD-based detection (>3 MAD from median) per layer
- run_qc_checks() orchestrates all checks and returns combined results with pass/fail status
</done>
</task>
<task type="auto">
<name>Task 2: Positive control validation against known gene rankings</name>
<files>
src/usher_pipeline/scoring/validation.py
src/usher_pipeline/scoring/__init__.py
</files>
<action>
1. Create `src/usher_pipeline/scoring/validation.py`:
Import: duckdb, polars, structlog, PipelineStore, compile_known_genes from scoring.known_genes.
Create function `validate_known_gene_ranking(store: PipelineStore, percentile_threshold: float = 0.75) -> dict`:
- Call `compile_known_genes()` to get known gene DataFrame.
- Register the known genes DataFrame as a temporary DuckDB table: `store.conn.execute("CREATE OR REPLACE TEMP TABLE _known_genes AS SELECT * FROM known_df")` (where known_df is the polars DataFrame variable).
- Execute a DuckDB SQL query that:
a. Computes PERCENT_RANK() OVER (ORDER BY composite_score) for ALL genes in scored_genes (not just known genes).
b. INNER JOINs with `_known_genes` on gene_symbol to get percentile ranks for known genes only.
c. Filters WHERE composite_score IS NOT NULL.
d. Returns gene_symbol, composite_score, source, percentile_rank.
- Compute validation metrics from the result:
- `total_known_in_dataset`: count of known genes found in scored_genes
- `total_known_expected`: count of unique gene_symbols in known gene list
- `median_percentile`: median of percentile_rank values
- `top_quartile_count`: count where percentile_rank >= 0.75
- `top_quartile_fraction`: top_quartile_count / total_known_in_dataset
- `validation_passed`: median_percentile >= percentile_threshold
- `known_gene_details`: list of dicts with gene_symbol, composite_score, percentile_rank, source (sorted by percentile_rank descending, limited to top 20)
- If no known genes found in scored_genes, return validation_passed=False with reason="no_known_genes_found".
- Log validation results: passed/failed, median_percentile, top_quartile stats.
- Drop the temp table after query: `store.conn.execute("DROP TABLE IF EXISTS _known_genes")`.
- Return the metrics dict.
Create function `generate_validation_report(metrics: dict) -> str`:
- Produce a human-readable text report summarizing validation results.
- Include: pass/fail status, median percentile, top quartile count/fraction, table of top-ranked known genes.
- Format percentiles as percentages (e.g., "87.3%").
- Return the report as a string.
2. Update `src/usher_pipeline/scoring/__init__.py` to also export: `run_qc_checks`, `validate_known_gene_ranking`, `generate_validation_report`.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
from usher_pipeline.scoring.validation import validate_known_gene_ranking, generate_validation_report
from usher_pipeline.scoring import (
compile_known_genes,
load_known_genes_to_duckdb,
join_evidence_layers,
compute_composite_scores,
persist_scored_genes,
run_qc_checks,
validate_known_gene_ranking,
generate_validation_report,
)
import inspect
sig = inspect.signature(validate_known_gene_ranking)
assert 'store' in sig.parameters
assert 'percentile_threshold' in sig.parameters
src = inspect.getsource(validate_known_gene_ranking)
assert 'PERCENT_RANK' in src, 'Missing PERCENT_RANK window function'
assert 'compile_known_genes' in src, 'Missing known genes integration'
print('Validation function verified')
print('All scoring module exports verified')
"`
</verify>
<done>
- validate_known_gene_ranking() computes percentile ranks of known genes in scored_genes using PERCENT_RANK window function
- Returns metrics dict with median_percentile, top_quartile_count/fraction, validation_passed boolean
- generate_validation_report() produces human-readable text summary
- All scoring module functions exported from __init__.py
</done>
</task>
</tasks>
<verification>
- Quality control detects missing data rates and classifies as warning (>50%) or error (>80%)
- Distribution stats computed per layer with anomaly detection (no variation, out-of-range)
- MAD-based outlier detection flags genes >3 MAD from median per layer
- Known gene validation uses PERCENT_RANK before exclusion (not after filtering)
- All functions importable from usher_pipeline.scoring
</verification>
<success_criteria>
- run_qc_checks() returns structured dict with missing_data, distributions, outliers, composite_stats, pass/fail
- validate_known_gene_ranking() computes percentile ranks for known cilia/Usher genes
- Validation checks median percentile against threshold (default 0.75 = top quartile)
- generate_validation_report() produces readable text output
- scipy used for MAD computation (or equivalent numpy manual calculation)
</success_criteria>
<output>
After completion, create `.planning/phases/04-scoring-integration/04-02-SUMMARY.md`
</output>

View File

@@ -0,0 +1,254 @@
---
phase: 04-scoring-integration
plan: 03
type: execute
wave: 3
depends_on: ["04-01", "04-02"]
files_modified:
- src/usher_pipeline/cli/score_cmd.py
- src/usher_pipeline/cli/main.py
- tests/test_scoring.py
- tests/test_scoring_integration.py
autonomous: true
must_haves:
truths:
- "CLI 'usher-pipeline score' command orchestrates full scoring pipeline with checkpoint-restart"
- "Scoring pipeline can be run end-to-end on synthetic test data"
- "Unit tests verify NULL preservation, weight validation, and known gene compilation"
- "Integration test verifies full scoring pipeline with synthetic evidence data"
artifacts:
- path: "src/usher_pipeline/cli/score_cmd.py"
provides: "CLI command for scoring pipeline orchestration"
contains: "click.command"
- path: "tests/test_scoring.py"
provides: "Unit tests for scoring module"
contains: "test_compile_known_genes"
- path: "tests/test_scoring_integration.py"
provides: "Integration tests for full scoring pipeline"
contains: "test_scoring_pipeline"
key_links:
- from: "src/usher_pipeline/cli/score_cmd.py"
to: "src/usher_pipeline/scoring/"
via: "imports integration, known_genes, quality_control, validation"
pattern: "from usher_pipeline.scoring import"
- from: "src/usher_pipeline/cli/main.py"
to: "src/usher_pipeline/cli/score_cmd.py"
via: "cli.add_command(score)"
pattern: "add_command.*score"
- from: "tests/test_scoring_integration.py"
to: "src/usher_pipeline/scoring/integration.py"
via: "synthetic DuckDB data -> compute_composite_scores"
pattern: "compute_composite_scores"
---
<objective>
Create CLI score command and comprehensive tests for the scoring module.
Purpose: The CLI command provides the user-facing interface for running the scoring pipeline (integrating all evidence, running QC, validating against known genes). Tests ensure correctness of NULL handling, weight validation, and end-to-end scoring with synthetic data.
Output: `src/usher_pipeline/cli/score_cmd.py`, `tests/test_scoring.py`, `tests/test_scoring_integration.py`.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/04-scoring-integration/04-RESEARCH.md
@.planning/phases/04-scoring-integration/04-01-SUMMARY.md
@.planning/phases/04-scoring-integration/04-02-SUMMARY.md
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/cli/main.py
@src/usher_pipeline/scoring/__init__.py
</context>
<tasks>
<task type="auto">
<name>Task 1: CLI score command with checkpoint-restart</name>
<files>
src/usher_pipeline/cli/score_cmd.py
src/usher_pipeline/cli/main.py
</files>
<action>
1. Create `src/usher_pipeline/cli/score_cmd.py` following the established pattern from `evidence_cmd.py`:
Import: click, structlog, sys, Path, load_config, PipelineStore, ProvenanceTracker, and from scoring module: load_known_genes_to_duckdb, compute_composite_scores, persist_scored_genes, run_qc_checks, validate_known_gene_ranking, generate_validation_report. Import ScoringWeights from config.schema.
Create Click command `score` (not a group -- single command):
- Options:
- `--force` (is_flag): Re-run scoring even if scored_genes checkpoint exists
- `--skip-qc` (is_flag): Skip quality control checks (for faster iteration)
- `--skip-validation` (is_flag): Skip known gene validation
- Uses `@click.pass_context` to get config_path from `ctx.obj['config_path']`
Implementation flow (follows evidence_cmd.py pattern):
a. Load config, initialize store and provenance
b. Check checkpoint: `store.has_checkpoint('scored_genes')` -- if exists and not --force, show summary and return
c. Load and validate scoring weights: `config.scoring`, call `validate_sum()`
d. Step 1 - Load known genes: call `load_known_genes_to_duckdb(store)`, display count
e. Step 2 - Compute composite scores: call `compute_composite_scores(store, config.scoring)`, display summary (total genes, mean score, quality flag distribution)
f. Step 3 - Persist scores: call `persist_scored_genes(store, scored_df, config.scoring)`
g. Step 4 (unless --skip-qc) - Run QC: call `run_qc_checks(store)`, display warnings/errors, log missing data rates
h. Step 5 (unless --skip-validation) - Validate: call `validate_known_gene_ranking(store)`, display results with `generate_validation_report()`
i. Save provenance sidecar to `data_dir/scoring/scoring.provenance.json`
j. Display final summary: total scored genes, mean composite score, quality flag counts, QC pass/fail, validation pass/fail
Use Click styling consistent with evidence_cmd.py: `click.style("=== Title ===", bold=True)`, green for success, yellow for warnings, red for errors.
Error handling: wrap each step in try/except, display error with click.style(fg='red'), sys.exit(1). Always close store in finally block.
2. Update `src/usher_pipeline/cli/main.py`:
- Import score command from score_cmd.py
- Add it to the CLI group: `cli.add_command(score)` (same pattern as evidence command)
- The score command should be a top-level command (not nested under evidence), since it's a different pipeline phase
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
from usher_pipeline.cli.score_cmd import score
import click.testing
runner = click.testing.CliRunner()
result = runner.invoke(score, ['--help'])
print(result.output)
assert result.exit_code == 0
assert '--force' in result.output
assert '--skip-qc' in result.output
assert '--skip-validation' in result.output
print('CLI score command --help works')
" && python -c "
from usher_pipeline.cli.main import cli
import click.testing
runner = click.testing.CliRunner()
result = runner.invoke(cli, ['--help'])
print(result.output)
assert 'score' in result.output, 'score command not registered in main CLI'
print('Score command registered in main CLI')
"`
</verify>
<done>
- `usher-pipeline score` command exists with --force, --skip-qc, --skip-validation options
- Score command registered in main CLI group
- Follows established pattern: config load -> checkpoint check -> process -> persist -> provenance
- Orchestrates full pipeline: known genes -> scoring -> QC -> validation
</done>
</task>
<task type="auto">
<name>Task 2: Unit and integration tests for scoring module</name>
<files>
tests/test_scoring.py
tests/test_scoring_integration.py
</files>
<action>
1. Create `tests/test_scoring.py` with unit tests:
Import: pytest, polars, from scoring module all functions, ScoringWeights.
Test class or functions:
a. `test_compile_known_genes_returns_expected_structure`:
- Call compile_known_genes()
- Assert returns polars DataFrame with columns: gene_symbol, source, confidence
- Assert height >= 38 (10 Usher + 28+ SYSCILIA)
- Assert "MYO7A" in gene_symbol values
- Assert "IFT88" in gene_symbol values
- Assert all confidence values == "HIGH"
- Assert sources include both "omim_usher" and "syscilia_scgs_v2"
b. `test_compile_known_genes_no_duplicates_within_source`:
- Verify no duplicate gene_symbol within the same source
- (A gene CAN appear in both sources as separate rows)
c. `test_scoring_weights_validate_sum_defaults`:
- ScoringWeights() with defaults should pass validate_sum()
d. `test_scoring_weights_validate_sum_custom_valid`:
- ScoringWeights with custom weights summing to 1.0 should pass
e. `test_scoring_weights_validate_sum_invalid`:
- ScoringWeights(gnomad=0.5) sums to 1.35 -> validate_sum() raises ValueError
f. `test_scoring_weights_validate_sum_close_to_one`:
- Weights that sum to 0.999999 (within 1e-6) should pass
- Weights that sum to 0.99 should fail
g. `test_null_preservation_in_composite`:
- Create a synthetic PipelineStore (in-memory DuckDB: `duckdb.connect(':memory:')`)
- Create a minimal gene_universe table with 3 genes
- Create gnomad_constraint table with scores for genes 1 and 2 (gene 3 has no entry)
- Create annotation_completeness with scores for gene 1 only
- Create empty/missing entries for other evidence tables (create them with no rows or only partial rows)
- Call join_evidence_layers and verify gene 3 has NULL gnomad_score and NULL annotation_score
- Call compute_composite_scores and verify gene 3 with zero evidence layers has composite_score = NULL
2. Create `tests/test_scoring_integration.py` with integration tests:
a. `test_scoring_pipeline_end_to_end`:
- Create in-memory PipelineStore (wrap duckdb.connect(':memory:') in a PipelineStore-like interface, OR create tmp file with pytest tmp_path)
- Create synthetic tables for all 7 tables (gene_universe + 6 evidence):
- gene_universe: 20 genes (gene_001 through gene_020) with gene_symbols
- Include some known genes (MYO7A, IFT88, CDH23) in the gene universe as genes 18-20
- gnomad_constraint: 15 genes with loeuf_normalized scores, 5 NULL
- tissue_expression: 12 genes with expression_score_normalized, 8 NULL
- annotation_completeness: 18 genes with annotation_score_normalized
- subcellular_localization: 10 genes with localization_score_normalized
- animal_model_phenotypes: 8 genes with animal_model_score_normalized
- literature_evidence: 14 genes with literature_score_normalized
- Give known genes (MYO7A, IFT88, CDH23) HIGH scores in multiple layers (0.8-0.95) to ensure they rank highly
- Run compute_composite_scores with default ScoringWeights
- Assert: all 20 genes present in result
- Assert: composite_score is not NULL for genes with at least 1 evidence layer
- Assert: evidence_count values are correct (count of non-NULL scores)
- Assert: quality_flag values are correct based on evidence_count
- Assert: known genes (MYO7A, IFT88, CDH23) have high composite scores (among top 5)
b. `test_qc_detects_missing_data`:
- Create scored_genes table where one layer is 90% NULL
- Run run_qc_checks
- Assert that layer appears in errors (>80% missing)
c. `test_validation_passes_with_known_genes_ranked_highly`:
- Use scored_genes from end-to-end test (known genes scored highly)
- Run validate_known_gene_ranking
- Assert validation_passed is True
Use `tmp_path` fixture for DuckDB file-based stores. Use `PipelineStore(tmp_path / "test.duckdb")` for store creation. Follow existing test patterns from tests/test_gnomad_integration.py.
</action>
<verify>
Run: `cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_scoring.py tests/test_scoring_integration.py -v --tb=short 2>&1 | tail -30`
</verify>
<done>
- test_scoring.py: 7+ unit tests covering known genes, weight validation, NULL preservation
- test_scoring_integration.py: 3+ integration tests covering end-to-end pipeline with synthetic data
- All tests pass with `pytest tests/test_scoring.py tests/test_scoring_integration.py`
- Tests verify NULL preservation (genes with no evidence get NULL composite score)
- Tests verify known genes rank highly when given high scores
</done>
</task>
</tasks>
<verification>
- `usher-pipeline score --help` shows available options
- Score command registered in main CLI
- Unit tests pass: known genes, weight validation, NULL handling
- Integration tests pass: end-to-end scoring with synthetic data, QC detection, validation
- All tests runnable with `pytest tests/test_scoring*.py`
</verification>
<success_criteria>
- CLI score command orchestrates: known genes -> composite scoring -> QC -> validation
- Checkpoint-restart: skips if scored_genes table exists (unless --force)
- pytest tests/test_scoring.py passes all unit tests
- pytest tests/test_scoring_integration.py passes all integration tests
- Tests use synthetic data (no external API calls, fast, reproducible)
</success_criteria>
<output>
After completion, create `.planning/phases/04-scoring-integration/04-03-SUMMARY.md`
</output>