docs(04): create phase plan for scoring and integration
This commit is contained in:
213
.planning/phases/04-scoring-integration/04-01-PLAN.md
Normal file
213
.planning/phases/04-scoring-integration/04-01-PLAN.md
Normal file
@@ -0,0 +1,213 @@
|
||||
---
|
||||
phase: 04-scoring-integration
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- src/usher_pipeline/scoring/__init__.py
|
||||
- src/usher_pipeline/scoring/known_genes.py
|
||||
- src/usher_pipeline/scoring/integration.py
|
||||
- src/usher_pipeline/config/schema.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Known cilia/Usher genes from SYSCILIA and OMIM are compiled into a reusable gene set"
|
||||
- "ScoringWeights validates that all weights sum to 1.0 and rejects invalid configs"
|
||||
- "Multi-evidence scoring joins all 6 evidence tables and computes weighted average of available evidence only"
|
||||
- "Genes with missing evidence layers receive NULL (not zero) for those layers"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/scoring/__init__.py"
|
||||
provides: "Scoring module package"
|
||||
exports: ["compile_known_genes", "compute_composite_scores", "join_evidence_layers"]
|
||||
- path: "src/usher_pipeline/scoring/known_genes.py"
|
||||
provides: "Known cilia/Usher gene compilation"
|
||||
contains: "OMIM_USHER_GENES"
|
||||
- path: "src/usher_pipeline/scoring/integration.py"
|
||||
provides: "Multi-evidence weighted scoring with NULL preservation"
|
||||
contains: "COALESCE"
|
||||
- path: "src/usher_pipeline/config/schema.py"
|
||||
provides: "ScoringWeights with validate_sum method"
|
||||
contains: "validate_sum"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/scoring/integration.py"
|
||||
to: "DuckDB evidence tables"
|
||||
via: "LEFT JOIN on gene_id"
|
||||
pattern: "LEFT JOIN.*ON.*gene_id"
|
||||
- from: "src/usher_pipeline/scoring/integration.py"
|
||||
to: "src/usher_pipeline/config/schema.py"
|
||||
via: "ScoringWeights parameter"
|
||||
pattern: "ScoringWeights"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Compile known cilia/Usher gene set and implement multi-evidence weighted scoring integration.
|
||||
|
||||
Purpose: Establishes the foundation for Phase 4 -- the known gene list (for exclusion and positive control validation) and the core scoring engine that joins all 6 evidence tables with configurable weights and NULL-preserving weighted averages.
|
||||
|
||||
Output: `src/usher_pipeline/scoring/` module with known_genes.py and integration.py; updated config/schema.py with weight sum validation.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/04-scoring-integration/04-RESEARCH.md
|
||||
@src/usher_pipeline/config/schema.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
@src/usher_pipeline/evidence/gnomad/load.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Known gene compilation and ScoringWeights validation</name>
|
||||
<files>
|
||||
src/usher_pipeline/scoring/__init__.py
|
||||
src/usher_pipeline/scoring/known_genes.py
|
||||
src/usher_pipeline/config/schema.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/scoring/__init__.py` with exports for the module.
|
||||
|
||||
2. Create `src/usher_pipeline/scoring/known_genes.py`:
|
||||
- Define `OMIM_USHER_GENES` as a frozenset of 10 known Usher syndrome gene symbols: MYO7A, USH1C, CDH23, PCDH15, USH1G (SANS), CIB2, USH2A, ADGRV1 (GPR98), WHRN, CLRN1. Include a brief docstring noting these are OMIM Usher syndrome entries.
|
||||
- Define `SYSCILIA_SCGS_V2_CORE` as a frozenset of well-known ciliary genes that serve as positive controls. Include at minimum: IFT88, IFT140, IFT172, BBS1, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, RPGRIP1L, CEP290, ARL13B, INPP5E, TMEM67, CC2D2A, NPHP1, NPHP3, NPHP4, RPGR, CEP164, OFD1, MKS1, TCTN1, TCTN2, TMEM216, TMEM231, TMEM138. This is a curated subset (~30 genes) of the full SCGS v2 list (686 genes). Add a docstring noting the full list can be downloaded from the SCGS v2 publication supplementary data (DOI: 10.1091/mbc.E21-05-0226) and loaded via a future `fetch_scgs_v2()` function.
|
||||
- Create function `compile_known_genes() -> pl.DataFrame` that returns a polars DataFrame with columns: `gene_symbol` (str), `source` (str: "omim_usher" or "syscilia_scgs_v2"), `confidence` (str: "HIGH"). Combines both gene sets. De-duplicates on gene_symbol (if any gene appears in both lists, keep both source entries as separate rows).
|
||||
- Create function `load_known_genes_to_duckdb(store: PipelineStore) -> int` that calls `compile_known_genes()`, saves to DuckDB table `known_cilia_genes` using `store.save_dataframe()`, and returns the count of unique gene symbols.
|
||||
|
||||
3. Update `src/usher_pipeline/config/schema.py`:
|
||||
- Add a `validate_sum(self) -> None` method to the `ScoringWeights` class that sums all 6 weight fields and raises `ValueError` if the absolute difference from 1.0 exceeds 1e-6. Message format: `f"Scoring weights must sum to 1.0, got {total:.6f}"`.
|
||||
- Do NOT change any existing field defaults or field definitions -- only add the method.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.scoring.known_genes import compile_known_genes, OMIM_USHER_GENES, SYSCILIA_SCGS_V2_CORE
|
||||
from usher_pipeline.config.schema import ScoringWeights
|
||||
df = compile_known_genes()
|
||||
print(f'Known genes: {df.height} rows, {df[\"gene_symbol\"].n_unique()} unique symbols')
|
||||
assert df.height >= 38, f'Expected >= 38 rows, got {df.height}'
|
||||
assert 'MYO7A' in df['gene_symbol'].to_list()
|
||||
assert 'IFT88' in df['gene_symbol'].to_list()
|
||||
w = ScoringWeights()
|
||||
w.validate_sum() # Should pass with defaults
|
||||
print('ScoringWeights.validate_sum() passed with defaults')
|
||||
try:
|
||||
w2 = ScoringWeights(gnomad=0.5)
|
||||
w2.validate_sum()
|
||||
print('ERROR: Should have raised ValueError')
|
||||
except ValueError as e:
|
||||
print(f'Correctly rejected invalid weights: {e}')
|
||||
print('All checks passed')
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- OMIM_USHER_GENES contains exactly 10 known Usher syndrome genes
|
||||
- SYSCILIA_SCGS_V2_CORE contains >= 25 core ciliary genes
|
||||
- compile_known_genes() returns DataFrame with gene_symbol, source, confidence columns
|
||||
- ScoringWeights.validate_sum() passes with defaults, raises ValueError when weights do not sum to 1.0
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Multi-evidence weighted scoring integration</name>
|
||||
<files>
|
||||
src/usher_pipeline/scoring/integration.py
|
||||
src/usher_pipeline/scoring/__init__.py
|
||||
</files>
|
||||
<action>
|
||||
1. Create `src/usher_pipeline/scoring/integration.py`:
|
||||
|
||||
Import: duckdb, polars, structlog, ScoringWeights from config.schema, PipelineStore from persistence.
|
||||
|
||||
Create function `join_evidence_layers(store: PipelineStore) -> pl.DataFrame`:
|
||||
- Execute a DuckDB SQL query using `store.conn` (direct DuckDB connection) that LEFT JOINs `gene_universe` with all 6 evidence tables on `gene_id`:
|
||||
- `gnomad_constraint` -> `loeuf_normalized` AS `gnomad_score`
|
||||
- `tissue_expression` -> `expression_score_normalized` AS `expression_score`
|
||||
- `annotation_completeness` -> `annotation_score_normalized` AS `annotation_score`
|
||||
- `subcellular_localization` -> `localization_score_normalized` AS `localization_score`
|
||||
- `animal_model_phenotypes` -> `animal_model_score_normalized` AS `animal_model_score`
|
||||
- `literature_evidence` -> `literature_score_normalized` AS `literature_score`
|
||||
- Compute `evidence_count` as the count of non-NULL scores (sum of CASE WHEN ... IS NOT NULL THEN 1 ELSE 0 END for all 6 layers).
|
||||
- Select `gene_id`, `gene_symbol` from `gene_universe`, plus all 6 aliased scores and `evidence_count`.
|
||||
- Return result as polars DataFrame via `.pl()`.
|
||||
- Log the total gene count, mean evidence_count, and per-layer NULL rates using structlog.
|
||||
|
||||
Create function `compute_composite_scores(store: PipelineStore, weights: ScoringWeights) -> pl.DataFrame`:
|
||||
- Call `weights.validate_sum()` first to assert valid weights.
|
||||
- Execute a DuckDB SQL query that:
|
||||
a. Uses the same join as `join_evidence_layers` (or call it as a CTE / subquery).
|
||||
b. Computes `available_weight` = sum of weights for non-NULL layers (using CASE WHEN ... IS NOT NULL THEN weight_value ELSE 0 END for each layer).
|
||||
c. Computes `weighted_sum` = sum of COALESCE(score * weight, 0) for each layer.
|
||||
d. Computes `composite_score` = CASE WHEN available_weight > 0 THEN weighted_sum / available_weight ELSE NULL END.
|
||||
e. Computes `quality_flag`:
|
||||
- `evidence_count >= 4` -> 'sufficient_evidence'
|
||||
- `evidence_count >= 2` -> 'moderate_evidence'
|
||||
- `evidence_count >= 1` -> 'sparse_evidence'
|
||||
- ELSE 'no_evidence'
|
||||
f. Includes all individual layer scores for explainability.
|
||||
g. Includes per-layer contribution columns: `gnomad_contribution` = gnomad_score * gnomad_weight (NULL if score is NULL), etc.
|
||||
h. Orders by composite_score DESC NULLS LAST.
|
||||
- Return as polars DataFrame.
|
||||
- Log summary stats: total genes, genes with composite score, mean/median composite score, quality flag distribution.
|
||||
|
||||
Create function `persist_scored_genes(store: PipelineStore, scored_df: pl.DataFrame, weights: ScoringWeights) -> None`:
|
||||
- Save `scored_df` to DuckDB table `scored_genes` via `store.save_dataframe()` with replace=True.
|
||||
- Description: "Multi-evidence weighted composite scores with per-layer contributions".
|
||||
- Log the row count and quality flag distribution.
|
||||
|
||||
2. Update `src/usher_pipeline/scoring/__init__.py` to export: `compile_known_genes`, `load_known_genes_to_duckdb`, `join_evidence_layers`, `compute_composite_scores`, `persist_scored_genes`.
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -c "
|
||||
from usher_pipeline.scoring.integration import join_evidence_layers, compute_composite_scores, persist_scored_genes
|
||||
from usher_pipeline.config.schema import ScoringWeights
|
||||
import inspect
|
||||
# Verify function signatures exist and have correct params
|
||||
sig_join = inspect.signature(join_evidence_layers)
|
||||
assert 'store' in sig_join.parameters
|
||||
sig_score = inspect.signature(compute_composite_scores)
|
||||
assert 'store' in sig_score.parameters
|
||||
assert 'weights' in sig_score.parameters
|
||||
sig_persist = inspect.signature(persist_scored_genes)
|
||||
assert 'store' in sig_persist.parameters
|
||||
print('All function signatures verified')
|
||||
print('Source contains COALESCE:', 'COALESCE' in inspect.getsource(compute_composite_scores))
|
||||
print('Source contains LEFT JOIN:', 'LEFT JOIN' in inspect.getsource(join_evidence_layers))
|
||||
"`
|
||||
</verify>
|
||||
<done>
|
||||
- join_evidence_layers() LEFT JOINs gene_universe with all 6 evidence tables on gene_id, returns DataFrame with gene_id, gene_symbol, 6 score columns, evidence_count
|
||||
- compute_composite_scores() computes weighted average of available evidence only (weighted_sum / available_weight), with quality_flag and per-layer contributions
|
||||
- NULL scores are not replaced with zero in the weighted average -- only available evidence contributes
|
||||
- persist_scored_genes() saves scored_genes table to DuckDB
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `src/usher_pipeline/scoring/` module exists with `__init__.py`, `known_genes.py`, `integration.py`
|
||||
- Known gene set includes 10 OMIM Usher genes and 25+ SYSCILIA core ciliary genes
|
||||
- ScoringWeights.validate_sum() enforces weight sum constraint
|
||||
- Integration SQL uses LEFT JOINs preserving NULLs and COALESCE for weighted scoring
|
||||
- No evidence layer with NULL score contributes to composite (weighted_sum / available_weight pattern)
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- compile_known_genes() returns polars DataFrame with >= 38 rows of known cilia/Usher genes
|
||||
- compute_composite_scores() produces composite_score using weighted average of available evidence
|
||||
- Genes with 0 evidence layers get composite_score = NULL (not 0)
|
||||
- ScoringWeights with defaults passes validate_sum(); invalid weights raise ValueError
|
||||
- All functions importable from usher_pipeline.scoring
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/04-scoring-integration/04-01-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user