gbanyan/usher-exploring

Fork 0

Files

gbanyan a52724aff4 docs(04): create phase plan for scoring and integration

2026-02-11 20:31:55 +08:00

12 KiB

Raw Blame History

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves

phase

plan

type

wave

depends_on

files_modified

autonomous

must_haves

04-scoring-integration

execute

src/usher_pipeline/scoring/__init__.py

src/usher_pipeline/scoring/known_genes.py

src/usher_pipeline/scoring/integration.py

src/usher_pipeline/config/schema.py

true

truths

artifacts

key_links

Known cilia/Usher genes from SYSCILIA and OMIM are compiled into a reusable gene set

ScoringWeights validates that all weights sum to 1.0 and rejects invalid configs

Multi-evidence scoring joins all 6 evidence tables and computes weighted average of available evidence only

Genes with missing evidence layers receive NULL (not zero) for those layers

path

provides

exports

src/usher_pipeline/scoring/__init__.py

Scoring module package

compile_known_genes

compute_composite_scores

join_evidence_layers

path	provides	contains
src/usher_pipeline/scoring/known_genes.py	Known cilia/Usher gene compilation	OMIM_USHER_GENES

path	provides	contains
src/usher_pipeline/scoring/integration.py	Multi-evidence weighted scoring with NULL preservation	COALESCE

path	provides	contains
src/usher_pipeline/config/schema.py	ScoringWeights with validate_sum method	validate_sum

from	to	via	pattern
src/usher_pipeline/scoring/integration.py	DuckDB evidence tables	LEFT JOIN on gene_id	LEFT JOIN.ON.gene_id

from	to	via	pattern
src/usher_pipeline/scoring/integration.py	src/usher_pipeline/config/schema.py	ScoringWeights parameter	ScoringWeights

Compile known cilia/Usher gene set and implement multi-evidence weighted scoring integration.

Purpose: Establishes the foundation for Phase 4 -- the known gene list (for exclusion and positive control validation) and the core scoring engine that joins all 6 evidence tables with configurable weights and NULL-preserving weighted averages.

Output: src/usher_pipeline/scoring/ module with known_genes.py and integration.py; updated config/schema.py with weight sum validation.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/04-scoring-integration/04-RESEARCH.md @src/usher_pipeline/config/schema.py @src/usher_pipeline/persistence/duckdb_store.py @src/usher_pipeline/evidence/gnomad/load.py Task 1: Known gene compilation and ScoringWeights validation src/usher_pipeline/scoring/__init__.py src/usher_pipeline/scoring/known_genes.py src/usher_pipeline/config/schema.py 1. Create `src/usher_pipeline/scoring/__init__.py` with exports for the module.

Create src/usher_pipeline/scoring/known_genes.py:
- Define OMIM_USHER_GENES as a frozenset of 10 known Usher syndrome gene symbols: MYO7A, USH1C, CDH23, PCDH15, USH1G (SANS), CIB2, USH2A, ADGRV1 (GPR98), WHRN, CLRN1. Include a brief docstring noting these are OMIM Usher syndrome entries.
- Define SYSCILIA_SCGS_V2_CORE as a frozenset of well-known ciliary genes that serve as positive controls. Include at minimum: IFT88, IFT140, IFT172, BBS1, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, RPGRIP1L, CEP290, ARL13B, INPP5E, TMEM67, CC2D2A, NPHP1, NPHP3, NPHP4, RPGR, CEP164, OFD1, MKS1, TCTN1, TCTN2, TMEM216, TMEM231, TMEM138. This is a curated subset (~30 genes) of the full SCGS v2 list (686 genes). Add a docstring noting the full list can be downloaded from the SCGS v2 publication supplementary data (DOI: 10.1091/mbc.E21-05-0226) and loaded via a future fetch_scgs_v2() function.
- Create function compile_known_genes() -> pl.DataFrame that returns a polars DataFrame with columns: gene_symbol (str), source (str: "omim_usher" or "syscilia_scgs_v2"), confidence (str: "HIGH"). Combines both gene sets. De-duplicates on gene_symbol (if any gene appears in both lists, keep both source entries as separate rows).
- Create function load_known_genes_to_duckdb(store: PipelineStore) -> int that calls compile_known_genes(), saves to DuckDB table known_cilia_genes using store.save_dataframe(), and returns the count of unique gene symbols.
Update src/usher_pipeline/config/schema.py:
- Add a validate_sum(self) -> None method to the ScoringWeights class that sums all 6 weight fields and raises ValueError if the absolute difference from 1.0 exceeds 1e-6. Message format: f"Scoring weights must sum to 1.0, got {total:.6f}".
- Do NOT change any existing field defaults or field definitions -- only add the method. Run: cd /Users/gbanyan/Project/usher-exploring && python -c " from usher_pipeline.scoring.known_genes import compile_known_genes, OMIM_USHER_GENES, SYSCILIA_SCGS_V2_CORE from usher_pipeline.config.schema import ScoringWeights df = compile_known_genes() print(f'Known genes: {df.height} rows, {df[\"gene_symbol\"].n_unique()} unique symbols') assert df.height >= 38, f'Expected >= 38 rows, got {df.height}' assert 'MYO7A' in df['gene_symbol'].to_list() assert 'IFT88' in df['gene_symbol'].to_list() w = ScoringWeights() w.validate_sum() # Should pass with defaults print('ScoringWeights.validate_sum() passed with defaults') try: w2 = ScoringWeights(gnomad=0.5) w2.validate_sum() print('ERROR: Should have raised ValueError') except ValueError as e: print(f'Correctly rejected invalid weights: {e}') print('All checks passed') "
- OMIM_USHER_GENES contains exactly 10 known Usher syndrome genes
- SYSCILIA_SCGS_V2_CORE contains >= 25 core ciliary genes
- compile_known_genes() returns DataFrame with gene_symbol, source, confidence columns
- ScoringWeights.validate_sum() passes with defaults, raises ValueError when weights do not sum to 1.0

Task 2: Multi-evidence weighted scoring integration src/usher_pipeline/scoring/integration.py src/usher_pipeline/scoring/__init__.py 1. Create `src/usher_pipeline/scoring/integration.py`:

Import: duckdb, polars, structlog, ScoringWeights from config.schema, PipelineStore from persistence.

Create function join_evidence_layers(store: PipelineStore) -> pl.DataFrame:

Execute a DuckDB SQL query using store.conn (direct DuckDB connection) that LEFT JOINs gene_universe with all 6 evidence tables on gene_id:
- gnomad_constraint -> loeuf_normalized AS gnomad_score
- tissue_expression -> expression_score_normalized AS expression_score
- annotation_completeness -> annotation_score_normalized AS annotation_score
- subcellular_localization -> localization_score_normalized AS localization_score
- animal_model_phenotypes -> animal_model_score_normalized AS animal_model_score
- literature_evidence -> literature_score_normalized AS literature_score
Compute evidence_count as the count of non-NULL scores (sum of CASE WHEN ... IS NOT NULL THEN 1 ELSE 0 END for all 6 layers).
Select gene_id, gene_symbol from gene_universe, plus all 6 aliased scores and evidence_count.
Return result as polars DataFrame via .pl().
Log the total gene count, mean evidence_count, and per-layer NULL rates using structlog.

Create function compute_composite_scores(store: PipelineStore, weights: ScoringWeights) -> pl.DataFrame:

Call weights.validate_sum() first to assert valid weights.
Execute a DuckDB SQL query that: a. Uses the same join as join_evidence_layers (or call it as a CTE / subquery). b. Computes available_weight = sum of weights for non-NULL layers (using CASE WHEN ... IS NOT NULL THEN weight_value ELSE 0 END for each layer). c. Computes weighted_sum = sum of COALESCE(score * weight, 0) for each layer. d. Computes composite_score = CASE WHEN available_weight > 0 THEN weighted_sum / available_weight ELSE NULL END. e. Computes quality_flag:
- evidence_count >= 4 -> 'sufficient_evidence'
- evidence_count >= 2 -> 'moderate_evidence'
- evidence_count >= 1 -> 'sparse_evidence'
- ELSE 'no_evidence' f. Includes all individual layer scores for explainability. g. Includes per-layer contribution columns: gnomad_contribution = gnomad_score * gnomad_weight (NULL if score is NULL), etc. h. Orders by composite_score DESC NULLS LAST.
Return as polars DataFrame.
Log summary stats: total genes, genes with composite score, mean/median composite score, quality flag distribution.

Create function persist_scored_genes(store: PipelineStore, scored_df: pl.DataFrame, weights: ScoringWeights) -> None:

Save scored_df to DuckDB table scored_genes via store.save_dataframe() with replace=True.
Description: "Multi-evidence weighted composite scores with per-layer contributions".
Log the row count and quality flag distribution.

Update src/usher_pipeline/scoring/__init__.py to export: compile_known_genes, load_known_genes_to_duckdb, join_evidence_layers, compute_composite_scores, persist_scored_genes. Run: `cd /Users/gbanyan/Project/usher-exploring && python -c " from usher_pipeline.scoring.integration import join_evidence_layers, compute_composite_scores, persist_scored_genes from usher_pipeline.config.schema import ScoringWeights import inspect

Verify function signatures exist and have correct params

sig_join = inspect.signature(join_evidence_layers) assert 'store' in sig_join.parameters sig_score = inspect.signature(compute_composite_scores) assert 'store' in sig_score.parameters assert 'weights' in sig_score.parameters sig_persist = inspect.signature(persist_scored_genes) assert 'store' in sig_persist.parameters print('All function signatures verified') print('Source contains COALESCE:', 'COALESCE' in inspect.getsource(compute_composite_scores)) print('Source contains LEFT JOIN:', 'LEFT JOIN' in inspect.getsource(join_evidence_layers)) "` - join_evidence_layers() LEFT JOINs gene_universe with all 6 evidence tables on gene_id, returns DataFrame with gene_id, gene_symbol, 6 score columns, evidence_count - compute_composite_scores() computes weighted average of available evidence only (weighted_sum / available_weight), with quality_flag and per-layer contributions - NULL scores are not replaced with zero in the weighted average -- only available evidence contributes - persist_scored_genes() saves scored_genes table to DuckDB

- `src/usher_pipeline/scoring/` module exists with `__init__.py`, `known_genes.py`, `integration.py` - Known gene set includes 10 OMIM Usher genes and 25+ SYSCILIA core ciliary genes - ScoringWeights.validate_sum() enforces weight sum constraint - Integration SQL uses LEFT JOINs preserving NULLs and COALESCE for weighted scoring - No evidence layer with NULL score contributes to composite (weighted_sum / available_weight pattern)

<success_criteria>

compile_known_genes() returns polars DataFrame with >= 38 rows of known cilia/Usher genes
compute_composite_scores() produces composite_score using weighted average of available evidence
Genes with 0 evidence layers get composite_score = NULL (not 0)
ScoringWeights with defaults passes validate_sum(); invalid weights raise ValueError
All functions importable from usher_pipeline.scoring </success_criteria>

After completion, create `.planning/phases/04-scoring-integration/04-01-SUMMARY.md`

12 KiB Raw Blame History

Verify function signatures exist and have correct params

12 KiB

Raw Blame History