docs(04-01): complete known gene compilation and weighted scoring plan

- Known genes: 38 (10 OMIM Usher + 28 SYSCILIA SCGS v2 core) - ScoringWeights.validate_sum() enforcing weight sum = 1.0 - NULL-preserving weighted average (weighted_sum / available_weight) - Quality flags based on evidence_count thresholds - Per-layer contributions for explainability - 2 tasks, 4 files, 4 min duration
2026-02-11 20:44:09 +08:00
parent f441e8c1ad
commit 71c4e8f736
2 changed files with 166 additions and 11 deletions
--- a/.planning/phases/04-scoring-integration/04-01-SUMMARY.md
+++ b/.planning/phases/04-scoring-integration/04-01-SUMMARY.md
@@ -0,0 +1,144 @@
+---
+phase: 04-scoring-integration
+plan: 01
+subsystem: scoring
+tags: [scoring, known-genes, weighted-average, duckdb, polars, multi-evidence]
+
+# Dependency graph
+requires:
+  - phase: 01-data-infrastructure
+    provides: PipelineStore, DuckDB persistence, gene_universe table
+  - phase: 02-prototype-evidence-layer
+    provides: gnomad_constraint table with loeuf_normalized
+  - phase: 03-core-evidence-layers
+    provides: tissue_expression, annotation_completeness, subcellular_localization, animal_model_phenotypes, literature_evidence tables
+provides:
+  - Known cilia/Usher gene set compilation (OMIM + SYSCILIA SCGS v2)
+  - ScoringWeights validation enforcing sum constraint
+  - Multi-evidence weighted scoring with NULL-preserving weighted average
+  - join_evidence_layers() - LEFT JOIN all 6 evidence tables
+  - compute_composite_scores() - weighted_sum / available_weight pattern
+  - persist_scored_genes() - DuckDB persistence with quality flags
+affects: [04-02, 04-03, ranking, filtering, validation]
+
+# Tech tracking
+tech-stack:
+  added: []
+  patterns:
+    - NULL-preserving weighted average (weighted_sum / available_weight)
+    - Evidence quality flags (sufficient/moderate/sparse/no_evidence)
+    - Per-layer contribution tracking for explainability
+    - Known gene compilation from multiple sources
+
+key-files:
+  created:
+    - src/usher_pipeline/scoring/__init__.py
+    - src/usher_pipeline/scoring/known_genes.py
+    - src/usher_pipeline/scoring/integration.py
+  modified:
+    - src/usher_pipeline/config/schema.py
+
+key-decisions:
+  - "OMIM Usher genes: 10 genes as disease positive controls"
+  - "SYSCILIA SCGS v2 core: ~28 genes as ciliary positive controls (subset of 686 full list)"
+  - "Known genes preserve multi-source provenance (duplicate gene_symbols with different sources)"
+  - "ScoringWeights validation: sum must equal 1.0 ± 1e-6 tolerance"
+  - "NULL-preserving weighted average: available_weight = sum of weights for non-NULL layers only"
+  - "Quality flags based on evidence_count: >=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence"
+  - "Per-layer contributions computed as score * weight (NULL if score is NULL) for explainability"
+
+patterns-established:
+  - "NULL-preserving scoring pattern: COALESCE for weighted_sum, but available_weight excludes NULL layers"
+  - "LEFT JOIN all evidence tables to gene_universe preserving all genes"
+  - "Quality flag classification based on evidence layer count"
+  - "Contribution tracking for transparency (each layer's impact on composite score)"
+
+# Metrics
+duration: 4min
+completed: 2026-02-11
+---
+
+# Phase 04 Plan 01: Known Gene Compilation and Multi-Evidence Scoring Summary
+
+**NULL-preserving weighted scoring engine joining 6 evidence layers with configurable weights, plus OMIM/SYSCILIA known gene compilation for validation**
+
+## Performance
+
+- **Duration:** 4 minutes (228 seconds)
+- **Started:** 2026-02-11T12:38:05Z
+- **Completed:** 2026-02-11T12:42:13Z
+- **Tasks:** 2
+- **Files modified:** 4
+
+## Accomplishments
+- Compiled 38 known cilia/Usher genes from OMIM (10 genes) and SYSCILIA SCGS v2 core (28 genes)
+- Implemented ScoringWeights.validate_sum() enforcing weight sum constraint (1.0 ± 1e-6)
+- Created join_evidence_layers() LEFT JOINing all 6 evidence tables preserving NULLs
+- Built compute_composite_scores() with NULL-preserving weighted average (weighted_sum / available_weight)
+- Added quality flag classification (sufficient/moderate/sparse/no_evidence) based on evidence count
+- Included per-layer contribution columns for explainability
+
+## Task Commits
+
+Each task was committed atomically:
+
+1. **Task 1: Known gene compilation and ScoringWeights validation** - `0cd2f7c` (feat)
+2. **Task 2: Multi-evidence weighted scoring integration** - `f441e8c` (feat)
+
+## Files Created/Modified
+- `src/usher_pipeline/scoring/__init__.py` - Scoring module exports
+- `src/usher_pipeline/scoring/known_genes.py` - OMIM_USHER_GENES (10), SYSCILIA_SCGS_V2_CORE (28), compile_known_genes()
+- `src/usher_pipeline/scoring/integration.py` - join_evidence_layers(), compute_composite_scores(), persist_scored_genes()
+- `src/usher_pipeline/config/schema.py` - Added ScoringWeights.validate_sum() method
+
+## Decisions Made
+
+1. **Known gene curation:** Limited SYSCILIA SCGS v2 to ~28 core genes (subset of 686 full list) for initial positive control validation. Future enhancement can add fetch_scgs_v2() to download complete list from publication supplementary data.
+
+2. **Multi-source provenance:** compile_known_genes() does NOT de-duplicate gene_symbols across sources. A gene appearing in both OMIM and SYSCILIA will have two rows (one per source). This preserves provenance for validation and analysis.
+
+3. **NULL-preserving weighted average:** Implemented weighted_sum / available_weight pattern where available_weight = sum of weights for non-NULL layers only. Genes with 0 evidence layers receive NULL composite_score (not 0), preserving semantic distinction between "no evidence" and "weak evidence".
+
+4. **Quality flags:** Classification based on evidence_count thresholds (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence) to guide downstream filtering and prioritization.
+
+5. **Explainability:** Per-layer contribution columns (score * weight) enable tracing which evidence layers drove a gene's composite score. Critical for manual review and trust.
+
+## Deviations from Plan
+
+None - plan executed exactly as written.
+
+## Issues Encountered
+
+None. Both verification tests passed on first attempt.
+
+## User Setup Required
+
+None - no external service configuration required.
+
+## Next Phase Readiness
+
+Ready for Phase 04 Plan 02 (ranked candidate list generation):
+- Known gene set compiled and ready for exclusion filtering
+- Composite scoring engine functional with NULL preservation
+- Quality flags available for filtering
+- Per-layer contributions available for ranking criteria
+
+No blockers. Next plan can implement:
+- Exclusion of known genes
+- Ranking by composite score
+- Quality flag filtering
+- Top-N candidate selection
+
+## Self-Check: PASSED
+
+All claimed files and commits verified:
+- src/usher_pipeline/scoring/__init__.py - FOUND
+- src/usher_pipeline/scoring/known_genes.py - FOUND
+- src/usher_pipeline/scoring/integration.py - FOUND
+- Commit 0cd2f7c (Task 1) - FOUND
+- Commit f441e8c (Task 2) - FOUND
+
+---
+*Phase: 04-scoring-integration*
+*Plan: 01*
+*Completed: 2026-02-11*