From 71c4e8f736ddfe1a273513fcc20fb54b095ae7df Mon Sep 17 00:00:00 2001 From: gbanyan Date: Wed, 11 Feb 2026 20:44:09 +0800 Subject: [PATCH] docs(04-01): complete known gene compilation and weighted scoring plan - Known genes: 38 (10 OMIM Usher + 28 SYSCILIA SCGS v2 core) - ScoringWeights.validate_sum() enforcing weight sum = 1.0 - NULL-preserving weighted average (weighted_sum / available_weight) - Quality flags based on evidence_count thresholds - Per-layer contributions for explainability - 2 tasks, 4 files, 4 min duration --- .planning/STATE.md | 33 ++-- .../04-scoring-integration/04-01-SUMMARY.md | 144 ++++++++++++++++++ 2 files changed, 166 insertions(+), 11 deletions(-) create mode 100644 .planning/phases/04-scoring-integration/04-01-SUMMARY.md diff --git a/.planning/STATE.md b/.planning/STATE.md index 947ef01..0f3bc67 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -5,23 +5,23 @@ See: .planning/PROJECT.md (updated 2026-02-11) **Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. -**Current focus:** Phase 3 complete — ready for Phase 4 +**Current focus:** Phase 4 in progress — Scoring and Integration ## Current Position -Phase: 3 of 6 (Core Evidence Layers) -Plan: 6 of 6 in current phase (phase complete) -Status: Phase 3 complete — verified (6/6 success criteria, 20/20 requirements) -Last activity: 2026-02-11 — Phase 3 verified and complete +Phase: 4 of 6 (Scoring and Integration) +Plan: 1 of 3 in current phase (in progress) +Status: Plan 04-01 complete — known gene compilation and weighted scoring integration +Last activity: 2026-02-11 — Completed 04-01-PLAN.md -Progress: [██████░░░░] 60.0% (12/20 plans complete across all phases) +Progress: [██████░░░░] 65.0% (13/20 plans complete across all phases) ## Performance Metrics **Velocity:** -- Total plans completed: 12 -- Average duration: 5.6 min -- Total execution time: 1.1 hours +- Total plans completed: 13 +- Average duration: 5.5 min +- Total execution time: 1.2 hours **By Phase:** @@ -30,11 +30,17 @@ Progress: [██████░░░░] 60.0% (12/20 plans complete across al | 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan | | 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan | | 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan | +| 04 - Scoring Integration | 1/3 | 4 min | 4.0 min/plan | + +**Recent Plan Details:** +| Plan | Duration | Tasks | Files | +|------|----------|-------|-------| | Phase 03 P02 | 12 min | 2 tasks | 9 files | | Phase 03 P03 | 11 min | 2 tasks | 7 files | | Phase 03 P04 | 8 min | 2 tasks | 8 files | | Phase 03 P05 | 10 min | 2 tasks | 8 files | | Phase 03 P06 | 13 min | 2 tasks | 10 files | +| Phase 04 P01 | 4 min | 2 tasks | 4 files | ## Accumulated Context @@ -92,6 +98,11 @@ Recent decisions affecting current work: - [03-06]: Quality-weighted scoring uses log2 normalization to mitigate well-studied gene bias (prevents TP53-like dominance) - [03-06]: Context weights cilia/sensory=2.0, cytoskeleton/polarity=1.0 for primary target prioritization - [03-06]: Rate limiting via decorator pattern (3 req/sec default, 10 req/sec with NCBI API key) +- [04-01]: OMIM Usher genes (10) and SYSCILIA SCGS v2 core (28) as known gene positive controls +- [04-01]: NULL-preserving weighted average: weighted_sum / available_weight (only non-NULL layers contribute) +- [04-01]: Quality flags based on evidence_count (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence) +- [04-01]: Per-layer contribution tracking (score * weight) for explainability +- [04-01]: ScoringWeights validation enforcing sum = 1.0 ± 1e-6 tolerance ### Pending Todos @@ -104,5 +115,5 @@ None yet. ## Session Continuity Last session: 2026-02-11 - Plan execution -Stopped at: Completed 03-06-PLAN.md (Literature Evidence layer) - Phase 3 complete -Resume file: .planning/phases/03-core-evidence-layers/03-06-SUMMARY.md +Stopped at: Completed 04-01-PLAN.md (Known gene compilation and weighted scoring) +Resume file: .planning/phases/04-scoring-integration/04-01-SUMMARY.md diff --git a/.planning/phases/04-scoring-integration/04-01-SUMMARY.md b/.planning/phases/04-scoring-integration/04-01-SUMMARY.md new file mode 100644 index 0000000..a1b5183 --- /dev/null +++ b/.planning/phases/04-scoring-integration/04-01-SUMMARY.md @@ -0,0 +1,144 @@ +--- +phase: 04-scoring-integration +plan: 01 +subsystem: scoring +tags: [scoring, known-genes, weighted-average, duckdb, polars, multi-evidence] + +# Dependency graph +requires: + - phase: 01-data-infrastructure + provides: PipelineStore, DuckDB persistence, gene_universe table + - phase: 02-prototype-evidence-layer + provides: gnomad_constraint table with loeuf_normalized + - phase: 03-core-evidence-layers + provides: tissue_expression, annotation_completeness, subcellular_localization, animal_model_phenotypes, literature_evidence tables +provides: + - Known cilia/Usher gene set compilation (OMIM + SYSCILIA SCGS v2) + - ScoringWeights validation enforcing sum constraint + - Multi-evidence weighted scoring with NULL-preserving weighted average + - join_evidence_layers() - LEFT JOIN all 6 evidence tables + - compute_composite_scores() - weighted_sum / available_weight pattern + - persist_scored_genes() - DuckDB persistence with quality flags +affects: [04-02, 04-03, ranking, filtering, validation] + +# Tech tracking +tech-stack: + added: [] + patterns: + - NULL-preserving weighted average (weighted_sum / available_weight) + - Evidence quality flags (sufficient/moderate/sparse/no_evidence) + - Per-layer contribution tracking for explainability + - Known gene compilation from multiple sources + +key-files: + created: + - src/usher_pipeline/scoring/__init__.py + - src/usher_pipeline/scoring/known_genes.py + - src/usher_pipeline/scoring/integration.py + modified: + - src/usher_pipeline/config/schema.py + +key-decisions: + - "OMIM Usher genes: 10 genes as disease positive controls" + - "SYSCILIA SCGS v2 core: ~28 genes as ciliary positive controls (subset of 686 full list)" + - "Known genes preserve multi-source provenance (duplicate gene_symbols with different sources)" + - "ScoringWeights validation: sum must equal 1.0 ± 1e-6 tolerance" + - "NULL-preserving weighted average: available_weight = sum of weights for non-NULL layers only" + - "Quality flags based on evidence_count: >=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence" + - "Per-layer contributions computed as score * weight (NULL if score is NULL) for explainability" + +patterns-established: + - "NULL-preserving scoring pattern: COALESCE for weighted_sum, but available_weight excludes NULL layers" + - "LEFT JOIN all evidence tables to gene_universe preserving all genes" + - "Quality flag classification based on evidence layer count" + - "Contribution tracking for transparency (each layer's impact on composite score)" + +# Metrics +duration: 4min +completed: 2026-02-11 +--- + +# Phase 04 Plan 01: Known Gene Compilation and Multi-Evidence Scoring Summary + +**NULL-preserving weighted scoring engine joining 6 evidence layers with configurable weights, plus OMIM/SYSCILIA known gene compilation for validation** + +## Performance + +- **Duration:** 4 minutes (228 seconds) +- **Started:** 2026-02-11T12:38:05Z +- **Completed:** 2026-02-11T12:42:13Z +- **Tasks:** 2 +- **Files modified:** 4 + +## Accomplishments +- Compiled 38 known cilia/Usher genes from OMIM (10 genes) and SYSCILIA SCGS v2 core (28 genes) +- Implemented ScoringWeights.validate_sum() enforcing weight sum constraint (1.0 ± 1e-6) +- Created join_evidence_layers() LEFT JOINing all 6 evidence tables preserving NULLs +- Built compute_composite_scores() with NULL-preserving weighted average (weighted_sum / available_weight) +- Added quality flag classification (sufficient/moderate/sparse/no_evidence) based on evidence count +- Included per-layer contribution columns for explainability + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: Known gene compilation and ScoringWeights validation** - `0cd2f7c` (feat) +2. **Task 2: Multi-evidence weighted scoring integration** - `f441e8c` (feat) + +## Files Created/Modified +- `src/usher_pipeline/scoring/__init__.py` - Scoring module exports +- `src/usher_pipeline/scoring/known_genes.py` - OMIM_USHER_GENES (10), SYSCILIA_SCGS_V2_CORE (28), compile_known_genes() +- `src/usher_pipeline/scoring/integration.py` - join_evidence_layers(), compute_composite_scores(), persist_scored_genes() +- `src/usher_pipeline/config/schema.py` - Added ScoringWeights.validate_sum() method + +## Decisions Made + +1. **Known gene curation:** Limited SYSCILIA SCGS v2 to ~28 core genes (subset of 686 full list) for initial positive control validation. Future enhancement can add fetch_scgs_v2() to download complete list from publication supplementary data. + +2. **Multi-source provenance:** compile_known_genes() does NOT de-duplicate gene_symbols across sources. A gene appearing in both OMIM and SYSCILIA will have two rows (one per source). This preserves provenance for validation and analysis. + +3. **NULL-preserving weighted average:** Implemented weighted_sum / available_weight pattern where available_weight = sum of weights for non-NULL layers only. Genes with 0 evidence layers receive NULL composite_score (not 0), preserving semantic distinction between "no evidence" and "weak evidence". + +4. **Quality flags:** Classification based on evidence_count thresholds (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence) to guide downstream filtering and prioritization. + +5. **Explainability:** Per-layer contribution columns (score * weight) enable tracing which evidence layers drove a gene's composite score. Critical for manual review and trust. + +## Deviations from Plan + +None - plan executed exactly as written. + +## Issues Encountered + +None. Both verification tests passed on first attempt. + +## User Setup Required + +None - no external service configuration required. + +## Next Phase Readiness + +Ready for Phase 04 Plan 02 (ranked candidate list generation): +- Known gene set compiled and ready for exclusion filtering +- Composite scoring engine functional with NULL preservation +- Quality flags available for filtering +- Per-layer contributions available for ranking criteria + +No blockers. Next plan can implement: +- Exclusion of known genes +- Ranking by composite score +- Quality flag filtering +- Top-N candidate selection + +## Self-Check: PASSED + +All claimed files and commits verified: +- src/usher_pipeline/scoring/__init__.py - FOUND +- src/usher_pipeline/scoring/known_genes.py - FOUND +- src/usher_pipeline/scoring/integration.py - FOUND +- Commit 0cd2f7c (Task 1) - FOUND +- Commit f441e8c (Task 2) - FOUND + +--- +*Phase: 04-scoring-integration* +*Plan: 01* +*Completed: 2026-02-11*