docs(04-01): complete known gene compilation and weighted scoring plan
- Known genes: 38 (10 OMIM Usher + 28 SYSCILIA SCGS v2 core) - ScoringWeights.validate_sum() enforcing weight sum = 1.0 - NULL-preserving weighted average (weighted_sum / available_weight) - Quality flags based on evidence_count thresholds - Per-layer contributions for explainability - 2 tasks, 4 files, 4 min duration
This commit is contained in:
@@ -5,23 +5,23 @@
|
|||||||
See: .planning/PROJECT.md (updated 2026-02-11)
|
See: .planning/PROJECT.md (updated 2026-02-11)
|
||||||
|
|
||||||
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
|
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
|
||||||
**Current focus:** Phase 3 complete — ready for Phase 4
|
**Current focus:** Phase 4 in progress — Scoring and Integration
|
||||||
|
|
||||||
## Current Position
|
## Current Position
|
||||||
|
|
||||||
Phase: 3 of 6 (Core Evidence Layers)
|
Phase: 4 of 6 (Scoring and Integration)
|
||||||
Plan: 6 of 6 in current phase (phase complete)
|
Plan: 1 of 3 in current phase (in progress)
|
||||||
Status: Phase 3 complete — verified (6/6 success criteria, 20/20 requirements)
|
Status: Plan 04-01 complete — known gene compilation and weighted scoring integration
|
||||||
Last activity: 2026-02-11 — Phase 3 verified and complete
|
Last activity: 2026-02-11 — Completed 04-01-PLAN.md
|
||||||
|
|
||||||
Progress: [██████░░░░] 60.0% (12/20 plans complete across all phases)
|
Progress: [██████░░░░] 65.0% (13/20 plans complete across all phases)
|
||||||
|
|
||||||
## Performance Metrics
|
## Performance Metrics
|
||||||
|
|
||||||
**Velocity:**
|
**Velocity:**
|
||||||
- Total plans completed: 12
|
- Total plans completed: 13
|
||||||
- Average duration: 5.6 min
|
- Average duration: 5.5 min
|
||||||
- Total execution time: 1.1 hours
|
- Total execution time: 1.2 hours
|
||||||
|
|
||||||
**By Phase:**
|
**By Phase:**
|
||||||
|
|
||||||
@@ -30,11 +30,17 @@ Progress: [██████░░░░] 60.0% (12/20 plans complete across al
|
|||||||
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
|
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
|
||||||
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
|
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
|
||||||
| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan |
|
| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan |
|
||||||
|
| 04 - Scoring Integration | 1/3 | 4 min | 4.0 min/plan |
|
||||||
|
|
||||||
|
**Recent Plan Details:**
|
||||||
|
| Plan | Duration | Tasks | Files |
|
||||||
|
|------|----------|-------|-------|
|
||||||
| Phase 03 P02 | 12 min | 2 tasks | 9 files |
|
| Phase 03 P02 | 12 min | 2 tasks | 9 files |
|
||||||
| Phase 03 P03 | 11 min | 2 tasks | 7 files |
|
| Phase 03 P03 | 11 min | 2 tasks | 7 files |
|
||||||
| Phase 03 P04 | 8 min | 2 tasks | 8 files |
|
| Phase 03 P04 | 8 min | 2 tasks | 8 files |
|
||||||
| Phase 03 P05 | 10 min | 2 tasks | 8 files |
|
| Phase 03 P05 | 10 min | 2 tasks | 8 files |
|
||||||
| Phase 03 P06 | 13 min | 2 tasks | 10 files |
|
| Phase 03 P06 | 13 min | 2 tasks | 10 files |
|
||||||
|
| Phase 04 P01 | 4 min | 2 tasks | 4 files |
|
||||||
|
|
||||||
## Accumulated Context
|
## Accumulated Context
|
||||||
|
|
||||||
@@ -92,6 +98,11 @@ Recent decisions affecting current work:
|
|||||||
- [03-06]: Quality-weighted scoring uses log2 normalization to mitigate well-studied gene bias (prevents TP53-like dominance)
|
- [03-06]: Quality-weighted scoring uses log2 normalization to mitigate well-studied gene bias (prevents TP53-like dominance)
|
||||||
- [03-06]: Context weights cilia/sensory=2.0, cytoskeleton/polarity=1.0 for primary target prioritization
|
- [03-06]: Context weights cilia/sensory=2.0, cytoskeleton/polarity=1.0 for primary target prioritization
|
||||||
- [03-06]: Rate limiting via decorator pattern (3 req/sec default, 10 req/sec with NCBI API key)
|
- [03-06]: Rate limiting via decorator pattern (3 req/sec default, 10 req/sec with NCBI API key)
|
||||||
|
- [04-01]: OMIM Usher genes (10) and SYSCILIA SCGS v2 core (28) as known gene positive controls
|
||||||
|
- [04-01]: NULL-preserving weighted average: weighted_sum / available_weight (only non-NULL layers contribute)
|
||||||
|
- [04-01]: Quality flags based on evidence_count (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence)
|
||||||
|
- [04-01]: Per-layer contribution tracking (score * weight) for explainability
|
||||||
|
- [04-01]: ScoringWeights validation enforcing sum = 1.0 ± 1e-6 tolerance
|
||||||
|
|
||||||
### Pending Todos
|
### Pending Todos
|
||||||
|
|
||||||
@@ -104,5 +115,5 @@ None yet.
|
|||||||
## Session Continuity
|
## Session Continuity
|
||||||
|
|
||||||
Last session: 2026-02-11 - Plan execution
|
Last session: 2026-02-11 - Plan execution
|
||||||
Stopped at: Completed 03-06-PLAN.md (Literature Evidence layer) - Phase 3 complete
|
Stopped at: Completed 04-01-PLAN.md (Known gene compilation and weighted scoring)
|
||||||
Resume file: .planning/phases/03-core-evidence-layers/03-06-SUMMARY.md
|
Resume file: .planning/phases/04-scoring-integration/04-01-SUMMARY.md
|
||||||
|
|||||||
144
.planning/phases/04-scoring-integration/04-01-SUMMARY.md
Normal file
144
.planning/phases/04-scoring-integration/04-01-SUMMARY.md
Normal file
@@ -0,0 +1,144 @@
|
|||||||
|
---
|
||||||
|
phase: 04-scoring-integration
|
||||||
|
plan: 01
|
||||||
|
subsystem: scoring
|
||||||
|
tags: [scoring, known-genes, weighted-average, duckdb, polars, multi-evidence]
|
||||||
|
|
||||||
|
# Dependency graph
|
||||||
|
requires:
|
||||||
|
- phase: 01-data-infrastructure
|
||||||
|
provides: PipelineStore, DuckDB persistence, gene_universe table
|
||||||
|
- phase: 02-prototype-evidence-layer
|
||||||
|
provides: gnomad_constraint table with loeuf_normalized
|
||||||
|
- phase: 03-core-evidence-layers
|
||||||
|
provides: tissue_expression, annotation_completeness, subcellular_localization, animal_model_phenotypes, literature_evidence tables
|
||||||
|
provides:
|
||||||
|
- Known cilia/Usher gene set compilation (OMIM + SYSCILIA SCGS v2)
|
||||||
|
- ScoringWeights validation enforcing sum constraint
|
||||||
|
- Multi-evidence weighted scoring with NULL-preserving weighted average
|
||||||
|
- join_evidence_layers() - LEFT JOIN all 6 evidence tables
|
||||||
|
- compute_composite_scores() - weighted_sum / available_weight pattern
|
||||||
|
- persist_scored_genes() - DuckDB persistence with quality flags
|
||||||
|
affects: [04-02, 04-03, ranking, filtering, validation]
|
||||||
|
|
||||||
|
# Tech tracking
|
||||||
|
tech-stack:
|
||||||
|
added: []
|
||||||
|
patterns:
|
||||||
|
- NULL-preserving weighted average (weighted_sum / available_weight)
|
||||||
|
- Evidence quality flags (sufficient/moderate/sparse/no_evidence)
|
||||||
|
- Per-layer contribution tracking for explainability
|
||||||
|
- Known gene compilation from multiple sources
|
||||||
|
|
||||||
|
key-files:
|
||||||
|
created:
|
||||||
|
- src/usher_pipeline/scoring/__init__.py
|
||||||
|
- src/usher_pipeline/scoring/known_genes.py
|
||||||
|
- src/usher_pipeline/scoring/integration.py
|
||||||
|
modified:
|
||||||
|
- src/usher_pipeline/config/schema.py
|
||||||
|
|
||||||
|
key-decisions:
|
||||||
|
- "OMIM Usher genes: 10 genes as disease positive controls"
|
||||||
|
- "SYSCILIA SCGS v2 core: ~28 genes as ciliary positive controls (subset of 686 full list)"
|
||||||
|
- "Known genes preserve multi-source provenance (duplicate gene_symbols with different sources)"
|
||||||
|
- "ScoringWeights validation: sum must equal 1.0 ± 1e-6 tolerance"
|
||||||
|
- "NULL-preserving weighted average: available_weight = sum of weights for non-NULL layers only"
|
||||||
|
- "Quality flags based on evidence_count: >=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence"
|
||||||
|
- "Per-layer contributions computed as score * weight (NULL if score is NULL) for explainability"
|
||||||
|
|
||||||
|
patterns-established:
|
||||||
|
- "NULL-preserving scoring pattern: COALESCE for weighted_sum, but available_weight excludes NULL layers"
|
||||||
|
- "LEFT JOIN all evidence tables to gene_universe preserving all genes"
|
||||||
|
- "Quality flag classification based on evidence layer count"
|
||||||
|
- "Contribution tracking for transparency (each layer's impact on composite score)"
|
||||||
|
|
||||||
|
# Metrics
|
||||||
|
duration: 4min
|
||||||
|
completed: 2026-02-11
|
||||||
|
---
|
||||||
|
|
||||||
|
# Phase 04 Plan 01: Known Gene Compilation and Multi-Evidence Scoring Summary
|
||||||
|
|
||||||
|
**NULL-preserving weighted scoring engine joining 6 evidence layers with configurable weights, plus OMIM/SYSCILIA known gene compilation for validation**
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
- **Duration:** 4 minutes (228 seconds)
|
||||||
|
- **Started:** 2026-02-11T12:38:05Z
|
||||||
|
- **Completed:** 2026-02-11T12:42:13Z
|
||||||
|
- **Tasks:** 2
|
||||||
|
- **Files modified:** 4
|
||||||
|
|
||||||
|
## Accomplishments
|
||||||
|
- Compiled 38 known cilia/Usher genes from OMIM (10 genes) and SYSCILIA SCGS v2 core (28 genes)
|
||||||
|
- Implemented ScoringWeights.validate_sum() enforcing weight sum constraint (1.0 ± 1e-6)
|
||||||
|
- Created join_evidence_layers() LEFT JOINing all 6 evidence tables preserving NULLs
|
||||||
|
- Built compute_composite_scores() with NULL-preserving weighted average (weighted_sum / available_weight)
|
||||||
|
- Added quality flag classification (sufficient/moderate/sparse/no_evidence) based on evidence count
|
||||||
|
- Included per-layer contribution columns for explainability
|
||||||
|
|
||||||
|
## Task Commits
|
||||||
|
|
||||||
|
Each task was committed atomically:
|
||||||
|
|
||||||
|
1. **Task 1: Known gene compilation and ScoringWeights validation** - `0cd2f7c` (feat)
|
||||||
|
2. **Task 2: Multi-evidence weighted scoring integration** - `f441e8c` (feat)
|
||||||
|
|
||||||
|
## Files Created/Modified
|
||||||
|
- `src/usher_pipeline/scoring/__init__.py` - Scoring module exports
|
||||||
|
- `src/usher_pipeline/scoring/known_genes.py` - OMIM_USHER_GENES (10), SYSCILIA_SCGS_V2_CORE (28), compile_known_genes()
|
||||||
|
- `src/usher_pipeline/scoring/integration.py` - join_evidence_layers(), compute_composite_scores(), persist_scored_genes()
|
||||||
|
- `src/usher_pipeline/config/schema.py` - Added ScoringWeights.validate_sum() method
|
||||||
|
|
||||||
|
## Decisions Made
|
||||||
|
|
||||||
|
1. **Known gene curation:** Limited SYSCILIA SCGS v2 to ~28 core genes (subset of 686 full list) for initial positive control validation. Future enhancement can add fetch_scgs_v2() to download complete list from publication supplementary data.
|
||||||
|
|
||||||
|
2. **Multi-source provenance:** compile_known_genes() does NOT de-duplicate gene_symbols across sources. A gene appearing in both OMIM and SYSCILIA will have two rows (one per source). This preserves provenance for validation and analysis.
|
||||||
|
|
||||||
|
3. **NULL-preserving weighted average:** Implemented weighted_sum / available_weight pattern where available_weight = sum of weights for non-NULL layers only. Genes with 0 evidence layers receive NULL composite_score (not 0), preserving semantic distinction between "no evidence" and "weak evidence".
|
||||||
|
|
||||||
|
4. **Quality flags:** Classification based on evidence_count thresholds (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence) to guide downstream filtering and prioritization.
|
||||||
|
|
||||||
|
5. **Explainability:** Per-layer contribution columns (score * weight) enable tracing which evidence layers drove a gene's composite score. Critical for manual review and trust.
|
||||||
|
|
||||||
|
## Deviations from Plan
|
||||||
|
|
||||||
|
None - plan executed exactly as written.
|
||||||
|
|
||||||
|
## Issues Encountered
|
||||||
|
|
||||||
|
None. Both verification tests passed on first attempt.
|
||||||
|
|
||||||
|
## User Setup Required
|
||||||
|
|
||||||
|
None - no external service configuration required.
|
||||||
|
|
||||||
|
## Next Phase Readiness
|
||||||
|
|
||||||
|
Ready for Phase 04 Plan 02 (ranked candidate list generation):
|
||||||
|
- Known gene set compiled and ready for exclusion filtering
|
||||||
|
- Composite scoring engine functional with NULL preservation
|
||||||
|
- Quality flags available for filtering
|
||||||
|
- Per-layer contributions available for ranking criteria
|
||||||
|
|
||||||
|
No blockers. Next plan can implement:
|
||||||
|
- Exclusion of known genes
|
||||||
|
- Ranking by composite score
|
||||||
|
- Quality flag filtering
|
||||||
|
- Top-N candidate selection
|
||||||
|
|
||||||
|
## Self-Check: PASSED
|
||||||
|
|
||||||
|
All claimed files and commits verified:
|
||||||
|
- src/usher_pipeline/scoring/__init__.py - FOUND
|
||||||
|
- src/usher_pipeline/scoring/known_genes.py - FOUND
|
||||||
|
- src/usher_pipeline/scoring/integration.py - FOUND
|
||||||
|
- Commit 0cd2f7c (Task 1) - FOUND
|
||||||
|
- Commit f441e8c (Task 2) - FOUND
|
||||||
|
|
||||||
|
---
|
||||||
|
*Phase: 04-scoring-integration*
|
||||||
|
*Plan: 01*
|
||||||
|
*Completed: 2026-02-11*
|
||||||
Reference in New Issue
Block a user