Files
gbanyan 71c4e8f736 docs(04-01): complete known gene compilation and weighted scoring plan
- Known genes: 38 (10 OMIM Usher + 28 SYSCILIA SCGS v2 core)
- ScoringWeights.validate_sum() enforcing weight sum = 1.0
- NULL-preserving weighted average (weighted_sum / available_weight)
- Quality flags based on evidence_count thresholds
- Per-layer contributions for explainability
- 2 tasks, 4 files, 4 min duration
2026-02-11 20:44:09 +08:00

6.2 KiB

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, key-decisions, patterns-established, duration, completed
phase plan subsystem tags requires provides affects tech-stack key-files key-decisions patterns-established duration completed
04-scoring-integration 01 scoring
scoring
known-genes
weighted-average
duckdb
polars
multi-evidence
phase provides
01-data-infrastructure PipelineStore, DuckDB persistence, gene_universe table
phase provides
02-prototype-evidence-layer gnomad_constraint table with loeuf_normalized
phase provides
03-core-evidence-layers tissue_expression, annotation_completeness, subcellular_localization, animal_model_phenotypes, literature_evidence tables
Known cilia/Usher gene set compilation (OMIM + SYSCILIA SCGS v2)
ScoringWeights validation enforcing sum constraint
Multi-evidence weighted scoring with NULL-preserving weighted average
join_evidence_layers() - LEFT JOIN all 6 evidence tables
compute_composite_scores() - weighted_sum / available_weight pattern
persist_scored_genes() - DuckDB persistence with quality flags
04-02
04-03
ranking
filtering
validation
added patterns
NULL-preserving weighted average (weighted_sum / available_weight)
Evidence quality flags (sufficient/moderate/sparse/no_evidence)
Per-layer contribution tracking for explainability
Known gene compilation from multiple sources
created modified
src/usher_pipeline/scoring/__init__.py
src/usher_pipeline/scoring/known_genes.py
src/usher_pipeline/scoring/integration.py
src/usher_pipeline/config/schema.py
OMIM Usher genes: 10 genes as disease positive controls
SYSCILIA SCGS v2 core: ~28 genes as ciliary positive controls (subset of 686 full list)
Known genes preserve multi-source provenance (duplicate gene_symbols with different sources)
ScoringWeights validation: sum must equal 1.0 ± 1e-6 tolerance
NULL-preserving weighted average: available_weight = sum of weights for non-NULL layers only
Quality flags based on evidence_count: >=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence
Per-layer contributions computed as score * weight (NULL if score is NULL) for explainability
NULL-preserving scoring pattern: COALESCE for weighted_sum, but available_weight excludes NULL layers
LEFT JOIN all evidence tables to gene_universe preserving all genes
Quality flag classification based on evidence layer count
Contribution tracking for transparency (each layer's impact on composite score)
4min 2026-02-11

Phase 04 Plan 01: Known Gene Compilation and Multi-Evidence Scoring Summary

NULL-preserving weighted scoring engine joining 6 evidence layers with configurable weights, plus OMIM/SYSCILIA known gene compilation for validation

Performance

  • Duration: 4 minutes (228 seconds)
  • Started: 2026-02-11T12:38:05Z
  • Completed: 2026-02-11T12:42:13Z
  • Tasks: 2
  • Files modified: 4

Accomplishments

  • Compiled 38 known cilia/Usher genes from OMIM (10 genes) and SYSCILIA SCGS v2 core (28 genes)
  • Implemented ScoringWeights.validate_sum() enforcing weight sum constraint (1.0 ± 1e-6)
  • Created join_evidence_layers() LEFT JOINing all 6 evidence tables preserving NULLs
  • Built compute_composite_scores() with NULL-preserving weighted average (weighted_sum / available_weight)
  • Added quality flag classification (sufficient/moderate/sparse/no_evidence) based on evidence count
  • Included per-layer contribution columns for explainability

Task Commits

Each task was committed atomically:

  1. Task 1: Known gene compilation and ScoringWeights validation - 0cd2f7c (feat)
  2. Task 2: Multi-evidence weighted scoring integration - f441e8c (feat)

Files Created/Modified

  • src/usher_pipeline/scoring/__init__.py - Scoring module exports
  • src/usher_pipeline/scoring/known_genes.py - OMIM_USHER_GENES (10), SYSCILIA_SCGS_V2_CORE (28), compile_known_genes()
  • src/usher_pipeline/scoring/integration.py - join_evidence_layers(), compute_composite_scores(), persist_scored_genes()
  • src/usher_pipeline/config/schema.py - Added ScoringWeights.validate_sum() method

Decisions Made

  1. Known gene curation: Limited SYSCILIA SCGS v2 to ~28 core genes (subset of 686 full list) for initial positive control validation. Future enhancement can add fetch_scgs_v2() to download complete list from publication supplementary data.

  2. Multi-source provenance: compile_known_genes() does NOT de-duplicate gene_symbols across sources. A gene appearing in both OMIM and SYSCILIA will have two rows (one per source). This preserves provenance for validation and analysis.

  3. NULL-preserving weighted average: Implemented weighted_sum / available_weight pattern where available_weight = sum of weights for non-NULL layers only. Genes with 0 evidence layers receive NULL composite_score (not 0), preserving semantic distinction between "no evidence" and "weak evidence".

  4. Quality flags: Classification based on evidence_count thresholds (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence) to guide downstream filtering and prioritization.

  5. Explainability: Per-layer contribution columns (score * weight) enable tracing which evidence layers drove a gene's composite score. Critical for manual review and trust.

Deviations from Plan

None - plan executed exactly as written.

Issues Encountered

None. Both verification tests passed on first attempt.

User Setup Required

None - no external service configuration required.

Next Phase Readiness

Ready for Phase 04 Plan 02 (ranked candidate list generation):

  • Known gene set compiled and ready for exclusion filtering
  • Composite scoring engine functional with NULL preservation
  • Quality flags available for filtering
  • Per-layer contributions available for ranking criteria

No blockers. Next plan can implement:

  • Exclusion of known genes
  • Ranking by composite score
  • Quality flag filtering
  • Top-N candidate selection

Self-Check: PASSED

All claimed files and commits verified:

  • src/usher_pipeline/scoring/init.py - FOUND
  • src/usher_pipeline/scoring/known_genes.py - FOUND
  • src/usher_pipeline/scoring/integration.py - FOUND
  • Commit 0cd2f7c (Task 1) - FOUND
  • Commit f441e8c (Task 2) - FOUND

Phase: 04-scoring-integration Plan: 01 Completed: 2026-02-11