8.3 KiB
Roadmap: Usher Cilia Candidate Gene Discovery Pipeline
Overview
This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system.
Phases
Phase Numbering:
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order.
- Phase 1: Data Infrastructure - Foundation for reproducible, modular pipeline
- Phase 2: Prototype Evidence Layer - Validate retrieval-to-storage architecture
- Phase 3: Core Evidence Layers - Parallel multi-source data retrieval
- Phase 4: Scoring & Integration - Multi-evidence weighted scoring system
- Phase 5: Output & CLI - User-facing interface and tiered results
- Phase 6: Validation - Benchmark scoring against known genes
Phase Details
Phase 1: Data Infrastructure
Goal: Establish reproducible data foundation and gene ID mapping utilities Depends on: Nothing (first phase) Requirements: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07 Success Criteria (what must be TRUE):
- Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions
- Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs
- API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching
- DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading
- Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash Plans: 4 plans
Plans:
- 01-01-PLAN.md -- Project scaffold, config system, and base API client
- 01-02-PLAN.md -- Gene ID mapping with validation gates
- 01-03-PLAN.md -- DuckDB persistence and provenance tracking
- 01-04-PLAN.md -- CLI integration and end-to-end wiring
Phase 2: Prototype Evidence Layer
Goal: Validate retrieval-to-storage pattern with single evidence layer Depends on: Phase 1 Requirements: GCON-01, GCON-02, GCON-03 Success Criteria (what must be TRUE):
- Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes
- Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags
- Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage
- Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability Plans: 2 plans
Plans:
- 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization
- 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests
Phase 3: Core Evidence Layers
Goal: Complete all remaining evidence retrieval modules Depends on: Phase 2 Requirements: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03 Success Criteria (what must be TRUE):
- Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification
- Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics
- Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features
- Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions
- Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring
- Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring Plans: 6 plans
Plans:
- 03-01-PLAN.md -- Gene annotation completeness (GO terms, UniProt scores, pathway membership, tier classification)
- 03-02-PLAN.md -- Tissue expression (HPA, GTEx, CellxGene with Tau specificity and enrichment scoring)
- 03-03-PLAN.md -- Protein sequence/structure features (UniProt/InterPro domains, cilia motifs, normalization)
- 03-04-PLAN.md -- Subcellular localization (HPA subcellular, cilia proteomics, evidence type distinction)
- 03-05-PLAN.md -- Animal model phenotypes (MGI, ZFIN, IMPC with HCOP ortholog mapping)
- 03-06-PLAN.md -- Literature evidence (PubMed queries, evidence tier classification, quality-weighted scoring)
Phase 4: Scoring & Integration
Goal: Multi-evidence weighted scoring with known gene validation Depends on: Phase 3 Requirements: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05 Success Criteria (what must be TRUE):
- Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls
- Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene
- Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers
- Known cilia/Usher genes rank highly before exclusion, validating that scoring system works
- Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer Plans: 3 plans
Plans:
- 04-01-PLAN.md -- Known gene compilation, weight validation, and multi-evidence scoring integration
- 04-02-PLAN.md -- Quality control checks and positive control validation
- 04-03-PLAN.md -- CLI score command and unit/integration tests
Phase 5: Output & CLI
Goal: User-facing interface and structured tiered output Depends on: Phase 4 Requirements: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05 Success Criteria (what must be TRUE):
- Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth
- Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps
- Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools
- Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown
- Unified CLI provides subcommands for running layers, integration, and reporting with progress logging
- Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics Plans: TBD
Plans: (to be created during plan-phase)
Phase 6: Validation
Goal: Benchmark scoring system against positive and negative controls Depends on: Phase 5 Requirements: (No new requirements - validates existing system) Success Criteria (what must be TRUE):
- Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates)
- Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
- Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
- Final scoring weights are tuned based on validation metrics and documented with rationale Plans: TBD
Plans: (to be created during plan-phase)
Progress
Execution Order: Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6
| Phase | Plans Complete | Status | Completed |
|---|---|---|---|
| 1. Data Infrastructure | 4/4 | ✓ Complete | 2026-02-11 |
| 2. Prototype Evidence Layer | 2/2 | ✓ Complete | 2026-02-11 |
| 3. Core Evidence Layers | 6/6 | ✓ Complete | 2026-02-11 |
| 4. Scoring & Integration | 0/3 | In progress | - |
| 5. Output & CLI | 0/TBD | Not started | - |
| 6. Validation | 0/TBD | Not started | - |