docs: create roadmap (6 phases)

This commit is contained in:
2026-02-11 15:47:36 +08:00
parent 0fb1a9581f
commit f80f384a61
3 changed files with 225 additions and 43 deletions

View File

@@ -122,52 +122,52 @@ Which phases cover which requirements. Updated during roadmap creation.
| Requirement | Phase | Status | | Requirement | Phase | Status |
|-------------|-------|--------| |-------------|-------|--------|
| INFRA-01 | | Pending | | INFRA-01 | Phase 1 | Pending |
| INFRA-02 | | Pending | | INFRA-02 | Phase 1 | Pending |
| INFRA-03 | | Pending | | INFRA-03 | Phase 1 | Pending |
| INFRA-04 | | Pending | | INFRA-04 | Phase 1 | Pending |
| INFRA-05 | | Pending | | INFRA-05 | Phase 1 | Pending |
| INFRA-06 | | Pending | | INFRA-06 | Phase 1 | Pending |
| INFRA-07 | | Pending | | INFRA-07 | Phase 1 | Pending |
| ANNOT-01 | | Pending | | ANNOT-01 | Phase 3 | Pending |
| ANNOT-02 | | Pending | | ANNOT-02 | Phase 3 | Pending |
| ANNOT-03 | | Pending | | ANNOT-03 | Phase 3 | Pending |
| EXPR-01 | | Pending | | EXPR-01 | Phase 3 | Pending |
| EXPR-02 | | Pending | | EXPR-02 | Phase 3 | Pending |
| EXPR-03 | | Pending | | EXPR-03 | Phase 3 | Pending |
| EXPR-04 | | Pending | | EXPR-04 | Phase 3 | Pending |
| PROT-01 | | Pending | | PROT-01 | Phase 3 | Pending |
| PROT-02 | | Pending | | PROT-02 | Phase 3 | Pending |
| PROT-03 | | Pending | | PROT-03 | Phase 3 | Pending |
| PROT-04 | | Pending | | PROT-04 | Phase 3 | Pending |
| LOCA-01 | | Pending | | LOCA-01 | Phase 3 | Pending |
| LOCA-02 | | Pending | | LOCA-02 | Phase 3 | Pending |
| LOCA-03 | | Pending | | LOCA-03 | Phase 3 | Pending |
| GCON-01 | | Pending | | GCON-01 | Phase 2 | Pending |
| GCON-02 | | Pending | | GCON-02 | Phase 2 | Pending |
| GCON-03 | | Pending | | GCON-03 | Phase 2 | Pending |
| ANIM-01 | | Pending | | ANIM-01 | Phase 3 | Pending |
| ANIM-02 | | Pending | | ANIM-02 | Phase 3 | Pending |
| ANIM-03 | | Pending | | ANIM-03 | Phase 3 | Pending |
| LITE-01 | | Pending | | LITE-01 | Phase 3 | Pending |
| LITE-02 | | Pending | | LITE-02 | Phase 3 | Pending |
| LITE-03 | | Pending | | LITE-03 | Phase 3 | Pending |
| SCOR-01 | | Pending | | SCOR-01 | Phase 4 | Pending |
| SCOR-02 | | Pending | | SCOR-02 | Phase 4 | Pending |
| SCOR-03 | | Pending | | SCOR-03 | Phase 4 | Pending |
| SCOR-04 | | Pending | | SCOR-04 | Phase 4 | Pending |
| SCOR-05 | | Pending | | SCOR-05 | Phase 4 | Pending |
| OUTP-01 | | Pending | | OUTP-01 | Phase 5 | Pending |
| OUTP-02 | | Pending | | OUTP-02 | Phase 5 | Pending |
| OUTP-03 | | Pending | | OUTP-03 | Phase 5 | Pending |
| OUTP-04 | | Pending | | OUTP-04 | Phase 5 | Pending |
| OUTP-05 | | Pending | | OUTP-05 | Phase 5 | Pending |
**Coverage:** **Coverage:**
- v1 requirements: 40 total - v1 requirements: 40 total
- Mapped to phases: 0 - Mapped to phases: 40
- Unmapped: 40 ⚠️ - Unmapped: 0
--- ---
*Requirements defined: 2026-02-11* *Requirements defined: 2026-02-11*
*Last updated: 2026-02-11 after initial definition* *Last updated: 2026-02-11 after roadmap creation*

120
.planning/ROADMAP.md Normal file
View File

@@ -0,0 +1,120 @@
# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline
## Overview
This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system.
## Phases
**Phase Numbering:**
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order.
- [ ] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline
- [ ] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture
- [ ] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval
- [ ] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system
- [ ] **Phase 5: Output & CLI** - User-facing interface and tiered results
- [ ] **Phase 6: Validation** - Benchmark scoring against known genes
## Phase Details
### Phase 1: Data Infrastructure
**Goal**: Establish reproducible data foundation and gene ID mapping utilities
**Depends on**: Nothing (first phase)
**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07
**Success Criteria** (what must be TRUE):
1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions
2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs
3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching
4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading
5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash
**Plans**: TBD
Plans: (to be created during plan-phase)
### Phase 2: Prototype Evidence Layer
**Goal**: Validate retrieval-to-storage pattern with single evidence layer
**Depends on**: Phase 1
**Requirements**: GCON-01, GCON-02, GCON-03
**Success Criteria** (what must be TRUE):
1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes
2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags
3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage
4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability
**Plans**: TBD
Plans: (to be created during plan-phase)
### Phase 3: Core Evidence Layers
**Goal**: Complete all remaining evidence retrieval modules
**Depends on**: Phase 2
**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03
**Success Criteria** (what must be TRUE):
1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification
2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics
3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features
4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions
5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring
6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring
**Plans**: TBD
Plans: (to be created during plan-phase)
### Phase 4: Scoring & Integration
**Goal**: Multi-evidence weighted scoring with known gene validation
**Depends on**: Phase 3
**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05
**Success Criteria** (what must be TRUE):
1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls
2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene
3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers
4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works
5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer
**Plans**: TBD
Plans: (to be created during plan-phase)
### Phase 5: Output & CLI
**Goal**: User-facing interface and structured tiered output
**Depends on**: Phase 4
**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05
**Success Criteria** (what must be TRUE):
1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth
2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps
3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools
4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown
5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging
6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics
**Plans**: TBD
Plans: (to be created during plan-phase)
### Phase 6: Validation
**Goal**: Benchmark scoring system against positive and negative controls
**Depends on**: Phase 5
**Requirements**: (No new requirements - validates existing system)
**Success Criteria** (what must be TRUE):
1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates)
2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
4. Final scoring weights are tuned based on validation metrics and documented with rationale
**Plans**: TBD
Plans: (to be created during plan-phase)
## Progress
**Execution Order:**
Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Data Infrastructure | 0/TBD | Not started | - |
| 2. Prototype Evidence Layer | 0/TBD | Not started | - |
| 3. Core Evidence Layers | 0/TBD | Not started | - |
| 4. Scoring & Integration | 0/TBD | Not started | - |
| 5. Output & CLI | 0/TBD | Not started | - |
| 6. Validation | 0/TBD | Not started | - |

62
.planning/STATE.md Normal file
View File

@@ -0,0 +1,62 @@
# Project State
## Project Reference
See: .planning/PROJECT.md (updated 2026-02-11)
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
**Current focus:** Phase 1 - Data Infrastructure
## Current Position
Phase: 1 of 6 (Data Infrastructure)
Plan: 0 of TBD in current phase
Status: Ready to plan
Last activity: 2026-02-11 — Roadmap created with 6 phases covering all 40 v1 requirements
Progress: [░░░░░░░░░░] 0%
## Performance Metrics
**Velocity:**
- Total plans completed: 0
- Average duration: - min
- Total execution time: 0.0 hours
**By Phase:**
| Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------|
| - | - | - | - |
**Recent Trend:**
- Last 5 plans: None yet
- Trend: No data
*Updated after each plan completion*
## Accumulated Context
### Decisions
Decisions are logged in PROJECT.md Key Decisions table.
Recent decisions affecting current work:
- Python over R/Bioconductor for rich data integration ecosystem
- Weighted rule-based scoring over ML for explainability
- Public data only for reproducibility
- Modular CLI scripts for flexibility during development
### Pending Todos
None yet.
### Blockers/Concerns
None yet.
## Session Continuity
Last session: 2026-02-11 - Roadmap creation
Stopped at: Roadmap and STATE files initialized, ready to plan Phase 1
Resume file: None