docs: create roadmap (6 phases)
This commit is contained in:
@@ -122,52 +122,52 @@ Which phases cover which requirements. Updated during roadmap creation.
|
|||||||
|
|
||||||
| Requirement | Phase | Status |
|
| Requirement | Phase | Status |
|
||||||
|-------------|-------|--------|
|
|-------------|-------|--------|
|
||||||
| INFRA-01 | — | Pending |
|
| INFRA-01 | Phase 1 | Pending |
|
||||||
| INFRA-02 | — | Pending |
|
| INFRA-02 | Phase 1 | Pending |
|
||||||
| INFRA-03 | — | Pending |
|
| INFRA-03 | Phase 1 | Pending |
|
||||||
| INFRA-04 | — | Pending |
|
| INFRA-04 | Phase 1 | Pending |
|
||||||
| INFRA-05 | — | Pending |
|
| INFRA-05 | Phase 1 | Pending |
|
||||||
| INFRA-06 | — | Pending |
|
| INFRA-06 | Phase 1 | Pending |
|
||||||
| INFRA-07 | — | Pending |
|
| INFRA-07 | Phase 1 | Pending |
|
||||||
| ANNOT-01 | — | Pending |
|
| ANNOT-01 | Phase 3 | Pending |
|
||||||
| ANNOT-02 | — | Pending |
|
| ANNOT-02 | Phase 3 | Pending |
|
||||||
| ANNOT-03 | — | Pending |
|
| ANNOT-03 | Phase 3 | Pending |
|
||||||
| EXPR-01 | — | Pending |
|
| EXPR-01 | Phase 3 | Pending |
|
||||||
| EXPR-02 | — | Pending |
|
| EXPR-02 | Phase 3 | Pending |
|
||||||
| EXPR-03 | — | Pending |
|
| EXPR-03 | Phase 3 | Pending |
|
||||||
| EXPR-04 | — | Pending |
|
| EXPR-04 | Phase 3 | Pending |
|
||||||
| PROT-01 | — | Pending |
|
| PROT-01 | Phase 3 | Pending |
|
||||||
| PROT-02 | — | Pending |
|
| PROT-02 | Phase 3 | Pending |
|
||||||
| PROT-03 | — | Pending |
|
| PROT-03 | Phase 3 | Pending |
|
||||||
| PROT-04 | — | Pending |
|
| PROT-04 | Phase 3 | Pending |
|
||||||
| LOCA-01 | — | Pending |
|
| LOCA-01 | Phase 3 | Pending |
|
||||||
| LOCA-02 | — | Pending |
|
| LOCA-02 | Phase 3 | Pending |
|
||||||
| LOCA-03 | — | Pending |
|
| LOCA-03 | Phase 3 | Pending |
|
||||||
| GCON-01 | — | Pending |
|
| GCON-01 | Phase 2 | Pending |
|
||||||
| GCON-02 | — | Pending |
|
| GCON-02 | Phase 2 | Pending |
|
||||||
| GCON-03 | — | Pending |
|
| GCON-03 | Phase 2 | Pending |
|
||||||
| ANIM-01 | — | Pending |
|
| ANIM-01 | Phase 3 | Pending |
|
||||||
| ANIM-02 | — | Pending |
|
| ANIM-02 | Phase 3 | Pending |
|
||||||
| ANIM-03 | — | Pending |
|
| ANIM-03 | Phase 3 | Pending |
|
||||||
| LITE-01 | — | Pending |
|
| LITE-01 | Phase 3 | Pending |
|
||||||
| LITE-02 | — | Pending |
|
| LITE-02 | Phase 3 | Pending |
|
||||||
| LITE-03 | — | Pending |
|
| LITE-03 | Phase 3 | Pending |
|
||||||
| SCOR-01 | — | Pending |
|
| SCOR-01 | Phase 4 | Pending |
|
||||||
| SCOR-02 | — | Pending |
|
| SCOR-02 | Phase 4 | Pending |
|
||||||
| SCOR-03 | — | Pending |
|
| SCOR-03 | Phase 4 | Pending |
|
||||||
| SCOR-04 | — | Pending |
|
| SCOR-04 | Phase 4 | Pending |
|
||||||
| SCOR-05 | — | Pending |
|
| SCOR-05 | Phase 4 | Pending |
|
||||||
| OUTP-01 | — | Pending |
|
| OUTP-01 | Phase 5 | Pending |
|
||||||
| OUTP-02 | — | Pending |
|
| OUTP-02 | Phase 5 | Pending |
|
||||||
| OUTP-03 | — | Pending |
|
| OUTP-03 | Phase 5 | Pending |
|
||||||
| OUTP-04 | — | Pending |
|
| OUTP-04 | Phase 5 | Pending |
|
||||||
| OUTP-05 | — | Pending |
|
| OUTP-05 | Phase 5 | Pending |
|
||||||
|
|
||||||
**Coverage:**
|
**Coverage:**
|
||||||
- v1 requirements: 40 total
|
- v1 requirements: 40 total
|
||||||
- Mapped to phases: 0
|
- Mapped to phases: 40
|
||||||
- Unmapped: 40 ⚠️
|
- Unmapped: 0
|
||||||
|
|
||||||
---
|
---
|
||||||
*Requirements defined: 2026-02-11*
|
*Requirements defined: 2026-02-11*
|
||||||
*Last updated: 2026-02-11 after initial definition*
|
*Last updated: 2026-02-11 after roadmap creation*
|
||||||
|
|||||||
120
.planning/ROADMAP.md
Normal file
120
.planning/ROADMAP.md
Normal file
@@ -0,0 +1,120 @@
|
|||||||
|
# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system.
|
||||||
|
|
||||||
|
## Phases
|
||||||
|
|
||||||
|
**Phase Numbering:**
|
||||||
|
- Integer phases (1, 2, 3): Planned milestone work
|
||||||
|
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
|
||||||
|
|
||||||
|
Decimal phases appear between their surrounding integers in numeric order.
|
||||||
|
|
||||||
|
- [ ] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline
|
||||||
|
- [ ] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture
|
||||||
|
- [ ] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval
|
||||||
|
- [ ] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system
|
||||||
|
- [ ] **Phase 5: Output & CLI** - User-facing interface and tiered results
|
||||||
|
- [ ] **Phase 6: Validation** - Benchmark scoring against known genes
|
||||||
|
|
||||||
|
## Phase Details
|
||||||
|
|
||||||
|
### Phase 1: Data Infrastructure
|
||||||
|
**Goal**: Establish reproducible data foundation and gene ID mapping utilities
|
||||||
|
**Depends on**: Nothing (first phase)
|
||||||
|
**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions
|
||||||
|
2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs
|
||||||
|
3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching
|
||||||
|
4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading
|
||||||
|
5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash
|
||||||
|
**Plans**: TBD
|
||||||
|
|
||||||
|
Plans: (to be created during plan-phase)
|
||||||
|
|
||||||
|
### Phase 2: Prototype Evidence Layer
|
||||||
|
**Goal**: Validate retrieval-to-storage pattern with single evidence layer
|
||||||
|
**Depends on**: Phase 1
|
||||||
|
**Requirements**: GCON-01, GCON-02, GCON-03
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes
|
||||||
|
2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags
|
||||||
|
3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage
|
||||||
|
4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability
|
||||||
|
**Plans**: TBD
|
||||||
|
|
||||||
|
Plans: (to be created during plan-phase)
|
||||||
|
|
||||||
|
### Phase 3: Core Evidence Layers
|
||||||
|
**Goal**: Complete all remaining evidence retrieval modules
|
||||||
|
**Depends on**: Phase 2
|
||||||
|
**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification
|
||||||
|
2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics
|
||||||
|
3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features
|
||||||
|
4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions
|
||||||
|
5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring
|
||||||
|
6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring
|
||||||
|
**Plans**: TBD
|
||||||
|
|
||||||
|
Plans: (to be created during plan-phase)
|
||||||
|
|
||||||
|
### Phase 4: Scoring & Integration
|
||||||
|
**Goal**: Multi-evidence weighted scoring with known gene validation
|
||||||
|
**Depends on**: Phase 3
|
||||||
|
**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls
|
||||||
|
2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene
|
||||||
|
3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers
|
||||||
|
4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works
|
||||||
|
5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer
|
||||||
|
**Plans**: TBD
|
||||||
|
|
||||||
|
Plans: (to be created during plan-phase)
|
||||||
|
|
||||||
|
### Phase 5: Output & CLI
|
||||||
|
**Goal**: User-facing interface and structured tiered output
|
||||||
|
**Depends on**: Phase 4
|
||||||
|
**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth
|
||||||
|
2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps
|
||||||
|
3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools
|
||||||
|
4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown
|
||||||
|
5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging
|
||||||
|
6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics
|
||||||
|
**Plans**: TBD
|
||||||
|
|
||||||
|
Plans: (to be created during plan-phase)
|
||||||
|
|
||||||
|
### Phase 6: Validation
|
||||||
|
**Goal**: Benchmark scoring system against positive and negative controls
|
||||||
|
**Depends on**: Phase 5
|
||||||
|
**Requirements**: (No new requirements - validates existing system)
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates)
|
||||||
|
2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
|
||||||
|
3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
|
||||||
|
4. Final scoring weights are tuned based on validation metrics and documented with rationale
|
||||||
|
**Plans**: TBD
|
||||||
|
|
||||||
|
Plans: (to be created during plan-phase)
|
||||||
|
|
||||||
|
## Progress
|
||||||
|
|
||||||
|
**Execution Order:**
|
||||||
|
Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6
|
||||||
|
|
||||||
|
| Phase | Plans Complete | Status | Completed |
|
||||||
|
|-------|----------------|--------|-----------|
|
||||||
|
| 1. Data Infrastructure | 0/TBD | Not started | - |
|
||||||
|
| 2. Prototype Evidence Layer | 0/TBD | Not started | - |
|
||||||
|
| 3. Core Evidence Layers | 0/TBD | Not started | - |
|
||||||
|
| 4. Scoring & Integration | 0/TBD | Not started | - |
|
||||||
|
| 5. Output & CLI | 0/TBD | Not started | - |
|
||||||
|
| 6. Validation | 0/TBD | Not started | - |
|
||||||
62
.planning/STATE.md
Normal file
62
.planning/STATE.md
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
# Project State
|
||||||
|
|
||||||
|
## Project Reference
|
||||||
|
|
||||||
|
See: .planning/PROJECT.md (updated 2026-02-11)
|
||||||
|
|
||||||
|
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
|
||||||
|
**Current focus:** Phase 1 - Data Infrastructure
|
||||||
|
|
||||||
|
## Current Position
|
||||||
|
|
||||||
|
Phase: 1 of 6 (Data Infrastructure)
|
||||||
|
Plan: 0 of TBD in current phase
|
||||||
|
Status: Ready to plan
|
||||||
|
Last activity: 2026-02-11 — Roadmap created with 6 phases covering all 40 v1 requirements
|
||||||
|
|
||||||
|
Progress: [░░░░░░░░░░] 0%
|
||||||
|
|
||||||
|
## Performance Metrics
|
||||||
|
|
||||||
|
**Velocity:**
|
||||||
|
- Total plans completed: 0
|
||||||
|
- Average duration: - min
|
||||||
|
- Total execution time: 0.0 hours
|
||||||
|
|
||||||
|
**By Phase:**
|
||||||
|
|
||||||
|
| Phase | Plans | Total | Avg/Plan |
|
||||||
|
|-------|-------|-------|----------|
|
||||||
|
| - | - | - | - |
|
||||||
|
|
||||||
|
**Recent Trend:**
|
||||||
|
- Last 5 plans: None yet
|
||||||
|
- Trend: No data
|
||||||
|
|
||||||
|
*Updated after each plan completion*
|
||||||
|
|
||||||
|
## Accumulated Context
|
||||||
|
|
||||||
|
### Decisions
|
||||||
|
|
||||||
|
Decisions are logged in PROJECT.md Key Decisions table.
|
||||||
|
Recent decisions affecting current work:
|
||||||
|
|
||||||
|
- Python over R/Bioconductor for rich data integration ecosystem
|
||||||
|
- Weighted rule-based scoring over ML for explainability
|
||||||
|
- Public data only for reproducibility
|
||||||
|
- Modular CLI scripts for flexibility during development
|
||||||
|
|
||||||
|
### Pending Todos
|
||||||
|
|
||||||
|
None yet.
|
||||||
|
|
||||||
|
### Blockers/Concerns
|
||||||
|
|
||||||
|
None yet.
|
||||||
|
|
||||||
|
## Session Continuity
|
||||||
|
|
||||||
|
Last session: 2026-02-11 - Roadmap creation
|
||||||
|
Stopped at: Roadmap and STATE files initialized, ready to plan Phase 1
|
||||||
|
Resume file: None
|
||||||
Reference in New Issue
Block a user