diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index 27a8ad1..aa29c09 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -122,52 +122,52 @@ Which phases cover which requirements. Updated during roadmap creation. | Requirement | Phase | Status | |-------------|-------|--------| -| INFRA-01 | — | Pending | -| INFRA-02 | — | Pending | -| INFRA-03 | — | Pending | -| INFRA-04 | — | Pending | -| INFRA-05 | — | Pending | -| INFRA-06 | — | Pending | -| INFRA-07 | — | Pending | -| ANNOT-01 | — | Pending | -| ANNOT-02 | — | Pending | -| ANNOT-03 | — | Pending | -| EXPR-01 | — | Pending | -| EXPR-02 | — | Pending | -| EXPR-03 | — | Pending | -| EXPR-04 | — | Pending | -| PROT-01 | — | Pending | -| PROT-02 | — | Pending | -| PROT-03 | — | Pending | -| PROT-04 | — | Pending | -| LOCA-01 | — | Pending | -| LOCA-02 | — | Pending | -| LOCA-03 | — | Pending | -| GCON-01 | — | Pending | -| GCON-02 | — | Pending | -| GCON-03 | — | Pending | -| ANIM-01 | — | Pending | -| ANIM-02 | — | Pending | -| ANIM-03 | — | Pending | -| LITE-01 | — | Pending | -| LITE-02 | — | Pending | -| LITE-03 | — | Pending | -| SCOR-01 | — | Pending | -| SCOR-02 | — | Pending | -| SCOR-03 | — | Pending | -| SCOR-04 | — | Pending | -| SCOR-05 | — | Pending | -| OUTP-01 | — | Pending | -| OUTP-02 | — | Pending | -| OUTP-03 | — | Pending | -| OUTP-04 | — | Pending | -| OUTP-05 | — | Pending | +| INFRA-01 | Phase 1 | Pending | +| INFRA-02 | Phase 1 | Pending | +| INFRA-03 | Phase 1 | Pending | +| INFRA-04 | Phase 1 | Pending | +| INFRA-05 | Phase 1 | Pending | +| INFRA-06 | Phase 1 | Pending | +| INFRA-07 | Phase 1 | Pending | +| ANNOT-01 | Phase 3 | Pending | +| ANNOT-02 | Phase 3 | Pending | +| ANNOT-03 | Phase 3 | Pending | +| EXPR-01 | Phase 3 | Pending | +| EXPR-02 | Phase 3 | Pending | +| EXPR-03 | Phase 3 | Pending | +| EXPR-04 | Phase 3 | Pending | +| PROT-01 | Phase 3 | Pending | +| PROT-02 | Phase 3 | Pending | +| PROT-03 | Phase 3 | Pending | +| PROT-04 | Phase 3 | Pending | +| LOCA-01 | Phase 3 | Pending | +| LOCA-02 | Phase 3 | Pending | +| LOCA-03 | Phase 3 | Pending | +| GCON-01 | Phase 2 | Pending | +| GCON-02 | Phase 2 | Pending | +| GCON-03 | Phase 2 | Pending | +| ANIM-01 | Phase 3 | Pending | +| ANIM-02 | Phase 3 | Pending | +| ANIM-03 | Phase 3 | Pending | +| LITE-01 | Phase 3 | Pending | +| LITE-02 | Phase 3 | Pending | +| LITE-03 | Phase 3 | Pending | +| SCOR-01 | Phase 4 | Pending | +| SCOR-02 | Phase 4 | Pending | +| SCOR-03 | Phase 4 | Pending | +| SCOR-04 | Phase 4 | Pending | +| SCOR-05 | Phase 4 | Pending | +| OUTP-01 | Phase 5 | Pending | +| OUTP-02 | Phase 5 | Pending | +| OUTP-03 | Phase 5 | Pending | +| OUTP-04 | Phase 5 | Pending | +| OUTP-05 | Phase 5 | Pending | **Coverage:** - v1 requirements: 40 total -- Mapped to phases: 0 -- Unmapped: 40 ⚠️ +- Mapped to phases: 40 +- Unmapped: 0 --- *Requirements defined: 2026-02-11* -*Last updated: 2026-02-11 after initial definition* +*Last updated: 2026-02-11 after roadmap creation* diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md new file mode 100644 index 0000000..63cad14 --- /dev/null +++ b/.planning/ROADMAP.md @@ -0,0 +1,120 @@ +# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline + +## Overview + +This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system. + +## Phases + +**Phase Numbering:** +- Integer phases (1, 2, 3): Planned milestone work +- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED) + +Decimal phases appear between their surrounding integers in numeric order. + +- [ ] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline +- [ ] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture +- [ ] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval +- [ ] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system +- [ ] **Phase 5: Output & CLI** - User-facing interface and tiered results +- [ ] **Phase 6: Validation** - Benchmark scoring against known genes + +## Phase Details + +### Phase 1: Data Infrastructure +**Goal**: Establish reproducible data foundation and gene ID mapping utilities +**Depends on**: Nothing (first phase) +**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07 +**Success Criteria** (what must be TRUE): + 1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions + 2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs + 3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching + 4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading + 5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash +**Plans**: TBD + +Plans: (to be created during plan-phase) + +### Phase 2: Prototype Evidence Layer +**Goal**: Validate retrieval-to-storage pattern with single evidence layer +**Depends on**: Phase 1 +**Requirements**: GCON-01, GCON-02, GCON-03 +**Success Criteria** (what must be TRUE): + 1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes + 2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags + 3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage + 4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability +**Plans**: TBD + +Plans: (to be created during plan-phase) + +### Phase 3: Core Evidence Layers +**Goal**: Complete all remaining evidence retrieval modules +**Depends on**: Phase 2 +**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03 +**Success Criteria** (what must be TRUE): + 1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification + 2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics + 3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features + 4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions + 5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring + 6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring +**Plans**: TBD + +Plans: (to be created during plan-phase) + +### Phase 4: Scoring & Integration +**Goal**: Multi-evidence weighted scoring with known gene validation +**Depends on**: Phase 3 +**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05 +**Success Criteria** (what must be TRUE): + 1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls + 2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene + 3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers + 4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works + 5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer +**Plans**: TBD + +Plans: (to be created during plan-phase) + +### Phase 5: Output & CLI +**Goal**: User-facing interface and structured tiered output +**Depends on**: Phase 4 +**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05 +**Success Criteria** (what must be TRUE): + 1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth + 2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps + 3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools + 4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown + 5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging + 6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics +**Plans**: TBD + +Plans: (to be created during plan-phase) + +### Phase 6: Validation +**Goal**: Benchmark scoring system against positive and negative controls +**Depends on**: Phase 5 +**Requirements**: (No new requirements - validates existing system) +**Success Criteria** (what must be TRUE): + 1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates) + 2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier) + 3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates + 4. Final scoring weights are tuned based on validation metrics and documented with rationale +**Plans**: TBD + +Plans: (to be created during plan-phase) + +## Progress + +**Execution Order:** +Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6 + +| Phase | Plans Complete | Status | Completed | +|-------|----------------|--------|-----------| +| 1. Data Infrastructure | 0/TBD | Not started | - | +| 2. Prototype Evidence Layer | 0/TBD | Not started | - | +| 3. Core Evidence Layers | 0/TBD | Not started | - | +| 4. Scoring & Integration | 0/TBD | Not started | - | +| 5. Output & CLI | 0/TBD | Not started | - | +| 6. Validation | 0/TBD | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md new file mode 100644 index 0000000..e5955aa --- /dev/null +++ b/.planning/STATE.md @@ -0,0 +1,62 @@ +# Project State + +## Project Reference + +See: .planning/PROJECT.md (updated 2026-02-11) + +**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. +**Current focus:** Phase 1 - Data Infrastructure + +## Current Position + +Phase: 1 of 6 (Data Infrastructure) +Plan: 0 of TBD in current phase +Status: Ready to plan +Last activity: 2026-02-11 — Roadmap created with 6 phases covering all 40 v1 requirements + +Progress: [░░░░░░░░░░] 0% + +## Performance Metrics + +**Velocity:** +- Total plans completed: 0 +- Average duration: - min +- Total execution time: 0.0 hours + +**By Phase:** + +| Phase | Plans | Total | Avg/Plan | +|-------|-------|-------|----------| +| - | - | - | - | + +**Recent Trend:** +- Last 5 plans: None yet +- Trend: No data + +*Updated after each plan completion* + +## Accumulated Context + +### Decisions + +Decisions are logged in PROJECT.md Key Decisions table. +Recent decisions affecting current work: + +- Python over R/Bioconductor for rich data integration ecosystem +- Weighted rule-based scoring over ML for explainability +- Public data only for reproducibility +- Modular CLI scripts for flexibility during development + +### Pending Todos + +None yet. + +### Blockers/Concerns + +None yet. + +## Session Continuity + +Last session: 2026-02-11 - Roadmap creation +Stopped at: Roadmap and STATE files initialized, ready to plan Phase 1 +Resume file: None