From a2ef2125baaa2e2454a08a9b53f91fd149b6f8f7 Mon Sep 17 00:00:00 2001 From: gbanyan Date: Thu, 12 Feb 2026 21:31:43 +0800 Subject: [PATCH] chore: complete v1.0 MVP milestone Archive v1.0 milestone: 6 phases, 21 plans, 40/40 requirements. Reorganize ROADMAP.md, evolve PROJECT.md, archive requirements. Co-Authored-By: Claude Opus 4.6 --- .planning/MILESTONES.md | 27 ++++ .planning/PROJECT.md | 99 +++++++----- .planning/ROADMAP.md | 149 +++-------------- .planning/STATE.md | 150 +++--------------- .../{ => milestones}/v1.0-MILESTONE-AUDIT.md | 0 .../v1.0-REQUIREMENTS.md} | 9 ++ .planning/milestones/v1.0-ROADMAP.md | 141 ++++++++++++++++ 7 files changed, 275 insertions(+), 300 deletions(-) create mode 100644 .planning/MILESTONES.md rename .planning/{ => milestones}/v1.0-MILESTONE-AUDIT.md (100%) rename .planning/{REQUIREMENTS.md => milestones/v1.0-REQUIREMENTS.md} (98%) create mode 100644 .planning/milestones/v1.0-ROADMAP.md diff --git a/.planning/MILESTONES.md b/.planning/MILESTONES.md new file mode 100644 index 0000000..288550a --- /dev/null +++ b/.planning/MILESTONES.md @@ -0,0 +1,27 @@ +# Milestones + +## v1.0 MVP (Shipped: 2026-02-12) + +**Phases completed:** 6 phases, 21 plans +**Lines of code:** 21,183 Python (src + tests) +**Files:** 164 files +**Timeline:** 2026-02-11 → 2026-02-12 + +**Delivered:** Reproducible bioinformatics pipeline that screens ~20,000 human protein-coding genes across 6 evidence layers to identify under-studied cilia/Usher syndrome candidate genes, with transparent weighted scoring, tiered output, and comprehensive validation. + +**Key accomplishments:** +1. Reproducible data foundation with Ensembl gene universe, validated HGNC/UniProt mapping, Pydantic config, DuckDB checkpoint-restart, and provenance tracking +2. 6-layer evidence integration: gnomAD constraint, tissue expression, gene annotation, protein features, subcellular localization, animal models, and PubMed literature +3. Transparent weighted scoring with NULL-preserving composite scores, configurable per-layer weights, and quality control (missing data rates, distribution anomalies, MAD outliers) +4. Tiered candidate output (high/medium/low confidence) with dual-format export (TSV+Parquet), visualizations, and reproducibility reports +5. Comprehensive validation: positive controls (recall@k), negative controls (13 housekeeping genes), sensitivity analysis (weight perturbation with Spearman rank correlation) +6. Unified CLI with 5 subcommands (setup, evidence, score, report, validate) and consistent checkpoint-restart pattern + +**v2 requirements delivered early:** +- Sensitivity analysis with parameter sweep (ASCR-03) +- Negative control validation with housekeeping genes (AOUT-02) + +**Archive:** [v1.0-ROADMAP.md](milestones/v1.0-ROADMAP.md) | [v1.0-REQUIREMENTS.md](milestones/v1.0-REQUIREMENTS.md) | [v1.0-MILESTONE-AUDIT.md](milestones/v1.0-MILESTONE-AUDIT.md) + +--- + diff --git a/.planning/PROJECT.md b/.planning/PROJECT.md index a49d516..fa5505c 100644 --- a/.planning/PROJECT.md +++ b/.planning/PROJECT.md @@ -2,33 +2,52 @@ ## What This Is -A reproducible, explainable bioinformatics pipeline that systematically screens all human protein-coding genes (~20,000) to identify under-studied candidates likely involved in cilia/sensory cilia pathways — particularly those relevant to Usher syndrome. The pipeline integrates 6+ evidence layers, scores genes via weighted rule-based integration, and outputs a tiered candidate list for downstream protein interaction network and structural prediction analyses. +A reproducible bioinformatics pipeline that screens all ~20,000 human protein-coding genes across 6 evidence layers to identify under-studied candidates likely involved in cilia/sensory cilia pathways relevant to Usher syndrome. Integrates genetic constraint, tissue expression, gene annotation, protein features, subcellular localization, animal model phenotypes, and literature evidence into a transparent weighted scoring system producing tiered candidate lists. ## Core Value Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. +## Current State + +**Shipped:** v1.0 MVP (2026-02-12) +**Codebase:** 21,183 lines Python across 164 files +**Tech stack:** Python, Click CLI, DuckDB, Polars, Pydantic, matplotlib/seaborn, scipy, structlog + +**What works:** +- `usher-pipeline setup` — fetches gene universe from Ensembl with HGNC/UniProt mapping +- `usher-pipeline evidence ` — 7 evidence layer subcommands with checkpoint-restart +- `usher-pipeline score` — multi-evidence weighted scoring with QC and positive control validation +- `usher-pipeline report` — tiered output (TSV+Parquet), visualizations, reproducibility report +- `usher-pipeline validate` — positive/negative control validation, sensitivity analysis + +**Known issues:** +- cellxgene-census version conflict blocks some test execution +- PubMed literature pipeline takes 3-11 hours for full gene universe (mitigated by checkpoint-restart) + ## Requirements ### Validated -(None yet — ship to validate) +- ✓ Modular Python pipeline with independent, composable CLI scripts per evidence layer — v1.0 +- ✓ Gene universe: all human protein-coding genes (Ensembl/HGNC aligned) — v1.0 +- ✓ Evidence Layer 1: Gene annotation completeness (GO/UniProt) — v1.0 +- ✓ Evidence Layer 2: Tissue-specific expression (HPA, GTEx, CellxGene) — v1.0 +- ✓ Evidence Layer 3: Protein sequence/structure features (UniProt/InterPro) — v1.0 +- ✓ Evidence Layer 4: Subcellular localization (HPA, cilia proteomics) — v1.0 +- ✓ Evidence Layer 5: Genetic constraint (gnomAD pLI, LOEUF) — v1.0 +- ✓ Evidence Layer 6: Animal model phenotypes (MGI, ZFIN, IMPC) — v1.0 +- ✓ Systematic literature scanning per candidate — v1.0 +- ✓ Known cilia/Usher gene set compiled as exclusion set and positive controls — v1.0 +- ✓ Weighted rule-based multi-evidence integration scoring — v1.0 +- ✓ Tiered output with per-gene evidence summaries and gap documentation — v1.0 +- ✓ Output format compatible with downstream analyses — v1.0 +- ✓ Sensitivity analysis with parameter sweep (originally v2, delivered early) — v1.0 +- ✓ Negative control validation with housekeeping genes (originally v2, delivered early) — v1.0 ### Active -- [ ] Modular Python pipeline with independent, composable CLI scripts per evidence layer -- [ ] Gene universe: all human protein-coding genes (Ensembl/HGNC aligned), excluding pseudogenes and transcripts lacking protein-level evidence -- [ ] Evidence Layer 1: Gene annotation completeness (GO/UniProt functional annotation depth) -- [ ] Evidence Layer 2: Tissue-specific expression (retina, inner ear/hair cells, cilia-rich tissues) from public atlases (HPA, GTEx, CellxGene published scRNA-seq) -- [ ] Evidence Layer 3: Protein sequence/structure features (length, domain composition, coiled-coil, scaffold/adaptor domains, cilia-associated motifs) -- [ ] Evidence Layer 4: Subcellular localization evidence (centrosome, basal body, cilium, stereocilia) from high-throughput proteomics datasets -- [ ] Evidence Layer 5: Human genetic constraint (loss-of-function tolerance from gnomAD, selection pressure indicators) -- [ ] Evidence Layer 6: Animal model phenotypes (sensory, balance, vision, cilia phenotypes from model organism databases) -- [ ] Systematic literature scanning per candidate (distinguishing direct experimental evidence, incidental mentions, high-throughput hits) -- [ ] Known cilia/Usher gene set compiled from public sources (CiliaCarta, SYSCILIA gold standard, OMIM Usher genes) as exclusion set and positive controls -- [ ] Weighted rule-based multi-evidence integration scoring with transparent weights -- [ ] Tiered output (high/medium/low confidence) with per-gene evidence summaries and data gap documentation -- [ ] Output format compatible with downstream PPI network analysis (STRING/BioGRID), structural prediction (AlphaFold-Multimer), and additional analyses +(None — define with `/gsd:new-milestone`) ### Out of Scope @@ -37,42 +56,44 @@ Produce a high-confidence, multi-evidence-backed ranked list of under-studied ci - Downstream PPI network or structural prediction analyses — this pipeline produces the input candidate list - Wet-lab validation — computational discovery pipeline only - Real-time data updates — pipeline runs against versioned snapshots of source databases +- Real-time web dashboard — static reports + CLI sufficient for research tool +- GUI for parameter tuning — research pipelines need reproducible CLI execution +- Variant-level analysis — gene-level discovery scope; use Exomiser/LIRICAL for variant work +- LLM-based automated literature scanning — manual/programmatic PubMed queries sufficient +- Bayesian evidence weight optimization — requires larger training set; manual tuning sufficient ## Context -Usher syndrome is the most common genetic cause of combined deafness and blindness. While several causal genes (USH1B/MYO7A, USH1C, USH2A, etc.) are known, the full molecular network — particularly scaffold, adaptor, and regulatory proteins connecting Usher complexes to cilia machinery — remains incompletely characterized. Many genes with cilia-relevant features lack functional annotation, creating a discovery opportunity. +Usher syndrome is the most common genetic cause of combined deafness and blindness. While several causal genes (USH1B/MYO7A, USH1C, USH2A, etc.) are known, the full molecular network — particularly scaffold, adaptor, and regulatory proteins connecting Usher complexes to cilia machinery — remains incompletely characterized. -The pipeline targets this gap: genes that have cilia-suggestive evidence across multiple layers but haven't been studied in the Usher/sensory cilia context. By operationalizing "under-studied" (limited GO annotation, sparse mechanistic literature, not in canonical cilia gene lists) and cross-referencing with expression, structural, localization, genetic, and phenotypic evidence, the pipeline surfaces candidates that would otherwise remain invisible. +The pipeline targets this gap: genes that have cilia-suggestive evidence across multiple layers but haven't been studied in the Usher/sensory cilia context. -Key public data sources: -- **Gene annotation:** Ensembl, HGNC, UniProt, Gene Ontology -- **Expression:** Human Protein Atlas, GTEx, CellxGene (published retina/cochlea scRNA-seq datasets) -- **Protein features:** UniProt domains, InterPro, Pfam -- **Localization:** Human Protein Atlas subcellular, OpenCell, published centrosome/cilium proteomics -- **Genetic constraint:** gnomAD (pLI, LOEUF scores) -- **Animal models:** MGI (mouse), ZFIN (zebrafish), IMPC -- **Known gene sets:** CiliaCarta, SYSCILIA gold standard, OMIM (Usher-related entries) -- **Literature:** PubMed/NCBI for systematic text scanning +Key public data sources: Ensembl, HGNC, UniProt, Gene Ontology, Human Protein Atlas, GTEx, CellxGene, InterPro, gnomAD, MGI, ZFIN, IMPC, CiliaCarta, SYSCILIA, OMIM, PubMed. ## Constraints -- **Language**: Python — all pipeline modules written in Python -- **Architecture**: Modular CLI scripts — each evidence layer is an independent module, composable via standard input/output -- **Data**: Public sources only — no proprietary or access-restricted datasets -- **Compute**: Local workstation with NVIDIA 4090 GPU — GPU available if needed for large-scale computations -- **Scoring**: Weighted rule-based — fully transparent, no black-box models -- **Reproducibility**: Versioned data snapshots, pinned dependencies, documented parameters +- **Language**: Python +- **Architecture**: Modular CLI (Click) with DuckDB persistence and Polars DataFrames +- **Data**: Public sources only +- **Scoring**: Weighted rule-based with transparent weights +- **Reproducibility**: Versioned data snapshots, provenance tracking, checkpoint-restart ## Key Decisions | Decision | Rationale | Outcome | |----------|-----------|---------| -| Python over R/Bioconductor | User preference; rich ecosystem for data integration (pandas, scanpy, biopython) | — Pending | -| Weighted rule-based scoring over ML | Explainability is paramount; every gene's score must be traceable to specific evidence | — Pending | -| Public data only | Reproducibility — anyone can re-run the pipeline with the same inputs | — Pending | -| Modular CLI scripts over workflow manager | Flexibility for iterative development; each layer can be run/debugged independently | — Pending | -| Known gene exclusion via CiliaCarta/SYSCILIA/OMIM | Standard community-curated lists; used as both exclusion set and positive controls for validation | — Pending | -| Tiered output over fixed cutoff | Allows flexible downstream use — high-confidence for focused follow-up, medium/low for broader network analysis | — Pending | +| Python over R/Bioconductor | Rich ecosystem for data integration (polars, biopython) | ✓ Good | +| Weighted rule-based scoring over ML | Explainability paramount; every score traceable to evidence | ✓ Good | +| Public data only | Reproducibility — anyone can re-run with same inputs | ✓ Good | +| Modular CLI scripts over workflow manager | Flexibility for iterative development; independent debugging | ✓ Good | +| DuckDB over SQLite | Native polars integration, better analytics queries | ✓ Good | +| NULL preservation (unknown ≠ zero) | Avoids penalizing genes with missing evidence | ✓ Good | +| Polars over pandas | Better performance with lazy evaluation, null handling | ✓ Good | +| LOEUF inversion (lower = more constrained = higher score) | Intuitive direction for scoring integration | ✓ Good | +| Log2 normalization for literature bias | Prevents well-studied gene dominance (TP53 problem) | ✓ Good | +| Housekeeping genes as negative controls | Literature-validated set (Eisenberg & Levanon 2013) | ✓ Good | +| Spearman rho ≥ 0.85 stability threshold | Based on rank stability literature for robustness testing | ✓ Good | +| Configurable tier thresholds | Allows flexible downstream use by confidence level | ✓ Good | --- -*Last updated: 2026-02-11 after initialization* +*Last updated: 2026-02-12 after v1.0 milestone* diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index e208681..6bdcb64 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -1,141 +1,30 @@ # Roadmap: Usher Cilia Candidate Gene Discovery Pipeline -## Overview +## Milestones -This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system. +- **v1.0 MVP** — Phases 1-6 (shipped 2026-02-12) | [Archive](milestones/v1.0-ROADMAP.md) ## Phases -**Phase Numbering:** -- Integer phases (1, 2, 3): Planned milestone work -- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED) +
+v1.0 MVP (Phases 1-6) — SHIPPED 2026-02-12 -Decimal phases appear between their surrounding integers in numeric order. +- [x] Phase 1: Data Infrastructure (4/4 plans) — completed 2026-02-11 +- [x] Phase 2: Prototype Evidence Layer (2/2 plans) — completed 2026-02-11 +- [x] Phase 3: Core Evidence Layers (6/6 plans) — completed 2026-02-11 +- [x] Phase 4: Scoring & Integration (3/3 plans) — completed 2026-02-11 +- [x] Phase 5: Output & CLI (3/3 plans) — completed 2026-02-12 +- [x] Phase 6: Validation (3/3 plans) — completed 2026-02-12 -- [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline -- [x] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture -- [x] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval -- [x] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system -- [x] **Phase 5: Output & CLI** - User-facing interface and tiered results -- [x] **Phase 6: Validation** - Benchmark scoring against known genes - -## Phase Details - -### Phase 1: Data Infrastructure -**Goal**: Establish reproducible data foundation and gene ID mapping utilities -**Depends on**: Nothing (first phase) -**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07 -**Success Criteria** (what must be TRUE): - 1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions - 2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs - 3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching - 4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading - 5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash -**Plans**: 4 plans - -Plans: -- [x] 01-01-PLAN.md -- Project scaffold, config system, and base API client -- [x] 01-02-PLAN.md -- Gene ID mapping with validation gates -- [x] 01-03-PLAN.md -- DuckDB persistence and provenance tracking -- [x] 01-04-PLAN.md -- CLI integration and end-to-end wiring - -### Phase 2: Prototype Evidence Layer -**Goal**: Validate retrieval-to-storage pattern with single evidence layer -**Depends on**: Phase 1 -**Requirements**: GCON-01, GCON-02, GCON-03 -**Success Criteria** (what must be TRUE): - 1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes - 2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags - 3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage - 4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability -**Plans**: 2 plans - -Plans: -- [x] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization -- [x] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests - -### Phase 3: Core Evidence Layers -**Goal**: Complete all remaining evidence retrieval modules -**Depends on**: Phase 2 -**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03 -**Success Criteria** (what must be TRUE): - 1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification - 2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics - 3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features - 4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions - 5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring - 6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring -**Plans**: 6 plans - -Plans: -- [x] 03-01-PLAN.md -- Gene annotation completeness (GO terms, UniProt scores, pathway membership, tier classification) -- [x] 03-02-PLAN.md -- Tissue expression (HPA, GTEx, CellxGene with Tau specificity and enrichment scoring) -- [x] 03-03-PLAN.md -- Protein sequence/structure features (UniProt/InterPro domains, cilia motifs, normalization) -- [x] 03-04-PLAN.md -- Subcellular localization (HPA subcellular, cilia proteomics, evidence type distinction) -- [x] 03-05-PLAN.md -- Animal model phenotypes (MGI, ZFIN, IMPC with HCOP ortholog mapping) -- [x] 03-06-PLAN.md -- Literature evidence (PubMed queries, evidence tier classification, quality-weighted scoring) - -### Phase 4: Scoring & Integration -**Goal**: Multi-evidence weighted scoring with known gene validation -**Depends on**: Phase 3 -**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05 -**Success Criteria** (what must be TRUE): - 1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls - 2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene - 3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers - 4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works - 5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer -**Plans**: 3 plans - -Plans: -- [x] 04-01-PLAN.md -- Known gene compilation, weight validation, and multi-evidence scoring integration -- [x] 04-02-PLAN.md -- Quality control checks and positive control validation -- [x] 04-03-PLAN.md -- CLI score command and unit/integration tests - -### Phase 5: Output & CLI -**Goal**: User-facing interface and structured tiered output -**Depends on**: Phase 4 -**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05 -**Success Criteria** (what must be TRUE): - 1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth - 2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps - 3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools - 4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown - 5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging - 6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics -**Plans**: 3 plans - -Plans: -- [x] 05-01-PLAN.md -- Tiered candidate output with evidence summary and dual-format writer (TSV+Parquet) -- [x] 05-02-PLAN.md -- Visualizations (score distribution, layer contributions, tier breakdown) and reproducibility report -- [x] 05-03-PLAN.md -- CLI report command wiring all output modules with integration tests - -### Phase 6: Validation -**Goal**: Benchmark scoring system against positive and negative controls -**Depends on**: Phase 5 -**Requirements**: (No new requirements - validates existing system) -**Success Criteria** (what must be TRUE): - 1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates) - 2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier) - 3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates - 4. Final scoring weights are tuned based on validation metrics and documented with rationale -**Plans**: 3 plans - -Plans: -- [x] 06-01-PLAN.md -- Negative control validation (housekeeping genes) and enhanced positive control metrics (recall@k) -- [x] 06-02-PLAN.md -- Sensitivity analysis (weight perturbation sweeps with Spearman rank correlation) -- [x] 06-03-PLAN.md -- Comprehensive validation report, CLI validate command, and unit tests +
## Progress -**Execution Order:** -Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6 - -| Phase | Plans Complete | Status | Completed | -|-------|----------------|--------|-----------| -| 1. Data Infrastructure | 4/4 | Complete | 2026-02-11 | -| 2. Prototype Evidence Layer | 2/2 | Complete | 2026-02-11 | -| 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 | -| 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 | -| 5. Output & CLI | 3/3 | Complete | 2026-02-12 | -| 6. Validation | 3/3 | Complete | 2026-02-12 | +| Phase | Milestone | Plans Complete | Status | Completed | +|-------|-----------|----------------|--------|-----------| +| 1. Data Infrastructure | v1.0 | 4/4 | Complete | 2026-02-11 | +| 2. Prototype Evidence Layer | v1.0 | 2/2 | Complete | 2026-02-11 | +| 3. Core Evidence Layers | v1.0 | 6/6 | Complete | 2026-02-11 | +| 4. Scoring & Integration | v1.0 | 3/3 | Complete | 2026-02-11 | +| 5. Output & CLI | v1.0 | 3/3 | Complete | 2026-02-12 | +| 6. Validation | v1.0 | 3/3 | Complete | 2026-02-12 | diff --git a/.planning/STATE.md b/.planning/STATE.md index 672a827..a937926 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -2,160 +2,48 @@ ## Project Reference -See: .planning/PROJECT.md (updated 2026-02-11) +See: .planning/PROJECT.md (updated 2026-02-12) **Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. -**Current focus:** Phase 6 complete — ALL PHASES COMPLETE — milestone ready +**Current focus:** v1.0 MVP shipped — planning next milestone ## Current Position -Phase: 6 of 6 (Validation) -Plan: 3 of 3 in current phase (all plans complete) -Status: Phase 6 COMPLETE — verified (4/4 success criteria passed) -Last activity: 2026-02-12 — Phase 6 verified and complete, all phases done - -Progress: [██████████] 100.0% (21/21 plans complete across all phases) +Milestone: v1.0 MVP — SHIPPED 2026-02-12 +Status: All 6 phases complete, 21/21 plans, audited and archived ## Performance Metrics -**Velocity:** +**v1.0 Velocity:** - Total plans completed: 21 -- Average duration: 4.6 min -- Total execution time: 1.6 hours - -**By Phase:** +- Average duration: 4.6 min/plan +- Total execution time: ~1.6 hours +- Lines of code: 21,183 Python | Phase | Plans | Total | Avg/Plan | |-------|-------|-------|----------| -| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan | -| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan | -| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan | -| 04 - Scoring Integration | 3/3 | 10 min | 3.3 min/plan | -| 05 - Output & CLI | 3/3 | 12 min | 4.0 min/plan | -| 06 - Validation | 3/3 | 10 min | 3.3 min/plan | - -**Recent Plan Details:** -| Plan | Duration | Tasks | Files | -|------|----------|-------|-------| -| Phase 04 P01 | 4 min | 2 tasks | 4 files | -| Phase 04 P02 | 3 min | 2 tasks | 4 files | -| Phase 04 P03 | 3 min | 2 tasks | 4 files | -| Phase 05 P01 | 4 min | 2 tasks | 5 files | -| Phase 05 P02 | 5 min | 2 tasks | 6 files | -| Phase 05 P03 | 3 min | 2 tasks | 3 files | -| Phase 06 P01 | 2 min | 2 tasks | 3 files | -| Phase 06 P02 | 3 min | 2 tasks | 2 files | -| Phase 06 P03 | 5 min | 2 tasks | 5 files | +| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min | +| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min | +| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min | +| 04 - Scoring Integration | 3/3 | 10 min | 3.3 min | +| 05 - Output & CLI | 3/3 | 12 min | 4.0 min | +| 06 - Validation | 3/3 | 10 min | 3.3 min | ## Accumulated Context ### Decisions -Decisions are logged in PROJECT.md Key Decisions table. -Recent decisions affecting current work: - -- Python over R/Bioconductor for rich data integration ecosystem -- Weighted rule-based scoring over ML for explainability -- Public data only for reproducibility -- Modular CLI scripts for flexibility during development -- Virtual environment required for dependency isolation (01-01: PEP 668 externally-managed Python) -- Auto-creation of directories on config load (01-01: data_dir, cache_dir field validators) -- [01-02]: Warn on gene count outside 19k-22k range but don't fail (allows for Ensembl version variations) -- [01-02]: HGNC success rate is primary validation gate (UniProt mapping tracked but not used for pass/fail) -- [01-02]: Take first UniProt accession when multiple exist (simplifies data model) -- [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility) -- [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics) -- [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern) -- [01-04]: Click for CLI framework (standard Python CLI library with excellent UX) -- [01-04]: Setup command uses checkpoint-restart pattern (gene universe fetch can take minutes) -- [01-04]: Mock mygene in integration tests (avoids external API dependency, reproducible) -- [02-01]: httpx over requests for streaming downloads (async-native, cleaner API) -- [02-01]: structlog for structured logging (JSON-formatted, context-aware) -- [02-01]: LOEUF normalization with inversion (lower LOEUF = more constrained = higher 0-1 score) -- [02-01]: Quality flags instead of filtering (preserve all genes with measured/incomplete_coverage/no_data categorization) -- [02-01]: NULL preservation pattern (unknown constraint != zero constraint, must not be conflated) -- [02-01]: Lazy polars evaluation (LazyFrame until final collect() for query optimization) -- [02-02]: load_to_duckdb uses CREATE OR REPLACE for idempotency (safe to re-run) -- [02-02]: CLI evidence command group for extensibility (future evidence sources follow same pattern) -- [02-02]: Checkpoint at table level (has_checkpoint checks DuckDB table existence) -- [02-02]: Integration tests with synthetic fixtures (no external downloads, fast, reproducible) -- [03-01]: Annotation tier thresholds: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt) -- [03-01]: Composite annotation score weighting: GO 50%, UniProt 30%, Pathway 20% -- [03-01]: NULL GO counts treated as zero for tier classification but preserved as NULL in data (conservative assumption) -- [03-03]: UniProt REST API with batching (100 accessions) over bulk download for flexibility -- [03-03]: InterPro API for supplemental domain annotations (10 req/sec rate limit) -- [03-03]: Keyword-based cilia motif detection over ML for explainability (IFT, BBSome, ciliary, etc.) -- [03-03]: Composite protein score weights: length 15%, domain 20%, coiled-coil 20%, TM 20%, cilia 15%, scaffold 10% -- [03-03]: List(Null) edge case handling for proteins with no domains (cast to List(String)) -- [03-04]: Evidence type terminology standardized to computational (not predicted) for consistency with bioinformatics convention -- [03-04]: Proteomics absence stored as False (informative negative) vs HPA absence as NULL (unknown/not tested) -- [03-04]: Curated proteomics reference gene sets (CiliaCarta, Centrosome-DB) embedded as Python constants for simpler deployment -- [03-04]: Computational evidence (HPA Uncertain/Approved) downweighted to 0.6x vs experimental (Enhanced/Supported, proteomics) at 1.0x -- [Phase 03-05]: Ortholog confidence based on HCOP support count (HIGH: 8+, MEDIUM: 4-7, LOW: 1-3) -- [Phase 03-05]: NULL score for genes without orthologs (preserves NULL pattern) -- [03-02]: HPA bulk TSV download over per-gene API (efficient for 20K genes) -- [03-02]: GTEx retina/fallopian tube may be NULL (not in all versions) -- [03-02]: CellxGene optional dependency with --skip-cellxgene flag (large install) -- [03-02]: Tau specificity requires complete tissue data (any NULL -> NULL Tau) -- [03-02]: Expression score composite: 40% enrichment + 30% Tau + 30% target rank -- [03-02]: Inner ear data primarily from CellxGene scRNA-seq (not HPA/GTEx bulk) -- [03-06]: HTS hits prioritized over functional mentions in evidence tier hierarchy (direct > HTS > functional > incidental) -- [03-06]: Quality-weighted scoring uses log2 normalization to mitigate well-studied gene bias (prevents TP53-like dominance) -- [03-06]: Context weights cilia/sensory=2.0, cytoskeleton/polarity=1.0 for primary target prioritization -- [03-06]: Rate limiting via decorator pattern (3 req/sec default, 10 req/sec with NCBI API key) -- [04-01]: OMIM Usher genes (10) and SYSCILIA SCGS v2 core (28) as known gene positive controls -- [04-01]: NULL-preserving weighted average: weighted_sum / available_weight (only non-NULL layers contribute) -- [04-01]: Quality flags based on evidence_count (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence) -- [04-01]: Per-layer contribution tracking (score * weight) for explainability -- [04-01]: ScoringWeights validation enforcing sum = 1.0 ± 1e-6 tolerance -- [04-02]: scipy MAD-based outlier detection (>3 MAD threshold) for robust anomaly detection -- [04-02]: Missing data thresholds: 50% warn, 80% error for graduated QC feedback -- [04-02]: PERCENT_RANK validation computed before known gene exclusion (validates scoring system) -- [04-02]: Top quartile validation criterion (median percentile >= 0.75 for known genes) -- [04-03]: Score command follows evidence_cmd.py pattern for consistency -- [04-03]: Separate --skip-qc and --skip-validation flags for flexible iteration -- [04-03]: Tests use tmp_path fixtures for isolated DuckDB instances -- [04-03]: Synthetic test data designed to ensure known genes rank highly (0.8-0.95 scores across all layers) -- [05-01]: Configurable tier thresholds (HIGH: score>=0.7 and evidence>=3, MEDIUM: score>=0.4 and evidence>=2, LOW: score>=0.2) -- [05-01]: EXCLUDED genes filtered out (below LOW threshold or NULL composite_score) -- [05-01]: Deterministic sorting (composite_score DESC, gene_id ASC) for reproducible output -- [05-01]: Dual-format TSV+Parquet with identical data for downstream tool compatibility -- [05-01]: YAML provenance sidecar includes statistics (tier counts) and column metadata -- [05-01]: Fixed deprecated pl.count() -> pl.len() usage for polars 0.20.5+ compatibility -- [05-02]: matplotlib Agg backend for headless/CLI safety (non-interactive visualization) -- [05-02]: 300 DPI for publication-quality plots -- [05-02]: Tier color scheme: GREEN/ORANGE/RED for HIGH/MEDIUM/LOW (consistent across all plots) -- [05-02]: Graceful degradation (individual plot failures don't block batch generation) -- [05-02]: Dual-format reproducibility reports (JSON machine-readable + Markdown human-readable) -- [05-02]: Optional validation metrics in reproducibility reports (report generates whether or not validation provided) -- [05-03]: Report command follows established CLI pattern (config load, store init, checkpoint, steps, summary, cleanup) -- [05-03]: Configurable tier thresholds via CLI flags (--high-threshold, --medium-threshold, --low-threshold, --min-evidence-high, --min-evidence-medium) -- [05-03]: Skip flags for flexible iteration (--skip-viz, --skip-report) allow faster output generation -- [05-03]: Graceful degradation for visualization and reproducibility report failures (warnings, not errors) -- [06-01]: Housekeeping genes as negative controls (13 literature-validated genes from Eisenberg & Levanon 2013) -- [06-01]: Inverted threshold logic for negative controls (median percentile < 50% = success) -- [06-01]: Recall@k at both absolute (100, 500, 1000, 2000) and percentage (5%, 10%, 20%) thresholds -- [06-01]: Per-source breakdown separates OMIM Usher from SYSCILIA SCGS v2 for granular validation analysis -- [06-02]: Perturbation deltas ±5% and ±10% (DEFAULT_DELTAS) for reasonable weight variations -- [06-02]: Stability threshold Spearman rho >= 0.85 (STABILITY_THRESHOLD) based on rank stability literature -- [06-02]: Renormalization maintains sum=1.0 after perturbation (weight constraint enforcement) -- [06-02]: Top-N default 100 genes for ranking comparison (relevant for candidate prioritization) -- [06-02]: Minimum overlap 10 genes required for Spearman correlation (avoids meaningless correlations) -- [06-02]: Per-layer sensitivity tracking (most_sensitive_layer and most_robust_layer computed from mean rho) -- [06-03]: Comprehensive validation report combines positive, negative, and sensitivity prongs in single Markdown document -- [06-03]: Weight tuning recommendations include critical circular validation warnings (post-validation tuning invalidates controls) -- [06-03]: CLI validate command provides --skip-sensitivity flag for faster iteration during development +All v1.0 decisions documented in PROJECT.md Key Decisions table. ### Pending Todos -None yet. +None. ### Blockers/Concerns -None yet. +None. ## Session Continuity -Last session: 2026-02-12 - Phase 6 execution and verification -Stopped at: All 6 phases complete — milestone ready for completion -Resume file: .planning/phases/06-validation/06-VERIFICATION.md +Last session: 2026-02-12 — v1.0 milestone completed and archived +Next action: /gsd:new-milestone for v1.1 or v2.0 diff --git a/.planning/v1.0-MILESTONE-AUDIT.md b/.planning/milestones/v1.0-MILESTONE-AUDIT.md similarity index 100% rename from .planning/v1.0-MILESTONE-AUDIT.md rename to .planning/milestones/v1.0-MILESTONE-AUDIT.md diff --git a/.planning/REQUIREMENTS.md b/.planning/milestones/v1.0-REQUIREMENTS.md similarity index 98% rename from .planning/REQUIREMENTS.md rename to .planning/milestones/v1.0-REQUIREMENTS.md index aa29c09..6912021 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/milestones/v1.0-REQUIREMENTS.md @@ -1,3 +1,12 @@ +# Requirements Archive: v1.0 MVP + +**Archived:** 2026-02-12 +**Status:** SHIPPED + +For current requirements, see `.planning/REQUIREMENTS.md`. + +--- + # Requirements: Usher Cilia Candidate Gene Discovery Pipeline **Defined:** 2026-02-11 diff --git a/.planning/milestones/v1.0-ROADMAP.md b/.planning/milestones/v1.0-ROADMAP.md new file mode 100644 index 0000000..e208681 --- /dev/null +++ b/.planning/milestones/v1.0-ROADMAP.md @@ -0,0 +1,141 @@ +# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline + +## Overview + +This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system. + +## Phases + +**Phase Numbering:** +- Integer phases (1, 2, 3): Planned milestone work +- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED) + +Decimal phases appear between their surrounding integers in numeric order. + +- [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline +- [x] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture +- [x] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval +- [x] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system +- [x] **Phase 5: Output & CLI** - User-facing interface and tiered results +- [x] **Phase 6: Validation** - Benchmark scoring against known genes + +## Phase Details + +### Phase 1: Data Infrastructure +**Goal**: Establish reproducible data foundation and gene ID mapping utilities +**Depends on**: Nothing (first phase) +**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07 +**Success Criteria** (what must be TRUE): + 1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions + 2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs + 3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching + 4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading + 5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash +**Plans**: 4 plans + +Plans: +- [x] 01-01-PLAN.md -- Project scaffold, config system, and base API client +- [x] 01-02-PLAN.md -- Gene ID mapping with validation gates +- [x] 01-03-PLAN.md -- DuckDB persistence and provenance tracking +- [x] 01-04-PLAN.md -- CLI integration and end-to-end wiring + +### Phase 2: Prototype Evidence Layer +**Goal**: Validate retrieval-to-storage pattern with single evidence layer +**Depends on**: Phase 1 +**Requirements**: GCON-01, GCON-02, GCON-03 +**Success Criteria** (what must be TRUE): + 1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes + 2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags + 3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage + 4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability +**Plans**: 2 plans + +Plans: +- [x] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization +- [x] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests + +### Phase 3: Core Evidence Layers +**Goal**: Complete all remaining evidence retrieval modules +**Depends on**: Phase 2 +**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03 +**Success Criteria** (what must be TRUE): + 1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification + 2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics + 3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features + 4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions + 5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring + 6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring +**Plans**: 6 plans + +Plans: +- [x] 03-01-PLAN.md -- Gene annotation completeness (GO terms, UniProt scores, pathway membership, tier classification) +- [x] 03-02-PLAN.md -- Tissue expression (HPA, GTEx, CellxGene with Tau specificity and enrichment scoring) +- [x] 03-03-PLAN.md -- Protein sequence/structure features (UniProt/InterPro domains, cilia motifs, normalization) +- [x] 03-04-PLAN.md -- Subcellular localization (HPA subcellular, cilia proteomics, evidence type distinction) +- [x] 03-05-PLAN.md -- Animal model phenotypes (MGI, ZFIN, IMPC with HCOP ortholog mapping) +- [x] 03-06-PLAN.md -- Literature evidence (PubMed queries, evidence tier classification, quality-weighted scoring) + +### Phase 4: Scoring & Integration +**Goal**: Multi-evidence weighted scoring with known gene validation +**Depends on**: Phase 3 +**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05 +**Success Criteria** (what must be TRUE): + 1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls + 2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene + 3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers + 4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works + 5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer +**Plans**: 3 plans + +Plans: +- [x] 04-01-PLAN.md -- Known gene compilation, weight validation, and multi-evidence scoring integration +- [x] 04-02-PLAN.md -- Quality control checks and positive control validation +- [x] 04-03-PLAN.md -- CLI score command and unit/integration tests + +### Phase 5: Output & CLI +**Goal**: User-facing interface and structured tiered output +**Depends on**: Phase 4 +**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05 +**Success Criteria** (what must be TRUE): + 1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth + 2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps + 3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools + 4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown + 5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging + 6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics +**Plans**: 3 plans + +Plans: +- [x] 05-01-PLAN.md -- Tiered candidate output with evidence summary and dual-format writer (TSV+Parquet) +- [x] 05-02-PLAN.md -- Visualizations (score distribution, layer contributions, tier breakdown) and reproducibility report +- [x] 05-03-PLAN.md -- CLI report command wiring all output modules with integration tests + +### Phase 6: Validation +**Goal**: Benchmark scoring system against positive and negative controls +**Depends on**: Phase 5 +**Requirements**: (No new requirements - validates existing system) +**Success Criteria** (what must be TRUE): + 1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates) + 2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier) + 3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates + 4. Final scoring weights are tuned based on validation metrics and documented with rationale +**Plans**: 3 plans + +Plans: +- [x] 06-01-PLAN.md -- Negative control validation (housekeeping genes) and enhanced positive control metrics (recall@k) +- [x] 06-02-PLAN.md -- Sensitivity analysis (weight perturbation sweeps with Spearman rank correlation) +- [x] 06-03-PLAN.md -- Comprehensive validation report, CLI validate command, and unit tests + +## Progress + +**Execution Order:** +Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6 + +| Phase | Plans Complete | Status | Completed | +|-------|----------------|--------|-----------| +| 1. Data Infrastructure | 4/4 | Complete | 2026-02-11 | +| 2. Prototype Evidence Layer | 2/2 | Complete | 2026-02-11 | +| 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 | +| 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 | +| 5. Output & CLI | 3/3 | Complete | 2026-02-12 | +| 6. Validation | 3/3 | Complete | 2026-02-12 |