chore: complete v1.0 MVP milestone

Archive v1.0 milestone: 6 phases, 21 plans, 40/40 requirements.
Reorganize ROADMAP.md, evolve PROJECT.md, archive requirements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-12 21:31:43 +08:00
parent c284804493
commit a2ef2125ba
7 changed files with 275 additions and 300 deletions

27
.planning/MILESTONES.md Normal file
View File

@@ -0,0 +1,27 @@
# Milestones
## v1.0 MVP (Shipped: 2026-02-12)
**Phases completed:** 6 phases, 21 plans
**Lines of code:** 21,183 Python (src + tests)
**Files:** 164 files
**Timeline:** 2026-02-11 → 2026-02-12
**Delivered:** Reproducible bioinformatics pipeline that screens ~20,000 human protein-coding genes across 6 evidence layers to identify under-studied cilia/Usher syndrome candidate genes, with transparent weighted scoring, tiered output, and comprehensive validation.
**Key accomplishments:**
1. Reproducible data foundation with Ensembl gene universe, validated HGNC/UniProt mapping, Pydantic config, DuckDB checkpoint-restart, and provenance tracking
2. 6-layer evidence integration: gnomAD constraint, tissue expression, gene annotation, protein features, subcellular localization, animal models, and PubMed literature
3. Transparent weighted scoring with NULL-preserving composite scores, configurable per-layer weights, and quality control (missing data rates, distribution anomalies, MAD outliers)
4. Tiered candidate output (high/medium/low confidence) with dual-format export (TSV+Parquet), visualizations, and reproducibility reports
5. Comprehensive validation: positive controls (recall@k), negative controls (13 housekeeping genes), sensitivity analysis (weight perturbation with Spearman rank correlation)
6. Unified CLI with 5 subcommands (setup, evidence, score, report, validate) and consistent checkpoint-restart pattern
**v2 requirements delivered early:**
- Sensitivity analysis with parameter sweep (ASCR-03)
- Negative control validation with housekeeping genes (AOUT-02)
**Archive:** [v1.0-ROADMAP.md](milestones/v1.0-ROADMAP.md) | [v1.0-REQUIREMENTS.md](milestones/v1.0-REQUIREMENTS.md) | [v1.0-MILESTONE-AUDIT.md](milestones/v1.0-MILESTONE-AUDIT.md)
---

View File

@@ -2,33 +2,52 @@
## What This Is ## What This Is
A reproducible, explainable bioinformatics pipeline that systematically screens all human protein-coding genes (~20,000) to identify under-studied candidates likely involved in cilia/sensory cilia pathways — particularly those relevant to Usher syndrome. The pipeline integrates 6+ evidence layers, scores genes via weighted rule-based integration, and outputs a tiered candidate list for downstream protein interaction network and structural prediction analyses. A reproducible bioinformatics pipeline that screens all ~20,000 human protein-coding genes across 6 evidence layers to identify under-studied candidates likely involved in cilia/sensory cilia pathways relevant to Usher syndrome. Integrates genetic constraint, tissue expression, gene annotation, protein features, subcellular localization, animal model phenotypes, and literature evidence into a transparent weighted scoring system producing tiered candidate lists.
## Core Value ## Core Value
Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
## Current State
**Shipped:** v1.0 MVP (2026-02-12)
**Codebase:** 21,183 lines Python across 164 files
**Tech stack:** Python, Click CLI, DuckDB, Polars, Pydantic, matplotlib/seaborn, scipy, structlog
**What works:**
- `usher-pipeline setup` — fetches gene universe from Ensembl with HGNC/UniProt mapping
- `usher-pipeline evidence <layer>` — 7 evidence layer subcommands with checkpoint-restart
- `usher-pipeline score` — multi-evidence weighted scoring with QC and positive control validation
- `usher-pipeline report` — tiered output (TSV+Parquet), visualizations, reproducibility report
- `usher-pipeline validate` — positive/negative control validation, sensitivity analysis
**Known issues:**
- cellxgene-census version conflict blocks some test execution
- PubMed literature pipeline takes 3-11 hours for full gene universe (mitigated by checkpoint-restart)
## Requirements ## Requirements
### Validated ### Validated
(None yet — ship to validate) - ✓ Modular Python pipeline with independent, composable CLI scripts per evidence layer — v1.0
- ✓ Gene universe: all human protein-coding genes (Ensembl/HGNC aligned) — v1.0
- ✓ Evidence Layer 1: Gene annotation completeness (GO/UniProt) — v1.0
- ✓ Evidence Layer 2: Tissue-specific expression (HPA, GTEx, CellxGene) — v1.0
- ✓ Evidence Layer 3: Protein sequence/structure features (UniProt/InterPro) — v1.0
- ✓ Evidence Layer 4: Subcellular localization (HPA, cilia proteomics) — v1.0
- ✓ Evidence Layer 5: Genetic constraint (gnomAD pLI, LOEUF) — v1.0
- ✓ Evidence Layer 6: Animal model phenotypes (MGI, ZFIN, IMPC) — v1.0
- ✓ Systematic literature scanning per candidate — v1.0
- ✓ Known cilia/Usher gene set compiled as exclusion set and positive controls — v1.0
- ✓ Weighted rule-based multi-evidence integration scoring — v1.0
- ✓ Tiered output with per-gene evidence summaries and gap documentation — v1.0
- ✓ Output format compatible with downstream analyses — v1.0
- ✓ Sensitivity analysis with parameter sweep (originally v2, delivered early) — v1.0
- ✓ Negative control validation with housekeeping genes (originally v2, delivered early) — v1.0
### Active ### Active
- [ ] Modular Python pipeline with independent, composable CLI scripts per evidence layer (None — define with `/gsd:new-milestone`)
- [ ] Gene universe: all human protein-coding genes (Ensembl/HGNC aligned), excluding pseudogenes and transcripts lacking protein-level evidence
- [ ] Evidence Layer 1: Gene annotation completeness (GO/UniProt functional annotation depth)
- [ ] Evidence Layer 2: Tissue-specific expression (retina, inner ear/hair cells, cilia-rich tissues) from public atlases (HPA, GTEx, CellxGene published scRNA-seq)
- [ ] Evidence Layer 3: Protein sequence/structure features (length, domain composition, coiled-coil, scaffold/adaptor domains, cilia-associated motifs)
- [ ] Evidence Layer 4: Subcellular localization evidence (centrosome, basal body, cilium, stereocilia) from high-throughput proteomics datasets
- [ ] Evidence Layer 5: Human genetic constraint (loss-of-function tolerance from gnomAD, selection pressure indicators)
- [ ] Evidence Layer 6: Animal model phenotypes (sensory, balance, vision, cilia phenotypes from model organism databases)
- [ ] Systematic literature scanning per candidate (distinguishing direct experimental evidence, incidental mentions, high-throughput hits)
- [ ] Known cilia/Usher gene set compiled from public sources (CiliaCarta, SYSCILIA gold standard, OMIM Usher genes) as exclusion set and positive controls
- [ ] Weighted rule-based multi-evidence integration scoring with transparent weights
- [ ] Tiered output (high/medium/low confidence) with per-gene evidence summaries and data gap documentation
- [ ] Output format compatible with downstream PPI network analysis (STRING/BioGRID), structural prediction (AlphaFold-Multimer), and additional analyses
### Out of Scope ### Out of Scope
@@ -37,42 +56,44 @@ Produce a high-confidence, multi-evidence-backed ranked list of under-studied ci
- Downstream PPI network or structural prediction analyses — this pipeline produces the input candidate list - Downstream PPI network or structural prediction analyses — this pipeline produces the input candidate list
- Wet-lab validation — computational discovery pipeline only - Wet-lab validation — computational discovery pipeline only
- Real-time data updates — pipeline runs against versioned snapshots of source databases - Real-time data updates — pipeline runs against versioned snapshots of source databases
- Real-time web dashboard — static reports + CLI sufficient for research tool
- GUI for parameter tuning — research pipelines need reproducible CLI execution
- Variant-level analysis — gene-level discovery scope; use Exomiser/LIRICAL for variant work
- LLM-based automated literature scanning — manual/programmatic PubMed queries sufficient
- Bayesian evidence weight optimization — requires larger training set; manual tuning sufficient
## Context ## Context
Usher syndrome is the most common genetic cause of combined deafness and blindness. While several causal genes (USH1B/MYO7A, USH1C, USH2A, etc.) are known, the full molecular network — particularly scaffold, adaptor, and regulatory proteins connecting Usher complexes to cilia machinery — remains incompletely characterized. Many genes with cilia-relevant features lack functional annotation, creating a discovery opportunity. Usher syndrome is the most common genetic cause of combined deafness and blindness. While several causal genes (USH1B/MYO7A, USH1C, USH2A, etc.) are known, the full molecular network — particularly scaffold, adaptor, and regulatory proteins connecting Usher complexes to cilia machinery — remains incompletely characterized.
The pipeline targets this gap: genes that have cilia-suggestive evidence across multiple layers but haven't been studied in the Usher/sensory cilia context. By operationalizing "under-studied" (limited GO annotation, sparse mechanistic literature, not in canonical cilia gene lists) and cross-referencing with expression, structural, localization, genetic, and phenotypic evidence, the pipeline surfaces candidates that would otherwise remain invisible. The pipeline targets this gap: genes that have cilia-suggestive evidence across multiple layers but haven't been studied in the Usher/sensory cilia context.
Key public data sources: Key public data sources: Ensembl, HGNC, UniProt, Gene Ontology, Human Protein Atlas, GTEx, CellxGene, InterPro, gnomAD, MGI, ZFIN, IMPC, CiliaCarta, SYSCILIA, OMIM, PubMed.
- **Gene annotation:** Ensembl, HGNC, UniProt, Gene Ontology
- **Expression:** Human Protein Atlas, GTEx, CellxGene (published retina/cochlea scRNA-seq datasets)
- **Protein features:** UniProt domains, InterPro, Pfam
- **Localization:** Human Protein Atlas subcellular, OpenCell, published centrosome/cilium proteomics
- **Genetic constraint:** gnomAD (pLI, LOEUF scores)
- **Animal models:** MGI (mouse), ZFIN (zebrafish), IMPC
- **Known gene sets:** CiliaCarta, SYSCILIA gold standard, OMIM (Usher-related entries)
- **Literature:** PubMed/NCBI for systematic text scanning
## Constraints ## Constraints
- **Language**: Python — all pipeline modules written in Python - **Language**: Python
- **Architecture**: Modular CLI scripts — each evidence layer is an independent module, composable via standard input/output - **Architecture**: Modular CLI (Click) with DuckDB persistence and Polars DataFrames
- **Data**: Public sources only — no proprietary or access-restricted datasets - **Data**: Public sources only
- **Compute**: Local workstation with NVIDIA 4090 GPU — GPU available if needed for large-scale computations - **Scoring**: Weighted rule-based with transparent weights
- **Scoring**: Weighted rule-based — fully transparent, no black-box models - **Reproducibility**: Versioned data snapshots, provenance tracking, checkpoint-restart
- **Reproducibility**: Versioned data snapshots, pinned dependencies, documented parameters
## Key Decisions ## Key Decisions
| Decision | Rationale | Outcome | | Decision | Rationale | Outcome |
|----------|-----------|---------| |----------|-----------|---------|
| Python over R/Bioconductor | User preference; rich ecosystem for data integration (pandas, scanpy, biopython) | — Pending | | Python over R/Bioconductor | Rich ecosystem for data integration (polars, biopython) | ✓ Good |
| Weighted rule-based scoring over ML | Explainability is paramount; every gene's score must be traceable to specific evidence | — Pending | | Weighted rule-based scoring over ML | Explainability paramount; every score traceable to evidence | ✓ Good |
| Public data only | Reproducibility — anyone can re-run the pipeline with the same inputs | — Pending | | Public data only | Reproducibility — anyone can re-run with same inputs | ✓ Good |
| Modular CLI scripts over workflow manager | Flexibility for iterative development; each layer can be run/debugged independently | — Pending | | Modular CLI scripts over workflow manager | Flexibility for iterative development; independent debugging | ✓ Good |
| Known gene exclusion via CiliaCarta/SYSCILIA/OMIM | Standard community-curated lists; used as both exclusion set and positive controls for validation | — Pending | | DuckDB over SQLite | Native polars integration, better analytics queries | ✓ Good |
| Tiered output over fixed cutoff | Allows flexible downstream use — high-confidence for focused follow-up, medium/low for broader network analysis | — Pending | | NULL preservation (unknown ≠ zero) | Avoids penalizing genes with missing evidence | ✓ Good |
| Polars over pandas | Better performance with lazy evaluation, null handling | ✓ Good |
| LOEUF inversion (lower = more constrained = higher score) | Intuitive direction for scoring integration | ✓ Good |
| Log2 normalization for literature bias | Prevents well-studied gene dominance (TP53 problem) | ✓ Good |
| Housekeeping genes as negative controls | Literature-validated set (Eisenberg & Levanon 2013) | ✓ Good |
| Spearman rho ≥ 0.85 stability threshold | Based on rank stability literature for robustness testing | ✓ Good |
| Configurable tier thresholds | Allows flexible downstream use by confidence level | ✓ Good |
--- ---
*Last updated: 2026-02-11 after initialization* *Last updated: 2026-02-12 after v1.0 milestone*

View File

@@ -1,141 +1,30 @@
# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline # Roadmap: Usher Cilia Candidate Gene Discovery Pipeline
## Overview ## Milestones
This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system. - **v1.0 MVP** — Phases 1-6 (shipped 2026-02-12) | [Archive](milestones/v1.0-ROADMAP.md)
## Phases ## Phases
**Phase Numbering:** <details>
- Integer phases (1, 2, 3): Planned milestone work <summary>v1.0 MVP (Phases 1-6) — SHIPPED 2026-02-12</summary>
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order. - [x] Phase 1: Data Infrastructure (4/4 plans) — completed 2026-02-11
- [x] Phase 2: Prototype Evidence Layer (2/2 plans) — completed 2026-02-11
- [x] Phase 3: Core Evidence Layers (6/6 plans) — completed 2026-02-11
- [x] Phase 4: Scoring & Integration (3/3 plans) — completed 2026-02-11
- [x] Phase 5: Output & CLI (3/3 plans) — completed 2026-02-12
- [x] Phase 6: Validation (3/3 plans) — completed 2026-02-12
- [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline </details>
- [x] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture
- [x] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval
- [x] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system
- [x] **Phase 5: Output & CLI** - User-facing interface and tiered results
- [x] **Phase 6: Validation** - Benchmark scoring against known genes
## Phase Details
### Phase 1: Data Infrastructure
**Goal**: Establish reproducible data foundation and gene ID mapping utilities
**Depends on**: Nothing (first phase)
**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07
**Success Criteria** (what must be TRUE):
1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions
2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs
3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching
4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading
5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash
**Plans**: 4 plans
Plans:
- [x] 01-01-PLAN.md -- Project scaffold, config system, and base API client
- [x] 01-02-PLAN.md -- Gene ID mapping with validation gates
- [x] 01-03-PLAN.md -- DuckDB persistence and provenance tracking
- [x] 01-04-PLAN.md -- CLI integration and end-to-end wiring
### Phase 2: Prototype Evidence Layer
**Goal**: Validate retrieval-to-storage pattern with single evidence layer
**Depends on**: Phase 1
**Requirements**: GCON-01, GCON-02, GCON-03
**Success Criteria** (what must be TRUE):
1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes
2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags
3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage
4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability
**Plans**: 2 plans
Plans:
- [x] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization
- [x] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests
### Phase 3: Core Evidence Layers
**Goal**: Complete all remaining evidence retrieval modules
**Depends on**: Phase 2
**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03
**Success Criteria** (what must be TRUE):
1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification
2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics
3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features
4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions
5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring
6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring
**Plans**: 6 plans
Plans:
- [x] 03-01-PLAN.md -- Gene annotation completeness (GO terms, UniProt scores, pathway membership, tier classification)
- [x] 03-02-PLAN.md -- Tissue expression (HPA, GTEx, CellxGene with Tau specificity and enrichment scoring)
- [x] 03-03-PLAN.md -- Protein sequence/structure features (UniProt/InterPro domains, cilia motifs, normalization)
- [x] 03-04-PLAN.md -- Subcellular localization (HPA subcellular, cilia proteomics, evidence type distinction)
- [x] 03-05-PLAN.md -- Animal model phenotypes (MGI, ZFIN, IMPC with HCOP ortholog mapping)
- [x] 03-06-PLAN.md -- Literature evidence (PubMed queries, evidence tier classification, quality-weighted scoring)
### Phase 4: Scoring & Integration
**Goal**: Multi-evidence weighted scoring with known gene validation
**Depends on**: Phase 3
**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05
**Success Criteria** (what must be TRUE):
1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls
2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene
3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers
4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works
5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer
**Plans**: 3 plans
Plans:
- [x] 04-01-PLAN.md -- Known gene compilation, weight validation, and multi-evidence scoring integration
- [x] 04-02-PLAN.md -- Quality control checks and positive control validation
- [x] 04-03-PLAN.md -- CLI score command and unit/integration tests
### Phase 5: Output & CLI
**Goal**: User-facing interface and structured tiered output
**Depends on**: Phase 4
**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05
**Success Criteria** (what must be TRUE):
1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth
2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps
3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools
4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown
5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging
6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics
**Plans**: 3 plans
Plans:
- [x] 05-01-PLAN.md -- Tiered candidate output with evidence summary and dual-format writer (TSV+Parquet)
- [x] 05-02-PLAN.md -- Visualizations (score distribution, layer contributions, tier breakdown) and reproducibility report
- [x] 05-03-PLAN.md -- CLI report command wiring all output modules with integration tests
### Phase 6: Validation
**Goal**: Benchmark scoring system against positive and negative controls
**Depends on**: Phase 5
**Requirements**: (No new requirements - validates existing system)
**Success Criteria** (what must be TRUE):
1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates)
2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
4. Final scoring weights are tuned based on validation metrics and documented with rationale
**Plans**: 3 plans
Plans:
- [x] 06-01-PLAN.md -- Negative control validation (housekeeping genes) and enhanced positive control metrics (recall@k)
- [x] 06-02-PLAN.md -- Sensitivity analysis (weight perturbation sweeps with Spearman rank correlation)
- [x] 06-03-PLAN.md -- Comprehensive validation report, CLI validate command, and unit tests
## Progress ## Progress
**Execution Order:** | Phase | Milestone | Plans Complete | Status | Completed |
Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6 |-------|-----------|----------------|--------|-----------|
| 1. Data Infrastructure | v1.0 | 4/4 | Complete | 2026-02-11 |
| Phase | Plans Complete | Status | Completed | | 2. Prototype Evidence Layer | v1.0 | 2/2 | Complete | 2026-02-11 |
|-------|----------------|--------|-----------| | 3. Core Evidence Layers | v1.0 | 6/6 | Complete | 2026-02-11 |
| 1. Data Infrastructure | 4/4 | Complete | 2026-02-11 | | 4. Scoring & Integration | v1.0 | 3/3 | Complete | 2026-02-11 |
| 2. Prototype Evidence Layer | 2/2 | Complete | 2026-02-11 | | 5. Output & CLI | v1.0 | 3/3 | Complete | 2026-02-12 |
| 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 | | 6. Validation | v1.0 | 3/3 | Complete | 2026-02-12 |
| 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 |
| 5. Output & CLI | 3/3 | Complete | 2026-02-12 |
| 6. Validation | 3/3 | Complete | 2026-02-12 |

View File

@@ -2,160 +2,48 @@
## Project Reference ## Project Reference
See: .planning/PROJECT.md (updated 2026-02-11) See: .planning/PROJECT.md (updated 2026-02-12)
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. **Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
**Current focus:** Phase 6 complete — ALL PHASES COMPLETE — milestone ready **Current focus:** v1.0 MVP shipped — planning next milestone
## Current Position ## Current Position
Phase: 6 of 6 (Validation) Milestone: v1.0 MVP — SHIPPED 2026-02-12
Plan: 3 of 3 in current phase (all plans complete) Status: All 6 phases complete, 21/21 plans, audited and archived
Status: Phase 6 COMPLETE — verified (4/4 success criteria passed)
Last activity: 2026-02-12 — Phase 6 verified and complete, all phases done
Progress: [██████████] 100.0% (21/21 plans complete across all phases)
## Performance Metrics ## Performance Metrics
**Velocity:** **v1.0 Velocity:**
- Total plans completed: 21 - Total plans completed: 21
- Average duration: 4.6 min - Average duration: 4.6 min/plan
- Total execution time: 1.6 hours - Total execution time: ~1.6 hours
- Lines of code: 21,183 Python
**By Phase:**
| Phase | Plans | Total | Avg/Plan | | Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------| |-------|-------|-------|----------|
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan | | 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min |
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan | | 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min |
| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan | | 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min |
| 04 - Scoring Integration | 3/3 | 10 min | 3.3 min/plan | | 04 - Scoring Integration | 3/3 | 10 min | 3.3 min |
| 05 - Output & CLI | 3/3 | 12 min | 4.0 min/plan | | 05 - Output & CLI | 3/3 | 12 min | 4.0 min |
| 06 - Validation | 3/3 | 10 min | 3.3 min/plan | | 06 - Validation | 3/3 | 10 min | 3.3 min |
**Recent Plan Details:**
| Plan | Duration | Tasks | Files |
|------|----------|-------|-------|
| Phase 04 P01 | 4 min | 2 tasks | 4 files |
| Phase 04 P02 | 3 min | 2 tasks | 4 files |
| Phase 04 P03 | 3 min | 2 tasks | 4 files |
| Phase 05 P01 | 4 min | 2 tasks | 5 files |
| Phase 05 P02 | 5 min | 2 tasks | 6 files |
| Phase 05 P03 | 3 min | 2 tasks | 3 files |
| Phase 06 P01 | 2 min | 2 tasks | 3 files |
| Phase 06 P02 | 3 min | 2 tasks | 2 files |
| Phase 06 P03 | 5 min | 2 tasks | 5 files |
## Accumulated Context ## Accumulated Context
### Decisions ### Decisions
Decisions are logged in PROJECT.md Key Decisions table. All v1.0 decisions documented in PROJECT.md Key Decisions table.
Recent decisions affecting current work:
- Python over R/Bioconductor for rich data integration ecosystem
- Weighted rule-based scoring over ML for explainability
- Public data only for reproducibility
- Modular CLI scripts for flexibility during development
- Virtual environment required for dependency isolation (01-01: PEP 668 externally-managed Python)
- Auto-creation of directories on config load (01-01: data_dir, cache_dir field validators)
- [01-02]: Warn on gene count outside 19k-22k range but don't fail (allows for Ensembl version variations)
- [01-02]: HGNC success rate is primary validation gate (UniProt mapping tracked but not used for pass/fail)
- [01-02]: Take first UniProt accession when multiple exist (simplifies data model)
- [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility)
- [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics)
- [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern)
- [01-04]: Click for CLI framework (standard Python CLI library with excellent UX)
- [01-04]: Setup command uses checkpoint-restart pattern (gene universe fetch can take minutes)
- [01-04]: Mock mygene in integration tests (avoids external API dependency, reproducible)
- [02-01]: httpx over requests for streaming downloads (async-native, cleaner API)
- [02-01]: structlog for structured logging (JSON-formatted, context-aware)
- [02-01]: LOEUF normalization with inversion (lower LOEUF = more constrained = higher 0-1 score)
- [02-01]: Quality flags instead of filtering (preserve all genes with measured/incomplete_coverage/no_data categorization)
- [02-01]: NULL preservation pattern (unknown constraint != zero constraint, must not be conflated)
- [02-01]: Lazy polars evaluation (LazyFrame until final collect() for query optimization)
- [02-02]: load_to_duckdb uses CREATE OR REPLACE for idempotency (safe to re-run)
- [02-02]: CLI evidence command group for extensibility (future evidence sources follow same pattern)
- [02-02]: Checkpoint at table level (has_checkpoint checks DuckDB table existence)
- [02-02]: Integration tests with synthetic fixtures (no external downloads, fast, reproducible)
- [03-01]: Annotation tier thresholds: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt)
- [03-01]: Composite annotation score weighting: GO 50%, UniProt 30%, Pathway 20%
- [03-01]: NULL GO counts treated as zero for tier classification but preserved as NULL in data (conservative assumption)
- [03-03]: UniProt REST API with batching (100 accessions) over bulk download for flexibility
- [03-03]: InterPro API for supplemental domain annotations (10 req/sec rate limit)
- [03-03]: Keyword-based cilia motif detection over ML for explainability (IFT, BBSome, ciliary, etc.)
- [03-03]: Composite protein score weights: length 15%, domain 20%, coiled-coil 20%, TM 20%, cilia 15%, scaffold 10%
- [03-03]: List(Null) edge case handling for proteins with no domains (cast to List(String))
- [03-04]: Evidence type terminology standardized to computational (not predicted) for consistency with bioinformatics convention
- [03-04]: Proteomics absence stored as False (informative negative) vs HPA absence as NULL (unknown/not tested)
- [03-04]: Curated proteomics reference gene sets (CiliaCarta, Centrosome-DB) embedded as Python constants for simpler deployment
- [03-04]: Computational evidence (HPA Uncertain/Approved) downweighted to 0.6x vs experimental (Enhanced/Supported, proteomics) at 1.0x
- [Phase 03-05]: Ortholog confidence based on HCOP support count (HIGH: 8+, MEDIUM: 4-7, LOW: 1-3)
- [Phase 03-05]: NULL score for genes without orthologs (preserves NULL pattern)
- [03-02]: HPA bulk TSV download over per-gene API (efficient for 20K genes)
- [03-02]: GTEx retina/fallopian tube may be NULL (not in all versions)
- [03-02]: CellxGene optional dependency with --skip-cellxgene flag (large install)
- [03-02]: Tau specificity requires complete tissue data (any NULL -> NULL Tau)
- [03-02]: Expression score composite: 40% enrichment + 30% Tau + 30% target rank
- [03-02]: Inner ear data primarily from CellxGene scRNA-seq (not HPA/GTEx bulk)
- [03-06]: HTS hits prioritized over functional mentions in evidence tier hierarchy (direct > HTS > functional > incidental)
- [03-06]: Quality-weighted scoring uses log2 normalization to mitigate well-studied gene bias (prevents TP53-like dominance)
- [03-06]: Context weights cilia/sensory=2.0, cytoskeleton/polarity=1.0 for primary target prioritization
- [03-06]: Rate limiting via decorator pattern (3 req/sec default, 10 req/sec with NCBI API key)
- [04-01]: OMIM Usher genes (10) and SYSCILIA SCGS v2 core (28) as known gene positive controls
- [04-01]: NULL-preserving weighted average: weighted_sum / available_weight (only non-NULL layers contribute)
- [04-01]: Quality flags based on evidence_count (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence)
- [04-01]: Per-layer contribution tracking (score * weight) for explainability
- [04-01]: ScoringWeights validation enforcing sum = 1.0 ± 1e-6 tolerance
- [04-02]: scipy MAD-based outlier detection (>3 MAD threshold) for robust anomaly detection
- [04-02]: Missing data thresholds: 50% warn, 80% error for graduated QC feedback
- [04-02]: PERCENT_RANK validation computed before known gene exclusion (validates scoring system)
- [04-02]: Top quartile validation criterion (median percentile >= 0.75 for known genes)
- [04-03]: Score command follows evidence_cmd.py pattern for consistency
- [04-03]: Separate --skip-qc and --skip-validation flags for flexible iteration
- [04-03]: Tests use tmp_path fixtures for isolated DuckDB instances
- [04-03]: Synthetic test data designed to ensure known genes rank highly (0.8-0.95 scores across all layers)
- [05-01]: Configurable tier thresholds (HIGH: score>=0.7 and evidence>=3, MEDIUM: score>=0.4 and evidence>=2, LOW: score>=0.2)
- [05-01]: EXCLUDED genes filtered out (below LOW threshold or NULL composite_score)
- [05-01]: Deterministic sorting (composite_score DESC, gene_id ASC) for reproducible output
- [05-01]: Dual-format TSV+Parquet with identical data for downstream tool compatibility
- [05-01]: YAML provenance sidecar includes statistics (tier counts) and column metadata
- [05-01]: Fixed deprecated pl.count() -> pl.len() usage for polars 0.20.5+ compatibility
- [05-02]: matplotlib Agg backend for headless/CLI safety (non-interactive visualization)
- [05-02]: 300 DPI for publication-quality plots
- [05-02]: Tier color scheme: GREEN/ORANGE/RED for HIGH/MEDIUM/LOW (consistent across all plots)
- [05-02]: Graceful degradation (individual plot failures don't block batch generation)
- [05-02]: Dual-format reproducibility reports (JSON machine-readable + Markdown human-readable)
- [05-02]: Optional validation metrics in reproducibility reports (report generates whether or not validation provided)
- [05-03]: Report command follows established CLI pattern (config load, store init, checkpoint, steps, summary, cleanup)
- [05-03]: Configurable tier thresholds via CLI flags (--high-threshold, --medium-threshold, --low-threshold, --min-evidence-high, --min-evidence-medium)
- [05-03]: Skip flags for flexible iteration (--skip-viz, --skip-report) allow faster output generation
- [05-03]: Graceful degradation for visualization and reproducibility report failures (warnings, not errors)
- [06-01]: Housekeeping genes as negative controls (13 literature-validated genes from Eisenberg & Levanon 2013)
- [06-01]: Inverted threshold logic for negative controls (median percentile < 50% = success)
- [06-01]: Recall@k at both absolute (100, 500, 1000, 2000) and percentage (5%, 10%, 20%) thresholds
- [06-01]: Per-source breakdown separates OMIM Usher from SYSCILIA SCGS v2 for granular validation analysis
- [06-02]: Perturbation deltas ±5% and ±10% (DEFAULT_DELTAS) for reasonable weight variations
- [06-02]: Stability threshold Spearman rho >= 0.85 (STABILITY_THRESHOLD) based on rank stability literature
- [06-02]: Renormalization maintains sum=1.0 after perturbation (weight constraint enforcement)
- [06-02]: Top-N default 100 genes for ranking comparison (relevant for candidate prioritization)
- [06-02]: Minimum overlap 10 genes required for Spearman correlation (avoids meaningless correlations)
- [06-02]: Per-layer sensitivity tracking (most_sensitive_layer and most_robust_layer computed from mean rho)
- [06-03]: Comprehensive validation report combines positive, negative, and sensitivity prongs in single Markdown document
- [06-03]: Weight tuning recommendations include critical circular validation warnings (post-validation tuning invalidates controls)
- [06-03]: CLI validate command provides --skip-sensitivity flag for faster iteration during development
### Pending Todos ### Pending Todos
None yet. None.
### Blockers/Concerns ### Blockers/Concerns
None yet. None.
## Session Continuity ## Session Continuity
Last session: 2026-02-12 - Phase 6 execution and verification Last session: 2026-02-12 — v1.0 milestone completed and archived
Stopped at: All 6 phases complete — milestone ready for completion Next action: /gsd:new-milestone for v1.1 or v2.0
Resume file: .planning/phases/06-validation/06-VERIFICATION.md

View File

@@ -1,3 +1,12 @@
# Requirements Archive: v1.0 MVP
**Archived:** 2026-02-12
**Status:** SHIPPED
For current requirements, see `.planning/REQUIREMENTS.md`.
---
# Requirements: Usher Cilia Candidate Gene Discovery Pipeline # Requirements: Usher Cilia Candidate Gene Discovery Pipeline
**Defined:** 2026-02-11 **Defined:** 2026-02-11

View File

@@ -0,0 +1,141 @@
# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline
## Overview
This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system.
## Phases
**Phase Numbering:**
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order.
- [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline
- [x] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture
- [x] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval
- [x] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system
- [x] **Phase 5: Output & CLI** - User-facing interface and tiered results
- [x] **Phase 6: Validation** - Benchmark scoring against known genes
## Phase Details
### Phase 1: Data Infrastructure
**Goal**: Establish reproducible data foundation and gene ID mapping utilities
**Depends on**: Nothing (first phase)
**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07
**Success Criteria** (what must be TRUE):
1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions
2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs
3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching
4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading
5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash
**Plans**: 4 plans
Plans:
- [x] 01-01-PLAN.md -- Project scaffold, config system, and base API client
- [x] 01-02-PLAN.md -- Gene ID mapping with validation gates
- [x] 01-03-PLAN.md -- DuckDB persistence and provenance tracking
- [x] 01-04-PLAN.md -- CLI integration and end-to-end wiring
### Phase 2: Prototype Evidence Layer
**Goal**: Validate retrieval-to-storage pattern with single evidence layer
**Depends on**: Phase 1
**Requirements**: GCON-01, GCON-02, GCON-03
**Success Criteria** (what must be TRUE):
1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes
2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags
3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage
4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability
**Plans**: 2 plans
Plans:
- [x] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization
- [x] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests
### Phase 3: Core Evidence Layers
**Goal**: Complete all remaining evidence retrieval modules
**Depends on**: Phase 2
**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03
**Success Criteria** (what must be TRUE):
1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification
2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics
3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features
4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions
5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring
6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring
**Plans**: 6 plans
Plans:
- [x] 03-01-PLAN.md -- Gene annotation completeness (GO terms, UniProt scores, pathway membership, tier classification)
- [x] 03-02-PLAN.md -- Tissue expression (HPA, GTEx, CellxGene with Tau specificity and enrichment scoring)
- [x] 03-03-PLAN.md -- Protein sequence/structure features (UniProt/InterPro domains, cilia motifs, normalization)
- [x] 03-04-PLAN.md -- Subcellular localization (HPA subcellular, cilia proteomics, evidence type distinction)
- [x] 03-05-PLAN.md -- Animal model phenotypes (MGI, ZFIN, IMPC with HCOP ortholog mapping)
- [x] 03-06-PLAN.md -- Literature evidence (PubMed queries, evidence tier classification, quality-weighted scoring)
### Phase 4: Scoring & Integration
**Goal**: Multi-evidence weighted scoring with known gene validation
**Depends on**: Phase 3
**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05
**Success Criteria** (what must be TRUE):
1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls
2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene
3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers
4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works
5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer
**Plans**: 3 plans
Plans:
- [x] 04-01-PLAN.md -- Known gene compilation, weight validation, and multi-evidence scoring integration
- [x] 04-02-PLAN.md -- Quality control checks and positive control validation
- [x] 04-03-PLAN.md -- CLI score command and unit/integration tests
### Phase 5: Output & CLI
**Goal**: User-facing interface and structured tiered output
**Depends on**: Phase 4
**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05
**Success Criteria** (what must be TRUE):
1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth
2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps
3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools
4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown
5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging
6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics
**Plans**: 3 plans
Plans:
- [x] 05-01-PLAN.md -- Tiered candidate output with evidence summary and dual-format writer (TSV+Parquet)
- [x] 05-02-PLAN.md -- Visualizations (score distribution, layer contributions, tier breakdown) and reproducibility report
- [x] 05-03-PLAN.md -- CLI report command wiring all output modules with integration tests
### Phase 6: Validation
**Goal**: Benchmark scoring system against positive and negative controls
**Depends on**: Phase 5
**Requirements**: (No new requirements - validates existing system)
**Success Criteria** (what must be TRUE):
1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates)
2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
4. Final scoring weights are tuned based on validation metrics and documented with rationale
**Plans**: 3 plans
Plans:
- [x] 06-01-PLAN.md -- Negative control validation (housekeeping genes) and enhanced positive control metrics (recall@k)
- [x] 06-02-PLAN.md -- Sensitivity analysis (weight perturbation sweeps with Spearman rank correlation)
- [x] 06-03-PLAN.md -- Comprehensive validation report, CLI validate command, and unit tests
## Progress
**Execution Order:**
Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Data Infrastructure | 4/4 | Complete | 2026-02-11 |
| 2. Prototype Evidence Layer | 2/2 | Complete | 2026-02-11 |
| 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 |
| 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 |
| 5. Output & CLI | 3/3 | Complete | 2026-02-12 |
| 6. Validation | 3/3 | Complete | 2026-02-12 |