chore: complete v1.0 MVP milestone
Archive v1.0 milestone: 6 phases, 21 plans, 40/40 requirements. Reorganize ROADMAP.md, evolve PROJECT.md, archive requirements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
27
.planning/MILESTONES.md
Normal file
27
.planning/MILESTONES.md
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
# Milestones
|
||||||
|
|
||||||
|
## v1.0 MVP (Shipped: 2026-02-12)
|
||||||
|
|
||||||
|
**Phases completed:** 6 phases, 21 plans
|
||||||
|
**Lines of code:** 21,183 Python (src + tests)
|
||||||
|
**Files:** 164 files
|
||||||
|
**Timeline:** 2026-02-11 → 2026-02-12
|
||||||
|
|
||||||
|
**Delivered:** Reproducible bioinformatics pipeline that screens ~20,000 human protein-coding genes across 6 evidence layers to identify under-studied cilia/Usher syndrome candidate genes, with transparent weighted scoring, tiered output, and comprehensive validation.
|
||||||
|
|
||||||
|
**Key accomplishments:**
|
||||||
|
1. Reproducible data foundation with Ensembl gene universe, validated HGNC/UniProt mapping, Pydantic config, DuckDB checkpoint-restart, and provenance tracking
|
||||||
|
2. 6-layer evidence integration: gnomAD constraint, tissue expression, gene annotation, protein features, subcellular localization, animal models, and PubMed literature
|
||||||
|
3. Transparent weighted scoring with NULL-preserving composite scores, configurable per-layer weights, and quality control (missing data rates, distribution anomalies, MAD outliers)
|
||||||
|
4. Tiered candidate output (high/medium/low confidence) with dual-format export (TSV+Parquet), visualizations, and reproducibility reports
|
||||||
|
5. Comprehensive validation: positive controls (recall@k), negative controls (13 housekeeping genes), sensitivity analysis (weight perturbation with Spearman rank correlation)
|
||||||
|
6. Unified CLI with 5 subcommands (setup, evidence, score, report, validate) and consistent checkpoint-restart pattern
|
||||||
|
|
||||||
|
**v2 requirements delivered early:**
|
||||||
|
- Sensitivity analysis with parameter sweep (ASCR-03)
|
||||||
|
- Negative control validation with housekeeping genes (AOUT-02)
|
||||||
|
|
||||||
|
**Archive:** [v1.0-ROADMAP.md](milestones/v1.0-ROADMAP.md) | [v1.0-REQUIREMENTS.md](milestones/v1.0-REQUIREMENTS.md) | [v1.0-MILESTONE-AUDIT.md](milestones/v1.0-MILESTONE-AUDIT.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
@@ -2,33 +2,52 @@
|
|||||||
|
|
||||||
## What This Is
|
## What This Is
|
||||||
|
|
||||||
A reproducible, explainable bioinformatics pipeline that systematically screens all human protein-coding genes (~20,000) to identify under-studied candidates likely involved in cilia/sensory cilia pathways — particularly those relevant to Usher syndrome. The pipeline integrates 6+ evidence layers, scores genes via weighted rule-based integration, and outputs a tiered candidate list for downstream protein interaction network and structural prediction analyses.
|
A reproducible bioinformatics pipeline that screens all ~20,000 human protein-coding genes across 6 evidence layers to identify under-studied candidates likely involved in cilia/sensory cilia pathways relevant to Usher syndrome. Integrates genetic constraint, tissue expression, gene annotation, protein features, subcellular localization, animal model phenotypes, and literature evidence into a transparent weighted scoring system producing tiered candidate lists.
|
||||||
|
|
||||||
## Core Value
|
## Core Value
|
||||||
|
|
||||||
Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
|
Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
**Shipped:** v1.0 MVP (2026-02-12)
|
||||||
|
**Codebase:** 21,183 lines Python across 164 files
|
||||||
|
**Tech stack:** Python, Click CLI, DuckDB, Polars, Pydantic, matplotlib/seaborn, scipy, structlog
|
||||||
|
|
||||||
|
**What works:**
|
||||||
|
- `usher-pipeline setup` — fetches gene universe from Ensembl with HGNC/UniProt mapping
|
||||||
|
- `usher-pipeline evidence <layer>` — 7 evidence layer subcommands with checkpoint-restart
|
||||||
|
- `usher-pipeline score` — multi-evidence weighted scoring with QC and positive control validation
|
||||||
|
- `usher-pipeline report` — tiered output (TSV+Parquet), visualizations, reproducibility report
|
||||||
|
- `usher-pipeline validate` — positive/negative control validation, sensitivity analysis
|
||||||
|
|
||||||
|
**Known issues:**
|
||||||
|
- cellxgene-census version conflict blocks some test execution
|
||||||
|
- PubMed literature pipeline takes 3-11 hours for full gene universe (mitigated by checkpoint-restart)
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
### Validated
|
### Validated
|
||||||
|
|
||||||
(None yet — ship to validate)
|
- ✓ Modular Python pipeline with independent, composable CLI scripts per evidence layer — v1.0
|
||||||
|
- ✓ Gene universe: all human protein-coding genes (Ensembl/HGNC aligned) — v1.0
|
||||||
|
- ✓ Evidence Layer 1: Gene annotation completeness (GO/UniProt) — v1.0
|
||||||
|
- ✓ Evidence Layer 2: Tissue-specific expression (HPA, GTEx, CellxGene) — v1.0
|
||||||
|
- ✓ Evidence Layer 3: Protein sequence/structure features (UniProt/InterPro) — v1.0
|
||||||
|
- ✓ Evidence Layer 4: Subcellular localization (HPA, cilia proteomics) — v1.0
|
||||||
|
- ✓ Evidence Layer 5: Genetic constraint (gnomAD pLI, LOEUF) — v1.0
|
||||||
|
- ✓ Evidence Layer 6: Animal model phenotypes (MGI, ZFIN, IMPC) — v1.0
|
||||||
|
- ✓ Systematic literature scanning per candidate — v1.0
|
||||||
|
- ✓ Known cilia/Usher gene set compiled as exclusion set and positive controls — v1.0
|
||||||
|
- ✓ Weighted rule-based multi-evidence integration scoring — v1.0
|
||||||
|
- ✓ Tiered output with per-gene evidence summaries and gap documentation — v1.0
|
||||||
|
- ✓ Output format compatible with downstream analyses — v1.0
|
||||||
|
- ✓ Sensitivity analysis with parameter sweep (originally v2, delivered early) — v1.0
|
||||||
|
- ✓ Negative control validation with housekeeping genes (originally v2, delivered early) — v1.0
|
||||||
|
|
||||||
### Active
|
### Active
|
||||||
|
|
||||||
- [ ] Modular Python pipeline with independent, composable CLI scripts per evidence layer
|
(None — define with `/gsd:new-milestone`)
|
||||||
- [ ] Gene universe: all human protein-coding genes (Ensembl/HGNC aligned), excluding pseudogenes and transcripts lacking protein-level evidence
|
|
||||||
- [ ] Evidence Layer 1: Gene annotation completeness (GO/UniProt functional annotation depth)
|
|
||||||
- [ ] Evidence Layer 2: Tissue-specific expression (retina, inner ear/hair cells, cilia-rich tissues) from public atlases (HPA, GTEx, CellxGene published scRNA-seq)
|
|
||||||
- [ ] Evidence Layer 3: Protein sequence/structure features (length, domain composition, coiled-coil, scaffold/adaptor domains, cilia-associated motifs)
|
|
||||||
- [ ] Evidence Layer 4: Subcellular localization evidence (centrosome, basal body, cilium, stereocilia) from high-throughput proteomics datasets
|
|
||||||
- [ ] Evidence Layer 5: Human genetic constraint (loss-of-function tolerance from gnomAD, selection pressure indicators)
|
|
||||||
- [ ] Evidence Layer 6: Animal model phenotypes (sensory, balance, vision, cilia phenotypes from model organism databases)
|
|
||||||
- [ ] Systematic literature scanning per candidate (distinguishing direct experimental evidence, incidental mentions, high-throughput hits)
|
|
||||||
- [ ] Known cilia/Usher gene set compiled from public sources (CiliaCarta, SYSCILIA gold standard, OMIM Usher genes) as exclusion set and positive controls
|
|
||||||
- [ ] Weighted rule-based multi-evidence integration scoring with transparent weights
|
|
||||||
- [ ] Tiered output (high/medium/low confidence) with per-gene evidence summaries and data gap documentation
|
|
||||||
- [ ] Output format compatible with downstream PPI network analysis (STRING/BioGRID), structural prediction (AlphaFold-Multimer), and additional analyses
|
|
||||||
|
|
||||||
### Out of Scope
|
### Out of Scope
|
||||||
|
|
||||||
@@ -37,42 +56,44 @@ Produce a high-confidence, multi-evidence-backed ranked list of under-studied ci
|
|||||||
- Downstream PPI network or structural prediction analyses — this pipeline produces the input candidate list
|
- Downstream PPI network or structural prediction analyses — this pipeline produces the input candidate list
|
||||||
- Wet-lab validation — computational discovery pipeline only
|
- Wet-lab validation — computational discovery pipeline only
|
||||||
- Real-time data updates — pipeline runs against versioned snapshots of source databases
|
- Real-time data updates — pipeline runs against versioned snapshots of source databases
|
||||||
|
- Real-time web dashboard — static reports + CLI sufficient for research tool
|
||||||
|
- GUI for parameter tuning — research pipelines need reproducible CLI execution
|
||||||
|
- Variant-level analysis — gene-level discovery scope; use Exomiser/LIRICAL for variant work
|
||||||
|
- LLM-based automated literature scanning — manual/programmatic PubMed queries sufficient
|
||||||
|
- Bayesian evidence weight optimization — requires larger training set; manual tuning sufficient
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
Usher syndrome is the most common genetic cause of combined deafness and blindness. While several causal genes (USH1B/MYO7A, USH1C, USH2A, etc.) are known, the full molecular network — particularly scaffold, adaptor, and regulatory proteins connecting Usher complexes to cilia machinery — remains incompletely characterized. Many genes with cilia-relevant features lack functional annotation, creating a discovery opportunity.
|
Usher syndrome is the most common genetic cause of combined deafness and blindness. While several causal genes (USH1B/MYO7A, USH1C, USH2A, etc.) are known, the full molecular network — particularly scaffold, adaptor, and regulatory proteins connecting Usher complexes to cilia machinery — remains incompletely characterized.
|
||||||
|
|
||||||
The pipeline targets this gap: genes that have cilia-suggestive evidence across multiple layers but haven't been studied in the Usher/sensory cilia context. By operationalizing "under-studied" (limited GO annotation, sparse mechanistic literature, not in canonical cilia gene lists) and cross-referencing with expression, structural, localization, genetic, and phenotypic evidence, the pipeline surfaces candidates that would otherwise remain invisible.
|
The pipeline targets this gap: genes that have cilia-suggestive evidence across multiple layers but haven't been studied in the Usher/sensory cilia context.
|
||||||
|
|
||||||
Key public data sources:
|
Key public data sources: Ensembl, HGNC, UniProt, Gene Ontology, Human Protein Atlas, GTEx, CellxGene, InterPro, gnomAD, MGI, ZFIN, IMPC, CiliaCarta, SYSCILIA, OMIM, PubMed.
|
||||||
- **Gene annotation:** Ensembl, HGNC, UniProt, Gene Ontology
|
|
||||||
- **Expression:** Human Protein Atlas, GTEx, CellxGene (published retina/cochlea scRNA-seq datasets)
|
|
||||||
- **Protein features:** UniProt domains, InterPro, Pfam
|
|
||||||
- **Localization:** Human Protein Atlas subcellular, OpenCell, published centrosome/cilium proteomics
|
|
||||||
- **Genetic constraint:** gnomAD (pLI, LOEUF scores)
|
|
||||||
- **Animal models:** MGI (mouse), ZFIN (zebrafish), IMPC
|
|
||||||
- **Known gene sets:** CiliaCarta, SYSCILIA gold standard, OMIM (Usher-related entries)
|
|
||||||
- **Literature:** PubMed/NCBI for systematic text scanning
|
|
||||||
|
|
||||||
## Constraints
|
## Constraints
|
||||||
|
|
||||||
- **Language**: Python — all pipeline modules written in Python
|
- **Language**: Python
|
||||||
- **Architecture**: Modular CLI scripts — each evidence layer is an independent module, composable via standard input/output
|
- **Architecture**: Modular CLI (Click) with DuckDB persistence and Polars DataFrames
|
||||||
- **Data**: Public sources only — no proprietary or access-restricted datasets
|
- **Data**: Public sources only
|
||||||
- **Compute**: Local workstation with NVIDIA 4090 GPU — GPU available if needed for large-scale computations
|
- **Scoring**: Weighted rule-based with transparent weights
|
||||||
- **Scoring**: Weighted rule-based — fully transparent, no black-box models
|
- **Reproducibility**: Versioned data snapshots, provenance tracking, checkpoint-restart
|
||||||
- **Reproducibility**: Versioned data snapshots, pinned dependencies, documented parameters
|
|
||||||
|
|
||||||
## Key Decisions
|
## Key Decisions
|
||||||
|
|
||||||
| Decision | Rationale | Outcome |
|
| Decision | Rationale | Outcome |
|
||||||
|----------|-----------|---------|
|
|----------|-----------|---------|
|
||||||
| Python over R/Bioconductor | User preference; rich ecosystem for data integration (pandas, scanpy, biopython) | — Pending |
|
| Python over R/Bioconductor | Rich ecosystem for data integration (polars, biopython) | ✓ Good |
|
||||||
| Weighted rule-based scoring over ML | Explainability is paramount; every gene's score must be traceable to specific evidence | — Pending |
|
| Weighted rule-based scoring over ML | Explainability paramount; every score traceable to evidence | ✓ Good |
|
||||||
| Public data only | Reproducibility — anyone can re-run the pipeline with the same inputs | — Pending |
|
| Public data only | Reproducibility — anyone can re-run with same inputs | ✓ Good |
|
||||||
| Modular CLI scripts over workflow manager | Flexibility for iterative development; each layer can be run/debugged independently | — Pending |
|
| Modular CLI scripts over workflow manager | Flexibility for iterative development; independent debugging | ✓ Good |
|
||||||
| Known gene exclusion via CiliaCarta/SYSCILIA/OMIM | Standard community-curated lists; used as both exclusion set and positive controls for validation | — Pending |
|
| DuckDB over SQLite | Native polars integration, better analytics queries | ✓ Good |
|
||||||
| Tiered output over fixed cutoff | Allows flexible downstream use — high-confidence for focused follow-up, medium/low for broader network analysis | — Pending |
|
| NULL preservation (unknown ≠ zero) | Avoids penalizing genes with missing evidence | ✓ Good |
|
||||||
|
| Polars over pandas | Better performance with lazy evaluation, null handling | ✓ Good |
|
||||||
|
| LOEUF inversion (lower = more constrained = higher score) | Intuitive direction for scoring integration | ✓ Good |
|
||||||
|
| Log2 normalization for literature bias | Prevents well-studied gene dominance (TP53 problem) | ✓ Good |
|
||||||
|
| Housekeeping genes as negative controls | Literature-validated set (Eisenberg & Levanon 2013) | ✓ Good |
|
||||||
|
| Spearman rho ≥ 0.85 stability threshold | Based on rank stability literature for robustness testing | ✓ Good |
|
||||||
|
| Configurable tier thresholds | Allows flexible downstream use by confidence level | ✓ Good |
|
||||||
|
|
||||||
---
|
---
|
||||||
*Last updated: 2026-02-11 after initialization*
|
*Last updated: 2026-02-12 after v1.0 milestone*
|
||||||
|
|||||||
@@ -1,141 +1,30 @@
|
|||||||
# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline
|
# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline
|
||||||
|
|
||||||
## Overview
|
## Milestones
|
||||||
|
|
||||||
This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system.
|
- **v1.0 MVP** — Phases 1-6 (shipped 2026-02-12) | [Archive](milestones/v1.0-ROADMAP.md)
|
||||||
|
|
||||||
## Phases
|
## Phases
|
||||||
|
|
||||||
**Phase Numbering:**
|
<details>
|
||||||
- Integer phases (1, 2, 3): Planned milestone work
|
<summary>v1.0 MVP (Phases 1-6) — SHIPPED 2026-02-12</summary>
|
||||||
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
|
|
||||||
|
|
||||||
Decimal phases appear between their surrounding integers in numeric order.
|
- [x] Phase 1: Data Infrastructure (4/4 plans) — completed 2026-02-11
|
||||||
|
- [x] Phase 2: Prototype Evidence Layer (2/2 plans) — completed 2026-02-11
|
||||||
|
- [x] Phase 3: Core Evidence Layers (6/6 plans) — completed 2026-02-11
|
||||||
|
- [x] Phase 4: Scoring & Integration (3/3 plans) — completed 2026-02-11
|
||||||
|
- [x] Phase 5: Output & CLI (3/3 plans) — completed 2026-02-12
|
||||||
|
- [x] Phase 6: Validation (3/3 plans) — completed 2026-02-12
|
||||||
|
|
||||||
- [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline
|
</details>
|
||||||
- [x] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture
|
|
||||||
- [x] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval
|
|
||||||
- [x] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system
|
|
||||||
- [x] **Phase 5: Output & CLI** - User-facing interface and tiered results
|
|
||||||
- [x] **Phase 6: Validation** - Benchmark scoring against known genes
|
|
||||||
|
|
||||||
## Phase Details
|
|
||||||
|
|
||||||
### Phase 1: Data Infrastructure
|
|
||||||
**Goal**: Establish reproducible data foundation and gene ID mapping utilities
|
|
||||||
**Depends on**: Nothing (first phase)
|
|
||||||
**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07
|
|
||||||
**Success Criteria** (what must be TRUE):
|
|
||||||
1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions
|
|
||||||
2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs
|
|
||||||
3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching
|
|
||||||
4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading
|
|
||||||
5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash
|
|
||||||
**Plans**: 4 plans
|
|
||||||
|
|
||||||
Plans:
|
|
||||||
- [x] 01-01-PLAN.md -- Project scaffold, config system, and base API client
|
|
||||||
- [x] 01-02-PLAN.md -- Gene ID mapping with validation gates
|
|
||||||
- [x] 01-03-PLAN.md -- DuckDB persistence and provenance tracking
|
|
||||||
- [x] 01-04-PLAN.md -- CLI integration and end-to-end wiring
|
|
||||||
|
|
||||||
### Phase 2: Prototype Evidence Layer
|
|
||||||
**Goal**: Validate retrieval-to-storage pattern with single evidence layer
|
|
||||||
**Depends on**: Phase 1
|
|
||||||
**Requirements**: GCON-01, GCON-02, GCON-03
|
|
||||||
**Success Criteria** (what must be TRUE):
|
|
||||||
1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes
|
|
||||||
2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags
|
|
||||||
3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage
|
|
||||||
4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability
|
|
||||||
**Plans**: 2 plans
|
|
||||||
|
|
||||||
Plans:
|
|
||||||
- [x] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization
|
|
||||||
- [x] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests
|
|
||||||
|
|
||||||
### Phase 3: Core Evidence Layers
|
|
||||||
**Goal**: Complete all remaining evidence retrieval modules
|
|
||||||
**Depends on**: Phase 2
|
|
||||||
**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03
|
|
||||||
**Success Criteria** (what must be TRUE):
|
|
||||||
1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification
|
|
||||||
2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics
|
|
||||||
3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features
|
|
||||||
4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions
|
|
||||||
5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring
|
|
||||||
6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring
|
|
||||||
**Plans**: 6 plans
|
|
||||||
|
|
||||||
Plans:
|
|
||||||
- [x] 03-01-PLAN.md -- Gene annotation completeness (GO terms, UniProt scores, pathway membership, tier classification)
|
|
||||||
- [x] 03-02-PLAN.md -- Tissue expression (HPA, GTEx, CellxGene with Tau specificity and enrichment scoring)
|
|
||||||
- [x] 03-03-PLAN.md -- Protein sequence/structure features (UniProt/InterPro domains, cilia motifs, normalization)
|
|
||||||
- [x] 03-04-PLAN.md -- Subcellular localization (HPA subcellular, cilia proteomics, evidence type distinction)
|
|
||||||
- [x] 03-05-PLAN.md -- Animal model phenotypes (MGI, ZFIN, IMPC with HCOP ortholog mapping)
|
|
||||||
- [x] 03-06-PLAN.md -- Literature evidence (PubMed queries, evidence tier classification, quality-weighted scoring)
|
|
||||||
|
|
||||||
### Phase 4: Scoring & Integration
|
|
||||||
**Goal**: Multi-evidence weighted scoring with known gene validation
|
|
||||||
**Depends on**: Phase 3
|
|
||||||
**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05
|
|
||||||
**Success Criteria** (what must be TRUE):
|
|
||||||
1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls
|
|
||||||
2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene
|
|
||||||
3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers
|
|
||||||
4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works
|
|
||||||
5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer
|
|
||||||
**Plans**: 3 plans
|
|
||||||
|
|
||||||
Plans:
|
|
||||||
- [x] 04-01-PLAN.md -- Known gene compilation, weight validation, and multi-evidence scoring integration
|
|
||||||
- [x] 04-02-PLAN.md -- Quality control checks and positive control validation
|
|
||||||
- [x] 04-03-PLAN.md -- CLI score command and unit/integration tests
|
|
||||||
|
|
||||||
### Phase 5: Output & CLI
|
|
||||||
**Goal**: User-facing interface and structured tiered output
|
|
||||||
**Depends on**: Phase 4
|
|
||||||
**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05
|
|
||||||
**Success Criteria** (what must be TRUE):
|
|
||||||
1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth
|
|
||||||
2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps
|
|
||||||
3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools
|
|
||||||
4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown
|
|
||||||
5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging
|
|
||||||
6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics
|
|
||||||
**Plans**: 3 plans
|
|
||||||
|
|
||||||
Plans:
|
|
||||||
- [x] 05-01-PLAN.md -- Tiered candidate output with evidence summary and dual-format writer (TSV+Parquet)
|
|
||||||
- [x] 05-02-PLAN.md -- Visualizations (score distribution, layer contributions, tier breakdown) and reproducibility report
|
|
||||||
- [x] 05-03-PLAN.md -- CLI report command wiring all output modules with integration tests
|
|
||||||
|
|
||||||
### Phase 6: Validation
|
|
||||||
**Goal**: Benchmark scoring system against positive and negative controls
|
|
||||||
**Depends on**: Phase 5
|
|
||||||
**Requirements**: (No new requirements - validates existing system)
|
|
||||||
**Success Criteria** (what must be TRUE):
|
|
||||||
1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates)
|
|
||||||
2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
|
|
||||||
3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
|
|
||||||
4. Final scoring weights are tuned based on validation metrics and documented with rationale
|
|
||||||
**Plans**: 3 plans
|
|
||||||
|
|
||||||
Plans:
|
|
||||||
- [x] 06-01-PLAN.md -- Negative control validation (housekeeping genes) and enhanced positive control metrics (recall@k)
|
|
||||||
- [x] 06-02-PLAN.md -- Sensitivity analysis (weight perturbation sweeps with Spearman rank correlation)
|
|
||||||
- [x] 06-03-PLAN.md -- Comprehensive validation report, CLI validate command, and unit tests
|
|
||||||
|
|
||||||
## Progress
|
## Progress
|
||||||
|
|
||||||
**Execution Order:**
|
| Phase | Milestone | Plans Complete | Status | Completed |
|
||||||
Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6
|
|-------|-----------|----------------|--------|-----------|
|
||||||
|
| 1. Data Infrastructure | v1.0 | 4/4 | Complete | 2026-02-11 |
|
||||||
| Phase | Plans Complete | Status | Completed |
|
| 2. Prototype Evidence Layer | v1.0 | 2/2 | Complete | 2026-02-11 |
|
||||||
|-------|----------------|--------|-----------|
|
| 3. Core Evidence Layers | v1.0 | 6/6 | Complete | 2026-02-11 |
|
||||||
| 1. Data Infrastructure | 4/4 | Complete | 2026-02-11 |
|
| 4. Scoring & Integration | v1.0 | 3/3 | Complete | 2026-02-11 |
|
||||||
| 2. Prototype Evidence Layer | 2/2 | Complete | 2026-02-11 |
|
| 5. Output & CLI | v1.0 | 3/3 | Complete | 2026-02-12 |
|
||||||
| 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 |
|
| 6. Validation | v1.0 | 3/3 | Complete | 2026-02-12 |
|
||||||
| 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 |
|
|
||||||
| 5. Output & CLI | 3/3 | Complete | 2026-02-12 |
|
|
||||||
| 6. Validation | 3/3 | Complete | 2026-02-12 |
|
|
||||||
|
|||||||
@@ -2,160 +2,48 @@
|
|||||||
|
|
||||||
## Project Reference
|
## Project Reference
|
||||||
|
|
||||||
See: .planning/PROJECT.md (updated 2026-02-11)
|
See: .planning/PROJECT.md (updated 2026-02-12)
|
||||||
|
|
||||||
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
|
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
|
||||||
**Current focus:** Phase 6 complete — ALL PHASES COMPLETE — milestone ready
|
**Current focus:** v1.0 MVP shipped — planning next milestone
|
||||||
|
|
||||||
## Current Position
|
## Current Position
|
||||||
|
|
||||||
Phase: 6 of 6 (Validation)
|
Milestone: v1.0 MVP — SHIPPED 2026-02-12
|
||||||
Plan: 3 of 3 in current phase (all plans complete)
|
Status: All 6 phases complete, 21/21 plans, audited and archived
|
||||||
Status: Phase 6 COMPLETE — verified (4/4 success criteria passed)
|
|
||||||
Last activity: 2026-02-12 — Phase 6 verified and complete, all phases done
|
|
||||||
|
|
||||||
Progress: [██████████] 100.0% (21/21 plans complete across all phases)
|
|
||||||
|
|
||||||
## Performance Metrics
|
## Performance Metrics
|
||||||
|
|
||||||
**Velocity:**
|
**v1.0 Velocity:**
|
||||||
- Total plans completed: 21
|
- Total plans completed: 21
|
||||||
- Average duration: 4.6 min
|
- Average duration: 4.6 min/plan
|
||||||
- Total execution time: 1.6 hours
|
- Total execution time: ~1.6 hours
|
||||||
|
- Lines of code: 21,183 Python
|
||||||
**By Phase:**
|
|
||||||
|
|
||||||
| Phase | Plans | Total | Avg/Plan |
|
| Phase | Plans | Total | Avg/Plan |
|
||||||
|-------|-------|-------|----------|
|
|-------|-------|-------|----------|
|
||||||
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
|
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min |
|
||||||
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
|
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min |
|
||||||
| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan |
|
| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min |
|
||||||
| 04 - Scoring Integration | 3/3 | 10 min | 3.3 min/plan |
|
| 04 - Scoring Integration | 3/3 | 10 min | 3.3 min |
|
||||||
| 05 - Output & CLI | 3/3 | 12 min | 4.0 min/plan |
|
| 05 - Output & CLI | 3/3 | 12 min | 4.0 min |
|
||||||
| 06 - Validation | 3/3 | 10 min | 3.3 min/plan |
|
| 06 - Validation | 3/3 | 10 min | 3.3 min |
|
||||||
|
|
||||||
**Recent Plan Details:**
|
|
||||||
| Plan | Duration | Tasks | Files |
|
|
||||||
|------|----------|-------|-------|
|
|
||||||
| Phase 04 P01 | 4 min | 2 tasks | 4 files |
|
|
||||||
| Phase 04 P02 | 3 min | 2 tasks | 4 files |
|
|
||||||
| Phase 04 P03 | 3 min | 2 tasks | 4 files |
|
|
||||||
| Phase 05 P01 | 4 min | 2 tasks | 5 files |
|
|
||||||
| Phase 05 P02 | 5 min | 2 tasks | 6 files |
|
|
||||||
| Phase 05 P03 | 3 min | 2 tasks | 3 files |
|
|
||||||
| Phase 06 P01 | 2 min | 2 tasks | 3 files |
|
|
||||||
| Phase 06 P02 | 3 min | 2 tasks | 2 files |
|
|
||||||
| Phase 06 P03 | 5 min | 2 tasks | 5 files |
|
|
||||||
|
|
||||||
## Accumulated Context
|
## Accumulated Context
|
||||||
|
|
||||||
### Decisions
|
### Decisions
|
||||||
|
|
||||||
Decisions are logged in PROJECT.md Key Decisions table.
|
All v1.0 decisions documented in PROJECT.md Key Decisions table.
|
||||||
Recent decisions affecting current work:
|
|
||||||
|
|
||||||
- Python over R/Bioconductor for rich data integration ecosystem
|
|
||||||
- Weighted rule-based scoring over ML for explainability
|
|
||||||
- Public data only for reproducibility
|
|
||||||
- Modular CLI scripts for flexibility during development
|
|
||||||
- Virtual environment required for dependency isolation (01-01: PEP 668 externally-managed Python)
|
|
||||||
- Auto-creation of directories on config load (01-01: data_dir, cache_dir field validators)
|
|
||||||
- [01-02]: Warn on gene count outside 19k-22k range but don't fail (allows for Ensembl version variations)
|
|
||||||
- [01-02]: HGNC success rate is primary validation gate (UniProt mapping tracked but not used for pass/fail)
|
|
||||||
- [01-02]: Take first UniProt accession when multiple exist (simplifies data model)
|
|
||||||
- [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility)
|
|
||||||
- [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics)
|
|
||||||
- [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern)
|
|
||||||
- [01-04]: Click for CLI framework (standard Python CLI library with excellent UX)
|
|
||||||
- [01-04]: Setup command uses checkpoint-restart pattern (gene universe fetch can take minutes)
|
|
||||||
- [01-04]: Mock mygene in integration tests (avoids external API dependency, reproducible)
|
|
||||||
- [02-01]: httpx over requests for streaming downloads (async-native, cleaner API)
|
|
||||||
- [02-01]: structlog for structured logging (JSON-formatted, context-aware)
|
|
||||||
- [02-01]: LOEUF normalization with inversion (lower LOEUF = more constrained = higher 0-1 score)
|
|
||||||
- [02-01]: Quality flags instead of filtering (preserve all genes with measured/incomplete_coverage/no_data categorization)
|
|
||||||
- [02-01]: NULL preservation pattern (unknown constraint != zero constraint, must not be conflated)
|
|
||||||
- [02-01]: Lazy polars evaluation (LazyFrame until final collect() for query optimization)
|
|
||||||
- [02-02]: load_to_duckdb uses CREATE OR REPLACE for idempotency (safe to re-run)
|
|
||||||
- [02-02]: CLI evidence command group for extensibility (future evidence sources follow same pattern)
|
|
||||||
- [02-02]: Checkpoint at table level (has_checkpoint checks DuckDB table existence)
|
|
||||||
- [02-02]: Integration tests with synthetic fixtures (no external downloads, fast, reproducible)
|
|
||||||
- [03-01]: Annotation tier thresholds: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt)
|
|
||||||
- [03-01]: Composite annotation score weighting: GO 50%, UniProt 30%, Pathway 20%
|
|
||||||
- [03-01]: NULL GO counts treated as zero for tier classification but preserved as NULL in data (conservative assumption)
|
|
||||||
- [03-03]: UniProt REST API with batching (100 accessions) over bulk download for flexibility
|
|
||||||
- [03-03]: InterPro API for supplemental domain annotations (10 req/sec rate limit)
|
|
||||||
- [03-03]: Keyword-based cilia motif detection over ML for explainability (IFT, BBSome, ciliary, etc.)
|
|
||||||
- [03-03]: Composite protein score weights: length 15%, domain 20%, coiled-coil 20%, TM 20%, cilia 15%, scaffold 10%
|
|
||||||
- [03-03]: List(Null) edge case handling for proteins with no domains (cast to List(String))
|
|
||||||
- [03-04]: Evidence type terminology standardized to computational (not predicted) for consistency with bioinformatics convention
|
|
||||||
- [03-04]: Proteomics absence stored as False (informative negative) vs HPA absence as NULL (unknown/not tested)
|
|
||||||
- [03-04]: Curated proteomics reference gene sets (CiliaCarta, Centrosome-DB) embedded as Python constants for simpler deployment
|
|
||||||
- [03-04]: Computational evidence (HPA Uncertain/Approved) downweighted to 0.6x vs experimental (Enhanced/Supported, proteomics) at 1.0x
|
|
||||||
- [Phase 03-05]: Ortholog confidence based on HCOP support count (HIGH: 8+, MEDIUM: 4-7, LOW: 1-3)
|
|
||||||
- [Phase 03-05]: NULL score for genes without orthologs (preserves NULL pattern)
|
|
||||||
- [03-02]: HPA bulk TSV download over per-gene API (efficient for 20K genes)
|
|
||||||
- [03-02]: GTEx retina/fallopian tube may be NULL (not in all versions)
|
|
||||||
- [03-02]: CellxGene optional dependency with --skip-cellxgene flag (large install)
|
|
||||||
- [03-02]: Tau specificity requires complete tissue data (any NULL -> NULL Tau)
|
|
||||||
- [03-02]: Expression score composite: 40% enrichment + 30% Tau + 30% target rank
|
|
||||||
- [03-02]: Inner ear data primarily from CellxGene scRNA-seq (not HPA/GTEx bulk)
|
|
||||||
- [03-06]: HTS hits prioritized over functional mentions in evidence tier hierarchy (direct > HTS > functional > incidental)
|
|
||||||
- [03-06]: Quality-weighted scoring uses log2 normalization to mitigate well-studied gene bias (prevents TP53-like dominance)
|
|
||||||
- [03-06]: Context weights cilia/sensory=2.0, cytoskeleton/polarity=1.0 for primary target prioritization
|
|
||||||
- [03-06]: Rate limiting via decorator pattern (3 req/sec default, 10 req/sec with NCBI API key)
|
|
||||||
- [04-01]: OMIM Usher genes (10) and SYSCILIA SCGS v2 core (28) as known gene positive controls
|
|
||||||
- [04-01]: NULL-preserving weighted average: weighted_sum / available_weight (only non-NULL layers contribute)
|
|
||||||
- [04-01]: Quality flags based on evidence_count (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence)
|
|
||||||
- [04-01]: Per-layer contribution tracking (score * weight) for explainability
|
|
||||||
- [04-01]: ScoringWeights validation enforcing sum = 1.0 ± 1e-6 tolerance
|
|
||||||
- [04-02]: scipy MAD-based outlier detection (>3 MAD threshold) for robust anomaly detection
|
|
||||||
- [04-02]: Missing data thresholds: 50% warn, 80% error for graduated QC feedback
|
|
||||||
- [04-02]: PERCENT_RANK validation computed before known gene exclusion (validates scoring system)
|
|
||||||
- [04-02]: Top quartile validation criterion (median percentile >= 0.75 for known genes)
|
|
||||||
- [04-03]: Score command follows evidence_cmd.py pattern for consistency
|
|
||||||
- [04-03]: Separate --skip-qc and --skip-validation flags for flexible iteration
|
|
||||||
- [04-03]: Tests use tmp_path fixtures for isolated DuckDB instances
|
|
||||||
- [04-03]: Synthetic test data designed to ensure known genes rank highly (0.8-0.95 scores across all layers)
|
|
||||||
- [05-01]: Configurable tier thresholds (HIGH: score>=0.7 and evidence>=3, MEDIUM: score>=0.4 and evidence>=2, LOW: score>=0.2)
|
|
||||||
- [05-01]: EXCLUDED genes filtered out (below LOW threshold or NULL composite_score)
|
|
||||||
- [05-01]: Deterministic sorting (composite_score DESC, gene_id ASC) for reproducible output
|
|
||||||
- [05-01]: Dual-format TSV+Parquet with identical data for downstream tool compatibility
|
|
||||||
- [05-01]: YAML provenance sidecar includes statistics (tier counts) and column metadata
|
|
||||||
- [05-01]: Fixed deprecated pl.count() -> pl.len() usage for polars 0.20.5+ compatibility
|
|
||||||
- [05-02]: matplotlib Agg backend for headless/CLI safety (non-interactive visualization)
|
|
||||||
- [05-02]: 300 DPI for publication-quality plots
|
|
||||||
- [05-02]: Tier color scheme: GREEN/ORANGE/RED for HIGH/MEDIUM/LOW (consistent across all plots)
|
|
||||||
- [05-02]: Graceful degradation (individual plot failures don't block batch generation)
|
|
||||||
- [05-02]: Dual-format reproducibility reports (JSON machine-readable + Markdown human-readable)
|
|
||||||
- [05-02]: Optional validation metrics in reproducibility reports (report generates whether or not validation provided)
|
|
||||||
- [05-03]: Report command follows established CLI pattern (config load, store init, checkpoint, steps, summary, cleanup)
|
|
||||||
- [05-03]: Configurable tier thresholds via CLI flags (--high-threshold, --medium-threshold, --low-threshold, --min-evidence-high, --min-evidence-medium)
|
|
||||||
- [05-03]: Skip flags for flexible iteration (--skip-viz, --skip-report) allow faster output generation
|
|
||||||
- [05-03]: Graceful degradation for visualization and reproducibility report failures (warnings, not errors)
|
|
||||||
- [06-01]: Housekeeping genes as negative controls (13 literature-validated genes from Eisenberg & Levanon 2013)
|
|
||||||
- [06-01]: Inverted threshold logic for negative controls (median percentile < 50% = success)
|
|
||||||
- [06-01]: Recall@k at both absolute (100, 500, 1000, 2000) and percentage (5%, 10%, 20%) thresholds
|
|
||||||
- [06-01]: Per-source breakdown separates OMIM Usher from SYSCILIA SCGS v2 for granular validation analysis
|
|
||||||
- [06-02]: Perturbation deltas ±5% and ±10% (DEFAULT_DELTAS) for reasonable weight variations
|
|
||||||
- [06-02]: Stability threshold Spearman rho >= 0.85 (STABILITY_THRESHOLD) based on rank stability literature
|
|
||||||
- [06-02]: Renormalization maintains sum=1.0 after perturbation (weight constraint enforcement)
|
|
||||||
- [06-02]: Top-N default 100 genes for ranking comparison (relevant for candidate prioritization)
|
|
||||||
- [06-02]: Minimum overlap 10 genes required for Spearman correlation (avoids meaningless correlations)
|
|
||||||
- [06-02]: Per-layer sensitivity tracking (most_sensitive_layer and most_robust_layer computed from mean rho)
|
|
||||||
- [06-03]: Comprehensive validation report combines positive, negative, and sensitivity prongs in single Markdown document
|
|
||||||
- [06-03]: Weight tuning recommendations include critical circular validation warnings (post-validation tuning invalidates controls)
|
|
||||||
- [06-03]: CLI validate command provides --skip-sensitivity flag for faster iteration during development
|
|
||||||
|
|
||||||
### Pending Todos
|
### Pending Todos
|
||||||
|
|
||||||
None yet.
|
None.
|
||||||
|
|
||||||
### Blockers/Concerns
|
### Blockers/Concerns
|
||||||
|
|
||||||
None yet.
|
None.
|
||||||
|
|
||||||
## Session Continuity
|
## Session Continuity
|
||||||
|
|
||||||
Last session: 2026-02-12 - Phase 6 execution and verification
|
Last session: 2026-02-12 — v1.0 milestone completed and archived
|
||||||
Stopped at: All 6 phases complete — milestone ready for completion
|
Next action: /gsd:new-milestone for v1.1 or v2.0
|
||||||
Resume file: .planning/phases/06-validation/06-VERIFICATION.md
|
|
||||||
|
|||||||
@@ -1,3 +1,12 @@
|
|||||||
|
# Requirements Archive: v1.0 MVP
|
||||||
|
|
||||||
|
**Archived:** 2026-02-12
|
||||||
|
**Status:** SHIPPED
|
||||||
|
|
||||||
|
For current requirements, see `.planning/REQUIREMENTS.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
# Requirements: Usher Cilia Candidate Gene Discovery Pipeline
|
# Requirements: Usher Cilia Candidate Gene Discovery Pipeline
|
||||||
|
|
||||||
**Defined:** 2026-02-11
|
**Defined:** 2026-02-11
|
||||||
141
.planning/milestones/v1.0-ROADMAP.md
Normal file
141
.planning/milestones/v1.0-ROADMAP.md
Normal file
@@ -0,0 +1,141 @@
|
|||||||
|
# Roadmap: Usher Cilia Candidate Gene Discovery Pipeline
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This pipeline transforms ~20,000 human protein-coding genes into a ranked, evidence-backed list of under-studied cilia/Usher candidates. The journey progresses from foundational data infrastructure through six independent evidence layers (annotation, expression, protein features, localization, genetic constraint, animal models, literature), multi-evidence scoring with transparent weights, and tiered output generation. Each phase delivers testable capabilities that compound toward a fully traceable, reproducible gene prioritization system.
|
||||||
|
|
||||||
|
## Phases
|
||||||
|
|
||||||
|
**Phase Numbering:**
|
||||||
|
- Integer phases (1, 2, 3): Planned milestone work
|
||||||
|
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
|
||||||
|
|
||||||
|
Decimal phases appear between their surrounding integers in numeric order.
|
||||||
|
|
||||||
|
- [x] **Phase 1: Data Infrastructure** - Foundation for reproducible, modular pipeline
|
||||||
|
- [x] **Phase 2: Prototype Evidence Layer** - Validate retrieval-to-storage architecture
|
||||||
|
- [x] **Phase 3: Core Evidence Layers** - Parallel multi-source data retrieval
|
||||||
|
- [x] **Phase 4: Scoring & Integration** - Multi-evidence weighted scoring system
|
||||||
|
- [x] **Phase 5: Output & CLI** - User-facing interface and tiered results
|
||||||
|
- [x] **Phase 6: Validation** - Benchmark scoring against known genes
|
||||||
|
|
||||||
|
## Phase Details
|
||||||
|
|
||||||
|
### Phase 1: Data Infrastructure
|
||||||
|
**Goal**: Establish reproducible data foundation and gene ID mapping utilities
|
||||||
|
**Depends on**: Nothing (first phase)
|
||||||
|
**Requirements**: INFRA-01, INFRA-02, INFRA-03, INFRA-04, INFRA-05, INFRA-06, INFRA-07
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Pipeline uses Ensembl gene IDs as primary keys throughout with validated mapping to HGNC symbols and UniProt accessions
|
||||||
|
2. Configuration system loads YAML parameters with Pydantic validation and rejects invalid configs
|
||||||
|
3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching
|
||||||
|
4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading
|
||||||
|
5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash
|
||||||
|
**Plans**: 4 plans
|
||||||
|
|
||||||
|
Plans:
|
||||||
|
- [x] 01-01-PLAN.md -- Project scaffold, config system, and base API client
|
||||||
|
- [x] 01-02-PLAN.md -- Gene ID mapping with validation gates
|
||||||
|
- [x] 01-03-PLAN.md -- DuckDB persistence and provenance tracking
|
||||||
|
- [x] 01-04-PLAN.md -- CLI integration and end-to-end wiring
|
||||||
|
|
||||||
|
### Phase 2: Prototype Evidence Layer
|
||||||
|
**Goal**: Validate retrieval-to-storage pattern with single evidence layer
|
||||||
|
**Depends on**: Phase 1
|
||||||
|
**Requirements**: GCON-01, GCON-02, GCON-03
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Pipeline retrieves gnomAD constraint metrics (pLI, LOEUF) for all human protein-coding genes
|
||||||
|
2. Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) and stored with quality flags
|
||||||
|
3. Missing data is encoded as "unknown" rather than zero, preserving genes with incomplete coverage
|
||||||
|
4. Prototype layer writes normalized scores to DuckDB and demonstrates checkpoint restart capability
|
||||||
|
**Plans**: 2 plans
|
||||||
|
|
||||||
|
Plans:
|
||||||
|
- [x] 02-01-PLAN.md -- gnomAD data model, download, coverage filter, and normalization
|
||||||
|
- [x] 02-02-PLAN.md -- DuckDB persistence, CLI evidence command, and integration tests
|
||||||
|
|
||||||
|
### Phase 3: Core Evidence Layers
|
||||||
|
**Goal**: Complete all remaining evidence retrieval modules
|
||||||
|
**Depends on**: Phase 2
|
||||||
|
**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, EXPR-01, EXPR-02, EXPR-03, EXPR-04, PROT-01, PROT-02, PROT-03, PROT-04, LOCA-01, LOCA-02, LOCA-03, ANIM-01, ANIM-02, ANIM-03, LITE-01, LITE-02, LITE-03
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Pipeline quantifies annotation depth per gene using GO term count, UniProt score, and pathway membership with tier classification
|
||||||
|
2. Expression data from HPA, GTEx, and CellxGene is retrieved for retina, inner ear, and cilia-rich tissues with normalized specificity metrics
|
||||||
|
3. Protein features (length, domains, coiled-coils, cilia motifs, transmembrane regions) are extracted from UniProt/InterPro as normalized features
|
||||||
|
4. Localization evidence from HPA and proteomics datasets distinguishes experimental from computational predictions
|
||||||
|
5. Animal model phenotypes from MGI, ZFIN, and IMPC are filtered for sensory/cilia relevance with ortholog confidence scoring
|
||||||
|
6. Literature evidence from PubMed distinguishes direct experimental evidence from incidental mentions with quality-weighted scoring
|
||||||
|
**Plans**: 6 plans
|
||||||
|
|
||||||
|
Plans:
|
||||||
|
- [x] 03-01-PLAN.md -- Gene annotation completeness (GO terms, UniProt scores, pathway membership, tier classification)
|
||||||
|
- [x] 03-02-PLAN.md -- Tissue expression (HPA, GTEx, CellxGene with Tau specificity and enrichment scoring)
|
||||||
|
- [x] 03-03-PLAN.md -- Protein sequence/structure features (UniProt/InterPro domains, cilia motifs, normalization)
|
||||||
|
- [x] 03-04-PLAN.md -- Subcellular localization (HPA subcellular, cilia proteomics, evidence type distinction)
|
||||||
|
- [x] 03-05-PLAN.md -- Animal model phenotypes (MGI, ZFIN, IMPC with HCOP ortholog mapping)
|
||||||
|
- [x] 03-06-PLAN.md -- Literature evidence (PubMed queries, evidence tier classification, quality-weighted scoring)
|
||||||
|
|
||||||
|
### Phase 4: Scoring & Integration
|
||||||
|
**Goal**: Multi-evidence weighted scoring with known gene validation
|
||||||
|
**Depends on**: Phase 3
|
||||||
|
**Requirements**: SCOR-01, SCOR-02, SCOR-03, SCOR-04, SCOR-05
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Known cilia/Usher genes from CiliaCarta, SYSCILIA, and OMIM are compiled as exclusion set and positive controls
|
||||||
|
2. Weighted rule-based scoring integrates all evidence layers with configurable per-layer weights producing composite score per gene
|
||||||
|
3. Scoring handles missing data explicitly with "unknown" status rather than penalizing genes lacking evidence in specific layers
|
||||||
|
4. Known cilia/Usher genes rank highly before exclusion, validating that scoring system works
|
||||||
|
5. Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer
|
||||||
|
**Plans**: 3 plans
|
||||||
|
|
||||||
|
Plans:
|
||||||
|
- [x] 04-01-PLAN.md -- Known gene compilation, weight validation, and multi-evidence scoring integration
|
||||||
|
- [x] 04-02-PLAN.md -- Quality control checks and positive control validation
|
||||||
|
- [x] 04-03-PLAN.md -- CLI score command and unit/integration tests
|
||||||
|
|
||||||
|
### Phase 5: Output & CLI
|
||||||
|
**Goal**: User-facing interface and structured tiered output
|
||||||
|
**Depends on**: Phase 4
|
||||||
|
**Requirements**: OUTP-01, OUTP-02, OUTP-03, OUTP-04, OUTP-05
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth
|
||||||
|
2. Each candidate includes multi-dimensional evidence summary showing which layers support it and which have gaps
|
||||||
|
3. Output is available in TSV and Parquet formats compatible with downstream PPI and structural prediction tools
|
||||||
|
4. Pipeline generates visualizations: score distribution, evidence layer contribution, tier breakdown
|
||||||
|
5. Unified CLI provides subcommands for running layers, integration, and reporting with progress logging
|
||||||
|
6. Reproducibility report documents all parameters, data versions, gene counts at filtering steps, and validation metrics
|
||||||
|
**Plans**: 3 plans
|
||||||
|
|
||||||
|
Plans:
|
||||||
|
- [x] 05-01-PLAN.md -- Tiered candidate output with evidence summary and dual-format writer (TSV+Parquet)
|
||||||
|
- [x] 05-02-PLAN.md -- Visualizations (score distribution, layer contributions, tier breakdown) and reproducibility report
|
||||||
|
- [x] 05-03-PLAN.md -- CLI report command wiring all output modules with integration tests
|
||||||
|
|
||||||
|
### Phase 6: Validation
|
||||||
|
**Goal**: Benchmark scoring system against positive and negative controls
|
||||||
|
**Depends on**: Phase 5
|
||||||
|
**Requirements**: (No new requirements - validates existing system)
|
||||||
|
**Success Criteria** (what must be TRUE):
|
||||||
|
1. Positive control validation shows known cilia/Usher genes achieve high recall (>70% in top 10% of candidates)
|
||||||
|
2. Negative control validation shows housekeeping genes are deprioritized (low scores, excluded from high-confidence tier)
|
||||||
|
3. Sensitivity analysis across parameter sweeps demonstrates rank stability for top candidates
|
||||||
|
4. Final scoring weights are tuned based on validation metrics and documented with rationale
|
||||||
|
**Plans**: 3 plans
|
||||||
|
|
||||||
|
Plans:
|
||||||
|
- [x] 06-01-PLAN.md -- Negative control validation (housekeeping genes) and enhanced positive control metrics (recall@k)
|
||||||
|
- [x] 06-02-PLAN.md -- Sensitivity analysis (weight perturbation sweeps with Spearman rank correlation)
|
||||||
|
- [x] 06-03-PLAN.md -- Comprehensive validation report, CLI validate command, and unit tests
|
||||||
|
|
||||||
|
## Progress
|
||||||
|
|
||||||
|
**Execution Order:**
|
||||||
|
Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6
|
||||||
|
|
||||||
|
| Phase | Plans Complete | Status | Completed |
|
||||||
|
|-------|----------------|--------|-----------|
|
||||||
|
| 1. Data Infrastructure | 4/4 | Complete | 2026-02-11 |
|
||||||
|
| 2. Prototype Evidence Layer | 2/2 | Complete | 2026-02-11 |
|
||||||
|
| 3. Core Evidence Layers | 6/6 | Complete | 2026-02-11 |
|
||||||
|
| 4. Scoring & Integration | 3/3 | Complete | 2026-02-11 |
|
||||||
|
| 5. Output & CLI | 3/3 | Complete | 2026-02-12 |
|
||||||
|
| 6. Validation | 3/3 | Complete | 2026-02-12 |
|
||||||
Reference in New Issue
Block a user