# Requirements Archive: v1.0 MVP **Archived:** 2026-02-12 **Status:** SHIPPED For current requirements, see `.planning/REQUIREMENTS.md`. --- # Requirements: Usher Cilia Candidate Gene Discovery Pipeline **Defined:** 2026-02-11 **Core Value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. ## v1 Requirements Requirements for initial release. Each maps to roadmap phases. ### Data Infrastructure - [ ] **INFRA-01**: Pipeline defines gene universe as all human protein-coding genes from Ensembl, excluding pseudogenes and transcripts lacking protein-level evidence - [ ] **INFRA-02**: Pipeline uses Ensembl gene IDs as primary keys throughout, with validated mapping to HGNC symbols and UniProt accessions - [ ] **INFRA-03**: Gene ID mapping includes validation gates that report percentage successfully mapped and flag unmapped genes for review - [ ] **INFRA-04**: Pipeline retrieves data from external APIs (gnomAD, GTEx, HPA, UniProt, PubMed, model organism DBs) with rate limiting, retry logic, and persistent disk caching - [ ] **INFRA-05**: All pipeline parameters (weights, thresholds, data source versions) are configurable via YAML config files with Pydantic validation - [ ] **INFRA-06**: Every output includes provenance metadata: pipeline version, data source versions, download timestamps, config hash, and processing steps - [ ] **INFRA-07**: Intermediate results are persisted to disk (Parquet/DuckDB) enabling restart-from-checkpoint without re-downloading data ### Evidence Layer 1: Gene Annotation Completeness - [ ] **ANNOT-01**: Pipeline quantifies functional annotation depth per gene using GO term count, UniProt annotation score, and pathway membership - [ ] **ANNOT-02**: Genes are classified into annotation tiers (well-annotated, partially-annotated, poorly-annotated) based on composite annotation metrics - [ ] **ANNOT-03**: Annotation completeness score is normalized to 0-1 scale and stored as an evidence layer feature ### Evidence Layer 2: Tissue Expression - [ ] **EXPR-01**: Pipeline retrieves tissue-level expression data from Human Protein Atlas and GTEx for retina, inner ear, and cilia-rich tissues - [ ] **EXPR-02**: Pipeline retrieves published scRNA-seq data from CellxGene for photoreceptor subtypes and auditory hair cell subpopulations - [ ] **EXPR-03**: Expression data is converted to comparable specificity metrics (e.g., tissue specificity index or relative rank) across data sources - [ ] **EXPR-04**: Expression evidence score reflects enrichment in Usher-relevant tissues relative to global expression ### Evidence Layer 3: Protein Sequence & Structure Features - [ ] **PROT-01**: Pipeline extracts protein length, domain composition, and domain count from UniProt/InterPro per gene - [ ] **PROT-02**: Pipeline identifies presence of coiled-coil regions, scaffold/adaptor-type domains, and transmembrane domains - [ ] **PROT-03**: Pipeline checks for known cilia-associated or sensory structure-associated motifs without presupposing conclusions - [ ] **PROT-04**: Protein features are encoded as binary and continuous features normalized to 0-1 scale ### Evidence Layer 4: Subcellular Localization - [ ] **LOCA-01**: Pipeline integrates high-throughput protein localization data from Human Protein Atlas subcellular and published centrosome/cilium proteomics - [ ] **LOCA-02**: Localization evidence distinguishes direct experimental evidence from computational predictions - [ ] **LOCA-03**: Localization score reflects proximity to cilia-related compartments (centrosome, basal body, cilium, stereocilia, transition zone) ### Evidence Layer 5: Genetic Constraint - [ ] **GCON-01**: Pipeline retrieves loss-of-function tolerance metrics (pLI, LOEUF) from gnomAD per gene - [ ] **GCON-02**: Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) to avoid unreliable estimates - [ ] **GCON-03**: Constraint evidence is interpreted as weak signal for "important but under-studied" rather than direct cilia involvement ### Evidence Layer 6: Animal Model Phenotypes - [ ] **ANIM-01**: Pipeline retrieves gene knockout/perturbation phenotypes from MGI (mouse), ZFIN (zebrafish), and IMPC - [ ] **ANIM-02**: Phenotypes are filtered for relevance to sensory function, balance, vision, hearing, and cilia morphology - [ ] **ANIM-03**: Ortholog mapping uses established databases with confidence scoring, handling one-to-many mappings explicitly ### Literature Evidence - [ ] **LITE-01**: Pipeline performs systematic PubMed queries per candidate gene for mentions in cilia, sensory organ, cytoskeleton, and cell polarity contexts - [ ] **LITE-02**: Literature evidence distinguishes direct experimental evidence, incidental mentions, and high-throughput screen hits as qualitative tiers - [ ] **LITE-03**: Literature score reflects evidence quality (not just publication count) to mitigate well-studied gene bias ### Scoring & Integration - [ ] **SCOR-01**: Pipeline compiles known cilia/Usher gene set from CiliaCarta, SYSCILIA gold standard, and OMIM Usher entries as exclusion set and positive controls - [ ] **SCOR-02**: Multi-evidence integration uses weighted rule-based scoring with configurable per-layer weights, producing a composite score per gene - [ ] **SCOR-03**: Scoring handles missing data explicitly — genes lacking evidence in a layer receive "unknown" status rather than zero score - [ ] **SCOR-04**: Known cilia/Usher genes are used as positive controls (should rank highly before exclusion) to validate scoring system - [ ] **SCOR-05**: Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer ### Output & Reporting - [ ] **OUTP-01**: Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth - [ ] **OUTP-02**: Each candidate gene includes a multi-dimensional evidence summary showing which layers support it and which have gaps - [ ] **OUTP-03**: Output is in structured machine-readable format (TSV and Parquet) compatible with downstream PPI and structural prediction tools - [ ] **OUTP-04**: Pipeline generates basic visualizations: score distribution, evidence layer contribution plots, tier breakdown - [ ] **OUTP-05**: Pipeline produces reproducibility report documenting all parameters, data versions, gene counts at each filtering step, and validation metrics ## v2 Requirements Deferred to future release. Tracked but not in current roadmap. ### Advanced Scoring - **ASCR-01**: Explainable scoring with SHAP-style per-gene evidence breakdown showing why each gene ranks where it does - **ASCR-02**: Systematic under-annotation bias correction that downweights literature-heavy features for under-studied candidates - **ASCR-03**: Sensitivity analysis with parameter sweep across weight configurations and rank stability metrics - **ASCR-04**: Evidence conflict detection flagging genes with contradictory evidence patterns ### Advanced Output - **AOUT-01**: Interactive HTML report with browsable results, sortable tables, and linked evidence sources - **AOUT-02**: Negative control validation testing against housekeeping genes to assess specificity ### Extended Evidence - **AEVD-01**: Cross-species homology confidence scoring using DIOPT with explicit ortholog quality thresholds - **AEVD-02**: Cilia-specific knowledgebase integration (CilioGenics, CiliaMiner) as additional evidence layer - **AEVD-03**: Incremental update capability to re-run with new data without full recomputation ## Out of Scope Explicitly excluded. Documented to prevent scope creep. | Feature | Reason | |---------|--------| | Real-time web dashboard | Overkill for research tool; static reports + CLI sufficient | | GUI for parameter tuning | Research pipelines need reproducible CLI execution | | Variant-level analysis | Out of scope for gene-level discovery; use Exomiser/LIRICAL for variant work | | Custom alignment/variant calling | Well-solved problem; focus on gene prioritization logic | | ML-based scoring model | Small positive control set insufficient for robust ML; rule-based more transparent | | LLM-based automated literature scanning | High complexity, cost, and uncertainty; manual/programmatic PubMed queries sufficient | | Bayesian evidence weight optimization | Requires larger training set; manual weight tuning sufficient for v1 | | Private/proprietary datasets | Public data only for reproducibility | | Downstream PPI network analysis | This pipeline produces input candidate list; PPI is separate | | AlphaFold structural predictions | Downstream analysis, not part of discovery pipeline | ## Traceability Which phases cover which requirements. Updated during roadmap creation. | Requirement | Phase | Status | |-------------|-------|--------| | INFRA-01 | Phase 1 | Pending | | INFRA-02 | Phase 1 | Pending | | INFRA-03 | Phase 1 | Pending | | INFRA-04 | Phase 1 | Pending | | INFRA-05 | Phase 1 | Pending | | INFRA-06 | Phase 1 | Pending | | INFRA-07 | Phase 1 | Pending | | ANNOT-01 | Phase 3 | Pending | | ANNOT-02 | Phase 3 | Pending | | ANNOT-03 | Phase 3 | Pending | | EXPR-01 | Phase 3 | Pending | | EXPR-02 | Phase 3 | Pending | | EXPR-03 | Phase 3 | Pending | | EXPR-04 | Phase 3 | Pending | | PROT-01 | Phase 3 | Pending | | PROT-02 | Phase 3 | Pending | | PROT-03 | Phase 3 | Pending | | PROT-04 | Phase 3 | Pending | | LOCA-01 | Phase 3 | Pending | | LOCA-02 | Phase 3 | Pending | | LOCA-03 | Phase 3 | Pending | | GCON-01 | Phase 2 | Pending | | GCON-02 | Phase 2 | Pending | | GCON-03 | Phase 2 | Pending | | ANIM-01 | Phase 3 | Pending | | ANIM-02 | Phase 3 | Pending | | ANIM-03 | Phase 3 | Pending | | LITE-01 | Phase 3 | Pending | | LITE-02 | Phase 3 | Pending | | LITE-03 | Phase 3 | Pending | | SCOR-01 | Phase 4 | Pending | | SCOR-02 | Phase 4 | Pending | | SCOR-03 | Phase 4 | Pending | | SCOR-04 | Phase 4 | Pending | | SCOR-05 | Phase 4 | Pending | | OUTP-01 | Phase 5 | Pending | | OUTP-02 | Phase 5 | Pending | | OUTP-03 | Phase 5 | Pending | | OUTP-04 | Phase 5 | Pending | | OUTP-05 | Phase 5 | Pending | **Coverage:** - v1 requirements: 40 total - Mapped to phases: 40 - Unmapped: 0 --- *Requirements defined: 2026-02-11* *Last updated: 2026-02-11 after roadmap creation*