Files
usher-exploring/.planning/milestones/v1.0-REQUIREMENTS.md
gbanyan a2ef2125ba chore: complete v1.0 MVP milestone
Archive v1.0 milestone: 6 phases, 21 plans, 40/40 requirements.
Reorganize ROADMAP.md, evolve PROJECT.md, archive requirements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 21:31:43 +08:00

10 KiB

Requirements Archive: v1.0 MVP

Archived: 2026-02-12 Status: SHIPPED

For current requirements, see .planning/REQUIREMENTS.md.


Requirements: Usher Cilia Candidate Gene Discovery Pipeline

Defined: 2026-02-11 Core Value: Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.

v1 Requirements

Requirements for initial release. Each maps to roadmap phases.

Data Infrastructure

  • INFRA-01: Pipeline defines gene universe as all human protein-coding genes from Ensembl, excluding pseudogenes and transcripts lacking protein-level evidence
  • INFRA-02: Pipeline uses Ensembl gene IDs as primary keys throughout, with validated mapping to HGNC symbols and UniProt accessions
  • INFRA-03: Gene ID mapping includes validation gates that report percentage successfully mapped and flag unmapped genes for review
  • INFRA-04: Pipeline retrieves data from external APIs (gnomAD, GTEx, HPA, UniProt, PubMed, model organism DBs) with rate limiting, retry logic, and persistent disk caching
  • INFRA-05: All pipeline parameters (weights, thresholds, data source versions) are configurable via YAML config files with Pydantic validation
  • INFRA-06: Every output includes provenance metadata: pipeline version, data source versions, download timestamps, config hash, and processing steps
  • INFRA-07: Intermediate results are persisted to disk (Parquet/DuckDB) enabling restart-from-checkpoint without re-downloading data

Evidence Layer 1: Gene Annotation Completeness

  • ANNOT-01: Pipeline quantifies functional annotation depth per gene using GO term count, UniProt annotation score, and pathway membership
  • ANNOT-02: Genes are classified into annotation tiers (well-annotated, partially-annotated, poorly-annotated) based on composite annotation metrics
  • ANNOT-03: Annotation completeness score is normalized to 0-1 scale and stored as an evidence layer feature

Evidence Layer 2: Tissue Expression

  • EXPR-01: Pipeline retrieves tissue-level expression data from Human Protein Atlas and GTEx for retina, inner ear, and cilia-rich tissues
  • EXPR-02: Pipeline retrieves published scRNA-seq data from CellxGene for photoreceptor subtypes and auditory hair cell subpopulations
  • EXPR-03: Expression data is converted to comparable specificity metrics (e.g., tissue specificity index or relative rank) across data sources
  • EXPR-04: Expression evidence score reflects enrichment in Usher-relevant tissues relative to global expression

Evidence Layer 3: Protein Sequence & Structure Features

  • PROT-01: Pipeline extracts protein length, domain composition, and domain count from UniProt/InterPro per gene
  • PROT-02: Pipeline identifies presence of coiled-coil regions, scaffold/adaptor-type domains, and transmembrane domains
  • PROT-03: Pipeline checks for known cilia-associated or sensory structure-associated motifs without presupposing conclusions
  • PROT-04: Protein features are encoded as binary and continuous features normalized to 0-1 scale

Evidence Layer 4: Subcellular Localization

  • LOCA-01: Pipeline integrates high-throughput protein localization data from Human Protein Atlas subcellular and published centrosome/cilium proteomics
  • LOCA-02: Localization evidence distinguishes direct experimental evidence from computational predictions
  • LOCA-03: Localization score reflects proximity to cilia-related compartments (centrosome, basal body, cilium, stereocilia, transition zone)

Evidence Layer 5: Genetic Constraint

  • GCON-01: Pipeline retrieves loss-of-function tolerance metrics (pLI, LOEUF) from gnomAD per gene
  • GCON-02: Constraint scores are filtered by coverage quality (mean depth >30x, >90% CDS covered) to avoid unreliable estimates
  • GCON-03: Constraint evidence is interpreted as weak signal for "important but under-studied" rather than direct cilia involvement

Evidence Layer 6: Animal Model Phenotypes

  • ANIM-01: Pipeline retrieves gene knockout/perturbation phenotypes from MGI (mouse), ZFIN (zebrafish), and IMPC
  • ANIM-02: Phenotypes are filtered for relevance to sensory function, balance, vision, hearing, and cilia morphology
  • ANIM-03: Ortholog mapping uses established databases with confidence scoring, handling one-to-many mappings explicitly

Literature Evidence

  • LITE-01: Pipeline performs systematic PubMed queries per candidate gene for mentions in cilia, sensory organ, cytoskeleton, and cell polarity contexts
  • LITE-02: Literature evidence distinguishes direct experimental evidence, incidental mentions, and high-throughput screen hits as qualitative tiers
  • LITE-03: Literature score reflects evidence quality (not just publication count) to mitigate well-studied gene bias

Scoring & Integration

  • SCOR-01: Pipeline compiles known cilia/Usher gene set from CiliaCarta, SYSCILIA gold standard, and OMIM Usher entries as exclusion set and positive controls
  • SCOR-02: Multi-evidence integration uses weighted rule-based scoring with configurable per-layer weights, producing a composite score per gene
  • SCOR-03: Scoring handles missing data explicitly — genes lacking evidence in a layer receive "unknown" status rather than zero score
  • SCOR-04: Known cilia/Usher genes are used as positive controls (should rank highly before exclusion) to validate scoring system
  • SCOR-05: Quality control checks detect missing data rates, score distribution anomalies, and outliers per evidence layer

Output & Reporting

  • OUTP-01: Pipeline produces tiered candidate list (high/medium/low confidence) based on composite score and evidence breadth
  • OUTP-02: Each candidate gene includes a multi-dimensional evidence summary showing which layers support it and which have gaps
  • OUTP-03: Output is in structured machine-readable format (TSV and Parquet) compatible with downstream PPI and structural prediction tools
  • OUTP-04: Pipeline generates basic visualizations: score distribution, evidence layer contribution plots, tier breakdown
  • OUTP-05: Pipeline produces reproducibility report documenting all parameters, data versions, gene counts at each filtering step, and validation metrics

v2 Requirements

Deferred to future release. Tracked but not in current roadmap.

Advanced Scoring

  • ASCR-01: Explainable scoring with SHAP-style per-gene evidence breakdown showing why each gene ranks where it does
  • ASCR-02: Systematic under-annotation bias correction that downweights literature-heavy features for under-studied candidates
  • ASCR-03: Sensitivity analysis with parameter sweep across weight configurations and rank stability metrics
  • ASCR-04: Evidence conflict detection flagging genes with contradictory evidence patterns

Advanced Output

  • AOUT-01: Interactive HTML report with browsable results, sortable tables, and linked evidence sources
  • AOUT-02: Negative control validation testing against housekeeping genes to assess specificity

Extended Evidence

  • AEVD-01: Cross-species homology confidence scoring using DIOPT with explicit ortholog quality thresholds
  • AEVD-02: Cilia-specific knowledgebase integration (CilioGenics, CiliaMiner) as additional evidence layer
  • AEVD-03: Incremental update capability to re-run with new data without full recomputation

Out of Scope

Explicitly excluded. Documented to prevent scope creep.

Feature Reason
Real-time web dashboard Overkill for research tool; static reports + CLI sufficient
GUI for parameter tuning Research pipelines need reproducible CLI execution
Variant-level analysis Out of scope for gene-level discovery; use Exomiser/LIRICAL for variant work
Custom alignment/variant calling Well-solved problem; focus on gene prioritization logic
ML-based scoring model Small positive control set insufficient for robust ML; rule-based more transparent
LLM-based automated literature scanning High complexity, cost, and uncertainty; manual/programmatic PubMed queries sufficient
Bayesian evidence weight optimization Requires larger training set; manual weight tuning sufficient for v1
Private/proprietary datasets Public data only for reproducibility
Downstream PPI network analysis This pipeline produces input candidate list; PPI is separate
AlphaFold structural predictions Downstream analysis, not part of discovery pipeline

Traceability

Which phases cover which requirements. Updated during roadmap creation.

Requirement Phase Status
INFRA-01 Phase 1 Pending
INFRA-02 Phase 1 Pending
INFRA-03 Phase 1 Pending
INFRA-04 Phase 1 Pending
INFRA-05 Phase 1 Pending
INFRA-06 Phase 1 Pending
INFRA-07 Phase 1 Pending
ANNOT-01 Phase 3 Pending
ANNOT-02 Phase 3 Pending
ANNOT-03 Phase 3 Pending
EXPR-01 Phase 3 Pending
EXPR-02 Phase 3 Pending
EXPR-03 Phase 3 Pending
EXPR-04 Phase 3 Pending
PROT-01 Phase 3 Pending
PROT-02 Phase 3 Pending
PROT-03 Phase 3 Pending
PROT-04 Phase 3 Pending
LOCA-01 Phase 3 Pending
LOCA-02 Phase 3 Pending
LOCA-03 Phase 3 Pending
GCON-01 Phase 2 Pending
GCON-02 Phase 2 Pending
GCON-03 Phase 2 Pending
ANIM-01 Phase 3 Pending
ANIM-02 Phase 3 Pending
ANIM-03 Phase 3 Pending
LITE-01 Phase 3 Pending
LITE-02 Phase 3 Pending
LITE-03 Phase 3 Pending
SCOR-01 Phase 4 Pending
SCOR-02 Phase 4 Pending
SCOR-03 Phase 4 Pending
SCOR-04 Phase 4 Pending
SCOR-05 Phase 4 Pending
OUTP-01 Phase 5 Pending
OUTP-02 Phase 5 Pending
OUTP-03 Phase 5 Pending
OUTP-04 Phase 5 Pending
OUTP-05 Phase 5 Pending

Coverage:

  • v1 requirements: 40 total
  • Mapped to phases: 40
  • Unmapped: 0

Requirements defined: 2026-02-11 Last updated: 2026-02-11 after roadmap creation