Files
usher-exploring/.planning/research/FEATURES.md

20 KiB

Feature Landscape

Domain: Gene Candidate Discovery and Prioritization for Rare Disease / Ciliopathy Research Researched: 2026-02-11 Confidence: MEDIUM

Table Stakes

Features users expect. Missing = pipeline is not credible.

Feature Why Expected Complexity Notes
Multi-evidence scoring Standard in modern gene prioritization; single-source approaches insufficient for rare disease Medium Requires weighting scheme, score normalization across evidence types
Reproducibility documentation Required for publication and scientific validity; FDA/NIH standards emphasize reproducible pipelines Low-Medium Parameter logging, version tracking, seed control, execution environment capture
Data provenance tracking W3C PROV standard; required to trace analysis steps and validate results Medium Track all data transformations, source versions, timestamps, intermediate results
Known gene validation Benchmarking against established disease genes is standard practice; without this, no confidence in results Low-Medium Positive control set, recall metrics at various rank cutoffs (e.g., recall@10, recall@50)
Quality control checks Standard for NGS and bioinformatics pipelines; catch data issues early Low-Medium Missing data detection, outlier identification, distribution checks, data completeness metrics
Structured output format Machine-readable outputs enable downstream analysis and integration Low CSV/TSV for tabular data, JSON for metadata, standard column naming
Basic visualization Visual inspection of scores, distributions, and rankings is expected Medium Score distribution plots, rank visualization, evidence contribution plots
Literature evidence Gene function annotation incomplete; literature mining standard for discovery pipelines High PubMed/literature search integration, manual or automated; alternatively curated disease-gene databases
HPO/phenotype integration Standard for rare disease gene prioritization since tools like Exomiser, LIRICAL Medium Human Phenotype Ontology term matching, phenotype similarity scoring if applicable
API-based data retrieval Manual downloads don't scale; automated retrieval from gnomAD, UniProt, GTEx, etc. is expected Medium Rate limiting, error handling, caching, retry logic
Batch processing Single-gene analysis doesn't scale to 20K genes Low-Medium Parallel execution, progress tracking, resume-from-checkpoint
Parameter configuration Hard-coded parameters prevent adaptation; config files standard Low YAML/JSON config, CLI arguments, validation

Differentiators

Features that set product apart. Not expected, but valued.

Feature Value Proposition Complexity Notes
Explainable scoring SHAP/attribution methods show WHY genes rank high; critical for discovery (not just diagnosis) High SHAP-style contribution analysis, per-gene evidence breakdown, visual explanations
Systematic under-annotation bias handling Novel: most tools favor well-studied genes; correcting publication bias is research advantage High Annotation completeness score as evidence layer; downweight literature-heavy features for under-studied candidates
Cilia-specific knowledgebase integration Leverage CilioGenics, CiliaMiner, ciliopathy databases for domain-focused scoring Medium Custom evidence layer; API/download from specialized databases
Sensitivity analysis Systematic parameter tuning rare in discovery pipelines; shows robustness Medium-High Grid search or DoE-based parameter sweep; rank stability metrics across configs
Tiered output with rationale Not just ranked list but grouped by confidence/evidence type; aids hypothesis generation Low-Medium Tier classification logic (e.g., high/medium/low confidence), evidence summary per tier
Multi-modal evidence weighting Naive Bayesian integration or optimized weights outperform equal weighting Medium Weight optimization using known positive controls, cross-validation
Negative control validation Test against known non-disease genes to assess specificity; rare in discovery pipelines Low-Medium Negative gene set (e.g., housekeeping genes), precision metrics
Evidence conflict detection Flag genes with contradictory evidence (e.g., high expression but low constraint) Medium Rule-based or correlation-based conflict identification
Interactive HTML report Modern tools (e.g., MultiQC-style) provide browsable results; better than static CSV Medium HTML generation, embedded plots, sortable tables, linked evidence
Incremental update capability Re-run with new data sources without full recomputation Medium-High Modular pipeline, cached intermediate results, dependency tracking
Cross-species homology scoring Animal model phenotypes critical for ciliopathy; ortholog-based evidence integration Medium DIOPT/OrthoDB integration, phenotype transfer from model organisms
Automated literature scanning with LLM Emerging: LLM-based RAG for literature evidence extraction and validation High LLM API integration, prompt engineering, faithfulness checks, cost management

Anti-Features

Features to explicitly NOT build.

Anti-Feature Why Avoid What to Do Instead
Real-time web dashboard Overkill for research tool; adds deployment complexity, security concerns Static HTML reports + CLI; Jupyter notebooks for interactive exploration if needed
GUI for parameter tuning Research pipelines need reproducible command-line execution; GUIs hinder automation YAML config files + CLI; document parameter rationale in config comments
Variant-level analysis Out of scope for gene-level discovery; conflates discovery with diagnostic prioritization Focus on gene-level evidence aggregation; refer users to Exomiser/LIRICAL for variant work
Custom alignment/variant calling Well-solved problem; reinventing the wheel wastes time Use standard BAM/VCF inputs from established pipelines; focus on gene prioritization logic
Social features (sharing, comments) Research tool, not collaboration platform File-based outputs (shareable via Git/email); documentation in README
Real-time database sync Bioinformatics data versions change slowly; real-time sync unnecessary Versioned data snapshots with documented download dates; update quarterly or as needed
One-click install for all dependencies Bioinformatics tools have complex dependencies; false promise Conda environment.yml or Docker container; document setup steps clearly
Machine learning model training Small positive control set insufficient for robust ML; rule-based more transparent Weighted scoring with manually tuned/optimized weights; reserve ML for future with larger training data

Feature Dependencies

Parameter Configuration
    └──requires──> Quality Control Checks
                      └──requires──> Data Provenance Tracking

Multi-Evidence Scoring
    ├──requires──> API-Based Data Retrieval
    ├──requires──> Literature Evidence
    └──requires──> Structured Output Format

Explainable Scoring
    └──requires──> Multi-Evidence Scoring
                      └──enhances──> Interactive HTML Report

Known Gene Validation
    └──requires──> Multi-Evidence Scoring
                      └──enables──> Sensitivity Analysis

Tiered Output with Rationale
    └──requires──> Explainable Scoring
                      └──enhances──> Interactive HTML Report

Batch Processing
    └──requires──> Parameter Configuration
                      └──enables──> Sensitivity Analysis

Negative Control Validation
    └──requires──> Known Gene Validation (uses similar metrics)

Evidence Conflict Detection
    └──requires──> Multi-Evidence Scoring

Incremental Update Capability
    └──requires──> Data Provenance Tracking
                      └──requires──> Batch Processing

Cross-Species Homology Scoring
    └──requires──> API-Based Data Retrieval

Automated Literature Scanning with LLM
    └──requires──> Literature Evidence
                      └──conflicts──> Manual curation (choose one approach per evidence layer)

Dependency Notes

  • Parameter Configuration → QC Checks → Data Provenance: Foundation stack; parameters must be logged, QC applied, and tracked before any analysis
  • Multi-Evidence Scoring requires API-Based Data Retrieval: Can't score without data; retrieval must be robust and cached
  • Explainable Scoring requires Multi-Evidence Scoring: Can't explain scores that don't exist; explanations decompose composite scores
  • Known Gene Validation enables Sensitivity Analysis: Positive controls provide ground truth for parameter tuning
  • Automated Literature Scanning conflicts with Manual Curation: Choose one approach per evidence layer to avoid redundancy and conflicting evidence

MVP Recommendation

Launch With (v1)

Minimum viable pipeline for ciliopathy gene discovery.

  • Multi-evidence scoring (6 layers: annotation, expression, sequence, localization, constraint, phenotype)
  • API-based data retrieval with caching (gnomAD, GTEx, HPA, UniProt, Model Organism DBs)
  • Known gene validation (Usher syndrome genes, known ciliopathy genes as positive controls)
  • Reproducibility documentation (parameter logging, versions, timestamps)
  • Data provenance tracking (source file versions, processing steps, intermediate results)
  • Structured output format (CSV with ranked genes, evidence scores per column)
  • Quality control checks (missing data detection, outlier flagging, distribution checks)
  • Batch processing (parallel execution across 20K genes)
  • Parameter configuration (YAML config for weights, thresholds, data sources)
  • Tiered output with rationale (High/Medium/Low confidence tiers, evidence summary)
  • Basic visualization (score distributions, top candidate plots)

Rationale: These features enable credible, reproducible gene prioritization at scale. Without validation, explainability, and QC, results are untrustworthy. Without tiered output, generating hypotheses from 20K genes is overwhelming.

Add After Validation (v1.x)

Features to add once core pipeline is validated against known ciliopathy genes.

  • Interactive HTML report (browsable results with sortable tables, linked evidence) — Trigger: After v1 produces validated candidate list; when sharing results with collaborators
  • Explainable scoring (per-gene evidence contribution breakdown, SHAP-style attribution) — Trigger: After identifying novel candidates; when reviewers ask "why is this gene ranked high?"
  • Negative control validation (test against housekeeping genes to assess specificity) — Trigger: After positive control validation succeeds; to quantify false positive rate
  • Evidence conflict detection (flag genes with contradictory evidence patterns) — Trigger: After observing unexpected high-ranking genes; to catch data quality issues
  • Sensitivity analysis (systematic parameter sweep, rank stability metrics) — Trigger: When preparing for publication; to demonstrate robustness
  • Cross-species homology scoring (zebrafish/mouse phenotype evidence integration) — Trigger: If animal model evidence proves valuable in initial analysis

Future Consideration (v2+)

Features to defer until core pipeline is published and validated.

  • Automated literature scanning with LLM (RAG-based PubMed evidence extraction) — Why defer: High complexity, cost, and uncertainty; manual curation sufficient for initial discovery
  • Incremental update capability (re-run with new data without full recomputation) — Why defer: Overkill for one-time discovery project; valuable if pipeline becomes ongoing surveillance tool
  • Multi-modal evidence weighting optimization (Bayesian integration, cross-validation) — Why defer: Requires larger training set of known genes; manual weight tuning sufficient initially
  • Systematic under-annotation bias handling — Why defer: Novel research question; defer until after initial discovery validates approach
  • Cilia-specific knowledgebase integration (CilioGenics, CiliaMiner) — Why defer: Nice-to-have; primary evidence layers sufficient for initial analysis; add if reviewers request

Feature Prioritization Matrix

Feature User Value Implementation Cost Priority
Multi-evidence scoring HIGH MEDIUM P1
Known gene validation HIGH LOW-MEDIUM P1
Reproducibility documentation HIGH LOW-MEDIUM P1
Data provenance tracking HIGH MEDIUM P1
Structured output format HIGH LOW P1
Quality control checks HIGH LOW-MEDIUM P1
API-based data retrieval HIGH MEDIUM P1
Batch processing HIGH LOW-MEDIUM P1
Parameter configuration HIGH LOW P1
Tiered output with rationale HIGH LOW-MEDIUM P1
Basic visualization HIGH MEDIUM P1
Explainable scoring HIGH HIGH P2
Interactive HTML report MEDIUM MEDIUM P2
Sensitivity analysis MEDIUM MEDIUM-HIGH P2
Evidence conflict detection MEDIUM MEDIUM P2
Negative control validation MEDIUM LOW-MEDIUM P2
Cross-species homology MEDIUM MEDIUM P2
Incremental updates LOW MEDIUM-HIGH P3
LLM literature scanning LOW HIGH P3
Multi-modal weight optimization MEDIUM MEDIUM P3
Under-annotation bias correction MEDIUM HIGH P3
Cilia knowledgebase integration LOW-MEDIUM MEDIUM P3

Priority key:

  • P1: Must have for credible v1 pipeline
  • P2: Should have; add after v1 validation
  • P3: Nice to have; future consideration

Competitor Feature Analysis

Feature Exomiser LIRICAL CilioGenics Tool Usher Pipeline Approach
Multi-evidence scoring Yes (variant+pheno combined score) Yes (LR-based) Yes (ML predictions) Yes (6-layer weighted scoring)
Phenotype integration HPO-based (strong) HPO-based (strong) Not primary HPO-compatible but not required (sensory cilia focus)
Known gene validation Benchmarked on Mendelian diseases Benchmarked on Mendelian diseases Validated on known cilia genes Validate on Usher + known ciliopathy genes
Explainable scoring Limited Post-test probability (interpretable) ML black box SHAP-style per-gene evidence breakdown (planned P2)
Variant-level analysis Primary focus Primary focus No (gene-level only) No (out of scope; gene-level discovery)
Literature evidence Automated (limited) Limited Text mining used in training Manual/automated (planned P3)
Tiered output Yes (rank-ordered) Yes (post-test prob tiers) Yes (confidence scores) Yes (High/Medium/Low tiers + rationale)
Under-annotation bias Not addressed Not addressed Not addressed Explicitly addressed (novel)
Domain-specific focus Mendelian disease diagnosis Mendelian disease diagnosis Cilia biology Usher syndrome / ciliopathy discovery
Reproducibility Config files, versions logged Config files, versions logged Not emphasized Extensive provenance tracking

Key Differentiators for Usher Pipeline:

  1. Discovery vs. Diagnosis: Exomiser/LIRICAL prioritize variants in patient genomes (diagnosis); Usher pipeline screens all genes for under-studied candidates (discovery)
  2. Under-annotation bias handling: Explicitly score annotation completeness and de-bias toward under-studied genes
  3. Explainable scoring for hypothesis generation: Per-gene evidence breakdown to guide experimental follow-up, not just rank genes
  4. Sensory cilia focus: Retina/cochlea expression, ciliopathy phenotypes, subcellular localization evidence tailored to Usher biology

Sources

Tool Features and Benchmarking

Reproducibility and Standards

Parameter Tuning and Sensitivity Analysis

Multi-Evidence Integration

Explainability and Interpretability

Cilia Biology and Ciliopathy Tools

Visualization and Reporting

Data Quality and Provenance


Feature research for: Bioinformatics Gene Candidate Discovery and Prioritization (Rare Disease / Ciliopathy) Researched: 2026-02-11 Confidence: MEDIUM — based on WebSearch findings verified across multiple sources; Context7 not available for specialized bioinformatics tools; recommendations synthesized from peer-reviewed publications and established tool documentation