docs: complete project research

This commit is contained in:
2026-02-11 14:52:06 +08:00
parent c0abe8bc6c
commit bb7bfaedab
5 changed files with 2133 additions and 0 deletions

View File

@@ -0,0 +1,259 @@
# Feature Landscape
**Domain:** Gene Candidate Discovery and Prioritization for Rare Disease / Ciliopathy Research
**Researched:** 2026-02-11
**Confidence:** MEDIUM
## Table Stakes
Features users expect. Missing = pipeline is not credible.
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Multi-evidence scoring | Standard in modern gene prioritization; single-source approaches insufficient for rare disease | Medium | Requires weighting scheme, score normalization across evidence types |
| Reproducibility documentation | Required for publication and scientific validity; FDA/NIH standards emphasize reproducible pipelines | Low-Medium | Parameter logging, version tracking, seed control, execution environment capture |
| Data provenance tracking | W3C PROV standard; required to trace analysis steps and validate results | Medium | Track all data transformations, source versions, timestamps, intermediate results |
| Known gene validation | Benchmarking against established disease genes is standard practice; without this, no confidence in results | Low-Medium | Positive control set, recall metrics at various rank cutoffs (e.g., recall@10, recall@50) |
| Quality control checks | Standard for NGS and bioinformatics pipelines; catch data issues early | Low-Medium | Missing data detection, outlier identification, distribution checks, data completeness metrics |
| Structured output format | Machine-readable outputs enable downstream analysis and integration | Low | CSV/TSV for tabular data, JSON for metadata, standard column naming |
| Basic visualization | Visual inspection of scores, distributions, and rankings is expected | Medium | Score distribution plots, rank visualization, evidence contribution plots |
| Literature evidence | Gene function annotation incomplete; literature mining standard for discovery pipelines | High | PubMed/literature search integration, manual or automated; alternatively curated disease-gene databases |
| HPO/phenotype integration | Standard for rare disease gene prioritization since tools like Exomiser, LIRICAL | Medium | Human Phenotype Ontology term matching, phenotype similarity scoring if applicable |
| API-based data retrieval | Manual downloads don't scale; automated retrieval from gnomAD, UniProt, GTEx, etc. is expected | Medium | Rate limiting, error handling, caching, retry logic |
| Batch processing | Single-gene analysis doesn't scale to 20K genes | Low-Medium | Parallel execution, progress tracking, resume-from-checkpoint |
| Parameter configuration | Hard-coded parameters prevent adaptation; config files standard | Low | YAML/JSON config, CLI arguments, validation |
## Differentiators
Features that set product apart. Not expected, but valued.
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| Explainable scoring | SHAP/attribution methods show WHY genes rank high; critical for discovery (not just diagnosis) | High | SHAP-style contribution analysis, per-gene evidence breakdown, visual explanations |
| Systematic under-annotation bias handling | Novel: most tools favor well-studied genes; correcting publication bias is research advantage | High | Annotation completeness score as evidence layer; downweight literature-heavy features for under-studied candidates |
| Cilia-specific knowledgebase integration | Leverage CilioGenics, CiliaMiner, ciliopathy databases for domain-focused scoring | Medium | Custom evidence layer; API/download from specialized databases |
| Sensitivity analysis | Systematic parameter tuning rare in discovery pipelines; shows robustness | Medium-High | Grid search or DoE-based parameter sweep; rank stability metrics across configs |
| Tiered output with rationale | Not just ranked list but grouped by confidence/evidence type; aids hypothesis generation | Low-Medium | Tier classification logic (e.g., high/medium/low confidence), evidence summary per tier |
| Multi-modal evidence weighting | Naive Bayesian integration or optimized weights outperform equal weighting | Medium | Weight optimization using known positive controls, cross-validation |
| Negative control validation | Test against known non-disease genes to assess specificity; rare in discovery pipelines | Low-Medium | Negative gene set (e.g., housekeeping genes), precision metrics |
| Evidence conflict detection | Flag genes with contradictory evidence (e.g., high expression but low constraint) | Medium | Rule-based or correlation-based conflict identification |
| Interactive HTML report | Modern tools (e.g., MultiQC-style) provide browsable results; better than static CSV | Medium | HTML generation, embedded plots, sortable tables, linked evidence |
| Incremental update capability | Re-run with new data sources without full recomputation | Medium-High | Modular pipeline, cached intermediate results, dependency tracking |
| Cross-species homology scoring | Animal model phenotypes critical for ciliopathy; ortholog-based evidence integration | Medium | DIOPT/OrthoDB integration, phenotype transfer from model organisms |
| Automated literature scanning with LLM | Emerging: LLM-based RAG for literature evidence extraction and validation | High | LLM API integration, prompt engineering, faithfulness checks, cost management |
## Anti-Features
Features to explicitly NOT build.
| Anti-Feature | Why Avoid | What to Do Instead |
|--------------|-----------|-------------------|
| Real-time web dashboard | Overkill for research tool; adds deployment complexity, security concerns | Static HTML reports + CLI; Jupyter notebooks for interactive exploration if needed |
| GUI for parameter tuning | Research pipelines need reproducible command-line execution; GUIs hinder automation | YAML config files + CLI; document parameter rationale in config comments |
| Variant-level analysis | Out of scope for gene-level discovery; conflates discovery with diagnostic prioritization | Focus on gene-level evidence aggregation; refer users to Exomiser/LIRICAL for variant work |
| Custom alignment/variant calling | Well-solved problem; reinventing the wheel wastes time | Use standard BAM/VCF inputs from established pipelines; focus on gene prioritization logic |
| Social features (sharing, comments) | Research tool, not collaboration platform | File-based outputs (shareable via Git/email); documentation in README |
| Real-time database sync | Bioinformatics data versions change slowly; real-time sync unnecessary | Versioned data snapshots with documented download dates; update quarterly or as needed |
| One-click install for all dependencies | Bioinformatics tools have complex dependencies; false promise | Conda environment.yml or Docker container; document setup steps clearly |
| Machine learning model training | Small positive control set insufficient for robust ML; rule-based more transparent | Weighted scoring with manually tuned/optimized weights; reserve ML for future with larger training data |
## Feature Dependencies
```
Parameter Configuration
└──requires──> Quality Control Checks
└──requires──> Data Provenance Tracking
Multi-Evidence Scoring
├──requires──> API-Based Data Retrieval
├──requires──> Literature Evidence
└──requires──> Structured Output Format
Explainable Scoring
└──requires──> Multi-Evidence Scoring
└──enhances──> Interactive HTML Report
Known Gene Validation
└──requires──> Multi-Evidence Scoring
└──enables──> Sensitivity Analysis
Tiered Output with Rationale
└──requires──> Explainable Scoring
└──enhances──> Interactive HTML Report
Batch Processing
└──requires──> Parameter Configuration
└──enables──> Sensitivity Analysis
Negative Control Validation
└──requires──> Known Gene Validation (uses similar metrics)
Evidence Conflict Detection
└──requires──> Multi-Evidence Scoring
Incremental Update Capability
└──requires──> Data Provenance Tracking
└──requires──> Batch Processing
Cross-Species Homology Scoring
└──requires──> API-Based Data Retrieval
Automated Literature Scanning with LLM
└──requires──> Literature Evidence
└──conflicts──> Manual curation (choose one approach per evidence layer)
```
### Dependency Notes
- **Parameter Configuration → QC Checks → Data Provenance:** Foundation stack; parameters must be logged, QC applied, and tracked before any analysis
- **Multi-Evidence Scoring requires API-Based Data Retrieval:** Can't score without data; retrieval must be robust and cached
- **Explainable Scoring requires Multi-Evidence Scoring:** Can't explain scores that don't exist; explanations decompose composite scores
- **Known Gene Validation enables Sensitivity Analysis:** Positive controls provide ground truth for parameter tuning
- **Automated Literature Scanning conflicts with Manual Curation:** Choose one approach per evidence layer to avoid redundancy and conflicting evidence
## MVP Recommendation
### Launch With (v1)
Minimum viable pipeline for ciliopathy gene discovery.
- [x] Multi-evidence scoring (6 layers: annotation, expression, sequence, localization, constraint, phenotype)
- [x] API-based data retrieval with caching (gnomAD, GTEx, HPA, UniProt, Model Organism DBs)
- [x] Known gene validation (Usher syndrome genes, known ciliopathy genes as positive controls)
- [x] Reproducibility documentation (parameter logging, versions, timestamps)
- [x] Data provenance tracking (source file versions, processing steps, intermediate results)
- [x] Structured output format (CSV with ranked genes, evidence scores per column)
- [x] Quality control checks (missing data detection, outlier flagging, distribution checks)
- [x] Batch processing (parallel execution across 20K genes)
- [x] Parameter configuration (YAML config for weights, thresholds, data sources)
- [x] Tiered output with rationale (High/Medium/Low confidence tiers, evidence summary)
- [x] Basic visualization (score distributions, top candidate plots)
**Rationale:** These features enable credible, reproducible gene prioritization at scale. Without validation, explainability, and QC, results are untrustworthy. Without tiered output, generating hypotheses from 20K genes is overwhelming.
### Add After Validation (v1.x)
Features to add once core pipeline is validated against known ciliopathy genes.
- [ ] Interactive HTML report (browsable results with sortable tables, linked evidence) — **Trigger:** After v1 produces validated candidate list; when sharing results with collaborators
- [ ] Explainable scoring (per-gene evidence contribution breakdown, SHAP-style attribution) — **Trigger:** After identifying novel candidates; when reviewers ask "why is this gene ranked high?"
- [ ] Negative control validation (test against housekeeping genes to assess specificity) — **Trigger:** After positive control validation succeeds; to quantify false positive rate
- [ ] Evidence conflict detection (flag genes with contradictory evidence patterns) — **Trigger:** After observing unexpected high-ranking genes; to catch data quality issues
- [ ] Sensitivity analysis (systematic parameter sweep, rank stability metrics) — **Trigger:** When preparing for publication; to demonstrate robustness
- [ ] Cross-species homology scoring (zebrafish/mouse phenotype evidence integration) — **Trigger:** If animal model evidence proves valuable in initial analysis
### Future Consideration (v2+)
Features to defer until core pipeline is published and validated.
- [ ] Automated literature scanning with LLM (RAG-based PubMed evidence extraction) — **Why defer:** High complexity, cost, and uncertainty; manual curation sufficient for initial discovery
- [ ] Incremental update capability (re-run with new data without full recomputation) — **Why defer:** Overkill for one-time discovery project; valuable if pipeline becomes ongoing surveillance tool
- [ ] Multi-modal evidence weighting optimization (Bayesian integration, cross-validation) — **Why defer:** Requires larger training set of known genes; manual weight tuning sufficient initially
- [ ] Systematic under-annotation bias handling — **Why defer:** Novel research question; defer until after initial discovery validates approach
- [ ] Cilia-specific knowledgebase integration (CilioGenics, CiliaMiner) — **Why defer:** Nice-to-have; primary evidence layers sufficient for initial analysis; add if reviewers request
## Feature Prioritization Matrix
| Feature | User Value | Implementation Cost | Priority |
|---------|------------|---------------------|----------|
| Multi-evidence scoring | HIGH | MEDIUM | P1 |
| Known gene validation | HIGH | LOW-MEDIUM | P1 |
| Reproducibility documentation | HIGH | LOW-MEDIUM | P1 |
| Data provenance tracking | HIGH | MEDIUM | P1 |
| Structured output format | HIGH | LOW | P1 |
| Quality control checks | HIGH | LOW-MEDIUM | P1 |
| API-based data retrieval | HIGH | MEDIUM | P1 |
| Batch processing | HIGH | LOW-MEDIUM | P1 |
| Parameter configuration | HIGH | LOW | P1 |
| Tiered output with rationale | HIGH | LOW-MEDIUM | P1 |
| Basic visualization | HIGH | MEDIUM | P1 |
| Explainable scoring | HIGH | HIGH | P2 |
| Interactive HTML report | MEDIUM | MEDIUM | P2 |
| Sensitivity analysis | MEDIUM | MEDIUM-HIGH | P2 |
| Evidence conflict detection | MEDIUM | MEDIUM | P2 |
| Negative control validation | MEDIUM | LOW-MEDIUM | P2 |
| Cross-species homology | MEDIUM | MEDIUM | P2 |
| Incremental updates | LOW | MEDIUM-HIGH | P3 |
| LLM literature scanning | LOW | HIGH | P3 |
| Multi-modal weight optimization | MEDIUM | MEDIUM | P3 |
| Under-annotation bias correction | MEDIUM | HIGH | P3 |
| Cilia knowledgebase integration | LOW-MEDIUM | MEDIUM | P3 |
**Priority key:**
- P1: Must have for credible v1 pipeline
- P2: Should have; add after v1 validation
- P3: Nice to have; future consideration
## Competitor Feature Analysis
| Feature | Exomiser | LIRICAL | CilioGenics Tool | Usher Pipeline Approach |
|---------|----------|---------|-------------------|------------------------|
| Multi-evidence scoring | Yes (variant+pheno combined score) | Yes (LR-based) | Yes (ML predictions) | Yes (6-layer weighted scoring) |
| Phenotype integration | HPO-based (strong) | HPO-based (strong) | Not primary | HPO-compatible but not required (sensory cilia focus) |
| Known gene validation | Benchmarked on Mendelian diseases | Benchmarked on Mendelian diseases | Validated on known cilia genes | Validate on Usher + known ciliopathy genes |
| Explainable scoring | Limited | Post-test probability (interpretable) | ML black box | SHAP-style per-gene evidence breakdown (planned P2) |
| Variant-level analysis | Primary focus | Primary focus | No (gene-level only) | No (out of scope; gene-level discovery) |
| Literature evidence | Automated (limited) | Limited | Text mining used in training | Manual/automated (planned P3) |
| Tiered output | Yes (rank-ordered) | Yes (post-test prob tiers) | Yes (confidence scores) | Yes (High/Medium/Low tiers + rationale) |
| Under-annotation bias | Not addressed | Not addressed | Not addressed | Explicitly addressed (novel) |
| Domain-specific focus | Mendelian disease diagnosis | Mendelian disease diagnosis | Cilia biology | Usher syndrome / ciliopathy discovery |
| Reproducibility | Config files, versions logged | Config files, versions logged | Not emphasized | Extensive provenance tracking |
**Key Differentiators for Usher Pipeline:**
1. **Discovery vs. Diagnosis:** Exomiser/LIRICAL prioritize variants in patient genomes (diagnosis); Usher pipeline screens all genes for under-studied candidates (discovery)
2. **Under-annotation bias handling:** Explicitly score annotation completeness and de-bias toward under-studied genes
3. **Explainable scoring for hypothesis generation:** Per-gene evidence breakdown to guide experimental follow-up, not just rank genes
4. **Sensory cilia focus:** Retina/cochlea expression, ciliopathy phenotypes, subcellular localization evidence tailored to Usher biology
## Sources
### Tool Features and Benchmarking
- [Survey and improvement strategies for gene prioritization with LLMs](https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf148/8172498) (2026)
- [Clinical and Cross-Domain Validation of LLM-Guided Gene Prioritization](https://www.biorxiv.org/content/10.64898/2026.01.22.701191v1) (2026)
- [Automating candidate gene prioritization with LLMs](https://academic.oup.com/bioinformatics/article/41/10/btaf541/8280402) (2025)
- [Explicable prioritization with rule-based and ML algorithms](https://pmc.ncbi.nlm.nih.gov/articles/PMC10956189/) (2024)
- [Phenotype-driven approaches to enhance variant prioritization](https://pmc.ncbi.nlm.nih.gov/articles/PMC9288531/) (2022)
- [Evaluation of phenotype-driven gene prioritization methods](https://pmc.ncbi.nlm.nih.gov/articles/PMC9487604/) (2022)
- [Add LIRICAL and Exomiser scores to seqr (GitHub issue)](https://github.com/broadinstitute/seqr/issues/2742) (2021)
- [Large-scale benchmark of gene prioritization methods](https://www.nature.com/articles/srep46598) (2017)
### Reproducibility and Standards
- [Rare disease gene discovery in 100K Genomes Project](https://www.nature.com/articles/s41586-025-08623-w) (2025)
- [Standards for validating NGS bioinformatics pipelines (AMP/CAP)](https://www.sciencedirect.com/science/article/pii/S1525157817303732) (2018)
- [Genomics pipelines and data integration challenges](https://pmc.ncbi.nlm.nih.gov/articles/PMC5580401/) (2017)
- [Experiences with workflows for bioinformatics](https://link.springer.com/article/10.1186/s13062-015-0071-8) (2015)
### Parameter Tuning and Sensitivity Analysis
- [Algorithm sensitivity analysis for tissue segmentation pipelines](https://academic.oup.com/bioinformatics/article/33/7/1064/2843894) (2017)
- [doepipeline: systematic approach to optimizing workflows](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3091-z) (2019)
### Multi-Evidence Integration
- [Multi-dimensional evidence-based candidate gene prioritization](https://pmc.ncbi.nlm.nih.gov/articles/PMC2752609/) (2009)
- [Extensive analysis of disease-gene associations with network integration](https://pmc.ncbi.nlm.nih.gov/articles/PMC4070077/) (2014)
- [Survey of gene prioritization tools for Mendelian diseases](https://pmc.ncbi.nlm.nih.gov/articles/PMC7074139/) (2020)
### Explainability and Interpretability
- [Explainable deep learning for cancer target prioritization](https://arxiv.org/html/2511.12463) (2024)
- [Interpretable machine learning for genomics](https://pmc.ncbi.nlm.nih.gov/articles/PMC8527313/) (2021)
- [SeqOne's DiagAI Score with xAI](https://www.seqone.com/news-insights/seqone-diagai-explainable-ai) (recent)
- [Spectrum of explainable and interpretable ML for genomics](https://wires.onlinelibrary.wiley.com/doi/10.1002/wics.1617) (2023)
### Cilia Biology and Ciliopathy Tools
- [Prioritization tool for cilia-associated genes](https://pmc.ncbi.nlm.nih.gov/articles/PMC11512102/) (2024)
- [CilioGenics: integrated method for predicting ciliary genes](https://academic.oup.com/nar/article/52/14/8127/7710917) (2024)
- [CiliaMiner: integrated database for ciliopathy genes](https://pmc.ncbi.nlm.nih.gov/articles/PMC10403755/) (2023)
- [Systems-biology approach to ciliopathy disorders](https://genomemedicine.biomedcentral.com/articles/10.1186/gm275) (2011)
### Visualization and Reporting
- [Computational pipeline for functional gene discovery](https://www.nature.com/articles/s41598-021-03041-0) (2021)
- [JWES: pipeline for gene-variant discovery and annotation](https://pmc.ncbi.nlm.nih.gov/articles/PMC8409305/) (2021)
### Data Quality and Provenance
- [Data quality assurance in bioinformatics](https://ranchobiosciences.com/2025/04/bioinformatics-and-quality-assurance-data/) (2025)
- [Bioinformatics pipeline for data quality control](https://www.meegle.com/en_us/topics/bioinformatics-pipeline/bioinformatics-pipeline-for-data-quality-control) (recent)
- [Bionitio: demonstrating best practices for bioinformatics CLI](https://academic.oup.com/gigascience/article/8/9/giz109/5572530) (2019)
---
*Feature research for: Bioinformatics Gene Candidate Discovery and Prioritization (Rare Disease / Ciliopathy)*
*Researched: 2026-02-11*
*Confidence: MEDIUM — based on WebSearch findings verified across multiple sources; Context7 not available for specialized bioinformatics tools; recommendations synthesized from peer-reviewed publications and established tool documentation*