docs: complete project research

This commit is contained in:
2026-02-11 14:52:06 +08:00
parent c0abe8bc6c
commit bb7bfaedab
5 changed files with 2133 additions and 0 deletions

View File

@@ -0,0 +1,301 @@
# Project Research Summary
**Project:** Bioinformatics Cilia/Usher Gene Discovery Pipeline
**Domain:** Gene Candidate Discovery and Prioritization for Rare Disease / Ciliopathy Research
**Researched:** 2026-02-11
**Confidence:** MEDIUM-HIGH
## Executive Summary
This project requires a multi-evidence bioinformatics pipeline to prioritize under-studied gene candidates for ciliopathy and Usher syndrome research. The recommended approach is a **staged, modular Python pipeline** using independent evidence retrieval layers (annotation, expression, protein features, localization, genetic constraint, phenotypes) that feed into weighted scoring and tiered output generation. Python 3.12+ with Polars for data processing, DuckDB for intermediate storage, and Typer for CLI orchestration represents the modern bioinformatics stack optimized for ~20K gene analysis on local workstations.
The critical architectural pattern is **independent evidence layers with staged persistence**: each layer retrieves, normalizes, and caches data separately before a final integration step combines scores. This enables restartability, parallel execution, and modular testing—essential for pipelines with external API dependencies and long runtimes. Configuration-driven behavior with YAML configs and Pydantic validation ensures reproducibility and enables parameter tuning without code changes.
The dominant risks are **gene ID mapping inconsistencies** (causing silent data loss across databases) and **literature bias amplification** (over-prioritizing well-studied genes at the expense of novel candidates). Both require explicit mitigation: standardized Ensembl gene IDs throughout, mapping validation gates, and publication-independent scoring layers weighted heavily. Success depends on validation against positive controls (known Usher/cilia genes) and negative controls (non-cilia housekeeping genes) to prove the scoring system works before running at scale.
## Key Findings
### Recommended Stack
Python 3.12+ provides the best balance of library support and stability. **Polars** (1.38+) outperforms Pandas by 6-38x for genomic operations with better memory efficiency for 20K gene datasets. **DuckDB** enables out-of-core analytical queries on Parquet-cached intermediate results, solving the memory pressure problem when integrating 6 evidence layers. **Typer** offers modern, type-hint-based CLI design with auto-generated help, cleaner than argparse for modular pipeline scripts.
**Core technologies:**
- **Python 3.12**: Industry standard for bioinformatics with extensive ecosystem (Biopython, scanpy, gget)
- **Polars 1.38+**: DataFrame processing 6-38x faster than Pandas, native streaming for large datasets
- **DuckDB**: Columnar analytics database for intermediate storage; 10-100x faster than SQLite for aggregations, queries Parquet files without full import
- **Typer 0.21+**: CLI framework with type-hint-based interface, auto-generates help documentation
- **gget 0.30+**: Unified API for multi-source gene annotation (Ensembl, UniProt, NCBI) — primary data access layer
- **Pydantic 2.12+**: Type-safe configuration and data validation; 1.5-1.75x faster than v1, prevents config errors
- **structlog**: Structured JSON logging with correlation IDs to trace genes through pipeline
- **diskcache**: Persistent API response caching across reruns, critical for avoiding API rate limits
- **uv**: Rust-based package manager 10-100x faster than pip, replaces pip/pip-tools/poetry
**Avoid:**
- Pandas alone (too slow, memory-inefficient)
- Python 3.9 or older (missing library support)
- Workflow orchestrators (Snakemake/Nextflow) for initial local workstation pipeline (overkill; adds complexity without benefit)
- Poetry for new projects (slower than uv, less standard)
### Expected Features
**Must have (table stakes):**
- **Multi-evidence scoring**: 6 layers (annotation, expression, protein features, localization, constraint, phenotypes) with weighted integration — standard in modern gene prioritization
- **Reproducibility documentation**: Parameter logging, version tracking, seed control — required for publication
- **Data provenance tracking**: W3C PROV standard; track all transformations, source versions, timestamps
- **Known gene validation**: Benchmark against established ciliopathy genes (CiliaCarta, SYSCILIA, Usher genes) with recall metrics
- **Quality control checks**: Missing data detection, outlier identification, distribution checks
- **API-based data retrieval**: Automated queries to gnomAD, GTEx, HPA, UniProt with rate limiting, caching, retry logic
- **Batch processing**: Parallel execution across 20K genes with progress tracking and resume-from-checkpoint
- **Parameter configuration**: YAML config for weights, thresholds, data sources
- **Tiered output with rationale**: High/Medium/Low confidence tiers with evidence summaries per tier
- **Basic visualization**: Score distributions, rank plots, evidence contribution charts
**Should have (competitive differentiators):**
- **Explainable scoring**: SHAP-style per-gene evidence breakdown showing WHY genes rank high — critical for discovery vs. diagnosis
- **Systematic under-annotation bias handling**: Correct publication bias by downweighting literature-heavy features for under-studied candidates — novel research advantage
- **Sensitivity analysis**: Systematic parameter sweep with rank stability metrics to demonstrate robustness
- **Evidence conflict detection**: Flag genes with contradictory evidence patterns (e.g., high expression but low constraint)
- **Interactive HTML report**: Browsable results with sortable tables, linked evidence sources
- **Cross-species homology scoring**: Zebrafish/mouse phenotype evidence integration via ortholog mapping
**Defer (v2+):**
- **Automated literature scanning with LLM**: RAG-based PubMed evidence extraction — high complexity, cost, uncertainty
- **Incremental update capability**: Re-run with new data without full recomputation — overkill for one-time discovery
- **Multi-modal evidence weighting optimization**: Bayesian integration, cross-validation — requires larger training set
- **Cilia-specific knowledgebase integration**: CilioGenics, CiliaMiner — nice-to-have, primary layers sufficient initially
### Architecture Approach
Multi-evidence gene prioritization pipelines follow a **layered architecture** with independent data retrieval modules feeding normalization, scoring, and validation layers. The standard pattern is a **staged pipeline** where each evidence layer operates independently, writes intermediate results to disk (Parquet cache + DuckDB tables), then a final integration component performs SQL joins and weighted scoring.
**Major components:**
1. **CLI Orchestration Layer** (Typer) — Entry point with subcommands (run-layer, integrate, report), global config management, pipeline orchestration
2. **Data Retrieval Layer** (6 modules) — Independent retrievers for annotation, expression, protein features, localization, constraint, phenotypes; API clients with caching and retry logic
3. **Normalization/Transform Layer** — Per-layer parsers converting raw formats to standardized schemas; gene ID mapping to Ensembl; score normalization to 0-1 scale
4. **Data Storage Layer** — Raw cache (Parquet), intermediate results (DuckDB), final output (TSV/Parquet); enables restartability and out-of-core analytics
5. **Integration/Scoring Layer** — Multi-evidence scoring via SQL joins in DuckDB; weighted aggregation; known gene filtering; confidence tier assignment
6. **Reporting Layer** — Per-gene evidence summaries; provenance metadata generation; tiered candidate lists
**Key architectural patterns:**
- **Independent Evidence Layers**: No cross-layer dependencies; enables parallel execution and isolated testing
- **Staged Data Persistence**: Each stage writes to disk before proceeding; enables restart-from-checkpoint and debugging
- **Configuration-Driven Behavior**: YAML configs for weights/thresholds; code reads config, never hardcodes parameters
- **Provenance Tracking**: Every output includes metadata (pipeline version, data source versions, timestamps, config hash)
### Critical Pitfalls
1. **Gene ID Mapping Inconsistency Cascade** — Over 51% of Ensembl IDs can fail symbol conversion; 8% discordance between Swiss-Prot and Ensembl. One-to-many mappings break naive merges, causing silent data loss. **Avoid by:** Using Ensembl gene IDs as primary keys throughout; version-locking annotation builds; implementing mapping validation gates that report % successfully mapped; manually reviewing unmapped high-priority genes.
2. **Literature Bias Amplification in Weighted Scoring** — Well-studied genes dominate scores because GO annotations, pathway coverage, interaction networks, and PubMed mentions all correlate with research attention rather than biological relevance. Systematically deprioritizes novel candidates. **Avoid by:** Decoupling publication metrics from functional evidence; normalizing scores by baseline publication count; heavily weighting sequence-based features independent of literature; requiring evidence diversity across multiple layers; validating against under-studied positive controls.
3. **Missing Data Handled as "Negative Evidence"** — Genes lacking data in a layer (no GTEx expression, no IMPC phenotype) are treated as "low score" rather than "unknown," systematically penalizing under-measured genes. **Avoid by:** Explicit three-state encoding (present/absent/unknown); layer-specific score normalization that doesn't penalize missing data; imputation using ortholog evidence with confidence weighting; coverage-aware constraint metrics (only use gnomAD when coverage >30x).
4. **Batch Effects Misinterpreted as Biological Signal in scRNA-seq** — Integration of multi-atlas scRNA-seq data (retina, inner ear, nasal epithelium) can erase true biological variation or create false cell-type signals. Only 27% of integration outputs perform better than unintegrated data. **Avoid by:** Validating integration quality using known marker genes; comparing multiple methods (Harmony, Seurat v5, scVI); using positive controls (known cilia genes should show expected cell-type enrichment); stratifying by sequencing technology before cross-platform integration.
5. **Reproducibility Theater Without Computational Environment Control** — Scripts are version-controlled but results drift due to uncontrolled dependencies (package versions, database snapshots, API response changes, random seeds). **Avoid by:** Pinning all dependencies with exact versions; containerizing environment (Docker/Singularity); snapshotting external databases with version tags; checksumming all downloaded data; setting random seeds globally; logging provenance metadata in all outputs.
## Implications for Roadmap
Based on research, suggested phase structure follows **dependency-driven ordering**: infrastructure first, single-layer prototype to validate architecture, parallel evidence layer development, integration/scoring, then reporting/polish.
### Phase 1: Data Infrastructure & Configuration
**Rationale:** All downstream components depend on config system, gene ID mapping utilities, and data storage patterns. Critical to establish before any evidence retrieval.
**Delivers:** Project skeleton, YAML config loading with Pydantic validation, DuckDB schema setup, gene ID mapping utility, API caching infrastructure.
**Addresses:**
- Reproducibility documentation (table stakes)
- Parameter configuration (table stakes)
- Data provenance tracking (table stakes)
**Avoids:**
- Gene ID mapping inconsistency (Pitfall #1) via standardized Ensembl IDs and validation gates
- Reproducibility failure (Pitfall #8) via containerization and version pinning
**Research flag:** SKIP deeper research — standard Python packaging patterns, well-documented.
### Phase 2: Single Evidence Layer Prototype
**Rationale:** Validate the retrieval → normalization → storage pattern before scaling to 6 layers. Identifies architectural issues early with low cost.
**Delivers:** One complete evidence layer (recommend starting with genetic constraint: gnomAD pLI/LOEUF as simplest API), end-to-end flow from API to DuckDB storage.
**Addresses:**
- API-based data retrieval (table stakes)
- Quality control checks (table stakes)
**Avoids:**
- Missing data as negative evidence (Pitfall #3) by designing explicit unknown-state handling from start
- Constraint metrics misinterpreted (Pitfall #6) via coverage-aware filtering
**Research flag:** MAY NEED `/gsd:research-phase` — gnomAD API usage patterns and coverage thresholds need validation.
### Phase 3: Remaining Evidence Layers (Parallel Work)
**Rationale:** Layers are independent by design; can be built in parallel. No inter-dependencies between annotation, expression, protein features, localization, phenotypes modules.
**Delivers:** 5 additional evidence layers replicating Phase 2 pattern (annotation, expression, protein features, localization, phenotypes + literature scan).
**Addresses:**
- Multi-evidence scoring foundation (table stakes) — requires all 6 layers operational
**Avoids:**
- Ortholog function over-assumed (Pitfall #5) via confidence scoring and phenotype relevance filtering
- API rate limits (Performance trap) via batch queries, exponential backoff, API keys
**Research flag:** NEEDS `/gsd:research-phase` for specialized APIs — CellxGene, IMPC, MGI/ZFIN integration patterns are niche and poorly documented.
### Phase 4: Integration & Scoring System
**Rationale:** Requires all evidence layers complete. Core scientific logic. Most complex component requiring validation against positive/negative controls.
**Delivers:** Multi-evidence scoring via SQL joins in DuckDB, weighted aggregation, known gene exclusion filter (CiliaCarta/SYSCILIA/OMIM), confidence tier assignment (high/medium/low).
**Addresses:**
- Multi-evidence scoring (table stakes) — weighted integration of all layers
- Known gene validation (table stakes) — benchmarking against established ciliopathy genes
- Tiered output with rationale (table stakes) — confidence classification
**Avoids:**
- Literature bias amplification (Pitfall #2) via publication-independent scoring layers and diversity requirements
- Missing data as negative evidence (Pitfall #3) via layer-aware scoring that preserves genes with partial data
**Research flag:** MAY NEED `/gsd:research-phase` — weight optimization strategies and validation metrics need domain-specific tuning.
### Phase 5: Reporting & Provenance
**Rationale:** Depends on integration layer producing scored results. Presentation layer, less critical for initial validation.
**Delivers:** Per-gene evidence summaries, provenance metadata generation (versions, timestamps, config hash), tiered candidate lists (TSV + Parquet), basic visualizations (score distributions, top candidates).
**Addresses:**
- Structured output format (table stakes)
- Basic visualization (table stakes)
- Data provenance tracking (completion)
**Research flag:** SKIP deeper research — standard output formatting and metadata patterns.
### Phase 6: CLI Orchestration & End-to-End Testing
**Rationale:** Integrates all components. User-facing interface. Validate full pipeline on test gene sets before production run.
**Delivers:** Unified Typer CLI app with subcommands (run-layer, integrate, report), dependency checking, progress logging with Rich, force-refresh flags, partial reruns.
**Addresses:**
- Batch processing (table stakes) — end-to-end orchestration with progress tracking
**Research flag:** SKIP deeper research — Typer CLI patterns well-documented.
### Phase 7: Validation & Weight Tuning
**Rationale:** After pipeline operational end-to-end, systematically validate against positive controls (known Usher/cilia genes) and negative controls (housekeeping genes). Iterate on scoring weights.
**Delivers:** Validation metrics (recall@10, recall@50 for positive controls; precision for negative controls), sensitivity analysis across parameter sweeps, finalized scoring weights.
**Addresses:**
- Known gene validation (completion) — quantitative benchmarking
- Sensitivity analysis (differentiator) — robustness demonstration
**Research flag:** MAY NEED `/gsd:research-phase` — validation metrics for gene prioritization pipelines need domain research.
### Phase 8 (v1.x): Explainability & Advanced Reporting
**Rationale:** After v1 produces validated candidate list. Adds interpretability for hypothesis generation and publication.
**Delivers:** SHAP-style per-gene evidence breakdown, interactive HTML report with sortable tables, evidence conflict detection, negative control validation.
**Addresses:**
- Explainable scoring (differentiator)
- Interactive HTML report (differentiator)
- Evidence conflict detection (differentiator)
**Research flag:** NEEDS `/gsd:research-phase` — SHAP for non-ML scoring systems and HTML report generation for bioinformatics need investigation.
### Phase Ordering Rationale
- **Sequential dependencies:** Phase 1 → Phase 2 → Phase 4 (infrastructure → prototype → integration) form the critical path; each depends on prior completion.
- **Parallelizable work:** Phase 3 (6 evidence layers) can be built concurrently by different developers or sequentially with rapid iteration.
- **Defer polish until validation:** Phases 5-6 (reporting, CLI) can be minimal initially; focus on scientific correctness (Phase 4) first.
- **Validate before extending:** Phase 7 (validation) must succeed before adding Phase 8 (advanced features); prevents building on broken foundation.
**Architecture-driven grouping:** Phases align with architectural layers (CLI orchestration, data retrieval, normalization, storage, integration, reporting) rather than arbitrary feature sets. Enables modular testing and isolated debugging.
**Pitfall avoidance:** Early phases establish mitigation strategies:
- Phase 1 prevents gene ID mapping issues (Pitfall #1) and reproducibility failures (Pitfall #8)
- Phase 2 prototype validates missing data handling (Pitfall #3)
- Phase 4 integration layer addresses literature bias (Pitfall #2)
- Validation throughout prevents silent failures from propagating
### Research Flags
Phases likely needing deeper research during planning:
- **Phase 2 (Constraint Metrics):** gnomAD API usage patterns, coverage thresholds, transcript selection — MEDIUM priority research
- **Phase 3 (Specialized APIs):** CellxGene census API, IMPC phenotype API, MGI/ZFIN bulk download workflows — HIGH priority research (niche, sparse documentation)
- **Phase 4 (Weight Optimization):** Validation metrics for gene prioritization, weight tuning strategies, positive control benchmark datasets — MEDIUM priority research
- **Phase 7 (Validation Metrics):** Recall/precision thresholds for rare disease discovery, statistical significance testing for constraint enrichment — LOW priority research (standard metrics available)
- **Phase 8 (SHAP for Rule-Based Systems):** Explainability methods for non-ML weighted scoring, HTML report generation best practices — MEDIUM priority research
Phases with standard patterns (skip research-phase):
- **Phase 1 (Infrastructure):** Python packaging, YAML config loading, DuckDB schema design — well-documented, no novelty
- **Phase 5 (Reporting):** Output formatting, CSV/Parquet generation, basic matplotlib/seaborn visualizations — standard patterns
- **Phase 6 (CLI):** Typer CLI design, Rich progress bars, logging configuration — mature libraries with excellent docs
## Confidence Assessment
| Area | Confidence | Notes |
|------|------------|-------|
| Stack | HIGH | All core technologies verified via official PyPI pages, Context7 library coverage. Version compatibility matrix validated. Alternative comparisons (Polars vs Pandas, uv vs Poetry) based on benchmarks and community adoption trends. |
| Features | MEDIUM | Feature expectations derived from 20+ peer-reviewed publications on gene prioritization tools (Exomiser, LIRICAL, CilioGenics) and bioinformatics best practices. No Context7 coverage for specialized domain tools; relied on WebSearch + academic literature. MVP recommendations synthesized from multiple sources. |
| Architecture | MEDIUM-HIGH | Architectural patterns validated across multiple bioinformatics pipeline implementations (PHA4GE best practices, nf-core patterns, academic publications). DuckDB vs SQLite comparison based on performance benchmarks. Staged pipeline pattern is industry standard. Confidence reduced slightly due to limited Context7 coverage for workflow orchestrators. |
| Pitfalls | HIGH | 40+ authoritative sources covering gene ID mapping failures (Bioconductor, UniProt docs), literature bias (Nature Scientific Reports), scRNA-seq batch effects (Nature Methods benchmarks), reproducibility failures (PLOS Computational Biology). Pitfall scenarios validated across multiple independent publications. Warning signs derived from "Common Bioinformatics Mistakes" community resources. |
**Overall confidence:** MEDIUM-HIGH
Research is well-grounded in established practices and peer-reviewed literature. Stack recommendations are conservative (mature technologies, not bleeding-edge). Architecture patterns are proven in production bioinformatics pipelines. Pitfalls are validated with quantitative data (e.g., "51% Ensembl ID conversion failure rate" from Bioconductor, "27% integration performance" from Nature Methods benchmarking).
Confidence reduced from HIGH due to:
1. Limited Context7 coverage for specialized bioinformatics tools (gget, scanpy, gnomad-toolbox)
2. Niche domain (ciliopathy research) has smaller community and fewer standardized tools than general genomics
3. Some API integration patterns (CellxGene, IMPC) rely on recent documentation that may be incomplete
### Gaps to Address
**During planning/Phase 2:**
- **gnomAD constraint metric reliability thresholds:** Research identified coverage-dependent issues but exact cutoffs (mean depth >30x, >90% CDS covered) need validation during implementation. Test on known ciliopathy genes to calibrate.
- **Transcript selection for tissue-specific genes:** SHANK2 example shows dramatic pLI differences between canonical and brain-specific transcripts. Need strategy for selecting appropriate transcript for retina/inner ear genes. Validate against GTEx isoform expression data.
**During planning/Phase 3:**
- **CellxGene API performance and quotas:** Census API is recent (2024-2025); rate limits and data transfer costs unclear. May need to pivot to bulk h5ad downloads if API proves impractical for 20K gene queries.
- **MGI/ZFIN/IMPC bulk download formats:** No official Python APIs; relying on bulk downloads. Need to validate download URLs are stable and file formats are parseable. Build sample parsing scripts during research-phase.
**During planning/Phase 4:**
- **Scoring weight initialization:** Literature provides ranges (expression 15-25%, constraint 10-20%) but optimal weights are dataset-dependent. Plan for iterative tuning using positive controls rather than assuming initial weights are final.
- **Under-annotation bias correction strategy:** Novel research direction; no established methods found. May need to defer to v2 if initial attempts don't improve results. Test correlation(score, pubmed_count) as success metric.
**During planning/Phase 7:**
- **Positive control gene set curation:** Need curated list of ~50-100 established ciliopathy genes with high confidence for validation. CiliaCarta (>600 genes) too broad; SYSCILIA Gold Standard (~200 genes) better. Manual curation required.
- **Negative control gene set:** Need ~50-100 high-confidence non-cilia genes (housekeeping, muscle-specific, liver-specific) for specificity testing. No standard negative control sets found in literature.
**Validation during implementation:**
- **scRNA-seq integration method selection:** Harmony vs Seurat v5 vs scVI performance is dataset-dependent. Research recommends comparing multiple methods, but optimal choice won't be known until testing on actual retina/inner ear atlases.
- **Ortholog confidence score thresholds:** DIOPT provides scores 0-15; research suggests prioritizing high-confidence orthologs but doesn't specify cutoff. Test recall vs precision tradeoff during Phase 3.
## Sources
### Primary (HIGH confidence)
- **STACK.md sources (official PyPI and Context7):**
- Polars 1.38.1, gget 0.30.2, Biopython 1.86, Typer 0.21.2, Pydantic 2.12.5 — PyPI verified Feb 2026
- polars-bio performance benchmarks (Oxford Academic, Dec 2025): 6-38x speedup over bioframe
- uv package manager (multiple sources): 10-100x faster than pip
- **FEATURES.md sources (peer-reviewed publications):**
- "Survey and improvement strategies for gene prioritization with LLMs" (Bioinformatics Advances, 2026)
- "Rare disease gene discovery in 100K Genomes Project" (Nature, 2025)
- "CilioGenics: integrated method for predicting ciliary genes" (NAR, 2024)
- "Standards for validating NGS bioinformatics pipelines" (AMP/CAP, 2018)
- **ARCHITECTURE.md sources (bioinformatics best practices):**
- PHA4GE Pipeline Best Practices (GitHub, community-validated)
- "Bioinformatics Pipeline Architecture Best Practices" (multiple implementations)
- DuckDB vs SQLite benchmarks (Bridge Informatics, DataCamp): 10-100x faster for analytics
- "Pipeline Provenance for Reproducibility" (arXiv, 2024)
- **PITFALLS.md sources (authoritative quantitative data):**
- Gene ID mapping failures: "How to map all Ensembl IDs to Gene Symbols" (Bioconductor Support, 51% NA rate)
- Literature bias: "Gene annotation bias impedes biomedical research" (Scientific Reports, 2018)
- scRNA-seq batch effects: "Benchmarking atlas-level data integration" (Nature Methods, 2021): 27% performance
- Ortholog conservation: "Functional and evolutionary implications of gene orthology" (Nature Reviews Genetics)
- gnomAD constraint: "No preferential mode of inheritance for highly constrained genes" (PMC, 2022)
### Secondary (MEDIUM confidence)
- **FEATURES.md:**
- Exomiser/LIRICAL feature comparisons (tool documentation + GitHub issues)
- Multi-evidence integration methods (academic publications, 2009-2024)
- Explainability methods for genomics (WIREs Data Mining, 2023)
- **ARCHITECTURE.md:**
- Workflow manager comparisons (Nextflow vs Snakemake, 2025 analysis)
- Python CLI tool comparisons (argparse vs Click vs Typer, multiple blogs)
- Gene annotation pipeline architectures (NCBI, AnnotaPipeline)
- **PITFALLS.md:**
- API rate limit documentation (NCBI E-utilities, UniProt)
- Tissue specificity scoring methods (TransTEx, Bioinformatics 2024)
- Literature mining tools (GLAD4U, PubTator 3.0)
### Tertiary (LOW confidence — needs validation)
- AlphaFold NVIDIA 4090 compatibility (GitHub issues): 24GB VRAM below 32GB recommended, may work with configuration
- MGI/ZFIN/IMPC Python API availability (web search): No official libraries found, bulk downloads required
- CellxGene Census API quotas and rate limits (recent docs, 2024-2025): Documentation incomplete
---
*Research completed: 2026-02-11*
*Ready for roadmap: yes*