Files

gbanyan bb7bfaedab docs: complete project research

2026-02-11 14:52:06 +08:00

27 KiB

Raw Blame History

Project Research Summary

Project: Bioinformatics Cilia/Usher Gene Discovery Pipeline Domain: Gene Candidate Discovery and Prioritization for Rare Disease / Ciliopathy Research Researched: 2026-02-11 Confidence: MEDIUM-HIGH

Executive Summary

This project requires a multi-evidence bioinformatics pipeline to prioritize under-studied gene candidates for ciliopathy and Usher syndrome research. The recommended approach is a staged, modular Python pipeline using independent evidence retrieval layers (annotation, expression, protein features, localization, genetic constraint, phenotypes) that feed into weighted scoring and tiered output generation. Python 3.12+ with Polars for data processing, DuckDB for intermediate storage, and Typer for CLI orchestration represents the modern bioinformatics stack optimized for ~20K gene analysis on local workstations.

The critical architectural pattern is independent evidence layers with staged persistence: each layer retrieves, normalizes, and caches data separately before a final integration step combines scores. This enables restartability, parallel execution, and modular testing—essential for pipelines with external API dependencies and long runtimes. Configuration-driven behavior with YAML configs and Pydantic validation ensures reproducibility and enables parameter tuning without code changes.

The dominant risks are gene ID mapping inconsistencies (causing silent data loss across databases) and literature bias amplification (over-prioritizing well-studied genes at the expense of novel candidates). Both require explicit mitigation: standardized Ensembl gene IDs throughout, mapping validation gates, and publication-independent scoring layers weighted heavily. Success depends on validation against positive controls (known Usher/cilia genes) and negative controls (non-cilia housekeeping genes) to prove the scoring system works before running at scale.

Key Findings

Recommended Stack

Python 3.12+ provides the best balance of library support and stability. Polars (1.38+) outperforms Pandas by 6-38x for genomic operations with better memory efficiency for 20K gene datasets. DuckDB enables out-of-core analytical queries on Parquet-cached intermediate results, solving the memory pressure problem when integrating 6 evidence layers. Typer offers modern, type-hint-based CLI design with auto-generated help, cleaner than argparse for modular pipeline scripts.

Core technologies:

Python 3.12: Industry standard for bioinformatics with extensive ecosystem (Biopython, scanpy, gget)
Polars 1.38+: DataFrame processing 6-38x faster than Pandas, native streaming for large datasets
DuckDB: Columnar analytics database for intermediate storage; 10-100x faster than SQLite for aggregations, queries Parquet files without full import
Typer 0.21+: CLI framework with type-hint-based interface, auto-generates help documentation
gget 0.30+: Unified API for multi-source gene annotation (Ensembl, UniProt, NCBI) — primary data access layer
Pydantic 2.12+: Type-safe configuration and data validation; 1.5-1.75x faster than v1, prevents config errors
structlog: Structured JSON logging with correlation IDs to trace genes through pipeline
diskcache: Persistent API response caching across reruns, critical for avoiding API rate limits
uv: Rust-based package manager 10-100x faster than pip, replaces pip/pip-tools/poetry

Avoid:

Pandas alone (too slow, memory-inefficient)
Python 3.9 or older (missing library support)
Workflow orchestrators (Snakemake/Nextflow) for initial local workstation pipeline (overkill; adds complexity without benefit)
Poetry for new projects (slower than uv, less standard)

Expected Features

Must have (table stakes):

Multi-evidence scoring: 6 layers (annotation, expression, protein features, localization, constraint, phenotypes) with weighted integration — standard in modern gene prioritization
Reproducibility documentation: Parameter logging, version tracking, seed control — required for publication
Data provenance tracking: W3C PROV standard; track all transformations, source versions, timestamps
Known gene validation: Benchmark against established ciliopathy genes (CiliaCarta, SYSCILIA, Usher genes) with recall metrics
Quality control checks: Missing data detection, outlier identification, distribution checks
API-based data retrieval: Automated queries to gnomAD, GTEx, HPA, UniProt with rate limiting, caching, retry logic
Batch processing: Parallel execution across 20K genes with progress tracking and resume-from-checkpoint
Parameter configuration: YAML config for weights, thresholds, data sources
Tiered output with rationale: High/Medium/Low confidence tiers with evidence summaries per tier
Basic visualization: Score distributions, rank plots, evidence contribution charts

Should have (competitive differentiators):

Explainable scoring: SHAP-style per-gene evidence breakdown showing WHY genes rank high — critical for discovery vs. diagnosis
Systematic under-annotation bias handling: Correct publication bias by downweighting literature-heavy features for under-studied candidates — novel research advantage
Sensitivity analysis: Systematic parameter sweep with rank stability metrics to demonstrate robustness
Evidence conflict detection: Flag genes with contradictory evidence patterns (e.g., high expression but low constraint)
Interactive HTML report: Browsable results with sortable tables, linked evidence sources
Cross-species homology scoring: Zebrafish/mouse phenotype evidence integration via ortholog mapping

Defer (v2+):

Automated literature scanning with LLM: RAG-based PubMed evidence extraction — high complexity, cost, uncertainty
Incremental update capability: Re-run with new data without full recomputation — overkill for one-time discovery
Multi-modal evidence weighting optimization: Bayesian integration, cross-validation — requires larger training set
Cilia-specific knowledgebase integration: CilioGenics, CiliaMiner — nice-to-have, primary layers sufficient initially

Architecture Approach

Multi-evidence gene prioritization pipelines follow a layered architecture with independent data retrieval modules feeding normalization, scoring, and validation layers. The standard pattern is a staged pipeline where each evidence layer operates independently, writes intermediate results to disk (Parquet cache + DuckDB tables), then a final integration component performs SQL joins and weighted scoring.

Major components:

CLI Orchestration Layer (Typer) — Entry point with subcommands (run-layer, integrate, report), global config management, pipeline orchestration
Data Retrieval Layer (6 modules) — Independent retrievers for annotation, expression, protein features, localization, constraint, phenotypes; API clients with caching and retry logic
Normalization/Transform Layer — Per-layer parsers converting raw formats to standardized schemas; gene ID mapping to Ensembl; score normalization to 0-1 scale
Data Storage Layer — Raw cache (Parquet), intermediate results (DuckDB), final output (TSV/Parquet); enables restartability and out-of-core analytics
Integration/Scoring Layer — Multi-evidence scoring via SQL joins in DuckDB; weighted aggregation; known gene filtering; confidence tier assignment
Reporting Layer — Per-gene evidence summaries; provenance metadata generation; tiered candidate lists

Key architectural patterns:

Independent Evidence Layers: No cross-layer dependencies; enables parallel execution and isolated testing
Staged Data Persistence: Each stage writes to disk before proceeding; enables restart-from-checkpoint and debugging
Configuration-Driven Behavior: YAML configs for weights/thresholds; code reads config, never hardcodes parameters
Provenance Tracking: Every output includes metadata (pipeline version, data source versions, timestamps, config hash)

Critical Pitfalls

Gene ID Mapping Inconsistency Cascade — Over 51% of Ensembl IDs can fail symbol conversion; 8% discordance between Swiss-Prot and Ensembl. One-to-many mappings break naive merges, causing silent data loss. Avoid by: Using Ensembl gene IDs as primary keys throughout; version-locking annotation builds; implementing mapping validation gates that report % successfully mapped; manually reviewing unmapped high-priority genes.
Literature Bias Amplification in Weighted Scoring — Well-studied genes dominate scores because GO annotations, pathway coverage, interaction networks, and PubMed mentions all correlate with research attention rather than biological relevance. Systematically deprioritizes novel candidates. Avoid by: Decoupling publication metrics from functional evidence; normalizing scores by baseline publication count; heavily weighting sequence-based features independent of literature; requiring evidence diversity across multiple layers; validating against under-studied positive controls.
Missing Data Handled as "Negative Evidence" — Genes lacking data in a layer (no GTEx expression, no IMPC phenotype) are treated as "low score" rather than "unknown," systematically penalizing under-measured genes. Avoid by: Explicit three-state encoding (present/absent/unknown); layer-specific score normalization that doesn't penalize missing data; imputation using ortholog evidence with confidence weighting; coverage-aware constraint metrics (only use gnomAD when coverage >30x).
Batch Effects Misinterpreted as Biological Signal in scRNA-seq — Integration of multi-atlas scRNA-seq data (retina, inner ear, nasal epithelium) can erase true biological variation or create false cell-type signals. Only 27% of integration outputs perform better than unintegrated data. Avoid by: Validating integration quality using known marker genes; comparing multiple methods (Harmony, Seurat v5, scVI); using positive controls (known cilia genes should show expected cell-type enrichment); stratifying by sequencing technology before cross-platform integration.
Reproducibility Theater Without Computational Environment Control — Scripts are version-controlled but results drift due to uncontrolled dependencies (package versions, database snapshots, API response changes, random seeds). Avoid by: Pinning all dependencies with exact versions; containerizing environment (Docker/Singularity); snapshotting external databases with version tags; checksumming all downloaded data; setting random seeds globally; logging provenance metadata in all outputs.

Implications for Roadmap

Based on research, suggested phase structure follows dependency-driven ordering: infrastructure first, single-layer prototype to validate architecture, parallel evidence layer development, integration/scoring, then reporting/polish.

Phase 1: Data Infrastructure & Configuration

Rationale: All downstream components depend on config system, gene ID mapping utilities, and data storage patterns. Critical to establish before any evidence retrieval. Delivers: Project skeleton, YAML config loading with Pydantic validation, DuckDB schema setup, gene ID mapping utility, API caching infrastructure. Addresses:

Reproducibility documentation (table stakes)
Parameter configuration (table stakes)
Data provenance tracking (table stakes) Avoids:
Gene ID mapping inconsistency (Pitfall #1) via standardized Ensembl IDs and validation gates
Reproducibility failure (Pitfall #8) via containerization and version pinning

Research flag: SKIP deeper research — standard Python packaging patterns, well-documented.

Phase 2: Single Evidence Layer Prototype

Rationale: Validate the retrieval → normalization → storage pattern before scaling to 6 layers. Identifies architectural issues early with low cost. Delivers: One complete evidence layer (recommend starting with genetic constraint: gnomAD pLI/LOEUF as simplest API), end-to-end flow from API to DuckDB storage. Addresses:

API-based data retrieval (table stakes)
Quality control checks (table stakes) Avoids:
Missing data as negative evidence (Pitfall #3) by designing explicit unknown-state handling from start
Constraint metrics misinterpreted (Pitfall #6) via coverage-aware filtering

Research flag: MAY NEED /gsd:research-phase — gnomAD API usage patterns and coverage thresholds need validation.

Phase 3: Remaining Evidence Layers (Parallel Work)

Rationale: Layers are independent by design; can be built in parallel. No inter-dependencies between annotation, expression, protein features, localization, phenotypes modules. Delivers: 5 additional evidence layers replicating Phase 2 pattern (annotation, expression, protein features, localization, phenotypes + literature scan). Addresses:

Multi-evidence scoring foundation (table stakes) — requires all 6 layers operational Avoids:
Ortholog function over-assumed (Pitfall #5) via confidence scoring and phenotype relevance filtering
API rate limits (Performance trap) via batch queries, exponential backoff, API keys

Research flag: NEEDS /gsd:research-phase for specialized APIs — CellxGene, IMPC, MGI/ZFIN integration patterns are niche and poorly documented.

Phase 4: Integration & Scoring System

Rationale: Requires all evidence layers complete. Core scientific logic. Most complex component requiring validation against positive/negative controls. Delivers: Multi-evidence scoring via SQL joins in DuckDB, weighted aggregation, known gene exclusion filter (CiliaCarta/SYSCILIA/OMIM), confidence tier assignment (high/medium/low). Addresses:

Multi-evidence scoring (table stakes) — weighted integration of all layers
Known gene validation (table stakes) — benchmarking against established ciliopathy genes
Tiered output with rationale (table stakes) — confidence classification Avoids:
Literature bias amplification (Pitfall #2) via publication-independent scoring layers and diversity requirements
Missing data as negative evidence (Pitfall #3) via layer-aware scoring that preserves genes with partial data

Research flag: MAY NEED /gsd:research-phase — weight optimization strategies and validation metrics need domain-specific tuning.

Phase 5: Reporting & Provenance

Rationale: Depends on integration layer producing scored results. Presentation layer, less critical for initial validation. Delivers: Per-gene evidence summaries, provenance metadata generation (versions, timestamps, config hash), tiered candidate lists (TSV + Parquet), basic visualizations (score distributions, top candidates). Addresses:

Structured output format (table stakes)
Basic visualization (table stakes)
Data provenance tracking (completion)

Research flag: SKIP deeper research — standard output formatting and metadata patterns.

Phase 6: CLI Orchestration & End-to-End Testing

Rationale: Integrates all components. User-facing interface. Validate full pipeline on test gene sets before production run. Delivers: Unified Typer CLI app with subcommands (run-layer, integrate, report), dependency checking, progress logging with Rich, force-refresh flags, partial reruns. Addresses:

Batch processing (table stakes) — end-to-end orchestration with progress tracking

Research flag: SKIP deeper research — Typer CLI patterns well-documented.

Phase 7: Validation & Weight Tuning

Rationale: After pipeline operational end-to-end, systematically validate against positive controls (known Usher/cilia genes) and negative controls (housekeeping genes). Iterate on scoring weights. Delivers: Validation metrics (recall@10, recall@50 for positive controls; precision for negative controls), sensitivity analysis across parameter sweeps, finalized scoring weights. Addresses:

Known gene validation (completion) — quantitative benchmarking
Sensitivity analysis (differentiator) — robustness demonstration

Research flag: MAY NEED /gsd:research-phase — validation metrics for gene prioritization pipelines need domain research.

Phase 8 (v1.x): Explainability & Advanced Reporting

Rationale: After v1 produces validated candidate list. Adds interpretability for hypothesis generation and publication. Delivers: SHAP-style per-gene evidence breakdown, interactive HTML report with sortable tables, evidence conflict detection, negative control validation. Addresses:

Explainable scoring (differentiator)
Interactive HTML report (differentiator)
Evidence conflict detection (differentiator)

Research flag: NEEDS /gsd:research-phase — SHAP for non-ML scoring systems and HTML report generation for bioinformatics need investigation.

Phase Ordering Rationale

Sequential dependencies: Phase 1 → Phase 2 → Phase 4 (infrastructure → prototype → integration) form the critical path; each depends on prior completion.
Parallelizable work: Phase 3 (6 evidence layers) can be built concurrently by different developers or sequentially with rapid iteration.
Defer polish until validation: Phases 5-6 (reporting, CLI) can be minimal initially; focus on scientific correctness (Phase 4) first.
Validate before extending: Phase 7 (validation) must succeed before adding Phase 8 (advanced features); prevents building on broken foundation.

Architecture-driven grouping: Phases align with architectural layers (CLI orchestration, data retrieval, normalization, storage, integration, reporting) rather than arbitrary feature sets. Enables modular testing and isolated debugging.

Pitfall avoidance: Early phases establish mitigation strategies:

Phase 1 prevents gene ID mapping issues (Pitfall #1) and reproducibility failures (Pitfall #8)
Phase 2 prototype validates missing data handling (Pitfall #3)
Phase 4 integration layer addresses literature bias (Pitfall #2)
Validation throughout prevents silent failures from propagating

Research Flags

Phases likely needing deeper research during planning:

Phase 2 (Constraint Metrics): gnomAD API usage patterns, coverage thresholds, transcript selection — MEDIUM priority research
Phase 3 (Specialized APIs): CellxGene census API, IMPC phenotype API, MGI/ZFIN bulk download workflows — HIGH priority research (niche, sparse documentation)
Phase 4 (Weight Optimization): Validation metrics for gene prioritization, weight tuning strategies, positive control benchmark datasets — MEDIUM priority research
Phase 7 (Validation Metrics): Recall/precision thresholds for rare disease discovery, statistical significance testing for constraint enrichment — LOW priority research (standard metrics available)
Phase 8 (SHAP for Rule-Based Systems): Explainability methods for non-ML weighted scoring, HTML report generation best practices — MEDIUM priority research

Phases with standard patterns (skip research-phase):

Phase 1 (Infrastructure): Python packaging, YAML config loading, DuckDB schema design — well-documented, no novelty
Phase 5 (Reporting): Output formatting, CSV/Parquet generation, basic matplotlib/seaborn visualizations — standard patterns
Phase 6 (CLI): Typer CLI design, Rich progress bars, logging configuration — mature libraries with excellent docs

Confidence Assessment

Area	Confidence	Notes
Stack	HIGH	All core technologies verified via official PyPI pages, Context7 library coverage. Version compatibility matrix validated. Alternative comparisons (Polars vs Pandas, uv vs Poetry) based on benchmarks and community adoption trends.
Features	MEDIUM	Feature expectations derived from 20+ peer-reviewed publications on gene prioritization tools (Exomiser, LIRICAL, CilioGenics) and bioinformatics best practices. No Context7 coverage for specialized domain tools; relied on WebSearch + academic literature. MVP recommendations synthesized from multiple sources.
Architecture	MEDIUM-HIGH	Architectural patterns validated across multiple bioinformatics pipeline implementations (PHA4GE best practices, nf-core patterns, academic publications). DuckDB vs SQLite comparison based on performance benchmarks. Staged pipeline pattern is industry standard. Confidence reduced slightly due to limited Context7 coverage for workflow orchestrators.
Pitfalls	HIGH	40+ authoritative sources covering gene ID mapping failures (Bioconductor, UniProt docs), literature bias (Nature Scientific Reports), scRNA-seq batch effects (Nature Methods benchmarks), reproducibility failures (PLOS Computational Biology). Pitfall scenarios validated across multiple independent publications. Warning signs derived from "Common Bioinformatics Mistakes" community resources.

Overall confidence: MEDIUM-HIGH

Research is well-grounded in established practices and peer-reviewed literature. Stack recommendations are conservative (mature technologies, not bleeding-edge). Architecture patterns are proven in production bioinformatics pipelines. Pitfalls are validated with quantitative data (e.g., "51% Ensembl ID conversion failure rate" from Bioconductor, "27% integration performance" from Nature Methods benchmarking).

Confidence reduced from HIGH due to:

Limited Context7 coverage for specialized bioinformatics tools (gget, scanpy, gnomad-toolbox)
Niche domain (ciliopathy research) has smaller community and fewer standardized tools than general genomics
Some API integration patterns (CellxGene, IMPC) rely on recent documentation that may be incomplete

Gaps to Address

During planning/Phase 2:

gnomAD constraint metric reliability thresholds: Research identified coverage-dependent issues but exact cutoffs (mean depth >30x, >90% CDS covered) need validation during implementation. Test on known ciliopathy genes to calibrate.
Transcript selection for tissue-specific genes: SHANK2 example shows dramatic pLI differences between canonical and brain-specific transcripts. Need strategy for selecting appropriate transcript for retina/inner ear genes. Validate against GTEx isoform expression data.

During planning/Phase 3:

CellxGene API performance and quotas: Census API is recent (2024-2025); rate limits and data transfer costs unclear. May need to pivot to bulk h5ad downloads if API proves impractical for 20K gene queries.
MGI/ZFIN/IMPC bulk download formats: No official Python APIs; relying on bulk downloads. Need to validate download URLs are stable and file formats are parseable. Build sample parsing scripts during research-phase.

During planning/Phase 4:

Scoring weight initialization: Literature provides ranges (expression 15-25%, constraint 10-20%) but optimal weights are dataset-dependent. Plan for iterative tuning using positive controls rather than assuming initial weights are final.
Under-annotation bias correction strategy: Novel research direction; no established methods found. May need to defer to v2 if initial attempts don't improve results. Test correlation(score, pubmed_count) as success metric.

During planning/Phase 7:

Positive control gene set curation: Need curated list of ~50-100 established ciliopathy genes with high confidence for validation. CiliaCarta (>600 genes) too broad; SYSCILIA Gold Standard (~200 genes) better. Manual curation required.
Negative control gene set: Need ~50-100 high-confidence non-cilia genes (housekeeping, muscle-specific, liver-specific) for specificity testing. No standard negative control sets found in literature.

Validation during implementation:

scRNA-seq integration method selection: Harmony vs Seurat v5 vs scVI performance is dataset-dependent. Research recommends comparing multiple methods, but optimal choice won't be known until testing on actual retina/inner ear atlases.
Ortholog confidence score thresholds: DIOPT provides scores 0-15; research suggests prioritizing high-confidence orthologs but doesn't specify cutoff. Test recall vs precision tradeoff during Phase 3.

Sources

Primary (HIGH confidence)

STACK.md sources (official PyPI and Context7):
- Polars 1.38.1, gget 0.30.2, Biopython 1.86, Typer 0.21.2, Pydantic 2.12.5 — PyPI verified Feb 2026
- polars-bio performance benchmarks (Oxford Academic, Dec 2025): 6-38x speedup over bioframe
- uv package manager (multiple sources): 10-100x faster than pip
FEATURES.md sources (peer-reviewed publications):
- "Survey and improvement strategies for gene prioritization with LLMs" (Bioinformatics Advances, 2026)
- "Rare disease gene discovery in 100K Genomes Project" (Nature, 2025)
- "CilioGenics: integrated method for predicting ciliary genes" (NAR, 2024)
- "Standards for validating NGS bioinformatics pipelines" (AMP/CAP, 2018)
ARCHITECTURE.md sources (bioinformatics best practices):
- PHA4GE Pipeline Best Practices (GitHub, community-validated)
- "Bioinformatics Pipeline Architecture Best Practices" (multiple implementations)
- DuckDB vs SQLite benchmarks (Bridge Informatics, DataCamp): 10-100x faster for analytics
- "Pipeline Provenance for Reproducibility" (arXiv, 2024)
PITFALLS.md sources (authoritative quantitative data):
- Gene ID mapping failures: "How to map all Ensembl IDs to Gene Symbols" (Bioconductor Support, 51% NA rate)
- Literature bias: "Gene annotation bias impedes biomedical research" (Scientific Reports, 2018)
- scRNA-seq batch effects: "Benchmarking atlas-level data integration" (Nature Methods, 2021): 27% performance
- Ortholog conservation: "Functional and evolutionary implications of gene orthology" (Nature Reviews Genetics)
- gnomAD constraint: "No preferential mode of inheritance for highly constrained genes" (PMC, 2022)

Secondary (MEDIUM confidence)

FEATURES.md:
- Exomiser/LIRICAL feature comparisons (tool documentation + GitHub issues)
- Multi-evidence integration methods (academic publications, 2009-2024)
- Explainability methods for genomics (WIREs Data Mining, 2023)
ARCHITECTURE.md:
- Workflow manager comparisons (Nextflow vs Snakemake, 2025 analysis)
- Python CLI tool comparisons (argparse vs Click vs Typer, multiple blogs)
- Gene annotation pipeline architectures (NCBI, AnnotaPipeline)
PITFALLS.md:
- API rate limit documentation (NCBI E-utilities, UniProt)
- Tissue specificity scoring methods (TransTEx, Bioinformatics 2024)
- Literature mining tools (GLAD4U, PubTator 3.0)

Tertiary (LOW confidence — needs validation)

AlphaFold NVIDIA 4090 compatibility (GitHub issues): 24GB VRAM below 32GB recommended, may work with configuration
MGI/ZFIN/IMPC Python API availability (web search): No official libraries found, bulk downloads required
CellxGene Census API quotas and rate limits (recent docs, 2024-2025): Documentation incomplete

Research completed: 2026-02-11 Ready for roadmap: yes

27 KiB Raw Blame History