gbanyan/usher-exploring

Fork 0

Files

gbanyan bb7bfaedab docs: complete project research

2026-02-11 14:52:06 +08:00

15 KiB

Raw Blame History

Technology Stack

Project: Bioinformatics Cilia/Usher Gene Discovery Pipeline Researched: 2026-02-11 Confidence: HIGH

Recommended Stack

Core Framework

Technology	Version	Purpose	Why
Python	3.12+	Pipeline runtime	Industry standard for bioinformatics, extensive ecosystem, type hints, best library support. Avoid 3.10-3.11 (older), 3.14 (too new for some libraries).
Polars	1.38+	DataFrame processing	6-38x faster than Pandas for genomic operations (via polars-bio), native Rust backend, streaming for large datasets, better memory efficiency for ~20K gene analysis.
Typer	0.21+	CLI framework	Modern type-hint based CLI, auto-generates help docs, built on Click (battle-tested), cleaner than argparse for modular scripts.
Pydantic	2.12+	Data validation & config	Type-safe configuration, 1.5-1.75x faster than v1, validates gene scores/weights, prevents config errors that waste compute time.

Data Access Libraries

Library	Version	Purpose	When to Use
gget	0.30+	Multi-source gene annotation	PRIMARY: Unified API for Ensembl, UniProt, NCBI. Fetches gene metadata, sequences, GO terms. Official Context7 support, actively maintained.
pyensembl	Latest	Ensembl GTF/FASTA access	FALLBACK: If gget insufficient. Downloads/caches Ensembl data locally. More control but less convenient than gget.
Biopython	1.86+	Sequence analysis, format parsing	ALWAYS: Parse FASTA/GenBank, sequence manipulation, EntrezId conversion. De facto standard, mandatory for bioinformatics. Requires Python 3.10+.
requests	2.32+	HTTP API calls	ALWAYS: Session-based API calls to REST endpoints (InterPro, STRING, gnomAD). Connection pooling, retry logic, timeout control.
metapub	0.6.4+	PubMed literature mining	PRIMARY for PubMed: Abstracts via eutils, DOI finding, formatted citations. Production-tested at bioinformatics facilities. Python 3.8+.
gnomad-toolbox	0.0.1+	gnomAD constraint metrics	PRIMARY: Official Broad Institute tool (Jan 2025), loads/filters constraint data. Requires Python 3.9+. Note: Very new, validate stability.

scRNA-seq & Expression Data

Library	Version	Purpose	When to Use
scanpy	1.12+	scRNA-seq analysis	For processing single-cell cilia expression. Clustering, QC, cell type annotation. Requires Python 3.12+.
anndata	0.11+	scRNA-seq data structures	DEPENDENCY: Required by scanpy. Stores expression matrices.
pyGTEx	Git latest	GTEx tissue expression	For GTEx API queries. Install from GitHub: pip install git+https://github.com/w-gao/pyGTEx.git
tspex	Latest	Tissue specificity scoring	For calculating tissue-specific expression scores from GTEx data.

Data Validation & Quality

Library	Version	Purpose	When to Use
pandera	0.29+	DataFrame schema validation	ALWAYS: Validate gene data schemas, catch data quality issues early. 1.5-1.75x faster with Pydantic v2. Supports Polars/Pandas. Python 3.10+.
pydantic-settings	2.12+	YAML/TOML config loading	ALWAYS: Load pipeline configs with validation. Built-in TOML/YAML support (Nov 2025). Prevents config typos from wasting GPU time.

Workflow & CLI

Library	Version	Purpose	When to Use
rich	14.3+	CLI progress bars, formatting	ALWAYS: Progress tracking for 20K gene processing, formatted tables, error highlighting. Production/Stable. Python 3.8+.
structlog	Latest	Structured logging	ALWAYS: JSON logs for pipeline debugging, correlation IDs for tracking genes through pipeline, contextvars for thread safety.
click	Latest	CLI subcommands	IF NEEDED: Fallback if Typer insufficient (Typer built on Click).

Performance & Caching

Library	Version	Purpose	When to Use
diskcache	Latest	Disk-based API response caching	ALWAYS: Cache API responses (Ensembl, UniProt, gnomAD) to avoid re-fetching during reruns. Persistent across runs.
joblib	Latest	NumPy/ML model caching	For AlphaFold-Multimer result caching, large array serialization. Disk-based, survives restarts.
httpx	0.28+	Async HTTP client (OPTIONAL)	ONLY IF async needed: HTTP/2 support, faster concurrent API calls. Otherwise use requests (simpler).

Testing & Development

Tool	Version	Purpose	Notes
pytest	Latest	Testing framework	Industry standard. Fixtures for test data, parametrization for edge cases.
pytest-cov	Latest	Coverage reporting	Ensure gene scoring logic tested.
mypy	Latest	Static type checking	Catch type errors pre-runtime. Essential with Pydantic/Typer type hints.
ruff	Latest	Linting & formatting	10-100x faster than pylint/black. Rust-based, configurable.

Package Management

Tool	Version	Purpose	Notes
uv	Latest	Dependency manager	10-100x faster than pip, replaces pip/pip-tools/poetry/pyenv. Rust-based (Astral team). Strict PEP compliance. Use for new project.
pyproject.toml	-	Dependency specification	Standard Python packaging. Works with uv/pip.

Installation

# Install uv (package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create project with Python 3.12
uv init
uv python install 3.12
uv python pin 3.12

# Core dependencies
uv add polars==1.38.1
uv add typer==0.21.2
uv add pydantic==2.12.5
uv add pydantic-settings==2.12.0

# Data access
uv add gget==0.30.2
uv add biopython==1.86
uv add requests==2.32.5
uv add metapub==0.6.4
uv add gnomad-toolbox==0.0.1

# scRNA-seq & expression
uv add scanpy==1.12
uv add anndata==0.11.4
uv add "pyGTEx @ git+https://github.com/w-gao/pyGTEx.git"

# Validation & quality
uv add pandera==0.29.0

# CLI & workflow
uv add rich==14.3.2
uv add structlog

# Performance & caching
uv add diskcache
uv add joblib

# Dev dependencies
uv add --dev pytest
uv add --dev pytest-cov
uv add --dev mypy
uv add --dev ruff

Alternatives Considered

Category	Recommended	Alternative	Why Not Alternative
DataFrame	Polars	Pandas	Pandas 6-38x slower for genomic interval operations (bioframe vs polars-bio benchmarks). Polars has streaming for >RAM datasets.
CLI	Typer	argparse	argparse verbose, no auto-docs. Typer uses type hints, generates help automatically.
CLI	Typer	Click	Click requires decorators for everything. Typer cleaner with type hints. (Typer built on Click internally.)
HTTP	requests	httpx	httpx adds async/HTTP/2 complexity. Requests simpler for synchronous API calls. Use httpx only if async needed.
Package Mgr	uv	Poetry	Poetry slower (not Rust-based), separate config file. uv 10-100x faster, uses standard pyproject.toml, replaces more tools.
Package Mgr	uv	pip + pip-tools	uv replaces both, adds reproducible lock files, faster resolution.
Config	pydantic-settings	Hydra	Hydra overkill for pipeline config. Better for ML experiments with many hyperparameter sweeps. pydantic-settings simpler.
Logging	structlog	stdlib logging	stdlib logging requires custom filters for structured output. structlog built for JSON logs from start.
Workflow	Modular scripts	Snakemake	Snakemake overkill for local pipeline. Adds complexity, HPC features not needed. Use if scaling to cluster later.
Workflow	Modular scripts	Nextflow	Nextflow for cloud/enterprise. Groovy syntax steep learning curve. Not needed for local NVIDIA 4090 workstation.

What NOT to Use

Avoid	Why	Use Instead
Pandas (alone)	6-38x slower than Polars for genomic operations. Memory inefficient for 20K genes.	Polars with polars-bio extension
Python 3.9 or older	Many modern libraries require 3.10+ (Biopython 1.86, scanpy 1.12, pandera 0.29). Missing type hint features.	Python 3.12 (stable, well-supported)
Python 3.14	Too new. Some libraries not tested yet. Risk of compatibility issues.	Python 3.12 or 3.13
Pydantic v1	1.5-1.75x slower than v2. V2 major rewrite with better validation.	Pydantic 2.12+
argparse for complex CLI	Verbose, manual help text, no type validation.	Typer (type-hint based)
Poetry (for new projects)	Slower dependency resolution, separate config file, fewer tools replaced.	uv (10-100x faster, standard pyproject.toml)
Manual API retries	Error-prone, inconsistent timeout handling.	requests.Session with retry adapters + timeout
print() debugging	No structured logs, hard to trace gene through pipeline.	structlog with correlation IDs

Stack Patterns by Use Case

Local workstation pipeline (current requirement):

Use Polars + modular Typer CLI scripts
Avoid Snakemake/Nextflow (overkill)
Use diskcache for API responses (persistent across runs)
Use rich for progress tracking (20K genes takes time)

If scaling to HPC cluster later:

Add Snakemake for workflow orchestration
Keep modular scripts (Snakemake calls them)
Use job scheduler integration (SLURM/PBS)

If adding async API calls:

Replace requests with httpx
Use asyncio event loop
Batch API calls (e.g., 100 genes at once to UniProt)

For GPU-accelerated steps (AlphaFold-Multimer downstream):

NVIDIA 4090 has 24GB VRAM (below 32GB recommended minimum for AlphaFold2 NIM)
May work for smaller predictions with appropriate configurations
Use joblib to cache results (avoid re-running expensive predictions)

Version Compatibility

Package	Requires Python	Notes
Biopython 1.86	>= 3.10	Blocks Python 3.9
scanpy 1.12	>= 3.12	Blocks Python 3.10, 3.11
Polars 1.38	>= 3.10	Recommended 3.12+
Typer 0.21	>= 3.9	Works with 3.9-3.14
Pydantic 2.12	>= 3.9	Requires v2 for performance
pandera 0.29	>= 3.10	Supports Polars + Pandas
metapub 0.6.4	>= 3.8	Older requirement, works with 3.12
gnomad-toolbox 0.0.1	>= 3.9	Very new (Jan 2025), monitor stability

Minimum Python: 3.12 (required by scanpy) Recommended Python: 3.12 (stable, well-supported by all libraries)

Data Source APIs & Formats

Source	Access Method	Library	Format	Notes
Ensembl	REST API	gget, pyensembl	JSON, GTF	Rate limits apply. Cache with diskcache.
UniProt	REST API	gget, requests	JSON, XML	Batch queries supported.
NCBI/PubMed	Entrez eutils	metapub, Biopython.Entrez	XML	Free but rate limited (3 req/sec without key).
gnomAD	Downloads + API	gnomad-toolbox	Hail Tables, TSV	Large files. Use constraint metrics API.
InterPro	REST API	requests	JSON	Protein domains/motifs. Free, rate limited.
GTEx	Portal + API	pyGTEx	TSV, JSON	Expression data. Portal downloads for bulk.
HPA	Downloads	requests	TSV	Tissue/cell expression. Download bulk files.
STRING	API + downloads	requests	TSV	PPI networks. API for queries, downloads for bulk.
BioGRID	Downloads	requests	TSV	PPI data. Download releases, parse locally.
MGI/ZFIN/IMPC	Web + downloads	requests	Various	No official Python API. Web scraping or bulk downloads.

Configuration Strategy

Use Pydantic models for type-safe configuration:

# config.py
from pydantic import BaseModel, Field
from pydantic_settings import BaseSettings, SettingsConfigDict

class ScoringWeights(BaseModel):
    tissue_expression: float = Field(ge=0, le=1)
    protein_domains: float = Field(ge=0, le=1)
    genetic_constraint: float = Field(ge=0, le=1)
    literature_score: float = Field(ge=0, le=1)

class PipelineConfig(BaseSettings):
    model_config = SettingsConfigDict(
        toml_file='pipeline_config.toml',
        yaml_file='pipeline_config.yaml'
    )

    weights: ScoringWeights
    api_cache_dir: str = ".cache/api_responses"
    output_tiers: list[str] = ["high", "medium", "low"]

Load from TOML/YAML, validated at startup. Prevents runtime config errors.

Logging Strategy

Use structlog for structured JSON logging:

import structlog

# Configure once at startup
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()

# In pipeline, bind gene context
logger = logger.bind(gene_id="ENSG00000123456", gene_symbol="USH2A")
logger.info("scoring_complete", score=0.85, tier="high")

Enables filtering/searching logs by gene, tracing genes through pipeline.

Sources

HIGH CONFIDENCE (Official Docs + Context7):

Polars 1.38.1 PyPI - Version verified Feb 2026
gget 0.30.2 PyPI - Version verified Feb 2026
Biopython 1.86 PyPI - Version verified Oct 2025
Typer 0.21.2 PyPI - Version verified Feb 2026
Pydantic 2.12.5 PyPI - Version verified Nov 2025
pandera 0.29.0 PyPI - Version verified Jan 2026
requests 2.32.5 PyPI - Version verified Aug 2025
httpx 0.28.1 PyPI - Version verified Dec 2024
rich 14.3.2 PyPI - Version verified Feb 2026
scanpy 1.12 PyPI - Version verified Jan 2026
metapub 0.6.4 PyPI - Version verified Aug 2025
gnomad-toolbox 0.0.1 PyPI - Version verified Jan 2025

MEDIUM CONFIDENCE (WebSearch + Official Sources):

polars-bio performance benchmarks - Oxford Academic, Dec 2025
gget genomic database querying - PMC publication
STRING database 12.5 update - NAR 2025
InterPro 2025 update - NAR 2025
uv package manager overview - Multiple sources confirm 10-100x speedup
Pydantic-settings TOML/YAML support - Nov 2025 release notes
Python logging best practices 2025 - Industry practices
Nextflow vs Snakemake 2025 comparison - Workflow framework analysis

MEDIUM-LOW CONFIDENCE (Limited verification):

AlphaFold NVIDIA 4090 compatibility - GitHub issues suggest challenges, official docs specify 80GB GPUs
MGI/ZFIN/IMPC Python API - No official Python libraries found, ZFIN and MGI provide web interfaces and bulk downloads

Stack research for: Bioinformatics Cilia/Usher Gene Discovery Pipeline Researched: 2026-02-11 Confidence: HIGH for core stack, MEDIUM for workflow alternatives, MEDIUM-LOW for animal model APIs

15 KiB Raw Blame History