Files

gbanyan 982f7f5a9b docs(01-data-infrastructure): research phase domain

2026-02-11 15:56:40 +08:00

32 KiB

Raw Blame History

Phase 1: Data Infrastructure - Research

Researched: 2026-02-11 Domain: Python bioinformatics data pipelines, gene ID mapping, API clients, configuration management, provenance tracking Confidence: MEDIUM-HIGH

Summary

This phase establishes the foundational data infrastructure for a reproducible gene essentiality scoring pipeline. The core technical challenge is building a robust system for gene ID mapping, external API integration with rate limiting and caching, configuration management with validation, and data persistence enabling checkpoint-restart capabilities.

Python ecosystem strengths: The bioinformatics Python ecosystem has mature libraries for gene ID mapping (mygene), API retry/caching (tenacity, requests-cache), data validation (Pydantic v2), and analytical data storage (DuckDB, Parquet). These tools are actively maintained with 2026 releases and well-documented patterns for scientific pipelines.

Key architectural decisions:

Use mygene (MyGene.info API) for gene ID mapping - it supports batch queries across Ensembl, HGNC, UniProt with species filtering
Use requests-cache + tenacity for API clients - persistent SQLite cache with exponential backoff retry
Use Pydantic v2 + pydantic-yaml for configuration - strong validation with clear error messages
Use DuckDB for intermediate data persistence - file-based database with native Parquet support and checkpoint capabilities
Use pathlib.Path consistently for file operations - cross-platform, modern Python standard

Primary recommendation: Build modular CLI scripts using click framework, separate concerns (config loading, API clients, gene mapping, persistence), and emphasize validation gates at each step (report mapping success rates, validate API responses, check data completeness). Avoid building custom solutions for ID mapping, retry logic, or data serialization - use established libraries.

Standard Stack

Core

Library	Version	Purpose	Why Standard
`mygene`	3.1.0+	Gene ID mapping (Ensembl ↔ HGNC ↔ UniProt)	Official MyGene.info client, handles batch queries, species filtering, automatic pagination
`requests`	2.32.0+	HTTP client foundation	Universal Python HTTP library, basis for requests-cache
`requests-cache`	1.3.0+	Persistent HTTP caching	SQLite backend, TTL support, transparent caching, saves API quota
`tenacity`	9.0.0+	Retry logic with exponential backoff	Declarative retry strategies, handles rate limits (429), jitter support
`Pydantic`	2.12.5+	Data validation and settings	Type-safe validation, clear error messages, v2 has Rust-based speed
`pydantic-yaml`	1.4.0+	YAML ↔ Pydantic integration	Load/dump Pydantic models from YAML, validation on load
`DuckDB`	1.2.0+	Analytical database for intermediate data	File-based persistence, native Parquet support, fast SQL queries
`click`	8.3.0+	CLI framework	Decorator-based, excellent help generation, subcommand support
`pathlib`	stdlib	Path handling	Cross-platform, object-oriented, modern standard (Python 3.4+)

Supporting

Library	Version	Purpose	When to Use
`pyensembl`	2.3.13+	Ensembl GTF/FASTA local database	If you need to filter protein-coding genes locally, access transcript details
`bioservices`	1.12.1+	UniProt/other bio API clients	Direct UniProt queries (though mygene covers most use cases)
`polars`	1.20.0+	Fast DataFrame operations	Large-scale data transformations (alternative to pandas)
`PyArrow`	18.0.0+	Parquet read/write, zero-copy interop	Writing Parquet files, Arrow integration
`hashlib`	stdlib	Hash generation for provenance	Create config hashes (SHA-256), file integrity checks

Alternatives Considered

Instead of	Could Use	Tradeoff
`mygene`	Manual API calls to Ensembl/UniProt	Custom code = maintenance burden, mygene handles pagination, errors, retries
`requests-cache`	Manual file-based cache	Reinvent TTL logic, serialization; requests-cache is battle-tested
`Pydantic v2`	`attrs` or `dataclass`	attrs is faster but Pydantic has better validation ecosystem for complex rules
`click`	`argparse` (stdlib)	argparse is stdlib but click has better DX for complex CLIs with subcommands
`DuckDB`	Pure Parquet files	DuckDB adds SQL query capability, simpler checkpoint logic, same Parquet backend

Installation:

pip install mygene requests requests-cache tenacity pydantic pydantic-yaml duckdb click polars pyarrow

For local Ensembl database (optional):

pip install pyensembl
pyensembl install --release 112 --species homo_sapiens

Architecture Patterns

Recommended Project Structure

src/
├── config/              # Configuration schemas and loaders
│   ├── schema.py        # Pydantic models for pipeline config
│   └── loader.py        # YAML → validated config
├── gene_mapping/        # Gene ID mapping utilities
│   ├── universe.py      # Define gene universe (protein-coding, Ensembl filtering)
│   ├── mapper.py        # mygene wrapper, batch mapping
│   └── validator.py     # Mapping validation gates, reporting
├── api_clients/         # External API clients
│   ├── base.py          # Shared retry/cache setup
│   ├── gnomad.py        # gnomAD client
│   ├── gtex.py          # GTEx client
│   ├── hpa.py           # Human Protein Atlas client
│   └── ...              # Other API clients
├── persistence/         # Data persistence layer
│   ├── duckdb_store.py  # DuckDB connection, table management
│   └── provenance.py    # Provenance metadata tracking
└── cli/                 # CLI entry points
    ├── setup.py         # Setup gene universe, validate config
    ├── fetch.py         # Fetch external data
    └── ...              # Other pipeline steps

Pattern 1: Configuration with Pydantic + YAML

What: Define configuration schema as Pydantic models, load from YAML with validation When to use: All pipeline parameters (weights, thresholds, data source versions, API keys)

Example:

# config/schema.py
from pydantic import BaseModel, Field, field_validator
from pathlib import Path

class DataSourceVersions(BaseModel):
    ensembl_release: int = Field(..., ge=100, description="Ensembl release number")
    gnomad_version: str = Field("v4.1", description="gnomAD version")
    gtex_version: str = Field("v8", description="GTEx data version")

class PipelineConfig(BaseModel):
    data_dir: Path
    cache_dir: Path
    duckdb_path: Path
    versions: DataSourceVersions
    api_rate_limit: int = Field(10, description="Max requests per second")

    @field_validator('duckdb_path')
    def ensure_parent_exists(cls, v: Path) -> Path:
        v.parent.mkdir(parents=True, exist_ok=True)
        return v

# config/loader.py
from pydantic_yaml import parse_yaml_raw_as
from pathlib import Path

def load_config(config_path: Path) -> PipelineConfig:
    """Load and validate YAML config."""
    yaml_content = config_path.read_text()
    config = parse_yaml_raw_as(PipelineConfig, yaml_content)
    return config

Source: Pydantic Documentation, pydantic-yaml PyPI

Pattern 2: Gene ID Mapping with Validation Gates

What: Use mygene for batch ID mapping, report success rate, flag unmapped genes When to use: Converting between Ensembl, HGNC, UniProt IDs

Example:

# gene_mapping/mapper.py
import mygene
from typing import List, Dict, Tuple

def batch_map_ensembl_to_hgnc(
    ensembl_ids: List[str],
    species: int = 9606  # Human
) -> Tuple[Dict[str, str], List[str]]:
    """
    Map Ensembl gene IDs to HGNC symbols.
    Returns: (successful_mappings, unmapped_ids)
    """
    mg = mygene.MyGeneInfo()

    # Query with scopes for Ensembl gene IDs
    results = mg.querymany(
        ensembl_ids,
        scopes='ensembl.gene',
        fields='symbol',
        species=species,
        returnall=True
    )

    successful = {}
    unmapped = []

    for query_result in results['out']:
        ensembl_id = query_result['query']

        if 'symbol' in query_result and 'notfound' not in query_result:
            successful[ensembl_id] = query_result['symbol']
        else:
            unmapped.append(ensembl_id)

    # Validation gate: report mapping success rate
    total = len(ensembl_ids)
    success_rate = len(successful) / total * 100
    print(f"Mapped {len(successful)}/{total} genes ({success_rate:.1f}%)")

    if success_rate < 90:
        print(f"WARNING: Low mapping success rate. Unmapped genes: {unmapped[:10]}...")

    return successful, unmapped

Source: MyGene.py Documentation

Pattern 3: API Client with Retry and Cache

What: Combine requests-cache for persistent caching and tenacity for retry logic When to use: All external API calls (gnomAD, GTEx, HPA, UniProt, PubMed)

Example:

# api_clients/base.py
import requests_cache
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)
from requests.exceptions import HTTPError, Timeout
from pathlib import Path

class CachedAPIClient:
    """Base class for API clients with caching and retry."""

    def __init__(self, cache_dir: Path, rate_limit: int = 10):
        self.session = requests_cache.CachedSession(
            cache_name=str(cache_dir / 'api_cache'),
            backend='sqlite',
            expire_after=86400,  # 24 hours default TTL
        )
        self.rate_limit = rate_limit

    @retry(
        stop=stop_after_attempt(5),
        wait=wait_exponential(multiplier=1, min=2, max=60),
        retry=retry_if_exception_type((HTTPError, Timeout)),
        reraise=True
    )
    def get(self, url: str, **kwargs):
        """GET request with retry and cache."""
        response = self.session.get(url, timeout=30, **kwargs)
        response.raise_for_status()  # Raise HTTPError for 4xx/5xx
        return response

# api_clients/gtex.py
class GTExClient(CachedAPIClient):
    """GTEx Portal API client."""

    BASE_URL = "https://gtexportal.org/api/v2"

    def get_gene_expression(self, gene_symbol: str):
        """Fetch gene expression data across tissues."""
        url = f"{self.BASE_URL}/expression/geneExpression"
        params = {'geneId': gene_symbol, 'datasetId': 'gtex_v8'}
        response = self.get(url, params=params)
        return response.json()

Sources: requests-cache Documentation, Tenacity Documentation

Pattern 4: DuckDB for Checkpoint-Restart

What: Store intermediate results in DuckDB file, enable resuming pipeline from checkpoint When to use: After fetching expensive API data, after transformations

Example:

# persistence/duckdb_store.py
import duckdb
from pathlib import Path
from typing import Optional
import pandas as pd

class PipelineStore:
    """DuckDB-based storage for pipeline intermediate results."""

    def __init__(self, db_path: Path):
        self.db_path = db_path
        self.conn = duckdb.connect(str(db_path))

    def save_dataframe(self, df: pd.DataFrame, table_name: str, replace: bool = False):
        """Save DataFrame to DuckDB table."""
        mode = 'replace' if replace else 'append'
        self.conn.execute(f"CREATE TABLE IF NOT EXISTS {table_name} AS SELECT * FROM df")
        if replace:
            self.conn.execute(f"DELETE FROM {table_name}")
            self.conn.execute(f"INSERT INTO {table_name} SELECT * FROM df")

    def load_dataframe(self, table_name: str) -> Optional[pd.DataFrame]:
        """Load DataFrame from DuckDB table."""
        try:
            return self.conn.execute(f"SELECT * FROM {table_name}").df()
        except Exception:
            return None

    def export_parquet(self, table_name: str, output_path: Path):
        """Export table to Parquet file."""
        self.conn.execute(f"COPY {table_name} TO '{output_path}' (FORMAT PARQUET)")

    def has_checkpoint(self, checkpoint_name: str) -> bool:
        """Check if checkpoint exists."""
        result = self.conn.execute(
            "SELECT COUNT(*) FROM information_schema.tables WHERE table_name = ?",
            [checkpoint_name]
        ).fetchone()
        return result[0] > 0

Source: DuckDB Python API

Pattern 5: Provenance Metadata Tracking

What: Attach metadata to every output (versions, timestamps, config hash) When to use: All pipeline outputs (gene lists, intermediate data, final results)

Example:

# persistence/provenance.py
import hashlib
import json
from datetime import datetime
from pathlib import Path
from typing import Dict, Any

def compute_config_hash(config: Dict[str, Any]) -> str:
    """Compute SHA-256 hash of configuration."""
    config_json = json.dumps(config, sort_keys=True, default=str)
    return hashlib.sha256(config_json.encode()).hexdigest()

def create_provenance_metadata(
    config: Dict[str, Any],
    pipeline_version: str,
    data_sources: Dict[str, str]
) -> Dict[str, Any]:
    """Create provenance metadata for output."""
    return {
        'pipeline_version': pipeline_version,
        'data_source_versions': data_sources,
        'config_hash': compute_config_hash(config),
        'timestamp': datetime.utcnow().isoformat(),
        'processing_steps': []  # Populated during pipeline execution
    }

def save_with_provenance(
    data: Any,
    output_path: Path,
    metadata: Dict[str, Any]
):
    """Save data with provenance metadata sidecar."""
    # Save main data
    # ... (format-specific save logic)

    # Save metadata sidecar
    metadata_path = output_path.with_suffix('.provenance.json')
    metadata_path.write_text(json.dumps(metadata, indent=2))

Source: Python hashlib documentation

Anti-Patterns to Avoid

Global state: Don't use global variables for configuration or connections; pass explicitly or use dependency injection
String-based paths: Don't use os.path with string concatenation; use pathlib.Path with / operator
Bare try-except: Don't catch all exceptions silently; catch specific exceptions, log, re-raise or handle
Manual CSV parsing: Don't parse CSV with string splitting; use pandas, polars, or csv module
Hardcoded credentials: Don't embed API keys in code; use environment variables or separate secrets file

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Gene ID mapping	Custom scrapers for Ensembl/UniProt	`mygene` library	Handles pagination, rate limits, species mapping, multiple ID types; battle-tested
HTTP caching	File-based cache with pickle	`requests-cache`	Handles cache expiration, concurrent access, multiple backends, HTTP semantics
Retry logic	Manual loop with sleep	`tenacity`	Exponential backoff, jitter, retry conditions, logging; prevents thundering herd
Data validation	Manual type checks	`Pydantic`	Automatic coercion, nested validation, clear error messages, JSON schema generation
Parquet I/O	Custom serialization	`pyarrow` or `polars`	Compression, schema evolution, partitioning, column pruning optimizations
CLI parsing	Manual sys.argv parsing	`click`	Help generation, type conversion, subcommands, testing utilities

Key insight: Bioinformatics pipelines have deceptively complex edge cases (API rate limits, pagination, ID mapping ambiguity, data versioning). Established libraries have solved these through community testing. Custom solutions will hit the same edge cases but without the benefit of community fixes.

Common Pitfalls

Pitfall 1: Not Filtering Pseudogenes from Gene Universe

What goes wrong: Including pseudogenes in gene universe inflates counts, breaks downstream ID mapping (pseudogenes often lack UniProt mappings)

Why it happens: Ensembl GTF includes all biotypes; filtering requires explicit gene_biotype == "protein_coding" check

How to avoid:

Use pyensembl or parse Ensembl GTF to filter by biotype
Explicitly exclude pseudogenes: biotype NOT IN ('processed_pseudogene', 'unprocessed_pseudogene', 'pseudogene')
Validate gene count matches expected human protein-coding genes (~20,000)

Warning signs: Gene universe > 25,000 genes (indicates non-coding genes included)

Source: Ensembl Gene Biotypes, Biostars: Exclude pseudogenes

Pitfall 2: Low Gene ID Mapping Success Rate Without Validation

What goes wrong: 20%+ genes fail to map between ID systems; pipeline proceeds with incomplete data, results are misleading

Why it happens: Different databases use different gene versions, some genes lack mappings, retired IDs persist in old datasets

How to avoid:

Implement validation gates: report mapping success rate, fail if < 90%
Save unmapped genes to file for manual review
Use mygene's returnall=True to distinguish "not found" from "ambiguous"
Use consistent Ensembl release across all steps

Warning signs: Silent data loss, gene count drops unexpectedly between steps, missing expected genes in results

Source: Biostars: Gene ID mapping challenges, Cancer Dependency Map: Gene annotation best practices

Pitfall 3: API Rate Limiting Without Backoff

What goes wrong: API returns 429 (Too Many Requests), script crashes or gets IP banned

Why it happens: Scientific APIs (gnomAD, GTEx) enforce rate limits; naive sequential requests hit limits quickly

How to avoid:

Use tenacity with exponential backoff and jitter
Check API documentation for rate limits, implement conservative delays
Use requests-cache to avoid re-fetching same data
Batch requests where API supports it (e.g., mygene querymany)

Warning signs: Frequent 429 errors, script hangs, API blocks your IP

Source: API Error Handling & Retry Strategies: Python Guide 2026, gnomAD: Blocked when using API

Pitfall 4: No Provenance Metadata = Unreproducible Results

What goes wrong: Cannot reproduce results 6 months later; data source versions unknown, config parameters lost

Why it happens: Scientific data sources update frequently (Ensembl releases biannually), config tweaks during development are forgotten

How to avoid:

Embed provenance metadata in every output: data source versions, timestamps, config hash, pipeline version
Save config file alongside outputs (or hash reference)
Use semantic versioning for pipeline scripts
Document data download dates

Warning signs: "Which Ensembl release did we use?", "What were the parameter values?", inability to reproduce old results

Source: FAIR data pipeline: provenance-driven data management

Pitfall 5: Hard-Coded File Paths and No Cross-Platform Support

What goes wrong: Pipeline breaks on different OS (Windows vs Linux), paths don't exist on collaborator's machine

Why it happens: Using string concatenation for paths ("/home/user/data/" + filename), hardcoding absolute paths

How to avoid:

Use pathlib.Path consistently: Path("data") / "genes.csv"
Make all paths configurable via YAML config
Use relative paths from project root or config-specified base directory
Check path existence, create directories with Path.mkdir(parents=True, exist_ok=True)

Warning signs: FileNotFoundError on different machines, mix of / and \ in path strings

Source: pathlib best practices 2026

Pitfall 6: Ignoring Data Versioning and API Changes

What goes wrong: API query syntax changes (e.g., UniProt 2022 column name changes), old code breaks

Why it happens: External APIs evolve; breaking changes in data schema or query parameters

How to avoid:

Pin data source versions in config (Ensembl release, gnomAD version)
Add API version checks or try-except with fallback for schema changes
Monitor API changelog announcements
Test with latest API versions periodically

Warning signs: Suddenly failing API calls after working for months, schema validation errors, missing expected fields

Source: bioservices UniProt API changes, Biostars: UniProt API programming

Pitfall 7: No Checkpoint-Restart = Re-download Everything on Failure

What goes wrong: Network error at hour 3 of 4-hour data fetch; must restart from beginning, waste time and API quota

Why it happens: Pipeline doesn't persist intermediate results; all-or-nothing execution model

How to avoid:

Save intermediate results to DuckDB after each major step
Check for checkpoints before expensive operations: if not store.has_checkpoint('gnomad_data'): fetch_gnomad()
Use requests-cache to avoid re-downloading API data
Design idempotent steps (safe to re-run)

Warning signs: Frequent restarts from scratch, frustration during development, wasted API quota

Source: DuckDB Python documentation

Code Examples

Verified patterns from official sources:

Loading YAML Config with Pydantic Validation

# Source: https://pypi.org/project/pydantic-yaml/
from pydantic import BaseModel, Field
from pydantic_yaml import parse_yaml_raw_as
from pathlib import Path

class Config(BaseModel):
    ensembl_release: int = Field(..., ge=100)
    cache_dir: Path
    api_rate_limit: int = 10

config_yaml = Path("config.yaml").read_text()
config = parse_yaml_raw_as(Config, config_yaml)
# Raises ValidationError if invalid

Batch Gene ID Mapping with mygene

# Source: https://docs.mygene.info/projects/mygene-py/en/latest/
import mygene

mg = mygene.MyGeneInfo()

# Map Ensembl IDs to HGNC symbols and UniProt accessions
results = mg.querymany(
    ['ENSG00000139618', 'ENSG00000141510'],
    scopes='ensembl.gene',
    fields='symbol,uniprot',
    species=9606,  # Human
    returnall=True
)

for hit in results['out']:
    print(f"{hit['query']} -> {hit.get('symbol')} (UniProt: {hit.get('uniprot', {}).get('Swiss-Prot')})")

Setting Up requests-cache with SQLite Backend

# Source: https://requests-cache.readthedocs.io/
import requests_cache

session = requests_cache.CachedSession(
    cache_name='api_cache',
    backend='sqlite',
    expire_after=86400,  # 24 hours
)

# First request hits API and caches
response = session.get('https://api.example.com/data')

# Subsequent requests return cached response (sub-millisecond)
response = session.get('https://api.example.com/data')
print(f"From cache: {response.from_cache}")

Retry with Exponential Backoff using tenacity

# Source: https://tenacity.readthedocs.io/
from tenacity import retry, stop_after_attempt, wait_exponential
import requests

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=4, max=60)
)
def fetch_api_data(url):
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    return response.json()

# Automatically retries with increasing delays: 4s, 8s, 16s, 32s, 60s
data = fetch_api_data('https://api.example.com/genes')

DuckDB: Persist DataFrame and Export Parquet

# Source: https://duckdb.org/docs/stable/clients/python/overview
import duckdb
import pandas as pd

# Connect to file-based database (creates if not exists)
conn = duckdb.connect('pipeline.duckdb')

# Save DataFrame to table
df = pd.DataFrame({'gene': ['BRCA1', 'TP53'], 'score': [0.95, 0.88]})
conn.execute("CREATE TABLE IF NOT EXISTS gene_scores AS SELECT * FROM df")

# Query later
result = conn.execute("SELECT * FROM gene_scores WHERE score > 0.9").df()

# Export to Parquet
conn.execute("COPY gene_scores TO 'output/gene_scores.parquet' (FORMAT PARQUET)")

Computing Config Hash for Provenance

# Source: https://docs.python.org/3/library/hashlib.html
import hashlib
import json

def compute_config_hash(config_dict):
    """Compute SHA-256 hash of config for provenance tracking."""
    config_json = json.dumps(config_dict, sort_keys=True, default=str)
    hash_digest = hashlib.sha256(config_json.encode()).hexdigest()
    return hash_digest

config = {'ensembl_release': 112, 'weights': {'gnomad': 0.3}}
config_hash = compute_config_hash(config)
print(f"Config hash: {config_hash[:16]}...")  # Config hash: a3f8b2c1d4e5f6a7...

Using pathlib for Cross-Platform Paths

# Source: https://docs.python.org/3/library/pathlib.html
from pathlib import Path

# Define base directory
data_dir = Path("data")

# Build paths with / operator (cross-platform)
gene_file = data_dir / "genes" / "ensembl_genes.csv"

# Create parent directories if needed
gene_file.parent.mkdir(parents=True, exist_ok=True)

# Read/write with convenience methods
content = gene_file.read_text()  # Reads entire file
gene_file.write_text("ENSG00000139618,BRCA1\n")  # Writes string

# Check existence
if gene_file.exists():
    print(f"Found: {gene_file.resolve()}")  # Absolute path

State of the Art

Old Approach	Current Approach	When Changed	Impact
Pandas for all data	DuckDB for analytical queries, Polars for transformations	2023-2025	10-24x speedup for large datasets; DuckDB SQL queries on Parquet without full load
Manual API retry loops	`tenacity` library	2020+	Declarative retry strategies, jitter prevents thundering herd, better error handling
Dataclasses for validation	Pydantic v2	2023 (v2 release)	Rust-based validation core = faster; richer validation rules, automatic JSON schema
os.path string manipulation	pathlib.Path	Python 3.4+ (now standard)	Cross-platform by default, object-oriented, more readable code
UniProt REST API column names	New API (2022 changes)	June 2022	Breaking change: `tab` → `tsv`, `taxonomy` → `taxonomy_id`; affects bioservices code
requests without caching	requests-cache	Adopted in scientific pipelines 2020+	Persistent SQLite cache, respects HTTP headers, saves API quota and time

Deprecated/outdated:

pyEntrezId: Converts IDs to Entrez but limited scope; use mygene instead (supports more ID types, better maintained)
gnomad_python_api: Package marked deprecated, GraphQL API changed; use direct GraphQL queries or official gnomAD Python tools
biomaRt (R): Still used in R pipelines but Python: use pybiomart for BioMart queries (if needed; mygene often sufficient)
argparse for complex CLIs: Still valid but click or typer have better DX for multi-command CLIs

Open Questions

1. GTEx API v2 Python Client Maturity

What we know: GTEx API v2 exists with improved documentation; community tool pyGTEx exists but may be outdated
What's unclear: Is pyGTEx maintained for API v2? Should we use direct API calls?
Recommendation: Start with direct API calls using base CachedAPIClient pattern; verify current GTEx API v2 endpoint behavior during implementation

2. Human Protein Atlas API Details

What we know: HPA exposes XML API accepting Ensembl IDs; MCP interface mentioned but details sparse
What's unclear: Rate limits, batch query support, data freshness, Python client recommendations
Recommendation: Use direct API calls with requests-cache; test rate limits empirically during development

3. gnomAD API Access Pattern

What we know: gnomAD has GraphQL API; some Python wrappers deprecated; rate limiting confirmed (blocked after ~10 queries)
What's unclear: Current recommended Python access method, official rate limits, batch query best practices
Recommendation: Use direct GraphQL queries with aggressive caching and conservative rate limiting (1 req/sec start); verify official docs during implementation

4. Ensembl vs MyGene.info Data Freshness

What we know: mygene provides convenient batch mapping; Ensembl is canonical source; data sync lag possible
What's unclear: How quickly does MyGene.info sync with new Ensembl releases? Acceptable lag for this pipeline?
Recommendation: Pin Ensembl release in config; validate mygene results against expected gene count; consider hybrid approach (mygene for bulk mapping, direct Ensembl fallback for failures)

Sources

Primary (HIGH confidence)

MyGene.py Documentation - Gene ID mapping API, batch queries, species filtering
DuckDB Python API - Persistence, Parquet integration, SQL queries
requests-cache Documentation - HTTP caching, SQLite backend, TTL configuration
Tenacity Documentation - Retry strategies, exponential backoff, rate limit handling
Pydantic Documentation - BaseModel, validation, v2 features
Python pathlib Documentation - Path handling, cross-platform patterns
Python hashlib Documentation - SHA-256 hashing for config provenance

Secondary (MEDIUM confidence)

pydantic-yaml PyPI - YAML integration with Pydantic
GTEx Portal API Documentation - GTEx API v2 endpoints
bioservices Documentation - UniProt API access via Python
PyEnsembl GitHub - Local Ensembl database, gene filtering
Click Documentation - CLI framework comparison, best practices
API Error Handling Guide 2026 - Retry patterns, backoff strategies
DuckDB vs Polars vs PyArrow comparison - Performance benchmarks for data pipelines

Tertiary (LOW confidence - needs verification)

pyGTEx GitHub - Community GTEx client (maintenance status unclear)
gnomAD Python API (deprecated) - Marked deprecated, may have outdated examples
HPA MCP Interface - Mentioned in search but implementation details sparse
FAIR Data Pipeline (academic paper) - Provenance concepts, may need adaptation to practical implementation

Metadata

Confidence breakdown:

Standard stack: HIGH - All libraries verified via official docs, actively maintained with 2026 releases, well-documented
Architecture: MEDIUM-HIGH - Patterns verified from official docs, some scientific pipeline specifics extrapolated from best practices
Pitfalls: MEDIUM - Validated from community forums (Biostars), official documentation warnings, and known bioinformatics challenges
API client specifics (GTEx, HPA, gnomAD): MEDIUM-LOW - Official APIs exist but Python client recommendations need runtime verification

Research date: 2026-02-11 Valid until: ~60 days (March 2026) - Gene ID mapping and data validation patterns are stable; API endpoints may change, verify before implementation

32 KiB Raw Blame History

Phase 1: Data Infrastructure - Research

Summary

Standard Stack

Core

Supporting

Alternatives Considered

Architecture Patterns

Recommended Project Structure

Pattern 1: Configuration with Pydantic + YAML

Pattern 2: Gene ID Mapping with Validation Gates

Pattern 3: API Client with Retry and Cache

Pattern 4: DuckDB for Checkpoint-Restart

Pattern 5: Provenance Metadata Tracking

Anti-Patterns to Avoid

Don't Hand-Roll

Common Pitfalls

Pitfall 1: Not Filtering Pseudogenes from Gene Universe

Pitfall 2: Low Gene ID Mapping Success Rate Without Validation

Pitfall 3: API Rate Limiting Without Backoff

Pitfall 4: No Provenance Metadata = Unreproducible Results

Pitfall 5: Hard-Coded File Paths and No Cross-Platform Support

Pitfall 6: Ignoring Data Versioning and API Changes

Pitfall 7: No Checkpoint-Restart = Re-download Everything on Failure

Code Examples

Loading YAML Config with Pydantic Validation

Batch Gene ID Mapping with mygene

Setting Up requests-cache with SQLite Backend

Retry with Exponential Backoff using tenacity

DuckDB: Persist DataFrame and Export Parquet

Computing Config Hash for Provenance

Using pathlib for Cross-Platform Paths

State of the Art

Open Questions

1. GTEx API v2 Python Client Maturity

2. Human Protein Atlas API Details

3. gnomAD API Access Pattern

4. Ensembl vs MyGene.info Data Freshness

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence - needs verification)

Metadata

32 KiB

Raw Blame History