gbanyan/usher-exploring

Fork 0

Files

gbanyan bb7bfaedab docs: complete project research

2026-02-11 14:52:06 +08:00

42 KiB

Raw Permalink Blame History

Architecture Research

Domain: Bioinformatics Gene Candidate Discovery Pipeline Researched: 2026-02-11 Confidence: MEDIUM-HIGH

Standard Architecture

Multi-evidence gene prioritization pipelines in bioinformatics follow a layered architecture with independent data retrieval modules feeding into normalization, feature extraction, scoring, and validation layers. The standard pattern is a staged pipeline where each evidence layer operates independently, writes intermediate results, then feeds into a final integration/scoring component.

System Overview

┌─────────────────────────────────────────────────────────────────┐
│                     CLI ORCHESTRATION LAYER                      │
│  (Typer/Click: main pipeline script + per-layer subcommands)    │
├─────────────────────────────────────────────────────────────────┤
│                   DATA RETRIEVAL LAYER (6 modules)               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │ GO/UniProt│  │ Tissue   │  │ Protein  │  │ Subcell  │        │
│  │ Annot.   │  │ Expr.    │  │ Features │  │ Localiz. │        │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘        │
│       │             │             │             │               │
│  ┌────┴─────┐  ┌───┴──────┐                                     │
│  │ Genetic  │  │ Animal   │  + Literature Scan (PubMed API)    │
│  │ Constrt. │  │ Models   │                                     │
│  └────┬─────┘  └────┬─────┘                                     │
├───────┴──────────────┴──────────────────────────────────────────┤
│                  NORMALIZATION/TRANSFORM LAYER                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Per-layer parsers: Raw → Standardized schema            │   │
│  │  Gene ID mapping (all to Ensembl/HGNC)                   │   │
│  │  Score normalization (0-1 scale per evidence type)       │   │
│  └──────────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────┤
│                     DATA STORAGE LAYER                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │ Raw Cache    │  │ Intermediate │  │ Final Output │          │
│  │ (Parquet)    │  │ (DuckDB)     │  │ (TSV/Parquet)│          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
├─────────────────────────────────────────────────────────────────┤
│                 INTEGRATION/SCORING LAYER                        │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Multi-evidence scoring: weighted rule-based integration │   │
│  │  Known gene exclusion filter                             │   │
│  │  Confidence tier assignment (high/medium/low)            │   │
│  └──────────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────┤
│                      REPORTING LAYER                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Per-gene evidence summary generation                    │   │
│  │  Ranked candidate lists by confidence tier              │   │
│  │  Provenance tracking (data versions, tool versions)     │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Component Responsibilities

Component	Responsibility	Typical Implementation
CLI Orchestrator	Entry point, subcommand routing, global config management	Typer (modern, type-hinted) or Click (mature, widely used)
Data Retrievers	Query external APIs/databases, cache raw results, handle rate limits/retries	Per-source Python modules with API clients (requests, biopython)
Normalizers	Parse raw formats, map gene IDs to standard vocab, scale scores 0-1	Pandas/Polars for tabular data, custom parsers for specialized formats
Data Storage	Persist intermediate results, enable restartability, support fast queries	DuckDB (analytical queries on intermediate data), Parquet (compressed columnar cache)
Integrator	Combine evidence scores, apply weights, filter known genes, assign confidence tiers	Custom Python logic, potentially scikit-learn for rule-based models
Reporter	Generate human-readable outputs, track provenance, create evidence summaries	Pandas/Polars for output formatting, Jinja2 for report templates

Recommended Project Structure

usher-gene-discovery/
├── config/
│   ├── pipeline.yaml          # Main pipeline config (weights, thresholds, data sources)
│   ├── sources.yaml            # API endpoints, file paths, version info
│   └── known_genes.txt         # CiliaCarta/SYSCILIA/OMIM exclusion list
├── data/
│   ├── raw/                    # Cached raw API responses (Parquet, JSON)
│   │   ├── uniprot/
│   │   ├── gtex/
│   │   ├── gnomaad/
│   │   └── pubmed/
│   ├── intermediate/           # Normalized per-layer results (DuckDB tables)
│   │   └── evidence.duckdb
│   └── output/                 # Final results
│       ├── candidates_high.tsv
│       ├── candidates_medium.tsv
│       └── evidence_summary.parquet
├── src/
│   ├── pipeline/
│   │   ├── __init__.py
│   │   ├── cli.py              # Main Typer CLI app
│   │   ├── config.py           # Config loading/validation (pydantic)
│   │   └── utils.py            # Shared utilities (gene ID mapping, etc.)
│   ├── retrievers/             # Data retrieval modules (6 evidence layers)
│   │   ├── __init__.py
│   │   ├── annotation.py       # GO/UniProt completeness
│   │   ├── expression.py       # HPA/GTEx/CellxGene tissue expression
│   │   ├── protein_features.py # Domain/motif/coiled-coil analysis
│   │   ├── localization.py     # Subcellular localization
│   │   ├── constraint.py       # gnomAD pLI/LOEUF
│   │   ├── phenotypes.py       # MGI/ZFIN/IMPC animal models
│   │   └── literature.py       # PubMed API scanning
│   ├── normalizers/            # Per-source parsers/normalizers
│   │   ├── __init__.py
│   │   └── [mirrors retrievers structure]
│   ├── integration/
│   │   ├── __init__.py
│   │   ├── scoring.py          # Multi-evidence scoring logic
│   │   ├── filtering.py        # Known gene exclusion
│   │   └── tiering.py          # Confidence tier assignment
│   └── reporting/
│       ├── __init__.py
│       ├── summaries.py        # Per-gene evidence summaries
│       └── provenance.py       # Version tracking, data lineage
├── tests/
│   ├── test_retrievers/
│   ├── test_normalizers/
│   ├── test_integration/
│   └── fixtures/               # Small test datasets
├── scripts/
│   └── validate_outputs.py     # Manual validation/QC scripts
├── pyproject.toml              # Dependencies, tool config
├── README.md
└── .env.example                # API keys template

Structure Rationale

config/: YAML for pipeline configuration (human-editable, version-controllable, standard in bioinformatics). TOML considered but YAML dominates bioinformatics tooling (Nextflow, Snakemake, nf-core).
data/ hierarchy: Separates raw (cacheable, immutable) from intermediate (queryable) from output (deliverable). Enables pipeline restartability and debugging.
src/pipeline/ vs src/retrievers/: CLI orchestration separated from business logic. Retrievers/normalizers mirror structure for clarity.
src/integration/: Scoring logic isolated from data retrieval. Critical for testing/validation and future refinement of scoring weights.
tests/ mirrors src/: Standard Python pattern. Fixtures for small validation datasets.

Architectural Patterns

Pattern 1: Independent Evidence Layers (Modular Retrieval)

What: Each evidence layer (annotation, expression, protein features, etc.) is an independent module with its own retriever, normalizer, and output schema. Modules do NOT communicate directly—only via the shared data storage layer.

When to use: Always in multi-evidence pipelines. Critical for your 6-layer architecture.

Trade-offs:

Pro: Layers can be developed/tested/run independently. Failures in one layer don't cascade. Easy to add/remove evidence sources.
Pro: Parallelizable execution (run all 6 retrievers concurrently).
Con: Requires discipline to maintain consistent gene ID mapping and schema standards across layers.

Example:

# src/retrievers/annotation.py
from typing import List
import polars as pl

class AnnotationRetriever:
    def __init__(self, config):
        self.config = config

    def retrieve(self, gene_ids: List[str]) -> pl.DataFrame:
        """Fetch GO/UniProt completeness for genes."""
        # Query UniProt API
        raw_data = self._query_uniprot(gene_ids)
        # Cache raw response
        self._cache_raw(raw_data, "uniprot")
        return raw_data

    def normalize(self, raw_data: pl.DataFrame) -> pl.DataFrame:
        """Normalize to standard schema."""
        return raw_data.select([
            pl.col("gene_id"),
            pl.col("annotation_score").cast(pl.Float32),  # 0-1 scale
            pl.col("go_term_count"),
            pl.col("data_source").cast(pl.Utf8)
        ])

Pattern 2: Staged Data Persistence (Cache-First Pipeline)

What: Each pipeline stage writes output to disk/database before proceeding. Use Parquet for raw caches (immutable, compressed), DuckDB for intermediate results (queryable), TSV/Parquet for final outputs (portable).

When to use: Always in long-running bioinformatics pipelines with external API dependencies. Essential for reproducibility.

Trade-offs:

Pro: Pipeline restartability (skip completed stages). Debugging (inspect intermediate outputs). Provenance (timestamp, version, source).
Pro: DuckDB enables fast analytical queries on intermediate data without full in-memory loads.
Con: Disk I/O overhead (minor for your ~20K gene scale). Storage requirements (mitigated by Parquet compression).

Example:

# src/pipeline/cli.py
import typer
import duckdb
from pathlib import Path

app = typer.Typer()

@app.command()
def run_layer(layer: str, force: bool = False):
    """Run single evidence layer retrieval + normalization."""
    cache_path = Path(f"data/raw/{layer}/results.parquet")

    # Check cache unless force refresh
    if cache_path.exists() and not force:
        typer.echo(f"Using cached {layer} data from {cache_path}")
        return

    # Retrieve + normalize
    retriever = get_retriever(layer)
    raw_data = retriever.retrieve(gene_list)
    normalized = retriever.normalize(raw_data)

    # Persist to Parquet + DuckDB
    normalized.write_parquet(cache_path)
    conn = duckdb.connect("data/intermediate/evidence.duckdb")
    conn.execute(f"CREATE OR REPLACE TABLE {layer} AS SELECT * FROM '{cache_path}'")
    conn.close()

@app.command()
def integrate(output: Path = Path("data/output/candidates_high.tsv")):
    """Integrate all evidence layers and generate candidates."""
    conn = duckdb.connect("data/intermediate/evidence.duckdb")

    # SQL-based multi-evidence integration
    query = """
    SELECT
        a.gene_id,
        a.annotation_score * 0.15 +
        e.expression_score * 0.20 +
        p.protein_score * 0.15 +
        l.localization_score * 0.25 +
        c.constraint_score * 0.15 +
        ph.phenotype_score * 0.10 AS total_score
    FROM annotation a
    JOIN expression e ON a.gene_id = e.gene_id
    JOIN protein_features p ON a.gene_id = p.gene_id
    JOIN localization l ON a.gene_id = l.gene_id
    JOIN constraint c ON a.gene_id = c.gene_id
    JOIN phenotypes ph ON a.gene_id = ph.gene_id
    WHERE total_score >= 0.7  -- High confidence threshold
    ORDER BY total_score DESC
    """

    result = conn.execute(query).pl()
    result.write_csv(output, separator="\t")

Pattern 3: Configuration-Driven Behavior (Declarative Pipeline)

What: All scoring weights, thresholds, data source URLs, API keys, etc. defined in YAML config files. Code reads config, doesn't hardcode parameters.

When to use: Always for reproducible research pipelines. Allows parameter tuning without code changes.

Trade-offs:

Pro: Configuration versioned in git → reproducibility. Easy to A/B test scoring schemes. Non-programmers can adjust weights.
Con: Requires config validation layer (use pydantic). Risk of config drift if not validated.

Example:

# config/pipeline.yaml
scoring:
  weights:
    annotation: 0.15
    expression: 0.20
    protein_features: 0.15
    localization: 0.25
    constraint: 0.15
    phenotypes: 0.10

  thresholds:
    high_confidence: 0.7
    medium_confidence: 0.5
    low_confidence: 0.3

data_sources:
  uniprot:
    base_url: "https://rest.uniprot.org/uniprotkb/search"
    version: "2026_01"
  gtex:
    file_path: "/data/external/GTEx_v8_median_tpm.tsv"
    version: "v8"

known_genes:
  exclusion_lists:
    - "config/ciliacarta_genes.txt"
    - "config/syscilia_genes.txt"
    - "config/omim_usher_genes.txt"

# src/pipeline/config.py
from pydantic import BaseModel, Field
from typing import Dict
import yaml

class ScoringConfig(BaseModel):
    weights: Dict[str, float] = Field(..., description="Evidence layer weights")
    thresholds: Dict[str, float]

class PipelineConfig(BaseModel):
    scoring: ScoringConfig
    data_sources: Dict
    known_genes: Dict

def load_config(path: str = "config/pipeline.yaml") -> PipelineConfig:
    with open(path) as f:
        config_dict = yaml.safe_load(f)
    return PipelineConfig(**config_dict)

Pattern 4: Provenance Tracking (Reproducibility First)

What: Every output file/table includes metadata: pipeline version, data source versions, timestamp, config hash. Enables exact reproduction.

When to use: Always in research pipelines. Critical for publication and validation.

Trade-offs:

Pro: Full audit trail. Can reproduce results from 6 months ago. Meets FAIR data principles.
Con: Requires discipline to capture versions (use git tags, API version headers, file checksums).

Example:

# src/reporting/provenance.py
import hashlib
import json
from datetime import datetime
from pathlib import Path
import subprocess

def generate_provenance() -> dict:
    """Generate provenance metadata for pipeline run."""
    return {
        "pipeline_version": get_git_version(),
        "run_timestamp": datetime.utcnow().isoformat(),
        "config_hash": hash_config("config/pipeline.yaml"),
        "data_sources": {
            "uniprot": {"version": "2026_01", "retrieved": "2026-02-10"},
            "gtex": {"version": "v8", "file_modified": "2023-11-15"},
            # ... per-source metadata
        },
        "dependencies": get_package_versions(),
        "runtime_env": {
            "python": sys.version,
            "platform": platform.platform()
        }
    }

def get_git_version() -> str:
    """Get current git commit hash."""
    return subprocess.check_output(
        ["git", "rev-parse", "HEAD"]
    ).decode().strip()

def hash_config(path: str) -> str:
    """SHA256 hash of config file."""
    content = Path(path).read_bytes()
    return hashlib.sha256(content).hexdigest()[:8]

Data Flow

Pipeline Execution Flow

┌──────────────────────────────────────────────────────────────┐
│  1. INITIALIZATION                                           │
│     Load config/pipeline.yaml → Validate with pydantic       │
│     Load gene list (~20K human protein-coding genes)         │
│     Check data/raw/ cache status                             │
└────────────────────┬─────────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────────┐
│  2. PARALLEL RETRIEVAL (6 evidence layers + literature)      │
│     For each layer:                                          │
│       - Query external API/database                          │
│       - Cache raw response → data/raw/{layer}/               │
│       - Normalize to standard schema                         │
│       - Write to DuckDB table                                │
│                                                              │
│     [annotation] [expression] [protein] [localization]      │
│     [constraint] [phenotypes] [literature]                  │
│                                                              │
│     All layers independent, no inter-dependencies            │
└────────────────────┬─────────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────────┐
│  3. INTEGRATION                                              │
│     SQL join on gene_id across all DuckDB tables            │
│     Apply weighted scoring formula from config               │
│     Filter out known genes (CiliaCarta/SYSCILIA/OMIM)       │
│     Assign confidence tiers (high/medium/low)                │
└────────────────────┬─────────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────────┐
│  4. REPORTING                                                │
│     Generate per-gene evidence summaries                     │
│     Write tiered candidate lists (TSV)                       │
│     Attach provenance metadata                               │
│     Optionally: visualizations, statistics                   │
└──────────────────────────────────────────────────────────────┘

Data Schema Evolution

RAW DATA (heterogeneous formats)
  ├── UniProt XML/JSON
  ├── GTEx TSV matrices
  ├── gnomAD VCF/TSV
  └── PubMed XML
      ↓ NORMALIZATION
INTERMEDIATE DATA (standardized schema in DuckDB)
  ├── Table: annotation
  │   ├── gene_id (TEXT, primary key)
  │   ├── annotation_score (FLOAT 0-1)
  │   ├── go_term_count (INT)
  │   └── data_source (TEXT)
  ├── Table: expression
  │   ├── gene_id (TEXT)
  │   ├── expression_score (FLOAT 0-1)
  │   ├── tissue_specificity (FLOAT)
  │   └── ciliated_tissue_enrichment (BOOLEAN)
  └── [similar schema for other 5 layers]
      ↓ INTEGRATION
FINAL OUTPUT (candidate lists)
  ├── candidates_high.tsv
  │   ├── gene_id
  │   ├── total_score (weighted sum)
  │   ├── confidence_tier (high/medium/low)
  │   └── [per-layer scores for transparency]
  └── evidence_summary.parquet
      └── [detailed per-gene evidence provenance]

Key Data Flow Principles

Unidirectional flow: Raw → Intermediate → Output (no backflow, no circular dependencies)
Gene ID as universal key: All tables join on standardized gene_id (Ensembl or HGNC)
Score normalization: All evidence scores scaled 0-1 for comparability
Immutable raw cache: Never modify data/raw/ after creation (versioned, timestamped)
DuckDB as integration hub: Analytical queries across layers without full in-memory loads

Build Order and Dependencies

Recommended Build Order (Phases)

Given the component dependencies and validation requirements, build in this sequence:

Phase 1: Infrastructure (Week 1)

Goal: Establish project skeleton, config management, data storage patterns

Components:

Project structure (pyproject.toml, src/pipeline/, config/, data/ hierarchy)
Config loading/validation (config.py with pydantic models)
DuckDB schema setup (create tables for 6 evidence layers)
Utility functions (gene ID mapping, caching helpers)

Why first: All downstream components depend on config system and data storage infrastructure.

Validation: Load sample config, create DuckDB tables, test gene ID mapping utility.

Phase 2: Single Evidence Layer Prototype (Week 2)

Goal: Prove the retrieval → normalization → storage pattern with one layer

Components:

Choose simplest layer (e.g., genetic constraint: gnomAD pLI/LOEUF)
Build retriever (API query, rate limiting, error handling)
Build normalizer (parse response, map gene IDs, scale scores)
Integrate with DuckDB storage

Why second: Validates architecture before scaling to 6 layers. Identifies issues early.

Validation: Run on 100 test genes, inspect Parquet cache + DuckDB table, verify schema.

Phase 3: Remaining Evidence Layers (Weeks 3-5)

Goal: Replicate Phase 2 pattern for all 6 layers + literature scan

Components:

Annotation (GO/UniProt completeness)
Expression (HPA/GTEx/CellxGene)
Protein features (domains/motifs/coiled-coil)
Localization (subcellular)
Phenotypes (MGI/ZFIN/IMPC)
Literature (PubMed API)

Why third: Independent layers can be built in parallel (or sequentially). No inter-dependencies.

Validation: Per-layer validation on test gene sets. Check schema consistency across layers.

Phase 4: Integration Layer (Week 6)

Goal: Combine evidence scores, apply weights, filter known genes

Components:

Multi-evidence scoring logic (SQL joins in DuckDB)
Known gene filter (CiliaCarta/SYSCILIA/OMIM exclusion)
Confidence tier assignment
Initial output generation (TSV candidate lists)

Why fourth: Requires all evidence layers complete. Core scientific logic.

Validation: Compare integrated scores against manual calculations for 10 test genes.

Phase 5: Reporting + Provenance (Week 7)

Goal: Generate publication-ready outputs with full audit trail

Components:

Per-gene evidence summaries
Provenance metadata generation
Output formatting (TSV + Parquet)
Optional: visualizations, statistics

Why fifth: Depends on integration layer. Presentation layer.

Validation: Manually review summaries for 20 high-confidence candidates. Verify provenance completeness.

Phase 6: CLI + Orchestration (Week 8)

Goal: Unified command-line interface for end-to-end pipeline

Components:

Typer CLI app with subcommands (run-layer, integrate, report)
Pipeline orchestration (dependency checking, progress logging)
Force-refresh flags, partial reruns

Why last: Integrates all components. User-facing interface.

Validation: Run full pipeline end-to-end on 1000 test genes. Benchmark runtime.

Component Dependency Graph

┌──────────────┐
│ Config +     │ ← PHASE 1 (foundation)
│ Storage      │
└───────┬──────┘
        ↓
┌───────┴──────────────────────────────────┐
│ Single Layer Prototype                   │ ← PHASE 2 (proof of concept)
│ (Retriever + Normalizer + DuckDB)        │
└───────┬──────────────────────────────────┘
        ↓
┌───────┴──────────────────────────────────┐
│ 6 Evidence Layers (independent modules)  │ ← PHASE 3 (parallel work)
│ + Literature Scan                        │
└───────┬──────────────────────────────────┘
        ↓
┌───────┴──────────────────────────────────┐
│ Integration Layer                        │ ← PHASE 4 (core logic)
│ (Scoring + Filtering + Tiering)          │
└───────┬──────────────────────────────────┘
        ↓
┌───────┴──────────────────────────────────┐
│ Reporting + Provenance                   │ ← PHASE 5 (presentation)
└───────┬──────────────────────────────────┘
        ↓
┌───────┴──────────────────────────────────┐
│ CLI Orchestration                        │ ← PHASE 6 (interface)
└──────────────────────────────────────────┘

Critical path: Phase 1 → Phase 2 → Phase 4 (infrastructure + prototype + integration) Parallelizable: Phase 3 (6 evidence layers can be built by different team members simultaneously) Deferrable: Phase 5 (reporting) and Phase 6 (CLI polish) can be minimal initially

Anti-Patterns to Avoid

Anti-Pattern 1: Hardcoded API Keys and Endpoints

What people do: Embed API keys directly in code (API_KEY = "abc123") or hardcode URLs.

Why it's wrong: Security risk (keys in git history), inflexibility (can't switch endpoints without code changes), breaks reproducibility (different users need different configs).

Do this instead: Use .env files for secrets (never committed), config files for endpoints. Load with python-dotenv or pydantic-settings.

# WRONG
def query_uniprot():
    response = requests.get("https://rest.uniprot.org/...", headers={"API-Key": "secret123"})

# RIGHT
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    uniprot_api_key: str
    uniprot_base_url: str

    class Config:
        env_file = ".env"

settings = Settings()
response = requests.get(settings.uniprot_base_url, headers={"API-Key": settings.uniprot_api_key})

Anti-Pattern 2: Monolithic All-in-One Script

What people do: Single 3000-line Python script that retrieves all data, normalizes, scores, and reports in one execution.

Why it's wrong: Impossible to restart from failure point (API rate limit at layer 5 → restart all 6 layers). Hard to test (can't unit test individual layers). No intermediate validation. Debugging nightmare.

Do this instead: Modular architecture with staged persistence (this entire ARCHITECTURE.md). Each layer is independently runnable. CLI orchestrates but doesn't enforce monolithic execution.

Anti-Pattern 3: In-Memory-Only Data Processing

What people do: Load all 20K genes × 6 evidence layers into Pandas DataFrames, perform joins in memory, never write intermediate results.

Why it's wrong: Memory pressure (especially with expression matrices). No restartability. No audit trail. Limits scalability (what if you scale to 40K genes?).

Do this instead: Use DuckDB for intermediate storage. Write Parquet caches. DuckDB queries Parquet files directly (out-of-core analytics). Memory-efficient even at larger scales.

Anti-Pattern 4: Ignoring Gene ID Ambiguity

What people do: Assume gene symbols are unique/stable. Join datasets on raw gene symbols from different sources.

Why it's wrong: Gene symbols are ambiguous (HLA-A vs HLAA), change over time, vary by organism. UniProt uses one ID system, GTEx another, gnomAD another. Mismatches cause silent data loss.

Do this instead: Standardize ALL gene identifiers to a single system (Ensembl or HGNC) during normalization. Build a gene ID mapper utility. Validate mappings (flag unmapped genes). Include original IDs in provenance.

# src/pipeline/utils.py
import mygene

def standardize_gene_ids(gene_list: List[str], target_id: str = "ensembl") -> Dict[str, str]:
    """Map heterogeneous gene IDs to standard Ensembl IDs."""
    mg = mygene.MyGeneInfo()
    results = mg.querymany(gene_list, scopes="symbol,entrezgene,uniprot",
                           fields="ensembl.gene", species="human")

    mapping = {}
    for result in results:
        if "ensembl" in result:
            mapping[result["query"]] = result["ensembl"]["gene"]
        else:
            # Log unmapped genes for manual review
            logging.warning(f"Could not map gene ID: {result['query']}")

    return mapping

Anti-Pattern 5: No Versioning or Provenance

What people do: Run pipeline, generate candidates.tsv, no record of config used or data source versions. Rerun 3 months later, get different results, can't explain why.

Why it's wrong: Irreproducible research. Can't debug score changes. Can't publish without provenance. Violates FAIR principles.

Do this instead: Implement Pattern 4 (Provenance Tracking). Every output has metadata footer/sidecar. Version config files in git. Tag releases. Include data source versions, tool versions, timestamps in all outputs.

Integration Points

External Services and APIs

Service	Integration Pattern	Rate Limits / Notes
UniProt REST API	HTTP GET with query params, JSON response	1000 requests/hour. Use batch queries (`gene_ids=A,B,C`). Cache aggressively.
GTEx Portal	Download static TSV files, local queries	No API. Download median TPM matrices (~3 GB). Update quarterly.
gnomAD	Query via GraphQL API or download VCF	API: 10 requests/sec. Consider downloading constraint metrics TSV (500 MB).
Human Protein Atlas	Download bulk TSV files	No real-time API. Download tissue expression TSV.
CellxGene	Download h5ad files or use API	Census API available. Large files (>10 GB). Consider pre-filtering cell types.
PubMed E-utilities	Entrez Programming Utilities (E-utils)	3 requests/sec without API key, 10/sec with key. Use batch queries.
MGI/ZFIN/IMPC	Download phenotype association files	Bulk downloads (TSV/XML). Monthly updates.

Internal Module Boundaries

Boundary	Communication	Notes
CLI ↔ Retrievers	Function calls with config objects	CLI passes validated config, receives DataFrames/file paths
Retrievers ↔ Normalizers	DataFrame handoff (Polars/Pandas)	Retrievers return raw DataFrames, normalizers return standardized schema
Normalizers ↔ Storage	File writes (Parquet) + DuckDB inserts	Normalizers write directly, no abstraction layer needed
Storage ↔ Integration	SQL queries (DuckDB connection)	Integration layer reads via SQL joins, no Python data structures
Integration ↔ Reporting	DataFrame handoff	Integration returns scored DataFrame, reporting formats for output

Data Format Interchange

External APIs → JSON/XML/TSV (heterogeneous)
      ↓
Retrievers → Polars DataFrame (in-memory, typed)
      ↓
Normalizers → Polars DataFrame (standardized schema)
      ↓
Storage Layer → Parquet (disk cache) + DuckDB (queryable)
      ↓
Integration → DuckDB Result (SQL query output)
      ↓
Reporting → TSV/Parquet (final outputs)

Why Polars over Pandas? Faster, more memory-efficient, better type system, native lazy evaluation. Polars is becoming standard in modern bioinformatics Python pipelines (2024-2026).

Why DuckDB over SQLite? Columnar storage (better for analytics), vectorized execution (10-100x faster for aggregations), native Parquet support, can query files without import.

Scaling Considerations

Current Scale (20K genes)

Architecture: Single-machine Python pipeline with local DuckDB. No distributed computing needed.

Bottlenecks:

API rate limits (PubMed: 10 req/sec, UniProt: 1000 req/hour) → solved by caching + batch queries
Expression matrix size (GTEx + CellxGene: ~10 GB) → DuckDB handles via out-of-core queries

Runtime estimate:

Data retrieval (with caching): 1-4 hours (depending on API latency)
Normalization + integration: <15 minutes (DuckDB vectorized queries)
Total: ~2-5 hours for full pipeline run (initial). <30 min for reruns with cache.

Medium Scale (50K genes, e.g., include non-coding)

Architecture adjustments: Same architecture. Increase DuckDB memory limit. Consider Parquet partitioning by chromosome.

Bottlenecks: Expression matrices grow to ~25 GB. Still manageable with 32 GB RAM + DuckDB out-of-core.

Large Scale (500K genes, e.g., cross-species)

Architecture adjustments: Consider workflow orchestrators (Nextflow/Snakemake) for parallelization. Distribute retrieval across multiple workers. Switch from local DuckDB to distributed query engine (DuckDB over S3, or Apache Spark).

When to consider: Only if scaling beyond single organism or adding many more evidence layers.

GPU Utilization Note

Your NVIDIA 4090 GPU is available but not needed for this pipeline. Use cases for GPU:

Deep learning-based gene prioritization (not rule-based scoring)
Protein structure prediction (AlphaFold, not in scope)
Large-scale sequence alignment (not in scope)

Rule-based multi-evidence scoring is CPU-bound (DuckDB is CPU-optimized). GPU would be idle.

Validation and Testing Strategy

Testing Pyramid

┌─────────────────────────┐
│  Integration Tests      │  ← Full pipeline on 100 test genes
│  (slow, comprehensive)  │
├─────────────────────────┤
│  Component Tests        │  ← Per-layer retrieval + normalization
│  (medium, focused)      │     Mock API responses
├─────────────────────────┤
│  Unit Tests             │  ← Utility functions, scoring logic
│  (fast, narrow)         │     Gene ID mapping, score calculations
└─────────────────────────┘

Validation Datasets

Positive controls: Known ciliopathy genes (USH2A, MYO7A, CDH23) → should score HIGH
Negative controls: Housekeeping genes (GAPDH, ACTB) → should score LOW
Novel candidates: Manually validate 10 high-scoring unknowns via literature review
Benchmark datasets: If available, compare to published cilia gene sets (CiliaCarta as gold standard)

Quality Gates

Schema validation: All DuckDB tables match expected schema (pydantic models)
Completeness checks: All 20K genes have entries in all 6 evidence tables (flag missing data)
Score sanity checks: No scores <0 or >1, distribution checks (not all genes score 0.5)
Provenance validation: Every output has complete metadata (no missing versions/timestamps)

Technology Choices Summary

Layer	Technology	Reasoning
CLI Framework	Typer (or Click)	Type-hinted, auto-generated help, modern Python. Click if need maximum maturity.
Config Format	YAML	Dominant in bioinformatics. Human-readable. Pydantic validation.
Data Processing	Polars	Faster + more memory-efficient than Pandas. Native lazy eval. Modern standard.
Intermediate Storage	DuckDB	Columnar analytics, out-of-core queries, native Parquet support. 10-100x faster than SQLite for aggregations.
Cache Format	Parquet	Compressed columnar format. Standard for big data. DuckDB-native.
Final Output	TSV + Parquet	TSV for human readability + Excel compatibility. Parquet for programmatic access.
API Client	requests + biopython	Standard for REST APIs. Biopython for Entrez/PubMed utilities.
Gene ID Mapping	mygene.info API	Comprehensive cross-reference database. MyGeneInfo Python client.

Sources

Architecture Patterns:

Multi-Evidence Integration:

Workflow Tools and Patterns:

Data Storage:

Reproducibility and Provenance:

Python CLI Tools:

Gene Annotation Pipelines:

Literature Mining:

Validation:

Architecture research for: Usher Cilia Gene Discovery Pipeline Researched: 2026-02-11

42 KiB Raw Permalink Blame History Unescape Escape

Architecture Research

Standard Architecture

System Overview

Component Responsibilities

Recommended Project Structure

Structure Rationale

Architectural Patterns

Pattern 1: Independent Evidence Layers (Modular Retrieval)

Pattern 2: Staged Data Persistence (Cache-First Pipeline)

Pattern 3: Configuration-Driven Behavior (Declarative Pipeline)

Pattern 4: Provenance Tracking (Reproducibility First)

Data Flow

Pipeline Execution Flow

Data Schema Evolution

Key Data Flow Principles

Build Order and Dependencies

Recommended Build Order (Phases)

Phase 1: Infrastructure (Week 1)

Phase 2: Single Evidence Layer Prototype (Week 2)

Phase 3: Remaining Evidence Layers (Weeks 3-5)

Phase 4: Integration Layer (Week 6)

Phase 5: Reporting + Provenance (Week 7)

Phase 6: CLI + Orchestration (Week 8)

Component Dependency Graph

Anti-Patterns to Avoid

Anti-Pattern 1: Hardcoded API Keys and Endpoints

Anti-Pattern 2: Monolithic All-in-One Script

Anti-Pattern 3: In-Memory-Only Data Processing

Anti-Pattern 4: Ignoring Gene ID Ambiguity

Anti-Pattern 5: No Versioning or Provenance

Integration Points

External Services and APIs

Internal Module Boundaries

Data Format Interchange

Scaling Considerations

Current Scale (20K genes)

Medium Scale (50K genes, e.g., include non-coding)

Large Scale (500K genes, e.g., cross-species)

GPU Utilization Note

Validation and Testing Strategy

Testing Pyramid

Validation Datasets

Quality Gates

Technology Choices Summary

Sources

42 KiB

Raw Permalink Blame History