docs: complete project research

This commit is contained in:
2026-02-11 14:52:06 +08:00
parent c0abe8bc6c
commit bb7bfaedab
5 changed files with 2133 additions and 0 deletions

View File

@@ -0,0 +1,812 @@
# Architecture Research
**Domain:** Bioinformatics Gene Candidate Discovery Pipeline
**Researched:** 2026-02-11
**Confidence:** MEDIUM-HIGH
## Standard Architecture
Multi-evidence gene prioritization pipelines in bioinformatics follow a **layered architecture** with independent data retrieval modules feeding into normalization, feature extraction, scoring, and validation layers. The standard pattern is a **staged pipeline** where each evidence layer operates independently, writes intermediate results, then feeds into a final integration/scoring component.
### System Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ CLI ORCHESTRATION LAYER │
│ (Typer/Click: main pipeline script + per-layer subcommands) │
├─────────────────────────────────────────────────────────────────┤
│ DATA RETRIEVAL LAYER (6 modules) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ GO/UniProt│ │ Tissue │ │ Protein │ │ Subcell │ │
│ │ Annot. │ │ Expr. │ │ Features │ │ Localiz. │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ┌────┴─────┐ ┌───┴──────┐ │
│ │ Genetic │ │ Animal │ + Literature Scan (PubMed API) │
│ │ Constrt. │ │ Models │ │
│ └────┬─────┘ └────┬─────┘ │
├───────┴──────────────┴──────────────────────────────────────────┤
│ NORMALIZATION/TRANSFORM LAYER │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Per-layer parsers: Raw → Standardized schema │ │
│ │ Gene ID mapping (all to Ensembl/HGNC) │ │
│ │ Score normalization (0-1 scale per evidence type) │ │
│ └──────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ DATA STORAGE LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Raw Cache │ │ Intermediate │ │ Final Output │ │
│ │ (Parquet) │ │ (DuckDB) │ │ (TSV/Parquet)│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ INTEGRATION/SCORING LAYER │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Multi-evidence scoring: weighted rule-based integration │ │
│ │ Known gene exclusion filter │ │
│ │ Confidence tier assignment (high/medium/low) │ │
│ └──────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ REPORTING LAYER │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Per-gene evidence summary generation │ │
│ │ Ranked candidate lists by confidence tier │ │
│ │ Provenance tracking (data versions, tool versions) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
### Component Responsibilities
| Component | Responsibility | Typical Implementation |
|-----------|----------------|------------------------|
| **CLI Orchestrator** | Entry point, subcommand routing, global config management | Typer (modern, type-hinted) or Click (mature, widely used) |
| **Data Retrievers** | Query external APIs/databases, cache raw results, handle rate limits/retries | Per-source Python modules with API clients (requests, biopython) |
| **Normalizers** | Parse raw formats, map gene IDs to standard vocab, scale scores 0-1 | Pandas/Polars for tabular data, custom parsers for specialized formats |
| **Data Storage** | Persist intermediate results, enable restartability, support fast queries | DuckDB (analytical queries on intermediate data), Parquet (compressed columnar cache) |
| **Integrator** | Combine evidence scores, apply weights, filter known genes, assign confidence tiers | Custom Python logic, potentially scikit-learn for rule-based models |
| **Reporter** | Generate human-readable outputs, track provenance, create evidence summaries | Pandas/Polars for output formatting, Jinja2 for report templates |
## Recommended Project Structure
```
usher-gene-discovery/
├── config/
│ ├── pipeline.yaml # Main pipeline config (weights, thresholds, data sources)
│ ├── sources.yaml # API endpoints, file paths, version info
│ └── known_genes.txt # CiliaCarta/SYSCILIA/OMIM exclusion list
├── data/
│ ├── raw/ # Cached raw API responses (Parquet, JSON)
│ │ ├── uniprot/
│ │ ├── gtex/
│ │ ├── gnomaad/
│ │ └── pubmed/
│ ├── intermediate/ # Normalized per-layer results (DuckDB tables)
│ │ └── evidence.duckdb
│ └── output/ # Final results
│ ├── candidates_high.tsv
│ ├── candidates_medium.tsv
│ └── evidence_summary.parquet
├── src/
│ ├── pipeline/
│ │ ├── __init__.py
│ │ ├── cli.py # Main Typer CLI app
│ │ ├── config.py # Config loading/validation (pydantic)
│ │ └── utils.py # Shared utilities (gene ID mapping, etc.)
│ ├── retrievers/ # Data retrieval modules (6 evidence layers)
│ │ ├── __init__.py
│ │ ├── annotation.py # GO/UniProt completeness
│ │ ├── expression.py # HPA/GTEx/CellxGene tissue expression
│ │ ├── protein_features.py # Domain/motif/coiled-coil analysis
│ │ ├── localization.py # Subcellular localization
│ │ ├── constraint.py # gnomAD pLI/LOEUF
│ │ ├── phenotypes.py # MGI/ZFIN/IMPC animal models
│ │ └── literature.py # PubMed API scanning
│ ├── normalizers/ # Per-source parsers/normalizers
│ │ ├── __init__.py
│ │ └── [mirrors retrievers structure]
│ ├── integration/
│ │ ├── __init__.py
│ │ ├── scoring.py # Multi-evidence scoring logic
│ │ ├── filtering.py # Known gene exclusion
│ │ └── tiering.py # Confidence tier assignment
│ └── reporting/
│ ├── __init__.py
│ ├── summaries.py # Per-gene evidence summaries
│ └── provenance.py # Version tracking, data lineage
├── tests/
│ ├── test_retrievers/
│ ├── test_normalizers/
│ ├── test_integration/
│ └── fixtures/ # Small test datasets
├── scripts/
│ └── validate_outputs.py # Manual validation/QC scripts
├── pyproject.toml # Dependencies, tool config
├── README.md
└── .env.example # API keys template
```
### Structure Rationale
- **config/:** YAML for pipeline configuration (human-editable, version-controllable, standard in bioinformatics). TOML considered but YAML dominates bioinformatics tooling (Nextflow, Snakemake, nf-core).
- **data/ hierarchy:** Separates raw (cacheable, immutable) from intermediate (queryable) from output (deliverable). Enables pipeline restartability and debugging.
- **src/pipeline/ vs src/retrievers/:** CLI orchestration separated from business logic. Retrievers/normalizers mirror structure for clarity.
- **src/integration/:** Scoring logic isolated from data retrieval. Critical for testing/validation and future refinement of scoring weights.
- **tests/ mirrors src/:** Standard Python pattern. Fixtures for small validation datasets.
## Architectural Patterns
### Pattern 1: Independent Evidence Layers (Modular Retrieval)
**What:** Each evidence layer (annotation, expression, protein features, etc.) is an independent module with its own retriever, normalizer, and output schema. Modules do NOT communicate directly—only via the shared data storage layer.
**When to use:** Always in multi-evidence pipelines. Critical for your 6-layer architecture.
**Trade-offs:**
- **Pro:** Layers can be developed/tested/run independently. Failures in one layer don't cascade. Easy to add/remove evidence sources.
- **Pro:** Parallelizable execution (run all 6 retrievers concurrently).
- **Con:** Requires discipline to maintain consistent gene ID mapping and schema standards across layers.
**Example:**
```python
# src/retrievers/annotation.py
from typing import List
import polars as pl
class AnnotationRetriever:
def __init__(self, config):
self.config = config
def retrieve(self, gene_ids: List[str]) -> pl.DataFrame:
"""Fetch GO/UniProt completeness for genes."""
# Query UniProt API
raw_data = self._query_uniprot(gene_ids)
# Cache raw response
self._cache_raw(raw_data, "uniprot")
return raw_data
def normalize(self, raw_data: pl.DataFrame) -> pl.DataFrame:
"""Normalize to standard schema."""
return raw_data.select([
pl.col("gene_id"),
pl.col("annotation_score").cast(pl.Float32), # 0-1 scale
pl.col("go_term_count"),
pl.col("data_source").cast(pl.Utf8)
])
```
### Pattern 2: Staged Data Persistence (Cache-First Pipeline)
**What:** Each pipeline stage writes output to disk/database before proceeding. Use Parquet for raw caches (immutable, compressed), DuckDB for intermediate results (queryable), TSV/Parquet for final outputs (portable).
**When to use:** Always in long-running bioinformatics pipelines with external API dependencies. Essential for reproducibility.
**Trade-offs:**
- **Pro:** Pipeline restartability (skip completed stages). Debugging (inspect intermediate outputs). Provenance (timestamp, version, source).
- **Pro:** DuckDB enables fast analytical queries on intermediate data without full in-memory loads.
- **Con:** Disk I/O overhead (minor for your ~20K gene scale). Storage requirements (mitigated by Parquet compression).
**Example:**
```python
# src/pipeline/cli.py
import typer
import duckdb
from pathlib import Path
app = typer.Typer()
@app.command()
def run_layer(layer: str, force: bool = False):
"""Run single evidence layer retrieval + normalization."""
cache_path = Path(f"data/raw/{layer}/results.parquet")
# Check cache unless force refresh
if cache_path.exists() and not force:
typer.echo(f"Using cached {layer} data from {cache_path}")
return
# Retrieve + normalize
retriever = get_retriever(layer)
raw_data = retriever.retrieve(gene_list)
normalized = retriever.normalize(raw_data)
# Persist to Parquet + DuckDB
normalized.write_parquet(cache_path)
conn = duckdb.connect("data/intermediate/evidence.duckdb")
conn.execute(f"CREATE OR REPLACE TABLE {layer} AS SELECT * FROM '{cache_path}'")
conn.close()
@app.command()
def integrate(output: Path = Path("data/output/candidates_high.tsv")):
"""Integrate all evidence layers and generate candidates."""
conn = duckdb.connect("data/intermediate/evidence.duckdb")
# SQL-based multi-evidence integration
query = """
SELECT
a.gene_id,
a.annotation_score * 0.15 +
e.expression_score * 0.20 +
p.protein_score * 0.15 +
l.localization_score * 0.25 +
c.constraint_score * 0.15 +
ph.phenotype_score * 0.10 AS total_score
FROM annotation a
JOIN expression e ON a.gene_id = e.gene_id
JOIN protein_features p ON a.gene_id = p.gene_id
JOIN localization l ON a.gene_id = l.gene_id
JOIN constraint c ON a.gene_id = c.gene_id
JOIN phenotypes ph ON a.gene_id = ph.gene_id
WHERE total_score >= 0.7 -- High confidence threshold
ORDER BY total_score DESC
"""
result = conn.execute(query).pl()
result.write_csv(output, separator="\t")
```
### Pattern 3: Configuration-Driven Behavior (Declarative Pipeline)
**What:** All scoring weights, thresholds, data source URLs, API keys, etc. defined in YAML config files. Code reads config, doesn't hardcode parameters.
**When to use:** Always for reproducible research pipelines. Allows parameter tuning without code changes.
**Trade-offs:**
- **Pro:** Configuration versioned in git → reproducibility. Easy to A/B test scoring schemes. Non-programmers can adjust weights.
- **Con:** Requires config validation layer (use pydantic). Risk of config drift if not validated.
**Example:**
```yaml
# config/pipeline.yaml
scoring:
weights:
annotation: 0.15
expression: 0.20
protein_features: 0.15
localization: 0.25
constraint: 0.15
phenotypes: 0.10
thresholds:
high_confidence: 0.7
medium_confidence: 0.5
low_confidence: 0.3
data_sources:
uniprot:
base_url: "https://rest.uniprot.org/uniprotkb/search"
version: "2026_01"
gtex:
file_path: "/data/external/GTEx_v8_median_tpm.tsv"
version: "v8"
known_genes:
exclusion_lists:
- "config/ciliacarta_genes.txt"
- "config/syscilia_genes.txt"
- "config/omim_usher_genes.txt"
```
```python
# src/pipeline/config.py
from pydantic import BaseModel, Field
from typing import Dict
import yaml
class ScoringConfig(BaseModel):
weights: Dict[str, float] = Field(..., description="Evidence layer weights")
thresholds: Dict[str, float]
class PipelineConfig(BaseModel):
scoring: ScoringConfig
data_sources: Dict
known_genes: Dict
def load_config(path: str = "config/pipeline.yaml") -> PipelineConfig:
with open(path) as f:
config_dict = yaml.safe_load(f)
return PipelineConfig(**config_dict)
```
### Pattern 4: Provenance Tracking (Reproducibility First)
**What:** Every output file/table includes metadata: pipeline version, data source versions, timestamp, config hash. Enables exact reproduction.
**When to use:** Always in research pipelines. Critical for publication and validation.
**Trade-offs:**
- **Pro:** Full audit trail. Can reproduce results from 6 months ago. Meets FAIR data principles.
- **Con:** Requires discipline to capture versions (use git tags, API version headers, file checksums).
**Example:**
```python
# src/reporting/provenance.py
import hashlib
import json
from datetime import datetime
from pathlib import Path
import subprocess
def generate_provenance() -> dict:
"""Generate provenance metadata for pipeline run."""
return {
"pipeline_version": get_git_version(),
"run_timestamp": datetime.utcnow().isoformat(),
"config_hash": hash_config("config/pipeline.yaml"),
"data_sources": {
"uniprot": {"version": "2026_01", "retrieved": "2026-02-10"},
"gtex": {"version": "v8", "file_modified": "2023-11-15"},
# ... per-source metadata
},
"dependencies": get_package_versions(),
"runtime_env": {
"python": sys.version,
"platform": platform.platform()
}
}
def get_git_version() -> str:
"""Get current git commit hash."""
return subprocess.check_output(
["git", "rev-parse", "HEAD"]
).decode().strip()
def hash_config(path: str) -> str:
"""SHA256 hash of config file."""
content = Path(path).read_bytes()
return hashlib.sha256(content).hexdigest()[:8]
```
## Data Flow
### Pipeline Execution Flow
```
┌──────────────────────────────────────────────────────────────┐
│ 1. INITIALIZATION │
│ Load config/pipeline.yaml → Validate with pydantic │
│ Load gene list (~20K human protein-coding genes) │
│ Check data/raw/ cache status │
└────────────────────┬─────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ 2. PARALLEL RETRIEVAL (6 evidence layers + literature) │
│ For each layer: │
│ - Query external API/database │
│ - Cache raw response → data/raw/{layer}/ │
│ - Normalize to standard schema │
│ - Write to DuckDB table │
│ │
│ [annotation] [expression] [protein] [localization] │
│ [constraint] [phenotypes] [literature] │
│ │
│ All layers independent, no inter-dependencies │
└────────────────────┬─────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ 3. INTEGRATION │
│ SQL join on gene_id across all DuckDB tables │
│ Apply weighted scoring formula from config │
│ Filter out known genes (CiliaCarta/SYSCILIA/OMIM) │
│ Assign confidence tiers (high/medium/low) │
└────────────────────┬─────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ 4. REPORTING │
│ Generate per-gene evidence summaries │
│ Write tiered candidate lists (TSV) │
│ Attach provenance metadata │
│ Optionally: visualizations, statistics │
└──────────────────────────────────────────────────────────────┘
```
### Data Schema Evolution
```
RAW DATA (heterogeneous formats)
├── UniProt XML/JSON
├── GTEx TSV matrices
├── gnomAD VCF/TSV
└── PubMed XML
↓ NORMALIZATION
INTERMEDIATE DATA (standardized schema in DuckDB)
├── Table: annotation
│ ├── gene_id (TEXT, primary key)
│ ├── annotation_score (FLOAT 0-1)
│ ├── go_term_count (INT)
│ └── data_source (TEXT)
├── Table: expression
│ ├── gene_id (TEXT)
│ ├── expression_score (FLOAT 0-1)
│ ├── tissue_specificity (FLOAT)
│ └── ciliated_tissue_enrichment (BOOLEAN)
└── [similar schema for other 5 layers]
↓ INTEGRATION
FINAL OUTPUT (candidate lists)
├── candidates_high.tsv
│ ├── gene_id
│ ├── total_score (weighted sum)
│ ├── confidence_tier (high/medium/low)
│ └── [per-layer scores for transparency]
└── evidence_summary.parquet
└── [detailed per-gene evidence provenance]
```
### Key Data Flow Principles
1. **Unidirectional flow:** Raw → Intermediate → Output (no backflow, no circular dependencies)
2. **Gene ID as universal key:** All tables join on standardized gene_id (Ensembl or HGNC)
3. **Score normalization:** All evidence scores scaled 0-1 for comparability
4. **Immutable raw cache:** Never modify data/raw/ after creation (versioned, timestamped)
5. **DuckDB as integration hub:** Analytical queries across layers without full in-memory loads
## Build Order and Dependencies
### Recommended Build Order (Phases)
Given the component dependencies and validation requirements, build in this sequence:
#### Phase 1: Infrastructure (Week 1)
**Goal:** Establish project skeleton, config management, data storage patterns
**Components:**
1. Project structure (`pyproject.toml`, `src/pipeline/`, `config/`, `data/` hierarchy)
2. Config loading/validation (`config.py` with pydantic models)
3. DuckDB schema setup (create tables for 6 evidence layers)
4. Utility functions (gene ID mapping, caching helpers)
**Why first:** All downstream components depend on config system and data storage infrastructure.
**Validation:** Load sample config, create DuckDB tables, test gene ID mapping utility.
#### Phase 2: Single Evidence Layer Prototype (Week 2)
**Goal:** Prove the retrieval → normalization → storage pattern with one layer
**Components:**
1. Choose simplest layer (e.g., genetic constraint: gnomAD pLI/LOEUF)
2. Build retriever (API query, rate limiting, error handling)
3. Build normalizer (parse response, map gene IDs, scale scores)
4. Integrate with DuckDB storage
**Why second:** Validates architecture before scaling to 6 layers. Identifies issues early.
**Validation:** Run on 100 test genes, inspect Parquet cache + DuckDB table, verify schema.
#### Phase 3: Remaining Evidence Layers (Weeks 3-5)
**Goal:** Replicate Phase 2 pattern for all 6 layers + literature scan
**Components:**
1. Annotation (GO/UniProt completeness)
2. Expression (HPA/GTEx/CellxGene)
3. Protein features (domains/motifs/coiled-coil)
4. Localization (subcellular)
5. Phenotypes (MGI/ZFIN/IMPC)
6. Literature (PubMed API)
**Why third:** Independent layers can be built in parallel (or sequentially). No inter-dependencies.
**Validation:** Per-layer validation on test gene sets. Check schema consistency across layers.
#### Phase 4: Integration Layer (Week 6)
**Goal:** Combine evidence scores, apply weights, filter known genes
**Components:**
1. Multi-evidence scoring logic (SQL joins in DuckDB)
2. Known gene filter (CiliaCarta/SYSCILIA/OMIM exclusion)
3. Confidence tier assignment
4. Initial output generation (TSV candidate lists)
**Why fourth:** Requires all evidence layers complete. Core scientific logic.
**Validation:** Compare integrated scores against manual calculations for 10 test genes.
#### Phase 5: Reporting + Provenance (Week 7)
**Goal:** Generate publication-ready outputs with full audit trail
**Components:**
1. Per-gene evidence summaries
2. Provenance metadata generation
3. Output formatting (TSV + Parquet)
4. Optional: visualizations, statistics
**Why fifth:** Depends on integration layer. Presentation layer.
**Validation:** Manually review summaries for 20 high-confidence candidates. Verify provenance completeness.
#### Phase 6: CLI + Orchestration (Week 8)
**Goal:** Unified command-line interface for end-to-end pipeline
**Components:**
1. Typer CLI app with subcommands (run-layer, integrate, report)
2. Pipeline orchestration (dependency checking, progress logging)
3. Force-refresh flags, partial reruns
**Why last:** Integrates all components. User-facing interface.
**Validation:** Run full pipeline end-to-end on 1000 test genes. Benchmark runtime.
### Component Dependency Graph
```
┌──────────────┐
│ Config + │ ← PHASE 1 (foundation)
│ Storage │
└───────┬──────┘
┌───────┴──────────────────────────────────┐
│ Single Layer Prototype │ ← PHASE 2 (proof of concept)
│ (Retriever + Normalizer + DuckDB) │
└───────┬──────────────────────────────────┘
┌───────┴──────────────────────────────────┐
│ 6 Evidence Layers (independent modules) │ ← PHASE 3 (parallel work)
│ + Literature Scan │
└───────┬──────────────────────────────────┘
┌───────┴──────────────────────────────────┐
│ Integration Layer │ ← PHASE 4 (core logic)
│ (Scoring + Filtering + Tiering) │
└───────┬──────────────────────────────────┘
┌───────┴──────────────────────────────────┐
│ Reporting + Provenance │ ← PHASE 5 (presentation)
└───────┬──────────────────────────────────┘
┌───────┴──────────────────────────────────┐
│ CLI Orchestration │ ← PHASE 6 (interface)
└──────────────────────────────────────────┘
```
**Critical path:** Phase 1 → Phase 2 → Phase 4 (infrastructure + prototype + integration)
**Parallelizable:** Phase 3 (6 evidence layers can be built by different team members simultaneously)
**Deferrable:** Phase 5 (reporting) and Phase 6 (CLI polish) can be minimal initially
## Anti-Patterns to Avoid
### Anti-Pattern 1: Hardcoded API Keys and Endpoints
**What people do:** Embed API keys directly in code (`API_KEY = "abc123"`) or hardcode URLs.
**Why it's wrong:** Security risk (keys in git history), inflexibility (can't switch endpoints without code changes), breaks reproducibility (different users need different configs).
**Do this instead:** Use `.env` files for secrets (never committed), config files for endpoints. Load with `python-dotenv` or `pydantic-settings`.
```python
# WRONG
def query_uniprot():
response = requests.get("https://rest.uniprot.org/...", headers={"API-Key": "secret123"})
# RIGHT
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
uniprot_api_key: str
uniprot_base_url: str
class Config:
env_file = ".env"
settings = Settings()
response = requests.get(settings.uniprot_base_url, headers={"API-Key": settings.uniprot_api_key})
```
### Anti-Pattern 2: Monolithic All-in-One Script
**What people do:** Single 3000-line Python script that retrieves all data, normalizes, scores, and reports in one execution.
**Why it's wrong:** Impossible to restart from failure point (API rate limit at layer 5 → restart all 6 layers). Hard to test (can't unit test individual layers). No intermediate validation. Debugging nightmare.
**Do this instead:** Modular architecture with staged persistence (this entire ARCHITECTURE.md). Each layer is independently runnable. CLI orchestrates but doesn't enforce monolithic execution.
### Anti-Pattern 3: In-Memory-Only Data Processing
**What people do:** Load all 20K genes × 6 evidence layers into Pandas DataFrames, perform joins in memory, never write intermediate results.
**Why it's wrong:** Memory pressure (especially with expression matrices). No restartability. No audit trail. Limits scalability (what if you scale to 40K genes?).
**Do this instead:** Use DuckDB for intermediate storage. Write Parquet caches. DuckDB queries Parquet files directly (out-of-core analytics). Memory-efficient even at larger scales.
### Anti-Pattern 4: Ignoring Gene ID Ambiguity
**What people do:** Assume gene symbols are unique/stable. Join datasets on raw gene symbols from different sources.
**Why it's wrong:** Gene symbols are ambiguous (HLA-A vs HLAA), change over time, vary by organism. UniProt uses one ID system, GTEx another, gnomAD another. Mismatches cause silent data loss.
**Do this instead:** Standardize ALL gene identifiers to a single system (Ensembl or HGNC) during normalization. Build a gene ID mapper utility. Validate mappings (flag unmapped genes). Include original IDs in provenance.
```python
# src/pipeline/utils.py
import mygene
def standardize_gene_ids(gene_list: List[str], target_id: str = "ensembl") -> Dict[str, str]:
"""Map heterogeneous gene IDs to standard Ensembl IDs."""
mg = mygene.MyGeneInfo()
results = mg.querymany(gene_list, scopes="symbol,entrezgene,uniprot",
fields="ensembl.gene", species="human")
mapping = {}
for result in results:
if "ensembl" in result:
mapping[result["query"]] = result["ensembl"]["gene"]
else:
# Log unmapped genes for manual review
logging.warning(f"Could not map gene ID: {result['query']}")
return mapping
```
### Anti-Pattern 5: No Versioning or Provenance
**What people do:** Run pipeline, generate `candidates.tsv`, no record of config used or data source versions. Rerun 3 months later, get different results, can't explain why.
**Why it's wrong:** Irreproducible research. Can't debug score changes. Can't publish without provenance. Violates FAIR principles.
**Do this instead:** Implement Pattern 4 (Provenance Tracking). Every output has metadata footer/sidecar. Version config files in git. Tag releases. Include data source versions, tool versions, timestamps in all outputs.
## Integration Points
### External Services and APIs
| Service | Integration Pattern | Rate Limits / Notes |
|---------|---------------------|---------------------|
| **UniProt REST API** | HTTP GET with query params, JSON response | 1000 requests/hour. Use batch queries (`gene_ids=A,B,C`). Cache aggressively. |
| **GTEx Portal** | Download static TSV files, local queries | No API. Download median TPM matrices (~3 GB). Update quarterly. |
| **gnomAD** | Query via GraphQL API or download VCF | API: 10 requests/sec. Consider downloading constraint metrics TSV (500 MB). |
| **Human Protein Atlas** | Download bulk TSV files | No real-time API. Download tissue expression TSV. |
| **CellxGene** | Download h5ad files or use API | Census API available. Large files (>10 GB). Consider pre-filtering cell types. |
| **PubMed E-utilities** | Entrez Programming Utilities (E-utils) | 3 requests/sec without API key, 10/sec with key. Use batch queries. |
| **MGI/ZFIN/IMPC** | Download phenotype association files | Bulk downloads (TSV/XML). Monthly updates. |
### Internal Module Boundaries
| Boundary | Communication | Notes |
|----------|---------------|-------|
| **CLI ↔ Retrievers** | Function calls with config objects | CLI passes validated config, receives DataFrames/file paths |
| **Retrievers ↔ Normalizers** | DataFrame handoff (Polars/Pandas) | Retrievers return raw DataFrames, normalizers return standardized schema |
| **Normalizers ↔ Storage** | File writes (Parquet) + DuckDB inserts | Normalizers write directly, no abstraction layer needed |
| **Storage ↔ Integration** | SQL queries (DuckDB connection) | Integration layer reads via SQL joins, no Python data structures |
| **Integration ↔ Reporting** | DataFrame handoff | Integration returns scored DataFrame, reporting formats for output |
### Data Format Interchange
```
External APIs → JSON/XML/TSV (heterogeneous)
Retrievers → Polars DataFrame (in-memory, typed)
Normalizers → Polars DataFrame (standardized schema)
Storage Layer → Parquet (disk cache) + DuckDB (queryable)
Integration → DuckDB Result (SQL query output)
Reporting → TSV/Parquet (final outputs)
```
**Why Polars over Pandas?** Faster, more memory-efficient, better type system, native lazy evaluation. Polars is becoming standard in modern bioinformatics Python pipelines (2024-2026).
**Why DuckDB over SQLite?** Columnar storage (better for analytics), vectorized execution (10-100x faster for aggregations), native Parquet support, can query files without import.
## Scaling Considerations
### Current Scale (20K genes)
**Architecture:** Single-machine Python pipeline with local DuckDB. No distributed computing needed.
**Bottlenecks:**
- API rate limits (PubMed: 10 req/sec, UniProt: 1000 req/hour) → solved by caching + batch queries
- Expression matrix size (GTEx + CellxGene: ~10 GB) → DuckDB handles via out-of-core queries
**Runtime estimate:**
- Data retrieval (with caching): 1-4 hours (depending on API latency)
- Normalization + integration: <15 minutes (DuckDB vectorized queries)
- **Total:** ~2-5 hours for full pipeline run (initial). <30 min for reruns with cache.
### Medium Scale (50K genes, e.g., include non-coding)
**Architecture adjustments:** Same architecture. Increase DuckDB memory limit. Consider Parquet partitioning by chromosome.
**Bottlenecks:** Expression matrices grow to ~25 GB. Still manageable with 32 GB RAM + DuckDB out-of-core.
### Large Scale (500K genes, e.g., cross-species)
**Architecture adjustments:** Consider workflow orchestrators (Nextflow/Snakemake) for parallelization. Distribute retrieval across multiple workers. Switch from local DuckDB to distributed query engine (DuckDB over S3, or Apache Spark).
**When to consider:** Only if scaling beyond single organism or adding many more evidence layers.
### GPU Utilization Note
Your NVIDIA 4090 GPU is available but **not needed** for this pipeline. Use cases for GPU:
- Deep learning-based gene prioritization (not rule-based scoring)
- Protein structure prediction (AlphaFold, not in scope)
- Large-scale sequence alignment (not in scope)
Rule-based multi-evidence scoring is CPU-bound (DuckDB is CPU-optimized). GPU would be idle.
## Validation and Testing Strategy
### Testing Pyramid
```
┌─────────────────────────┐
│ Integration Tests │ ← Full pipeline on 100 test genes
│ (slow, comprehensive) │
├─────────────────────────┤
│ Component Tests │ ← Per-layer retrieval + normalization
│ (medium, focused) │ Mock API responses
├─────────────────────────┤
│ Unit Tests │ ← Utility functions, scoring logic
│ (fast, narrow) │ Gene ID mapping, score calculations
└─────────────────────────┘
```
### Validation Datasets
1. **Positive controls:** Known ciliopathy genes (USH2A, MYO7A, CDH23) → should score HIGH
2. **Negative controls:** Housekeeping genes (GAPDH, ACTB) → should score LOW
3. **Novel candidates:** Manually validate 10 high-scoring unknowns via literature review
4. **Benchmark datasets:** If available, compare to published cilia gene sets (CiliaCarta as gold standard)
### Quality Gates
- **Schema validation:** All DuckDB tables match expected schema (pydantic models)
- **Completeness checks:** All 20K genes have entries in all 6 evidence tables (flag missing data)
- **Score sanity checks:** No scores <0 or >1, distribution checks (not all genes score 0.5)
- **Provenance validation:** Every output has complete metadata (no missing versions/timestamps)
## Technology Choices Summary
| Layer | Technology | Reasoning |
|-------|------------|-----------|
| **CLI Framework** | Typer (or Click) | Type-hinted, auto-generated help, modern Python. Click if need maximum maturity. |
| **Config Format** | YAML | Dominant in bioinformatics. Human-readable. Pydantic validation. |
| **Data Processing** | Polars | Faster + more memory-efficient than Pandas. Native lazy eval. Modern standard. |
| **Intermediate Storage** | DuckDB | Columnar analytics, out-of-core queries, native Parquet support. 10-100x faster than SQLite for aggregations. |
| **Cache Format** | Parquet | Compressed columnar format. Standard for big data. DuckDB-native. |
| **Final Output** | TSV + Parquet | TSV for human readability + Excel compatibility. Parquet for programmatic access. |
| **API Client** | requests + biopython | Standard for REST APIs. Biopython for Entrez/PubMed utilities. |
| **Gene ID Mapping** | mygene.info API | Comprehensive cross-reference database. MyGeneInfo Python client. |
## Sources
**Architecture Patterns:**
- [Bioinformatics Pipeline Architecture Best Practices](https://www.meegle.com/en_us/topics/bioinformatics-pipeline/bioinformatics-pipeline-architecture)
- [Developing and reusing bioinformatics pipelines using scientific workflow systems](https://www.sciencedirect.com/science/article/pii/S2001037023001010)
- [PHA4GE Pipeline Best Practices](https://github.com/pha4ge/public-health-pipeline-best-practices/blob/main/docs/pipeline-best-practices.md)
**Multi-Evidence Integration:**
- [Multi-omics data integration methods](https://academic.oup.com/bib/article/26/4/bbaf355/8220754)
- [FAS: multi-layered feature architectures for protein similarity](https://academic.oup.com/bioinformatics/article/39/5/btad226/7135831)
**Workflow Tools and Patterns:**
- [A lightweight, flow-based toolkit for bioinformatics pipelines](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-61)
- [Using prototyping to choose a workflow management system](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008622)
**Data Storage:**
- [DuckDB vs SQLite for bioinformatics](https://bridgeinformatics.com/from-pandas-to-spark-and-back-again-why-bioinformatics-should-know-duckdb/)
- [Modern Data Formats for Big Bioinformatics](https://www.researchgate.net/publication/316682490_Modern_Data_Formats_for_Big_Bioinformatics_Data_Analytics)
- [DuckDB vs SQLite: Complete Comparison](https://www.datacamp.com/blog/duckdb-vs-sqlite-complete-database-comparison)
**Reproducibility and Provenance:**
- [Genomics pipelines and data integration challenges](https://pmc.ncbi.nlm.nih.gov/articles/PMC5580401/)
- [Pipeline Provenance for Reproducibility](https://arxiv.org/abs/2404.14378)
- [Open Targets provenance metadata](https://blog.opentargets.org/provenance-metadata/)
**Python CLI Tools:**
- [Comparing argparse, Click, and Typer](https://codecut.ai/comparing-python-command-line-interface-tools-argparse-click-and-typer/)
- [Building Python CLI Tools Guide](https://inventivehq.com/blog/python-cli-tools-guide)
**Gene Annotation Pipelines:**
- [NCBI Eukaryotic Genome Annotation Pipeline](https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/process/)
- [AnnotaPipeline: integrated annotation tool](https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2022.1020100/full)
**Literature Mining:**
- [NCBI Text Mining APIs](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/)
- [PubTator 3.0: AI-powered literature resource](https://academic.oup.com/nar/article/52/W1/W540/7640526)
**Validation:**
- [Standards for Validating NGS Bioinformatics Pipelines](https://www.sciencedirect.com/science/article/pii/S1525157817303732)
- [NFTest: automated Nextflow pipeline testing](https://pubmed.ncbi.nlm.nih.gov/38341660/)
---
*Architecture research for: Usher Cilia Gene Discovery Pipeline*
*Researched: 2026-02-11*

View File

@@ -0,0 +1,259 @@
# Feature Landscape
**Domain:** Gene Candidate Discovery and Prioritization for Rare Disease / Ciliopathy Research
**Researched:** 2026-02-11
**Confidence:** MEDIUM
## Table Stakes
Features users expect. Missing = pipeline is not credible.
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Multi-evidence scoring | Standard in modern gene prioritization; single-source approaches insufficient for rare disease | Medium | Requires weighting scheme, score normalization across evidence types |
| Reproducibility documentation | Required for publication and scientific validity; FDA/NIH standards emphasize reproducible pipelines | Low-Medium | Parameter logging, version tracking, seed control, execution environment capture |
| Data provenance tracking | W3C PROV standard; required to trace analysis steps and validate results | Medium | Track all data transformations, source versions, timestamps, intermediate results |
| Known gene validation | Benchmarking against established disease genes is standard practice; without this, no confidence in results | Low-Medium | Positive control set, recall metrics at various rank cutoffs (e.g., recall@10, recall@50) |
| Quality control checks | Standard for NGS and bioinformatics pipelines; catch data issues early | Low-Medium | Missing data detection, outlier identification, distribution checks, data completeness metrics |
| Structured output format | Machine-readable outputs enable downstream analysis and integration | Low | CSV/TSV for tabular data, JSON for metadata, standard column naming |
| Basic visualization | Visual inspection of scores, distributions, and rankings is expected | Medium | Score distribution plots, rank visualization, evidence contribution plots |
| Literature evidence | Gene function annotation incomplete; literature mining standard for discovery pipelines | High | PubMed/literature search integration, manual or automated; alternatively curated disease-gene databases |
| HPO/phenotype integration | Standard for rare disease gene prioritization since tools like Exomiser, LIRICAL | Medium | Human Phenotype Ontology term matching, phenotype similarity scoring if applicable |
| API-based data retrieval | Manual downloads don't scale; automated retrieval from gnomAD, UniProt, GTEx, etc. is expected | Medium | Rate limiting, error handling, caching, retry logic |
| Batch processing | Single-gene analysis doesn't scale to 20K genes | Low-Medium | Parallel execution, progress tracking, resume-from-checkpoint |
| Parameter configuration | Hard-coded parameters prevent adaptation; config files standard | Low | YAML/JSON config, CLI arguments, validation |
## Differentiators
Features that set product apart. Not expected, but valued.
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| Explainable scoring | SHAP/attribution methods show WHY genes rank high; critical for discovery (not just diagnosis) | High | SHAP-style contribution analysis, per-gene evidence breakdown, visual explanations |
| Systematic under-annotation bias handling | Novel: most tools favor well-studied genes; correcting publication bias is research advantage | High | Annotation completeness score as evidence layer; downweight literature-heavy features for under-studied candidates |
| Cilia-specific knowledgebase integration | Leverage CilioGenics, CiliaMiner, ciliopathy databases for domain-focused scoring | Medium | Custom evidence layer; API/download from specialized databases |
| Sensitivity analysis | Systematic parameter tuning rare in discovery pipelines; shows robustness | Medium-High | Grid search or DoE-based parameter sweep; rank stability metrics across configs |
| Tiered output with rationale | Not just ranked list but grouped by confidence/evidence type; aids hypothesis generation | Low-Medium | Tier classification logic (e.g., high/medium/low confidence), evidence summary per tier |
| Multi-modal evidence weighting | Naive Bayesian integration or optimized weights outperform equal weighting | Medium | Weight optimization using known positive controls, cross-validation |
| Negative control validation | Test against known non-disease genes to assess specificity; rare in discovery pipelines | Low-Medium | Negative gene set (e.g., housekeeping genes), precision metrics |
| Evidence conflict detection | Flag genes with contradictory evidence (e.g., high expression but low constraint) | Medium | Rule-based or correlation-based conflict identification |
| Interactive HTML report | Modern tools (e.g., MultiQC-style) provide browsable results; better than static CSV | Medium | HTML generation, embedded plots, sortable tables, linked evidence |
| Incremental update capability | Re-run with new data sources without full recomputation | Medium-High | Modular pipeline, cached intermediate results, dependency tracking |
| Cross-species homology scoring | Animal model phenotypes critical for ciliopathy; ortholog-based evidence integration | Medium | DIOPT/OrthoDB integration, phenotype transfer from model organisms |
| Automated literature scanning with LLM | Emerging: LLM-based RAG for literature evidence extraction and validation | High | LLM API integration, prompt engineering, faithfulness checks, cost management |
## Anti-Features
Features to explicitly NOT build.
| Anti-Feature | Why Avoid | What to Do Instead |
|--------------|-----------|-------------------|
| Real-time web dashboard | Overkill for research tool; adds deployment complexity, security concerns | Static HTML reports + CLI; Jupyter notebooks for interactive exploration if needed |
| GUI for parameter tuning | Research pipelines need reproducible command-line execution; GUIs hinder automation | YAML config files + CLI; document parameter rationale in config comments |
| Variant-level analysis | Out of scope for gene-level discovery; conflates discovery with diagnostic prioritization | Focus on gene-level evidence aggregation; refer users to Exomiser/LIRICAL for variant work |
| Custom alignment/variant calling | Well-solved problem; reinventing the wheel wastes time | Use standard BAM/VCF inputs from established pipelines; focus on gene prioritization logic |
| Social features (sharing, comments) | Research tool, not collaboration platform | File-based outputs (shareable via Git/email); documentation in README |
| Real-time database sync | Bioinformatics data versions change slowly; real-time sync unnecessary | Versioned data snapshots with documented download dates; update quarterly or as needed |
| One-click install for all dependencies | Bioinformatics tools have complex dependencies; false promise | Conda environment.yml or Docker container; document setup steps clearly |
| Machine learning model training | Small positive control set insufficient for robust ML; rule-based more transparent | Weighted scoring with manually tuned/optimized weights; reserve ML for future with larger training data |
## Feature Dependencies
```
Parameter Configuration
└──requires──> Quality Control Checks
└──requires──> Data Provenance Tracking
Multi-Evidence Scoring
├──requires──> API-Based Data Retrieval
├──requires──> Literature Evidence
└──requires──> Structured Output Format
Explainable Scoring
└──requires──> Multi-Evidence Scoring
└──enhances──> Interactive HTML Report
Known Gene Validation
└──requires──> Multi-Evidence Scoring
└──enables──> Sensitivity Analysis
Tiered Output with Rationale
└──requires──> Explainable Scoring
└──enhances──> Interactive HTML Report
Batch Processing
└──requires──> Parameter Configuration
└──enables──> Sensitivity Analysis
Negative Control Validation
└──requires──> Known Gene Validation (uses similar metrics)
Evidence Conflict Detection
└──requires──> Multi-Evidence Scoring
Incremental Update Capability
└──requires──> Data Provenance Tracking
└──requires──> Batch Processing
Cross-Species Homology Scoring
└──requires──> API-Based Data Retrieval
Automated Literature Scanning with LLM
└──requires──> Literature Evidence
└──conflicts──> Manual curation (choose one approach per evidence layer)
```
### Dependency Notes
- **Parameter Configuration → QC Checks → Data Provenance:** Foundation stack; parameters must be logged, QC applied, and tracked before any analysis
- **Multi-Evidence Scoring requires API-Based Data Retrieval:** Can't score without data; retrieval must be robust and cached
- **Explainable Scoring requires Multi-Evidence Scoring:** Can't explain scores that don't exist; explanations decompose composite scores
- **Known Gene Validation enables Sensitivity Analysis:** Positive controls provide ground truth for parameter tuning
- **Automated Literature Scanning conflicts with Manual Curation:** Choose one approach per evidence layer to avoid redundancy and conflicting evidence
## MVP Recommendation
### Launch With (v1)
Minimum viable pipeline for ciliopathy gene discovery.
- [x] Multi-evidence scoring (6 layers: annotation, expression, sequence, localization, constraint, phenotype)
- [x] API-based data retrieval with caching (gnomAD, GTEx, HPA, UniProt, Model Organism DBs)
- [x] Known gene validation (Usher syndrome genes, known ciliopathy genes as positive controls)
- [x] Reproducibility documentation (parameter logging, versions, timestamps)
- [x] Data provenance tracking (source file versions, processing steps, intermediate results)
- [x] Structured output format (CSV with ranked genes, evidence scores per column)
- [x] Quality control checks (missing data detection, outlier flagging, distribution checks)
- [x] Batch processing (parallel execution across 20K genes)
- [x] Parameter configuration (YAML config for weights, thresholds, data sources)
- [x] Tiered output with rationale (High/Medium/Low confidence tiers, evidence summary)
- [x] Basic visualization (score distributions, top candidate plots)
**Rationale:** These features enable credible, reproducible gene prioritization at scale. Without validation, explainability, and QC, results are untrustworthy. Without tiered output, generating hypotheses from 20K genes is overwhelming.
### Add After Validation (v1.x)
Features to add once core pipeline is validated against known ciliopathy genes.
- [ ] Interactive HTML report (browsable results with sortable tables, linked evidence) — **Trigger:** After v1 produces validated candidate list; when sharing results with collaborators
- [ ] Explainable scoring (per-gene evidence contribution breakdown, SHAP-style attribution) — **Trigger:** After identifying novel candidates; when reviewers ask "why is this gene ranked high?"
- [ ] Negative control validation (test against housekeeping genes to assess specificity) — **Trigger:** After positive control validation succeeds; to quantify false positive rate
- [ ] Evidence conflict detection (flag genes with contradictory evidence patterns) — **Trigger:** After observing unexpected high-ranking genes; to catch data quality issues
- [ ] Sensitivity analysis (systematic parameter sweep, rank stability metrics) — **Trigger:** When preparing for publication; to demonstrate robustness
- [ ] Cross-species homology scoring (zebrafish/mouse phenotype evidence integration) — **Trigger:** If animal model evidence proves valuable in initial analysis
### Future Consideration (v2+)
Features to defer until core pipeline is published and validated.
- [ ] Automated literature scanning with LLM (RAG-based PubMed evidence extraction) — **Why defer:** High complexity, cost, and uncertainty; manual curation sufficient for initial discovery
- [ ] Incremental update capability (re-run with new data without full recomputation) — **Why defer:** Overkill for one-time discovery project; valuable if pipeline becomes ongoing surveillance tool
- [ ] Multi-modal evidence weighting optimization (Bayesian integration, cross-validation) — **Why defer:** Requires larger training set of known genes; manual weight tuning sufficient initially
- [ ] Systematic under-annotation bias handling — **Why defer:** Novel research question; defer until after initial discovery validates approach
- [ ] Cilia-specific knowledgebase integration (CilioGenics, CiliaMiner) — **Why defer:** Nice-to-have; primary evidence layers sufficient for initial analysis; add if reviewers request
## Feature Prioritization Matrix
| Feature | User Value | Implementation Cost | Priority |
|---------|------------|---------------------|----------|
| Multi-evidence scoring | HIGH | MEDIUM | P1 |
| Known gene validation | HIGH | LOW-MEDIUM | P1 |
| Reproducibility documentation | HIGH | LOW-MEDIUM | P1 |
| Data provenance tracking | HIGH | MEDIUM | P1 |
| Structured output format | HIGH | LOW | P1 |
| Quality control checks | HIGH | LOW-MEDIUM | P1 |
| API-based data retrieval | HIGH | MEDIUM | P1 |
| Batch processing | HIGH | LOW-MEDIUM | P1 |
| Parameter configuration | HIGH | LOW | P1 |
| Tiered output with rationale | HIGH | LOW-MEDIUM | P1 |
| Basic visualization | HIGH | MEDIUM | P1 |
| Explainable scoring | HIGH | HIGH | P2 |
| Interactive HTML report | MEDIUM | MEDIUM | P2 |
| Sensitivity analysis | MEDIUM | MEDIUM-HIGH | P2 |
| Evidence conflict detection | MEDIUM | MEDIUM | P2 |
| Negative control validation | MEDIUM | LOW-MEDIUM | P2 |
| Cross-species homology | MEDIUM | MEDIUM | P2 |
| Incremental updates | LOW | MEDIUM-HIGH | P3 |
| LLM literature scanning | LOW | HIGH | P3 |
| Multi-modal weight optimization | MEDIUM | MEDIUM | P3 |
| Under-annotation bias correction | MEDIUM | HIGH | P3 |
| Cilia knowledgebase integration | LOW-MEDIUM | MEDIUM | P3 |
**Priority key:**
- P1: Must have for credible v1 pipeline
- P2: Should have; add after v1 validation
- P3: Nice to have; future consideration
## Competitor Feature Analysis
| Feature | Exomiser | LIRICAL | CilioGenics Tool | Usher Pipeline Approach |
|---------|----------|---------|-------------------|------------------------|
| Multi-evidence scoring | Yes (variant+pheno combined score) | Yes (LR-based) | Yes (ML predictions) | Yes (6-layer weighted scoring) |
| Phenotype integration | HPO-based (strong) | HPO-based (strong) | Not primary | HPO-compatible but not required (sensory cilia focus) |
| Known gene validation | Benchmarked on Mendelian diseases | Benchmarked on Mendelian diseases | Validated on known cilia genes | Validate on Usher + known ciliopathy genes |
| Explainable scoring | Limited | Post-test probability (interpretable) | ML black box | SHAP-style per-gene evidence breakdown (planned P2) |
| Variant-level analysis | Primary focus | Primary focus | No (gene-level only) | No (out of scope; gene-level discovery) |
| Literature evidence | Automated (limited) | Limited | Text mining used in training | Manual/automated (planned P3) |
| Tiered output | Yes (rank-ordered) | Yes (post-test prob tiers) | Yes (confidence scores) | Yes (High/Medium/Low tiers + rationale) |
| Under-annotation bias | Not addressed | Not addressed | Not addressed | Explicitly addressed (novel) |
| Domain-specific focus | Mendelian disease diagnosis | Mendelian disease diagnosis | Cilia biology | Usher syndrome / ciliopathy discovery |
| Reproducibility | Config files, versions logged | Config files, versions logged | Not emphasized | Extensive provenance tracking |
**Key Differentiators for Usher Pipeline:**
1. **Discovery vs. Diagnosis:** Exomiser/LIRICAL prioritize variants in patient genomes (diagnosis); Usher pipeline screens all genes for under-studied candidates (discovery)
2. **Under-annotation bias handling:** Explicitly score annotation completeness and de-bias toward under-studied genes
3. **Explainable scoring for hypothesis generation:** Per-gene evidence breakdown to guide experimental follow-up, not just rank genes
4. **Sensory cilia focus:** Retina/cochlea expression, ciliopathy phenotypes, subcellular localization evidence tailored to Usher biology
## Sources
### Tool Features and Benchmarking
- [Survey and improvement strategies for gene prioritization with LLMs](https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf148/8172498) (2026)
- [Clinical and Cross-Domain Validation of LLM-Guided Gene Prioritization](https://www.biorxiv.org/content/10.64898/2026.01.22.701191v1) (2026)
- [Automating candidate gene prioritization with LLMs](https://academic.oup.com/bioinformatics/article/41/10/btaf541/8280402) (2025)
- [Explicable prioritization with rule-based and ML algorithms](https://pmc.ncbi.nlm.nih.gov/articles/PMC10956189/) (2024)
- [Phenotype-driven approaches to enhance variant prioritization](https://pmc.ncbi.nlm.nih.gov/articles/PMC9288531/) (2022)
- [Evaluation of phenotype-driven gene prioritization methods](https://pmc.ncbi.nlm.nih.gov/articles/PMC9487604/) (2022)
- [Add LIRICAL and Exomiser scores to seqr (GitHub issue)](https://github.com/broadinstitute/seqr/issues/2742) (2021)
- [Large-scale benchmark of gene prioritization methods](https://www.nature.com/articles/srep46598) (2017)
### Reproducibility and Standards
- [Rare disease gene discovery in 100K Genomes Project](https://www.nature.com/articles/s41586-025-08623-w) (2025)
- [Standards for validating NGS bioinformatics pipelines (AMP/CAP)](https://www.sciencedirect.com/science/article/pii/S1525157817303732) (2018)
- [Genomics pipelines and data integration challenges](https://pmc.ncbi.nlm.nih.gov/articles/PMC5580401/) (2017)
- [Experiences with workflows for bioinformatics](https://link.springer.com/article/10.1186/s13062-015-0071-8) (2015)
### Parameter Tuning and Sensitivity Analysis
- [Algorithm sensitivity analysis for tissue segmentation pipelines](https://academic.oup.com/bioinformatics/article/33/7/1064/2843894) (2017)
- [doepipeline: systematic approach to optimizing workflows](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3091-z) (2019)
### Multi-Evidence Integration
- [Multi-dimensional evidence-based candidate gene prioritization](https://pmc.ncbi.nlm.nih.gov/articles/PMC2752609/) (2009)
- [Extensive analysis of disease-gene associations with network integration](https://pmc.ncbi.nlm.nih.gov/articles/PMC4070077/) (2014)
- [Survey of gene prioritization tools for Mendelian diseases](https://pmc.ncbi.nlm.nih.gov/articles/PMC7074139/) (2020)
### Explainability and Interpretability
- [Explainable deep learning for cancer target prioritization](https://arxiv.org/html/2511.12463) (2024)
- [Interpretable machine learning for genomics](https://pmc.ncbi.nlm.nih.gov/articles/PMC8527313/) (2021)
- [SeqOne's DiagAI Score with xAI](https://www.seqone.com/news-insights/seqone-diagai-explainable-ai) (recent)
- [Spectrum of explainable and interpretable ML for genomics](https://wires.onlinelibrary.wiley.com/doi/10.1002/wics.1617) (2023)
### Cilia Biology and Ciliopathy Tools
- [Prioritization tool for cilia-associated genes](https://pmc.ncbi.nlm.nih.gov/articles/PMC11512102/) (2024)
- [CilioGenics: integrated method for predicting ciliary genes](https://academic.oup.com/nar/article/52/14/8127/7710917) (2024)
- [CiliaMiner: integrated database for ciliopathy genes](https://pmc.ncbi.nlm.nih.gov/articles/PMC10403755/) (2023)
- [Systems-biology approach to ciliopathy disorders](https://genomemedicine.biomedcentral.com/articles/10.1186/gm275) (2011)
### Visualization and Reporting
- [Computational pipeline for functional gene discovery](https://www.nature.com/articles/s41598-021-03041-0) (2021)
- [JWES: pipeline for gene-variant discovery and annotation](https://pmc.ncbi.nlm.nih.gov/articles/PMC8409305/) (2021)
### Data Quality and Provenance
- [Data quality assurance in bioinformatics](https://ranchobiosciences.com/2025/04/bioinformatics-and-quality-assurance-data/) (2025)
- [Bioinformatics pipeline for data quality control](https://www.meegle.com/en_us/topics/bioinformatics-pipeline/bioinformatics-pipeline-for-data-quality-control) (recent)
- [Bionitio: demonstrating best practices for bioinformatics CLI](https://academic.oup.com/gigascience/article/8/9/giz109/5572530) (2019)
---
*Feature research for: Bioinformatics Gene Candidate Discovery and Prioritization (Rare Disease / Ciliopathy)*
*Researched: 2026-02-11*
*Confidence: MEDIUM — based on WebSearch findings verified across multiple sources; Context7 not available for specialized bioinformatics tools; recommendations synthesized from peer-reviewed publications and established tool documentation*

View File

@@ -0,0 +1,468 @@
# Domain Pitfalls: Bioinformatics Gene Prioritization for Cilia/Usher
**Domain:** Bioinformatics gene candidate discovery pipeline
**Researched:** 2026-02-11
**Confidence:** HIGH
## Critical Pitfalls
### Pitfall 1: Gene ID Mapping Inconsistency Cascade
**What goes wrong:**
Gene identifiers fail to map correctly across databases (Ensembl, HGNC, UniProt, Entrez), resulting in data loss or incorrect merges. Over 51% of Ensembl IDs can return NA when converting to gene symbols, and approximately 8% discordance exists between Swiss-Prot and Ensembl. Multiple Ensembl IDs can map to a single Entrez ID, creating one-to-many relationships that break naively designed merge operations.
**Why it happens:**
- UniProt requires 100% sequence identity with no insertions/deletions for mapping to Ensembl, causing ~5% of well-annotated Swiss-Prot proteins to remain unmapped
- Gene symbols "shuffle around more haphazardly" than stable IDs, with the same symbol assigned to different genes across annotation builds
- Mucin-like proteins (e.g., MUC2, MUC19) with variable repeat regions have mismatches between curated sequences and reference genomes
- Different annotation build versions used across source databases
**How to avoid:**
1. **Use stable IDs as primary keys:** Store Ensembl gene IDs as the authoritative identifier; treat gene symbols as display-only metadata
2. **Version-lock annotation builds:** Document and freeze the Ensembl/GENCODE release used for all conversions; ensure GTEx expression data uses the same build
3. **Implement mapping validation:** After each ID conversion step, report the % successfully mapped and manually inspect unmapped high-priority genes
4. **One-to-many resolution strategy:** For multiple Ensembl→Entrez mappings, aggregate scores using max() or create separate records with explicit disambiguation
5. **Bypass problematic conversions:** Where possible, retrieve data directly using the native ID system of each source database rather than converting first
**Warning signs:**
- Sudden drop in gene counts after merging datasets
- Known positive control genes (established cilia/Usher genes) missing from integrated data
- Genes appearing multiple times with different scores after merge
- Different gene counts between pilot analysis and production run after database updates
**Phase to address:**
**Phase 1 (Data Infrastructure):** Establish ID mapping validation framework with explicit logging of conversion success rates. Create curated mapping tables with manual review of ambiguous cases.
---
### Pitfall 2: Literature Bias Amplification in Weighted Scoring
**What goes wrong:**
Well-studied genes dominate prioritization scores because multiple evidence layers correlate with publication volume rather than biological relevance. PubMed co-mentions, Gene Ontology annotation depth, pathway database coverage, and interaction network degree all increase with research attention, creating a reinforcing feedback loop that systematically deprioritizes novel candidates. In gene prioritization benchmarks, "the number of Gene Ontology annotations for a gene was significantly correlated with published disease-gene associations," reflecting research bias rather than actual biological importance.
**Why it happens:**
- Gene Ontology annotations are "created by curation of the scientific literature and typically only contain functional annotations for genes with published experimental data"
- Interaction networks derived from literature mining or yeast two-hybrid screens are biased toward well-studied proteins
- Pathway databases (KEGG, Reactome) provide richer annotations for canonical pathways vs. emerging biology
- Constraint metrics (gnomAD pLI/LOEUF) perform better for well-covered genes but may be unreliable for genes with low coverage
- Human genes "are much more likely to have been published on in the last 12 years if they are in clusters that were already well known"
**How to avoid:**
1. **Decouple publication metrics from functional evidence:** Score PubMed mentions separately from mechanistic evidence (protein domains, expression patterns, orthologs)
2. **Normalize for baseline research attention:** Divide pathway/GO scores by total publication count to create a "novelty-adjusted" functional score
3. **Use sequence-based features heavily:** Prioritize evidence types independent of literature (protein domains, tissue expression, evolutionary constraint, ortholog phenotypes)
4. **Set evidence diversity requirements:** Require candidates to score above threshold in at least N different evidence categories, preventing single-layer dominance
5. **Explicit "under-studied" bonus:** Add a scoring component that rewards genes with low PubMed counts but high biological plausibility from other layers
6. **Validate scoring against a "dark genome" test set:** Ensure low-publication genes with strong experimental validation (from model organisms) score appropriately
**Warning signs:**
- Top 100 candidates are all genes with >500 PubMed citations
- Known disease genes with <50 publications rank below 1,000th percentile
- Correlation coefficient >0.7 between final score and PubMed count
- When you add a new evidence layer, the top-ranked genes don't change
- Positive controls (known cilia genes) score lower than expected based on functional relevance
**Phase to address:**
**Phase 3 (Scoring System Design):** Implement multi-layer weighted scoring with explicit publication bias correction. Test scoring system on a stratified validation set including both well-studied and under-studied genes.
---
### Pitfall 3: Missing Data Handled as "Negative Evidence"
**What goes wrong:**
Genes lacking data in a particular evidence layer (e.g., no GTEx expression, no scRNA-seq detection, no IMPC phenotype) are treated as having "low expression" or "no phenotype" rather than "unknown." This systematically penalizes genes simply because they haven't been measured in the right contexts. For example, a gene expressed in rare retinal cell subtypes may be absent from bulk GTEx data and appear in only 1-2 cells per scRNA-seq atlas, leading to false classification as "not expressed in relevant tissues."
**Why it happens:**
- Default pandas merge behavior drops unmatched records or fills with NaN
- Scoring functions that sum across layers implicitly assign 0 to missing layers
- GTEx bulk tissue lacks resolution for rare cell types (e.g., photoreceptor subtypes, vestibular hair cells)
- scRNA-seq atlases have dropout and may not capture low-abundance transcripts
- Model organism phenotypes exist only for ~40% of human genes
- gnomAD constraint metrics are unreliable for genes with low coverage regions
**How to avoid:**
1. **Explicit missing data encoding:** Use a three-state system: "present" (data exists and positive), "absent" (data exists and negative), "unknown" (no data available)
2. **Layer-specific score normalization:** Compute scores only across layers with data; do not penalize genes for missing layers
3. **Imputation with biological priors:** For genes with orthologs but no direct human data, propagate evidence from mouse/zebrafish with confidence weighting
4. **Coverage-aware constraint metrics:** Only use gnomAD pLI/LOEUF when coverage is adequate (mean depth >30x and >90% of coding sequence covered)
5. **Aggregate at appropriate resolution:** For scRNA-seq, aggregate to cell-type level rather than individual cells to handle dropout
6. **Document data availability per gene:** Create a metadata field tracking which evidence layers are available for each gene to enable layer-aware filtering
**Warning signs:**
- Genes with partial data coverage systematically rank lower than genes with complete data
- Number of scored genes drops dramatically when adding a new evidence layer (should be union, not intersection)
- High-confidence candidates from preliminary analysis disappear after integrating additional databases
- Known Usher genes with tissue-specific expression patterns score poorly
**Phase to address:**
**Phase 2 (Data Integration):** Design merge strategy that preserves all genes and explicitly tracks data availability per evidence layer. Implement imputation strategy for ortholog-based evidence propagation.
---
### Pitfall 4: Batch Effects Misinterpreted as Biological Signal in scRNA-seq Integration
**What goes wrong:**
When integrating scRNA-seq data from multiple atlases (e.g., retinal atlas, inner ear atlas, nasal epithelium datasets), technical batch effects between studies are mistakenly interpreted as tissue-specific expression patterns. Methods that over-correct for batch effects can erase true biological variation, while under-correction leads to false cell-type-specific signals. Recent benchmarking shows "only 27% of integration outputs performed better than the best unintegrated data," and methods like LIGER, BBKNN, and Seurat v3 "tended to favor removal of batch effects over conservation of biological variation."
**Why it happens:**
- Different scRNA-seq platforms (10X Chromium vs. Smart-seq2 vs. single-nuclei) have systematically different detection profiles
- Batch effects arise from "cell isolation protocols, library preparation technology, and sequencing platforms"
- Integration algorithms make tradeoffs between removing technical variation vs. preserving biological differences
- "Increasing KullbackLeibler divergence regularization does not improve integration and adversarial learning removes biological signals"
- Tissue-of-origin effects (primary tissue vs. organoid vs. cell culture) can be confounded with true cell-type identity
**How to avoid:**
1. **Validate integration quality:** After batch correction, verify that known marker genes still show expected cell-type-specific patterns
2. **Compare multiple integration methods:** Run Harmony, Seurat v5, and scVI in parallel; select based on preservation of positive control markers
3. **Use positive/negative control genes:** Ensure known cilia genes (IFT88, BBS1) show expected enrichment in ciliated cells post-integration
4. **Stratify by sequencing technology:** Analyze 10X datasets separately from Smart-seq2 before attempting cross-platform integration
5. **Prefer within-study comparisons:** When possible, compare cell types within the same study rather than integrating across studies
6. **Document integration parameters:** Record all batch correction hyperparameters (k-neighbors, PCA dimensions, integration strength) to enable sensitivity analysis
**Warning signs:**
- Cell types from different studies don't overlap in UMAP space after integration
- Marker genes lose cell-type-specificity after batch correction
- Technical replicates from the same study cluster separately after integration
- Integration method produces >50% improvement in batch-mixing metrics but known biological markers disappear
**Phase to address:**
**Phase 4 (scRNA-seq Processing):** Implement multi-method integration pipeline with positive control validation. Create cell-type-specific expression profiles only after validating preservation of known markers.
---
### Pitfall 5: Ortholog Function Conservation Over-Assumed
**What goes wrong:**
Phenotypes from mouse knockouts (MGI) or zebrafish morphants (ZFIN) are naively transferred to human genes as if function were perfectly conserved, ignoring cases where orthologs have "dramatically different functions." This is especially problematic for gene families with lineage-specific duplications or losses. The assumption that "single-copy orthologs are more reliable for functional annotation" is contradicted by evidence showing "multi-copy genes are equally or more likely to provide accurate functional information."
**Why it happens:**
- Automated orthology pipelines use sequence similarity thresholds without functional validation
- "Bidirectional best hit (BBH) assumption can be false because genes may be each other's highest-ranking matches due to differential gene loss"
- Domain recombination creates false orthology assignments where proteins share some but not all domains
- Zebrafish genome duplication means many human genes have two zebrafish co-orthologs with subfunctionalized roles
- "There is no universally applicable, unequivocal definition of conserved function"
- Cilia gene functions may diverge between motile (zebrafish) and non-motile/sensory (mammalian) cilia contexts
**How to avoid:**
1. **Confidence-weight ortholog evidence:** Use orthology confidence scores from databases (e.g., DIOPT scores, OMA groups) rather than binary ortholog/non-ortholog
2. **Require phenotype relevance:** Only count ortholog phenotypes that match the target biology (ciliary defects, sensory organ abnormalities, not generic lethality)
3. **Handle one-to-many orthologs explicitly:** For human genes with multiple zebrafish co-orthologs, aggregate phenotypes using OR logic (any co-ortholog with phenotype = positive evidence)
4. **Validate with synteny:** Prioritize orthologs supported by both sequence similarity and conserved genomic context
5. **Species-appropriate expectations:** Expect stronger conservation for cilia structure genes (IFT machinery) vs. sensory signaling cascades (tissue-specific)
6. **Cross-validate with multiple species:** Genes with convergent phenotypes across mouse, zebrafish, AND Drosophila are more reliable than single-species evidence
**Warning signs:**
- Ortholog evidence contradicts human genetic data (e.g., mouse knockout viable but human LoF variants cause disease)
- Large fraction of human genes map to paralogs in model organism rather than true orthologs
- Zebrafish co-orthologs have opposing phenotypes (one causes ciliopathy, other is wildtype)
- Synteny breaks detected for claimed ortholog pairs
**Phase to address:**
**Phase 2 (Data Integration - Ortholog Module):** Implement orthology confidence scoring and phenotype relevance filtering. Create manual curation workflow for ambiguous high-scoring candidates.
---
### Pitfall 6: Constraint Metrics (gnomAD pLI/LOEUF) Misinterpreted
**What goes wrong:**
gnomAD constraint scores are used as disease gene predictors without understanding their limitations. Researchers assume high pLI (>0.9) or low LOEUF (<0.35) automatically indicates dominant disease genes, but "even the most highly constrained genes are not necessarily autosomal dominant." Transcript selection errors cause dramatic score changes—SHANK2 has pLI=0 in the canonical transcript but pLI=1 in the brain-specific transcript due to differential exon usage. Low-coverage genes have unreliable constraint metrics but are not flagged.
**Why it happens:**
- pLI is dichotomous (>0.9 or <0.1), ignoring genes in the intermediate range
- LOEUF thresholds changed between gnomAD v2 (<0.35) and v4 (<0.6), causing confusion
- "There is no brain expression from over half the exons in the SHANK2 canonical transcript where most of the protein truncating variants are found"
- Genes with low sequencing coverage or high GC content have unreliable observed/expected ratios
- Recessive disease genes can have low constraint (heterozygous LoF is benign)
- Haploinsufficient genes vs. dominant-negative mechanisms are not distinguished
**How to avoid:**
1. **Use LOEUF continuously, not dichotomously:** Score genes on LOEUF scale rather than applying hard cutoffs; gives partial credit across the distribution
2. **Verify transcript selection:** For ciliopathy candidates, ensure the scored transcript includes exons expressed in retina/inner ear using GTEx isoform data
3. **Check coverage metrics:** Only use pLI/LOEUF when mean coverage >30x and >90% of CDS is covered in gnomAD
4. **Adjust expectations for inheritance pattern:** High constraint supports haploinsufficiency; moderate constraint is compatible with recessive inheritance (relevant for Usher syndrome)
5. **Cross-validate with ClinVar:** Check whether existing pathogenic variants in the gene match the constraint prediction
6. **Incorporate missense constraint:** Use gnomAD missense Z-scores alongside LoF metrics; some genes tolerate LoF but are missense-constrained
**Warning signs:**
- Candidate genes have low gnomAD coverage but constraint scores are used anyway
- All top candidates have pLI >0.9, despite Usher syndrome being recessive
- Constraint scores change dramatically when switching from canonical to tissue-specific transcript
- Genes with established recessive inheritance pattern are filtered out due to low constraint
**Phase to address:**
**Phase 3 (Scoring System Design - Constraint Module):** Implement coverage-aware constraint scoring with transcript validation. Create inheritance-pattern-adjusted weighting (lower weight for dominant constraint when prioritizing recessive disease genes).
---
### Pitfall 7: "Unknown Function" Operationally Undefined Creates Circular Logic
**What goes wrong:**
The pipeline aims to discover "under-studied" or "unknown function" genes but lacks a rigorous operational definition. If "unknown function" means "few PubMed papers," the literature bias pitfall is worsened. If it means "no GO annotations," it excludes genes with partial functional knowledge that might be excellent candidates. Approximately "40% of proteins in eukaryotic genomes are proteins of unknown function," but defining unknown is complex—"when a group of related sequences contains one or more members of known function, the similarity approach assigns all to the known space, whereas empirical approach distinguishes between characterized and uncharacterized candidates."
**Why it happens:**
- GO term coverage is publication-biased; "52-79% of bacterial proteomes can be functionally annotated based on homology searches" but eukaryotes have more uncharacterized proteins
- Partial functional knowledge exists on a spectrum (domain predictions, general pathway assignments, no mechanistic details)
- "Research bias, as measured by publication volume, was an important factor influencing genome annotation completeness"
- Negative selection (excluding known cilia genes) requires defining "known," which is database- and date-dependent
**How to avoid:**
1. **Multi-tier functional classification:**
- Tier 1: Direct experimental evidence of cilia/Usher involvement (exclude from discovery)
- Tier 2: Strong functional prediction (domains, orthologs) but no direct evidence (high-priority targets)
- Tier 3: Minimal functional annotation (under-studied candidates)
- Tier 4: No annotation beyond gene symbol (true unknowns)
2. **Explicit positive control exclusion list:** Maintain a curated list of ~200 known cilia genes and established Usher genes to exclude; version-control this list
3. **Publication-independent functional metrics:** Define "unknown" using absence of experimental GO evidence codes (EXP, IDA, IPI, IMP), not total publication count
4. **Pathway coverage threshold:** Consider genes "known" if they appear in >3 canonical cilia pathways (IFT, transition zone, basal body assembly)
5. **Temporal versioning:** Tag genes as "unknown as of [date]" to allow retrospective validation when new discoveries are published
**Warning signs:**
- Definition of "unknown" changes between pilot and production runs
- Candidates include genes with extensive literature on cilia-related functions
- Positive control genes leak into discovery set due to incomplete exclusion list
- Different team members have conflicting intuitions about whether a gene is "known enough" to exclude
**Phase to address:**
**Phase 1 (Data Infrastructure - Annotation Module):** Create explicit functional classification schema with clear inclusion/exclusion criteria. Build curated positive control list with literature provenance.
---
### Pitfall 8: Reproducibility Theater Without Computational Environment Control
**What goes wrong:**
Pipeline scripts are version-controlled and documented, creating an illusion of reproducibility, but results cannot be reproduced due to uncontrolled dependencies. Python package versions (pandas, numpy, scikit-learn), database snapshots (Ensembl release, gnomAD version, GTEx v8 vs v9), and API response changes cause silent result drift. "Workflow managers were developed in response to challenges with data complexity and reproducibility" but are often not adopted until after initial analyses are complete and results have diverged.
**Why it happens:**
- `pip install package` without pinning versions installs latest release, which may change behavior
- Ensembl/NCBI/UniProt databases update continuously; API calls return different data over time
- Downloaded files are not checksummed; corruption or incomplete downloads go undetected
- Reference genome versions (GRCh37 vs GRCh38) are mixed across data sources
- RAM/disk caching causes results to differ between first run and subsequent runs
- Random seeds not set for stochastic algorithms (UMAP, t-SNE, subsampling)
**How to avoid:**
1. **Pin all dependencies:** Use `requirements.txt` with exact versions (`pandas==2.0.3` not `pandas>=2.0`) or `conda env export`
2. **Containerize the environment:** Build Docker/Singularity container with frozen versions; run all analyses inside container
3. **Snapshot external databases:** Download full database dumps (Ensembl, GTEx) with version tags; do not rely on live API queries for production runs
4. **Checksum all downloaded data:** Compute and store MD5/SHA256 hashes for every downloaded file; verify on load
5. **Version-control intermediate outputs:** Store preprocessed data files (post-QC, post-normalization) with version tags to enable restart from checkpoints
6. **Set random seeds globally:** Fix numpy/torch/random seeds at the start of every script to ensure stochastic steps are reproducible
7. **Log provenance metadata:** Embed database versions, software versions, and run parameters in output files using JSON headers or HDF5 attributes
**Warning signs:**
- Results change when re-running the same script on a different machine
- Collaborator cannot reproduce your gene rankings despite using "the same code"
- Adding a new analysis step changes results from previous steps (should be impossible if truly modular)
- You cannot explain why the gene count changed between last month's and this month's runs
**Phase to address:**
**Phase 1 (Data Infrastructure):** Establish containerized environment with pinned dependencies and database version snapshots. Implement checksumming and provenance logging from the start.
---
## Technical Debt Patterns
Shortcuts that seem reasonable but create long-term problems.
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Hard-code database API URLs in scripts | Quick to write | Breaks when APIs change; no version control | Never—use config file |
| Convert all IDs to gene symbols early | Simpler to read outputs | Loses ability to trace back to source; mapping errors propagate | Only for final display, never internal processing |
| Filter to protein-coding genes only, drop non-coding | Reduces dataset size | Misses lncRNAs with cilia-regulatory roles | Acceptable for MVP if explicitly documented as limitation |
| Use default merge (inner join) in pandas | Fewer rows to process | Silent data loss when IDs don't match | Never—always left join with explicit logging |
| Skip validation on positive control genes | Faster iteration | No way to detect when scoring system breaks | Never—positive controls are mandatory |
| Download data via API calls in main pipeline | No manual download step | Irreproducible; results change as databases update | Only during exploration phase, never production |
| Store intermediate data as CSV | Easy to inspect manually | Loss of data types (ints become floats); no metadata storage | Acceptable for small tables <10K rows |
| Use `pip install --upgrade` to fix bugs | Gets latest fixes | Introduces breaking changes unpredictably | Never—pin versions and upgrade explicitly |
| Aggregate to gene-level immediately from scRNA-seq | Simpler analysis | Loses cell-type resolution; can't detect subtype-specific expression | Only for bulk comparison, not prioritization |
---
## Integration Gotchas
Common mistakes when connecting to external services.
| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| Ensembl BioMart | Selecting canonical transcript only; miss tissue-specific isoforms | Query all transcripts, then filter by retina/inner ear expression using GTEx |
| gnomAD API | Not checking coverage per gene before using constraint scores | Filter to genes with mean_depth >30x and >90% CDS covered |
| GTEx Portal | Using TPM directly without accounting for sample size per tissue | Normalize by sample count and use median TPM across replicates |
| CellxGene API | Downloading full h5ad files serially (slow, memory-intensive) | Use CellxGene's API to fetch only cell-type-aggregated counts for genes of interest |
| UniProt REST API | Converting gene lists one-by-one in a loop | Use batch endpoint with POST requests (up to 100K IDs per request) |
| PubMed E-utilities | Sending requests without API key (3 req/sec limit) | Register NCBI API key for 10 req/sec; still implement exponential backoff |
| MGI/ZFIN batch queries | Assuming 1:1 human-mouse orthology | Handle one-to-many mappings explicitly (use highest-confidence ortholog or aggregate) |
| IMPC phenotype API | Taking all phenotypes as equally informative | Filter to cilia-relevant phenotypes (MP:0003935 cilium; HP:0000508 retinal degeneration) |
| String-DB | Assuming all edges are physical interactions | Filter by interaction type (text-mining vs experimental) and confidence score >0.7 |
---
## Performance Traps
Patterns that work at small scale but fail as usage grows.
| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| Loading entire scRNA-seq h5ad into memory | Script crashes with MemoryError | Use backed mode (`anndata.read_h5ad(file, backed='r')`) to stream from disk | >5GB file or <32GB RAM |
| Nested loops for gene-gene comparisons | Script runs for hours | Vectorize with numpy/pandas; use scipy.spatial.distance for pairwise | >1K genes |
| Re-downloading data on every run | Slow iteration, API rate limits | Cache downloaded files locally with checksums; only re-download if missing | Always implement caching |
| Storing all intermediate results in memory | Cannot debug failed runs | Write intermediate outputs to disk after each major step | >50K genes or complex pipeline |
| Single-threaded processing | Slow on large gene sets | Parallelize with joblib/multiprocessing for embarrassingly parallel tasks | >10K genes or >1hr runtime |
| Not indexing database tables | Slow queries on merged datasets | Create indexes on ID columns (Ensembl ID, HGNC symbol) before joins | >20K genes or >5 tables |
---
## Phase-Specific Warnings
| Phase Topic | Likely Pitfall | Mitigation |
|-------------|---------------|------------|
| Data download (Phase 1) | API rate limits hit during batch download | Implement exponential backoff and retry logic; use NCBI API keys |
| ID mapping (Phase 1) | Multiple IDs map to the same symbol | Create explicit disambiguation strategy; log all many-to-one mappings |
| scRNA-seq integration (Phase 4) | Batch effects erase biological signal | Validate integration by checking known marker genes before/after |
| Scoring system (Phase 3) | Literature bias dominates scores | Normalize by publication count; validate on under-studied positive controls |
| Validation (Phase 5) | Positive controls not scoring as expected | Debug scoring system before declaring it ready; iterate on weights |
| Reproducibility (All phases) | Results differ between runs | Pin dependency versions and database snapshots from Phase 1 |
| Missing data handling (Phase 2) | Genes with incomplete data rank artificially low | Implement layer-aware scoring that doesn't penalize missing data |
| Ortholog phenotypes (Phase 2) | Mouse/zebrafish phenotypes over-interpreted | Require phenotype relevance filtering (cilia-related only) |
---
## "Looks Done But Isn't" Checklist
Things that appear complete but are missing critical pieces.
- [ ] **ID mapping module:** Often missing validation step to ensure known genes are successfully mapped—verify 100% of positive control genes map successfully
- [ ] **Scoring system:** Often missing publication bias correction—verify correlation between final score and PubMed count is <0.5
- [ ] **scRNA-seq integration:** Often missing positive control validation—verify known cilia genes show expected cell-type enrichment post-integration
- [ ] **Constraint metrics:** Often missing coverage check—verify genes have adequate gnomAD coverage before using pLI/LOEUF
- [ ] **Ortholog evidence:** Often missing confidence scoring—verify orthology confidence scores are used, not just binary ortholog calls
- [ ] **Data provenance:** Often missing version logging—verify every output file records database versions and software versions used
- [ ] **Missing data handling:** Often missing explicit "unknown" state—verify merge strategy preserves genes with partial data
- [ ] **Reproducibility:** Often missing checksums on downloaded data—verify MD5 hashes are computed and stored for all external data files
- [ ] **Positive controls:** Often missing negative control validation—verify negative controls (non-cilia genes) score appropriately low
- [ ] **API error handling:** Often missing retry logic—verify exponential backoff is implemented for all external API calls
---
## Recovery Strategies
When pitfalls occur despite prevention, how to recover.
| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| ID mapping errors detected late | LOW | Re-run mapping module with corrected conversion tables; propagate fixes forward |
| Literature bias discovered after scoring | MEDIUM | Add publication normalization to scoring function; re-compute all scores |
| Batch effects in scRNA-seq integration | MEDIUM | Try alternative integration method (switch from Seurat to Harmony); re-validate |
| Missing data treated as negative | HIGH | Redesign merge strategy to preserve all genes; re-run entire integration pipeline |
| Reproducibility failure (cannot re-run) | HIGH | Containerize environment; snapshot databases; document and re-run from scratch |
| Positive controls score poorly | MEDIUM | Debug scoring function weights; validate on stratified test set; adjust weights iteratively |
| Ortholog function over-assumed | LOW | Add orthology confidence scores; re-filter to high-confidence orthologs only |
| Constraint metrics misinterpreted | LOW | Add coverage check; use LOEUF continuously instead of dichotomous; re-score |
| "Unknown function" poorly defined | MEDIUM | Create explicit functional tiers; rebuild positive control exclusion list; re-classify |
---
## Pitfall-to-Phase Mapping
How roadmap phases should address these pitfalls.
| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| Gene ID mapping inconsistency | Phase 1: Data Infrastructure | 100% of positive controls successfully mapped across all databases |
| Literature bias amplification | Phase 3: Scoring System | Correlation(final_score, pubmed_count) < 0.5; under-studied positive controls rank in top 10% |
| Missing data as negative evidence | Phase 2: Data Integration | Gene count preserved after merges (should be union, not intersection); explicit "unknown" state in data model |
| scRNA-seq batch effects | Phase 4: scRNA-seq Processing | Known cilia markers show expected cell-type specificity post-integration |
| Ortholog function over-assumed | Phase 2: Ortholog Module | Orthology confidence scores used; phenotype relevance filtering applied |
| Constraint metrics misinterpreted | Phase 3: Constraint Module | Coverage-aware filtering; LOEUF used continuously; transcript validation performed |
| "Unknown function" undefined | Phase 1: Annotation Module | Explicit functional tiers defined; positive control exclusion list version-controlled |
| Reproducibility failure | Phase 1: Environment Setup | Containerized; dependencies pinned; databases snapshotted; results bit-identical on re-run |
| API rate limits hit | Phase 1: Data Download | Retry logic with exponential backoff; batch queries used; NCBI API key registered |
---
## Sources
### Gene ID Mapping
- [How to map all Ensembl IDs to Gene Symbols (Bioconductor)](https://support.bioconductor.org/p/87454/)
- [UniProt genomic mapping for deciphering functional effects of missense variants - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC6563471/)
- [Comparing HGNC Symbol Mappings by 3 Different Databases](https://www.ffli.dev/posts/2021-07-11-comparing-hgnc-symbol-mappings-by-3-different-databases/)
### Literature Bias and Gene Annotation
- [Gene annotation bias impedes biomedical research | Scientific Reports](https://www.nature.com/articles/s41598-018-19333-x)
- [Annotating Genes of Known and Unknown Function by Large-Scale Coexpression Analysis - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC2330292/)
- [An assessment of genome annotation coverage across the bacterial tree of life - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC7200070/)
- [Functional unknomics: Systematic screening of conserved genes of unknown function | PLOS Biology](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002222)
### Multi-Database Integration and Missing Data
- [Missing data in multi-omics integration: Recent advances through artificial intelligence - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC9949722/)
- [Handling missing rows in multi-omics data integration: Multiple imputation in multiple factor analysis framework | BMC Bioinformatics](https://link.springer.com/article/10.1186/s12859-016-1273-5)
- [A technical review of multi-omics data integration methods: from classical statistical to deep generative approaches - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC12315550/)
### Weighted Scoring Systems
- [IW-Scoring: an Integrative Weighted Scoring framework for annotating and prioritizing genetic variations in the noncoding genome - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC5934661/)
- [A large-scale benchmark of gene prioritization methods | Scientific Reports](https://www.nature.com/articles/srep46598)
### Gene Prioritization and LLM Biases
- [Survey and improvement strategies for gene prioritization with large language models | Bioinformatics Advances](https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf148/8172498)
- [Automating candidate gene prioritization with large language models | Bioinformatics](https://academic.oup.com/bioinformatics/article/41/10/btaf541/8280402)
- [What Are The Most Common Stupid Mistakes In Bioinformatics?](https://www.biostars.org/p/7126/)
### scRNA-seq Integration and Batch Effects
- [Integrating single-cell RNA-seq datasets with substantial batch effects | BMC Genomics](https://link.springer.com/article/10.1186/s12864-025-12126-3)
- [Batch correction methods used in single-cell RNA sequencing analyses are often poorly calibrated - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC12315870/)
- [Benchmarking atlas-level data integration in single-cell genomics | Nature Methods](https://www.nature.com/articles/s41592-021-01336-8)
- [Benchmarking cross-species single-cell RNA-seq data integration methods | Nucleic Acids Research](https://academic.oup.com/nar/article/53/1/gkae1316/7945393)
### Rare Disease Gene Discovery and Validation
- [Rare disease gene association discovery in the 100,000 Genomes Project | Nature](https://www.nature.com/articles/s41586-025-08623-w)
- [ClinPrior: an algorithm for diagnosis and novel gene discovery by network-based prioritization | Genome Medicine](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-023-01214-2)
- [Strategies to Uplift Novel Mendelian Gene Discovery for Improved Clinical Outcomes | Frontiers](https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.674295/full)
### Reproducible Bioinformatics Pipelines
- [The five pillars of computational reproducibility: bioinformatics and beyond - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC10591307/)
- [Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods](https://www.nature.com/articles/s41592-021-01254-9)
- [Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC10030817/)
- [Investigating reproducibility and tracking provenance A genomic workflow case study | BMC Bioinformatics](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1747-0)
### Cilia Gene Discovery and Ciliopathy Prioritization
- [A prioritization tool for cilia-associated genes and their in vivo resources unveils new avenues for ciliopathy research - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC11512102/)
- [CilioGenics: an integrated method and database for predicting novel ciliary genes | bioRxiv](https://www.biorxiv.org/content/10.1101/2023.03.31.535034v1.full)
### Usher Syndrome Gene Discovery
- [Usher Syndrome: Genetics and Molecular Links of Hearing Loss | Frontiers](https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2020.565216/full)
- [The genetic and phenotypic landscapes of Usher syndrome | Human Genetics](https://link.springer.com/article/10.1007/s00439-022-02448-7)
- [Usher Syndrome: Genetics of a Human Ciliopathy - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC8268283/)
- [Usher Syndrome: Genetics of a Human Ciliopathy | MDPI](https://www.mdpi.com/1422-0067/22/13/6723)
### gnomAD Constraint Metrics
- [gnomAD v4.0 Gene Constraint | gnomAD browser](https://gnomad.broadinstitute.org/news/2024-03-gnomad-v4-0-gene-constraint/)
- [Gene constraint and genotype-phenotype correlations in neurodevelopmental disorders - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC10340126/)
- [No preferential mode of inheritance for highly constrained genes - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC8898387/)
- [Variant interpretation using population databases: Lessons from gnomAD - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC9160216/)
### API Rate Limits and Batch Download
- [How to avoid NCBI API rate limits? | LifeSciencesHub](https://www.lifescienceshub.ai/guides/how-to-avoid-ncbi-api-rate-limits-step-by-step-guide)
- [Batch retrieval & ID mapping | UniProt](https://www.ebi.ac.uk/training/online/courses/uniprot-exploring-protein-sequence-and-functional-info/how-to-use-uniprot-tools-clone/batch-retrieval-id-mapping/)
- [UniProt website API documentation](https://www.uniprot.org/api-documentation/:definition)
- [The UniProt website API: facilitating programmatic access to protein knowledge - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC12230682/)
### Ortholog Inference and Limitations
- [Functional and evolutionary implications of gene orthology | Nature Reviews Genetics](https://www.nature.com/articles/nrg3456)
- [Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals | PLOS Computational Biology](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002073)
- [Standardized benchmarking in the quest for orthologs | Nature Methods](https://www.nature.com/articles/nmeth.3830)
- [Getting Started in Gene Orthology and Functional Analysis - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC2845645/)
### Tissue Specificity and Cell Type Deconvolution
- [Cellular deconvolution of GTEx tissues powers discovery of disease and cell-type associated regulatory variants | Nature Communications](https://www.nature.com/articles/s41467-020-14561-0)
- [TransTEx: novel tissue-specificity scoring method | Bioinformatics](https://academic.oup.com/bioinformatics/article/40/8/btae475/7731001)
- [Comprehensive evaluation of deconvolution methods for human brain gene expression | Nature Communications](https://www.nature.com/articles/s41467-022-28655-4)
### PubMed Literature Mining
- [Text Mining Genotype-Phenotype Relationships from Biomedical Literature | PLOS Computational Biology](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005017)
- [GLAD4U: deriving and prioritizing gene lists from PubMed literature | BMC Genomics](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-S8-S20)
- [Mining experimental evidence of molecular function claims from the literature - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC3041023/)
---
*Pitfalls research for: Bioinformatics Gene Candidate Discovery Pipeline for Cilia/Usher Syndrome*
*Researched: 2026-02-11*
*Confidence: HIGH (validated with 40+ authoritative sources across all risk domains)*

293
.planning/research/STACK.md Normal file
View File

@@ -0,0 +1,293 @@
# Technology Stack
**Project:** Bioinformatics Cilia/Usher Gene Discovery Pipeline
**Researched:** 2026-02-11
**Confidence:** HIGH
## Recommended Stack
### Core Framework
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| Python | 3.12+ | Pipeline runtime | Industry standard for bioinformatics, extensive ecosystem, type hints, best library support. Avoid 3.10-3.11 (older), 3.14 (too new for some libraries). |
| Polars | 1.38+ | DataFrame processing | 6-38x faster than Pandas for genomic operations (via polars-bio), native Rust backend, streaming for large datasets, better memory efficiency for ~20K gene analysis. |
| Typer | 0.21+ | CLI framework | Modern type-hint based CLI, auto-generates help docs, built on Click (battle-tested), cleaner than argparse for modular scripts. |
| Pydantic | 2.12+ | Data validation & config | Type-safe configuration, 1.5-1.75x faster than v1, validates gene scores/weights, prevents config errors that waste compute time. |
### Data Access Libraries
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| gget | 0.30+ | Multi-source gene annotation | PRIMARY: Unified API for Ensembl, UniProt, NCBI. Fetches gene metadata, sequences, GO terms. Official Context7 support, actively maintained. |
| pyensembl | Latest | Ensembl GTF/FASTA access | FALLBACK: If gget insufficient. Downloads/caches Ensembl data locally. More control but less convenient than gget. |
| Biopython | 1.86+ | Sequence analysis, format parsing | ALWAYS: Parse FASTA/GenBank, sequence manipulation, EntrezId conversion. De facto standard, mandatory for bioinformatics. Requires Python 3.10+. |
| requests | 2.32+ | HTTP API calls | ALWAYS: Session-based API calls to REST endpoints (InterPro, STRING, gnomAD). Connection pooling, retry logic, timeout control. |
| metapub | 0.6.4+ | PubMed literature mining | PRIMARY for PubMed: Abstracts via eutils, DOI finding, formatted citations. Production-tested at bioinformatics facilities. Python 3.8+. |
| gnomad-toolbox | 0.0.1+ | gnomAD constraint metrics | PRIMARY: Official Broad Institute tool (Jan 2025), loads/filters constraint data. Requires Python 3.9+. Note: Very new, validate stability. |
### scRNA-seq & Expression Data
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| scanpy | 1.12+ | scRNA-seq analysis | For processing single-cell cilia expression. Clustering, QC, cell type annotation. Requires Python 3.12+. |
| anndata | 0.11+ | scRNA-seq data structures | DEPENDENCY: Required by scanpy. Stores expression matrices. |
| pyGTEx | Git latest | GTEx tissue expression | For GTEx API queries. Install from GitHub: pip install git+https://github.com/w-gao/pyGTEx.git |
| tspex | Latest | Tissue specificity scoring | For calculating tissue-specific expression scores from GTEx data. |
### Data Validation & Quality
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| pandera | 0.29+ | DataFrame schema validation | ALWAYS: Validate gene data schemas, catch data quality issues early. 1.5-1.75x faster with Pydantic v2. Supports Polars/Pandas. Python 3.10+. |
| pydantic-settings | 2.12+ | YAML/TOML config loading | ALWAYS: Load pipeline configs with validation. Built-in TOML/YAML support (Nov 2025). Prevents config typos from wasting GPU time. |
### Workflow & CLI
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| rich | 14.3+ | CLI progress bars, formatting | ALWAYS: Progress tracking for 20K gene processing, formatted tables, error highlighting. Production/Stable. Python 3.8+. |
| structlog | Latest | Structured logging | ALWAYS: JSON logs for pipeline debugging, correlation IDs for tracking genes through pipeline, contextvars for thread safety. |
| click | Latest | CLI subcommands | IF NEEDED: Fallback if Typer insufficient (Typer built on Click). |
### Performance & Caching
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| diskcache | Latest | Disk-based API response caching | ALWAYS: Cache API responses (Ensembl, UniProt, gnomAD) to avoid re-fetching during reruns. Persistent across runs. |
| joblib | Latest | NumPy/ML model caching | For AlphaFold-Multimer result caching, large array serialization. Disk-based, survives restarts. |
| httpx | 0.28+ | Async HTTP client (OPTIONAL) | ONLY IF async needed: HTTP/2 support, faster concurrent API calls. Otherwise use requests (simpler). |
### Testing & Development
| Tool | Version | Purpose | Notes |
|------|---------|---------|-------|
| pytest | Latest | Testing framework | Industry standard. Fixtures for test data, parametrization for edge cases. |
| pytest-cov | Latest | Coverage reporting | Ensure gene scoring logic tested. |
| mypy | Latest | Static type checking | Catch type errors pre-runtime. Essential with Pydantic/Typer type hints. |
| ruff | Latest | Linting & formatting | 10-100x faster than pylint/black. Rust-based, configurable. |
### Package Management
| Tool | Version | Purpose | Notes |
|------|---------|---------|-------|
| uv | Latest | Dependency manager | 10-100x faster than pip, replaces pip/pip-tools/poetry/pyenv. Rust-based (Astral team). Strict PEP compliance. Use for new project. |
| pyproject.toml | - | Dependency specification | Standard Python packaging. Works with uv/pip. |
## Installation
```bash
# Install uv (package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create project with Python 3.12
uv init
uv python install 3.12
uv python pin 3.12
# Core dependencies
uv add polars==1.38.1
uv add typer==0.21.2
uv add pydantic==2.12.5
uv add pydantic-settings==2.12.0
# Data access
uv add gget==0.30.2
uv add biopython==1.86
uv add requests==2.32.5
uv add metapub==0.6.4
uv add gnomad-toolbox==0.0.1
# scRNA-seq & expression
uv add scanpy==1.12
uv add anndata==0.11.4
uv add "pyGTEx @ git+https://github.com/w-gao/pyGTEx.git"
# Validation & quality
uv add pandera==0.29.0
# CLI & workflow
uv add rich==14.3.2
uv add structlog
# Performance & caching
uv add diskcache
uv add joblib
# Dev dependencies
uv add --dev pytest
uv add --dev pytest-cov
uv add --dev mypy
uv add --dev ruff
```
## Alternatives Considered
| Category | Recommended | Alternative | Why Not Alternative |
|----------|-------------|-------------|---------------------|
| DataFrame | Polars | Pandas | Pandas 6-38x slower for genomic interval operations (bioframe vs polars-bio benchmarks). Polars has streaming for >RAM datasets. |
| CLI | Typer | argparse | argparse verbose, no auto-docs. Typer uses type hints, generates help automatically. |
| CLI | Typer | Click | Click requires decorators for everything. Typer cleaner with type hints. (Typer built on Click internally.) |
| HTTP | requests | httpx | httpx adds async/HTTP/2 complexity. Requests simpler for synchronous API calls. Use httpx only if async needed. |
| Package Mgr | uv | Poetry | Poetry slower (not Rust-based), separate config file. uv 10-100x faster, uses standard pyproject.toml, replaces more tools. |
| Package Mgr | uv | pip + pip-tools | uv replaces both, adds reproducible lock files, faster resolution. |
| Config | pydantic-settings | Hydra | Hydra overkill for pipeline config. Better for ML experiments with many hyperparameter sweeps. pydantic-settings simpler. |
| Logging | structlog | stdlib logging | stdlib logging requires custom filters for structured output. structlog built for JSON logs from start. |
| Workflow | Modular scripts | Snakemake | Snakemake overkill for local pipeline. Adds complexity, HPC features not needed. Use if scaling to cluster later. |
| Workflow | Modular scripts | Nextflow | Nextflow for cloud/enterprise. Groovy syntax steep learning curve. Not needed for local NVIDIA 4090 workstation. |
## What NOT to Use
| Avoid | Why | Use Instead |
|-------|-----|-------------|
| Pandas (alone) | 6-38x slower than Polars for genomic operations. Memory inefficient for 20K genes. | Polars with polars-bio extension |
| Python 3.9 or older | Many modern libraries require 3.10+ (Biopython 1.86, scanpy 1.12, pandera 0.29). Missing type hint features. | Python 3.12 (stable, well-supported) |
| Python 3.14 | Too new. Some libraries not tested yet. Risk of compatibility issues. | Python 3.12 or 3.13 |
| Pydantic v1 | 1.5-1.75x slower than v2. V2 major rewrite with better validation. | Pydantic 2.12+ |
| argparse for complex CLI | Verbose, manual help text, no type validation. | Typer (type-hint based) |
| Poetry (for new projects) | Slower dependency resolution, separate config file, fewer tools replaced. | uv (10-100x faster, standard pyproject.toml) |
| Manual API retries | Error-prone, inconsistent timeout handling. | requests.Session with retry adapters + timeout |
| print() debugging | No structured logs, hard to trace gene through pipeline. | structlog with correlation IDs |
## Stack Patterns by Use Case
**Local workstation pipeline (current requirement):**
- Use Polars + modular Typer CLI scripts
- Avoid Snakemake/Nextflow (overkill)
- Use diskcache for API responses (persistent across runs)
- Use rich for progress tracking (20K genes takes time)
**If scaling to HPC cluster later:**
- Add Snakemake for workflow orchestration
- Keep modular scripts (Snakemake calls them)
- Use job scheduler integration (SLURM/PBS)
**If adding async API calls:**
- Replace requests with httpx
- Use asyncio event loop
- Batch API calls (e.g., 100 genes at once to UniProt)
**For GPU-accelerated steps (AlphaFold-Multimer downstream):**
- NVIDIA 4090 has 24GB VRAM (below 32GB recommended minimum for AlphaFold2 NIM)
- May work for smaller predictions with appropriate configurations
- Use joblib to cache results (avoid re-running expensive predictions)
## Version Compatibility
| Package | Requires Python | Notes |
|---------|-----------------|-------|
| Biopython 1.86 | >= 3.10 | Blocks Python 3.9 |
| scanpy 1.12 | >= 3.12 | Blocks Python 3.10, 3.11 |
| Polars 1.38 | >= 3.10 | Recommended 3.12+ |
| Typer 0.21 | >= 3.9 | Works with 3.9-3.14 |
| Pydantic 2.12 | >= 3.9 | Requires v2 for performance |
| pandera 0.29 | >= 3.10 | Supports Polars + Pandas |
| metapub 0.6.4 | >= 3.8 | Older requirement, works with 3.12 |
| gnomad-toolbox 0.0.1 | >= 3.9 | Very new (Jan 2025), monitor stability |
**Minimum Python:** 3.12 (required by scanpy)
**Recommended Python:** 3.12 (stable, well-supported by all libraries)
## Data Source APIs & Formats
| Source | Access Method | Library | Format | Notes |
|--------|---------------|---------|--------|-------|
| Ensembl | REST API | gget, pyensembl | JSON, GTF | Rate limits apply. Cache with diskcache. |
| UniProt | REST API | gget, requests | JSON, XML | Batch queries supported. |
| NCBI/PubMed | Entrez eutils | metapub, Biopython.Entrez | XML | Free but rate limited (3 req/sec without key). |
| gnomAD | Downloads + API | gnomad-toolbox | Hail Tables, TSV | Large files. Use constraint metrics API. |
| InterPro | REST API | requests | JSON | Protein domains/motifs. Free, rate limited. |
| GTEx | Portal + API | pyGTEx | TSV, JSON | Expression data. Portal downloads for bulk. |
| HPA | Downloads | requests | TSV | Tissue/cell expression. Download bulk files. |
| STRING | API + downloads | requests | TSV | PPI networks. API for queries, downloads for bulk. |
| BioGRID | Downloads | requests | TSV | PPI data. Download releases, parse locally. |
| MGI/ZFIN/IMPC | Web + downloads | requests | Various | No official Python API. Web scraping or bulk downloads. |
## Configuration Strategy
Use Pydantic models for type-safe configuration:
```python
# config.py
from pydantic import BaseModel, Field
from pydantic_settings import BaseSettings, SettingsConfigDict
class ScoringWeights(BaseModel):
tissue_expression: float = Field(ge=0, le=1)
protein_domains: float = Field(ge=0, le=1)
genetic_constraint: float = Field(ge=0, le=1)
literature_score: float = Field(ge=0, le=1)
class PipelineConfig(BaseSettings):
model_config = SettingsConfigDict(
toml_file='pipeline_config.toml',
yaml_file='pipeline_config.yaml'
)
weights: ScoringWeights
api_cache_dir: str = ".cache/api_responses"
output_tiers: list[str] = ["high", "medium", "low"]
```
Load from TOML/YAML, validated at startup. Prevents runtime config errors.
## Logging Strategy
Use structlog for structured JSON logging:
```python
import structlog
# Configure once at startup
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer()
]
)
logger = structlog.get_logger()
# In pipeline, bind gene context
logger = logger.bind(gene_id="ENSG00000123456", gene_symbol="USH2A")
logger.info("scoring_complete", score=0.85, tier="high")
```
Enables filtering/searching logs by gene, tracing genes through pipeline.
## Sources
**HIGH CONFIDENCE (Official Docs + Context7):**
- [Polars 1.38.1 PyPI](https://pypi.org/project/polars/) - Version verified Feb 2026
- [gget 0.30.2 PyPI](https://pypi.org/project/gget/) - Version verified Feb 2026
- [Biopython 1.86 PyPI](https://pypi.org/project/biopython/) - Version verified Oct 2025
- [Typer 0.21.2 PyPI](https://pypi.org/project/typer/) - Version verified Feb 2026
- [Pydantic 2.12.5 PyPI](https://pypi.org/project/pydantic/) - Version verified Nov 2025
- [pandera 0.29.0 PyPI](https://pypi.org/project/pandera/) - Version verified Jan 2026
- [requests 2.32.5 PyPI](https://pypi.org/project/requests/) - Version verified Aug 2025
- [httpx 0.28.1 PyPI](https://pypi.org/project/httpx/) - Version verified Dec 2024
- [rich 14.3.2 PyPI](https://pypi.org/project/rich/) - Version verified Feb 2026
- [scanpy 1.12 PyPI](https://pypi.org/project/scanpy/) - Version verified Jan 2026
- [metapub 0.6.4 PyPI](https://pypi.org/project/metapub/) - Version verified Aug 2025
- [gnomad-toolbox 0.0.1 PyPI](https://pypi.org/project/gnomad-toolbox/) - Version verified Jan 2025
**MEDIUM CONFIDENCE (WebSearch + Official Sources):**
- [polars-bio performance benchmarks](https://academic.oup.com/bioinformatics/article/41/12/btaf640/8362264) - Oxford Academic, Dec 2025
- [gget genomic database querying](https://pmc.ncbi.nlm.nih.gov/articles/PMC9835474/) - PMC publication
- [STRING database 12.5 update](https://academic.oup.com/nar/article/53/D1/D730/7903368) - NAR 2025
- [InterPro 2025 update](https://academic.oup.com/nar/article/53/D1/D444/7905301) - NAR 2025
- [uv package manager overview](https://www.analyticsvidhya.com/blog/2025/08/uv-python-package-manager/) - Multiple sources confirm 10-100x speedup
- [Pydantic-settings TOML/YAML support](https://levelup.gitconnected.com/pydantic-settings-2025-a-clean-way-to-handle-configs-f1c432030085) - Nov 2025 release notes
- [Python logging best practices 2025](https://signoz.io/guides/python-logging-best-practices/) - Industry practices
- [Nextflow vs Snakemake 2025 comparison](https://www.tracer.cloud/resources/bioinformatics-pipeline-frameworks-2025) - Workflow framework analysis
**MEDIUM-LOW CONFIDENCE (Limited verification):**
- AlphaFold NVIDIA 4090 compatibility - [GitHub issues](https://github.com/kalininalab/alphafold_non_docker/issues/77) suggest challenges, official docs specify 80GB GPUs
- MGI/ZFIN/IMPC Python API - No official Python libraries found, [ZFIN](https://zfin.org/) and [MGI](https://www.informatics.jax.org/) provide web interfaces and bulk downloads
---
*Stack research for: Bioinformatics Cilia/Usher Gene Discovery Pipeline*
*Researched: 2026-02-11*
*Confidence: HIGH for core stack, MEDIUM for workflow alternatives, MEDIUM-LOW for animal model APIs*

View File

@@ -0,0 +1,301 @@
# Project Research Summary
**Project:** Bioinformatics Cilia/Usher Gene Discovery Pipeline
**Domain:** Gene Candidate Discovery and Prioritization for Rare Disease / Ciliopathy Research
**Researched:** 2026-02-11
**Confidence:** MEDIUM-HIGH
## Executive Summary
This project requires a multi-evidence bioinformatics pipeline to prioritize under-studied gene candidates for ciliopathy and Usher syndrome research. The recommended approach is a **staged, modular Python pipeline** using independent evidence retrieval layers (annotation, expression, protein features, localization, genetic constraint, phenotypes) that feed into weighted scoring and tiered output generation. Python 3.12+ with Polars for data processing, DuckDB for intermediate storage, and Typer for CLI orchestration represents the modern bioinformatics stack optimized for ~20K gene analysis on local workstations.
The critical architectural pattern is **independent evidence layers with staged persistence**: each layer retrieves, normalizes, and caches data separately before a final integration step combines scores. This enables restartability, parallel execution, and modular testing—essential for pipelines with external API dependencies and long runtimes. Configuration-driven behavior with YAML configs and Pydantic validation ensures reproducibility and enables parameter tuning without code changes.
The dominant risks are **gene ID mapping inconsistencies** (causing silent data loss across databases) and **literature bias amplification** (over-prioritizing well-studied genes at the expense of novel candidates). Both require explicit mitigation: standardized Ensembl gene IDs throughout, mapping validation gates, and publication-independent scoring layers weighted heavily. Success depends on validation against positive controls (known Usher/cilia genes) and negative controls (non-cilia housekeeping genes) to prove the scoring system works before running at scale.
## Key Findings
### Recommended Stack
Python 3.12+ provides the best balance of library support and stability. **Polars** (1.38+) outperforms Pandas by 6-38x for genomic operations with better memory efficiency for 20K gene datasets. **DuckDB** enables out-of-core analytical queries on Parquet-cached intermediate results, solving the memory pressure problem when integrating 6 evidence layers. **Typer** offers modern, type-hint-based CLI design with auto-generated help, cleaner than argparse for modular pipeline scripts.
**Core technologies:**
- **Python 3.12**: Industry standard for bioinformatics with extensive ecosystem (Biopython, scanpy, gget)
- **Polars 1.38+**: DataFrame processing 6-38x faster than Pandas, native streaming for large datasets
- **DuckDB**: Columnar analytics database for intermediate storage; 10-100x faster than SQLite for aggregations, queries Parquet files without full import
- **Typer 0.21+**: CLI framework with type-hint-based interface, auto-generates help documentation
- **gget 0.30+**: Unified API for multi-source gene annotation (Ensembl, UniProt, NCBI) — primary data access layer
- **Pydantic 2.12+**: Type-safe configuration and data validation; 1.5-1.75x faster than v1, prevents config errors
- **structlog**: Structured JSON logging with correlation IDs to trace genes through pipeline
- **diskcache**: Persistent API response caching across reruns, critical for avoiding API rate limits
- **uv**: Rust-based package manager 10-100x faster than pip, replaces pip/pip-tools/poetry
**Avoid:**
- Pandas alone (too slow, memory-inefficient)
- Python 3.9 or older (missing library support)
- Workflow orchestrators (Snakemake/Nextflow) for initial local workstation pipeline (overkill; adds complexity without benefit)
- Poetry for new projects (slower than uv, less standard)
### Expected Features
**Must have (table stakes):**
- **Multi-evidence scoring**: 6 layers (annotation, expression, protein features, localization, constraint, phenotypes) with weighted integration — standard in modern gene prioritization
- **Reproducibility documentation**: Parameter logging, version tracking, seed control — required for publication
- **Data provenance tracking**: W3C PROV standard; track all transformations, source versions, timestamps
- **Known gene validation**: Benchmark against established ciliopathy genes (CiliaCarta, SYSCILIA, Usher genes) with recall metrics
- **Quality control checks**: Missing data detection, outlier identification, distribution checks
- **API-based data retrieval**: Automated queries to gnomAD, GTEx, HPA, UniProt with rate limiting, caching, retry logic
- **Batch processing**: Parallel execution across 20K genes with progress tracking and resume-from-checkpoint
- **Parameter configuration**: YAML config for weights, thresholds, data sources
- **Tiered output with rationale**: High/Medium/Low confidence tiers with evidence summaries per tier
- **Basic visualization**: Score distributions, rank plots, evidence contribution charts
**Should have (competitive differentiators):**
- **Explainable scoring**: SHAP-style per-gene evidence breakdown showing WHY genes rank high — critical for discovery vs. diagnosis
- **Systematic under-annotation bias handling**: Correct publication bias by downweighting literature-heavy features for under-studied candidates — novel research advantage
- **Sensitivity analysis**: Systematic parameter sweep with rank stability metrics to demonstrate robustness
- **Evidence conflict detection**: Flag genes with contradictory evidence patterns (e.g., high expression but low constraint)
- **Interactive HTML report**: Browsable results with sortable tables, linked evidence sources
- **Cross-species homology scoring**: Zebrafish/mouse phenotype evidence integration via ortholog mapping
**Defer (v2+):**
- **Automated literature scanning with LLM**: RAG-based PubMed evidence extraction — high complexity, cost, uncertainty
- **Incremental update capability**: Re-run with new data without full recomputation — overkill for one-time discovery
- **Multi-modal evidence weighting optimization**: Bayesian integration, cross-validation — requires larger training set
- **Cilia-specific knowledgebase integration**: CilioGenics, CiliaMiner — nice-to-have, primary layers sufficient initially
### Architecture Approach
Multi-evidence gene prioritization pipelines follow a **layered architecture** with independent data retrieval modules feeding normalization, scoring, and validation layers. The standard pattern is a **staged pipeline** where each evidence layer operates independently, writes intermediate results to disk (Parquet cache + DuckDB tables), then a final integration component performs SQL joins and weighted scoring.
**Major components:**
1. **CLI Orchestration Layer** (Typer) — Entry point with subcommands (run-layer, integrate, report), global config management, pipeline orchestration
2. **Data Retrieval Layer** (6 modules) — Independent retrievers for annotation, expression, protein features, localization, constraint, phenotypes; API clients with caching and retry logic
3. **Normalization/Transform Layer** — Per-layer parsers converting raw formats to standardized schemas; gene ID mapping to Ensembl; score normalization to 0-1 scale
4. **Data Storage Layer** — Raw cache (Parquet), intermediate results (DuckDB), final output (TSV/Parquet); enables restartability and out-of-core analytics
5. **Integration/Scoring Layer** — Multi-evidence scoring via SQL joins in DuckDB; weighted aggregation; known gene filtering; confidence tier assignment
6. **Reporting Layer** — Per-gene evidence summaries; provenance metadata generation; tiered candidate lists
**Key architectural patterns:**
- **Independent Evidence Layers**: No cross-layer dependencies; enables parallel execution and isolated testing
- **Staged Data Persistence**: Each stage writes to disk before proceeding; enables restart-from-checkpoint and debugging
- **Configuration-Driven Behavior**: YAML configs for weights/thresholds; code reads config, never hardcodes parameters
- **Provenance Tracking**: Every output includes metadata (pipeline version, data source versions, timestamps, config hash)
### Critical Pitfalls
1. **Gene ID Mapping Inconsistency Cascade** — Over 51% of Ensembl IDs can fail symbol conversion; 8% discordance between Swiss-Prot and Ensembl. One-to-many mappings break naive merges, causing silent data loss. **Avoid by:** Using Ensembl gene IDs as primary keys throughout; version-locking annotation builds; implementing mapping validation gates that report % successfully mapped; manually reviewing unmapped high-priority genes.
2. **Literature Bias Amplification in Weighted Scoring** — Well-studied genes dominate scores because GO annotations, pathway coverage, interaction networks, and PubMed mentions all correlate with research attention rather than biological relevance. Systematically deprioritizes novel candidates. **Avoid by:** Decoupling publication metrics from functional evidence; normalizing scores by baseline publication count; heavily weighting sequence-based features independent of literature; requiring evidence diversity across multiple layers; validating against under-studied positive controls.
3. **Missing Data Handled as "Negative Evidence"** — Genes lacking data in a layer (no GTEx expression, no IMPC phenotype) are treated as "low score" rather than "unknown," systematically penalizing under-measured genes. **Avoid by:** Explicit three-state encoding (present/absent/unknown); layer-specific score normalization that doesn't penalize missing data; imputation using ortholog evidence with confidence weighting; coverage-aware constraint metrics (only use gnomAD when coverage >30x).
4. **Batch Effects Misinterpreted as Biological Signal in scRNA-seq** — Integration of multi-atlas scRNA-seq data (retina, inner ear, nasal epithelium) can erase true biological variation or create false cell-type signals. Only 27% of integration outputs perform better than unintegrated data. **Avoid by:** Validating integration quality using known marker genes; comparing multiple methods (Harmony, Seurat v5, scVI); using positive controls (known cilia genes should show expected cell-type enrichment); stratifying by sequencing technology before cross-platform integration.
5. **Reproducibility Theater Without Computational Environment Control** — Scripts are version-controlled but results drift due to uncontrolled dependencies (package versions, database snapshots, API response changes, random seeds). **Avoid by:** Pinning all dependencies with exact versions; containerizing environment (Docker/Singularity); snapshotting external databases with version tags; checksumming all downloaded data; setting random seeds globally; logging provenance metadata in all outputs.
## Implications for Roadmap
Based on research, suggested phase structure follows **dependency-driven ordering**: infrastructure first, single-layer prototype to validate architecture, parallel evidence layer development, integration/scoring, then reporting/polish.
### Phase 1: Data Infrastructure & Configuration
**Rationale:** All downstream components depend on config system, gene ID mapping utilities, and data storage patterns. Critical to establish before any evidence retrieval.
**Delivers:** Project skeleton, YAML config loading with Pydantic validation, DuckDB schema setup, gene ID mapping utility, API caching infrastructure.
**Addresses:**
- Reproducibility documentation (table stakes)
- Parameter configuration (table stakes)
- Data provenance tracking (table stakes)
**Avoids:**
- Gene ID mapping inconsistency (Pitfall #1) via standardized Ensembl IDs and validation gates
- Reproducibility failure (Pitfall #8) via containerization and version pinning
**Research flag:** SKIP deeper research — standard Python packaging patterns, well-documented.
### Phase 2: Single Evidence Layer Prototype
**Rationale:** Validate the retrieval → normalization → storage pattern before scaling to 6 layers. Identifies architectural issues early with low cost.
**Delivers:** One complete evidence layer (recommend starting with genetic constraint: gnomAD pLI/LOEUF as simplest API), end-to-end flow from API to DuckDB storage.
**Addresses:**
- API-based data retrieval (table stakes)
- Quality control checks (table stakes)
**Avoids:**
- Missing data as negative evidence (Pitfall #3) by designing explicit unknown-state handling from start
- Constraint metrics misinterpreted (Pitfall #6) via coverage-aware filtering
**Research flag:** MAY NEED `/gsd:research-phase` — gnomAD API usage patterns and coverage thresholds need validation.
### Phase 3: Remaining Evidence Layers (Parallel Work)
**Rationale:** Layers are independent by design; can be built in parallel. No inter-dependencies between annotation, expression, protein features, localization, phenotypes modules.
**Delivers:** 5 additional evidence layers replicating Phase 2 pattern (annotation, expression, protein features, localization, phenotypes + literature scan).
**Addresses:**
- Multi-evidence scoring foundation (table stakes) — requires all 6 layers operational
**Avoids:**
- Ortholog function over-assumed (Pitfall #5) via confidence scoring and phenotype relevance filtering
- API rate limits (Performance trap) via batch queries, exponential backoff, API keys
**Research flag:** NEEDS `/gsd:research-phase` for specialized APIs — CellxGene, IMPC, MGI/ZFIN integration patterns are niche and poorly documented.
### Phase 4: Integration & Scoring System
**Rationale:** Requires all evidence layers complete. Core scientific logic. Most complex component requiring validation against positive/negative controls.
**Delivers:** Multi-evidence scoring via SQL joins in DuckDB, weighted aggregation, known gene exclusion filter (CiliaCarta/SYSCILIA/OMIM), confidence tier assignment (high/medium/low).
**Addresses:**
- Multi-evidence scoring (table stakes) — weighted integration of all layers
- Known gene validation (table stakes) — benchmarking against established ciliopathy genes
- Tiered output with rationale (table stakes) — confidence classification
**Avoids:**
- Literature bias amplification (Pitfall #2) via publication-independent scoring layers and diversity requirements
- Missing data as negative evidence (Pitfall #3) via layer-aware scoring that preserves genes with partial data
**Research flag:** MAY NEED `/gsd:research-phase` — weight optimization strategies and validation metrics need domain-specific tuning.
### Phase 5: Reporting & Provenance
**Rationale:** Depends on integration layer producing scored results. Presentation layer, less critical for initial validation.
**Delivers:** Per-gene evidence summaries, provenance metadata generation (versions, timestamps, config hash), tiered candidate lists (TSV + Parquet), basic visualizations (score distributions, top candidates).
**Addresses:**
- Structured output format (table stakes)
- Basic visualization (table stakes)
- Data provenance tracking (completion)
**Research flag:** SKIP deeper research — standard output formatting and metadata patterns.
### Phase 6: CLI Orchestration & End-to-End Testing
**Rationale:** Integrates all components. User-facing interface. Validate full pipeline on test gene sets before production run.
**Delivers:** Unified Typer CLI app with subcommands (run-layer, integrate, report), dependency checking, progress logging with Rich, force-refresh flags, partial reruns.
**Addresses:**
- Batch processing (table stakes) — end-to-end orchestration with progress tracking
**Research flag:** SKIP deeper research — Typer CLI patterns well-documented.
### Phase 7: Validation & Weight Tuning
**Rationale:** After pipeline operational end-to-end, systematically validate against positive controls (known Usher/cilia genes) and negative controls (housekeeping genes). Iterate on scoring weights.
**Delivers:** Validation metrics (recall@10, recall@50 for positive controls; precision for negative controls), sensitivity analysis across parameter sweeps, finalized scoring weights.
**Addresses:**
- Known gene validation (completion) — quantitative benchmarking
- Sensitivity analysis (differentiator) — robustness demonstration
**Research flag:** MAY NEED `/gsd:research-phase` — validation metrics for gene prioritization pipelines need domain research.
### Phase 8 (v1.x): Explainability & Advanced Reporting
**Rationale:** After v1 produces validated candidate list. Adds interpretability for hypothesis generation and publication.
**Delivers:** SHAP-style per-gene evidence breakdown, interactive HTML report with sortable tables, evidence conflict detection, negative control validation.
**Addresses:**
- Explainable scoring (differentiator)
- Interactive HTML report (differentiator)
- Evidence conflict detection (differentiator)
**Research flag:** NEEDS `/gsd:research-phase` — SHAP for non-ML scoring systems and HTML report generation for bioinformatics need investigation.
### Phase Ordering Rationale
- **Sequential dependencies:** Phase 1 → Phase 2 → Phase 4 (infrastructure → prototype → integration) form the critical path; each depends on prior completion.
- **Parallelizable work:** Phase 3 (6 evidence layers) can be built concurrently by different developers or sequentially with rapid iteration.
- **Defer polish until validation:** Phases 5-6 (reporting, CLI) can be minimal initially; focus on scientific correctness (Phase 4) first.
- **Validate before extending:** Phase 7 (validation) must succeed before adding Phase 8 (advanced features); prevents building on broken foundation.
**Architecture-driven grouping:** Phases align with architectural layers (CLI orchestration, data retrieval, normalization, storage, integration, reporting) rather than arbitrary feature sets. Enables modular testing and isolated debugging.
**Pitfall avoidance:** Early phases establish mitigation strategies:
- Phase 1 prevents gene ID mapping issues (Pitfall #1) and reproducibility failures (Pitfall #8)
- Phase 2 prototype validates missing data handling (Pitfall #3)
- Phase 4 integration layer addresses literature bias (Pitfall #2)
- Validation throughout prevents silent failures from propagating
### Research Flags
Phases likely needing deeper research during planning:
- **Phase 2 (Constraint Metrics):** gnomAD API usage patterns, coverage thresholds, transcript selection — MEDIUM priority research
- **Phase 3 (Specialized APIs):** CellxGene census API, IMPC phenotype API, MGI/ZFIN bulk download workflows — HIGH priority research (niche, sparse documentation)
- **Phase 4 (Weight Optimization):** Validation metrics for gene prioritization, weight tuning strategies, positive control benchmark datasets — MEDIUM priority research
- **Phase 7 (Validation Metrics):** Recall/precision thresholds for rare disease discovery, statistical significance testing for constraint enrichment — LOW priority research (standard metrics available)
- **Phase 8 (SHAP for Rule-Based Systems):** Explainability methods for non-ML weighted scoring, HTML report generation best practices — MEDIUM priority research
Phases with standard patterns (skip research-phase):
- **Phase 1 (Infrastructure):** Python packaging, YAML config loading, DuckDB schema design — well-documented, no novelty
- **Phase 5 (Reporting):** Output formatting, CSV/Parquet generation, basic matplotlib/seaborn visualizations — standard patterns
- **Phase 6 (CLI):** Typer CLI design, Rich progress bars, logging configuration — mature libraries with excellent docs
## Confidence Assessment
| Area | Confidence | Notes |
|------|------------|-------|
| Stack | HIGH | All core technologies verified via official PyPI pages, Context7 library coverage. Version compatibility matrix validated. Alternative comparisons (Polars vs Pandas, uv vs Poetry) based on benchmarks and community adoption trends. |
| Features | MEDIUM | Feature expectations derived from 20+ peer-reviewed publications on gene prioritization tools (Exomiser, LIRICAL, CilioGenics) and bioinformatics best practices. No Context7 coverage for specialized domain tools; relied on WebSearch + academic literature. MVP recommendations synthesized from multiple sources. |
| Architecture | MEDIUM-HIGH | Architectural patterns validated across multiple bioinformatics pipeline implementations (PHA4GE best practices, nf-core patterns, academic publications). DuckDB vs SQLite comparison based on performance benchmarks. Staged pipeline pattern is industry standard. Confidence reduced slightly due to limited Context7 coverage for workflow orchestrators. |
| Pitfalls | HIGH | 40+ authoritative sources covering gene ID mapping failures (Bioconductor, UniProt docs), literature bias (Nature Scientific Reports), scRNA-seq batch effects (Nature Methods benchmarks), reproducibility failures (PLOS Computational Biology). Pitfall scenarios validated across multiple independent publications. Warning signs derived from "Common Bioinformatics Mistakes" community resources. |
**Overall confidence:** MEDIUM-HIGH
Research is well-grounded in established practices and peer-reviewed literature. Stack recommendations are conservative (mature technologies, not bleeding-edge). Architecture patterns are proven in production bioinformatics pipelines. Pitfalls are validated with quantitative data (e.g., "51% Ensembl ID conversion failure rate" from Bioconductor, "27% integration performance" from Nature Methods benchmarking).
Confidence reduced from HIGH due to:
1. Limited Context7 coverage for specialized bioinformatics tools (gget, scanpy, gnomad-toolbox)
2. Niche domain (ciliopathy research) has smaller community and fewer standardized tools than general genomics
3. Some API integration patterns (CellxGene, IMPC) rely on recent documentation that may be incomplete
### Gaps to Address
**During planning/Phase 2:**
- **gnomAD constraint metric reliability thresholds:** Research identified coverage-dependent issues but exact cutoffs (mean depth >30x, >90% CDS covered) need validation during implementation. Test on known ciliopathy genes to calibrate.
- **Transcript selection for tissue-specific genes:** SHANK2 example shows dramatic pLI differences between canonical and brain-specific transcripts. Need strategy for selecting appropriate transcript for retina/inner ear genes. Validate against GTEx isoform expression data.
**During planning/Phase 3:**
- **CellxGene API performance and quotas:** Census API is recent (2024-2025); rate limits and data transfer costs unclear. May need to pivot to bulk h5ad downloads if API proves impractical for 20K gene queries.
- **MGI/ZFIN/IMPC bulk download formats:** No official Python APIs; relying on bulk downloads. Need to validate download URLs are stable and file formats are parseable. Build sample parsing scripts during research-phase.
**During planning/Phase 4:**
- **Scoring weight initialization:** Literature provides ranges (expression 15-25%, constraint 10-20%) but optimal weights are dataset-dependent. Plan for iterative tuning using positive controls rather than assuming initial weights are final.
- **Under-annotation bias correction strategy:** Novel research direction; no established methods found. May need to defer to v2 if initial attempts don't improve results. Test correlation(score, pubmed_count) as success metric.
**During planning/Phase 7:**
- **Positive control gene set curation:** Need curated list of ~50-100 established ciliopathy genes with high confidence for validation. CiliaCarta (>600 genes) too broad; SYSCILIA Gold Standard (~200 genes) better. Manual curation required.
- **Negative control gene set:** Need ~50-100 high-confidence non-cilia genes (housekeeping, muscle-specific, liver-specific) for specificity testing. No standard negative control sets found in literature.
**Validation during implementation:**
- **scRNA-seq integration method selection:** Harmony vs Seurat v5 vs scVI performance is dataset-dependent. Research recommends comparing multiple methods, but optimal choice won't be known until testing on actual retina/inner ear atlases.
- **Ortholog confidence score thresholds:** DIOPT provides scores 0-15; research suggests prioritizing high-confidence orthologs but doesn't specify cutoff. Test recall vs precision tradeoff during Phase 3.
## Sources
### Primary (HIGH confidence)
- **STACK.md sources (official PyPI and Context7):**
- Polars 1.38.1, gget 0.30.2, Biopython 1.86, Typer 0.21.2, Pydantic 2.12.5 — PyPI verified Feb 2026
- polars-bio performance benchmarks (Oxford Academic, Dec 2025): 6-38x speedup over bioframe
- uv package manager (multiple sources): 10-100x faster than pip
- **FEATURES.md sources (peer-reviewed publications):**
- "Survey and improvement strategies for gene prioritization with LLMs" (Bioinformatics Advances, 2026)
- "Rare disease gene discovery in 100K Genomes Project" (Nature, 2025)
- "CilioGenics: integrated method for predicting ciliary genes" (NAR, 2024)
- "Standards for validating NGS bioinformatics pipelines" (AMP/CAP, 2018)
- **ARCHITECTURE.md sources (bioinformatics best practices):**
- PHA4GE Pipeline Best Practices (GitHub, community-validated)
- "Bioinformatics Pipeline Architecture Best Practices" (multiple implementations)
- DuckDB vs SQLite benchmarks (Bridge Informatics, DataCamp): 10-100x faster for analytics
- "Pipeline Provenance for Reproducibility" (arXiv, 2024)
- **PITFALLS.md sources (authoritative quantitative data):**
- Gene ID mapping failures: "How to map all Ensembl IDs to Gene Symbols" (Bioconductor Support, 51% NA rate)
- Literature bias: "Gene annotation bias impedes biomedical research" (Scientific Reports, 2018)
- scRNA-seq batch effects: "Benchmarking atlas-level data integration" (Nature Methods, 2021): 27% performance
- Ortholog conservation: "Functional and evolutionary implications of gene orthology" (Nature Reviews Genetics)
- gnomAD constraint: "No preferential mode of inheritance for highly constrained genes" (PMC, 2022)
### Secondary (MEDIUM confidence)
- **FEATURES.md:**
- Exomiser/LIRICAL feature comparisons (tool documentation + GitHub issues)
- Multi-evidence integration methods (academic publications, 2009-2024)
- Explainability methods for genomics (WIREs Data Mining, 2023)
- **ARCHITECTURE.md:**
- Workflow manager comparisons (Nextflow vs Snakemake, 2025 analysis)
- Python CLI tool comparisons (argparse vs Click vs Typer, multiple blogs)
- Gene annotation pipeline architectures (NCBI, AnnotaPipeline)
- **PITFALLS.md:**
- API rate limit documentation (NCBI E-utilities, UniProt)
- Tissue specificity scoring methods (TransTEx, Bioinformatics 2024)
- Literature mining tools (GLAD4U, PubTator 3.0)
### Tertiary (LOW confidence — needs validation)
- AlphaFold NVIDIA 4090 compatibility (GitHub issues): 24GB VRAM below 32GB recommended, may work with configuration
- MGI/ZFIN/IMPC Python API availability (web search): No official libraries found, bulk downloads required
- CellxGene Census API quotas and rate limits (recent docs, 2024-2025): Documentation incomplete
---
*Research completed: 2026-02-11*
*Ready for roadmap: yes*