Files
usher-exploring/.planning/research/PITFALLS.md

40 KiB
Raw Permalink Blame History

Domain Pitfalls: Bioinformatics Gene Prioritization for Cilia/Usher

Domain: Bioinformatics gene candidate discovery pipeline Researched: 2026-02-11 Confidence: HIGH

Critical Pitfalls

Pitfall 1: Gene ID Mapping Inconsistency Cascade

What goes wrong: Gene identifiers fail to map correctly across databases (Ensembl, HGNC, UniProt, Entrez), resulting in data loss or incorrect merges. Over 51% of Ensembl IDs can return NA when converting to gene symbols, and approximately 8% discordance exists between Swiss-Prot and Ensembl. Multiple Ensembl IDs can map to a single Entrez ID, creating one-to-many relationships that break naively designed merge operations.

Why it happens:

  • UniProt requires 100% sequence identity with no insertions/deletions for mapping to Ensembl, causing ~5% of well-annotated Swiss-Prot proteins to remain unmapped
  • Gene symbols "shuffle around more haphazardly" than stable IDs, with the same symbol assigned to different genes across annotation builds
  • Mucin-like proteins (e.g., MUC2, MUC19) with variable repeat regions have mismatches between curated sequences and reference genomes
  • Different annotation build versions used across source databases

How to avoid:

  1. Use stable IDs as primary keys: Store Ensembl gene IDs as the authoritative identifier; treat gene symbols as display-only metadata
  2. Version-lock annotation builds: Document and freeze the Ensembl/GENCODE release used for all conversions; ensure GTEx expression data uses the same build
  3. Implement mapping validation: After each ID conversion step, report the % successfully mapped and manually inspect unmapped high-priority genes
  4. One-to-many resolution strategy: For multiple Ensembl→Entrez mappings, aggregate scores using max() or create separate records with explicit disambiguation
  5. Bypass problematic conversions: Where possible, retrieve data directly using the native ID system of each source database rather than converting first

Warning signs:

  • Sudden drop in gene counts after merging datasets
  • Known positive control genes (established cilia/Usher genes) missing from integrated data
  • Genes appearing multiple times with different scores after merge
  • Different gene counts between pilot analysis and production run after database updates

Phase to address: Phase 1 (Data Infrastructure): Establish ID mapping validation framework with explicit logging of conversion success rates. Create curated mapping tables with manual review of ambiguous cases.


Pitfall 2: Literature Bias Amplification in Weighted Scoring

What goes wrong: Well-studied genes dominate prioritization scores because multiple evidence layers correlate with publication volume rather than biological relevance. PubMed co-mentions, Gene Ontology annotation depth, pathway database coverage, and interaction network degree all increase with research attention, creating a reinforcing feedback loop that systematically deprioritizes novel candidates. In gene prioritization benchmarks, "the number of Gene Ontology annotations for a gene was significantly correlated with published disease-gene associations," reflecting research bias rather than actual biological importance.

Why it happens:

  • Gene Ontology annotations are "created by curation of the scientific literature and typically only contain functional annotations for genes with published experimental data"
  • Interaction networks derived from literature mining or yeast two-hybrid screens are biased toward well-studied proteins
  • Pathway databases (KEGG, Reactome) provide richer annotations for canonical pathways vs. emerging biology
  • Constraint metrics (gnomAD pLI/LOEUF) perform better for well-covered genes but may be unreliable for genes with low coverage
  • Human genes "are much more likely to have been published on in the last 12 years if they are in clusters that were already well known"

How to avoid:

  1. Decouple publication metrics from functional evidence: Score PubMed mentions separately from mechanistic evidence (protein domains, expression patterns, orthologs)
  2. Normalize for baseline research attention: Divide pathway/GO scores by total publication count to create a "novelty-adjusted" functional score
  3. Use sequence-based features heavily: Prioritize evidence types independent of literature (protein domains, tissue expression, evolutionary constraint, ortholog phenotypes)
  4. Set evidence diversity requirements: Require candidates to score above threshold in at least N different evidence categories, preventing single-layer dominance
  5. Explicit "under-studied" bonus: Add a scoring component that rewards genes with low PubMed counts but high biological plausibility from other layers
  6. Validate scoring against a "dark genome" test set: Ensure low-publication genes with strong experimental validation (from model organisms) score appropriately

Warning signs:

  • Top 100 candidates are all genes with >500 PubMed citations
  • Known disease genes with <50 publications rank below 1,000th percentile
  • Correlation coefficient >0.7 between final score and PubMed count
  • When you add a new evidence layer, the top-ranked genes don't change
  • Positive controls (known cilia genes) score lower than expected based on functional relevance

Phase to address: Phase 3 (Scoring System Design): Implement multi-layer weighted scoring with explicit publication bias correction. Test scoring system on a stratified validation set including both well-studied and under-studied genes.


Pitfall 3: Missing Data Handled as "Negative Evidence"

What goes wrong: Genes lacking data in a particular evidence layer (e.g., no GTEx expression, no scRNA-seq detection, no IMPC phenotype) are treated as having "low expression" or "no phenotype" rather than "unknown." This systematically penalizes genes simply because they haven't been measured in the right contexts. For example, a gene expressed in rare retinal cell subtypes may be absent from bulk GTEx data and appear in only 1-2 cells per scRNA-seq atlas, leading to false classification as "not expressed in relevant tissues."

Why it happens:

  • Default pandas merge behavior drops unmatched records or fills with NaN
  • Scoring functions that sum across layers implicitly assign 0 to missing layers
  • GTEx bulk tissue lacks resolution for rare cell types (e.g., photoreceptor subtypes, vestibular hair cells)
  • scRNA-seq atlases have dropout and may not capture low-abundance transcripts
  • Model organism phenotypes exist only for ~40% of human genes
  • gnomAD constraint metrics are unreliable for genes with low coverage regions

How to avoid:

  1. Explicit missing data encoding: Use a three-state system: "present" (data exists and positive), "absent" (data exists and negative), "unknown" (no data available)
  2. Layer-specific score normalization: Compute scores only across layers with data; do not penalize genes for missing layers
  3. Imputation with biological priors: For genes with orthologs but no direct human data, propagate evidence from mouse/zebrafish with confidence weighting
  4. Coverage-aware constraint metrics: Only use gnomAD pLI/LOEUF when coverage is adequate (mean depth >30x and >90% of coding sequence covered)
  5. Aggregate at appropriate resolution: For scRNA-seq, aggregate to cell-type level rather than individual cells to handle dropout
  6. Document data availability per gene: Create a metadata field tracking which evidence layers are available for each gene to enable layer-aware filtering

Warning signs:

  • Genes with partial data coverage systematically rank lower than genes with complete data
  • Number of scored genes drops dramatically when adding a new evidence layer (should be union, not intersection)
  • High-confidence candidates from preliminary analysis disappear after integrating additional databases
  • Known Usher genes with tissue-specific expression patterns score poorly

Phase to address: Phase 2 (Data Integration): Design merge strategy that preserves all genes and explicitly tracks data availability per evidence layer. Implement imputation strategy for ortholog-based evidence propagation.


Pitfall 4: Batch Effects Misinterpreted as Biological Signal in scRNA-seq Integration

What goes wrong: When integrating scRNA-seq data from multiple atlases (e.g., retinal atlas, inner ear atlas, nasal epithelium datasets), technical batch effects between studies are mistakenly interpreted as tissue-specific expression patterns. Methods that over-correct for batch effects can erase true biological variation, while under-correction leads to false cell-type-specific signals. Recent benchmarking shows "only 27% of integration outputs performed better than the best unintegrated data," and methods like LIGER, BBKNN, and Seurat v3 "tended to favor removal of batch effects over conservation of biological variation."

Why it happens:

  • Different scRNA-seq platforms (10X Chromium vs. Smart-seq2 vs. single-nuclei) have systematically different detection profiles
  • Batch effects arise from "cell isolation protocols, library preparation technology, and sequencing platforms"
  • Integration algorithms make tradeoffs between removing technical variation vs. preserving biological differences
  • "Increasing KullbackLeibler divergence regularization does not improve integration and adversarial learning removes biological signals"
  • Tissue-of-origin effects (primary tissue vs. organoid vs. cell culture) can be confounded with true cell-type identity

How to avoid:

  1. Validate integration quality: After batch correction, verify that known marker genes still show expected cell-type-specific patterns
  2. Compare multiple integration methods: Run Harmony, Seurat v5, and scVI in parallel; select based on preservation of positive control markers
  3. Use positive/negative control genes: Ensure known cilia genes (IFT88, BBS1) show expected enrichment in ciliated cells post-integration
  4. Stratify by sequencing technology: Analyze 10X datasets separately from Smart-seq2 before attempting cross-platform integration
  5. Prefer within-study comparisons: When possible, compare cell types within the same study rather than integrating across studies
  6. Document integration parameters: Record all batch correction hyperparameters (k-neighbors, PCA dimensions, integration strength) to enable sensitivity analysis

Warning signs:

  • Cell types from different studies don't overlap in UMAP space after integration
  • Marker genes lose cell-type-specificity after batch correction
  • Technical replicates from the same study cluster separately after integration
  • Integration method produces >50% improvement in batch-mixing metrics but known biological markers disappear

Phase to address: Phase 4 (scRNA-seq Processing): Implement multi-method integration pipeline with positive control validation. Create cell-type-specific expression profiles only after validating preservation of known markers.


Pitfall 5: Ortholog Function Conservation Over-Assumed

What goes wrong: Phenotypes from mouse knockouts (MGI) or zebrafish morphants (ZFIN) are naively transferred to human genes as if function were perfectly conserved, ignoring cases where orthologs have "dramatically different functions." This is especially problematic for gene families with lineage-specific duplications or losses. The assumption that "single-copy orthologs are more reliable for functional annotation" is contradicted by evidence showing "multi-copy genes are equally or more likely to provide accurate functional information."

Why it happens:

  • Automated orthology pipelines use sequence similarity thresholds without functional validation
  • "Bidirectional best hit (BBH) assumption can be false because genes may be each other's highest-ranking matches due to differential gene loss"
  • Domain recombination creates false orthology assignments where proteins share some but not all domains
  • Zebrafish genome duplication means many human genes have two zebrafish co-orthologs with subfunctionalized roles
  • "There is no universally applicable, unequivocal definition of conserved function"
  • Cilia gene functions may diverge between motile (zebrafish) and non-motile/sensory (mammalian) cilia contexts

How to avoid:

  1. Confidence-weight ortholog evidence: Use orthology confidence scores from databases (e.g., DIOPT scores, OMA groups) rather than binary ortholog/non-ortholog
  2. Require phenotype relevance: Only count ortholog phenotypes that match the target biology (ciliary defects, sensory organ abnormalities, not generic lethality)
  3. Handle one-to-many orthologs explicitly: For human genes with multiple zebrafish co-orthologs, aggregate phenotypes using OR logic (any co-ortholog with phenotype = positive evidence)
  4. Validate with synteny: Prioritize orthologs supported by both sequence similarity and conserved genomic context
  5. Species-appropriate expectations: Expect stronger conservation for cilia structure genes (IFT machinery) vs. sensory signaling cascades (tissue-specific)
  6. Cross-validate with multiple species: Genes with convergent phenotypes across mouse, zebrafish, AND Drosophila are more reliable than single-species evidence

Warning signs:

  • Ortholog evidence contradicts human genetic data (e.g., mouse knockout viable but human LoF variants cause disease)
  • Large fraction of human genes map to paralogs in model organism rather than true orthologs
  • Zebrafish co-orthologs have opposing phenotypes (one causes ciliopathy, other is wildtype)
  • Synteny breaks detected for claimed ortholog pairs

Phase to address: Phase 2 (Data Integration - Ortholog Module): Implement orthology confidence scoring and phenotype relevance filtering. Create manual curation workflow for ambiguous high-scoring candidates.


Pitfall 6: Constraint Metrics (gnomAD pLI/LOEUF) Misinterpreted

What goes wrong: gnomAD constraint scores are used as disease gene predictors without understanding their limitations. Researchers assume high pLI (>0.9) or low LOEUF (<0.35) automatically indicates dominant disease genes, but "even the most highly constrained genes are not necessarily autosomal dominant." Transcript selection errors cause dramatic score changes—SHANK2 has pLI=0 in the canonical transcript but pLI=1 in the brain-specific transcript due to differential exon usage. Low-coverage genes have unreliable constraint metrics but are not flagged.

Why it happens:

  • pLI is dichotomous (>0.9 or <0.1), ignoring genes in the intermediate range
  • LOEUF thresholds changed between gnomAD v2 (<0.35) and v4 (<0.6), causing confusion
  • "There is no brain expression from over half the exons in the SHANK2 canonical transcript where most of the protein truncating variants are found"
  • Genes with low sequencing coverage or high GC content have unreliable observed/expected ratios
  • Recessive disease genes can have low constraint (heterozygous LoF is benign)
  • Haploinsufficient genes vs. dominant-negative mechanisms are not distinguished

How to avoid:

  1. Use LOEUF continuously, not dichotomously: Score genes on LOEUF scale rather than applying hard cutoffs; gives partial credit across the distribution
  2. Verify transcript selection: For ciliopathy candidates, ensure the scored transcript includes exons expressed in retina/inner ear using GTEx isoform data
  3. Check coverage metrics: Only use pLI/LOEUF when mean coverage >30x and >90% of CDS is covered in gnomAD
  4. Adjust expectations for inheritance pattern: High constraint supports haploinsufficiency; moderate constraint is compatible with recessive inheritance (relevant for Usher syndrome)
  5. Cross-validate with ClinVar: Check whether existing pathogenic variants in the gene match the constraint prediction
  6. Incorporate missense constraint: Use gnomAD missense Z-scores alongside LoF metrics; some genes tolerate LoF but are missense-constrained

Warning signs:

  • Candidate genes have low gnomAD coverage but constraint scores are used anyway
  • All top candidates have pLI >0.9, despite Usher syndrome being recessive
  • Constraint scores change dramatically when switching from canonical to tissue-specific transcript
  • Genes with established recessive inheritance pattern are filtered out due to low constraint

Phase to address: Phase 3 (Scoring System Design - Constraint Module): Implement coverage-aware constraint scoring with transcript validation. Create inheritance-pattern-adjusted weighting (lower weight for dominant constraint when prioritizing recessive disease genes).


Pitfall 7: "Unknown Function" Operationally Undefined Creates Circular Logic

What goes wrong: The pipeline aims to discover "under-studied" or "unknown function" genes but lacks a rigorous operational definition. If "unknown function" means "few PubMed papers," the literature bias pitfall is worsened. If it means "no GO annotations," it excludes genes with partial functional knowledge that might be excellent candidates. Approximately "40% of proteins in eukaryotic genomes are proteins of unknown function," but defining unknown is complex—"when a group of related sequences contains one or more members of known function, the similarity approach assigns all to the known space, whereas empirical approach distinguishes between characterized and uncharacterized candidates."

Why it happens:

  • GO term coverage is publication-biased; "52-79% of bacterial proteomes can be functionally annotated based on homology searches" but eukaryotes have more uncharacterized proteins
  • Partial functional knowledge exists on a spectrum (domain predictions, general pathway assignments, no mechanistic details)
  • "Research bias, as measured by publication volume, was an important factor influencing genome annotation completeness"
  • Negative selection (excluding known cilia genes) requires defining "known," which is database- and date-dependent

How to avoid:

  1. Multi-tier functional classification:
    • Tier 1: Direct experimental evidence of cilia/Usher involvement (exclude from discovery)
    • Tier 2: Strong functional prediction (domains, orthologs) but no direct evidence (high-priority targets)
    • Tier 3: Minimal functional annotation (under-studied candidates)
    • Tier 4: No annotation beyond gene symbol (true unknowns)
  2. Explicit positive control exclusion list: Maintain a curated list of ~200 known cilia genes and established Usher genes to exclude; version-control this list
  3. Publication-independent functional metrics: Define "unknown" using absence of experimental GO evidence codes (EXP, IDA, IPI, IMP), not total publication count
  4. Pathway coverage threshold: Consider genes "known" if they appear in >3 canonical cilia pathways (IFT, transition zone, basal body assembly)
  5. Temporal versioning: Tag genes as "unknown as of [date]" to allow retrospective validation when new discoveries are published

Warning signs:

  • Definition of "unknown" changes between pilot and production runs
  • Candidates include genes with extensive literature on cilia-related functions
  • Positive control genes leak into discovery set due to incomplete exclusion list
  • Different team members have conflicting intuitions about whether a gene is "known enough" to exclude

Phase to address: Phase 1 (Data Infrastructure - Annotation Module): Create explicit functional classification schema with clear inclusion/exclusion criteria. Build curated positive control list with literature provenance.


Pitfall 8: Reproducibility Theater Without Computational Environment Control

What goes wrong: Pipeline scripts are version-controlled and documented, creating an illusion of reproducibility, but results cannot be reproduced due to uncontrolled dependencies. Python package versions (pandas, numpy, scikit-learn), database snapshots (Ensembl release, gnomAD version, GTEx v8 vs v9), and API response changes cause silent result drift. "Workflow managers were developed in response to challenges with data complexity and reproducibility" but are often not adopted until after initial analyses are complete and results have diverged.

Why it happens:

  • pip install package without pinning versions installs latest release, which may change behavior
  • Ensembl/NCBI/UniProt databases update continuously; API calls return different data over time
  • Downloaded files are not checksummed; corruption or incomplete downloads go undetected
  • Reference genome versions (GRCh37 vs GRCh38) are mixed across data sources
  • RAM/disk caching causes results to differ between first run and subsequent runs
  • Random seeds not set for stochastic algorithms (UMAP, t-SNE, subsampling)

How to avoid:

  1. Pin all dependencies: Use requirements.txt with exact versions (pandas==2.0.3 not pandas>=2.0) or conda env export
  2. Containerize the environment: Build Docker/Singularity container with frozen versions; run all analyses inside container
  3. Snapshot external databases: Download full database dumps (Ensembl, GTEx) with version tags; do not rely on live API queries for production runs
  4. Checksum all downloaded data: Compute and store MD5/SHA256 hashes for every downloaded file; verify on load
  5. Version-control intermediate outputs: Store preprocessed data files (post-QC, post-normalization) with version tags to enable restart from checkpoints
  6. Set random seeds globally: Fix numpy/torch/random seeds at the start of every script to ensure stochastic steps are reproducible
  7. Log provenance metadata: Embed database versions, software versions, and run parameters in output files using JSON headers or HDF5 attributes

Warning signs:

  • Results change when re-running the same script on a different machine
  • Collaborator cannot reproduce your gene rankings despite using "the same code"
  • Adding a new analysis step changes results from previous steps (should be impossible if truly modular)
  • You cannot explain why the gene count changed between last month's and this month's runs

Phase to address: Phase 1 (Data Infrastructure): Establish containerized environment with pinned dependencies and database version snapshots. Implement checksumming and provenance logging from the start.


Technical Debt Patterns

Shortcuts that seem reasonable but create long-term problems.

Shortcut Immediate Benefit Long-term Cost When Acceptable
Hard-code database API URLs in scripts Quick to write Breaks when APIs change; no version control Never—use config file
Convert all IDs to gene symbols early Simpler to read outputs Loses ability to trace back to source; mapping errors propagate Only for final display, never internal processing
Filter to protein-coding genes only, drop non-coding Reduces dataset size Misses lncRNAs with cilia-regulatory roles Acceptable for MVP if explicitly documented as limitation
Use default merge (inner join) in pandas Fewer rows to process Silent data loss when IDs don't match Never—always left join with explicit logging
Skip validation on positive control genes Faster iteration No way to detect when scoring system breaks Never—positive controls are mandatory
Download data via API calls in main pipeline No manual download step Irreproducible; results change as databases update Only during exploration phase, never production
Store intermediate data as CSV Easy to inspect manually Loss of data types (ints become floats); no metadata storage Acceptable for small tables <10K rows
Use pip install --upgrade to fix bugs Gets latest fixes Introduces breaking changes unpredictably Never—pin versions and upgrade explicitly
Aggregate to gene-level immediately from scRNA-seq Simpler analysis Loses cell-type resolution; can't detect subtype-specific expression Only for bulk comparison, not prioritization

Integration Gotchas

Common mistakes when connecting to external services.

Integration Common Mistake Correct Approach
Ensembl BioMart Selecting canonical transcript only; miss tissue-specific isoforms Query all transcripts, then filter by retina/inner ear expression using GTEx
gnomAD API Not checking coverage per gene before using constraint scores Filter to genes with mean_depth >30x and >90% CDS covered
GTEx Portal Using TPM directly without accounting for sample size per tissue Normalize by sample count and use median TPM across replicates
CellxGene API Downloading full h5ad files serially (slow, memory-intensive) Use CellxGene's API to fetch only cell-type-aggregated counts for genes of interest
UniProt REST API Converting gene lists one-by-one in a loop Use batch endpoint with POST requests (up to 100K IDs per request)
PubMed E-utilities Sending requests without API key (3 req/sec limit) Register NCBI API key for 10 req/sec; still implement exponential backoff
MGI/ZFIN batch queries Assuming 1:1 human-mouse orthology Handle one-to-many mappings explicitly (use highest-confidence ortholog or aggregate)
IMPC phenotype API Taking all phenotypes as equally informative Filter to cilia-relevant phenotypes (MP:0003935 cilium; HP:0000508 retinal degeneration)
String-DB Assuming all edges are physical interactions Filter by interaction type (text-mining vs experimental) and confidence score >0.7

Performance Traps

Patterns that work at small scale but fail as usage grows.

Trap Symptoms Prevention When It Breaks
Loading entire scRNA-seq h5ad into memory Script crashes with MemoryError Use backed mode (anndata.read_h5ad(file, backed='r')) to stream from disk >5GB file or <32GB RAM
Nested loops for gene-gene comparisons Script runs for hours Vectorize with numpy/pandas; use scipy.spatial.distance for pairwise >1K genes
Re-downloading data on every run Slow iteration, API rate limits Cache downloaded files locally with checksums; only re-download if missing Always implement caching
Storing all intermediate results in memory Cannot debug failed runs Write intermediate outputs to disk after each major step >50K genes or complex pipeline
Single-threaded processing Slow on large gene sets Parallelize with joblib/multiprocessing for embarrassingly parallel tasks >10K genes or >1hr runtime
Not indexing database tables Slow queries on merged datasets Create indexes on ID columns (Ensembl ID, HGNC symbol) before joins >20K genes or >5 tables

Phase-Specific Warnings

Phase Topic Likely Pitfall Mitigation
Data download (Phase 1) API rate limits hit during batch download Implement exponential backoff and retry logic; use NCBI API keys
ID mapping (Phase 1) Multiple IDs map to the same symbol Create explicit disambiguation strategy; log all many-to-one mappings
scRNA-seq integration (Phase 4) Batch effects erase biological signal Validate integration by checking known marker genes before/after
Scoring system (Phase 3) Literature bias dominates scores Normalize by publication count; validate on under-studied positive controls
Validation (Phase 5) Positive controls not scoring as expected Debug scoring system before declaring it ready; iterate on weights
Reproducibility (All phases) Results differ between runs Pin dependency versions and database snapshots from Phase 1
Missing data handling (Phase 2) Genes with incomplete data rank artificially low Implement layer-aware scoring that doesn't penalize missing data
Ortholog phenotypes (Phase 2) Mouse/zebrafish phenotypes over-interpreted Require phenotype relevance filtering (cilia-related only)

"Looks Done But Isn't" Checklist

Things that appear complete but are missing critical pieces.

  • ID mapping module: Often missing validation step to ensure known genes are successfully mapped—verify 100% of positive control genes map successfully
  • Scoring system: Often missing publication bias correction—verify correlation between final score and PubMed count is <0.5
  • scRNA-seq integration: Often missing positive control validation—verify known cilia genes show expected cell-type enrichment post-integration
  • Constraint metrics: Often missing coverage check—verify genes have adequate gnomAD coverage before using pLI/LOEUF
  • Ortholog evidence: Often missing confidence scoring—verify orthology confidence scores are used, not just binary ortholog calls
  • Data provenance: Often missing version logging—verify every output file records database versions and software versions used
  • Missing data handling: Often missing explicit "unknown" state—verify merge strategy preserves genes with partial data
  • Reproducibility: Often missing checksums on downloaded data—verify MD5 hashes are computed and stored for all external data files
  • Positive controls: Often missing negative control validation—verify negative controls (non-cilia genes) score appropriately low
  • API error handling: Often missing retry logic—verify exponential backoff is implemented for all external API calls

Recovery Strategies

When pitfalls occur despite prevention, how to recover.

Pitfall Recovery Cost Recovery Steps
ID mapping errors detected late LOW Re-run mapping module with corrected conversion tables; propagate fixes forward
Literature bias discovered after scoring MEDIUM Add publication normalization to scoring function; re-compute all scores
Batch effects in scRNA-seq integration MEDIUM Try alternative integration method (switch from Seurat to Harmony); re-validate
Missing data treated as negative HIGH Redesign merge strategy to preserve all genes; re-run entire integration pipeline
Reproducibility failure (cannot re-run) HIGH Containerize environment; snapshot databases; document and re-run from scratch
Positive controls score poorly MEDIUM Debug scoring function weights; validate on stratified test set; adjust weights iteratively
Ortholog function over-assumed LOW Add orthology confidence scores; re-filter to high-confidence orthologs only
Constraint metrics misinterpreted LOW Add coverage check; use LOEUF continuously instead of dichotomous; re-score
"Unknown function" poorly defined MEDIUM Create explicit functional tiers; rebuild positive control exclusion list; re-classify

Pitfall-to-Phase Mapping

How roadmap phases should address these pitfalls.

Pitfall Prevention Phase Verification
Gene ID mapping inconsistency Phase 1: Data Infrastructure 100% of positive controls successfully mapped across all databases
Literature bias amplification Phase 3: Scoring System Correlation(final_score, pubmed_count) < 0.5; under-studied positive controls rank in top 10%
Missing data as negative evidence Phase 2: Data Integration Gene count preserved after merges (should be union, not intersection); explicit "unknown" state in data model
scRNA-seq batch effects Phase 4: scRNA-seq Processing Known cilia markers show expected cell-type specificity post-integration
Ortholog function over-assumed Phase 2: Ortholog Module Orthology confidence scores used; phenotype relevance filtering applied
Constraint metrics misinterpreted Phase 3: Constraint Module Coverage-aware filtering; LOEUF used continuously; transcript validation performed
"Unknown function" undefined Phase 1: Annotation Module Explicit functional tiers defined; positive control exclusion list version-controlled
Reproducibility failure Phase 1: Environment Setup Containerized; dependencies pinned; databases snapshotted; results bit-identical on re-run
API rate limits hit Phase 1: Data Download Retry logic with exponential backoff; batch queries used; NCBI API key registered

Sources

Gene ID Mapping

Literature Bias and Gene Annotation

Multi-Database Integration and Missing Data

Weighted Scoring Systems

Gene Prioritization and LLM Biases

scRNA-seq Integration and Batch Effects

Rare Disease Gene Discovery and Validation

Reproducible Bioinformatics Pipelines

Cilia Gene Discovery and Ciliopathy Prioritization

Usher Syndrome Gene Discovery

gnomAD Constraint Metrics

API Rate Limits and Batch Download

Ortholog Inference and Limitations

Tissue Specificity and Cell Type Deconvolution

PubMed Literature Mining


Pitfalls research for: Bioinformatics Gene Candidate Discovery Pipeline for Cilia/Usher Syndrome Researched: 2026-02-11 Confidence: HIGH (validated with 40+ authoritative sources across all risk domains)