gbanyan/usher-exploring

Fork 0

Files

gbanyan bb7bfaedab docs: complete project research

2026-02-11 14:52:06 +08:00

40 KiB

Raw Permalink Blame History

Domain Pitfalls: Bioinformatics Gene Prioritization for Cilia/Usher

Domain: Bioinformatics gene candidate discovery pipeline Researched: 2026-02-11 Confidence: HIGH

Critical Pitfalls

Pitfall 1: Gene ID Mapping Inconsistency Cascade

What goes wrong: Gene identifiers fail to map correctly across databases (Ensembl, HGNC, UniProt, Entrez), resulting in data loss or incorrect merges. Over 51% of Ensembl IDs can return NA when converting to gene symbols, and approximately 8% discordance exists between Swiss-Prot and Ensembl. Multiple Ensembl IDs can map to a single Entrez ID, creating one-to-many relationships that break naively designed merge operations.

Why it happens:

UniProt requires 100% sequence identity with no insertions/deletions for mapping to Ensembl, causing ~5% of well-annotated Swiss-Prot proteins to remain unmapped
Gene symbols "shuffle around more haphazardly" than stable IDs, with the same symbol assigned to different genes across annotation builds
Mucin-like proteins (e.g., MUC2, MUC19) with variable repeat regions have mismatches between curated sequences and reference genomes
Different annotation build versions used across source databases

How to avoid:

Use stable IDs as primary keys: Store Ensembl gene IDs as the authoritative identifier; treat gene symbols as display-only metadata
Version-lock annotation builds: Document and freeze the Ensembl/GENCODE release used for all conversions; ensure GTEx expression data uses the same build
Implement mapping validation: After each ID conversion step, report the % successfully mapped and manually inspect unmapped high-priority genes
One-to-many resolution strategy: For multiple Ensembl→Entrez mappings, aggregate scores using max() or create separate records with explicit disambiguation
Bypass problematic conversions: Where possible, retrieve data directly using the native ID system of each source database rather than converting first

Warning signs:

Sudden drop in gene counts after merging datasets
Known positive control genes (established cilia/Usher genes) missing from integrated data
Genes appearing multiple times with different scores after merge
Different gene counts between pilot analysis and production run after database updates

Phase to address: Phase 1 (Data Infrastructure): Establish ID mapping validation framework with explicit logging of conversion success rates. Create curated mapping tables with manual review of ambiguous cases.

Pitfall 2: Literature Bias Amplification in Weighted Scoring

What goes wrong: Well-studied genes dominate prioritization scores because multiple evidence layers correlate with publication volume rather than biological relevance. PubMed co-mentions, Gene Ontology annotation depth, pathway database coverage, and interaction network degree all increase with research attention, creating a reinforcing feedback loop that systematically deprioritizes novel candidates. In gene prioritization benchmarks, "the number of Gene Ontology annotations for a gene was significantly correlated with published disease-gene associations," reflecting research bias rather than actual biological importance.

Why it happens:

Gene Ontology annotations are "created by curation of the scientific literature and typically only contain functional annotations for genes with published experimental data"
Interaction networks derived from literature mining or yeast two-hybrid screens are biased toward well-studied proteins
Pathway databases (KEGG, Reactome) provide richer annotations for canonical pathways vs. emerging biology
Constraint metrics (gnomAD pLI/LOEUF) perform better for well-covered genes but may be unreliable for genes with low coverage
Human genes "are much more likely to have been published on in the last 12 years if they are in clusters that were already well known"

How to avoid:

Decouple publication metrics from functional evidence: Score PubMed mentions separately from mechanistic evidence (protein domains, expression patterns, orthologs)
Normalize for baseline research attention: Divide pathway/GO scores by total publication count to create a "novelty-adjusted" functional score
Use sequence-based features heavily: Prioritize evidence types independent of literature (protein domains, tissue expression, evolutionary constraint, ortholog phenotypes)
Set evidence diversity requirements: Require candidates to score above threshold in at least N different evidence categories, preventing single-layer dominance
Explicit "under-studied" bonus: Add a scoring component that rewards genes with low PubMed counts but high biological plausibility from other layers
Validate scoring against a "dark genome" test set: Ensure low-publication genes with strong experimental validation (from model organisms) score appropriately

Warning signs:

Top 100 candidates are all genes with >500 PubMed citations
Known disease genes with <50 publications rank below 1,000th percentile
Correlation coefficient >0.7 between final score and PubMed count
When you add a new evidence layer, the top-ranked genes don't change
Positive controls (known cilia genes) score lower than expected based on functional relevance

Phase to address: Phase 3 (Scoring System Design): Implement multi-layer weighted scoring with explicit publication bias correction. Test scoring system on a stratified validation set including both well-studied and under-studied genes.

Pitfall 3: Missing Data Handled as "Negative Evidence"

What goes wrong: Genes lacking data in a particular evidence layer (e.g., no GTEx expression, no scRNA-seq detection, no IMPC phenotype) are treated as having "low expression" or "no phenotype" rather than "unknown." This systematically penalizes genes simply because they haven't been measured in the right contexts. For example, a gene expressed in rare retinal cell subtypes may be absent from bulk GTEx data and appear in only 1-2 cells per scRNA-seq atlas, leading to false classification as "not expressed in relevant tissues."

Why it happens:

Default pandas merge behavior drops unmatched records or fills with NaN
Scoring functions that sum across layers implicitly assign 0 to missing layers
GTEx bulk tissue lacks resolution for rare cell types (e.g., photoreceptor subtypes, vestibular hair cells)
scRNA-seq atlases have dropout and may not capture low-abundance transcripts
Model organism phenotypes exist only for ~40% of human genes
gnomAD constraint metrics are unreliable for genes with low coverage regions

How to avoid:

Explicit missing data encoding: Use a three-state system: "present" (data exists and positive), "absent" (data exists and negative), "unknown" (no data available)
Layer-specific score normalization: Compute scores only across layers with data; do not penalize genes for missing layers
Imputation with biological priors: For genes with orthologs but no direct human data, propagate evidence from mouse/zebrafish with confidence weighting
Coverage-aware constraint metrics: Only use gnomAD pLI/LOEUF when coverage is adequate (mean depth >30x and >90% of coding sequence covered)
Aggregate at appropriate resolution: For scRNA-seq, aggregate to cell-type level rather than individual cells to handle dropout
Document data availability per gene: Create a metadata field tracking which evidence layers are available for each gene to enable layer-aware filtering

Warning signs:

Genes with partial data coverage systematically rank lower than genes with complete data
Number of scored genes drops dramatically when adding a new evidence layer (should be union, not intersection)
High-confidence candidates from preliminary analysis disappear after integrating additional databases
Known Usher genes with tissue-specific expression patterns score poorly

Phase to address: Phase 2 (Data Integration): Design merge strategy that preserves all genes and explicitly tracks data availability per evidence layer. Implement imputation strategy for ortholog-based evidence propagation.

Pitfall 4: Batch Effects Misinterpreted as Biological Signal in scRNA-seq Integration

What goes wrong: When integrating scRNA-seq data from multiple atlases (e.g., retinal atlas, inner ear atlas, nasal epithelium datasets), technical batch effects between studies are mistakenly interpreted as tissue-specific expression patterns. Methods that over-correct for batch effects can erase true biological variation, while under-correction leads to false cell-type-specific signals. Recent benchmarking shows "only 27% of integration outputs performed better than the best unintegrated data," and methods like LIGER, BBKNN, and Seurat v3 "tended to favor removal of batch effects over conservation of biological variation."

Why it happens:

Different scRNA-seq platforms (10X Chromium vs. Smart-seq2 vs. single-nuclei) have systematically different detection profiles
Batch effects arise from "cell isolation protocols, library preparation technology, and sequencing platforms"
Integration algorithms make tradeoffs between removing technical variation vs. preserving biological differences
"Increasing Kullback–Leibler divergence regularization does not improve integration and adversarial learning removes biological signals"
Tissue-of-origin effects (primary tissue vs. organoid vs. cell culture) can be confounded with true cell-type identity

How to avoid:

Validate integration quality: After batch correction, verify that known marker genes still show expected cell-type-specific patterns
Compare multiple integration methods: Run Harmony, Seurat v5, and scVI in parallel; select based on preservation of positive control markers
Use positive/negative control genes: Ensure known cilia genes (IFT88, BBS1) show expected enrichment in ciliated cells post-integration
Stratify by sequencing technology: Analyze 10X datasets separately from Smart-seq2 before attempting cross-platform integration
Prefer within-study comparisons: When possible, compare cell types within the same study rather than integrating across studies
Document integration parameters: Record all batch correction hyperparameters (k-neighbors, PCA dimensions, integration strength) to enable sensitivity analysis

Warning signs:

Cell types from different studies don't overlap in UMAP space after integration
Marker genes lose cell-type-specificity after batch correction
Technical replicates from the same study cluster separately after integration
Integration method produces >50% improvement in batch-mixing metrics but known biological markers disappear

Phase to address: Phase 4 (scRNA-seq Processing): Implement multi-method integration pipeline with positive control validation. Create cell-type-specific expression profiles only after validating preservation of known markers.

Pitfall 5: Ortholog Function Conservation Over-Assumed

What goes wrong: Phenotypes from mouse knockouts (MGI) or zebrafish morphants (ZFIN) are naively transferred to human genes as if function were perfectly conserved, ignoring cases where orthologs have "dramatically different functions." This is especially problematic for gene families with lineage-specific duplications or losses. The assumption that "single-copy orthologs are more reliable for functional annotation" is contradicted by evidence showing "multi-copy genes are equally or more likely to provide accurate functional information."

Why it happens:

Automated orthology pipelines use sequence similarity thresholds without functional validation
"Bidirectional best hit (BBH) assumption can be false because genes may be each other's highest-ranking matches due to differential gene loss"
Domain recombination creates false orthology assignments where proteins share some but not all domains
Zebrafish genome duplication means many human genes have two zebrafish co-orthologs with subfunctionalized roles
"There is no universally applicable, unequivocal definition of conserved function"
Cilia gene functions may diverge between motile (zebrafish) and non-motile/sensory (mammalian) cilia contexts

How to avoid:

Confidence-weight ortholog evidence: Use orthology confidence scores from databases (e.g., DIOPT scores, OMA groups) rather than binary ortholog/non-ortholog
Require phenotype relevance: Only count ortholog phenotypes that match the target biology (ciliary defects, sensory organ abnormalities, not generic lethality)
Handle one-to-many orthologs explicitly: For human genes with multiple zebrafish co-orthologs, aggregate phenotypes using OR logic (any co-ortholog with phenotype = positive evidence)
Validate with synteny: Prioritize orthologs supported by both sequence similarity and conserved genomic context
Species-appropriate expectations: Expect stronger conservation for cilia structure genes (IFT machinery) vs. sensory signaling cascades (tissue-specific)
Cross-validate with multiple species: Genes with convergent phenotypes across mouse, zebrafish, AND Drosophila are more reliable than single-species evidence

Warning signs:

Ortholog evidence contradicts human genetic data (e.g., mouse knockout viable but human LoF variants cause disease)
Large fraction of human genes map to paralogs in model organism rather than true orthologs
Zebrafish co-orthologs have opposing phenotypes (one causes ciliopathy, other is wildtype)
Synteny breaks detected for claimed ortholog pairs

Phase to address: Phase 2 (Data Integration - Ortholog Module): Implement orthology confidence scoring and phenotype relevance filtering. Create manual curation workflow for ambiguous high-scoring candidates.

Pitfall 6: Constraint Metrics (gnomAD pLI/LOEUF) Misinterpreted

What goes wrong: gnomAD constraint scores are used as disease gene predictors without understanding their limitations. Researchers assume high pLI (>0.9) or low LOEUF (<0.35) automatically indicates dominant disease genes, but "even the most highly constrained genes are not necessarily autosomal dominant." Transcript selection errors cause dramatic score changes—SHANK2 has pLI=0 in the canonical transcript but pLI=1 in the brain-specific transcript due to differential exon usage. Low-coverage genes have unreliable constraint metrics but are not flagged.

Why it happens:

pLI is dichotomous (>0.9 or <0.1), ignoring genes in the intermediate range
LOEUF thresholds changed between gnomAD v2 (<0.35) and v4 (<0.6), causing confusion
"There is no brain expression from over half the exons in the SHANK2 canonical transcript where most of the protein truncating variants are found"
Genes with low sequencing coverage or high GC content have unreliable observed/expected ratios
Recessive disease genes can have low constraint (heterozygous LoF is benign)
Haploinsufficient genes vs. dominant-negative mechanisms are not distinguished

How to avoid:

Use LOEUF continuously, not dichotomously: Score genes on LOEUF scale rather than applying hard cutoffs; gives partial credit across the distribution
Verify transcript selection: For ciliopathy candidates, ensure the scored transcript includes exons expressed in retina/inner ear using GTEx isoform data
Check coverage metrics: Only use pLI/LOEUF when mean coverage >30x and >90% of CDS is covered in gnomAD
Adjust expectations for inheritance pattern: High constraint supports haploinsufficiency; moderate constraint is compatible with recessive inheritance (relevant for Usher syndrome)
Cross-validate with ClinVar: Check whether existing pathogenic variants in the gene match the constraint prediction
Incorporate missense constraint: Use gnomAD missense Z-scores alongside LoF metrics; some genes tolerate LoF but are missense-constrained

Warning signs:

Candidate genes have low gnomAD coverage but constraint scores are used anyway
All top candidates have pLI >0.9, despite Usher syndrome being recessive
Constraint scores change dramatically when switching from canonical to tissue-specific transcript
Genes with established recessive inheritance pattern are filtered out due to low constraint

Phase to address: Phase 3 (Scoring System Design - Constraint Module): Implement coverage-aware constraint scoring with transcript validation. Create inheritance-pattern-adjusted weighting (lower weight for dominant constraint when prioritizing recessive disease genes).

Pitfall 7: "Unknown Function" Operationally Undefined Creates Circular Logic

What goes wrong: The pipeline aims to discover "under-studied" or "unknown function" genes but lacks a rigorous operational definition. If "unknown function" means "few PubMed papers," the literature bias pitfall is worsened. If it means "no GO annotations," it excludes genes with partial functional knowledge that might be excellent candidates. Approximately "40% of proteins in eukaryotic genomes are proteins of unknown function," but defining unknown is complex—"when a group of related sequences contains one or more members of known function, the similarity approach assigns all to the known space, whereas empirical approach distinguishes between characterized and uncharacterized candidates."

Why it happens:

GO term coverage is publication-biased; "52-79% of bacterial proteomes can be functionally annotated based on homology searches" but eukaryotes have more uncharacterized proteins
Partial functional knowledge exists on a spectrum (domain predictions, general pathway assignments, no mechanistic details)
"Research bias, as measured by publication volume, was an important factor influencing genome annotation completeness"
Negative selection (excluding known cilia genes) requires defining "known," which is database- and date-dependent

How to avoid:

Multi-tier functional classification:
- Tier 1: Direct experimental evidence of cilia/Usher involvement (exclude from discovery)
- Tier 2: Strong functional prediction (domains, orthologs) but no direct evidence (high-priority targets)
- Tier 3: Minimal functional annotation (under-studied candidates)
- Tier 4: No annotation beyond gene symbol (true unknowns)
Explicit positive control exclusion list: Maintain a curated list of ~200 known cilia genes and established Usher genes to exclude; version-control this list
Publication-independent functional metrics: Define "unknown" using absence of experimental GO evidence codes (EXP, IDA, IPI, IMP), not total publication count
Pathway coverage threshold: Consider genes "known" if they appear in >3 canonical cilia pathways (IFT, transition zone, basal body assembly)
Temporal versioning: Tag genes as "unknown as of [date]" to allow retrospective validation when new discoveries are published

Warning signs:

Definition of "unknown" changes between pilot and production runs
Candidates include genes with extensive literature on cilia-related functions
Positive control genes leak into discovery set due to incomplete exclusion list
Different team members have conflicting intuitions about whether a gene is "known enough" to exclude

Phase to address: Phase 1 (Data Infrastructure - Annotation Module): Create explicit functional classification schema with clear inclusion/exclusion criteria. Build curated positive control list with literature provenance.

Pitfall 8: Reproducibility Theater Without Computational Environment Control

What goes wrong: Pipeline scripts are version-controlled and documented, creating an illusion of reproducibility, but results cannot be reproduced due to uncontrolled dependencies. Python package versions (pandas, numpy, scikit-learn), database snapshots (Ensembl release, gnomAD version, GTEx v8 vs v9), and API response changes cause silent result drift. "Workflow managers were developed in response to challenges with data complexity and reproducibility" but are often not adopted until after initial analyses are complete and results have diverged.

Why it happens:

pip install package without pinning versions installs latest release, which may change behavior
Ensembl/NCBI/UniProt databases update continuously; API calls return different data over time
Downloaded files are not checksummed; corruption or incomplete downloads go undetected
Reference genome versions (GRCh37 vs GRCh38) are mixed across data sources
RAM/disk caching causes results to differ between first run and subsequent runs
Random seeds not set for stochastic algorithms (UMAP, t-SNE, subsampling)

How to avoid:

Pin all dependencies: Use requirements.txt with exact versions (pandas==2.0.3 not pandas>=2.0) or conda env export
Containerize the environment: Build Docker/Singularity container with frozen versions; run all analyses inside container
Snapshot external databases: Download full database dumps (Ensembl, GTEx) with version tags; do not rely on live API queries for production runs
Checksum all downloaded data: Compute and store MD5/SHA256 hashes for every downloaded file; verify on load
Version-control intermediate outputs: Store preprocessed data files (post-QC, post-normalization) with version tags to enable restart from checkpoints
Set random seeds globally: Fix numpy/torch/random seeds at the start of every script to ensure stochastic steps are reproducible
Log provenance metadata: Embed database versions, software versions, and run parameters in output files using JSON headers or HDF5 attributes

Warning signs:

Results change when re-running the same script on a different machine
Collaborator cannot reproduce your gene rankings despite using "the same code"
Adding a new analysis step changes results from previous steps (should be impossible if truly modular)
You cannot explain why the gene count changed between last month's and this month's runs

Phase to address: Phase 1 (Data Infrastructure): Establish containerized environment with pinned dependencies and database version snapshots. Implement checksumming and provenance logging from the start.

Technical Debt Patterns

Shortcuts that seem reasonable but create long-term problems.

Shortcut	Immediate Benefit	Long-term Cost	When Acceptable
Hard-code database API URLs in scripts	Quick to write	Breaks when APIs change; no version control	Never—use config file
Convert all IDs to gene symbols early	Simpler to read outputs	Loses ability to trace back to source; mapping errors propagate	Only for final display, never internal processing
Filter to protein-coding genes only, drop non-coding	Reduces dataset size	Misses lncRNAs with cilia-regulatory roles	Acceptable for MVP if explicitly documented as limitation
Use default merge (inner join) in pandas	Fewer rows to process	Silent data loss when IDs don't match	Never—always left join with explicit logging
Skip validation on positive control genes	Faster iteration	No way to detect when scoring system breaks	Never—positive controls are mandatory
Download data via API calls in main pipeline	No manual download step	Irreproducible; results change as databases update	Only during exploration phase, never production
Store intermediate data as CSV	Easy to inspect manually	Loss of data types (ints become floats); no metadata storage	Acceptable for small tables <10K rows
Use `pip install --upgrade` to fix bugs	Gets latest fixes	Introduces breaking changes unpredictably	Never—pin versions and upgrade explicitly
Aggregate to gene-level immediately from scRNA-seq	Simpler analysis	Loses cell-type resolution; can't detect subtype-specific expression	Only for bulk comparison, not prioritization

Integration Gotchas

Common mistakes when connecting to external services.

Integration	Common Mistake	Correct Approach
Ensembl BioMart	Selecting canonical transcript only; miss tissue-specific isoforms	Query all transcripts, then filter by retina/inner ear expression using GTEx
gnomAD API	Not checking coverage per gene before using constraint scores	Filter to genes with mean_depth >30x and >90% CDS covered
GTEx Portal	Using TPM directly without accounting for sample size per tissue	Normalize by sample count and use median TPM across replicates
CellxGene API	Downloading full h5ad files serially (slow, memory-intensive)	Use CellxGene's API to fetch only cell-type-aggregated counts for genes of interest
UniProt REST API	Converting gene lists one-by-one in a loop	Use batch endpoint with POST requests (up to 100K IDs per request)
PubMed E-utilities	Sending requests without API key (3 req/sec limit)	Register NCBI API key for 10 req/sec; still implement exponential backoff
MGI/ZFIN batch queries	Assuming 1:1 human-mouse orthology	Handle one-to-many mappings explicitly (use highest-confidence ortholog or aggregate)
IMPC phenotype API	Taking all phenotypes as equally informative	Filter to cilia-relevant phenotypes (MP:0003935 cilium; HP:0000508 retinal degeneration)
String-DB	Assuming all edges are physical interactions	Filter by interaction type (text-mining vs experimental) and confidence score >0.7

Performance Traps

Patterns that work at small scale but fail as usage grows.

Trap	Symptoms	Prevention	When It Breaks
Loading entire scRNA-seq h5ad into memory	Script crashes with MemoryError	Use backed mode (`anndata.read_h5ad(file, backed='r')`) to stream from disk	>5GB file or <32GB RAM
Nested loops for gene-gene comparisons	Script runs for hours	Vectorize with numpy/pandas; use scipy.spatial.distance for pairwise	>1K genes
Re-downloading data on every run	Slow iteration, API rate limits	Cache downloaded files locally with checksums; only re-download if missing	Always implement caching
Storing all intermediate results in memory	Cannot debug failed runs	Write intermediate outputs to disk after each major step	>50K genes or complex pipeline
Single-threaded processing	Slow on large gene sets	Parallelize with joblib/multiprocessing for embarrassingly parallel tasks	>10K genes or >1hr runtime
Not indexing database tables	Slow queries on merged datasets	Create indexes on ID columns (Ensembl ID, HGNC symbol) before joins	>20K genes or >5 tables

Phase-Specific Warnings

Phase Topic	Likely Pitfall	Mitigation
Data download (Phase 1)	API rate limits hit during batch download	Implement exponential backoff and retry logic; use NCBI API keys
ID mapping (Phase 1)	Multiple IDs map to the same symbol	Create explicit disambiguation strategy; log all many-to-one mappings
scRNA-seq integration (Phase 4)	Batch effects erase biological signal	Validate integration by checking known marker genes before/after
Scoring system (Phase 3)	Literature bias dominates scores	Normalize by publication count; validate on under-studied positive controls
Validation (Phase 5)	Positive controls not scoring as expected	Debug scoring system before declaring it ready; iterate on weights
Reproducibility (All phases)	Results differ between runs	Pin dependency versions and database snapshots from Phase 1
Missing data handling (Phase 2)	Genes with incomplete data rank artificially low	Implement layer-aware scoring that doesn't penalize missing data
Ortholog phenotypes (Phase 2)	Mouse/zebrafish phenotypes over-interpreted	Require phenotype relevance filtering (cilia-related only)

"Looks Done But Isn't" Checklist

Things that appear complete but are missing critical pieces.

ID mapping module: Often missing validation step to ensure known genes are successfully mapped—verify 100% of positive control genes map successfully
Scoring system: Often missing publication bias correction—verify correlation between final score and PubMed count is <0.5
scRNA-seq integration: Often missing positive control validation—verify known cilia genes show expected cell-type enrichment post-integration
Constraint metrics: Often missing coverage check—verify genes have adequate gnomAD coverage before using pLI/LOEUF
Ortholog evidence: Often missing confidence scoring—verify orthology confidence scores are used, not just binary ortholog calls
Data provenance: Often missing version logging—verify every output file records database versions and software versions used
Missing data handling: Often missing explicit "unknown" state—verify merge strategy preserves genes with partial data
Reproducibility: Often missing checksums on downloaded data—verify MD5 hashes are computed and stored for all external data files
Positive controls: Often missing negative control validation—verify negative controls (non-cilia genes) score appropriately low
API error handling: Often missing retry logic—verify exponential backoff is implemented for all external API calls

Recovery Strategies

When pitfalls occur despite prevention, how to recover.

Pitfall	Recovery Cost	Recovery Steps
ID mapping errors detected late	LOW	Re-run mapping module with corrected conversion tables; propagate fixes forward
Literature bias discovered after scoring	MEDIUM	Add publication normalization to scoring function; re-compute all scores
Batch effects in scRNA-seq integration	MEDIUM	Try alternative integration method (switch from Seurat to Harmony); re-validate
Missing data treated as negative	HIGH	Redesign merge strategy to preserve all genes; re-run entire integration pipeline
Reproducibility failure (cannot re-run)	HIGH	Containerize environment; snapshot databases; document and re-run from scratch
Positive controls score poorly	MEDIUM	Debug scoring function weights; validate on stratified test set; adjust weights iteratively
Ortholog function over-assumed	LOW	Add orthology confidence scores; re-filter to high-confidence orthologs only
Constraint metrics misinterpreted	LOW	Add coverage check; use LOEUF continuously instead of dichotomous; re-score
"Unknown function" poorly defined	MEDIUM	Create explicit functional tiers; rebuild positive control exclusion list; re-classify

Pitfall-to-Phase Mapping

How roadmap phases should address these pitfalls.

Pitfall	Prevention Phase	Verification
Gene ID mapping inconsistency	Phase 1: Data Infrastructure	100% of positive controls successfully mapped across all databases
Literature bias amplification	Phase 3: Scoring System	Correlation(final_score, pubmed_count) < 0.5; under-studied positive controls rank in top 10%
Missing data as negative evidence	Phase 2: Data Integration	Gene count preserved after merges (should be union, not intersection); explicit "unknown" state in data model
scRNA-seq batch effects	Phase 4: scRNA-seq Processing	Known cilia markers show expected cell-type specificity post-integration
Ortholog function over-assumed	Phase 2: Ortholog Module	Orthology confidence scores used; phenotype relevance filtering applied
Constraint metrics misinterpreted	Phase 3: Constraint Module	Coverage-aware filtering; LOEUF used continuously; transcript validation performed
"Unknown function" undefined	Phase 1: Annotation Module	Explicit functional tiers defined; positive control exclusion list version-controlled
Reproducibility failure	Phase 1: Environment Setup	Containerized; dependencies pinned; databases snapshotted; results bit-identical on re-run
API rate limits hit	Phase 1: Data Download	Retry logic with exponential backoff; batch queries used; NCBI API key registered

Sources

Gene ID Mapping

Literature Bias and Gene Annotation

Multi-Database Integration and Missing Data

Weighted Scoring Systems

Gene Prioritization and LLM Biases

scRNA-seq Integration and Batch Effects

Rare Disease Gene Discovery and Validation

Reproducible Bioinformatics Pipelines

Cilia Gene Discovery and Ciliopathy Prioritization

Usher Syndrome Gene Discovery

gnomAD Constraint Metrics

API Rate Limits and Batch Download

Ortholog Inference and Limitations

Tissue Specificity and Cell Type Deconvolution

PubMed Literature Mining

Pitfalls research for: Bioinformatics Gene Candidate Discovery Pipeline for Cilia/Usher Syndrome Researched: 2026-02-11 Confidence: HIGH (validated with 40+ authoritative sources across all risk domains)

40 KiB Raw Permalink Blame History Unescape Escape

Domain Pitfalls: Bioinformatics Gene Prioritization for Cilia/Usher

Critical Pitfalls

Pitfall 1: Gene ID Mapping Inconsistency Cascade

Pitfall 2: Literature Bias Amplification in Weighted Scoring

Pitfall 3: Missing Data Handled as "Negative Evidence"

Pitfall 4: Batch Effects Misinterpreted as Biological Signal in scRNA-seq Integration

Pitfall 5: Ortholog Function Conservation Over-Assumed

Pitfall 6: Constraint Metrics (gnomAD pLI/LOEUF) Misinterpreted

Pitfall 7: "Unknown Function" Operationally Undefined Creates Circular Logic

Pitfall 8: Reproducibility Theater Without Computational Environment Control

Technical Debt Patterns

Integration Gotchas

Performance Traps

Phase-Specific Warnings

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Sources

Gene ID Mapping

Literature Bias and Gene Annotation

Multi-Database Integration and Missing Data

Weighted Scoring Systems

Gene Prioritization and LLM Biases

scRNA-seq Integration and Batch Effects

Rare Disease Gene Discovery and Validation

Reproducible Bioinformatics Pipelines

Cilia Gene Discovery and Ciliopathy Prioritization

Usher Syndrome Gene Discovery

gnomAD Constraint Metrics

API Rate Limits and Batch Download

Ortholog Inference and Limitations

Tissue Specificity and Cell Type Deconvolution

PubMed Literature Mining

40 KiB

Raw Permalink Blame History