From 674a9ae8452f9153e65b4038b7277b0ff08d5451 Mon Sep 17 00:00:00 2001 From: gbanyan Date: Fri, 13 Feb 2026 04:50:02 +0800 Subject: [PATCH] docs: add project README (zh/en) and CLAUDE.md Researcher-facing README explaining pipeline rationale, six evidence layers with scientific basis, scoring methodology, and step-by-step execution guide for CLI newcomers. CLAUDE.md provides development context for future Claude Code sessions. Co-Authored-By: Claude Opus 4.6 --- CLAUDE.md | 94 +++++++++ README.en.md | 548 +++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 546 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 1188 insertions(+) create mode 100644 CLAUDE.md create mode 100644 README.en.md create mode 100644 README.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..d419434 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,94 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +Bioinformatics pipeline for discovering under-studied candidate genes related to Usher syndrome and ciliopathies. Screens ~22,600 human protein-coding genes across 6 evidence layers, producing weighted composite scores and tiered candidate lists. + +## Commands + +```bash +# Install (editable, with dev deps) +pip install -e ".[dev]" + +# Run full pipeline (sequential steps) +usher-pipeline setup # Fetch gene universe via mygene +usher-pipeline evidence gnomad # gnomAD constraint metrics +usher-pipeline evidence annotation # GO/InterPro/pathway annotations +usher-pipeline evidence expression # HPA + GTEx + CellxGene tissue expression +usher-pipeline evidence localization # HPA subcellular + cilia proteomics +usher-pipeline evidence animal-models # HCOP orthologs + MGI/ZFIN/IMPC phenotypes +usher-pipeline evidence literature --email USER@EMAIL # PubMed via NCBI E-utilities +usher-pipeline score # Weighted composite scoring +usher-pipeline report # Generate TSV/Parquet + visualizations +usher-pipeline validate # Validate known Usher genes rank highly + +# Tests +pytest # All tests +pytest tests/test_gnomad.py # Single test file +pytest tests/test_gnomad.py::test_name # Single test +pytest -k "not integration" # Skip integration tests (which hit APIs) +``` + +## Architecture + +### Data Flow +``` +mygene API → gene_universe table (DuckDB) + ↓ +6 evidence layers (each: fetch → transform → load to DuckDB) + ↓ +scoring/integration.py: LEFT JOIN all layers → weighted composite + ↓ +output/: TSV, Parquet, visualizations, reproducibility report +``` + +### Key Design Decisions + +- **DuckDB** for persistence (`data/pipeline.duckdb`). Single-writer — no concurrent access from multiple processes. +- **Polars** for data manipulation (LazyFrame for fetch, DataFrame for transforms requiring horizontal ops). +- **NULL preservation**: Missing evidence ≠ zero score. LEFT JOINs preserve NULLs; scoring weights only applied to non-NULL layers (`evidence_count` tracks coverage). +- **Idempotent loads**: Each evidence layer uses `CREATE OR REPLACE TABLE`. +- **Checkpoint-restart**: Literature layer supports resuming via existing progress in DuckDB. + +### Source Layout + +``` +src/usher_pipeline/ +├── cli/ # Click commands (setup, evidence, score, report, validate) +├── config/ # YAML config loading + schema (ScoringWeights, DataVersions) +├── persistence/ # PipelineStore (DuckDB wrapper), ProvenanceTracker +├── gene_mapping/ # Gene universe fetch (mygene) + validation +├── evidence/ # 6 evidence layers, each with: +│ ├── {layer}/fetch.py # Download/API calls +│ ├── {layer}/transform.py # Data processing +│ ├── {layer}/load.py # DuckDB persistence +│ └── {layer}/models.py # Pydantic models + constants +├── scoring/ # Composite scoring, validation, sensitivity analysis +└── output/ # Report generation, visualizations +``` + +### DuckDB Tables + +| Table | Source | Score Column | +|-------|--------|-------------| +| `gene_universe` | mygene | — | +| `gnomad_constraint` | gnomAD v4.1 | `loeuf_normalized` | +| `tissue_expression` | HPA v23 + GTEx v8 | `expression_score_normalized` | +| `gene_annotation` | GO/InterPro/Reactome | `annotation_score_normalized` | +| `subcellular_localization` | HPA + proteomics | `localization_score_normalized` | +| `animal_model_phenotypes` | MGI/ZFIN/IMPC via HCOP | `animal_model_score_normalized` | +| `literature_evidence` | PubMed (NCBI E-utils) | `literature_score_normalized` | + +### Scoring Weights (config/default.yaml) + +gnomAD: 0.20, Expression: 0.20, Annotation: 0.15, Localization: 0.15, Animal Model: 0.15, Literature: 0.15 + +### Known Limitations + +- **gnomAD gene_id alignment**: gnomAD uses transcript-level IDs; join to gene_universe may produce NaN scores for some genes. +- **GTEx v8 lacks retina tissue**: "Eye - Retina" not available; retina expression comes only from HPA. +- **HPA expression merge gap**: HPA uses gene_symbol while pipeline keys on gene_id; the join in `expression/transform.py` may miss genes without symbol mapping. +- **Literature layer is slow**: ~8 genes/minute via NCBI E-utilities; full run takes ~46 hours for 22K genes. Use `--api-key` for 10 req/s (vs 3 req/s default). +- **HPA URLs pinned to v23**: Using `v23.proteinatlas.org` because latest version changed download paths. diff --git a/README.en.md b/README.en.md new file mode 100644 index 0000000..435110d --- /dev/null +++ b/README.en.md @@ -0,0 +1,548 @@ +# Usher Cilia Candidate Gene Discovery Pipeline + +A reproducible bioinformatics pipeline for systematic screening of candidate genes associated with Usher syndrome and ciliopathies. + +This pipeline evaluates approximately 22,600 human protein-coding genes across six independent evidence layers, producing weighted composite scores and tiered candidate gene lists for downstream experimental validation. + +--- + +## Table of Contents + +- [Research Background](#research-background) +- [Pipeline Overview](#pipeline-overview) +- [Installation and Setup](#installation-and-setup) +- [Running the Pipeline](#running-the-pipeline) +- [The Six Evidence Layers in Detail](#the-six-evidence-layers-in-detail) + - [1. gnomAD Constraint Analysis](#1-gnomad-constraint-analysis) + - [2. Gene Functional Annotation](#2-gene-functional-annotation) + - [3. Tissue Expression Specificity](#3-tissue-expression-specificity) + - [4. Subcellular Localization](#4-subcellular-localization) + - [5. Animal Model Phenotypes](#5-animal-model-phenotypes) + - [6. Literature Mining](#6-literature-mining) +- [Composite Scoring and Tiering](#composite-scoring-and-tiering) +- [Validation](#validation) +- [Output Files](#output-files) +- [Configuration](#configuration) +- [Known Limitations](#known-limitations) +- [Data Sources and References](#data-sources-and-references) + +--- + +## Research Background + +**Usher syndrome** is the most common cause of hereditary deaf-blindness, characterized by sensorineural hearing loss and retinitis pigmentosa. Approximately 10 causative genes have been identified to date (e.g., MYO7A, USH2A, CDH23), yet a subset of clinically diagnosed patients carry no pathogenic variants in known genes — suggesting that additional causative or modifier genes remain to be discovered. + +Usher proteins play critical roles in cilia and cilia-associated structures, particularly the connecting cilium of retinal photoreceptors and the stereocilia of inner ear hair cells. This pipeline therefore adopts **ciliary biology** as its central framework, integrating multi-dimensional genomic and functional data to systematically identify overlooked candidate genes. + +### Core Design Principles + +- **Missing data ≠ zero score**: If a gene lacks data in a given evidence layer (NULL), it is not penalized. Only layers with available data contribute to the weighted average. This prevents systematic bias against understudied genes. +- **Orthogonal multi-evidence validation**: Six evidence layers assess cilia/Usher relevance from independent angles, reducing the impact of any single data source bias. +- **Full reproducibility**: All data versions, parameters, and analysis steps are recorded with provenance tracking. + +--- + +## Pipeline Overview + +``` +┌──────────────────────────────────────────────────────┐ +│ Step 1: Build Gene Universe │ +│ Retrieve ~22,600 human protein-coding genes │ +│ via mygene API │ +└────────────────────────┬─────────────────────────────┘ + ▼ +┌──────────────────────────────────────────────────────┐ +│ Step 2: Six Evidence Layers (independent, any order)│ +│ │ +│ ┌────────────┐ ┌────────────┐ ┌─────────────────┐ │ +│ │ gnomAD │ │ Functional │ │ Tissue │ │ +│ │ Constraint │ │ Annotation │ │ Expression │ │ +│ └────────────┘ └────────────┘ └─────────────────┘ │ +│ ┌────────────┐ ┌────────────┐ ┌─────────────────┐ │ +│ │ Subcellular│ │ Animal │ │ Literature │ │ +│ │ Localiz. │ │ Models │ │ Mining │ │ +│ └────────────┘ └────────────┘ └─────────────────┘ │ +└────────────────────────┬─────────────────────────────┘ + ▼ +┌──────────────────────────────────────────────────────┐ +│ Step 3: Composite Weighted Scoring + Tiering │ +│ NULL-aware weighted average → HIGH/MEDIUM/LOW │ +└────────────────────────┬─────────────────────────────┘ + ▼ +┌──────────────────────────────────────────────────────┐ +│ Step 4: Validation + Report Generation │ +│ Known Usher/cilia gene ranking check │ +│ → TSV + Parquet + Visualizations │ +└──────────────────────────────────────────────────────┘ +``` + +--- + +## Installation and Setup + +### System Requirements + +- Python 3.11 or later +- Approximately 5 GB disk space (for caching downloaded databases) +- Internet connection (required on first run to download external data) + +### Installation + +Open a terminal, navigate to the project directory, and run: + +```bash +# 1. Create a virtual environment (recommended) +python -m venv .venv +source .venv/bin/activate # macOS / Linux +# .venv\Scripts\activate # Windows + +# 2. Install the pipeline +pip install -e ".[dev]" +``` + +To verify the installation: + +```bash +usher-pipeline info +``` + +If you see a version number and configuration summary, the installation was successful. + +### NCBI API Key (required for literature mining) + +The literature mining layer queries PubMed via the NCBI E-utilities API. An email address is required, and an API key is recommended to increase query throughput (3 requests/second without key, 10 requests/second with key). + +To obtain an API key: register an NCBI account at https://www.ncbi.nlm.nih.gov/account/ and generate a key in your account settings. + +--- + +## Running the Pipeline + +Below is the complete execution workflow. Each step is an independent command, run in order. + +> **Note**: Each command prints summary statistics upon completion. If a step fails, fix the issue and re-run the same command — all steps are idempotent (re-running produces no duplicate data). + +### Step 1: Build Gene Universe + +```bash +usher-pipeline setup +``` + +This step will: +- Query all human protein-coding genes (~22,600) via the mygene API +- Establish mappings between Ensembl Gene ID, HGNC Symbol, and UniProt Accession +- Store results in a local DuckDB database (`data/pipeline.duckdb`) + +### Step 2: Run the Six Evidence Layers + +The six layers can be run in any order, as they are independent of each other. However, run them **one at a time** — do not run two simultaneously (DuckDB supports only a single writer). + +```bash +# 2a. gnomAD constraint metrics +usher-pipeline evidence gnomad + +# 2b. Gene functional annotation +usher-pipeline evidence annotation + +# 2c. Tissue expression specificity +usher-pipeline evidence expression + +# 2d. Subcellular localization +usher-pipeline evidence localization + +# 2e. Animal model phenotypes +usher-pipeline evidence animal-models + +# 2f. Literature mining (email required; API key recommended) +usher-pipeline evidence literature --email your@email.com --api-key YOUR_KEY +``` + +> **Note**: The literature mining layer queries PubMed records for all ~22,600 genes, which is rate-limited (~8 genes/minute). A full run may take a considerable amount of time. This layer supports checkpoint-restart — if interrupted, simply re-run the same command to resume from where it left off. + +### Step 3: Composite Scoring + +```bash +usher-pipeline score +``` + +This step will: +- LEFT JOIN all six evidence layer scores with the gene universe +- Compute a NULL-aware weighted composite score +- Validate rankings against known Usher/cilia genes (positive controls) +- Classify genes into HIGH / MEDIUM / LOW confidence tiers + +### Step 4: Generate Report + +```bash +usher-pipeline report +``` + +Output files are written to the `data/report/` directory, including: +- `candidates.tsv` — Candidate gene list (tab-separated; can be opened in Excel) +- `candidates.parquet` — Same data in high-performance binary format (for R/Python analysis) +- `score_distribution.png` — Score distribution histogram +- `evidence_coverage.png` — Coverage chart for each evidence layer +- `tier_distribution.png` — Confidence tier pie chart +- `reproducibility.md` — Full provenance record (parameters, data versions, software environment) + +### Optional: Validation Report + +```bash +usher-pipeline validate +``` + +Verifies that known Usher genes and SYSCILIA cilia genes rank in the top 25%, confirming the scoring system's effectiveness. + +--- + +## The Six Evidence Layers in Detail + +### 1. gnomAD Constraint Analysis + +**Biological question**: Does this gene show strong selective constraint against loss-of-function (LoF) variants in human populations? + +**Data source**: gnomAD v4.1 constraint metrics + +**Scientific rationale**: +If a gene is essential for normal physiological function (e.g., sensory function), loss-of-function variants should be rarely observed in the human population, because carriers of such variants would have reduced fitness. The gnomAD database quantifies this by comparing the observed number of LoF variants against the expected number. + +**Key metrics**: +- **LOEUF** (Loss-of-function Observed/Expected Upper bound Fraction): Lower values indicate stronger constraint +- **pLI** (Probability of LoF Intolerance): Higher values indicate greater intolerance to LoF variants + +**Scoring**: +LOEUF is inverted and normalized to a 0–1 range: + +``` +loeuf_normalized = (LOEUF_max − LOEUF) / (LOEUF_max − LOEUF_min) +``` + +Higher score = stronger constraint = greater functional importance. + +**Quality control**: Requires mean sequencing depth ≥30x and CDS coverage ≥90%. Genes below these thresholds are flagged as `incomplete_coverage`. + +--- + +### 2. Gene Functional Annotation + +**Biological question**: How well-characterized is this gene in terms of functional annotations? + +**Data sources**: +- Gene Ontology (GO) — biological process, molecular function, cellular component +- UniProt — protein annotation quality score (0–5) +- KEGG / Reactome — metabolic and signaling pathway membership + +**Scientific rationale**: +The completeness of functional annotation reflects how thoroughly a gene has been studied. Importantly, this layer is **inversely correlated with novelty** — genes with fewer annotations may represent undiscovered biology. To account for this, annotation carries only 15% weight in the composite score, allowing novel candidates with evidence from other layers to still rank highly. + +**Scoring**: + +``` +annotation_score = 0.5 × GO_component + 0.3 × UniProt_component + 0.2 × pathway_component + +where: + GO_component = log₂(GO_term_count + 1) / log₂(max_GO_count + 1) + UniProt_component = uniprot_score / 5.0 + pathway_component = 1 if any pathway annotation exists, else 0 +``` + +--- + +### 3. Tissue Expression Specificity + +**Biological question**: Is this gene preferentially expressed in Usher-relevant tissues (retina, inner ear)? + +**Data sources**: +- **HPA** (Human Protein Atlas) v23 — tissue-level RNA expression (TPM) +- **GTEx** v8 — bulk RNA-seq across 54 human tissues +- **CellxGene** — single-cell RNA-seq (photoreceptor and hair cell populations) + +**Scientific rationale**: +Usher syndrome affects retinal photoreceptors and cochlear hair cells. Genes highly enriched in these tissues are likely to have specialized functions there, making them candidates of interest. + +**Key metrics**: + +1. **Tau tissue specificity index (τ)**: Measures how uniformly a gene is expressed across tissues + - τ = 0: Ubiquitous expression (housekeeping gene) + - τ = 1: Highly tissue-specific + - Formula: `τ = Σ(1 − xᵢ/x_max) / (n − 1)` + +2. **Usher tissue enrichment**: Mean expression in target tissues (retina, cerebellum, photoreceptors, hair cells) divided by mean expression across all tissues. A ratio > 1 indicates enrichment in target tissues. + +**Scoring**: + +``` +expression_score = 0.4 × enrichment_percentile + 0.3 × τ_specificity + 0.3 × max_target_tissue_percentile +``` + +Percentile normalization is used to prevent absolute TPM values from dominating due to sequencing depth variation. + +--- + +### 4. Subcellular Localization + +**Biological question**: Does this protein localize to cilia, centrosomes, or basal bodies — structures implicated in Usher pathology? + +**Data sources**: +- **HPA subcellular location data** — immunofluorescence microscopy +- **Cilia proteomics** — published mass spectrometry-based cilium proteome datasets (CiliaCarta, etc.) +- **Centrosome proteomics** — published centrosome/basal body proteome datasets + +**Scientific rationale**: +Known Usher proteins (MYO7A, USH1C, CDH23, etc.) are ciliary or periciliary proteins. Localization to cilia, basal bodies, or centrosomes constitutes **strong mechanistic evidence** for involvement in ciliopathy pathways. + +**Scoring**: + +| Localization | Base Score | +|-------------|-----------| +| Cilia, centrosome, basal body, transition zone, stereocilia | 1.0 | +| Cytoskeleton, microtubules, cell junctions | 0.5 | +| Found in proteomics datasets only | 0.3 | +| Has localization data but no cilia association | 0.0 | +| No localization data available | NULL | + +**Evidence weighting**: +- Experimental evidence (HPA Enhanced/Supported reliability; proteomics): × 1.0 +- Computational prediction (HPA Approved/Uncertain reliability): × 0.6 + +``` +localization_score = cilia_proximity_base_score × evidence_type_weight +``` + +--- + +### 5. Animal Model Phenotypes + +**Biological question**: When orthologs of this gene are disrupted in mouse or zebrafish, do sensory or ciliary phenotypes result? + +**Data sources**: +- **HCOP** (HUGO Gene Nomenclature Committee Ortholog Predictions) — human–mouse/human–zebrafish ortholog mapping +- **MGI** (Mouse Genome Informatics) — mouse knockout phenotypes (MP ontology) +- **ZFIN** (Zebrafish Information Network) — zebrafish mutant phenotypes +- **IMPC** (International Mouse Phenotyping Consortium) — systematic knockout screens + +**Scientific rationale**: +Conservation of sensory phenotypes across species provides **functional validation**. Hearing and balance defects in mouse, and lateral line defects in zebrafish, recapitulate the pathological features of human Usher syndrome. + +**Phenotype keyword filtering**: +- Mouse: hearing, vision, retina, photoreceptor, cochlea, stereocilia, cilia, vestibular, balance +- Zebrafish: hearing, ear, otic, otolith, lateral line, hair cell, retina, vision, eye + +**Scoring**: + +``` +animal_model_score = + 0.4 × mouse_score × mouse_confidence + + 0.3 × zebrafish_score × zebrafish_confidence + + 0.3 × impc_bonus + +Confidence weights: HIGH = 1.0, MEDIUM = 0.7, LOW = 0.4 +Phenotype count scaling: log₂(sensory_phenotype_count + 1) / log₂(max_count + 1) +``` + +Logarithmic scaling prevents genes with excessive phenotype annotations from dominating rankings (diminishing returns). + +--- + +### 6. Literature Mining + +**Biological question**: Is this gene mentioned in the scientific literature in a cilia or sensory context, and what is the quality of that evidence? + +**Data source**: NCBI PubMed (via E-utilities API) + +**Scientific rationale**: +Literature evidence reflects accumulated scientific knowledge, but suffers from **study bias** — well-known genes (e.g., TP53, BRCA1) have massive publication counts regardless of cilia relevance. The normalization strategy in this layer specifically corrects for this bias. + +**Evidence quality tiers**: + +| Tier | Description | Weight | +|------|-------------|--------| +| direct_experimental | Gene knockout/mutation in cilia/sensory context | 1.0 | +| functional_mention | Cilia/sensory context with ≥3 publications | 0.6 | +| hts_hit | High-throughput screen hit in cilia/sensory context | 0.3 | +| incidental | Publications exist but no cilia/sensory context | 0.1 | +| none | No publications found | 0.0 | + +**Study bias correction**: + +``` +raw_score = (context_score × quality_weight) / log₂(total_pubmed_count + 1) +literature_score = percentile_rank(raw_score) +``` + +The division by `log₂(total_pubmed_count)` is critical: a gene with 5 cilia-related papers out of 50 total will score higher than a gene with 5 cilia-related papers out of 100,000 total. This encourages the discovery of overlooked genes with suggestive evidence. + +--- + +## Composite Scoring and Tiering + +### Weighted Composite Score + +The six evidence layers are combined into a single composite score using configurable weights: + +| Evidence Layer | Default Weight | Biological Significance | +|----------------|---------------|------------------------| +| gnomAD Constraint | 20% | Functional essentiality of the gene | +| Tissue Expression | 20% | Specific expression in target tissues | +| Functional Annotation | 15% | Known functional characteristics | +| Subcellular Localization | 15% | Proximity to ciliary structures | +| Animal Models | 15% | Cross-species functional validation | +| Literature Mining | 15% | Cilia/sensory association in literature | + +**NULL-aware weighted average**: + +``` +composite_score = Σ(scoreᵢ × weightᵢ) / Σ(weightᵢ) + ── only non-NULL layers contribute ── +``` + +Example: A gene with data in only gnomAD (0.8), expression (0.6), and localization (0.9): + +``` +composite_score = (0.8 × 0.20 + 0.6 × 0.20 + 0.9 × 0.15) / (0.20 + 0.20 + 0.15) + = 0.415 / 0.55 + = 0.755 +``` + +### Confidence Tiers + +| Tier | Criteria | Interpretation | +|------|----------|----------------| +| **HIGH** | Composite score ≥ 0.7 and ≥ 3 layers with data | High-priority candidate; recommended for experimental validation | +| **MEDIUM** | Composite score ≥ 0.4 and ≥ 2 layers with data | Moderate evidence; warrants further literature investigation | +| **LOW** | Composite score ≥ 0.2 | Weak evidence; additional data needed | +| EXCLUDED | Below LOW threshold | Excluded from candidate list | + +### Quality Flags + +| Flag | Criteria | Description | +|------|----------|-------------| +| sufficient_evidence | ≥ 4 layers with scores | Adequate data coverage | +| moderate_evidence | ≥ 2 layers with scores | Partial coverage | +| sparse_evidence | ≥ 1 layer with score | Sparse data | +| no_evidence | 0 layers with scores | No data available | + +--- + +## Validation + +The pipeline includes built-in positive control validation to confirm the effectiveness of the scoring system. + +### Positive Control Gene Sets + +1. **OMIM Usher genes** (10 genes): MYO7A, USH1C, CDH23, PCDH15, USH1G, CIB2, USH2A, ADGRV1, WHRN, CLRN1 +2. **SYSCILIA SCGS v2 core ciliary genes** (28 genes): IFT88, IFT140, BBS1, CEP290, RPGR, and others + +### Validation Criteria + +- Median percentile rank of known genes should be ≥ 75% (top quartile) +- Top 10% of candidates should contain > 70% of known genes (Recall@10%) + +If validation fails, it indicates potential issues with weight configuration or data quality that require investigation and adjustment. + +--- + +## Output Files + +All results are stored in the `data/report/` directory: + +| File | Description | How to Use | +|------|-------------|-----------| +| `candidates.tsv` | Candidate gene list (tab-separated) | Open in Excel or any text editor | +| `candidates.parquet` | Same data in high-performance binary format | R: `arrow::read_parquet()`; Python: `polars.read_parquet()` | +| `score_distribution.png` | Score distribution histogram | Quick overview of overall distribution | +| `evidence_coverage.png` | Per-layer coverage chart | Identify which layers have best/worst coverage | +| `tier_distribution.png` | HIGH/MEDIUM/LOW proportion pie chart | Quick summary | +| `reproducibility.md` | Full provenance record (parameters, data versions, software) | Methods section / reproducibility | + +### Column Descriptions for candidates.tsv + +**Core columns**: +- `gene_id` — Ensembl Gene ID (e.g., ENSG00000154229) +- `gene_symbol` — HGNC gene symbol (e.g., MYO7A) +- `composite_score` — Composite score (0–1) +- `confidence_tier` — Confidence tier (HIGH / MEDIUM / LOW) +- `evidence_count` — Number of evidence layers with non-NULL scores (0–6) + +**Per-layer scores** (0–1; NULL indicates no data available): +- `gnomad_score` — Constraint score +- `expression_score` — Expression specificity score +- `annotation_score` — Functional annotation score +- `localization_score` — Subcellular localization score +- `animal_model_score` — Animal model phenotype score +- `literature_score` — Literature evidence score + +**Auxiliary columns**: +- `supporting_layers` — Layers with scores (e.g., `gnomad,expression,localization`) +- `evidence_gaps` — Layers with missing data +- `quality_flag` — Quality flag + +--- + +## Configuration + +The pipeline configuration file is located at `config/default.yaml`: + +```yaml +# Data versions +versions: + ensembl_release: 113 + gnomad_version: v4.1 + gtex_version: v8 + hpa_version: "23.0" + +# Scoring weights (adjustable; must sum to 1.0) +scoring: + gnomad: 0.20 + expression: 0.20 + annotation: 0.15 + localization: 0.15 + animal_model: 0.15 + literature: 0.15 + +# API settings +api: + rate_limit_per_second: 5 + max_retries: 5 + timeout_seconds: 30 +``` + +To adjust weights (e.g., to place greater emphasis on tissue expression), modify the `scoring` section and re-run `usher-pipeline score` followed by `usher-pipeline report`. + +--- + +## Known Limitations + +| Limitation | Description | Impact | +|-----------|-------------|--------| +| GTEx v8 lacks retina tissue | "Eye - Retina" is not available in GTEx v8 | Retina expression data comes from HPA only | +| HPA gene symbol mismatch | HPA uses gene symbols while the pipeline keys on gene IDs | Some genes may lack HPA expression data | +| gnomAD transcript-level IDs | gnomAD uses transcript IDs rather than gene-level IDs | Some genes may produce NaN on JOIN | +| Literature mining speed | NCBI API rate limits; ~8 genes/minute | Full run requires considerable time; use API key for faster throughput | +| Single-writer constraint | DuckDB does not support concurrent writers | Do not run two pipeline steps simultaneously | +| Limited inner ear data | Bulk transcriptome data for human cochlear tissue is scarce | Cerebellum (cilia-rich) and CellxGene hair cell data used as proxies | + +--- + +## Data Sources and References + +This pipeline integrates the following public databases and tools: + +- **gnomAD** v4.1 — Karczewski et al. (2020) *Nature* 581:434–443 +- **Human Protein Atlas** v23 — Uhlén et al. (2015) *Science* 347:1260419 +- **GTEx** v8 — GTEx Consortium (2020) *Science* 369:1318–1330 +- **CellxGene** — Chan Zuckerberg Initiative single-cell atlas +- **Gene Ontology** — Gene Ontology Consortium (2021) *Nucleic Acids Res* 49:D325–D334 +- **UniProt** — UniProt Consortium (2023) *Nucleic Acids Res* 51:D523–D531 +- **MGI** — Mouse Genome Informatics, The Jackson Laboratory +- **ZFIN** — Zebrafish Information Network +- **IMPC** — International Mouse Phenotyping Consortium +- **SYSCILIA SCGS v2** — van Dam et al. (2021) *Mol Biol Cell* 32:br6 +- **mygene.info** — Xin et al. (2016) *Genome Biol* 17:91 +- **NCBI E-utilities** — PubMed literature search API + +--- + +## License + +MIT License diff --git a/README.md b/README.md new file mode 100644 index 0000000..cda4778 --- /dev/null +++ b/README.md @@ -0,0 +1,546 @@ +# Usher Cilia Candidate Gene Discovery Pipeline + +一套可重現的生物資訊分析管線,用於系統性篩選與 Usher 症候群及纖毛病變(ciliopathies)相關的候選基因。 + +本管線對人類約 22,600 個蛋白質編碼基因,透過六個獨立的證據層面進行評分與排序,最終產出分層候選基因清單,供後續實驗驗證參考。 + +--- + +## 目錄 + +- [研究背景](#研究背景) +- [管線總覽](#管線總覽) +- [安裝與環境設定](#安裝與環境設定) +- [執行管線](#執行管線) +- [六大證據層面詳解](#六大證據層面詳解) + - [1. gnomAD 約束性分析](#1-gnomad-約束性分析) + - [2. 基因功能註釋](#2-基因功能註釋) + - [3. 組織表達特異性](#3-組織表達特異性) + - [4. 亞細胞定位](#4-亞細胞定位) + - [5. 動物模型表型](#5-動物模型表型) + - [6. 文獻探勘](#6-文獻探勘) +- [綜合評分與分層](#綜合評分與分層) +- [驗證機制](#驗證機制) +- [產出檔案說明](#產出檔案說明) +- [設定檔調整](#設定檔調整) +- [已知限制](#已知限制) +- [引用資料庫](#引用資料庫) + +--- + +## 研究背景 + +**Usher 症候群**是最常見的遺傳性聾盲症候群,患者同時出現感音神經性聽力損失與視網膜色素變性。目前已知的致病基因(如 MYO7A、USH2A、CDH23 等)約 10 個,但臨床上仍有部分患者無法找到已知基因的致病變異,暗示可能存在尚未被發現的致病或修飾基因。 + +Usher 蛋白在纖毛(cilia)與纖毛相關結構中扮演關鍵角色,特別是視網膜光受器細胞的連接纖毛(connecting cilium)及內耳毛細胞的靜纖毛(stereocilia)。因此,本管線以**纖毛生物學**為核心,整合多面向的基因體與功能體學資料,系統性地搜尋可能被忽略的候選基因。 + +### 核心設計理念 + +- **缺失資料 ≠ 零分**:若某基因在特定證據層無資料(NULL),不會被扣分,僅以有資料的層面計算加權平均。這避免了對研究不足基因的系統性懲罰。 +- **多面向正交驗證**:六個證據層面分別從不同角度衡量基因與纖毛/Usher 的關聯性,降低單一資料來源偏差的影響。 +- **可重現性**:所有資料版本、參數設定、分析步驟均有完整記錄。 + +--- + +## 管線總覽 + +``` +┌─────────────────────────────────────────────────────┐ +│ Step 1: 建立基因宇宙 (Gene Universe) │ +│ 透過 mygene API 取得 ~22,600 個人類蛋白質編碼基因 │ +└────────────────────────┬────────────────────────────┘ + ▼ +┌─────────────────────────────────────────────────────┐ +│ Step 2: 六大證據層面 (並行可獨立執行) │ +│ │ +│ ┌───────────┐ ┌───────────┐ ┌────────────────┐ │ +│ │ gnomAD │ │ 功能註釋 │ │ 組織表達特異性 │ │ +│ │ 約束性 │ │ │ │ │ │ +│ └───────────┘ └───────────┘ └────────────────┘ │ +│ ┌───────────┐ ┌───────────┐ ┌────────────────┐ │ +│ │ 亞細胞定位 │ │ 動物模型 │ │ 文獻探勘 │ │ +│ │ │ │ 表型 │ │ │ │ +│ └───────────┘ └───────────┘ └────────────────┘ │ +└────────────────────────┬────────────────────────────┘ + ▼ +┌─────────────────────────────────────────────────────┐ +│ Step 3: 綜合加權評分 + 信心分層 │ +│ NULL-aware weighted average → HIGH/MEDIUM/LOW │ +└────────────────────────┬────────────────────────────┘ + ▼ +┌─────────────────────────────────────────────────────┐ +│ Step 4: 驗證 + 報告產出 │ +│ 已知 Usher/cilia 基因排名驗證 → TSV + Parquet + 圖表 │ +└─────────────────────────────────────────────────────┘ +``` + +--- + +## 安裝與環境設定 + +### 系統需求 + +- Python 3.11 以上 +- 約 5 GB 磁碟空間(用於快取下載的資料庫檔案) +- 網路連線(首次執行需下載外部資料) + +### 安裝步驟 + +打開終端機(Terminal),進入專案資料夾後執行: + +```bash +# 1. 建立虛擬環境(建議) +python -m venv .venv +source .venv/bin/activate # macOS / Linux +# .venv\Scripts\activate # Windows + +# 2. 安裝管線 +pip install -e ".[dev]" +``` + +安裝完成後,可以用以下指令確認是否成功: + +```bash +usher-pipeline info +``` + +如果看到版本號與設定摘要,表示安裝成功。 + +### NCBI API Key(文獻探勘層需要) + +文獻探勘層使用 NCBI PubMed E-utilities API。此 API 需要提供電子郵件,並建議申請 API Key 以提高查詢速度(3 次/秒 → 10 次/秒)。 + +申請方式:前往 https://www.ncbi.nlm.nih.gov/account/ 註冊 NCBI 帳號,在帳號設定中產生 API Key。 + +--- + +## 執行管線 + +以下是完整的執行流程。每一步都是獨立的指令,按順序執行即可。 + +> **提示**:每個指令執行完畢後會顯示摘要統計。如果中途失敗,修正問題後重新執行同一步即可(所有步驟皆為冪等操作,重複執行不會產生重複資料)。 + +### Step 1:建立基因宇宙 + +```bash +usher-pipeline setup +``` + +這一步會: +- 透過 mygene API 查詢所有人類蛋白質編碼基因(約 22,600 個) +- 建立 Ensembl Gene ID ↔ HGNC Symbol ↔ UniProt Accession 的對應關係 +- 結果儲存至本地 DuckDB 資料庫(`data/pipeline.duckdb`) + +### Step 2:執行六大證據層面 + +六個層面可以按任意順序執行,彼此之間互相獨立。但請**逐一執行**,不要同時開兩個(DuckDB 僅支援單一寫入者)。 + +```bash +# 2a. gnomAD 約束性指標 +usher-pipeline evidence gnomad + +# 2b. 基因功能註釋 +usher-pipeline evidence annotation + +# 2c. 組織表達特異性 +usher-pipeline evidence expression + +# 2d. 亞細胞定位 +usher-pipeline evidence localization + +# 2e. 動物模型表型 +usher-pipeline evidence animal-models + +# 2f. 文獻探勘(需要 email,建議提供 API key) +usher-pipeline evidence literature --email your@email.com --api-key YOUR_KEY +``` + +> **注意**:文獻探勘層需要查詢 22,600 個基因的 PubMed 記錄,速度較慢(約 8 基因/分鐘),完整執行可能需要較長時間。此層支援中斷後續跑(checkpoint-restart),如果中途斷線,重新執行同一指令即可從上次進度繼續。 + +### Step 3:綜合評分 + +```bash +usher-pipeline score +``` + +此步驟會: +- 以 LEFT JOIN 合併六個證據層的分數 +- 計算 NULL-aware 加權綜合分數 +- 對已知 Usher/纖毛基因進行排名驗證(正控制) +- 將基因分為 HIGH / MEDIUM / LOW 三個信心層級 + +### Step 4:產出報告 + +```bash +usher-pipeline report +``` + +產出檔案會放在 `data/report/` 目錄,包含: +- `candidates.tsv` — 候選基因清單(可用 Excel 開啟) +- `candidates.parquet` — 同上,但為高效能二進位格式(適合 R / Python 後續分析) +- `score_distribution.png` — 分數分佈圖 +- `evidence_coverage.png` — 各證據層覆蓋率圖 +- `tier_distribution.png` — 信心層級分佈圓餅圖 +- `reproducibility.md` — 可重現性報告(含所有參數與資料版本) + +### 選用:驗證報告 + +```bash +usher-pipeline validate +``` + +驗證已知 Usher 基因與 SYSCILIA 纖毛基因是否排名在前 25%,確認評分系統的有效性。 + +--- + +## 六大證據層面詳解 + +### 1. gnomAD 約束性分析 + +**生物學問題**:此基因是否在人類族群中顯示強烈的功能喪失(Loss-of-Function)選擇約束? + +**資料來源**:gnomAD v4.1 約束性指標 + +**科學依據**: +如果一個基因對正常生理功能是必要的(例如感覺功能),那麼在人類族群中應該很少觀察到該基因的功能喪失型變異(LoF variants),因為帶有這些變異的個體會有降低的適應性。gnomAD 資料庫透過比較「觀察到的 LoF 變異數量」與「預期的 LoF 變異數量」,計算出約束性指標。 + +**關鍵指標**: +- **LOEUF**(Loss-of-function Observed/Expected Upper bound Fraction):越低代表約束越強 +- **pLI**(Probability of LoF Intolerance):越高代表越不容忍 LoF 變異 + +**評分方式**: +將 LOEUF 值反轉並正規化至 0–1 範圍: + +``` +loeuf_normalized = (LOEUF_max − LOEUF) / (LOEUF_max − LOEUF_min) +``` + +分數越高 = 約束性越強 = 該基因對正常功能越重要。 + +**品質控管**:要求平均定序深度 ≥30x 且 CDS 涵蓋率 ≥90%,低於此閾值的基因標記為 `incomplete_coverage`。 + +--- + +### 2. 基因功能註釋 + +**生物學問題**:此基因的功能註釋完整程度如何? + +**資料來源**: +- Gene Ontology(GO)— 生物過程、分子功能、細胞組分 +- UniProt — 蛋白質註釋品質分數(0–5) +- KEGG / Reactome — 代謝與訊號傳遞路徑 + +**科學依據**: +功能註釋的完整程度反映了一個基因被研究的深度。注意:此層面與「新穎性」呈**負相關**——註釋越少的基因可能代表尚未被發現的生物學。因此在綜合評分中,此層僅佔 15% 權重,讓缺乏註釋但有其他證據支持的新穎候選基因仍能獲得高排名。 + +**評分方式**: + +``` +annotation_score = 0.5 × GO 組分 + 0.3 × UniProt 組分 + 0.2 × Pathway 組分 + +其中: + GO 組分 = log₂(GO term 數量 + 1) / log₂(最大 GO term 數量 + 1) + UniProt 組分 = UniProt 分數 / 5.0 + Pathway 組分 = 若有任何 pathway 註釋則為 1,否則為 0 +``` + +--- + +### 3. 組織表達特異性 + +**生物學問題**:此基因是否特異性地表達在 Usher 症候群相關的組織(視網膜、內耳)? + +**資料來源**: +- **HPA**(Human Protein Atlas)v23 — 組織層級 RNA 表達量(TPM) +- **GTEx** v8 — 54 個人體組織的大量 RNA 定序資料 +- **CellxGene** — 單細胞 RNA 定序(光受器細胞、毛細胞群體) + +**科學依據**: +Usher 症候群影響視網膜光受器細胞與耳蝸毛細胞。若一個基因在這些組織中高度富集表達,暗示其在這些組織具有特化功能,是值得關注的候選基因。 + +**關鍵指標**: + +1. **Tau 組織特異性指數**(τ):衡量一個基因在各組織間的表達均勻程度 + - τ = 0:遍在表達(housekeeping gene) + - τ = 1:高度組織特異性 + - 公式:`τ = Σ(1 − xᵢ/x_max) / (n − 1)` + +2. **Usher 組織富集度**:目標組織(視網膜、小腦、光受器、毛細胞)的平均表達量,除以所有組織的整體平均表達量。比值 > 1 代表在目標組織富集。 + +**評分方式**: + +``` +expression_score = 0.4 × 富集度百分位數 + 0.3 × τ 特異性 + 0.3 × 目標組織最大表達百分位數 +``` + +使用百分位數正規化,避免絕對 TPM 值因定序深度差異而產生偏差。 + +--- + +### 4. 亞細胞定位 + +**生物學問題**:此蛋白質是否定位在纖毛、中心體或基體等與 Usher 病理機轉相關的亞細胞結構? + +**資料來源**: +- **HPA 亞細胞定位資料** — 免疫螢光顯微鏡 +- **纖毛蛋白質體學** — 已發表的纖毛質譜資料集(CiliaCarta 等) +- **中心體蛋白質體學** — 已發表的中心體質譜資料集 + +**科學依據**: +已知的 Usher 蛋白(MYO7A、USH1C、CDH23 等)都是纖毛或纖毛周邊蛋白。蛋白質定位在纖毛/基體/中心體,是參與纖毛病變途徑的**強力機制性證據**。 + +**評分方式**: + +| 定位區域 | 基礎分數 | +|---------|---------| +| 纖毛、中心體、基體、轉換區、靜纖毛 | 1.0 | +| 細胞骨架、微管、細胞連接 | 0.5 | +| 僅出現在蛋白質體學資料集 | 0.3 | +| 有定位資料但與纖毛無關 | 0.0 | +| 無定位資料 | NULL | + +**證據權重**: +- 實驗證據(HPA Enhanced/Supported 可靠度、蛋白質體學):× 1.0 +- 計算預測(HPA Approved/Uncertain 可靠度):× 0.6 + +``` +localization_score = 纖毛接近度基礎分數 × 證據類型權重 +``` + +--- + +### 5. 動物模型表型 + +**生物學問題**:在小鼠或斑馬魚中,此基因的直系同源物(ortholog)被破壞時,是否會產生感覺或纖毛相關的表型? + +**資料來源**: +- **HCOP**(HUGO Gene Nomenclature Committee Ortholog Predictions)— 人-鼠/人-魚直系同源物對應 +- **MGI**(Mouse Genome Informatics)— 小鼠基因剔除表型(MP ontology) +- **ZFIN**(Zebrafish Information Network)— 斑馬魚突變表型 +- **IMPC**(International Mouse Phenotyping Consortium)— 系統性基因剔除篩選 + +**科學依據**: +跨物種的感覺表型保守性提供了**功能驗證**。小鼠的聽力/平衡缺陷及斑馬魚的側線系統缺陷,都能重現人類 Usher 症候群的病理特徵。 + +**表型關鍵字篩選**: +- 小鼠:hearing、vision、retina、photoreceptor、cochlea、stereocilia、cilia、vestibular、balance +- 斑馬魚:hearing、ear、otic、otolith、lateral line、hair cell、retina、vision、eye + +**評分方式**: + +``` +animal_model_score = + 0.4 × 小鼠分數 × 小鼠信心度 + + 0.3 × 斑馬魚分數 × 斑馬魚信心度 + + 0.3 × IMPC 額外加分 + +信心度權重:HIGH = 1.0, MEDIUM = 0.7, LOW = 0.4 +表型數量縮放:log₂(感覺表型數量 + 1) / log₂(最大數量 + 1) +``` + +使用對數縮放防止註釋數量過多的基因主導排名(報酬遞減效應)。 + +--- + +### 6. 文獻探勘 + +**生物學問題**:此基因在科學文獻中是否被提及與纖毛或感覺功能相關?文獻證據的品質如何? + +**資料來源**:NCBI PubMed(透過 E-utilities API) + +**科學依據**: +文獻證據反映了累積的科學知識,但存在**研究偏差**(study bias)——明星基因(如 TP53、BRCA1)擁有大量文獻,但這不代表它們與纖毛相關。本層的正規化策略特別針對這個問題進行校正。 + +**證據品質分級**: + +| 等級 | 說明 | 權重 | +|------|------|------| +| direct_experimental | 基因剔除/突變 + 纖毛/感覺語境 | 1.0 | +| functional_mention | 纖毛/感覺語境 + ≥3 篇文獻 | 0.6 | +| hts_hit | 高通量篩選命中 + 纖毛/感覺語境 | 0.3 | +| incidental | 有文獻但無纖毛/感覺語境 | 0.1 | +| none | 無文獻 | 0.0 | + +**研究偏差校正**: + +``` +raw_score = (語境分數 × 品質權重) / log₂(PubMed 總文獻數 + 1) +literature_score = percentile_rank(raw_score) +``` + +除以 `log₂(總文獻數)` 的關鍵意義:一個基因若在 50 篇文獻中有 5 篇提及纖毛,會比在 100,000 篇文獻中有 5 篇提及纖毛得到更高的分數。這鼓勵發現被忽略但有線索的基因。 + +--- + +## 綜合評分與分層 + +### 加權綜合分數 + +六個證據層以可調整的權重合併為單一綜合分數: + +| 證據層 | 預設權重 | 生物學意義 | +|--------|---------|-----------| +| gnomAD 約束性 | 20% | 基因的功能必要性 | +| 組織表達 | 20% | 在目標組織的特異性表達 | +| 功能註釋 | 15% | 已知的功能特徵 | +| 亞細胞定位 | 15% | 纖毛結構的接近程度 | +| 動物模型 | 15% | 跨物種的功能驗證 | +| 文獻探勘 | 15% | 文獻中的纖毛/感覺關聯 | + +**NULL-aware 加權平均**: + +``` +composite_score = Σ(scoreᵢ × weightᵢ) / Σ(weightᵢ) + ─── 僅計算非 NULL 的層面 ─── +``` + +例如:某基因只有 gnomAD(0.8)、表達(0.6)和定位(0.9)三層有資料: + +``` +composite_score = (0.8×0.20 + 0.6×0.20 + 0.9×0.15) / (0.20 + 0.20 + 0.15) + = 0.415 / 0.55 + = 0.755 +``` + +### 信心分層 + +| 層級 | 條件 | 意義 | +|------|------|------| +| **HIGH** | 綜合分數 ≥ 0.7 且 ≥ 3 層有資料 | 高優先候選基因,建議進入實驗驗證 | +| **MEDIUM** | 綜合分數 ≥ 0.4 且 ≥ 2 層有資料 | 中等證據,值得進一步文獻調查 | +| **LOW** | 綜合分數 ≥ 0.2 | 證據薄弱,需要更多資料 | +| EXCLUDED | 低於以上條件 | 排除,不列入候選清單 | + +### 品質旗標 + +| 旗標 | 條件 | 說明 | +|------|------|------| +| sufficient_evidence | ≥ 4 層有分數 | 資料涵蓋充分 | +| moderate_evidence | ≥ 2 層有分數 | 部分涵蓋 | +| sparse_evidence | ≥ 1 層有分數 | 資料稀疏 | +| no_evidence | 0 層有分數 | 完全無資料 | + +--- + +## 驗證機制 + +管線內建正控制驗證,確認評分系統的有效性: + +### 正控制基因集 + +1. **OMIM Usher 基因**(10 個):MYO7A、USH1C、CDH23、PCDH15、USH1G、CIB2、USH2A、ADGRV1、WHRN、CLRN1 +2. **SYSCILIA SCGS v2 核心纖毛基因**(28 個):IFT88、IFT140、BBS1、CEP290、RPGR 等 + +### 驗證標準 + +- 已知基因的中位百分位排名應 ≥ 75%(前四分之一) +- 前 10% 候選基因應包含 > 70% 的已知基因(Recall@10%) + +如果驗證未通過,表示權重設定或資料品質可能有問題,需要檢視並調整。 + +--- + +## 產出檔案說明 + +所有結果儲存在 `data/report/` 目錄: + +| 檔案 | 說明 | 適用對象 | +|------|------|---------| +| `candidates.tsv` | 候選基因清單(Tab 分隔) | 用 Excel 或文字編輯器開啟 | +| `candidates.parquet` | 同上,高效能二進位格式 | R(`arrow::read_parquet`)或 Python(`polars.read_parquet`) | +| `score_distribution.png` | 分數分佈直方圖 | 快速了解整體分佈 | +| `evidence_coverage.png` | 各證據層覆蓋率圖 | 了解哪些層資料最完整 | +| `tier_distribution.png` | HIGH/MEDIUM/LOW 比例圓餅圖 | 快速概覽 | +| `reproducibility.md` | 完整的參數、資料版本與軟體環境記錄 | 論文方法學 / 重現分析 | + +### candidates.tsv 欄位說明 + +**核心欄位**: +- `gene_id` — Ensembl 基因 ID(如 ENSG00000154229) +- `gene_symbol` — HGNC 基因符號(如 MYO7A) +- `composite_score` — 綜合分數(0–1) +- `confidence_tier` — 信心層級(HIGH / MEDIUM / LOW) +- `evidence_count` — 有非 NULL 分數的證據層數量(0–6) + +**各層分數**(0–1,NULL 表示無資料): +- `gnomad_score` — 約束性分數 +- `expression_score` — 表達特異性分數 +- `annotation_score` — 功能註釋分數 +- `localization_score` — 亞細胞定位分數 +- `animal_model_score` — 動物模型表型分數 +- `literature_score` — 文獻證據分數 + +**輔助欄位**: +- `supporting_layers` — 有分數的層面清單(如 `gnomad,expression,localization`) +- `evidence_gaps` — 缺少資料的層面清單 +- `quality_flag` — 品質旗標 + +--- + +## 設定檔調整 + +管線設定檔位於 `config/default.yaml`: + +```yaml +# 資料版本 +versions: + ensembl_release: 113 + gnomad_version: v4.1 + gtex_version: v8 + hpa_version: "23.0" + +# 評分權重(可依研究需求調整,總和必須為 1.0) +scoring: + gnomad: 0.20 + expression: 0.20 + annotation: 0.15 + localization: 0.15 + animal_model: 0.15 + literature: 0.15 + +# API 設定 +api: + rate_limit_per_second: 5 + max_retries: 5 + timeout_seconds: 30 +``` + +若要調整權重(例如更重視組織表達),修改 `scoring` 區塊後重新執行 `usher-pipeline score` 與 `usher-pipeline report` 即可。 + +--- + +## 已知限制 + +| 限制 | 說明 | 影響 | +|------|------|------| +| GTEx v8 無視網膜組織 | GTEx v8 不包含 "Eye - Retina" 組織 | 視網膜表達資料僅來自 HPA | +| HPA 基因符號比對落差 | HPA 使用 gene symbol,管線以 gene ID 為主鍵 | 部分基因的 HPA 表達資料可能遺失 | +| gnomAD 轉錄本層級 ID | gnomAD 使用轉錄本 ID,非基因層級 | 部分基因 JOIN 時可能產生 NaN | +| 文獻探勘速度 | NCBI API 限制,~8 基因/分鐘 | 完整執行需要較長時間;可使用 API key 加速 | +| 單一寫入限制 | DuckDB 不支援多行程同時寫入 | 請勿同時執行兩個管線步驟 | +| 內耳資料稀缺 | 人類耳蝸組織的大量轉錄體資料有限 | 以小腦(含纖毛)及 CellxGene 毛細胞資料作為替代 | + +--- + +## 引用資料庫 + +本管線整合以下公開資料庫與工具: + +- **gnomAD** v4.1 — Karczewski et al. (2020) *Nature* 581:434-443 +- **Human Protein Atlas** v23 — Uhlén et al. (2015) *Science* 347:1260419 +- **GTEx** v8 — GTEx Consortium (2020) *Science* 369:1318-1330 +- **CellxGene** — Chan Zuckerberg Initiative single-cell atlas +- **Gene Ontology** — Gene Ontology Consortium (2021) *Nucleic Acids Res* 49:D325-D334 +- **UniProt** — UniProt Consortium (2023) *Nucleic Acids Res* 51:D523-D531 +- **MGI** — Mouse Genome Informatics, The Jackson Laboratory +- **ZFIN** — Zebrafish Information Network +- **IMPC** — International Mouse Phenotyping Consortium +- **SYSCILIA SCGS v2** — van Dam et al. (2021) *Mol Biol Cell* 32:br6 +- **mygene.info** — Xin et al. (2016) *Genome Biol* 17:91 +- **NCBI E-utilities** — PubMed literature search API + +--- + +## 授權 + +MIT License