docs: update gitignore to track report results, add README cross-links

- Revised .gitignore to ignore raw data/cache but track data/report/
  (candidates TSV/Parquet, plots, reproducibility metadata)
- Added zh↔en cross-links between README.md and README.en.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-13 05:57:27 +08:00
parent 674a9ae845
commit dc36730cb4
11 changed files with 18283 additions and 10 deletions

37
.gitignore vendored
View File

@@ -1,11 +1,24 @@
# Data files # === Raw data & intermediate files ===
data/ data/cache/
data/gnomad/
data/annotation/
data/expression/
data/localization/
data/animal_models/
data/scoring/
# DuckDB # DuckDB database (large, regeneratable)
*.duckdb *.duckdb
*.duckdb.wal *.duckdb.wal
# Python # Provenance JSON (regenerated on each run)
*.provenance.json
*.provenance.provenance.json
# === Keep report results ===
# data/report/ is NOT ignored — candidates.tsv, parquet, plots are tracked
# === Python ===
__pycache__/ __pycache__/
*.pyc *.pyc
*.pyo *.pyo
@@ -18,23 +31,27 @@ dist/
build/ build/
.eggs/ .eggs/
# Testing # === Testing ===
.pytest_cache/ .pytest_cache/
.coverage .coverage
htmlcov/ htmlcov/
.tox/ .tox/
# IDE # === IDE ===
.vscode/ .vscode/
.idea/ .idea/
*.swp *.swp
*.swo *.swo
*~ *~
# Provenance files (not in data/) # === Virtual environment ===
/*.provenance.json
# Virtual environment
.venv/ .venv/
venv/ venv/
env/ env/
# === OS ===
.DS_Store
Thumbs.db
# === Planning (internal) ===
.planning/

View File

@@ -1,5 +1,7 @@
# Usher Cilia Candidate Gene Discovery Pipeline # Usher Cilia Candidate Gene Discovery Pipeline
> **[中文版 (README.md)](README.md)**
A reproducible bioinformatics pipeline for systematic screening of candidate genes associated with Usher syndrome and ciliopathies. A reproducible bioinformatics pipeline for systematic screening of candidate genes associated with Usher syndrome and ciliopathies.
This pipeline evaluates approximately 22,600 human protein-coding genes across six independent evidence layers, producing weighted composite scores and tiered candidate gene lists for downstream experimental validation. This pipeline evaluates approximately 22,600 human protein-coding genes across six independent evidence layers, producing weighted composite scores and tiered candidate gene lists for downstream experimental validation.

View File

@@ -1,5 +1,7 @@
# Usher Cilia Candidate Gene Discovery Pipeline # Usher Cilia Candidate Gene Discovery Pipeline
> **[English version (README.en.md)](README.en.md)**
一套可重現的生物資訊分析管線,用於系統性篩選與 Usher 症候群及纖毛病變ciliopathies相關的候選基因。 一套可重現的生物資訊分析管線,用於系統性篩選與 Usher 症候群及纖毛病變ciliopathies相關的候選基因。
本管線對人類約 22,600 個蛋白質編碼基因,透過六個獨立的證據層面進行評分與排序,最終產出分層候選基因清單,供後續實驗驗證參考。 本管線對人類約 22,600 個蛋白質編碼基因,透過六個獨立的證據層面進行評分與排序,最終產出分層候選基因清單,供後續實驗驗證參考。

Binary file not shown.

View File

@@ -0,0 +1,33 @@
generated_at: '2026-02-12T18:42:59.932245+00:00'
output_files:
- candidates.tsv
- candidates.parquet
statistics:
total_candidates: 18116
high_count: 0
medium_count: 2151
low_count: 15965
column_count: 22
column_names:
- gene_id
- gene_symbol
- gnomad_score
- expression_score
- annotation_score
- localization_score
- animal_model_score
- literature_score
- evidence_count
- available_weight
- weighted_sum
- composite_score
- quality_flag
- gnomad_contribution
- expression_contribution
- annotation_contribution
- localization_contribution
- animal_model_contribution
- literature_contribution
- confidence_tier
- supporting_layers
- evidence_gaps

18117
data/report/candidates.tsv Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

View File

@@ -0,0 +1,57 @@
{
"run_id": "5f00f9da-e548-4a58-b1b3-028d05c94d32",
"timestamp": "2026-02-12T18:43:00.223842+00:00",
"pipeline_version": "0.1.0",
"parameters": {
"gnomad": 0.2,
"expression": 0.2,
"annotation": 0.15,
"localization": 0.15,
"animal_model": 0.15,
"literature": 0.15
},
"data_versions": {
"ensembl_release": 113,
"gnomad_version": "v4.1",
"gtex_version": "v8",
"hpa_version": "23.0"
},
"software_environment": {
"python": "3.14.3",
"polars": "1.38.1",
"duckdb": "1.4.4"
},
"filtering_steps": [
{
"step_name": "load_scored_genes",
"input_count": 0,
"output_count": 0,
"criteria": ""
},
{
"step_name": "apply_tier_classification",
"input_count": 0,
"output_count": 0,
"criteria": ""
},
{
"step_name": "write_candidate_output",
"input_count": 0,
"output_count": 0,
"criteria": ""
},
{
"step_name": "generate_visualizations",
"input_count": 0,
"output_count": 0,
"criteria": ""
}
],
"validation_metrics": {},
"tier_statistics": {
"total": 18116,
"high": 0,
"medium": 2151,
"low": 15965
}
}

View File

@@ -0,0 +1,45 @@
# Pipeline Reproducibility Report
**Run ID:** `5f00f9da-e548-4a58-b1b3-028d05c94d32`
**Timestamp:** 2026-02-12T18:43:00.223842+00:00
**Pipeline Version:** 0.1.0
## Parameters
**Scoring Weights:**
- gnomAD: 0.20
- Expression: 0.20
- Annotation: 0.15
- Localization: 0.15
- Animal Model: 0.15
- Literature: 0.15
## Data Versions
- **ensembl_release:** 113
- **gnomad_version:** v4.1
- **gtex_version:** v8
- **hpa_version:** 23.0
## Software Environment
- **python:** 3.14.3
- **polars:** 1.38.1
- **duckdb:** 1.4.4
## Filtering Steps
| Step | Input Count | Output Count | Criteria |
|------|-------------|--------------|----------|
| load_scored_genes | 0 | 0 | |
| apply_tier_classification | 0 | 0 | |
| write_candidate_output | 0 | 0 | |
| generate_visualizations | 0 | 0 | |
## Tier Statistics
- **Total Candidates:** 18116
- **HIGH:** 0
- **MEDIUM:** 2151
- **LOW:** 15965