docs: update gitignore to track report results, add README cross-links

- Revised .gitignore to ignore raw data/cache but track data/report/ (candidates TSV/Parquet, plots, reproducibility metadata) - Added zh↔en cross-links between README.md and README.en.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 05:57:27 +08:00
parent 674a9ae845
commit dc36730cb4
11 changed files with 18283 additions and 10 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,11 +1,24 @@
-# Data files
+# === Raw data & intermediate files ===
-data/
+data/cache/
 data/gnomad/
 data/annotation/
 data/expression/
 data/localization/
 data/animal_models/
 data/scoring/
-# DuckDB
+# DuckDB database (large, regeneratable)
 *.duckdb
 *.duckdb.wal
-# Python
+# Provenance JSON (regenerated on each run)
 *.provenance.json
 *.provenance.provenance.json
 # === Keep report results ===
 # data/report/ is NOT ignored — candidates.tsv, parquet, plots are tracked
 # === Python ===
 __pycache__/
 *.pyc
 *.pyo
@@ -18,23 +31,27 @@ dist/
 build/
 .eggs/
-# Testing
+# === Testing ===
 .pytest_cache/
 .coverage
 htmlcov/
 .tox/
-# IDE
+# === IDE ===
 .vscode/
 .idea/
 *.swp
 *.swo
 *~
-# Provenance files (not in data/)
+# === Virtual environment ===
 /*.provenance.json
 # Virtual environment
 .venv/
 venv/
 env/
 # === OS ===
 .DS_Store
 Thumbs.db
 # === Planning (internal) ===
 .planning/
--- a/README.en.md
+++ b/README.en.md
@@ -1,5 +1,7 @@
 # Usher Cilia Candidate Gene Discovery Pipeline
 > **[中文版 (README.md)](README.md)**
 A reproducible bioinformatics pipeline for systematic screening of candidate genes associated with Usher syndrome and ciliopathies.
 This pipeline evaluates approximately 22,600 human protein-coding genes across six independent evidence layers, producing weighted composite scores and tiered candidate gene lists for downstream experimental validation.
--- a/README.md
+++ b/README.md
@@ -1,5 +1,7 @@
 # Usher Cilia Candidate Gene Discovery Pipeline
 > **[English version (README.en.md)](README.en.md)**
 一套可重現的生物資訊分析管線，用於系統性篩選與 Usher 症候群及纖毛病變（ciliopathies）相關的候選基因。
 本管線對人類約 22,600 個蛋白質編碼基因，透過六個獨立的證據層面進行評分與排序，最終產出分層候選基因清單，供後續實驗驗證參考。
--- a/data/report/candidates.parquet
+++ b/data/report/candidates.parquet
--- a/data/report/candidates.provenance.yaml
+++ b/data/report/candidates.provenance.yaml
@@ -0,0 +1,33 @@
 generated_at: '2026-02-12T18:42:59.932245+00:00'
 output_files:
 - candidates.tsv
 - candidates.parquet
 statistics:
  total_candidates: 18116
  high_count: 0
  medium_count: 2151
  low_count: 15965
 column_count: 22
 column_names:
 - gene_id
 - gene_symbol
 - gnomad_score
 - expression_score
 - annotation_score
 - localization_score
 - animal_model_score
 - literature_score
 - evidence_count
 - available_weight
 - weighted_sum
 - composite_score
 - quality_flag
 - gnomad_contribution
 - expression_contribution
 - annotation_contribution
 - localization_contribution
 - animal_model_contribution
 - literature_contribution
 - confidence_tier
 - supporting_layers
 - evidence_gaps
--- a/data/report/candidates.tsv
+++ b/data/report/candidates.tsv
--- a/data/report/plots/layer_contributions.png
+++ b/data/report/plots/layer_contributions.png
--- a/data/report/plots/score_distribution.png
+++ b/data/report/plots/score_distribution.png
--- a/data/report/plots/tier_breakdown.png
+++ b/data/report/plots/tier_breakdown.png
--- a/data/report/reproducibility.json
+++ b/data/report/reproducibility.json
@@ -0,0 +1,57 @@
 {
  "run_id": "5f00f9da-e548-4a58-b1b3-028d05c94d32",
  "timestamp": "2026-02-12T18:43:00.223842+00:00",
  "pipeline_version": "0.1.0",
  "parameters": {
    "gnomad": 0.2,
    "expression": 0.2,
    "annotation": 0.15,
    "localization": 0.15,
    "animal_model": 0.15,
    "literature": 0.15
  },
  "data_versions": {
    "ensembl_release": 113,
    "gnomad_version": "v4.1",
    "gtex_version": "v8",
    "hpa_version": "23.0"
  },
  "software_environment": {
    "python": "3.14.3",
    "polars": "1.38.1",
    "duckdb": "1.4.4"
  },
  "filtering_steps": [
    {
      "step_name": "load_scored_genes",
      "input_count": 0,
      "output_count": 0,
      "criteria": ""
    },
    {
      "step_name": "apply_tier_classification",
      "input_count": 0,
      "output_count": 0,
      "criteria": ""
    },
    {
      "step_name": "write_candidate_output",
      "input_count": 0,
      "output_count": 0,
      "criteria": ""
    },
    {
      "step_name": "generate_visualizations",
      "input_count": 0,
      "output_count": 0,
      "criteria": ""
    }
  ],
  "validation_metrics": {},
  "tier_statistics": {
    "total": 18116,
    "high": 0,
    "medium": 2151,
    "low": 15965
  }
 }
--- a/data/report/reproducibility.md
+++ b/data/report/reproducibility.md
@@ -0,0 +1,45 @@
 # Pipeline Reproducibility Report
 **Run ID:** `5f00f9da-e548-4a58-b1b3-028d05c94d32`
 **Timestamp:** 2026-02-12T18:43:00.223842+00:00
 **Pipeline Version:** 0.1.0
 ## Parameters
 **Scoring Weights:**
 - gnomAD: 0.20
 - Expression: 0.20
 - Annotation: 0.15
 - Localization: 0.15
 - Animal Model: 0.15
 - Literature: 0.15
 ## Data Versions
 - **ensembl_release:** 113
 - **gnomad_version:** v4.1
 - **gtex_version:** v8
 - **hpa_version:** 23.0
 ## Software Environment
 - **python:** 3.14.3
 - **polars:** 1.38.1
 - **duckdb:** 1.4.4
 ## Filtering Steps
 | Step | Input Count | Output Count | Criteria |
 |------|-------------|--------------|----------|
 | load_scored_genes | 0 | 0 |  |
 | apply_tier_classification | 0 | 0 |  |
 | write_candidate_output | 0 | 0 |  |
 | generate_visualizations | 0 | 0 |  |
 ## Tier Statistics
 - **Total Candidates:** 18116
 - **HIGH:** 0
 - **MEDIUM:** 2151
 - **LOW:** 15965