gene_universe contains 1,539 gene_symbols mapping to multiple Ensembl IDs (3,033 excess). Non-canonical IDs lack data in some evidence tables, causing weighted_sum/available_weight to inflate their composite scores. Fix: after scoring SQL, keep one row per gene_symbol with the most evidence_count (tiebreak by composite_score DESC). 22,604 → 19,555 genes. Results: HIGH 82→4, all top 20 now have 5-6 layers with expression data. Validation PASSED (CDH23 98.3rd percentile, median known gene 83.3%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
893 B
893 B
Pipeline Reproducibility Report
Run ID: f5587e39-163b-418c-8ac2-593b47323f34
Timestamp: 2026-02-16T01:59:18.969516+00:00
Pipeline Version: 0.1.0
Parameters
Scoring Weights:
- gnomAD: 0.20
- Expression: 0.20
- Annotation: 0.15
- Localization: 0.15
- Animal Model: 0.15
- Literature: 0.15
Data Versions
- ensembl_release: 113
- gnomad_version: v4.1
- gtex_version: v8
- hpa_version: 23.0
Software Environment
- python: 3.14.3
- polars: 1.38.1
- duckdb: 1.4.4
Filtering Steps
| Step | Input Count | Output Count | Criteria |
|---|---|---|---|
| load_scored_genes | 0 | 0 | |
| apply_tier_classification | 0 | 0 | |
| write_candidate_output | 0 | 0 | |
| generate_visualizations | 0 | 0 |
Tier Statistics
- Total Candidates: 18243
- HIGH: 4
- MEDIUM: 8051
- LOW: 10188