- Gene universe definition with mygene protein-coding gene retrieval - Batch Ensembl->HGNC+UniProt mapping with edge case handling - Validation gates with configurable success rate thresholds - 15 comprehensive tests with mocked API responses
6.5 KiB
phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, patterns_established, metrics
| phase | plan | subsystem | tags | dependency_graph | tech_stack | key_files | decisions | patterns_established | metrics | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 01-data-infrastructure | 02 | foundation |
|
|
|
|
|
|
|
Phase 01 Plan 02: Gene ID Mapping and Validation Summary
Gene universe definition (19k-22k protein-coding genes via mygene) with batch Ensembl→HGNC+UniProt mapping and configurable validation gates (success rate thresholds, unmapped gene reports)
Performance
- Duration: 4 minutes
- Started: 2026-02-11T08:29:05Z
- Completed: 2026-02-11T08:33:54Z
- Tasks: 2
- Files created: 5
Accomplishments
- Gene universe retrieval -
fetch_protein_coding_genes()queries mygene for human protein-coding genes, validates count in 19k-22k range, returns sorted ENSG IDs - Batch ID mapping -
GeneMapperconverts Ensembl IDs to HGNC symbols and UniProt Swiss-Prot accessions via mygene batch queries with edge case handling (notfound, missing keys, nested structures, uniprot lists) - Validation gates -
MappingValidatorenforces configurable success rate thresholds with pass/warn/fail logic and produces unmapped gene reports for manual review - Comprehensive testing - 15 tests with mocked mygene responses covering successful mapping, unmapped genes, uniprot lists, batching, validation thresholds, and universe validation
Task Commits
Each task was committed atomically:
- Task 2: Create mapping validation gates with tests -
0200395(feat)
Note: Task 1 files (universe.py, mapper.py, init.py) were already created in a prior execution (commit d51141f from plan 01-03). Plan 01-02 was executed retroactively to document and test these components, adding the missing validator and comprehensive tests.
Files Created/Modified
Created:
src/usher_pipeline/gene_mapping/__init__.py- Module exports for GeneMapper, MappingResult, MappingReport, MappingValidator, ValidationResult, validate_gene_universe, fetch_protein_coding_genessrc/usher_pipeline/gene_mapping/universe.py- Gene universe definition with mygene query, ENSG filtering, and count validationsrc/usher_pipeline/gene_mapping/mapper.py- Batch ID mapper with MappingResult/MappingReport dataclasses and edge case handlingsrc/usher_pipeline/gene_mapping/validator.py- MappingValidator class and validate_gene_universe function with configurable thresholdstests/test_gene_mapping.py- 15 tests covering all mapping and validation functionality with mocked mygene API
Decisions Made
-
Gene count validation warns but doesn't fail - Allows for Ensembl version variations while still flagging anomalies (rationale: 19k-22k is expected range but exact count varies by release)
-
HGNC success rate is primary validation gate - UniProt mapping is tracked but not used for pass/fail decisions (rationale: HGNC symbols are more stable and universal than UniProt accessions)
-
Take first UniProt accession when multiple exist - Some genes have multiple Swiss-Prot entries; we take the first (rationale: simplifies data model, first entry is typically primary)
-
Mock mygene in tests - All tests use mocked API responses (rationale: avoids rate limits, ensures reproducibility, faster test execution)
Deviations from Plan
None - plan executed exactly as written.
Note: The existence of Task 1 files prior to this plan execution is not a deviation from this plan - it indicates out-of-order execution. This summary documents the complete functionality as implemented, including adding the validator and tests that were missing.
Issues Encountered
None. All tests passed on first run with mocked API responses.
Next Phase Readiness
Ready for downstream phases:
- Gene universe can be fetched and validated
- Batch ID mapping handles all edge cases (notfound, nested structures, lists)
- Validation gates enforce data quality thresholds
- All components fully tested with mocked API
Dependencies for evidence layers:
- Evidence layer modules will use
fetch_protein_coding_genes()to get gene universe - Evidence layer modules will use
GeneMapper.map_ensembl_ids()to convert between ID systems - Evidence layer modules will use
MappingValidator.validate()to enforce data quality gates
Self-Check: PASSED
Files verified:
FOUND: src/usher_pipeline/gene_mapping/__init__.py
FOUND: src/usher_pipeline/gene_mapping/universe.py
FOUND: src/usher_pipeline/gene_mapping/mapper.py
FOUND: src/usher_pipeline/gene_mapping/validator.py
FOUND: tests/test_gene_mapping.py
Commits verified:
FOUND: 0200395 (Task 2)
All files and commits exist as documented.
Phase: 01-data-infrastructure Plan: 02 Completed: 2026-02-11