usher-exploring/.planning/phases/01-data-infrastructure/01-02-PLAN.md

---
phase: 01-data-infrastructure
plan: 02
type: execute
wave: 2
depends_on: ["01-01"]
files_modified:
  - src/usher_pipeline/gene_mapping/__init__.py
  - src/usher_pipeline/gene_mapping/universe.py
  - src/usher_pipeline/gene_mapping/mapper.py
  - src/usher_pipeline/gene_mapping/validator.py
  - tests/test_gene_mapping.py
autonomous: true

must_haves:
  truths:
    - "Gene universe contains only human protein-coding genes from Ensembl, excluding pseudogenes"
    - "Gene IDs map between Ensembl, HGNC symbols, and UniProt accessions via mygene batch queries"
    - "Mapping validation reports percentage successfully mapped and flags unmapped genes for review"
    - "Validation gate fails pipeline if mapping success rate drops below configurable threshold"
  artifacts:
    - path: "src/usher_pipeline/gene_mapping/universe.py"
      provides: "Gene universe definition and Ensembl protein-coding gene retrieval"
      contains: "protein_coding"
    - path: "src/usher_pipeline/gene_mapping/mapper.py"
      provides: "Batch gene ID mapping via mygene"
      contains: "class GeneMapper"
    - path: "src/usher_pipeline/gene_mapping/validator.py"
      provides: "Mapping validation gates with success rate reporting"
      contains: "class MappingValidator"
  key_links:
    - from: "src/usher_pipeline/gene_mapping/mapper.py"
      to: "mygene"
      via: "MyGeneInfo client for batch ID queries"
      pattern: "mygene\\.MyGeneInfo"
    - from: "src/usher_pipeline/gene_mapping/validator.py"
      to: "src/usher_pipeline/gene_mapping/mapper.py"
      via: "validates mapping results from GeneMapper"
      pattern: "MappingResult"
    - from: "src/usher_pipeline/gene_mapping/universe.py"
      to: "src/usher_pipeline/config/schema.py"
      via: "reads Ensembl release from PipelineConfig"
      pattern: "ensembl_release"
---

<objective>
Build the gene ID mapping system that defines the gene universe (human protein-coding genes from Ensembl) and provides validated mapping between Ensembl gene IDs, HGNC symbols, and UniProt accessions.

Purpose: Gene ID mapping is the foundation of the entire pipeline. Every downstream evidence layer retrieves data using gene identifiers, so reliable mapping between ID systems (Ensembl, HGNC, UniProt) with validation gates is critical. The gene universe definition (INFRA-01) ensures we work with the correct set of ~20,000 protein-coding genes.

Output: Gene mapping module with universe definition, batch mapper, and validation gates.
</objective>

<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>

<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-data-infrastructure/01-RESEARCH.md
@.planning/phases/01-data-infrastructure/01-01-SUMMARY.md
</context>

<tasks>

<task type="auto">
  <name>Task 1: Define gene universe and build batch ID mapper</name>
  <files>
    src/usher_pipeline/gene_mapping/__init__.py
    src/usher_pipeline/gene_mapping/universe.py
    src/usher_pipeline/gene_mapping/mapper.py
  </files>
  <action>
    1. Create `src/usher_pipeline/gene_mapping/universe.py`:
       - `fetch_protein_coding_genes(ensembl_release: int = 113) -> list[str]`: Use mygene.MyGeneInfo() to query all human protein-coding genes. Query with `mg.query('type_of_gene:protein-coding', species=9606, fields='ensembl.gene,symbol,name', fetch_all=True)`. Extract Ensembl gene IDs (ENSG format). Filter to only include entries that have valid Ensembl gene IDs (start with "ENSG").
       - Validate gene count is in expected range: 19,000-22,000. Log warning if outside this range (indicates pseudogene contamination or missing data).
       - Return deduplicated sorted list of Ensembl gene IDs.
       - Add type alias `GeneUniverse = list[str]` for clarity.

    2. Create `src/usher_pipeline/gene_mapping/mapper.py`:
       - Define `MappingResult` dataclass: ensembl_id (str), hgnc_symbol (str | None), uniprot_accession (str | None), mapping_source (str = "mygene").
       - Define `MappingReport` dataclass: total_genes (int), mapped_hgnc (int), mapped_uniprot (int), unmapped_ids (list[str]), success_rate_hgnc (float), success_rate_uniprot (float).
       - `class GeneMapper`:
         - Constructor takes batch_size (int, default 1000) for chunked queries.
         - `map_ensembl_ids(ensembl_ids: list[str]) -> tuple[list[MappingResult], MappingReport]`: Use mygene.MyGeneInfo().querymany() with scopes='ensembl.gene', fields='symbol,uniprot.Swiss-Prot', species=9606, returnall=True. Process in batches of batch_size. For each result, extract symbol and UniProt Swiss-Prot accession (handle both string and list cases for uniprot — take first if list). Build MappingResult for each gene. Compute MappingReport with success rates.
         - Handle edge cases: 'notfound' in result, missing 'symbol' key, uniprot being nested dict vs string, duplicate query results (take first non-null).

    3. Create `src/usher_pipeline/gene_mapping/__init__.py` exporting GeneMapper, MappingResult, MappingReport, fetch_protein_coding_genes.
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && python -c "
from usher_pipeline.gene_mapping.mapper import GeneMapper, MappingResult, MappingReport
from usher_pipeline.gene_mapping.universe import fetch_protein_coding_genes
# Verify imports work and classes are defined
print('MappingResult fields:', [f.name for f in MappingResult.__dataclass_fields__.values()] if hasattr(MappingResult, '__dataclass_fields__') else 'OK')
print('GeneMapper has map_ensembl_ids:', hasattr(GeneMapper, 'map_ensembl_ids'))
"
  </verify>
  <done>
    - universe.py fetches protein-coding genes via mygene and validates count in 19k-22k range
    - mapper.py provides batch mapping from Ensembl to HGNC+UniProt with MappingResult/MappingReport dataclasses
    - All classes import cleanly
  </done>
</task>

<task type="auto">
  <name>Task 2: Create mapping validation gates with tests</name>
  <files>
    src/usher_pipeline/gene_mapping/validator.py
    tests/test_gene_mapping.py
  </files>
  <action>
    1. Create `src/usher_pipeline/gene_mapping/validator.py`:
       - `class MappingValidator`:
         - Constructor takes min_success_rate (float, default 0.90), warn_threshold (float, default 0.95).
         - `validate(report: MappingReport) -> ValidationResult`: Returns ValidationResult dataclass with passed (bool), messages (list[str]), unmapped_summary (dict). Fails if HGNC success_rate < min_success_rate. Warns if success_rate between min_success_rate and warn_threshold.
         - `save_unmapped_report(report: MappingReport, output_path: Path)`: Writes unmapped gene IDs to a text file for manual review, one per line, with header comment showing timestamp and total count.
       - `class ValidationResult`: dataclass with passed (bool), messages (list[str]), hgnc_rate (float), uniprot_rate (float).
       - `validate_gene_universe(genes: list[str]) -> ValidationResult`: Checks gene count in range, all IDs start with ENSG, no duplicates. Returns ValidationResult.

    2. Create `tests/test_gene_mapping.py` with pytest tests. Use mocked mygene responses (do NOT call real API in tests):
       - test_mapping_result_creation: create MappingResult with all fields, verify attributes
       - test_mapper_handles_successful_mapping: mock mygene.querymany to return successful results for 3 genes (with symbol and uniprot), verify MappingResult list and MappingReport success rates = 1.0
       - test_mapper_handles_unmapped_genes: mock mygene.querymany to return 'notfound' for 1 of 3 genes, verify unmapped_ids contains the missing gene, success rate = 0.667
       - test_mapper_handles_uniprot_list: mock result where uniprot.Swiss-Prot is a list of accessions, verify first is taken
       - test_validator_passes_high_rate: MappingReport with 95% success rate passes with min_success_rate=0.90
       - test_validator_fails_low_rate: MappingReport with 80% success rate fails with min_success_rate=0.90
       - test_validator_warns_medium_rate: MappingReport with 92% passes but includes warning
       - test_validate_gene_universe_valid: list of 20000 ENSG IDs passes
       - test_validate_gene_universe_invalid_count: list of 50000 genes fails (too many, likely includes non-coding)
       - test_save_unmapped_report: verify unmapped genes written to file (use tmp_path)

    Mock mygene using unittest.mock.patch on the mygene.MyGeneInfo class. Create fixture with realistic mygene response format:
    ```python
    MOCK_MYGENE_RESPONSE = {
        'out': [
            {'query': 'ENSG00000139618', 'symbol': 'BRCA2', 'uniprot': {'Swiss-Prot': 'P51587'}},
            {'query': 'ENSG00000141510', 'symbol': 'TP53', 'uniprot': {'Swiss-Prot': 'P04637'}},
            {'query': 'ENSG00000000000', 'notfound': True, 'query': 'ENSG00000000000'}
        ],
        'missing': ['ENSG00000000000']
    }
    ```
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_gene_mapping.py -v
  </verify>
  <done>
    - MappingValidator enforces configurable success rate thresholds
    - Validation gates report clear pass/fail/warn with messages
    - Unmapped gene report saves to file for review
    - All 10 gene mapping tests pass with mocked mygene responses (no real API calls)
  </done>
</task>

</tasks>

<verification>
1. `pytest tests/test_gene_mapping.py -v` -- all tests pass
2. `python -c "from usher_pipeline.gene_mapping import GeneMapper, MappingValidator"` -- imports work
3. Gene mapping validation gates enforce minimum success rates and produce actionable reports
</verification>

<success_criteria>
- Gene universe function retrieves human protein-coding genes via mygene with count validation
- Batch mapper converts Ensembl IDs to HGNC symbols + UniProt accessions with proper edge case handling
- Validation gates enforce configurable success rate thresholds and produce unmapped gene reports
- All tests pass with mocked API responses
</success_criteria>

<output>
After completion, create `.planning/phases/01-data-infrastructure/01-02-SUMMARY.md`
</output>