docs(01-data-infrastructure): create phase plan

2026-02-11 16:04:42 +08:00
parent 982f7f5a9b
commit cab2f5fc66
5 changed files with 726 additions and 3 deletions
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@@ -31,9 +31,13 @@ Decimal phases appear between their surrounding integers in numeric order.
  3. API clients retrieve data from external sources with rate limiting, retry logic, and persistent disk caching
  4. DuckDB database stores intermediate results enabling restart-from-checkpoint without re-downloading
  5. Every pipeline output includes provenance metadata: pipeline version, data source versions, timestamps, config hash
-**Plans**: TBD
+**Plans**: 4 plans
-Plans: (to be created during plan-phase)
+Plans:
 - [ ] 01-01-PLAN.md -- Project scaffold, config system, and base API client
 - [ ] 01-02-PLAN.md -- Gene ID mapping with validation gates
 - [ ] 01-03-PLAN.md -- DuckDB persistence and provenance tracking
 - [ ] 01-04-PLAN.md -- CLI integration and end-to-end wiring
 ### Phase 2: Prototype Evidence Layer
 **Goal**: Validate retrieval-to-storage pattern with single evidence layer
@@ -112,7 +116,7 @@ Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6
 | Phase | Plans Complete | Status | Completed |
 |-------|----------------|--------|-----------|
-| 1. Data Infrastructure | 0/TBD | Not started | - |
+| 1. Data Infrastructure | 0/4 | Planning complete | - |
 | 2. Prototype Evidence Layer | 0/TBD | Not started | - |
 | 3. Core Evidence Layers | 0/TBD | Not started | - |
 | 4. Scoring & Integration | 0/TBD | Not started | - |
--- a/.planning/phases/01-data-infrastructure/01-01-PLAN.md
+++ b/.planning/phases/01-data-infrastructure/01-01-PLAN.md
@@ -0,0 +1,177 @@
 ---
 phase: 01-data-infrastructure
 plan: 01
 type: execute
 wave: 1
 depends_on: []
 files_modified:
  - pyproject.toml
  - src/usher_pipeline/__init__.py
  - src/usher_pipeline/config/__init__.py
  - src/usher_pipeline/config/schema.py
  - src/usher_pipeline/config/loader.py
  - src/usher_pipeline/api_clients/__init__.py
  - src/usher_pipeline/api_clients/base.py
  - config/default.yaml
  - tests/__init__.py
  - tests/test_config.py
 autonomous: true
 must_haves:
  truths:
    - "YAML config loads and validates with Pydantic, returning typed PipelineConfig object"
    - "Invalid config (missing required fields, wrong types, bad values) raises ValidationError with clear messages"
    - "CachedAPIClient makes HTTP requests with automatic retry on 429/5xx and persistent SQLite caching"
    - "Pipeline is installable as Python package with all dependencies pinned"
  artifacts:
    - path: "pyproject.toml"
      provides: "Package definition with all dependencies"
      contains: "mygene"
    - path: "src/usher_pipeline/config/schema.py"
      provides: "Pydantic models for pipeline configuration"
      contains: "class PipelineConfig"
    - path: "src/usher_pipeline/config/loader.py"
      provides: "YAML config loading with validation"
      contains: "def load_config"
    - path: "src/usher_pipeline/api_clients/base.py"
      provides: "Base API client with retry and caching"
      contains: "class CachedAPIClient"
    - path: "config/default.yaml"
      provides: "Default pipeline configuration"
      contains: "ensembl_release"
  key_links:
    - from: "src/usher_pipeline/config/loader.py"
      to: "src/usher_pipeline/config/schema.py"
      via: "imports PipelineConfig for validation"
      pattern: "from.*schema.*import.*PipelineConfig"
    - from: "src/usher_pipeline/api_clients/base.py"
      to: "requests_cache"
      via: "creates CachedSession for persistent caching"
      pattern: "requests_cache\\.CachedSession"
 ---
 <objective>
 Create the Python project scaffold, configuration system with Pydantic validation, and base API client with retry/caching.
 Purpose: Establish the foundational package structure, dependency management, and two core infrastructure components (config loading, API client pattern) that all subsequent plans depend on. This is the skeleton that gene mapping, persistence, and CLI modules plug into.
 Output: Installable Python package with validated config loading and reusable API client base class.
 </objective>
 <execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-data-infrastructure/01-RESEARCH.md
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Create Python package scaffold with config system</name>
  <files>
    pyproject.toml
    src/usher_pipeline/__init__.py
    src/usher_pipeline/config/__init__.py
    src/usher_pipeline/config/schema.py
    src/usher_pipeline/config/loader.py
    config/default.yaml
    tests/__init__.py
    tests/test_config.py
  </files>
  <action>
    1. Create `pyproject.toml` using modern Python packaging (PEP 621). Project name: `usher-pipeline`. Python requires >=3.11. Include all dependencies from research: mygene, requests, requests-cache, tenacity, pydantic>=2.0, pydantic-yaml, duckdb, click, polars, pyarrow. Add dev dependencies: pytest, pytest-cov. Use `src` layout with `src/usher_pipeline` as the package.
    2. Create `src/usher_pipeline/__init__.py` with `__version__ = "0.1.0"`.
    3. Create `src/usher_pipeline/config/schema.py` with Pydantic v2 models:
       - `DataSourceVersions(BaseModel)`: fields for ensembl_release (int, ge=100), gnomad_version (str, default "v4.1"), gtex_version (str, default "v8"), hpa_version (str, default "23.0"). All fields with descriptions.
       - `ScoringWeights(BaseModel)`: placeholder fields for per-layer weights (gnomad, expression, annotation, localization, animal_model, literature) all float 0.0-1.0 with defaults summing to 1.0.
       - `APIConfig(BaseModel)`: rate_limit_per_second (int, default 5), max_retries (int, default 5, ge=1, le=20), cache_ttl_seconds (int, default 86400), timeout_seconds (int, default 30).
       - `PipelineConfig(BaseModel)`: data_dir (Path), cache_dir (Path), duckdb_path (Path), versions (DataSourceVersions), api (APIConfig), scoring (ScoringWeights). Add field_validator on data_dir and cache_dir to create directories with mkdir(parents=True, exist_ok=True). Add a method `config_hash() -> str` that computes SHA-256 hash of the config dict (json.dumps with sort_keys=True, default=str).
    4. Create `src/usher_pipeline/config/loader.py`:
       - `load_config(config_path: Path) -> PipelineConfig`: reads YAML file, parses with pydantic_yaml.parse_yaml_raw_as into PipelineConfig. Raises FileNotFoundError if path missing, pydantic.ValidationError if invalid.
       - `load_config_with_overrides(config_path: Path, overrides: dict) -> PipelineConfig`: loads base config, applies dict overrides (for CLI flags), re-validates.
    5. Create `config/default.yaml` with sensible defaults: data_dir = "data", cache_dir = "data/cache", duckdb_path = "data/pipeline.duckdb", versions with ensembl_release 113, gnomad v4.1, gtex v8. API config with rate_limit 5, max_retries 5, cache_ttl 86400.
    6. Create `tests/test_config.py` with pytest tests:
       - test_load_valid_config: loads default.yaml, asserts PipelineConfig returned with correct types
       - test_invalid_config_missing_field: config missing data_dir raises ValidationError
       - test_invalid_ensembl_release: ensembl_release < 100 raises ValidationError
       - test_config_hash_deterministic: same config produces same hash, different config produces different hash
       - test_config_creates_directories: loading config with non-existent data_dir creates the directory (use tmp_path fixture)
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pip install -e ".[dev]" && pytest tests/test_config.py -v
  </verify>
  <done>
    - pyproject.toml installs cleanly with all dependencies
    - All 5 config tests pass
    - PipelineConfig validates types, rejects invalid input, creates directories, produces deterministic hashes
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Create base API client with retry logic and persistent caching</name>
  <files>
    src/usher_pipeline/api_clients/__init__.py
    src/usher_pipeline/api_clients/base.py
    tests/test_api_client.py
  </files>
  <action>
    1. Create `src/usher_pipeline/api_clients/base.py` with `CachedAPIClient` class:
       - Constructor takes `cache_dir: Path`, `rate_limit: int = 5`, `max_retries: int = 5`, `cache_ttl: int = 86400`, `timeout: int = 30`.
       - Creates `requests_cache.CachedSession` with SQLite backend at `cache_dir / 'api_cache'`, expire_after=cache_ttl.
       - Implements `get(url: str, params: dict = None, **kwargs) -> requests.Response` method decorated with `@retry` from tenacity: stop_after_attempt(max_retries), wait_exponential(multiplier=1, min=2, max=60), retry on HTTPError and Timeout and ConnectionError. Before raising on 429, log a warning about rate limiting. Sets timeout on all requests.
       - Implements `get_json(url: str, params: dict = None, **kwargs) -> dict` that calls get() and returns response.json().
       - Implements simple rate limiting via `time.sleep(1/rate_limit)` before non-cached requests. Check `response.from_cache` to skip sleep for cached responses.
       - Add `@classmethod from_config(cls, config: PipelineConfig)` that creates instance from config's api settings and cache_dir.
       - Add `clear_cache()` method to clear the SQLite cache.
       - Add `cache_stats() -> dict` returning hit/miss counts from the session.
    2. Create `tests/test_api_client.py` with pytest tests:
       - test_client_creates_cache_dir: instantiating with non-existent cache_dir creates it (use tmp_path)
       - test_client_caches_response: mock a GET request, call twice, second call returns from cache (use responses or unittest.mock to mock HTTP)
       - test_client_from_config: create from PipelineConfig, verify settings applied
       - test_rate_limit_respected: verify sleep is called between non-cached requests (mock time.sleep)
    Use unittest.mock for HTTP mocking (avoid adding extra test dependencies). Patch requests_cache.CachedSession if needed to avoid real HTTP in tests.
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_api_client.py -v
  </verify>
  <done>
    - CachedAPIClient instantiates with SQLite cache in specified directory
    - Retry decorator configured with exponential backoff on HTTP errors
    - Rate limiting sleeps between non-cached requests
    - All 4 API client tests pass
    - from_config classmethod correctly reads PipelineConfig settings
  </done>
 </task>
 </tasks>
 <verification>
 1. `pip install -e ".[dev]"` completes without errors
 2. `pytest tests/ -v` shows all config and API client tests passing
 3. `python -c "from usher_pipeline.config.loader import load_config; c = load_config('config/default.yaml'); print(c.config_hash())"` prints a SHA-256 hash
 4. `python -c "from usher_pipeline.api_clients.base import CachedAPIClient"` imports without error
 </verification>
 <success_criteria>
 - Python package installs with all bioinformatics dependencies
 - Config loads from YAML, validates with Pydantic, rejects invalid input
 - API client provides retry + caching foundation for all future API modules
 - All tests pass
 </success_criteria>
 <output>
 After completion, create `.planning/phases/01-data-infrastructure/01-01-SUMMARY.md`
 </output>
--- a/.planning/phases/01-data-infrastructure/01-02-PLAN.md
+++ b/.planning/phases/01-data-infrastructure/01-02-PLAN.md
@@ -0,0 +1,176 @@
 ---
 phase: 01-data-infrastructure
 plan: 02
 type: execute
 wave: 2
 depends_on: ["01-01"]
 files_modified:
  - src/usher_pipeline/gene_mapping/__init__.py
  - src/usher_pipeline/gene_mapping/universe.py
  - src/usher_pipeline/gene_mapping/mapper.py
  - src/usher_pipeline/gene_mapping/validator.py
  - tests/test_gene_mapping.py
 autonomous: true
 must_haves:
  truths:
    - "Gene universe contains only human protein-coding genes from Ensembl, excluding pseudogenes"
    - "Gene IDs map between Ensembl, HGNC symbols, and UniProt accessions via mygene batch queries"
    - "Mapping validation reports percentage successfully mapped and flags unmapped genes for review"
    - "Validation gate fails pipeline if mapping success rate drops below configurable threshold"
  artifacts:
    - path: "src/usher_pipeline/gene_mapping/universe.py"
      provides: "Gene universe definition and Ensembl protein-coding gene retrieval"
      contains: "protein_coding"
    - path: "src/usher_pipeline/gene_mapping/mapper.py"
      provides: "Batch gene ID mapping via mygene"
      contains: "class GeneMapper"
    - path: "src/usher_pipeline/gene_mapping/validator.py"
      provides: "Mapping validation gates with success rate reporting"
      contains: "class MappingValidator"
  key_links:
    - from: "src/usher_pipeline/gene_mapping/mapper.py"
      to: "mygene"
      via: "MyGeneInfo client for batch ID queries"
      pattern: "mygene\\.MyGeneInfo"
    - from: "src/usher_pipeline/gene_mapping/validator.py"
      to: "src/usher_pipeline/gene_mapping/mapper.py"
      via: "validates mapping results from GeneMapper"
      pattern: "MappingResult"
    - from: "src/usher_pipeline/gene_mapping/universe.py"
      to: "src/usher_pipeline/config/schema.py"
      via: "reads Ensembl release from PipelineConfig"
      pattern: "ensembl_release"
 ---
 <objective>
 Build the gene ID mapping system that defines the gene universe (human protein-coding genes from Ensembl) and provides validated mapping between Ensembl gene IDs, HGNC symbols, and UniProt accessions.
 Purpose: Gene ID mapping is the foundation of the entire pipeline. Every downstream evidence layer retrieves data using gene identifiers, so reliable mapping between ID systems (Ensembl, HGNC, UniProt) with validation gates is critical. The gene universe definition (INFRA-01) ensures we work with the correct set of ~20,000 protein-coding genes.
 Output: Gene mapping module with universe definition, batch mapper, and validation gates.
 </objective>
 <execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-data-infrastructure/01-RESEARCH.md
@.planning/phases/01-data-infrastructure/01-01-SUMMARY.md
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Define gene universe and build batch ID mapper</name>
  <files>
    src/usher_pipeline/gene_mapping/__init__.py
    src/usher_pipeline/gene_mapping/universe.py
    src/usher_pipeline/gene_mapping/mapper.py
  </files>
  <action>
    1. Create `src/usher_pipeline/gene_mapping/universe.py`:
       - `fetch_protein_coding_genes(ensembl_release: int = 113) -> list[str]`: Use mygene.MyGeneInfo() to query all human protein-coding genes. Query with `mg.query('type_of_gene:protein-coding', species=9606, fields='ensembl.gene,symbol,name', fetch_all=True)`. Extract Ensembl gene IDs (ENSG format). Filter to only include entries that have valid Ensembl gene IDs (start with "ENSG").
       - Validate gene count is in expected range: 19,000-22,000. Log warning if outside this range (indicates pseudogene contamination or missing data).
       - Return deduplicated sorted list of Ensembl gene IDs.
       - Add type alias `GeneUniverse = list[str]` for clarity.
    2. Create `src/usher_pipeline/gene_mapping/mapper.py`:
       - Define `MappingResult` dataclass: ensembl_id (str), hgnc_symbol (str | None), uniprot_accession (str | None), mapping_source (str = "mygene").
       - Define `MappingReport` dataclass: total_genes (int), mapped_hgnc (int), mapped_uniprot (int), unmapped_ids (list[str]), success_rate_hgnc (float), success_rate_uniprot (float).
       - `class GeneMapper`:
         - Constructor takes batch_size (int, default 1000) for chunked queries.
         - `map_ensembl_ids(ensembl_ids: list[str]) -> tuple[list[MappingResult], MappingReport]`: Use mygene.MyGeneInfo().querymany() with scopes='ensembl.gene', fields='symbol,uniprot.Swiss-Prot', species=9606, returnall=True. Process in batches of batch_size. For each result, extract symbol and UniProt Swiss-Prot accession (handle both string and list cases for uniprot — take first if list). Build MappingResult for each gene. Compute MappingReport with success rates.
         - Handle edge cases: 'notfound' in result, missing 'symbol' key, uniprot being nested dict vs string, duplicate query results (take first non-null).
    3. Create `src/usher_pipeline/gene_mapping/__init__.py` exporting GeneMapper, MappingResult, MappingReport, fetch_protein_coding_genes.
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && python -c "
 from usher_pipeline.gene_mapping.mapper import GeneMapper, MappingResult, MappingReport
 from usher_pipeline.gene_mapping.universe import fetch_protein_coding_genes
 # Verify imports work and classes are defined
 print('MappingResult fields:', [f.name for f in MappingResult.__dataclass_fields__.values()] if hasattr(MappingResult, '__dataclass_fields__') else 'OK')
 print('GeneMapper has map_ensembl_ids:', hasattr(GeneMapper, 'map_ensembl_ids'))
 "
  </verify>
  <done>
    - universe.py fetches protein-coding genes via mygene and validates count in 19k-22k range
    - mapper.py provides batch mapping from Ensembl to HGNC+UniProt with MappingResult/MappingReport dataclasses
    - All classes import cleanly
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Create mapping validation gates with tests</name>
  <files>
    src/usher_pipeline/gene_mapping/validator.py
    tests/test_gene_mapping.py
  </files>
  <action>
    1. Create `src/usher_pipeline/gene_mapping/validator.py`:
       - `class MappingValidator`:
         - Constructor takes min_success_rate (float, default 0.90), warn_threshold (float, default 0.95).
         - `validate(report: MappingReport) -> ValidationResult`: Returns ValidationResult dataclass with passed (bool), messages (list[str]), unmapped_summary (dict). Fails if HGNC success_rate < min_success_rate. Warns if success_rate between min_success_rate and warn_threshold.
         - `save_unmapped_report(report: MappingReport, output_path: Path)`: Writes unmapped gene IDs to a text file for manual review, one per line, with header comment showing timestamp and total count.
       - `class ValidationResult`: dataclass with passed (bool), messages (list[str]), hgnc_rate (float), uniprot_rate (float).
       - `validate_gene_universe(genes: list[str]) -> ValidationResult`: Checks gene count in range, all IDs start with ENSG, no duplicates. Returns ValidationResult.
    2. Create `tests/test_gene_mapping.py` with pytest tests. Use mocked mygene responses (do NOT call real API in tests):
       - test_mapping_result_creation: create MappingResult with all fields, verify attributes
       - test_mapper_handles_successful_mapping: mock mygene.querymany to return successful results for 3 genes (with symbol and uniprot), verify MappingResult list and MappingReport success rates = 1.0
       - test_mapper_handles_unmapped_genes: mock mygene.querymany to return 'notfound' for 1 of 3 genes, verify unmapped_ids contains the missing gene, success rate = 0.667
       - test_mapper_handles_uniprot_list: mock result where uniprot.Swiss-Prot is a list of accessions, verify first is taken
       - test_validator_passes_high_rate: MappingReport with 95% success rate passes with min_success_rate=0.90
       - test_validator_fails_low_rate: MappingReport with 80% success rate fails with min_success_rate=0.90
       - test_validator_warns_medium_rate: MappingReport with 92% passes but includes warning
       - test_validate_gene_universe_valid: list of 20000 ENSG IDs passes
       - test_validate_gene_universe_invalid_count: list of 50000 genes fails (too many, likely includes non-coding)
       - test_save_unmapped_report: verify unmapped genes written to file (use tmp_path)
    Mock mygene using unittest.mock.patch on the mygene.MyGeneInfo class. Create fixture with realistic mygene response format:
    ```python
    MOCK_MYGENE_RESPONSE = {
        'out': [
            {'query': 'ENSG00000139618', 'symbol': 'BRCA2', 'uniprot': {'Swiss-Prot': 'P51587'}},
            {'query': 'ENSG00000141510', 'symbol': 'TP53', 'uniprot': {'Swiss-Prot': 'P04637'}},
            {'query': 'ENSG00000000000', 'notfound': True, 'query': 'ENSG00000000000'}
        ],
        'missing': ['ENSG00000000000']
    }
    ```
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_gene_mapping.py -v
  </verify>
  <done>
    - MappingValidator enforces configurable success rate thresholds
    - Validation gates report clear pass/fail/warn with messages
    - Unmapped gene report saves to file for review
    - All 10 gene mapping tests pass with mocked mygene responses (no real API calls)
  </done>
 </task>
 </tasks>
 <verification>
 1. `pytest tests/test_gene_mapping.py -v` -- all tests pass
 2. `python -c "from usher_pipeline.gene_mapping import GeneMapper, MappingValidator"` -- imports work
 3. Gene mapping validation gates enforce minimum success rates and produce actionable reports
 </verification>
 <success_criteria>
 - Gene universe function retrieves human protein-coding genes via mygene with count validation
 - Batch mapper converts Ensembl IDs to HGNC symbols + UniProt accessions with proper edge case handling
 - Validation gates enforce configurable success rate thresholds and produce unmapped gene reports
 - All tests pass with mocked API responses
 </success_criteria>
 <output>
 After completion, create `.planning/phases/01-data-infrastructure/01-02-SUMMARY.md`
 </output>
--- a/.planning/phases/01-data-infrastructure/01-03-PLAN.md
+++ b/.planning/phases/01-data-infrastructure/01-03-PLAN.md
@@ -0,0 +1,179 @@
 ---
 phase: 01-data-infrastructure
 plan: 03
 type: execute
 wave: 2
 depends_on: ["01-01"]
 files_modified:
  - src/usher_pipeline/persistence/__init__.py
  - src/usher_pipeline/persistence/duckdb_store.py
  - src/usher_pipeline/persistence/provenance.py
  - tests/test_persistence.py
 autonomous: true
 must_haves:
  truths:
    - "DuckDB database stores DataFrames as tables and exports to Parquet"
    - "Checkpoint system detects existing tables to enable restart-from-checkpoint without re-downloading"
    - "Provenance metadata captures pipeline version, data source versions, timestamps, and config hash"
    - "Provenance sidecar JSON file is saved alongside every pipeline output"
  artifacts:
    - path: "src/usher_pipeline/persistence/duckdb_store.py"
      provides: "DuckDB-based storage with checkpoint-restart capability"
      contains: "class PipelineStore"
    - path: "src/usher_pipeline/persistence/provenance.py"
      provides: "Provenance metadata creation and persistence"
      contains: "class ProvenanceTracker"
  key_links:
    - from: "src/usher_pipeline/persistence/duckdb_store.py"
      to: "duckdb"
      via: "duckdb.connect for file-based database"
      pattern: "duckdb\\.connect"
    - from: "src/usher_pipeline/persistence/provenance.py"
      to: "src/usher_pipeline/config/schema.py"
      via: "reads PipelineConfig for version info and config hash"
      pattern: "config_hash|PipelineConfig"
    - from: "src/usher_pipeline/persistence/duckdb_store.py"
      to: "src/usher_pipeline/persistence/provenance.py"
      via: "attaches provenance metadata when saving checkpoints"
      pattern: "provenance|ProvenanceTracker"
 ---
 <objective>
 Build the DuckDB persistence layer for intermediate results with checkpoint-restart capability, and the provenance metadata system for reproducibility tracking.
 Purpose: The pipeline fetches expensive API data that takes hours. DuckDB persistence enables restart-from-checkpoint so failures don't require re-downloading everything (INFRA-07). Provenance tracking ensures every output is traceable to specific data versions, config parameters, and timestamps (INFRA-06).
 Output: PipelineStore class for DuckDB operations and ProvenanceTracker for metadata.
 </objective>
 <execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-data-infrastructure/01-RESEARCH.md
@.planning/phases/01-data-infrastructure/01-01-SUMMARY.md
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Create DuckDB persistence layer with checkpoint-restart</name>
  <files>
    src/usher_pipeline/persistence/__init__.py
    src/usher_pipeline/persistence/duckdb_store.py
  </files>
  <action>
    1. Create `src/usher_pipeline/persistence/duckdb_store.py` with `PipelineStore` class:
       - Constructor takes `db_path: Path`. Creates parent directories. Connects via `duckdb.connect(str(db_path))`. Creates internal `_checkpoints` metadata table on init: `CREATE TABLE IF NOT EXISTS _checkpoints (table_name VARCHAR PRIMARY KEY, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, row_count INTEGER, description VARCHAR)`.
       - `save_dataframe(df: polars.DataFrame | pandas.DataFrame, table_name: str, description: str = "", replace: bool = True)`: Detect DataFrame type. For polars, use `conn.execute("CREATE OR REPLACE TABLE ... AS SELECT * FROM df")` via DuckDB's native polars support. For pandas, same pattern. Update _checkpoints metadata with row count and description. Use parameterized queries for metadata insert.
       - `load_dataframe(table_name: str, as_polars: bool = True) -> polars.DataFrame | pandas.DataFrame | None`: Load table. Return None if table doesn't exist (catch duckdb.CatalogException). Return polars by default (use `conn.execute("SELECT * FROM ...").pl()`), pandas if as_polars=False.
       - `has_checkpoint(table_name: str) -> bool`: Check _checkpoints table for existence.
       - `list_checkpoints() -> list[dict]`: Return all checkpoint metadata (table_name, created_at, row_count, description).
       - `delete_checkpoint(table_name: str)`: Drop table and remove from _checkpoints.
       - `export_parquet(table_name: str, output_path: Path)`: Export table to Parquet file using `COPY ... TO ... (FORMAT PARQUET)`. Create parent dirs.
       - `execute_query(query: str, params: list = None) -> polars.DataFrame`: Execute arbitrary SQL, return polars DataFrame.
       - `close()`: Close DuckDB connection. Implement `__enter__` and `__exit__` for context manager support.
       - `@classmethod from_config(cls, config: PipelineConfig)`: Create from config's duckdb_path.
    2. Create `src/usher_pipeline/persistence/__init__.py` exporting PipelineStore and ProvenanceTracker.
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && python -c "
 from usher_pipeline.persistence.duckdb_store import PipelineStore
 from pathlib import Path
 import tempfile, polars as pl
 with tempfile.TemporaryDirectory() as tmp:
    store = PipelineStore(Path(tmp) / 'test.duckdb')
    df = pl.DataFrame({'gene': ['BRCA1', 'TP53'], 'score': [0.95, 0.88]})
    store.save_dataframe(df, 'test_genes', 'test data')
    assert store.has_checkpoint('test_genes')
    loaded = store.load_dataframe('test_genes')
    assert loaded.shape == (2, 2)
    assert not store.has_checkpoint('nonexistent')
    store.close()
    print('PipelineStore: all basic operations work')
 "
  </verify>
  <done>
    - PipelineStore creates DuckDB file, saves/loads polars DataFrames
    - Checkpoint system tracks which tables exist with metadata
    - has_checkpoint returns True for existing tables, False for missing
    - Context manager support works for clean connection handling
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Create provenance tracker with tests for both modules</name>
  <files>
    src/usher_pipeline/persistence/provenance.py
    tests/test_persistence.py
  </files>
  <action>
    1. Create `src/usher_pipeline/persistence/provenance.py` with `ProvenanceTracker`:
       - Constructor takes `pipeline_version: str`, `config: PipelineConfig`.
       - Stores config_hash from config.config_hash(), data_source_versions from config.versions.model_dump().
       - `record_step(step_name: str, details: dict = None)`: Appends to internal processing_steps list with timestamp, step_name, and optional details dict.
       - `create_metadata() -> dict`: Returns full provenance dict: pipeline_version, data_source_versions, config_hash, created_at (ISO timestamp), processing_steps list.
       - `save_sidecar(output_path: Path)`: Saves provenance metadata as JSON sidecar file at `output_path.with_suffix('.provenance.json')`. Uses json.dumps with indent=2, default=str for Path serialization.
       - `save_to_store(store: PipelineStore)`: Saves provenance metadata to a `_provenance` table in DuckDB (flattened: version, config_hash, created_at, steps_json).
       - `@staticmethod load_sidecar(sidecar_path: Path) -> dict`: Reads and returns provenance JSON.
       - `@staticmethod from_config(config: PipelineConfig, version: str = None) -> ProvenanceTracker`: Creates tracker, using version from usher_pipeline.__version__ if not provided.
    2. Create `tests/test_persistence.py` with pytest tests using tmp_path:
       DuckDB Store tests:
       - test_store_creates_database: PipelineStore creates .duckdb file at specified path
       - test_save_and_load_polars: save polars DataFrame, load back, verify shape and values
       - test_save_and_load_pandas: save pandas DataFrame, load back as pandas (as_polars=False)
       - test_checkpoint_lifecycle: save -> has_checkpoint True -> delete -> has_checkpoint False
       - test_list_checkpoints: save 3 tables, list_checkpoints returns 3 entries with metadata
       - test_export_parquet: save DataFrame, export to Parquet, verify Parquet file exists and is readable by polars
       - test_load_nonexistent_returns_none: load_dataframe for non-existent table returns None
       - test_context_manager: use `with PipelineStore(...) as store:` and verify operations work
       Provenance tests:
       - test_provenance_metadata_structure: create ProvenanceTracker, verify metadata dict has all required keys
       - test_provenance_records_steps: record 2 steps, verify they appear in metadata with timestamps
       - test_provenance_sidecar_roundtrip: save sidecar, load sidecar, verify content matches
       - test_provenance_config_hash_included: verify config_hash in metadata matches config.config_hash()
       - test_provenance_save_to_store: save to PipelineStore, verify _provenance table exists
    For tests needing PipelineConfig, create a minimal test fixture that builds one with tmp_path directories.
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_persistence.py -v
  </verify>
  <done>
    - ProvenanceTracker creates metadata with pipeline version, data source versions, config hash, timestamps, and processing steps
    - Sidecar JSON saves alongside outputs and round-trips correctly
    - All 13 persistence tests pass (8 DuckDB store + 5 provenance)
  </done>
 </task>
 </tasks>
 <verification>
 1. `pytest tests/test_persistence.py -v` -- all tests pass
 2. `python -c "from usher_pipeline.persistence import PipelineStore, ProvenanceTracker"` -- imports work
 3. DuckDB checkpoint-restart verified: save -> check -> load cycle works
 4. Provenance sidecar JSON created with all required metadata fields
 </verification>
 <success_criteria>
 - DuckDB store saves/loads DataFrames with checkpoint metadata tracking
 - Checkpoint-restart pattern works: has_checkpoint -> skip expensive re-fetch
 - Provenance tracker captures all required metadata (INFRA-06): pipeline version, data source versions, timestamps, config hash, processing steps
 - Parquet export works for downstream compatibility
 - All tests pass
 </success_criteria>
 <output>
 After completion, create `.planning/phases/01-data-infrastructure/01-03-SUMMARY.md`
 </output>
--- a/.planning/phases/01-data-infrastructure/01-04-PLAN.md
+++ b/.planning/phases/01-data-infrastructure/01-04-PLAN.md
@@ -0,0 +1,187 @@
 ---
 phase: 01-data-infrastructure
 plan: 04
 type: execute
 wave: 3
 depends_on: ["01-01", "01-02", "01-03"]
 files_modified:
  - src/usher_pipeline/cli/__init__.py
  - src/usher_pipeline/cli/main.py
  - src/usher_pipeline/cli/setup_cmd.py
  - tests/test_integration.py
  - .gitignore
 autonomous: true
 must_haves:
  truths:
    - "CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands"
    - "Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance"
    - "All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance"
    - "Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching"
  artifacts:
    - path: "src/usher_pipeline/cli/main.py"
      provides: "CLI entry point with click command group"
      contains: "def cli"
    - path: "src/usher_pipeline/cli/setup_cmd.py"
      provides: "Setup command wiring config, gene mapping, persistence, provenance"
      contains: "def setup"
    - path: "tests/test_integration.py"
      provides: "Integration tests verifying module wiring"
      contains: "test_"
  key_links:
    - from: "src/usher_pipeline/cli/setup_cmd.py"
      to: "src/usher_pipeline/config/loader.py"
      via: "loads pipeline config from YAML"
      pattern: "load_config"
    - from: "src/usher_pipeline/cli/setup_cmd.py"
      to: "src/usher_pipeline/gene_mapping/mapper.py"
      via: "maps gene IDs using GeneMapper"
      pattern: "GeneMapper"
    - from: "src/usher_pipeline/cli/setup_cmd.py"
      to: "src/usher_pipeline/persistence/duckdb_store.py"
      via: "saves results to DuckDB with checkpoint"
      pattern: "PipelineStore"
    - from: "src/usher_pipeline/cli/setup_cmd.py"
      to: "src/usher_pipeline/persistence/provenance.py"
      via: "tracks provenance for setup step"
      pattern: "ProvenanceTracker"
    - from: "src/usher_pipeline/cli/main.py"
      to: "src/usher_pipeline/cli/setup_cmd.py"
      via: "registers setup as click subcommand"
      pattern: "cli\\.add_command|@cli\\.command"
 ---
 <objective>
 Wire all infrastructure modules together with a CLI entry point and integration tests to verify the complete data infrastructure works end-to-end.
 Purpose: Individual modules (config, gene mapping, persistence, provenance) must work together as a cohesive system. This plan creates the CLI interface (click-based), implements the `setup` subcommand that exercises the full pipeline flow, and adds integration tests proving the wiring is correct. This is the "it all works together" plan.
 Output: Working CLI with setup command that demonstrates full infrastructure flow, integration tests.
 </objective>
 <execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-data-infrastructure/01-RESEARCH.md
@.planning/phases/01-data-infrastructure/01-01-SUMMARY.md
@.planning/phases/01-data-infrastructure/01-02-SUMMARY.md
@.planning/phases/01-data-infrastructure/01-03-SUMMARY.md
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Create CLI entry point with setup command</name>
  <files>
    src/usher_pipeline/cli/__init__.py
    src/usher_pipeline/cli/main.py
    src/usher_pipeline/cli/setup_cmd.py
    .gitignore
  </files>
  <action>
    1. Create `src/usher_pipeline/cli/main.py`:
       - Use click to create a command group: `@click.group()` named `cli`.
       - Add `--config` option (default "config/default.yaml") and `--verbose` flag to the group.
       - Pass config_path and verbose to context (ctx.obj) for subcommands.
       - Add `info` subcommand that prints pipeline version, config path, config hash, data source versions.
       - Register setup_cmd.
    2. Create `src/usher_pipeline/cli/setup_cmd.py`:
       - `@click.command('setup')` subcommand with `--force` flag to re-run even if checkpoints exist.
       - Implementation flow:
         a. Load config via load_config(config_path).
         b. Create PipelineStore from config.
         c. Create ProvenanceTracker from config.
         d. Check checkpoint: if store.has_checkpoint('gene_universe') and not force, print "Gene universe already loaded (use --force to re-fetch)" and skip to validation.
         e. If no checkpoint or force: call fetch_protein_coding_genes(config.versions.ensembl_release) to get gene universe.
         f. Run validate_gene_universe() on the result. If validation fails, print error and exit(1).
         g. Create GeneMapper, call map_ensembl_ids() on gene universe.
         h. Run MappingValidator.validate() on mapping report. Print success rates. If validation fails, save unmapped report and exit(1) with clear message about low mapping rate.
         i. Save gene universe to DuckDB as 'gene_universe' table (polars DataFrame with columns: ensembl_id, hgnc_symbol, uniprot_accession).
         j. Record provenance steps: "fetch_gene_universe", "map_gene_ids", "validate_mapping".
         k. Save provenance sidecar to data_dir / "setup.provenance.json".
         l. Print summary: gene count, HGNC mapping rate, UniProt mapping rate, DuckDB path, provenance path.
       - Use click.echo for output, click.style for colored status (green=OK, yellow=warn, red=fail).
    3. Update `pyproject.toml` to add CLI entry point: `[project.scripts] usher-pipeline = "usher_pipeline.cli.main:cli"`.
    4. Create `.gitignore` with: data/, *.duckdb, *.duckdb.wal, __pycache__/, *.pyc, .pytest_cache/, *.egg-info/, dist/, build/, .eggs/, *.provenance.json (not in data/).
    5. Create `src/usher_pipeline/cli/__init__.py`.
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pip install -e ".[dev]" && usher-pipeline --help && usher-pipeline info --config config/default.yaml
  </verify>
  <done>
    - `usher-pipeline --help` shows available commands (setup, info)
    - `usher-pipeline info` displays version, config hash, data source versions
    - Setup command implements full flow: config -> gene universe -> mapping -> validation -> DuckDB -> provenance
    - Checkpoint-restart: setup skips re-fetch if gene_universe table exists
    - .gitignore excludes data files and build artifacts
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Create integration tests verifying module wiring</name>
  <files>
    tests/test_integration.py
  </files>
  <action>
    1. Create `tests/test_integration.py` with integration tests. These tests verify module wiring without calling real external APIs (mock mygene):
       - test_config_to_store_roundtrip: Load config from default.yaml, create PipelineStore with tmp duckdb path, save a test DataFrame, verify checkpoint exists, load back, verify data matches. Tests config -> persistence wiring.
       - test_config_to_provenance: Load config, create ProvenanceTracker, record steps, save sidecar to tmp dir, verify sidecar file exists and contains config_hash matching config.config_hash(). Tests config -> provenance wiring.
       - test_full_setup_flow_mocked: Mock mygene.MyGeneInfo to return a small set of 5 fake protein-coding genes with valid Ensembl IDs, symbols, and UniProt accessions. Mock fetch_protein_coding_genes to return 5 ENSG IDs. Run the setup flow programmatically (not via CLI): load config, create store, fetch universe (mocked), map IDs (mocked), validate, save to DuckDB, create provenance. Verify:
         a. gene_universe table exists in DuckDB with 5 rows
         b. DataFrame has columns: ensembl_id, hgnc_symbol, uniprot_accession
         c. Provenance sidecar exists with correct structure
         d. has_checkpoint('gene_universe') returns True
       - test_checkpoint_skip_flow: After test_full_setup_flow_mocked, verify that running setup again detects checkpoint and skips re-fetch (mock fetch_protein_coding_genes, verify it is NOT called when checkpoint exists).
       - test_setup_cli_help: Use click.testing.CliRunner to invoke `cli` with `['setup', '--help']`, verify exit_code=0 and output contains '--force' and '--config'.
       - test_info_cli: Use CliRunner to invoke `cli` with `['info', '--config', 'config/default.yaml']`, verify exit_code=0 and output contains version string.
    Use tmp_path fixture extensively. Create a conftest.py helper or fixtures within the file for shared setup (mock config with tmp dirs).
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_integration.py -v
  </verify>
  <done>
    - All 6 integration tests pass
    - Config -> PipelineStore -> ProvenanceTracker wiring verified
    - Full setup flow works end-to-end with mocked API calls
    - Checkpoint-restart verified: second run skips fetch
    - CLI commands respond correctly
  </done>
 </task>
 </tasks>
 <verification>
 1. `pytest tests/ -v` -- ALL tests pass (config, api_client, gene_mapping, persistence, integration)
 2. `usher-pipeline --help` -- shows setup, info commands
 3. `usher-pipeline info` -- displays pipeline version and config info
 4. Full test suite covers: config validation, API client retry/cache, gene mapping with validation gates, DuckDB persistence with checkpoints, provenance tracking, module wiring
 </verification>
 <success_criteria>
 - CLI entry point works with setup and info subcommands
 - Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
 - Checkpoint-restart works: existing DuckDB data skips re-downloading
 - All integration tests pass verifying cross-module wiring
 - Full test suite (all files) passes: `pytest tests/ -v`
 </success_criteria>
 <output>
 After completion, create `.planning/phases/01-data-infrastructure/01-04-SUMMARY.md`
 </output>