usher-exploring/.planning/phases/01-data-infrastructure/01-04-PLAN.md

---
phase: 01-data-infrastructure
plan: 04
type: execute
wave: 3
depends_on: ["01-01", "01-02", "01-03"]
files_modified:
  - src/usher_pipeline/cli/__init__.py
  - src/usher_pipeline/cli/main.py
  - src/usher_pipeline/cli/setup_cmd.py
  - tests/test_integration.py
  - .gitignore
autonomous: true

must_haves:
  truths:
    - "CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands"
    - "Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance"
    - "All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance"
    - "Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching"
  artifacts:
    - path: "src/usher_pipeline/cli/main.py"
      provides: "CLI entry point with click command group"
      contains: "def cli"
    - path: "src/usher_pipeline/cli/setup_cmd.py"
      provides: "Setup command wiring config, gene mapping, persistence, provenance"
      contains: "def setup"
    - path: "tests/test_integration.py"
      provides: "Integration tests verifying module wiring"
      contains: "test_"
  key_links:
    - from: "src/usher_pipeline/cli/setup_cmd.py"
      to: "src/usher_pipeline/config/loader.py"
      via: "loads pipeline config from YAML"
      pattern: "load_config"
    - from: "src/usher_pipeline/cli/setup_cmd.py"
      to: "src/usher_pipeline/gene_mapping/mapper.py"
      via: "maps gene IDs using GeneMapper"
      pattern: "GeneMapper"
    - from: "src/usher_pipeline/cli/setup_cmd.py"
      to: "src/usher_pipeline/persistence/duckdb_store.py"
      via: "saves results to DuckDB with checkpoint"
      pattern: "PipelineStore"
    - from: "src/usher_pipeline/cli/setup_cmd.py"
      to: "src/usher_pipeline/persistence/provenance.py"
      via: "tracks provenance for setup step"
      pattern: "ProvenanceTracker"
    - from: "src/usher_pipeline/cli/main.py"
      to: "src/usher_pipeline/cli/setup_cmd.py"
      via: "registers setup as click subcommand"
      pattern: "cli\\.add_command|@cli\\.command"
---

<objective>
Wire all infrastructure modules together with a CLI entry point and integration tests to verify the complete data infrastructure works end-to-end.

Purpose: Individual modules (config, gene mapping, persistence, provenance) must work together as a cohesive system. This plan creates the CLI interface (click-based), implements the `setup` subcommand that exercises the full pipeline flow, and adds integration tests proving the wiring is correct. This is the "it all works together" plan.

Output: Working CLI with setup command that demonstrates full infrastructure flow, integration tests.
</objective>

<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>

<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-data-infrastructure/01-RESEARCH.md
@.planning/phases/01-data-infrastructure/01-01-SUMMARY.md
@.planning/phases/01-data-infrastructure/01-02-SUMMARY.md
@.planning/phases/01-data-infrastructure/01-03-SUMMARY.md
</context>

<tasks>

<task type="auto">
  <name>Task 1: Create CLI entry point with setup command</name>
  <files>
    src/usher_pipeline/cli/__init__.py
    src/usher_pipeline/cli/main.py
    src/usher_pipeline/cli/setup_cmd.py
    .gitignore
  </files>
  <action>
    1. Create `src/usher_pipeline/cli/main.py`:
       - Use click to create a command group: `@click.group()` named `cli`.
       - Add `--config` option (default "config/default.yaml") and `--verbose` flag to the group.
       - Pass config_path and verbose to context (ctx.obj) for subcommands.
       - Add `info` subcommand that prints pipeline version, config path, config hash, data source versions.
       - Register setup_cmd.

    2. Create `src/usher_pipeline/cli/setup_cmd.py`:
       - `@click.command('setup')` subcommand with `--force` flag to re-run even if checkpoints exist.
       - Implementation flow:
         a. Load config via load_config(config_path).
         b. Create PipelineStore from config.
         c. Create ProvenanceTracker from config.
         d. Check checkpoint: if store.has_checkpoint('gene_universe') and not force, print "Gene universe already loaded (use --force to re-fetch)" and skip to validation.
         e. If no checkpoint or force: call fetch_protein_coding_genes(config.versions.ensembl_release) to get gene universe.
         f. Run validate_gene_universe() on the result. If validation fails, print error and exit(1).
         g. Create GeneMapper, call map_ensembl_ids() on gene universe.
         h. Run MappingValidator.validate() on mapping report. Print success rates. If validation fails, save unmapped report and exit(1) with clear message about low mapping rate.
         i. Save gene universe to DuckDB as 'gene_universe' table (polars DataFrame with columns: ensembl_id, hgnc_symbol, uniprot_accession).
         j. Record provenance steps: "fetch_gene_universe", "map_gene_ids", "validate_mapping".
         k. Save provenance sidecar to data_dir / "setup.provenance.json".
         l. Print summary: gene count, HGNC mapping rate, UniProt mapping rate, DuckDB path, provenance path.
       - Use click.echo for output, click.style for colored status (green=OK, yellow=warn, red=fail).

    3. Update `pyproject.toml` to add CLI entry point: `[project.scripts] usher-pipeline = "usher_pipeline.cli.main:cli"`.

    4. Create `.gitignore` with: data/, *.duckdb, *.duckdb.wal, __pycache__/, *.pyc, .pytest_cache/, *.egg-info/, dist/, build/, .eggs/, *.provenance.json (not in data/).

    5. Create `src/usher_pipeline/cli/__init__.py`.
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pip install -e ".[dev]" && usher-pipeline --help && usher-pipeline info --config config/default.yaml
  </verify>
  <done>
    - `usher-pipeline --help` shows available commands (setup, info)
    - `usher-pipeline info` displays version, config hash, data source versions
    - Setup command implements full flow: config -> gene universe -> mapping -> validation -> DuckDB -> provenance
    - Checkpoint-restart: setup skips re-fetch if gene_universe table exists
    - .gitignore excludes data files and build artifacts
  </done>
</task>

<task type="auto">
  <name>Task 2: Create integration tests verifying module wiring</name>
  <files>
    tests/test_integration.py
  </files>
  <action>
    1. Create `tests/test_integration.py` with integration tests. These tests verify module wiring without calling real external APIs (mock mygene):

       - test_config_to_store_roundtrip: Load config from default.yaml, create PipelineStore with tmp duckdb path, save a test DataFrame, verify checkpoint exists, load back, verify data matches. Tests config -> persistence wiring.

       - test_config_to_provenance: Load config, create ProvenanceTracker, record steps, save sidecar to tmp dir, verify sidecar file exists and contains config_hash matching config.config_hash(). Tests config -> provenance wiring.

       - test_full_setup_flow_mocked: Mock mygene.MyGeneInfo to return a small set of 5 fake protein-coding genes with valid Ensembl IDs, symbols, and UniProt accessions. Mock fetch_protein_coding_genes to return 5 ENSG IDs. Run the setup flow programmatically (not via CLI): load config, create store, fetch universe (mocked), map IDs (mocked), validate, save to DuckDB, create provenance. Verify:
         a. gene_universe table exists in DuckDB with 5 rows
         b. DataFrame has columns: ensembl_id, hgnc_symbol, uniprot_accession
         c. Provenance sidecar exists with correct structure
         d. has_checkpoint('gene_universe') returns True

       - test_checkpoint_skip_flow: After test_full_setup_flow_mocked, verify that running setup again detects checkpoint and skips re-fetch (mock fetch_protein_coding_genes, verify it is NOT called when checkpoint exists).

       - test_setup_cli_help: Use click.testing.CliRunner to invoke `cli` with `['setup', '--help']`, verify exit_code=0 and output contains '--force' and '--config'.

       - test_info_cli: Use CliRunner to invoke `cli` with `['info', '--config', 'config/default.yaml']`, verify exit_code=0 and output contains version string.

    Use tmp_path fixture extensively. Create a conftest.py helper or fixtures within the file for shared setup (mock config with tmp dirs).
  </action>
  <verify>
    cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_integration.py -v
  </verify>
  <done>
    - All 6 integration tests pass
    - Config -> PipelineStore -> ProvenanceTracker wiring verified
    - Full setup flow works end-to-end with mocked API calls
    - Checkpoint-restart verified: second run skips fetch
    - CLI commands respond correctly
  </done>
</task>

</tasks>

<verification>
1. `pytest tests/ -v` -- ALL tests pass (config, api_client, gene_mapping, persistence, integration)
2. `usher-pipeline --help` -- shows setup, info commands
3. `usher-pipeline info` -- displays pipeline version and config info
4. Full test suite covers: config validation, API client retry/cache, gene mapping with validation gates, DuckDB persistence with checkpoints, provenance tracking, module wiring
</verification>

<success_criteria>
- CLI entry point works with setup and info subcommands
- Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
- Checkpoint-restart works: existing DuckDB data skips re-downloading
- All integration tests pass verifying cross-module wiring
- Full test suite (all files) passes: `pytest tests/ -v`
</success_criteria>

<output>
After completion, create `.planning/phases/01-data-infrastructure/01-04-SUMMARY.md`
</output>