Files
usher-exploring/.planning/phases/01-data-infrastructure/01-04-PLAN.md

9.7 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
phase plan type wave depends_on files_modified autonomous must_haves
01-data-infrastructure 04 execute 3
01-01
01-02
01-03
src/usher_pipeline/cli/__init__.py
src/usher_pipeline/cli/main.py
src/usher_pipeline/cli/setup_cmd.py
tests/test_integration.py
.gitignore
true
truths artifacts key_links
CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands
Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance
All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance
Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching
path provides contains
src/usher_pipeline/cli/main.py CLI entry point with click command group def cli
path provides contains
src/usher_pipeline/cli/setup_cmd.py Setup command wiring config, gene mapping, persistence, provenance def setup
path provides contains
tests/test_integration.py Integration tests verifying module wiring test_
from to via pattern
src/usher_pipeline/cli/setup_cmd.py src/usher_pipeline/config/loader.py loads pipeline config from YAML load_config
from to via pattern
src/usher_pipeline/cli/setup_cmd.py src/usher_pipeline/gene_mapping/mapper.py maps gene IDs using GeneMapper GeneMapper
from to via pattern
src/usher_pipeline/cli/setup_cmd.py src/usher_pipeline/persistence/duckdb_store.py saves results to DuckDB with checkpoint PipelineStore
from to via pattern
src/usher_pipeline/cli/setup_cmd.py src/usher_pipeline/persistence/provenance.py tracks provenance for setup step ProvenanceTracker
from to via pattern
src/usher_pipeline/cli/main.py src/usher_pipeline/cli/setup_cmd.py registers setup as click subcommand cli.add_command|@cli.command
Wire all infrastructure modules together with a CLI entry point and integration tests to verify the complete data infrastructure works end-to-end.

Purpose: Individual modules (config, gene mapping, persistence, provenance) must work together as a cohesive system. This plan creates the CLI interface (click-based), implements the setup subcommand that exercises the full pipeline flow, and adds integration tests proving the wiring is correct. This is the "it all works together" plan.

Output: Working CLI with setup command that demonstrates full infrastructure flow, integration tests.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/01-data-infrastructure/01-RESEARCH.md @.planning/phases/01-data-infrastructure/01-01-SUMMARY.md @.planning/phases/01-data-infrastructure/01-02-SUMMARY.md @.planning/phases/01-data-infrastructure/01-03-SUMMARY.md Task 1: Create CLI entry point with setup command src/usher_pipeline/cli/__init__.py src/usher_pipeline/cli/main.py src/usher_pipeline/cli/setup_cmd.py .gitignore 1. Create `src/usher_pipeline/cli/main.py`: - Use click to create a command group: `@click.group()` named `cli`. - Add `--config` option (default "config/default.yaml") and `--verbose` flag to the group. - Pass config_path and verbose to context (ctx.obj) for subcommands. - Add `info` subcommand that prints pipeline version, config path, config hash, data source versions. - Register setup_cmd.
2. Create `src/usher_pipeline/cli/setup_cmd.py`:
   - `@click.command('setup')` subcommand with `--force` flag to re-run even if checkpoints exist.
   - Implementation flow:
     a. Load config via load_config(config_path).
     b. Create PipelineStore from config.
     c. Create ProvenanceTracker from config.
     d. Check checkpoint: if store.has_checkpoint('gene_universe') and not force, print "Gene universe already loaded (use --force to re-fetch)" and skip to validation.
     e. If no checkpoint or force: call fetch_protein_coding_genes(config.versions.ensembl_release) to get gene universe.
     f. Run validate_gene_universe() on the result. If validation fails, print error and exit(1).
     g. Create GeneMapper, call map_ensembl_ids() on gene universe.
     h. Run MappingValidator.validate() on mapping report. Print success rates. If validation fails, save unmapped report and exit(1) with clear message about low mapping rate.
     i. Save gene universe to DuckDB as 'gene_universe' table (polars DataFrame with columns: ensembl_id, hgnc_symbol, uniprot_accession).
     j. Record provenance steps: "fetch_gene_universe", "map_gene_ids", "validate_mapping".
     k. Save provenance sidecar to data_dir / "setup.provenance.json".
     l. Print summary: gene count, HGNC mapping rate, UniProt mapping rate, DuckDB path, provenance path.
   - Use click.echo for output, click.style for colored status (green=OK, yellow=warn, red=fail).

3. Update `pyproject.toml` to add CLI entry point: `[project.scripts] usher-pipeline = "usher_pipeline.cli.main:cli"`.

4. Create `.gitignore` with: data/, *.duckdb, *.duckdb.wal, __pycache__/, *.pyc, .pytest_cache/, *.egg-info/, dist/, build/, .eggs/, *.provenance.json (not in data/).

5. Create `src/usher_pipeline/cli/__init__.py`.
cd /Users/gbanyan/Project/usher-exploring && pip install -e ".[dev]" && usher-pipeline --help && usher-pipeline info --config config/default.yaml - `usher-pipeline --help` shows available commands (setup, info) - `usher-pipeline info` displays version, config hash, data source versions - Setup command implements full flow: config -> gene universe -> mapping -> validation -> DuckDB -> provenance - Checkpoint-restart: setup skips re-fetch if gene_universe table exists - .gitignore excludes data files and build artifacts Task 2: Create integration tests verifying module wiring tests/test_integration.py 1. Create `tests/test_integration.py` with integration tests. These tests verify module wiring without calling real external APIs (mock mygene):
   - test_config_to_store_roundtrip: Load config from default.yaml, create PipelineStore with tmp duckdb path, save a test DataFrame, verify checkpoint exists, load back, verify data matches. Tests config -> persistence wiring.

   - test_config_to_provenance: Load config, create ProvenanceTracker, record steps, save sidecar to tmp dir, verify sidecar file exists and contains config_hash matching config.config_hash(). Tests config -> provenance wiring.

   - test_full_setup_flow_mocked: Mock mygene.MyGeneInfo to return a small set of 5 fake protein-coding genes with valid Ensembl IDs, symbols, and UniProt accessions. Mock fetch_protein_coding_genes to return 5 ENSG IDs. Run the setup flow programmatically (not via CLI): load config, create store, fetch universe (mocked), map IDs (mocked), validate, save to DuckDB, create provenance. Verify:
     a. gene_universe table exists in DuckDB with 5 rows
     b. DataFrame has columns: ensembl_id, hgnc_symbol, uniprot_accession
     c. Provenance sidecar exists with correct structure
     d. has_checkpoint('gene_universe') returns True

   - test_checkpoint_skip_flow: After test_full_setup_flow_mocked, verify that running setup again detects checkpoint and skips re-fetch (mock fetch_protein_coding_genes, verify it is NOT called when checkpoint exists).

   - test_setup_cli_help: Use click.testing.CliRunner to invoke `cli` with `['setup', '--help']`, verify exit_code=0 and output contains '--force' and '--config'.

   - test_info_cli: Use CliRunner to invoke `cli` with `['info', '--config', 'config/default.yaml']`, verify exit_code=0 and output contains version string.

Use tmp_path fixture extensively. Create a conftest.py helper or fixtures within the file for shared setup (mock config with tmp dirs).
cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_integration.py -v - All 6 integration tests pass - Config -> PipelineStore -> ProvenanceTracker wiring verified - Full setup flow works end-to-end with mocked API calls - Checkpoint-restart verified: second run skips fetch - CLI commands respond correctly 1. `pytest tests/ -v` -- ALL tests pass (config, api_client, gene_mapping, persistence, integration) 2. `usher-pipeline --help` -- shows setup, info commands 3. `usher-pipeline info` -- displays pipeline version and config info 4. Full test suite covers: config validation, API client retry/cache, gene mapping with validation gates, DuckDB persistence with checkpoints, provenance tracking, module wiring

<success_criteria>

  • CLI entry point works with setup and info subcommands
  • Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
  • Checkpoint-restart works: existing DuckDB data skips re-downloading
  • All integration tests pass verifying cross-module wiring
  • Full test suite (all files) passes: pytest tests/ -v </success_criteria>
After completion, create `.planning/phases/01-data-infrastructure/01-04-SUMMARY.md`