usher-exploring/.planning/phases/01-data-infrastructure/01-04-PLAN.md at c10d59548f2a3d04b196e13c4367556a96d756b9

gbanyan/usher-exploring

Fork 0

Files

gbanyan cab2f5fc66 docs(01-data-infrastructure): create phase plan

2026-02-11 16:04:42 +08:00

9.7 KiB

Raw Blame History

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves

phase

plan

type

wave

depends_on

files_modified

autonomous

must_haves

01-data-infrastructure

execute

01-01

01-02

01-03

src/usher_pipeline/cli/__init__.py

src/usher_pipeline/cli/main.py

src/usher_pipeline/cli/setup_cmd.py

tests/test_integration.py

.gitignore

true

truths

artifacts

key_links

CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands

Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance

All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance

Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching

path	provides	contains
src/usher_pipeline/cli/main.py	CLI entry point with click command group	def cli

path	provides	contains
src/usher_pipeline/cli/setup_cmd.py	Setup command wiring config, gene mapping, persistence, provenance	def setup

path	provides	contains
tests/test_integration.py	Integration tests verifying module wiring	test_

from	to	via	pattern
src/usher_pipeline/cli/setup_cmd.py	src/usher_pipeline/config/loader.py	loads pipeline config from YAML	load_config

from	to	via	pattern
src/usher_pipeline/cli/setup_cmd.py	src/usher_pipeline/gene_mapping/mapper.py	maps gene IDs using GeneMapper	GeneMapper

from	to	via	pattern
src/usher_pipeline/cli/setup_cmd.py	src/usher_pipeline/persistence/duckdb_store.py	saves results to DuckDB with checkpoint	PipelineStore

from	to	via	pattern
src/usher_pipeline/cli/setup_cmd.py	src/usher_pipeline/persistence/provenance.py	tracks provenance for setup step	ProvenanceTracker

from	to	via	pattern
src/usher_pipeline/cli/main.py	src/usher_pipeline/cli/setup_cmd.py	registers setup as click subcommand	cli.add_command\|@cli.command

Wire all infrastructure modules together with a CLI entry point and integration tests to verify the complete data infrastructure works end-to-end.

Purpose: Individual modules (config, gene mapping, persistence, provenance) must work together as a cohesive system. This plan creates the CLI interface (click-based), implements the setup subcommand that exercises the full pipeline flow, and adds integration tests proving the wiring is correct. This is the "it all works together" plan.

Output: Working CLI with setup command that demonstrates full infrastructure flow, integration tests.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/01-data-infrastructure/01-RESEARCH.md @.planning/phases/01-data-infrastructure/01-01-SUMMARY.md @.planning/phases/01-data-infrastructure/01-02-SUMMARY.md @.planning/phases/01-data-infrastructure/01-03-SUMMARY.md Task 1: Create CLI entry point with setup command src/usher_pipeline/cli/__init__.py src/usher_pipeline/cli/main.py src/usher_pipeline/cli/setup_cmd.py .gitignore 1. Create `src/usher_pipeline/cli/main.py`: - Use click to create a command group: `@click.group()` named `cli`. - Add `--config` option (default "config/default.yaml") and `--verbose` flag to the group. - Pass config_path and verbose to context (ctx.obj) for subcommands. - Add `info` subcommand that prints pipeline version, config path, config hash, data source versions. - Register setup_cmd.

2. Create `src/usher_pipeline/cli/setup_cmd.py`:
   - `@click.command('setup')` subcommand with `--force` flag to re-run even if checkpoints exist.
   - Implementation flow:
     a. Load config via load_config(config_path).
     b. Create PipelineStore from config.
     c. Create ProvenanceTracker from config.
     d. Check checkpoint: if store.has_checkpoint('gene_universe') and not force, print "Gene universe already loaded (use --force to re-fetch)" and skip to validation.
     e. If no checkpoint or force: call fetch_protein_coding_genes(config.versions.ensembl_release) to get gene universe.
     f. Run validate_gene_universe() on the result. If validation fails, print error and exit(1).
     g. Create GeneMapper, call map_ensembl_ids() on gene universe.
     h. Run MappingValidator.validate() on mapping report. Print success rates. If validation fails, save unmapped report and exit(1) with clear message about low mapping rate.
     i. Save gene universe to DuckDB as 'gene_universe' table (polars DataFrame with columns: ensembl_id, hgnc_symbol, uniprot_accession).
     j. Record provenance steps: "fetch_gene_universe", "map_gene_ids", "validate_mapping".
     k. Save provenance sidecar to data_dir / "setup.provenance.json".
     l. Print summary: gene count, HGNC mapping rate, UniProt mapping rate, DuckDB path, provenance path.
   - Use click.echo for output, click.style for colored status (green=OK, yellow=warn, red=fail).

3. Update `pyproject.toml` to add CLI entry point: `[project.scripts] usher-pipeline = "usher_pipeline.cli.main:cli"`.

4. Create `.gitignore` with: data/, *.duckdb, *.duckdb.wal, __pycache__/, *.pyc, .pytest_cache/, *.egg-info/, dist/, build/, .eggs/, *.provenance.json (not in data/).

5. Create `src/usher_pipeline/cli/__init__.py`.

cd /Users/gbanyan/Project/usher-exploring && pip install -e ".[dev]" && usher-pipeline --help && usher-pipeline info --config config/default.yaml - `usher-pipeline --help` shows available commands (setup, info) - `usher-pipeline info` displays version, config hash, data source versions - Setup command implements full flow: config -> gene universe -> mapping -> validation -> DuckDB -> provenance - Checkpoint-restart: setup skips re-fetch if gene_universe table exists - .gitignore excludes data files and build artifacts Task 2: Create integration tests verifying module wiring tests/test_integration.py 1. Create `tests/test_integration.py` with integration tests. These tests verify module wiring without calling real external APIs (mock mygene):

   - test_config_to_store_roundtrip: Load config from default.yaml, create PipelineStore with tmp duckdb path, save a test DataFrame, verify checkpoint exists, load back, verify data matches. Tests config -> persistence wiring.

   - test_config_to_provenance: Load config, create ProvenanceTracker, record steps, save sidecar to tmp dir, verify sidecar file exists and contains config_hash matching config.config_hash(). Tests config -> provenance wiring.

   - test_full_setup_flow_mocked: Mock mygene.MyGeneInfo to return a small set of 5 fake protein-coding genes with valid Ensembl IDs, symbols, and UniProt accessions. Mock fetch_protein_coding_genes to return 5 ENSG IDs. Run the setup flow programmatically (not via CLI): load config, create store, fetch universe (mocked), map IDs (mocked), validate, save to DuckDB, create provenance. Verify:
     a. gene_universe table exists in DuckDB with 5 rows
     b. DataFrame has columns: ensembl_id, hgnc_symbol, uniprot_accession
     c. Provenance sidecar exists with correct structure
     d. has_checkpoint('gene_universe') returns True

   - test_checkpoint_skip_flow: After test_full_setup_flow_mocked, verify that running setup again detects checkpoint and skips re-fetch (mock fetch_protein_coding_genes, verify it is NOT called when checkpoint exists).

   - test_setup_cli_help: Use click.testing.CliRunner to invoke `cli` with `['setup', '--help']`, verify exit_code=0 and output contains '--force' and '--config'.

   - test_info_cli: Use CliRunner to invoke `cli` with `['info', '--config', 'config/default.yaml']`, verify exit_code=0 and output contains version string.

Use tmp_path fixture extensively. Create a conftest.py helper or fixtures within the file for shared setup (mock config with tmp dirs).

cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_integration.py -v - All 6 integration tests pass - Config -> PipelineStore -> ProvenanceTracker wiring verified - Full setup flow works end-to-end with mocked API calls - Checkpoint-restart verified: second run skips fetch - CLI commands respond correctly 1. `pytest tests/ -v` -- ALL tests pass (config, api_client, gene_mapping, persistence, integration) 2. `usher-pipeline --help` -- shows setup, info commands 3. `usher-pipeline info` -- displays pipeline version and config info 4. Full test suite covers: config validation, API client retry/cache, gene mapping with validation gates, DuckDB persistence with checkpoints, provenance tracking, module wiring

<success_criteria>

CLI entry point works with setup and info subcommands
Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
Checkpoint-restart works: existing DuckDB data skips re-downloading
All integration tests pass verifying cross-module wiring
Full test suite (all files) passes: pytest tests/ -v </success_criteria>

After completion, create `.planning/phases/01-data-infrastructure/01-04-SUMMARY.md`

9.7 KiB Raw Blame History

9.7 KiB

Raw Blame History