188 lines
9.7 KiB
Markdown
188 lines
9.7 KiB
Markdown
---
|
|
phase: 01-data-infrastructure
|
|
plan: 04
|
|
type: execute
|
|
wave: 3
|
|
depends_on: ["01-01", "01-02", "01-03"]
|
|
files_modified:
|
|
- src/usher_pipeline/cli/__init__.py
|
|
- src/usher_pipeline/cli/main.py
|
|
- src/usher_pipeline/cli/setup_cmd.py
|
|
- tests/test_integration.py
|
|
- .gitignore
|
|
autonomous: true
|
|
|
|
must_haves:
|
|
truths:
|
|
- "CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands"
|
|
- "Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance"
|
|
- "All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance"
|
|
- "Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching"
|
|
artifacts:
|
|
- path: "src/usher_pipeline/cli/main.py"
|
|
provides: "CLI entry point with click command group"
|
|
contains: "def cli"
|
|
- path: "src/usher_pipeline/cli/setup_cmd.py"
|
|
provides: "Setup command wiring config, gene mapping, persistence, provenance"
|
|
contains: "def setup"
|
|
- path: "tests/test_integration.py"
|
|
provides: "Integration tests verifying module wiring"
|
|
contains: "test_"
|
|
key_links:
|
|
- from: "src/usher_pipeline/cli/setup_cmd.py"
|
|
to: "src/usher_pipeline/config/loader.py"
|
|
via: "loads pipeline config from YAML"
|
|
pattern: "load_config"
|
|
- from: "src/usher_pipeline/cli/setup_cmd.py"
|
|
to: "src/usher_pipeline/gene_mapping/mapper.py"
|
|
via: "maps gene IDs using GeneMapper"
|
|
pattern: "GeneMapper"
|
|
- from: "src/usher_pipeline/cli/setup_cmd.py"
|
|
to: "src/usher_pipeline/persistence/duckdb_store.py"
|
|
via: "saves results to DuckDB with checkpoint"
|
|
pattern: "PipelineStore"
|
|
- from: "src/usher_pipeline/cli/setup_cmd.py"
|
|
to: "src/usher_pipeline/persistence/provenance.py"
|
|
via: "tracks provenance for setup step"
|
|
pattern: "ProvenanceTracker"
|
|
- from: "src/usher_pipeline/cli/main.py"
|
|
to: "src/usher_pipeline/cli/setup_cmd.py"
|
|
via: "registers setup as click subcommand"
|
|
pattern: "cli\\.add_command|@cli\\.command"
|
|
---
|
|
|
|
<objective>
|
|
Wire all infrastructure modules together with a CLI entry point and integration tests to verify the complete data infrastructure works end-to-end.
|
|
|
|
Purpose: Individual modules (config, gene mapping, persistence, provenance) must work together as a cohesive system. This plan creates the CLI interface (click-based), implements the `setup` subcommand that exercises the full pipeline flow, and adds integration tests proving the wiring is correct. This is the "it all works together" plan.
|
|
|
|
Output: Working CLI with setup command that demonstrates full infrastructure flow, integration tests.
|
|
</objective>
|
|
|
|
<execution_context>
|
|
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
|
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
|
</execution_context>
|
|
|
|
<context>
|
|
@.planning/PROJECT.md
|
|
@.planning/ROADMAP.md
|
|
@.planning/STATE.md
|
|
@.planning/phases/01-data-infrastructure/01-RESEARCH.md
|
|
@.planning/phases/01-data-infrastructure/01-01-SUMMARY.md
|
|
@.planning/phases/01-data-infrastructure/01-02-SUMMARY.md
|
|
@.planning/phases/01-data-infrastructure/01-03-SUMMARY.md
|
|
</context>
|
|
|
|
<tasks>
|
|
|
|
<task type="auto">
|
|
<name>Task 1: Create CLI entry point with setup command</name>
|
|
<files>
|
|
src/usher_pipeline/cli/__init__.py
|
|
src/usher_pipeline/cli/main.py
|
|
src/usher_pipeline/cli/setup_cmd.py
|
|
.gitignore
|
|
</files>
|
|
<action>
|
|
1. Create `src/usher_pipeline/cli/main.py`:
|
|
- Use click to create a command group: `@click.group()` named `cli`.
|
|
- Add `--config` option (default "config/default.yaml") and `--verbose` flag to the group.
|
|
- Pass config_path and verbose to context (ctx.obj) for subcommands.
|
|
- Add `info` subcommand that prints pipeline version, config path, config hash, data source versions.
|
|
- Register setup_cmd.
|
|
|
|
2. Create `src/usher_pipeline/cli/setup_cmd.py`:
|
|
- `@click.command('setup')` subcommand with `--force` flag to re-run even if checkpoints exist.
|
|
- Implementation flow:
|
|
a. Load config via load_config(config_path).
|
|
b. Create PipelineStore from config.
|
|
c. Create ProvenanceTracker from config.
|
|
d. Check checkpoint: if store.has_checkpoint('gene_universe') and not force, print "Gene universe already loaded (use --force to re-fetch)" and skip to validation.
|
|
e. If no checkpoint or force: call fetch_protein_coding_genes(config.versions.ensembl_release) to get gene universe.
|
|
f. Run validate_gene_universe() on the result. If validation fails, print error and exit(1).
|
|
g. Create GeneMapper, call map_ensembl_ids() on gene universe.
|
|
h. Run MappingValidator.validate() on mapping report. Print success rates. If validation fails, save unmapped report and exit(1) with clear message about low mapping rate.
|
|
i. Save gene universe to DuckDB as 'gene_universe' table (polars DataFrame with columns: ensembl_id, hgnc_symbol, uniprot_accession).
|
|
j. Record provenance steps: "fetch_gene_universe", "map_gene_ids", "validate_mapping".
|
|
k. Save provenance sidecar to data_dir / "setup.provenance.json".
|
|
l. Print summary: gene count, HGNC mapping rate, UniProt mapping rate, DuckDB path, provenance path.
|
|
- Use click.echo for output, click.style for colored status (green=OK, yellow=warn, red=fail).
|
|
|
|
3. Update `pyproject.toml` to add CLI entry point: `[project.scripts] usher-pipeline = "usher_pipeline.cli.main:cli"`.
|
|
|
|
4. Create `.gitignore` with: data/, *.duckdb, *.duckdb.wal, __pycache__/, *.pyc, .pytest_cache/, *.egg-info/, dist/, build/, .eggs/, *.provenance.json (not in data/).
|
|
|
|
5. Create `src/usher_pipeline/cli/__init__.py`.
|
|
</action>
|
|
<verify>
|
|
cd /Users/gbanyan/Project/usher-exploring && pip install -e ".[dev]" && usher-pipeline --help && usher-pipeline info --config config/default.yaml
|
|
</verify>
|
|
<done>
|
|
- `usher-pipeline --help` shows available commands (setup, info)
|
|
- `usher-pipeline info` displays version, config hash, data source versions
|
|
- Setup command implements full flow: config -> gene universe -> mapping -> validation -> DuckDB -> provenance
|
|
- Checkpoint-restart: setup skips re-fetch if gene_universe table exists
|
|
- .gitignore excludes data files and build artifacts
|
|
</done>
|
|
</task>
|
|
|
|
<task type="auto">
|
|
<name>Task 2: Create integration tests verifying module wiring</name>
|
|
<files>
|
|
tests/test_integration.py
|
|
</files>
|
|
<action>
|
|
1. Create `tests/test_integration.py` with integration tests. These tests verify module wiring without calling real external APIs (mock mygene):
|
|
|
|
- test_config_to_store_roundtrip: Load config from default.yaml, create PipelineStore with tmp duckdb path, save a test DataFrame, verify checkpoint exists, load back, verify data matches. Tests config -> persistence wiring.
|
|
|
|
- test_config_to_provenance: Load config, create ProvenanceTracker, record steps, save sidecar to tmp dir, verify sidecar file exists and contains config_hash matching config.config_hash(). Tests config -> provenance wiring.
|
|
|
|
- test_full_setup_flow_mocked: Mock mygene.MyGeneInfo to return a small set of 5 fake protein-coding genes with valid Ensembl IDs, symbols, and UniProt accessions. Mock fetch_protein_coding_genes to return 5 ENSG IDs. Run the setup flow programmatically (not via CLI): load config, create store, fetch universe (mocked), map IDs (mocked), validate, save to DuckDB, create provenance. Verify:
|
|
a. gene_universe table exists in DuckDB with 5 rows
|
|
b. DataFrame has columns: ensembl_id, hgnc_symbol, uniprot_accession
|
|
c. Provenance sidecar exists with correct structure
|
|
d. has_checkpoint('gene_universe') returns True
|
|
|
|
- test_checkpoint_skip_flow: After test_full_setup_flow_mocked, verify that running setup again detects checkpoint and skips re-fetch (mock fetch_protein_coding_genes, verify it is NOT called when checkpoint exists).
|
|
|
|
- test_setup_cli_help: Use click.testing.CliRunner to invoke `cli` with `['setup', '--help']`, verify exit_code=0 and output contains '--force' and '--config'.
|
|
|
|
- test_info_cli: Use CliRunner to invoke `cli` with `['info', '--config', 'config/default.yaml']`, verify exit_code=0 and output contains version string.
|
|
|
|
Use tmp_path fixture extensively. Create a conftest.py helper or fixtures within the file for shared setup (mock config with tmp dirs).
|
|
</action>
|
|
<verify>
|
|
cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_integration.py -v
|
|
</verify>
|
|
<done>
|
|
- All 6 integration tests pass
|
|
- Config -> PipelineStore -> ProvenanceTracker wiring verified
|
|
- Full setup flow works end-to-end with mocked API calls
|
|
- Checkpoint-restart verified: second run skips fetch
|
|
- CLI commands respond correctly
|
|
</done>
|
|
</task>
|
|
|
|
</tasks>
|
|
|
|
<verification>
|
|
1. `pytest tests/ -v` -- ALL tests pass (config, api_client, gene_mapping, persistence, integration)
|
|
2. `usher-pipeline --help` -- shows setup, info commands
|
|
3. `usher-pipeline info` -- displays pipeline version and config info
|
|
4. Full test suite covers: config validation, API client retry/cache, gene mapping with validation gates, DuckDB persistence with checkpoints, provenance tracking, module wiring
|
|
</verification>
|
|
|
|
<success_criteria>
|
|
- CLI entry point works with setup and info subcommands
|
|
- Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
|
|
- Checkpoint-restart works: existing DuckDB data skips re-downloading
|
|
- All integration tests pass verifying cross-module wiring
|
|
- Full test suite (all files) passes: `pytest tests/ -v`
|
|
</success_criteria>
|
|
|
|
<output>
|
|
After completion, create `.planning/phases/01-data-infrastructure/01-04-SUMMARY.md`
|
|
</output>
|