- CLI entry point with setup and info commands - Full infrastructure integration verified - 6 integration tests with mocked APIs - Phase 01 Data Infrastructure complete
12 KiB
phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
| phase | plan | subsystem | tags | dependency_graph | tech_stack | key_files | decisions | metrics | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 01-data-infrastructure | 04 | integration |
|
|
|
|
|
|
Phase 01 Plan 04: CLI Integration and End-to-End Testing Summary
One-liner: Click-based CLI with setup command orchestrating full infrastructure flow (config -> fetch gene universe -> map IDs -> validate -> DuckDB persistence -> provenance) and 6 integration tests verifying cross-module wiring with mocked APIs.
What Was Built
Wired all infrastructure modules together with a CLI interface and integration tests:
-
CLI Entry Point
- Click command group with global options (--config, --verbose)
infocommand: displays pipeline version, config hash, data source versions, paths, API configsetupcommand: orchestrates full infrastructure flow- Colored output with status indicators (green=OK, yellow=warn, red=fail)
- Entry point:
usher-pipelinebinary installed with package
-
Setup Command Flow
- Load config from YAML
- Create PipelineStore and ProvenanceTracker from config
- Check for existing checkpoint (gene_universe table in DuckDB)
- If checkpoint exists and no --force: skip fetch, display summary
- If no checkpoint or --force:
- Fetch protein-coding genes from mygene (19k-22k genes)
- Validate gene universe (count, format, duplicates)
- Map Ensembl IDs to HGNC + UniProt via batch queries
- Validate mapping quality (min 90% HGNC success rate)
- Save to DuckDB as gene_universe table
- Record provenance steps
- Save provenance sidecar JSON
- Display summary with counts, rates, paths
-
Integration Test Suite
- 6 tests verifying module wiring with mocked mygene API:
test_config_to_store_roundtrip: config -> PipelineStore -> save/loadtest_config_to_provenance: config -> ProvenanceTracker -> sidecartest_full_setup_flow_mocked: full setup with 5 mocked genestest_checkpoint_skip_flow: verify checkpoint-restart skips re-fetchtest_setup_cli_help: CLI help output verificationtest_info_cli: info command with config display
- All tests use tmp_path fixtures for isolation
- No external API calls (mocked mygene responses)
- 6 tests verifying module wiring with mocked mygene API:
-
.gitignore
- Excludes data/, *.duckdb, *.duckdb.wal
- Python artifacts: pycache, *.pyc, *.egg-info, dist/, build/
- Testing: .pytest_cache/, .coverage, htmlcov/
- Provenance files (not in data/)
- Virtual environment: .venv/
Tests
50 tests total (49 passed, 1 skipped):
Integration Tests (6 tests, all passed)
test_config_to_store_roundtrip: Load config, create store, save/load DataFrame, verify roundtriptest_config_to_provenance: Load config, create provenance, record steps, save/load sidecar, verify config_hashtest_full_setup_flow_mocked: Full setup flow with mocked mygene (5 genes), verify DuckDB table, provenancetest_checkpoint_skip_flow: Create checkpoint, verify second run skips fetchtest_setup_cli_help: CLI help shows --force and checkpoint infotest_info_cli: Info command displays version, config hash, data sources
All Tests Summary
- Config tests: 5 passed
- API client tests: 5 passed
- Gene mapping tests: 15 passed
- Persistence tests: 12 passed, 1 skipped (pandas)
- Integration tests: 6 passed
Verification Results
All plan verification steps passed:
# 1. All tests pass
$ pytest tests/ -v
========================= 49 passed, 1 skipped, 1 warning in 0.42s =========================
# 2. CLI help works
$ usher-pipeline --help
Usage: usher-pipeline [OPTIONS] COMMAND [ARGS]...
Commands:
info Display pipeline information and configuration summary.
setup Initialize pipeline data infrastructure.
# 3. Info command works
$ usher-pipeline info
Usher Pipeline v0.1.0
Config: config/default.yaml
Config Hash: ddbb5195738ac354...
Data Source Versions:
Ensembl Release: 113
gnomAD Version: v4.1
GTEx Version: v8
HPA Version: 23.0
Deviations from Plan
None - plan executed exactly as written.
Task Execution Log
Task 1: Create CLI entry point with setup command
Status: Complete
Duration: ~3 minutes
Commit: f33b048
Actions:
- Created src/usher_pipeline/cli/ package
- Implemented main.py with click command group:
- Global options: --config (default config/default.yaml), --verbose
- info command: displays version, config hash, data sources, paths, API config
- Registers setup command
- Implemented setup_cmd.py with full orchestration:
- Load config, create store/provenance
- Checkpoint detection: has_checkpoint('gene_universe')
- Fetch gene universe (mygene) with count validation
- Map IDs (Ensembl -> HGNC + UniProt) with batch queries
- Validate mapping (min 90% HGNC success rate)
- Save to DuckDB with provenance sidecar
- Colored output with status indicators
- Resource cleanup in finally block
- Updated pyproject.toml: fixed entry point to usher_pipeline.cli.main:cli
- Created .gitignore with data/, *.duckdb, build artifacts
Files created: 5 files (cli/init.py, main.py, setup_cmd.py, .gitignore, modified pyproject.toml)
Key features:
- Checkpoint-restart: skips expensive fetch if data exists
- Validation gates: enforces data quality thresholds
- Provenance tracking: captures all setup steps
- Colored CLI output with clear status messages
Task 2: Create integration tests verifying module wiring
Status: Complete
Duration: ~2 minutes
Commit: e4d71d0
Actions:
- Created tests/test_integration.py with 6 tests
- Mock data setup:
- MOCK_GENES: 5 Ensembl IDs
- MOCK_MYGENE_QUERY_RESPONSE: 5 genes with symbols
- MOCK_MYGENE_QUERYMANY_RESPONSE: 5 genes with HGNC + UniProt
- Test fixtures:
- test_config: creates temp config with tmp_path for isolation
- Integration tests:
- Config -> PipelineStore -> save/load roundtrip
- Config -> ProvenanceTracker -> sidecar creation
- Full setup flow with mocked mygene (fetch, map, validate, save, provenance)
- Checkpoint-restart verification
- CLI help and info commands
- All tests pass with mocked API (no external dependencies)
Files created: 1 file (test_integration.py)
Key features:
- Mocked mygene API calls (no rate limits, reproducible)
- Temporary paths for isolation (no pollution)
- Verifies cross-module wiring works correctly
- Fast execution (<1s for all 6 tests)
Success Criteria Verification
- CLI entry point works with setup and info subcommands
- Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
- Checkpoint-restart works: existing DuckDB data skips re-downloading
- All integration tests pass verifying cross-module wiring
- Full test suite (all files) passes:
pytest tests/ -v(49 passed, 1 skipped)
Must-Haves Verification
Truths:
- CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands
- Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance
- All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance
- Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching
Artifacts:
- src/usher_pipeline/cli/main.py provides "CLI entry point with click command group" containing "def cli"
- src/usher_pipeline/cli/setup_cmd.py provides "Setup command wiring config, gene mapping, persistence, provenance" containing "def setup"
- tests/test_integration.py provides "Integration tests verifying module wiring" containing "test_"
Key Links:
- src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/config/loader.py via "loads pipeline config from YAML" (pattern:
load_config) - src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/gene_mapping/mapper.py via "maps gene IDs using GeneMapper" (pattern:
GeneMapper) - src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/duckdb_store.py via "saves results to DuckDB with checkpoint" (pattern:
PipelineStore) - src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/provenance.py via "tracks provenance for setup step" (pattern:
ProvenanceTracker) - src/usher_pipeline/cli/main.py -> src/usher_pipeline/cli/setup_cmd.py via "registers setup as click subcommand" (pattern:
cli\.add_command)
Impact on Roadmap
Phase 01 Data Infrastructure Complete:
All 4 plans in Phase 01 are now complete:
- 01-01: Python package scaffold, config system, base API client
- 01-02: Gene ID mapping and validation
- 01-03: DuckDB persistence and provenance tracking
- 01-04: CLI integration and end-to-end testing
Ready for Phase 02 (Evidence Layer Ingestion):
- CLI provides interface for running pipeline operations
- Config system defines data source versions
- Gene universe can be fetched and validated
- ID mapping handles Ensembl -> HGNC + UniProt
- DuckDB checkpoint-restart enables incremental processing
- Provenance tracking captures all processing steps
- Integration tests prove modules work together
Next Steps
Phase 02 will add evidence layer ingestion commands to the CLI:
usher-pipeline fetch gnomad- download gnomAD datausher-pipeline fetch expression- download GTEx/HPA datausher-pipeline fetch annotations- download GO/HPO annotations
Each command will use the same checkpoint-restart pattern established in setup.
Self-Check: PASSED
Files verified:
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/__init__.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/main.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/setup_cmd.py
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_integration.py
FOUND: /Users/gbanyan/Project/usher-exploring/.gitignore
Commits verified:
FOUND: f33b048 (Task 1)
FOUND: e4d71d0 (Task 2)
All files and commits exist as documented.