--- phase: 01-data-infrastructure plan: 04 subsystem: integration tags: [cli, integration, wiring, testing, click] dependency_graph: requires: - phase: 01-01 provides: ["Python package scaffold", "Pydantic v2 config system", "CachedAPIClient"] - phase: 01-02 provides: ["Gene universe definition", "Batch ID mapper", "Validation gates"] - phase: 01-03 provides: ["DuckDB checkpoint-restart storage", "Provenance tracking"] provides: - CLI entry point with setup and info commands - Full infrastructure integration (config -> fetch -> map -> validate -> persist -> provenance) - Checkpoint-restart capability for expensive operations - Integration test suite verifying cross-module wiring affects: - All future pipeline operations (CLI is primary interface) tech_stack: added: - click for CLI framework patterns: - Click command group with global options (--config, --verbose) - Colored CLI output with status indicators - Context manager for resource cleanup - Mock-based integration testing key_files: created: - src/usher_pipeline/cli/__init__.py: CLI package exports - src/usher_pipeline/cli/main.py: Click command group with info command - src/usher_pipeline/cli/setup_cmd.py: Setup command orchestrating full flow - tests/test_integration.py: 6 integration tests - .gitignore: Data files, build artifacts, provenance exclusions modified: - pyproject.toml: Fixed CLI entry point to usher_pipeline.cli.main:cli decisions: - decision: "Click for CLI framework" rationale: "Standard Python CLI library with excellent UX features (colored output, help generation, subcommands)" alternatives: ["argparse (rejected - verbose)", "typer (rejected - less mature)"] - decision: "Setup command uses checkpoint-restart pattern" rationale: "Gene universe fetch can take minutes; checkpoint enables fast restart without re-fetching" impact: "Setup detects existing DuckDB tables and skips re-fetch unless --force flag used" - decision: "Mock mygene in integration tests" rationale: "Avoids external API dependency, ensures reproducible tests, faster execution" impact: "All 6 integration tests run in <1s with mocked responses" metrics: duration_minutes: 5 tasks_completed: 2 files_created: 5 files_modified: 1 tests_added: 6 commits: 2 completed_date: "2026-02-11" --- # Phase 01 Plan 04: CLI Integration and End-to-End Testing Summary **One-liner:** Click-based CLI with setup command orchestrating full infrastructure flow (config -> fetch gene universe -> map IDs -> validate -> DuckDB persistence -> provenance) and 6 integration tests verifying cross-module wiring with mocked APIs. ## What Was Built Wired all infrastructure modules together with a CLI interface and integration tests: 1. **CLI Entry Point** - Click command group with global options (--config, --verbose) - `info` command: displays pipeline version, config hash, data source versions, paths, API config - `setup` command: orchestrates full infrastructure flow - Colored output with status indicators (green=OK, yellow=warn, red=fail) - Entry point: `usher-pipeline` binary installed with package 2. **Setup Command Flow** - Load config from YAML - Create PipelineStore and ProvenanceTracker from config - Check for existing checkpoint (gene_universe table in DuckDB) - If checkpoint exists and no --force: skip fetch, display summary - If no checkpoint or --force: - Fetch protein-coding genes from mygene (19k-22k genes) - Validate gene universe (count, format, duplicates) - Map Ensembl IDs to HGNC + UniProt via batch queries - Validate mapping quality (min 90% HGNC success rate) - Save to DuckDB as gene_universe table - Record provenance steps - Save provenance sidecar JSON - Display summary with counts, rates, paths 3. **Integration Test Suite** - 6 tests verifying module wiring with mocked mygene API: - `test_config_to_store_roundtrip`: config -> PipelineStore -> save/load - `test_config_to_provenance`: config -> ProvenanceTracker -> sidecar - `test_full_setup_flow_mocked`: full setup with 5 mocked genes - `test_checkpoint_skip_flow`: verify checkpoint-restart skips re-fetch - `test_setup_cli_help`: CLI help output verification - `test_info_cli`: info command with config display - All tests use tmp_path fixtures for isolation - No external API calls (mocked mygene responses) 4. **.gitignore** - Excludes data/, *.duckdb, *.duckdb.wal - Python artifacts: __pycache__, *.pyc, *.egg-info, dist/, build/ - Testing: .pytest_cache/, .coverage, htmlcov/ - Provenance files (not in data/) - Virtual environment: .venv/ ## Tests **50 tests total (49 passed, 1 skipped):** ### Integration Tests (6 tests, all passed) - `test_config_to_store_roundtrip`: Load config, create store, save/load DataFrame, verify roundtrip - `test_config_to_provenance`: Load config, create provenance, record steps, save/load sidecar, verify config_hash - `test_full_setup_flow_mocked`: Full setup flow with mocked mygene (5 genes), verify DuckDB table, provenance - `test_checkpoint_skip_flow`: Create checkpoint, verify second run skips fetch - `test_setup_cli_help`: CLI help shows --force and checkpoint info - `test_info_cli`: Info command displays version, config hash, data sources ### All Tests Summary - Config tests: 5 passed - API client tests: 5 passed - Gene mapping tests: 15 passed - Persistence tests: 12 passed, 1 skipped (pandas) - Integration tests: 6 passed ## Verification Results All plan verification steps passed: ```bash # 1. All tests pass $ pytest tests/ -v ========================= 49 passed, 1 skipped, 1 warning in 0.42s ========================= # 2. CLI help works $ usher-pipeline --help Usage: usher-pipeline [OPTIONS] COMMAND [ARGS]... Commands: info Display pipeline information and configuration summary. setup Initialize pipeline data infrastructure. # 3. Info command works $ usher-pipeline info Usher Pipeline v0.1.0 Config: config/default.yaml Config Hash: ddbb5195738ac354... Data Source Versions: Ensembl Release: 113 gnomAD Version: v4.1 GTEx Version: v8 HPA Version: 23.0 ``` ## Deviations from Plan None - plan executed exactly as written. ## Task Execution Log ### Task 1: Create CLI entry point with setup command **Status:** Complete **Duration:** ~3 minutes **Commit:** f33b048 **Actions:** 1. Created src/usher_pipeline/cli/ package 2. Implemented main.py with click command group: - Global options: --config (default config/default.yaml), --verbose - info command: displays version, config hash, data sources, paths, API config - Registers setup command 3. Implemented setup_cmd.py with full orchestration: - Load config, create store/provenance - Checkpoint detection: has_checkpoint('gene_universe') - Fetch gene universe (mygene) with count validation - Map IDs (Ensembl -> HGNC + UniProt) with batch queries - Validate mapping (min 90% HGNC success rate) - Save to DuckDB with provenance sidecar - Colored output with status indicators - Resource cleanup in finally block 4. Updated pyproject.toml: fixed entry point to usher_pipeline.cli.main:cli 5. Created .gitignore with data/, *.duckdb, build artifacts **Files created:** 5 files (cli/__init__.py, main.py, setup_cmd.py, .gitignore, modified pyproject.toml) **Key features:** - Checkpoint-restart: skips expensive fetch if data exists - Validation gates: enforces data quality thresholds - Provenance tracking: captures all setup steps - Colored CLI output with clear status messages ### Task 2: Create integration tests verifying module wiring **Status:** Complete **Duration:** ~2 minutes **Commit:** e4d71d0 **Actions:** 1. Created tests/test_integration.py with 6 tests 2. Mock data setup: - MOCK_GENES: 5 Ensembl IDs - MOCK_MYGENE_QUERY_RESPONSE: 5 genes with symbols - MOCK_MYGENE_QUERYMANY_RESPONSE: 5 genes with HGNC + UniProt 3. Test fixtures: - test_config: creates temp config with tmp_path for isolation 4. Integration tests: - Config -> PipelineStore -> save/load roundtrip - Config -> ProvenanceTracker -> sidecar creation - Full setup flow with mocked mygene (fetch, map, validate, save, provenance) - Checkpoint-restart verification - CLI help and info commands 5. All tests pass with mocked API (no external dependencies) **Files created:** 1 file (test_integration.py) **Key features:** - Mocked mygene API calls (no rate limits, reproducible) - Temporary paths for isolation (no pollution) - Verifies cross-module wiring works correctly - Fast execution (<1s for all 6 tests) ## Success Criteria Verification - [x] CLI entry point works with setup and info subcommands - [x] Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking - [x] Checkpoint-restart works: existing DuckDB data skips re-downloading - [x] All integration tests pass verifying cross-module wiring - [x] Full test suite (all files) passes: `pytest tests/ -v` (49 passed, 1 skipped) ## Must-Haves Verification **Truths:** - [x] CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands - [x] Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance - [x] All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance - [x] Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching **Artifacts:** - [x] src/usher_pipeline/cli/main.py provides "CLI entry point with click command group" containing "def cli" - [x] src/usher_pipeline/cli/setup_cmd.py provides "Setup command wiring config, gene mapping, persistence, provenance" containing "def setup" - [x] tests/test_integration.py provides "Integration tests verifying module wiring" containing "test_" **Key Links:** - [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/config/loader.py via "loads pipeline config from YAML" (pattern: `load_config`) - [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/gene_mapping/mapper.py via "maps gene IDs using GeneMapper" (pattern: `GeneMapper`) - [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/duckdb_store.py via "saves results to DuckDB with checkpoint" (pattern: `PipelineStore`) - [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/provenance.py via "tracks provenance for setup step" (pattern: `ProvenanceTracker`) - [x] src/usher_pipeline/cli/main.py -> src/usher_pipeline/cli/setup_cmd.py via "registers setup as click subcommand" (pattern: `cli\.add_command`) ## Impact on Roadmap **Phase 01 Data Infrastructure Complete:** All 4 plans in Phase 01 are now complete: - 01-01: Python package scaffold, config system, base API client - 01-02: Gene ID mapping and validation - 01-03: DuckDB persistence and provenance tracking - 01-04: CLI integration and end-to-end testing **Ready for Phase 02 (Evidence Layer Ingestion):** - CLI provides interface for running pipeline operations - Config system defines data source versions - Gene universe can be fetched and validated - ID mapping handles Ensembl -> HGNC + UniProt - DuckDB checkpoint-restart enables incremental processing - Provenance tracking captures all processing steps - Integration tests prove modules work together ## Next Steps Phase 02 will add evidence layer ingestion commands to the CLI: - `usher-pipeline fetch gnomad` - download gnomAD data - `usher-pipeline fetch expression` - download GTEx/HPA data - `usher-pipeline fetch annotations` - download GO/HPO annotations Each command will use the same checkpoint-restart pattern established in setup. ## Self-Check: PASSED **Files verified:** ```bash FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/__init__.py FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/main.py FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/setup_cmd.py FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_integration.py FOUND: /Users/gbanyan/Project/usher-exploring/.gitignore ``` **Commits verified:** ```bash FOUND: f33b048 (Task 1) FOUND: e4d71d0 (Task 2) ``` All files and commits exist as documented.