- CLI entry point with setup and info commands - Full infrastructure integration verified - 6 integration tests with mocked APIs - Phase 01 Data Infrastructure complete
293 lines
12 KiB
Markdown
293 lines
12 KiB
Markdown
---
|
|
phase: 01-data-infrastructure
|
|
plan: 04
|
|
subsystem: integration
|
|
tags: [cli, integration, wiring, testing, click]
|
|
dependency_graph:
|
|
requires:
|
|
- phase: 01-01
|
|
provides: ["Python package scaffold", "Pydantic v2 config system", "CachedAPIClient"]
|
|
- phase: 01-02
|
|
provides: ["Gene universe definition", "Batch ID mapper", "Validation gates"]
|
|
- phase: 01-03
|
|
provides: ["DuckDB checkpoint-restart storage", "Provenance tracking"]
|
|
provides:
|
|
- CLI entry point with setup and info commands
|
|
- Full infrastructure integration (config -> fetch -> map -> validate -> persist -> provenance)
|
|
- Checkpoint-restart capability for expensive operations
|
|
- Integration test suite verifying cross-module wiring
|
|
affects:
|
|
- All future pipeline operations (CLI is primary interface)
|
|
tech_stack:
|
|
added:
|
|
- click for CLI framework
|
|
patterns:
|
|
- Click command group with global options (--config, --verbose)
|
|
- Colored CLI output with status indicators
|
|
- Context manager for resource cleanup
|
|
- Mock-based integration testing
|
|
key_files:
|
|
created:
|
|
- src/usher_pipeline/cli/__init__.py: CLI package exports
|
|
- src/usher_pipeline/cli/main.py: Click command group with info command
|
|
- src/usher_pipeline/cli/setup_cmd.py: Setup command orchestrating full flow
|
|
- tests/test_integration.py: 6 integration tests
|
|
- .gitignore: Data files, build artifacts, provenance exclusions
|
|
modified:
|
|
- pyproject.toml: Fixed CLI entry point to usher_pipeline.cli.main:cli
|
|
decisions:
|
|
- decision: "Click for CLI framework"
|
|
rationale: "Standard Python CLI library with excellent UX features (colored output, help generation, subcommands)"
|
|
alternatives: ["argparse (rejected - verbose)", "typer (rejected - less mature)"]
|
|
- decision: "Setup command uses checkpoint-restart pattern"
|
|
rationale: "Gene universe fetch can take minutes; checkpoint enables fast restart without re-fetching"
|
|
impact: "Setup detects existing DuckDB tables and skips re-fetch unless --force flag used"
|
|
- decision: "Mock mygene in integration tests"
|
|
rationale: "Avoids external API dependency, ensures reproducible tests, faster execution"
|
|
impact: "All 6 integration tests run in <1s with mocked responses"
|
|
metrics:
|
|
duration_minutes: 5
|
|
tasks_completed: 2
|
|
files_created: 5
|
|
files_modified: 1
|
|
tests_added: 6
|
|
commits: 2
|
|
completed_date: "2026-02-11"
|
|
---
|
|
|
|
# Phase 01 Plan 04: CLI Integration and End-to-End Testing Summary
|
|
|
|
**One-liner:** Click-based CLI with setup command orchestrating full infrastructure flow (config -> fetch gene universe -> map IDs -> validate -> DuckDB persistence -> provenance) and 6 integration tests verifying cross-module wiring with mocked APIs.
|
|
|
|
## What Was Built
|
|
|
|
Wired all infrastructure modules together with a CLI interface and integration tests:
|
|
|
|
1. **CLI Entry Point**
|
|
- Click command group with global options (--config, --verbose)
|
|
- `info` command: displays pipeline version, config hash, data source versions, paths, API config
|
|
- `setup` command: orchestrates full infrastructure flow
|
|
- Colored output with status indicators (green=OK, yellow=warn, red=fail)
|
|
- Entry point: `usher-pipeline` binary installed with package
|
|
|
|
2. **Setup Command Flow**
|
|
- Load config from YAML
|
|
- Create PipelineStore and ProvenanceTracker from config
|
|
- Check for existing checkpoint (gene_universe table in DuckDB)
|
|
- If checkpoint exists and no --force: skip fetch, display summary
|
|
- If no checkpoint or --force:
|
|
- Fetch protein-coding genes from mygene (19k-22k genes)
|
|
- Validate gene universe (count, format, duplicates)
|
|
- Map Ensembl IDs to HGNC + UniProt via batch queries
|
|
- Validate mapping quality (min 90% HGNC success rate)
|
|
- Save to DuckDB as gene_universe table
|
|
- Record provenance steps
|
|
- Save provenance sidecar JSON
|
|
- Display summary with counts, rates, paths
|
|
|
|
3. **Integration Test Suite**
|
|
- 6 tests verifying module wiring with mocked mygene API:
|
|
- `test_config_to_store_roundtrip`: config -> PipelineStore -> save/load
|
|
- `test_config_to_provenance`: config -> ProvenanceTracker -> sidecar
|
|
- `test_full_setup_flow_mocked`: full setup with 5 mocked genes
|
|
- `test_checkpoint_skip_flow`: verify checkpoint-restart skips re-fetch
|
|
- `test_setup_cli_help`: CLI help output verification
|
|
- `test_info_cli`: info command with config display
|
|
- All tests use tmp_path fixtures for isolation
|
|
- No external API calls (mocked mygene responses)
|
|
|
|
4. **.gitignore**
|
|
- Excludes data/, *.duckdb, *.duckdb.wal
|
|
- Python artifacts: __pycache__, *.pyc, *.egg-info, dist/, build/
|
|
- Testing: .pytest_cache/, .coverage, htmlcov/
|
|
- Provenance files (not in data/)
|
|
- Virtual environment: .venv/
|
|
|
|
## Tests
|
|
|
|
**50 tests total (49 passed, 1 skipped):**
|
|
|
|
### Integration Tests (6 tests, all passed)
|
|
- `test_config_to_store_roundtrip`: Load config, create store, save/load DataFrame, verify roundtrip
|
|
- `test_config_to_provenance`: Load config, create provenance, record steps, save/load sidecar, verify config_hash
|
|
- `test_full_setup_flow_mocked`: Full setup flow with mocked mygene (5 genes), verify DuckDB table, provenance
|
|
- `test_checkpoint_skip_flow`: Create checkpoint, verify second run skips fetch
|
|
- `test_setup_cli_help`: CLI help shows --force and checkpoint info
|
|
- `test_info_cli`: Info command displays version, config hash, data sources
|
|
|
|
### All Tests Summary
|
|
- Config tests: 5 passed
|
|
- API client tests: 5 passed
|
|
- Gene mapping tests: 15 passed
|
|
- Persistence tests: 12 passed, 1 skipped (pandas)
|
|
- Integration tests: 6 passed
|
|
|
|
## Verification Results
|
|
|
|
All plan verification steps passed:
|
|
|
|
```bash
|
|
# 1. All tests pass
|
|
$ pytest tests/ -v
|
|
========================= 49 passed, 1 skipped, 1 warning in 0.42s =========================
|
|
|
|
# 2. CLI help works
|
|
$ usher-pipeline --help
|
|
Usage: usher-pipeline [OPTIONS] COMMAND [ARGS]...
|
|
Commands:
|
|
info Display pipeline information and configuration summary.
|
|
setup Initialize pipeline data infrastructure.
|
|
|
|
# 3. Info command works
|
|
$ usher-pipeline info
|
|
Usher Pipeline v0.1.0
|
|
Config: config/default.yaml
|
|
Config Hash: ddbb5195738ac354...
|
|
Data Source Versions:
|
|
Ensembl Release: 113
|
|
gnomAD Version: v4.1
|
|
GTEx Version: v8
|
|
HPA Version: 23.0
|
|
```
|
|
|
|
## Deviations from Plan
|
|
|
|
None - plan executed exactly as written.
|
|
|
|
## Task Execution Log
|
|
|
|
### Task 1: Create CLI entry point with setup command
|
|
**Status:** Complete
|
|
**Duration:** ~3 minutes
|
|
**Commit:** f33b048
|
|
|
|
**Actions:**
|
|
1. Created src/usher_pipeline/cli/ package
|
|
2. Implemented main.py with click command group:
|
|
- Global options: --config (default config/default.yaml), --verbose
|
|
- info command: displays version, config hash, data sources, paths, API config
|
|
- Registers setup command
|
|
3. Implemented setup_cmd.py with full orchestration:
|
|
- Load config, create store/provenance
|
|
- Checkpoint detection: has_checkpoint('gene_universe')
|
|
- Fetch gene universe (mygene) with count validation
|
|
- Map IDs (Ensembl -> HGNC + UniProt) with batch queries
|
|
- Validate mapping (min 90% HGNC success rate)
|
|
- Save to DuckDB with provenance sidecar
|
|
- Colored output with status indicators
|
|
- Resource cleanup in finally block
|
|
4. Updated pyproject.toml: fixed entry point to usher_pipeline.cli.main:cli
|
|
5. Created .gitignore with data/, *.duckdb, build artifacts
|
|
|
|
**Files created:** 5 files (cli/__init__.py, main.py, setup_cmd.py, .gitignore, modified pyproject.toml)
|
|
|
|
**Key features:**
|
|
- Checkpoint-restart: skips expensive fetch if data exists
|
|
- Validation gates: enforces data quality thresholds
|
|
- Provenance tracking: captures all setup steps
|
|
- Colored CLI output with clear status messages
|
|
|
|
### Task 2: Create integration tests verifying module wiring
|
|
**Status:** Complete
|
|
**Duration:** ~2 minutes
|
|
**Commit:** e4d71d0
|
|
|
|
**Actions:**
|
|
1. Created tests/test_integration.py with 6 tests
|
|
2. Mock data setup:
|
|
- MOCK_GENES: 5 Ensembl IDs
|
|
- MOCK_MYGENE_QUERY_RESPONSE: 5 genes with symbols
|
|
- MOCK_MYGENE_QUERYMANY_RESPONSE: 5 genes with HGNC + UniProt
|
|
3. Test fixtures:
|
|
- test_config: creates temp config with tmp_path for isolation
|
|
4. Integration tests:
|
|
- Config -> PipelineStore -> save/load roundtrip
|
|
- Config -> ProvenanceTracker -> sidecar creation
|
|
- Full setup flow with mocked mygene (fetch, map, validate, save, provenance)
|
|
- Checkpoint-restart verification
|
|
- CLI help and info commands
|
|
5. All tests pass with mocked API (no external dependencies)
|
|
|
|
**Files created:** 1 file (test_integration.py)
|
|
|
|
**Key features:**
|
|
- Mocked mygene API calls (no rate limits, reproducible)
|
|
- Temporary paths for isolation (no pollution)
|
|
- Verifies cross-module wiring works correctly
|
|
- Fast execution (<1s for all 6 tests)
|
|
|
|
## Success Criteria Verification
|
|
|
|
- [x] CLI entry point works with setup and info subcommands
|
|
- [x] Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
|
|
- [x] Checkpoint-restart works: existing DuckDB data skips re-downloading
|
|
- [x] All integration tests pass verifying cross-module wiring
|
|
- [x] Full test suite (all files) passes: `pytest tests/ -v` (49 passed, 1 skipped)
|
|
|
|
## Must-Haves Verification
|
|
|
|
**Truths:**
|
|
- [x] CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands
|
|
- [x] Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance
|
|
- [x] All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance
|
|
- [x] Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching
|
|
|
|
**Artifacts:**
|
|
- [x] src/usher_pipeline/cli/main.py provides "CLI entry point with click command group" containing "def cli"
|
|
- [x] src/usher_pipeline/cli/setup_cmd.py provides "Setup command wiring config, gene mapping, persistence, provenance" containing "def setup"
|
|
- [x] tests/test_integration.py provides "Integration tests verifying module wiring" containing "test_"
|
|
|
|
**Key Links:**
|
|
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/config/loader.py via "loads pipeline config from YAML" (pattern: `load_config`)
|
|
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/gene_mapping/mapper.py via "maps gene IDs using GeneMapper" (pattern: `GeneMapper`)
|
|
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/duckdb_store.py via "saves results to DuckDB with checkpoint" (pattern: `PipelineStore`)
|
|
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/provenance.py via "tracks provenance for setup step" (pattern: `ProvenanceTracker`)
|
|
- [x] src/usher_pipeline/cli/main.py -> src/usher_pipeline/cli/setup_cmd.py via "registers setup as click subcommand" (pattern: `cli\.add_command`)
|
|
|
|
## Impact on Roadmap
|
|
|
|
**Phase 01 Data Infrastructure Complete:**
|
|
|
|
All 4 plans in Phase 01 are now complete:
|
|
- 01-01: Python package scaffold, config system, base API client
|
|
- 01-02: Gene ID mapping and validation
|
|
- 01-03: DuckDB persistence and provenance tracking
|
|
- 01-04: CLI integration and end-to-end testing
|
|
|
|
**Ready for Phase 02 (Evidence Layer Ingestion):**
|
|
- CLI provides interface for running pipeline operations
|
|
- Config system defines data source versions
|
|
- Gene universe can be fetched and validated
|
|
- ID mapping handles Ensembl -> HGNC + UniProt
|
|
- DuckDB checkpoint-restart enables incremental processing
|
|
- Provenance tracking captures all processing steps
|
|
- Integration tests prove modules work together
|
|
|
|
## Next Steps
|
|
|
|
Phase 02 will add evidence layer ingestion commands to the CLI:
|
|
- `usher-pipeline fetch gnomad` - download gnomAD data
|
|
- `usher-pipeline fetch expression` - download GTEx/HPA data
|
|
- `usher-pipeline fetch annotations` - download GO/HPO annotations
|
|
|
|
Each command will use the same checkpoint-restart pattern established in setup.
|
|
|
|
## Self-Check: PASSED
|
|
|
|
**Files verified:**
|
|
```bash
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/__init__.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/main.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/setup_cmd.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_integration.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/.gitignore
|
|
```
|
|
|
|
**Commits verified:**
|
|
```bash
|
|
FOUND: f33b048 (Task 1)
|
|
FOUND: e4d71d0 (Task 2)
|
|
```
|
|
|
|
All files and commits exist as documented.
|