docs(01-04): complete CLI integration and end-to-end testing plan
- CLI entry point with setup and info commands - Full infrastructure integration verified - 6 integration tests with mocked APIs - Phase 01 Data Infrastructure complete
This commit is contained in:
292
.planning/phases/01-data-infrastructure/01-04-SUMMARY.md
Normal file
292
.planning/phases/01-data-infrastructure/01-04-SUMMARY.md
Normal file
@@ -0,0 +1,292 @@
|
||||
---
|
||||
phase: 01-data-infrastructure
|
||||
plan: 04
|
||||
subsystem: integration
|
||||
tags: [cli, integration, wiring, testing, click]
|
||||
dependency_graph:
|
||||
requires:
|
||||
- phase: 01-01
|
||||
provides: ["Python package scaffold", "Pydantic v2 config system", "CachedAPIClient"]
|
||||
- phase: 01-02
|
||||
provides: ["Gene universe definition", "Batch ID mapper", "Validation gates"]
|
||||
- phase: 01-03
|
||||
provides: ["DuckDB checkpoint-restart storage", "Provenance tracking"]
|
||||
provides:
|
||||
- CLI entry point with setup and info commands
|
||||
- Full infrastructure integration (config -> fetch -> map -> validate -> persist -> provenance)
|
||||
- Checkpoint-restart capability for expensive operations
|
||||
- Integration test suite verifying cross-module wiring
|
||||
affects:
|
||||
- All future pipeline operations (CLI is primary interface)
|
||||
tech_stack:
|
||||
added:
|
||||
- click for CLI framework
|
||||
patterns:
|
||||
- Click command group with global options (--config, --verbose)
|
||||
- Colored CLI output with status indicators
|
||||
- Context manager for resource cleanup
|
||||
- Mock-based integration testing
|
||||
key_files:
|
||||
created:
|
||||
- src/usher_pipeline/cli/__init__.py: CLI package exports
|
||||
- src/usher_pipeline/cli/main.py: Click command group with info command
|
||||
- src/usher_pipeline/cli/setup_cmd.py: Setup command orchestrating full flow
|
||||
- tests/test_integration.py: 6 integration tests
|
||||
- .gitignore: Data files, build artifacts, provenance exclusions
|
||||
modified:
|
||||
- pyproject.toml: Fixed CLI entry point to usher_pipeline.cli.main:cli
|
||||
decisions:
|
||||
- decision: "Click for CLI framework"
|
||||
rationale: "Standard Python CLI library with excellent UX features (colored output, help generation, subcommands)"
|
||||
alternatives: ["argparse (rejected - verbose)", "typer (rejected - less mature)"]
|
||||
- decision: "Setup command uses checkpoint-restart pattern"
|
||||
rationale: "Gene universe fetch can take minutes; checkpoint enables fast restart without re-fetching"
|
||||
impact: "Setup detects existing DuckDB tables and skips re-fetch unless --force flag used"
|
||||
- decision: "Mock mygene in integration tests"
|
||||
rationale: "Avoids external API dependency, ensures reproducible tests, faster execution"
|
||||
impact: "All 6 integration tests run in <1s with mocked responses"
|
||||
metrics:
|
||||
duration_minutes: 5
|
||||
tasks_completed: 2
|
||||
files_created: 5
|
||||
files_modified: 1
|
||||
tests_added: 6
|
||||
commits: 2
|
||||
completed_date: "2026-02-11"
|
||||
---
|
||||
|
||||
# Phase 01 Plan 04: CLI Integration and End-to-End Testing Summary
|
||||
|
||||
**One-liner:** Click-based CLI with setup command orchestrating full infrastructure flow (config -> fetch gene universe -> map IDs -> validate -> DuckDB persistence -> provenance) and 6 integration tests verifying cross-module wiring with mocked APIs.
|
||||
|
||||
## What Was Built
|
||||
|
||||
Wired all infrastructure modules together with a CLI interface and integration tests:
|
||||
|
||||
1. **CLI Entry Point**
|
||||
- Click command group with global options (--config, --verbose)
|
||||
- `info` command: displays pipeline version, config hash, data source versions, paths, API config
|
||||
- `setup` command: orchestrates full infrastructure flow
|
||||
- Colored output with status indicators (green=OK, yellow=warn, red=fail)
|
||||
- Entry point: `usher-pipeline` binary installed with package
|
||||
|
||||
2. **Setup Command Flow**
|
||||
- Load config from YAML
|
||||
- Create PipelineStore and ProvenanceTracker from config
|
||||
- Check for existing checkpoint (gene_universe table in DuckDB)
|
||||
- If checkpoint exists and no --force: skip fetch, display summary
|
||||
- If no checkpoint or --force:
|
||||
- Fetch protein-coding genes from mygene (19k-22k genes)
|
||||
- Validate gene universe (count, format, duplicates)
|
||||
- Map Ensembl IDs to HGNC + UniProt via batch queries
|
||||
- Validate mapping quality (min 90% HGNC success rate)
|
||||
- Save to DuckDB as gene_universe table
|
||||
- Record provenance steps
|
||||
- Save provenance sidecar JSON
|
||||
- Display summary with counts, rates, paths
|
||||
|
||||
3. **Integration Test Suite**
|
||||
- 6 tests verifying module wiring with mocked mygene API:
|
||||
- `test_config_to_store_roundtrip`: config -> PipelineStore -> save/load
|
||||
- `test_config_to_provenance`: config -> ProvenanceTracker -> sidecar
|
||||
- `test_full_setup_flow_mocked`: full setup with 5 mocked genes
|
||||
- `test_checkpoint_skip_flow`: verify checkpoint-restart skips re-fetch
|
||||
- `test_setup_cli_help`: CLI help output verification
|
||||
- `test_info_cli`: info command with config display
|
||||
- All tests use tmp_path fixtures for isolation
|
||||
- No external API calls (mocked mygene responses)
|
||||
|
||||
4. **.gitignore**
|
||||
- Excludes data/, *.duckdb, *.duckdb.wal
|
||||
- Python artifacts: __pycache__, *.pyc, *.egg-info, dist/, build/
|
||||
- Testing: .pytest_cache/, .coverage, htmlcov/
|
||||
- Provenance files (not in data/)
|
||||
- Virtual environment: .venv/
|
||||
|
||||
## Tests
|
||||
|
||||
**50 tests total (49 passed, 1 skipped):**
|
||||
|
||||
### Integration Tests (6 tests, all passed)
|
||||
- `test_config_to_store_roundtrip`: Load config, create store, save/load DataFrame, verify roundtrip
|
||||
- `test_config_to_provenance`: Load config, create provenance, record steps, save/load sidecar, verify config_hash
|
||||
- `test_full_setup_flow_mocked`: Full setup flow with mocked mygene (5 genes), verify DuckDB table, provenance
|
||||
- `test_checkpoint_skip_flow`: Create checkpoint, verify second run skips fetch
|
||||
- `test_setup_cli_help`: CLI help shows --force and checkpoint info
|
||||
- `test_info_cli`: Info command displays version, config hash, data sources
|
||||
|
||||
### All Tests Summary
|
||||
- Config tests: 5 passed
|
||||
- API client tests: 5 passed
|
||||
- Gene mapping tests: 15 passed
|
||||
- Persistence tests: 12 passed, 1 skipped (pandas)
|
||||
- Integration tests: 6 passed
|
||||
|
||||
## Verification Results
|
||||
|
||||
All plan verification steps passed:
|
||||
|
||||
```bash
|
||||
# 1. All tests pass
|
||||
$ pytest tests/ -v
|
||||
========================= 49 passed, 1 skipped, 1 warning in 0.42s =========================
|
||||
|
||||
# 2. CLI help works
|
||||
$ usher-pipeline --help
|
||||
Usage: usher-pipeline [OPTIONS] COMMAND [ARGS]...
|
||||
Commands:
|
||||
info Display pipeline information and configuration summary.
|
||||
setup Initialize pipeline data infrastructure.
|
||||
|
||||
# 3. Info command works
|
||||
$ usher-pipeline info
|
||||
Usher Pipeline v0.1.0
|
||||
Config: config/default.yaml
|
||||
Config Hash: ddbb5195738ac354...
|
||||
Data Source Versions:
|
||||
Ensembl Release: 113
|
||||
gnomAD Version: v4.1
|
||||
GTEx Version: v8
|
||||
HPA Version: 23.0
|
||||
```
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
None - plan executed exactly as written.
|
||||
|
||||
## Task Execution Log
|
||||
|
||||
### Task 1: Create CLI entry point with setup command
|
||||
**Status:** Complete
|
||||
**Duration:** ~3 minutes
|
||||
**Commit:** f33b048
|
||||
|
||||
**Actions:**
|
||||
1. Created src/usher_pipeline/cli/ package
|
||||
2. Implemented main.py with click command group:
|
||||
- Global options: --config (default config/default.yaml), --verbose
|
||||
- info command: displays version, config hash, data sources, paths, API config
|
||||
- Registers setup command
|
||||
3. Implemented setup_cmd.py with full orchestration:
|
||||
- Load config, create store/provenance
|
||||
- Checkpoint detection: has_checkpoint('gene_universe')
|
||||
- Fetch gene universe (mygene) with count validation
|
||||
- Map IDs (Ensembl -> HGNC + UniProt) with batch queries
|
||||
- Validate mapping (min 90% HGNC success rate)
|
||||
- Save to DuckDB with provenance sidecar
|
||||
- Colored output with status indicators
|
||||
- Resource cleanup in finally block
|
||||
4. Updated pyproject.toml: fixed entry point to usher_pipeline.cli.main:cli
|
||||
5. Created .gitignore with data/, *.duckdb, build artifacts
|
||||
|
||||
**Files created:** 5 files (cli/__init__.py, main.py, setup_cmd.py, .gitignore, modified pyproject.toml)
|
||||
|
||||
**Key features:**
|
||||
- Checkpoint-restart: skips expensive fetch if data exists
|
||||
- Validation gates: enforces data quality thresholds
|
||||
- Provenance tracking: captures all setup steps
|
||||
- Colored CLI output with clear status messages
|
||||
|
||||
### Task 2: Create integration tests verifying module wiring
|
||||
**Status:** Complete
|
||||
**Duration:** ~2 minutes
|
||||
**Commit:** e4d71d0
|
||||
|
||||
**Actions:**
|
||||
1. Created tests/test_integration.py with 6 tests
|
||||
2. Mock data setup:
|
||||
- MOCK_GENES: 5 Ensembl IDs
|
||||
- MOCK_MYGENE_QUERY_RESPONSE: 5 genes with symbols
|
||||
- MOCK_MYGENE_QUERYMANY_RESPONSE: 5 genes with HGNC + UniProt
|
||||
3. Test fixtures:
|
||||
- test_config: creates temp config with tmp_path for isolation
|
||||
4. Integration tests:
|
||||
- Config -> PipelineStore -> save/load roundtrip
|
||||
- Config -> ProvenanceTracker -> sidecar creation
|
||||
- Full setup flow with mocked mygene (fetch, map, validate, save, provenance)
|
||||
- Checkpoint-restart verification
|
||||
- CLI help and info commands
|
||||
5. All tests pass with mocked API (no external dependencies)
|
||||
|
||||
**Files created:** 1 file (test_integration.py)
|
||||
|
||||
**Key features:**
|
||||
- Mocked mygene API calls (no rate limits, reproducible)
|
||||
- Temporary paths for isolation (no pollution)
|
||||
- Verifies cross-module wiring works correctly
|
||||
- Fast execution (<1s for all 6 tests)
|
||||
|
||||
## Success Criteria Verification
|
||||
|
||||
- [x] CLI entry point works with setup and info subcommands
|
||||
- [x] Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
|
||||
- [x] Checkpoint-restart works: existing DuckDB data skips re-downloading
|
||||
- [x] All integration tests pass verifying cross-module wiring
|
||||
- [x] Full test suite (all files) passes: `pytest tests/ -v` (49 passed, 1 skipped)
|
||||
|
||||
## Must-Haves Verification
|
||||
|
||||
**Truths:**
|
||||
- [x] CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands
|
||||
- [x] Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance
|
||||
- [x] All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance
|
||||
- [x] Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching
|
||||
|
||||
**Artifacts:**
|
||||
- [x] src/usher_pipeline/cli/main.py provides "CLI entry point with click command group" containing "def cli"
|
||||
- [x] src/usher_pipeline/cli/setup_cmd.py provides "Setup command wiring config, gene mapping, persistence, provenance" containing "def setup"
|
||||
- [x] tests/test_integration.py provides "Integration tests verifying module wiring" containing "test_"
|
||||
|
||||
**Key Links:**
|
||||
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/config/loader.py via "loads pipeline config from YAML" (pattern: `load_config`)
|
||||
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/gene_mapping/mapper.py via "maps gene IDs using GeneMapper" (pattern: `GeneMapper`)
|
||||
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/duckdb_store.py via "saves results to DuckDB with checkpoint" (pattern: `PipelineStore`)
|
||||
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/provenance.py via "tracks provenance for setup step" (pattern: `ProvenanceTracker`)
|
||||
- [x] src/usher_pipeline/cli/main.py -> src/usher_pipeline/cli/setup_cmd.py via "registers setup as click subcommand" (pattern: `cli\.add_command`)
|
||||
|
||||
## Impact on Roadmap
|
||||
|
||||
**Phase 01 Data Infrastructure Complete:**
|
||||
|
||||
All 4 plans in Phase 01 are now complete:
|
||||
- 01-01: Python package scaffold, config system, base API client
|
||||
- 01-02: Gene ID mapping and validation
|
||||
- 01-03: DuckDB persistence and provenance tracking
|
||||
- 01-04: CLI integration and end-to-end testing
|
||||
|
||||
**Ready for Phase 02 (Evidence Layer Ingestion):**
|
||||
- CLI provides interface for running pipeline operations
|
||||
- Config system defines data source versions
|
||||
- Gene universe can be fetched and validated
|
||||
- ID mapping handles Ensembl -> HGNC + UniProt
|
||||
- DuckDB checkpoint-restart enables incremental processing
|
||||
- Provenance tracking captures all processing steps
|
||||
- Integration tests prove modules work together
|
||||
|
||||
## Next Steps
|
||||
|
||||
Phase 02 will add evidence layer ingestion commands to the CLI:
|
||||
- `usher-pipeline fetch gnomad` - download gnomAD data
|
||||
- `usher-pipeline fetch expression` - download GTEx/HPA data
|
||||
- `usher-pipeline fetch annotations` - download GO/HPO annotations
|
||||
|
||||
Each command will use the same checkpoint-restart pattern established in setup.
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
**Files verified:**
|
||||
```bash
|
||||
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/__init__.py
|
||||
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/main.py
|
||||
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/setup_cmd.py
|
||||
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_integration.py
|
||||
FOUND: /Users/gbanyan/Project/usher-exploring/.gitignore
|
||||
```
|
||||
|
||||
**Commits verified:**
|
||||
```bash
|
||||
FOUND: f33b048 (Task 1)
|
||||
FOUND: e4d71d0 (Task 2)
|
||||
```
|
||||
|
||||
All files and commits exist as documented.
|
||||
Reference in New Issue
Block a user