docs(01-04): complete CLI integration and end-to-end testing plan

- CLI entry point with setup and info commands
- Full infrastructure integration verified
- 6 integration tests with mocked APIs
- Phase 01 Data Infrastructure complete
This commit is contained in:
2026-02-11 16:45:12 +08:00
parent e4d71d0790
commit 102dcdbe84
2 changed files with 305 additions and 10 deletions

View File

@@ -10,24 +10,24 @@ See: .planning/PROJECT.md (updated 2026-02-11)
## Current Position ## Current Position
Phase: 1 of 6 (Data Infrastructure) Phase: 1 of 6 (Data Infrastructure)
Plan: 3 of 4 in current phase Plan: 4 of 4 in current phase
Status: Executing Status: Complete
Last activity: 2026-02-11 — Completed 01-02-PLAN.md (Gene ID mapping and validation) Last activity: 2026-02-11 — Completed 01-04-PLAN.md (CLI integration and end-to-end testing)
Progress: [███░░░░░░░] 25.0% (1/6 phases planned, 2/4 plans in phase 1 complete) Progress: [█████░░░░░] 16.7% (1/6 phases complete, 4/4 plans in phase 1 complete)
## Performance Metrics ## Performance Metrics
**Velocity:** **Velocity:**
- Total plans completed: 2 - Total plans completed: 4
- Average duration: 3 min - Average duration: 3.5 min
- Total execution time: 0.12 hours - Total execution time: 0.23 hours
**By Phase:** **By Phase:**
| Phase | Plans | Total | Avg/Plan | | Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------| |-------|-------|-------|----------|
| 01 - Data Infrastructure | 2/4 | 7 min | 3.5 min/plan | | 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
## Accumulated Context ## Accumulated Context
@@ -48,6 +48,9 @@ Recent decisions affecting current work:
- [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility) - [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility)
- [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics) - [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics)
- [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern) - [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern)
- [01-04]: Click for CLI framework (standard Python CLI library with excellent UX)
- [01-04]: Setup command uses checkpoint-restart pattern (gene universe fetch can take minutes)
- [01-04]: Mock mygene in integration tests (avoids external API dependency, reproducible)
### Pending Todos ### Pending Todos
@@ -60,5 +63,5 @@ None yet.
## Session Continuity ## Session Continuity
Last session: 2026-02-11 - Plan execution Last session: 2026-02-11 - Plan execution
Stopped at: Completed 01-02-PLAN.md Stopped at: Completed 01-04-PLAN.md (Phase 01 complete)
Resume file: .planning/phases/01-data-infrastructure/01-02-SUMMARY.md Resume file: .planning/phases/01-data-infrastructure/01-04-SUMMARY.md

View File

@@ -0,0 +1,292 @@
---
phase: 01-data-infrastructure
plan: 04
subsystem: integration
tags: [cli, integration, wiring, testing, click]
dependency_graph:
requires:
- phase: 01-01
provides: ["Python package scaffold", "Pydantic v2 config system", "CachedAPIClient"]
- phase: 01-02
provides: ["Gene universe definition", "Batch ID mapper", "Validation gates"]
- phase: 01-03
provides: ["DuckDB checkpoint-restart storage", "Provenance tracking"]
provides:
- CLI entry point with setup and info commands
- Full infrastructure integration (config -> fetch -> map -> validate -> persist -> provenance)
- Checkpoint-restart capability for expensive operations
- Integration test suite verifying cross-module wiring
affects:
- All future pipeline operations (CLI is primary interface)
tech_stack:
added:
- click for CLI framework
patterns:
- Click command group with global options (--config, --verbose)
- Colored CLI output with status indicators
- Context manager for resource cleanup
- Mock-based integration testing
key_files:
created:
- src/usher_pipeline/cli/__init__.py: CLI package exports
- src/usher_pipeline/cli/main.py: Click command group with info command
- src/usher_pipeline/cli/setup_cmd.py: Setup command orchestrating full flow
- tests/test_integration.py: 6 integration tests
- .gitignore: Data files, build artifacts, provenance exclusions
modified:
- pyproject.toml: Fixed CLI entry point to usher_pipeline.cli.main:cli
decisions:
- decision: "Click for CLI framework"
rationale: "Standard Python CLI library with excellent UX features (colored output, help generation, subcommands)"
alternatives: ["argparse (rejected - verbose)", "typer (rejected - less mature)"]
- decision: "Setup command uses checkpoint-restart pattern"
rationale: "Gene universe fetch can take minutes; checkpoint enables fast restart without re-fetching"
impact: "Setup detects existing DuckDB tables and skips re-fetch unless --force flag used"
- decision: "Mock mygene in integration tests"
rationale: "Avoids external API dependency, ensures reproducible tests, faster execution"
impact: "All 6 integration tests run in <1s with mocked responses"
metrics:
duration_minutes: 5
tasks_completed: 2
files_created: 5
files_modified: 1
tests_added: 6
commits: 2
completed_date: "2026-02-11"
---
# Phase 01 Plan 04: CLI Integration and End-to-End Testing Summary
**One-liner:** Click-based CLI with setup command orchestrating full infrastructure flow (config -> fetch gene universe -> map IDs -> validate -> DuckDB persistence -> provenance) and 6 integration tests verifying cross-module wiring with mocked APIs.
## What Was Built
Wired all infrastructure modules together with a CLI interface and integration tests:
1. **CLI Entry Point**
- Click command group with global options (--config, --verbose)
- `info` command: displays pipeline version, config hash, data source versions, paths, API config
- `setup` command: orchestrates full infrastructure flow
- Colored output with status indicators (green=OK, yellow=warn, red=fail)
- Entry point: `usher-pipeline` binary installed with package
2. **Setup Command Flow**
- Load config from YAML
- Create PipelineStore and ProvenanceTracker from config
- Check for existing checkpoint (gene_universe table in DuckDB)
- If checkpoint exists and no --force: skip fetch, display summary
- If no checkpoint or --force:
- Fetch protein-coding genes from mygene (19k-22k genes)
- Validate gene universe (count, format, duplicates)
- Map Ensembl IDs to HGNC + UniProt via batch queries
- Validate mapping quality (min 90% HGNC success rate)
- Save to DuckDB as gene_universe table
- Record provenance steps
- Save provenance sidecar JSON
- Display summary with counts, rates, paths
3. **Integration Test Suite**
- 6 tests verifying module wiring with mocked mygene API:
- `test_config_to_store_roundtrip`: config -> PipelineStore -> save/load
- `test_config_to_provenance`: config -> ProvenanceTracker -> sidecar
- `test_full_setup_flow_mocked`: full setup with 5 mocked genes
- `test_checkpoint_skip_flow`: verify checkpoint-restart skips re-fetch
- `test_setup_cli_help`: CLI help output verification
- `test_info_cli`: info command with config display
- All tests use tmp_path fixtures for isolation
- No external API calls (mocked mygene responses)
4. **.gitignore**
- Excludes data/, *.duckdb, *.duckdb.wal
- Python artifacts: __pycache__, *.pyc, *.egg-info, dist/, build/
- Testing: .pytest_cache/, .coverage, htmlcov/
- Provenance files (not in data/)
- Virtual environment: .venv/
## Tests
**50 tests total (49 passed, 1 skipped):**
### Integration Tests (6 tests, all passed)
- `test_config_to_store_roundtrip`: Load config, create store, save/load DataFrame, verify roundtrip
- `test_config_to_provenance`: Load config, create provenance, record steps, save/load sidecar, verify config_hash
- `test_full_setup_flow_mocked`: Full setup flow with mocked mygene (5 genes), verify DuckDB table, provenance
- `test_checkpoint_skip_flow`: Create checkpoint, verify second run skips fetch
- `test_setup_cli_help`: CLI help shows --force and checkpoint info
- `test_info_cli`: Info command displays version, config hash, data sources
### All Tests Summary
- Config tests: 5 passed
- API client tests: 5 passed
- Gene mapping tests: 15 passed
- Persistence tests: 12 passed, 1 skipped (pandas)
- Integration tests: 6 passed
## Verification Results
All plan verification steps passed:
```bash
# 1. All tests pass
$ pytest tests/ -v
========================= 49 passed, 1 skipped, 1 warning in 0.42s =========================
# 2. CLI help works
$ usher-pipeline --help
Usage: usher-pipeline [OPTIONS] COMMAND [ARGS]...
Commands:
info Display pipeline information and configuration summary.
setup Initialize pipeline data infrastructure.
# 3. Info command works
$ usher-pipeline info
Usher Pipeline v0.1.0
Config: config/default.yaml
Config Hash: ddbb5195738ac354...
Data Source Versions:
Ensembl Release: 113
gnomAD Version: v4.1
GTEx Version: v8
HPA Version: 23.0
```
## Deviations from Plan
None - plan executed exactly as written.
## Task Execution Log
### Task 1: Create CLI entry point with setup command
**Status:** Complete
**Duration:** ~3 minutes
**Commit:** f33b048
**Actions:**
1. Created src/usher_pipeline/cli/ package
2. Implemented main.py with click command group:
- Global options: --config (default config/default.yaml), --verbose
- info command: displays version, config hash, data sources, paths, API config
- Registers setup command
3. Implemented setup_cmd.py with full orchestration:
- Load config, create store/provenance
- Checkpoint detection: has_checkpoint('gene_universe')
- Fetch gene universe (mygene) with count validation
- Map IDs (Ensembl -> HGNC + UniProt) with batch queries
- Validate mapping (min 90% HGNC success rate)
- Save to DuckDB with provenance sidecar
- Colored output with status indicators
- Resource cleanup in finally block
4. Updated pyproject.toml: fixed entry point to usher_pipeline.cli.main:cli
5. Created .gitignore with data/, *.duckdb, build artifacts
**Files created:** 5 files (cli/__init__.py, main.py, setup_cmd.py, .gitignore, modified pyproject.toml)
**Key features:**
- Checkpoint-restart: skips expensive fetch if data exists
- Validation gates: enforces data quality thresholds
- Provenance tracking: captures all setup steps
- Colored CLI output with clear status messages
### Task 2: Create integration tests verifying module wiring
**Status:** Complete
**Duration:** ~2 minutes
**Commit:** e4d71d0
**Actions:**
1. Created tests/test_integration.py with 6 tests
2. Mock data setup:
- MOCK_GENES: 5 Ensembl IDs
- MOCK_MYGENE_QUERY_RESPONSE: 5 genes with symbols
- MOCK_MYGENE_QUERYMANY_RESPONSE: 5 genes with HGNC + UniProt
3. Test fixtures:
- test_config: creates temp config with tmp_path for isolation
4. Integration tests:
- Config -> PipelineStore -> save/load roundtrip
- Config -> ProvenanceTracker -> sidecar creation
- Full setup flow with mocked mygene (fetch, map, validate, save, provenance)
- Checkpoint-restart verification
- CLI help and info commands
5. All tests pass with mocked API (no external dependencies)
**Files created:** 1 file (test_integration.py)
**Key features:**
- Mocked mygene API calls (no rate limits, reproducible)
- Temporary paths for isolation (no pollution)
- Verifies cross-module wiring works correctly
- Fast execution (<1s for all 6 tests)
## Success Criteria Verification
- [x] CLI entry point works with setup and info subcommands
- [x] Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
- [x] Checkpoint-restart works: existing DuckDB data skips re-downloading
- [x] All integration tests pass verifying cross-module wiring
- [x] Full test suite (all files) passes: `pytest tests/ -v` (49 passed, 1 skipped)
## Must-Haves Verification
**Truths:**
- [x] CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands
- [x] Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance
- [x] All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance
- [x] Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching
**Artifacts:**
- [x] src/usher_pipeline/cli/main.py provides "CLI entry point with click command group" containing "def cli"
- [x] src/usher_pipeline/cli/setup_cmd.py provides "Setup command wiring config, gene mapping, persistence, provenance" containing "def setup"
- [x] tests/test_integration.py provides "Integration tests verifying module wiring" containing "test_"
**Key Links:**
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/config/loader.py via "loads pipeline config from YAML" (pattern: `load_config`)
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/gene_mapping/mapper.py via "maps gene IDs using GeneMapper" (pattern: `GeneMapper`)
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/duckdb_store.py via "saves results to DuckDB with checkpoint" (pattern: `PipelineStore`)
- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/provenance.py via "tracks provenance for setup step" (pattern: `ProvenanceTracker`)
- [x] src/usher_pipeline/cli/main.py -> src/usher_pipeline/cli/setup_cmd.py via "registers setup as click subcommand" (pattern: `cli\.add_command`)
## Impact on Roadmap
**Phase 01 Data Infrastructure Complete:**
All 4 plans in Phase 01 are now complete:
- 01-01: Python package scaffold, config system, base API client
- 01-02: Gene ID mapping and validation
- 01-03: DuckDB persistence and provenance tracking
- 01-04: CLI integration and end-to-end testing
**Ready for Phase 02 (Evidence Layer Ingestion):**
- CLI provides interface for running pipeline operations
- Config system defines data source versions
- Gene universe can be fetched and validated
- ID mapping handles Ensembl -> HGNC + UniProt
- DuckDB checkpoint-restart enables incremental processing
- Provenance tracking captures all processing steps
- Integration tests prove modules work together
## Next Steps
Phase 02 will add evidence layer ingestion commands to the CLI:
- `usher-pipeline fetch gnomad` - download gnomAD data
- `usher-pipeline fetch expression` - download GTEx/HPA data
- `usher-pipeline fetch annotations` - download GO/HPO annotations
Each command will use the same checkpoint-restart pattern established in setup.
## Self-Check: PASSED
**Files verified:**
```bash
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/__init__.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/main.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/setup_cmd.py
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_integration.py
FOUND: /Users/gbanyan/Project/usher-exploring/.gitignore
```
**Commits verified:**
```bash
FOUND: f33b048 (Task 1)
FOUND: e4d71d0 (Task 2)
```
All files and commits exist as documented.