docs(01-04): complete CLI integration and end-to-end testing plan

- CLI entry point with setup and info commands - Full infrastructure integration verified - 6 integration tests with mocked APIs - Phase 01 Data Infrastructure complete
2026-02-11 16:45:12 +08:00
parent e4d71d0790
commit 102dcdbe84
2 changed files with 305 additions and 10 deletions
@@ -10,24 +10,24 @@ See: .planning/PROJECT.md (updated 2026-02-11)
 ## Current Position
 Phase: 1 of 6 (Data Infrastructure)
-Plan: 3 of 4 in current phase
+Plan: 4 of 4 in current phase
-Status: Executing
+Status: Complete
-Last activity: 2026-02-11 — Completed 01-02-PLAN.md (Gene ID mapping and validation)
+Last activity: 2026-02-11 — Completed 01-04-PLAN.md (CLI integration and end-to-end testing)
-Progress: [███░░░░░░░] 25.0% (1/6 phases planned, 2/4 plans in phase 1 complete)
+Progress: [█████░░░░░] 16.7% (1/6 phases complete, 4/4 plans in phase 1 complete)
 ## Performance Metrics
 **Velocity:**
- Total plans completed: 2
+- Total plans completed: 4
- Average duration: 3 min
+- Average duration: 3.5 min
- Total execution time: 0.12 hours
+- Total execution time: 0.23 hours
 **By Phase:**
 | Phase | Plans | Total | Avg/Plan |
 |-------|-------|-------|----------|
-| 01 - Data Infrastructure | 2/4 | 7 min | 3.5 min/plan |
+| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
 ## Accumulated Context
@@ -48,6 +48,9 @@ Recent decisions affecting current work:
 - [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility)
 - [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics)
 - [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern)
 - [01-04]: Click for CLI framework (standard Python CLI library with excellent UX)
 - [01-04]: Setup command uses checkpoint-restart pattern (gene universe fetch can take minutes)
 - [01-04]: Mock mygene in integration tests (avoids external API dependency, reproducible)
 ### Pending Todos
@@ -60,5 +63,5 @@ None yet.
 ## Session Continuity
 Last session: 2026-02-11 - Plan execution
-Stopped at: Completed 01-02-PLAN.md
+Stopped at: Completed 01-04-PLAN.md (Phase 01 complete)
-Resume file: .planning/phases/01-data-infrastructure/01-02-SUMMARY.md
+Resume file: .planning/phases/01-data-infrastructure/01-04-SUMMARY.md
@@ -0,0 +1,292 @@
 ---
 phase: 01-data-infrastructure
 plan: 04
 subsystem: integration
 tags: [cli, integration, wiring, testing, click]
 dependency_graph:
  requires:
    - phase: 01-01
      provides: ["Python package scaffold", "Pydantic v2 config system", "CachedAPIClient"]
    - phase: 01-02
      provides: ["Gene universe definition", "Batch ID mapper", "Validation gates"]
    - phase: 01-03
      provides: ["DuckDB checkpoint-restart storage", "Provenance tracking"]
  provides:
    - CLI entry point with setup and info commands
    - Full infrastructure integration (config -> fetch -> map -> validate -> persist -> provenance)
    - Checkpoint-restart capability for expensive operations
    - Integration test suite verifying cross-module wiring
  affects:
    - All future pipeline operations (CLI is primary interface)
 tech_stack:
  added:
    - click for CLI framework
  patterns:
    - Click command group with global options (--config, --verbose)
    - Colored CLI output with status indicators
    - Context manager for resource cleanup
    - Mock-based integration testing
 key_files:
  created:
    - src/usher_pipeline/cli/__init__.py: CLI package exports
    - src/usher_pipeline/cli/main.py: Click command group with info command
    - src/usher_pipeline/cli/setup_cmd.py: Setup command orchestrating full flow
    - tests/test_integration.py: 6 integration tests
    - .gitignore: Data files, build artifacts, provenance exclusions
  modified:
    - pyproject.toml: Fixed CLI entry point to usher_pipeline.cli.main:cli
 decisions:
  - decision: "Click for CLI framework"
    rationale: "Standard Python CLI library with excellent UX features (colored output, help generation, subcommands)"
    alternatives: ["argparse (rejected - verbose)", "typer (rejected - less mature)"]
  - decision: "Setup command uses checkpoint-restart pattern"
    rationale: "Gene universe fetch can take minutes; checkpoint enables fast restart without re-fetching"
    impact: "Setup detects existing DuckDB tables and skips re-fetch unless --force flag used"
  - decision: "Mock mygene in integration tests"
    rationale: "Avoids external API dependency, ensures reproducible tests, faster execution"
    impact: "All 6 integration tests run in <1s with mocked responses"
 metrics:
  duration_minutes: 5
  tasks_completed: 2
  files_created: 5
  files_modified: 1
  tests_added: 6
  commits: 2
  completed_date: "2026-02-11"
 ---
 # Phase 01 Plan 04: CLI Integration and End-to-End Testing Summary
 **One-liner:** Click-based CLI with setup command orchestrating full infrastructure flow (config -> fetch gene universe -> map IDs -> validate -> DuckDB persistence -> provenance) and 6 integration tests verifying cross-module wiring with mocked APIs.
 ## What Was Built
 Wired all infrastructure modules together with a CLI interface and integration tests:
 1. **CLI Entry Point**
   - Click command group with global options (--config, --verbose)
   - `info` command: displays pipeline version, config hash, data source versions, paths, API config
   - `setup` command: orchestrates full infrastructure flow
   - Colored output with status indicators (green=OK, yellow=warn, red=fail)
   - Entry point: `usher-pipeline` binary installed with package
 2. **Setup Command Flow**
   - Load config from YAML
   - Create PipelineStore and ProvenanceTracker from config
   - Check for existing checkpoint (gene_universe table in DuckDB)
   - If checkpoint exists and no --force: skip fetch, display summary
   - If no checkpoint or --force:
     - Fetch protein-coding genes from mygene (19k-22k genes)
     - Validate gene universe (count, format, duplicates)
     - Map Ensembl IDs to HGNC + UniProt via batch queries
     - Validate mapping quality (min 90% HGNC success rate)
     - Save to DuckDB as gene_universe table
     - Record provenance steps
     - Save provenance sidecar JSON
   - Display summary with counts, rates, paths
 3. **Integration Test Suite**
   - 6 tests verifying module wiring with mocked mygene API:
     - `test_config_to_store_roundtrip`: config -> PipelineStore -> save/load
     - `test_config_to_provenance`: config -> ProvenanceTracker -> sidecar
     - `test_full_setup_flow_mocked`: full setup with 5 mocked genes
     - `test_checkpoint_skip_flow`: verify checkpoint-restart skips re-fetch
     - `test_setup_cli_help`: CLI help output verification
     - `test_info_cli`: info command with config display
   - All tests use tmp_path fixtures for isolation
   - No external API calls (mocked mygene responses)
 4. **.gitignore**
   - Excludes data/, *.duckdb, *.duckdb.wal
   - Python artifacts: __pycache__, *.pyc, *.egg-info, dist/, build/
   - Testing: .pytest_cache/, .coverage, htmlcov/
   - Provenance files (not in data/)
   - Virtual environment: .venv/
 ## Tests
 **50 tests total (49 passed, 1 skipped):**
 ### Integration Tests (6 tests, all passed)
 - `test_config_to_store_roundtrip`: Load config, create store, save/load DataFrame, verify roundtrip
 - `test_config_to_provenance`: Load config, create provenance, record steps, save/load sidecar, verify config_hash
 - `test_full_setup_flow_mocked`: Full setup flow with mocked mygene (5 genes), verify DuckDB table, provenance
 - `test_checkpoint_skip_flow`: Create checkpoint, verify second run skips fetch
 - `test_setup_cli_help`: CLI help shows --force and checkpoint info
 - `test_info_cli`: Info command displays version, config hash, data sources
 ### All Tests Summary
 - Config tests: 5 passed
 - API client tests: 5 passed
 - Gene mapping tests: 15 passed
 - Persistence tests: 12 passed, 1 skipped (pandas)
 - Integration tests: 6 passed
 ## Verification Results
 All plan verification steps passed:
 ```bash
 # 1. All tests pass
 $ pytest tests/ -v
 ========================= 49 passed, 1 skipped, 1 warning in 0.42s =========================
 # 2. CLI help works
 $ usher-pipeline --help
 Usage: usher-pipeline [OPTIONS] COMMAND [ARGS]...
 Commands:
  info   Display pipeline information and configuration summary.
  setup  Initialize pipeline data infrastructure.
 # 3. Info command works
 $ usher-pipeline info
 Usher Pipeline v0.1.0
 Config: config/default.yaml
 Config Hash: ddbb5195738ac354...
 Data Source Versions:
  Ensembl Release: 113
  gnomAD Version:  v4.1
  GTEx Version:    v8
  HPA Version:     23.0
 ```
 ## Deviations from Plan
 None - plan executed exactly as written.
 ## Task Execution Log
 ### Task 1: Create CLI entry point with setup command
 **Status:** Complete
 **Duration:** ~3 minutes
 **Commit:** f33b048
 **Actions:**
 1. Created src/usher_pipeline/cli/ package
 2. Implemented main.py with click command group:
   - Global options: --config (default config/default.yaml), --verbose
   - info command: displays version, config hash, data sources, paths, API config
   - Registers setup command
 3. Implemented setup_cmd.py with full orchestration:
   - Load config, create store/provenance
   - Checkpoint detection: has_checkpoint('gene_universe')
   - Fetch gene universe (mygene) with count validation
   - Map IDs (Ensembl -> HGNC + UniProt) with batch queries
   - Validate mapping (min 90% HGNC success rate)
   - Save to DuckDB with provenance sidecar
   - Colored output with status indicators
   - Resource cleanup in finally block
 4. Updated pyproject.toml: fixed entry point to usher_pipeline.cli.main:cli
 5. Created .gitignore with data/, *.duckdb, build artifacts
 **Files created:** 5 files (cli/__init__.py, main.py, setup_cmd.py, .gitignore, modified pyproject.toml)
 **Key features:**
 - Checkpoint-restart: skips expensive fetch if data exists
 - Validation gates: enforces data quality thresholds
 - Provenance tracking: captures all setup steps
 - Colored CLI output with clear status messages
 ### Task 2: Create integration tests verifying module wiring
 **Status:** Complete
 **Duration:** ~2 minutes
 **Commit:** e4d71d0
 **Actions:**
 1. Created tests/test_integration.py with 6 tests
 2. Mock data setup:
   - MOCK_GENES: 5 Ensembl IDs
   - MOCK_MYGENE_QUERY_RESPONSE: 5 genes with symbols
   - MOCK_MYGENE_QUERYMANY_RESPONSE: 5 genes with HGNC + UniProt
 3. Test fixtures:
   - test_config: creates temp config with tmp_path for isolation
 4. Integration tests:
   - Config -> PipelineStore -> save/load roundtrip
   - Config -> ProvenanceTracker -> sidecar creation
   - Full setup flow with mocked mygene (fetch, map, validate, save, provenance)
   - Checkpoint-restart verification
   - CLI help and info commands
 5. All tests pass with mocked API (no external dependencies)
 **Files created:** 1 file (test_integration.py)
 **Key features:**
 - Mocked mygene API calls (no rate limits, reproducible)
 - Temporary paths for isolation (no pollution)
 - Verifies cross-module wiring works correctly
 - Fast execution (<1s for all 6 tests)
 ## Success Criteria Verification
 - [x] CLI entry point works with setup and info subcommands
 - [x] Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
 - [x] Checkpoint-restart works: existing DuckDB data skips re-downloading
 - [x] All integration tests pass verifying cross-module wiring
 - [x] Full test suite (all files) passes: `pytest tests/ -v` (49 passed, 1 skipped)
 ## Must-Haves Verification
 **Truths:**
 - [x] CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands
 - [x] Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance
 - [x] All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance
 - [x] Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching
 **Artifacts:**
 - [x] src/usher_pipeline/cli/main.py provides "CLI entry point with click command group" containing "def cli"
 - [x] src/usher_pipeline/cli/setup_cmd.py provides "Setup command wiring config, gene mapping, persistence, provenance" containing "def setup"
 - [x] tests/test_integration.py provides "Integration tests verifying module wiring" containing "test_"
 **Key Links:**
 - [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/config/loader.py via "loads pipeline config from YAML" (pattern: `load_config`)
 - [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/gene_mapping/mapper.py via "maps gene IDs using GeneMapper" (pattern: `GeneMapper`)
 - [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/duckdb_store.py via "saves results to DuckDB with checkpoint" (pattern: `PipelineStore`)
 - [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/provenance.py via "tracks provenance for setup step" (pattern: `ProvenanceTracker`)
 - [x] src/usher_pipeline/cli/main.py -> src/usher_pipeline/cli/setup_cmd.py via "registers setup as click subcommand" (pattern: `cli\.add_command`)
 ## Impact on Roadmap
 **Phase 01 Data Infrastructure Complete:**
 All 4 plans in Phase 01 are now complete:
 - 01-01: Python package scaffold, config system, base API client
 - 01-02: Gene ID mapping and validation
 - 01-03: DuckDB persistence and provenance tracking
 - 01-04: CLI integration and end-to-end testing
 **Ready for Phase 02 (Evidence Layer Ingestion):**
 - CLI provides interface for running pipeline operations
 - Config system defines data source versions
 - Gene universe can be fetched and validated
 - ID mapping handles Ensembl -> HGNC + UniProt
 - DuckDB checkpoint-restart enables incremental processing
 - Provenance tracking captures all processing steps
 - Integration tests prove modules work together
 ## Next Steps
 Phase 02 will add evidence layer ingestion commands to the CLI:
 - `usher-pipeline fetch gnomad` - download gnomAD data
 - `usher-pipeline fetch expression` - download GTEx/HPA data
 - `usher-pipeline fetch annotations` - download GO/HPO annotations
 Each command will use the same checkpoint-restart pattern established in setup.
 ## Self-Check: PASSED
 **Files verified:**
 ```bash
 FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/__init__.py
 FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/main.py
 FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/setup_cmd.py
 FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_integration.py
 FOUND: /Users/gbanyan/Project/usher-exploring/.gitignore
 ```
 **Commits verified:**
 ```bash
 FOUND: f33b048 (Task 1)
 FOUND: e4d71d0 (Task 2)
 ```
 All files and commits exist as documented.