Files
usher-exploring/.planning/phases/01-data-infrastructure/01-04-SUMMARY.md
gbanyan 102dcdbe84 docs(01-04): complete CLI integration and end-to-end testing plan
- CLI entry point with setup and info commands
- Full infrastructure integration verified
- 6 integration tests with mocked APIs
- Phase 01 Data Infrastructure complete
2026-02-11 16:45:12 +08:00

12 KiB

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
phase plan subsystem tags dependency_graph tech_stack key_files decisions metrics
01-data-infrastructure 04 integration
cli
integration
wiring
testing
click
requires provides affects
phase provides
01-01
Python package scaffold
Pydantic v2 config system
CachedAPIClient
phase provides
01-02
Gene universe definition
Batch ID mapper
Validation gates
phase provides
01-03
DuckDB checkpoint-restart storage
Provenance tracking
CLI entry point with setup and info commands
Full infrastructure integration (config -> fetch -> map -> validate -> persist -> provenance)
Checkpoint-restart capability for expensive operations
Integration test suite verifying cross-module wiring
All future pipeline operations (CLI is primary interface)
added patterns
click for CLI framework
Click command group with global options (--config, --verbose)
Colored CLI output with status indicators
Context manager for resource cleanup
Mock-based integration testing
created modified
src/usher_pipeline/cli/__init__.py
CLI package exports
src/usher_pipeline/cli/main.py
Click command group with info command
src/usher_pipeline/cli/setup_cmd.py
Setup command orchestrating full flow
tests/test_integration.py
6 integration tests
.gitignore
Data files, build artifacts, provenance exclusions
pyproject.toml
Fixed CLI entry point to usher_pipeline.cli.main:cli
decision rationale alternatives
Click for CLI framework Standard Python CLI library with excellent UX features (colored output, help generation, subcommands)
argparse (rejected - verbose)
typer (rejected - less mature)
decision rationale impact
Setup command uses checkpoint-restart pattern Gene universe fetch can take minutes; checkpoint enables fast restart without re-fetching Setup detects existing DuckDB tables and skips re-fetch unless --force flag used
decision rationale impact
Mock mygene in integration tests Avoids external API dependency, ensures reproducible tests, faster execution All 6 integration tests run in <1s with mocked responses
duration_minutes tasks_completed files_created files_modified tests_added commits completed_date
5 2 5 1 6 2 2026-02-11

Phase 01 Plan 04: CLI Integration and End-to-End Testing Summary

One-liner: Click-based CLI with setup command orchestrating full infrastructure flow (config -> fetch gene universe -> map IDs -> validate -> DuckDB persistence -> provenance) and 6 integration tests verifying cross-module wiring with mocked APIs.

What Was Built

Wired all infrastructure modules together with a CLI interface and integration tests:

  1. CLI Entry Point

    • Click command group with global options (--config, --verbose)
    • info command: displays pipeline version, config hash, data source versions, paths, API config
    • setup command: orchestrates full infrastructure flow
    • Colored output with status indicators (green=OK, yellow=warn, red=fail)
    • Entry point: usher-pipeline binary installed with package
  2. Setup Command Flow

    • Load config from YAML
    • Create PipelineStore and ProvenanceTracker from config
    • Check for existing checkpoint (gene_universe table in DuckDB)
    • If checkpoint exists and no --force: skip fetch, display summary
    • If no checkpoint or --force:
      • Fetch protein-coding genes from mygene (19k-22k genes)
      • Validate gene universe (count, format, duplicates)
      • Map Ensembl IDs to HGNC + UniProt via batch queries
      • Validate mapping quality (min 90% HGNC success rate)
      • Save to DuckDB as gene_universe table
      • Record provenance steps
      • Save provenance sidecar JSON
    • Display summary with counts, rates, paths
  3. Integration Test Suite

    • 6 tests verifying module wiring with mocked mygene API:
      • test_config_to_store_roundtrip: config -> PipelineStore -> save/load
      • test_config_to_provenance: config -> ProvenanceTracker -> sidecar
      • test_full_setup_flow_mocked: full setup with 5 mocked genes
      • test_checkpoint_skip_flow: verify checkpoint-restart skips re-fetch
      • test_setup_cli_help: CLI help output verification
      • test_info_cli: info command with config display
    • All tests use tmp_path fixtures for isolation
    • No external API calls (mocked mygene responses)
  4. .gitignore

    • Excludes data/, *.duckdb, *.duckdb.wal
    • Python artifacts: pycache, *.pyc, *.egg-info, dist/, build/
    • Testing: .pytest_cache/, .coverage, htmlcov/
    • Provenance files (not in data/)
    • Virtual environment: .venv/

Tests

50 tests total (49 passed, 1 skipped):

Integration Tests (6 tests, all passed)

  • test_config_to_store_roundtrip: Load config, create store, save/load DataFrame, verify roundtrip
  • test_config_to_provenance: Load config, create provenance, record steps, save/load sidecar, verify config_hash
  • test_full_setup_flow_mocked: Full setup flow with mocked mygene (5 genes), verify DuckDB table, provenance
  • test_checkpoint_skip_flow: Create checkpoint, verify second run skips fetch
  • test_setup_cli_help: CLI help shows --force and checkpoint info
  • test_info_cli: Info command displays version, config hash, data sources

All Tests Summary

  • Config tests: 5 passed
  • API client tests: 5 passed
  • Gene mapping tests: 15 passed
  • Persistence tests: 12 passed, 1 skipped (pandas)
  • Integration tests: 6 passed

Verification Results

All plan verification steps passed:

# 1. All tests pass
$ pytest tests/ -v
========================= 49 passed, 1 skipped, 1 warning in 0.42s =========================

# 2. CLI help works
$ usher-pipeline --help
Usage: usher-pipeline [OPTIONS] COMMAND [ARGS]...
Commands:
  info   Display pipeline information and configuration summary.
  setup  Initialize pipeline data infrastructure.

# 3. Info command works
$ usher-pipeline info
Usher Pipeline v0.1.0
Config: config/default.yaml
Config Hash: ddbb5195738ac354...
Data Source Versions:
  Ensembl Release: 113
  gnomAD Version:  v4.1
  GTEx Version:    v8
  HPA Version:     23.0

Deviations from Plan

None - plan executed exactly as written.

Task Execution Log

Task 1: Create CLI entry point with setup command

Status: Complete Duration: ~3 minutes Commit: f33b048

Actions:

  1. Created src/usher_pipeline/cli/ package
  2. Implemented main.py with click command group:
    • Global options: --config (default config/default.yaml), --verbose
    • info command: displays version, config hash, data sources, paths, API config
    • Registers setup command
  3. Implemented setup_cmd.py with full orchestration:
    • Load config, create store/provenance
    • Checkpoint detection: has_checkpoint('gene_universe')
    • Fetch gene universe (mygene) with count validation
    • Map IDs (Ensembl -> HGNC + UniProt) with batch queries
    • Validate mapping (min 90% HGNC success rate)
    • Save to DuckDB with provenance sidecar
    • Colored output with status indicators
    • Resource cleanup in finally block
  4. Updated pyproject.toml: fixed entry point to usher_pipeline.cli.main:cli
  5. Created .gitignore with data/, *.duckdb, build artifacts

Files created: 5 files (cli/init.py, main.py, setup_cmd.py, .gitignore, modified pyproject.toml)

Key features:

  • Checkpoint-restart: skips expensive fetch if data exists
  • Validation gates: enforces data quality thresholds
  • Provenance tracking: captures all setup steps
  • Colored CLI output with clear status messages

Task 2: Create integration tests verifying module wiring

Status: Complete Duration: ~2 minutes Commit: e4d71d0

Actions:

  1. Created tests/test_integration.py with 6 tests
  2. Mock data setup:
    • MOCK_GENES: 5 Ensembl IDs
    • MOCK_MYGENE_QUERY_RESPONSE: 5 genes with symbols
    • MOCK_MYGENE_QUERYMANY_RESPONSE: 5 genes with HGNC + UniProt
  3. Test fixtures:
    • test_config: creates temp config with tmp_path for isolation
  4. Integration tests:
    • Config -> PipelineStore -> save/load roundtrip
    • Config -> ProvenanceTracker -> sidecar creation
    • Full setup flow with mocked mygene (fetch, map, validate, save, provenance)
    • Checkpoint-restart verification
    • CLI help and info commands
  5. All tests pass with mocked API (no external dependencies)

Files created: 1 file (test_integration.py)

Key features:

  • Mocked mygene API calls (no rate limits, reproducible)
  • Temporary paths for isolation (no pollution)
  • Verifies cross-module wiring works correctly
  • Fast execution (<1s for all 6 tests)

Success Criteria Verification

  • CLI entry point works with setup and info subcommands
  • Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking
  • Checkpoint-restart works: existing DuckDB data skips re-downloading
  • All integration tests pass verifying cross-module wiring
  • Full test suite (all files) passes: pytest tests/ -v (49 passed, 1 skipped)

Must-Haves Verification

Truths:

  • CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands
  • Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance
  • All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance
  • Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching

Artifacts:

  • src/usher_pipeline/cli/main.py provides "CLI entry point with click command group" containing "def cli"
  • src/usher_pipeline/cli/setup_cmd.py provides "Setup command wiring config, gene mapping, persistence, provenance" containing "def setup"
  • tests/test_integration.py provides "Integration tests verifying module wiring" containing "test_"

Key Links:

  • src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/config/loader.py via "loads pipeline config from YAML" (pattern: load_config)
  • src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/gene_mapping/mapper.py via "maps gene IDs using GeneMapper" (pattern: GeneMapper)
  • src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/duckdb_store.py via "saves results to DuckDB with checkpoint" (pattern: PipelineStore)
  • src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/provenance.py via "tracks provenance for setup step" (pattern: ProvenanceTracker)
  • src/usher_pipeline/cli/main.py -> src/usher_pipeline/cli/setup_cmd.py via "registers setup as click subcommand" (pattern: cli\.add_command)

Impact on Roadmap

Phase 01 Data Infrastructure Complete:

All 4 plans in Phase 01 are now complete:

  • 01-01: Python package scaffold, config system, base API client
  • 01-02: Gene ID mapping and validation
  • 01-03: DuckDB persistence and provenance tracking
  • 01-04: CLI integration and end-to-end testing

Ready for Phase 02 (Evidence Layer Ingestion):

  • CLI provides interface for running pipeline operations
  • Config system defines data source versions
  • Gene universe can be fetched and validated
  • ID mapping handles Ensembl -> HGNC + UniProt
  • DuckDB checkpoint-restart enables incremental processing
  • Provenance tracking captures all processing steps
  • Integration tests prove modules work together

Next Steps

Phase 02 will add evidence layer ingestion commands to the CLI:

  • usher-pipeline fetch gnomad - download gnomAD data
  • usher-pipeline fetch expression - download GTEx/HPA data
  • usher-pipeline fetch annotations - download GO/HPO annotations

Each command will use the same checkpoint-restart pattern established in setup.

Self-Check: PASSED

Files verified:

FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/__init__.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/main.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/setup_cmd.py
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_integration.py
FOUND: /Users/gbanyan/Project/usher-exploring/.gitignore

Commits verified:

FOUND: f33b048 (Task 1)
FOUND: e4d71d0 (Task 2)

All files and commits exist as documented.