diff --git a/.planning/STATE.md b/.planning/STATE.md index 479e4e2..9af2844 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -10,24 +10,24 @@ See: .planning/PROJECT.md (updated 2026-02-11) ## Current Position Phase: 1 of 6 (Data Infrastructure) -Plan: 3 of 4 in current phase -Status: Executing -Last activity: 2026-02-11 — Completed 01-02-PLAN.md (Gene ID mapping and validation) +Plan: 4 of 4 in current phase +Status: Complete +Last activity: 2026-02-11 — Completed 01-04-PLAN.md (CLI integration and end-to-end testing) -Progress: [███░░░░░░░] 25.0% (1/6 phases planned, 2/4 plans in phase 1 complete) +Progress: [█████░░░░░] 16.7% (1/6 phases complete, 4/4 plans in phase 1 complete) ## Performance Metrics **Velocity:** -- Total plans completed: 2 -- Average duration: 3 min -- Total execution time: 0.12 hours +- Total plans completed: 4 +- Average duration: 3.5 min +- Total execution time: 0.23 hours **By Phase:** | Phase | Plans | Total | Avg/Plan | |-------|-------|-------|----------| -| 01 - Data Infrastructure | 2/4 | 7 min | 3.5 min/plan | +| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan | ## Accumulated Context @@ -48,6 +48,9 @@ Recent decisions affecting current work: - [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility) - [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics) - [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern) +- [01-04]: Click for CLI framework (standard Python CLI library with excellent UX) +- [01-04]: Setup command uses checkpoint-restart pattern (gene universe fetch can take minutes) +- [01-04]: Mock mygene in integration tests (avoids external API dependency, reproducible) ### Pending Todos @@ -60,5 +63,5 @@ None yet. ## Session Continuity Last session: 2026-02-11 - Plan execution -Stopped at: Completed 01-02-PLAN.md -Resume file: .planning/phases/01-data-infrastructure/01-02-SUMMARY.md +Stopped at: Completed 01-04-PLAN.md (Phase 01 complete) +Resume file: .planning/phases/01-data-infrastructure/01-04-SUMMARY.md diff --git a/.planning/phases/01-data-infrastructure/01-04-SUMMARY.md b/.planning/phases/01-data-infrastructure/01-04-SUMMARY.md new file mode 100644 index 0000000..f8de96c --- /dev/null +++ b/.planning/phases/01-data-infrastructure/01-04-SUMMARY.md @@ -0,0 +1,292 @@ +--- +phase: 01-data-infrastructure +plan: 04 +subsystem: integration +tags: [cli, integration, wiring, testing, click] +dependency_graph: + requires: + - phase: 01-01 + provides: ["Python package scaffold", "Pydantic v2 config system", "CachedAPIClient"] + - phase: 01-02 + provides: ["Gene universe definition", "Batch ID mapper", "Validation gates"] + - phase: 01-03 + provides: ["DuckDB checkpoint-restart storage", "Provenance tracking"] + provides: + - CLI entry point with setup and info commands + - Full infrastructure integration (config -> fetch -> map -> validate -> persist -> provenance) + - Checkpoint-restart capability for expensive operations + - Integration test suite verifying cross-module wiring + affects: + - All future pipeline operations (CLI is primary interface) +tech_stack: + added: + - click for CLI framework + patterns: + - Click command group with global options (--config, --verbose) + - Colored CLI output with status indicators + - Context manager for resource cleanup + - Mock-based integration testing +key_files: + created: + - src/usher_pipeline/cli/__init__.py: CLI package exports + - src/usher_pipeline/cli/main.py: Click command group with info command + - src/usher_pipeline/cli/setup_cmd.py: Setup command orchestrating full flow + - tests/test_integration.py: 6 integration tests + - .gitignore: Data files, build artifacts, provenance exclusions + modified: + - pyproject.toml: Fixed CLI entry point to usher_pipeline.cli.main:cli +decisions: + - decision: "Click for CLI framework" + rationale: "Standard Python CLI library with excellent UX features (colored output, help generation, subcommands)" + alternatives: ["argparse (rejected - verbose)", "typer (rejected - less mature)"] + - decision: "Setup command uses checkpoint-restart pattern" + rationale: "Gene universe fetch can take minutes; checkpoint enables fast restart without re-fetching" + impact: "Setup detects existing DuckDB tables and skips re-fetch unless --force flag used" + - decision: "Mock mygene in integration tests" + rationale: "Avoids external API dependency, ensures reproducible tests, faster execution" + impact: "All 6 integration tests run in <1s with mocked responses" +metrics: + duration_minutes: 5 + tasks_completed: 2 + files_created: 5 + files_modified: 1 + tests_added: 6 + commits: 2 + completed_date: "2026-02-11" +--- + +# Phase 01 Plan 04: CLI Integration and End-to-End Testing Summary + +**One-liner:** Click-based CLI with setup command orchestrating full infrastructure flow (config -> fetch gene universe -> map IDs -> validate -> DuckDB persistence -> provenance) and 6 integration tests verifying cross-module wiring with mocked APIs. + +## What Was Built + +Wired all infrastructure modules together with a CLI interface and integration tests: + +1. **CLI Entry Point** + - Click command group with global options (--config, --verbose) + - `info` command: displays pipeline version, config hash, data source versions, paths, API config + - `setup` command: orchestrates full infrastructure flow + - Colored output with status indicators (green=OK, yellow=warn, red=fail) + - Entry point: `usher-pipeline` binary installed with package + +2. **Setup Command Flow** + - Load config from YAML + - Create PipelineStore and ProvenanceTracker from config + - Check for existing checkpoint (gene_universe table in DuckDB) + - If checkpoint exists and no --force: skip fetch, display summary + - If no checkpoint or --force: + - Fetch protein-coding genes from mygene (19k-22k genes) + - Validate gene universe (count, format, duplicates) + - Map Ensembl IDs to HGNC + UniProt via batch queries + - Validate mapping quality (min 90% HGNC success rate) + - Save to DuckDB as gene_universe table + - Record provenance steps + - Save provenance sidecar JSON + - Display summary with counts, rates, paths + +3. **Integration Test Suite** + - 6 tests verifying module wiring with mocked mygene API: + - `test_config_to_store_roundtrip`: config -> PipelineStore -> save/load + - `test_config_to_provenance`: config -> ProvenanceTracker -> sidecar + - `test_full_setup_flow_mocked`: full setup with 5 mocked genes + - `test_checkpoint_skip_flow`: verify checkpoint-restart skips re-fetch + - `test_setup_cli_help`: CLI help output verification + - `test_info_cli`: info command with config display + - All tests use tmp_path fixtures for isolation + - No external API calls (mocked mygene responses) + +4. **.gitignore** + - Excludes data/, *.duckdb, *.duckdb.wal + - Python artifacts: __pycache__, *.pyc, *.egg-info, dist/, build/ + - Testing: .pytest_cache/, .coverage, htmlcov/ + - Provenance files (not in data/) + - Virtual environment: .venv/ + +## Tests + +**50 tests total (49 passed, 1 skipped):** + +### Integration Tests (6 tests, all passed) +- `test_config_to_store_roundtrip`: Load config, create store, save/load DataFrame, verify roundtrip +- `test_config_to_provenance`: Load config, create provenance, record steps, save/load sidecar, verify config_hash +- `test_full_setup_flow_mocked`: Full setup flow with mocked mygene (5 genes), verify DuckDB table, provenance +- `test_checkpoint_skip_flow`: Create checkpoint, verify second run skips fetch +- `test_setup_cli_help`: CLI help shows --force and checkpoint info +- `test_info_cli`: Info command displays version, config hash, data sources + +### All Tests Summary +- Config tests: 5 passed +- API client tests: 5 passed +- Gene mapping tests: 15 passed +- Persistence tests: 12 passed, 1 skipped (pandas) +- Integration tests: 6 passed + +## Verification Results + +All plan verification steps passed: + +```bash +# 1. All tests pass +$ pytest tests/ -v +========================= 49 passed, 1 skipped, 1 warning in 0.42s ========================= + +# 2. CLI help works +$ usher-pipeline --help +Usage: usher-pipeline [OPTIONS] COMMAND [ARGS]... +Commands: + info Display pipeline information and configuration summary. + setup Initialize pipeline data infrastructure. + +# 3. Info command works +$ usher-pipeline info +Usher Pipeline v0.1.0 +Config: config/default.yaml +Config Hash: ddbb5195738ac354... +Data Source Versions: + Ensembl Release: 113 + gnomAD Version: v4.1 + GTEx Version: v8 + HPA Version: 23.0 +``` + +## Deviations from Plan + +None - plan executed exactly as written. + +## Task Execution Log + +### Task 1: Create CLI entry point with setup command +**Status:** Complete +**Duration:** ~3 minutes +**Commit:** f33b048 + +**Actions:** +1. Created src/usher_pipeline/cli/ package +2. Implemented main.py with click command group: + - Global options: --config (default config/default.yaml), --verbose + - info command: displays version, config hash, data sources, paths, API config + - Registers setup command +3. Implemented setup_cmd.py with full orchestration: + - Load config, create store/provenance + - Checkpoint detection: has_checkpoint('gene_universe') + - Fetch gene universe (mygene) with count validation + - Map IDs (Ensembl -> HGNC + UniProt) with batch queries + - Validate mapping (min 90% HGNC success rate) + - Save to DuckDB with provenance sidecar + - Colored output with status indicators + - Resource cleanup in finally block +4. Updated pyproject.toml: fixed entry point to usher_pipeline.cli.main:cli +5. Created .gitignore with data/, *.duckdb, build artifacts + +**Files created:** 5 files (cli/__init__.py, main.py, setup_cmd.py, .gitignore, modified pyproject.toml) + +**Key features:** +- Checkpoint-restart: skips expensive fetch if data exists +- Validation gates: enforces data quality thresholds +- Provenance tracking: captures all setup steps +- Colored CLI output with clear status messages + +### Task 2: Create integration tests verifying module wiring +**Status:** Complete +**Duration:** ~2 minutes +**Commit:** e4d71d0 + +**Actions:** +1. Created tests/test_integration.py with 6 tests +2. Mock data setup: + - MOCK_GENES: 5 Ensembl IDs + - MOCK_MYGENE_QUERY_RESPONSE: 5 genes with symbols + - MOCK_MYGENE_QUERYMANY_RESPONSE: 5 genes with HGNC + UniProt +3. Test fixtures: + - test_config: creates temp config with tmp_path for isolation +4. Integration tests: + - Config -> PipelineStore -> save/load roundtrip + - Config -> ProvenanceTracker -> sidecar creation + - Full setup flow with mocked mygene (fetch, map, validate, save, provenance) + - Checkpoint-restart verification + - CLI help and info commands +5. All tests pass with mocked API (no external dependencies) + +**Files created:** 1 file (test_integration.py) + +**Key features:** +- Mocked mygene API calls (no rate limits, reproducible) +- Temporary paths for isolation (no pollution) +- Verifies cross-module wiring works correctly +- Fast execution (<1s for all 6 tests) + +## Success Criteria Verification + +- [x] CLI entry point works with setup and info subcommands +- [x] Setup command wires together all infrastructure: config loading, gene universe fetch, ID mapping, validation, DuckDB storage, provenance tracking +- [x] Checkpoint-restart works: existing DuckDB data skips re-downloading +- [x] All integration tests pass verifying cross-module wiring +- [x] Full test suite (all files) passes: `pytest tests/ -v` (49 passed, 1 skipped) + +## Must-Haves Verification + +**Truths:** +- [x] CLI entry point 'usher-pipeline' is available after package install with setup and info subcommands +- [x] Running 'usher-pipeline setup' loads config, fetches gene universe, maps IDs, validates mappings, and saves to DuckDB with provenance +- [x] All infrastructure modules work together end-to-end: config -> gene mapping -> persistence -> provenance +- [x] Pipeline can restart from checkpoint: if gene_universe checkpoint exists in DuckDB, setup skips re-fetching + +**Artifacts:** +- [x] src/usher_pipeline/cli/main.py provides "CLI entry point with click command group" containing "def cli" +- [x] src/usher_pipeline/cli/setup_cmd.py provides "Setup command wiring config, gene mapping, persistence, provenance" containing "def setup" +- [x] tests/test_integration.py provides "Integration tests verifying module wiring" containing "test_" + +**Key Links:** +- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/config/loader.py via "loads pipeline config from YAML" (pattern: `load_config`) +- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/gene_mapping/mapper.py via "maps gene IDs using GeneMapper" (pattern: `GeneMapper`) +- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/duckdb_store.py via "saves results to DuckDB with checkpoint" (pattern: `PipelineStore`) +- [x] src/usher_pipeline/cli/setup_cmd.py -> src/usher_pipeline/persistence/provenance.py via "tracks provenance for setup step" (pattern: `ProvenanceTracker`) +- [x] src/usher_pipeline/cli/main.py -> src/usher_pipeline/cli/setup_cmd.py via "registers setup as click subcommand" (pattern: `cli\.add_command`) + +## Impact on Roadmap + +**Phase 01 Data Infrastructure Complete:** + +All 4 plans in Phase 01 are now complete: +- 01-01: Python package scaffold, config system, base API client +- 01-02: Gene ID mapping and validation +- 01-03: DuckDB persistence and provenance tracking +- 01-04: CLI integration and end-to-end testing + +**Ready for Phase 02 (Evidence Layer Ingestion):** +- CLI provides interface for running pipeline operations +- Config system defines data source versions +- Gene universe can be fetched and validated +- ID mapping handles Ensembl -> HGNC + UniProt +- DuckDB checkpoint-restart enables incremental processing +- Provenance tracking captures all processing steps +- Integration tests prove modules work together + +## Next Steps + +Phase 02 will add evidence layer ingestion commands to the CLI: +- `usher-pipeline fetch gnomad` - download gnomAD data +- `usher-pipeline fetch expression` - download GTEx/HPA data +- `usher-pipeline fetch annotations` - download GO/HPO annotations + +Each command will use the same checkpoint-restart pattern established in setup. + +## Self-Check: PASSED + +**Files verified:** +```bash +FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/__init__.py +FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/main.py +FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/cli/setup_cmd.py +FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_integration.py +FOUND: /Users/gbanyan/Project/usher-exploring/.gitignore +``` + +**Commits verified:** +```bash +FOUND: f33b048 (Task 1) +FOUND: e4d71d0 (Task 2) +``` + +All files and commits exist as documented.