--- phase: 01-data-infrastructure plan: 01 subsystem: foundation tags: [infrastructure, config, api-client, testing] dependency_graph: requires: [] provides: - Python package scaffold with src layout - Pydantic v2 config system with YAML loading - Base API client pattern with retry and caching affects: - All future plans (foundational dependencies) tech_stack: added: - Python 3.11+ with pyproject.toml packaging - Pydantic v2 for config validation - pydantic-yaml for YAML parsing - requests-cache for persistent SQLite caching - tenacity for retry with exponential backoff - pytest for testing patterns: - Config-driven architecture with validation gates - Reusable API client base class pattern - Virtual environment isolation key_files: created: - pyproject.toml: Package definition with bioinformatics dependencies - src/usher_pipeline/__init__.py: Package root - src/usher_pipeline/config/schema.py: Pydantic models (PipelineConfig, DataSourceVersions, APIConfig, ScoringWeights) - src/usher_pipeline/config/loader.py: YAML loading with override support - src/usher_pipeline/api_clients/base.py: CachedAPIClient with retry and rate limiting - config/default.yaml: Default pipeline configuration (Ensembl 113, gnomAD v4.1) - tests/test_config.py: 5 config validation tests - tests/test_api_client.py: 5 API client tests modified: [] decisions: - decision: "Virtual environment required due to externally-managed Python" rationale: "macOS system Python uses PEP 668 protection; venv isolates dependencies" alternatives: ["--break-system-packages flag (rejected - risky)", "pipx (rejected - inappropriate for development)"] - decision: "Auto-created .venv during Task 1 execution" rationale: "Blocking issue (Rule 3) - pip install failed without venv" impact: "Added venv creation step; documented in deviation log" metrics: duration_minutes: 3 tasks_completed: 2 files_created: 11 tests_added: 10 commits: 2 completed_date: "2026-02-11" --- # Phase 01 Plan 01: Project Scaffold, Config System, and Base API Client Summary **One-liner:** Installable Python package with Pydantic v2 config validation (YAML loading, directory creation, deterministic hashing) and reusable CachedAPIClient base class (SQLite persistence, retry with exponential backoff, rate limiting). ## What Was Built Created the foundational Python package structure and two core infrastructure components that all subsequent plans depend on: 1. **Python Package Scaffold** - Modern pyproject.toml with PEP 621 packaging - src/usher_pipeline layout for clean imports - All bioinformatics dependencies: mygene, requests, requests-cache, tenacity, pydantic>=2.0, pydantic-yaml, duckdb, click, polars, pyarrow - Dev dependencies: pytest, pytest-cov - Virtual environment (.venv) for dependency isolation 2. **Config System** - Pydantic v2 models with validation: - DataSourceVersions: Ensembl (>= 100), gnomAD, GTEx, HPA versions - ScoringWeights: Per-layer weights (gnomad, expression, annotation, localization, animal_model, literature) - APIConfig: Rate limiting, retries, cache TTL, timeout - PipelineConfig: Aggregates all settings with Path validation - Field validators: ensembl_release >= 100, auto-create directories - Config hash method: SHA-256 for cache invalidation and provenance - YAML loader with override support for CLI flags - Default config: Ensembl 113, gnomAD v4.1, GTEx v8 3. **Base API Client** - CachedAPIClient class with: - SQLite persistent cache via requests_cache - Retry with exponential backoff (tenacity): 429/5xx/network errors - Rate limiting: configurable req/sec with skip for cached responses - Timeout and max_retries configuration - from_config classmethod for pipeline integration - cache_stats() and clear_cache() utilities ## Tests **10 tests total (all passing):** ### Config Tests (5) - `test_load_valid_config`: Loads default.yaml, validates PipelineConfig types - `test_invalid_config_missing_field`: Missing required field raises ValidationError - `test_invalid_ensembl_release`: ensembl_release < 100 raises ValidationError - `test_config_hash_deterministic`: Same config = same hash, different config = different hash - `test_config_creates_directories`: Non-existent data_dir/cache_dir created on load ### API Client Tests (5) - `test_client_creates_cache_dir`: Instantiation creates cache directory - `test_client_caches_response`: Second request retrieves from cache - `test_client_from_config`: from_config applies PipelineConfig settings - `test_rate_limit_respected`: Non-cached requests trigger sleep (1/rate_limit) - `test_rate_limit_skipped_for_cached`: Cached requests skip rate limiting ## Verification Results All plan verification steps passed: ```bash # 1. Package installation $ pip install -e ".[dev]" Successfully installed usher-pipeline-0.1.0 # 2. All tests pass $ pytest tests/ -v ======================== 10 passed, 1 warning in 0.15s ========================= # 3. Config hash generation $ python -c "from usher_pipeline.config.loader import load_config; c = load_config('config/default.yaml'); print(c.config_hash())" ddbb5195738ac3540f08ed0a46d5936cca070ec880fba3f65e7da48b81ca2b0f # 4. API client import $ python -c "from usher_pipeline.api_clients.base import CachedAPIClient" CachedAPIClient imported successfully ``` ## Deviations from Plan ### Auto-fixed Issues **1. [Rule 3 - Blocking] Created virtual environment for dependency isolation** - **Found during:** Task 1 (pip install -e ".[dev]") - **Issue:** macOS system Python is externally-managed (PEP 668), blocking direct pip installs - **Fix:** Created .venv with `python3 -m venv .venv`, upgraded pip/setuptools/wheel, installed package in isolated environment - **Files modified:** .venv/ (created) - **Commit:** Included in Task 1 commit (4a80a03) - **Rationale:** Blocking issue preventing task completion; venv is standard practice for Python development No other deviations. Plan executed as written after venv creation. ## Task Execution Log ### Task 1: Create Python package scaffold with config system **Status:** Complete **Duration:** ~2 minutes **Commit:** 4a80a03 **Actions:** 1. Created pyproject.toml with modern PEP 621 packaging 2. Created src/usher_pipeline package structure 3. Implemented Pydantic v2 config schema with validators 4. Implemented YAML loader with override support 5. Created default.yaml with sensible defaults 6. Wrote 5 comprehensive config tests 7. Fixed blocking venv issue (deviation Rule 3) 8. Installed package with `pip install -e ".[dev]"` 9. Verified all 5 tests pass **Files created:** 8 files (pyproject.toml, 3 config modules, default.yaml, 2 test files, package __init__) **Key validation gates:** - ensembl_release >= 100 (rejects outdated releases) - Directory auto-creation on Path fields - Config hash for cache invalidation ### Task 2: Create base API client with retry logic and persistent caching **Status:** Complete **Duration:** ~1 minute **Commit:** 4204116 **Actions:** 1. Implemented CachedAPIClient base class 2. Integrated requests_cache with SQLite backend 3. Added tenacity retry decorator with exponential backoff 4. Implemented rate limiting with cache-aware skip 5. Added from_config classmethod for pipeline integration 6. Wrote 5 comprehensive API client tests 7. Verified all 5 tests pass **Files created:** 3 files (base.py, api_clients __init__, test_api_client.py) **Key features:** - Persistent SQLite cache with configurable TTL - Retry on 429/5xx/network errors (exponential backoff: 2-60s) - Rate limiting (default 5 req/sec) skipped for cached responses - Timeout configuration (default 30s) ## Success Criteria Verification - [x] Python package installs with all bioinformatics dependencies - [x] Config loads from YAML, validates with Pydantic, rejects invalid input - [x] API client provides retry + caching foundation for all future API modules - [x] All tests pass (10/10) ## Must-Haves Verification **Truths:** - [x] YAML config loads and validates with Pydantic, returning typed PipelineConfig object - [x] Invalid config (missing required fields, wrong types, bad values) raises ValidationError with clear messages - [x] CachedAPIClient makes HTTP requests with automatic retry on 429/5xx and persistent SQLite caching - [x] Pipeline is installable as Python package with all dependencies pinned **Artifacts:** - [x] pyproject.toml provides "Package definition with all dependencies" containing "mygene" - [x] src/usher_pipeline/config/schema.py provides "Pydantic models for pipeline configuration" containing "class PipelineConfig" - [x] src/usher_pipeline/config/loader.py provides "YAML config loading with validation" containing "def load_config" - [x] src/usher_pipeline/api_clients/base.py provides "Base API client with retry and caching" containing "class CachedAPIClient" - [x] config/default.yaml provides "Default pipeline configuration" containing "ensembl_release" **Key Links:** - [x] src/usher_pipeline/config/loader.py → src/usher_pipeline/config/schema.py via "imports PipelineConfig for validation" (pattern: `from.*schema.*import.*PipelineConfig`) - [x] src/usher_pipeline/api_clients/base.py → requests_cache via "creates CachedSession for persistent caching" (pattern: `requests_cache\.CachedSession`) ## Next Steps Phase 01, Plan 02 depends on this plan's outputs: - Gene ID mapping will use PipelineConfig for data versioning - API clients for Ensembl/MyGene will inherit from CachedAPIClient - DuckDB persistence (Plan 03) will store config_hash for provenance ## Self-Check: PASSED **Files verified:** ```bash FOUND: /Users/gbanyan/Project/usher-exploring/pyproject.toml FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/__init__.py FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/schema.py FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/loader.py FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/api_clients/base.py FOUND: /Users/gbanyan/Project/usher-exploring/config/default.yaml FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_config.py FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_api_client.py ``` **Commits verified:** ```bash FOUND: 4a80a03 (Task 1) FOUND: 4204116 (Task 2) ``` All files and commits exist as documented.