---
phase: 01-data-infrastructure
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- pyproject.toml
- src/usher_pipeline/__init__.py
- src/usher_pipeline/config/__init__.py
- src/usher_pipeline/config/schema.py
- src/usher_pipeline/config/loader.py
- src/usher_pipeline/api_clients/__init__.py
- src/usher_pipeline/api_clients/base.py
- config/default.yaml
- tests/__init__.py
- tests/test_config.py
autonomous: true
must_haves:
truths:
- "YAML config loads and validates with Pydantic, returning typed PipelineConfig object"
- "Invalid config (missing required fields, wrong types, bad values) raises ValidationError with clear messages"
- "CachedAPIClient makes HTTP requests with automatic retry on 429/5xx and persistent SQLite caching"
- "Pipeline is installable as Python package with all dependencies pinned"
artifacts:
- path: "pyproject.toml"
provides: "Package definition with all dependencies"
contains: "mygene"
- path: "src/usher_pipeline/config/schema.py"
provides: "Pydantic models for pipeline configuration"
contains: "class PipelineConfig"
- path: "src/usher_pipeline/config/loader.py"
provides: "YAML config loading with validation"
contains: "def load_config"
- path: "src/usher_pipeline/api_clients/base.py"
provides: "Base API client with retry and caching"
contains: "class CachedAPIClient"
- path: "config/default.yaml"
provides: "Default pipeline configuration"
contains: "ensembl_release"
key_links:
- from: "src/usher_pipeline/config/loader.py"
to: "src/usher_pipeline/config/schema.py"
via: "imports PipelineConfig for validation"
pattern: "from.*schema.*import.*PipelineConfig"
- from: "src/usher_pipeline/api_clients/base.py"
to: "requests_cache"
via: "creates CachedSession for persistent caching"
pattern: "requests_cache\\.CachedSession"
---
Create the Python project scaffold, configuration system with Pydantic validation, and base API client with retry/caching.
Purpose: Establish the foundational package structure, dependency management, and two core infrastructure components (config loading, API client pattern) that all subsequent plans depend on. This is the skeleton that gene mapping, persistence, and CLI modules plug into.
Output: Installable Python package with validated config loading and reusable API client base class.
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-data-infrastructure/01-RESEARCH.md
Task 1: Create Python package scaffold with config system
pyproject.toml
src/usher_pipeline/__init__.py
src/usher_pipeline/config/__init__.py
src/usher_pipeline/config/schema.py
src/usher_pipeline/config/loader.py
config/default.yaml
tests/__init__.py
tests/test_config.py
1. Create `pyproject.toml` using modern Python packaging (PEP 621). Project name: `usher-pipeline`. Python requires >=3.11. Include all dependencies from research: mygene, requests, requests-cache, tenacity, pydantic>=2.0, pydantic-yaml, duckdb, click, polars, pyarrow. Add dev dependencies: pytest, pytest-cov. Use `src` layout with `src/usher_pipeline` as the package.
2. Create `src/usher_pipeline/__init__.py` with `__version__ = "0.1.0"`.
3. Create `src/usher_pipeline/config/schema.py` with Pydantic v2 models:
- `DataSourceVersions(BaseModel)`: fields for ensembl_release (int, ge=100), gnomad_version (str, default "v4.1"), gtex_version (str, default "v8"), hpa_version (str, default "23.0"). All fields with descriptions.
- `ScoringWeights(BaseModel)`: placeholder fields for per-layer weights (gnomad, expression, annotation, localization, animal_model, literature) all float 0.0-1.0 with defaults summing to 1.0.
- `APIConfig(BaseModel)`: rate_limit_per_second (int, default 5), max_retries (int, default 5, ge=1, le=20), cache_ttl_seconds (int, default 86400), timeout_seconds (int, default 30).
- `PipelineConfig(BaseModel)`: data_dir (Path), cache_dir (Path), duckdb_path (Path), versions (DataSourceVersions), api (APIConfig), scoring (ScoringWeights). Add field_validator on data_dir and cache_dir to create directories with mkdir(parents=True, exist_ok=True). Add a method `config_hash() -> str` that computes SHA-256 hash of the config dict (json.dumps with sort_keys=True, default=str).
4. Create `src/usher_pipeline/config/loader.py`:
- `load_config(config_path: Path) -> PipelineConfig`: reads YAML file, parses with pydantic_yaml.parse_yaml_raw_as into PipelineConfig. Raises FileNotFoundError if path missing, pydantic.ValidationError if invalid.
- `load_config_with_overrides(config_path: Path, overrides: dict) -> PipelineConfig`: loads base config, applies dict overrides (for CLI flags), re-validates.
5. Create `config/default.yaml` with sensible defaults: data_dir = "data", cache_dir = "data/cache", duckdb_path = "data/pipeline.duckdb", versions with ensembl_release 113, gnomad v4.1, gtex v8. API config with rate_limit 5, max_retries 5, cache_ttl 86400.
6. Create `tests/test_config.py` with pytest tests:
- test_load_valid_config: loads default.yaml, asserts PipelineConfig returned with correct types
- test_invalid_config_missing_field: config missing data_dir raises ValidationError
- test_invalid_ensembl_release: ensembl_release < 100 raises ValidationError
- test_config_hash_deterministic: same config produces same hash, different config produces different hash
- test_config_creates_directories: loading config with non-existent data_dir creates the directory (use tmp_path fixture)
cd /Users/gbanyan/Project/usher-exploring && pip install -e ".[dev]" && pytest tests/test_config.py -v
- pyproject.toml installs cleanly with all dependencies
- All 5 config tests pass
- PipelineConfig validates types, rejects invalid input, creates directories, produces deterministic hashes
Task 2: Create base API client with retry logic and persistent caching
src/usher_pipeline/api_clients/__init__.py
src/usher_pipeline/api_clients/base.py
tests/test_api_client.py
1. Create `src/usher_pipeline/api_clients/base.py` with `CachedAPIClient` class:
- Constructor takes `cache_dir: Path`, `rate_limit: int = 5`, `max_retries: int = 5`, `cache_ttl: int = 86400`, `timeout: int = 30`.
- Creates `requests_cache.CachedSession` with SQLite backend at `cache_dir / 'api_cache'`, expire_after=cache_ttl.
- Implements `get(url: str, params: dict = None, **kwargs) -> requests.Response` method decorated with `@retry` from tenacity: stop_after_attempt(max_retries), wait_exponential(multiplier=1, min=2, max=60), retry on HTTPError and Timeout and ConnectionError. Before raising on 429, log a warning about rate limiting. Sets timeout on all requests.
- Implements `get_json(url: str, params: dict = None, **kwargs) -> dict` that calls get() and returns response.json().
- Implements simple rate limiting via `time.sleep(1/rate_limit)` before non-cached requests. Check `response.from_cache` to skip sleep for cached responses.
- Add `@classmethod from_config(cls, config: PipelineConfig)` that creates instance from config's api settings and cache_dir.
- Add `clear_cache()` method to clear the SQLite cache.
- Add `cache_stats() -> dict` returning hit/miss counts from the session.
2. Create `tests/test_api_client.py` with pytest tests:
- test_client_creates_cache_dir: instantiating with non-existent cache_dir creates it (use tmp_path)
- test_client_caches_response: mock a GET request, call twice, second call returns from cache (use responses or unittest.mock to mock HTTP)
- test_client_from_config: create from PipelineConfig, verify settings applied
- test_rate_limit_respected: verify sleep is called between non-cached requests (mock time.sleep)
Use unittest.mock for HTTP mocking (avoid adding extra test dependencies). Patch requests_cache.CachedSession if needed to avoid real HTTP in tests.
cd /Users/gbanyan/Project/usher-exploring && pytest tests/test_api_client.py -v
- CachedAPIClient instantiates with SQLite cache in specified directory
- Retry decorator configured with exponential backoff on HTTP errors
- Rate limiting sleeps between non-cached requests
- All 4 API client tests pass
- from_config classmethod correctly reads PipelineConfig settings
1. `pip install -e ".[dev]"` completes without errors
2. `pytest tests/ -v` shows all config and API client tests passing
3. `python -c "from usher_pipeline.config.loader import load_config; c = load_config('config/default.yaml'); print(c.config_hash())"` prints a SHA-256 hash
4. `python -c "from usher_pipeline.api_clients.base import CachedAPIClient"` imports without error
- Python package installs with all bioinformatics dependencies
- Config loads from YAML, validates with Pydantic, rejects invalid input
- API client provides retry + caching foundation for all future API modules
- All tests pass