- Created comprehensive SUMMARY.md with all execution details - Updated STATE.md: 1/4 plans in phase 1 complete, 16.7% overall progress - Documented deviation (venv creation) and decisions - Verified all files and commits exist (self-check passed)
243 lines
10 KiB
Markdown
243 lines
10 KiB
Markdown
---
|
|
phase: 01-data-infrastructure
|
|
plan: 01
|
|
subsystem: foundation
|
|
tags: [infrastructure, config, api-client, testing]
|
|
dependency_graph:
|
|
requires: []
|
|
provides:
|
|
- Python package scaffold with src layout
|
|
- Pydantic v2 config system with YAML loading
|
|
- Base API client pattern with retry and caching
|
|
affects:
|
|
- All future plans (foundational dependencies)
|
|
tech_stack:
|
|
added:
|
|
- Python 3.11+ with pyproject.toml packaging
|
|
- Pydantic v2 for config validation
|
|
- pydantic-yaml for YAML parsing
|
|
- requests-cache for persistent SQLite caching
|
|
- tenacity for retry with exponential backoff
|
|
- pytest for testing
|
|
patterns:
|
|
- Config-driven architecture with validation gates
|
|
- Reusable API client base class pattern
|
|
- Virtual environment isolation
|
|
key_files:
|
|
created:
|
|
- pyproject.toml: Package definition with bioinformatics dependencies
|
|
- src/usher_pipeline/__init__.py: Package root
|
|
- src/usher_pipeline/config/schema.py: Pydantic models (PipelineConfig, DataSourceVersions, APIConfig, ScoringWeights)
|
|
- src/usher_pipeline/config/loader.py: YAML loading with override support
|
|
- src/usher_pipeline/api_clients/base.py: CachedAPIClient with retry and rate limiting
|
|
- config/default.yaml: Default pipeline configuration (Ensembl 113, gnomAD v4.1)
|
|
- tests/test_config.py: 5 config validation tests
|
|
- tests/test_api_client.py: 5 API client tests
|
|
modified: []
|
|
decisions:
|
|
- decision: "Virtual environment required due to externally-managed Python"
|
|
rationale: "macOS system Python uses PEP 668 protection; venv isolates dependencies"
|
|
alternatives: ["--break-system-packages flag (rejected - risky)", "pipx (rejected - inappropriate for development)"]
|
|
- decision: "Auto-created .venv during Task 1 execution"
|
|
rationale: "Blocking issue (Rule 3) - pip install failed without venv"
|
|
impact: "Added venv creation step; documented in deviation log"
|
|
metrics:
|
|
duration_minutes: 3
|
|
tasks_completed: 2
|
|
files_created: 11
|
|
tests_added: 10
|
|
commits: 2
|
|
completed_date: "2026-02-11"
|
|
---
|
|
|
|
# Phase 01 Plan 01: Project Scaffold, Config System, and Base API Client Summary
|
|
|
|
**One-liner:** Installable Python package with Pydantic v2 config validation (YAML loading, directory creation, deterministic hashing) and reusable CachedAPIClient base class (SQLite persistence, retry with exponential backoff, rate limiting).
|
|
|
|
## What Was Built
|
|
|
|
Created the foundational Python package structure and two core infrastructure components that all subsequent plans depend on:
|
|
|
|
1. **Python Package Scaffold**
|
|
- Modern pyproject.toml with PEP 621 packaging
|
|
- src/usher_pipeline layout for clean imports
|
|
- All bioinformatics dependencies: mygene, requests, requests-cache, tenacity, pydantic>=2.0, pydantic-yaml, duckdb, click, polars, pyarrow
|
|
- Dev dependencies: pytest, pytest-cov
|
|
- Virtual environment (.venv) for dependency isolation
|
|
|
|
2. **Config System**
|
|
- Pydantic v2 models with validation:
|
|
- DataSourceVersions: Ensembl (>= 100), gnomAD, GTEx, HPA versions
|
|
- ScoringWeights: Per-layer weights (gnomad, expression, annotation, localization, animal_model, literature)
|
|
- APIConfig: Rate limiting, retries, cache TTL, timeout
|
|
- PipelineConfig: Aggregates all settings with Path validation
|
|
- Field validators: ensembl_release >= 100, auto-create directories
|
|
- Config hash method: SHA-256 for cache invalidation and provenance
|
|
- YAML loader with override support for CLI flags
|
|
- Default config: Ensembl 113, gnomAD v4.1, GTEx v8
|
|
|
|
3. **Base API Client**
|
|
- CachedAPIClient class with:
|
|
- SQLite persistent cache via requests_cache
|
|
- Retry with exponential backoff (tenacity): 429/5xx/network errors
|
|
- Rate limiting: configurable req/sec with skip for cached responses
|
|
- Timeout and max_retries configuration
|
|
- from_config classmethod for pipeline integration
|
|
- cache_stats() and clear_cache() utilities
|
|
|
|
## Tests
|
|
|
|
**10 tests total (all passing):**
|
|
|
|
### Config Tests (5)
|
|
- `test_load_valid_config`: Loads default.yaml, validates PipelineConfig types
|
|
- `test_invalid_config_missing_field`: Missing required field raises ValidationError
|
|
- `test_invalid_ensembl_release`: ensembl_release < 100 raises ValidationError
|
|
- `test_config_hash_deterministic`: Same config = same hash, different config = different hash
|
|
- `test_config_creates_directories`: Non-existent data_dir/cache_dir created on load
|
|
|
|
### API Client Tests (5)
|
|
- `test_client_creates_cache_dir`: Instantiation creates cache directory
|
|
- `test_client_caches_response`: Second request retrieves from cache
|
|
- `test_client_from_config`: from_config applies PipelineConfig settings
|
|
- `test_rate_limit_respected`: Non-cached requests trigger sleep (1/rate_limit)
|
|
- `test_rate_limit_skipped_for_cached`: Cached requests skip rate limiting
|
|
|
|
## Verification Results
|
|
|
|
All plan verification steps passed:
|
|
|
|
```bash
|
|
# 1. Package installation
|
|
$ pip install -e ".[dev]"
|
|
Successfully installed usher-pipeline-0.1.0
|
|
|
|
# 2. All tests pass
|
|
$ pytest tests/ -v
|
|
======================== 10 passed, 1 warning in 0.15s =========================
|
|
|
|
# 3. Config hash generation
|
|
$ python -c "from usher_pipeline.config.loader import load_config; c = load_config('config/default.yaml'); print(c.config_hash())"
|
|
ddbb5195738ac3540f08ed0a46d5936cca070ec880fba3f65e7da48b81ca2b0f
|
|
|
|
# 4. API client import
|
|
$ python -c "from usher_pipeline.api_clients.base import CachedAPIClient"
|
|
CachedAPIClient imported successfully
|
|
```
|
|
|
|
## Deviations from Plan
|
|
|
|
### Auto-fixed Issues
|
|
|
|
**1. [Rule 3 - Blocking] Created virtual environment for dependency isolation**
|
|
- **Found during:** Task 1 (pip install -e ".[dev]")
|
|
- **Issue:** macOS system Python is externally-managed (PEP 668), blocking direct pip installs
|
|
- **Fix:** Created .venv with `python3 -m venv .venv`, upgraded pip/setuptools/wheel, installed package in isolated environment
|
|
- **Files modified:** .venv/ (created)
|
|
- **Commit:** Included in Task 1 commit (4a80a03)
|
|
- **Rationale:** Blocking issue preventing task completion; venv is standard practice for Python development
|
|
|
|
No other deviations. Plan executed as written after venv creation.
|
|
|
|
## Task Execution Log
|
|
|
|
### Task 1: Create Python package scaffold with config system
|
|
**Status:** Complete
|
|
**Duration:** ~2 minutes
|
|
**Commit:** 4a80a03
|
|
|
|
**Actions:**
|
|
1. Created pyproject.toml with modern PEP 621 packaging
|
|
2. Created src/usher_pipeline package structure
|
|
3. Implemented Pydantic v2 config schema with validators
|
|
4. Implemented YAML loader with override support
|
|
5. Created default.yaml with sensible defaults
|
|
6. Wrote 5 comprehensive config tests
|
|
7. Fixed blocking venv issue (deviation Rule 3)
|
|
8. Installed package with `pip install -e ".[dev]"`
|
|
9. Verified all 5 tests pass
|
|
|
|
**Files created:** 8 files (pyproject.toml, 3 config modules, default.yaml, 2 test files, package __init__)
|
|
|
|
**Key validation gates:**
|
|
- ensembl_release >= 100 (rejects outdated releases)
|
|
- Directory auto-creation on Path fields
|
|
- Config hash for cache invalidation
|
|
|
|
### Task 2: Create base API client with retry logic and persistent caching
|
|
**Status:** Complete
|
|
**Duration:** ~1 minute
|
|
**Commit:** 4204116
|
|
|
|
**Actions:**
|
|
1. Implemented CachedAPIClient base class
|
|
2. Integrated requests_cache with SQLite backend
|
|
3. Added tenacity retry decorator with exponential backoff
|
|
4. Implemented rate limiting with cache-aware skip
|
|
5. Added from_config classmethod for pipeline integration
|
|
6. Wrote 5 comprehensive API client tests
|
|
7. Verified all 5 tests pass
|
|
|
|
**Files created:** 3 files (base.py, api_clients __init__, test_api_client.py)
|
|
|
|
**Key features:**
|
|
- Persistent SQLite cache with configurable TTL
|
|
- Retry on 429/5xx/network errors (exponential backoff: 2-60s)
|
|
- Rate limiting (default 5 req/sec) skipped for cached responses
|
|
- Timeout configuration (default 30s)
|
|
|
|
## Success Criteria Verification
|
|
|
|
- [x] Python package installs with all bioinformatics dependencies
|
|
- [x] Config loads from YAML, validates with Pydantic, rejects invalid input
|
|
- [x] API client provides retry + caching foundation for all future API modules
|
|
- [x] All tests pass (10/10)
|
|
|
|
## Must-Haves Verification
|
|
|
|
**Truths:**
|
|
- [x] YAML config loads and validates with Pydantic, returning typed PipelineConfig object
|
|
- [x] Invalid config (missing required fields, wrong types, bad values) raises ValidationError with clear messages
|
|
- [x] CachedAPIClient makes HTTP requests with automatic retry on 429/5xx and persistent SQLite caching
|
|
- [x] Pipeline is installable as Python package with all dependencies pinned
|
|
|
|
**Artifacts:**
|
|
- [x] pyproject.toml provides "Package definition with all dependencies" containing "mygene"
|
|
- [x] src/usher_pipeline/config/schema.py provides "Pydantic models for pipeline configuration" containing "class PipelineConfig"
|
|
- [x] src/usher_pipeline/config/loader.py provides "YAML config loading with validation" containing "def load_config"
|
|
- [x] src/usher_pipeline/api_clients/base.py provides "Base API client with retry and caching" containing "class CachedAPIClient"
|
|
- [x] config/default.yaml provides "Default pipeline configuration" containing "ensembl_release"
|
|
|
|
**Key Links:**
|
|
- [x] src/usher_pipeline/config/loader.py → src/usher_pipeline/config/schema.py via "imports PipelineConfig for validation" (pattern: `from.*schema.*import.*PipelineConfig`)
|
|
- [x] src/usher_pipeline/api_clients/base.py → requests_cache via "creates CachedSession for persistent caching" (pattern: `requests_cache\.CachedSession`)
|
|
|
|
## Next Steps
|
|
|
|
Phase 01, Plan 02 depends on this plan's outputs:
|
|
- Gene ID mapping will use PipelineConfig for data versioning
|
|
- API clients for Ensembl/MyGene will inherit from CachedAPIClient
|
|
- DuckDB persistence (Plan 03) will store config_hash for provenance
|
|
|
|
## Self-Check: PASSED
|
|
|
|
**Files verified:**
|
|
```bash
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/pyproject.toml
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/__init__.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/schema.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/loader.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/api_clients/base.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/config/default.yaml
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_config.py
|
|
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_api_client.py
|
|
```
|
|
|
|
**Commits verified:**
|
|
```bash
|
|
FOUND: 4a80a03 (Task 1)
|
|
FOUND: 4204116 (Task 2)
|
|
```
|
|
|
|
All files and commits exist as documented.
|