- Created comprehensive SUMMARY.md with all execution details - Updated STATE.md: 1/4 plans in phase 1 complete, 16.7% overall progress - Documented deviation (venv creation) and decisions - Verified all files and commits exist (self-check passed)
10 KiB
phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
| phase | plan | subsystem | tags | dependency_graph | tech_stack | key_files | decisions | metrics | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 01-data-infrastructure | 01 | foundation |
|
|
|
|
|
|
Phase 01 Plan 01: Project Scaffold, Config System, and Base API Client Summary
One-liner: Installable Python package with Pydantic v2 config validation (YAML loading, directory creation, deterministic hashing) and reusable CachedAPIClient base class (SQLite persistence, retry with exponential backoff, rate limiting).
What Was Built
Created the foundational Python package structure and two core infrastructure components that all subsequent plans depend on:
-
Python Package Scaffold
- Modern pyproject.toml with PEP 621 packaging
- src/usher_pipeline layout for clean imports
- All bioinformatics dependencies: mygene, requests, requests-cache, tenacity, pydantic>=2.0, pydantic-yaml, duckdb, click, polars, pyarrow
- Dev dependencies: pytest, pytest-cov
- Virtual environment (.venv) for dependency isolation
-
Config System
- Pydantic v2 models with validation:
- DataSourceVersions: Ensembl (>= 100), gnomAD, GTEx, HPA versions
- ScoringWeights: Per-layer weights (gnomad, expression, annotation, localization, animal_model, literature)
- APIConfig: Rate limiting, retries, cache TTL, timeout
- PipelineConfig: Aggregates all settings with Path validation
- Field validators: ensembl_release >= 100, auto-create directories
- Config hash method: SHA-256 for cache invalidation and provenance
- YAML loader with override support for CLI flags
- Default config: Ensembl 113, gnomAD v4.1, GTEx v8
- Pydantic v2 models with validation:
-
Base API Client
- CachedAPIClient class with:
- SQLite persistent cache via requests_cache
- Retry with exponential backoff (tenacity): 429/5xx/network errors
- Rate limiting: configurable req/sec with skip for cached responses
- Timeout and max_retries configuration
- from_config classmethod for pipeline integration
- cache_stats() and clear_cache() utilities
- CachedAPIClient class with:
Tests
10 tests total (all passing):
Config Tests (5)
test_load_valid_config: Loads default.yaml, validates PipelineConfig typestest_invalid_config_missing_field: Missing required field raises ValidationErrortest_invalid_ensembl_release: ensembl_release < 100 raises ValidationErrortest_config_hash_deterministic: Same config = same hash, different config = different hashtest_config_creates_directories: Non-existent data_dir/cache_dir created on load
API Client Tests (5)
test_client_creates_cache_dir: Instantiation creates cache directorytest_client_caches_response: Second request retrieves from cachetest_client_from_config: from_config applies PipelineConfig settingstest_rate_limit_respected: Non-cached requests trigger sleep (1/rate_limit)test_rate_limit_skipped_for_cached: Cached requests skip rate limiting
Verification Results
All plan verification steps passed:
# 1. Package installation
$ pip install -e ".[dev]"
Successfully installed usher-pipeline-0.1.0
# 2. All tests pass
$ pytest tests/ -v
======================== 10 passed, 1 warning in 0.15s =========================
# 3. Config hash generation
$ python -c "from usher_pipeline.config.loader import load_config; c = load_config('config/default.yaml'); print(c.config_hash())"
ddbb5195738ac3540f08ed0a46d5936cca070ec880fba3f65e7da48b81ca2b0f
# 4. API client import
$ python -c "from usher_pipeline.api_clients.base import CachedAPIClient"
CachedAPIClient imported successfully
Deviations from Plan
Auto-fixed Issues
1. [Rule 3 - Blocking] Created virtual environment for dependency isolation
- Found during: Task 1 (pip install -e ".[dev]")
- Issue: macOS system Python is externally-managed (PEP 668), blocking direct pip installs
- Fix: Created .venv with
python3 -m venv .venv, upgraded pip/setuptools/wheel, installed package in isolated environment - Files modified: .venv/ (created)
- Commit: Included in Task 1 commit (
4a80a03) - Rationale: Blocking issue preventing task completion; venv is standard practice for Python development
No other deviations. Plan executed as written after venv creation.
Task Execution Log
Task 1: Create Python package scaffold with config system
Status: Complete
Duration: ~2 minutes
Commit: 4a80a03
Actions:
- Created pyproject.toml with modern PEP 621 packaging
- Created src/usher_pipeline package structure
- Implemented Pydantic v2 config schema with validators
- Implemented YAML loader with override support
- Created default.yaml with sensible defaults
- Wrote 5 comprehensive config tests
- Fixed blocking venv issue (deviation Rule 3)
- Installed package with
pip install -e ".[dev]" - Verified all 5 tests pass
Files created: 8 files (pyproject.toml, 3 config modules, default.yaml, 2 test files, package init)
Key validation gates:
- ensembl_release >= 100 (rejects outdated releases)
- Directory auto-creation on Path fields
- Config hash for cache invalidation
Task 2: Create base API client with retry logic and persistent caching
Status: Complete
Duration: ~1 minute
Commit: 4204116
Actions:
- Implemented CachedAPIClient base class
- Integrated requests_cache with SQLite backend
- Added tenacity retry decorator with exponential backoff
- Implemented rate limiting with cache-aware skip
- Added from_config classmethod for pipeline integration
- Wrote 5 comprehensive API client tests
- Verified all 5 tests pass
Files created: 3 files (base.py, api_clients init, test_api_client.py)
Key features:
- Persistent SQLite cache with configurable TTL
- Retry on 429/5xx/network errors (exponential backoff: 2-60s)
- Rate limiting (default 5 req/sec) skipped for cached responses
- Timeout configuration (default 30s)
Success Criteria Verification
- Python package installs with all bioinformatics dependencies
- Config loads from YAML, validates with Pydantic, rejects invalid input
- API client provides retry + caching foundation for all future API modules
- All tests pass (10/10)
Must-Haves Verification
Truths:
- YAML config loads and validates with Pydantic, returning typed PipelineConfig object
- Invalid config (missing required fields, wrong types, bad values) raises ValidationError with clear messages
- CachedAPIClient makes HTTP requests with automatic retry on 429/5xx and persistent SQLite caching
- Pipeline is installable as Python package with all dependencies pinned
Artifacts:
- pyproject.toml provides "Package definition with all dependencies" containing "mygene"
- src/usher_pipeline/config/schema.py provides "Pydantic models for pipeline configuration" containing "class PipelineConfig"
- src/usher_pipeline/config/loader.py provides "YAML config loading with validation" containing "def load_config"
- src/usher_pipeline/api_clients/base.py provides "Base API client with retry and caching" containing "class CachedAPIClient"
- config/default.yaml provides "Default pipeline configuration" containing "ensembl_release"
Key Links:
- src/usher_pipeline/config/loader.py → src/usher_pipeline/config/schema.py via "imports PipelineConfig for validation" (pattern:
from.*schema.*import.*PipelineConfig) - src/usher_pipeline/api_clients/base.py → requests_cache via "creates CachedSession for persistent caching" (pattern:
requests_cache\.CachedSession)
Next Steps
Phase 01, Plan 02 depends on this plan's outputs:
- Gene ID mapping will use PipelineConfig for data versioning
- API clients for Ensembl/MyGene will inherit from CachedAPIClient
- DuckDB persistence (Plan 03) will store config_hash for provenance
Self-Check: PASSED
Files verified:
FOUND: /Users/gbanyan/Project/usher-exploring/pyproject.toml
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/__init__.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/schema.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/loader.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/api_clients/base.py
FOUND: /Users/gbanyan/Project/usher-exploring/config/default.yaml
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_config.py
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_api_client.py
Commits verified:
FOUND: 4a80a03 (Task 1)
FOUND: 4204116 (Task 2)
All files and commits exist as documented.