Files
gbanyan 9ee3ec2e84 docs(01-01): complete project scaffold and config system plan
- Created comprehensive SUMMARY.md with all execution details
- Updated STATE.md: 1/4 plans in phase 1 complete, 16.7% overall progress
- Documented deviation (venv creation) and decisions
- Verified all files and commits exist (self-check passed)
2026-02-11 16:28:03 +08:00

10 KiB

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
phase plan subsystem tags dependency_graph tech_stack key_files decisions metrics
01-data-infrastructure 01 foundation
infrastructure
config
api-client
testing
requires provides affects
Python package scaffold with src layout
Pydantic v2 config system with YAML loading
Base API client pattern with retry and caching
All future plans (foundational dependencies)
added patterns
Python 3.11+ with pyproject.toml packaging
Pydantic v2 for config validation
pydantic-yaml for YAML parsing
requests-cache for persistent SQLite caching
tenacity for retry with exponential backoff
pytest for testing
Config-driven architecture with validation gates
Reusable API client base class pattern
Virtual environment isolation
created modified
pyproject.toml
Package definition with bioinformatics dependencies
src/usher_pipeline/__init__.py
Package root
src/usher_pipeline/config/schema.py
Pydantic models (PipelineConfig, DataSourceVersions, APIConfig, ScoringWeights)
src/usher_pipeline/config/loader.py
YAML loading with override support
src/usher_pipeline/api_clients/base.py
CachedAPIClient with retry and rate limiting
config/default.yaml
Default pipeline configuration (Ensembl 113, gnomAD v4.1)
tests/test_config.py
5 config validation tests
tests/test_api_client.py
5 API client tests
decision rationale alternatives
Virtual environment required due to externally-managed Python macOS system Python uses PEP 668 protection; venv isolates dependencies
--break-system-packages flag (rejected - risky)
pipx (rejected - inappropriate for development)
decision rationale impact
Auto-created .venv during Task 1 execution Blocking issue (Rule 3) - pip install failed without venv Added venv creation step; documented in deviation log
duration_minutes tasks_completed files_created tests_added commits completed_date
3 2 11 10 2 2026-02-11

Phase 01 Plan 01: Project Scaffold, Config System, and Base API Client Summary

One-liner: Installable Python package with Pydantic v2 config validation (YAML loading, directory creation, deterministic hashing) and reusable CachedAPIClient base class (SQLite persistence, retry with exponential backoff, rate limiting).

What Was Built

Created the foundational Python package structure and two core infrastructure components that all subsequent plans depend on:

  1. Python Package Scaffold

    • Modern pyproject.toml with PEP 621 packaging
    • src/usher_pipeline layout for clean imports
    • All bioinformatics dependencies: mygene, requests, requests-cache, tenacity, pydantic>=2.0, pydantic-yaml, duckdb, click, polars, pyarrow
    • Dev dependencies: pytest, pytest-cov
    • Virtual environment (.venv) for dependency isolation
  2. Config System

    • Pydantic v2 models with validation:
      • DataSourceVersions: Ensembl (>= 100), gnomAD, GTEx, HPA versions
      • ScoringWeights: Per-layer weights (gnomad, expression, annotation, localization, animal_model, literature)
      • APIConfig: Rate limiting, retries, cache TTL, timeout
      • PipelineConfig: Aggregates all settings with Path validation
    • Field validators: ensembl_release >= 100, auto-create directories
    • Config hash method: SHA-256 for cache invalidation and provenance
    • YAML loader with override support for CLI flags
    • Default config: Ensembl 113, gnomAD v4.1, GTEx v8
  3. Base API Client

    • CachedAPIClient class with:
      • SQLite persistent cache via requests_cache
      • Retry with exponential backoff (tenacity): 429/5xx/network errors
      • Rate limiting: configurable req/sec with skip for cached responses
      • Timeout and max_retries configuration
      • from_config classmethod for pipeline integration
      • cache_stats() and clear_cache() utilities

Tests

10 tests total (all passing):

Config Tests (5)

  • test_load_valid_config: Loads default.yaml, validates PipelineConfig types
  • test_invalid_config_missing_field: Missing required field raises ValidationError
  • test_invalid_ensembl_release: ensembl_release < 100 raises ValidationError
  • test_config_hash_deterministic: Same config = same hash, different config = different hash
  • test_config_creates_directories: Non-existent data_dir/cache_dir created on load

API Client Tests (5)

  • test_client_creates_cache_dir: Instantiation creates cache directory
  • test_client_caches_response: Second request retrieves from cache
  • test_client_from_config: from_config applies PipelineConfig settings
  • test_rate_limit_respected: Non-cached requests trigger sleep (1/rate_limit)
  • test_rate_limit_skipped_for_cached: Cached requests skip rate limiting

Verification Results

All plan verification steps passed:

# 1. Package installation
$ pip install -e ".[dev]"
Successfully installed usher-pipeline-0.1.0

# 2. All tests pass
$ pytest tests/ -v
======================== 10 passed, 1 warning in 0.15s =========================

# 3. Config hash generation
$ python -c "from usher_pipeline.config.loader import load_config; c = load_config('config/default.yaml'); print(c.config_hash())"
ddbb5195738ac3540f08ed0a46d5936cca070ec880fba3f65e7da48b81ca2b0f

# 4. API client import
$ python -c "from usher_pipeline.api_clients.base import CachedAPIClient"
CachedAPIClient imported successfully

Deviations from Plan

Auto-fixed Issues

1. [Rule 3 - Blocking] Created virtual environment for dependency isolation

  • Found during: Task 1 (pip install -e ".[dev]")
  • Issue: macOS system Python is externally-managed (PEP 668), blocking direct pip installs
  • Fix: Created .venv with python3 -m venv .venv, upgraded pip/setuptools/wheel, installed package in isolated environment
  • Files modified: .venv/ (created)
  • Commit: Included in Task 1 commit (4a80a03)
  • Rationale: Blocking issue preventing task completion; venv is standard practice for Python development

No other deviations. Plan executed as written after venv creation.

Task Execution Log

Task 1: Create Python package scaffold with config system

Status: Complete Duration: ~2 minutes Commit: 4a80a03

Actions:

  1. Created pyproject.toml with modern PEP 621 packaging
  2. Created src/usher_pipeline package structure
  3. Implemented Pydantic v2 config schema with validators
  4. Implemented YAML loader with override support
  5. Created default.yaml with sensible defaults
  6. Wrote 5 comprehensive config tests
  7. Fixed blocking venv issue (deviation Rule 3)
  8. Installed package with pip install -e ".[dev]"
  9. Verified all 5 tests pass

Files created: 8 files (pyproject.toml, 3 config modules, default.yaml, 2 test files, package init)

Key validation gates:

  • ensembl_release >= 100 (rejects outdated releases)
  • Directory auto-creation on Path fields
  • Config hash for cache invalidation

Task 2: Create base API client with retry logic and persistent caching

Status: Complete Duration: ~1 minute Commit: 4204116

Actions:

  1. Implemented CachedAPIClient base class
  2. Integrated requests_cache with SQLite backend
  3. Added tenacity retry decorator with exponential backoff
  4. Implemented rate limiting with cache-aware skip
  5. Added from_config classmethod for pipeline integration
  6. Wrote 5 comprehensive API client tests
  7. Verified all 5 tests pass

Files created: 3 files (base.py, api_clients init, test_api_client.py)

Key features:

  • Persistent SQLite cache with configurable TTL
  • Retry on 429/5xx/network errors (exponential backoff: 2-60s)
  • Rate limiting (default 5 req/sec) skipped for cached responses
  • Timeout configuration (default 30s)

Success Criteria Verification

  • Python package installs with all bioinformatics dependencies
  • Config loads from YAML, validates with Pydantic, rejects invalid input
  • API client provides retry + caching foundation for all future API modules
  • All tests pass (10/10)

Must-Haves Verification

Truths:

  • YAML config loads and validates with Pydantic, returning typed PipelineConfig object
  • Invalid config (missing required fields, wrong types, bad values) raises ValidationError with clear messages
  • CachedAPIClient makes HTTP requests with automatic retry on 429/5xx and persistent SQLite caching
  • Pipeline is installable as Python package with all dependencies pinned

Artifacts:

  • pyproject.toml provides "Package definition with all dependencies" containing "mygene"
  • src/usher_pipeline/config/schema.py provides "Pydantic models for pipeline configuration" containing "class PipelineConfig"
  • src/usher_pipeline/config/loader.py provides "YAML config loading with validation" containing "def load_config"
  • src/usher_pipeline/api_clients/base.py provides "Base API client with retry and caching" containing "class CachedAPIClient"
  • config/default.yaml provides "Default pipeline configuration" containing "ensembl_release"

Key Links:

  • src/usher_pipeline/config/loader.py → src/usher_pipeline/config/schema.py via "imports PipelineConfig for validation" (pattern: from.*schema.*import.*PipelineConfig)
  • src/usher_pipeline/api_clients/base.py → requests_cache via "creates CachedSession for persistent caching" (pattern: requests_cache\.CachedSession)

Next Steps

Phase 01, Plan 02 depends on this plan's outputs:

  • Gene ID mapping will use PipelineConfig for data versioning
  • API clients for Ensembl/MyGene will inherit from CachedAPIClient
  • DuckDB persistence (Plan 03) will store config_hash for provenance

Self-Check: PASSED

Files verified:

FOUND: /Users/gbanyan/Project/usher-exploring/pyproject.toml
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/__init__.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/schema.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/loader.py
FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/api_clients/base.py
FOUND: /Users/gbanyan/Project/usher-exploring/config/default.yaml
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_config.py
FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_api_client.py

Commits verified:

FOUND: 4a80a03 (Task 1)
FOUND: 4204116 (Task 2)

All files and commits exist as documented.