From 9ee3ec2e84723374ad529e62baf7e64f4f9e3a53 Mon Sep 17 00:00:00 2001 From: gbanyan Date: Wed, 11 Feb 2026 16:28:03 +0800 Subject: [PATCH] docs(01-01): complete project scaffold and config system plan - Created comprehensive SUMMARY.md with all execution details - Updated STATE.md: 1/4 plans in phase 1 complete, 16.7% overall progress - Documented deviation (venv creation) and decisions - Verified all files and commits exist (self-check passed) --- .planning/STATE.md | 30 +-- .../01-data-infrastructure/01-01-SUMMARY.md | 242 ++++++++++++++++++ 2 files changed, 255 insertions(+), 17 deletions(-) create mode 100644 .planning/phases/01-data-infrastructure/01-01-SUMMARY.md diff --git a/.planning/STATE.md b/.planning/STATE.md index e5955aa..ba069f0 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -10,30 +10,24 @@ See: .planning/PROJECT.md (updated 2026-02-11) ## Current Position Phase: 1 of 6 (Data Infrastructure) -Plan: 0 of TBD in current phase -Status: Ready to plan -Last activity: 2026-02-11 — Roadmap created with 6 phases covering all 40 v1 requirements +Plan: 1 of 4 in current phase +Status: Executing +Last activity: 2026-02-11 — Completed 01-01-PLAN.md (Project scaffold, config system, base API client) -Progress: [░░░░░░░░░░] 0% +Progress: [██░░░░░░░░] 16.7% (1/6 phases planned, 1/4 plans in phase 1 complete) ## Performance Metrics **Velocity:** -- Total plans completed: 0 -- Average duration: - min -- Total execution time: 0.0 hours +- Total plans completed: 1 +- Average duration: 3 min +- Total execution time: 0.05 hours **By Phase:** | Phase | Plans | Total | Avg/Plan | |-------|-------|-------|----------| -| - | - | - | - | - -**Recent Trend:** -- Last 5 plans: None yet -- Trend: No data - -*Updated after each plan completion* +| 01 - Data Infrastructure | 1/4 | 3 min | 3 min/plan | ## Accumulated Context @@ -46,6 +40,8 @@ Recent decisions affecting current work: - Weighted rule-based scoring over ML for explainability - Public data only for reproducibility - Modular CLI scripts for flexibility during development +- Virtual environment required for dependency isolation (01-01: PEP 668 externally-managed Python) +- Auto-creation of directories on config load (01-01: data_dir, cache_dir field validators) ### Pending Todos @@ -57,6 +53,6 @@ None yet. ## Session Continuity -Last session: 2026-02-11 - Roadmap creation -Stopped at: Roadmap and STATE files initialized, ready to plan Phase 1 -Resume file: None +Last session: 2026-02-11 - Plan execution +Stopped at: Completed 01-01-PLAN.md +Resume file: .planning/phases/01-data-infrastructure/01-01-SUMMARY.md diff --git a/.planning/phases/01-data-infrastructure/01-01-SUMMARY.md b/.planning/phases/01-data-infrastructure/01-01-SUMMARY.md new file mode 100644 index 0000000..cf4012c --- /dev/null +++ b/.planning/phases/01-data-infrastructure/01-01-SUMMARY.md @@ -0,0 +1,242 @@ +--- +phase: 01-data-infrastructure +plan: 01 +subsystem: foundation +tags: [infrastructure, config, api-client, testing] +dependency_graph: + requires: [] + provides: + - Python package scaffold with src layout + - Pydantic v2 config system with YAML loading + - Base API client pattern with retry and caching + affects: + - All future plans (foundational dependencies) +tech_stack: + added: + - Python 3.11+ with pyproject.toml packaging + - Pydantic v2 for config validation + - pydantic-yaml for YAML parsing + - requests-cache for persistent SQLite caching + - tenacity for retry with exponential backoff + - pytest for testing + patterns: + - Config-driven architecture with validation gates + - Reusable API client base class pattern + - Virtual environment isolation +key_files: + created: + - pyproject.toml: Package definition with bioinformatics dependencies + - src/usher_pipeline/__init__.py: Package root + - src/usher_pipeline/config/schema.py: Pydantic models (PipelineConfig, DataSourceVersions, APIConfig, ScoringWeights) + - src/usher_pipeline/config/loader.py: YAML loading with override support + - src/usher_pipeline/api_clients/base.py: CachedAPIClient with retry and rate limiting + - config/default.yaml: Default pipeline configuration (Ensembl 113, gnomAD v4.1) + - tests/test_config.py: 5 config validation tests + - tests/test_api_client.py: 5 API client tests + modified: [] +decisions: + - decision: "Virtual environment required due to externally-managed Python" + rationale: "macOS system Python uses PEP 668 protection; venv isolates dependencies" + alternatives: ["--break-system-packages flag (rejected - risky)", "pipx (rejected - inappropriate for development)"] + - decision: "Auto-created .venv during Task 1 execution" + rationale: "Blocking issue (Rule 3) - pip install failed without venv" + impact: "Added venv creation step; documented in deviation log" +metrics: + duration_minutes: 3 + tasks_completed: 2 + files_created: 11 + tests_added: 10 + commits: 2 + completed_date: "2026-02-11" +--- + +# Phase 01 Plan 01: Project Scaffold, Config System, and Base API Client Summary + +**One-liner:** Installable Python package with Pydantic v2 config validation (YAML loading, directory creation, deterministic hashing) and reusable CachedAPIClient base class (SQLite persistence, retry with exponential backoff, rate limiting). + +## What Was Built + +Created the foundational Python package structure and two core infrastructure components that all subsequent plans depend on: + +1. **Python Package Scaffold** + - Modern pyproject.toml with PEP 621 packaging + - src/usher_pipeline layout for clean imports + - All bioinformatics dependencies: mygene, requests, requests-cache, tenacity, pydantic>=2.0, pydantic-yaml, duckdb, click, polars, pyarrow + - Dev dependencies: pytest, pytest-cov + - Virtual environment (.venv) for dependency isolation + +2. **Config System** + - Pydantic v2 models with validation: + - DataSourceVersions: Ensembl (>= 100), gnomAD, GTEx, HPA versions + - ScoringWeights: Per-layer weights (gnomad, expression, annotation, localization, animal_model, literature) + - APIConfig: Rate limiting, retries, cache TTL, timeout + - PipelineConfig: Aggregates all settings with Path validation + - Field validators: ensembl_release >= 100, auto-create directories + - Config hash method: SHA-256 for cache invalidation and provenance + - YAML loader with override support for CLI flags + - Default config: Ensembl 113, gnomAD v4.1, GTEx v8 + +3. **Base API Client** + - CachedAPIClient class with: + - SQLite persistent cache via requests_cache + - Retry with exponential backoff (tenacity): 429/5xx/network errors + - Rate limiting: configurable req/sec with skip for cached responses + - Timeout and max_retries configuration + - from_config classmethod for pipeline integration + - cache_stats() and clear_cache() utilities + +## Tests + +**10 tests total (all passing):** + +### Config Tests (5) +- `test_load_valid_config`: Loads default.yaml, validates PipelineConfig types +- `test_invalid_config_missing_field`: Missing required field raises ValidationError +- `test_invalid_ensembl_release`: ensembl_release < 100 raises ValidationError +- `test_config_hash_deterministic`: Same config = same hash, different config = different hash +- `test_config_creates_directories`: Non-existent data_dir/cache_dir created on load + +### API Client Tests (5) +- `test_client_creates_cache_dir`: Instantiation creates cache directory +- `test_client_caches_response`: Second request retrieves from cache +- `test_client_from_config`: from_config applies PipelineConfig settings +- `test_rate_limit_respected`: Non-cached requests trigger sleep (1/rate_limit) +- `test_rate_limit_skipped_for_cached`: Cached requests skip rate limiting + +## Verification Results + +All plan verification steps passed: + +```bash +# 1. Package installation +$ pip install -e ".[dev]" +Successfully installed usher-pipeline-0.1.0 + +# 2. All tests pass +$ pytest tests/ -v +======================== 10 passed, 1 warning in 0.15s ========================= + +# 3. Config hash generation +$ python -c "from usher_pipeline.config.loader import load_config; c = load_config('config/default.yaml'); print(c.config_hash())" +ddbb5195738ac3540f08ed0a46d5936cca070ec880fba3f65e7da48b81ca2b0f + +# 4. API client import +$ python -c "from usher_pipeline.api_clients.base import CachedAPIClient" +CachedAPIClient imported successfully +``` + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 3 - Blocking] Created virtual environment for dependency isolation** +- **Found during:** Task 1 (pip install -e ".[dev]") +- **Issue:** macOS system Python is externally-managed (PEP 668), blocking direct pip installs +- **Fix:** Created .venv with `python3 -m venv .venv`, upgraded pip/setuptools/wheel, installed package in isolated environment +- **Files modified:** .venv/ (created) +- **Commit:** Included in Task 1 commit (4a80a03) +- **Rationale:** Blocking issue preventing task completion; venv is standard practice for Python development + +No other deviations. Plan executed as written after venv creation. + +## Task Execution Log + +### Task 1: Create Python package scaffold with config system +**Status:** Complete +**Duration:** ~2 minutes +**Commit:** 4a80a03 + +**Actions:** +1. Created pyproject.toml with modern PEP 621 packaging +2. Created src/usher_pipeline package structure +3. Implemented Pydantic v2 config schema with validators +4. Implemented YAML loader with override support +5. Created default.yaml with sensible defaults +6. Wrote 5 comprehensive config tests +7. Fixed blocking venv issue (deviation Rule 3) +8. Installed package with `pip install -e ".[dev]"` +9. Verified all 5 tests pass + +**Files created:** 8 files (pyproject.toml, 3 config modules, default.yaml, 2 test files, package __init__) + +**Key validation gates:** +- ensembl_release >= 100 (rejects outdated releases) +- Directory auto-creation on Path fields +- Config hash for cache invalidation + +### Task 2: Create base API client with retry logic and persistent caching +**Status:** Complete +**Duration:** ~1 minute +**Commit:** 4204116 + +**Actions:** +1. Implemented CachedAPIClient base class +2. Integrated requests_cache with SQLite backend +3. Added tenacity retry decorator with exponential backoff +4. Implemented rate limiting with cache-aware skip +5. Added from_config classmethod for pipeline integration +6. Wrote 5 comprehensive API client tests +7. Verified all 5 tests pass + +**Files created:** 3 files (base.py, api_clients __init__, test_api_client.py) + +**Key features:** +- Persistent SQLite cache with configurable TTL +- Retry on 429/5xx/network errors (exponential backoff: 2-60s) +- Rate limiting (default 5 req/sec) skipped for cached responses +- Timeout configuration (default 30s) + +## Success Criteria Verification + +- [x] Python package installs with all bioinformatics dependencies +- [x] Config loads from YAML, validates with Pydantic, rejects invalid input +- [x] API client provides retry + caching foundation for all future API modules +- [x] All tests pass (10/10) + +## Must-Haves Verification + +**Truths:** +- [x] YAML config loads and validates with Pydantic, returning typed PipelineConfig object +- [x] Invalid config (missing required fields, wrong types, bad values) raises ValidationError with clear messages +- [x] CachedAPIClient makes HTTP requests with automatic retry on 429/5xx and persistent SQLite caching +- [x] Pipeline is installable as Python package with all dependencies pinned + +**Artifacts:** +- [x] pyproject.toml provides "Package definition with all dependencies" containing "mygene" +- [x] src/usher_pipeline/config/schema.py provides "Pydantic models for pipeline configuration" containing "class PipelineConfig" +- [x] src/usher_pipeline/config/loader.py provides "YAML config loading with validation" containing "def load_config" +- [x] src/usher_pipeline/api_clients/base.py provides "Base API client with retry and caching" containing "class CachedAPIClient" +- [x] config/default.yaml provides "Default pipeline configuration" containing "ensembl_release" + +**Key Links:** +- [x] src/usher_pipeline/config/loader.py → src/usher_pipeline/config/schema.py via "imports PipelineConfig for validation" (pattern: `from.*schema.*import.*PipelineConfig`) +- [x] src/usher_pipeline/api_clients/base.py → requests_cache via "creates CachedSession for persistent caching" (pattern: `requests_cache\.CachedSession`) + +## Next Steps + +Phase 01, Plan 02 depends on this plan's outputs: +- Gene ID mapping will use PipelineConfig for data versioning +- API clients for Ensembl/MyGene will inherit from CachedAPIClient +- DuckDB persistence (Plan 03) will store config_hash for provenance + +## Self-Check: PASSED + +**Files verified:** +```bash +FOUND: /Users/gbanyan/Project/usher-exploring/pyproject.toml +FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/__init__.py +FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/schema.py +FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/config/loader.py +FOUND: /Users/gbanyan/Project/usher-exploring/src/usher_pipeline/api_clients/base.py +FOUND: /Users/gbanyan/Project/usher-exploring/config/default.yaml +FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_config.py +FOUND: /Users/gbanyan/Project/usher-exploring/tests/test_api_client.py +``` + +**Commits verified:** +```bash +FOUND: 4a80a03 (Task 1) +FOUND: 4204116 (Task 2) +``` + +All files and commits exist as documented.