Commit Graph

68 Commits

Author SHA1 Message Date
102dcdbe84 docs(01-04): complete CLI integration and end-to-end testing plan
- CLI entry point with setup and info commands
- Full infrastructure integration verified
- 6 integration tests with mocked APIs
- Phase 01 Data Infrastructure complete
2026-02-11 16:45:12 +08:00
e4d71d0790 test(01-04): add integration tests verifying module wiring
- test_config_to_store_roundtrip: config -> PipelineStore -> save/load
- test_config_to_provenance: config -> ProvenanceTracker -> sidecar
- test_full_setup_flow_mocked: full setup with mocked mygene (fetch, map, validate, save, provenance)
- test_checkpoint_skip_flow: verify checkpoint-restart skips re-fetch
- test_setup_cli_help: CLI help output verification
- test_info_cli: info command with config display

All tests pass with mocked API calls (no external dependencies).
2026-02-11 16:42:13 +08:00
f33b048635 feat(01-04): add CLI entry point with setup and info commands
- Create click-based CLI with command group (--config, --verbose options)
- Add 'info' command displaying pipeline version, config hash, data source versions
- Add 'setup' command orchestrating full infrastructure flow:
  - Load config -> create store/provenance
  - Fetch gene universe (with checkpoint-restart)
  - Map Ensembl IDs to HGNC + UniProt
  - Validate mapping quality gates
  - Save to DuckDB with provenance sidecar
- Update pyproject.toml entry point to usher_pipeline.cli.main:cli
- Add .gitignore for data/, *.duckdb, build artifacts, provenance files
2026-02-11 16:39:50 +08:00
e29d39d1dc docs(01-02): complete gene ID mapping and validation plan
- Gene universe definition with mygene protein-coding gene retrieval
- Batch Ensembl->HGNC+UniProt mapping with edge case handling
- Validation gates with configurable success rate thresholds
- 15 comprehensive tests with mocked API responses
2026-02-11 16:35:57 +08:00
92322b1d7c docs(01-03): complete DuckDB persistence and provenance tracking plan
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 16:34:00 +08:00
0200395d9e feat(01-02): create mapping validation gates with tests
- Add MappingValidator with configurable success rate thresholds (min_success_rate, warn_threshold)
- Add validate_gene_universe for gene count, format, and duplicate checks
- Add save_unmapped_report for manual review output
- Implement 15 comprehensive tests with mocked mygene responses (no real API calls)
- Tests cover: successful mapping, notfound handling, uniprot list parsing, batching, validation gates, universe validation
2026-02-11 16:33:36 +08:00
98a1a750dd feat(01-03): create provenance tracker with comprehensive tests
- ProvenanceTracker class for metadata tracking
- Records pipeline version, data source versions, config hash, timestamps
- Sidecar JSON export alongside outputs
- DuckDB _provenance table support
- 13 comprehensive tests (8 DuckDB + 5 provenance)
- All tests pass (12 passed, 1 skipped - pandas)
2026-02-11 16:31:51 +08:00
d51141f7d5 feat(01-03): create DuckDB persistence layer with checkpoint-restart
- PipelineStore class for DuckDB-based storage
- save_dataframe/load_dataframe for polars and pandas
- Checkpoint system with has_checkpoint and metadata tracking
- Parquet export capability
- Context manager support
2026-02-11 16:30:25 +08:00
9ee3ec2e84 docs(01-01): complete project scaffold and config system plan
- Created comprehensive SUMMARY.md with all execution details
- Updated STATE.md: 1/4 plans in phase 1 complete, 16.7% overall progress
- Documented deviation (venv creation) and decisions
- Verified all files and commits exist (self-check passed)
2026-02-11 16:28:03 +08:00
4204116772 feat(01-01): create base API client with retry and caching
- CachedAPIClient with SQLite persistent cache
- Exponential backoff retry on 429/5xx/network errors (tenacity)
- Rate limiting with skip for cached responses
- from_config classmethod for pipeline integration
- 5 passing tests for cache creation, rate limiting, and config integration
2026-02-11 16:25:46 +08:00
4a80a0398e feat(01-01): create Python package scaffold with config system
- pyproject.toml: installable package with bioinformatics dependencies
- Pydantic config schema with validation (ensembl_release >= 100, directory creation)
- YAML config loader with override support
- Default config with Ensembl 113, gnomAD v4.1
- 5 passing tests for config validation and hashing
2026-02-11 16:24:35 +08:00
cab2f5fc66 docs(01-data-infrastructure): create phase plan 2026-02-11 16:04:42 +08:00
982f7f5a9b docs(01-data-infrastructure): research phase domain 2026-02-11 15:56:40 +08:00
f80f384a61 docs: create roadmap (6 phases) 2026-02-11 15:47:36 +08:00
0fb1a9581f docs: define v1 requirements 2026-02-11 15:31:05 +08:00
bb7bfaedab docs: complete project research 2026-02-11 14:52:06 +08:00
c0abe8bc6c chore: add project config 2026-02-11 14:41:35 +08:00
e2c202d689 docs: initialize project 2026-02-11 14:40:36 +08:00