6645c59b0b
feat(03-04): create localization evidence data model and processing
...
- Define LocalizationRecord model with HPA and proteomics fields
- Implement fetch_hpa_subcellular to download HPA bulk data
- Implement fetch_cilia_proteomics with curated reference gene sets
- Implement classify_evidence_type (experimental vs computational)
- Implement score_localization with cilia proximity scoring
- Implement process_localization_evidence end-to-end pipeline
- Create load_to_duckdb for persistence with provenance
2026-02-11 19:00:09 +08:00
adbb74b965
feat(03-01): implement annotation evidence fetch and transform modules
...
- Create AnnotationRecord model with GO counts, UniProt scores, tier classification
- Implement fetch_go_annotations using mygene.info batch queries
- Implement fetch_uniprot_scores using UniProt REST API
- Add classify_annotation_tier with 3-tier system (well/partial/poor)
- Add normalize_annotation_score with weighted composite (GO 50%, UniProt 30%, Pathway 20%)
- Implement process_annotation_evidence end-to-end pipeline
- Follow NULL preservation pattern from gnomAD (unknown != zero)
- Use lazy polars evaluation where applicable
2026-02-11 18:58:45 +08:00
ee27f3ad2f
feat(02-02): add DuckDB loader and CLI evidence command for gnomAD
...
- load_to_duckdb: Saves constraint DataFrame to gnomad_constraint table with provenance tracking
- query_constrained_genes: Queries constrained genes by LOEUF threshold (validates GCON-03 interpretation)
- evidence_cmd.py: CLI command group with gnomad subcommand (fetch->transform->load orchestration)
- Checkpoint-restart: Skips processing if gnomad_constraint table exists (--force to override)
- Full CLI: usher-pipeline evidence gnomad [--force] [--url URL] [--min-depth N] [--min-cds-pct N]
2026-02-11 18:19:07 +08:00
174c4af02d
feat(02-01): add gnomAD transform pipeline and comprehensive tests
...
- Implement filter_by_coverage with quality_flag categorization (measured/incomplete_coverage/no_data)
- Add normalize_scores with LOEUF inversion (lower LOEUF = higher score)
- NULL preservation throughout pipeline (unknown != zero constraint)
- process_gnomad_constraint end-to-end pipeline function
- 15 comprehensive unit tests covering edge cases:
- NULL handling and preservation
- Coverage filtering without dropping genes
- Normalization bounds and inversion
- Mixed type handling for robust parsing
- Fix column mapping to handle gnomAD v4.x loeuf/loeuf_upper duplication
- All existing tests continue to pass
2026-02-11 18:14:41 +08:00
a88b0eea60
feat(02-01): add gnomAD constraint data models and download module
...
- Create evidence layer package structure
- Define ConstraintRecord Pydantic model with NULL preservation
- Implement streaming download with httpx and tenacity retry
- Add lazy TSV parser with column name variant handling
- Add httpx and structlog dependencies
2026-02-11 18:11:49 +08:00
f33b048635
feat(01-04): add CLI entry point with setup and info commands
...
- Create click-based CLI with command group (--config, --verbose options)
- Add 'info' command displaying pipeline version, config hash, data source versions
- Add 'setup' command orchestrating full infrastructure flow:
- Load config -> create store/provenance
- Fetch gene universe (with checkpoint-restart)
- Map Ensembl IDs to HGNC + UniProt
- Validate mapping quality gates
- Save to DuckDB with provenance sidecar
- Update pyproject.toml entry point to usher_pipeline.cli.main:cli
- Add .gitignore for data/, *.duckdb, build artifacts, provenance files
2026-02-11 16:39:50 +08:00
0200395d9e
feat(01-02): create mapping validation gates with tests
...
- Add MappingValidator with configurable success rate thresholds (min_success_rate, warn_threshold)
- Add validate_gene_universe for gene count, format, and duplicate checks
- Add save_unmapped_report for manual review output
- Implement 15 comprehensive tests with mocked mygene responses (no real API calls)
- Tests cover: successful mapping, notfound handling, uniprot list parsing, batching, validation gates, universe validation
2026-02-11 16:33:36 +08:00
98a1a750dd
feat(01-03): create provenance tracker with comprehensive tests
...
- ProvenanceTracker class for metadata tracking
- Records pipeline version, data source versions, config hash, timestamps
- Sidecar JSON export alongside outputs
- DuckDB _provenance table support
- 13 comprehensive tests (8 DuckDB + 5 provenance)
- All tests pass (12 passed, 1 skipped - pandas)
2026-02-11 16:31:51 +08:00
d51141f7d5
feat(01-03): create DuckDB persistence layer with checkpoint-restart
...
- PipelineStore class for DuckDB-based storage
- save_dataframe/load_dataframe for polars and pandas
- Checkpoint system with has_checkpoint and metadata tracking
- Parquet export capability
- Context manager support
2026-02-11 16:30:25 +08:00
4204116772
feat(01-01): create base API client with retry and caching
...
- CachedAPIClient with SQLite persistent cache
- Exponential backoff retry on 429/5xx/network errors (tenacity)
- Rate limiting with skip for cached responses
- from_config classmethod for pipeline integration
- 5 passing tests for cache creation, rate limiting, and config integration
2026-02-11 16:25:46 +08:00
4a80a0398e
feat(01-01): create Python package scaffold with config system
...
- pyproject.toml: installable package with bioinformatics dependencies
- Pydantic config schema with validation (ensembl_release >= 100, directory creation)
- YAML config loader with override support
- Default config with Ensembl 113, gnomAD v4.1
- 5 passing tests for config validation and hashing
2026-02-11 16:24:35 +08:00