99bc975a2c
docs(03-01): complete annotation completeness plan
2026-02-11 19:05:56 +08:00
942aaf2ec3
feat(03-04): add localization CLI command and comprehensive tests
...
- Add localization subcommand to evidence command group
- Implement checkpoint-restart pattern for HPA download
- Display summary with evidence type distribution
- Create 17 unit and integration tests (all pass)
- Test HPA parsing, evidence classification, scoring, and DuckDB persistence
- Fix evidence type terminology (computational vs predicted) for consistency
- Mock HTTP calls in integration tests for reproducibility
2026-02-11 19:05:22 +08:00
d70239c4ce
feat(03-01): add annotation DuckDB loader, CLI command, and tests
...
- Create load_to_duckdb with provenance tracking and tier distribution stats
- Add query_poorly_annotated helper to find under-studied genes
- Register `evidence annotation` CLI command with checkpoint-restart pattern
- Add comprehensive unit tests (9 tests) covering GO extraction, NULL handling, tier classification, score normalization, weighting
- Add integration tests (6 tests) for pipeline, idempotency, checkpoint-restart, provenance, queries
- All 15 tests pass with proper NULL preservation and schema validation
2026-02-11 19:03:10 +08:00
0e389c7e41
feat(03-05): implement animal model evidence fetch and transform
...
- models.py: AnimalModelRecord with ortholog confidence, phenotype flags, and normalized scoring
- fetch.py: Retrieve orthologs from HCOP, phenotypes from MGI/ZFIN/IMPC with retry
- transform.py: Filter sensory/cilia-relevant phenotypes, score with confidence weighting
- Ortholog confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3)
- Scoring: mouse +0.4, zebrafish +0.3, IMPC +0.3, weighted by confidence
- NULL preservation: no ortholog = NULL score (not zero)
2026-02-11 19:00:24 +08:00
8aa66987f8
feat(03-06): implement literature evidence models, PubMed fetch, and scoring
...
- Create LiteratureRecord pydantic model with context-specific counts
- Implement PubMed query via Biopython Entrez with rate limiting (3/sec default, 10/sec with API key)
- Define SEARCH_CONTEXTS for cilia, sensory, cytoskeleton, cell_polarity queries
- Implement evidence tier classification: direct_experimental > functional_mention > hts_hit > incidental > none
- Implement quality-weighted scoring with bias mitigation via log2(total_pubmed_count) normalization
- Add biopython>=1.84 dependency to pyproject.toml
- Support checkpoint-restart for long-running PubMed queries (estimated 3-11 hours for 20K genes)
2026-02-11 19:00:20 +08:00
6645c59b0b
feat(03-04): create localization evidence data model and processing
...
- Define LocalizationRecord model with HPA and proteomics fields
- Implement fetch_hpa_subcellular to download HPA bulk data
- Implement fetch_cilia_proteomics with curated reference gene sets
- Implement classify_evidence_type (experimental vs computational)
- Implement score_localization with cilia proximity scoring
- Implement process_localization_evidence end-to-end pipeline
- Create load_to_duckdb for persistence with provenance
2026-02-11 19:00:09 +08:00
adbb74b965
feat(03-01): implement annotation evidence fetch and transform modules
...
- Create AnnotationRecord model with GO counts, UniProt scores, tier classification
- Implement fetch_go_annotations using mygene.info batch queries
- Implement fetch_uniprot_scores using UniProt REST API
- Add classify_annotation_tier with 3-tier system (well/partial/poor)
- Add normalize_annotation_score with weighted composite (GO 50%, UniProt 30%, Pathway 20%)
- Implement process_annotation_evidence end-to-end pipeline
- Follow NULL preservation pattern from gnomAD (unknown != zero)
- Use lazy polars evaluation where applicable
2026-02-11 18:58:45 +08:00
0d252da348
docs(03): create phase plan
2026-02-11 18:46:28 +08:00
3354cfe006
docs(phase-03): research core evidence layers domain
2026-02-11 18:37:14 +08:00
ffb4963d2b
docs(phase-02): complete phase execution
2026-02-11 18:28:13 +08:00
a0388cf4e1
docs(02-02): complete gnomAD evidence layer integration plan
...
- DuckDB persistence: gnomad_constraint table with CREATE OR REPLACE (idempotent)
- CLI evidence command: usher-pipeline evidence gnomad with checkpoint-restart
- Provenance tracking: records processing steps, saves sidecar JSON
- Query helpers: query_constrained_genes validates GCON-03 interpretation
- 12 integration tests: end-to-end pipeline, checkpoint, provenance, CLI
- Phase 2 complete: Evidence layer pattern established for future sources
- Duration: 4 min, 2 tasks, 5 files, 70 tests passing
Phase 2 (Prototype Evidence Layer) complete.
2026-02-11 18:23:32 +08:00
56e04e68c2
test(02-02): add comprehensive gnomAD integration tests
...
- 12 integration tests covering full pipeline: fetch->transform->load->query
- test_full_pipeline_to_duckdb: End-to-end pipeline verification with DuckDB storage
- test_checkpoint_restart_skips_processing: Checkpoint detection works correctly
- test_provenance_recorded: Provenance step records expected details
- test_provenance_sidecar_created: JSON sidecar file creation and structure
- test_query_constrained_genes_filters_correctly: Query returns only measured genes below threshold
- test_null_loeuf_not_in_constrained_results: NULL LOEUF genes excluded from queries
- test_duckdb_schema_has_quality_flag: Schema includes quality_flag with valid values
- test_normalized_scores_in_duckdb: Normalized scores in [0,1] for measured genes, NULL for others
- test_cli_evidence_gnomad_help: CLI help text displays correctly
- test_cli_evidence_gnomad_with_mock: CLI command runs end-to-end with mocked download
- test_idempotent_load_replaces_table: Loading twice replaces table (not appends)
- test_quality_flag_categorization: Quality flags correctly categorize genes
All 70 tests pass (58 existing + 12 new), no regressions
2026-02-11 18:20:59 +08:00
ee27f3ad2f
feat(02-02): add DuckDB loader and CLI evidence command for gnomAD
...
- load_to_duckdb: Saves constraint DataFrame to gnomad_constraint table with provenance tracking
- query_constrained_genes: Queries constrained genes by LOEUF threshold (validates GCON-03 interpretation)
- evidence_cmd.py: CLI command group with gnomad subcommand (fetch->transform->load orchestration)
- Checkpoint-restart: Skips processing if gnomad_constraint table exists (--force to override)
- Full CLI: usher-pipeline evidence gnomad [--force] [--url URL] [--min-depth N] [--min-cds-pct N]
2026-02-11 18:19:07 +08:00
c6198122ac
docs(02-01): complete gnomAD constraint data pipeline plan
2026-02-11 18:16:35 +08:00
174c4af02d
feat(02-01): add gnomAD transform pipeline and comprehensive tests
...
- Implement filter_by_coverage with quality_flag categorization (measured/incomplete_coverage/no_data)
- Add normalize_scores with LOEUF inversion (lower LOEUF = higher score)
- NULL preservation throughout pipeline (unknown != zero constraint)
- process_gnomad_constraint end-to-end pipeline function
- 15 comprehensive unit tests covering edge cases:
- NULL handling and preservation
- Coverage filtering without dropping genes
- Normalization bounds and inversion
- Mixed type handling for robust parsing
- Fix column mapping to handle gnomAD v4.x loeuf/loeuf_upper duplication
- All existing tests continue to pass
2026-02-11 18:14:41 +08:00
a88b0eea60
feat(02-01): add gnomAD constraint data models and download module
...
- Create evidence layer package structure
- Define ConstraintRecord Pydantic model with NULL preservation
- Implement streaming download with httpx and tenacity retry
- Add lazy TSV parser with column name variant handling
- Add httpx and structlog dependencies
2026-02-11 18:11:49 +08:00
c7753e7b1c
docs(02): create phase plan
2026-02-11 17:47:23 +08:00
d328467737
docs(phase-02): research prototype evidence layer
2026-02-11 17:41:35 +08:00
34437fdf0a
docs(phase-01): complete phase execution
...
Phase 1 (Data Infrastructure) verified: 5/5 must-haves, 12/12 artifacts,
9/9 key links, 7/7 requirements satisfied. All 4 plans executed across
3 waves with 49 tests passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-11 16:50:30 +08:00
102dcdbe84
docs(01-04): complete CLI integration and end-to-end testing plan
...
- CLI entry point with setup and info commands
- Full infrastructure integration verified
- 6 integration tests with mocked APIs
- Phase 01 Data Infrastructure complete
2026-02-11 16:45:12 +08:00
e4d71d0790
test(01-04): add integration tests verifying module wiring
...
- test_config_to_store_roundtrip: config -> PipelineStore -> save/load
- test_config_to_provenance: config -> ProvenanceTracker -> sidecar
- test_full_setup_flow_mocked: full setup with mocked mygene (fetch, map, validate, save, provenance)
- test_checkpoint_skip_flow: verify checkpoint-restart skips re-fetch
- test_setup_cli_help: CLI help output verification
- test_info_cli: info command with config display
All tests pass with mocked API calls (no external dependencies).
2026-02-11 16:42:13 +08:00
f33b048635
feat(01-04): add CLI entry point with setup and info commands
...
- Create click-based CLI with command group (--config, --verbose options)
- Add 'info' command displaying pipeline version, config hash, data source versions
- Add 'setup' command orchestrating full infrastructure flow:
- Load config -> create store/provenance
- Fetch gene universe (with checkpoint-restart)
- Map Ensembl IDs to HGNC + UniProt
- Validate mapping quality gates
- Save to DuckDB with provenance sidecar
- Update pyproject.toml entry point to usher_pipeline.cli.main:cli
- Add .gitignore for data/, *.duckdb, build artifacts, provenance files
2026-02-11 16:39:50 +08:00
e29d39d1dc
docs(01-02): complete gene ID mapping and validation plan
...
- Gene universe definition with mygene protein-coding gene retrieval
- Batch Ensembl->HGNC+UniProt mapping with edge case handling
- Validation gates with configurable success rate thresholds
- 15 comprehensive tests with mocked API responses
2026-02-11 16:35:57 +08:00
92322b1d7c
docs(01-03): complete DuckDB persistence and provenance tracking plan
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-11 16:34:00 +08:00
0200395d9e
feat(01-02): create mapping validation gates with tests
...
- Add MappingValidator with configurable success rate thresholds (min_success_rate, warn_threshold)
- Add validate_gene_universe for gene count, format, and duplicate checks
- Add save_unmapped_report for manual review output
- Implement 15 comprehensive tests with mocked mygene responses (no real API calls)
- Tests cover: successful mapping, notfound handling, uniprot list parsing, batching, validation gates, universe validation
2026-02-11 16:33:36 +08:00
98a1a750dd
feat(01-03): create provenance tracker with comprehensive tests
...
- ProvenanceTracker class for metadata tracking
- Records pipeline version, data source versions, config hash, timestamps
- Sidecar JSON export alongside outputs
- DuckDB _provenance table support
- 13 comprehensive tests (8 DuckDB + 5 provenance)
- All tests pass (12 passed, 1 skipped - pandas)
2026-02-11 16:31:51 +08:00
d51141f7d5
feat(01-03): create DuckDB persistence layer with checkpoint-restart
...
- PipelineStore class for DuckDB-based storage
- save_dataframe/load_dataframe for polars and pandas
- Checkpoint system with has_checkpoint and metadata tracking
- Parquet export capability
- Context manager support
2026-02-11 16:30:25 +08:00
9ee3ec2e84
docs(01-01): complete project scaffold and config system plan
...
- Created comprehensive SUMMARY.md with all execution details
- Updated STATE.md: 1/4 plans in phase 1 complete, 16.7% overall progress
- Documented deviation (venv creation) and decisions
- Verified all files and commits exist (self-check passed)
2026-02-11 16:28:03 +08:00
4204116772
feat(01-01): create base API client with retry and caching
...
- CachedAPIClient with SQLite persistent cache
- Exponential backoff retry on 429/5xx/network errors (tenacity)
- Rate limiting with skip for cached responses
- from_config classmethod for pipeline integration
- 5 passing tests for cache creation, rate limiting, and config integration
2026-02-11 16:25:46 +08:00
4a80a0398e
feat(01-01): create Python package scaffold with config system
...
- pyproject.toml: installable package with bioinformatics dependencies
- Pydantic config schema with validation (ensembl_release >= 100, directory creation)
- YAML config loader with override support
- Default config with Ensembl 113, gnomAD v4.1
- 5 passing tests for config validation and hashing
2026-02-11 16:24:35 +08:00
cab2f5fc66
docs(01-data-infrastructure): create phase plan
2026-02-11 16:04:42 +08:00
982f7f5a9b
docs(01-data-infrastructure): research phase domain
2026-02-11 15:56:40 +08:00
f80f384a61
docs: create roadmap (6 phases)
2026-02-11 15:47:36 +08:00
0fb1a9581f
docs: define v1 requirements
2026-02-11 15:31:05 +08:00
bb7bfaedab
docs: complete project research
2026-02-11 14:52:06 +08:00
c0abe8bc6c
chore: add project config
2026-02-11 14:41:35 +08:00
e2c202d689
docs: initialize project
2026-02-11 14:40:36 +08:00