usher-exploring

Author	SHA1	Message	Date
gbanyan	e72c516669	docs(03-06): complete literature evidence layer - Created SUMMARY.md with full implementation details - Updated STATE.md: progress 60%, 12/20 plans complete, Phase 3 complete - Documented 4 key decisions (tier priority, bias mitigation, context weights, rate limiting) - All verification criteria met: 17/17 tests pass, CLI functional, bias mitigation validated - Self-check PASSED: all files and commits verified Key accomplishments: - PubMed evidence layer queries per gene across cilia/sensory/cytoskeleton/polarity contexts - Quality tier classification: direct_experimental > hts_hit > functional_mention > incidental - Bias mitigation via log2(total_pubmed_count) prevents well-studied gene dominance - Novel genes with 10 total/5 cilia publications score higher than TP53-like genes with 100K total/5 cilia - Biopython Entrez integration with rate limiting (3/sec default, 10/sec with API key)	2026-02-11 19:13:26 +08:00
gbanyan	0e89bf0dd6	docs(03-02): complete expression evidence layer plan - Create 03-02-SUMMARY.md with performance metrics, decisions, and deviations - Update STATE.md: 5 of 6 plans complete in Phase 03 (03-06 remaining) - Update progress: 55% complete (11/20 plans across all phases) - Add key decisions: Tau calculation, expression scoring, CellxGene optional - Record duration: 12 min for 2 tasks (9 files modified) - Self-check passed: all files and commits verified Expression layer provides: - HPA/GTEx tissue expression with Tau specificity index - Usher-tissue enrichment scoring (retina, inner ear, cilia) - Optional CellxGene single-cell integration - CLI command with checkpoint-restart - 11 passing unit and integration tests	2026-02-11 19:12:18 +08:00
gbanyan	cfe4b830e6	docs(03-03): complete protein features plan with SUMMARY and STATE updates	2026-02-11 19:10:03 +08:00
gbanyan	053f0d926b	docs(03-05): complete animal model phenotype evidence layer plan - SUMMARY.md: Ortholog-mapped animal evidence from MGI/ZFIN/IMPC - Confidence-weighted scoring (mouse +0.4, zebrafish +0.3, IMPC +0.3) - 14/14 tests passing: ortholog confidence, keyword filtering, NULL preservation - Deviations: Schema mismatches, NULL handling, polars deprecations auto-fixed - Duration: 10 minutes, 2 tasks, 8 files, 2 commits	2026-02-11 19:08:45 +08:00
gbanyan	d8009f1236	docs(03-04): complete subcellular localization evidence layer - Created SUMMARY.md with full implementation details - Updated STATE.md: progress 40%, 8/20 plans complete - Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting) - All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete	2026-02-11 19:08:01 +08:00
gbanyan	46059874f2	feat(03-03): implement protein evidence layer with UniProt/InterPro integration - Create protein features data model with domain, coiled-coil, TM, cilia motifs - Implement fetch.py with UniProt REST API and InterPro API queries - Implement transform.py with feature extraction, motif detection, normalization - Implement load.py with DuckDB persistence and provenance tracking - Add CLI protein command following evidence layer pattern - Add comprehensive unit and integration tests (all passing) - Handle NULL preservation and List(Null) edge case - Add get_steps() method to ProvenanceTracker for test compatibility	2026-02-11 19:07:30 +08:00
gbanyan	bcd3c4ffbe	feat(03-05): add animal model DuckDB loader, CLI, and comprehensive tests - load.py: DuckDB persistence with provenance tracking, ortholog confidence distribution stats - CLI animal-models command: checkpoint-restart pattern, top scoring genes display - 10 unit tests: ortholog confidence scoring, keyword filtering, multi-organism bonus, NULL preservation - 4 integration tests: full pipeline, checkpoint-restart, provenance tracking, empty phenotype handling - All tests pass (14/14): validates fetch->transform->load->CLI flow - Fixed polars deprecations: str.join replaces str.concat, pl.len replaces pl.count	2026-02-11 19:06:49 +08:00
gbanyan	99bc975a2c	docs(03-01): complete annotation completeness plan	2026-02-11 19:05:56 +08:00
gbanyan	942aaf2ec3	feat(03-04): add localization CLI command and comprehensive tests - Add localization subcommand to evidence command group - Implement checkpoint-restart pattern for HPA download - Display summary with evidence type distribution - Create 17 unit and integration tests (all pass) - Test HPA parsing, evidence classification, scoring, and DuckDB persistence - Fix evidence type terminology (computational vs predicted) for consistency - Mock HTTP calls in integration tests for reproducibility	2026-02-11 19:05:22 +08:00
gbanyan	d70239c4ce	feat(03-01): add annotation DuckDB loader, CLI command, and tests - Create load_to_duckdb with provenance tracking and tier distribution stats - Add query_poorly_annotated helper to find under-studied genes - Register `evidence annotation` CLI command with checkpoint-restart pattern - Add comprehensive unit tests (9 tests) covering GO extraction, NULL handling, tier classification, score normalization, weighting - Add integration tests (6 tests) for pipeline, idempotency, checkpoint-restart, provenance, queries - All 15 tests pass with proper NULL preservation and schema validation	2026-02-11 19:03:10 +08:00
gbanyan	0e389c7e41	feat(03-05): implement animal model evidence fetch and transform - models.py: AnimalModelRecord with ortholog confidence, phenotype flags, and normalized scoring - fetch.py: Retrieve orthologs from HCOP, phenotypes from MGI/ZFIN/IMPC with retry - transform.py: Filter sensory/cilia-relevant phenotypes, score with confidence weighting - Ortholog confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3) - Scoring: mouse +0.4, zebrafish +0.3, IMPC +0.3, weighted by confidence - NULL preservation: no ortholog = NULL score (not zero)	2026-02-11 19:00:24 +08:00
gbanyan	8aa66987f8	feat(03-06): implement literature evidence models, PubMed fetch, and scoring - Create LiteratureRecord pydantic model with context-specific counts - Implement PubMed query via Biopython Entrez with rate limiting (3/sec default, 10/sec with API key) - Define SEARCH_CONTEXTS for cilia, sensory, cytoskeleton, cell_polarity queries - Implement evidence tier classification: direct_experimental > functional_mention > hts_hit > incidental > none - Implement quality-weighted scoring with bias mitigation via log2(total_pubmed_count) normalization - Add biopython>=1.84 dependency to pyproject.toml - Support checkpoint-restart for long-running PubMed queries (estimated 3-11 hours for 20K genes)	2026-02-11 19:00:20 +08:00
gbanyan	6645c59b0b	feat(03-04): create localization evidence data model and processing - Define LocalizationRecord model with HPA and proteomics fields - Implement fetch_hpa_subcellular to download HPA bulk data - Implement fetch_cilia_proteomics with curated reference gene sets - Implement classify_evidence_type (experimental vs computational) - Implement score_localization with cilia proximity scoring - Implement process_localization_evidence end-to-end pipeline - Create load_to_duckdb for persistence with provenance	2026-02-11 19:00:09 +08:00
gbanyan	adbb74b965	feat(03-01): implement annotation evidence fetch and transform modules - Create AnnotationRecord model with GO counts, UniProt scores, tier classification - Implement fetch_go_annotations using mygene.info batch queries - Implement fetch_uniprot_scores using UniProt REST API - Add classify_annotation_tier with 3-tier system (well/partial/poor) - Add normalize_annotation_score with weighted composite (GO 50%, UniProt 30%, Pathway 20%) - Implement process_annotation_evidence end-to-end pipeline - Follow NULL preservation pattern from gnomAD (unknown != zero) - Use lazy polars evaluation where applicable	2026-02-11 18:58:45 +08:00
gbanyan	0d252da348	docs(03): create phase plan	2026-02-11 18:46:28 +08:00
gbanyan	3354cfe006	docs(phase-03): research core evidence layers domain	2026-02-11 18:37:14 +08:00
gbanyan	ffb4963d2b	docs(phase-02): complete phase execution	2026-02-11 18:28:13 +08:00
gbanyan	a0388cf4e1	docs(02-02): complete gnomAD evidence layer integration plan - DuckDB persistence: gnomad_constraint table with CREATE OR REPLACE (idempotent) - CLI evidence command: usher-pipeline evidence gnomad with checkpoint-restart - Provenance tracking: records processing steps, saves sidecar JSON - Query helpers: query_constrained_genes validates GCON-03 interpretation - 12 integration tests: end-to-end pipeline, checkpoint, provenance, CLI - Phase 2 complete: Evidence layer pattern established for future sources - Duration: 4 min, 2 tasks, 5 files, 70 tests passing Phase 2 (Prototype Evidence Layer) complete.	2026-02-11 18:23:32 +08:00
gbanyan	56e04e68c2	test(02-02): add comprehensive gnomAD integration tests - 12 integration tests covering full pipeline: fetch->transform->load->query - test_full_pipeline_to_duckdb: End-to-end pipeline verification with DuckDB storage - test_checkpoint_restart_skips_processing: Checkpoint detection works correctly - test_provenance_recorded: Provenance step records expected details - test_provenance_sidecar_created: JSON sidecar file creation and structure - test_query_constrained_genes_filters_correctly: Query returns only measured genes below threshold - test_null_loeuf_not_in_constrained_results: NULL LOEUF genes excluded from queries - test_duckdb_schema_has_quality_flag: Schema includes quality_flag with valid values - test_normalized_scores_in_duckdb: Normalized scores in [0,1] for measured genes, NULL for others - test_cli_evidence_gnomad_help: CLI help text displays correctly - test_cli_evidence_gnomad_with_mock: CLI command runs end-to-end with mocked download - test_idempotent_load_replaces_table: Loading twice replaces table (not appends) - test_quality_flag_categorization: Quality flags correctly categorize genes All 70 tests pass (58 existing + 12 new), no regressions	2026-02-11 18:20:59 +08:00
gbanyan	ee27f3ad2f	feat(02-02): add DuckDB loader and CLI evidence command for gnomAD - load_to_duckdb: Saves constraint DataFrame to gnomad_constraint table with provenance tracking - query_constrained_genes: Queries constrained genes by LOEUF threshold (validates GCON-03 interpretation) - evidence_cmd.py: CLI command group with gnomad subcommand (fetch->transform->load orchestration) - Checkpoint-restart: Skips processing if gnomad_constraint table exists (--force to override) - Full CLI: usher-pipeline evidence gnomad [--force] [--url URL] [--min-depth N] [--min-cds-pct N]	2026-02-11 18:19:07 +08:00
gbanyan	c6198122ac	docs(02-01): complete gnomAD constraint data pipeline plan	2026-02-11 18:16:35 +08:00
gbanyan	174c4af02d	feat(02-01): add gnomAD transform pipeline and comprehensive tests - Implement filter_by_coverage with quality_flag categorization (measured/incomplete_coverage/no_data) - Add normalize_scores with LOEUF inversion (lower LOEUF = higher score) - NULL preservation throughout pipeline (unknown != zero constraint) - process_gnomad_constraint end-to-end pipeline function - 15 comprehensive unit tests covering edge cases: - NULL handling and preservation - Coverage filtering without dropping genes - Normalization bounds and inversion - Mixed type handling for robust parsing - Fix column mapping to handle gnomAD v4.x loeuf/loeuf_upper duplication - All existing tests continue to pass	2026-02-11 18:14:41 +08:00
gbanyan	a88b0eea60	feat(02-01): add gnomAD constraint data models and download module - Create evidence layer package structure - Define ConstraintRecord Pydantic model with NULL preservation - Implement streaming download with httpx and tenacity retry - Add lazy TSV parser with column name variant handling - Add httpx and structlog dependencies	2026-02-11 18:11:49 +08:00
gbanyan	c7753e7b1c	docs(02): create phase plan	2026-02-11 17:47:23 +08:00
gbanyan	d328467737	docs(phase-02): research prototype evidence layer	2026-02-11 17:41:35 +08:00
gbanyan	34437fdf0a	docs(phase-01): complete phase execution Phase 1 (Data Infrastructure) verified: 5/5 must-haves, 12/12 artifacts, 9/9 key links, 7/7 requirements satisfied. All 4 plans executed across 3 waves with 49 tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 16:50:30 +08:00
gbanyan	102dcdbe84	docs(01-04): complete CLI integration and end-to-end testing plan - CLI entry point with setup and info commands - Full infrastructure integration verified - 6 integration tests with mocked APIs - Phase 01 Data Infrastructure complete	2026-02-11 16:45:12 +08:00
gbanyan	e4d71d0790	test(01-04): add integration tests verifying module wiring - test_config_to_store_roundtrip: config -> PipelineStore -> save/load - test_config_to_provenance: config -> ProvenanceTracker -> sidecar - test_full_setup_flow_mocked: full setup with mocked mygene (fetch, map, validate, save, provenance) - test_checkpoint_skip_flow: verify checkpoint-restart skips re-fetch - test_setup_cli_help: CLI help output verification - test_info_cli: info command with config display All tests pass with mocked API calls (no external dependencies).	2026-02-11 16:42:13 +08:00
gbanyan	f33b048635	feat(01-04): add CLI entry point with setup and info commands - Create click-based CLI with command group (--config, --verbose options) - Add 'info' command displaying pipeline version, config hash, data source versions - Add 'setup' command orchestrating full infrastructure flow: - Load config -> create store/provenance - Fetch gene universe (with checkpoint-restart) - Map Ensembl IDs to HGNC + UniProt - Validate mapping quality gates - Save to DuckDB with provenance sidecar - Update pyproject.toml entry point to usher_pipeline.cli.main:cli - Add .gitignore for data/, *.duckdb, build artifacts, provenance files	2026-02-11 16:39:50 +08:00
gbanyan	e29d39d1dc	docs(01-02): complete gene ID mapping and validation plan - Gene universe definition with mygene protein-coding gene retrieval - Batch Ensembl->HGNC+UniProt mapping with edge case handling - Validation gates with configurable success rate thresholds - 15 comprehensive tests with mocked API responses	2026-02-11 16:35:57 +08:00
gbanyan	92322b1d7c	docs(01-03): complete DuckDB persistence and provenance tracking plan Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 16:34:00 +08:00
gbanyan	0200395d9e	feat(01-02): create mapping validation gates with tests - Add MappingValidator with configurable success rate thresholds (min_success_rate, warn_threshold) - Add validate_gene_universe for gene count, format, and duplicate checks - Add save_unmapped_report for manual review output - Implement 15 comprehensive tests with mocked mygene responses (no real API calls) - Tests cover: successful mapping, notfound handling, uniprot list parsing, batching, validation gates, universe validation	2026-02-11 16:33:36 +08:00
gbanyan	98a1a750dd	feat(01-03): create provenance tracker with comprehensive tests - ProvenanceTracker class for metadata tracking - Records pipeline version, data source versions, config hash, timestamps - Sidecar JSON export alongside outputs - DuckDB _provenance table support - 13 comprehensive tests (8 DuckDB + 5 provenance) - All tests pass (12 passed, 1 skipped - pandas)	2026-02-11 16:31:51 +08:00
gbanyan	d51141f7d5	feat(01-03): create DuckDB persistence layer with checkpoint-restart - PipelineStore class for DuckDB-based storage - save_dataframe/load_dataframe for polars and pandas - Checkpoint system with has_checkpoint and metadata tracking - Parquet export capability - Context manager support	2026-02-11 16:30:25 +08:00
gbanyan	9ee3ec2e84	docs(01-01): complete project scaffold and config system plan - Created comprehensive SUMMARY.md with all execution details - Updated STATE.md: 1/4 plans in phase 1 complete, 16.7% overall progress - Documented deviation (venv creation) and decisions - Verified all files and commits exist (self-check passed)	2026-02-11 16:28:03 +08:00
gbanyan	4204116772	feat(01-01): create base API client with retry and caching - CachedAPIClient with SQLite persistent cache - Exponential backoff retry on 429/5xx/network errors (tenacity) - Rate limiting with skip for cached responses - from_config classmethod for pipeline integration - 5 passing tests for cache creation, rate limiting, and config integration	2026-02-11 16:25:46 +08:00
gbanyan	4a80a0398e	feat(01-01): create Python package scaffold with config system - pyproject.toml: installable package with bioinformatics dependencies - Pydantic config schema with validation (ensembl_release >= 100, directory creation) - YAML config loader with override support - Default config with Ensembl 113, gnomAD v4.1 - 5 passing tests for config validation and hashing	2026-02-11 16:24:35 +08:00
gbanyan	cab2f5fc66	docs(01-data-infrastructure): create phase plan	2026-02-11 16:04:42 +08:00
gbanyan	982f7f5a9b	docs(01-data-infrastructure): research phase domain	2026-02-11 15:56:40 +08:00
gbanyan	f80f384a61	docs: create roadmap (6 phases)	2026-02-11 15:47:36 +08:00
gbanyan	0fb1a9581f	docs: define v1 requirements	2026-02-11 15:31:05 +08:00
gbanyan	bb7bfaedab	docs: complete project research	2026-02-11 14:52:06 +08:00
gbanyan	c0abe8bc6c	chore: add project config	2026-02-11 14:41:35 +08:00
gbanyan	e2c202d689	docs: initialize project	2026-02-11 14:40:36 +08:00

44 Commits