usher-exploring

Author	SHA1	Message	Date
gbanyan	5af63eab46	feat(05-02): implement reproducibility report module with JSON and Markdown output - Create ReproducibilityReport dataclass with all metadata fields - Implement generate_reproducibility_report function - Extract parameters from PipelineConfig (scoring weights, data versions) - Capture software environment (Python, polars, duckdb versions) - Build filtering steps from ProvenanceTracker - Compute tier statistics from tiered DataFrame - Support optional validation metrics - to_json: write as indented JSON for machine-readable format - to_markdown: write with tables and headers for human-readable format - 7 tests covering all report fields, formats, and edge cases	2026-02-12 04:00:21 +08:00
gbanyan	4e46b488f1	feat(05-01): add dual-format writer with provenance and tests - Implement TSV+Parquet writer with deterministic output sorting - Generate YAML provenance sidecar with statistics and metadata - Add comprehensive unit tests (9 tests covering all functionality) - Fix deprecated pl.count() -> pl.len() usage	2026-02-12 03:59:26 +08:00
gbanyan	150417ffcc	feat(05-02): implement visualization module with matplotlib/seaborn plots - Add matplotlib>=3.8.0 and seaborn>=0.13.0 to dependencies - Create visualizations.py with 3 plot functions and orchestrator - plot_score_distribution: histogram colored by confidence tier - plot_layer_contributions: bar chart of evidence layer coverage - plot_tier_breakdown: pie chart of tier distribution - Use Agg backend for headless/CLI safety - All plots saved at 300 DPI with proper figure cleanup - 6 tests covering file creation, edge cases, and return values	2026-02-12 03:57:50 +08:00
gbanyan	a6ad6c6d19	test(04-03): add unit and integration tests for scoring module - test_scoring.py: 7 unit tests for known genes, weight validation, NULL preservation - test_scoring_integration.py: 3 integration tests for end-to-end pipeline with synthetic data - Tests verify NULL handling (genes with no evidence get NULL composite score) - Tests verify known genes rank highly when given high scores - Tests verify QC detects missing data above thresholds - All tests use synthetic data (no external API calls, fast, reproducible)	2026-02-11 20:54:39 +08:00
gbanyan	d8009f1236	docs(03-04): complete subcellular localization evidence layer - Created SUMMARY.md with full implementation details - Updated STATE.md: progress 40%, 8/20 plans complete - Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting) - All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete	2026-02-11 19:08:01 +08:00
gbanyan	46059874f2	feat(03-03): implement protein evidence layer with UniProt/InterPro integration - Create protein features data model with domain, coiled-coil, TM, cilia motifs - Implement fetch.py with UniProt REST API and InterPro API queries - Implement transform.py with feature extraction, motif detection, normalization - Implement load.py with DuckDB persistence and provenance tracking - Add CLI protein command following evidence layer pattern - Add comprehensive unit and integration tests (all passing) - Handle NULL preservation and List(Null) edge case - Add get_steps() method to ProvenanceTracker for test compatibility	2026-02-11 19:07:30 +08:00
gbanyan	bcd3c4ffbe	feat(03-05): add animal model DuckDB loader, CLI, and comprehensive tests - load.py: DuckDB persistence with provenance tracking, ortholog confidence distribution stats - CLI animal-models command: checkpoint-restart pattern, top scoring genes display - 10 unit tests: ortholog confidence scoring, keyword filtering, multi-organism bonus, NULL preservation - 4 integration tests: full pipeline, checkpoint-restart, provenance tracking, empty phenotype handling - All tests pass (14/14): validates fetch->transform->load->CLI flow - Fixed polars deprecations: str.join replaces str.concat, pl.len replaces pl.count	2026-02-11 19:06:49 +08:00
gbanyan	942aaf2ec3	feat(03-04): add localization CLI command and comprehensive tests - Add localization subcommand to evidence command group - Implement checkpoint-restart pattern for HPA download - Display summary with evidence type distribution - Create 17 unit and integration tests (all pass) - Test HPA parsing, evidence classification, scoring, and DuckDB persistence - Fix evidence type terminology (computational vs predicted) for consistency - Mock HTTP calls in integration tests for reproducibility	2026-02-11 19:05:22 +08:00
gbanyan	d70239c4ce	feat(03-01): add annotation DuckDB loader, CLI command, and tests - Create load_to_duckdb with provenance tracking and tier distribution stats - Add query_poorly_annotated helper to find under-studied genes - Register `evidence annotation` CLI command with checkpoint-restart pattern - Add comprehensive unit tests (9 tests) covering GO extraction, NULL handling, tier classification, score normalization, weighting - Add integration tests (6 tests) for pipeline, idempotency, checkpoint-restart, provenance, queries - All 15 tests pass with proper NULL preservation and schema validation	2026-02-11 19:03:10 +08:00
gbanyan	56e04e68c2	test(02-02): add comprehensive gnomAD integration tests - 12 integration tests covering full pipeline: fetch->transform->load->query - test_full_pipeline_to_duckdb: End-to-end pipeline verification with DuckDB storage - test_checkpoint_restart_skips_processing: Checkpoint detection works correctly - test_provenance_recorded: Provenance step records expected details - test_provenance_sidecar_created: JSON sidecar file creation and structure - test_query_constrained_genes_filters_correctly: Query returns only measured genes below threshold - test_null_loeuf_not_in_constrained_results: NULL LOEUF genes excluded from queries - test_duckdb_schema_has_quality_flag: Schema includes quality_flag with valid values - test_normalized_scores_in_duckdb: Normalized scores in [0,1] for measured genes, NULL for others - test_cli_evidence_gnomad_help: CLI help text displays correctly - test_cli_evidence_gnomad_with_mock: CLI command runs end-to-end with mocked download - test_idempotent_load_replaces_table: Loading twice replaces table (not appends) - test_quality_flag_categorization: Quality flags correctly categorize genes All 70 tests pass (58 existing + 12 new), no regressions	2026-02-11 18:20:59 +08:00
gbanyan	174c4af02d	feat(02-01): add gnomAD transform pipeline and comprehensive tests - Implement filter_by_coverage with quality_flag categorization (measured/incomplete_coverage/no_data) - Add normalize_scores with LOEUF inversion (lower LOEUF = higher score) - NULL preservation throughout pipeline (unknown != zero constraint) - process_gnomad_constraint end-to-end pipeline function - 15 comprehensive unit tests covering edge cases: - NULL handling and preservation - Coverage filtering without dropping genes - Normalization bounds and inversion - Mixed type handling for robust parsing - Fix column mapping to handle gnomAD v4.x loeuf/loeuf_upper duplication - All existing tests continue to pass	2026-02-11 18:14:41 +08:00
gbanyan	e4d71d0790	test(01-04): add integration tests verifying module wiring - test_config_to_store_roundtrip: config -> PipelineStore -> save/load - test_config_to_provenance: config -> ProvenanceTracker -> sidecar - test_full_setup_flow_mocked: full setup with mocked mygene (fetch, map, validate, save, provenance) - test_checkpoint_skip_flow: verify checkpoint-restart skips re-fetch - test_setup_cli_help: CLI help output verification - test_info_cli: info command with config display All tests pass with mocked API calls (no external dependencies).	2026-02-11 16:42:13 +08:00
gbanyan	0200395d9e	feat(01-02): create mapping validation gates with tests - Add MappingValidator with configurable success rate thresholds (min_success_rate, warn_threshold) - Add validate_gene_universe for gene count, format, and duplicate checks - Add save_unmapped_report for manual review output - Implement 15 comprehensive tests with mocked mygene responses (no real API calls) - Tests cover: successful mapping, notfound handling, uniprot list parsing, batching, validation gates, universe validation	2026-02-11 16:33:36 +08:00
gbanyan	98a1a750dd	feat(01-03): create provenance tracker with comprehensive tests - ProvenanceTracker class for metadata tracking - Records pipeline version, data source versions, config hash, timestamps - Sidecar JSON export alongside outputs - DuckDB _provenance table support - 13 comprehensive tests (8 DuckDB + 5 provenance) - All tests pass (12 passed, 1 skipped - pandas)	2026-02-11 16:31:51 +08:00
gbanyan	4204116772	feat(01-01): create base API client with retry and caching - CachedAPIClient with SQLite persistent cache - Exponential backoff retry on 429/5xx/network errors (tenacity) - Rate limiting with skip for cached responses - from_config classmethod for pipeline integration - 5 passing tests for cache creation, rate limiting, and config integration	2026-02-11 16:25:46 +08:00
gbanyan	4a80a0398e	feat(01-01): create Python package scaffold with config system - pyproject.toml: installable package with bioinformatics dependencies - Pydantic config schema with validation (ensembl_release >= 100, directory creation) - YAML config loader with override support - Default config with Ensembl 113, gnomAD v4.1 - 5 passing tests for config validation and hashing	2026-02-11 16:24:35 +08:00

16 Commits