usher-exploring

Author	SHA1	Message	Date
gbanyan	63e3cccd3c	fix: implement real checkpoint persistence for literature layer Previously checkpoints only logged a message but never wrote to DuckDB, causing all progress to be lost on process kill. Now every batch_size genes (default 500) are persisted to a literature_partial table. On restart the CLI loads partial results and resumes from where it left off. The partial table is cleaned up after successful completion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:14:16 +08:00
gbanyan	6605ff0f2b	fix: resolve runtime bugs for pipeline execution on Python 3.14 + latest deps - gene_mapping: wrap mygene fetch_all generator in list() to fix len() error - gene_mapping: raise MAX_EXPECTED_GENES to 23000 (mygene DB growth) - setup_cmd: rename gene_universe columns to gene_id/gene_symbol for consistency with all downstream evidence layer code - gnomad: handle missing coverage columns in v4.1 constraint TSV - expression: fix HPA URL (v23.proteinatlas.org) and GTEx URL (v8 path) - expression: fix Polars pivot() API change (columns -> on), collect first - expression: handle missing GTEx tissues (Eye - Retina not in v8) - expression: ensure all expected columns exist even when sources unavailable - expression/load: safely check column existence before filtering - localization: fix HPA subcellular URL to v23 - animal_models: fix httpx stream response.read() before .text access - animal_models: increase infer_schema_length for HCOP and MGI TSV parsing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 03:44:01 +08:00
gbanyan	10f19f89f4	feat(06-03): comprehensive validation report and CLI validate command - Created validation_report.py with comprehensive report generation - generate_comprehensive_validation_report: combines positive, negative, sensitivity - recommend_weight_tuning: provides targeted weight adjustment recommendations - save_validation_report: persists report to file - Created validate_cmd.py CLI command following score_cmd.py pattern - Orchestrates positive controls, negative controls, sensitivity analysis - Options: --force, --skip-sensitivity, --output-dir, --top-n - Styled output with click.echo patterns - Provenance tracking for all validation steps - Updated main.py to register validate command - Updated scoring.__init__.py to export validation_report functions	2026-02-12 04:48:25 +08:00
gbanyan	0084a67fba	feat(06-02): export sensitivity analysis module from scoring package - Add sensitivity module imports and exports - Preserve existing negative_controls exports from Plan 01 - All sensitivity functions and constants now importable from usher_pipeline.scoring	2026-02-12 04:41:27 +08:00
gbanyan	0f615c0d53	feat(06-01): add recall@k metrics and extended positive control validation - compute_recall_at_k() measures recall at absolute (100, 500, 1000, 2000) and percentage (5%, 10%, 20%) thresholds - validate_positive_controls_extended() combines percentile + recall + per-source metrics - Per-source breakdown separates OMIM Usher from SYSCILIA SCGS v2 validation - Export all new functions in __init__.py including negative control imports	2026-02-12 04:41:00 +08:00
gbanyan	a7589d9bf1	feat(06-02): implement sensitivity analysis module with weight perturbation and Spearman correlation - Add perturb_weight() function with renormalization to maintain sum=1.0 - Add run_sensitivity_analysis() for parameter sweep across all layers and deltas - Add summarize_sensitivity() for stability classification - Add generate_sensitivity_report() for human-readable output - Default perturbations: ±5% and ±10% with stability threshold 0.85	2026-02-12 04:40:21 +08:00
gbanyan	e488ff2d7a	feat(06-01): add negative control validation with housekeeping genes - Create negative_controls.py with 13 curated housekeeping genes - HOUSEKEEPING_GENES_CORE frozenset with source provenance - compile_housekeeping_genes() returns DataFrame matching known_genes pattern - validate_negative_controls() uses PERCENT_RANK with inverted threshold logic - generate_negative_control_report() provides human-readable output	2026-02-12 04:39:48 +08:00
gbanyan	c10d59548f	test(05-03): add CliRunner integration tests for report command - Create test_report_cmd.py with 9 comprehensive tests - Test fixtures: test_config (minimal YAML), populated_db (synthetic scored_genes data) - Test coverage: help output, file generation, tier counts, visualizations, skip flags, custom thresholds, error handling, custom output directory - Synthetic data design: 3 HIGH, 5 MEDIUM, 5 LOW, 4 EXCLUDED, 3 NULL composite_score - All tests pass with isolated tmp_path DuckDB instances - Fixed report_cmd.py tier threshold format (uppercase keys: HIGH/MEDIUM/LOW, composite_score field) - Fixed write_candidate_output parameter name (filename_base not base_filename)	2026-02-12 04:07:34 +08:00
gbanyan	2ab25ef5c2	feat(05-03): implement CLI report command - Create report_cmd.py following established CLI pattern - Orchestrate full output pipeline: tiering, evidence summary, dual-format output, visualizations, reproducibility reports - Support --output-dir, --force, --skip-viz, --skip-report flags - Configurable tier thresholds (--high-threshold, --medium-threshold, --low-threshold, --min-evidence-high, --min-evidence-medium) - Register report command in main.py CLI entry point - Follow score_cmd.py pattern: config load, store init, checkpoint check, pipeline steps, summary display, cleanup - CLI now has 5 commands: setup, evidence, score, report, info	2026-02-12 04:05:52 +08:00
gbanyan	5af63eab46	feat(05-02): implement reproducibility report module with JSON and Markdown output - Create ReproducibilityReport dataclass with all metadata fields - Implement generate_reproducibility_report function - Extract parameters from PipelineConfig (scoring weights, data versions) - Capture software environment (Python, polars, duckdb versions) - Build filtering steps from ProvenanceTracker - Compute tier statistics from tiered DataFrame - Support optional validation metrics - to_json: write as indented JSON for machine-readable format - to_markdown: write with tables and headers for human-readable format - 7 tests covering all report fields, formats, and edge cases	2026-02-12 04:00:21 +08:00
gbanyan	4e46b488f1	feat(05-01): add dual-format writer with provenance and tests - Implement TSV+Parquet writer with deterministic output sorting - Generate YAML provenance sidecar with statistics and metadata - Add comprehensive unit tests (9 tests covering all functionality) - Fix deprecated pl.count() -> pl.len() usage	2026-02-12 03:59:26 +08:00
gbanyan	150417ffcc	feat(05-02): implement visualization module with matplotlib/seaborn plots - Add matplotlib>=3.8.0 and seaborn>=0.13.0 to dependencies - Create visualizations.py with 3 plot functions and orchestrator - plot_score_distribution: histogram colored by confidence tier - plot_layer_contributions: bar chart of evidence layer coverage - plot_tier_breakdown: pie chart of tier distribution - Use Agg backend for headless/CLI safety - All plots saved at 300 DPI with proper figure cleanup - 6 tests covering file creation, edge cases, and return values	2026-02-12 03:57:50 +08:00
gbanyan	d2ef3a2b84	feat(05-01): implement tiering logic and evidence summary module - Add confidence tier classification (HIGH/MEDIUM/LOW) based on composite_score and evidence_count - Add supporting_layers and evidence_gaps columns per gene - Use vectorized polars expressions for performance - Configurable thresholds for tier assignment	2026-02-12 03:56:42 +08:00
gbanyan	d57a5f2826	feat(04-03): add CLI score command with checkpoint-restart - Created score_cmd.py following evidence_cmd.py pattern - Orchestrates full scoring pipeline: known genes -> composite scores -> QC -> validation - Options: --force, --skip-qc, --skip-validation for flexible iteration - Registered score command in main CLI group - Displays comprehensive summary with quality flag distribution	2026-02-11 20:52:37 +08:00
gbanyan	70a5d6eff8	feat(04-02): implement positive control validation - Create validation.py with known gene ranking validation - validate_known_gene_ranking: PERCENT_RANK window function over all genes - Computes median percentile, top quartile count/fraction for known genes - generate_validation_report: human-readable text output with formatted table - Update __init__.py to export run_qc_checks, validate_known_gene_ranking, generate_validation_report	2026-02-11 20:47:59 +08:00
gbanyan	ba2f97ac55	feat(04-02): implement QC checks for scoring results - Add scipy>=1.14 dependency for MAD-based outlier detection - Create quality_control.py with 4 QC functions - compute_missing_data_rates: NULL rate detection with warn/error thresholds - compute_distribution_stats: mean/median/std per layer with anomaly detection - detect_outliers: MAD-based robust outlier detection (>3 MAD) - run_qc_checks: orchestrator with composite score percentiles	2026-02-11 20:46:57 +08:00
gbanyan	f441e8c1ad	feat(04-01): implement multi-evidence weighted scoring integration - Create join_evidence_layers() with LEFT JOIN preserving NULLs from all 6 evidence tables - Implement compute_composite_scores() with NULL-preserving weighted average (weighted_sum / available_weight) - Add quality_flag classification based on evidence_count (sufficient/moderate/sparse/no_evidence) - Include per-layer contribution columns for explainability - Add persist_scored_genes() to save scored_genes table to DuckDB - Log summary stats: coverage, mean/median scores, quality distribution, NULL rates	2026-02-11 20:41:44 +08:00
gbanyan	0cd2f7c9dd	feat(04-01): implement known gene compilation and ScoringWeights validation - Create scoring module with OMIM_USHER_GENES (10 genes) and SYSCILIA_SCGS_V2_CORE (28 genes) - Implement compile_known_genes() returning DataFrame with gene_symbol, source, confidence - Add load_known_genes_to_duckdb() to persist known genes table - Add ScoringWeights.validate_sum() method enforcing weight sum constraint (1.0 ± 1e-6)	2026-02-11 20:41:31 +08:00
gbanyan	ed21f18a98	fix(03-05): handle NULL columns and deprecated polars API in animal models - Add NULL/empty column checks in fetch_ortholog_mapping - Fix NULL handling in filter_sensory_phenotypes with is_not_null guard - Replace deprecated str.concat with str.join - Add explicit schema to empty DataFrames for consistency	2026-02-11 20:38:36 +08:00
gbanyan	d8009f1236	docs(03-04): complete subcellular localization evidence layer - Created SUMMARY.md with full implementation details - Updated STATE.md: progress 40%, 8/20 plans complete - Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting) - All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete	2026-02-11 19:08:01 +08:00
gbanyan	46059874f2	feat(03-03): implement protein evidence layer with UniProt/InterPro integration - Create protein features data model with domain, coiled-coil, TM, cilia motifs - Implement fetch.py with UniProt REST API and InterPro API queries - Implement transform.py with feature extraction, motif detection, normalization - Implement load.py with DuckDB persistence and provenance tracking - Add CLI protein command following evidence layer pattern - Add comprehensive unit and integration tests (all passing) - Handle NULL preservation and List(Null) edge case - Add get_steps() method to ProvenanceTracker for test compatibility	2026-02-11 19:07:30 +08:00
gbanyan	bcd3c4ffbe	feat(03-05): add animal model DuckDB loader, CLI, and comprehensive tests - load.py: DuckDB persistence with provenance tracking, ortholog confidence distribution stats - CLI animal-models command: checkpoint-restart pattern, top scoring genes display - 10 unit tests: ortholog confidence scoring, keyword filtering, multi-organism bonus, NULL preservation - 4 integration tests: full pipeline, checkpoint-restart, provenance tracking, empty phenotype handling - All tests pass (14/14): validates fetch->transform->load->CLI flow - Fixed polars deprecations: str.join replaces str.concat, pl.len replaces pl.count	2026-02-11 19:06:49 +08:00
gbanyan	942aaf2ec3	feat(03-04): add localization CLI command and comprehensive tests - Add localization subcommand to evidence command group - Implement checkpoint-restart pattern for HPA download - Display summary with evidence type distribution - Create 17 unit and integration tests (all pass) - Test HPA parsing, evidence classification, scoring, and DuckDB persistence - Fix evidence type terminology (computational vs predicted) for consistency - Mock HTTP calls in integration tests for reproducibility	2026-02-11 19:05:22 +08:00
gbanyan	d70239c4ce	feat(03-01): add annotation DuckDB loader, CLI command, and tests - Create load_to_duckdb with provenance tracking and tier distribution stats - Add query_poorly_annotated helper to find under-studied genes - Register `evidence annotation` CLI command with checkpoint-restart pattern - Add comprehensive unit tests (9 tests) covering GO extraction, NULL handling, tier classification, score normalization, weighting - Add integration tests (6 tests) for pipeline, idempotency, checkpoint-restart, provenance, queries - All 15 tests pass with proper NULL preservation and schema validation	2026-02-11 19:03:10 +08:00
gbanyan	0e389c7e41	feat(03-05): implement animal model evidence fetch and transform - models.py: AnimalModelRecord with ortholog confidence, phenotype flags, and normalized scoring - fetch.py: Retrieve orthologs from HCOP, phenotypes from MGI/ZFIN/IMPC with retry - transform.py: Filter sensory/cilia-relevant phenotypes, score with confidence weighting - Ortholog confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3) - Scoring: mouse +0.4, zebrafish +0.3, IMPC +0.3, weighted by confidence - NULL preservation: no ortholog = NULL score (not zero)	2026-02-11 19:00:24 +08:00
gbanyan	8aa66987f8	feat(03-06): implement literature evidence models, PubMed fetch, and scoring - Create LiteratureRecord pydantic model with context-specific counts - Implement PubMed query via Biopython Entrez with rate limiting (3/sec default, 10/sec with API key) - Define SEARCH_CONTEXTS for cilia, sensory, cytoskeleton, cell_polarity queries - Implement evidence tier classification: direct_experimental > functional_mention > hts_hit > incidental > none - Implement quality-weighted scoring with bias mitigation via log2(total_pubmed_count) normalization - Add biopython>=1.84 dependency to pyproject.toml - Support checkpoint-restart for long-running PubMed queries (estimated 3-11 hours for 20K genes)	2026-02-11 19:00:20 +08:00
gbanyan	6645c59b0b	feat(03-04): create localization evidence data model and processing - Define LocalizationRecord model with HPA and proteomics fields - Implement fetch_hpa_subcellular to download HPA bulk data - Implement fetch_cilia_proteomics with curated reference gene sets - Implement classify_evidence_type (experimental vs computational) - Implement score_localization with cilia proximity scoring - Implement process_localization_evidence end-to-end pipeline - Create load_to_duckdb for persistence with provenance	2026-02-11 19:00:09 +08:00
gbanyan	adbb74b965	feat(03-01): implement annotation evidence fetch and transform modules - Create AnnotationRecord model with GO counts, UniProt scores, tier classification - Implement fetch_go_annotations using mygene.info batch queries - Implement fetch_uniprot_scores using UniProt REST API - Add classify_annotation_tier with 3-tier system (well/partial/poor) - Add normalize_annotation_score with weighted composite (GO 50%, UniProt 30%, Pathway 20%) - Implement process_annotation_evidence end-to-end pipeline - Follow NULL preservation pattern from gnomAD (unknown != zero) - Use lazy polars evaluation where applicable	2026-02-11 18:58:45 +08:00
gbanyan	ee27f3ad2f	feat(02-02): add DuckDB loader and CLI evidence command for gnomAD - load_to_duckdb: Saves constraint DataFrame to gnomad_constraint table with provenance tracking - query_constrained_genes: Queries constrained genes by LOEUF threshold (validates GCON-03 interpretation) - evidence_cmd.py: CLI command group with gnomad subcommand (fetch->transform->load orchestration) - Checkpoint-restart: Skips processing if gnomad_constraint table exists (--force to override) - Full CLI: usher-pipeline evidence gnomad [--force] [--url URL] [--min-depth N] [--min-cds-pct N]	2026-02-11 18:19:07 +08:00
gbanyan	174c4af02d	feat(02-01): add gnomAD transform pipeline and comprehensive tests - Implement filter_by_coverage with quality_flag categorization (measured/incomplete_coverage/no_data) - Add normalize_scores with LOEUF inversion (lower LOEUF = higher score) - NULL preservation throughout pipeline (unknown != zero constraint) - process_gnomad_constraint end-to-end pipeline function - 15 comprehensive unit tests covering edge cases: - NULL handling and preservation - Coverage filtering without dropping genes - Normalization bounds and inversion - Mixed type handling for robust parsing - Fix column mapping to handle gnomAD v4.x loeuf/loeuf_upper duplication - All existing tests continue to pass	2026-02-11 18:14:41 +08:00
gbanyan	a88b0eea60	feat(02-01): add gnomAD constraint data models and download module - Create evidence layer package structure - Define ConstraintRecord Pydantic model with NULL preservation - Implement streaming download with httpx and tenacity retry - Add lazy TSV parser with column name variant handling - Add httpx and structlog dependencies	2026-02-11 18:11:49 +08:00
gbanyan	f33b048635	feat(01-04): add CLI entry point with setup and info commands - Create click-based CLI with command group (--config, --verbose options) - Add 'info' command displaying pipeline version, config hash, data source versions - Add 'setup' command orchestrating full infrastructure flow: - Load config -> create store/provenance - Fetch gene universe (with checkpoint-restart) - Map Ensembl IDs to HGNC + UniProt - Validate mapping quality gates - Save to DuckDB with provenance sidecar - Update pyproject.toml entry point to usher_pipeline.cli.main:cli - Add .gitignore for data/, *.duckdb, build artifacts, provenance files	2026-02-11 16:39:50 +08:00
gbanyan	0200395d9e	feat(01-02): create mapping validation gates with tests - Add MappingValidator with configurable success rate thresholds (min_success_rate, warn_threshold) - Add validate_gene_universe for gene count, format, and duplicate checks - Add save_unmapped_report for manual review output - Implement 15 comprehensive tests with mocked mygene responses (no real API calls) - Tests cover: successful mapping, notfound handling, uniprot list parsing, batching, validation gates, universe validation	2026-02-11 16:33:36 +08:00
gbanyan	98a1a750dd	feat(01-03): create provenance tracker with comprehensive tests - ProvenanceTracker class for metadata tracking - Records pipeline version, data source versions, config hash, timestamps - Sidecar JSON export alongside outputs - DuckDB _provenance table support - 13 comprehensive tests (8 DuckDB + 5 provenance) - All tests pass (12 passed, 1 skipped - pandas)	2026-02-11 16:31:51 +08:00
gbanyan	d51141f7d5	feat(01-03): create DuckDB persistence layer with checkpoint-restart - PipelineStore class for DuckDB-based storage - save_dataframe/load_dataframe for polars and pandas - Checkpoint system with has_checkpoint and metadata tracking - Parquet export capability - Context manager support	2026-02-11 16:30:25 +08:00
gbanyan	4204116772	feat(01-01): create base API client with retry and caching - CachedAPIClient with SQLite persistent cache - Exponential backoff retry on 429/5xx/network errors (tenacity) - Rate limiting with skip for cached responses - from_config classmethod for pipeline integration - 5 passing tests for cache creation, rate limiting, and config integration	2026-02-11 16:25:46 +08:00
gbanyan	4a80a0398e	feat(01-01): create Python package scaffold with config system - pyproject.toml: installable package with bioinformatics dependencies - Pydantic config schema with validation (ensembl_release >= 100, directory creation) - YAML config loader with override support - Default config with Ensembl 113, gnomAD v4.1 - 5 passing tests for config validation and hashing	2026-02-11 16:24:35 +08:00

37 Commits