usher-exploring

Author	SHA1	Message	Date
gbanyan	674a9ae845	docs: add project README (zh/en) and CLAUDE.md Researcher-facing README explaining pipeline rationale, six evidence layers with scientific basis, scoring methodology, and step-by-step execution guide for CLI newcomers. CLAUDE.md provides development context for future Claude Code sessions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 04:50:02 +08:00
gbanyan	6605ff0f2b	fix: resolve runtime bugs for pipeline execution on Python 3.14 + latest deps - gene_mapping: wrap mygene fetch_all generator in list() to fix len() error - gene_mapping: raise MAX_EXPECTED_GENES to 23000 (mygene DB growth) - setup_cmd: rename gene_universe columns to gene_id/gene_symbol for consistency with all downstream evidence layer code - gnomad: handle missing coverage columns in v4.1 constraint TSV - expression: fix HPA URL (v23.proteinatlas.org) and GTEx URL (v8 path) - expression: fix Polars pivot() API change (columns -> on), collect first - expression: handle missing GTEx tissues (Eye - Retina not in v8) - expression: ensure all expected columns exist even when sources unavailable - expression/load: safely check column existence before filtering - localization: fix HPA subcellular URL to v23 - animal_models: fix httpx stream response.read() before .text access - animal_models: increase infer_schema_length for HCOP and MGI TSV parsing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 03:44:01 +08:00
gbanyan	a2ef2125ba	chore: complete v1.0 MVP milestone Archive v1.0 milestone: 6 phases, 21 plans, 40/40 requirements. Reorganize ROADMAP.md, evolve PROJECT.md, archive requirements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 21:31:43 +08:00
gbanyan	c284804493	docs: milestone v1.0 audit — passed (40/40 requirements)	2026-02-12 10:28:53 +08:00
gbanyan	294b4defbb	docs(phase-06): complete phase execution	2026-02-12 04:57:56 +08:00
gbanyan	c97592d629	docs(06-03): complete comprehensive validation report & CLI plan - Created 06-03-SUMMARY.md documenting plan execution - Updated STATE.md: - Current Position: Phase 6 COMPLETE (21/21 plans) - Performance Metrics: Phase 06 3/3 plans, 10 min total, 3.3 min/plan avg - Added decisions for comprehensive validation report and weight tuning recommendations - Session Continuity: Stopped at 06-03-PLAN.md completion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 04:53:05 +08:00
gbanyan	5879ae9768	test(06-03): unit tests for all validation modules - Created test_validation.py with 13 comprehensive tests - create_synthetic_scored_db helper: designs scores for known genes high, housekeeping low - Tests for negative controls: - test_compile_housekeeping_genes_structure - test_compile_housekeeping_genes_known_genes_present - test_validate_negative_controls_with_synthetic_data - test_validate_negative_controls_inverted_logic - Tests for recall@k: - test_compute_recall_at_k - Tests for sensitivity/perturbation: - test_perturb_weight_renormalizes - test_perturb_weight_large_negative - test_perturb_weight_invalid_layer - Tests for validation report: - test_generate_comprehensive_validation_report_format - test_recommend_weight_tuning_all_pass - test_recommend_weight_tuning_positive_fail - test_recommend_weight_tuning_negative_fail - test_recommend_weight_tuning_sensitivity_fail - All tests pass using synthetic DuckDB with tmp_path isolation	2026-02-12 04:50:17 +08:00
gbanyan	10f19f89f4	feat(06-03): comprehensive validation report and CLI validate command - Created validation_report.py with comprehensive report generation - generate_comprehensive_validation_report: combines positive, negative, sensitivity - recommend_weight_tuning: provides targeted weight adjustment recommendations - save_validation_report: persists report to file - Created validate_cmd.py CLI command following score_cmd.py pattern - Orchestrates positive controls, negative controls, sensitivity analysis - Options: --force, --skip-sensitivity, --output-dir, --top-n - Styled output with click.echo patterns - Provenance tracking for all validation steps - Updated main.py to register validate command - Updated scoring.__init__.py to export validation_report functions	2026-02-12 04:48:25 +08:00
gbanyan	2d29f43848	docs(06-02): complete sensitivity analysis plan - Create SUMMARY.md with implementation details and verification results - Update STATE.md: progress 100% (20/20 plans), plan 06-02 complete - Record decisions: perturbation deltas, stability threshold, renormalization - All tasks completed with 2 commits in 3 minutes	2026-02-12 04:44:13 +08:00
gbanyan	a2d6e97acf	docs(06-01): complete negative controls and recall@k validation plan Summary: - Created negative_controls.py with 13 housekeeping genes - Added recall@k metrics (absolute and percentage thresholds) - Added per-source breakdown for OMIM vs SYSCILIA - Updated STATE.md with Phase 6 progress and decisions Plan Summary: .planning/phases/06-validation/06-01-SUMMARY.md	2026-02-12 04:43:59 +08:00
gbanyan	0084a67fba	feat(06-02): export sensitivity analysis module from scoring package - Add sensitivity module imports and exports - Preserve existing negative_controls exports from Plan 01 - All sensitivity functions and constants now importable from usher_pipeline.scoring	2026-02-12 04:41:27 +08:00
gbanyan	0f615c0d53	feat(06-01): add recall@k metrics and extended positive control validation - compute_recall_at_k() measures recall at absolute (100, 500, 1000, 2000) and percentage (5%, 10%, 20%) thresholds - validate_positive_controls_extended() combines percentile + recall + per-source metrics - Per-source breakdown separates OMIM Usher from SYSCILIA SCGS v2 validation - Export all new functions in __init__.py including negative control imports	2026-02-12 04:41:00 +08:00
gbanyan	a7589d9bf1	feat(06-02): implement sensitivity analysis module with weight perturbation and Spearman correlation - Add perturb_weight() function with renormalization to maintain sum=1.0 - Add run_sensitivity_analysis() for parameter sweep across all layers and deltas - Add summarize_sensitivity() for stability classification - Add generate_sensitivity_report() for human-readable output - Default perturbations: ±5% and ±10% with stability threshold 0.85	2026-02-12 04:40:21 +08:00
gbanyan	e488ff2d7a	feat(06-01): add negative control validation with housekeeping genes - Create negative_controls.py with 13 curated housekeeping genes - HOUSEKEEPING_GENES_CORE frozenset with source provenance - compile_housekeeping_genes() returns DataFrame matching known_genes pattern - validate_negative_controls() uses PERCENT_RANK with inverted threshold logic - generate_negative_control_report() provides human-readable output	2026-02-12 04:39:48 +08:00
gbanyan	844295c681	docs(06): create phase plan	2026-02-12 04:33:17 +08:00
gbanyan	ca2b715d8e	docs(phase-06): research validation domain with controls and sensitivity analysis	2026-02-12 04:27:26 +08:00
gbanyan	964c1cebe8	docs(phase-05): complete phase execution	2026-02-12 04:15:50 +08:00
gbanyan	00e2836eb9	docs(05-03): complete CLI report command plan - Create 05-03-SUMMARY.md with comprehensive execution report - Update STATE.md: Phase 5 complete (3/3 plans), progress 90% (18/20) - Record decisions: CLI pattern, configurable thresholds, skip flags, graceful degradation - Document deviations: 2 auto-fixed bugs (tier threshold format, parameter name) - Update metrics: Phase 05 total 12 min (4.0 min/plan avg) - All tests pass (9/9 CliRunner integration tests)	2026-02-12 04:09:26 +08:00
gbanyan	c10d59548f	test(05-03): add CliRunner integration tests for report command - Create test_report_cmd.py with 9 comprehensive tests - Test fixtures: test_config (minimal YAML), populated_db (synthetic scored_genes data) - Test coverage: help output, file generation, tier counts, visualizations, skip flags, custom thresholds, error handling, custom output directory - Synthetic data design: 3 HIGH, 5 MEDIUM, 5 LOW, 4 EXCLUDED, 3 NULL composite_score - All tests pass with isolated tmp_path DuckDB instances - Fixed report_cmd.py tier threshold format (uppercase keys: HIGH/MEDIUM/LOW, composite_score field) - Fixed write_candidate_output parameter name (filename_base not base_filename)	2026-02-12 04:07:34 +08:00
gbanyan	2ab25ef5c2	feat(05-03): implement CLI report command - Create report_cmd.py following established CLI pattern - Orchestrate full output pipeline: tiering, evidence summary, dual-format output, visualizations, reproducibility reports - Support --output-dir, --force, --skip-viz, --skip-report flags - Configurable tier thresholds (--high-threshold, --medium-threshold, --low-threshold, --min-evidence-high, --min-evidence-medium) - Register report command in main.py CLI entry point - Follow score_cmd.py pattern: config load, store init, checkpoint check, pipeline steps, summary display, cleanup - CLI now has 5 commands: setup, evidence, score, report, info	2026-02-12 04:05:52 +08:00
gbanyan	5f14dc2e64	docs(05-02): complete visualization and reproducibility report plan - Plan 05-02 executed successfully - 2 tasks completed with 2 commits - 13 tests passing (6 visualization + 7 reproducibility) - 4 files created, 2 files modified - Duration: 5 minutes - Updated STATE.md with progress (17/20 plans complete, 85%)	2026-02-12 04:03:08 +08:00
gbanyan	434c79c0a8	docs(05-01): complete output generation core plan - Add 05-01-SUMMARY.md with performance metrics and decisions - Update STATE.md to Phase 5, Plan 1 of 3 (80% overall progress) - Record key decisions: configurable tiers, dual-format output, YAML provenance - Document deviation: pl.count() -> pl.len() deprecation fix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 04:01:24 +08:00
gbanyan	5af63eab46	feat(05-02): implement reproducibility report module with JSON and Markdown output - Create ReproducibilityReport dataclass with all metadata fields - Implement generate_reproducibility_report function - Extract parameters from PipelineConfig (scoring weights, data versions) - Capture software environment (Python, polars, duckdb versions) - Build filtering steps from ProvenanceTracker - Compute tier statistics from tiered DataFrame - Support optional validation metrics - to_json: write as indented JSON for machine-readable format - to_markdown: write with tables and headers for human-readable format - 7 tests covering all report fields, formats, and edge cases	2026-02-12 04:00:21 +08:00
gbanyan	4e46b488f1	feat(05-01): add dual-format writer with provenance and tests - Implement TSV+Parquet writer with deterministic output sorting - Generate YAML provenance sidecar with statistics and metadata - Add comprehensive unit tests (9 tests covering all functionality) - Fix deprecated pl.count() -> pl.len() usage	2026-02-12 03:59:26 +08:00
gbanyan	150417ffcc	feat(05-02): implement visualization module with matplotlib/seaborn plots - Add matplotlib>=3.8.0 and seaborn>=0.13.0 to dependencies - Create visualizations.py with 3 plot functions and orchestrator - plot_score_distribution: histogram colored by confidence tier - plot_layer_contributions: bar chart of evidence layer coverage - plot_tier_breakdown: pie chart of tier distribution - Use Agg backend for headless/CLI safety - All plots saved at 300 DPI with proper figure cleanup - 6 tests covering file creation, edge cases, and return values	2026-02-12 03:57:50 +08:00
gbanyan	d2ef3a2b84	feat(05-01): implement tiering logic and evidence summary module - Add confidence tier classification (HIGH/MEDIUM/LOW) based on composite_score and evidence_count - Add supporting_layers and evidence_gaps columns per gene - Use vectorized polars expressions for performance - Configurable thresholds for tier assignment	2026-02-12 03:56:42 +08:00
gbanyan	6ab7fd1378	docs(05-output-cli): create phase plan	2026-02-11 21:14:37 +08:00
gbanyan	1799906138	docs(05): research phase output-cli domain	2026-02-11 21:07:56 +08:00
gbanyan	de678858cd	docs(phase-04): complete phase execution	2026-02-11 21:01:50 +08:00
gbanyan	386fbf51b2	docs(04-03): complete CLI score command and tests plan - SUMMARY.md: CLI orchestration with checkpoint-restart + 10 comprehensive tests - STATE.md: Updated position (Phase 4 complete), progress (75%), velocity, decisions - Duration: 3 minutes, 2 tasks, 4 files (3 created, 1 modified)	2026-02-11 20:56:31 +08:00
gbanyan	a6ad6c6d19	test(04-03): add unit and integration tests for scoring module - test_scoring.py: 7 unit tests for known genes, weight validation, NULL preservation - test_scoring_integration.py: 3 integration tests for end-to-end pipeline with synthetic data - Tests verify NULL handling (genes with no evidence get NULL composite score) - Tests verify known genes rank highly when given high scores - Tests verify QC detects missing data above thresholds - All tests use synthetic data (no external API calls, fast, reproducible)	2026-02-11 20:54:39 +08:00
gbanyan	d57a5f2826	feat(04-03): add CLI score command with checkpoint-restart - Created score_cmd.py following evidence_cmd.py pattern - Orchestrates full scoring pipeline: known genes -> composite scores -> QC -> validation - Options: --force, --skip-qc, --skip-validation for flexible iteration - Registered score command in main CLI group - Displays comprehensive summary with quality flag distribution	2026-02-11 20:52:37 +08:00
gbanyan	c501951b0f	docs(04-02): complete QC and validation plan - Add SUMMARY for quality control and positive control validation - Update STATE.md: Plan 2 of 3 in Phase 04 complete - Progress: 70% (14/20 plans complete) - Decisions: scipy MAD outlier detection, PERCENT_RANK validation	2026-02-11 20:50:00 +08:00
gbanyan	70a5d6eff8	feat(04-02): implement positive control validation - Create validation.py with known gene ranking validation - validate_known_gene_ranking: PERCENT_RANK window function over all genes - Computes median percentile, top quartile count/fraction for known genes - generate_validation_report: human-readable text output with formatted table - Update __init__.py to export run_qc_checks, validate_known_gene_ranking, generate_validation_report	2026-02-11 20:47:59 +08:00
gbanyan	ba2f97ac55	feat(04-02): implement QC checks for scoring results - Add scipy>=1.14 dependency for MAD-based outlier detection - Create quality_control.py with 4 QC functions - compute_missing_data_rates: NULL rate detection with warn/error thresholds - compute_distribution_stats: mean/median/std per layer with anomaly detection - detect_outliers: MAD-based robust outlier detection (>3 MAD) - run_qc_checks: orchestrator with composite score percentiles	2026-02-11 20:46:57 +08:00
gbanyan	71c4e8f736	docs(04-01): complete known gene compilation and weighted scoring plan - Known genes: 38 (10 OMIM Usher + 28 SYSCILIA SCGS v2 core) - ScoringWeights.validate_sum() enforcing weight sum = 1.0 - NULL-preserving weighted average (weighted_sum / available_weight) - Quality flags based on evidence_count thresholds - Per-layer contributions for explainability - 2 tasks, 4 files, 4 min duration	2026-02-11 20:44:09 +08:00
gbanyan	f441e8c1ad	feat(04-01): implement multi-evidence weighted scoring integration - Create join_evidence_layers() with LEFT JOIN preserving NULLs from all 6 evidence tables - Implement compute_composite_scores() with NULL-preserving weighted average (weighted_sum / available_weight) - Add quality_flag classification based on evidence_count (sufficient/moderate/sparse/no_evidence) - Include per-layer contribution columns for explainability - Add persist_scored_genes() to save scored_genes table to DuckDB - Log summary stats: coverage, mean/median scores, quality distribution, NULL rates	2026-02-11 20:41:44 +08:00
gbanyan	0cd2f7c9dd	feat(04-01): implement known gene compilation and ScoringWeights validation - Create scoring module with OMIM_USHER_GENES (10 genes) and SYSCILIA_SCGS_V2_CORE (28 genes) - Implement compile_known_genes() returning DataFrame with gene_symbol, source, confidence - Add load_known_genes_to_duckdb() to persist known genes table - Add ScoringWeights.validate_sum() method enforcing weight sum constraint (1.0 ± 1e-6)	2026-02-11 20:41:31 +08:00
gbanyan	ed21f18a98	fix(03-05): handle NULL columns and deprecated polars API in animal models - Add NULL/empty column checks in fetch_ortholog_mapping - Fix NULL handling in filter_sensory_phenotypes with is_not_null guard - Replace deprecated str.concat with str.join - Add explicit schema to empty DataFrames for consistency	2026-02-11 20:38:36 +08:00
gbanyan	a52724aff4	docs(04): create phase plan for scoring and integration	2026-02-11 20:31:55 +08:00
gbanyan	32988c631f	docs(04): research multi-evidence weighted scoring with NULL preservation	2026-02-11 20:24:42 +08:00
gbanyan	190bedaa80	docs(phase-03): complete phase execution	2026-02-11 19:18:12 +08:00
gbanyan	e72c516669	docs(03-06): complete literature evidence layer - Created SUMMARY.md with full implementation details - Updated STATE.md: progress 60%, 12/20 plans complete, Phase 3 complete - Documented 4 key decisions (tier priority, bias mitigation, context weights, rate limiting) - All verification criteria met: 17/17 tests pass, CLI functional, bias mitigation validated - Self-check PASSED: all files and commits verified Key accomplishments: - PubMed evidence layer queries per gene across cilia/sensory/cytoskeleton/polarity contexts - Quality tier classification: direct_experimental > hts_hit > functional_mention > incidental - Bias mitigation via log2(total_pubmed_count) prevents well-studied gene dominance - Novel genes with 10 total/5 cilia publications score higher than TP53-like genes with 100K total/5 cilia - Biopython Entrez integration with rate limiting (3/sec default, 10/sec with API key)	2026-02-11 19:13:26 +08:00
gbanyan	0e89bf0dd6	docs(03-02): complete expression evidence layer plan - Create 03-02-SUMMARY.md with performance metrics, decisions, and deviations - Update STATE.md: 5 of 6 plans complete in Phase 03 (03-06 remaining) - Update progress: 55% complete (11/20 plans across all phases) - Add key decisions: Tau calculation, expression scoring, CellxGene optional - Record duration: 12 min for 2 tasks (9 files modified) - Self-check passed: all files and commits verified Expression layer provides: - HPA/GTEx tissue expression with Tau specificity index - Usher-tissue enrichment scoring (retina, inner ear, cilia) - Optional CellxGene single-cell integration - CLI command with checkpoint-restart - 11 passing unit and integration tests	2026-02-11 19:12:18 +08:00
gbanyan	cfe4b830e6	docs(03-03): complete protein features plan with SUMMARY and STATE updates	2026-02-11 19:10:03 +08:00
gbanyan	053f0d926b	docs(03-05): complete animal model phenotype evidence layer plan - SUMMARY.md: Ortholog-mapped animal evidence from MGI/ZFIN/IMPC - Confidence-weighted scoring (mouse +0.4, zebrafish +0.3, IMPC +0.3) - 14/14 tests passing: ortholog confidence, keyword filtering, NULL preservation - Deviations: Schema mismatches, NULL handling, polars deprecations auto-fixed - Duration: 10 minutes, 2 tasks, 8 files, 2 commits	2026-02-11 19:08:45 +08:00
gbanyan	d8009f1236	docs(03-04): complete subcellular localization evidence layer - Created SUMMARY.md with full implementation details - Updated STATE.md: progress 40%, 8/20 plans complete - Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting) - All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete	2026-02-11 19:08:01 +08:00
gbanyan	46059874f2	feat(03-03): implement protein evidence layer with UniProt/InterPro integration - Create protein features data model with domain, coiled-coil, TM, cilia motifs - Implement fetch.py with UniProt REST API and InterPro API queries - Implement transform.py with feature extraction, motif detection, normalization - Implement load.py with DuckDB persistence and provenance tracking - Add CLI protein command following evidence layer pattern - Add comprehensive unit and integration tests (all passing) - Handle NULL preservation and List(Null) edge case - Add get_steps() method to ProvenanceTracker for test compatibility	2026-02-11 19:07:30 +08:00
gbanyan	bcd3c4ffbe	feat(03-05): add animal model DuckDB loader, CLI, and comprehensive tests - load.py: DuckDB persistence with provenance tracking, ortholog confidence distribution stats - CLI animal-models command: checkpoint-restart pattern, top scoring genes display - 10 unit tests: ortholog confidence scoring, keyword filtering, multi-organism bonus, NULL preservation - 4 integration tests: full pipeline, checkpoint-restart, provenance tracking, empty phenotype handling - All tests pass (14/14): validates fetch->transform->load->CLI flow - Fixed polars deprecations: str.join replaces str.concat, pl.len replaces pl.count	2026-02-11 19:06:49 +08:00
gbanyan	99bc975a2c	docs(03-01): complete annotation completeness plan	2026-02-11 19:05:56 +08:00

1 2

86 Commits