Commit Graph

86 Commits

Author SHA1 Message Date
674a9ae845 docs: add project README (zh/en) and CLAUDE.md
Researcher-facing README explaining pipeline rationale, six evidence
layers with scientific basis, scoring methodology, and step-by-step
execution guide for CLI newcomers. CLAUDE.md provides development
context for future Claude Code sessions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 04:50:02 +08:00
6605ff0f2b fix: resolve runtime bugs for pipeline execution on Python 3.14 + latest deps
- gene_mapping: wrap mygene fetch_all generator in list() to fix len() error
- gene_mapping: raise MAX_EXPECTED_GENES to 23000 (mygene DB growth)
- setup_cmd: rename gene_universe columns to gene_id/gene_symbol for
  consistency with all downstream evidence layer code
- gnomad: handle missing coverage columns in v4.1 constraint TSV
- expression: fix HPA URL (v23.proteinatlas.org) and GTEx URL (v8 path)
- expression: fix Polars pivot() API change (columns -> on), collect first
- expression: handle missing GTEx tissues (Eye - Retina not in v8)
- expression: ensure all expected columns exist even when sources unavailable
- expression/load: safely check column existence before filtering
- localization: fix HPA subcellular URL to v23
- animal_models: fix httpx stream response.read() before .text access
- animal_models: increase infer_schema_length for HCOP and MGI TSV parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 03:44:01 +08:00
a2ef2125ba chore: complete v1.0 MVP milestone
Archive v1.0 milestone: 6 phases, 21 plans, 40/40 requirements.
Reorganize ROADMAP.md, evolve PROJECT.md, archive requirements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 21:31:43 +08:00
c284804493 docs: milestone v1.0 audit — passed (40/40 requirements) 2026-02-12 10:28:53 +08:00
294b4defbb docs(phase-06): complete phase execution 2026-02-12 04:57:56 +08:00
c97592d629 docs(06-03): complete comprehensive validation report & CLI plan
- Created 06-03-SUMMARY.md documenting plan execution
- Updated STATE.md:
  - Current Position: Phase 6 COMPLETE (21/21 plans)
  - Performance Metrics: Phase 06 3/3 plans, 10 min total, 3.3 min/plan avg
  - Added decisions for comprehensive validation report and weight tuning recommendations
  - Session Continuity: Stopped at 06-03-PLAN.md completion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 04:53:05 +08:00
5879ae9768 test(06-03): unit tests for all validation modules
- Created test_validation.py with 13 comprehensive tests
- create_synthetic_scored_db helper: designs scores for known genes high, housekeeping low
- Tests for negative controls:
  - test_compile_housekeeping_genes_structure
  - test_compile_housekeeping_genes_known_genes_present
  - test_validate_negative_controls_with_synthetic_data
  - test_validate_negative_controls_inverted_logic
- Tests for recall@k:
  - test_compute_recall_at_k
- Tests for sensitivity/perturbation:
  - test_perturb_weight_renormalizes
  - test_perturb_weight_large_negative
  - test_perturb_weight_invalid_layer
- Tests for validation report:
  - test_generate_comprehensive_validation_report_format
  - test_recommend_weight_tuning_all_pass
  - test_recommend_weight_tuning_positive_fail
  - test_recommend_weight_tuning_negative_fail
  - test_recommend_weight_tuning_sensitivity_fail
- All tests pass using synthetic DuckDB with tmp_path isolation
2026-02-12 04:50:17 +08:00
10f19f89f4 feat(06-03): comprehensive validation report and CLI validate command
- Created validation_report.py with comprehensive report generation
  - generate_comprehensive_validation_report: combines positive, negative, sensitivity
  - recommend_weight_tuning: provides targeted weight adjustment recommendations
  - save_validation_report: persists report to file
- Created validate_cmd.py CLI command following score_cmd.py pattern
  - Orchestrates positive controls, negative controls, sensitivity analysis
  - Options: --force, --skip-sensitivity, --output-dir, --top-n
  - Styled output with click.echo patterns
  - Provenance tracking for all validation steps
- Updated main.py to register validate command
- Updated scoring.__init__.py to export validation_report functions
2026-02-12 04:48:25 +08:00
2d29f43848 docs(06-02): complete sensitivity analysis plan
- Create SUMMARY.md with implementation details and verification results
- Update STATE.md: progress 100% (20/20 plans), plan 06-02 complete
- Record decisions: perturbation deltas, stability threshold, renormalization
- All tasks completed with 2 commits in 3 minutes
2026-02-12 04:44:13 +08:00
a2d6e97acf docs(06-01): complete negative controls and recall@k validation plan
Summary:
- Created negative_controls.py with 13 housekeeping genes
- Added recall@k metrics (absolute and percentage thresholds)
- Added per-source breakdown for OMIM vs SYSCILIA
- Updated STATE.md with Phase 6 progress and decisions

Plan Summary: .planning/phases/06-validation/06-01-SUMMARY.md
2026-02-12 04:43:59 +08:00
0084a67fba feat(06-02): export sensitivity analysis module from scoring package
- Add sensitivity module imports and exports
- Preserve existing negative_controls exports from Plan 01
- All sensitivity functions and constants now importable from usher_pipeline.scoring
2026-02-12 04:41:27 +08:00
0f615c0d53 feat(06-01): add recall@k metrics and extended positive control validation
- compute_recall_at_k() measures recall at absolute (100, 500, 1000, 2000) and percentage (5%, 10%, 20%) thresholds
- validate_positive_controls_extended() combines percentile + recall + per-source metrics
- Per-source breakdown separates OMIM Usher from SYSCILIA SCGS v2 validation
- Export all new functions in __init__.py including negative control imports
2026-02-12 04:41:00 +08:00
a7589d9bf1 feat(06-02): implement sensitivity analysis module with weight perturbation and Spearman correlation
- Add perturb_weight() function with renormalization to maintain sum=1.0
- Add run_sensitivity_analysis() for parameter sweep across all layers and deltas
- Add summarize_sensitivity() for stability classification
- Add generate_sensitivity_report() for human-readable output
- Default perturbations: ±5% and ±10% with stability threshold 0.85
2026-02-12 04:40:21 +08:00
e488ff2d7a feat(06-01): add negative control validation with housekeeping genes
- Create negative_controls.py with 13 curated housekeeping genes
- HOUSEKEEPING_GENES_CORE frozenset with source provenance
- compile_housekeeping_genes() returns DataFrame matching known_genes pattern
- validate_negative_controls() uses PERCENT_RANK with inverted threshold logic
- generate_negative_control_report() provides human-readable output
2026-02-12 04:39:48 +08:00
844295c681 docs(06): create phase plan 2026-02-12 04:33:17 +08:00
ca2b715d8e docs(phase-06): research validation domain with controls and sensitivity analysis 2026-02-12 04:27:26 +08:00
964c1cebe8 docs(phase-05): complete phase execution 2026-02-12 04:15:50 +08:00
00e2836eb9 docs(05-03): complete CLI report command plan
- Create 05-03-SUMMARY.md with comprehensive execution report
- Update STATE.md: Phase 5 complete (3/3 plans), progress 90% (18/20)
- Record decisions: CLI pattern, configurable thresholds, skip flags, graceful degradation
- Document deviations: 2 auto-fixed bugs (tier threshold format, parameter name)
- Update metrics: Phase 05 total 12 min (4.0 min/plan avg)
- All tests pass (9/9 CliRunner integration tests)
2026-02-12 04:09:26 +08:00
c10d59548f test(05-03): add CliRunner integration tests for report command
- Create test_report_cmd.py with 9 comprehensive tests
- Test fixtures: test_config (minimal YAML), populated_db (synthetic scored_genes data)
- Test coverage: help output, file generation, tier counts, visualizations, skip flags, custom thresholds, error handling, custom output directory
- Synthetic data design: 3 HIGH, 5 MEDIUM, 5 LOW, 4 EXCLUDED, 3 NULL composite_score
- All tests pass with isolated tmp_path DuckDB instances
- Fixed report_cmd.py tier threshold format (uppercase keys: HIGH/MEDIUM/LOW, composite_score field)
- Fixed write_candidate_output parameter name (filename_base not base_filename)
2026-02-12 04:07:34 +08:00
2ab25ef5c2 feat(05-03): implement CLI report command
- Create report_cmd.py following established CLI pattern
- Orchestrate full output pipeline: tiering, evidence summary, dual-format output, visualizations, reproducibility reports
- Support --output-dir, --force, --skip-viz, --skip-report flags
- Configurable tier thresholds (--high-threshold, --medium-threshold, --low-threshold, --min-evidence-high, --min-evidence-medium)
- Register report command in main.py CLI entry point
- Follow score_cmd.py pattern: config load, store init, checkpoint check, pipeline steps, summary display, cleanup
- CLI now has 5 commands: setup, evidence, score, report, info
2026-02-12 04:05:52 +08:00
5f14dc2e64 docs(05-02): complete visualization and reproducibility report plan
- Plan 05-02 executed successfully
- 2 tasks completed with 2 commits
- 13 tests passing (6 visualization + 7 reproducibility)
- 4 files created, 2 files modified
- Duration: 5 minutes
- Updated STATE.md with progress (17/20 plans complete, 85%)
2026-02-12 04:03:08 +08:00
434c79c0a8 docs(05-01): complete output generation core plan
- Add 05-01-SUMMARY.md with performance metrics and decisions
- Update STATE.md to Phase 5, Plan 1 of 3 (80% overall progress)
- Record key decisions: configurable tiers, dual-format output, YAML provenance
- Document deviation: pl.count() -> pl.len() deprecation fix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 04:01:24 +08:00
5af63eab46 feat(05-02): implement reproducibility report module with JSON and Markdown output
- Create ReproducibilityReport dataclass with all metadata fields
- Implement generate_reproducibility_report function
- Extract parameters from PipelineConfig (scoring weights, data versions)
- Capture software environment (Python, polars, duckdb versions)
- Build filtering steps from ProvenanceTracker
- Compute tier statistics from tiered DataFrame
- Support optional validation metrics
- to_json: write as indented JSON for machine-readable format
- to_markdown: write with tables and headers for human-readable format
- 7 tests covering all report fields, formats, and edge cases
2026-02-12 04:00:21 +08:00
4e46b488f1 feat(05-01): add dual-format writer with provenance and tests
- Implement TSV+Parquet writer with deterministic output sorting
- Generate YAML provenance sidecar with statistics and metadata
- Add comprehensive unit tests (9 tests covering all functionality)
- Fix deprecated pl.count() -> pl.len() usage
2026-02-12 03:59:26 +08:00
150417ffcc feat(05-02): implement visualization module with matplotlib/seaborn plots
- Add matplotlib>=3.8.0 and seaborn>=0.13.0 to dependencies
- Create visualizations.py with 3 plot functions and orchestrator
- plot_score_distribution: histogram colored by confidence tier
- plot_layer_contributions: bar chart of evidence layer coverage
- plot_tier_breakdown: pie chart of tier distribution
- Use Agg backend for headless/CLI safety
- All plots saved at 300 DPI with proper figure cleanup
- 6 tests covering file creation, edge cases, and return values
2026-02-12 03:57:50 +08:00
d2ef3a2b84 feat(05-01): implement tiering logic and evidence summary module
- Add confidence tier classification (HIGH/MEDIUM/LOW) based on composite_score and evidence_count
- Add supporting_layers and evidence_gaps columns per gene
- Use vectorized polars expressions for performance
- Configurable thresholds for tier assignment
2026-02-12 03:56:42 +08:00
6ab7fd1378 docs(05-output-cli): create phase plan 2026-02-11 21:14:37 +08:00
1799906138 docs(05): research phase output-cli domain 2026-02-11 21:07:56 +08:00
de678858cd docs(phase-04): complete phase execution 2026-02-11 21:01:50 +08:00
386fbf51b2 docs(04-03): complete CLI score command and tests plan
- SUMMARY.md: CLI orchestration with checkpoint-restart + 10 comprehensive tests
- STATE.md: Updated position (Phase 4 complete), progress (75%), velocity, decisions
- Duration: 3 minutes, 2 tasks, 4 files (3 created, 1 modified)
2026-02-11 20:56:31 +08:00
a6ad6c6d19 test(04-03): add unit and integration tests for scoring module
- test_scoring.py: 7 unit tests for known genes, weight validation, NULL preservation
- test_scoring_integration.py: 3 integration tests for end-to-end pipeline with synthetic data
- Tests verify NULL handling (genes with no evidence get NULL composite score)
- Tests verify known genes rank highly when given high scores
- Tests verify QC detects missing data above thresholds
- All tests use synthetic data (no external API calls, fast, reproducible)
2026-02-11 20:54:39 +08:00
d57a5f2826 feat(04-03): add CLI score command with checkpoint-restart
- Created score_cmd.py following evidence_cmd.py pattern
- Orchestrates full scoring pipeline: known genes -> composite scores -> QC -> validation
- Options: --force, --skip-qc, --skip-validation for flexible iteration
- Registered score command in main CLI group
- Displays comprehensive summary with quality flag distribution
2026-02-11 20:52:37 +08:00
c501951b0f docs(04-02): complete QC and validation plan
- Add SUMMARY for quality control and positive control validation
- Update STATE.md: Plan 2 of 3 in Phase 04 complete
- Progress: 70% (14/20 plans complete)
- Decisions: scipy MAD outlier detection, PERCENT_RANK validation
2026-02-11 20:50:00 +08:00
70a5d6eff8 feat(04-02): implement positive control validation
- Create validation.py with known gene ranking validation
- validate_known_gene_ranking: PERCENT_RANK window function over all genes
- Computes median percentile, top quartile count/fraction for known genes
- generate_validation_report: human-readable text output with formatted table
- Update __init__.py to export run_qc_checks, validate_known_gene_ranking, generate_validation_report
2026-02-11 20:47:59 +08:00
ba2f97ac55 feat(04-02): implement QC checks for scoring results
- Add scipy>=1.14 dependency for MAD-based outlier detection
- Create quality_control.py with 4 QC functions
- compute_missing_data_rates: NULL rate detection with warn/error thresholds
- compute_distribution_stats: mean/median/std per layer with anomaly detection
- detect_outliers: MAD-based robust outlier detection (>3 MAD)
- run_qc_checks: orchestrator with composite score percentiles
2026-02-11 20:46:57 +08:00
71c4e8f736 docs(04-01): complete known gene compilation and weighted scoring plan
- Known genes: 38 (10 OMIM Usher + 28 SYSCILIA SCGS v2 core)
- ScoringWeights.validate_sum() enforcing weight sum = 1.0
- NULL-preserving weighted average (weighted_sum / available_weight)
- Quality flags based on evidence_count thresholds
- Per-layer contributions for explainability
- 2 tasks, 4 files, 4 min duration
2026-02-11 20:44:09 +08:00
f441e8c1ad feat(04-01): implement multi-evidence weighted scoring integration
- Create join_evidence_layers() with LEFT JOIN preserving NULLs from all 6 evidence tables
- Implement compute_composite_scores() with NULL-preserving weighted average (weighted_sum / available_weight)
- Add quality_flag classification based on evidence_count (sufficient/moderate/sparse/no_evidence)
- Include per-layer contribution columns for explainability
- Add persist_scored_genes() to save scored_genes table to DuckDB
- Log summary stats: coverage, mean/median scores, quality distribution, NULL rates
2026-02-11 20:41:44 +08:00
0cd2f7c9dd feat(04-01): implement known gene compilation and ScoringWeights validation
- Create scoring module with OMIM_USHER_GENES (10 genes) and SYSCILIA_SCGS_V2_CORE (28 genes)
- Implement compile_known_genes() returning DataFrame with gene_symbol, source, confidence
- Add load_known_genes_to_duckdb() to persist known genes table
- Add ScoringWeights.validate_sum() method enforcing weight sum constraint (1.0 ± 1e-6)
2026-02-11 20:41:31 +08:00
ed21f18a98 fix(03-05): handle NULL columns and deprecated polars API in animal models
- Add NULL/empty column checks in fetch_ortholog_mapping
- Fix NULL handling in filter_sensory_phenotypes with is_not_null guard
- Replace deprecated str.concat with str.join
- Add explicit schema to empty DataFrames for consistency
2026-02-11 20:38:36 +08:00
a52724aff4 docs(04): create phase plan for scoring and integration 2026-02-11 20:31:55 +08:00
32988c631f docs(04): research multi-evidence weighted scoring with NULL preservation 2026-02-11 20:24:42 +08:00
190bedaa80 docs(phase-03): complete phase execution 2026-02-11 19:18:12 +08:00
e72c516669 docs(03-06): complete literature evidence layer
- Created SUMMARY.md with full implementation details
- Updated STATE.md: progress 60%, 12/20 plans complete, Phase 3 complete
- Documented 4 key decisions (tier priority, bias mitigation, context weights, rate limiting)
- All verification criteria met: 17/17 tests pass, CLI functional, bias mitigation validated
- Self-check PASSED: all files and commits verified

Key accomplishments:
- PubMed evidence layer queries per gene across cilia/sensory/cytoskeleton/polarity contexts
- Quality tier classification: direct_experimental > hts_hit > functional_mention > incidental
- Bias mitigation via log2(total_pubmed_count) prevents well-studied gene dominance
- Novel genes with 10 total/5 cilia publications score higher than TP53-like genes with 100K total/5 cilia
- Biopython Entrez integration with rate limiting (3/sec default, 10/sec with API key)
2026-02-11 19:13:26 +08:00
0e89bf0dd6 docs(03-02): complete expression evidence layer plan
- Create 03-02-SUMMARY.md with performance metrics, decisions, and deviations
- Update STATE.md: 5 of 6 plans complete in Phase 03 (03-06 remaining)
- Update progress: 55% complete (11/20 plans across all phases)
- Add key decisions: Tau calculation, expression scoring, CellxGene optional
- Record duration: 12 min for 2 tasks (9 files modified)
- Self-check passed: all files and commits verified

Expression layer provides:
- HPA/GTEx tissue expression with Tau specificity index
- Usher-tissue enrichment scoring (retina, inner ear, cilia)
- Optional CellxGene single-cell integration
- CLI command with checkpoint-restart
- 11 passing unit and integration tests
2026-02-11 19:12:18 +08:00
cfe4b830e6 docs(03-03): complete protein features plan with SUMMARY and STATE updates 2026-02-11 19:10:03 +08:00
053f0d926b docs(03-05): complete animal model phenotype evidence layer plan
- SUMMARY.md: Ortholog-mapped animal evidence from MGI/ZFIN/IMPC
- Confidence-weighted scoring (mouse +0.4, zebrafish +0.3, IMPC +0.3)
- 14/14 tests passing: ortholog confidence, keyword filtering, NULL preservation
- Deviations: Schema mismatches, NULL handling, polars deprecations auto-fixed
- Duration: 10 minutes, 2 tasks, 8 files, 2 commits
2026-02-11 19:08:45 +08:00
d8009f1236 docs(03-04): complete subcellular localization evidence layer
- Created SUMMARY.md with full implementation details
- Updated STATE.md: progress 40%, 8/20 plans complete
- Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting)
- All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete
2026-02-11 19:08:01 +08:00
46059874f2 feat(03-03): implement protein evidence layer with UniProt/InterPro integration
- Create protein features data model with domain, coiled-coil, TM, cilia motifs
- Implement fetch.py with UniProt REST API and InterPro API queries
- Implement transform.py with feature extraction, motif detection, normalization
- Implement load.py with DuckDB persistence and provenance tracking
- Add CLI protein command following evidence layer pattern
- Add comprehensive unit and integration tests (all passing)
- Handle NULL preservation and List(Null) edge case
- Add get_steps() method to ProvenanceTracker for test compatibility
2026-02-11 19:07:30 +08:00
bcd3c4ffbe feat(03-05): add animal model DuckDB loader, CLI, and comprehensive tests
- load.py: DuckDB persistence with provenance tracking, ortholog confidence distribution stats
- CLI animal-models command: checkpoint-restart pattern, top scoring genes display
- 10 unit tests: ortholog confidence scoring, keyword filtering, multi-organism bonus, NULL preservation
- 4 integration tests: full pipeline, checkpoint-restart, provenance tracking, empty phenotype handling
- All tests pass (14/14): validates fetch->transform->load->CLI flow
- Fixed polars deprecations: str.join replaces str.concat, pl.len replaces pl.count
2026-02-11 19:06:49 +08:00
99bc975a2c docs(03-01): complete annotation completeness plan 2026-02-11 19:05:56 +08:00