Commit Graph

13 Commits

Author SHA1 Message Date
a6ad6c6d19 test(04-03): add unit and integration tests for scoring module
- test_scoring.py: 7 unit tests for known genes, weight validation, NULL preservation
- test_scoring_integration.py: 3 integration tests for end-to-end pipeline with synthetic data
- Tests verify NULL handling (genes with no evidence get NULL composite score)
- Tests verify known genes rank highly when given high scores
- Tests verify QC detects missing data above thresholds
- All tests use synthetic data (no external API calls, fast, reproducible)
2026-02-11 20:54:39 +08:00
d8009f1236 docs(03-04): complete subcellular localization evidence layer
- Created SUMMARY.md with full implementation details
- Updated STATE.md: progress 40%, 8/20 plans complete
- Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting)
- All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete
2026-02-11 19:08:01 +08:00
46059874f2 feat(03-03): implement protein evidence layer with UniProt/InterPro integration
- Create protein features data model with domain, coiled-coil, TM, cilia motifs
- Implement fetch.py with UniProt REST API and InterPro API queries
- Implement transform.py with feature extraction, motif detection, normalization
- Implement load.py with DuckDB persistence and provenance tracking
- Add CLI protein command following evidence layer pattern
- Add comprehensive unit and integration tests (all passing)
- Handle NULL preservation and List(Null) edge case
- Add get_steps() method to ProvenanceTracker for test compatibility
2026-02-11 19:07:30 +08:00
bcd3c4ffbe feat(03-05): add animal model DuckDB loader, CLI, and comprehensive tests
- load.py: DuckDB persistence with provenance tracking, ortholog confidence distribution stats
- CLI animal-models command: checkpoint-restart pattern, top scoring genes display
- 10 unit tests: ortholog confidence scoring, keyword filtering, multi-organism bonus, NULL preservation
- 4 integration tests: full pipeline, checkpoint-restart, provenance tracking, empty phenotype handling
- All tests pass (14/14): validates fetch->transform->load->CLI flow
- Fixed polars deprecations: str.join replaces str.concat, pl.len replaces pl.count
2026-02-11 19:06:49 +08:00
942aaf2ec3 feat(03-04): add localization CLI command and comprehensive tests
- Add localization subcommand to evidence command group
- Implement checkpoint-restart pattern for HPA download
- Display summary with evidence type distribution
- Create 17 unit and integration tests (all pass)
- Test HPA parsing, evidence classification, scoring, and DuckDB persistence
- Fix evidence type terminology (computational vs predicted) for consistency
- Mock HTTP calls in integration tests for reproducibility
2026-02-11 19:05:22 +08:00
d70239c4ce feat(03-01): add annotation DuckDB loader, CLI command, and tests
- Create load_to_duckdb with provenance tracking and tier distribution stats
- Add query_poorly_annotated helper to find under-studied genes
- Register `evidence annotation` CLI command with checkpoint-restart pattern
- Add comprehensive unit tests (9 tests) covering GO extraction, NULL handling, tier classification, score normalization, weighting
- Add integration tests (6 tests) for pipeline, idempotency, checkpoint-restart, provenance, queries
- All 15 tests pass with proper NULL preservation and schema validation
2026-02-11 19:03:10 +08:00
56e04e68c2 test(02-02): add comprehensive gnomAD integration tests
- 12 integration tests covering full pipeline: fetch->transform->load->query
- test_full_pipeline_to_duckdb: End-to-end pipeline verification with DuckDB storage
- test_checkpoint_restart_skips_processing: Checkpoint detection works correctly
- test_provenance_recorded: Provenance step records expected details
- test_provenance_sidecar_created: JSON sidecar file creation and structure
- test_query_constrained_genes_filters_correctly: Query returns only measured genes below threshold
- test_null_loeuf_not_in_constrained_results: NULL LOEUF genes excluded from queries
- test_duckdb_schema_has_quality_flag: Schema includes quality_flag with valid values
- test_normalized_scores_in_duckdb: Normalized scores in [0,1] for measured genes, NULL for others
- test_cli_evidence_gnomad_help: CLI help text displays correctly
- test_cli_evidence_gnomad_with_mock: CLI command runs end-to-end with mocked download
- test_idempotent_load_replaces_table: Loading twice replaces table (not appends)
- test_quality_flag_categorization: Quality flags correctly categorize genes

All 70 tests pass (58 existing + 12 new), no regressions
2026-02-11 18:20:59 +08:00
174c4af02d feat(02-01): add gnomAD transform pipeline and comprehensive tests
- Implement filter_by_coverage with quality_flag categorization (measured/incomplete_coverage/no_data)
- Add normalize_scores with LOEUF inversion (lower LOEUF = higher score)
- NULL preservation throughout pipeline (unknown != zero constraint)
- process_gnomad_constraint end-to-end pipeline function
- 15 comprehensive unit tests covering edge cases:
  - NULL handling and preservation
  - Coverage filtering without dropping genes
  - Normalization bounds and inversion
  - Mixed type handling for robust parsing
- Fix column mapping to handle gnomAD v4.x loeuf/loeuf_upper duplication
- All existing tests continue to pass
2026-02-11 18:14:41 +08:00
e4d71d0790 test(01-04): add integration tests verifying module wiring
- test_config_to_store_roundtrip: config -> PipelineStore -> save/load
- test_config_to_provenance: config -> ProvenanceTracker -> sidecar
- test_full_setup_flow_mocked: full setup with mocked mygene (fetch, map, validate, save, provenance)
- test_checkpoint_skip_flow: verify checkpoint-restart skips re-fetch
- test_setup_cli_help: CLI help output verification
- test_info_cli: info command with config display

All tests pass with mocked API calls (no external dependencies).
2026-02-11 16:42:13 +08:00
0200395d9e feat(01-02): create mapping validation gates with tests
- Add MappingValidator with configurable success rate thresholds (min_success_rate, warn_threshold)
- Add validate_gene_universe for gene count, format, and duplicate checks
- Add save_unmapped_report for manual review output
- Implement 15 comprehensive tests with mocked mygene responses (no real API calls)
- Tests cover: successful mapping, notfound handling, uniprot list parsing, batching, validation gates, universe validation
2026-02-11 16:33:36 +08:00
98a1a750dd feat(01-03): create provenance tracker with comprehensive tests
- ProvenanceTracker class for metadata tracking
- Records pipeline version, data source versions, config hash, timestamps
- Sidecar JSON export alongside outputs
- DuckDB _provenance table support
- 13 comprehensive tests (8 DuckDB + 5 provenance)
- All tests pass (12 passed, 1 skipped - pandas)
2026-02-11 16:31:51 +08:00
4204116772 feat(01-01): create base API client with retry and caching
- CachedAPIClient with SQLite persistent cache
- Exponential backoff retry on 429/5xx/network errors (tenacity)
- Rate limiting with skip for cached responses
- from_config classmethod for pipeline integration
- 5 passing tests for cache creation, rate limiting, and config integration
2026-02-11 16:25:46 +08:00
4a80a0398e feat(01-01): create Python package scaffold with config system
- pyproject.toml: installable package with bioinformatics dependencies
- Pydantic config schema with validation (ensembl_release >= 100, directory creation)
- YAML config loader with override support
- Default config with Ensembl 113, gnomAD v4.1
- 5 passing tests for config validation and hashing
2026-02-11 16:24:35 +08:00