- DuckDB persistence: gnomad_constraint table with CREATE OR REPLACE (idempotent) - CLI evidence command: usher-pipeline evidence gnomad with checkpoint-restart - Provenance tracking: records processing steps, saves sidecar JSON - Query helpers: query_constrained_genes validates GCON-03 interpretation - 12 integration tests: end-to-end pipeline, checkpoint, provenance, CLI - Phase 2 complete: Evidence layer pattern established for future sources - Duration: 4 min, 2 tasks, 5 files, 70 tests passing Phase 2 (Prototype Evidence Layer) complete.
79 lines
3.6 KiB
Markdown
79 lines
3.6 KiB
Markdown
# Project State
|
|
|
|
## Project Reference
|
|
|
|
See: .planning/PROJECT.md (updated 2026-02-11)
|
|
|
|
**Core value:** Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented.
|
|
**Current focus:** Phase 1 complete — ready for Phase 2
|
|
|
|
## Current Position
|
|
|
|
Phase: 2 of 6 (Prototype Evidence Layer)
|
|
Plan: 2 of 2 in current phase (phase complete)
|
|
Status: Phase 2 complete - ready for Phase 3
|
|
Last activity: 2026-02-11 — Completed 02-02: gnomAD evidence layer integration (DuckDB persistence, CLI, checkpoint-restart)
|
|
|
|
Progress: [█████░░░░░] 33.3% (2/6 phases complete)
|
|
|
|
## Performance Metrics
|
|
|
|
**Velocity:**
|
|
- Total plans completed: 6
|
|
- Average duration: 3.7 min
|
|
- Total execution time: 0.37 hours
|
|
|
|
**By Phase:**
|
|
|
|
| Phase | Plans | Total | Avg/Plan |
|
|
|-------|-------|-------|----------|
|
|
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
|
|
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
|
|
|
|
## Accumulated Context
|
|
|
|
### Decisions
|
|
|
|
Decisions are logged in PROJECT.md Key Decisions table.
|
|
Recent decisions affecting current work:
|
|
|
|
- Python over R/Bioconductor for rich data integration ecosystem
|
|
- Weighted rule-based scoring over ML for explainability
|
|
- Public data only for reproducibility
|
|
- Modular CLI scripts for flexibility during development
|
|
- Virtual environment required for dependency isolation (01-01: PEP 668 externally-managed Python)
|
|
- Auto-creation of directories on config load (01-01: data_dir, cache_dir field validators)
|
|
- [01-02]: Warn on gene count outside 19k-22k range but don't fail (allows for Ensembl version variations)
|
|
- [01-02]: HGNC success rate is primary validation gate (UniProt mapping tracked but not used for pass/fail)
|
|
- [01-02]: Take first UniProt accession when multiple exist (simplifies data model)
|
|
- [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility)
|
|
- [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics)
|
|
- [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern)
|
|
- [01-04]: Click for CLI framework (standard Python CLI library with excellent UX)
|
|
- [01-04]: Setup command uses checkpoint-restart pattern (gene universe fetch can take minutes)
|
|
- [01-04]: Mock mygene in integration tests (avoids external API dependency, reproducible)
|
|
- [02-01]: httpx over requests for streaming downloads (async-native, cleaner API)
|
|
- [02-01]: structlog for structured logging (JSON-formatted, context-aware)
|
|
- [02-01]: LOEUF normalization with inversion (lower LOEUF = more constrained = higher 0-1 score)
|
|
- [02-01]: Quality flags instead of filtering (preserve all genes with measured/incomplete_coverage/no_data categorization)
|
|
- [02-01]: NULL preservation pattern (unknown constraint != zero constraint, must not be conflated)
|
|
- [02-01]: Lazy polars evaluation (LazyFrame until final collect() for query optimization)
|
|
- [02-02]: load_to_duckdb uses CREATE OR REPLACE for idempotency (safe to re-run)
|
|
- [02-02]: CLI evidence command group for extensibility (future evidence sources follow same pattern)
|
|
- [02-02]: Checkpoint at table level (has_checkpoint checks DuckDB table existence)
|
|
- [02-02]: Integration tests with synthetic fixtures (no external downloads, fast, reproducible)
|
|
|
|
### Pending Todos
|
|
|
|
None yet.
|
|
|
|
### Blockers/Concerns
|
|
|
|
None yet.
|
|
|
|
## Session Continuity
|
|
|
|
Last session: 2026-02-11 - Plan execution
|
|
Stopped at: Completed 02-02-PLAN.md (gnomAD evidence layer integration) - Phase 2 complete
|
|
Resume file: .planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
|