docs(01-02): complete gene ID mapping and validation plan
- Gene universe definition with mygene protein-coding gene retrieval - Batch Ensembl->HGNC+UniProt mapping with edge case handling - Validation gates with configurable success rate thresholds - 15 comprehensive tests with mocked API responses
This commit is contained in:
@@ -12,7 +12,7 @@ See: .planning/PROJECT.md (updated 2026-02-11)
|
||||
Phase: 1 of 6 (Data Infrastructure)
|
||||
Plan: 3 of 4 in current phase
|
||||
Status: Executing
|
||||
Last activity: 2026-02-11 — Completed 01-03-PLAN.md (DuckDB persistence and provenance tracking)
|
||||
Last activity: 2026-02-11 — Completed 01-02-PLAN.md (Gene ID mapping and validation)
|
||||
|
||||
Progress: [███░░░░░░░] 25.0% (1/6 phases planned, 2/4 plans in phase 1 complete)
|
||||
|
||||
@@ -20,14 +20,14 @@ Progress: [███░░░░░░░] 25.0% (1/6 phases planned, 2/4 plans
|
||||
|
||||
**Velocity:**
|
||||
- Total plans completed: 2
|
||||
- Average duration: 2.5 min
|
||||
- Total execution time: 0.08 hours
|
||||
- Average duration: 3 min
|
||||
- Total execution time: 0.12 hours
|
||||
|
||||
**By Phase:**
|
||||
|
||||
| Phase | Plans | Total | Avg/Plan |
|
||||
|-------|-------|-------|----------|
|
||||
| 01 - Data Infrastructure | 2/4 | 5 min | 2.5 min/plan |
|
||||
| 01 - Data Infrastructure | 2/4 | 7 min | 3.5 min/plan |
|
||||
|
||||
## Accumulated Context
|
||||
|
||||
@@ -42,8 +42,12 @@ Recent decisions affecting current work:
|
||||
- Modular CLI scripts for flexibility during development
|
||||
- Virtual environment required for dependency isolation (01-01: PEP 668 externally-managed Python)
|
||||
- Auto-creation of directories on config load (01-01: data_dir, cache_dir field validators)
|
||||
- [Phase 01-data-infrastructure]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics)
|
||||
- [Phase 01-data-infrastructure]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern)
|
||||
- [01-02]: Warn on gene count outside 19k-22k range but don't fail (allows for Ensembl version variations)
|
||||
- [01-02]: HGNC success rate is primary validation gate (UniProt mapping tracked but not used for pass/fail)
|
||||
- [01-02]: Take first UniProt accession when multiple exist (simplifies data model)
|
||||
- [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility)
|
||||
- [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics)
|
||||
- [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern)
|
||||
|
||||
### Pending Todos
|
||||
|
||||
@@ -56,5 +60,5 @@ None yet.
|
||||
## Session Continuity
|
||||
|
||||
Last session: 2026-02-11 - Plan execution
|
||||
Stopped at: Completed 01-03-PLAN.md
|
||||
Resume file: .planning/phases/01-data-infrastructure/01-03-SUMMARY.md
|
||||
Stopped at: Completed 01-02-PLAN.md
|
||||
Resume file: .planning/phases/01-data-infrastructure/01-02-SUMMARY.md
|
||||
|
||||
Reference in New Issue
Block a user