Files
usher-exploring/.planning/STATE.md
gbanyan 053f0d926b docs(03-05): complete animal model phenotype evidence layer plan
- SUMMARY.md: Ortholog-mapped animal evidence from MGI/ZFIN/IMPC
- Confidence-weighted scoring (mouse +0.4, zebrafish +0.3, IMPC +0.3)
- 14/14 tests passing: ortholog confidence, keyword filtering, NULL preservation
- Deviations: Schema mismatches, NULL handling, polars deprecations auto-fixed
- Duration: 10 minutes, 2 tasks, 8 files, 2 commits
2026-02-11 19:08:45 +08:00

4.7 KiB

Project State

Project Reference

See: .planning/PROJECT.md (updated 2026-02-11)

Core value: Produce a high-confidence, multi-evidence-backed ranked list of under-studied cilia/Usher candidate genes that is fully traceable — every gene's inclusion is explainable by specific evidence, and every gap is documented. Current focus: Phase 2 complete — ready for Phase 3

Current Position

Phase: 3 of 6 (Core Evidence Layers) Plan: 4 of 6 in current phase Status: In progress — 03-04 complete (subcellular localization) Last activity: 2026-02-11 — Completed 03-04-PLAN.md (Subcellular Localization evidence layer)

Progress: [█████░░░░░] 40.0% (8/20 plans complete across all phases)

Performance Metrics

Velocity:

  • Total plans completed: 8
  • Average duration: 4.7 min
  • Total execution time: 0.63 hours

By Phase:

Phase Plans Total Avg/Plan
01 - Data Infrastructure 4/4 14 min 3.5 min/plan
02 - Prototype Evidence Layer 2/2 8 min 4.0 min/plan
03 - Core Evidence Layers 2/6 16 min 8.0 min/plan
Phase 03 P05 10 2 tasks 8 files

Accumulated Context

Decisions

Decisions are logged in PROJECT.md Key Decisions table. Recent decisions affecting current work:

  • Python over R/Bioconductor for rich data integration ecosystem
  • Weighted rule-based scoring over ML for explainability
  • Public data only for reproducibility
  • Modular CLI scripts for flexibility during development
  • Virtual environment required for dependency isolation (01-01: PEP 668 externally-managed Python)
  • Auto-creation of directories on config load (01-01: data_dir, cache_dir field validators)
  • [01-02]: Warn on gene count outside 19k-22k range but don't fail (allows for Ensembl version variations)
  • [01-02]: HGNC success rate is primary validation gate (UniProt mapping tracked but not used for pass/fail)
  • [01-02]: Take first UniProt accession when multiple exist (simplifies data model)
  • [01-02]: Mock mygene in tests (avoids rate limits, ensures reproducibility)
  • [01-03]: DuckDB over SQLite for DataFrame storage (native polars/pandas integration, better analytics)
  • [01-03]: Provenance sidecar files alongside outputs (co-located metadata, bioinformatics standard pattern)
  • [01-04]: Click for CLI framework (standard Python CLI library with excellent UX)
  • [01-04]: Setup command uses checkpoint-restart pattern (gene universe fetch can take minutes)
  • [01-04]: Mock mygene in integration tests (avoids external API dependency, reproducible)
  • [02-01]: httpx over requests for streaming downloads (async-native, cleaner API)
  • [02-01]: structlog for structured logging (JSON-formatted, context-aware)
  • [02-01]: LOEUF normalization with inversion (lower LOEUF = more constrained = higher 0-1 score)
  • [02-01]: Quality flags instead of filtering (preserve all genes with measured/incomplete_coverage/no_data categorization)
  • [02-01]: NULL preservation pattern (unknown constraint != zero constraint, must not be conflated)
  • [02-01]: Lazy polars evaluation (LazyFrame until final collect() for query optimization)
  • [02-02]: load_to_duckdb uses CREATE OR REPLACE for idempotency (safe to re-run)
  • [02-02]: CLI evidence command group for extensibility (future evidence sources follow same pattern)
  • [02-02]: Checkpoint at table level (has_checkpoint checks DuckDB table existence)
  • [02-02]: Integration tests with synthetic fixtures (no external downloads, fast, reproducible)
  • [03-01]: Annotation tier thresholds: Well >= (20 GO AND 4 UniProt), Partial >= (5 GO OR 3 UniProt)
  • [03-01]: Composite annotation score weighting: GO 50%, UniProt 30%, Pathway 20%
  • [03-01]: NULL GO counts treated as zero for tier classification but preserved as NULL in data (conservative assumption)
  • [03-04]: Evidence type terminology standardized to computational (not predicted) for consistency with bioinformatics convention
  • [03-04]: Proteomics absence stored as False (informative negative) vs HPA absence as NULL (unknown/not tested)
  • [03-04]: Curated proteomics reference gene sets (CiliaCarta, Centrosome-DB) embedded as Python constants for simpler deployment
  • [03-04]: Computational evidence (HPA Uncertain/Approved) downweighted to 0.6x vs experimental (Enhanced/Supported, proteomics) at 1.0x
  • [Phase 03-05]: Ortholog confidence based on HCOP support count (HIGH: 8+, MEDIUM: 4-7, LOW: 1-3)
  • [Phase 03-05]: NULL score for genes without orthologs (preserves NULL pattern)

Pending Todos

None yet.

Blockers/Concerns

None yet.

Session Continuity

Last session: 2026-02-11 - Plan execution Stopped at: Completed 03-04-PLAN.md (Subcellular Localization evidence layer) Resume file: .planning/phases/03-core-evidence-layers/03-04-SUMMARY.md