Files

gbanyan 434c79c0a8 docs(05-01): complete output generation core plan

- Add 05-01-SUMMARY.md with performance metrics and decisions
- Update STATE.md to Phase 5, Plan 1 of 3 (80% overall progress)
- Record key decisions: configurable tiers, dual-format output, YAML provenance
- Document deviation: pl.count() -> pl.len() deprecation fix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-12 04:01:24 +08:00

6.1 KiB

Raw Permalink Blame History

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, key-decisions, patterns-established, duration, completed

phase

plan

subsystem

tags

requires

provides

affects

tech-stack

key-files

key-decisions

patterns-established

duration

completed

05-output-cli

output

polars

yaml

tsv

parquet

tiering

evidence-summary

phase	provides
04-scoring-integration	scored_genes DataFrame with composite_score, evidence_count, and layer contributions

Confidence tier classification (HIGH/MEDIUM/LOW) based on composite_score and evidence_count

Per-gene evidence summary (supporting_layers and evidence_gaps columns)

Dual-format TSV+Parquet writer with YAML provenance sidecar

Comprehensive unit test suite for output module

05-02

05-03

reporting

visualization

downstream-tools

added

patterns

pyyaml

vectorized-polars-expressions

dual-format-output

provenance-sidecars

deterministic-sorting

created

modified

src/usher_pipeline/output/tiers.py

src/usher_pipeline/output/evidence_summary.py

src/usher_pipeline/output/writers.py

src/usher_pipeline/output/__init__.py

tests/test_output.py

Configurable tier thresholds (HIGH: score>=0.7 and evidence>=3, MEDIUM: score>=0.4 and evidence>=2, LOW: score>=0.2)

EXCLUDED genes filtered out (below LOW threshold or NULL composite_score)

Deterministic sorting (composite_score DESC, gene_id ASC) for reproducible output

Dual-format TSV+Parquet with identical data for downstream tool compatibility

YAML provenance sidecar includes statistics (tier counts) and column metadata

Fixed deprecated pl.count() -> pl.len() usage for polars 0.20.5+ compatibility

Vectorized polars when/then/otherwise chains for tier assignment (not row-by-row)

concat_list + list.drop_nulls + list.join for comma-separated string columns

Provenance YAML sidecars alongside output files for full traceability

Deterministic sorting before writing for reproducible output across runs

4min

2026-02-11

Phase 05 Plan 01: Output Generation Core Summary

Tiered candidate classification with supporting/gap evidence tracking and dual-format TSV+Parquet output with YAML provenance sidecars

Performance

Duration: 4 minutes
Started: 2026-02-11T19:55:28Z
Completed: 2026-02-11T19:59:31Z
Tasks: 2
Files modified: 5

Accomplishments

Implemented configurable confidence tier classification (HIGH/MEDIUM/LOW) with filtering of EXCLUDED genes
Added per-gene evidence summary columns (supporting_layers and evidence_gaps) tracking which layers contributed
Created dual-format writer producing identical TSV and Parquet outputs with YAML provenance sidecars
Built comprehensive test suite with 9 tests covering all functionality (100% pass rate)

Task Commits

Each task was committed atomically:

Task 1: Tiering logic and evidence summary module - d2ef3a2 (feat)
- tiers.py with assign_tiers() and configurable TIER_THRESHOLDS
- evidence_summary.py with add_evidence_summary() and EVIDENCE_LAYERS
- init.py with exports
Task 2: Dual-format writer with provenance sidecar and unit tests - 4e46b48 (feat)
- writers.py with write_candidate_output()
- tests/test_output.py with 9 comprehensive tests
- Fixed deprecated pl.count() -> pl.len() usage

Files Created/Modified

src/usher_pipeline/output/tiers.py - Confidence tier assignment (HIGH/MEDIUM/LOW) with configurable thresholds
src/usher_pipeline/output/evidence_summary.py - Per-gene supporting_layers and evidence_gaps columns
src/usher_pipeline/output/writers.py - Dual-format TSV+Parquet writer with YAML provenance sidecar
src/usher_pipeline/output/__init__.py - Package exports
tests/test_output.py - 9 unit tests covering tiering, evidence summary, and writers

Decisions Made

Configurable thresholds: TIER_THRESHOLDS dictionary allows CLI configurability later while providing sensible defaults from research
EXCLUDED filtering: Genes below LOW threshold (score < 0.2) or with NULL composite_score are filtered out before output
Deterministic sorting: Sort by composite_score DESC, gene_id ASC for reproducible output across runs
Dual-format output: TSV for human-readability and tools like Excel; Parquet for efficient large-scale data processing
YAML provenance: Sidecar includes statistics (tier counts), column metadata, and timestamp for full reproducibility tracking
Polars 0.20.5+ compatibility: Replaced deprecated pl.count() with pl.len() to eliminate deprecation warnings

Deviations from Plan

Auto-fixed Issues

1. [Rule 1 - Bug] Fixed deprecated polars API usage

Found during: Task 2 (test execution)
Issue: pl.count() deprecated in polars 0.20.5+, producing warnings
Fix: Replaced all occurrences of pl.count() with pl.len() in tests and writers.py, updated row access from row["count"] to row["len"]
Files modified: tests/test_output.py, src/usher_pipeline/output/writers.py
Verification: Tests run without deprecation warnings
Committed in: 4e46b48 (Task 2 commit)

Total deviations: 1 auto-fixed (1 bug fix) Impact on plan: Necessary fix for current polars version compatibility. No scope creep.

Issues Encountered

None - plan executed smoothly with only the deprecated API fix needed.

User Setup Required

None - no external service configuration required.

Next Phase Readiness

Output module core complete with tiering, evidence summary, and dual-format writing
Ready for visualization module (05-02) and reproducibility reporting (05-03)
Ready for CLI command integration to generate candidate outputs
All tests pass, no blockers

Phase: 05-output-cli Completed: 2026-02-11

Self-Check: PASSED

All files and commits verified:

Files created:

✓ src/usher_pipeline/output/tiers.py
✓ src/usher_pipeline/output/evidence_summary.py
✓ src/usher_pipeline/output/writers.py
✓ src/usher_pipeline/output/init.py
✓ tests/test_output.py

Commits:

✓ d2ef3a2 (Task 1: Tiering logic and evidence summary module)
✓ 4e46b48 (Task 2: Dual-format writer with provenance and tests)

6.1 KiB Raw Permalink Blame History