usher-exploring/.planning/phases/05-output-cli/05-01-PLAN.md at 150417ffcc1b46401b3d11c151ed6f3566a9c11d

gbanyan/usher-exploring

Fork 0

Files

gbanyan 6ab7fd1378 docs(05-output-cli): create phase plan

2026-02-11 21:14:37 +08:00

9.3 KiB

Raw Blame History

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves

phase

plan

type

wave

depends_on

files_modified

autonomous

must_haves

05-output-cli

execute

src/usher_pipeline/output/__init__.py

src/usher_pipeline/output/tiers.py

src/usher_pipeline/output/evidence_summary.py

src/usher_pipeline/output/writers.py

tests/test_output.py

true

truths

artifacts

key_links

scored_genes DataFrame is classified into HIGH/MEDIUM/LOW/EXCLUDED tiers based on composite_score and evidence_count

Each candidate gene has a supporting_layers field listing which evidence layers contributed and an evidence_gaps field listing which are NULL

Output is written in both TSV and Parquet formats with identical data

Provenance YAML sidecar is generated alongside output files

path

provides

exports

src/usher_pipeline/output/tiers.py

Confidence tiering logic

assign_tiers

TIER_THRESHOLDS

path

provides

exports

src/usher_pipeline/output/evidence_summary.py

Per-gene evidence summary columns

add_evidence_summary

path

provides

exports

src/usher_pipeline/output/writers.py

Dual-format TSV+Parquet writer with provenance sidecar

write_candidate_output

path	provides
tests/test_output.py	Unit tests for tiering, evidence summary, and writers

from	to	via	pattern
src/usher_pipeline/output/tiers.py	scored_genes DuckDB table	polars DataFrame with composite_score and evidence_count columns	pl.when.composite_score.evidence_count

from	to	via	pattern
src/usher_pipeline/output/writers.py	output files	polars write_csv (separator=tab) and write_parquet	write_csv.separator.write_parquet

Create the output generation module: tiered candidate classification, per-gene evidence summary, and dual-format (TSV+Parquet) file writer with provenance sidecars.

Purpose: This is the core data transformation that converts raw scored_genes into the pipeline's primary deliverable -- a tiered, annotated candidate list. All downstream reporting and visualization depend on this module. Output: src/usher_pipeline/output/ package with tiers.py, evidence_summary.py, writers.py and unit tests.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @src/usher_pipeline/scoring/integration.py @src/usher_pipeline/scoring/quality_control.py @src/usher_pipeline/persistence/provenance.py @src/usher_pipeline/persistence/duckdb_store.py @src/usher_pipeline/config/schema.py Task 1: Tiering logic and evidence summary module src/usher_pipeline/output/__init__.py src/usher_pipeline/output/tiers.py src/usher_pipeline/output/evidence_summary.py Create `src/usher_pipeline/output/` package directory.

tiers.py: Create tiering module with configurable thresholds.

Define TIER_THRESHOLDS as a dict with defaults from research:

HIGH: composite_score >= 0.7 AND evidence_count >= 3
MEDIUM: composite_score >= 0.4 AND evidence_count >= 2
LOW: composite_score >= 0.2 (any evidence_count)
Everything else: EXCLUDED (filtered out)

Implement assign_tiers(scored_df: pl.DataFrame, thresholds: dict | None = None) -> pl.DataFrame:

Accepts polars DataFrame with columns: gene_id, gene_symbol, composite_score, evidence_count, quality_flag, all 6 layer score columns, all 6 contribution columns
Uses pl.when/then/otherwise chain (vectorized, not row-by-row) to add confidence_tier column
Filters OUT rows where confidence_tier == "EXCLUDED"
Sorts by composite_score DESC (deterministic: break ties by gene_id ASC)
Returns DataFrame with confidence_tier column added

Allow thresholds parameter to override defaults (for CLI configurability later).

evidence_summary.py: Create evidence summary module.

Define the 6 evidence layer names as a constant list: EVIDENCE_LAYERS = ["gnomad", "expression", "annotation", "localization", "animal_model", "literature"]

Implement add_evidence_summary(df: pl.DataFrame) -> pl.DataFrame:

For each layer in EVIDENCE_LAYERS, checks if {layer}_score column is not null
Adds supporting_layers column: comma-separated list of layer names where score is NOT NULL (e.g., "gnomad,expression,annotation")
Adds evidence_gaps column: comma-separated list of layer names where score IS NULL (e.g., "localization,animal_model,literature")
Uses polars expressions (pl.concat_str or equivalent) -- do NOT convert to pandas
Handles edge case: gene with all NULLs -> supporting_layers="" and evidence_gaps="gnomad,expression,annotation,localization,animal_model,literature"

init.py: Export assign_tiers, TIER_THRESHOLDS, add_evidence_summary, and write_candidate_output (from writers.py created in Task 2). Run: python -c "from usher_pipeline.output.tiers import assign_tiers, TIER_THRESHOLDS; print('tiers OK')" and python -c "from usher_pipeline.output.evidence_summary import add_evidence_summary, EVIDENCE_LAYERS; print('summary OK')" assign_tiers() classifies genes into HIGH/MEDIUM/LOW tiers using configurable thresholds; add_evidence_summary() adds supporting_layers and evidence_gaps columns; both functions operate on polars DataFrames without materialization issues.

Task 2: Dual-format writer with provenance sidecar and unit tests src/usher_pipeline/output/writers.py src/usher_pipeline/output/__init__.py tests/test_output.py **writers.py**: Create dual-format output writer.

Implement write_candidate_output(df: pl.DataFrame, output_dir: Path, filename_base: str = "candidates") -> dict:

Collects LazyFrame if needed (handle both DataFrame and LazyFrame input)
Writes TSV: {output_dir}/{filename_base}.tsv using df.write_csv(path, separator="\t", include_header=True)
Writes Parquet: {output_dir}/{filename_base}.parquet using df.write_parquet(path, compression="snappy", use_pyarrow=True)
Writes provenance YAML sidecar: {output_dir}/{filename_base}.provenance.yaml containing:
- generated_at (ISO timestamp)
- output_files: [tsv filename, parquet filename]
- statistics: total_candidates, high_count, medium_count, low_count (counted from confidence_tier column)
- column_count and column_names for downstream tool compatibility verification
Uses pyyaml (already in dependencies) for YAML output
Creates output_dir if it doesn't exist
Returns dict with paths: {"tsv": Path, "parquet": Path, "provenance": Path}

Ensure deterministic output: sort by composite_score DESC, gene_id ASC before writing (avoid non-deterministic ordering pitfall from research).

Update init.py: Add write_candidate_output to exports.

tests/test_output.py: Create comprehensive unit tests.

Use tmp_path pytest fixture. Create synthetic scored_genes DataFrame with ~20 rows spanning all tiers:

3 genes with score >= 0.7, evidence_count >= 3 (HIGH)
5 genes with score 0.4-0.69, evidence_count >= 2 (MEDIUM)
5 genes with score 0.2-0.39 (LOW)
3 genes with score < 0.2 (EXCLUDED -- should be filtered out)
4 genes with NULL composite_score (no evidence)

Tests:

test_assign_tiers_default_thresholds: Verify correct tier assignment counts, EXCLUDED genes removed
test_assign_tiers_custom_thresholds: Override thresholds, verify different classification
test_assign_tiers_sorting: Verify output sorted by composite_score DESC
test_add_evidence_summary_supporting_layers: Gene with 3 non-NULL scores has 3 layers listed
test_add_evidence_summary_gaps: Gene with all NULL scores has all 6 layers as gaps
test_write_candidate_output_creates_files: TSV, Parquet, and provenance.yaml all created
test_write_candidate_output_tsv_readable: Read back TSV with polars, verify columns and row count match
test_write_candidate_output_parquet_readable: Read back Parquet, verify schema matches
test_write_candidate_output_provenance_yaml: Parse YAML, verify statistics match Run: cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_output.py -v All 9+ tests pass. TSV and Parquet outputs are byte-for-byte consistent (same data). Provenance YAML contains accurate statistics. Tiering correctly classifies and filters genes. Evidence summary correctly identifies supporting layers and gaps.

- `python -c "from usher_pipeline.output import assign_tiers, add_evidence_summary, write_candidate_output; print('All exports OK')"` succeeds - `python -m pytest tests/test_output.py -v` -- all tests pass - Synthetic data round-trips through tier assignment, evidence summary, and dual-format writing without errors

<success_criteria>

Tiering logic classifies scored genes into HIGH/MEDIUM/LOW confidence tiers using composite_score and evidence_count thresholds
Evidence summary adds supporting_layers and evidence_gaps columns per gene
Writer produces identical data in TSV and Parquet formats with provenance YAML sidecar
All unit tests pass </success_criteria>

After completion, create `.planning/phases/05-output-cli/05-01-SUMMARY.md`

9.3 KiB Raw Blame History

9.3 KiB

Raw Blame History