docs(05-output-cli): create phase plan
This commit is contained in:
189
.planning/phases/05-output-cli/05-01-PLAN.md
Normal file
189
.planning/phases/05-output-cli/05-01-PLAN.md
Normal file
@@ -0,0 +1,189 @@
|
||||
---
|
||||
phase: 05-output-cli
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- src/usher_pipeline/output/__init__.py
|
||||
- src/usher_pipeline/output/tiers.py
|
||||
- src/usher_pipeline/output/evidence_summary.py
|
||||
- src/usher_pipeline/output/writers.py
|
||||
- tests/test_output.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "scored_genes DataFrame is classified into HIGH/MEDIUM/LOW/EXCLUDED tiers based on composite_score and evidence_count"
|
||||
- "Each candidate gene has a supporting_layers field listing which evidence layers contributed and an evidence_gaps field listing which are NULL"
|
||||
- "Output is written in both TSV and Parquet formats with identical data"
|
||||
- "Provenance YAML sidecar is generated alongside output files"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/output/tiers.py"
|
||||
provides: "Confidence tiering logic"
|
||||
exports: ["assign_tiers", "TIER_THRESHOLDS"]
|
||||
- path: "src/usher_pipeline/output/evidence_summary.py"
|
||||
provides: "Per-gene evidence summary columns"
|
||||
exports: ["add_evidence_summary"]
|
||||
- path: "src/usher_pipeline/output/writers.py"
|
||||
provides: "Dual-format TSV+Parquet writer with provenance sidecar"
|
||||
exports: ["write_candidate_output"]
|
||||
- path: "tests/test_output.py"
|
||||
provides: "Unit tests for tiering, evidence summary, and writers"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/output/tiers.py"
|
||||
to: "scored_genes DuckDB table"
|
||||
via: "polars DataFrame with composite_score and evidence_count columns"
|
||||
pattern: "pl\\.when.*composite_score.*evidence_count"
|
||||
- from: "src/usher_pipeline/output/writers.py"
|
||||
to: "output files"
|
||||
via: "polars write_csv (separator=tab) and write_parquet"
|
||||
pattern: "write_csv.*separator.*write_parquet"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Create the output generation module: tiered candidate classification, per-gene evidence summary, and dual-format (TSV+Parquet) file writer with provenance sidecars.
|
||||
|
||||
Purpose: This is the core data transformation that converts raw scored_genes into the pipeline's primary deliverable -- a tiered, annotated candidate list. All downstream reporting and visualization depend on this module.
|
||||
Output: `src/usher_pipeline/output/` package with tiers.py, evidence_summary.py, writers.py and unit tests.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@src/usher_pipeline/scoring/integration.py
|
||||
@src/usher_pipeline/scoring/quality_control.py
|
||||
@src/usher_pipeline/persistence/provenance.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
@src/usher_pipeline/config/schema.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Tiering logic and evidence summary module</name>
|
||||
<files>
|
||||
src/usher_pipeline/output/__init__.py
|
||||
src/usher_pipeline/output/tiers.py
|
||||
src/usher_pipeline/output/evidence_summary.py
|
||||
</files>
|
||||
<action>
|
||||
Create `src/usher_pipeline/output/` package directory.
|
||||
|
||||
**tiers.py**: Create tiering module with configurable thresholds.
|
||||
|
||||
Define `TIER_THRESHOLDS` as a dict with defaults from research:
|
||||
- HIGH: composite_score >= 0.7 AND evidence_count >= 3
|
||||
- MEDIUM: composite_score >= 0.4 AND evidence_count >= 2
|
||||
- LOW: composite_score >= 0.2 (any evidence_count)
|
||||
- Everything else: EXCLUDED (filtered out)
|
||||
|
||||
Implement `assign_tiers(scored_df: pl.DataFrame, thresholds: dict | None = None) -> pl.DataFrame`:
|
||||
- Accepts polars DataFrame with columns: gene_id, gene_symbol, composite_score, evidence_count, quality_flag, all 6 layer score columns, all 6 contribution columns
|
||||
- Uses pl.when/then/otherwise chain (vectorized, not row-by-row) to add `confidence_tier` column
|
||||
- Filters OUT rows where confidence_tier == "EXCLUDED"
|
||||
- Sorts by composite_score DESC (deterministic: break ties by gene_id ASC)
|
||||
- Returns DataFrame with confidence_tier column added
|
||||
|
||||
Allow thresholds parameter to override defaults (for CLI configurability later).
|
||||
|
||||
**evidence_summary.py**: Create evidence summary module.
|
||||
|
||||
Define the 6 evidence layer names as a constant list: EVIDENCE_LAYERS = ["gnomad", "expression", "annotation", "localization", "animal_model", "literature"]
|
||||
|
||||
Implement `add_evidence_summary(df: pl.DataFrame) -> pl.DataFrame`:
|
||||
- For each layer in EVIDENCE_LAYERS, checks if `{layer}_score` column is not null
|
||||
- Adds `supporting_layers` column: comma-separated list of layer names where score is NOT NULL (e.g., "gnomad,expression,annotation")
|
||||
- Adds `evidence_gaps` column: comma-separated list of layer names where score IS NULL (e.g., "localization,animal_model,literature")
|
||||
- Uses polars expressions (pl.concat_str or equivalent) -- do NOT convert to pandas
|
||||
- Handles edge case: gene with all NULLs -> supporting_layers="" and evidence_gaps="gnomad,expression,annotation,localization,animal_model,literature"
|
||||
|
||||
**__init__.py**: Export assign_tiers, TIER_THRESHOLDS, add_evidence_summary, and write_candidate_output (from writers.py created in Task 2).
|
||||
</action>
|
||||
<verify>
|
||||
Run: `python -c "from usher_pipeline.output.tiers import assign_tiers, TIER_THRESHOLDS; print('tiers OK')"` and `python -c "from usher_pipeline.output.evidence_summary import add_evidence_summary, EVIDENCE_LAYERS; print('summary OK')"`
|
||||
</verify>
|
||||
<done>
|
||||
assign_tiers() classifies genes into HIGH/MEDIUM/LOW tiers using configurable thresholds; add_evidence_summary() adds supporting_layers and evidence_gaps columns; both functions operate on polars DataFrames without materialization issues.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Dual-format writer with provenance sidecar and unit tests</name>
|
||||
<files>
|
||||
src/usher_pipeline/output/writers.py
|
||||
src/usher_pipeline/output/__init__.py
|
||||
tests/test_output.py
|
||||
</files>
|
||||
<action>
|
||||
**writers.py**: Create dual-format output writer.
|
||||
|
||||
Implement `write_candidate_output(df: pl.DataFrame, output_dir: Path, filename_base: str = "candidates") -> dict`:
|
||||
- Collects LazyFrame if needed (handle both DataFrame and LazyFrame input)
|
||||
- Writes TSV: `{output_dir}/{filename_base}.tsv` using `df.write_csv(path, separator="\t", include_header=True)`
|
||||
- Writes Parquet: `{output_dir}/{filename_base}.parquet` using `df.write_parquet(path, compression="snappy", use_pyarrow=True)`
|
||||
- Writes provenance YAML sidecar: `{output_dir}/{filename_base}.provenance.yaml` containing:
|
||||
- generated_at (ISO timestamp)
|
||||
- output_files: [tsv filename, parquet filename]
|
||||
- statistics: total_candidates, high_count, medium_count, low_count (counted from confidence_tier column)
|
||||
- column_count and column_names for downstream tool compatibility verification
|
||||
- Uses pyyaml (already in dependencies) for YAML output
|
||||
- Creates output_dir if it doesn't exist
|
||||
- Returns dict with paths: {"tsv": Path, "parquet": Path, "provenance": Path}
|
||||
|
||||
Ensure deterministic output: sort by composite_score DESC, gene_id ASC before writing (avoid non-deterministic ordering pitfall from research).
|
||||
|
||||
**Update __init__.py**: Add write_candidate_output to exports.
|
||||
|
||||
**tests/test_output.py**: Create comprehensive unit tests.
|
||||
|
||||
Use tmp_path pytest fixture. Create synthetic scored_genes DataFrame with ~20 rows spanning all tiers:
|
||||
- 3 genes with score >= 0.7, evidence_count >= 3 (HIGH)
|
||||
- 5 genes with score 0.4-0.69, evidence_count >= 2 (MEDIUM)
|
||||
- 5 genes with score 0.2-0.39 (LOW)
|
||||
- 3 genes with score < 0.2 (EXCLUDED -- should be filtered out)
|
||||
- 4 genes with NULL composite_score (no evidence)
|
||||
|
||||
Tests:
|
||||
1. test_assign_tiers_default_thresholds: Verify correct tier assignment counts, EXCLUDED genes removed
|
||||
2. test_assign_tiers_custom_thresholds: Override thresholds, verify different classification
|
||||
3. test_assign_tiers_sorting: Verify output sorted by composite_score DESC
|
||||
4. test_add_evidence_summary_supporting_layers: Gene with 3 non-NULL scores has 3 layers listed
|
||||
5. test_add_evidence_summary_gaps: Gene with all NULL scores has all 6 layers as gaps
|
||||
6. test_write_candidate_output_creates_files: TSV, Parquet, and provenance.yaml all created
|
||||
7. test_write_candidate_output_tsv_readable: Read back TSV with polars, verify columns and row count match
|
||||
8. test_write_candidate_output_parquet_readable: Read back Parquet, verify schema matches
|
||||
9. test_write_candidate_output_provenance_yaml: Parse YAML, verify statistics match
|
||||
</action>
|
||||
<verify>
|
||||
Run: `cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_output.py -v`
|
||||
</verify>
|
||||
<done>
|
||||
All 9+ tests pass. TSV and Parquet outputs are byte-for-byte consistent (same data). Provenance YAML contains accurate statistics. Tiering correctly classifies and filters genes. Evidence summary correctly identifies supporting layers and gaps.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `python -c "from usher_pipeline.output import assign_tiers, add_evidence_summary, write_candidate_output; print('All exports OK')"` succeeds
|
||||
- `python -m pytest tests/test_output.py -v` -- all tests pass
|
||||
- Synthetic data round-trips through tier assignment, evidence summary, and dual-format writing without errors
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- Tiering logic classifies scored genes into HIGH/MEDIUM/LOW confidence tiers using composite_score and evidence_count thresholds
|
||||
- Evidence summary adds supporting_layers and evidence_gaps columns per gene
|
||||
- Writer produces identical data in TSV and Parquet formats with provenance YAML sidecar
|
||||
- All unit tests pass
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/05-output-cli/05-01-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user