Files

12 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
phase plan type wave depends_on files_modified autonomous must_haves
04-scoring-integration 02 execute 2
04-01
src/usher_pipeline/scoring/quality_control.py
src/usher_pipeline/scoring/validation.py
src/usher_pipeline/scoring/__init__.py
true
truths artifacts key_links
Quality control detects missing data rates per evidence layer and flags layers above threshold
Score distribution anomalies (no variation, out-of-range values) are detected per layer
Outlier genes are identified using MAD-based robust detection per evidence layer
Known cilia/Usher genes rank in top quartile of composite scores before exclusion
path provides contains
src/usher_pipeline/scoring/quality_control.py QC checks for missing data, distributions, outliers median_absolute_deviation
path provides contains
src/usher_pipeline/scoring/validation.py Positive control validation against known gene ranking PERCENT_RANK
from to via pattern
src/usher_pipeline/scoring/quality_control.py DuckDB scored_genes table store.conn.execute SQL queries scored_genes
from to via pattern
src/usher_pipeline/scoring/validation.py src/usher_pipeline/scoring/known_genes.py compile_known_genes import compile_known_genes
from to via pattern
src/usher_pipeline/scoring/validation.py DuckDB scored_genes table PERCENT_RANK window function PERCENT_RANK.*ORDER BY.*composite_score
Implement quality control checks and positive control validation for the scoring system.

Purpose: QC catches upstream failures (high missing data rates, distribution anomalies, normalization bugs, outliers) before results are trusted. Positive control validation confirms known cilia/Usher genes rank highly, proving the scoring logic works correctly. Both are essential for scoring system credibility.

Output: src/usher_pipeline/scoring/quality_control.py and src/usher_pipeline/scoring/validation.py.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/04-scoring-integration/04-RESEARCH.md @.planning/phases/04-scoring-integration/04-01-SUMMARY.md @src/usher_pipeline/scoring/integration.py @src/usher_pipeline/scoring/known_genes.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Quality control checks for scoring results src/usher_pipeline/scoring/quality_control.py 1. Create `src/usher_pipeline/scoring/quality_control.py`:

Import: numpy, polars, structlog, PipelineStore from persistence. Import scipy.stats for MAD-based outlier detection (add scipy>=1.14 to project dependencies if not present -- check pyproject.toml first).

Define constants:

  • MISSING_RATE_WARN = 0.5 (50% missing = warning)
  • MISSING_RATE_ERROR = 0.8 (80% missing = error)
  • MIN_STD_THRESHOLD = 0.01 (std < 0.01 = no variation warning)
  • OUTLIER_MAD_THRESHOLD = 3.0 (>3 MAD from median = outlier)
  • EVIDENCE_LAYERS = list of 6 layer names: ["gnomad", "expression", "annotation", "localization", "animal_model", "literature"]
  • SCORE_COLUMNS = dict mapping layer name to column name: {"gnomad": "gnomad_score", "expression": "expression_score", "annotation": "annotation_score", "localization": "localization_score", "animal_model": "animal_model_score", "literature": "literature_score"}

Create function compute_missing_data_rates(store: PipelineStore) -> dict:

  • Query scored_genes table to compute fraction of NULL values per score column.
  • Use a single SQL query: SELECT COUNT(*) AS total, AVG(CASE WHEN col IS NULL THEN 1.0 ELSE 0.0 END) AS col_missing ... for all 6 columns.
  • Return dict with keys: "rates" (dict[str, float] per layer), "warnings" (list of str), "errors" (list of str).
  • Classify: rate > MISSING_RATE_ERROR -> add to errors; rate > MISSING_RATE_WARN -> add to warnings.
  • Log each layer's rate with appropriate level (error/warning/info).

Create function compute_distribution_stats(store: PipelineStore) -> dict:

  • For each evidence layer, query non-NULL scores from scored_genes.
  • Compute: mean, median, std, min, max using numpy on the extracted array.
  • Detect anomalies:
    • std < MIN_STD_THRESHOLD -> warning "no variation"
    • min < 0.0 or max > 1.0 -> error "out of range"
  • Return dict with keys: "distributions" (dict per layer with stats), "warnings", "errors".

Create function detect_outliers(store: PipelineStore) -> dict:

  • For each evidence layer, query gene_symbol + score from scored_genes where score IS NOT NULL.
  • Compute MAD: median(|scores - median(scores)|).
  • If MAD == 0, skip (no variation). Otherwise, flag genes where |score - median| > OUTLIER_MAD_THRESHOLD * MAD.
  • Return dict with keys per layer: "count" (int), "example_genes" (list of up to 5 gene symbols).
  • Log outlier counts per layer.

Create function run_qc_checks(store: PipelineStore) -> dict:

  • Orchestrates all three checks above.
  • Returns combined dict with keys: "missing_data", "distributions", "outliers", "composite_stats", "warnings" (combined list), "errors" (combined list), "passed" (bool: True if no errors).
  • Also compute composite score distribution: mean, median, std, percentiles (p10, p25, p50, p75, p90) from scored_genes WHERE composite_score IS NOT NULL.
  • Log final QC summary: total warnings, total errors, pass/fail.

Note: Use numpy for statistics (scipy.stats.median_abs_deviation can be used for MAD, or compute manually with np.median(np.abs(x - np.median(x)))). Keep scipy import conditional or direct -- either approach is fine as long as outlier detection works. Run: `cd /Users/gbanyan/Project/usher-exploring && python -c " from usher_pipeline.scoring.quality_control import ( compute_missing_data_rates, compute_distribution_stats, detect_outliers, run_qc_checks, MISSING_RATE_WARN, MISSING_RATE_ERROR, OUTLIER_MAD_THRESHOLD, EVIDENCE_LAYERS, SCORE_COLUMNS, ) import inspect print('Constants:') print(f' MISSING_RATE_WARN={MISSING_RATE_WARN}') print(f' MISSING_RATE_ERROR={MISSING_RATE_ERROR}') print(f' OUTLIER_MAD_THRESHOLD={OUTLIER_MAD_THRESHOLD}') print(f' EVIDENCE_LAYERS={EVIDENCE_LAYERS}') print(f' SCORE_COLUMNS has {len(SCORE_COLUMNS)} entries') assert len(EVIDENCE_LAYERS) == 6 assert len(SCORE_COLUMNS) == 6

Verify function signatures

for fn in [compute_missing_data_rates, compute_distribution_stats, detect_outliers, run_qc_checks]: sig = inspect.signature(fn) assert 'store' in sig.parameters, f'{fn.name} missing store param' print('All QC functions verified') "` - compute_missing_data_rates() queries NULL rates per layer from scored_genes with warning/error classification - compute_distribution_stats() computes mean/median/std/min/max per layer and detects anomalies (no variation, out of range) - detect_outliers() uses MAD-based detection (>3 MAD from median) per layer - run_qc_checks() orchestrates all checks and returns combined results with pass/fail status

Task 2: Positive control validation against known gene rankings src/usher_pipeline/scoring/validation.py src/usher_pipeline/scoring/__init__.py 1. Create `src/usher_pipeline/scoring/validation.py`:

Import: duckdb, polars, structlog, PipelineStore, compile_known_genes from scoring.known_genes.

Create function validate_known_gene_ranking(store: PipelineStore, percentile_threshold: float = 0.75) -> dict:

  • Call compile_known_genes() to get known gene DataFrame.
  • Register the known genes DataFrame as a temporary DuckDB table: store.conn.execute("CREATE OR REPLACE TEMP TABLE _known_genes AS SELECT * FROM known_df") (where known_df is the polars DataFrame variable).
  • Execute a DuckDB SQL query that: a. Computes PERCENT_RANK() OVER (ORDER BY composite_score) for ALL genes in scored_genes (not just known genes). b. INNER JOINs with _known_genes on gene_symbol to get percentile ranks for known genes only. c. Filters WHERE composite_score IS NOT NULL. d. Returns gene_symbol, composite_score, source, percentile_rank.
  • Compute validation metrics from the result:
    • total_known_in_dataset: count of known genes found in scored_genes
    • total_known_expected: count of unique gene_symbols in known gene list
    • median_percentile: median of percentile_rank values
    • top_quartile_count: count where percentile_rank >= 0.75
    • top_quartile_fraction: top_quartile_count / total_known_in_dataset
    • validation_passed: median_percentile >= percentile_threshold
    • known_gene_details: list of dicts with gene_symbol, composite_score, percentile_rank, source (sorted by percentile_rank descending, limited to top 20)
  • If no known genes found in scored_genes, return validation_passed=False with reason="no_known_genes_found".
  • Log validation results: passed/failed, median_percentile, top_quartile stats.
  • Drop the temp table after query: store.conn.execute("DROP TABLE IF EXISTS _known_genes").
  • Return the metrics dict.

Create function generate_validation_report(metrics: dict) -> str:

  • Produce a human-readable text report summarizing validation results.
  • Include: pass/fail status, median percentile, top quartile count/fraction, table of top-ranked known genes.
  • Format percentiles as percentages (e.g., "87.3%").
  • Return the report as a string.
  1. Update src/usher_pipeline/scoring/__init__.py to also export: run_qc_checks, validate_known_gene_ranking, generate_validation_report. Run: cd /Users/gbanyan/Project/usher-exploring && python -c " from usher_pipeline.scoring.validation import validate_known_gene_ranking, generate_validation_report from usher_pipeline.scoring import ( compile_known_genes, load_known_genes_to_duckdb, join_evidence_layers, compute_composite_scores, persist_scored_genes, run_qc_checks, validate_known_gene_ranking, generate_validation_report, ) import inspect sig = inspect.signature(validate_known_gene_ranking) assert 'store' in sig.parameters assert 'percentile_threshold' in sig.parameters src = inspect.getsource(validate_known_gene_ranking) assert 'PERCENT_RANK' in src, 'Missing PERCENT_RANK window function' assert 'compile_known_genes' in src, 'Missing known genes integration' print('Validation function verified') print('All scoring module exports verified') "
    • validate_known_gene_ranking() computes percentile ranks of known genes in scored_genes using PERCENT_RANK window function
    • Returns metrics dict with median_percentile, top_quartile_count/fraction, validation_passed boolean
    • generate_validation_report() produces human-readable text summary
    • All scoring module functions exported from init.py
- Quality control detects missing data rates and classifies as warning (>50%) or error (>80%) - Distribution stats computed per layer with anomaly detection (no variation, out-of-range) - MAD-based outlier detection flags genes >3 MAD from median per layer - Known gene validation uses PERCENT_RANK before exclusion (not after filtering) - All functions importable from usher_pipeline.scoring

<success_criteria>

  • run_qc_checks() returns structured dict with missing_data, distributions, outliers, composite_stats, pass/fail
  • validate_known_gene_ranking() computes percentile ranks for known cilia/Usher genes
  • Validation checks median percentile against threshold (default 0.75 = top quartile)
  • generate_validation_report() produces readable text output
  • scipy used for MAD computation (or equivalent numpy manual calculation) </success_criteria>
After completion, create `.planning/phases/04-scoring-integration/04-02-SUMMARY.md`