docs(04-02): complete QC and validation plan
- Add SUMMARY for quality control and positive control validation - Update STATE.md: Plan 2 of 3 in Phase 04 complete - Progress: 70% (14/20 plans complete) - Decisions: scipy MAD outlier detection, PERCENT_RANK validation
This commit is contained in:
@@ -10,18 +10,18 @@ See: .planning/PROJECT.md (updated 2026-02-11)
|
|||||||
## Current Position
|
## Current Position
|
||||||
|
|
||||||
Phase: 4 of 6 (Scoring and Integration)
|
Phase: 4 of 6 (Scoring and Integration)
|
||||||
Plan: 1 of 3 in current phase (in progress)
|
Plan: 2 of 3 in current phase (in progress)
|
||||||
Status: Plan 04-01 complete — known gene compilation and weighted scoring integration
|
Status: Plan 04-02 complete — quality control and positive control validation
|
||||||
Last activity: 2026-02-11 — Completed 04-01-PLAN.md
|
Last activity: 2026-02-11 — Completed 04-02-PLAN.md
|
||||||
|
|
||||||
Progress: [██████░░░░] 65.0% (13/20 plans complete across all phases)
|
Progress: [███████░░░] 70.0% (14/20 plans complete across all phases)
|
||||||
|
|
||||||
## Performance Metrics
|
## Performance Metrics
|
||||||
|
|
||||||
**Velocity:**
|
**Velocity:**
|
||||||
- Total plans completed: 13
|
- Total plans completed: 14
|
||||||
- Average duration: 5.5 min
|
- Average duration: 5.4 min
|
||||||
- Total execution time: 1.2 hours
|
- Total execution time: 1.3 hours
|
||||||
|
|
||||||
**By Phase:**
|
**By Phase:**
|
||||||
|
|
||||||
@@ -30,17 +30,17 @@ Progress: [██████░░░░] 65.0% (13/20 plans complete across al
|
|||||||
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
|
| 01 - Data Infrastructure | 4/4 | 14 min | 3.5 min/plan |
|
||||||
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
|
| 02 - Prototype Evidence Layer | 2/2 | 8 min | 4.0 min/plan |
|
||||||
| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan |
|
| 03 - Core Evidence Layers | 6/6 | 52 min | 8.7 min/plan |
|
||||||
| 04 - Scoring Integration | 1/3 | 4 min | 4.0 min/plan |
|
| 04 - Scoring Integration | 2/3 | 7 min | 3.5 min/plan |
|
||||||
|
|
||||||
**Recent Plan Details:**
|
**Recent Plan Details:**
|
||||||
| Plan | Duration | Tasks | Files |
|
| Plan | Duration | Tasks | Files |
|
||||||
|------|----------|-------|-------|
|
|------|----------|-------|-------|
|
||||||
| Phase 03 P02 | 12 min | 2 tasks | 9 files |
|
|
||||||
| Phase 03 P03 | 11 min | 2 tasks | 7 files |
|
| Phase 03 P03 | 11 min | 2 tasks | 7 files |
|
||||||
| Phase 03 P04 | 8 min | 2 tasks | 8 files |
|
| Phase 03 P04 | 8 min | 2 tasks | 8 files |
|
||||||
| Phase 03 P05 | 10 min | 2 tasks | 8 files |
|
| Phase 03 P05 | 10 min | 2 tasks | 8 files |
|
||||||
| Phase 03 P06 | 13 min | 2 tasks | 10 files |
|
| Phase 03 P06 | 13 min | 2 tasks | 10 files |
|
||||||
| Phase 04 P01 | 4 min | 2 tasks | 4 files |
|
| Phase 04 P01 | 4 min | 2 tasks | 4 files |
|
||||||
|
| Phase 04 P02 | 3 min | 2 tasks | 4 files |
|
||||||
|
|
||||||
## Accumulated Context
|
## Accumulated Context
|
||||||
|
|
||||||
@@ -103,6 +103,10 @@ Recent decisions affecting current work:
|
|||||||
- [04-01]: Quality flags based on evidence_count (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence)
|
- [04-01]: Quality flags based on evidence_count (>=4 sufficient, >=2 moderate, >=1 sparse, 0 no_evidence)
|
||||||
- [04-01]: Per-layer contribution tracking (score * weight) for explainability
|
- [04-01]: Per-layer contribution tracking (score * weight) for explainability
|
||||||
- [04-01]: ScoringWeights validation enforcing sum = 1.0 ± 1e-6 tolerance
|
- [04-01]: ScoringWeights validation enforcing sum = 1.0 ± 1e-6 tolerance
|
||||||
|
- [04-02]: scipy MAD-based outlier detection (>3 MAD threshold) for robust anomaly detection
|
||||||
|
- [04-02]: Missing data thresholds: 50% warn, 80% error for graduated QC feedback
|
||||||
|
- [04-02]: PERCENT_RANK validation computed before known gene exclusion (validates scoring system)
|
||||||
|
- [04-02]: Top quartile validation criterion (median percentile >= 0.75 for known genes)
|
||||||
|
|
||||||
### Pending Todos
|
### Pending Todos
|
||||||
|
|
||||||
@@ -115,5 +119,5 @@ None yet.
|
|||||||
## Session Continuity
|
## Session Continuity
|
||||||
|
|
||||||
Last session: 2026-02-11 - Plan execution
|
Last session: 2026-02-11 - Plan execution
|
||||||
Stopped at: Completed 04-01-PLAN.md (Known gene compilation and weighted scoring)
|
Stopped at: Completed 04-02-PLAN.md (Quality control and positive control validation)
|
||||||
Resume file: .planning/phases/04-scoring-integration/04-01-SUMMARY.md
|
Resume file: .planning/phases/04-scoring-integration/04-02-SUMMARY.md
|
||||||
|
|||||||
146
.planning/phases/04-scoring-integration/04-02-SUMMARY.md
Normal file
146
.planning/phases/04-scoring-integration/04-02-SUMMARY.md
Normal file
@@ -0,0 +1,146 @@
|
|||||||
|
---
|
||||||
|
phase: 04-scoring-integration
|
||||||
|
plan: 02
|
||||||
|
subsystem: scoring
|
||||||
|
tags: [qc, validation, quality-control, outlier-detection, positive-controls, mad, percentile-rank]
|
||||||
|
|
||||||
|
# Dependency graph
|
||||||
|
requires:
|
||||||
|
- phase: 04-scoring-integration
|
||||||
|
plan: 01
|
||||||
|
provides: scored_genes table, known_genes compilation
|
||||||
|
- phase: 01-data-infrastructure
|
||||||
|
provides: PipelineStore, DuckDB persistence
|
||||||
|
provides:
|
||||||
|
- Quality control checks for missing data, distributions, outliers
|
||||||
|
- MAD-based robust outlier detection per evidence layer
|
||||||
|
- Positive control validation against known gene rankings
|
||||||
|
- PERCENT_RANK window function validation
|
||||||
|
- QC orchestrator with composite score statistics
|
||||||
|
affects: [04-03, candidate-ranking, filtering, quality-assessment]
|
||||||
|
|
||||||
|
# Tech tracking
|
||||||
|
tech-stack:
|
||||||
|
added:
|
||||||
|
- scipy>=1.14 for MAD-based outlier detection
|
||||||
|
patterns:
|
||||||
|
- MAD-based robust outlier detection (>3 MAD threshold)
|
||||||
|
- PERCENT_RANK window function for percentile ranking
|
||||||
|
- Threshold-based classification (warn/error levels)
|
||||||
|
- Temporary table pattern for DuckDB joins
|
||||||
|
|
||||||
|
key-files:
|
||||||
|
created:
|
||||||
|
- src/usher_pipeline/scoring/quality_control.py
|
||||||
|
- src/usher_pipeline/scoring/validation.py
|
||||||
|
modified:
|
||||||
|
- pyproject.toml
|
||||||
|
- src/usher_pipeline/scoring/__init__.py
|
||||||
|
|
||||||
|
key-decisions:
|
||||||
|
- "scipy.stats.median_abs_deviation for MAD computation (robust to outliers)"
|
||||||
|
- "Missing data thresholds: 50% warn, 80% error"
|
||||||
|
- "Outlier threshold: >3 MAD from median per layer"
|
||||||
|
- "PERCENT_RANK computed across ALL genes before exclusion (validates scoring system)"
|
||||||
|
- "Top quartile validation: median percentile >= 0.75 for known genes"
|
||||||
|
- "Temporary table pattern for known gene join (avoids external temp files)"
|
||||||
|
|
||||||
|
patterns-established:
|
||||||
|
- "QC orchestrator pattern: run_qc_checks combines all checks with pass/fail boolean"
|
||||||
|
- "Composite score percentiles (p10, p25, p50, p75, p90) for distribution analysis"
|
||||||
|
- "Human-readable validation reports with formatted tables"
|
||||||
|
- "NULL-aware statistics: only compute on non-NULL values per layer"
|
||||||
|
|
||||||
|
# Metrics
|
||||||
|
duration: 2m 54s
|
||||||
|
completed: 2026-02-11
|
||||||
|
---
|
||||||
|
|
||||||
|
# Phase 04 Plan 02: Quality Control and Positive Control Validation Summary
|
||||||
|
|
||||||
|
**MAD-based outlier detection and PERCENT_RANK validation ensuring scoring system credibility before candidate ranking**
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
- **Duration:** 2 minutes 54 seconds (174 seconds)
|
||||||
|
- **Started:** 2026-02-11T12:45:12Z
|
||||||
|
- **Completed:** 2026-02-11T12:48:06Z
|
||||||
|
- **Tasks:** 2
|
||||||
|
- **Files modified:** 4
|
||||||
|
|
||||||
|
## Accomplishments
|
||||||
|
- Added scipy>=1.14 dependency for robust statistical methods
|
||||||
|
- Implemented compute_missing_data_rates() with warn (>50%) and error (>80%) thresholds
|
||||||
|
- Created compute_distribution_stats() detecting no variation (std < 0.01) and out-of-range values
|
||||||
|
- Built detect_outliers() using MAD-based detection (>3 MAD from median per layer)
|
||||||
|
- Developed run_qc_checks() orchestrator with composite score percentiles (p10-p90)
|
||||||
|
- Implemented validate_known_gene_ranking() using PERCENT_RANK window function
|
||||||
|
- Created generate_validation_report() with human-readable formatted output
|
||||||
|
- Exported run_qc_checks, validate_known_gene_ranking, generate_validation_report from scoring module
|
||||||
|
|
||||||
|
## Task Commits
|
||||||
|
|
||||||
|
Each task was committed atomically:
|
||||||
|
|
||||||
|
1. **Task 1: Quality control checks for scoring results** - `ba2f97a` (feat)
|
||||||
|
2. **Task 2: Positive control validation against known gene rankings** - `70a5d6e` (feat)
|
||||||
|
|
||||||
|
## Files Created/Modified
|
||||||
|
- `src/usher_pipeline/scoring/quality_control.py` - QC checks with MAD-based outlier detection
|
||||||
|
- `src/usher_pipeline/scoring/validation.py` - Known gene ranking validation with PERCENT_RANK
|
||||||
|
- `pyproject.toml` - Added scipy>=1.14 dependency
|
||||||
|
- `src/usher_pipeline/scoring/__init__.py` - Export QC and validation functions
|
||||||
|
|
||||||
|
## Decisions Made
|
||||||
|
|
||||||
|
1. **scipy for MAD computation:** Used scipy.stats.median_abs_deviation with scale="normal" for robust outlier detection. MAD is less sensitive to outliers than standard deviation, making it ideal for detecting anomalous genes in scored datasets.
|
||||||
|
|
||||||
|
2. **Missing data threshold classification:** 50% missing = warning (may still be usable), 80% missing = error (too sparse to trust). This dual-threshold approach allows for graduated QC feedback.
|
||||||
|
|
||||||
|
3. **Outlier threshold calibration:** >3 MAD from median flags outliers. This is a standard robust statistics threshold balancing sensitivity and specificity. Layers with MAD=0 (no variation) skip outlier detection.
|
||||||
|
|
||||||
|
4. **PERCENT_RANK validation timing:** Validation computes percentile ranks BEFORE known gene exclusion to validate the scoring system itself (not post-filtering artifacts). Uses temporary table pattern for efficient DuckDB join.
|
||||||
|
|
||||||
|
5. **Top quartile validation criterion:** Median percentile >= 0.75 ensures known cilia/Usher genes rank in top 25% of all genes. This threshold confirms scoring logic prioritizes cilia-relevant features.
|
||||||
|
|
||||||
|
6. **Composite score percentiles:** Included p10, p25, p50, p75, p90 in run_qc_checks() for distribution analysis. These percentiles help identify skewness, outliers, and score concentration patterns.
|
||||||
|
|
||||||
|
## Deviations from Plan
|
||||||
|
|
||||||
|
None - plan executed exactly as written.
|
||||||
|
|
||||||
|
## Issues Encountered
|
||||||
|
|
||||||
|
None. Both verification tests passed on first attempt.
|
||||||
|
|
||||||
|
## User Setup Required
|
||||||
|
|
||||||
|
None - scipy installed automatically in virtual environment.
|
||||||
|
|
||||||
|
## Next Phase Readiness
|
||||||
|
|
||||||
|
Ready for Phase 04 Plan 03 (ranked candidate list generation and filtering):
|
||||||
|
- QC checks available via run_qc_checks() for post-scoring validation
|
||||||
|
- Positive control validation confirms scoring system works correctly
|
||||||
|
- Outlier detection identifies anomalous genes per layer
|
||||||
|
- Composite score percentiles provide distribution context for threshold selection
|
||||||
|
|
||||||
|
No blockers. Next plan can implement:
|
||||||
|
- Known gene exclusion filtering
|
||||||
|
- Quality flag filtering (sufficient_evidence threshold)
|
||||||
|
- Composite score ranking
|
||||||
|
- Top-N candidate selection with provenance
|
||||||
|
|
||||||
|
## Self-Check: PASSED
|
||||||
|
|
||||||
|
All claimed files and commits verified:
|
||||||
|
- src/usher_pipeline/scoring/quality_control.py - FOUND
|
||||||
|
- src/usher_pipeline/scoring/validation.py - FOUND
|
||||||
|
- pyproject.toml - FOUND
|
||||||
|
- Commit ba2f97a (Task 1) - FOUND
|
||||||
|
- Commit 70a5d6e (Task 2) - FOUND
|
||||||
|
|
||||||
|
---
|
||||||
|
*Phase: 04-scoring-integration*
|
||||||
|
*Plan: 02*
|
||||||
|
*Completed: 2026-02-11*
|
||||||
Reference in New Issue
Block a user