docs(03): create phase plan

This commit is contained in:
2026-02-11 18:46:28 +08:00
parent 3354cfe006
commit 0d252da348
7 changed files with 1022 additions and 3 deletions

View File

@@ -0,0 +1,178 @@
---
phase: 03-core-evidence-layers
plan: 06
type: execute
wave: 1
depends_on: []
files_modified:
- src/usher_pipeline/evidence/literature/__init__.py
- src/usher_pipeline/evidence/literature/models.py
- src/usher_pipeline/evidence/literature/fetch.py
- src/usher_pipeline/evidence/literature/transform.py
- src/usher_pipeline/evidence/literature/load.py
- tests/test_literature.py
- tests/test_literature_integration.py
- src/usher_pipeline/cli/evidence_cmd.py
- pyproject.toml
autonomous: true
user_setup:
- service: ncbi
why: "PubMed E-utilities rate limit increase (3/sec -> 10/sec)"
env_vars:
- name: NCBI_API_KEY
source: "https://www.ncbi.nlm.nih.gov/account/settings/ -> API Key Management -> Create"
- name: NCBI_EMAIL
source: "Your email address (required by NCBI E-utilities)"
dashboard_config: []
must_haves:
truths:
- "Pipeline performs systematic PubMed queries per candidate gene for cilia, sensory, cytoskeleton, and cell polarity contexts"
- "Literature evidence distinguishes direct experimental evidence from incidental mentions and HTS hits"
- "Literature score reflects evidence quality not just publication count to mitigate well-studied gene bias"
artifacts:
- path: "src/usher_pipeline/evidence/literature/fetch.py"
provides: "PubMed query and publication retrieval per gene"
exports: ["query_pubmed_gene", "fetch_literature_evidence"]
- path: "src/usher_pipeline/evidence/literature/transform.py"
provides: "Evidence tier classification and quality-weighted scoring"
exports: ["classify_evidence_tier", "compute_literature_score", "process_literature_evidence"]
- path: "src/usher_pipeline/evidence/literature/load.py"
provides: "DuckDB persistence for literature evidence"
exports: ["load_to_duckdb"]
- path: "tests/test_literature.py"
provides: "Unit tests for evidence classification and quality scoring"
key_links:
- from: "src/usher_pipeline/evidence/literature/fetch.py"
to: "NCBI PubMed E-utilities"
via: "Biopython Entrez with ratelimit"
pattern: "Entrez\\.esearch|Entrez\\.efetch"
- from: "src/usher_pipeline/evidence/literature/transform.py"
to: "src/usher_pipeline/evidence/literature/fetch.py"
via: "classifies and scores fetched publication data"
pattern: "classify_evidence_tier|compute_literature_score"
- from: "src/usher_pipeline/evidence/literature/load.py"
to: "src/usher_pipeline/persistence/duckdb_store.py"
via: "store.save_dataframe"
pattern: "save_dataframe.*literature_evidence"
---
<objective>
Implement the Literature Evidence layer (LITE-01/02/03): perform systematic PubMed queries per gene for cilia/sensory contexts, classify evidence into quality tiers (direct experimental, functional mention, HTS hit, incidental), and compute quality-weighted literature score that avoids well-studied gene bias.
Purpose: Literature evidence confirms or suggests gene involvement in cilia/sensory pathways. Quality weighting prevents TP53-like genes (100K publications) from dominating scoring over novel candidates with 5 high-quality papers.
Output: literature_evidence DuckDB table with per-gene PubMed hit counts by context, evidence tier classification, and quality-weighted normalized literature score.
</objective>
<execution_context>
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
@src/usher_pipeline/evidence/gnomad/fetch.py
@src/usher_pipeline/evidence/gnomad/load.py
@src/usher_pipeline/cli/evidence_cmd.py
@src/usher_pipeline/persistence/duckdb_store.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create literature evidence data model, PubMed fetch, and scoring transform</name>
<files>
src/usher_pipeline/evidence/literature/__init__.py
src/usher_pipeline/evidence/literature/models.py
src/usher_pipeline/evidence/literature/fetch.py
src/usher_pipeline/evidence/literature/transform.py
pyproject.toml
</files>
<action>
Create the literature evidence layer following the established fetch->transform pattern.
**models.py**: Define LiteratureRecord pydantic model with fields: gene_id (str), gene_symbol (str), total_pubmed_count (int|None -- total publications mentioning gene), cilia_context_count (int|None), sensory_context_count (int|None), cytoskeleton_context_count (int|None), cell_polarity_context_count (int|None), direct_experimental_count (int|None -- publications with knockout/mutation/knockdown language), hts_screen_count (int|None -- publications from high-throughput screens), evidence_tier (str -- "direct_experimental", "functional_mention", "hts_hit", "incidental", "none"), literature_score_normalized (float|None -- 0-1 quality-weighted). Define LITERATURE_TABLE_NAME = "literature_evidence". Define SEARCH_CONTEXTS as dict mapping context names to PubMed search terms: {"cilia": "(cilia OR cilium OR ciliary OR flagellum OR intraflagellar)", "sensory": "(retina OR cochlea OR hair cell OR photoreceptor OR vestibular OR hearing OR usher syndrome)", "cytoskeleton": "(cytoskeleton OR actin OR microtubule OR motor protein)", "cell_polarity": "(cell polarity OR planar cell polarity OR apicobasal OR tight junction)"}. Define DIRECT_EVIDENCE_TERMS = "(knockout OR knockdown OR mutation OR CRISPR OR siRNA OR morpholino OR null allele)".
**fetch.py**: Two functions:
1. `query_pubmed_gene(gene_symbol: str, contexts: dict[str, str], email: str, api_key: str|None = None) -> dict` -- Use Biopython Entrez.esearch to query PubMed. For each context, construct query: `({gene_symbol}[Gene Name]) AND {context_terms}`. Also query with DIRECT_EVIDENCE_TERMS appended for direct experimental count. Query for HTS: `({gene_symbol}[Gene Name]) AND (screen[Title/Abstract] OR proteomics[Title/Abstract] OR transcriptomics[Title/Abstract])`. Get total count for gene (no context filter). Use ratelimit decorator: 3 req/sec without API key, 10 req/sec with key. Return dict with counts per context plus direct_experimental and hts counts. Set Entrez.email and Entrez.api_key from parameters.
2. `fetch_literature_evidence(gene_symbols: list[str], email: str, api_key: str|None = None) -> pl.DataFrame` -- Process all genes through query_pubmed_gene. Batch processing with progress logging (log every 100 genes). Build DataFrame with all count columns. For genes that fail (API errors), set counts to NULL (not zero). This function will be slow (~20K genes * ~6 queries each = ~120K queries, at 10/sec = ~3.3 hours with API key, ~11 hours without). Log estimated time at start. Support resumption by accepting partial results DataFrame.
Add `biopython>=1.84` to pyproject.toml dependencies (if not already present).
**transform.py**: Three functions:
1. `classify_evidence_tier(df: pl.DataFrame) -> pl.DataFrame` -- Classify each gene: "direct_experimental" if direct_experimental_count >= 1 AND cilia/sensory context count >= 1, "functional_mention" if cilia/sensory context count >= 1 AND total count >= 3, "hts_hit" if hts_screen_count >= 1 AND cilia/sensory context count >= 1, "incidental" if total_pubmed_count >= 1 but no context matches, "none" if total_pubmed_count == 0 or NULL. Add evidence_tier column.
2. `compute_literature_score(df: pl.DataFrame) -> pl.DataFrame` -- Quality-weighted scoring that avoids well-studied gene bias. Formula: context_score = sum of context_counts (cilia * 2.0 + sensory * 2.0 + cytoskeleton * 1.0 + cell_polarity * 1.0) -- weighted by relevance. evidence_quality_weight = {"direct_experimental": 1.0, "functional_mention": 0.6, "hts_hit": 0.3, "incidental": 0.1, "none": 0.0}. raw_score = context_score * evidence_quality_weight. CRITICAL bias mitigation: normalize by log2(total_pubmed_count + 1) to penalize genes where cilia mentions are a tiny fraction of total literature. Final normalization: rank-percentile across all genes, scaled to [0, 1]. NULL if total_pubmed_count is NULL.
3. `process_literature_evidence(gene_ids: list[str], gene_symbol_map: pl.DataFrame, email: str, api_key: str|None = None) -> pl.DataFrame` -- End-to-end: map gene_ids to symbols -> fetch literature -> classify tier -> compute score -> collect.
Follow established patterns: NULL preservation, structlog logging. IMPORTANT: PubMed queries are slow. Design for checkpoint-restart (save partial results every 500 genes). Log progress extensively.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.literature import query_pubmed_gene, fetch_literature_evidence, classify_evidence_tier, compute_literature_score, process_literature_evidence; print('imports OK')"
</verify>
<done>
Literature fetch queries PubMed via Biopython Entrez with rate limiting and context-specific search terms. Transform classifies evidence tiers and computes quality-weighted score with bias mitigation via total publication normalization. All functions importable.
</done>
</task>
<task type="auto">
<name>Task 2: Create literature DuckDB loader, CLI command, and tests</name>
<files>
src/usher_pipeline/evidence/literature/load.py
src/usher_pipeline/cli/evidence_cmd.py
tests/test_literature.py
tests/test_literature_integration.py
</files>
<action>
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "literature_evidence" table. Record provenance: genes queried, genes with direct evidence, tier distribution, mean literature score, total PubMed queries made. Create `query_literature_supported(store, min_tier="functional_mention") -> pl.DataFrame` helper returning genes with evidence tier at or above specified tier.
**evidence_cmd.py**: Add `literature` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('literature_evidence')), --force flag, --email (required, for NCBI), --api-key (optional, for higher rate limit), --batch-size (default 500, for partial checkpoint saves), load gene universe for gene symbols, call process_literature_evidence, load to DuckDB, save provenance sidecar to data/literature/pubmed.provenance.json. Display summary: genes queried, evidence tier distribution, estimated remaining time. IMPORTANT: This command will take hours. Log progress clearly. Save partial checkpoints to DuckDB every batch-size genes.
**tests/test_literature.py**: Unit tests with synthetic data. Mock Biopython Entrez.esearch. Test cases:
- test_direct_experimental_classification: Gene with knockout paper in cilia context -> "direct_experimental"
- test_functional_mention_classification: Gene with cilia context but no knockout -> "functional_mention"
- test_hts_hit_classification: Gene from proteomics screen in cilia context -> "hts_hit"
- test_incidental_classification: Gene with publications but no cilia/sensory context -> "incidental"
- test_no_evidence: Gene with zero publications -> "none"
- test_bias_mitigation: TP53-like gene (100K total, 5 cilia) scores LOWER than novel gene (10 total, 5 cilia)
- test_quality_weighting: Direct experimental evidence scores higher than incidental mention
- test_null_preservation: Failed PubMed query -> NULL counts, not zero
- test_context_weighting: Cilia/sensory contexts weighted higher than cytoskeleton
- test_score_normalization: Final score in [0, 1]
**tests/test_literature_integration.py**: Integration tests. Mock Entrez.esearch/efetch responses. Test full pipeline with small gene set, checkpoint-restart (verify partial results preserved), provenance recording. Synthetic PubMed response fixtures.
</action>
<verify>
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_literature.py tests/test_literature_integration.py -v
</verify>
<done>
All literature unit and integration tests pass. CLI `evidence literature` command registered with --email and --api-key options. DuckDB stores literature_evidence table with context counts, evidence tiers, and quality-weighted scores. Bias mitigation validated: novel genes with focused evidence score higher than well-studied genes with incidental mentions. Checkpoint-restart works for long-running PubMed queries.
</done>
</task>
</tasks>
<verification>
- `python -m pytest tests/test_literature.py tests/test_literature_integration.py -v` -- all tests pass
- `python -c "from usher_pipeline.evidence.literature import *"` -- all exports importable
- `usher-pipeline evidence literature --help` -- CLI help displays with --email and --api-key options
- DuckDB literature_evidence table has columns: gene_id, gene_symbol, total_pubmed_count, cilia_context_count, sensory_context_count, direct_experimental_count, evidence_tier, literature_score_normalized
- Bias test: Gene with 10 total/5 cilia publications scores higher than gene with 100K total/5 cilia publications
</verification>
<success_criteria>
- LITE-01: Systematic PubMed queries for each gene across cilia, sensory, cytoskeleton, and cell polarity contexts
- LITE-02: Evidence classified into tiers: direct_experimental, functional_mention, hts_hit, incidental, none
- LITE-03: Quality-weighted score with bias mitigation via total publication normalization -- evidence quality matters more than count
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
- User setup: NCBI_API_KEY and NCBI_EMAIL documented for rate limit increase
</success_criteria>
<output>
After completion, create `.planning/phases/03-core-evidence-layers/03-06-SUMMARY.md`
</output>