docs(03): create phase plan
This commit is contained in:
169
.planning/phases/03-core-evidence-layers/03-03-PLAN.md
Normal file
169
.planning/phases/03-core-evidence-layers/03-03-PLAN.md
Normal file
@@ -0,0 +1,169 @@
|
||||
---
|
||||
phase: 03-core-evidence-layers
|
||||
plan: 03
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- src/usher_pipeline/evidence/protein/__init__.py
|
||||
- src/usher_pipeline/evidence/protein/models.py
|
||||
- src/usher_pipeline/evidence/protein/fetch.py
|
||||
- src/usher_pipeline/evidence/protein/transform.py
|
||||
- src/usher_pipeline/evidence/protein/load.py
|
||||
- tests/test_protein.py
|
||||
- tests/test_protein_integration.py
|
||||
- src/usher_pipeline/cli/evidence_cmd.py
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Pipeline extracts protein length, domain composition, and domain count from UniProt/InterPro per gene"
|
||||
- "Pipeline identifies coiled-coil regions, scaffold/adaptor domains, and transmembrane domains"
|
||||
- "Pipeline checks for known cilia-associated motifs and domain annotations without presupposing conclusions"
|
||||
- "Protein features are encoded as binary and continuous features normalized to 0-1 scale"
|
||||
artifacts:
|
||||
- path: "src/usher_pipeline/evidence/protein/fetch.py"
|
||||
provides: "UniProt protein features and InterPro domain retrieval"
|
||||
exports: ["fetch_uniprot_features", "fetch_interpro_domains"]
|
||||
- path: "src/usher_pipeline/evidence/protein/transform.py"
|
||||
provides: "Feature extraction, cilia motif detection, and normalization"
|
||||
exports: ["extract_protein_features", "detect_cilia_motifs", "normalize_protein_features", "process_protein_evidence"]
|
||||
- path: "src/usher_pipeline/evidence/protein/load.py"
|
||||
provides: "DuckDB persistence for protein features"
|
||||
exports: ["load_to_duckdb"]
|
||||
- path: "tests/test_protein.py"
|
||||
provides: "Unit tests for feature extraction and motif detection"
|
||||
key_links:
|
||||
- from: "src/usher_pipeline/evidence/protein/fetch.py"
|
||||
to: "UniProt REST API"
|
||||
via: "httpx batch queries with ratelimit"
|
||||
pattern: "rest\\.uniprot\\.org"
|
||||
- from: "src/usher_pipeline/evidence/protein/fetch.py"
|
||||
to: "InterPro REST API"
|
||||
via: "httpx with tenacity retry"
|
||||
pattern: "interpro.*api"
|
||||
- from: "src/usher_pipeline/evidence/protein/transform.py"
|
||||
to: "src/usher_pipeline/evidence/protein/fetch.py"
|
||||
via: "processes raw features into scored evidence"
|
||||
pattern: "detect_cilia_motifs|extract_protein_features"
|
||||
- from: "src/usher_pipeline/evidence/protein/load.py"
|
||||
to: "src/usher_pipeline/persistence/duckdb_store.py"
|
||||
via: "store.save_dataframe"
|
||||
pattern: "save_dataframe.*protein_features"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Implement the Protein Sequence and Structure Features evidence layer (PROT-01/02/03/04): extract protein length, domains, coiled-coils, transmembrane regions, and cilia-associated motifs from UniProt/InterPro, encode as binary and continuous features normalized to 0-1.
|
||||
|
||||
Purpose: Usher proteins share structural motifs (coiled-coils, scaffold domains, transmembrane regions). Identifying these features in unstudied genes provides structural evidence for potential cilia/Usher involvement.
|
||||
Output: protein_features DuckDB table with per-gene protein length, domain count, coiled-coil flag, transmembrane count, cilia motif flags, and normalized composite protein score.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/gbanyan/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/03-core-evidence-layers/03-RESEARCH.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md
|
||||
@.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md
|
||||
@src/usher_pipeline/evidence/gnomad/fetch.py
|
||||
@src/usher_pipeline/evidence/gnomad/transform.py
|
||||
@src/usher_pipeline/evidence/gnomad/load.py
|
||||
@src/usher_pipeline/cli/evidence_cmd.py
|
||||
@src/usher_pipeline/persistence/duckdb_store.py
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Create protein features data model, fetch, and transform modules</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/protein/__init__.py
|
||||
src/usher_pipeline/evidence/protein/models.py
|
||||
src/usher_pipeline/evidence/protein/fetch.py
|
||||
src/usher_pipeline/evidence/protein/transform.py
|
||||
</files>
|
||||
<action>
|
||||
Create the protein features evidence layer following the established gnomAD fetch->transform pattern.
|
||||
|
||||
**models.py**: Define ProteinFeatureRecord pydantic model with fields: gene_id (str), gene_symbol (str), uniprot_id (str|None), protein_length (int|None), domain_count (int|None), coiled_coil (bool|None -- has coiled-coil region), coiled_coil_count (int|None), transmembrane_count (int|None), scaffold_adaptor_domain (bool|None -- has PDZ, SH3, ankyrin, WD40, or similar scaffold domains), has_cilia_domain (bool|None -- has IFT, BBSome, ciliary targeting, or transition zone domain), has_sensory_domain (bool|None -- has stereocilia, photoreceptor, or sensory-associated domain), protein_score_normalized (float|None -- 0-1 composite). Define PROTEIN_TABLE_NAME = "protein_features". Define CILIA_DOMAIN_KEYWORDS as list: ["IFT", "intraflagellar", "BBSome", "ciliary", "cilia", "basal body", "centrosome", "transition zone", "axoneme"]. Define SCAFFOLD_DOMAIN_TYPES as list: ["PDZ", "SH3", "Ankyrin", "WD40", "Coiled coil", "SAM", "FERM", "Harmonin"].
|
||||
|
||||
**fetch.py**: Two fetch functions:
|
||||
1. `fetch_uniprot_features(uniprot_ids: list[str]) -> pl.DataFrame` -- Query UniProt REST API in batches of 100 accessions. Use search endpoint with fields parameter: `accession,length,ft_domain,ft_coiled,ft_transmem,annotation_score`. Parse JSON response to extract: protein_length, list of domain names, coiled-coil region count, transmembrane region count. Use httpx with tenacity retry (5 attempts, exponential backoff) and ratelimit (200 req/sec). Return DataFrame with uniprot_id and extracted features. NULL for accessions not found.
|
||||
2. `fetch_interpro_domains(uniprot_ids: list[str]) -> pl.DataFrame` -- Query InterPro REST API for domain annotations per protein. Endpoint: `https://www.ebi.ac.uk/interpro/api/entry/interpro/protein/uniprot/{accession}`. Use conservative rate limiting (10 req/sec as recommended in RESEARCH.md). Extract domain names, InterPro IDs. Return DataFrame with uniprot_id, domain_names list, interpro_ids list. This supplements UniProt with more detailed domain classification. For >10K proteins, consider InterPro bulk download as fallback (log warning if API too slow).
|
||||
|
||||
**transform.py**: Four functions:
|
||||
1. `extract_protein_features(uniprot_df: pl.DataFrame, interpro_df: pl.DataFrame) -> pl.DataFrame` -- Join UniProt and InterPro data on uniprot_id. Compute domain_count from combined sources (deduplicate). Set coiled_coil boolean from UniProt ft_coiled count > 0. Set transmembrane_count from UniProt ft_transmem.
|
||||
2. `detect_cilia_motifs(df: pl.DataFrame) -> pl.DataFrame` -- Scan domain names for CILIA_DOMAIN_KEYWORDS (case-insensitive substring match). Set has_cilia_domain = True if any domain matches. Scan for SCAFFOLD_DOMAIN_TYPES. Set scaffold_adaptor_domain = True if match found. Set has_sensory_domain based on keywords: ["stereocilia", "photoreceptor", "usher", "harmonin", "cadherin 23", "protocadherin"]. Important: this is pattern matching on domain annotations, NOT presupposing cilia involvement -- it flags structural features that happen to be associated with cilia biology.
|
||||
3. `normalize_protein_features(df: pl.DataFrame) -> pl.DataFrame` -- Normalize continuous features: protein_length via log-transform rank percentile (0-1), domain_count via rank percentile (0-1), transmembrane_count capped at 20 then /20. Binary features stay as 0/1. Composite protein_score_normalized = 0.15 * length_rank + 0.20 * domain_rank + 0.20 * coiled_coil + 0.20 * transmembrane_normalized + 0.15 * has_cilia_domain + 0.10 * scaffold_adaptor_domain. NULL if no UniProt entry exists for gene.
|
||||
4. `process_protein_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end: map gene_ids to uniprot_ids -> fetch UniProt -> fetch InterPro -> extract -> detect motifs -> normalize -> collect.
|
||||
|
||||
Follow established patterns: NULL preservation, structlog logging. Note: do NOT use external tools (CoCoNat, Phobius) for coiled-coil/TM prediction -- use UniProt annotations only (already curated). Research mentioned these tools but they add complexity with marginal value at this stage.
|
||||
</action>
|
||||
<verify>
|
||||
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.protein import fetch_uniprot_features, fetch_interpro_domains, extract_protein_features, detect_cilia_motifs, normalize_protein_features, process_protein_evidence; print('imports OK')"
|
||||
</verify>
|
||||
<done>
|
||||
Protein fetch module retrieves features from UniProt REST API and InterPro API. Transform module extracts domain features, detects cilia-associated motifs via keyword matching, and normalizes to 0-1 composite. All functions importable.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Create protein DuckDB loader, CLI command, and tests</name>
|
||||
<files>
|
||||
src/usher_pipeline/evidence/protein/load.py
|
||||
src/usher_pipeline/cli/evidence_cmd.py
|
||||
tests/test_protein.py
|
||||
tests/test_protein_integration.py
|
||||
</files>
|
||||
<action>
|
||||
**load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "protein_features" table. Record provenance: gene count, genes with cilia domains, genes with scaffold domains, genes with coiled-coils, genes with TM domains, mean domain count, NULL UniProt count. Create `query_cilia_candidates(store) -> pl.DataFrame` helper querying genes with has_cilia_domain=True OR (coiled_coil=True AND scaffold_adaptor_domain=True).
|
||||
|
||||
**evidence_cmd.py**: Add `protein` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('protein_features')), --force flag, load gene universe for gene_ids and uniprot mappings, call process_protein_evidence, load to DuckDB, save provenance sidecar to data/protein/features.provenance.json. Display summary: genes with UniProt data, cilia domain count, scaffold domain count, coiled-coil count.
|
||||
|
||||
**tests/test_protein.py**: Unit tests with synthetic data, mock httpx for UniProt/InterPro. Test cases:
|
||||
- test_uniprot_feature_extraction: Correct parsing of length, domain, coiled-coil, TM from UniProt JSON
|
||||
- test_cilia_motif_detection_positive: Domain name containing "IFT" -> has_cilia_domain=True
|
||||
- test_cilia_motif_detection_negative: Standard domain (e.g., "Kinase") -> has_cilia_domain=False
|
||||
- test_scaffold_detection: PDZ domain -> scaffold_adaptor_domain=True
|
||||
- test_null_uniprot: Gene without UniProt entry -> all features NULL
|
||||
- test_normalization_bounds: All features in [0, 1]
|
||||
- test_composite_score_cilia_gene: Gene with cilia domains scores higher
|
||||
- test_composite_score_null_handling: NULL UniProt -> NULL composite
|
||||
- test_domain_keyword_case_insensitive: "intraflagellar" matches case-insensitively
|
||||
|
||||
**tests/test_protein_integration.py**: Integration tests. Mock UniProt and InterPro API responses. Test full pipeline, checkpoint-restart, provenance recording. Synthetic UniProt JSON fixtures with realistic domain structures.
|
||||
</action>
|
||||
<verify>
|
||||
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_protein.py tests/test_protein_integration.py -v
|
||||
</verify>
|
||||
<done>
|
||||
All protein unit and integration tests pass. CLI `evidence protein` command registered. DuckDB stores protein_features table with domain counts, motif flags, and normalized score. Checkpoint-restart works.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `python -m pytest tests/test_protein.py tests/test_protein_integration.py -v` -- all tests pass
|
||||
- `python -c "from usher_pipeline.evidence.protein import *"` -- all exports importable
|
||||
- `usher-pipeline evidence protein --help` -- CLI help displays
|
||||
- DuckDB protein_features table has columns: gene_id, gene_symbol, protein_length, domain_count, coiled_coil, transmembrane_count, has_cilia_domain, scaffold_adaptor_domain, protein_score_normalized
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- PROT-01: Protein length, domain composition, domain count extracted from UniProt/InterPro per gene
|
||||
- PROT-02: Coiled-coil, scaffold/adaptor, and transmembrane domains identified
|
||||
- PROT-03: Cilia-associated motifs detected via domain keyword matching without presupposing conclusions
|
||||
- PROT-04: Binary and continuous protein features normalized to 0-1 composite score
|
||||
- Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user