Files
2026-02-11 18:46:28 +08:00

12 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
phase plan type wave depends_on files_modified autonomous must_haves
03-core-evidence-layers 03 execute 1
src/usher_pipeline/evidence/protein/__init__.py
src/usher_pipeline/evidence/protein/models.py
src/usher_pipeline/evidence/protein/fetch.py
src/usher_pipeline/evidence/protein/transform.py
src/usher_pipeline/evidence/protein/load.py
tests/test_protein.py
tests/test_protein_integration.py
src/usher_pipeline/cli/evidence_cmd.py
true
truths artifacts key_links
Pipeline extracts protein length, domain composition, and domain count from UniProt/InterPro per gene
Pipeline identifies coiled-coil regions, scaffold/adaptor domains, and transmembrane domains
Pipeline checks for known cilia-associated motifs and domain annotations without presupposing conclusions
Protein features are encoded as binary and continuous features normalized to 0-1 scale
path provides exports
src/usher_pipeline/evidence/protein/fetch.py UniProt protein features and InterPro domain retrieval
fetch_uniprot_features
fetch_interpro_domains
path provides exports
src/usher_pipeline/evidence/protein/transform.py Feature extraction, cilia motif detection, and normalization
extract_protein_features
detect_cilia_motifs
normalize_protein_features
process_protein_evidence
path provides exports
src/usher_pipeline/evidence/protein/load.py DuckDB persistence for protein features
load_to_duckdb
path provides
tests/test_protein.py Unit tests for feature extraction and motif detection
from to via pattern
src/usher_pipeline/evidence/protein/fetch.py UniProt REST API httpx batch queries with ratelimit rest.uniprot.org
from to via pattern
src/usher_pipeline/evidence/protein/fetch.py InterPro REST API httpx with tenacity retry interpro.*api
from to via pattern
src/usher_pipeline/evidence/protein/transform.py src/usher_pipeline/evidence/protein/fetch.py processes raw features into scored evidence detect_cilia_motifs|extract_protein_features
from to via pattern
src/usher_pipeline/evidence/protein/load.py src/usher_pipeline/persistence/duckdb_store.py store.save_dataframe save_dataframe.*protein_features
Implement the Protein Sequence and Structure Features evidence layer (PROT-01/02/03/04): extract protein length, domains, coiled-coils, transmembrane regions, and cilia-associated motifs from UniProt/InterPro, encode as binary and continuous features normalized to 0-1.

Purpose: Usher proteins share structural motifs (coiled-coils, scaffold domains, transmembrane regions). Identifying these features in unstudied genes provides structural evidence for potential cilia/Usher involvement. Output: protein_features DuckDB table with per-gene protein length, domain count, coiled-coil flag, transmembrane count, cilia motif flags, and normalized composite protein score.

<execution_context> @/Users/gbanyan/.claude/get-shit-done/workflows/execute-plan.md @/Users/gbanyan/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-core-evidence-layers/03-RESEARCH.md @.planning/phases/02-prototype-evidence-layer/02-01-SUMMARY.md @.planning/phases/02-prototype-evidence-layer/02-02-SUMMARY.md @src/usher_pipeline/evidence/gnomad/fetch.py @src/usher_pipeline/evidence/gnomad/transform.py @src/usher_pipeline/evidence/gnomad/load.py @src/usher_pipeline/cli/evidence_cmd.py @src/usher_pipeline/persistence/duckdb_store.py Task 1: Create protein features data model, fetch, and transform modules src/usher_pipeline/evidence/protein/__init__.py src/usher_pipeline/evidence/protein/models.py src/usher_pipeline/evidence/protein/fetch.py src/usher_pipeline/evidence/protein/transform.py Create the protein features evidence layer following the established gnomAD fetch->transform pattern.
**models.py**: Define ProteinFeatureRecord pydantic model with fields: gene_id (str), gene_symbol (str), uniprot_id (str|None), protein_length (int|None), domain_count (int|None), coiled_coil (bool|None -- has coiled-coil region), coiled_coil_count (int|None), transmembrane_count (int|None), scaffold_adaptor_domain (bool|None -- has PDZ, SH3, ankyrin, WD40, or similar scaffold domains), has_cilia_domain (bool|None -- has IFT, BBSome, ciliary targeting, or transition zone domain), has_sensory_domain (bool|None -- has stereocilia, photoreceptor, or sensory-associated domain), protein_score_normalized (float|None -- 0-1 composite). Define PROTEIN_TABLE_NAME = "protein_features". Define CILIA_DOMAIN_KEYWORDS as list: ["IFT", "intraflagellar", "BBSome", "ciliary", "cilia", "basal body", "centrosome", "transition zone", "axoneme"]. Define SCAFFOLD_DOMAIN_TYPES as list: ["PDZ", "SH3", "Ankyrin", "WD40", "Coiled coil", "SAM", "FERM", "Harmonin"].

**fetch.py**: Two fetch functions:
1. `fetch_uniprot_features(uniprot_ids: list[str]) -> pl.DataFrame` -- Query UniProt REST API in batches of 100 accessions. Use search endpoint with fields parameter: `accession,length,ft_domain,ft_coiled,ft_transmem,annotation_score`. Parse JSON response to extract: protein_length, list of domain names, coiled-coil region count, transmembrane region count. Use httpx with tenacity retry (5 attempts, exponential backoff) and ratelimit (200 req/sec). Return DataFrame with uniprot_id and extracted features. NULL for accessions not found.
2. `fetch_interpro_domains(uniprot_ids: list[str]) -> pl.DataFrame` -- Query InterPro REST API for domain annotations per protein. Endpoint: `https://www.ebi.ac.uk/interpro/api/entry/interpro/protein/uniprot/{accession}`. Use conservative rate limiting (10 req/sec as recommended in RESEARCH.md). Extract domain names, InterPro IDs. Return DataFrame with uniprot_id, domain_names list, interpro_ids list. This supplements UniProt with more detailed domain classification. For >10K proteins, consider InterPro bulk download as fallback (log warning if API too slow).

**transform.py**: Four functions:
1. `extract_protein_features(uniprot_df: pl.DataFrame, interpro_df: pl.DataFrame) -> pl.DataFrame` -- Join UniProt and InterPro data on uniprot_id. Compute domain_count from combined sources (deduplicate). Set coiled_coil boolean from UniProt ft_coiled count > 0. Set transmembrane_count from UniProt ft_transmem.
2. `detect_cilia_motifs(df: pl.DataFrame) -> pl.DataFrame` -- Scan domain names for CILIA_DOMAIN_KEYWORDS (case-insensitive substring match). Set has_cilia_domain = True if any domain matches. Scan for SCAFFOLD_DOMAIN_TYPES. Set scaffold_adaptor_domain = True if match found. Set has_sensory_domain based on keywords: ["stereocilia", "photoreceptor", "usher", "harmonin", "cadherin 23", "protocadherin"]. Important: this is pattern matching on domain annotations, NOT presupposing cilia involvement -- it flags structural features that happen to be associated with cilia biology.
3. `normalize_protein_features(df: pl.DataFrame) -> pl.DataFrame` -- Normalize continuous features: protein_length via log-transform rank percentile (0-1), domain_count via rank percentile (0-1), transmembrane_count capped at 20 then /20. Binary features stay as 0/1. Composite protein_score_normalized = 0.15 * length_rank + 0.20 * domain_rank + 0.20 * coiled_coil + 0.20 * transmembrane_normalized + 0.15 * has_cilia_domain + 0.10 * scaffold_adaptor_domain. NULL if no UniProt entry exists for gene.
4. `process_protein_evidence(gene_ids: list[str], uniprot_mapping: pl.DataFrame) -> pl.DataFrame` -- End-to-end: map gene_ids to uniprot_ids -> fetch UniProt -> fetch InterPro -> extract -> detect motifs -> normalize -> collect.

Follow established patterns: NULL preservation, structlog logging. Note: do NOT use external tools (CoCoNat, Phobius) for coiled-coil/TM prediction -- use UniProt annotations only (already curated). Research mentioned these tools but they add complexity with marginal value at this stage.
cd /Users/gbanyan/Project/usher-exploring && python -c "from usher_pipeline.evidence.protein import fetch_uniprot_features, fetch_interpro_domains, extract_protein_features, detect_cilia_motifs, normalize_protein_features, process_protein_evidence; print('imports OK')" Protein fetch module retrieves features from UniProt REST API and InterPro API. Transform module extracts domain features, detects cilia-associated motifs via keyword matching, and normalizes to 0-1 composite. All functions importable. Task 2: Create protein DuckDB loader, CLI command, and tests src/usher_pipeline/evidence/protein/load.py src/usher_pipeline/cli/evidence_cmd.py tests/test_protein.py tests/test_protein_integration.py **load.py**: Follow gnomad/load.py pattern. Create `load_to_duckdb(df, store, provenance, description)` saving to "protein_features" table. Record provenance: gene count, genes with cilia domains, genes with scaffold domains, genes with coiled-coils, genes with TM domains, mean domain count, NULL UniProt count. Create `query_cilia_candidates(store) -> pl.DataFrame` helper querying genes with has_cilia_domain=True OR (coiled_coil=True AND scaffold_adaptor_domain=True).
**evidence_cmd.py**: Add `protein` subcommand to evidence command group. Follow gnomad pattern: checkpoint check (has_checkpoint('protein_features')), --force flag, load gene universe for gene_ids and uniprot mappings, call process_protein_evidence, load to DuckDB, save provenance sidecar to data/protein/features.provenance.json. Display summary: genes with UniProt data, cilia domain count, scaffold domain count, coiled-coil count.

**tests/test_protein.py**: Unit tests with synthetic data, mock httpx for UniProt/InterPro. Test cases:
- test_uniprot_feature_extraction: Correct parsing of length, domain, coiled-coil, TM from UniProt JSON
- test_cilia_motif_detection_positive: Domain name containing "IFT" -> has_cilia_domain=True
- test_cilia_motif_detection_negative: Standard domain (e.g., "Kinase") -> has_cilia_domain=False
- test_scaffold_detection: PDZ domain -> scaffold_adaptor_domain=True
- test_null_uniprot: Gene without UniProt entry -> all features NULL
- test_normalization_bounds: All features in [0, 1]
- test_composite_score_cilia_gene: Gene with cilia domains scores higher
- test_composite_score_null_handling: NULL UniProt -> NULL composite
- test_domain_keyword_case_insensitive: "intraflagellar" matches case-insensitively

**tests/test_protein_integration.py**: Integration tests. Mock UniProt and InterPro API responses. Test full pipeline, checkpoint-restart, provenance recording. Synthetic UniProt JSON fixtures with realistic domain structures.
cd /Users/gbanyan/Project/usher-exploring && python -m pytest tests/test_protein.py tests/test_protein_integration.py -v All protein unit and integration tests pass. CLI `evidence protein` command registered. DuckDB stores protein_features table with domain counts, motif flags, and normalized score. Checkpoint-restart works. - `python -m pytest tests/test_protein.py tests/test_protein_integration.py -v` -- all tests pass - `python -c "from usher_pipeline.evidence.protein import *"` -- all exports importable - `usher-pipeline evidence protein --help` -- CLI help displays - DuckDB protein_features table has columns: gene_id, gene_symbol, protein_length, domain_count, coiled_coil, transmembrane_count, has_cilia_domain, scaffold_adaptor_domain, protein_score_normalized

<success_criteria>

  • PROT-01: Protein length, domain composition, domain count extracted from UniProt/InterPro per gene
  • PROT-02: Coiled-coil, scaffold/adaptor, and transmembrane domains identified
  • PROT-03: Cilia-associated motifs detected via domain keyword matching without presupposing conclusions
  • PROT-04: Binary and continuous protein features normalized to 0-1 composite score
  • Pattern compliance: fetch->transform->load->CLI->tests matching evidence layer structure </success_criteria>
After completion, create `.planning/phases/03-core-evidence-layers/03-03-SUMMARY.md`