ed21f18a98
fix(03-05): handle NULL columns and deprecated polars API in animal models
...
- Add NULL/empty column checks in fetch_ortholog_mapping
- Fix NULL handling in filter_sensory_phenotypes with is_not_null guard
- Replace deprecated str.concat with str.join
- Add explicit schema to empty DataFrames for consistency
2026-02-11 20:38:36 +08:00
d8009f1236
docs(03-04): complete subcellular localization evidence layer
...
- Created SUMMARY.md with full implementation details
- Updated STATE.md: progress 40%, 8/20 plans complete
- Documented 4 key decisions (evidence terminology, NULL semantics, embedded proteomics, evidence weighting)
- All verification criteria met: 17/17 tests pass, CLI functional, DuckDB integration complete
2026-02-11 19:08:01 +08:00
46059874f2
feat(03-03): implement protein evidence layer with UniProt/InterPro integration
...
- Create protein features data model with domain, coiled-coil, TM, cilia motifs
- Implement fetch.py with UniProt REST API and InterPro API queries
- Implement transform.py with feature extraction, motif detection, normalization
- Implement load.py with DuckDB persistence and provenance tracking
- Add CLI protein command following evidence layer pattern
- Add comprehensive unit and integration tests (all passing)
- Handle NULL preservation and List(Null) edge case
- Add get_steps() method to ProvenanceTracker for test compatibility
2026-02-11 19:07:30 +08:00
bcd3c4ffbe
feat(03-05): add animal model DuckDB loader, CLI, and comprehensive tests
...
- load.py: DuckDB persistence with provenance tracking, ortholog confidence distribution stats
- CLI animal-models command: checkpoint-restart pattern, top scoring genes display
- 10 unit tests: ortholog confidence scoring, keyword filtering, multi-organism bonus, NULL preservation
- 4 integration tests: full pipeline, checkpoint-restart, provenance tracking, empty phenotype handling
- All tests pass (14/14): validates fetch->transform->load->CLI flow
- Fixed polars deprecations: str.join replaces str.concat, pl.len replaces pl.count
2026-02-11 19:06:49 +08:00
942aaf2ec3
feat(03-04): add localization CLI command and comprehensive tests
...
- Add localization subcommand to evidence command group
- Implement checkpoint-restart pattern for HPA download
- Display summary with evidence type distribution
- Create 17 unit and integration tests (all pass)
- Test HPA parsing, evidence classification, scoring, and DuckDB persistence
- Fix evidence type terminology (computational vs predicted) for consistency
- Mock HTTP calls in integration tests for reproducibility
2026-02-11 19:05:22 +08:00
d70239c4ce
feat(03-01): add annotation DuckDB loader, CLI command, and tests
...
- Create load_to_duckdb with provenance tracking and tier distribution stats
- Add query_poorly_annotated helper to find under-studied genes
- Register `evidence annotation` CLI command with checkpoint-restart pattern
- Add comprehensive unit tests (9 tests) covering GO extraction, NULL handling, tier classification, score normalization, weighting
- Add integration tests (6 tests) for pipeline, idempotency, checkpoint-restart, provenance, queries
- All 15 tests pass with proper NULL preservation and schema validation
2026-02-11 19:03:10 +08:00
0e389c7e41
feat(03-05): implement animal model evidence fetch and transform
...
- models.py: AnimalModelRecord with ortholog confidence, phenotype flags, and normalized scoring
- fetch.py: Retrieve orthologs from HCOP, phenotypes from MGI/ZFIN/IMPC with retry
- transform.py: Filter sensory/cilia-relevant phenotypes, score with confidence weighting
- Ortholog confidence: HIGH (8+ sources), MEDIUM (4-7), LOW (1-3)
- Scoring: mouse +0.4, zebrafish +0.3, IMPC +0.3, weighted by confidence
- NULL preservation: no ortholog = NULL score (not zero)
2026-02-11 19:00:24 +08:00
8aa66987f8
feat(03-06): implement literature evidence models, PubMed fetch, and scoring
...
- Create LiteratureRecord pydantic model with context-specific counts
- Implement PubMed query via Biopython Entrez with rate limiting (3/sec default, 10/sec with API key)
- Define SEARCH_CONTEXTS for cilia, sensory, cytoskeleton, cell_polarity queries
- Implement evidence tier classification: direct_experimental > functional_mention > hts_hit > incidental > none
- Implement quality-weighted scoring with bias mitigation via log2(total_pubmed_count) normalization
- Add biopython>=1.84 dependency to pyproject.toml
- Support checkpoint-restart for long-running PubMed queries (estimated 3-11 hours for 20K genes)
2026-02-11 19:00:20 +08:00
6645c59b0b
feat(03-04): create localization evidence data model and processing
...
- Define LocalizationRecord model with HPA and proteomics fields
- Implement fetch_hpa_subcellular to download HPA bulk data
- Implement fetch_cilia_proteomics with curated reference gene sets
- Implement classify_evidence_type (experimental vs computational)
- Implement score_localization with cilia proximity scoring
- Implement process_localization_evidence end-to-end pipeline
- Create load_to_duckdb for persistence with provenance
2026-02-11 19:00:09 +08:00
adbb74b965
feat(03-01): implement annotation evidence fetch and transform modules
...
- Create AnnotationRecord model with GO counts, UniProt scores, tier classification
- Implement fetch_go_annotations using mygene.info batch queries
- Implement fetch_uniprot_scores using UniProt REST API
- Add classify_annotation_tier with 3-tier system (well/partial/poor)
- Add normalize_annotation_score with weighted composite (GO 50%, UniProt 30%, Pathway 20%)
- Implement process_annotation_evidence end-to-end pipeline
- Follow NULL preservation pattern from gnomAD (unknown != zero)
- Use lazy polars evaluation where applicable
2026-02-11 18:58:45 +08:00
ee27f3ad2f
feat(02-02): add DuckDB loader and CLI evidence command for gnomAD
...
- load_to_duckdb: Saves constraint DataFrame to gnomad_constraint table with provenance tracking
- query_constrained_genes: Queries constrained genes by LOEUF threshold (validates GCON-03 interpretation)
- evidence_cmd.py: CLI command group with gnomad subcommand (fetch->transform->load orchestration)
- Checkpoint-restart: Skips processing if gnomad_constraint table exists (--force to override)
- Full CLI: usher-pipeline evidence gnomad [--force] [--url URL] [--min-depth N] [--min-cds-pct N]
2026-02-11 18:19:07 +08:00
174c4af02d
feat(02-01): add gnomAD transform pipeline and comprehensive tests
...
- Implement filter_by_coverage with quality_flag categorization (measured/incomplete_coverage/no_data)
- Add normalize_scores with LOEUF inversion (lower LOEUF = higher score)
- NULL preservation throughout pipeline (unknown != zero constraint)
- process_gnomad_constraint end-to-end pipeline function
- 15 comprehensive unit tests covering edge cases:
- NULL handling and preservation
- Coverage filtering without dropping genes
- Normalization bounds and inversion
- Mixed type handling for robust parsing
- Fix column mapping to handle gnomAD v4.x loeuf/loeuf_upper duplication
- All existing tests continue to pass
2026-02-11 18:14:41 +08:00
a88b0eea60
feat(02-01): add gnomAD constraint data models and download module
...
- Create evidence layer package structure
- Define ConstraintRecord Pydantic model with NULL preservation
- Implement streaming download with httpx and tenacity retry
- Add lazy TSV parser with column name variant handling
- Add httpx and structlog dependencies
2026-02-11 18:11:49 +08:00