feat: Add Deduplication Agent with embedding and LLM methods

Implement a new Deduplication Agent that identifies and groups similar
transformation descriptions. Supports two deduplication methods:
- Embedding: Fast vector similarity comparison using cosine similarity
- LLM: Accurate pairwise semantic comparison (slower but more precise)

Backend changes:
- Add deduplication router with /deduplicate endpoint
- Add embedding_service for vector-based similarity
- Add llm_deduplication_service for LLM-based comparison
- Improve expert_transformation error handling and progress reporting

Frontend changes:
- Add DeduplicationPanel with interactive group visualization
- Add useDeduplication hook for state management
- Integrate deduplication tab in main App
- Add threshold slider and method selector in sidebar

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-22 20:26:17 +08:00
parent 5571076406
commit bc281b8e0a
18 changed files with 1397 additions and 25 deletions

View File

@@ -232,3 +232,38 @@ class ExpertTransformationRequest(BaseModel):
# LLM parameters
model: Optional[str] = None
temperature: Optional[float] = 0.7
# ===== Deduplication Agent schemas =====
class DeduplicationMethod(str, Enum):
"""去重方法"""
EMBEDDING = "embedding" # 向量相似度
LLM = "llm" # LLM 成對判斷
class DeduplicationRequest(BaseModel):
"""去重請求"""
descriptions: List[ExpertTransformationDescription]
method: DeduplicationMethod = DeduplicationMethod.EMBEDDING # 去重方法
similarity_threshold: float = 0.85 # 餘弦相似度閾值 (0.0-1.0),僅 Embedding 使用
model: Optional[str] = None # Embedding/LLM 模型
class DescriptionGroup(BaseModel):
"""相似描述分組"""
group_id: str # "group-0", "group-1"...
representative: ExpertTransformationDescription # 代表描述
duplicates: List[ExpertTransformationDescription] # 相似描述
similarity_scores: List[float] # 每個重複項的相似度分數
class DeduplicationResult(BaseModel):
"""去重結果"""
total_input: int # 輸入描述總數
total_groups: int # 分組數量
total_duplicates: int # 重複項數量
groups: List[DescriptionGroup]
threshold_used: float
method_used: DeduplicationMethod # 使用的去重方法
model_used: str # 使用的模型