- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
343 lines
8.9 KiB
Markdown
343 lines
8.9 KiB
Markdown
# 研究發表計畫與未來工作
|
||
|
||
**建立日期:** 2026-01-19
|
||
**專案:** Breaking Semantic Gravity in LLM-Based Creative Ideation
|
||
|
||
---
|
||
|
||
## 一、發表可行性評估
|
||
|
||
### 現有研究的覆蓋範圍
|
||
|
||
| 主題 | 代表論文 | 我們的差異 |
|
||
|------|----------|------------|
|
||
| LLM 創意評估 | Organisciak et al. (2023) | 他們評估 LLM 創意,我們是**增強**創意 |
|
||
| AUT 彈性評分 | Hadas & Hershkovitz (2024) | 他們是評估方法,我們是**生成方法** |
|
||
| Prompt 工程 | Zhou et al. (2023) | 他們優化 prompt,我們是**結構化管線** |
|
||
| LLM-as-Judge | Zheng et al. (2023) | 這是評估工具,非核心貢獻 |
|
||
|
||
### 本研究的獨特貢獻
|
||
|
||
| 獨特性 | 說明 | 學術價值 |
|
||
|--------|------|----------|
|
||
| Context-Free Keyword Generation | 專家從未看到原始查詢,強迫雙重聯想 | 方法創新 |
|
||
| 次加性交互作用 | 屬性 × 專家 = Sub-additive | 實證發現 |
|
||
| 隨機視角 ≈ 領域專家 | 視角轉換本身比專業知識更重要 | 理論貢獻 |
|
||
| 新穎性-彈性正交性 | 在 LLM 創意生成中首次驗證 | 理論驗證 |
|
||
|
||
---
|
||
|
||
## 二、目前研究狀態
|
||
|
||
### 已完成 ✓
|
||
|
||
| 要素 | 狀態 | 詳情 |
|
||
|------|:----:|------|
|
||
| 理論框架 | ✓ | Bisociation Theory + Torrance Creativity Framework |
|
||
| 實驗設計 | ✓ | 2×2 factorial + control (5 conditions) |
|
||
| 管線實作 | ✓ | 屬性分解 → 專家轉換 → 去重 |
|
||
| 自動評估指標 | ✓ | 新穎性、彈性、多樣性、凝聚度、跳躍信號 |
|
||
| 人類評估介面 | ✓ | Web-based Torrance 評分工具 |
|
||
| 統計分析 | ✓ | ANOVA、效果量、相關性分析 |
|
||
| 初步實驗 | ✓ | 10 queries, Qwen3:8b, 1119 ideas |
|
||
|
||
### 需要補充 ✗
|
||
|
||
| 缺口 | 重要性 | 說明 |
|
||
|------|:------:|------|
|
||
| 多模型驗證 | **高** | 目前只有 Qwen3:8b |
|
||
| 人類評估數據 | **高** | 介面已建置但未收集數據 |
|
||
| 樣本量擴充 | **中** | 10 → 30-50 queries |
|
||
| Baseline 比較 | **中** | 與其他創意增強方法比較 |
|
||
| LLM-as-Judge | 中 | 與人類評估的相關性驗證 |
|
||
|
||
---
|
||
|
||
## 三、發表策略選項
|
||
|
||
### 選項 A:完整論文(頂會/期刊)
|
||
|
||
**目標會議/期刊:**
|
||
- ACL / EMNLP(NLP 頂會)
|
||
- CHI(人機互動頂會)
|
||
- Creativity Research Journal(創意研究期刊)
|
||
- Thinking Skills and Creativity(創意思維期刊)
|
||
|
||
**論文標題建議:**
|
||
> "Breaking Semantic Gravity: Context-Free Expert Perspectives for LLM Creative Ideation"
|
||
|
||
**需要補充的工作:**
|
||
|
||
| 工作項目 | 預估時間 | 優先級 |
|
||
|----------|:--------:|:------:|
|
||
| GPT-4 實驗 | 1 週 | P0 |
|
||
| Claude 實驗 | 1 週 | P0 |
|
||
| Llama-3 實驗 | 1 週 | P1 |
|
||
| 人類評估收集 | 2-3 週 | P0 |
|
||
| 樣本量擴充 (30 queries) | 1 週 | P1 |
|
||
| Baseline 比較實驗 | 1-2 週 | P1 |
|
||
| 論文撰寫 | 2-3 週 | - |
|
||
|
||
**總預估時間:** 2-3 個月
|
||
|
||
---
|
||
|
||
### 選項 B:短論文 / Workshop Paper
|
||
|
||
**目標:**
|
||
- ACL/EMNLP Workshop on Creativity and AI
|
||
- NeurIPS Workshop on Creativity and Design
|
||
- ICCC (International Conference on Computational Creativity)
|
||
|
||
**需要補充的工作:**
|
||
|
||
| 工作項目 | 預估時間 | 優先級 |
|
||
|----------|:--------:|:------:|
|
||
| GPT-4 實驗 | 1 週 | P0 |
|
||
| 小規模人類評估 (50-100 ideas) | 1 週 | P0 |
|
||
| 論文撰寫 | 1 週 | - |
|
||
|
||
**總預估時間:** 2-4 週
|
||
|
||
---
|
||
|
||
## 四、實驗補充計畫
|
||
|
||
### Phase 1:多模型驗證(優先級 P0)
|
||
|
||
```
|
||
目標:驗證方法的泛化性
|
||
|
||
模型清單:
|
||
□ GPT-4 / GPT-4o (OpenAI)
|
||
□ Claude 3.5 Sonnet (Anthropic)
|
||
□ Llama-3-70B (Meta)
|
||
□ Gemini Pro (Google) [optional]
|
||
|
||
實驗設計:
|
||
- 相同的 10 queries
|
||
- 相同的 5 conditions
|
||
- 相同的評估指標
|
||
|
||
預期結果:
|
||
- 跨模型一致性分析
|
||
- 模型特定效應識別
|
||
```
|
||
|
||
### Phase 2:人類評估(優先級 P0)
|
||
|
||
```
|
||
目標:驗證自動指標與人類判斷的相關性
|
||
|
||
評估維度(Torrance Framework):
|
||
1. 原創性 (Originality) - 1-5 Likert
|
||
2. 精緻性 (Elaboration) - 1-5 Likert
|
||
3. 可行性 (Feasibility) - 1-5 Likert
|
||
4. 荒謬性 (Nonsense) - Binary
|
||
|
||
樣本策略:
|
||
- 分層抽樣:每 condition × 每 query = 4 ideas
|
||
- 總計:5 × 10 × 4 = 200 ideas
|
||
- 評審者:3-5 人(計算 ICC)
|
||
|
||
介面:
|
||
- 已建置:experiments/assessment/
|
||
- 需要:招募評審者、收集數據
|
||
```
|
||
|
||
### Phase 3:樣本量擴充(優先級 P1)
|
||
|
||
```
|
||
目標:提高統計效力
|
||
|
||
擴充計畫:
|
||
- 現有:10 queries
|
||
- 目標:30-50 queries
|
||
|
||
Query 來源:
|
||
- 物品類:傢俱、工具、電器、交通工具
|
||
- 概念類:服務、系統、流程
|
||
- 混合類:結合物理和數位元素
|
||
|
||
統計效力分析:
|
||
- 當前效果量 d ≈ 2-3(大效應)
|
||
- 30 queries 應足夠達到 power > 0.95
|
||
```
|
||
|
||
### Phase 4:Baseline 比較(優先級 P1)
|
||
|
||
```
|
||
目標:與現有方法比較
|
||
|
||
Baseline 方法:
|
||
1. Vanilla Prompting
|
||
"Generate creative uses for [object]"
|
||
|
||
2. Chain-of-Thought (CoT)
|
||
"Think step by step about creative uses..."
|
||
|
||
3. Few-shot Examples
|
||
提供 3-5 個創意範例
|
||
|
||
4. Role-Playing (Standard)
|
||
"As a [expert], suggest uses for [object]"
|
||
(專家看到完整查詢)
|
||
|
||
比較指標:
|
||
- 新穎性、彈性、多樣性
|
||
- 想法數量、生成時間
|
||
- 人類評估分數
|
||
```
|
||
|
||
---
|
||
|
||
## 五、論文大綱草稿
|
||
|
||
### Title
|
||
"Breaking Semantic Gravity: Context-Free Expert Perspectives for Enhanced LLM Creative Ideation"
|
||
|
||
### Abstract
|
||
- Problem: LLMs generate ideas clustered around training distributions
|
||
- Method: Attribute decomposition + context-free expert transformation
|
||
- Results: Sub-additive interaction, random ≈ expert, novelty ⊥ flexibility
|
||
- Contribution: Novel pipeline + empirical findings
|
||
|
||
### 1. Introduction
|
||
- Semantic gravity problem in LLM creativity
|
||
- Bisociation theory and creative thinking
|
||
- Research questions (RQ1-4)
|
||
|
||
### 2. Related Work
|
||
- LLM creativity evaluation
|
||
- Prompt engineering for creativity
|
||
- Computational creativity methods
|
||
|
||
### 3. Method
|
||
- Pipeline architecture
|
||
- Context-free keyword generation
|
||
- Experimental design (2×2 + control)
|
||
|
||
### 4. Evaluation Framework
|
||
- Automatic metrics (novelty, flexibility, diversity)
|
||
- Human evaluation (Torrance dimensions)
|
||
- LLM-as-Judge validation
|
||
|
||
### 5. Results
|
||
- RQ1: Attribute effect
|
||
- RQ2: Expert effect
|
||
- RQ3: Interaction effect
|
||
- RQ4: Expert vs Random
|
||
- Cross-model validation
|
||
|
||
### 6. Discussion
|
||
- Attribute anchoring effect
|
||
- Value of perspective shift
|
||
- Novelty vs flexibility orthogonality
|
||
|
||
### 7. Conclusion
|
||
- Contributions
|
||
- Limitations
|
||
- Future work
|
||
|
||
---
|
||
|
||
## 六、時間線規劃
|
||
|
||
### 快速發表路線(Workshop Paper)
|
||
|
||
```
|
||
Week 1-2: 多模型實驗 (GPT-4, Claude)
|
||
Week 2-3: 小規模人類評估
|
||
Week 3-4: 論文撰寫與投稿
|
||
|
||
目標:2026 Q1 Workshop Deadline
|
||
```
|
||
|
||
### 完整發表路線(Full Paper)
|
||
|
||
```
|
||
Month 1:
|
||
- Week 1-2: 多模型實驗
|
||
- Week 3-4: 樣本量擴充
|
||
|
||
Month 2:
|
||
- Week 1-2: 人類評估收集
|
||
- Week 3-4: Baseline 比較實驗
|
||
|
||
Month 3:
|
||
- Week 1-2: 數據分析與統計
|
||
- Week 3-4: 論文撰寫
|
||
|
||
目標:ACL 2026 / EMNLP 2026
|
||
```
|
||
|
||
---
|
||
|
||
## 七、風險與緩解
|
||
|
||
| 風險 | 可能性 | 影響 | 緩解策略 |
|
||
|------|:------:|:----:|----------|
|
||
| 跨模型結果不一致 | 中 | 高 | 報告為「模型特定發現」 |
|
||
| 人類評估 ICC 低 | 中 | 中 | 增加評審者、修訂評分指南 |
|
||
| 效應在大樣本消失 | 低 | 高 | 現有效果量大,風險較低 |
|
||
| 競爭論文搶先 | 低 | 高 | 優先投 Workshop 建立優先權 |
|
||
|
||
---
|
||
|
||
## 八、資源需求
|
||
|
||
### 計算資源
|
||
|
||
| 資源 | 用途 | 預估成本 |
|
||
|------|------|:--------:|
|
||
| OpenAI API | GPT-4 實驗 | ~$50-100 |
|
||
| Anthropic API | Claude 實驗 | ~$50-100 |
|
||
| Local GPU | Llama 實驗 | 已有 |
|
||
| Ollama | Embedding | 已有 |
|
||
|
||
### 人力資源
|
||
|
||
| 角色 | 需求 | 說明 |
|
||
|------|------|------|
|
||
| 人類評審者 | 3-5 人 | 可招募同學或眾包 |
|
||
| 統計顧問 | 可選 | 複雜統計分析諮詢 |
|
||
|
||
---
|
||
|
||
## 九、成功指標
|
||
|
||
### 短期(1個月內)
|
||
|
||
- [ ] 完成 GPT-4 實驗
|
||
- [ ] 完成 Claude 實驗
|
||
- [ ] 收集至少 100 個人類評估樣本
|
||
|
||
### 中期(3個月內)
|
||
|
||
- [ ] 完成所有模型實驗
|
||
- [ ] 完成人類評估(200+ samples, ICC > 0.7)
|
||
- [ ] 完成 baseline 比較
|
||
- [ ] 投稿第一篇論文
|
||
|
||
### 長期(6個月內)
|
||
|
||
- [ ] 論文被接受
|
||
- [ ] 開源程式碼和數據集
|
||
- [ ] 擴展到其他創意任務
|
||
|
||
---
|
||
|
||
## 十、參考文獻
|
||
|
||
1. Hadas, S., & Hershkovitz, A. (2024). Using Large Language Models to Evaluate Alternative Uses Task Flexibility Score. *Thinking Skills and Creativity*, 52, 101549.
|
||
|
||
2. Organisciak, P., et al. (2023). Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models. *Thinking Skills and Creativity*, 49, 101356.
|
||
|
||
3. Koestler, A. (1964). *The Act of Creation*. Hutchinson.
|
||
|
||
4. Torrance, E.P. (1974). *Torrance Tests of Creative Thinking*. Scholastic Testing Service.
|
||
|
||
5. Stevenson, C., et al. (2024). Characterizing Creative Processes in Humans and Large Language Models. *arXiv:2405.00899*.
|
||
|
||
6. Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. *NeurIPS 2023*.
|