novelty-seeking/experiments/docs/future_research_plan_zh.md

# 研究發表計畫與未來工作

**建立日期：** 2026-01-19
**專案：** Breaking Semantic Gravity in LLM-Based Creative Ideation

---

## 一、發表可行性評估

### 現有研究的覆蓋範圍

| 主題 | 代表論文 | 我們的差異 |
|------|----------|------------|
| LLM 創意評估 | Organisciak et al. (2023) | 他們評估 LLM 創意，我們是**增強**創意 |
| AUT 彈性評分 | Hadas & Hershkovitz (2024) | 他們是評估方法，我們是**生成方法** |
| Prompt 工程 | Zhou et al. (2023) | 他們優化 prompt，我們是**結構化管線** |
| LLM-as-Judge | Zheng et al. (2023) | 這是評估工具，非核心貢獻 |

### 本研究的獨特貢獻

| 獨特性 | 說明 | 學術價值 |
|--------|------|----------|
| Context-Free Keyword Generation | 專家從未看到原始查詢，強迫雙重聯想 | 方法創新 |
| 次加性交互作用 | 屬性 × 專家 = Sub-additive | 實證發現 |
| 隨機視角 ≈ 領域專家 | 視角轉換本身比專業知識更重要 | 理論貢獻 |
| 新穎性-彈性正交性 | 在 LLM 創意生成中首次驗證 | 理論驗證 |

---

## 二、目前研究狀態

### 已完成 ✓

| 要素 | 狀態 | 詳情 |
|------|:----:|------|
| 理論框架 | ✓ | Bisociation Theory + Torrance Creativity Framework |
| 實驗設計 | ✓ | 2×2 factorial + control (5 conditions) |
| 管線實作 | ✓ | 屬性分解 → 專家轉換 → 去重 |
| 自動評估指標 | ✓ | 新穎性、彈性、多樣性、凝聚度、跳躍信號 |
| 人類評估介面 | ✓ | Web-based Torrance 評分工具 |
| 統計分析 | ✓ | ANOVA、效果量、相關性分析 |
| 初步實驗 | ✓ | 10 queries, Qwen3:8b, 1119 ideas |

### 需要補充 ✗

| 缺口 | 重要性 | 說明 |
|------|:------:|------|
| 多模型驗證 | **高** | 目前只有 Qwen3:8b |
| 人類評估數據 | **高** | 介面已建置但未收集數據 |
| 樣本量擴充 | **中** | 10 → 30-50 queries |
| Baseline 比較 | **中** | 與其他創意增強方法比較 |
| LLM-as-Judge | 中 | 與人類評估的相關性驗證 |

---

## 三、發表策略選項

### 選項 A：完整論文（頂會/期刊）

**目標會議/期刊：**
- ACL / EMNLP（NLP 頂會）
- CHI（人機互動頂會）
- Creativity Research Journal（創意研究期刊）
- Thinking Skills and Creativity（創意思維期刊）

**論文標題建議：**
> "Breaking Semantic Gravity: Context-Free Expert Perspectives for LLM Creative Ideation"

**需要補充的工作：**

| 工作項目 | 預估時間 | 優先級 |
|----------|:--------:|:------:|
| GPT-4 實驗 | 1 週 | P0 |
| Claude 實驗 | 1 週 | P0 |
| Llama-3 實驗 | 1 週 | P1 |
| 人類評估收集 | 2-3 週 | P0 |
| 樣本量擴充 (30 queries) | 1 週 | P1 |
| Baseline 比較實驗 | 1-2 週 | P1 |
| 論文撰寫 | 2-3 週 | - |

**總預估時間：** 2-3 個月

---

### 選項 B：短論文 / Workshop Paper

**目標：**
- ACL/EMNLP Workshop on Creativity and AI
- NeurIPS Workshop on Creativity and Design
- ICCC (International Conference on Computational Creativity)

**需要補充的工作：**

| 工作項目 | 預估時間 | 優先級 |
|----------|:--------:|:------:|
| GPT-4 實驗 | 1 週 | P0 |
| 小規模人類評估 (50-100 ideas) | 1 週 | P0 |
| 論文撰寫 | 1 週 | - |

**總預估時間：** 2-4 週

---

## 四、實驗補充計畫

### Phase 1：多模型驗證（優先級 P0）

```
目標：驗證方法的泛化性

模型清單：
  □ GPT-4 / GPT-4o (OpenAI)
  □ Claude 3.5 Sonnet (Anthropic)
  □ Llama-3-70B (Meta)
  □ Gemini Pro (Google) [optional]

實驗設計：
  - 相同的 10 queries
  - 相同的 5 conditions
  - 相同的評估指標

預期結果：
  - 跨模型一致性分析
  - 模型特定效應識別
```

### Phase 2：人類評估（優先級 P0）

```
目標：驗證自動指標與人類判斷的相關性

評估維度（Torrance Framework）：
  1. 原創性 (Originality) - 1-5 Likert
  2. 精緻性 (Elaboration) - 1-5 Likert
  3. 可行性 (Feasibility) - 1-5 Likert
  4. 荒謬性 (Nonsense) - Binary

樣本策略：
  - 分層抽樣：每 condition × 每 query = 4 ideas
  - 總計：5 × 10 × 4 = 200 ideas
  - 評審者：3-5 人（計算 ICC）

介面：
  - 已建置：experiments/assessment/
  - 需要：招募評審者、收集數據
```

### Phase 3：樣本量擴充（優先級 P1）

```
目標：提高統計效力

擴充計畫：
  - 現有：10 queries
  - 目標：30-50 queries

Query 來源：
  - 物品類：傢俱、工具、電器、交通工具
  - 概念類：服務、系統、流程
  - 混合類：結合物理和數位元素

統計效力分析：
  - 當前效果量 d ≈ 2-3（大效應）
  - 30 queries 應足夠達到 power > 0.95
```

### Phase 4：Baseline 比較（優先級 P1）

```
目標：與現有方法比較

Baseline 方法：
  1. Vanilla Prompting
     "Generate creative uses for [object]"

  2. Chain-of-Thought (CoT)
     "Think step by step about creative uses..."

  3. Few-shot Examples
     提供 3-5 個創意範例

  4. Role-Playing (Standard)
     "As a [expert], suggest uses for [object]"
     （專家看到完整查詢）

比較指標：
  - 新穎性、彈性、多樣性
  - 想法數量、生成時間
  - 人類評估分數
```

---

## 五、論文大綱草稿

### Title
"Breaking Semantic Gravity: Context-Free Expert Perspectives for Enhanced LLM Creative Ideation"

### Abstract
- Problem: LLMs generate ideas clustered around training distributions
- Method: Attribute decomposition + context-free expert transformation
- Results: Sub-additive interaction, random ≈ expert, novelty ⊥ flexibility
- Contribution: Novel pipeline + empirical findings

### 1. Introduction
- Semantic gravity problem in LLM creativity
- Bisociation theory and creative thinking
- Research questions (RQ1-4)

### 2. Related Work
- LLM creativity evaluation
- Prompt engineering for creativity
- Computational creativity methods

### 3. Method
- Pipeline architecture
- Context-free keyword generation
- Experimental design (2×2 + control)

### 4. Evaluation Framework
- Automatic metrics (novelty, flexibility, diversity)
- Human evaluation (Torrance dimensions)
- LLM-as-Judge validation

### 5. Results
- RQ1: Attribute effect
- RQ2: Expert effect
- RQ3: Interaction effect
- RQ4: Expert vs Random
- Cross-model validation

### 6. Discussion
- Attribute anchoring effect
- Value of perspective shift
- Novelty vs flexibility orthogonality

### 7. Conclusion
- Contributions
- Limitations
- Future work

---

## 六、時間線規劃

### 快速發表路線（Workshop Paper）

```
Week 1-2: 多模型實驗 (GPT-4, Claude)
Week 2-3: 小規模人類評估
Week 3-4: 論文撰寫與投稿

目標：2026 Q1 Workshop Deadline
```

### 完整發表路線（Full Paper）

```
Month 1:
  - Week 1-2: 多模型實驗
  - Week 3-4: 樣本量擴充

Month 2:
  - Week 1-2: 人類評估收集
  - Week 3-4: Baseline 比較實驗

Month 3:
  - Week 1-2: 數據分析與統計
  - Week 3-4: 論文撰寫

目標：ACL 2026 / EMNLP 2026
```

---

## 七、風險與緩解

| 風險 | 可能性 | 影響 | 緩解策略 |
|------|:------:|:----:|----------|
| 跨模型結果不一致 | 中 | 高 | 報告為「模型特定發現」 |
| 人類評估 ICC 低 | 中 | 中 | 增加評審者、修訂評分指南 |
| 效應在大樣本消失 | 低 | 高 | 現有效果量大，風險較低 |
| 競爭論文搶先 | 低 | 高 | 優先投 Workshop 建立優先權 |

---

## 八、資源需求

### 計算資源

| 資源 | 用途 | 預估成本 |
|------|------|:--------:|
| OpenAI API | GPT-4 實驗 | ~$50-100 |
| Anthropic API | Claude 實驗 | ~$50-100 |
| Local GPU | Llama 實驗 | 已有 |
| Ollama | Embedding | 已有 |

### 人力資源

| 角色 | 需求 | 說明 |
|------|------|------|
| 人類評審者 | 3-5 人 | 可招募同學或眾包 |
| 統計顧問 | 可選 | 複雜統計分析諮詢 |

---

## 九、成功指標

### 短期（1個月內）

- [ ] 完成 GPT-4 實驗
- [ ] 完成 Claude 實驗
- [ ] 收集至少 100 個人類評估樣本

### 中期（3個月內）

- [ ] 完成所有模型實驗
- [ ] 完成人類評估（200+ samples, ICC > 0.7）
- [ ] 完成 baseline 比較
- [ ] 投稿第一篇論文

### 長期（6個月內）

- [ ] 論文被接受
- [ ] 開源程式碼和數據集
- [ ] 擴展到其他創意任務

---

## 十、參考文獻

1. Hadas, S., & Hershkovitz, A. (2024). Using Large Language Models to Evaluate Alternative Uses Task Flexibility Score. *Thinking Skills and Creativity*, 52, 101549.

2. Organisciak, P., et al. (2023). Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models. *Thinking Skills and Creativity*, 49, 101356.

3. Koestler, A. (1964). *The Act of Creation*. Hutchinson.

4. Torrance, E.P. (1974). *Torrance Tests of Creative Thinking*. Scholastic Testing Service.

5. Stevenson, C., et al. (2024). Characterizing Creative Processes in Humans and Large Language Models. *arXiv:2405.00899*.

6. Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. *NeurIPS 2023*.