feat: Add experiments framework and novelty-driven agent loop

- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-20 10:16:21 +08:00
parent 26a56a2a07
commit 43c025e060
81 changed files with 18766 additions and 2 deletions

View File

@@ -0,0 +1,301 @@
# AUT 彈性評估方法說明
## 什麼是 AUT替代用途任務
AUTAlternative Uses Task替代用途任務是一個經典的**發散性思維測試**,由 Guilford 在 1967 年提出。
**測試方式:**
```
問題:「請列出磚塊的所有可能用途」
典型回答:
1. 蓋房子
2. 當門擋
3. 壓紙張
4. 當武器
5. 墊高東西
...
```
---
## Torrance 創造力四維度
| 維度 | 中文 | 定義 | 測量方式 |
|------|------|------|----------|
| **Fluency** | 流暢性 | 產生多少想法 | 計算數量 |
| **Flexibility** | 彈性/靈活性 | 想法涵蓋多少不同類別 | 計算類別數 |
| **Originality** | 原創性 | 想法的稀有程度 | 統計罕見度 |
| **Elaboration** | 精緻性 | 想法的詳細程度 | 評估細節 |
---
## 我們實作的三種彈性評估方法
### 方法一LLM 雙階段分類法Hadas & Hershkovitz 2024
**原理:** 讓大型語言模型識別想法的語義類別,然後計算類別數量
```
第一階段:讓 LLM 識別所有想法的語義類別
輸入:「椅子」的 195 個創意想法
輸出:["交通運輸", "藝術裝飾", "醫療健康", "教育", "儲存", ...]
第二階段:將每個想法分配到類別
想法 1「太陽能充電椅」→ 科技類
想法 2「椅子改裝成擔架」→ 醫療類
想法 3「椅腳當鼓棒」→ 藝術類
彈性分數 = 使用的不同類別數量
```
**優點:** 類別名稱有語義意義,可解釋性強
**缺點:** 依賴 LLM 的一致性,可能有解析錯誤
---
### 方法二嵌入向量階層式聚類法arXiv:2405.00899
**原理:** 將想法轉換成向量,用數學方法自動分群
```
步驟 1將每個想法轉換成向量embedding
「太陽能充電椅」→ [0.12, -0.34, 0.56, ...]1024 維)
步驟 2使用 Ward 連結法進行階層式聚類
計算所有想法之間的餘弦距離
由下而上合併最相似的群組
步驟 3在相似度 ≥ 0.7 的閾值切割樹狀圖
確保同一群內的想法夠相似
彈性分數 = 產生的群集數量
```
**優點:** 客觀、可重現、不依賴 LLM 判斷
**缺點:** 群集沒有語義標籤,需要人工解讀
---
### 方法三組合跳躍信號分析Combined Jump Signal, arXiv:2405.00899
**原理:** 使用更嚴格的「真正跳躍」定義,減少假陽性
```
組合跳躍 = 類別跳躍 ∧ 語義跳躍
類別跳躍jumpcat連續想法屬於不同的 embedding 群集
語義跳躍jumpSS連續想法的語義相似度 < 0.7
真正跳躍 = 兩個條件都必須成立
```
**為什麼需要組合跳躍?**
```
問題:單獨使用類別跳躍可能產生假陽性
例如:「人體工學椅」和「可調節椅」
- 可能被分到不同群集(類別跳躍 = True
- 但語義上很相似(語義跳躍 = False
- 不應該算作真正的「創意跳躍」
解決:組合跳躍要求兩者同時成立,更準確
```
| 跳躍比率 | 探索模式 | 含義 |
|----------|----------|------|
| 高(>45% | 靈活探索Flexible | 廣泛切換類別,思維跳躍 |
| 中30-45% | 混合模式Mixed | 適度切換 |
| 低(<30% | 持續探索Persistent | 深入單一領域,專注發展 |
**應用:** 區分 LLM 與人類的創意模式差異
---
## 研究發現
### 發現一新穎性Novelty與彈性Flexibility是獨立維度
| 條件 | 新穎性分數 | 彈性(群集數) | 平均相似度 | 模式 |
|------|:----------:|:--------------:|:----------:|------|
| C4 完整管線 | **0.395**(最高) | 10 | 0.583 | 高新穎、中等彈性 |
| C5 隨機視角 | 0.365 | **15**(最高) | 0.521 | 高新穎、高彈性 |
| C2 專家視角 | 0.315 | 13 | 0.517 | 中等新穎、高彈性 |
| C3 屬性分解 | 0.337 | 12 | - | 中等新穎、中等彈性 |
| C1 直接生成 | 0.273(最低) | **1**(最低) | 0.647 | 低新穎、低彈性 |
**視覺化解讀:**
```
C1 直接生成的想法:
┌─────────────────────────────────────┐
│ ○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○ │ ← 所有想法集中在一個「普通領域」
│ (彼此相似,且都很典型) │ (低新穎性 + 低彈性)
└─────────────────────────────────────┘
C5 隨機視角的想法:
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ ★ │ │ ★ │ │ ★ │ │ ★ │ │ ★ │ ← 分散在多個「新穎領域」
└───┘ └───┘ └───┘ └───┘ └───┘ (高新穎性 + 高彈性)
↑ ↑ ↑ ↑ ↑
交通 醫療 藝術 教育 科技
C4 完整管線的想法:
┌─────────────────┐
┌──┤ ★★★★★★★★★★★★ ├──┐ ← 集中在一個「新穎領域」但有多個子類別
│ └─────────────────┘ │ (最高新穎性 + 中等彈性)
│ ↓ │
└── 10 個語義群集 ───────┘
```
### 發現二:組合跳躍信號分析結果
| 條件 | 類別跳躍 | 語義跳躍 | **組合跳躍** | 彈性檔案 |
|------|:--------:|:--------:|:------------:|:--------:|
| C2 專家視角 | 54 | 125 | **48** | 持續探索 |
| C3 屬性分解 | 34 | 107 | **33** | 持續探索 |
| C5 隨機視角 | 22 | 116 | **20** | 持續探索 |
| C4 完整管線 | 13 | 348 | **13** | 持續探索 |
| C1 直接生成 | 0 | 104 | **0** | 持續探索 |
**組合跳躍比率:**
| 條件 | 組合跳躍比率 | 彈性檔案 | 解讀 |
|------|:------------:|:--------:|------|
| C3 屬性分解 | **26.6%** | Persistent | 適度類別切換 |
| C2 專家視角 | **24.4%** | Persistent | 適度類別切換 |
| C5 隨機視角 | 10.1% | Persistent | 較低類別切換 |
| C4 完整管線 | **3.2%** | Persistent | 非常專注的探索 |
| C1 直接生成 | 0.0% | Persistent | 單一群集(無跳躍) |
**關鍵洞察:** 組合跳躍 ≤ 類別跳躍(符合預期)。所有條件都呈現「持續探索」模式。
---
### 發現三:🔑 原創性-彈性相關性(關鍵發現)
**論文發現arXiv:2405.00899**
- **人類:** 原創性與彈性**無相關**r ≈ 0
- **典型 LLM** **正相關** — 靈活的 LLM 原創性更高
**我們的結果:**
| 指標 | 數值 | 解讀 |
|------|:----:|------|
| **Pearson r** | **0.071** | 接近零的相關性 |
| 模式 | **類似人類** | 打破典型 LLM 模式 |
**各條件數據:**
| 條件 | 新穎性分數 | 彈性(組合跳躍數) |
|------|:----------:|:------------------:|
| C4 完整管線 | **0.395**(最高) | **13**(最低) |
| C5 隨機視角 | 0.365 | 20 |
| C3 屬性分解 | 0.337 | 33 |
| C2 專家視角 | 0.315 | 48最高 |
| C1 直接生成 | 0.273(最低) | 0 |
**重大發現:** 屬性+專家管線C4實現**最高新穎性但最低彈性**
證明結構化的無上下文生成能產生**聚焦的新穎性**而非分散的探索。
**這意味著什麼?**
```
典型 LLM 模式:
彈性高 → 新穎性高(正相關)
想法越分散,越可能遇到新穎概念
我們的管線C4
彈性低 + 新穎性高(打破模式)
專注探索一個新穎領域,而非到處跳躍
這是「類似人類」的創意模式!
人類專家通常深入探索一個領域,而非廣泛但淺薄地涉獵
```
---
## 這對創意研究的意義
1. **創造力是多維度的**
- 新穎性Novelty和彈性Flexibility是**獨立維度**
- 高新穎不代表高彈性,反之亦然
- 需要同時考慮流暢性、彈性、原創性、精緻性
2. **管線設計的取捨**
| 策略 | 新穎性 | 彈性 | 特點 |
|------|:------:|:----:|------|
| 直接生成C1 | 低 | 低 | 快速但普通 |
| 專家視角C2 | 中 | 高 | 多元觀點 |
| 隨機視角C5 | 高 | **最高** | 強迫跳躍 |
| 完整管線C4 | **最高** | 中 | 結構化新穎 |
3. **為什麼專家/隨機視角產生更多類別?**
```
C1 直接生成:
LLM 沒有外部刺激 → 停留在「家具改良」單一領域
平均相似度 0.647(最高)→ 想法彼此很像
C2 專家視角:
4 個不同領域專家 → 引入不同思維框架
平均相似度 0.517(較低)→ 想法更分散
C5 隨機視角:
隨機詞彙強迫跳躍 → 意外的連結
平均相似度 0.521 → 最多語義類別15 個)
```
4. **實務建議**
- 若需要**高新穎性**使用完整管線C4
- 若需要**高彈性/多元性**使用隨機視角C5或專家視角C2
- 若需要**兩者兼顧**:可能需要混合策略
---
## 方法論修正說明
### 原始演算法的問題
最初的聚類演算法有邏輯錯誤:
```
原本的邏輯(錯誤):
目標:找到群內相似度 >= 0.7 的群集
問題:當想法很分散時(低相似度),
無法形成符合閾值的緊密群集
→ 演算法放棄,回傳 1 個群集
結果C2/C5 的分散想法被錯誤標記為「1 個群集」
```
### 修正後的演算法
```
修正後的邏輯(正確):
方法:使用 average linkage 階層式聚類
閾值:在距離 0.5 處切割樹狀圖
(即相似度 < 0.5 時分開)
結果:分散的想法正確地被分成多個群集
```
### 結果對比
| 條件 | 修正前群集數 | 修正後群集數 | 平均相似度 |
|------|:------------:|:------------:|:----------:|
| C1 直接生成 | 29 | **1** | 0.647(高) |
| C2 專家視角 | 1 | **13** | 0.517(低) |
| C5 隨機視角 | 1 | **15** | 0.521(低) |
**關鍵洞察:** 低相似度 = 高多元性 = 高彈性分數
---
## 參考文獻
1. Hadas & Hershkovitz (2024). "Using Large Language Models to Evaluate Alternative Uses Task Flexibility Score." *Thinking Skills and Creativity*, Vol. 52.
2. arXiv:2405.00899 - "Characterising Creative Process in Humans and LLMs" - Jump signal methodology
3. Guilford, J.P. (1967). *The Nature of Human Intelligence*. McGraw-Hill.
4. Torrance, E.P. (1974). *Torrance Tests of Creative Thinking*. Scholastic Testing Service.

View File

@@ -0,0 +1,477 @@
# 創意過程特徵化指標詳解
## 基於 arXiv:2405.00899 論文的方法論
**論文標題:** "Characterising the Creative Process in Humans and Large Language Models"
**來源:** [arXiv:2405.00899](https://arxiv.org/html/2405.00899v2)
本文檔詳細解釋我們從該論文引入的創意過程評估指標,以及這些指標在我們實驗中揭示的重要發現。
---
## 一、組合跳躍信號Combined Jump Signal
### 1.1 什麼是「跳躍」?
在創意發散思維中,「跳躍」指的是連續產生的想法之間的**語義類別切換**。
```
想法序列範例:
1. 太陽能充電椅 → 科技類
2. 智慧溫控座椅 → 科技類(無跳躍)
3. 椅子改裝成擔架 → 醫療類(跳躍!)
4. 輪椅輔助站立功能 → 醫療類(無跳躍)
5. 椅腳當鼓棒 → 藝術類(跳躍!)
```
### 1.2 為什麼需要「組合」跳躍?
**原始方法的問題:**
單純使用類別跳躍jumpcat可能產生**假陽性**
```
問題情境:
想法 A「可折疊露營椅」 → 群集 1
想法 B「便攜式野餐椅」 → 群集 2
類別跳躍 = True不同群集
但這兩個想法語義上非常相似!
這不應該算作真正的「創意跳躍」
```
**論文的解決方案:組合跳躍信號**
```
組合跳躍 = 類別跳躍 ∧ 語義跳躍
其中:
類別跳躍jumpcat連續想法屬於不同的 embedding 群集
語義跳躍jumpSS連續想法的餘弦相似度 < 0.7
真正跳躍 = 兩個條件都必須成立
```
### 1.3 數學定義
對於連續的想法 $i$ 和 $i-1$
$$
\text{jump}_i = \text{jump}_{cat,i} \land \text{jump}_{SS,i}
$$
其中:
- $\text{jump}_{cat,i} = \mathbb{1}[c_i \neq c_{i-1}]$(類別是否改變)
- $\text{jump}_{SS,i} = \mathbb{1}[\text{sim}(e_i, e_{i-1}) < 0.7]$(相似度是否低於閾值)
### 1.4 我們的實驗結果
| 條件 | 類別跳躍 | 語義跳躍 | **組合跳躍** | 組合比率 |
|------|:--------:|:--------:|:------------:|:--------:|
| C2 專家視角 | 54 | 125 | **48** | 24.4% |
| C3 屬性分解 | 34 | 107 | **33** | 26.6% |
| C5 隨機視角 | 22 | 116 | **20** | 10.1% |
| C4 完整管線 | 13 | 348 | **13** | 3.2% |
| C1 直接生成 | 0 | 104 | **0** | 0.0% |
**關鍵觀察:**
- 組合跳躍 ≤ 類別跳躍(驗證方法有效性)
- C4 的語義跳躍很高348但類別跳躍很低13→ 想法在語義上分散但停留在相似類別
- C1 沒有類別跳躍 → 所有想法在單一語義群集內
---
## 二、彈性檔案分類Flexibility Profile Classification
### 2.1 三種創意探索模式
根據論文研究,創意探索可分為三種模式:
| 檔案 | 英文 | 跳躍比率 | 特徵 |
|------|------|:--------:|------|
| **持續探索** | Persistent | < 30% | 深入單一領域,專注發展想法 |
| **混合模式** | Mixed | 30-45% | 適度切換,平衡深度與廣度 |
| **靈活探索** | Flexible | > 45% | 頻繁跳躍,廣泛涉獵不同領域 |
### 2.2 視覺化理解
```
持續探索Persistent
┌─────────────────────────────────────┐
│ ●→●→●→●→●→●→●→●→●→● │ 深入探索一個領域
│ 科技類 │ 偶爾切換(<30%
│ ↓ │
│ ●→●→●→● │
│ 醫療類 │
└─────────────────────────────────────┘
靈活探索Flexible
┌─────────────────────────────────────┐
│ ●→ ●→ ●→ ●→ ●→ ●→ ●→ ● │ 頻繁在不同領域間跳躍
│ 科 醫 藝 教 科 社 環 科 │ 每個領域停留很短
│ 技 療 術 育 技 會 保 技 │ >45% 跳躍)
└─────────────────────────────────────┘
混合模式Mixed
┌─────────────────────────────────────┐
│ ●→●→●→●→ ●→●→●→ ●→●→●→● │ 適度平衡
│ 科技類 醫療類 藝術類 │ 30-45% 跳躍)
└─────────────────────────────────────┘
```
### 2.3 我們的實驗結果
| 條件 | 組合跳躍比率 | 彈性檔案 | 解讀 |
|------|:------------:|:--------:|------|
| C3 屬性分解 | 26.6% | Persistent | 接近 Mixed 的邊界 |
| C2 專家視角 | 24.4% | Persistent | 適度的類別切換 |
| C5 隨機視角 | 10.1% | Persistent | 較少切換 |
| **C4 完整管線** | **3.2%** | **Persistent** | 非常專注的探索 |
| C1 直接生成 | 0.0% | Persistent | 單一群集 |
**重要發現:** 所有條件都呈現「持續探索」模式,但程度不同。
---
## 三、原創性-彈性相關性分析Originality-Flexibility Correlation
### 3.1 論文的核心發現
arXiv:2405.00899 論文發現了一個關鍵差異:
| 主體 | 原創性與彈性的關係 | 解讀 |
|------|:------------------:|------|
| **人類** | r ≈ 0無相關 | 原創性和彈性是獨立的能力 |
| **典型 LLM** | r > 0正相關 | 越靈活的 LLM 越原創 |
**為什麼會有這種差異?**
```
人類創意模式:
- 有些人善於深入探索(低彈性、高原創)
- 有些人善於廣泛聯想(高彈性、高原創)
- 兩種能力是獨立的維度
典型 LLM 模式:
- LLM 透過「隨機性」產生多樣性
- 高 temperature → 更多跳躍 → 更多意外發現
- 彈性和原創性被「隨機性」綁定在一起
```
### 3.2 我們的實驗結果
**Pearson 相關係數r = 0.071**
| 指標 | 數值 | 解讀 |
|------|:----:|------|
| **Pearson r** | **0.071** | 接近零 |
| 統計意義 | 無顯著相關 | 兩個維度獨立 |
| **模式判定** | **類似人類** | 打破典型 LLM 模式 |
**各條件詳細數據:**
| 條件 | 新穎性(距離質心) | 彈性(組合跳躍數) | 組合 |
|------|:------------------:|:------------------:|------|
| C4 完整管線 | **0.395**(最高) | **13**(最低) | 高新穎 + 低彈性 |
| C5 隨機視角 | 0.365 | 20 | 高新穎 + 低彈性 |
| C3 屬性分解 | 0.337 | 33 | 中新穎 + 中彈性 |
| C2 專家視角 | 0.315 | **48**(最高) | 中新穎 + 高彈性 |
| C1 直接生成 | 0.273(最低) | 0 | 低新穎 + 低彈性 |
### 3.3 這個發現的重大意義
```
┌─────────────────────────────────────────────────────────────┐
│ 原創性-彈性空間 │
│ │
│ 高原創 │ C4● │
│ │ C5● │
│ │ C3● │
│ │ C2● │
│ │ │
│ 低原創 │ C1● │
│ └──────────────────────────────────────────────── │
│ 低彈性 高彈性 │
│ │
│ r = 0.071 → 幾乎垂直於對角線 → 無相關 → 類似人類! │
└─────────────────────────────────────────────────────────────┘
對比典型 LLMr > 0.3
┌─────────────────────────────────────────────────────────────┐
│ 高原創 │ ● │
│ │ ● │
│ │ ● │
│ │ ● │
│ 低原創 │ ● │
│ └──────────────────────────────────────────────── │
│ 低彈性 高彈性 │
│ │
│ r > 0.3 → 沿對角線分布 → 正相關 → 典型 LLM 模式 │
└─────────────────────────────────────────────────────────────┘
```
---
## 四、累積跳躍輪廓Cumulative Jump Profile
### 4.1 什麼是累積跳躍輪廓?
追蹤在想法生成過程中,跳躍次數如何隨時間累積。
```
想法位置: 1 2 3 4 5 6 7 8 9 10
跳躍發生: - - ✓ - ✓ - ✓ ✓ - ✓
累積計數: 0 0 1 1 2 2 3 4 4 5
輪廓線:
5 │ ●
4 │ ●────●
3 │ ●────●
2 │ ●────●
1 │ ●────●
0 │●────●
└────────────────────────────────────────
1 2 3 4 5 6 7 8 9 10
想法位置
```
### 4.2 輪廓線的解讀
| 輪廓特徵 | 含義 | 創意模式 |
|----------|------|----------|
| **陡峭斜率** | 快速累積跳躍 | 頻繁切換類別 |
| **平緩區域** | 跳躍暫停 | 深入探索當前類別 |
| **階梯狀** | 突然爆發跳躍 | 類別耗盡後轉移 |
| **近乎水平** | 幾乎沒有跳躍 | 持續在單一領域 |
### 4.3 我們的實驗視覺化
![累積跳躍輪廓](../results/cumulative_jump_profiles.png)
**各條件輪廓解讀:**
| 條件 | 輪廓特徵 | 創意策略 |
|------|----------|----------|
| C2 專家視角 | 穩定上升 | 持續的類別切換 |
| C3 屬性分解 | 穩定上升 | 持續的類別切換 |
| C5 隨機視角 | 緩慢上升 | 較少切換 |
| C4 完整管線 | 幾乎水平 | 非常專注的單一領域探索 |
| C1 直接生成 | 完全水平 | 無任何類別切換 |
---
## 五、實驗發現的綜合意義
### 5.1 核心發現總結
| 發現 | 內容 | 意義 |
|------|------|------|
| **發現一** | 原創性-彈性相關 r = 0.071 | 管線產生「類似人類」的創意模式 |
| **發現二** | C4 最高新穎性 + 最低彈性 | 結構化方法產生聚焦的新穎性 |
| **發現三** | 所有條件都是 Persistent | LLM 傾向深度探索而非廣度 |
| **發現四** | 組合跳躍 < 類別跳躍 | 驗證方法學的有效性 |
### 5.2 為什麼 C4 能打破 LLM 模式?
```
典型 LLM 的問題:
┌─────────────────────────────────────────────────────────────┐
│ 直接生成:「給我椅子的創新用途」 │
│ │
│ LLM 依賴 temperature 產生多樣性 │
│ → 高 temperature = 更多隨機性 │
│ → 更多隨機性 = 更多跳躍(高彈性) │
│ → 更多跳躍 = 更可能遇到新穎想法(高原創) │
│ │
│ 結果:彈性和原創性被綁定(正相關) │
└─────────────────────────────────────────────────────────────┘
C4 管線的突破:
┌─────────────────────────────────────────────────────────────┐
│ 結構化生成: │
│ │
│ Step 1: 屬性分解 │
│ 「椅子」→ [便攜, 可堆疊, 人體工學, ...] │
│ │
│ Step 2: 專家無上下文關鍵字 │
│ 會計師 + 「便攜」→ 「流動資產」(不知道是椅子!) │
│ │
│ Step 3: 重新結合 │
│ 「椅子」+ 「流動資產」+ 會計師視角 │
│ → 「帶 RFID 資產追蹤的企業椅子」 │
│ │
│ 關鍵機制: │
│ - 結構強制「跳出」典型語義空間(高新穎性) │
│ - 但所有想法都錨定在相同屬性集(低彈性) │
│ - 新穎性來自「強制bisociation」而非「隨機探索」 │
│ │
│ 結果:高新穎性 + 低彈性 → 打破正相關 → 類似人類 │
└─────────────────────────────────────────────────────────────┘
```
### 5.3 這對創意 AI 研究的意義
**理論貢獻:**
1. **證明 LLM 可以產生「類似人類」的創意模式**
- 不是透過模仿人類數據
- 而是透過結構化的創意管線設計
2. **原創性和彈性是可以獨立控制的**
- 傳統認為需要高隨機性才能高原創
- 我們證明結構化約束也能達到高原創
3. **「專注的新穎性」vs「分散的探索」**
- C4深入一個新穎領域專家策略
- C5廣泛接觸多個領域通才策略
- 兩種都有價值,但機制不同
**實務應用:**
| 目標 | 推薦策略 | 原因 |
|------|----------|------|
| 最大化新穎性 | C4 完整管線 | 最高距離質心分數 |
| 最大化類別多樣性 | C2 專家視角 | 最多組合跳躍 |
| 平衡新穎與多樣 | C3 屬性分解 | 中等水平 |
| 快速生成 | C1 直接生成 | 最少 API 調用 |
---
## 六、方法論驗證
### 6.1 組合跳躍 ≤ 類別跳躍
這是方法學的必要條件驗證:
```
邏輯推導:
組合跳躍 = 類別跳躍 ∧ 語義跳躍
當類別跳躍 = False 時:
組合跳躍 = False ∧ ? = False
當類別跳躍 = True 時:
組合跳躍 = True ∧ 語義跳躍 = 語義跳躍(可能 True 或 False
因此:組合跳躍 ≤ 類別跳躍(必然成立)
```
**實驗驗證:**
| 條件 | 類別跳躍 | 組合跳躍 | 驗證 |
|------|:--------:|:--------:|:----:|
| C2 | 54 | 48 | ✓ |
| C3 | 34 | 33 | ✓ |
| C5 | 22 | 20 | ✓ |
| C4 | 13 | 13 | ✓ |
| C1 | 0 | 0 | ✓ |
### 6.2 彈性檔案閾值的選擇
論文使用的閾值30%、45%)基於人類實驗數據的分布。我們的 LLM 實驗中,所有條件都落在 Persistent 區間,這本身就是一個發現:
```
人類分布(論文數據):
Persistent: ~33%
Mixed: ~34%
Flexible: ~33%
我們的 LLM 分布:
Persistent: 100%(所有條件)
Mixed: 0%
Flexible: 0%
解讀:
LLM即使加入專家/屬性引導)仍傾向持續探索模式
這可能是 LLM 架構的固有特性
```
---
## 七、與其他指標的整合
### 7.1 完整指標體系
| 維度 | 指標 | 來源 | C4 表現 |
|------|------|------|:-------:|
| **流暢性** | 想法數量 | Torrance | 402最多 |
| **彈性** | 組合跳躍數 | arXiv:2405.00899 | 13最低 |
| **原創性** | 距離質心 | 本研究 | 0.395(最高) |
| **精緻性** | 平均字數 | Torrance | 26.2 |
### 7.2 C4 的獨特位置
```
創意空間定位:
高原創性
C4 ●│
│ C5●
│ C3●
│ C2●
C1 ●│
└──────────────────── 高彈性
低原創性
C4 占據了「高原創性 + 低彈性」的獨特位置
這在人類創意者中常見(專家型),但在 LLM 中罕見
```
---
## 八、未來研究方向
基於這些發現,建議的後續研究:
1. **跨模型驗證**
- 在 GPT-4、Claude、Llama-3 上重複實驗
- 確認發現是否為通用現象
2. **Temperature 敏感度測試**
- 論文發現 LLM 對 temperature 不敏感
- 測試我們的管線是否也有此特性
3. **人類基準比較**
- 收集人類在相同任務上的數據
- 直接比較彈性檔案分布
4. **管線變體測試**
- 調整屬性數量、專家數量
- 找到最佳平衡點
---
## 參考文獻
1. **arXiv:2405.00899** - "Characterising the Creative Process in Humans and Large Language Models"
- 組合跳躍信號、彈性檔案分類的原始論文
2. **Hadas & Hershkovitz (2024)** - "Using LLMs to Evaluate AUT Flexibility Score"
- LLM 雙階段分類法的來源
3. **Torrance (1974)** - *Torrance Tests of Creative Thinking*
- 創造力四維度框架
4. **Koestler (1964)** - *The Act of Creation*
- Bisociation 理論基礎
---
## 附錄:程式碼參考
相關分析程式碼位於:
- `experiments/aut_flexibility_analysis.py`
- `compute_jump_signal()` - 組合跳躍計算
- `classify_flexibility_profile()` - 彈性檔案分類
- `analyze_originality_flexibility_correlation()` - 相關性分析
- `compute_cumulative_jump_profile()` - 累積跳躍輪廓
- `plot_cumulative_jump_profiles()` - 視覺化
執行分析:
```bash
cd experiments
source ../backend/venv/bin/activate
python aut_flexibility_analysis.py experiment_20260119_165650_deduped.json
```

View File

@@ -0,0 +1,259 @@
# Experiment Design: 5-Condition Idea Generation Study
**Date:** January 19, 2026
**Version:** 1.0
**Status:** Pilot Implementation
## Overview
This experiment tests whether the novelty-seeking system's two key mechanisms—**attribute decomposition** and **expert transformation**—independently and jointly improve creative ideation quality compared to direct LLM generation.
## Research Questions
1. Does decomposing a query into structured attributes improve idea diversity?
2. Do expert perspectives improve idea novelty?
3. Do these mechanisms have synergistic effects when combined?
4. Is the benefit from experts due to domain knowledge, or simply perspective-shifting?
## Experimental Design
### 2×2 Factorial Design + Control
| | No Attributes | With Attributes |
|--------------------|---------------|-----------------|
| **No Experts** | C1: Direct | C3: Attr-Only |
| **With Experts** | C2: Expert-Only | C4: Full Pipeline |
**Plus:** C5: Random-Perspective (tests perspective-shifting without domain knowledge)
### Condition Descriptions
#### C1: Direct Generation (Baseline)
- Single LLM call: "Generate 20 creative ideas for [query]"
- No attribute decomposition
- No expert perspectives
- Purpose: Baseline for standard LLM ideation
#### C2: Expert-Only
- 4 experts from curated occupations
- Each expert generates 5 ideas directly for the query
- No attribute decomposition
- Purpose: Isolate expert contribution
#### C3: Attribute-Only
- Decompose query into 4 fixed categories
- Generate attributes per category
- Direct idea generation per attribute (no expert framing)
- Purpose: Isolate attribute decomposition contribution
#### C4: Full Pipeline
- Full attribute decomposition (4 categories)
- Expert transformation (4 experts × 1 keyword per attribute)
- Purpose: Test combined mechanism (main system)
#### C5: Random-Perspective
- 4 random words per query (from curated pool)
- Each word used as a "perspective" to generate 5 ideas
- Purpose: Control for perspective-shifting vs. expert knowledge
---
## Key Design Decisions & Rationale
### 1. Why 5 Conditions?
C1-C4 form a 2×2 factorial design that isolates the independent contributions of:
- **Attribute decomposition** (C1 vs C3, C2 vs C4)
- **Expert perspectives** (C1 vs C2, C3 vs C4)
C5 addresses a critical confound: if experts improve ideation, is it because of their **domain knowledge** or simply because any **perspective shift** helps? By using random words instead of domain experts, C5 tests whether the perspective-taking mechanism alone provides benefits.
### 2. Why Random Words in C5 (Not Fixed)?
**Decision:** Use randomly sampled words (with seed) rather than a fixed set.
**Rationale:**
- Stronger generalization: results hold across many word combinations
- Avoids cherry-picking accusation ("you just picked easy words")
- Reproducible via random seed (seed=42)
- Each query gets different random words, increasing robustness
### 3. Why Apply Deduplication Uniformly?
**Decision:** Apply embedding-based deduplication (threshold=0.85) to ALL conditions after generation.
**Rationale:**
- Fair comparison: all conditions normalized to unique ideas
- Creates "dedup survival rate" as an additional metric
- Hypothesis: Full Pipeline ideas are diverse (low redundancy), not just numerous
- Direct generation may produce many similar ideas that collapse after dedup
### 4. Why FIXED_ONLY Categories?
**Decision:** Use 4 fixed categories: Functions, Usages, User Groups, Characteristics
**Rationale:**
- Best for proof power: isolates "attribute decomposition" effect
- No confound from dynamic category selection variability
- Universal applicability: these 4 categories apply to objects, technology, and services
- Dropped "Materials" category as it doesn't apply well to services
### 5. Why Curated Expert Source?
**Decision:** Use curated occupations (210 professions) rather than LLM-generated experts.
**Rationale:**
- Reproducibility: same occupation pool across runs
- Consistency: no variance from LLM expert generation
- Control: we know exactly which experts are available
- Validation: occupations were manually curated for diversity
### 6. Why Temperature 0.9?
**Decision:** Use temperature=0.9 for all conditions.
**Rationale:**
- Higher temperature encourages more diverse/creative outputs
- Matches typical creative task settings
- Consistent across conditions for fair comparison
- Lower temperatures (0.7) showed more repetitive outputs in testing
### 7. Why 10 Pilot Queries?
**Decision:** Start with 10 queries before scaling to full 30.
**Rationale:**
- Validate pipeline works before full investment
- Catch implementation bugs early
- Balanced across categories (3 everyday, 3 technology, 4 services)
- Sufficient for initial pattern detection
---
## Configuration Summary
| Setting | Value | Rationale |
|---------|-------|-----------|
| **LLM Model** | qwen3:8b | Local, fast, consistent |
| **Temperature** | 0.9 | Encourages creativity |
| **Expert Count** | 4 | Balance diversity vs. cost |
| **Expert Source** | Curated | Reproducibility |
| **Keywords/Expert** | 1 | Simplifies analysis |
| **Language** | English | Consistency |
| **Categories** | Functions, Usages, User Groups, Characteristics | Universal applicability |
| **Dedup Threshold** | 0.85 | Standard similarity cutoff |
| **Random Seed** | 42 | Reproducibility |
| **Pilot Queries** | 10 | Validation before scaling |
---
## Query Selection
### Pilot Queries (10)
| ID | Query | Category |
|----|-------|----------|
| A1 | Chair | Everyday |
| A5 | Bicycle | Everyday |
| A7 | Smartphone | Everyday |
| B1 | Solar panel | Technology |
| B3 | 3D printer | Technology |
| B4 | Drone | Technology |
| C1 | Food delivery service | Services |
| C2 | Online education platform | Services |
| C4 | Public transportation | Services |
| C9 | Elderly care service | Services |
### Selection Criteria
- Balanced across 3 domains (everyday objects, technology, services)
- Varying complexity levels
- Different user familiarity levels
- Subset from full 30-query experimental protocol
---
## Random Word Pool (C5)
35 words selected across 7 conceptual categories:
| Category | Words |
|----------|-------|
| Nature | ocean, mountain, forest, desert, cave |
| Optics | microscope, telescope, kaleidoscope, prism, lens |
| Animals | butterfly, elephant, octopus, eagle, ant |
| Weather | sunrise, thunderstorm, rainbow, fog, aurora |
| Art | clockwork, origami, mosaic, symphony, ballet |
| Temporal | ancient, futuristic, organic, crystalline, liquid |
| Sensory | whisper, explosion, rhythm, silence, echo |
**Selection Criteria:**
- Concrete and evocative (easy to generate associations)
- Diverse domains (no overlap with typical expert knowledge)
- No obvious connection to test queries
- Equal representation across categories
---
## Expected Outputs
### Per Condition Per Query
| Condition | Expected Ideas (pre-dedup) | Mechanism |
|-----------|---------------------------|-----------|
| C1 | 20 | Direct request |
| C2 | 20 | 4 experts × 5 ideas |
| C3 | ~20 | Varies by attribute count |
| C4 | ~20 | 4 experts × ~5 keywords × 1 description |
| C5 | 20 | 4 words × 5 ideas |
### Metrics to Collect
1. **Pre-deduplication count**: Raw ideas generated
2. **Post-deduplication count**: Unique ideas after similarity filtering
3. **Dedup survival rate**: post/pre ratio
4. **Generation metadata**: Experts/words used, attributes generated
---
## File Structure
```
experiments/
├── __init__.py
├── config.py # Experiment configuration
├── docs/
│ └── experiment_design_2026-01-19.md # This file
├── conditions/
│ ├── __init__.py
│ ├── c1_direct.py
│ ├── c2_expert_only.py
│ ├── c3_attribute_only.py
│ ├── c4_full_pipeline.py
│ └── c5_random_perspective.py
├── data/
│ ├── queries.json # 10 pilot queries
│ └── random_words.json # Word pool for C5
├── generate_ideas.py # Main runner
├── deduplication.py # Post-processing
└── results/ # Output (gitignored)
```
---
## Verification Checklist
- [ ] Each condition produces expected number of ideas
- [ ] Deduplication reduces count meaningfully
- [ ] Results JSON contains all required metadata
- [ ] Random seed produces reproducible C5 results
- [ ] No runtime errors on all 10 pilot queries
---
## Next Steps After Pilot
1. Analyze pilot results for obvious issues
2. Adjust parameters if needed (idea count normalization, etc.)
3. Scale to full 30 queries
4. Human evaluation of idea quality (novelty, usefulness, feasibility)
5. Statistical analysis of condition differences

View File

@@ -0,0 +1,813 @@
---
marp: true
theme: default
paginate: true
backgroundColor: #fff
style: |
section {
font-size: 24px;
}
h1 {
color: #2c3e50;
}
h2 {
color: #34495e;
}
table {
font-size: 18px;
}
.columns {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1rem;
}
---
# Breaking Semantic Gravity in LLM-Based Creative Ideation
## A Pilot Study on Attribute Decomposition and Expert Perspectives
**Date:** January 19, 2026
**Model:** Qwen3:8b (Temperature: 0.9)
**Queries:** 10 pilot queries
---
# Research Problem
## The "Semantic Gravity" Challenge
LLMs tend to generate ideas clustered around **high-probability training distributions**
```
Query: "Chair"
Typical LLM output:
- Ergonomic office chair
- Comfortable reading chair
- Foldable portable chair
← All within "furniture comfort" semantic cluster
```
**Goal:** Break this gravitational pull toward obvious solutions
---
# Theoretical Framework
## Bisociation Theory (Koestler, 1964)
Creative thinking occurs when two unrelated "matrices of thought" collide
**Our Approach:**
1. **Attribute Decomposition** → Break object into structural components
2. **Expert Perspectives** → Introduce distant domain knowledge
3. **Context-Free Keywords** → Force unexpected conceptual leaps
---
# Experimental Design
## 2×2 Factorial + Control
| Condition | Attributes | Experts | Description |
|-----------|:----------:|:-------:|-------------|
| **C1** Direct | - | - | Baseline: Direct LLM generation |
| **C2** Expert-Only | - | ✓ | Expert perspectives without structure |
| **C3** Attribute-Only | ✓ | - | Structure without expert knowledge |
| **C4** Full Pipeline | ✓ | ✓ | Combined approach |
| **C5** Random-Perspective | - | Random | Control: Random words as "experts" |
---
# Research Questions
1. **RQ1:** Does attribute decomposition increase idea diversity?
2. **RQ2:** Do expert perspectives increase idea diversity?
3. **RQ3:** Is there a synergistic (super-additive) interaction effect?
4. **RQ4:** Do domain-relevant experts outperform random perspectives?
---
# Pipeline Architecture
## C4: Full Pipeline Process
```
Query: "Chair"
Step 1: Attribute Decomposition
→ "portable", "stackable", "ergonomic", ...
Step 2: Context-Free Keyword Generation (Expert sees ONLY attribute)
→ Accountant + "portable" → "mobile assets"
→ Architect + "portable" → "modular units"
Step 3: Idea Synthesis (Reunite with query)
→ "Chair" + "mobile assets" + Accountant perspective
→ "Asset-tracking chairs for corporate inventory management"
```
---
# Key Design Decision
## Context-Free Keyword Generation
The expert **never sees the original query** when generating keywords
```python
# Step 2: Expert sees only attribute
prompt = f"As a {expert}, what keyword comes to mind for '{attribute}'?"
# Input: "portable" (NOT "portable chair")
# Step 3: Reunite with query
prompt = f"Apply '{keyword}' to '{query}' from {expert}'s perspective"
# Input: "mobile assets" + "Chair" + "Accountant"
```
**Purpose:** Force bisociation by preventing obvious associations
---
# Pilot Study Parameters
## Model & Generation Settings
| Parameter | Value |
|-----------|-------|
| LLM Model | Qwen3:8b (Ollama) |
| Temperature | 0.9 |
| Ollama Endpoint | localhost:11435 |
| Language | English |
| Random Seed | 42 |
---
# Pilot Study Parameters (cont.)
## Pipeline Configuration
| Parameter | Value |
|-----------|-------|
| Queries | 10 (Chair, Bicycle, Smartphone, Solar panel, 3D printer, Drone, Food delivery, Online education, Public transport, Elderly care) |
| Attribute Categories | 4 (Functions, Usages, User Groups, Characteristics) |
| Attributes per Category | 5 |
| Expert Source | Curated (210 occupations) |
| Experts per Query | 4 |
| Keywords per Expert | 1 |
---
# Pilot Study Parameters (cont.)
## Output & Evaluation
| Parameter | Value |
|-----------|-------|
| Total Ideas Generated | 1,119 (after deduplication) |
| Ideas by Condition | C1: 195, C2: 198, C3: 125, C4: 402, C5: 199 |
| Deduplication Threshold | 0.90 (cosine similarity) |
| Embedding Model | qwen3-embedding:4b (1024D) |
---
# Background: Embedding Models Evolution
## From Static to Contextual Representations
| Generation | Model | Characteristics | Limitation |
|------------|-------|-----------------|------------|
| **1st Gen** | Word2Vec, GloVe | Static vectors, one vector per word | "bank" = same vector (river vs finance) |
| **2nd Gen** | BERT, Sentence-BERT | Contextual, transformer-based | Limited context window, older training |
| **3rd Gen** | Qwen3-embedding | LLM-based, instruction-tuned | Requires more compute |
---
# Background: Transformer vs LLM-based Embedding
## Architecture Differences
| Aspect | Transformer (BERT) | LLM-based (Qwen3) |
|--------|-------------------|-------------------|
| **架構** | Encoder-only | Decoder-only (GPT-style) |
| **訓練目標** | MLM (遮罩語言模型) | Next-token prediction |
| **訓練數據** | ~16GB (Wikipedia + Books) | ~數 TB (網頁、程式碼、書籍) |
| **參數量** | 110M - 340M | 4B+ |
| **上下文** | 512 tokens | 8K - 128K tokens |
---
# Background
## Key Comparison
```
1. 較多的知識訓練
BERT: 只知道 2019 年前的知識
Qwen3: 知道 "drone delivery", "AI-powered", "IoT" 等現代概念
2. 較廣語義理解
BERT: "chair for elderly" ≈ "elderly chair" (詞袋相似)
Qwen3: 理解 "mobility assistance" vs "comfort seating" 的差異
3. 接受指令微調 (Instruction Tuning)
傳統: 無法根據任務意圖調整
Qwen3: 可以理解 "找出創意想法之間的語義差異"
```
---
# Background: Qwen3-Embedding?
## Comparison with Traditional Methods
```
傳統 Sentence-BERT (all-MiniLM-L6-v2):
- 384 維向量
- 訓練於 2021 年之前的數據
- 對短句效果好,長文本理解有限
- Encoder-onlyMLM 訓練
Qwen3-Embedding (qwen3-embedding:4b):
- 1024 維向量(更豐富的語義表達)
- 基於 Qwen3 LLM2024+ 訓練數據)
- 支援長上下文8K tokens
- 指令微調instruction-tuned→ 配合任務意圖
- 繼承 LLM 的部分能力
```
**選擇理由:** 創意想法通常較長且語義複雜,需要更強的上下文理解能力
---
# Background: How Embedding Works
## Semantic Similarity via Vector Space
```
Step 1: 將文字轉換為向量
"Solar-powered charging chair" → [0.12, -0.34, 0.56, ..., 0.78] (1024D)
Step 2: 計算餘弦相似度
similarity = cos(θ) = (A · B) / (|A| × |B|)
Step 3: 相似度解讀
1.0 = 完全相同
0.9 = 非常相似(可能是重複想法)
0.5 = 中等相關
0.0 = 無關
```
**應用:** 去重similarity > 0.9、彈性分析clustering、新穎性centroid distance
---
# Results: Semantic Diversity
## Mean Pairwise Distance (Higher = More Diverse)
> **Method:** We convert each idea into a vector embedding (qwen3-embedding:4b), then calculate the average cosine distance between all pairs of ideas within each condition. Higher values indicate ideas are more spread out in semantic space.
| Condition | Mean | SD | vs C1 (Cohen's d) |
|-----------|:----:|:--:|:-----------------:|
| C1 Direct | 0.294 | 0.039 | - |
| C2 Expert-Only | 0.400 | 0.028 | **3.15*** |
| C3 Attribute-Only | 0.377 | 0.036 | **2.20*** |
| C4 Full Pipeline | 0.395 | 0.019 | **3.21*** |
| C5 Random | 0.405 | 0.062 | **2.72*** |
*p < 0.001, Large effect sizes (d > 0.8)
> **Cohen's d:** Measures effect size (how big the difference is). d > 0.8 = large effect, d > 0.5 = medium, d > 0.2 = small.
---
# Results: ANOVA Summary
## Normalized Diversity Metric
> **Method:** Two-way ANOVA tests whether Attributes and Experts each have independent effects on diversity, and whether combining them produces extra benefit (interaction). F-statistic measures variance between groups vs within groups.
| Effect | F | p | Significant |
|--------|:-:|:-:|:-----------:|
| **Attributes (RQ1)** | 5.31 | 0.027 | Yes |
| **Experts (RQ2)** | 26.07 | <0.001 | Yes |
| **Interaction (RQ3)** | - | - | Sub-additive |
**Key Finding:** Both factors work, but combination is **not synergistic**
---
# Results: Expert vs Random (RQ4)
## C2 (Expert-Only) vs C5 (Random-Perspective)
| Metric | C2 Expert | C5 Random | p-value | Effect |
|--------|:---------:|:---------:|:-------:|:------:|
| Diversity | 0.399 | 0.414 | 0.463 | n.s. |
| Query Distance | 0.448 | 0.437 | 0.654 | n.s. |
**Finding:** Random words perform as well as domain experts
Implication: The value may be in **perspective shift itself**, not expert knowledge
---
# Results: Efficiency Analysis
## Diversity per Idea Generated
| Condition | Mean Ideas | Diversity | Efficiency |
|-----------|:----------:|:---------:|:----------:|
| C1 Direct | 20.0 | 0.293 | 1.46 |
| C2 Expert-Only | 20.0 | 0.399 | **1.99** |
| C3 Attribute-Only | 12.8 | 0.376 | **3.01** |
| C4 Full Pipeline | 51.9 | 0.393 | 0.78 |
| C5 Random | 20.0 | 0.405 | 2.02 |
**C4 produces 2.6× more ideas but achieves same diversity**
---
# Visualization: Diversity by Condition
![height:450px](../results/figures/20260119_165650_diversity_boxplot.png)
---
# Visualization: Query Distance
![height:450px](../results/figures/20260119_165650_query_distance_boxplot.png)
---
# Advanced Analysis: Lexical Diversity
## Type-Token Ratio & Vocabulary Richness
> **Method:** Type-Token Ratio (TTR) = unique words ÷ total words. High TTR means more varied vocabulary; low TTR means more word repetition. Vocabulary size counts total unique words across all ideas in a condition.
| Condition | TTR | Vocabulary | Avg Words/Idea |
|-----------|:---:|:----------:|:--------------:|
| C1 Direct | **0.382** | 853 | 11.5 |
| C2 Expert-Only | 0.330 | 1,358 | 20.8 |
| C3 Attribute-Only | 0.330 | 1,098 | 26.6 |
| C4 Full Pipeline | 0.189 | **1,992** | 26.2 |
| C5 Random | 0.320 | 1,331 | 20.9 |
**Finding:** C4 has largest vocabulary (1,992) but lowest TTR (0.189)
→ More words but more repetition across ideas
---
# Advanced Analysis: Concept Extraction
## Top Keywords by Condition
> **Method:** Extract meaningful keywords from idea texts using NLP (removing stopwords, lemmatization). Top keywords show most frequent concepts; unique keywords count distinct terms. Domain coverage checks if ideas span different knowledge areas.
| Condition | Top Keywords | Unique Keywords |
|-----------|--------------|:---------------:|
| C1 Direct | solar, powered, smart, delivery, drone | 805 |
| C2 Expert | real, create, design, time, develop | 1,306 |
| C3 Attribute | real, time, create, develop, powered | 1,046 |
| C4 Pipeline | time, real, data, ensuring, enhancing | **1,937** |
| C5 Random | like, solar, inspired, energy, uses | 1,286 |
**Finding:** C5 Random shows "inspired" → suggests analogical thinking
All conditions cover 6 domain categories
---
# Advanced Analysis: Novelty Scores
## Distance from Global Centroid (Higher = More Novel)
> **Method:** Compute the centroid (average vector) of ALL ideas across all conditions. Then measure each idea's distance from this "typical idea" center. Ideas far from the centroid are semantically unusual compared to the overall pool.
| Condition | Mean | Std | Interpretation |
|-----------|:----:|:---:|----------------|
| C1 Direct | 0.273 | 0.037 | Closest to "typical" ideas |
| C2 Expert-Only | 0.315 | 0.062 | Moderate novelty |
| C3 Attribute-Only | 0.337 | 0.066 | Moderate novelty |
| C5 Random | 0.365 | 0.069 | High novelty |
| **C4 Full Pipeline** | **0.395** | 0.083 | **Highest novelty** |
**Finding:** C4 produces ideas furthest from the "average" idea space
---
# Advanced Analysis: Cross-Condition Cohesion
## % Nearest Neighbors from Same Condition
> **Method:** For each idea, find its K nearest neighbors in embedding space. Cohesion = percentage of neighbors from the same condition. High cohesion means ideas from that condition cluster together; low cohesion means they're scattered among other conditions.
| Condition | Cohesion | Interpretation |
|-----------|:--------:|----------------|
| **C4 Full Pipeline** | **88.6%** | Highly distinct idea cluster |
| C2 Expert-Only | 72.7% | Moderate clustering |
| C5 Random | 71.4% | Moderate clustering |
| C1 Direct | 70.8% | Moderate clustering |
| C3 Attribute-Only | 51.2% | Ideas scattered, overlap with others |
**Finding:** C4 ideas form a distinct cluster in semantic space
---
# Advanced Analysis: AUT Flexibility
## Semantic Category Diversity (Hadas & Hershkovitz 2024)
> **Method:** Uses the Alternative Uses Task (AUT) flexibility framework. Embedding-based: Hierarchical clustering with average linkage, cut at distance threshold 0.5. Higher cluster count = more semantic categories covered = higher flexibility.
| Condition | Embedding Clusters | Mean Pairwise Similarity |
|-----------|:------------------:|:------------------------:|
| **C5 Random** | **15** | 0.521 (most diverse) |
| **C2 Expert-Only** | **13** | 0.517 |
| C3 Attribute-Only | 12 | - |
| C4 Full Pipeline | 10 | 0.583 |
| C1 Direct | **1** | 0.647 (most similar) |
**Finding:** Expert perspectives (C2, C5) produce more diverse categories than direct generation (C1)
---
# Advanced Analysis: Combined Jump Signal
## Enhanced Method from arXiv:2405.00899
> **Method:** Combined jump signal uses logical AND of two conditions:
> - **jumpcat:** Category changes between consecutive ideas (from embedding clustering)
> - **jumpSS:** Semantic similarity < 0.7 (ideas are semantically dissimilar)
>
> **True jump = jumpcat ∧ jumpSS** — reduces false positives where similar ideas happen to be in different clusters.
| Condition | Cat-Only | Sem-Only | **Combined** | Profile |
|-----------|:--------:|:--------:|:------------:|---------|
| C2 Expert-Only | 54 | 125 | **48** | Persistent |
| C3 Attribute-Only | 34 | 107 | **33** | Persistent |
| C5 Random | 22 | 116 | **20** | Persistent |
| C4 Full Pipeline | 13 | 348 | **13** | Persistent |
| C1 Direct | 0 | 104 | **0** | Persistent |
**Finding:** Combined jumps ≤ category jumps (as expected). All conditions show "Persistent" exploration pattern.
---
# Advanced Analysis: Flexibility Profiles
## Classification Based on Combined Jump Ratio
> **Method:** Classify creativity style based on normalized jump ratio (jumps / transitions):
> - **Persistent:** ratio < 0.30 (deep exploration within categories)
> - **Flexible:** ratio > 0.45 (broad exploration across categories)
> - **Mixed:** 0.30 ≤ ratio ≤ 0.45
| Condition | Combined Jump Ratio | Profile | Interpretation |
|-----------|:-------------------:|:-------:|----------------|
| C3 Attribute-Only | **26.6%** | Persistent | Moderate category switching |
| C2 Expert-Only | **24.4%** | Persistent | Moderate category switching |
| C5 Random | 10.1% | Persistent | Low category switching |
| **C4 Full Pipeline** | **3.2%** | Persistent | Very deep within-category exploration |
| C1 Direct | 0.0% | Persistent | Single semantic cluster |
**Key Insight:** C4's low jump ratio indicates focused, persistent exploration within novel semantic territory
---
# Key Finding: Originality-Flexibility Correlation
## Does Our Pipeline Break the Typical LLM Pattern?
> **Paper Finding (arXiv:2405.00899):**
> - **Humans:** No correlation between flexibility and originality (r ≈ 0)
> - **LLMs:** Positive correlation — flexible LLMs score higher on originality
**Our Results:**
| Metric | Value | Interpretation |
|--------|:-----:|----------------|
| **Pearson r** | **0.071** | Near zero correlation |
| Interpretation | **Human-like pattern** | Breaks typical LLM pattern |
**Per-Condition Breakdown:**
| Condition | Novelty | Flexibility (combined jumps) |
|-----------|:-------:|:----------------------------:|
| C4 Full Pipeline | **0.395** (highest) | **13** (lowest) |
| C5 Random | 0.365 | 20 |
| C3 Attribute-Only | 0.337 | 33 |
| C2 Expert-Only | 0.315 | 48 (highest) |
| C1 Direct | 0.273 (lowest) | 0 |
**Critical Finding:** The attribute+expert pipeline (C4) achieves **highest novelty with lowest flexibility**, demonstrating that structured context-free generation produces **focused novelty** rather than scattered exploration.
---
# Cumulative Jump Profile Visualization
## Exploration Patterns Over Generation Sequence
> **Method:** Track cumulative jump count at each response position. Steep slopes indicate rapid category switching; flat regions indicate persistent exploration within categories.
![height:400px](../results/cumulative_jump_profiles.png)
**Visual Pattern:**
- C2/C3 show steady accumulation of jumps → regular category switching
- C4/C5 show flatter profiles → persistent within-category exploration
- C1 is flat (0 jumps) → all ideas in single cluster
---
# Flexibility vs Novelty: Key Insight
## Novelty and Flexibility are Orthogonal Dimensions
| Condition | Novelty (centroid dist) | Flexibility (combined jumps) | Pattern |
|-----------|:-----------------------:|:----------------------------:|---------|
| C4 Pipeline | **0.395** (highest) | **13** (lowest) | High novel, low flex |
| C5 Random | 0.365 | 20 | High novel, low flex |
| C2 Expert | 0.315 | **48** (highest) | Moderate novel, high flex |
| C3 Attribute | 0.337 | 33 | Moderate both |
| C1 Direct | 0.273 (lowest) | 0 | Typical, single category |
**Interpretation:**
- **C1 Direct** produces similar ideas within one typical category (low novelty, no jumps)
- **C4 Full Pipeline** produces the most novel ideas with focused exploration (low jump ratio)
- **C2 Expert-Only** produces the most category switching but moderate novelty
- **r = 0.071** confirms these are orthogonal dimensions (human-like pattern)
---
# Embedding Visualization: PCA
> **Method:** Principal Component Analysis reduces high-dimensional embeddings (1024D) to 2D for visualization by finding directions of maximum variance. Points close together = semantically similar ideas. Colors represent conditions.
![height:450px](../results/embedding_pca.png)
---
# Embedding Visualization: t-SNE
> **Method:** t-SNE (t-distributed Stochastic Neighbor Embedding) preserves local neighborhood structure when reducing to 2D. Better at revealing clusters than PCA, but distances between clusters are less meaningful. Good for seeing if conditions form distinct groups.
![height:450px](../results/embedding_tsne.png)
---
# Integrated Findings
## What the Advanced Analysis Reveals
| Analysis | C4 Full Pipeline Characteristic |
|----------|--------------------------------|
| Lexical | Largest vocabulary (1,992 words) |
| Novelty | Highest distance from centroid (0.395) |
| Cohesion | Tightest cluster (88.6% same-condition NN) |
| Diversity | High pairwise distance (0.395) |
| **Flexibility** | **Lowest combined jumps (13) = focused exploration** |
**Interpretation:** C4 creates a **distinct semantic territory** -
novel ideas that are internally coherent but far from other conditions.
Low flexibility (3.2% jump ratio) indicates deep, focused exploration within a novel space.
## Understanding Novelty vs Flexibility
| Condition | Novelty | Flexibility (jumps) | Strategy |
|-----------|:-------:|:-------------------:|----------|
| C1 Direct | Low | Lowest (0) | Typical, single category |
| C2 Expert | Medium | **Highest (48)** | Experts = diverse exploration |
| C3 Attribute | Medium | Medium (33) | Structured exploration |
| C5 Random | High | Low (20) | Random but focused |
| **C4 Pipeline** | **Highest** | **Low (13)** | **Focused novelty** |
---
# Critical Limitation
## Embedding Distance ≠ True Novelty
Current metrics measure **semantic spread**, not **creative value**
| What We Measure | What We Miss |
|-----------------|--------------|
| Vector distance | Practical usefulness |
| Cluster spread | Conceptual surprise |
| Query distance | Non-obviousness |
| | Feasibility |
```
"Quantum entanglement chair" → High distance, Low novelty
"Chair legs as drumsticks" → Low distance, High novelty
```
---
# Torrance Creativity Framework
## What True Novelty Assessment Requires
| Dimension | Definition | Our Coverage |
|-----------|------------|:------------:|
| **Fluency** | Number of ideas | ✓ Measured |
| **Flexibility** | Category diversity | ✓ Measured (LLM + embedding) |
| **Originality** | Statistical rarity | Not measured |
| **Elaboration** | Detail & development | Not measured |
**Originality requires human judgment or LLM-as-Judge**
---
# Discussion: The Attribute Anchoring Effect
## Why C4 Has Highest Novelty but Lowest Flexibility
```
C2 (Expert-Only): HIGHEST FLEXIBILITY (48 combined jumps)
Architect → "load-bearing furniture"
Chef → "dining experience design"
← Each expert explores freely, frequent category switching
C4 (Full Pipeline): LOWEST FLEXIBILITY (13 combined jumps, 3.2% ratio)
All experts respond to same attribute set
Architect + "portable" → "modular portable"
Chef + "portable" → "portable serving"
← Attribute anchoring constrains category switching
← BUT forced bisociation produces HIGHEST NOVELTY
```
**Key Mechanism:** Attributes anchor experts to similar conceptual space (low flexibility),
but context-free keyword generation forces novel associations (high novelty).
**Result:** "Focused novelty" — deep exploration in a distant semantic territory
---
# Key Findings Summary
| RQ | Question | Answer |
|----|----------|--------|
| RQ1 | Attributes increase diversity? | **Yes** (p=0.027) |
| RQ2 | Experts increase diversity? | **Yes** (p<0.001) |
| RQ3 | Synergistic interaction? | **No** (sub-additive) |
| RQ4 | Experts > Random? | **No** (p=0.463) |
**Additional Findings (arXiv:2405.00899 Metrics):**
- Full Pipeline (C4) has **highest novelty** but **lowest flexibility**
- **Originality-Flexibility correlation r=0.071** (human-like, breaks typical LLM pattern)
- Novelty and Flexibility are **orthogonal dimensions**
- All conditions show **Persistent** exploration profile (combined jump ratio < 30%)
- Direct generation (C1) produces ideas in a **single semantic cluster**
---
# Limitations
1. **Sample Size:** 10 queries (pilot study)
2. **Novelty Measurement:** Embedding-based metrics only measure semantic distance, not true creative value
3. **Single Model:** Results may vary with different LLMs
4. **No Human Evaluation:** No validation of idea quality or usefulness
5. **Fixed Categories:** 4 attribute categories may limit exploration
---
# Future Work
## Immediate Next Steps
1. **Human Assessment Interface** (Built)
- Web-based rating tool with Torrance dimensions
- Stratified sampling: 200 ideas (4 per condition × 10 queries)
- 4 dimensions: Originality, Elaboration, Coherence, Usefulness
2. **Multi-Model Validation** (Priority)
- Replicate on GPT-4, Claude, Llama-3
- Verify findings generalize across LLMs
3. **LLM-as-Judge evaluation** for full-scale scoring
4. **Scale to 30 queries** for statistical power
5. **Alternative pipeline designs** to address attribute anchoring
**Documentation:**
- `experiments/docs/future_research_plan_zh.md` - Detailed research plan
- `experiments/docs/creative_process_metrics_zh.md` - arXiv:2405.00899 metrics explanation
---
# Conclusion
## Key Takeaways
1. **Both attribute decomposition and expert perspectives significantly increase semantic diversity** compared to direct generation
2. **The combination is sub-additive**, suggesting attribute structure may constrain expert creativity
3. **Random perspectives work as well as domain experts**, implying the value is in perspective shift, not expert knowledge
4. **Novelty and Flexibility are orthogonal creativity dimensions** - high novelty ≠ high flexibility
- C4 Full Pipeline: Highest novelty, lowest flexibility
- C5 Random: Higher flexibility, moderate novelty
5. **🔑 Key Finding:** The pipeline produces **human-like originality-flexibility patterns** (r=0.071)
- Typical LLMs show positive correlation (flexible → more original)
- Our method breaks this pattern: high novelty with focused exploration
6. **True novelty assessment requires judgment-based evaluation** beyond embedding metrics
---
# Appendix: Statistical Details
## T-test Results (vs C1 Baseline)
| Comparison | t | p | Cohen's d |
|------------|:-:|:-:|:---------:|
| C4 vs C1 | 8.55 | <0.001 | 4.05 |
| C2 vs C1 | 7.67 | <0.001 | 3.43 |
| C3 vs C1 | 4.23 | <0.001 | 1.89 |
All experimental conditions significantly outperform baseline
---
# Appendix: Experiment Configuration
```python
EXPERIMENT_CONFIG = {
"model": "qwen3:8b",
"temperature": 0.9,
"expert_count": 4,
"expert_source": "curated", # 210 occupations
"keywords_per_expert": 1,
"categories": ["Functions", "Usages",
"User Groups", "Characteristics"],
"dedup_threshold": 0.90,
"random_seed": 42
}
```
---
# Thank You
## Questions?
**Repository:** novelty-seeking
**Experiment Date:** January 19, 2026
**Contact:** [Your Email]
---
# Backup Slides
---
# Backup: Deduplication Threshold Analysis
Original threshold (0.85) was too aggressive:
- 40.5% of removed pairs were borderline (0.85-0.87)
- Many genuinely different concepts were grouped
Raised to 0.90:
- RQ1 (Attributes) became significant (p: 0.052 → 0.027)
- Preserved ~103 additional unique ideas
---
# Backup: Sample Ideas by Condition
## Query: "Chair"
**C1 Direct:**
- Ergonomic office chair with lumbar support
- Foldable camping chair
**C2 Expert-Only (Architect):**
- Load-bearing furniture integrated into building structure
**C4 Full Pipeline:**
- Asset-tracking chairs with RFID for corporate inventory
- (Accountant + "portable" → "mobile assets")
---
# Backup: Efficiency Calculation
$$\text{Efficiency} = \frac{\text{Mean Pairwise Distance}}{\text{Idea Count}} \times 100$$
| Condition | Calculation | Result |
|-----------|-------------|:------:|
| C3 Attribute | 0.376 / 12.8 × 100 | 3.01 |
| C4 Pipeline | 0.393 / 51.9 × 100 | 0.78 |
C3 achieves 96% of C4's diversity with 25% of the ideas

View File

@@ -0,0 +1,342 @@
# 研究發表計畫與未來工作
**建立日期:** 2026-01-19
**專案:** Breaking Semantic Gravity in LLM-Based Creative Ideation
---
## 一、發表可行性評估
### 現有研究的覆蓋範圍
| 主題 | 代表論文 | 我們的差異 |
|------|----------|------------|
| LLM 創意評估 | Organisciak et al. (2023) | 他們評估 LLM 創意,我們是**增強**創意 |
| AUT 彈性評分 | Hadas & Hershkovitz (2024) | 他們是評估方法,我們是**生成方法** |
| Prompt 工程 | Zhou et al. (2023) | 他們優化 prompt我們是**結構化管線** |
| LLM-as-Judge | Zheng et al. (2023) | 這是評估工具,非核心貢獻 |
### 本研究的獨特貢獻
| 獨特性 | 說明 | 學術價值 |
|--------|------|----------|
| Context-Free Keyword Generation | 專家從未看到原始查詢,強迫雙重聯想 | 方法創新 |
| 次加性交互作用 | 屬性 × 專家 = Sub-additive | 實證發現 |
| 隨機視角 ≈ 領域專家 | 視角轉換本身比專業知識更重要 | 理論貢獻 |
| 新穎性-彈性正交性 | 在 LLM 創意生成中首次驗證 | 理論驗證 |
---
## 二、目前研究狀態
### 已完成 ✓
| 要素 | 狀態 | 詳情 |
|------|:----:|------|
| 理論框架 | ✓ | Bisociation Theory + Torrance Creativity Framework |
| 實驗設計 | ✓ | 2×2 factorial + control (5 conditions) |
| 管線實作 | ✓ | 屬性分解 → 專家轉換 → 去重 |
| 自動評估指標 | ✓ | 新穎性、彈性、多樣性、凝聚度、跳躍信號 |
| 人類評估介面 | ✓ | Web-based Torrance 評分工具 |
| 統計分析 | ✓ | ANOVA、效果量、相關性分析 |
| 初步實驗 | ✓ | 10 queries, Qwen3:8b, 1119 ideas |
### 需要補充 ✗
| 缺口 | 重要性 | 說明 |
|------|:------:|------|
| 多模型驗證 | **高** | 目前只有 Qwen3:8b |
| 人類評估數據 | **高** | 介面已建置但未收集數據 |
| 樣本量擴充 | **中** | 10 → 30-50 queries |
| Baseline 比較 | **中** | 與其他創意增強方法比較 |
| LLM-as-Judge | 中 | 與人類評估的相關性驗證 |
---
## 三、發表策略選項
### 選項 A完整論文頂會/期刊)
**目標會議/期刊:**
- ACL / EMNLPNLP 頂會)
- CHI人機互動頂會
- Creativity Research Journal創意研究期刊
- Thinking Skills and Creativity創意思維期刊
**論文標題建議:**
> "Breaking Semantic Gravity: Context-Free Expert Perspectives for LLM Creative Ideation"
**需要補充的工作:**
| 工作項目 | 預估時間 | 優先級 |
|----------|:--------:|:------:|
| GPT-4 實驗 | 1 週 | P0 |
| Claude 實驗 | 1 週 | P0 |
| Llama-3 實驗 | 1 週 | P1 |
| 人類評估收集 | 2-3 週 | P0 |
| 樣本量擴充 (30 queries) | 1 週 | P1 |
| Baseline 比較實驗 | 1-2 週 | P1 |
| 論文撰寫 | 2-3 週 | - |
**總預估時間:** 2-3 個月
---
### 選項 B短論文 / Workshop Paper
**目標:**
- ACL/EMNLP Workshop on Creativity and AI
- NeurIPS Workshop on Creativity and Design
- ICCC (International Conference on Computational Creativity)
**需要補充的工作:**
| 工作項目 | 預估時間 | 優先級 |
|----------|:--------:|:------:|
| GPT-4 實驗 | 1 週 | P0 |
| 小規模人類評估 (50-100 ideas) | 1 週 | P0 |
| 論文撰寫 | 1 週 | - |
**總預估時間:** 2-4 週
---
## 四、實驗補充計畫
### Phase 1多模型驗證優先級 P0
```
目標:驗證方法的泛化性
模型清單:
□ GPT-4 / GPT-4o (OpenAI)
□ Claude 3.5 Sonnet (Anthropic)
□ Llama-3-70B (Meta)
□ Gemini Pro (Google) [optional]
實驗設計:
- 相同的 10 queries
- 相同的 5 conditions
- 相同的評估指標
預期結果:
- 跨模型一致性分析
- 模型特定效應識別
```
### Phase 2人類評估優先級 P0
```
目標:驗證自動指標與人類判斷的相關性
評估維度Torrance Framework
1. 原創性 (Originality) - 1-5 Likert
2. 精緻性 (Elaboration) - 1-5 Likert
3. 可行性 (Feasibility) - 1-5 Likert
4. 荒謬性 (Nonsense) - Binary
樣本策略:
- 分層抽樣:每 condition × 每 query = 4 ideas
- 總計5 × 10 × 4 = 200 ideas
- 評審者3-5 人(計算 ICC
介面:
- 已建置experiments/assessment/
- 需要:招募評審者、收集數據
```
### Phase 3樣本量擴充優先級 P1
```
目標:提高統計效力
擴充計畫:
- 現有10 queries
- 目標30-50 queries
Query 來源:
- 物品類:傢俱、工具、電器、交通工具
- 概念類:服務、系統、流程
- 混合類:結合物理和數位元素
統計效力分析:
- 當前效果量 d ≈ 2-3大效應
- 30 queries 應足夠達到 power > 0.95
```
### Phase 4Baseline 比較(優先級 P1
```
目標:與現有方法比較
Baseline 方法:
1. Vanilla Prompting
"Generate creative uses for [object]"
2. Chain-of-Thought (CoT)
"Think step by step about creative uses..."
3. Few-shot Examples
提供 3-5 個創意範例
4. Role-Playing (Standard)
"As a [expert], suggest uses for [object]"
(專家看到完整查詢)
比較指標:
- 新穎性、彈性、多樣性
- 想法數量、生成時間
- 人類評估分數
```
---
## 五、論文大綱草稿
### Title
"Breaking Semantic Gravity: Context-Free Expert Perspectives for Enhanced LLM Creative Ideation"
### Abstract
- Problem: LLMs generate ideas clustered around training distributions
- Method: Attribute decomposition + context-free expert transformation
- Results: Sub-additive interaction, random ≈ expert, novelty ⊥ flexibility
- Contribution: Novel pipeline + empirical findings
### 1. Introduction
- Semantic gravity problem in LLM creativity
- Bisociation theory and creative thinking
- Research questions (RQ1-4)
### 2. Related Work
- LLM creativity evaluation
- Prompt engineering for creativity
- Computational creativity methods
### 3. Method
- Pipeline architecture
- Context-free keyword generation
- Experimental design (2×2 + control)
### 4. Evaluation Framework
- Automatic metrics (novelty, flexibility, diversity)
- Human evaluation (Torrance dimensions)
- LLM-as-Judge validation
### 5. Results
- RQ1: Attribute effect
- RQ2: Expert effect
- RQ3: Interaction effect
- RQ4: Expert vs Random
- Cross-model validation
### 6. Discussion
- Attribute anchoring effect
- Value of perspective shift
- Novelty vs flexibility orthogonality
### 7. Conclusion
- Contributions
- Limitations
- Future work
---
## 六、時間線規劃
### 快速發表路線Workshop Paper
```
Week 1-2: 多模型實驗 (GPT-4, Claude)
Week 2-3: 小規模人類評估
Week 3-4: 論文撰寫與投稿
目標2026 Q1 Workshop Deadline
```
### 完整發表路線Full Paper
```
Month 1:
- Week 1-2: 多模型實驗
- Week 3-4: 樣本量擴充
Month 2:
- Week 1-2: 人類評估收集
- Week 3-4: Baseline 比較實驗
Month 3:
- Week 1-2: 數據分析與統計
- Week 3-4: 論文撰寫
目標ACL 2026 / EMNLP 2026
```
---
## 七、風險與緩解
| 風險 | 可能性 | 影響 | 緩解策略 |
|------|:------:|:----:|----------|
| 跨模型結果不一致 | 中 | 高 | 報告為「模型特定發現」 |
| 人類評估 ICC 低 | 中 | 中 | 增加評審者、修訂評分指南 |
| 效應在大樣本消失 | 低 | 高 | 現有效果量大,風險較低 |
| 競爭論文搶先 | 低 | 高 | 優先投 Workshop 建立優先權 |
---
## 八、資源需求
### 計算資源
| 資源 | 用途 | 預估成本 |
|------|------|:--------:|
| OpenAI API | GPT-4 實驗 | ~$50-100 |
| Anthropic API | Claude 實驗 | ~$50-100 |
| Local GPU | Llama 實驗 | 已有 |
| Ollama | Embedding | 已有 |
### 人力資源
| 角色 | 需求 | 說明 |
|------|------|------|
| 人類評審者 | 3-5 人 | 可招募同學或眾包 |
| 統計顧問 | 可選 | 複雜統計分析諮詢 |
---
## 九、成功指標
### 短期1個月內
- [ ] 完成 GPT-4 實驗
- [ ] 完成 Claude 實驗
- [ ] 收集至少 100 個人類評估樣本
### 中期3個月內
- [ ] 完成所有模型實驗
- [ ] 完成人類評估200+ samples, ICC > 0.7
- [ ] 完成 baseline 比較
- [ ] 投稿第一篇論文
### 長期6個月內
- [ ] 論文被接受
- [ ] 開源程式碼和數據集
- [ ] 擴展到其他創意任務
---
## 十、參考文獻
1. Hadas, S., & Hershkovitz, A. (2024). Using Large Language Models to Evaluate Alternative Uses Task Flexibility Score. *Thinking Skills and Creativity*, 52, 101549.
2. Organisciak, P., et al. (2023). Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models. *Thinking Skills and Creativity*, 49, 101356.
3. Koestler, A. (1964). *The Act of Creation*. Hutchinson.
4. Torrance, E.P. (1974). *Torrance Tests of Creative Thinking*. Scholastic Testing Service.
5. Stevenson, C., et al. (2024). Characterizing Creative Processes in Humans and Large Language Models. *arXiv:2405.00899*.
6. Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. *NeurIPS 2023*.

View File

@@ -0,0 +1,178 @@
# 簡報備忘稿
---
## 開場1-2 分鐘)
**問題:** LLM 生成創意時有「語義引力」問題
- 問「椅子創新用途」→ 都是「人體工學椅」「折疊椅」
- 想法集中在訓練數據的高頻區域
**我們的解法:** Bisociation雙重聯想
- 拆解屬性 + 專家視角 + 無上下文關鍵字
- 強迫產生意外連結
---
## 實驗設計1 分鐘)
**五個條件2×2 + 控制組:**
| 條件 | 記法 | 重點 |
|------|------|------|
| C1 | 直接生成 | Baseline |
| C2 | 只有專家 | 專家自由發揮 |
| C3 | 只有屬性 | 結構但無專家 |
| C4 | 完整管線 | 屬性 + 專家 |
| C5 | 隨機詞彙 | 控制組:隨機 vs 專家 |
**關鍵設計:** 專家生成關鍵字時**看不到原始查詢**
- 會計師 + 「便攜」→ 「流動資產」(不知道是椅子)
- 再把「流動資產」+ 「椅子」結合
---
## 四個研究問題的答案
| RQ | 問題 | 答案 | 一句話 |
|----|------|:----:|--------|
| RQ1 | 屬性有效? | ✓ Yes | p=0.027 |
| RQ2 | 專家有效? | ✓ Yes | p<0.001 |
| RQ3 | 有加乘效果? | ✗ No | Sub-additive |
| RQ4 | 專家 > 隨機? | ✗ No | p=0.463 |
**意外發現:** 隨機詞彙跟專家一樣好 → 價值在「視角轉換」本身
---
## 核心數據(記住這幾個數字)
### 新穎性(距離質心,越高越新穎)
```
C4: 0.395 ← 最高!
C5: 0.365
C3: 0.337
C2: 0.315
C1: 0.273 ← 最低(最典型)
```
### 彈性(組合跳躍數,越高越分散)
```
C2: 48 ← 最高!(專家自由探索)
C3: 33
C5: 20
C4: 13 ← 最低!(專注探索)
C1: 0 ← 單一群集
```
---
## 🔑 關鍵發現(重點中的重點)
### 發現 1原創性-彈性相關性
**論文說:**
- 人類r ≈ 0無相關
- 典型 LLMr > 0正相關
**我們的結果r = 0.071(接近零)**
**產生「類似人類」的創意模式!**
### 發現 2C4 的獨特位置
```
C4 = 最高新穎性 + 最低彈性
這代表:「專注的新穎性」
- 不是到處亂跳(高彈性)
- 而是深入一個新穎領域(低彈性但高新穎)
- 像人類專家的創意模式
```
### 發現 3為什麼會這樣
```
屬性錨定效應:
所有專家都回應同樣的屬性集
→ 想法被錨定在相似概念空間(低彈性)
→ 但無上下文關鍵字強迫新穎聯結(高新穎)
結果focused novelty聚焦的新穎性
```
---
## 方法論亮點
### 組合跳躍信號Combined Jump
- 舊方法:只看類別切換
- 新方法:類別切換 **且** 語義不相似
- 減少假陽性,更準確
### 彈性檔案分類
| 檔案 | 跳躍比率 | 我們的結果 |
|------|:--------:|:----------:|
| Persistent | <30% | 全部條件 |
| Mixed | 30-45% | 無 |
| Flexible | >45% | 無 |
→ LLM 傾向「持續探索」而非「靈活跳躍」
---
## 限制(誠實說)
1. **樣本小:** 10 個查詢pilot study
2. **沒有人工評估:** 只有 embedding 指標
3. **單一模型:** 只測 Qwen3:8b
4. **語義距離 ≠ 真正新穎:** 「量子糾纏椅」距離遠但不新穎
---
## 下一步(如果被問到)
1. **人工評估介面**(已建好)
2. **多模型驗證**GPT-4, Claude
3. **LLM-as-Judge** 大規模評分
4. **30 個查詢** 增加統計效力
---
## 一句話總結
> **我們的屬性+專家管線讓 LLM 產生「類似人類專家」的創意模式:
> 高新穎性但專注探索,打破典型 LLM 的「彈性=新穎」正相關。**
---
## 快問快答
**Q: 為什麼隨機詞跟專家一樣好?**
A: 價值在「視角轉換」本身,不在專業知識
**Q: 為什麼 C4 彈性最低但新穎性最高?**
A: 屬性把專家錨定在同一概念空間,但無上下文關鍵字強迫新穎連結
**Q: r=0.071 代表什麼?**
A: 新穎性和彈性無相關,跟人類一樣,打破典型 LLM 的正相關模式
**Q: Persistent profile 是好是壞?**
A: 不是好壞是探索策略。C4 證明可以 persistent 但仍然 novel
**Q: 這對實務有什麼用?**
A: 想要高新穎性 → 用 C4想要多元類別 → 用 C2
---
## 數字速查表
| 指標 | C1 | C2 | C3 | C4 | C5 |
|------|:--:|:--:|:--:|:--:|:--:|
| 想法數 | 195 | 198 | 125 | **402** | 199 |
| 新穎性 | 0.273 | 0.315 | 0.337 | **0.395** | 0.365 |
| 彈性(jumps) | 0 | **48** | 33 | 13 | 20 |
| 跳躍比率 | 0% | 24% | 27% | **3%** | 10% |
| 凝聚度 | 71% | 73% | 51% | **89%** | 71% |
**記憶口訣:** C4 最新穎、最凝聚、最低彈性 = 「聚焦的新穎」