Files

gbanyan 43c025e060 feat: Add experiments framework and novelty-driven agent loop

- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 10:16:21 +08:00

8.9 KiB

Raw Permalink Blame History

研究發表計畫與未來工作

建立日期： 2026-01-19 專案： Breaking Semantic Gravity in LLM-Based Creative Ideation

一、發表可行性評估

現有研究的覆蓋範圍

主題	代表論文	我們的差異
LLM 創意評估	Organisciak et al. (2023)	他們評估 LLM 創意，我們是增強創意
AUT 彈性評分	Hadas & Hershkovitz (2024)	他們是評估方法，我們是生成方法
Prompt 工程	Zhou et al. (2023)	他們優化 prompt，我們是結構化管線
LLM-as-Judge	Zheng et al. (2023)	這是評估工具，非核心貢獻

本研究的獨特貢獻

獨特性	說明	學術價值
Context-Free Keyword Generation	專家從未看到原始查詢，強迫雙重聯想	方法創新
次加性交互作用	屬性 × 專家 = Sub-additive	實證發現
隨機視角 ≈ 領域專家	視角轉換本身比專業知識更重要	理論貢獻
新穎性-彈性正交性	在 LLM 創意生成中首次驗證	理論驗證

二、目前研究狀態

已完成 ✓

要素	狀態	詳情
理論框架	✓	Bisociation Theory + Torrance Creativity Framework
實驗設計	✓	2×2 factorial + control (5 conditions)
管線實作	✓	屬性分解 → 專家轉換 → 去重
自動評估指標	✓	新穎性、彈性、多樣性、凝聚度、跳躍信號
人類評估介面	✓	Web-based Torrance 評分工具
統計分析	✓	ANOVA、效果量、相關性分析
初步實驗	✓	10 queries, Qwen3:8b, 1119 ideas

需要補充 ✗

缺口	重要性	說明
多模型驗證	高	目前只有 Qwen3:8b
人類評估數據	高	介面已建置但未收集數據
樣本量擴充	中	10 → 30-50 queries
Baseline 比較	中	與其他創意增強方法比較
LLM-as-Judge	中	與人類評估的相關性驗證

三、發表策略選項

選項 A：完整論文（頂會/期刊）

目標會議/期刊：

ACL / EMNLP（NLP 頂會）
CHI（人機互動頂會）
Creativity Research Journal（創意研究期刊）
Thinking Skills and Creativity（創意思維期刊）

論文標題建議：

"Breaking Semantic Gravity: Context-Free Expert Perspectives for LLM Creative Ideation"

需要補充的工作：

工作項目	預估時間	優先級
GPT-4 實驗	1 週	P0
Claude 實驗	1 週	P0
Llama-3 實驗	1 週	P1
人類評估收集	2-3 週	P0
樣本量擴充 (30 queries)	1 週	P1
Baseline 比較實驗	1-2 週	P1
論文撰寫	2-3 週	-

總預估時間： 2-3 個月

選項 B：短論文 / Workshop Paper

目標：

ACL/EMNLP Workshop on Creativity and AI
NeurIPS Workshop on Creativity and Design
ICCC (International Conference on Computational Creativity)

需要補充的工作：

工作項目	預估時間	優先級
GPT-4 實驗	1 週	P0
小規模人類評估 (50-100 ideas)	1 週	P0
論文撰寫	1 週	-

總預估時間： 2-4 週

四、實驗補充計畫

Phase 1：多模型驗證（優先級 P0）

目標：驗證方法的泛化性

模型清單：
  □ GPT-4 / GPT-4o (OpenAI)
  □ Claude 3.5 Sonnet (Anthropic)
  □ Llama-3-70B (Meta)
  □ Gemini Pro (Google) [optional]

實驗設計：
  - 相同的 10 queries
  - 相同的 5 conditions
  - 相同的評估指標

預期結果：
  - 跨模型一致性分析
  - 模型特定效應識別

Phase 2：人類評估（優先級 P0）

目標：驗證自動指標與人類判斷的相關性

評估維度（Torrance Framework）：
  1. 原創性 (Originality) - 1-5 Likert
  2. 精緻性 (Elaboration) - 1-5 Likert
  3. 可行性 (Feasibility) - 1-5 Likert
  4. 荒謬性 (Nonsense) - Binary

樣本策略：
  - 分層抽樣：每 condition × 每 query = 4 ideas
  - 總計：5 × 10 × 4 = 200 ideas
  - 評審者：3-5 人（計算 ICC）

介面：
  - 已建置：experiments/assessment/
  - 需要：招募評審者、收集數據

Phase 3：樣本量擴充（優先級 P1）

目標：提高統計效力

擴充計畫：
  - 現有：10 queries
  - 目標：30-50 queries

Query 來源：
  - 物品類：傢俱、工具、電器、交通工具
  - 概念類：服務、系統、流程
  - 混合類：結合物理和數位元素

統計效力分析：
  - 當前效果量 d ≈ 2-3（大效應）
  - 30 queries 應足夠達到 power > 0.95

Phase 4：Baseline 比較（優先級 P1）

目標：與現有方法比較

Baseline 方法：
  1. Vanilla Prompting
     "Generate creative uses for [object]"

  2. Chain-of-Thought (CoT)
     "Think step by step about creative uses..."

  3. Few-shot Examples
     提供 3-5 個創意範例

  4. Role-Playing (Standard)
     "As a [expert], suggest uses for [object]"
     （專家看到完整查詢）

比較指標：
  - 新穎性、彈性、多樣性
  - 想法數量、生成時間
  - 人類評估分數

五、論文大綱草稿

Title

"Breaking Semantic Gravity: Context-Free Expert Perspectives for Enhanced LLM Creative Ideation"

Abstract

Problem: LLMs generate ideas clustered around training distributions
Method: Attribute decomposition + context-free expert transformation
Results: Sub-additive interaction, random ≈ expert, novelty ⊥ flexibility
Contribution: Novel pipeline + empirical findings

1. Introduction

Semantic gravity problem in LLM creativity
Bisociation theory and creative thinking
Research questions (RQ1-4)

LLM creativity evaluation
Prompt engineering for creativity
Computational creativity methods

3. Method

Pipeline architecture
Context-free keyword generation
Experimental design (2×2 + control)

4. Evaluation Framework

Automatic metrics (novelty, flexibility, diversity)
Human evaluation (Torrance dimensions)
LLM-as-Judge validation

5. Results

RQ1: Attribute effect
RQ2: Expert effect
RQ3: Interaction effect
RQ4: Expert vs Random
Cross-model validation

6. Discussion

Attribute anchoring effect
Value of perspective shift
Novelty vs flexibility orthogonality

7. Conclusion

Contributions
Limitations
Future work

六、時間線規劃

快速發表路線（Workshop Paper）

Week 1-2: 多模型實驗 (GPT-4, Claude)
Week 2-3: 小規模人類評估
Week 3-4: 論文撰寫與投稿

目標：2026 Q1 Workshop Deadline

完整發表路線（Full Paper）

Month 1:
  - Week 1-2: 多模型實驗
  - Week 3-4: 樣本量擴充

Month 2:
  - Week 1-2: 人類評估收集
  - Week 3-4: Baseline 比較實驗

Month 3:
  - Week 1-2: 數據分析與統計
  - Week 3-4: 論文撰寫

目標：ACL 2026 / EMNLP 2026

七、風險與緩解

風險	可能性	影響	緩解策略
跨模型結果不一致	中	高	報告為「模型特定發現」
人類評估 ICC 低	中	中	增加評審者、修訂評分指南
效應在大樣本消失	低	高	現有效果量大，風險較低
競爭論文搶先	低	高	優先投 Workshop 建立優先權

八、資源需求

計算資源

資源	用途	預估成本
OpenAI API	GPT-4 實驗	~$50-100
Anthropic API	Claude 實驗	~$50-100
Local GPU	Llama 實驗	已有
Ollama	Embedding	已有

人力資源

角色	需求	說明
人類評審者	3-5 人	可招募同學或眾包
統計顧問	可選	複雜統計分析諮詢

九、成功指標

短期（1個月內）

完成 GPT-4 實驗
完成 Claude 實驗
收集至少 100 個人類評估樣本

中期（3個月內）

完成所有模型實驗
完成人類評估（200+ samples, ICC > 0.7）
完成 baseline 比較
投稿第一篇論文

長期（6個月內）

論文被接受
開源程式碼和數據集
擴展到其他創意任務

十、參考文獻

Hadas, S., & Hershkovitz, A. (2024). Using Large Language Models to Evaluate Alternative Uses Task Flexibility Score. Thinking Skills and Creativity, 52, 101549.
Organisciak, P., et al. (2023). Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models. Thinking Skills and Creativity, 49, 101356.
Koestler, A. (1964). The Act of Creation. Hutchinson.
Torrance, E.P. (1974). Torrance Tests of Creative Thinking. Scholastic Testing Service.
Stevenson, C., et al. (2024). Characterizing Creative Processes in Humans and Large Language Models. arXiv:2405.00899.
Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.

8.9 KiB Raw Permalink Blame History Unescape Escape