feat: Add experiments framework and novelty-driven agent loop

- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 10:16:21 +08:00
parent 26a56a2a07
commit 43c025e060
81 changed files with 18766 additions and 2 deletions
--- a/experiments/docs/experiment_report_2026-01-19.md
+++ b/experiments/docs/experiment_report_2026-01-19.md
@@ -0,0 +1,813 @@
+---
+marp: true
+theme: default
+paginate: true
+backgroundColor: #fff
+style: |
+  section {
+    font-size: 24px;
+  }
+  h1 {
+    color: #2c3e50;
+  }
+  h2 {
+    color: #34495e;
+  }
+  table {
+    font-size: 18px;
+  }
+  .columns {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 1rem;
+  }
+---
+
+# Breaking Semantic Gravity in LLM-Based Creative Ideation
+
+## A Pilot Study on Attribute Decomposition and Expert Perspectives
+
+**Date:** January 19, 2026
+**Model:** Qwen3:8b (Temperature: 0.9)
+**Queries:** 10 pilot queries
+
+---
+
+# Research Problem
+
+## The "Semantic Gravity" Challenge
+
+LLMs tend to generate ideas clustered around **high-probability training distributions**
+
+```
+Query: "Chair"
+Typical LLM output:
+  - Ergonomic office chair
+  - Comfortable reading chair
+  - Foldable portable chair
+  ← All within "furniture comfort" semantic cluster
+```
+
+**Goal:** Break this gravitational pull toward obvious solutions
+
+---
+
+# Theoretical Framework
+
+## Bisociation Theory (Koestler, 1964)
+
+Creative thinking occurs when two unrelated "matrices of thought" collide
+
+**Our Approach:**
+1. **Attribute Decomposition** → Break object into structural components
+2. **Expert Perspectives** → Introduce distant domain knowledge
+3. **Context-Free Keywords** → Force unexpected conceptual leaps
+
+---
+
+# Experimental Design
+
+## 2×2 Factorial + Control
+
+| Condition | Attributes | Experts | Description |
+|-----------|:----------:|:-------:|-------------|
+| **C1** Direct | - | - | Baseline: Direct LLM generation |
+| **C2** Expert-Only | - | ✓ | Expert perspectives without structure |
+| **C3** Attribute-Only | ✓ | - | Structure without expert knowledge |
+| **C4** Full Pipeline | ✓ | ✓ | Combined approach |
+| **C5** Random-Perspective | - | Random | Control: Random words as "experts" |
+
+---
+
+# Research Questions
+
+1. **RQ1:** Does attribute decomposition increase idea diversity?
+
+2. **RQ2:** Do expert perspectives increase idea diversity?
+
+3. **RQ3:** Is there a synergistic (super-additive) interaction effect?
+
+4. **RQ4:** Do domain-relevant experts outperform random perspectives?
+
+---
+
+# Pipeline Architecture
+
+## C4: Full Pipeline Process
+
+```
+Query: "Chair"
+    ↓
+Step 1: Attribute Decomposition
+    → "portable", "stackable", "ergonomic", ...
+    ↓
+Step 2: Context-Free Keyword Generation (Expert sees ONLY attribute)
+    → Accountant + "portable" → "mobile assets"
+    → Architect + "portable" → "modular units"
+    ↓
+Step 3: Idea Synthesis (Reunite with query)
+    → "Chair" + "mobile assets" + Accountant perspective
+    → "Asset-tracking chairs for corporate inventory management"
+```
+
+---
+
+# Key Design Decision
+
+## Context-Free Keyword Generation
+
+The expert **never sees the original query** when generating keywords
+
+```python
+# Step 2: Expert sees only attribute
+prompt = f"As a {expert}, what keyword comes to mind for '{attribute}'?"
+# Input: "portable" (NOT "portable chair")
+
+# Step 3: Reunite with query
+prompt = f"Apply '{keyword}' to '{query}' from {expert}'s perspective"
+# Input: "mobile assets" + "Chair" + "Accountant"
+```
+
+**Purpose:** Force bisociation by preventing obvious associations
+
+---
+
+# Pilot Study Parameters
+
+## Model & Generation Settings
+
+| Parameter | Value |
+|-----------|-------|
+| LLM Model | Qwen3:8b (Ollama) |
+| Temperature | 0.9 |
+| Ollama Endpoint | localhost:11435 |
+| Language | English |
+| Random Seed | 42 |
+
+---
+
+# Pilot Study Parameters (cont.)
+
+## Pipeline Configuration
+
+| Parameter | Value |
+|-----------|-------|
+| Queries | 10 (Chair, Bicycle, Smartphone, Solar panel, 3D printer, Drone, Food delivery, Online education, Public transport, Elderly care) |
+| Attribute Categories | 4 (Functions, Usages, User Groups, Characteristics) |
+| Attributes per Category | 5 |
+| Expert Source | Curated (210 occupations) |
+| Experts per Query | 4 |
+| Keywords per Expert | 1 |
+
+---
+
+# Pilot Study Parameters (cont.)
+
+## Output & Evaluation
+
+| Parameter | Value |
+|-----------|-------|
+| Total Ideas Generated | 1,119 (after deduplication) |
+| Ideas by Condition | C1: 195, C2: 198, C3: 125, C4: 402, C5: 199 |
+| Deduplication Threshold | 0.90 (cosine similarity) |
+| Embedding Model | qwen3-embedding:4b (1024D) |
+
+---
+
+# Background: Embedding Models Evolution
+
+## From Static to Contextual Representations
+
+| Generation | Model | Characteristics | Limitation |
+|------------|-------|-----------------|------------|
+| **1st Gen** | Word2Vec, GloVe | Static vectors, one vector per word | "bank" = same vector (river vs finance) |
+| **2nd Gen** | BERT, Sentence-BERT | Contextual, transformer-based | Limited context window, older training |
+| **3rd Gen** | Qwen3-embedding | LLM-based, instruction-tuned | Requires more compute |
+
+---
+
+# Background: Transformer vs LLM-based Embedding
+
+## Architecture Differences
+
+| Aspect | Transformer (BERT) | LLM-based (Qwen3) |
+|--------|-------------------|-------------------|
+| **架構** | Encoder-only | Decoder-only (GPT-style) |
+| **訓練目標** | MLM (遮罩語言模型) | Next-token prediction |
+| **訓練數據** | ~16GB (Wikipedia + Books) | ~數 TB (網頁、程式碼、書籍) |
+| **參數量** | 110M - 340M | 4B+ |
+| **上下文** | 512 tokens | 8K - 128K tokens |
+
+---
+
+# Background
+
+## Key Comparison
+
+```
+1. 較多的知識訓練
+   BERT: 只知道 2019 年前的知識
+   Qwen3: 知道 "drone delivery", "AI-powered", "IoT" 等現代概念
+
+2. 較廣語義理解
+   BERT: "chair for elderly" ≈ "elderly chair" (詞袋相似)
+   Qwen3: 理解 "mobility assistance" vs "comfort seating" 的差異
+
+3. 接受指令微調 (Instruction Tuning)
+   傳統: 無法根據任務意圖調整
+   Qwen3: 可以理解 "找出創意想法之間的語義差異"
+```
+
+---
+
+# Background:  Qwen3-Embedding?
+
+## Comparison with Traditional Methods
+
+```
+傳統 Sentence-BERT (all-MiniLM-L6-v2):
+  - 384 維向量
+  - 訓練於 2021 年之前的數據
+  - 對短句效果好，長文本理解有限
+  - Encoder-only，MLM 訓練
+
+Qwen3-Embedding (qwen3-embedding:4b):
+  - 1024 維向量（更豐富的語義表達）
+  - 基於 Qwen3 LLM（2024+ 訓練數據）
+  - 支援長上下文（8K tokens）
+  - 指令微調（instruction-tuned）→ 配合任務意圖
+  - 繼承 LLM 的部分能力
+```
+
+**選擇理由：** 創意想法通常較長且語義複雜，需要更強的上下文理解能力
+
+---
+
+# Background: How Embedding Works
+
+## Semantic Similarity via Vector Space
+
+```
+Step 1: 將文字轉換為向量
+  "Solar-powered charging chair" → [0.12, -0.34, 0.56, ..., 0.78] (1024D)
+
+Step 2: 計算餘弦相似度
+  similarity = cos(θ) = (A · B) / (|A| × |B|)
+
+Step 3: 相似度解讀
+  1.0 = 完全相同
+  0.9 = 非常相似（可能是重複想法）
+  0.5 = 中等相關
+  0.0 = 無關
+```
+
+**應用：** 去重（similarity > 0.9）、彈性分析（clustering）、新穎性（centroid distance）
+
+---
+
+# Results: Semantic Diversity 
+
+## Mean Pairwise Distance (Higher = More Diverse)
+
+> **Method:** We convert each idea into a vector embedding (qwen3-embedding:4b), then calculate the average cosine distance between all pairs of ideas within each condition. Higher values indicate ideas are more spread out in semantic space.
+
+| Condition | Mean | SD | vs C1 (Cohen's d) |
+|-----------|:----:|:--:|:-----------------:|
+| C1 Direct | 0.294 | 0.039 | - |
+| C2 Expert-Only | 0.400 | 0.028 | **3.15*** |
+| C3 Attribute-Only | 0.377 | 0.036 | **2.20*** |
+| C4 Full Pipeline | 0.395 | 0.019 | **3.21*** |
+| C5 Random | 0.405 | 0.062 | **2.72*** |
+
+*p < 0.001, Large effect sizes (d > 0.8)
+
+> **Cohen's d:** Measures effect size (how big the difference is). d > 0.8 = large effect, d > 0.5 = medium, d > 0.2 = small.
+
+---
+
+# Results: ANOVA Summary
+
+## Normalized Diversity Metric
+
+> **Method:** Two-way ANOVA tests whether Attributes and Experts each have independent effects on diversity, and whether combining them produces extra benefit (interaction). F-statistic measures variance between groups vs within groups.
+
+| Effect | F | p | Significant |
+|--------|:-:|:-:|:-----------:|
+| **Attributes (RQ1)** | 5.31 | 0.027 | Yes |
+| **Experts (RQ2)** | 26.07 | <0.001 | Yes |
+| **Interaction (RQ3)** | - | - | Sub-additive |
+
+**Key Finding:** Both factors work, but combination is **not synergistic**
+
+---
+
+# Results: Expert vs Random (RQ4)
+
+## C2 (Expert-Only) vs C5 (Random-Perspective)
+
+| Metric | C2 Expert | C5 Random | p-value | Effect |
+|--------|:---------:|:---------:|:-------:|:------:|
+| Diversity | 0.399 | 0.414 | 0.463 | n.s. |
+| Query Distance | 0.448 | 0.437 | 0.654 | n.s. |
+
+**Finding:** Random words perform as well as domain experts
+
+Implication: The value may be in **perspective shift itself**, not expert knowledge
+
+---
+
+# Results: Efficiency Analysis
+
+## Diversity per Idea Generated
+
+| Condition | Mean Ideas | Diversity | Efficiency |
+|-----------|:----------:|:---------:|:----------:|
+| C1 Direct | 20.0 | 0.293 | 1.46 |
+| C2 Expert-Only | 20.0 | 0.399 | **1.99** |
+| C3 Attribute-Only | 12.8 | 0.376 | **3.01** |
+| C4 Full Pipeline | 51.9 | 0.393 | 0.78 |
+| C5 Random | 20.0 | 0.405 | 2.02 |
+
+**C4 produces 2.6× more ideas but achieves same diversity**
+
+---
+
+# Visualization: Diversity by Condition
+
+![height:450px](../results/figures/20260119_165650_diversity_boxplot.png)
+
+---
+
+# Visualization: Query Distance
+
+![height:450px](../results/figures/20260119_165650_query_distance_boxplot.png)
+
+---
+
+# Advanced Analysis: Lexical Diversity
+
+## Type-Token Ratio & Vocabulary Richness
+
+> **Method:** Type-Token Ratio (TTR) = unique words ÷ total words. High TTR means more varied vocabulary; low TTR means more word repetition. Vocabulary size counts total unique words across all ideas in a condition.
+
+| Condition | TTR | Vocabulary | Avg Words/Idea |
+|-----------|:---:|:----------:|:--------------:|
+| C1 Direct | **0.382** | 853 | 11.5 |
+| C2 Expert-Only | 0.330 | 1,358 | 20.8 |
+| C3 Attribute-Only | 0.330 | 1,098 | 26.6 |
+| C4 Full Pipeline | 0.189 | **1,992** | 26.2 |
+| C5 Random | 0.320 | 1,331 | 20.9 |
+
+**Finding:** C4 has largest vocabulary (1,992) but lowest TTR (0.189)
+→ More words but more repetition across ideas
+
+---
+
+# Advanced Analysis: Concept Extraction
+
+## Top Keywords by Condition
+
+> **Method:** Extract meaningful keywords from idea texts using NLP (removing stopwords, lemmatization). Top keywords show most frequent concepts; unique keywords count distinct terms. Domain coverage checks if ideas span different knowledge areas.
+
+| Condition | Top Keywords | Unique Keywords |
+|-----------|--------------|:---------------:|
+| C1 Direct | solar, powered, smart, delivery, drone | 805 |
+| C2 Expert | real, create, design, time, develop | 1,306 |
+| C3 Attribute | real, time, create, develop, powered | 1,046 |
+| C4 Pipeline | time, real, data, ensuring, enhancing | **1,937** |
+| C5 Random | like, solar, inspired, energy, uses | 1,286 |
+
+**Finding:** C5 Random shows "inspired" → suggests analogical thinking
+All conditions cover 6 domain categories
+
+---
+
+# Advanced Analysis: Novelty Scores
+
+## Distance from Global Centroid (Higher = More Novel)
+
+> **Method:** Compute the centroid (average vector) of ALL ideas across all conditions. Then measure each idea's distance from this "typical idea" center. Ideas far from the centroid are semantically unusual compared to the overall pool.
+
+| Condition | Mean | Std | Interpretation |
+|-----------|:----:|:---:|----------------|
+| C1 Direct | 0.273 | 0.037 | Closest to "typical" ideas |
+| C2 Expert-Only | 0.315 | 0.062 | Moderate novelty |
+| C3 Attribute-Only | 0.337 | 0.066 | Moderate novelty |
+| C5 Random | 0.365 | 0.069 | High novelty |
+| **C4 Full Pipeline** | **0.395** | 0.083 | **Highest novelty** |
+
+**Finding:** C4 produces ideas furthest from the "average" idea space
+
+---
+
+# Advanced Analysis: Cross-Condition Cohesion
+
+## % Nearest Neighbors from Same Condition
+
+> **Method:** For each idea, find its K nearest neighbors in embedding space. Cohesion = percentage of neighbors from the same condition. High cohesion means ideas from that condition cluster together; low cohesion means they're scattered among other conditions.
+
+| Condition | Cohesion | Interpretation |
+|-----------|:--------:|----------------|
+| **C4 Full Pipeline** | **88.6%** | Highly distinct idea cluster |
+| C2 Expert-Only | 72.7% | Moderate clustering |
+| C5 Random | 71.4% | Moderate clustering |
+| C1 Direct | 70.8% | Moderate clustering |
+| C3 Attribute-Only | 51.2% | Ideas scattered, overlap with others |
+
+**Finding:** C4 ideas form a distinct cluster in semantic space
+
+---
+
+# Advanced Analysis: AUT Flexibility
+
+## Semantic Category Diversity (Hadas & Hershkovitz 2024)
+
+> **Method:** Uses the Alternative Uses Task (AUT) flexibility framework. Embedding-based: Hierarchical clustering with average linkage, cut at distance threshold 0.5. Higher cluster count = more semantic categories covered = higher flexibility.
+
+| Condition | Embedding Clusters | Mean Pairwise Similarity |
+|-----------|:------------------:|:------------------------:|
+| **C5 Random** | **15** | 0.521 (most diverse) |
+| **C2 Expert-Only** | **13** | 0.517 |
+| C3 Attribute-Only | 12 | - |
+| C4 Full Pipeline | 10 | 0.583 |
+| C1 Direct | **1** | 0.647 (most similar) |
+
+**Finding:** Expert perspectives (C2, C5) produce more diverse categories than direct generation (C1)
+
+---
+
+# Advanced Analysis: Combined Jump Signal
+
+## Enhanced Method from arXiv:2405.00899
+
+> **Method:** Combined jump signal uses logical AND of two conditions:
+> - **jumpcat:** Category changes between consecutive ideas (from embedding clustering)
+> - **jumpSS:** Semantic similarity < 0.7 (ideas are semantically dissimilar)
+>
+> **True jump = jumpcat ∧ jumpSS** — reduces false positives where similar ideas happen to be in different clusters.
+
+| Condition | Cat-Only | Sem-Only | **Combined** | Profile |
+|-----------|:--------:|:--------:|:------------:|---------|
+| C2 Expert-Only | 54 | 125 | **48** | Persistent |
+| C3 Attribute-Only | 34 | 107 | **33** | Persistent |
+| C5 Random | 22 | 116 | **20** | Persistent |
+| C4 Full Pipeline | 13 | 348 | **13** | Persistent |
+| C1 Direct | 0 | 104 | **0** | Persistent |
+
+**Finding:** Combined jumps ≤ category jumps (as expected). All conditions show "Persistent" exploration pattern.
+
+---
+
+# Advanced Analysis: Flexibility Profiles
+
+## Classification Based on Combined Jump Ratio
+
+> **Method:** Classify creativity style based on normalized jump ratio (jumps / transitions):
+> - **Persistent:** ratio < 0.30 (deep exploration within categories)
+> - **Flexible:** ratio > 0.45 (broad exploration across categories)
+> - **Mixed:** 0.30 ≤ ratio ≤ 0.45
+
+| Condition | Combined Jump Ratio | Profile | Interpretation |
+|-----------|:-------------------:|:-------:|----------------|
+| C3 Attribute-Only | **26.6%** | Persistent | Moderate category switching |
+| C2 Expert-Only | **24.4%** | Persistent | Moderate category switching |
+| C5 Random | 10.1% | Persistent | Low category switching |
+| **C4 Full Pipeline** | **3.2%** | Persistent | Very deep within-category exploration |
+| C1 Direct | 0.0% | Persistent | Single semantic cluster |
+
+**Key Insight:** C4's low jump ratio indicates focused, persistent exploration within novel semantic territory
+
+---
+
+# Key Finding: Originality-Flexibility Correlation
+
+## Does Our Pipeline Break the Typical LLM Pattern?
+
+> **Paper Finding (arXiv:2405.00899):**
+> - **Humans:** No correlation between flexibility and originality (r ≈ 0)
+> - **LLMs:** Positive correlation — flexible LLMs score higher on originality
+
+**Our Results:**
+
+| Metric | Value | Interpretation |
+|--------|:-----:|----------------|
+| **Pearson r** | **0.071** | Near zero correlation |
+| Interpretation | **Human-like pattern** | Breaks typical LLM pattern |
+
+**Per-Condition Breakdown:**
+
+| Condition | Novelty | Flexibility (combined jumps) |
+|-----------|:-------:|:----------------------------:|
+| C4 Full Pipeline | **0.395** (highest) | **13** (lowest) |
+| C5 Random | 0.365 | 20 |
+| C3 Attribute-Only | 0.337 | 33 |
+| C2 Expert-Only | 0.315 | 48 (highest) |
+| C1 Direct | 0.273 (lowest) | 0 |
+
+**Critical Finding:** The attribute+expert pipeline (C4) achieves **highest novelty with lowest flexibility**, demonstrating that structured context-free generation produces **focused novelty** rather than scattered exploration.
+
+---
+
+# Cumulative Jump Profile Visualization
+
+## Exploration Patterns Over Generation Sequence
+
+> **Method:** Track cumulative jump count at each response position. Steep slopes indicate rapid category switching; flat regions indicate persistent exploration within categories.
+
+![height:400px](../results/cumulative_jump_profiles.png)
+
+**Visual Pattern:**
+- C2/C3 show steady accumulation of jumps → regular category switching
+- C4/C5 show flatter profiles → persistent within-category exploration
+- C1 is flat (0 jumps) → all ideas in single cluster
+
+---
+
+# Flexibility vs Novelty: Key Insight
+
+## Novelty and Flexibility are Orthogonal Dimensions
+
+| Condition | Novelty (centroid dist) | Flexibility (combined jumps) | Pattern |
+|-----------|:-----------------------:|:----------------------------:|---------|
+| C4 Pipeline | **0.395** (highest) | **13** (lowest) | High novel, low flex |
+| C5 Random | 0.365 | 20 | High novel, low flex |
+| C2 Expert | 0.315 | **48** (highest) | Moderate novel, high flex |
+| C3 Attribute | 0.337 | 33 | Moderate both |
+| C1 Direct | 0.273 (lowest) | 0 | Typical, single category |
+
+**Interpretation:**
+- **C1 Direct** produces similar ideas within one typical category (low novelty, no jumps)
+- **C4 Full Pipeline** produces the most novel ideas with focused exploration (low jump ratio)
+- **C2 Expert-Only** produces the most category switching but moderate novelty
+- **r = 0.071** confirms these are orthogonal dimensions (human-like pattern)
+
+---
+
+# Embedding Visualization: PCA
+
+> **Method:** Principal Component Analysis reduces high-dimensional embeddings (1024D) to 2D for visualization by finding directions of maximum variance. Points close together = semantically similar ideas. Colors represent conditions.
+
+![height:450px](../results/embedding_pca.png)
+
+---
+
+# Embedding Visualization: t-SNE
+
+> **Method:** t-SNE (t-distributed Stochastic Neighbor Embedding) preserves local neighborhood structure when reducing to 2D. Better at revealing clusters than PCA, but distances between clusters are less meaningful. Good for seeing if conditions form distinct groups.
+
+![height:450px](../results/embedding_tsne.png)
+
+---
+
+# Integrated Findings
+
+## What the Advanced Analysis Reveals
+
+| Analysis | C4 Full Pipeline Characteristic |
+|----------|--------------------------------|
+| Lexical | Largest vocabulary (1,992 words) |
+| Novelty | Highest distance from centroid (0.395) |
+| Cohesion | Tightest cluster (88.6% same-condition NN) |
+| Diversity | High pairwise distance (0.395) |
+| **Flexibility** | **Lowest combined jumps (13) = focused exploration** |
+
+**Interpretation:** C4 creates a **distinct semantic territory** -
+novel ideas that are internally coherent but far from other conditions.
+Low flexibility (3.2% jump ratio) indicates deep, focused exploration within a novel space.
+
+## Understanding Novelty vs Flexibility
+
+| Condition | Novelty | Flexibility (jumps) | Strategy |
+|-----------|:-------:|:-------------------:|----------|
+| C1 Direct | Low | Lowest (0) | Typical, single category |
+| C2 Expert | Medium | **Highest (48)** | Experts = diverse exploration |
+| C3 Attribute | Medium | Medium (33) | Structured exploration |
+| C5 Random | High | Low (20) | Random but focused |
+| **C4 Pipeline** | **Highest** | **Low (13)** | **Focused novelty** |
+
+---
+
+# Critical Limitation
+
+## Embedding Distance ≠ True Novelty
+
+Current metrics measure **semantic spread**, not **creative value**
+
+| What We Measure | What We Miss |
+|-----------------|--------------|
+| Vector distance | Practical usefulness |
+| Cluster spread | Conceptual surprise |
+| Query distance | Non-obviousness |
+| | Feasibility |
+
+```
+"Quantum entanglement chair" → High distance, Low novelty
+"Chair legs as drumsticks" → Low distance, High novelty
+```
+
+---
+
+# Torrance Creativity Framework
+
+## What True Novelty Assessment Requires
+
+| Dimension | Definition | Our Coverage |
+|-----------|------------|:------------:|
+| **Fluency** | Number of ideas | ✓ Measured |
+| **Flexibility** | Category diversity | ✓ Measured (LLM + embedding) |
+| **Originality** | Statistical rarity | Not measured |
+| **Elaboration** | Detail & development | Not measured |
+
+**Originality requires human judgment or LLM-as-Judge**
+
+---
+
+# Discussion: The Attribute Anchoring Effect
+
+## Why C4 Has Highest Novelty but Lowest Flexibility
+
+```
+C2 (Expert-Only): HIGHEST FLEXIBILITY (48 combined jumps)
+  Architect → "load-bearing furniture"
+  Chef → "dining experience design"
+  ← Each expert explores freely, frequent category switching
+
+C4 (Full Pipeline): LOWEST FLEXIBILITY (13 combined jumps, 3.2% ratio)
+  All experts respond to same attribute set
+  Architect + "portable" → "modular portable"
+  Chef + "portable" → "portable serving"
+  ← Attribute anchoring constrains category switching
+  ← BUT forced bisociation produces HIGHEST NOVELTY
+```
+
+**Key Mechanism:** Attributes anchor experts to similar conceptual space (low flexibility),
+but context-free keyword generation forces novel associations (high novelty).
+
+**Result:** "Focused novelty" — deep exploration in a distant semantic territory
+
+---
+
+# Key Findings Summary
+
+| RQ | Question | Answer |
+|----|----------|--------|
+| RQ1 | Attributes increase diversity? | **Yes** (p=0.027) |
+| RQ2 | Experts increase diversity? | **Yes** (p<0.001) |
+| RQ3 | Synergistic interaction? | **No** (sub-additive) |
+| RQ4 | Experts > Random? | **No** (p=0.463) |
+
+**Additional Findings (arXiv:2405.00899 Metrics):**
+- Full Pipeline (C4) has **highest novelty** but **lowest flexibility**
+- **Originality-Flexibility correlation r=0.071** (human-like, breaks typical LLM pattern)
+- Novelty and Flexibility are **orthogonal dimensions**
+- All conditions show **Persistent** exploration profile (combined jump ratio < 30%)
+- Direct generation (C1) produces ideas in a **single semantic cluster**
+
+---
+
+# Limitations
+
+1. **Sample Size:** 10 queries (pilot study)
+
+2. **Novelty Measurement:** Embedding-based metrics only measure semantic distance, not true creative value
+
+3. **Single Model:** Results may vary with different LLMs
+
+4. **No Human Evaluation:** No validation of idea quality or usefulness
+
+5. **Fixed Categories:** 4 attribute categories may limit exploration
+
+---
+
+# Future Work
+
+## Immediate Next Steps
+
+1. **Human Assessment Interface** (Built)
+   - Web-based rating tool with Torrance dimensions
+   - Stratified sampling: 200 ideas (4 per condition × 10 queries)
+   - 4 dimensions: Originality, Elaboration, Coherence, Usefulness
+
+2. **Multi-Model Validation** (Priority)
+   - Replicate on GPT-4, Claude, Llama-3
+   - Verify findings generalize across LLMs
+
+3. **LLM-as-Judge evaluation** for full-scale scoring
+
+4. **Scale to 30 queries** for statistical power
+
+5. **Alternative pipeline designs** to address attribute anchoring
+
+**Documentation:**
+- `experiments/docs/future_research_plan_zh.md` - Detailed research plan
+- `experiments/docs/creative_process_metrics_zh.md` - arXiv:2405.00899 metrics explanation
+
+---
+
+# Conclusion
+
+## Key Takeaways
+
+1. **Both attribute decomposition and expert perspectives significantly increase semantic diversity** compared to direct generation
+
+2. **The combination is sub-additive**, suggesting attribute structure may constrain expert creativity
+
+3. **Random perspectives work as well as domain experts**, implying the value is in perspective shift, not expert knowledge
+
+4. **Novelty and Flexibility are orthogonal creativity dimensions** - high novelty ≠ high flexibility
+   - C4 Full Pipeline: Highest novelty, lowest flexibility
+   - C5 Random: Higher flexibility, moderate novelty
+
+5. **🔑 Key Finding:** The pipeline produces **human-like originality-flexibility patterns** (r=0.071)
+   - Typical LLMs show positive correlation (flexible → more original)
+   - Our method breaks this pattern: high novelty with focused exploration
+
+6. **True novelty assessment requires judgment-based evaluation** beyond embedding metrics
+
+---
+
+# Appendix: Statistical Details
+
+## T-test Results (vs C1 Baseline)
+
+| Comparison | t | p | Cohen's d |
+|------------|:-:|:-:|:---------:|
+| C4 vs C1 | 8.55 | <0.001 | 4.05 |
+| C2 vs C1 | 7.67 | <0.001 | 3.43 |
+| C3 vs C1 | 4.23 | <0.001 | 1.89 |
+
+All experimental conditions significantly outperform baseline
+
+---
+
+# Appendix: Experiment Configuration
+
+```python
+EXPERIMENT_CONFIG = {
+    "model": "qwen3:8b",
+    "temperature": 0.9,
+    "expert_count": 4,
+    "expert_source": "curated",  # 210 occupations
+    "keywords_per_expert": 1,
+    "categories": ["Functions", "Usages",
+                   "User Groups", "Characteristics"],
+    "dedup_threshold": 0.90,
+    "random_seed": 42
+}
+```
+
+---
+
+# Thank You
+
+## Questions?
+
+**Repository:** novelty-seeking
+**Experiment Date:** January 19, 2026
+**Contact:** [Your Email]
+
+---
+
+# Backup Slides
+
+---
+
+# Backup: Deduplication Threshold Analysis
+
+Original threshold (0.85) was too aggressive:
+- 40.5% of removed pairs were borderline (0.85-0.87)
+- Many genuinely different concepts were grouped
+
+Raised to 0.90:
+- RQ1 (Attributes) became significant (p: 0.052 → 0.027)
+- Preserved ~103 additional unique ideas
+
+---
+
+# Backup: Sample Ideas by Condition
+
+## Query: "Chair"
+
+**C1 Direct:**
+- Ergonomic office chair with lumbar support
+- Foldable camping chair
+
+**C2 Expert-Only (Architect):**
+- Load-bearing furniture integrated into building structure
+
+**C4 Full Pipeline:**
+- Asset-tracking chairs with RFID for corporate inventory
+- (Accountant + "portable" → "mobile assets")
+
+---
+
+# Backup: Efficiency Calculation
+
+$$\text{Efficiency} = \frac{\text{Mean Pairwise Distance}}{\text{Idea Count}} \times 100$$
+
+| Condition | Calculation | Result |
+|-----------|-------------|:------:|
+| C3 Attribute | 0.376 / 12.8 × 100 | 3.01 |
+| C4 Pipeline | 0.393 / 51.9 × 100 | 0.78 |
+
+C3 achieves 96% of C4's diversity with 25% of the ideas