Files
novelty-seeking/experiments/docs/experiment_report_2026-01-19.md
gbanyan 43c025e060 feat: Add experiments framework and novelty-driven agent loop
- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 10:16:21 +08:00

26 KiB
Raw Blame History

marp, theme, paginate, backgroundColor, style
marp theme paginate backgroundColor style
true default true section { font-size: 24px; } h1 { color: #2c3e50; } h2 { color: #34495e; } table { font-size: 18px; } .columns { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; }

Breaking Semantic Gravity in LLM-Based Creative Ideation

A Pilot Study on Attribute Decomposition and Expert Perspectives

Date: January 19, 2026 Model: Qwen3:8b (Temperature: 0.9) Queries: 10 pilot queries


Research Problem

The "Semantic Gravity" Challenge

LLMs tend to generate ideas clustered around high-probability training distributions

Query: "Chair"
Typical LLM output:
  - Ergonomic office chair
  - Comfortable reading chair
  - Foldable portable chair
  ← All within "furniture comfort" semantic cluster

Goal: Break this gravitational pull toward obvious solutions


Theoretical Framework

Bisociation Theory (Koestler, 1964)

Creative thinking occurs when two unrelated "matrices of thought" collide

Our Approach:

  1. Attribute Decomposition → Break object into structural components
  2. Expert Perspectives → Introduce distant domain knowledge
  3. Context-Free Keywords → Force unexpected conceptual leaps

Experimental Design

2×2 Factorial + Control

Condition Attributes Experts Description
C1 Direct - - Baseline: Direct LLM generation
C2 Expert-Only - Expert perspectives without structure
C3 Attribute-Only - Structure without expert knowledge
C4 Full Pipeline Combined approach
C5 Random-Perspective - Random Control: Random words as "experts"

Research Questions

  1. RQ1: Does attribute decomposition increase idea diversity?

  2. RQ2: Do expert perspectives increase idea diversity?

  3. RQ3: Is there a synergistic (super-additive) interaction effect?

  4. RQ4: Do domain-relevant experts outperform random perspectives?


Pipeline Architecture

C4: Full Pipeline Process

Query: "Chair"
    ↓
Step 1: Attribute Decomposition
    → "portable", "stackable", "ergonomic", ...
    ↓
Step 2: Context-Free Keyword Generation (Expert sees ONLY attribute)
    → Accountant + "portable" → "mobile assets"
    → Architect + "portable" → "modular units"
    ↓
Step 3: Idea Synthesis (Reunite with query)
    → "Chair" + "mobile assets" + Accountant perspective
    → "Asset-tracking chairs for corporate inventory management"

Key Design Decision

Context-Free Keyword Generation

The expert never sees the original query when generating keywords

# Step 2: Expert sees only attribute
prompt = f"As a {expert}, what keyword comes to mind for '{attribute}'?"
# Input: "portable" (NOT "portable chair")

# Step 3: Reunite with query
prompt = f"Apply '{keyword}' to '{query}' from {expert}'s perspective"
# Input: "mobile assets" + "Chair" + "Accountant"

Purpose: Force bisociation by preventing obvious associations


Pilot Study Parameters

Model & Generation Settings

Parameter Value
LLM Model Qwen3:8b (Ollama)
Temperature 0.9
Ollama Endpoint localhost:11435
Language English
Random Seed 42

Pilot Study Parameters (cont.)

Pipeline Configuration

Parameter Value
Queries 10 (Chair, Bicycle, Smartphone, Solar panel, 3D printer, Drone, Food delivery, Online education, Public transport, Elderly care)
Attribute Categories 4 (Functions, Usages, User Groups, Characteristics)
Attributes per Category 5
Expert Source Curated (210 occupations)
Experts per Query 4
Keywords per Expert 1

Pilot Study Parameters (cont.)

Output & Evaluation

Parameter Value
Total Ideas Generated 1,119 (after deduplication)
Ideas by Condition C1: 195, C2: 198, C3: 125, C4: 402, C5: 199
Deduplication Threshold 0.90 (cosine similarity)
Embedding Model qwen3-embedding:4b (1024D)

Background: Embedding Models Evolution

From Static to Contextual Representations

Generation Model Characteristics Limitation
1st Gen Word2Vec, GloVe Static vectors, one vector per word "bank" = same vector (river vs finance)
2nd Gen BERT, Sentence-BERT Contextual, transformer-based Limited context window, older training
3rd Gen Qwen3-embedding LLM-based, instruction-tuned Requires more compute

Background: Transformer vs LLM-based Embedding

Architecture Differences

Aspect Transformer (BERT) LLM-based (Qwen3)
架構 Encoder-only Decoder-only (GPT-style)
訓練目標 MLM (遮罩語言模型) Next-token prediction
訓練數據 ~16GB (Wikipedia + Books) ~數 TB (網頁、程式碼、書籍)
參數量 110M - 340M 4B+
上下文 512 tokens 8K - 128K tokens

Background

Key Comparison

1. 較多的知識訓練
   BERT: 只知道 2019 年前的知識
   Qwen3: 知道 "drone delivery", "AI-powered", "IoT" 等現代概念

2. 較廣語義理解
   BERT: "chair for elderly" ≈ "elderly chair" (詞袋相似)
   Qwen3: 理解 "mobility assistance" vs "comfort seating" 的差異

3. 接受指令微調 (Instruction Tuning)
   傳統: 無法根據任務意圖調整
   Qwen3: 可以理解 "找出創意想法之間的語義差異"

Background: Qwen3-Embedding?

Comparison with Traditional Methods

傳統 Sentence-BERT (all-MiniLM-L6-v2):
  - 384 維向量
  - 訓練於 2021 年之前的數據
  - 對短句效果好,長文本理解有限
  - Encoder-onlyMLM 訓練

Qwen3-Embedding (qwen3-embedding:4b):
  - 1024 維向量(更豐富的語義表達)
  - 基於 Qwen3 LLM2024+ 訓練數據)
  - 支援長上下文8K tokens
  - 指令微調instruction-tuned→ 配合任務意圖
  - 繼承 LLM 的部分能力

選擇理由: 創意想法通常較長且語義複雜,需要更強的上下文理解能力


Background: How Embedding Works

Semantic Similarity via Vector Space

Step 1: 將文字轉換為向量
  "Solar-powered charging chair" → [0.12, -0.34, 0.56, ..., 0.78] (1024D)

Step 2: 計算餘弦相似度
  similarity = cos(θ) = (A · B) / (|A| × |B|)

Step 3: 相似度解讀
  1.0 = 完全相同
  0.9 = 非常相似(可能是重複想法)
  0.5 = 中等相關
  0.0 = 無關

應用: 去重similarity > 0.9、彈性分析clustering、新穎性centroid distance


Results: Semantic Diversity

Mean Pairwise Distance (Higher = More Diverse)

Method: We convert each idea into a vector embedding (qwen3-embedding:4b), then calculate the average cosine distance between all pairs of ideas within each condition. Higher values indicate ideas are more spread out in semantic space.

Condition Mean SD vs C1 (Cohen's d)
C1 Direct 0.294 0.039 -
C2 Expert-Only 0.400 0.028 3.15*
C3 Attribute-Only 0.377 0.036 2.20*
C4 Full Pipeline 0.395 0.019 3.21*
C5 Random 0.405 0.062 2.72*

*p < 0.001, Large effect sizes (d > 0.8)

Cohen's d: Measures effect size (how big the difference is). d > 0.8 = large effect, d > 0.5 = medium, d > 0.2 = small.


Results: ANOVA Summary

Normalized Diversity Metric

Method: Two-way ANOVA tests whether Attributes and Experts each have independent effects on diversity, and whether combining them produces extra benefit (interaction). F-statistic measures variance between groups vs within groups.

Effect F p Significant
Attributes (RQ1) 5.31 0.027 Yes
Experts (RQ2) 26.07 <0.001 Yes
Interaction (RQ3) - - Sub-additive

Key Finding: Both factors work, but combination is not synergistic


Results: Expert vs Random (RQ4)

C2 (Expert-Only) vs C5 (Random-Perspective)

Metric C2 Expert C5 Random p-value Effect
Diversity 0.399 0.414 0.463 n.s.
Query Distance 0.448 0.437 0.654 n.s.

Finding: Random words perform as well as domain experts

Implication: The value may be in perspective shift itself, not expert knowledge


Results: Efficiency Analysis

Diversity per Idea Generated

Condition Mean Ideas Diversity Efficiency
C1 Direct 20.0 0.293 1.46
C2 Expert-Only 20.0 0.399 1.99
C3 Attribute-Only 12.8 0.376 3.01
C4 Full Pipeline 51.9 0.393 0.78
C5 Random 20.0 0.405 2.02

C4 produces 2.6× more ideas but achieves same diversity


Visualization: Diversity by Condition

height:450px


Visualization: Query Distance

height:450px


Advanced Analysis: Lexical Diversity

Type-Token Ratio & Vocabulary Richness

Method: Type-Token Ratio (TTR) = unique words ÷ total words. High TTR means more varied vocabulary; low TTR means more word repetition. Vocabulary size counts total unique words across all ideas in a condition.

Condition TTR Vocabulary Avg Words/Idea
C1 Direct 0.382 853 11.5
C2 Expert-Only 0.330 1,358 20.8
C3 Attribute-Only 0.330 1,098 26.6
C4 Full Pipeline 0.189 1,992 26.2
C5 Random 0.320 1,331 20.9

Finding: C4 has largest vocabulary (1,992) but lowest TTR (0.189) → More words but more repetition across ideas


Advanced Analysis: Concept Extraction

Top Keywords by Condition

Method: Extract meaningful keywords from idea texts using NLP (removing stopwords, lemmatization). Top keywords show most frequent concepts; unique keywords count distinct terms. Domain coverage checks if ideas span different knowledge areas.

Condition Top Keywords Unique Keywords
C1 Direct solar, powered, smart, delivery, drone 805
C2 Expert real, create, design, time, develop 1,306
C3 Attribute real, time, create, develop, powered 1,046
C4 Pipeline time, real, data, ensuring, enhancing 1,937
C5 Random like, solar, inspired, energy, uses 1,286

Finding: C5 Random shows "inspired" → suggests analogical thinking All conditions cover 6 domain categories


Advanced Analysis: Novelty Scores

Distance from Global Centroid (Higher = More Novel)

Method: Compute the centroid (average vector) of ALL ideas across all conditions. Then measure each idea's distance from this "typical idea" center. Ideas far from the centroid are semantically unusual compared to the overall pool.

Condition Mean Std Interpretation
C1 Direct 0.273 0.037 Closest to "typical" ideas
C2 Expert-Only 0.315 0.062 Moderate novelty
C3 Attribute-Only 0.337 0.066 Moderate novelty
C5 Random 0.365 0.069 High novelty
C4 Full Pipeline 0.395 0.083 Highest novelty

Finding: C4 produces ideas furthest from the "average" idea space


Advanced Analysis: Cross-Condition Cohesion

% Nearest Neighbors from Same Condition

Method: For each idea, find its K nearest neighbors in embedding space. Cohesion = percentage of neighbors from the same condition. High cohesion means ideas from that condition cluster together; low cohesion means they're scattered among other conditions.

Condition Cohesion Interpretation
C4 Full Pipeline 88.6% Highly distinct idea cluster
C2 Expert-Only 72.7% Moderate clustering
C5 Random 71.4% Moderate clustering
C1 Direct 70.8% Moderate clustering
C3 Attribute-Only 51.2% Ideas scattered, overlap with others

Finding: C4 ideas form a distinct cluster in semantic space


Advanced Analysis: AUT Flexibility

Semantic Category Diversity (Hadas & Hershkovitz 2024)

Method: Uses the Alternative Uses Task (AUT) flexibility framework. Embedding-based: Hierarchical clustering with average linkage, cut at distance threshold 0.5. Higher cluster count = more semantic categories covered = higher flexibility.

Condition Embedding Clusters Mean Pairwise Similarity
C5 Random 15 0.521 (most diverse)
C2 Expert-Only 13 0.517
C3 Attribute-Only 12 -
C4 Full Pipeline 10 0.583
C1 Direct 1 0.647 (most similar)

Finding: Expert perspectives (C2, C5) produce more diverse categories than direct generation (C1)


Advanced Analysis: Combined Jump Signal

Enhanced Method from arXiv:2405.00899

Method: Combined jump signal uses logical AND of two conditions:

  • jumpcat: Category changes between consecutive ideas (from embedding clustering)
  • jumpSS: Semantic similarity < 0.7 (ideas are semantically dissimilar)

True jump = jumpcat ∧ jumpSS — reduces false positives where similar ideas happen to be in different clusters.

Condition Cat-Only Sem-Only Combined Profile
C2 Expert-Only 54 125 48 Persistent
C3 Attribute-Only 34 107 33 Persistent
C5 Random 22 116 20 Persistent
C4 Full Pipeline 13 348 13 Persistent
C1 Direct 0 104 0 Persistent

Finding: Combined jumps ≤ category jumps (as expected). All conditions show "Persistent" exploration pattern.


Advanced Analysis: Flexibility Profiles

Classification Based on Combined Jump Ratio

Method: Classify creativity style based on normalized jump ratio (jumps / transitions):

  • Persistent: ratio < 0.30 (deep exploration within categories)
  • Flexible: ratio > 0.45 (broad exploration across categories)
  • Mixed: 0.30 ≤ ratio ≤ 0.45
Condition Combined Jump Ratio Profile Interpretation
C3 Attribute-Only 26.6% Persistent Moderate category switching
C2 Expert-Only 24.4% Persistent Moderate category switching
C5 Random 10.1% Persistent Low category switching
C4 Full Pipeline 3.2% Persistent Very deep within-category exploration
C1 Direct 0.0% Persistent Single semantic cluster

Key Insight: C4's low jump ratio indicates focused, persistent exploration within novel semantic territory


Key Finding: Originality-Flexibility Correlation

Does Our Pipeline Break the Typical LLM Pattern?

Paper Finding (arXiv:2405.00899):

  • Humans: No correlation between flexibility and originality (r ≈ 0)
  • LLMs: Positive correlation — flexible LLMs score higher on originality

Our Results:

Metric Value Interpretation
Pearson r 0.071 Near zero correlation
Interpretation Human-like pattern Breaks typical LLM pattern

Per-Condition Breakdown:

Condition Novelty Flexibility (combined jumps)
C4 Full Pipeline 0.395 (highest) 13 (lowest)
C5 Random 0.365 20
C3 Attribute-Only 0.337 33
C2 Expert-Only 0.315 48 (highest)
C1 Direct 0.273 (lowest) 0

Critical Finding: The attribute+expert pipeline (C4) achieves highest novelty with lowest flexibility, demonstrating that structured context-free generation produces focused novelty rather than scattered exploration.


Cumulative Jump Profile Visualization

Exploration Patterns Over Generation Sequence

Method: Track cumulative jump count at each response position. Steep slopes indicate rapid category switching; flat regions indicate persistent exploration within categories.

height:400px

Visual Pattern:

  • C2/C3 show steady accumulation of jumps → regular category switching
  • C4/C5 show flatter profiles → persistent within-category exploration
  • C1 is flat (0 jumps) → all ideas in single cluster

Flexibility vs Novelty: Key Insight

Novelty and Flexibility are Orthogonal Dimensions

Condition Novelty (centroid dist) Flexibility (combined jumps) Pattern
C4 Pipeline 0.395 (highest) 13 (lowest) High novel, low flex
C5 Random 0.365 20 High novel, low flex
C2 Expert 0.315 48 (highest) Moderate novel, high flex
C3 Attribute 0.337 33 Moderate both
C1 Direct 0.273 (lowest) 0 Typical, single category

Interpretation:

  • C1 Direct produces similar ideas within one typical category (low novelty, no jumps)
  • C4 Full Pipeline produces the most novel ideas with focused exploration (low jump ratio)
  • C2 Expert-Only produces the most category switching but moderate novelty
  • r = 0.071 confirms these are orthogonal dimensions (human-like pattern)

Embedding Visualization: PCA

Method: Principal Component Analysis reduces high-dimensional embeddings (1024D) to 2D for visualization by finding directions of maximum variance. Points close together = semantically similar ideas. Colors represent conditions.

height:450px


Embedding Visualization: t-SNE

Method: t-SNE (t-distributed Stochastic Neighbor Embedding) preserves local neighborhood structure when reducing to 2D. Better at revealing clusters than PCA, but distances between clusters are less meaningful. Good for seeing if conditions form distinct groups.

height:450px


Integrated Findings

What the Advanced Analysis Reveals

Analysis C4 Full Pipeline Characteristic
Lexical Largest vocabulary (1,992 words)
Novelty Highest distance from centroid (0.395)
Cohesion Tightest cluster (88.6% same-condition NN)
Diversity High pairwise distance (0.395)
Flexibility Lowest combined jumps (13) = focused exploration

Interpretation: C4 creates a distinct semantic territory - novel ideas that are internally coherent but far from other conditions. Low flexibility (3.2% jump ratio) indicates deep, focused exploration within a novel space.

Understanding Novelty vs Flexibility

Condition Novelty Flexibility (jumps) Strategy
C1 Direct Low Lowest (0) Typical, single category
C2 Expert Medium Highest (48) Experts = diverse exploration
C3 Attribute Medium Medium (33) Structured exploration
C5 Random High Low (20) Random but focused
C4 Pipeline Highest Low (13) Focused novelty

Critical Limitation

Embedding Distance ≠ True Novelty

Current metrics measure semantic spread, not creative value

What We Measure What We Miss
Vector distance Practical usefulness
Cluster spread Conceptual surprise
Query distance Non-obviousness
Feasibility
"Quantum entanglement chair" → High distance, Low novelty
"Chair legs as drumsticks" → Low distance, High novelty

Torrance Creativity Framework

What True Novelty Assessment Requires

Dimension Definition Our Coverage
Fluency Number of ideas ✓ Measured
Flexibility Category diversity ✓ Measured (LLM + embedding)
Originality Statistical rarity Not measured
Elaboration Detail & development Not measured

Originality requires human judgment or LLM-as-Judge


Discussion: The Attribute Anchoring Effect

Why C4 Has Highest Novelty but Lowest Flexibility

C2 (Expert-Only): HIGHEST FLEXIBILITY (48 combined jumps)
  Architect → "load-bearing furniture"
  Chef → "dining experience design"
  ← Each expert explores freely, frequent category switching

C4 (Full Pipeline): LOWEST FLEXIBILITY (13 combined jumps, 3.2% ratio)
  All experts respond to same attribute set
  Architect + "portable" → "modular portable"
  Chef + "portable" → "portable serving"
  ← Attribute anchoring constrains category switching
  ← BUT forced bisociation produces HIGHEST NOVELTY

Key Mechanism: Attributes anchor experts to similar conceptual space (low flexibility), but context-free keyword generation forces novel associations (high novelty).

Result: "Focused novelty" — deep exploration in a distant semantic territory


Key Findings Summary

RQ Question Answer
RQ1 Attributes increase diversity? Yes (p=0.027)
RQ2 Experts increase diversity? Yes (p<0.001)
RQ3 Synergistic interaction? No (sub-additive)
RQ4 Experts > Random? No (p=0.463)

Additional Findings (arXiv:2405.00899 Metrics):

  • Full Pipeline (C4) has highest novelty but lowest flexibility
  • Originality-Flexibility correlation r=0.071 (human-like, breaks typical LLM pattern)
  • Novelty and Flexibility are orthogonal dimensions
  • All conditions show Persistent exploration profile (combined jump ratio < 30%)
  • Direct generation (C1) produces ideas in a single semantic cluster

Limitations

  1. Sample Size: 10 queries (pilot study)

  2. Novelty Measurement: Embedding-based metrics only measure semantic distance, not true creative value

  3. Single Model: Results may vary with different LLMs

  4. No Human Evaluation: No validation of idea quality or usefulness

  5. Fixed Categories: 4 attribute categories may limit exploration


Future Work

Immediate Next Steps

  1. Human Assessment Interface (Built)

    • Web-based rating tool with Torrance dimensions
    • Stratified sampling: 200 ideas (4 per condition × 10 queries)
    • 4 dimensions: Originality, Elaboration, Coherence, Usefulness
  2. Multi-Model Validation (Priority)

    • Replicate on GPT-4, Claude, Llama-3
    • Verify findings generalize across LLMs
  3. LLM-as-Judge evaluation for full-scale scoring

  4. Scale to 30 queries for statistical power

  5. Alternative pipeline designs to address attribute anchoring

Documentation:

  • experiments/docs/future_research_plan_zh.md - Detailed research plan
  • experiments/docs/creative_process_metrics_zh.md - arXiv:2405.00899 metrics explanation

Conclusion

Key Takeaways

  1. Both attribute decomposition and expert perspectives significantly increase semantic diversity compared to direct generation

  2. The combination is sub-additive, suggesting attribute structure may constrain expert creativity

  3. Random perspectives work as well as domain experts, implying the value is in perspective shift, not expert knowledge

  4. Novelty and Flexibility are orthogonal creativity dimensions - high novelty ≠ high flexibility

    • C4 Full Pipeline: Highest novelty, lowest flexibility
    • C5 Random: Higher flexibility, moderate novelty
  5. 🔑 Key Finding: The pipeline produces human-like originality-flexibility patterns (r=0.071)

    • Typical LLMs show positive correlation (flexible → more original)
    • Our method breaks this pattern: high novelty with focused exploration
  6. True novelty assessment requires judgment-based evaluation beyond embedding metrics


Appendix: Statistical Details

T-test Results (vs C1 Baseline)

Comparison t p Cohen's d
C4 vs C1 8.55 <0.001 4.05
C2 vs C1 7.67 <0.001 3.43
C3 vs C1 4.23 <0.001 1.89

All experimental conditions significantly outperform baseline


Appendix: Experiment Configuration

EXPERIMENT_CONFIG = {
    "model": "qwen3:8b",
    "temperature": 0.9,
    "expert_count": 4,
    "expert_source": "curated",  # 210 occupations
    "keywords_per_expert": 1,
    "categories": ["Functions", "Usages",
                   "User Groups", "Characteristics"],
    "dedup_threshold": 0.90,
    "random_seed": 42
}

Thank You

Questions?

Repository: novelty-seeking Experiment Date: January 19, 2026 Contact: [Your Email]


Backup Slides


Backup: Deduplication Threshold Analysis

Original threshold (0.85) was too aggressive:

  • 40.5% of removed pairs were borderline (0.85-0.87)
  • Many genuinely different concepts were grouped

Raised to 0.90:

  • RQ1 (Attributes) became significant (p: 0.052 → 0.027)
  • Preserved ~103 additional unique ideas

Backup: Sample Ideas by Condition

Query: "Chair"

C1 Direct:

  • Ergonomic office chair with lumbar support
  • Foldable camping chair

C2 Expert-Only (Architect):

  • Load-bearing furniture integrated into building structure

C4 Full Pipeline:

  • Asset-tracking chairs with RFID for corporate inventory
  • (Accountant + "portable" → "mobile assets")

Backup: Efficiency Calculation

\text{Efficiency} = \frac{\text{Mean Pairwise Distance}}{\text{Idea Count}} \times 100
Condition Calculation Result
C3 Attribute 0.376 / 12.8 × 100 3.01
C4 Pipeline 0.393 / 51.9 × 100 0.78

C3 achieves 96% of C4's diversity with 25% of the ideas