Files

gbanyan 43c025e060 feat: Add experiments framework and novelty-driven agent loop

- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 10:16:21 +08:00

26 KiB

Raw Blame History

marp, theme, paginate, backgroundColor, style

marp	theme	paginate	backgroundColor	style
true	default	true		section { font-size: 24px; } h1 { color: #2c3e50; } h2 { color: #34495e; } table { font-size: 18px; } .columns { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; }

Breaking Semantic Gravity in LLM-Based Creative Ideation

A Pilot Study on Attribute Decomposition and Expert Perspectives

Date: January 19, 2026 Model: Qwen3:8b (Temperature: 0.9) Queries: 10 pilot queries

Research Problem

The "Semantic Gravity" Challenge

LLMs tend to generate ideas clustered around high-probability training distributions

Query: "Chair"
Typical LLM output:
  - Ergonomic office chair
  - Comfortable reading chair
  - Foldable portable chair
  ← All within "furniture comfort" semantic cluster

Goal: Break this gravitational pull toward obvious solutions

Theoretical Framework

Bisociation Theory (Koestler, 1964)

Creative thinking occurs when two unrelated "matrices of thought" collide

Our Approach:

Attribute Decomposition → Break object into structural components
Expert Perspectives → Introduce distant domain knowledge
Context-Free Keywords → Force unexpected conceptual leaps

Experimental Design

2×2 Factorial + Control

Condition	Attributes	Experts	Description
C1 Direct	-	-	Baseline: Direct LLM generation
C2 Expert-Only	-	✓	Expert perspectives without structure
C3 Attribute-Only	✓	-	Structure without expert knowledge
C4 Full Pipeline	✓	✓	Combined approach
C5 Random-Perspective	-	Random	Control: Random words as "experts"

Research Questions

RQ1: Does attribute decomposition increase idea diversity?
RQ2: Do expert perspectives increase idea diversity?
RQ3: Is there a synergistic (super-additive) interaction effect?
RQ4: Do domain-relevant experts outperform random perspectives?

Pipeline Architecture

C4: Full Pipeline Process

Query: "Chair"
    ↓
Step 1: Attribute Decomposition
    → "portable", "stackable", "ergonomic", ...
    ↓
Step 2: Context-Free Keyword Generation (Expert sees ONLY attribute)
    → Accountant + "portable" → "mobile assets"
    → Architect + "portable" → "modular units"
    ↓
Step 3: Idea Synthesis (Reunite with query)
    → "Chair" + "mobile assets" + Accountant perspective
    → "Asset-tracking chairs for corporate inventory management"

Key Design Decision

Context-Free Keyword Generation

The expert never sees the original query when generating keywords

# Step 2: Expert sees only attribute
prompt = f"As a {expert}, what keyword comes to mind for '{attribute}'?"
# Input: "portable" (NOT "portable chair")

# Step 3: Reunite with query
prompt = f"Apply '{keyword}' to '{query}' from {expert}'s perspective"
# Input: "mobile assets" + "Chair" + "Accountant"

Purpose: Force bisociation by preventing obvious associations

Pilot Study Parameters

Model & Generation Settings

Parameter	Value
LLM Model	Qwen3:8b (Ollama)
Temperature	0.9
Ollama Endpoint	localhost:11435
Language	English
Random Seed	42

Pilot Study Parameters (cont.)

Pipeline Configuration

Parameter	Value
Queries	10 (Chair, Bicycle, Smartphone, Solar panel, 3D printer, Drone, Food delivery, Online education, Public transport, Elderly care)
Attribute Categories	4 (Functions, Usages, User Groups, Characteristics)
Attributes per Category	5
Expert Source	Curated (210 occupations)
Experts per Query	4
Keywords per Expert	1

Pilot Study Parameters (cont.)

Output & Evaluation

Parameter	Value
Total Ideas Generated	1,119 (after deduplication)
Ideas by Condition	C1: 195, C2: 198, C3: 125, C4: 402, C5: 199
Deduplication Threshold	0.90 (cosine similarity)
Embedding Model	qwen3-embedding:4b (1024D)

Background: Embedding Models Evolution

From Static to Contextual Representations

Generation	Model	Characteristics	Limitation
1st Gen	Word2Vec, GloVe	Static vectors, one vector per word	"bank" = same vector (river vs finance)
2nd Gen	BERT, Sentence-BERT	Contextual, transformer-based	Limited context window, older training
3rd Gen	Qwen3-embedding	LLM-based, instruction-tuned	Requires more compute

Background: Transformer vs LLM-based Embedding

Architecture Differences

Aspect	Transformer (BERT)	LLM-based (Qwen3)
架構	Encoder-only	Decoder-only (GPT-style)
訓練目標	MLM (遮罩語言模型)	Next-token prediction
訓練數據	~16GB (Wikipedia + Books)	~數 TB (網頁、程式碼、書籍)
參數量	110M - 340M	4B+
上下文	512 tokens	8K - 128K tokens

Background

Key Comparison

1. 較多的知識訓練
   BERT: 只知道 2019 年前的知識
   Qwen3: 知道 "drone delivery", "AI-powered", "IoT" 等現代概念

2. 較廣語義理解
   BERT: "chair for elderly" ≈ "elderly chair" (詞袋相似)
   Qwen3: 理解 "mobility assistance" vs "comfort seating" 的差異

3. 接受指令微調 (Instruction Tuning)
   傳統: 無法根據任務意圖調整
   Qwen3: 可以理解 "找出創意想法之間的語義差異"

Background: Qwen3-Embedding?

Comparison with Traditional Methods

傳統 Sentence-BERT (all-MiniLM-L6-v2):
  - 384 維向量
  - 訓練於 2021 年之前的數據
  - 對短句效果好，長文本理解有限
  - Encoder-only，MLM 訓練

Qwen3-Embedding (qwen3-embedding:4b):
  - 1024 維向量（更豐富的語義表達）
  - 基於 Qwen3 LLM（2024+ 訓練數據）
  - 支援長上下文（8K tokens）
  - 指令微調（instruction-tuned）→ 配合任務意圖
  - 繼承 LLM 的部分能力

選擇理由： 創意想法通常較長且語義複雜，需要更強的上下文理解能力

Background: How Embedding Works

Semantic Similarity via Vector Space

Step 1: 將文字轉換為向量
  "Solar-powered charging chair" → [0.12, -0.34, 0.56, ..., 0.78] (1024D)

Step 2: 計算餘弦相似度
  similarity = cos(θ) = (A · B) / (|A| × |B|)

Step 3: 相似度解讀
  1.0 = 完全相同
  0.9 = 非常相似（可能是重複想法）
  0.5 = 中等相關
  0.0 = 無關

應用： 去重（similarity > 0.9）、彈性分析（clustering）、新穎性（centroid distance）

Results: Semantic Diversity

Mean Pairwise Distance (Higher = More Diverse)

Method: We convert each idea into a vector embedding (qwen3-embedding:4b), then calculate the average cosine distance between all pairs of ideas within each condition. Higher values indicate ideas are more spread out in semantic space.

Condition	Mean	SD	vs C1 (Cohen's d)
C1 Direct	0.294	0.039	-
C2 Expert-Only	0.400	0.028	3.15*
C3 Attribute-Only	0.377	0.036	2.20*
C4 Full Pipeline	0.395	0.019	3.21*
C5 Random	0.405	0.062	2.72*

*p < 0.001, Large effect sizes (d > 0.8)

Cohen's d: Measures effect size (how big the difference is). d > 0.8 = large effect, d > 0.5 = medium, d > 0.2 = small.

Results: ANOVA Summary

Normalized Diversity Metric

Method: Two-way ANOVA tests whether Attributes and Experts each have independent effects on diversity, and whether combining them produces extra benefit (interaction). F-statistic measures variance between groups vs within groups.

Effect	F	p	Significant
Attributes (RQ1)	5.31	0.027	Yes
Experts (RQ2)	26.07	<0.001	Yes
Interaction (RQ3)	-	-	Sub-additive

Key Finding: Both factors work, but combination is not synergistic

Results: Expert vs Random (RQ4)

C2 (Expert-Only) vs C5 (Random-Perspective)

Metric	C2 Expert	C5 Random	p-value	Effect
Diversity	0.399	0.414	0.463	n.s.
Query Distance	0.448	0.437	0.654	n.s.

Finding: Random words perform as well as domain experts

Implication: The value may be in perspective shift itself, not expert knowledge

Results: Efficiency Analysis

Diversity per Idea Generated

Condition	Mean Ideas	Diversity	Efficiency
C1 Direct	20.0	0.293	1.46
C2 Expert-Only	20.0	0.399	1.99
C3 Attribute-Only	12.8	0.376	3.01
C4 Full Pipeline	51.9	0.393	0.78
C5 Random	20.0	0.405	2.02

C4 produces 2.6× more ideas but achieves same diversity

Visualization: Diversity by Condition

Visualization: Query Distance

Advanced Analysis: Lexical Diversity

Type-Token Ratio & Vocabulary Richness

Method: Type-Token Ratio (TTR) = unique words ÷ total words. High TTR means more varied vocabulary; low TTR means more word repetition. Vocabulary size counts total unique words across all ideas in a condition.

Condition	TTR	Vocabulary	Avg Words/Idea
C1 Direct	0.382	853	11.5
C2 Expert-Only	0.330	1,358	20.8
C3 Attribute-Only	0.330	1,098	26.6
C4 Full Pipeline	0.189	1,992	26.2
C5 Random	0.320	1,331	20.9

Finding: C4 has largest vocabulary (1,992) but lowest TTR (0.189) → More words but more repetition across ideas

Advanced Analysis: Concept Extraction

Top Keywords by Condition

Method: Extract meaningful keywords from idea texts using NLP (removing stopwords, lemmatization). Top keywords show most frequent concepts; unique keywords count distinct terms. Domain coverage checks if ideas span different knowledge areas.

Condition	Top Keywords	Unique Keywords
C1 Direct	solar, powered, smart, delivery, drone	805
C2 Expert	real, create, design, time, develop	1,306
C3 Attribute	real, time, create, develop, powered	1,046
C4 Pipeline	time, real, data, ensuring, enhancing	1,937
C5 Random	like, solar, inspired, energy, uses	1,286

Finding: C5 Random shows "inspired" → suggests analogical thinking All conditions cover 6 domain categories

Advanced Analysis: Novelty Scores

Distance from Global Centroid (Higher = More Novel)

Method: Compute the centroid (average vector) of ALL ideas across all conditions. Then measure each idea's distance from this "typical idea" center. Ideas far from the centroid are semantically unusual compared to the overall pool.

Condition	Mean	Std	Interpretation
C1 Direct	0.273	0.037	Closest to "typical" ideas
C2 Expert-Only	0.315	0.062	Moderate novelty
C3 Attribute-Only	0.337	0.066	Moderate novelty
C5 Random	0.365	0.069	High novelty
C4 Full Pipeline	0.395	0.083	Highest novelty

Finding: C4 produces ideas furthest from the "average" idea space

Advanced Analysis: Cross-Condition Cohesion

% Nearest Neighbors from Same Condition

Method: For each idea, find its K nearest neighbors in embedding space. Cohesion = percentage of neighbors from the same condition. High cohesion means ideas from that condition cluster together; low cohesion means they're scattered among other conditions.

Condition	Cohesion	Interpretation
C4 Full Pipeline	88.6%	Highly distinct idea cluster
C2 Expert-Only	72.7%	Moderate clustering
C5 Random	71.4%	Moderate clustering
C1 Direct	70.8%	Moderate clustering
C3 Attribute-Only	51.2%	Ideas scattered, overlap with others

Finding: C4 ideas form a distinct cluster in semantic space

Advanced Analysis: AUT Flexibility

Semantic Category Diversity (Hadas & Hershkovitz 2024)

Method: Uses the Alternative Uses Task (AUT) flexibility framework. Embedding-based: Hierarchical clustering with average linkage, cut at distance threshold 0.5. Higher cluster count = more semantic categories covered = higher flexibility.

Condition	Embedding Clusters	Mean Pairwise Similarity
C5 Random	15	0.521 (most diverse)
C2 Expert-Only	13	0.517
C3 Attribute-Only	12	-
C4 Full Pipeline	10	0.583
C1 Direct	1	0.647 (most similar)

Finding: Expert perspectives (C2, C5) produce more diverse categories than direct generation (C1)

Advanced Analysis: Combined Jump Signal

Enhanced Method from arXiv:2405.00899

Method: Combined jump signal uses logical AND of two conditions:

jumpcat: Category changes between consecutive ideas (from embedding clustering)

jumpSS: Semantic similarity < 0.7 (ideas are semantically dissimilar)

True jump = jumpcat ∧ jumpSS — reduces false positives where similar ideas happen to be in different clusters.

Condition	Cat-Only	Sem-Only	Combined	Profile
C2 Expert-Only	54	125	48	Persistent
C3 Attribute-Only	34	107	33	Persistent
C5 Random	22	116	20	Persistent
C4 Full Pipeline	13	348	13	Persistent
C1 Direct	0	104	0	Persistent

Finding: Combined jumps ≤ category jumps (as expected). All conditions show "Persistent" exploration pattern.

Advanced Analysis: Flexibility Profiles

Classification Based on Combined Jump Ratio

Method: Classify creativity style based on normalized jump ratio (jumps / transitions):

Persistent: ratio < 0.30 (deep exploration within categories)

Flexible: ratio > 0.45 (broad exploration across categories)

Mixed: 0.30 ≤ ratio ≤ 0.45

Condition	Combined Jump Ratio	Profile	Interpretation
C3 Attribute-Only	26.6%	Persistent	Moderate category switching
C2 Expert-Only	24.4%	Persistent	Moderate category switching
C5 Random	10.1%	Persistent	Low category switching
C4 Full Pipeline	3.2%	Persistent	Very deep within-category exploration
C1 Direct	0.0%	Persistent	Single semantic cluster

Key Insight: C4's low jump ratio indicates focused, persistent exploration within novel semantic territory

Key Finding: Originality-Flexibility Correlation

Does Our Pipeline Break the Typical LLM Pattern?

Paper Finding (arXiv:2405.00899):

Humans: No correlation between flexibility and originality (r ≈ 0)

LLMs: Positive correlation — flexible LLMs score higher on originality

Our Results:

Metric	Value	Interpretation
Pearson r	0.071	Near zero correlation
Interpretation	Human-like pattern	Breaks typical LLM pattern

Per-Condition Breakdown:

Condition	Novelty	Flexibility (combined jumps)
C4 Full Pipeline	0.395 (highest)	13 (lowest)
C5 Random	0.365	20
C3 Attribute-Only	0.337	33
C2 Expert-Only	0.315	48 (highest)
C1 Direct	0.273 (lowest)	0

Critical Finding: The attribute+expert pipeline (C4) achieves highest novelty with lowest flexibility, demonstrating that structured context-free generation produces focused novelty rather than scattered exploration.

Cumulative Jump Profile Visualization

Exploration Patterns Over Generation Sequence

Method: Track cumulative jump count at each response position. Steep slopes indicate rapid category switching; flat regions indicate persistent exploration within categories.

Visual Pattern:

C2/C3 show steady accumulation of jumps → regular category switching
C4/C5 show flatter profiles → persistent within-category exploration
C1 is flat (0 jumps) → all ideas in single cluster

Flexibility vs Novelty: Key Insight

Novelty and Flexibility are Orthogonal Dimensions

Condition	Novelty (centroid dist)	Flexibility (combined jumps)	Pattern
C4 Pipeline	0.395 (highest)	13 (lowest)	High novel, low flex
C5 Random	0.365	20	High novel, low flex
C2 Expert	0.315	48 (highest)	Moderate novel, high flex
C3 Attribute	0.337	33	Moderate both
C1 Direct	0.273 (lowest)	0	Typical, single category

Interpretation:

C1 Direct produces similar ideas within one typical category (low novelty, no jumps)
C4 Full Pipeline produces the most novel ideas with focused exploration (low jump ratio)
C2 Expert-Only produces the most category switching but moderate novelty
r = 0.071 confirms these are orthogonal dimensions (human-like pattern)

Embedding Visualization: PCA

Method: Principal Component Analysis reduces high-dimensional embeddings (1024D) to 2D for visualization by finding directions of maximum variance. Points close together = semantically similar ideas. Colors represent conditions.

Embedding Visualization: t-SNE

Method: t-SNE (t-distributed Stochastic Neighbor Embedding) preserves local neighborhood structure when reducing to 2D. Better at revealing clusters than PCA, but distances between clusters are less meaningful. Good for seeing if conditions form distinct groups.

Integrated Findings

What the Advanced Analysis Reveals

Analysis	C4 Full Pipeline Characteristic
Lexical	Largest vocabulary (1,992 words)
Novelty	Highest distance from centroid (0.395)
Cohesion	Tightest cluster (88.6% same-condition NN)
Diversity	High pairwise distance (0.395)
Flexibility	Lowest combined jumps (13) = focused exploration

Interpretation: C4 creates a distinct semantic territory - novel ideas that are internally coherent but far from other conditions. Low flexibility (3.2% jump ratio) indicates deep, focused exploration within a novel space.

Understanding Novelty vs Flexibility

Condition	Novelty	Flexibility (jumps)	Strategy
C1 Direct	Low	Lowest (0)	Typical, single category
C2 Expert	Medium	Highest (48)	Experts = diverse exploration
C3 Attribute	Medium	Medium (33)	Structured exploration
C5 Random	High	Low (20)	Random but focused
C4 Pipeline	Highest	Low (13)	Focused novelty

Critical Limitation

Embedding Distance ≠ True Novelty

Current metrics measure semantic spread, not creative value

What We Measure	What We Miss
Vector distance	Practical usefulness
Cluster spread	Conceptual surprise
Query distance	Non-obviousness
	Feasibility

"Quantum entanglement chair" → High distance, Low novelty
"Chair legs as drumsticks" → Low distance, High novelty

Torrance Creativity Framework

What True Novelty Assessment Requires

Dimension	Definition	Our Coverage
Fluency	Number of ideas	✓ Measured
Flexibility	Category diversity	✓ Measured (LLM + embedding)
Originality	Statistical rarity	Not measured
Elaboration	Detail & development	Not measured

Originality requires human judgment or LLM-as-Judge

Discussion: The Attribute Anchoring Effect

Why C4 Has Highest Novelty but Lowest Flexibility

C2 (Expert-Only): HIGHEST FLEXIBILITY (48 combined jumps)
  Architect → "load-bearing furniture"
  Chef → "dining experience design"
  ← Each expert explores freely, frequent category switching

C4 (Full Pipeline): LOWEST FLEXIBILITY (13 combined jumps, 3.2% ratio)
  All experts respond to same attribute set
  Architect + "portable" → "modular portable"
  Chef + "portable" → "portable serving"
  ← Attribute anchoring constrains category switching
  ← BUT forced bisociation produces HIGHEST NOVELTY

Key Mechanism: Attributes anchor experts to similar conceptual space (low flexibility), but context-free keyword generation forces novel associations (high novelty).

Result: "Focused novelty" — deep exploration in a distant semantic territory

Key Findings Summary

RQ	Question	Answer
RQ1	Attributes increase diversity?	Yes (p=0.027)
RQ2	Experts increase diversity?	Yes (p<0.001)
RQ3	Synergistic interaction?	No (sub-additive)
RQ4	Experts > Random?	No (p=0.463)

Additional Findings (arXiv:2405.00899 Metrics):

Full Pipeline (C4) has highest novelty but lowest flexibility
Originality-Flexibility correlation r=0.071 (human-like, breaks typical LLM pattern)
Novelty and Flexibility are orthogonal dimensions
All conditions show Persistent exploration profile (combined jump ratio < 30%)
Direct generation (C1) produces ideas in a single semantic cluster

Limitations

Sample Size: 10 queries (pilot study)
Novelty Measurement: Embedding-based metrics only measure semantic distance, not true creative value
Single Model: Results may vary with different LLMs
No Human Evaluation: No validation of idea quality or usefulness
Fixed Categories: 4 attribute categories may limit exploration

Future Work

Immediate Next Steps

Human Assessment Interface (Built)
- Web-based rating tool with Torrance dimensions
- Stratified sampling: 200 ideas (4 per condition × 10 queries)
- 4 dimensions: Originality, Elaboration, Coherence, Usefulness
Multi-Model Validation (Priority)
- Replicate on GPT-4, Claude, Llama-3
- Verify findings generalize across LLMs
LLM-as-Judge evaluation for full-scale scoring
Scale to 30 queries for statistical power
Alternative pipeline designs to address attribute anchoring

Documentation:

experiments/docs/future_research_plan_zh.md - Detailed research plan
experiments/docs/creative_process_metrics_zh.md - arXiv:2405.00899 metrics explanation

Conclusion

Key Takeaways

Both attribute decomposition and expert perspectives significantly increase semantic diversity compared to direct generation
The combination is sub-additive, suggesting attribute structure may constrain expert creativity
Random perspectives work as well as domain experts, implying the value is in perspective shift, not expert knowledge
Novelty and Flexibility are orthogonal creativity dimensions - high novelty ≠ high flexibility
- C4 Full Pipeline: Highest novelty, lowest flexibility
- C5 Random: Higher flexibility, moderate novelty
🔑 Key Finding: The pipeline produces human-like originality-flexibility patterns (r=0.071)
- Typical LLMs show positive correlation (flexible → more original)
- Our method breaks this pattern: high novelty with focused exploration
True novelty assessment requires judgment-based evaluation beyond embedding metrics

Appendix: Statistical Details

T-test Results (vs C1 Baseline)

Comparison	t	p	Cohen's d
C4 vs C1	8.55	<0.001	4.05
C2 vs C1	7.67	<0.001	3.43
C3 vs C1	4.23	<0.001	1.89

All experimental conditions significantly outperform baseline

Appendix: Experiment Configuration

EXPERIMENT_CONFIG = {
    "model": "qwen3:8b",
    "temperature": 0.9,
    "expert_count": 4,
    "expert_source": "curated",  # 210 occupations
    "keywords_per_expert": 1,
    "categories": ["Functions", "Usages",
                   "User Groups", "Characteristics"],
    "dedup_threshold": 0.90,
    "random_seed": 42
}

Thank You

Questions?

Repository: novelty-seeking Experiment Date: January 19, 2026 Contact: [Your Email]

Backup Slides

Backup: Deduplication Threshold Analysis

Original threshold (0.85) was too aggressive:

40.5% of removed pairs were borderline (0.85-0.87)
Many genuinely different concepts were grouped

Raised to 0.90:

RQ1 (Attributes) became significant (p: 0.052 → 0.027)
Preserved ~103 additional unique ideas

Backup: Sample Ideas by Condition

Query: "Chair"

C1 Direct:

Ergonomic office chair with lumbar support
Foldable camping chair

C2 Expert-Only (Architect):

Load-bearing furniture integrated into building structure

C4 Full Pipeline:

Asset-tracking chairs with RFID for corporate inventory
(Accountant + "portable" → "mobile assets")

Backup: Efficiency Calculation

\text{Efficiency} = \frac{\text{Mean Pairwise Distance}}{\text{Idea Count}} \times 100

Condition	Calculation	Result
C3 Attribute	0.376 / 12.8 × 100	3.01
C4 Pipeline	0.393 / 51.9 × 100	0.78

C3 achieves 96% of C4's diversity with 25% of the ideas

26 KiB Raw Blame History Unescape Escape

Breaking Semantic Gravity in LLM-Based Creative Ideation

A Pilot Study on Attribute Decomposition and Expert Perspectives

Research Problem

The "Semantic Gravity" Challenge

Theoretical Framework

Bisociation Theory (Koestler, 1964)

Experimental Design

2×2 Factorial + Control

Research Questions

Pipeline Architecture

C4: Full Pipeline Process

Key Design Decision

Context-Free Keyword Generation

Pilot Study Parameters

Model & Generation Settings

Pilot Study Parameters (cont.)

Pipeline Configuration

Pilot Study Parameters (cont.)

Output & Evaluation

Background: Embedding Models Evolution

From Static to Contextual Representations

Background: Transformer vs LLM-based Embedding

Architecture Differences

Background

Key Comparison

Background: Qwen3-Embedding?

Comparison with Traditional Methods

Background: How Embedding Works

Semantic Similarity via Vector Space

Results: Semantic Diversity

Mean Pairwise Distance (Higher = More Diverse)

Results: ANOVA Summary

Normalized Diversity Metric

Results: Expert vs Random (RQ4)

C2 (Expert-Only) vs C5 (Random-Perspective)

Results: Efficiency Analysis

Diversity per Idea Generated

Visualization: Diversity by Condition

Visualization: Query Distance

Advanced Analysis: Lexical Diversity

Type-Token Ratio & Vocabulary Richness

Advanced Analysis: Concept Extraction

Top Keywords by Condition

Advanced Analysis: Novelty Scores

Distance from Global Centroid (Higher = More Novel)

Advanced Analysis: Cross-Condition Cohesion

% Nearest Neighbors from Same Condition

Advanced Analysis: AUT Flexibility

Semantic Category Diversity (Hadas & Hershkovitz 2024)

Advanced Analysis: Combined Jump Signal

Enhanced Method from arXiv:2405.00899

Advanced Analysis: Flexibility Profiles

Classification Based on Combined Jump Ratio

Key Finding: Originality-Flexibility Correlation

Does Our Pipeline Break the Typical LLM Pattern?

Cumulative Jump Profile Visualization

Exploration Patterns Over Generation Sequence

Flexibility vs Novelty: Key Insight

Novelty and Flexibility are Orthogonal Dimensions

Embedding Visualization: PCA

Embedding Visualization: t-SNE

Integrated Findings

What the Advanced Analysis Reveals

Understanding Novelty vs Flexibility

Critical Limitation

Embedding Distance ≠ True Novelty

Torrance Creativity Framework

What True Novelty Assessment Requires

Discussion: The Attribute Anchoring Effect

Why C4 Has Highest Novelty but Lowest Flexibility

Key Findings Summary

Limitations

Future Work

Immediate Next Steps

Conclusion

Key Takeaways

Appendix: Statistical Details

T-test Results (vs C1 Baseline)

Appendix: Experiment Configuration

26 KiB

Raw Blame History