- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
479 lines
14 KiB
Markdown
479 lines
14 KiB
Markdown
---
|
||
marp: true
|
||
theme: default
|
||
paginate: true
|
||
size: 16:9
|
||
style: |
|
||
section {
|
||
font-size: 24px;
|
||
}
|
||
h1 {
|
||
color: #2563eb;
|
||
}
|
||
h2 {
|
||
color: #1e40af;
|
||
}
|
||
table {
|
||
font-size: 20px;
|
||
}
|
||
.columns {
|
||
display: grid;
|
||
grid-template-columns: 1fr 1fr;
|
||
gap: 1rem;
|
||
}
|
||
---
|
||
|
||
# Breaking Semantic Gravity
|
||
## Expert-Augmented LLM Ideation for Enhanced Creativity
|
||
|
||
**Research Progress Report**
|
||
|
||
January 2026
|
||
|
||
---
|
||
|
||
# Agenda
|
||
|
||
1. Research Problem & Motivation
|
||
2. Theoretical Framework: "Semantic Gravity"
|
||
3. Proposed Solution: Expert-Augmented Ideation
|
||
4. Experimental Design
|
||
5. Implementation Progress
|
||
6. Timeline & Next Steps
|
||
|
||
---
|
||
|
||
# 1. Research Problem
|
||
|
||
## The Myth, Problem and Myth of LLM Creativity
|
||
|
||
**Myth**: LLMs enable infinite idea generation for creative tasks
|
||
|
||
**Problem**: Generated ideas lack **diversity** and **novelty**
|
||
|
||
- Ideas cluster around high-probability training distributions
|
||
- Limited exploration of distant conceptual spaces
|
||
- "Creative" outputs are **interpolations**, not **extrapolations**
|
||
|
||
---
|
||
|
||
# The "Semantic Gravity" Phenomenon
|
||
|
||
```
|
||
Direct LLM Generation:
|
||
Input: "Generate creative ideas for a chair"
|
||
|
||
Result:
|
||
- "Ergonomic office chair" (high probability)
|
||
- "Foldable portable chair" (high probability)
|
||
- "Eco-friendly bamboo chair" (moderate probability)
|
||
|
||
Problem:
|
||
→ Ideas cluster in predictable semantic neighborhoods
|
||
→ Limited exploration of distant conceptual spaces
|
||
```
|
||
|
||
---
|
||
|
||
# Why Does Semantic Gravity Occur?
|
||
|
||
| Factor | Description |
|
||
|--------|-------------|
|
||
| **Statistical Pattern Learning** | LLMs learn co-occurrence patterns from training data |
|
||
| **Model Collapse** (再看看) | Sampling from "creative ideas" distribution seen in training |
|
||
| **Relevance Trap** (再看看) | Strong associations dominate weak ones |
|
||
| **Domain Bias** | Outputs gravitate toward category prototypes |
|
||
|
||
|
||
|
||
---
|
||
|
||
# 2. Theoretical Framework
|
||
|
||
## Three Key Foundations
|
||
|
||
1. **Semantic Distance Theory** (Mednick, 1962)
|
||
- Creativity correlates with conceptual "jump" distance
|
||
|
||
2. **Conceptual Blending Theory** (Fauconnier & Turner, 2002)
|
||
- Creative products emerge from blending input spaces
|
||
|
||
3. **Design Fixation** (Jansson & Smith, 1991)
|
||
- Blind adherence to initial ideas limits creativity
|
||
|
||
---
|
||
|
||
# Semantic Distance in Action
|
||
|
||
```
|
||
Without Expert:
|
||
"Chair" → furniture, sitting, comfort, design
|
||
Semantic distance: SHORT
|
||
|
||
With Marine Biologist Expert:
|
||
"Chair" → underwater pressure, coral structure, buoyancy
|
||
Semantic distance: LONG
|
||
|
||
Result: Novel ideas like "pressure-adaptive seating"
|
||
```
|
||
|
||
**Key Insight**: Expert perspectives force semantic jumps that LLMs wouldn't naturally make.
|
||
|
||
---
|
||
|
||
# 3. Proposed Solution
|
||
|
||
## Expert-Augmented LLM Ideation Pipeline
|
||
|
||
```
|
||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||
│ Attribute │ → │ Expert │ → │ Expert │
|
||
│ Decomposition│ │ Generation │ │Transformation│
|
||
└──────────────┘ └──────────────┘ └──────────────┘
|
||
│
|
||
▼
|
||
┌──────────────┐ ┌──────────────┐
|
||
│ Novelty │ ← │ Deduplication│
|
||
│ Validation │ │ │
|
||
└──────────────┘ └──────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
# From "Wisdom of Crowds" to "Inner Crowd"
|
||
|
||
**Traditional Crowd**:
|
||
- Person 1 → Ideas from perspective 1
|
||
- Person 2 → Ideas from perspective 2
|
||
- Aggregation → Diverse idea pool
|
||
|
||
**Our "Inner Crowd"**:
|
||
- LLM + Expert 1 Persona → Ideas from perspective 1
|
||
- LLM + Expert 2 Persona → Ideas from perspective 2
|
||
- Aggregation → Diverse idea pool (simulated crowd)
|
||
|
||
---
|
||
|
||
# Expert Sources
|
||
|
||
| Source | Description | Coverage |
|
||
|--------|-------------|----------|
|
||
| **LLM-Generated** | Query-specific, prioritizes unconventional | Flexible |
|
||
| **Curated** | 210 pre-selected high-quality occupations | Controlled |
|
||
| **DBpedia** | 2,164 occupations from database | Broad |
|
||
|
||
|
||
---
|
||
|
||
# 4. Research Questions (2×2 Factorial Design)
|
||
|
||
| ID | Research Question |
|
||
|----|-------------------|
|
||
| **RQ1** | Does attribute decomposition improve semantic diversity? |
|
||
| **RQ2** | Does expert perspective transformation improve semantic diversity? |
|
||
| **RQ3** | Is there an interaction effect between the two factors? |
|
||
| **RQ4** | Which combination produces the highest patent novelty? |
|
||
| **RQ5** | How do expert sources (LLM vs Curated vs External) affect quality? |
|
||
| **RQ6** | What is the hallucination/nonsense rate of context-free generation? |
|
||
|
||
---
|
||
|
||
# Design Choice: Context-Free Keyword Generation
|
||
|
||
Our system intentionally excludes the original query during keyword generation:
|
||
|
||
```
|
||
Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant)
|
||
Expert does NOT see "椅子" (chair)
|
||
→ Generates: "資金流動" (cash flow)
|
||
|
||
Stage 2 (Description): Expert sees "椅子" + "資金流動"
|
||
→ Applies keyword to original query
|
||
```
|
||
|
||
**Rationale**: Forces maximum semantic distance for novelty
|
||
**Risk**: Some keywords may be too distant → nonsense/hallucination
|
||
**RQ6**: Measure this tradeoff
|
||
|
||
---
|
||
|
||
# The Semantic Distance Tradeoff
|
||
|
||
```
|
||
Too Close Optimal Zone Too Far
|
||
(Semantic Gravity) (Creative) (Hallucination)
|
||
├─────────────────────────┼──────────────────────────────┼─────────────────────────┤
|
||
"Ergonomic office chair" "Pressure-adaptive seating" "Quantum chair consciousness"
|
||
|
||
High usefulness High novelty + useful High novelty, nonsense
|
||
Low novelty Low usefulness
|
||
```
|
||
|
||
**H6**: Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%)
|
||
|
||
---
|
||
|
||
# Measuring Nonsense/Hallucination (RQ6) - Three Methods
|
||
|
||
| Method | Metric | Pros | Cons |
|
||
|--------|--------|------|------|
|
||
| **Automatic** | Semantic distance > 0.85 | Fast, cheap | May miss contextual nonsense |
|
||
| **LLM-as-Judge** | GPT-4 relevance score (1-3) | Moderate cost, scalable | Potential LLM bias |
|
||
| **Human Evaluation** | Relevance rating (1-7 Likert) | Gold standard | Expensive, slow |
|
||
|
||
**Triangulation**: Compare all three methods
|
||
- Agreement → high confidence in nonsense detection
|
||
- Disagreement → interesting edge cases to analyze
|
||
|
||
---
|
||
|
||
# Core Hypotheses (2×2 Factorial)
|
||
|
||
| Hypothesis | Prediction | Metric |
|
||
|------------|------------|--------|
|
||
| **H1: Attributes** | (Attr-Only + Full) > (Direct + Expert-Only) | Semantic diversity |
|
||
| **H2: Experts** | (Expert-Only + Full) > (Direct + Attr-Only) | Semantic diversity |
|
||
| **H3: Interaction** | Full > (Attr-Only + Expert-Only - Direct) | Super-additive effect |
|
||
| **H4: Novelty** | Full Pipeline > all others | Patent novelty rate |
|
||
| **H5: Control** | Expert-Only > Random-Perspective | Validates expert knowledge |
|
||
| **H6: Tradeoff** | Full Pipeline nonsense rate < 20% | Nonsense rate |
|
||
|
||
---
|
||
|
||
# Experimental Conditions (2×2 Factorial)
|
||
|
||
| Condition | Attributes | Experts | Description |
|
||
|-----------|------------|---------|-------------|
|
||
| **C1: Direct** | ❌ | ❌ | Baseline: "Generate 20 ideas for [query]" |
|
||
| **C2: Expert-Only** | ❌ | ✅ | Expert personas generate for whole query |
|
||
| **C3: Attribute-Only** | ✅ | ❌ | Decompose query, direct generate per attribute |
|
||
| **C4: Full Pipeline** | ✅ | ✅ | Decompose query, experts generate per attribute |
|
||
| **C5: Random-Perspective** | ❌ | (random) | Control: random words as "perspectives" |
|
||
|
||
---
|
||
|
||
# Expected 2×2 Pattern
|
||
|
||
```
|
||
Without Experts With Experts
|
||
--------------- ------------
|
||
Without Attributes Direct (low) Expert-Only (medium)
|
||
|
||
With Attributes Attr-Only (medium) Full Pipeline (high)
|
||
```
|
||
|
||
**Key prediction**: The combination (Full Pipeline) produces **super-additive** effects
|
||
- Experts are more effective when given structured attributes to transform
|
||
- The interaction term should be statistically significant
|
||
|
||
---
|
||
|
||
# Query Dataset (30 Queries)
|
||
|
||
**Category A: Everyday Objects (10)**
|
||
- Chair, Umbrella, Backpack, Coffee mug, Bicycle...
|
||
|
||
**Category B: Technology & Tools (10)**
|
||
- Solar panel, Electric vehicle, 3D printer, Drone...
|
||
|
||
**Category C: Services & Systems (10)**
|
||
- Food delivery, Online education, Healthcare appointment...
|
||
|
||
**Total**: 30 queries × 5 conditions (4 factorial + 1 control) × 20 ideas = **3,000 ideas**
|
||
|
||
---
|
||
|
||
# Metrics: Stastic Evaluation
|
||
|
||
| Metric | Formula | Interpretation |
|
||
|--------|---------|----------------|
|
||
| **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) | Higher = more diverse |
|
||
| **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters |
|
||
| **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original |
|
||
| **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel |
|
||
|
||
---
|
||
|
||
# Metrics: Human Evaluation
|
||
|
||
**Participants**: 60 evaluators (Prolific/MTurk)
|
||
|
||
**Rating Scales** (7-point Likert):
|
||
|
||
- **Novelty**: How novel/surprising is this idea?
|
||
- **Usefulness**: How practical is this idea?
|
||
- **Creativity**: How creative is this idea overall?
|
||
- **Relevance**: How relevant/coherent is this idea to the query? **(RQ6)**
|
||
- Nonsense ?
|
||
|
||
**Quality Control**:
|
||
|
||
- Attention checks, completion time monitoring
|
||
- Inter-rater reliability (Cronbach's α > 0.7)
|
||
|
||
---
|
||
|
||
# What is Prolific/MTurk?
|
||
|
||
Online platforms for recruiting human participants for research studies.
|
||
|
||
| Platform | Description | Best For |
|
||
|----------|-------------|----------|
|
||
| **Prolific** | Academic-focused crowdsourcing | Research studies (higher quality) |
|
||
| **MTurk** | Amazon Mechanical Turk | Large-scale tasks (lower cost) |
|
||
|
||
**How it works for our study**:
|
||
1. Upload 600 ideas to evaluate (subset of generated ideas)
|
||
2. Recruit 60 participants (~$8-15/hour compensation)
|
||
3. Each participant rates ~30 ideas (novelty, usefulness, creativity)
|
||
4. Download ratings → statistical analysis
|
||
|
||
**Cost estimate**: 60 participants × 30 min × $12/hr = ~$360
|
||
|
||
---
|
||
|
||
# Alternative: LLM-as-Judge
|
||
|
||
If human evaluation is too expensive or time-consuming:
|
||
|
||
| Approach | Pros | Cons |
|
||
|----------|------|------|
|
||
| **Human (Prolific/MTurk)** | Gold standard, publishable | Cost, time, IRB approval |
|
||
| **LLM-as-Judge (GPT-4)** | Fast, cheap, reproducible | Less rigorous, potential bias |
|
||
| **Automatic metrics only** | No human cost | Missing subjective quality |
|
||
|
||
**Recommendation**: Start with automatic metrics, add human evaluation for final paper submission.
|
||
|
||
---
|
||
|
||
# 5. Implementation Status
|
||
|
||
## System Components (Implemented)
|
||
|
||
- Attribute decomposition pipeline
|
||
- Expert team generation (LLM, Curated, DBpedia sources)
|
||
- Expert transformation with parallel processing
|
||
- Semantic deduplication (embedding + LLM methods)
|
||
- Patent search integration
|
||
- Web-based visualization interface
|
||
|
||
---
|
||
|
||
# Implementation Checklist
|
||
|
||
### Experiment Scripts (To Do)
|
||
- [ ] `experiments/generate_ideas.py` - Idea generation
|
||
- [ ] `experiments/compute_metrics.py` - Automatic metrics
|
||
- [ ] `experiments/export_for_evaluation.py` - Human evaluation prep
|
||
- [ ] `experiments/analyze_results.py` - Statistical analysis
|
||
- [ ] `experiments/visualize.py` - Generate figures
|
||
|
||
---
|
||
|
||
# 6. Timeline
|
||
|
||
| Phase | Activity |
|
||
|-------|----------|
|
||
| **Phase 1** | Implement idea generation scripts |
|
||
| **Phase 2** | Generate all ideas (5 conditions × 30 queries) |
|
||
| **Phase 3** | Compute automatic metrics |
|
||
| **Phase 4** | Design and pilot human evaluation |
|
||
| **Phase 5** | Run human evaluation (60 participants) |
|
||
| **Phase 6** | Analyze results and write paper |
|
||
|
||
---
|
||
|
||
# Target Venues
|
||
|
||
### Tier 1 (Recommended)
|
||
- **CHI** - ACM Conference on Human Factors (Sept deadline)
|
||
- **CSCW** - Computer-Supported Cooperative Work (Apr/Jan deadline)
|
||
- **Creativity & Cognition** - Specialized computational creativity
|
||
|
||
### Journal Options
|
||
- **IJHCS** - International Journal of Human-Computer Studies
|
||
- **TOCHI** - ACM Transactions on CHI
|
||
|
||
---
|
||
|
||
# Key Contributions
|
||
|
||
1. **Theoretical**: "Semantic gravity" framework + two-factor solution
|
||
|
||
2. **Methodological**: 2×2 factorial design isolates attribute vs expert contributions
|
||
|
||
3. **Empirical**: Quantitative evidence for interaction effects in LLM creativity
|
||
|
||
4. **Practical**: Open-source system with both factors for maximum diversity
|
||
|
||
---
|
||
|
||
# Key Differentiator vs PersonaFlow
|
||
|
||
```
|
||
PersonaFlow (2024): Query → Experts → Ideas
|
||
(Experts see WHOLE query, no structure)
|
||
|
||
Our Approach: Query → Attributes → (Attributes × Experts) → Ideas
|
||
(Experts see SPECIFIC attributes, systematic)
|
||
```
|
||
|
||
**What we can answer that PersonaFlow cannot:**
|
||
1. Does problem structure alone help? (Attribute-Only vs Direct)
|
||
2. Do experts help beyond structure? (Full vs Attribute-Only)
|
||
3. Is there an interaction effect? (amplification hypothesis)
|
||
|
||
---
|
||
|
||
# Related Work Comparison
|
||
|
||
| Approach | Limitation | Our Advantage |
|
||
|----------|------------|---------------|
|
||
| Direct LLM | Semantic gravity | Two-factor enhancement |
|
||
| **PersonaFlow** | **No problem structure** | **Attribute decomposition amplifies experts** |
|
||
| PopBlends | Two-concept only | Systematic attribute × expert matrix |
|
||
| BILLY | Cannot isolate factors | 2×2 factorial isolates contributions |
|
||
|
||
---
|
||
|
||
# References (Key Papers)
|
||
|
||
1. Siangliulue et al. (2017) - Wisdom of Crowds via Role Assumption
|
||
2. Liu et al. (2024) - PersonaFlow: LLM-Simulated Expert Perspectives
|
||
3. Choi et al. (2023) - PopBlends: Conceptual Blending with LLMs
|
||
4. Wadinambiarachchi et al. (2024) - Effects of Generative AI on Design Fixation
|
||
5. Mednick (1962) - Semantic Distance Theory
|
||
6. Fauconnier & Turner (2002) - Conceptual Blending Theory
|
||
|
||
*Full reference list: 55+ papers in `research/references.md`*
|
||
|
||
---
|
||
|
||
# Questions & Discussion
|
||
|
||
## Next Steps
|
||
1. Finalize experimental design details
|
||
2. Implement experiment scripts
|
||
3. Collect pilot data for validation
|
||
4. Submit IRB for human evaluation (if needed)
|
||
|
||
---
|
||
|
||
# Thank You
|
||
|
||
**Project Repository**: novelty-seeking
|
||
|
||
**Research Materials**:
|
||
- `research/literature_review.md`
|
||
- `research/theoretical_framework.md`
|
||
- `research/experimental_protocol.md`
|
||
- `research/paper_outline.md`
|
||
- `research/references.md`
|
||
|
||
---
|
||
|
||
# Discussion
|
||
|
||
- Futurework: Domain, 杜威分類法
|
||
|