novelty-seeking/research/research_report.md

---
marp: true
theme: default
paginate: true
size: 16:9
style: |
  section {
    font-size: 24px;
  }
  h1 {
    color: #2563eb;
  }
  h2 {
    color: #1e40af;
  }
  table {
    font-size: 20px;
  }
  .columns {
    display: grid;
    grid-template-columns: 1fr 1fr;
    gap: 1rem;
  }
---

# Breaking Semantic Gravity
## Expert-Augmented LLM Ideation for Enhanced Creativity

**Research Progress Report**

January 2026

---

# Agenda

1. Research Problem & Motivation
2. Theoretical Framework: "Semantic Gravity"
3. Proposed Solution: Expert-Augmented Ideation
4. Experimental Design
5. Implementation Progress
6. Timeline & Next Steps

---

# 1. Research Problem

## The Myth, Problem and Myth of LLM Creativity

**Myth**: LLMs enable infinite idea generation for creative tasks

**Problem**: Generated ideas lack **diversity** and **novelty**

- Ideas cluster around high-probability training distributions
- Limited exploration of distant conceptual spaces
- "Creative" outputs are **interpolations**, not **extrapolations**

---

# The "Semantic Gravity" Phenomenon

```
Direct LLM Generation:
  Input: "Generate creative ideas for a chair"

  Result:
    - "Ergonomic office chair"      (high probability)
    - "Foldable portable chair"     (high probability)
    - "Eco-friendly bamboo chair"   (moderate probability)

  Problem:
    → Ideas cluster in predictable semantic neighborhoods
    → Limited exploration of distant conceptual spaces
```

---

# Why Does Semantic Gravity Occur?

| Factor | Description |
|--------|-------------|
| **Statistical Pattern Learning** | LLMs learn co-occurrence patterns from training data |
| **Model Collapse** (再看看) | Sampling from "creative ideas" distribution seen in training |
| **Relevance Trap** (再看看) | Strong associations dominate weak ones |
| **Domain Bias** | Outputs gravitate toward category prototypes |


---

# 2. Theoretical Framework

## Three Key Foundations

1. **Semantic Distance Theory** (Mednick, 1962)
   - Creativity correlates with conceptual "jump" distance

2. **Conceptual Blending Theory** (Fauconnier & Turner, 2002)
   - Creative products emerge from blending input spaces

3. **Design Fixation** (Jansson & Smith, 1991)
   - Blind adherence to initial ideas limits creativity

---

# Semantic Distance in Action

```
Without Expert:
  "Chair" → furniture, sitting, comfort, design
  Semantic distance: SHORT

With Marine Biologist Expert:
  "Chair" → underwater pressure, coral structure, buoyancy
  Semantic distance: LONG

Result: Novel ideas like "pressure-adaptive seating"
```

**Key Insight**: Expert perspectives force semantic jumps that LLMs wouldn't naturally make.

---

# 3. Proposed Solution

## Expert-Augmented LLM Ideation Pipeline

```
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│   Attribute  │ → │    Expert    │ → │    Expert    │
│ Decomposition│   │  Generation  │   │Transformation│
└──────────────┘   └──────────────┘   └──────────────┘
                                              │
                                              ▼
                   ┌──────────────┐   ┌──────────────┐
                   │   Novelty    │ ← │ Deduplication│
                   │  Validation  │   │              │
                   └──────────────┘   └──────────────┘
```

---

# From "Wisdom of Crowds" to "Inner Crowd"

**Traditional Crowd**:
- Person 1 → Ideas from perspective 1
- Person 2 → Ideas from perspective 2
- Aggregation → Diverse idea pool

**Our "Inner Crowd"**:
- LLM + Expert 1 Persona → Ideas from perspective 1
- LLM + Expert 2 Persona → Ideas from perspective 2
- Aggregation → Diverse idea pool (simulated crowd)

---

# Expert Sources

| Source | Description | Coverage |
|--------|-------------|----------|
| **LLM-Generated** | Query-specific, prioritizes unconventional | Flexible |
| **Curated** | 210 pre-selected high-quality occupations | Controlled |
| **DBpedia** | 2,164 occupations from database | Broad |

Note: use the domain list (嘗試加入杜威分類法兩層? Future work?  )

---

# 4. Research Questions (2×2 Factorial Design)

| ID | Research Question |
|----|-------------------|
| **RQ1** | Does attribute decomposition improve semantic diversity? |
| **RQ2** | Does expert perspective transformation improve semantic diversity? |
| **RQ3** | Is there an interaction effect between the two factors? |
| **RQ4** | Which combination produces the highest patent novelty? |
| **RQ5** | How do expert sources (LLM vs Curated vs External) affect quality? |
| **RQ6** | What is the hallucination/nonsense rate of context-free generation? |

---

# Design Choice: Context-Free Keyword Generation

Our system intentionally excludes the original query during keyword generation:

```
Stage 1 (Keyword):     Expert sees "木質" (wood) + "會計師" (accountant)
                       Expert does NOT see "椅子" (chair)
                       → Generates: "資金流動" (cash flow)

Stage 2 (Description): Expert sees "椅子" + "資金流動"
                       → Applies keyword to original query
```

**Rationale**: Forces maximum semantic distance for novelty
**Risk**: Some keywords may be too distant → nonsense/hallucination
**RQ6**: Measure this tradeoff

---

# The Semantic Distance Tradeoff

```
Too Close                 Optimal Zone                   Too Far
(Semantic Gravity)        (Creative)                     (Hallucination)
├─────────────────────────┼──────────────────────────────┼─────────────────────────┤
"Ergonomic office chair"  "Pressure-adaptive seating"    "Quantum chair consciousness"

High usefulness           High novelty + useful          High novelty, nonsense
Low novelty                                              Low usefulness
```

**H6**: Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%)

---

# Measuring Nonsense/Hallucination (RQ6) - Three Methods

| Method | Metric | Pros | Cons |
|--------|--------|------|------|
| **Automatic** | Semantic distance > 0.85 | Fast, cheap | May miss contextual nonsense |
| **LLM-as-Judge** | GPT-4 relevance score (1-3) | Moderate cost, scalable | Potential LLM bias |
| **Human Evaluation** | Relevance rating (1-7 Likert) | Gold standard | Expensive, slow |

**Triangulation**: Compare all three methods
- Agreement → high confidence in nonsense detection
- Disagreement → interesting edge cases to analyze

---

# Core Hypotheses (2×2 Factorial)

| Hypothesis | Prediction | Metric |
|------------|------------|--------|
| **H1: Attributes** | (Attr-Only + Full) > (Direct + Expert-Only) | Semantic diversity |
| **H2: Experts** | (Expert-Only + Full) > (Direct + Attr-Only) | Semantic diversity |
| **H3: Interaction** | Full > (Attr-Only + Expert-Only - Direct) | Super-additive effect |
| **H4: Novelty** | Full Pipeline > all others | Patent novelty rate |
| **H5: Control** | Expert-Only > Random-Perspective | Validates expert knowledge |
| **H6: Tradeoff** | Full Pipeline nonsense rate < 20% | Nonsense rate |

---

# Experimental Conditions (2×2 Factorial)

| Condition | Attributes | Experts | Description |
|-----------|------------|---------|-------------|
| **C1: Direct** | ❌ | ❌ | Baseline: "Generate 20 ideas for [query]" |
| **C2: Expert-Only** | ❌ | ✅ | Expert personas generate for whole query |
| **C3: Attribute-Only** | ✅ | ❌ | Decompose query, direct generate per attribute |
| **C4: Full Pipeline** | ✅ | ✅ | Decompose query, experts generate per attribute |
| **C5: Random-Perspective** | ❌ | (random) | Control: random words as "perspectives" |

---

# Expected 2×2 Pattern

```
                      Without Experts       With Experts
                      ---------------       ------------
Without Attributes    Direct (low)          Expert-Only (medium)

With Attributes       Attr-Only (medium)    Full Pipeline (high)
```

**Key prediction**: The combination (Full Pipeline) produces **super-additive** effects
- Experts are more effective when given structured attributes to transform
- The interaction term should be statistically significant

---

# Query Dataset (30 Queries)

**Category A: Everyday Objects (10)**
- Chair, Umbrella, Backpack, Coffee mug, Bicycle...

**Category B: Technology & Tools (10)**
- Solar panel, Electric vehicle, 3D printer, Drone...

**Category C: Services & Systems (10)**
- Food delivery, Online education, Healthcare appointment...

**Total**: 30 queries × 5 conditions (4 factorial + 1 control) × 20 ideas = **3,000 ideas**

---

# Metrics: Stastic Evaluation

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) | Higher = more diverse |
| **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters |
| **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original |
| **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel |

---

# Metrics: Human Evaluation

**Participants**: 60 evaluators (Prolific/MTurk)

**Rating Scales** (7-point Likert):

- **Novelty**: How novel/surprising is this idea?
- **Usefulness**: How practical is this idea?
- **Creativity**: How creative is this idea overall?
- **Relevance**: How relevant/coherent is this idea to the query? **(RQ6)**
- Nonsense ?

**Quality Control**:

- Attention checks, completion time monitoring
- Inter-rater reliability (Cronbach's α > 0.7)

---

# What is Prolific/MTurk?

Online platforms for recruiting human participants for research studies.

| Platform | Description | Best For |
|----------|-------------|----------|
| **Prolific** | Academic-focused crowdsourcing | Research studies (higher quality) |
| **MTurk** | Amazon Mechanical Turk | Large-scale tasks (lower cost) |

**How it works for our study**:
1. Upload 600 ideas to evaluate (subset of generated ideas)
2. Recruit 60 participants (~$8-15/hour compensation)
3. Each participant rates ~30 ideas (novelty, usefulness, creativity)
4. Download ratings → statistical analysis

**Cost estimate**: 60 participants × 30 min × $12/hr = ~$360

---

# Alternative: LLM-as-Judge

If human evaluation is too expensive or time-consuming:

| Approach | Pros | Cons |
|----------|------|------|
| **Human (Prolific/MTurk)** | Gold standard, publishable | Cost, time, IRB approval |
| **LLM-as-Judge (GPT-4)** | Fast, cheap, reproducible | Less rigorous, potential bias |
| **Automatic metrics only** | No human cost | Missing subjective quality |

**Recommendation**: Start with automatic metrics, add human evaluation for final paper submission.

---

# 5. Implementation Status

## System Components (Implemented)

- Attribute decomposition pipeline
- Expert team generation (LLM, Curated, DBpedia sources)
- Expert transformation with parallel processing
- Semantic deduplication (embedding + LLM methods)
- Patent search integration
- Web-based visualization interface

---

# Implementation Checklist

### Experiment Scripts (To Do)
- [ ] `experiments/generate_ideas.py` - Idea generation
- [ ] `experiments/compute_metrics.py` - Automatic metrics
- [ ] `experiments/export_for_evaluation.py` - Human evaluation prep
- [ ] `experiments/analyze_results.py` - Statistical analysis
- [ ] `experiments/visualize.py` - Generate figures

---

# 6. Timeline

| Phase | Activity |
|-------|----------|
| **Phase 1** | Implement idea generation scripts |
| **Phase 2** | Generate all ideas (5 conditions × 30 queries) |
| **Phase 3** | Compute automatic metrics |
| **Phase 4** | Design and pilot human evaluation |
| **Phase 5** | Run human evaluation (60 participants) |
| **Phase 6** | Analyze results and write paper |

---

# Target Venues

### Tier 1 (Recommended)
- **CHI** - ACM Conference on Human Factors (Sept deadline)
- **CSCW** - Computer-Supported Cooperative Work (Apr/Jan deadline)
- **Creativity & Cognition** - Specialized computational creativity

### Journal Options
- **IJHCS** - International Journal of Human-Computer Studies
- **TOCHI** - ACM Transactions on CHI

---

# Key Contributions

1. **Theoretical**: "Semantic gravity" framework + two-factor solution

2. **Methodological**: 2×2 factorial design isolates attribute vs expert contributions

3. **Empirical**: Quantitative evidence for interaction effects in LLM creativity

4. **Practical**: Open-source system with both factors for maximum diversity

---

# Key Differentiator vs PersonaFlow

```
PersonaFlow (2024):   Query → Experts → Ideas
                      (Experts see WHOLE query, no structure)

Our Approach:         Query → Attributes → (Attributes × Experts) → Ideas
                      (Experts see SPECIFIC attributes, systematic)
```

**What we can answer that PersonaFlow cannot:**
1. Does problem structure alone help? (Attribute-Only vs Direct)
2. Do experts help beyond structure? (Full vs Attribute-Only)
3. Is there an interaction effect? (amplification hypothesis)

---

# Related Work Comparison

| Approach | Limitation | Our Advantage |
|----------|------------|---------------|
| Direct LLM | Semantic gravity | Two-factor enhancement |
| **PersonaFlow** | **No problem structure** | **Attribute decomposition amplifies experts** |
| PopBlends | Two-concept only | Systematic attribute × expert matrix |
| BILLY | Cannot isolate factors | 2×2 factorial isolates contributions |

---

# References (Key Papers)

1. Siangliulue et al. (2017) - Wisdom of Crowds via Role Assumption
2. Liu et al. (2024) - PersonaFlow: LLM-Simulated Expert Perspectives
3. Choi et al. (2023) - PopBlends: Conceptual Blending with LLMs
4. Wadinambiarachchi et al. (2024) - Effects of Generative AI on Design Fixation
5. Mednick (1962) - Semantic Distance Theory
6. Fauconnier & Turner (2002) - Conceptual Blending Theory

*Full reference list: 55+ papers in `research/references.md`*

---

# Questions & Discussion

## Next Steps
1. Finalize experimental design details
2. Implement experiment scripts
3. Collect pilot data for validation
4. Submit IRB for human evaluation (if needed)

---

# Thank You

**Project Repository**: novelty-seeking

**Research Materials**:
- `research/literature_review.md`
- `research/theoretical_framework.md`
- `research/experimental_protocol.md`
- `research/paper_outline.md`
- `research/references.md`