Files
novelty-seeking/research/research_report.md
gbanyan 26a56a2a07 feat: Enhance patent search and update research documentation
- Improve patent search service with expanded functionality
- Update PatentSearchPanel UI component
- Add new research_report.md
- Update experimental protocol, literature review, paper outline, and theoretical framework

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 15:52:33 +08:00

473 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
marp: true
theme: default
paginate: true
size: 16:9
style: |
section {
font-size: 24px;
}
h1 {
color: #2563eb;
}
h2 {
color: #1e40af;
}
table {
font-size: 20px;
}
.columns {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1rem;
}
---
# Breaking Semantic Gravity
## Expert-Augmented LLM Ideation for Enhanced Creativity
**Research Progress Report**
January 2026
---
# Agenda
1. Research Problem & Motivation
2. Theoretical Framework: "Semantic Gravity"
3. Proposed Solution: Expert-Augmented Ideation
4. Experimental Design
5. Implementation Progress
6. Timeline & Next Steps
---
# 1. Research Problem
## The Myth, Problem and Myth of LLM Creativity
**Myth**: LLMs enable infinite idea generation for creative tasks
**Problem**: Generated ideas lack **diversity** and **novelty**
- Ideas cluster around high-probability training distributions
- Limited exploration of distant conceptual spaces
- "Creative" outputs are **interpolations**, not **extrapolations**
---
# The "Semantic Gravity" Phenomenon
```
Direct LLM Generation:
Input: "Generate creative ideas for a chair"
Result:
- "Ergonomic office chair" (high probability)
- "Foldable portable chair" (high probability)
- "Eco-friendly bamboo chair" (moderate probability)
Problem:
→ Ideas cluster in predictable semantic neighborhoods
→ Limited exploration of distant conceptual spaces
```
---
# Why Does Semantic Gravity Occur?
| Factor | Description |
|--------|-------------|
| **Statistical Pattern Learning** | LLMs learn co-occurrence patterns from training data |
| **Model Collapse** (再看看) | Sampling from "creative ideas" distribution seen in training |
| **Relevance Trap** (再看看) | Strong associations dominate weak ones |
| **Domain Bias** | Outputs gravitate toward category prototypes |
---
# 2. Theoretical Framework
## Three Key Foundations
1. **Semantic Distance Theory** (Mednick, 1962)
- Creativity correlates with conceptual "jump" distance
2. **Conceptual Blending Theory** (Fauconnier & Turner, 2002)
- Creative products emerge from blending input spaces
3. **Design Fixation** (Jansson & Smith, 1991)
- Blind adherence to initial ideas limits creativity
---
# Semantic Distance in Action
```
Without Expert:
"Chair" → furniture, sitting, comfort, design
Semantic distance: SHORT
With Marine Biologist Expert:
"Chair" → underwater pressure, coral structure, buoyancy
Semantic distance: LONG
Result: Novel ideas like "pressure-adaptive seating"
```
**Key Insight**: Expert perspectives force semantic jumps that LLMs wouldn't naturally make.
---
# 3. Proposed Solution
## Expert-Augmented LLM Ideation Pipeline
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Attribute │ → │ Expert │ → │ Expert │
│ Decomposition│ │ Generation │ │Transformation│
└──────────────┘ └──────────────┘ └──────────────┘
┌──────────────┐ ┌──────────────┐
│ Novelty │ ← │ Deduplication│
│ Validation │ │ │
└──────────────┘ └──────────────┘
```
---
# From "Wisdom of Crowds" to "Inner Crowd"
**Traditional Crowd**:
- Person 1 → Ideas from perspective 1
- Person 2 → Ideas from perspective 2
- Aggregation → Diverse idea pool
**Our "Inner Crowd"**:
- LLM + Expert 1 Persona → Ideas from perspective 1
- LLM + Expert 2 Persona → Ideas from perspective 2
- Aggregation → Diverse idea pool (simulated crowd)
---
# Expert Sources
| Source | Description | Coverage |
|--------|-------------|----------|
| **LLM-Generated** | Query-specific, prioritizes unconventional | Flexible |
| **Curated** | 210 pre-selected high-quality occupations | Controlled |
| **DBpedia** | 2,164 occupations from database | Broad |
Note: use the domain list (嘗試加入杜威分類法兩層? Future work? )
---
# 4. Research Questions (2×2 Factorial Design)
| ID | Research Question |
|----|-------------------|
| **RQ1** | Does attribute decomposition improve semantic diversity? |
| **RQ2** | Does expert perspective transformation improve semantic diversity? |
| **RQ3** | Is there an interaction effect between the two factors? |
| **RQ4** | Which combination produces the highest patent novelty? |
| **RQ5** | How do expert sources (LLM vs Curated vs External) affect quality? |
| **RQ6** | What is the hallucination/nonsense rate of context-free generation? |
---
# Design Choice: Context-Free Keyword Generation
Our system intentionally excludes the original query during keyword generation:
```
Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant)
Expert does NOT see "椅子" (chair)
→ Generates: "資金流動" (cash flow)
Stage 2 (Description): Expert sees "椅子" + "資金流動"
→ Applies keyword to original query
```
**Rationale**: Forces maximum semantic distance for novelty
**Risk**: Some keywords may be too distant → nonsense/hallucination
**RQ6**: Measure this tradeoff
---
# The Semantic Distance Tradeoff
```
Too Close Optimal Zone Too Far
(Semantic Gravity) (Creative) (Hallucination)
├─────────────────────────┼──────────────────────────────┼─────────────────────────┤
"Ergonomic office chair" "Pressure-adaptive seating" "Quantum chair consciousness"
High usefulness High novelty + useful High novelty, nonsense
Low novelty Low usefulness
```
**H6**: Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%)
---
# Measuring Nonsense/Hallucination (RQ6) - Three Methods
| Method | Metric | Pros | Cons |
|--------|--------|------|------|
| **Automatic** | Semantic distance > 0.85 | Fast, cheap | May miss contextual nonsense |
| **LLM-as-Judge** | GPT-4 relevance score (1-3) | Moderate cost, scalable | Potential LLM bias |
| **Human Evaluation** | Relevance rating (1-7 Likert) | Gold standard | Expensive, slow |
**Triangulation**: Compare all three methods
- Agreement → high confidence in nonsense detection
- Disagreement → interesting edge cases to analyze
---
# Core Hypotheses (2×2 Factorial)
| Hypothesis | Prediction | Metric |
|------------|------------|--------|
| **H1: Attributes** | (Attr-Only + Full) > (Direct + Expert-Only) | Semantic diversity |
| **H2: Experts** | (Expert-Only + Full) > (Direct + Attr-Only) | Semantic diversity |
| **H3: Interaction** | Full > (Attr-Only + Expert-Only - Direct) | Super-additive effect |
| **H4: Novelty** | Full Pipeline > all others | Patent novelty rate |
| **H5: Control** | Expert-Only > Random-Perspective | Validates expert knowledge |
| **H6: Tradeoff** | Full Pipeline nonsense rate < 20% | Nonsense rate |
---
# Experimental Conditions (2×2 Factorial)
| Condition | Attributes | Experts | Description |
|-----------|------------|---------|-------------|
| **C1: Direct** | ❌ | ❌ | Baseline: "Generate 20 ideas for [query]" |
| **C2: Expert-Only** | ❌ | ✅ | Expert personas generate for whole query |
| **C3: Attribute-Only** | ✅ | ❌ | Decompose query, direct generate per attribute |
| **C4: Full Pipeline** | ✅ | ✅ | Decompose query, experts generate per attribute |
| **C5: Random-Perspective** | ❌ | (random) | Control: random words as "perspectives" |
---
# Expected 2×2 Pattern
```
Without Experts With Experts
--------------- ------------
Without Attributes Direct (low) Expert-Only (medium)
With Attributes Attr-Only (medium) Full Pipeline (high)
```
**Key prediction**: The combination (Full Pipeline) produces **super-additive** effects
- Experts are more effective when given structured attributes to transform
- The interaction term should be statistically significant
---
# Query Dataset (30 Queries)
**Category A: Everyday Objects (10)**
- Chair, Umbrella, Backpack, Coffee mug, Bicycle...
**Category B: Technology & Tools (10)**
- Solar panel, Electric vehicle, 3D printer, Drone...
**Category C: Services & Systems (10)**
- Food delivery, Online education, Healthcare appointment...
**Total**: 30 queries × 5 conditions (4 factorial + 1 control) × 20 ideas = **3,000 ideas**
---
# Metrics: Stastic Evaluation
| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) | Higher = more diverse |
| **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters |
| **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original |
| **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel |
---
# Metrics: Human Evaluation
**Participants**: 60 evaluators (Prolific/MTurk)
**Rating Scales** (7-point Likert):
- **Novelty**: How novel/surprising is this idea?
- **Usefulness**: How practical is this idea?
- **Creativity**: How creative is this idea overall?
- **Relevance**: How relevant/coherent is this idea to the query? **(RQ6)**
- Nonsense ?
**Quality Control**:
- Attention checks, completion time monitoring
- Inter-rater reliability (Cronbach's α > 0.7)
---
# What is Prolific/MTurk?
Online platforms for recruiting human participants for research studies.
| Platform | Description | Best For |
|----------|-------------|----------|
| **Prolific** | Academic-focused crowdsourcing | Research studies (higher quality) |
| **MTurk** | Amazon Mechanical Turk | Large-scale tasks (lower cost) |
**How it works for our study**:
1. Upload 600 ideas to evaluate (subset of generated ideas)
2. Recruit 60 participants (~$8-15/hour compensation)
3. Each participant rates ~30 ideas (novelty, usefulness, creativity)
4. Download ratings → statistical analysis
**Cost estimate**: 60 participants × 30 min × $12/hr = ~$360
---
# Alternative: LLM-as-Judge
If human evaluation is too expensive or time-consuming:
| Approach | Pros | Cons |
|----------|------|------|
| **Human (Prolific/MTurk)** | Gold standard, publishable | Cost, time, IRB approval |
| **LLM-as-Judge (GPT-4)** | Fast, cheap, reproducible | Less rigorous, potential bias |
| **Automatic metrics only** | No human cost | Missing subjective quality |
**Recommendation**: Start with automatic metrics, add human evaluation for final paper submission.
---
# 5. Implementation Status
## System Components (Implemented)
- Attribute decomposition pipeline
- Expert team generation (LLM, Curated, DBpedia sources)
- Expert transformation with parallel processing
- Semantic deduplication (embedding + LLM methods)
- Patent search integration
- Web-based visualization interface
---
# Implementation Checklist
### Experiment Scripts (To Do)
- [ ] `experiments/generate_ideas.py` - Idea generation
- [ ] `experiments/compute_metrics.py` - Automatic metrics
- [ ] `experiments/export_for_evaluation.py` - Human evaluation prep
- [ ] `experiments/analyze_results.py` - Statistical analysis
- [ ] `experiments/visualize.py` - Generate figures
---
# 6. Timeline
| Phase | Activity |
|-------|----------|
| **Phase 1** | Implement idea generation scripts |
| **Phase 2** | Generate all ideas (5 conditions × 30 queries) |
| **Phase 3** | Compute automatic metrics |
| **Phase 4** | Design and pilot human evaluation |
| **Phase 5** | Run human evaluation (60 participants) |
| **Phase 6** | Analyze results and write paper |
---
# Target Venues
### Tier 1 (Recommended)
- **CHI** - ACM Conference on Human Factors (Sept deadline)
- **CSCW** - Computer-Supported Cooperative Work (Apr/Jan deadline)
- **Creativity & Cognition** - Specialized computational creativity
### Journal Options
- **IJHCS** - International Journal of Human-Computer Studies
- **TOCHI** - ACM Transactions on CHI
---
# Key Contributions
1. **Theoretical**: "Semantic gravity" framework + two-factor solution
2. **Methodological**: 2×2 factorial design isolates attribute vs expert contributions
3. **Empirical**: Quantitative evidence for interaction effects in LLM creativity
4. **Practical**: Open-source system with both factors for maximum diversity
---
# Key Differentiator vs PersonaFlow
```
PersonaFlow (2024): Query → Experts → Ideas
(Experts see WHOLE query, no structure)
Our Approach: Query → Attributes → (Attributes × Experts) → Ideas
(Experts see SPECIFIC attributes, systematic)
```
**What we can answer that PersonaFlow cannot:**
1. Does problem structure alone help? (Attribute-Only vs Direct)
2. Do experts help beyond structure? (Full vs Attribute-Only)
3. Is there an interaction effect? (amplification hypothesis)
---
# Related Work Comparison
| Approach | Limitation | Our Advantage |
|----------|------------|---------------|
| Direct LLM | Semantic gravity | Two-factor enhancement |
| **PersonaFlow** | **No problem structure** | **Attribute decomposition amplifies experts** |
| PopBlends | Two-concept only | Systematic attribute × expert matrix |
| BILLY | Cannot isolate factors | 2×2 factorial isolates contributions |
---
# References (Key Papers)
1. Siangliulue et al. (2017) - Wisdom of Crowds via Role Assumption
2. Liu et al. (2024) - PersonaFlow: LLM-Simulated Expert Perspectives
3. Choi et al. (2023) - PopBlends: Conceptual Blending with LLMs
4. Wadinambiarachchi et al. (2024) - Effects of Generative AI on Design Fixation
5. Mednick (1962) - Semantic Distance Theory
6. Fauconnier & Turner (2002) - Conceptual Blending Theory
*Full reference list: 55+ papers in `research/references.md`*
---
# Questions & Discussion
## Next Steps
1. Finalize experimental design details
2. Implement experiment scripts
3. Collect pilot data for validation
4. Submit IRB for human evaluation (if needed)
---
# Thank You
**Project Repository**: novelty-seeking
**Research Materials**:
- `research/literature_review.md`
- `research/theoretical_framework.md`
- `research/experimental_protocol.md`
- `research/paper_outline.md`
- `research/references.md`