- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
14 KiB
marp, theme, paginate, size, style
| marp | theme | paginate | size | style |
|---|---|---|---|---|
| true | default | true | 16:9 | section { font-size: 24px; } h1 { color: #2563eb; } h2 { color: #1e40af; } table { font-size: 20px; } .columns { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; } |
Breaking Semantic Gravity
Expert-Augmented LLM Ideation for Enhanced Creativity
Research Progress Report
January 2026
Agenda
- Research Problem & Motivation
- Theoretical Framework: "Semantic Gravity"
- Proposed Solution: Expert-Augmented Ideation
- Experimental Design
- Implementation Progress
- Timeline & Next Steps
1. Research Problem
The Myth, Problem and Myth of LLM Creativity
Myth: LLMs enable infinite idea generation for creative tasks
Problem: Generated ideas lack diversity and novelty
- Ideas cluster around high-probability training distributions
- Limited exploration of distant conceptual spaces
- "Creative" outputs are interpolations, not extrapolations
The "Semantic Gravity" Phenomenon
Direct LLM Generation:
Input: "Generate creative ideas for a chair"
Result:
- "Ergonomic office chair" (high probability)
- "Foldable portable chair" (high probability)
- "Eco-friendly bamboo chair" (moderate probability)
Problem:
→ Ideas cluster in predictable semantic neighborhoods
→ Limited exploration of distant conceptual spaces
Why Does Semantic Gravity Occur?
| Factor | Description |
|---|---|
| Statistical Pattern Learning | LLMs learn co-occurrence patterns from training data |
| Model Collapse (再看看) | Sampling from "creative ideas" distribution seen in training |
| Relevance Trap (再看看) | Strong associations dominate weak ones |
| Domain Bias | Outputs gravitate toward category prototypes |
2. Theoretical Framework
Three Key Foundations
-
Semantic Distance Theory (Mednick, 1962)
- Creativity correlates with conceptual "jump" distance
-
Conceptual Blending Theory (Fauconnier & Turner, 2002)
- Creative products emerge from blending input spaces
-
Design Fixation (Jansson & Smith, 1991)
- Blind adherence to initial ideas limits creativity
Semantic Distance in Action
Without Expert:
"Chair" → furniture, sitting, comfort, design
Semantic distance: SHORT
With Marine Biologist Expert:
"Chair" → underwater pressure, coral structure, buoyancy
Semantic distance: LONG
Result: Novel ideas like "pressure-adaptive seating"
Key Insight: Expert perspectives force semantic jumps that LLMs wouldn't naturally make.
3. Proposed Solution
Expert-Augmented LLM Ideation Pipeline
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Attribute │ → │ Expert │ → │ Expert │
│ Decomposition│ │ Generation │ │Transformation│
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐ ┌──────────────┐
│ Novelty │ ← │ Deduplication│
│ Validation │ │ │
└──────────────┘ └──────────────┘
From "Wisdom of Crowds" to "Inner Crowd"
Traditional Crowd:
- Person 1 → Ideas from perspective 1
- Person 2 → Ideas from perspective 2
- Aggregation → Diverse idea pool
Our "Inner Crowd":
- LLM + Expert 1 Persona → Ideas from perspective 1
- LLM + Expert 2 Persona → Ideas from perspective 2
- Aggregation → Diverse idea pool (simulated crowd)
Expert Sources
| Source | Description | Coverage |
|---|---|---|
| LLM-Generated | Query-specific, prioritizes unconventional | Flexible |
| Curated | 210 pre-selected high-quality occupations | Controlled |
| DBpedia | 2,164 occupations from database | Broad |
4. Research Questions (2×2 Factorial Design)
| ID | Research Question |
|---|---|
| RQ1 | Does attribute decomposition improve semantic diversity? |
| RQ2 | Does expert perspective transformation improve semantic diversity? |
| RQ3 | Is there an interaction effect between the two factors? |
| RQ4 | Which combination produces the highest patent novelty? |
| RQ5 | How do expert sources (LLM vs Curated vs External) affect quality? |
| RQ6 | What is the hallucination/nonsense rate of context-free generation? |
Design Choice: Context-Free Keyword Generation
Our system intentionally excludes the original query during keyword generation:
Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant)
Expert does NOT see "椅子" (chair)
→ Generates: "資金流動" (cash flow)
Stage 2 (Description): Expert sees "椅子" + "資金流動"
→ Applies keyword to original query
Rationale: Forces maximum semantic distance for novelty Risk: Some keywords may be too distant → nonsense/hallucination RQ6: Measure this tradeoff
The Semantic Distance Tradeoff
Too Close Optimal Zone Too Far
(Semantic Gravity) (Creative) (Hallucination)
├─────────────────────────┼──────────────────────────────┼─────────────────────────┤
"Ergonomic office chair" "Pressure-adaptive seating" "Quantum chair consciousness"
High usefulness High novelty + useful High novelty, nonsense
Low novelty Low usefulness
H6: Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%)
Measuring Nonsense/Hallucination (RQ6) - Three Methods
| Method | Metric | Pros | Cons |
|---|---|---|---|
| Automatic | Semantic distance > 0.85 | Fast, cheap | May miss contextual nonsense |
| LLM-as-Judge | GPT-4 relevance score (1-3) | Moderate cost, scalable | Potential LLM bias |
| Human Evaluation | Relevance rating (1-7 Likert) | Gold standard | Expensive, slow |
Triangulation: Compare all three methods
- Agreement → high confidence in nonsense detection
- Disagreement → interesting edge cases to analyze
Core Hypotheses (2×2 Factorial)
| Hypothesis | Prediction | Metric |
|---|---|---|
| H1: Attributes | (Attr-Only + Full) > (Direct + Expert-Only) | Semantic diversity |
| H2: Experts | (Expert-Only + Full) > (Direct + Attr-Only) | Semantic diversity |
| H3: Interaction | Full > (Attr-Only + Expert-Only - Direct) | Super-additive effect |
| H4: Novelty | Full Pipeline > all others | Patent novelty rate |
| H5: Control | Expert-Only > Random-Perspective | Validates expert knowledge |
| H6: Tradeoff | Full Pipeline nonsense rate < 20% | Nonsense rate |
Experimental Conditions (2×2 Factorial)
| Condition | Attributes | Experts | Description |
|---|---|---|---|
| C1: Direct | ❌ | ❌ | Baseline: "Generate 20 ideas for [query]" |
| C2: Expert-Only | ❌ | ✅ | Expert personas generate for whole query |
| C3: Attribute-Only | ✅ | ❌ | Decompose query, direct generate per attribute |
| C4: Full Pipeline | ✅ | ✅ | Decompose query, experts generate per attribute |
| C5: Random-Perspective | ❌ | (random) | Control: random words as "perspectives" |
Expected 2×2 Pattern
Without Experts With Experts
--------------- ------------
Without Attributes Direct (low) Expert-Only (medium)
With Attributes Attr-Only (medium) Full Pipeline (high)
Key prediction: The combination (Full Pipeline) produces super-additive effects
- Experts are more effective when given structured attributes to transform
- The interaction term should be statistically significant
Query Dataset (30 Queries)
Category A: Everyday Objects (10)
- Chair, Umbrella, Backpack, Coffee mug, Bicycle...
Category B: Technology & Tools (10)
- Solar panel, Electric vehicle, 3D printer, Drone...
Category C: Services & Systems (10)
- Food delivery, Online education, Healthcare appointment...
Total: 30 queries × 5 conditions (4 factorial + 1 control) × 20 ideas = 3,000 ideas
Metrics: Stastic Evaluation
| Metric | Formula | Interpretation |
|---|---|---|
| Mean Pairwise Distance | avg(1 - cos_sim(i, j)) | Higher = more diverse |
| Silhouette Score | Cluster cohesion vs separation | Higher = clearer clusters |
| Query Distance | 1 - cos_sim(query, idea) | Higher = farther from original |
| Patent Novelty Rate | 1 - (matches / total) | Higher = more novel |
Metrics: Human Evaluation
Participants: 60 evaluators (Prolific/MTurk)
Rating Scales (7-point Likert):
- Novelty: How novel/surprising is this idea?
- Usefulness: How practical is this idea?
- Creativity: How creative is this idea overall?
- Relevance: How relevant/coherent is this idea to the query? (RQ6)
- Nonsense ?
Quality Control:
- Attention checks, completion time monitoring
- Inter-rater reliability (Cronbach's α > 0.7)
What is Prolific/MTurk?
Online platforms for recruiting human participants for research studies.
| Platform | Description | Best For |
|---|---|---|
| Prolific | Academic-focused crowdsourcing | Research studies (higher quality) |
| MTurk | Amazon Mechanical Turk | Large-scale tasks (lower cost) |
How it works for our study:
- Upload 600 ideas to evaluate (subset of generated ideas)
- Recruit 60 participants (~$8-15/hour compensation)
- Each participant rates ~30 ideas (novelty, usefulness, creativity)
- Download ratings → statistical analysis
Cost estimate: 60 participants × 30 min × $12/hr = ~$360
Alternative: LLM-as-Judge
If human evaluation is too expensive or time-consuming:
| Approach | Pros | Cons |
|---|---|---|
| Human (Prolific/MTurk) | Gold standard, publishable | Cost, time, IRB approval |
| LLM-as-Judge (GPT-4) | Fast, cheap, reproducible | Less rigorous, potential bias |
| Automatic metrics only | No human cost | Missing subjective quality |
Recommendation: Start with automatic metrics, add human evaluation for final paper submission.
5. Implementation Status
System Components (Implemented)
- Attribute decomposition pipeline
- Expert team generation (LLM, Curated, DBpedia sources)
- Expert transformation with parallel processing
- Semantic deduplication (embedding + LLM methods)
- Patent search integration
- Web-based visualization interface
Implementation Checklist
Experiment Scripts (To Do)
experiments/generate_ideas.py- Idea generationexperiments/compute_metrics.py- Automatic metricsexperiments/export_for_evaluation.py- Human evaluation prepexperiments/analyze_results.py- Statistical analysisexperiments/visualize.py- Generate figures
6. Timeline
| Phase | Activity |
|---|---|
| Phase 1 | Implement idea generation scripts |
| Phase 2 | Generate all ideas (5 conditions × 30 queries) |
| Phase 3 | Compute automatic metrics |
| Phase 4 | Design and pilot human evaluation |
| Phase 5 | Run human evaluation (60 participants) |
| Phase 6 | Analyze results and write paper |
Target Venues
Tier 1 (Recommended)
- CHI - ACM Conference on Human Factors (Sept deadline)
- CSCW - Computer-Supported Cooperative Work (Apr/Jan deadline)
- Creativity & Cognition - Specialized computational creativity
Journal Options
- IJHCS - International Journal of Human-Computer Studies
- TOCHI - ACM Transactions on CHI
Key Contributions
-
Theoretical: "Semantic gravity" framework + two-factor solution
-
Methodological: 2×2 factorial design isolates attribute vs expert contributions
-
Empirical: Quantitative evidence for interaction effects in LLM creativity
-
Practical: Open-source system with both factors for maximum diversity
Key Differentiator vs PersonaFlow
PersonaFlow (2024): Query → Experts → Ideas
(Experts see WHOLE query, no structure)
Our Approach: Query → Attributes → (Attributes × Experts) → Ideas
(Experts see SPECIFIC attributes, systematic)
What we can answer that PersonaFlow cannot:
- Does problem structure alone help? (Attribute-Only vs Direct)
- Do experts help beyond structure? (Full vs Attribute-Only)
- Is there an interaction effect? (amplification hypothesis)
Related Work Comparison
| Approach | Limitation | Our Advantage |
|---|---|---|
| Direct LLM | Semantic gravity | Two-factor enhancement |
| PersonaFlow | No problem structure | Attribute decomposition amplifies experts |
| PopBlends | Two-concept only | Systematic attribute × expert matrix |
| BILLY | Cannot isolate factors | 2×2 factorial isolates contributions |
References (Key Papers)
- Siangliulue et al. (2017) - Wisdom of Crowds via Role Assumption
- Liu et al. (2024) - PersonaFlow: LLM-Simulated Expert Perspectives
- Choi et al. (2023) - PopBlends: Conceptual Blending with LLMs
- Wadinambiarachchi et al. (2024) - Effects of Generative AI on Design Fixation
- Mednick (1962) - Semantic Distance Theory
- Fauconnier & Turner (2002) - Conceptual Blending Theory
Full reference list: 55+ papers in research/references.md
Questions & Discussion
Next Steps
- Finalize experimental design details
- Implement experiment scripts
- Collect pilot data for validation
- Submit IRB for human evaluation (if needed)
Thank You
Project Repository: novelty-seeking
Research Materials:
research/literature_review.mdresearch/theoretical_framework.mdresearch/experimental_protocol.mdresearch/paper_outline.mdresearch/references.md
Discussion
- Futurework: Domain, 杜威分類法