Experimental Protocol: Expert-Augmented LLM Ideation
Executive Summary
This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation.
1. Research Questions
| ID |
Research Question |
| RQ1 |
Does multi-expert generation produce higher semantic diversity than direct LLM generation? |
| RQ2 |
Does multi-expert generation produce ideas with lower patent overlap (higher novelty)? |
| RQ3 |
What is the optimal number of experts for maximizing diversity? |
| RQ4 |
How do different expert sources (LLM vs Curated vs DBpedia) affect idea quality? |
| RQ5 |
Does structured attribute decomposition enhance the multi-expert effect? |
2. Experimental Design Overview
2.1 Design Type
Mixed Design: Between-subjects for main conditions × Within-subjects for queries
2.2 Variables
Independent Variables (Manipulated)
| Variable |
Levels |
Your System Parameter |
| Generation Method |
5 levels (see conditions) |
Condition-dependent |
| Expert Count |
1, 2, 4, 6, 8 |
expert_count |
| Expert Source |
LLM, Curated, DBpedia |
expert_source |
| Attribute Structure |
With/Without decomposition |
Pipeline inclusion |
Dependent Variables (Measured)
| Variable |
Measurement Method |
| Semantic Diversity |
Mean pairwise cosine distance (embeddings) |
| Cluster Spread |
Number of clusters, silhouette score |
| Patent Novelty |
1 - (ideas with patent match / total ideas) |
| Semantic Distance |
Distance from query centroid |
| Human Novelty Rating |
7-point Likert scale |
| Human Usefulness Rating |
7-point Likert scale |
| Human Creativity Rating |
7-point Likert scale |
Control Variables (Held Constant)
| Variable |
Fixed Value |
| LLM Model |
Qwen3:8b (or specify) |
| Temperature |
0.7 |
| Total Ideas per Query |
20 |
| Keywords per Expert |
1 |
| Deduplication |
Disabled for raw comparison |
| Language |
English (for patent search) |
3. Experimental Conditions
3.1 Main Study: Generation Method Comparison
| Condition |
Description |
Implementation |
| C1: Direct |
Direct LLM generation |
Prompt: "Generate 20 creative ideas for [query]" |
| C2: Single-Expert |
1 expert × 20 ideas |
expert_count=1, keywords_per_expert=20 |
| C3: Multi-Expert-4 |
4 experts × 5 ideas each |
expert_count=4, keywords_per_expert=5 |
| C4: Multi-Expert-8 |
8 experts × 2-3 ideas each |
expert_count=8, keywords_per_expert=2-3 |
| C5: Random-Perspective |
4 random words as "perspectives" |
Custom prompt with random nouns |
3.2 Expert Count Study
| Condition |
Expert Count |
Ideas per Expert |
| E1 |
1 |
20 |
| E2 |
2 |
10 |
| E4 |
4 |
5 |
| E6 |
6 |
3-4 |
| E8 |
8 |
2-3 |
3.3 Expert Source Study
| Condition |
Source |
Implementation |
| S-LLM |
LLM-generated |
expert_source=ExpertSource.LLM |
| S-Curated |
Curated 210 occupations |
expert_source=ExpertSource.CURATED |
| S-DBpedia |
DBpedia 2164 occupations |
expert_source=ExpertSource.DBPEDIA |
| S-Random |
Random word "experts" |
Custom implementation |
4. Query Dataset
4.1 Design Principles
- Diversity: Cover multiple domains (consumer products, technology, services, abstract concepts)
- Complexity Variation: Simple objects to complex systems
- Familiarity Variation: Common items to specialized equipment
- Cultural Neutrality: Concepts understandable across cultures
4.2 Query Set (30 Queries)
Category A: Everyday Objects (10)
| ID |
Query |
Complexity |
| A1 |
Chair |
Low |
| A2 |
Umbrella |
Low |
| A3 |
Backpack |
Low |
| A4 |
Coffee mug |
Low |
| A5 |
Bicycle |
Medium |
| A6 |
Refrigerator |
Medium |
| A7 |
Smartphone |
Medium |
| A8 |
Running shoes |
Medium |
| A9 |
Kitchen knife |
Low |
| A10 |
Desk lamp |
Low |
Category B: Technology & Tools (10)
| ID |
Query |
Complexity |
| B1 |
Solar panel |
Medium |
| B2 |
Electric vehicle |
High |
| B3 |
3D printer |
High |
| B4 |
Drone |
Medium |
| B5 |
Smart thermostat |
Medium |
| B6 |
Noise-canceling headphones |
Medium |
| B7 |
Water purifier |
Medium |
| B8 |
Wind turbine |
High |
| B9 |
Robotic vacuum |
Medium |
| B10 |
Wearable fitness tracker |
Medium |
Category C: Services & Systems (10)
| ID |
Query |
Complexity |
| C1 |
Food delivery service |
Medium |
| C2 |
Online education platform |
High |
| C3 |
Healthcare appointment system |
High |
| C4 |
Public transportation |
High |
| C5 |
Hotel booking system |
Medium |
| C6 |
Personal finance app |
Medium |
| C7 |
Grocery shopping experience |
Medium |
| C8 |
Parking solution |
Medium |
| C9 |
Elderly care service |
High |
| C10 |
Waste management system |
High |
4.3 Sample Size Justification
Based on CHI meta-study on effect sizes:
- Queries: 30 (crossed with conditions)
- Expected effect size: d = 0.5 (medium)
- Power target: 80%
- For automatic metrics: 30 queries × 5 conditions × 20 ideas = 3,000 ideas
- For human evaluation: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas
5. Automatic Metrics Collection
5.1 Semantic Diversity Metrics
5.1.1 Mean Pairwise Distance (Primary)
5.1.2 Cluster Analysis
5.1.3 Semantic Distance from Query
5.2 Patent Novelty Metrics
5.2.1 Patent Overlap Rate
5.3 Metrics Summary Table
| Metric |
Formula |
Interpretation |
| Mean Pairwise Distance |
avg(1 - cos_sim(i, j)) for all pairs |
Higher = more diverse |
| Silhouette Score |
Cluster cohesion vs separation |
Higher = clearer clusters |
| Optimal Cluster Count |
argmax(silhouette) |
More clusters = more themes |
| Query Distance |
1 - cos_sim(query, idea) |
Higher = farther from original |
| Patent Novelty Rate |
1 - (matches / total) |
Higher = more novel |
6. Human Evaluation Protocol
6.1 Participants
6.1.1 Recruitment
- Platform: Prolific, MTurk, or domain experts
- Sample Size: 60 evaluators (20 per condition group)
- Criteria:
- Native English speakers
- Bachelor's degree or higher
- Attention check pass rate > 80%
6.1.2 Compensation
- $15/hour equivalent
- ~30 minutes per session
- Bonus for high-quality ratings
6.2 Rating Scales
6.2.1 Novelty (7-point Likert)
6.2.2 Usefulness (7-point Likert)
6.2.3 Creativity (7-point Likert)
6.3 Procedure
-
Introduction (5 min)
- Study purpose (without revealing hypotheses)
- Rating scale explanation
- Practice with 3 example ideas
-
Training (5 min)
- Rate 5 calibration ideas with feedback
- Discuss edge cases
-
Main Evaluation (20 min)
- Rate 30 ideas (randomized order)
- 3 attention check items embedded
- Break after 15 ideas
-
Debriefing (2 min)
- Demographics
- Open-ended feedback
6.4 Quality Control
| Check |
Threshold |
Action |
| Attention checks |
< 2/3 correct |
Exclude |
| Completion time |
< 10 min |
Flag for review |
| Variance in ratings |
All same score |
Exclude |
| Inter-rater reliability |
Cronbach's α < 0.7 |
Review ratings |
6.5 Analysis Plan
6.5.1 Reliability
- Cronbach's alpha for each scale
- ICC (Intraclass Correlation) for inter-rater agreement
6.5.2 Main Analysis
- Mixed-effects ANOVA: Condition × Query
- Post-hoc: Tukey HSD for pairwise comparisons
- Effect sizes: Cohen's d
6.5.3 Correlation with Automatic Metrics
- Pearson correlation: Human ratings vs semantic diversity
- Regression: Predict human ratings from automatic metrics
7. Experimental Procedure
7.1 Phase 1: Idea Generation
7.2 Phase 2: Automatic Metrics
7.3 Phase 3: Human Evaluation
7.4 Phase 4: Analysis
8. Implementation Checklist
8.1 Code to Implement
8.2 Data Files to Create
8.3 Analysis Outputs
9. Expected Results & Hypotheses
9.1 Primary Hypotheses
| Hypothesis |
Prediction |
Metric |
| H1 |
Multi-Expert-4 > Single-Expert > Direct |
Semantic diversity |
| H2 |
Multi-Expert-8 ≈ Multi-Expert-4 (diminishing returns) |
Semantic diversity |
| H3 |
Multi-Expert > Direct |
Patent novelty rate |
| H4 |
LLM experts > Curated > DBpedia |
Unconventionality |
| H5 |
With attributes > Without attributes |
Overall diversity |
9.2 Expected Effect Sizes
Based on related work:
- Diversity increase: d = 0.5-0.8 (medium to large)
- Patent novelty increase: 20-40% improvement
- Human creativity rating: d = 0.3-0.5 (small to medium)
9.3 Potential Confounds
| Confound |
Mitigation |
| Query difficulty |
Crossed design (all queries × all conditions) |
| LLM variability |
Multiple runs, fixed seed where possible |
| Evaluator bias |
Randomized presentation, blinding |
| Order effects |
Counterbalancing in human evaluation |
10. Timeline
| Week |
Activity |
| 1-2 |
Implement idea generation scripts |
| 3 |
Generate all ideas (5 conditions × 30 queries) |
| 4 |
Compute automatic metrics |
| 5 |
Design and pilot human evaluation |
| 6-7 |
Run human evaluation (60 participants) |
| 8 |
Analyze results |
| 9-10 |
Write paper |
| 11 |
Internal review |
| 12 |
Submit |
11. Appendix: Direct Generation Prompt
For baseline condition C1 (Direct LLM generation):
12. Appendix: Random Perspective Words
For condition C5 (Random-Perspective), sample from:
This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.