Files
novelty-seeking/research/experimental_protocol.md
2026-01-05 22:32:08 +08:00

556 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Experimental Protocol: Expert-Augmented LLM Ideation
## Executive Summary
This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation.
---
## 1. Research Questions
| ID | Research Question |
|----|-------------------|
| **RQ1** | Does multi-expert generation produce higher semantic diversity than direct LLM generation? |
| **RQ2** | Does multi-expert generation produce ideas with lower patent overlap (higher novelty)? |
| **RQ3** | What is the optimal number of experts for maximizing diversity? |
| **RQ4** | How do different expert sources (LLM vs Curated vs DBpedia) affect idea quality? |
| **RQ5** | Does structured attribute decomposition enhance the multi-expert effect? |
---
## 2. Experimental Design Overview
### 2.1 Design Type
**Mixed Design**: Between-subjects for main conditions × Within-subjects for queries
### 2.2 Variables
#### Independent Variables (Manipulated)
| Variable | Levels | Your System Parameter |
|----------|--------|----------------------|
| **Generation Method** | 5 levels (see conditions) | Condition-dependent |
| **Expert Count** | 1, 2, 4, 6, 8 | `expert_count` |
| **Expert Source** | LLM, Curated, DBpedia | `expert_source` |
| **Attribute Structure** | With/Without decomposition | Pipeline inclusion |
#### Dependent Variables (Measured)
| Variable | Measurement Method |
|----------|-------------------|
| **Semantic Diversity** | Mean pairwise cosine distance (embeddings) |
| **Cluster Spread** | Number of clusters, silhouette score |
| **Patent Novelty** | 1 - (ideas with patent match / total ideas) |
| **Semantic Distance** | Distance from query centroid |
| **Human Novelty Rating** | 7-point Likert scale |
| **Human Usefulness Rating** | 7-point Likert scale |
| **Human Creativity Rating** | 7-point Likert scale |
#### Control Variables (Held Constant)
| Variable | Fixed Value |
|----------|-------------|
| LLM Model | Qwen3:8b (or specify) |
| Temperature | 0.7 |
| Total Ideas per Query | 20 |
| Keywords per Expert | 1 |
| Deduplication | Disabled for raw comparison |
| Language | English (for patent search) |
---
## 3. Experimental Conditions
### 3.1 Main Study: Generation Method Comparison
| Condition | Description | Implementation |
|-----------|-------------|----------------|
| **C1: Direct** | Direct LLM generation | Prompt: "Generate 20 creative ideas for [query]" |
| **C2: Single-Expert** | 1 expert × 20 ideas | `expert_count=1`, `keywords_per_expert=20` |
| **C3: Multi-Expert-4** | 4 experts × 5 ideas each | `expert_count=4`, `keywords_per_expert=5` |
| **C4: Multi-Expert-8** | 8 experts × 2-3 ideas each | `expert_count=8`, `keywords_per_expert=2-3` |
| **C5: Random-Perspective** | 4 random words as "perspectives" | Custom prompt with random nouns |
### 3.2 Expert Count Study
| Condition | Expert Count | Ideas per Expert |
|-----------|--------------|------------------|
| **E1** | 1 | 20 |
| **E2** | 2 | 10 |
| **E4** | 4 | 5 |
| **E6** | 6 | 3-4 |
| **E8** | 8 | 2-3 |
### 3.3 Expert Source Study
| Condition | Source | Implementation |
|-----------|--------|----------------|
| **S-LLM** | LLM-generated | `expert_source=ExpertSource.LLM` |
| **S-Curated** | Curated 210 occupations | `expert_source=ExpertSource.CURATED` |
| **S-DBpedia** | DBpedia 2164 occupations | `expert_source=ExpertSource.DBPEDIA` |
| **S-Random** | Random word "experts" | Custom implementation |
---
## 4. Query Dataset
### 4.1 Design Principles
- **Diversity**: Cover multiple domains (consumer products, technology, services, abstract concepts)
- **Complexity Variation**: Simple objects to complex systems
- **Familiarity Variation**: Common items to specialized equipment
- **Cultural Neutrality**: Concepts understandable across cultures
### 4.2 Query Set (30 Queries)
#### Category A: Everyday Objects (10)
| ID | Query | Complexity |
|----|-------|------------|
| A1 | Chair | Low |
| A2 | Umbrella | Low |
| A3 | Backpack | Low |
| A4 | Coffee mug | Low |
| A5 | Bicycle | Medium |
| A6 | Refrigerator | Medium |
| A7 | Smartphone | Medium |
| A8 | Running shoes | Medium |
| A9 | Kitchen knife | Low |
| A10 | Desk lamp | Low |
#### Category B: Technology & Tools (10)
| ID | Query | Complexity |
|----|-------|------------|
| B1 | Solar panel | Medium |
| B2 | Electric vehicle | High |
| B3 | 3D printer | High |
| B4 | Drone | Medium |
| B5 | Smart thermostat | Medium |
| B6 | Noise-canceling headphones | Medium |
| B7 | Water purifier | Medium |
| B8 | Wind turbine | High |
| B9 | Robotic vacuum | Medium |
| B10 | Wearable fitness tracker | Medium |
#### Category C: Services & Systems (10)
| ID | Query | Complexity |
|----|-------|------------|
| C1 | Food delivery service | Medium |
| C2 | Online education platform | High |
| C3 | Healthcare appointment system | High |
| C4 | Public transportation | High |
| C5 | Hotel booking system | Medium |
| C6 | Personal finance app | Medium |
| C7 | Grocery shopping experience | Medium |
| C8 | Parking solution | Medium |
| C9 | Elderly care service | High |
| C10 | Waste management system | High |
### 4.3 Sample Size Justification
Based on [CHI meta-study on effect sizes](https://dl.acm.org/doi/10.1145/3706598.3713671):
- **Queries**: 30 (crossed with conditions)
- **Expected effect size**: d = 0.5 (medium)
- **Power target**: 80%
- **For automatic metrics**: 30 queries × 5 conditions × 20 ideas = 3,000 ideas
- **For human evaluation**: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas
---
## 5. Automatic Metrics Collection
### 5.1 Semantic Diversity Metrics
#### 5.1.1 Mean Pairwise Distance (Primary)
```python
def compute_mean_pairwise_distance(ideas: List[str], embedding_model: str) -> float:
"""
Compute mean cosine distance between all idea pairs.
Higher = more diverse.
"""
embeddings = get_embeddings(ideas, model=embedding_model)
n = len(embeddings)
distances = []
for i in range(n):
for j in range(i+1, n):
dist = 1 - cosine_similarity(embeddings[i], embeddings[j])
distances.append(dist)
return np.mean(distances), np.std(distances)
```
#### 5.1.2 Cluster Analysis
```python
def compute_cluster_metrics(ideas: List[str], embedding_model: str) -> dict:
"""
Analyze idea clustering patterns.
"""
embeddings = get_embeddings(ideas, model=embedding_model)
# Find optimal k using silhouette score
silhouette_scores = []
for k in range(2, min(len(ideas), 10)):
kmeans = KMeans(n_clusters=k)
labels = kmeans.fit_predict(embeddings)
score = silhouette_score(embeddings, labels)
silhouette_scores.append((k, score))
best_k = max(silhouette_scores, key=lambda x: x[1])[0]
return {
'optimal_clusters': best_k,
'silhouette_score': max(silhouette_scores, key=lambda x: x[1])[1],
'cluster_distribution': compute_cluster_sizes(embeddings, best_k)
}
```
#### 5.1.3 Semantic Distance from Query
```python
def compute_query_distance(query: str, ideas: List[str], embedding_model: str) -> dict:
"""
Measure how far ideas are from the original query.
Higher = more novel/distant.
"""
query_emb = get_embedding(query, model=embedding_model)
idea_embs = get_embeddings(ideas, model=embedding_model)
distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]
return {
'mean_distance': np.mean(distances),
'max_distance': np.max(distances),
'min_distance': np.min(distances),
'std_distance': np.std(distances)
}
```
### 5.2 Patent Novelty Metrics
#### 5.2.1 Patent Overlap Rate
```python
def compute_patent_novelty(ideas: List[str], query: str) -> dict:
"""
Search patents for each idea and compute overlap rate.
Uses existing patent_search_service.
"""
matches = 0
match_details = []
for idea in ideas:
result = patent_search_service.search(idea)
if result.has_match:
matches += 1
match_details.append({
'idea': idea,
'patent': result.best_match
})
return {
'novelty_rate': 1 - (matches / len(ideas)),
'match_count': matches,
'total_ideas': len(ideas),
'match_details': match_details
}
```
### 5.3 Metrics Summary Table
| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) for all pairs | Higher = more diverse |
| **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters |
| **Optimal Cluster Count** | argmax(silhouette) | More clusters = more themes |
| **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original |
| **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel |
---
## 6. Human Evaluation Protocol
### 6.1 Participants
#### 6.1.1 Recruitment
- **Platform**: Prolific, MTurk, or domain experts
- **Sample Size**: 60 evaluators (20 per condition group)
- **Criteria**:
- Native English speakers
- Bachelor's degree or higher
- Attention check pass rate > 80%
#### 6.1.2 Compensation
- $15/hour equivalent
- ~30 minutes per session
- Bonus for high-quality ratings
### 6.2 Rating Scales
#### 6.2.1 Novelty (7-point Likert)
```
How novel/surprising is this idea?
1 = Not at all novel (very common/obvious)
4 = Moderately novel
7 = Extremely novel (never seen before)
```
#### 6.2.2 Usefulness (7-point Likert)
```
How useful/practical is this idea?
1 = Not at all useful (impractical)
4 = Moderately useful
7 = Extremely useful (highly practical)
```
#### 6.2.3 Creativity (7-point Likert)
```
How creative is this idea overall?
1 = Not at all creative
4 = Moderately creative
7 = Extremely creative
```
### 6.3 Procedure
1. **Introduction** (5 min)
- Study purpose (without revealing hypotheses)
- Rating scale explanation
- Practice with 3 example ideas
2. **Training** (5 min)
- Rate 5 calibration ideas with feedback
- Discuss edge cases
3. **Main Evaluation** (20 min)
- Rate 30 ideas (randomized order)
- 3 attention check items embedded
- Break after 15 ideas
4. **Debriefing** (2 min)
- Demographics
- Open-ended feedback
### 6.4 Quality Control
| Check | Threshold | Action |
|-------|-----------|--------|
| Attention checks | < 2/3 correct | Exclude |
| Completion time | < 10 min | Flag for review |
| Variance in ratings | All same score | Exclude |
| Inter-rater reliability | Cronbach's α < 0.7 | Review ratings |
### 6.5 Analysis Plan
#### 6.5.1 Reliability
- Cronbach's alpha for each scale
- ICC (Intraclass Correlation) for inter-rater agreement
#### 6.5.2 Main Analysis
- Mixed-effects ANOVA: Condition × Query
- Post-hoc: Tukey HSD for pairwise comparisons
- Effect sizes: Cohen's d
#### 6.5.3 Correlation with Automatic Metrics
- Pearson correlation: Human ratings vs semantic diversity
- Regression: Predict human ratings from automatic metrics
---
## 7. Experimental Procedure
### 7.1 Phase 1: Idea Generation
```
For each query Q in QuerySet:
For each condition C in Conditions:
If C == "Direct":
ideas = direct_llm_generation(Q, n=20)
Elif C == "Single-Expert":
expert = generate_expert(Q, n=1)
ideas = expert_transformation(Q, expert, ideas_per_expert=20)
Elif C == "Multi-Expert-4":
experts = generate_experts(Q, n=4)
ideas = expert_transformation(Q, experts, ideas_per_expert=5)
Elif C == "Multi-Expert-8":
experts = generate_experts(Q, n=8)
ideas = expert_transformation(Q, experts, ideas_per_expert=2-3)
Elif C == "Random-Perspective":
perspectives = random.sample(RANDOM_WORDS, 4)
ideas = perspective_generation(Q, perspectives, ideas_per=5)
Store(Q, C, ideas)
```
### 7.2 Phase 2: Automatic Metrics
```
For each (Q, C, ideas) in Results:
metrics = {
'diversity': compute_mean_pairwise_distance(ideas),
'clusters': compute_cluster_metrics(ideas),
'query_distance': compute_query_distance(Q, ideas),
'patent_novelty': compute_patent_novelty(ideas, Q)
}
Store(Q, C, metrics)
```
### 7.3 Phase 3: Human Evaluation
```
# Sample selection
selected_queries = random.sample(QuerySet, 10)
selected_conditions = ["Direct", "Multi-Expert-4", "Multi-Expert-8"]
# Create evaluation set
evaluation_items = []
For each Q in selected_queries:
For each C in selected_conditions:
ideas = Get(Q, C)
For each idea in ideas:
evaluation_items.append((Q, C, idea))
# Randomize and assign to evaluators
random.shuffle(evaluation_items)
assignments = assign_to_evaluators(evaluation_items, n_evaluators=60)
# Collect ratings
ratings = collect_human_ratings(assignments)
```
### 7.4 Phase 4: Analysis
```
# Automatic metrics analysis
Run ANOVA: diversity ~ condition + query + condition:query
Run post-hoc: Tukey HSD for condition pairs
Compute effect sizes
# Human ratings analysis
Check reliability: Cronbach's alpha, ICC
Run mixed-effects model: rating ~ condition + (1|evaluator) + (1|query)
Compute correlations: human vs automatic metrics
# Visualization
Plot: Diversity by condition (box plots)
Plot: t-SNE of idea embeddings colored by condition
Plot: Expert count vs diversity curve
```
---
## 8. Implementation Checklist
### 8.1 Code to Implement
- [ ] `experiments/generate_ideas.py` - Idea generation for all conditions
- [ ] `experiments/compute_metrics.py` - Automatic metric computation
- [ ] `experiments/export_for_evaluation.py` - Prepare human evaluation set
- [ ] `experiments/analyze_results.py` - Statistical analysis
- [ ] `experiments/visualize.py` - Generate figures
### 8.2 Data Files to Create
- [ ] `data/queries.json` - 30 queries with metadata
- [ ] `data/random_words.json` - Random perspective words
- [ ] `data/generated_ideas/` - Raw idea outputs
- [ ] `data/metrics/` - Computed metric results
- [ ] `data/human_ratings/` - Collected ratings
### 8.3 Analysis Outputs
- [ ] `results/diversity_by_condition.csv`
- [ ] `results/patent_novelty_by_condition.csv`
- [ ] `results/human_ratings_summary.csv`
- [ ] `results/statistical_tests.txt`
- [ ] `figures/` - All visualizations
---
## 9. Expected Results & Hypotheses
### 9.1 Primary Hypotheses
| Hypothesis | Prediction | Metric |
|------------|------------|--------|
| **H1** | Multi-Expert-4 > Single-Expert > Direct | Semantic diversity |
| **H2** | Multi-Expert-8 ≈ Multi-Expert-4 (diminishing returns) | Semantic diversity |
| **H3** | Multi-Expert > Direct | Patent novelty rate |
| **H4** | LLM experts > Curated > DBpedia | Unconventionality |
| **H5** | With attributes > Without attributes | Overall diversity |
### 9.2 Expected Effect Sizes
Based on related work:
- Diversity increase: d = 0.5-0.8 (medium to large)
- Patent novelty increase: 20-40% improvement
- Human creativity rating: d = 0.3-0.5 (small to medium)
### 9.3 Potential Confounds
| Confound | Mitigation |
|----------|-----------|
| Query difficulty | Crossed design (all queries × all conditions) |
| LLM variability | Multiple runs, fixed seed where possible |
| Evaluator bias | Randomized presentation, blinding |
| Order effects | Counterbalancing in human evaluation |
---
## 10. Timeline
| Week | Activity |
|------|----------|
| 1-2 | Implement idea generation scripts |
| 3 | Generate all ideas (5 conditions × 30 queries) |
| 4 | Compute automatic metrics |
| 5 | Design and pilot human evaluation |
| 6-7 | Run human evaluation (60 participants) |
| 8 | Analyze results |
| 9-10 | Write paper |
| 11 | Internal review |
| 12 | Submit |
---
## 11. Appendix: Direct Generation Prompt
For baseline condition C1 (Direct LLM generation):
```
You are a creative innovation consultant. Generate 20 unique and creative ideas
for improving or reimagining a [QUERY].
Requirements:
- Each idea should be distinct and novel
- Ideas should range from incremental improvements to radical innovations
- Consider different aspects: materials, functions, user experiences, contexts
- Provide a brief (15-30 word) description for each idea
Output format:
1. [Idea keyword]: [Description]
2. [Idea keyword]: [Description]
...
20. [Idea keyword]: [Description]
```
---
## 12. Appendix: Random Perspective Words
For condition C5 (Random-Perspective), sample from:
```json
[
"ocean", "mountain", "forest", "desert", "cave",
"microscope", "telescope", "kaleidoscope", "prism", "lens",
"butterfly", "elephant", "octopus", "eagle", "ant",
"sunrise", "thunderstorm", "rainbow", "fog", "aurora",
"clockwork", "origami", "mosaic", "symphony", "ballet",
"ancient", "futuristic", "organic", "crystalline", "liquid",
"whisper", "explosion", "rhythm", "silence", "echo"
]
```
This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.