Files
novelty-seeking/research/experimental_protocol.md
2026-01-05 22:32:08 +08:00

17 KiB
Raw Blame History

Experimental Protocol: Expert-Augmented LLM Ideation

Executive Summary

This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation.


1. Research Questions

ID Research Question
RQ1 Does multi-expert generation produce higher semantic diversity than direct LLM generation?
RQ2 Does multi-expert generation produce ideas with lower patent overlap (higher novelty)?
RQ3 What is the optimal number of experts for maximizing diversity?
RQ4 How do different expert sources (LLM vs Curated vs DBpedia) affect idea quality?
RQ5 Does structured attribute decomposition enhance the multi-expert effect?

2. Experimental Design Overview

2.1 Design Type

Mixed Design: Between-subjects for main conditions × Within-subjects for queries

2.2 Variables

Independent Variables (Manipulated)

Variable Levels Your System Parameter
Generation Method 5 levels (see conditions) Condition-dependent
Expert Count 1, 2, 4, 6, 8 expert_count
Expert Source LLM, Curated, DBpedia expert_source
Attribute Structure With/Without decomposition Pipeline inclusion

Dependent Variables (Measured)

Variable Measurement Method
Semantic Diversity Mean pairwise cosine distance (embeddings)
Cluster Spread Number of clusters, silhouette score
Patent Novelty 1 - (ideas with patent match / total ideas)
Semantic Distance Distance from query centroid
Human Novelty Rating 7-point Likert scale
Human Usefulness Rating 7-point Likert scale
Human Creativity Rating 7-point Likert scale

Control Variables (Held Constant)

Variable Fixed Value
LLM Model Qwen3:8b (or specify)
Temperature 0.7
Total Ideas per Query 20
Keywords per Expert 1
Deduplication Disabled for raw comparison
Language English (for patent search)

3. Experimental Conditions

3.1 Main Study: Generation Method Comparison

Condition Description Implementation
C1: Direct Direct LLM generation Prompt: "Generate 20 creative ideas for [query]"
C2: Single-Expert 1 expert × 20 ideas expert_count=1, keywords_per_expert=20
C3: Multi-Expert-4 4 experts × 5 ideas each expert_count=4, keywords_per_expert=5
C4: Multi-Expert-8 8 experts × 2-3 ideas each expert_count=8, keywords_per_expert=2-3
C5: Random-Perspective 4 random words as "perspectives" Custom prompt with random nouns

3.2 Expert Count Study

Condition Expert Count Ideas per Expert
E1 1 20
E2 2 10
E4 4 5
E6 6 3-4
E8 8 2-3

3.3 Expert Source Study

Condition Source Implementation
S-LLM LLM-generated expert_source=ExpertSource.LLM
S-Curated Curated 210 occupations expert_source=ExpertSource.CURATED
S-DBpedia DBpedia 2164 occupations expert_source=ExpertSource.DBPEDIA
S-Random Random word "experts" Custom implementation

4. Query Dataset

4.1 Design Principles

  • Diversity: Cover multiple domains (consumer products, technology, services, abstract concepts)
  • Complexity Variation: Simple objects to complex systems
  • Familiarity Variation: Common items to specialized equipment
  • Cultural Neutrality: Concepts understandable across cultures

4.2 Query Set (30 Queries)

Category A: Everyday Objects (10)

ID Query Complexity
A1 Chair Low
A2 Umbrella Low
A3 Backpack Low
A4 Coffee mug Low
A5 Bicycle Medium
A6 Refrigerator Medium
A7 Smartphone Medium
A8 Running shoes Medium
A9 Kitchen knife Low
A10 Desk lamp Low

Category B: Technology & Tools (10)

ID Query Complexity
B1 Solar panel Medium
B2 Electric vehicle High
B3 3D printer High
B4 Drone Medium
B5 Smart thermostat Medium
B6 Noise-canceling headphones Medium
B7 Water purifier Medium
B8 Wind turbine High
B9 Robotic vacuum Medium
B10 Wearable fitness tracker Medium

Category C: Services & Systems (10)

ID Query Complexity
C1 Food delivery service Medium
C2 Online education platform High
C3 Healthcare appointment system High
C4 Public transportation High
C5 Hotel booking system Medium
C6 Personal finance app Medium
C7 Grocery shopping experience Medium
C8 Parking solution Medium
C9 Elderly care service High
C10 Waste management system High

4.3 Sample Size Justification

Based on CHI meta-study on effect sizes:

  • Queries: 30 (crossed with conditions)
  • Expected effect size: d = 0.5 (medium)
  • Power target: 80%
  • For automatic metrics: 30 queries × 5 conditions × 20 ideas = 3,000 ideas
  • For human evaluation: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas

5. Automatic Metrics Collection

5.1 Semantic Diversity Metrics

5.1.1 Mean Pairwise Distance (Primary)

def compute_mean_pairwise_distance(ideas: List[str], embedding_model: str) -> float:
    """
    Compute mean cosine distance between all idea pairs.
    Higher = more diverse.
    """
    embeddings = get_embeddings(ideas, model=embedding_model)
    n = len(embeddings)
    distances = []
    for i in range(n):
        for j in range(i+1, n):
            dist = 1 - cosine_similarity(embeddings[i], embeddings[j])
            distances.append(dist)
    return np.mean(distances), np.std(distances)

5.1.2 Cluster Analysis

def compute_cluster_metrics(ideas: List[str], embedding_model: str) -> dict:
    """
    Analyze idea clustering patterns.
    """
    embeddings = get_embeddings(ideas, model=embedding_model)

    # Find optimal k using silhouette score
    silhouette_scores = []
    for k in range(2, min(len(ideas), 10)):
        kmeans = KMeans(n_clusters=k)
        labels = kmeans.fit_predict(embeddings)
        score = silhouette_score(embeddings, labels)
        silhouette_scores.append((k, score))

    best_k = max(silhouette_scores, key=lambda x: x[1])[0]

    return {
        'optimal_clusters': best_k,
        'silhouette_score': max(silhouette_scores, key=lambda x: x[1])[1],
        'cluster_distribution': compute_cluster_sizes(embeddings, best_k)
    }

5.1.3 Semantic Distance from Query

def compute_query_distance(query: str, ideas: List[str], embedding_model: str) -> dict:
    """
    Measure how far ideas are from the original query.
    Higher = more novel/distant.
    """
    query_emb = get_embedding(query, model=embedding_model)
    idea_embs = get_embeddings(ideas, model=embedding_model)

    distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]

    return {
        'mean_distance': np.mean(distances),
        'max_distance': np.max(distances),
        'min_distance': np.min(distances),
        'std_distance': np.std(distances)
    }

5.2 Patent Novelty Metrics

5.2.1 Patent Overlap Rate

def compute_patent_novelty(ideas: List[str], query: str) -> dict:
    """
    Search patents for each idea and compute overlap rate.
    Uses existing patent_search_service.
    """
    matches = 0
    match_details = []

    for idea in ideas:
        result = patent_search_service.search(idea)
        if result.has_match:
            matches += 1
            match_details.append({
                'idea': idea,
                'patent': result.best_match
            })

    return {
        'novelty_rate': 1 - (matches / len(ideas)),
        'match_count': matches,
        'total_ideas': len(ideas),
        'match_details': match_details
    }

5.3 Metrics Summary Table

Metric Formula Interpretation
Mean Pairwise Distance avg(1 - cos_sim(i, j)) for all pairs Higher = more diverse
Silhouette Score Cluster cohesion vs separation Higher = clearer clusters
Optimal Cluster Count argmax(silhouette) More clusters = more themes
Query Distance 1 - cos_sim(query, idea) Higher = farther from original
Patent Novelty Rate 1 - (matches / total) Higher = more novel

6. Human Evaluation Protocol

6.1 Participants

6.1.1 Recruitment

  • Platform: Prolific, MTurk, or domain experts
  • Sample Size: 60 evaluators (20 per condition group)
  • Criteria:
    • Native English speakers
    • Bachelor's degree or higher
    • Attention check pass rate > 80%

6.1.2 Compensation

  • $15/hour equivalent
  • ~30 minutes per session
  • Bonus for high-quality ratings

6.2 Rating Scales

6.2.1 Novelty (7-point Likert)

How novel/surprising is this idea?
1 = Not at all novel (very common/obvious)
4 = Moderately novel
7 = Extremely novel (never seen before)

6.2.2 Usefulness (7-point Likert)

How useful/practical is this idea?
1 = Not at all useful (impractical)
4 = Moderately useful
7 = Extremely useful (highly practical)

6.2.3 Creativity (7-point Likert)

How creative is this idea overall?
1 = Not at all creative
4 = Moderately creative
7 = Extremely creative

6.3 Procedure

  1. Introduction (5 min)

    • Study purpose (without revealing hypotheses)
    • Rating scale explanation
    • Practice with 3 example ideas
  2. Training (5 min)

    • Rate 5 calibration ideas with feedback
    • Discuss edge cases
  3. Main Evaluation (20 min)

    • Rate 30 ideas (randomized order)
    • 3 attention check items embedded
    • Break after 15 ideas
  4. Debriefing (2 min)

    • Demographics
    • Open-ended feedback

6.4 Quality Control

Check Threshold Action
Attention checks < 2/3 correct Exclude
Completion time < 10 min Flag for review
Variance in ratings All same score Exclude
Inter-rater reliability Cronbach's α < 0.7 Review ratings

6.5 Analysis Plan

6.5.1 Reliability

  • Cronbach's alpha for each scale
  • ICC (Intraclass Correlation) for inter-rater agreement

6.5.2 Main Analysis

  • Mixed-effects ANOVA: Condition × Query
  • Post-hoc: Tukey HSD for pairwise comparisons
  • Effect sizes: Cohen's d

6.5.3 Correlation with Automatic Metrics

  • Pearson correlation: Human ratings vs semantic diversity
  • Regression: Predict human ratings from automatic metrics

7. Experimental Procedure

7.1 Phase 1: Idea Generation

For each query Q in QuerySet:
    For each condition C in Conditions:

        If C == "Direct":
            ideas = direct_llm_generation(Q, n=20)

        Elif C == "Single-Expert":
            expert = generate_expert(Q, n=1)
            ideas = expert_transformation(Q, expert, ideas_per_expert=20)

        Elif C == "Multi-Expert-4":
            experts = generate_experts(Q, n=4)
            ideas = expert_transformation(Q, experts, ideas_per_expert=5)

        Elif C == "Multi-Expert-8":
            experts = generate_experts(Q, n=8)
            ideas = expert_transformation(Q, experts, ideas_per_expert=2-3)

        Elif C == "Random-Perspective":
            perspectives = random.sample(RANDOM_WORDS, 4)
            ideas = perspective_generation(Q, perspectives, ideas_per=5)

        Store(Q, C, ideas)

7.2 Phase 2: Automatic Metrics

For each (Q, C, ideas) in Results:
    metrics = {
        'diversity': compute_mean_pairwise_distance(ideas),
        'clusters': compute_cluster_metrics(ideas),
        'query_distance': compute_query_distance(Q, ideas),
        'patent_novelty': compute_patent_novelty(ideas, Q)
    }
    Store(Q, C, metrics)

7.3 Phase 3: Human Evaluation

# Sample selection
selected_queries = random.sample(QuerySet, 10)
selected_conditions = ["Direct", "Multi-Expert-4", "Multi-Expert-8"]

# Create evaluation set
evaluation_items = []
For each Q in selected_queries:
    For each C in selected_conditions:
        ideas = Get(Q, C)
        For each idea in ideas:
            evaluation_items.append((Q, C, idea))

# Randomize and assign to evaluators
random.shuffle(evaluation_items)
assignments = assign_to_evaluators(evaluation_items, n_evaluators=60)

# Collect ratings
ratings = collect_human_ratings(assignments)

7.4 Phase 4: Analysis

# Automatic metrics analysis
Run ANOVA: diversity ~ condition + query + condition:query
Run post-hoc: Tukey HSD for condition pairs
Compute effect sizes

# Human ratings analysis
Check reliability: Cronbach's alpha, ICC
Run mixed-effects model: rating ~ condition + (1|evaluator) + (1|query)
Compute correlations: human vs automatic metrics

# Visualization
Plot: Diversity by condition (box plots)
Plot: t-SNE of idea embeddings colored by condition
Plot: Expert count vs diversity curve

8. Implementation Checklist

8.1 Code to Implement

  • experiments/generate_ideas.py - Idea generation for all conditions
  • experiments/compute_metrics.py - Automatic metric computation
  • experiments/export_for_evaluation.py - Prepare human evaluation set
  • experiments/analyze_results.py - Statistical analysis
  • experiments/visualize.py - Generate figures

8.2 Data Files to Create

  • data/queries.json - 30 queries with metadata
  • data/random_words.json - Random perspective words
  • data/generated_ideas/ - Raw idea outputs
  • data/metrics/ - Computed metric results
  • data/human_ratings/ - Collected ratings

8.3 Analysis Outputs

  • results/diversity_by_condition.csv
  • results/patent_novelty_by_condition.csv
  • results/human_ratings_summary.csv
  • results/statistical_tests.txt
  • figures/ - All visualizations

9. Expected Results & Hypotheses

9.1 Primary Hypotheses

Hypothesis Prediction Metric
H1 Multi-Expert-4 > Single-Expert > Direct Semantic diversity
H2 Multi-Expert-8 ≈ Multi-Expert-4 (diminishing returns) Semantic diversity
H3 Multi-Expert > Direct Patent novelty rate
H4 LLM experts > Curated > DBpedia Unconventionality
H5 With attributes > Without attributes Overall diversity

9.2 Expected Effect Sizes

Based on related work:

  • Diversity increase: d = 0.5-0.8 (medium to large)
  • Patent novelty increase: 20-40% improvement
  • Human creativity rating: d = 0.3-0.5 (small to medium)

9.3 Potential Confounds

Confound Mitigation
Query difficulty Crossed design (all queries × all conditions)
LLM variability Multiple runs, fixed seed where possible
Evaluator bias Randomized presentation, blinding
Order effects Counterbalancing in human evaluation

10. Timeline

Week Activity
1-2 Implement idea generation scripts
3 Generate all ideas (5 conditions × 30 queries)
4 Compute automatic metrics
5 Design and pilot human evaluation
6-7 Run human evaluation (60 participants)
8 Analyze results
9-10 Write paper
11 Internal review
12 Submit

11. Appendix: Direct Generation Prompt

For baseline condition C1 (Direct LLM generation):

You are a creative innovation consultant. Generate 20 unique and creative ideas
for improving or reimagining a [QUERY].

Requirements:
- Each idea should be distinct and novel
- Ideas should range from incremental improvements to radical innovations
- Consider different aspects: materials, functions, user experiences, contexts
- Provide a brief (15-30 word) description for each idea

Output format:
1. [Idea keyword]: [Description]
2. [Idea keyword]: [Description]
...
20. [Idea keyword]: [Description]

12. Appendix: Random Perspective Words

For condition C5 (Random-Perspective), sample from:

[
  "ocean", "mountain", "forest", "desert", "cave",
  "microscope", "telescope", "kaleidoscope", "prism", "lens",
  "butterfly", "elephant", "octopus", "eagle", "ant",
  "sunrise", "thunderstorm", "rainbow", "fog", "aurora",
  "clockwork", "origami", "mosaic", "symphony", "ballet",
  "ancient", "futuristic", "organic", "crystalline", "liquid",
  "whisper", "explosion", "rhythm", "silence", "echo"
]

This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.