Files
novelty-seeking/research/experimental_protocol.md
gbanyan 26a56a2a07 feat: Enhance patent search and update research documentation
- Improve patent search service with expanded functionality
- Update PatentSearchPanel UI component
- Add new research_report.md
- Update experimental protocol, literature review, paper outline, and theoretical framework

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 15:52:33 +08:00

22 KiB
Raw Blame History

Experimental Protocol: Expert-Augmented LLM Ideation

Executive Summary

This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation.


1. Research Questions

ID Research Question
RQ1 Does attribute decomposition improve semantic diversity of generated ideas?
RQ2 Does expert perspective transformation improve semantic diversity of generated ideas?
RQ3 Is there an interaction effect between attribute decomposition and expert perspectives?
RQ4 Which combination produces the highest patent novelty (lowest overlap)?
RQ5 How do different expert sources (LLM vs Curated vs External) affect idea quality?
RQ6 Does context-free keyword generation (current design) increase hallucination/nonsense rate?

Design Note: Context-Free Keyword Generation

Our system intentionally excludes the original query during keyword generation (Stage 1):

Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant)
                   Expert does NOT see "椅子" (chair)
                   → Generates: "資金流動" (cash flow)

Stage 2 (Description): Expert sees "椅子" + "資金流動"
                       → Applies keyword to original query

Rationale: This forces maximum semantic distance in keyword generation. Risk: Some keywords may be too distant, resulting in nonsensical or unusable ideas. RQ6 investigates: What is the hallucination/nonsense rate, and is the tradeoff worthwhile?


2. Experimental Design Overview

2.1 Design Type

2×2 Factorial Design: Attribute Decomposition (With/Without) × Expert Perspectives (With/Without)

  • Within-subjects for queries (all queries tested across all conditions)

2.2 Variables

Independent Variables (Manipulated)

Variable Levels Description
Attribute Decomposition 2 levels: With / Without Whether to decompose query into structured attributes
Expert Perspectives 2 levels: With / Without Whether to use expert personas for idea generation
Expert Source (secondary) LLM, Curated, External Source of expert occupations (tested within Expert=With conditions)

Dependent Variables (Measured)

Variable Measurement Method
Semantic Diversity Mean pairwise cosine distance (embeddings)
Cluster Spread Number of clusters, silhouette score
Patent Novelty 1 - (ideas with patent match / total ideas)
Semantic Distance Distance from query centroid
Human Novelty Rating 7-point Likert scale
Human Usefulness Rating 7-point Likert scale
Human Creativity Rating 7-point Likert scale

Control Variables (Held Constant)

Variable Fixed Value
LLM Model Qwen3:8b (or specify)
Temperature 0.7
Total Ideas per Query 20
Keywords per Expert 1
Deduplication Disabled for raw comparison
Language English (for patent search)

3. Experimental Conditions

3.1 Main Study: 2×2 Factorial Design

Condition Attributes Experts Description
C1: Direct Without Without Baseline: "Generate 20 creative ideas for [query]"
C2: Expert-Only Without With Expert personas generate for whole query
C3: Attribute-Only With Without Decompose query, direct generate per attribute
C4: Full Pipeline With With Decompose query, experts generate per attribute

3.2 Control Condition

Condition Description Purpose
C5: Random-Perspective 4 random words as "perspectives" Tests if ANY perspective shift helps, or if EXPERT knowledge specifically matters

3.3 Expert Source Study (Secondary, within Expert=With conditions)

Condition Source Implementation
S-LLM LLM-generated Query-specific experts generated by LLM
S-Curated Curated occupations Pre-selected high-quality occupations
S-External External sources Wikidata/ConceptNet occupations

4. Query Dataset

4.1 Design Principles

  • Diversity: Cover multiple domains (consumer products, technology, services, abstract concepts)
  • Complexity Variation: Simple objects to complex systems
  • Familiarity Variation: Common items to specialized equipment
  • Cultural Neutrality: Concepts understandable across cultures

4.2 Query Set (30 Queries)

Category A: Everyday Objects (10)

ID Query Complexity
A1 Chair Low
A2 Umbrella Low
A3 Backpack Low
A4 Coffee mug Low
A5 Bicycle Medium
A6 Refrigerator Medium
A7 Smartphone Medium
A8 Running shoes Medium
A9 Kitchen knife Low
A10 Desk lamp Low

Category B: Technology & Tools (10)

ID Query Complexity
B1 Solar panel Medium
B2 Electric vehicle High
B3 3D printer High
B4 Drone Medium
B5 Smart thermostat Medium
B6 Noise-canceling headphones Medium
B7 Water purifier Medium
B8 Wind turbine High
B9 Robotic vacuum Medium
B10 Wearable fitness tracker Medium

Category C: Services & Systems (10)

ID Query Complexity
C1 Food delivery service Medium
C2 Online education platform High
C3 Healthcare appointment system High
C4 Public transportation High
C5 Hotel booking system Medium
C6 Personal finance app Medium
C7 Grocery shopping experience Medium
C8 Parking solution Medium
C9 Elderly care service High
C10 Waste management system High

4.3 Sample Size Justification

Based on CHI meta-study on effect sizes:

  • Queries: 30 (crossed with conditions)
  • Expected effect size: d = 0.5 (medium)
  • Power target: 80%
  • For automatic metrics: 30 queries × 5 conditions × 20 ideas = 3,000 ideas
  • For human evaluation: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas

5. Automatic Metrics Collection

5.1 Semantic Diversity Metrics

5.1.1 Mean Pairwise Distance (Primary)

def compute_mean_pairwise_distance(ideas: List[str], embedding_model: str) -> float:
    """
    Compute mean cosine distance between all idea pairs.
    Higher = more diverse.
    """
    embeddings = get_embeddings(ideas, model=embedding_model)
    n = len(embeddings)
    distances = []
    for i in range(n):
        for j in range(i+1, n):
            dist = 1 - cosine_similarity(embeddings[i], embeddings[j])
            distances.append(dist)
    return np.mean(distances), np.std(distances)

5.1.2 Cluster Analysis

def compute_cluster_metrics(ideas: List[str], embedding_model: str) -> dict:
    """
    Analyze idea clustering patterns.
    """
    embeddings = get_embeddings(ideas, model=embedding_model)

    # Find optimal k using silhouette score
    silhouette_scores = []
    for k in range(2, min(len(ideas), 10)):
        kmeans = KMeans(n_clusters=k)
        labels = kmeans.fit_predict(embeddings)
        score = silhouette_score(embeddings, labels)
        silhouette_scores.append((k, score))

    best_k = max(silhouette_scores, key=lambda x: x[1])[0]

    return {
        'optimal_clusters': best_k,
        'silhouette_score': max(silhouette_scores, key=lambda x: x[1])[1],
        'cluster_distribution': compute_cluster_sizes(embeddings, best_k)
    }

5.1.3 Semantic Distance from Query

def compute_query_distance(query: str, ideas: List[str], embedding_model: str) -> dict:
    """
    Measure how far ideas are from the original query.
    Higher = more novel/distant.
    """
    query_emb = get_embedding(query, model=embedding_model)
    idea_embs = get_embeddings(ideas, model=embedding_model)

    distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]

    return {
        'mean_distance': np.mean(distances),
        'max_distance': np.max(distances),
        'min_distance': np.min(distances),
        'std_distance': np.std(distances)
    }

5.2 Patent Novelty Metrics

5.2.1 Patent Overlap Rate

def compute_patent_novelty(ideas: List[str], query: str) -> dict:
    """
    Search patents for each idea and compute overlap rate.
    Uses existing patent_search_service.
    """
    matches = 0
    match_details = []

    for idea in ideas:
        result = patent_search_service.search(idea)
        if result.has_match:
            matches += 1
            match_details.append({
                'idea': idea,
                'patent': result.best_match
            })

    return {
        'novelty_rate': 1 - (matches / len(ideas)),
        'match_count': matches,
        'total_ideas': len(ideas),
        'match_details': match_details
    }

5.3 Hallucination/Nonsense Metrics (RQ6)

Since our design intentionally excludes the original query during keyword generation, we need to measure the "cost" of this approach.

5.3.1 LLM-as-Judge for Relevance

def compute_relevance_score(query: str, ideas: List[str], judge_model: str) -> dict:
    """
    Use LLM to judge if each idea is relevant/applicable to the original query.
    """
    relevant_count = 0
    nonsense_count = 0
    results = []

    for idea in ideas:
        prompt = f"""
        Original query: {query}
        Generated idea: {idea}

        Is this idea relevant and applicable to the original query?
        Rate: 1 (nonsense/irrelevant), 2 (weak connection), 3 (relevant)

        Return JSON: {{"score": N, "reason": "brief explanation"}}
        """
        result = llm_judge(prompt, model=judge_model)
        results.append(result)
        if result['score'] == 1:
            nonsense_count += 1
        elif result['score'] >= 2:
            relevant_count += 1

    return {
        'relevance_rate': relevant_count / len(ideas),
        'nonsense_rate': nonsense_count / len(ideas),
        'details': results
    }

5.3.2 Semantic Distance Threshold Analysis

def analyze_distance_threshold(query: str, ideas: List[str], embedding_model: str) -> dict:
    """
    Analyze which ideas exceed a "too far" semantic distance threshold.
    Ideas beyond threshold may be creative OR nonsensical.
    """
    query_emb = get_embedding(query, model=embedding_model)
    idea_embs = get_embeddings(ideas, model=embedding_model)

    distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]

    # Define thresholds (to be calibrated)
    CREATIVE_THRESHOLD = 0.6  # Ideas this far are "creative"
    NONSENSE_THRESHOLD = 0.85  # Ideas this far may be "nonsense"

    return {
        'creative_zone': sum(1 for d in distances if CREATIVE_THRESHOLD <= d < NONSENSE_THRESHOLD),
        'potential_nonsense': sum(1 for d in distances if d >= NONSENSE_THRESHOLD),
        'safe_zone': sum(1 for d in distances if d < CREATIVE_THRESHOLD),
        'distance_distribution': distances
    }

5.4 Metrics Summary Table

Metric Formula Interpretation
Mean Pairwise Distance avg(1 - cos_sim(i, j)) for all pairs Higher = more diverse
Silhouette Score Cluster cohesion vs separation Higher = clearer clusters
Optimal Cluster Count argmax(silhouette) More clusters = more themes
Query Distance 1 - cos_sim(query, idea) Higher = farther from original
Patent Novelty Rate 1 - (matches / total) Higher = more novel

5.5 Nonsense/Hallucination Analysis (RQ6) - Three Methods

Method Metric How it works Pros/Cons
Automatic Semantic Distance Threshold Ideas with distance > 0.85 flagged as "potential nonsense" Fast, cheap; May miss contextual nonsense
LLM-as-Judge Relevance Score (1-3) GPT-4 rates if idea is relevant to original query Moderate cost; Good balance
Human Evaluation Relevance Rating (1-7 Likert) Humans rate coherence/relevance Gold standard; Most expensive

Triangulation: Compare all three methods to validate findings:

  • If automatic + LLM + human agree → high confidence
  • If they disagree → investigate why (interesting edge cases)

6. Human Evaluation Protocol

6.1 Participants

6.1.1 Recruitment

  • Platform: Prolific, MTurk, or domain experts
  • Sample Size: 60 evaluators (20 per condition group)
  • Criteria:
    • Native English speakers
    • Bachelor's degree or higher
    • Attention check pass rate > 80%

6.1.2 Compensation

  • $15/hour equivalent
  • ~30 minutes per session
  • Bonus for high-quality ratings

6.2 Rating Scales

6.2.1 Novelty (7-point Likert)

How novel/surprising is this idea?
1 = Not at all novel (very common/obvious)
4 = Moderately novel
7 = Extremely novel (never seen before)

6.2.2 Usefulness (7-point Likert)

How useful/practical is this idea?
1 = Not at all useful (impractical)
4 = Moderately useful
7 = Extremely useful (highly practical)

6.2.3 Creativity (7-point Likert)

How creative is this idea overall?
1 = Not at all creative
4 = Moderately creative
7 = Extremely creative

6.2.4 Relevance/Coherence (7-point Likert) - For RQ6

How relevant and coherent is this idea to the original query?
1 = Nonsense/completely irrelevant (no logical connection)
2 = Very weak connection (hard to see relevance)
3 = Weak connection (requires stretch to see relevance)
4 = Moderate connection (somewhat relevant)
5 = Good connection (clearly relevant)
6 = Strong connection (directly applicable)
7 = Perfect fit (highly relevant and coherent)

Note: This scale specifically measures the "cost" of context-free generation.

  • Ideas with high novelty but low relevance (1-3) = potential hallucination
  • Ideas with high novelty AND high relevance (5-7) = successful creative leap

6.3 Procedure

  1. Introduction (5 min)

    • Study purpose (without revealing hypotheses)
    • Rating scale explanation
    • Practice with 3 example ideas
  2. Training (5 min)

    • Rate 5 calibration ideas with feedback
    • Discuss edge cases
  3. Main Evaluation (20 min)

    • Rate 30 ideas (randomized order)
    • 3 attention check items embedded
    • Break after 15 ideas
  4. Debriefing (2 min)

    • Demographics
    • Open-ended feedback

6.4 Quality Control

Check Threshold Action
Attention checks < 2/3 correct Exclude
Completion time < 10 min Flag for review
Variance in ratings All same score Exclude
Inter-rater reliability Cronbach's α < 0.7 Review ratings

6.5 Analysis Plan

6.5.1 Reliability

  • Cronbach's alpha for each scale
  • ICC (Intraclass Correlation) for inter-rater agreement

6.5.2 Main Analysis

  • Mixed-effects ANOVA: Condition × Query
  • Post-hoc: Tukey HSD for pairwise comparisons
  • Effect sizes: Cohen's d

6.5.3 Correlation with Automatic Metrics

  • Pearson correlation: Human ratings vs semantic diversity
  • Regression: Predict human ratings from automatic metrics

7. Experimental Procedure

7.1 Phase 1: Idea Generation

For each query Q in QuerySet:
    For each condition C in Conditions:

        If C == "Direct":
            # No attributes, no experts
            ideas = direct_llm_generation(Q, n=20)

        Elif C == "Expert-Only":
            # No attributes, with experts
            experts = generate_experts(Q, n=4)
            ideas = expert_generation_whole_query(Q, experts, ideas_per_expert=5)

        Elif C == "Attribute-Only":
            # With attributes, no experts
            attributes = decompose_attributes(Q)
            ideas = direct_generation_per_attribute(Q, attributes, ideas_per_attr=5)

        Elif C == "Full-Pipeline":
            # With attributes, with experts
            attributes = decompose_attributes(Q)
            experts = generate_experts(Q, n=4)
            ideas = expert_transformation(Q, attributes, experts, ideas_per_combo=1-2)

        Elif C == "Random-Perspective":
            # Control: random words instead of experts
            perspectives = random.sample(RANDOM_WORDS, 4)
            ideas = perspective_generation(Q, perspectives, ideas_per=5)

        Store(Q, C, ideas)

7.2 Phase 2: Automatic Metrics

For each (Q, C, ideas) in Results:
    metrics = {
        'diversity': compute_mean_pairwise_distance(ideas),
        'clusters': compute_cluster_metrics(ideas),
        'query_distance': compute_query_distance(Q, ideas),
        'patent_novelty': compute_patent_novelty(ideas, Q)
    }
    Store(Q, C, metrics)

7.3 Phase 3: Human Evaluation

# Sample selection
selected_queries = random.sample(QuerySet, 10)
selected_conditions = ["Direct", "Multi-Expert-4", "Multi-Expert-8"]

# Create evaluation set
evaluation_items = []
For each Q in selected_queries:
    For each C in selected_conditions:
        ideas = Get(Q, C)
        For each idea in ideas:
            evaluation_items.append((Q, C, idea))

# Randomize and assign to evaluators
random.shuffle(evaluation_items)
assignments = assign_to_evaluators(evaluation_items, n_evaluators=60)

# Collect ratings
ratings = collect_human_ratings(assignments)

7.4 Phase 4: Analysis

# Automatic metrics analysis
Run ANOVA: diversity ~ condition + query + condition:query
Run post-hoc: Tukey HSD for condition pairs
Compute effect sizes

# Human ratings analysis
Check reliability: Cronbach's alpha, ICC
Run mixed-effects model: rating ~ condition + (1|evaluator) + (1|query)
Compute correlations: human vs automatic metrics

# Visualization
Plot: Diversity by condition (box plots)
Plot: t-SNE of idea embeddings colored by condition
Plot: Expert count vs diversity curve

8. Implementation Checklist

8.1 Code to Implement

  • experiments/generate_ideas.py - Idea generation for all conditions
  • experiments/compute_metrics.py - Automatic metric computation
  • experiments/export_for_evaluation.py - Prepare human evaluation set
  • experiments/analyze_results.py - Statistical analysis
  • experiments/visualize.py - Generate figures

8.2 Data Files to Create

  • data/queries.json - 30 queries with metadata
  • data/random_words.json - Random perspective words
  • data/generated_ideas/ - Raw idea outputs
  • data/metrics/ - Computed metric results
  • data/human_ratings/ - Collected ratings

8.3 Analysis Outputs

  • results/diversity_by_condition.csv
  • results/patent_novelty_by_condition.csv
  • results/human_ratings_summary.csv
  • results/statistical_tests.txt
  • figures/ - All visualizations

9. Expected Results & Hypotheses

9.1 Primary Hypotheses (2×2 Factorial)

Hypothesis Prediction Metric
H1: Main Effect of Attributes Attribute-Only > Direct Semantic diversity
H2: Main Effect of Experts Expert-Only > Direct Semantic diversity
H3: Interaction Effect Full Pipeline > (Attribute-Only + Expert-Only - Direct) Semantic diversity
H4: Novelty Full Pipeline > all other conditions Patent novelty rate
H5: Expert vs Random Expert-Only > Random-Perspective Validates expert knowledge matters
H6: Novelty-Usefulness Tradeoff Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%) Nonsense rate

9.2 Expected Pattern

                    Without Experts    With Experts
                    ---------------    ------------
Without Attributes    Direct (low)      Expert-Only (medium)
With Attributes       Attr-Only (medium) Full Pipeline (high)

Expected interaction: The combination (Full Pipeline) should produce super-additive effects - the benefit of experts is amplified when combined with structured attributes.

9.3 Expected Effect Sizes

Based on related work:

  • Main effect of attributes: d = 0.3-0.5 (small to medium)
  • Main effect of experts: d = 0.4-0.6 (medium)
  • Interaction effect: d = 0.2-0.4 (small)
  • Patent novelty increase: 20-40% improvement
  • Human creativity rating: d = 0.3-0.5 (small to medium)

9.3 Potential Confounds

Confound Mitigation
Query difficulty Crossed design (all queries × all conditions)
LLM variability Multiple runs, fixed seed where possible
Evaluator bias Randomized presentation, blinding
Order effects Counterbalancing in human evaluation

10. Timeline

Week Activity
1-2 Implement idea generation scripts
3 Generate all ideas (5 conditions × 30 queries)
4 Compute automatic metrics
5 Design and pilot human evaluation
6-7 Run human evaluation (60 participants)
8 Analyze results
9-10 Write paper
11 Internal review
12 Submit

11. Appendix: Direct Generation Prompt

For baseline condition C1 (Direct LLM generation):

You are a creative innovation consultant. Generate 20 unique and creative ideas
for improving or reimagining a [QUERY].

Requirements:
- Each idea should be distinct and novel
- Ideas should range from incremental improvements to radical innovations
- Consider different aspects: materials, functions, user experiences, contexts
- Provide a brief (15-30 word) description for each idea

Output format:
1. [Idea keyword]: [Description]
2. [Idea keyword]: [Description]
...
20. [Idea keyword]: [Description]

12. Appendix: Random Perspective Words

For condition C5 (Random-Perspective), sample from:

[
  "ocean", "mountain", "forest", "desert", "cave",
  "microscope", "telescope", "kaleidoscope", "prism", "lens",
  "butterfly", "elephant", "octopus", "eagle", "ant",
  "sunrise", "thunderstorm", "rainbow", "fog", "aurora",
  "clockwork", "origami", "mosaic", "symphony", "ballet",
  "ancient", "futuristic", "organic", "crystalline", "liquid",
  "whisper", "explosion", "rhythm", "silence", "echo"
]

This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.