Files

gbanyan 26a56a2a07 feat: Enhance patent search and update research documentation

- Improve patent search service with expanded functionality
- Update PatentSearchPanel UI component
- Add new research_report.md
- Update experimental protocol, literature review, paper outline, and theoretical framework

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-19 15:52:33 +08:00

22 KiB

Raw Blame History

Experimental Protocol: Expert-Augmented LLM Ideation

Executive Summary

This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation.

1. Research Questions

ID	Research Question
RQ1	Does attribute decomposition improve semantic diversity of generated ideas?
RQ2	Does expert perspective transformation improve semantic diversity of generated ideas?
RQ3	Is there an interaction effect between attribute decomposition and expert perspectives?
RQ4	Which combination produces the highest patent novelty (lowest overlap)?
RQ5	How do different expert sources (LLM vs Curated vs External) affect idea quality?
RQ6	Does context-free keyword generation (current design) increase hallucination/nonsense rate?

Design Note: Context-Free Keyword Generation

Our system intentionally excludes the original query during keyword generation (Stage 1):

Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant)
                   Expert does NOT see "椅子" (chair)
                   → Generates: "資金流動" (cash flow)

Stage 2 (Description): Expert sees "椅子" + "資金流動"
                       → Applies keyword to original query

Rationale: This forces maximum semantic distance in keyword generation. Risk: Some keywords may be too distant, resulting in nonsensical or unusable ideas. RQ6 investigates: What is the hallucination/nonsense rate, and is the tradeoff worthwhile?

2. Experimental Design Overview

2.1 Design Type

2×2 Factorial Design: Attribute Decomposition (With/Without) × Expert Perspectives (With/Without)

Within-subjects for queries (all queries tested across all conditions)

2.2 Variables

Independent Variables (Manipulated)

Variable	Levels	Description
Attribute Decomposition	2 levels: With / Without	Whether to decompose query into structured attributes
Expert Perspectives	2 levels: With / Without	Whether to use expert personas for idea generation
Expert Source (secondary)	LLM, Curated, External	Source of expert occupations (tested within Expert=With conditions)

Dependent Variables (Measured)

Variable	Measurement Method
Semantic Diversity	Mean pairwise cosine distance (embeddings)
Cluster Spread	Number of clusters, silhouette score
Patent Novelty	1 - (ideas with patent match / total ideas)
Semantic Distance	Distance from query centroid
Human Novelty Rating	7-point Likert scale
Human Usefulness Rating	7-point Likert scale
Human Creativity Rating	7-point Likert scale

Control Variables (Held Constant)

Variable	Fixed Value
LLM Model	Qwen3:8b (or specify)
Temperature	0.7
Total Ideas per Query	20
Keywords per Expert	1
Deduplication	Disabled for raw comparison
Language	English (for patent search)

3. Experimental Conditions

3.1 Main Study: 2×2 Factorial Design

Condition	Attributes	Experts	Description
C1: Direct	❌ Without	❌ Without	Baseline: "Generate 20 creative ideas for [query]"
C2: Expert-Only	❌ Without	✅ With	Expert personas generate for whole query
C3: Attribute-Only	✅ With	❌ Without	Decompose query, direct generate per attribute
C4: Full Pipeline	✅ With	✅ With	Decompose query, experts generate per attribute

3.2 Control Condition

Condition	Description	Purpose
C5: Random-Perspective	4 random words as "perspectives"	Tests if ANY perspective shift helps, or if EXPERT knowledge specifically matters

3.3 Expert Source Study (Secondary, within Expert=With conditions)

Condition	Source	Implementation
S-LLM	LLM-generated	Query-specific experts generated by LLM
S-Curated	Curated occupations	Pre-selected high-quality occupations
S-External	External sources	Wikidata/ConceptNet occupations

4. Query Dataset

4.1 Design Principles

Diversity: Cover multiple domains (consumer products, technology, services, abstract concepts)
Complexity Variation: Simple objects to complex systems
Familiarity Variation: Common items to specialized equipment
Cultural Neutrality: Concepts understandable across cultures

4.2 Query Set (30 Queries)

Category A: Everyday Objects (10)

ID	Query	Complexity
A1	Chair	Low
A2	Umbrella	Low
A3	Backpack	Low
A4	Coffee mug	Low
A5	Bicycle	Medium
A6	Refrigerator	Medium
A7	Smartphone	Medium
A8	Running shoes	Medium
A9	Kitchen knife	Low
A10	Desk lamp	Low

Category B: Technology & Tools (10)

ID	Query	Complexity
B1	Solar panel	Medium
B2	Electric vehicle	High
B3	3D printer	High
B4	Drone	Medium
B5	Smart thermostat	Medium
B6	Noise-canceling headphones	Medium
B7	Water purifier	Medium
B8	Wind turbine	High
B9	Robotic vacuum	Medium
B10	Wearable fitness tracker	Medium

Category C: Services & Systems (10)

ID	Query	Complexity
C1	Food delivery service	Medium
C2	Online education platform	High
C3	Healthcare appointment system	High
C4	Public transportation	High
C5	Hotel booking system	Medium
C6	Personal finance app	Medium
C7	Grocery shopping experience	Medium
C8	Parking solution	Medium
C9	Elderly care service	High
C10	Waste management system	High

4.3 Sample Size Justification

Based on CHI meta-study on effect sizes:

Queries: 30 (crossed with conditions)
Expected effect size: d = 0.5 (medium)
Power target: 80%
For automatic metrics: 30 queries × 5 conditions × 20 ideas = 3,000 ideas
For human evaluation: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas

5. Automatic Metrics Collection

5.1 Semantic Diversity Metrics

5.1.1 Mean Pairwise Distance (Primary)

def compute_mean_pairwise_distance(ideas: List[str], embedding_model: str) -> float:
    """
    Compute mean cosine distance between all idea pairs.
    Higher = more diverse.
    """
    embeddings = get_embeddings(ideas, model=embedding_model)
    n = len(embeddings)
    distances = []
    for i in range(n):
        for j in range(i+1, n):
            dist = 1 - cosine_similarity(embeddings[i], embeddings[j])
            distances.append(dist)
    return np.mean(distances), np.std(distances)

5.1.2 Cluster Analysis

def compute_cluster_metrics(ideas: List[str], embedding_model: str) -> dict:
    """
    Analyze idea clustering patterns.
    """
    embeddings = get_embeddings(ideas, model=embedding_model)

    # Find optimal k using silhouette score
    silhouette_scores = []
    for k in range(2, min(len(ideas), 10)):
        kmeans = KMeans(n_clusters=k)
        labels = kmeans.fit_predict(embeddings)
        score = silhouette_score(embeddings, labels)
        silhouette_scores.append((k, score))

    best_k = max(silhouette_scores, key=lambda x: x[1])[0]

    return {
        'optimal_clusters': best_k,
        'silhouette_score': max(silhouette_scores, key=lambda x: x[1])[1],
        'cluster_distribution': compute_cluster_sizes(embeddings, best_k)
    }

5.1.3 Semantic Distance from Query

def compute_query_distance(query: str, ideas: List[str], embedding_model: str) -> dict:
    """
    Measure how far ideas are from the original query.
    Higher = more novel/distant.
    """
    query_emb = get_embedding(query, model=embedding_model)
    idea_embs = get_embeddings(ideas, model=embedding_model)

    distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]

    return {
        'mean_distance': np.mean(distances),
        'max_distance': np.max(distances),
        'min_distance': np.min(distances),
        'std_distance': np.std(distances)
    }

5.2 Patent Novelty Metrics

5.2.1 Patent Overlap Rate

def compute_patent_novelty(ideas: List[str], query: str) -> dict:
    """
    Search patents for each idea and compute overlap rate.
    Uses existing patent_search_service.
    """
    matches = 0
    match_details = []

    for idea in ideas:
        result = patent_search_service.search(idea)
        if result.has_match:
            matches += 1
            match_details.append({
                'idea': idea,
                'patent': result.best_match
            })

    return {
        'novelty_rate': 1 - (matches / len(ideas)),
        'match_count': matches,
        'total_ideas': len(ideas),
        'match_details': match_details
    }

5.3 Hallucination/Nonsense Metrics (RQ6)

Since our design intentionally excludes the original query during keyword generation, we need to measure the "cost" of this approach.

5.3.1 LLM-as-Judge for Relevance

def compute_relevance_score(query: str, ideas: List[str], judge_model: str) -> dict:
    """
    Use LLM to judge if each idea is relevant/applicable to the original query.
    """
    relevant_count = 0
    nonsense_count = 0
    results = []

    for idea in ideas:
        prompt = f"""
        Original query: {query}
        Generated idea: {idea}

        Is this idea relevant and applicable to the original query?
        Rate: 1 (nonsense/irrelevant), 2 (weak connection), 3 (relevant)

        Return JSON: {{"score": N, "reason": "brief explanation"}}
        """
        result = llm_judge(prompt, model=judge_model)
        results.append(result)
        if result['score'] == 1:
            nonsense_count += 1
        elif result['score'] >= 2:
            relevant_count += 1

    return {
        'relevance_rate': relevant_count / len(ideas),
        'nonsense_rate': nonsense_count / len(ideas),
        'details': results
    }

5.3.2 Semantic Distance Threshold Analysis

def analyze_distance_threshold(query: str, ideas: List[str], embedding_model: str) -> dict:
    """
    Analyze which ideas exceed a "too far" semantic distance threshold.
    Ideas beyond threshold may be creative OR nonsensical.
    """
    query_emb = get_embedding(query, model=embedding_model)
    idea_embs = get_embeddings(ideas, model=embedding_model)

    distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]

    # Define thresholds (to be calibrated)
    CREATIVE_THRESHOLD = 0.6  # Ideas this far are "creative"
    NONSENSE_THRESHOLD = 0.85  # Ideas this far may be "nonsense"

    return {
        'creative_zone': sum(1 for d in distances if CREATIVE_THRESHOLD <= d < NONSENSE_THRESHOLD),
        'potential_nonsense': sum(1 for d in distances if d >= NONSENSE_THRESHOLD),
        'safe_zone': sum(1 for d in distances if d < CREATIVE_THRESHOLD),
        'distance_distribution': distances
    }

5.4 Metrics Summary Table

Metric	Formula	Interpretation
Mean Pairwise Distance	avg(1 - cos_sim(i, j)) for all pairs	Higher = more diverse
Silhouette Score	Cluster cohesion vs separation	Higher = clearer clusters
Optimal Cluster Count	argmax(silhouette)	More clusters = more themes
Query Distance	1 - cos_sim(query, idea)	Higher = farther from original
Patent Novelty Rate	1 - (matches / total)	Higher = more novel

5.5 Nonsense/Hallucination Analysis (RQ6) - Three Methods

Method	Metric	How it works	Pros/Cons
Automatic	Semantic Distance Threshold	Ideas with distance > 0.85 flagged as "potential nonsense"	Fast, cheap; May miss contextual nonsense
LLM-as-Judge	Relevance Score (1-3)	GPT-4 rates if idea is relevant to original query	Moderate cost; Good balance
Human Evaluation	Relevance Rating (1-7 Likert)	Humans rate coherence/relevance	Gold standard; Most expensive

Triangulation: Compare all three methods to validate findings:

If automatic + LLM + human agree → high confidence
If they disagree → investigate why (interesting edge cases)

6. Human Evaluation Protocol

6.1 Participants

6.1.1 Recruitment

Platform: Prolific, MTurk, or domain experts
Sample Size: 60 evaluators (20 per condition group)
Criteria:
- Native English speakers
- Bachelor's degree or higher
- Attention check pass rate > 80%

6.1.2 Compensation

$15/hour equivalent
~30 minutes per session
Bonus for high-quality ratings

6.2 Rating Scales

6.2.1 Novelty (7-point Likert)

How novel/surprising is this idea?
1 = Not at all novel (very common/obvious)
4 = Moderately novel
7 = Extremely novel (never seen before)

6.2.2 Usefulness (7-point Likert)

How useful/practical is this idea?
1 = Not at all useful (impractical)
4 = Moderately useful
7 = Extremely useful (highly practical)

6.2.3 Creativity (7-point Likert)

How creative is this idea overall?
1 = Not at all creative
4 = Moderately creative
7 = Extremely creative

6.2.4 Relevance/Coherence (7-point Likert) - For RQ6

How relevant and coherent is this idea to the original query?
1 = Nonsense/completely irrelevant (no logical connection)
2 = Very weak connection (hard to see relevance)
3 = Weak connection (requires stretch to see relevance)
4 = Moderate connection (somewhat relevant)
5 = Good connection (clearly relevant)
6 = Strong connection (directly applicable)
7 = Perfect fit (highly relevant and coherent)

Note: This scale specifically measures the "cost" of context-free generation.

Ideas with high novelty but low relevance (1-3) = potential hallucination
Ideas with high novelty AND high relevance (5-7) = successful creative leap

6.3 Procedure

Introduction (5 min)
- Study purpose (without revealing hypotheses)
- Rating scale explanation
- Practice with 3 example ideas
Training (5 min)
- Rate 5 calibration ideas with feedback
- Discuss edge cases
Main Evaluation (20 min)
- Rate 30 ideas (randomized order)
- 3 attention check items embedded
- Break after 15 ideas
Debriefing (2 min)
- Demographics
- Open-ended feedback

6.4 Quality Control

Check	Threshold	Action
Attention checks	< 2/3 correct	Exclude
Completion time	< 10 min	Flag for review
Variance in ratings	All same score	Exclude
Inter-rater reliability	Cronbach's α < 0.7	Review ratings

6.5 Analysis Plan

6.5.1 Reliability

Cronbach's alpha for each scale
ICC (Intraclass Correlation) for inter-rater agreement

6.5.2 Main Analysis

Mixed-effects ANOVA: Condition × Query
Post-hoc: Tukey HSD for pairwise comparisons
Effect sizes: Cohen's d

6.5.3 Correlation with Automatic Metrics

Pearson correlation: Human ratings vs semantic diversity
Regression: Predict human ratings from automatic metrics

7. Experimental Procedure

7.1 Phase 1: Idea Generation

For each query Q in QuerySet:
    For each condition C in Conditions:

        If C == "Direct":
            # No attributes, no experts
            ideas = direct_llm_generation(Q, n=20)

        Elif C == "Expert-Only":
            # No attributes, with experts
            experts = generate_experts(Q, n=4)
            ideas = expert_generation_whole_query(Q, experts, ideas_per_expert=5)

        Elif C == "Attribute-Only":
            # With attributes, no experts
            attributes = decompose_attributes(Q)
            ideas = direct_generation_per_attribute(Q, attributes, ideas_per_attr=5)

        Elif C == "Full-Pipeline":
            # With attributes, with experts
            attributes = decompose_attributes(Q)
            experts = generate_experts(Q, n=4)
            ideas = expert_transformation(Q, attributes, experts, ideas_per_combo=1-2)

        Elif C == "Random-Perspective":
            # Control: random words instead of experts
            perspectives = random.sample(RANDOM_WORDS, 4)
            ideas = perspective_generation(Q, perspectives, ideas_per=5)

        Store(Q, C, ideas)

7.2 Phase 2: Automatic Metrics

For each (Q, C, ideas) in Results:
    metrics = {
        'diversity': compute_mean_pairwise_distance(ideas),
        'clusters': compute_cluster_metrics(ideas),
        'query_distance': compute_query_distance(Q, ideas),
        'patent_novelty': compute_patent_novelty(ideas, Q)
    }
    Store(Q, C, metrics)

7.3 Phase 3: Human Evaluation

# Sample selection
selected_queries = random.sample(QuerySet, 10)
selected_conditions = ["Direct", "Multi-Expert-4", "Multi-Expert-8"]

# Create evaluation set
evaluation_items = []
For each Q in selected_queries:
    For each C in selected_conditions:
        ideas = Get(Q, C)
        For each idea in ideas:
            evaluation_items.append((Q, C, idea))

# Randomize and assign to evaluators
random.shuffle(evaluation_items)
assignments = assign_to_evaluators(evaluation_items, n_evaluators=60)

# Collect ratings
ratings = collect_human_ratings(assignments)

7.4 Phase 4: Analysis

# Automatic metrics analysis
Run ANOVA: diversity ~ condition + query + condition:query
Run post-hoc: Tukey HSD for condition pairs
Compute effect sizes

# Human ratings analysis
Check reliability: Cronbach's alpha, ICC
Run mixed-effects model: rating ~ condition + (1|evaluator) + (1|query)
Compute correlations: human vs automatic metrics

# Visualization
Plot: Diversity by condition (box plots)
Plot: t-SNE of idea embeddings colored by condition
Plot: Expert count vs diversity curve

8. Implementation Checklist

8.1 Code to Implement

experiments/generate_ideas.py - Idea generation for all conditions
experiments/compute_metrics.py - Automatic metric computation
experiments/export_for_evaluation.py - Prepare human evaluation set
experiments/analyze_results.py - Statistical analysis
experiments/visualize.py - Generate figures

8.2 Data Files to Create

data/queries.json - 30 queries with metadata
data/random_words.json - Random perspective words
data/generated_ideas/ - Raw idea outputs
data/metrics/ - Computed metric results
data/human_ratings/ - Collected ratings

8.3 Analysis Outputs

results/diversity_by_condition.csv
results/patent_novelty_by_condition.csv
results/human_ratings_summary.csv
results/statistical_tests.txt
figures/ - All visualizations

9. Expected Results & Hypotheses

9.1 Primary Hypotheses (2×2 Factorial)

Hypothesis	Prediction	Metric
H1: Main Effect of Attributes	Attribute-Only > Direct	Semantic diversity
H2: Main Effect of Experts	Expert-Only > Direct	Semantic diversity
H3: Interaction Effect	Full Pipeline > (Attribute-Only + Expert-Only - Direct)	Semantic diversity
H4: Novelty	Full Pipeline > all other conditions	Patent novelty rate
H5: Expert vs Random	Expert-Only > Random-Perspective	Validates expert knowledge matters
H6: Novelty-Usefulness Tradeoff	Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%)	Nonsense rate

9.2 Expected Pattern

                    Without Experts    With Experts
                    ---------------    ------------
Without Attributes    Direct (low)      Expert-Only (medium)
With Attributes       Attr-Only (medium) Full Pipeline (high)

Expected interaction: The combination (Full Pipeline) should produce super-additive effects - the benefit of experts is amplified when combined with structured attributes.

9.3 Expected Effect Sizes

Based on related work:

Main effect of attributes: d = 0.3-0.5 (small to medium)
Main effect of experts: d = 0.4-0.6 (medium)
Interaction effect: d = 0.2-0.4 (small)
Patent novelty increase: 20-40% improvement
Human creativity rating: d = 0.3-0.5 (small to medium)

9.3 Potential Confounds

Confound	Mitigation
Query difficulty	Crossed design (all queries × all conditions)
LLM variability	Multiple runs, fixed seed where possible
Evaluator bias	Randomized presentation, blinding
Order effects	Counterbalancing in human evaluation

10. Timeline

Week	Activity
1-2	Implement idea generation scripts
3	Generate all ideas (5 conditions × 30 queries)
4	Compute automatic metrics
5	Design and pilot human evaluation
6-7	Run human evaluation (60 participants)
8	Analyze results
9-10	Write paper
11	Internal review
12	Submit

11. Appendix: Direct Generation Prompt

For baseline condition C1 (Direct LLM generation):

You are a creative innovation consultant. Generate 20 unique and creative ideas
for improving or reimagining a [QUERY].

Requirements:
- Each idea should be distinct and novel
- Ideas should range from incremental improvements to radical innovations
- Consider different aspects: materials, functions, user experiences, contexts
- Provide a brief (15-30 word) description for each idea

Output format:
1. [Idea keyword]: [Description]
2. [Idea keyword]: [Description]
...
20. [Idea keyword]: [Description]

12. Appendix: Random Perspective Words

For condition C5 (Random-Perspective), sample from:

[
  "ocean", "mountain", "forest", "desert", "cave",
  "microscope", "telescope", "kaleidoscope", "prism", "lens",
  "butterfly", "elephant", "octopus", "eagle", "ant",
  "sunrise", "thunderstorm", "rainbow", "fog", "aurora",
  "clockwork", "origami", "mosaic", "symphony", "ballet",
  "ancient", "futuristic", "organic", "crystalline", "liquid",
  "whisper", "explosion", "rhythm", "silence", "echo"
]

This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.

22 KiB Raw Blame History Unescape Escape