Files
novelty-seeking/research/experimental_protocol.md
gbanyan 26a56a2a07 feat: Enhance patent search and update research documentation
- Improve patent search service with expanded functionality
- Update PatentSearchPanel UI component
- Add new research_report.md
- Update experimental protocol, literature review, paper outline, and theoretical framework

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 15:52:33 +08:00

678 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Experimental Protocol: Expert-Augmented LLM Ideation
## Executive Summary
This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation.
---
## 1. Research Questions
| ID | Research Question |
|----|-------------------|
| **RQ1** | Does attribute decomposition improve semantic diversity of generated ideas? |
| **RQ2** | Does expert perspective transformation improve semantic diversity of generated ideas? |
| **RQ3** | Is there an interaction effect between attribute decomposition and expert perspectives? |
| **RQ4** | Which combination produces the highest patent novelty (lowest overlap)? |
| **RQ5** | How do different expert sources (LLM vs Curated vs External) affect idea quality? |
| **RQ6** | Does context-free keyword generation (current design) increase hallucination/nonsense rate? |
### Design Note: Context-Free Keyword Generation
Our system intentionally excludes the original query during keyword generation (Stage 1):
```
Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant)
Expert does NOT see "椅子" (chair)
→ Generates: "資金流動" (cash flow)
Stage 2 (Description): Expert sees "椅子" + "資金流動"
→ Applies keyword to original query
```
**Rationale**: This forces maximum semantic distance in keyword generation.
**Risk**: Some keywords may be too distant, resulting in nonsensical or unusable ideas.
**RQ6 investigates**: What is the hallucination/nonsense rate, and is the tradeoff worthwhile?
---
## 2. Experimental Design Overview
### 2.1 Design Type
**2×2 Factorial Design**: Attribute Decomposition (With/Without) × Expert Perspectives (With/Without)
- Within-subjects for queries (all queries tested across all conditions)
### 2.2 Variables
#### Independent Variables (Manipulated)
| Variable | Levels | Description |
|----------|--------|-------------|
| **Attribute Decomposition** | 2 levels: With / Without | Whether to decompose query into structured attributes |
| **Expert Perspectives** | 2 levels: With / Without | Whether to use expert personas for idea generation |
| **Expert Source** (secondary) | LLM, Curated, External | Source of expert occupations (tested within Expert=With conditions) |
#### Dependent Variables (Measured)
| Variable | Measurement Method |
|----------|-------------------|
| **Semantic Diversity** | Mean pairwise cosine distance (embeddings) |
| **Cluster Spread** | Number of clusters, silhouette score |
| **Patent Novelty** | 1 - (ideas with patent match / total ideas) |
| **Semantic Distance** | Distance from query centroid |
| **Human Novelty Rating** | 7-point Likert scale |
| **Human Usefulness Rating** | 7-point Likert scale |
| **Human Creativity Rating** | 7-point Likert scale |
#### Control Variables (Held Constant)
| Variable | Fixed Value |
|----------|-------------|
| LLM Model | Qwen3:8b (or specify) |
| Temperature | 0.7 |
| Total Ideas per Query | 20 |
| Keywords per Expert | 1 |
| Deduplication | Disabled for raw comparison |
| Language | English (for patent search) |
---
## 3. Experimental Conditions
### 3.1 Main Study: 2×2 Factorial Design
| Condition | Attributes | Experts | Description |
|-----------|------------|---------|-------------|
| **C1: Direct** | ❌ Without | ❌ Without | Baseline: "Generate 20 creative ideas for [query]" |
| **C2: Expert-Only** | ❌ Without | ✅ With | Expert personas generate for whole query |
| **C3: Attribute-Only** | ✅ With | ❌ Without | Decompose query, direct generate per attribute |
| **C4: Full Pipeline** | ✅ With | ✅ With | Decompose query, experts generate per attribute |
### 3.2 Control Condition
| Condition | Description | Purpose |
|-----------|-------------|---------|
| **C5: Random-Perspective** | 4 random words as "perspectives" | Tests if ANY perspective shift helps, or if EXPERT knowledge specifically matters |
### 3.3 Expert Source Study (Secondary, within Expert=With conditions)
| Condition | Source | Implementation |
|-----------|--------|----------------|
| **S-LLM** | LLM-generated | Query-specific experts generated by LLM |
| **S-Curated** | Curated occupations | Pre-selected high-quality occupations |
| **S-External** | External sources | Wikidata/ConceptNet occupations |
---
## 4. Query Dataset
### 4.1 Design Principles
- **Diversity**: Cover multiple domains (consumer products, technology, services, abstract concepts)
- **Complexity Variation**: Simple objects to complex systems
- **Familiarity Variation**: Common items to specialized equipment
- **Cultural Neutrality**: Concepts understandable across cultures
### 4.2 Query Set (30 Queries)
#### Category A: Everyday Objects (10)
| ID | Query | Complexity |
|----|-------|------------|
| A1 | Chair | Low |
| A2 | Umbrella | Low |
| A3 | Backpack | Low |
| A4 | Coffee mug | Low |
| A5 | Bicycle | Medium |
| A6 | Refrigerator | Medium |
| A7 | Smartphone | Medium |
| A8 | Running shoes | Medium |
| A9 | Kitchen knife | Low |
| A10 | Desk lamp | Low |
#### Category B: Technology & Tools (10)
| ID | Query | Complexity |
|----|-------|------------|
| B1 | Solar panel | Medium |
| B2 | Electric vehicle | High |
| B3 | 3D printer | High |
| B4 | Drone | Medium |
| B5 | Smart thermostat | Medium |
| B6 | Noise-canceling headphones | Medium |
| B7 | Water purifier | Medium |
| B8 | Wind turbine | High |
| B9 | Robotic vacuum | Medium |
| B10 | Wearable fitness tracker | Medium |
#### Category C: Services & Systems (10)
| ID | Query | Complexity |
|----|-------|------------|
| C1 | Food delivery service | Medium |
| C2 | Online education platform | High |
| C3 | Healthcare appointment system | High |
| C4 | Public transportation | High |
| C5 | Hotel booking system | Medium |
| C6 | Personal finance app | Medium |
| C7 | Grocery shopping experience | Medium |
| C8 | Parking solution | Medium |
| C9 | Elderly care service | High |
| C10 | Waste management system | High |
### 4.3 Sample Size Justification
Based on [CHI meta-study on effect sizes](https://dl.acm.org/doi/10.1145/3706598.3713671):
- **Queries**: 30 (crossed with conditions)
- **Expected effect size**: d = 0.5 (medium)
- **Power target**: 80%
- **For automatic metrics**: 30 queries × 5 conditions × 20 ideas = 3,000 ideas
- **For human evaluation**: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas
---
## 5. Automatic Metrics Collection
### 5.1 Semantic Diversity Metrics
#### 5.1.1 Mean Pairwise Distance (Primary)
```python
def compute_mean_pairwise_distance(ideas: List[str], embedding_model: str) -> float:
"""
Compute mean cosine distance between all idea pairs.
Higher = more diverse.
"""
embeddings = get_embeddings(ideas, model=embedding_model)
n = len(embeddings)
distances = []
for i in range(n):
for j in range(i+1, n):
dist = 1 - cosine_similarity(embeddings[i], embeddings[j])
distances.append(dist)
return np.mean(distances), np.std(distances)
```
#### 5.1.2 Cluster Analysis
```python
def compute_cluster_metrics(ideas: List[str], embedding_model: str) -> dict:
"""
Analyze idea clustering patterns.
"""
embeddings = get_embeddings(ideas, model=embedding_model)
# Find optimal k using silhouette score
silhouette_scores = []
for k in range(2, min(len(ideas), 10)):
kmeans = KMeans(n_clusters=k)
labels = kmeans.fit_predict(embeddings)
score = silhouette_score(embeddings, labels)
silhouette_scores.append((k, score))
best_k = max(silhouette_scores, key=lambda x: x[1])[0]
return {
'optimal_clusters': best_k,
'silhouette_score': max(silhouette_scores, key=lambda x: x[1])[1],
'cluster_distribution': compute_cluster_sizes(embeddings, best_k)
}
```
#### 5.1.3 Semantic Distance from Query
```python
def compute_query_distance(query: str, ideas: List[str], embedding_model: str) -> dict:
"""
Measure how far ideas are from the original query.
Higher = more novel/distant.
"""
query_emb = get_embedding(query, model=embedding_model)
idea_embs = get_embeddings(ideas, model=embedding_model)
distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]
return {
'mean_distance': np.mean(distances),
'max_distance': np.max(distances),
'min_distance': np.min(distances),
'std_distance': np.std(distances)
}
```
### 5.2 Patent Novelty Metrics
#### 5.2.1 Patent Overlap Rate
```python
def compute_patent_novelty(ideas: List[str], query: str) -> dict:
"""
Search patents for each idea and compute overlap rate.
Uses existing patent_search_service.
"""
matches = 0
match_details = []
for idea in ideas:
result = patent_search_service.search(idea)
if result.has_match:
matches += 1
match_details.append({
'idea': idea,
'patent': result.best_match
})
return {
'novelty_rate': 1 - (matches / len(ideas)),
'match_count': matches,
'total_ideas': len(ideas),
'match_details': match_details
}
```
### 5.3 Hallucination/Nonsense Metrics (RQ6)
Since our design intentionally excludes the original query during keyword generation, we need to measure the "cost" of this approach.
#### 5.3.1 LLM-as-Judge for Relevance
```python
def compute_relevance_score(query: str, ideas: List[str], judge_model: str) -> dict:
"""
Use LLM to judge if each idea is relevant/applicable to the original query.
"""
relevant_count = 0
nonsense_count = 0
results = []
for idea in ideas:
prompt = f"""
Original query: {query}
Generated idea: {idea}
Is this idea relevant and applicable to the original query?
Rate: 1 (nonsense/irrelevant), 2 (weak connection), 3 (relevant)
Return JSON: {{"score": N, "reason": "brief explanation"}}
"""
result = llm_judge(prompt, model=judge_model)
results.append(result)
if result['score'] == 1:
nonsense_count += 1
elif result['score'] >= 2:
relevant_count += 1
return {
'relevance_rate': relevant_count / len(ideas),
'nonsense_rate': nonsense_count / len(ideas),
'details': results
}
```
#### 5.3.2 Semantic Distance Threshold Analysis
```python
def analyze_distance_threshold(query: str, ideas: List[str], embedding_model: str) -> dict:
"""
Analyze which ideas exceed a "too far" semantic distance threshold.
Ideas beyond threshold may be creative OR nonsensical.
"""
query_emb = get_embedding(query, model=embedding_model)
idea_embs = get_embeddings(ideas, model=embedding_model)
distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]
# Define thresholds (to be calibrated)
CREATIVE_THRESHOLD = 0.6 # Ideas this far are "creative"
NONSENSE_THRESHOLD = 0.85 # Ideas this far may be "nonsense"
return {
'creative_zone': sum(1 for d in distances if CREATIVE_THRESHOLD <= d < NONSENSE_THRESHOLD),
'potential_nonsense': sum(1 for d in distances if d >= NONSENSE_THRESHOLD),
'safe_zone': sum(1 for d in distances if d < CREATIVE_THRESHOLD),
'distance_distribution': distances
}
```
### 5.4 Metrics Summary Table
| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) for all pairs | Higher = more diverse |
| **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters |
| **Optimal Cluster Count** | argmax(silhouette) | More clusters = more themes |
| **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original |
| **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel |
### 5.5 Nonsense/Hallucination Analysis (RQ6) - Three Methods
| Method | Metric | How it works | Pros/Cons |
|--------|--------|--------------|-----------|
| **Automatic** | Semantic Distance Threshold | Ideas with distance > 0.85 flagged as "potential nonsense" | Fast, cheap; May miss contextual nonsense |
| **LLM-as-Judge** | Relevance Score (1-3) | GPT-4 rates if idea is relevant to original query | Moderate cost; Good balance |
| **Human Evaluation** | Relevance Rating (1-7 Likert) | Humans rate coherence/relevance | Gold standard; Most expensive |
**Triangulation**: Compare all three methods to validate findings:
- If automatic + LLM + human agree → high confidence
- If they disagree → investigate why (interesting edge cases)
---
## 6. Human Evaluation Protocol
### 6.1 Participants
#### 6.1.1 Recruitment
- **Platform**: Prolific, MTurk, or domain experts
- **Sample Size**: 60 evaluators (20 per condition group)
- **Criteria**:
- Native English speakers
- Bachelor's degree or higher
- Attention check pass rate > 80%
#### 6.1.2 Compensation
- $15/hour equivalent
- ~30 minutes per session
- Bonus for high-quality ratings
### 6.2 Rating Scales
#### 6.2.1 Novelty (7-point Likert)
```
How novel/surprising is this idea?
1 = Not at all novel (very common/obvious)
4 = Moderately novel
7 = Extremely novel (never seen before)
```
#### 6.2.2 Usefulness (7-point Likert)
```
How useful/practical is this idea?
1 = Not at all useful (impractical)
4 = Moderately useful
7 = Extremely useful (highly practical)
```
#### 6.2.3 Creativity (7-point Likert)
```
How creative is this idea overall?
1 = Not at all creative
4 = Moderately creative
7 = Extremely creative
```
#### 6.2.4 Relevance/Coherence (7-point Likert) - For RQ6
```
How relevant and coherent is this idea to the original query?
1 = Nonsense/completely irrelevant (no logical connection)
2 = Very weak connection (hard to see relevance)
3 = Weak connection (requires stretch to see relevance)
4 = Moderate connection (somewhat relevant)
5 = Good connection (clearly relevant)
6 = Strong connection (directly applicable)
7 = Perfect fit (highly relevant and coherent)
```
**Note**: This scale specifically measures the "cost" of context-free generation.
- Ideas with high novelty but low relevance (1-3) = potential hallucination
- Ideas with high novelty AND high relevance (5-7) = successful creative leap
### 6.3 Procedure
1. **Introduction** (5 min)
- Study purpose (without revealing hypotheses)
- Rating scale explanation
- Practice with 3 example ideas
2. **Training** (5 min)
- Rate 5 calibration ideas with feedback
- Discuss edge cases
3. **Main Evaluation** (20 min)
- Rate 30 ideas (randomized order)
- 3 attention check items embedded
- Break after 15 ideas
4. **Debriefing** (2 min)
- Demographics
- Open-ended feedback
### 6.4 Quality Control
| Check | Threshold | Action |
|-------|-----------|--------|
| Attention checks | < 2/3 correct | Exclude |
| Completion time | < 10 min | Flag for review |
| Variance in ratings | All same score | Exclude |
| Inter-rater reliability | Cronbach's α < 0.7 | Review ratings |
### 6.5 Analysis Plan
#### 6.5.1 Reliability
- Cronbach's alpha for each scale
- ICC (Intraclass Correlation) for inter-rater agreement
#### 6.5.2 Main Analysis
- Mixed-effects ANOVA: Condition × Query
- Post-hoc: Tukey HSD for pairwise comparisons
- Effect sizes: Cohen's d
#### 6.5.3 Correlation with Automatic Metrics
- Pearson correlation: Human ratings vs semantic diversity
- Regression: Predict human ratings from automatic metrics
---
## 7. Experimental Procedure
### 7.1 Phase 1: Idea Generation
```
For each query Q in QuerySet:
For each condition C in Conditions:
If C == "Direct":
# No attributes, no experts
ideas = direct_llm_generation(Q, n=20)
Elif C == "Expert-Only":
# No attributes, with experts
experts = generate_experts(Q, n=4)
ideas = expert_generation_whole_query(Q, experts, ideas_per_expert=5)
Elif C == "Attribute-Only":
# With attributes, no experts
attributes = decompose_attributes(Q)
ideas = direct_generation_per_attribute(Q, attributes, ideas_per_attr=5)
Elif C == "Full-Pipeline":
# With attributes, with experts
attributes = decompose_attributes(Q)
experts = generate_experts(Q, n=4)
ideas = expert_transformation(Q, attributes, experts, ideas_per_combo=1-2)
Elif C == "Random-Perspective":
# Control: random words instead of experts
perspectives = random.sample(RANDOM_WORDS, 4)
ideas = perspective_generation(Q, perspectives, ideas_per=5)
Store(Q, C, ideas)
```
### 7.2 Phase 2: Automatic Metrics
```
For each (Q, C, ideas) in Results:
metrics = {
'diversity': compute_mean_pairwise_distance(ideas),
'clusters': compute_cluster_metrics(ideas),
'query_distance': compute_query_distance(Q, ideas),
'patent_novelty': compute_patent_novelty(ideas, Q)
}
Store(Q, C, metrics)
```
### 7.3 Phase 3: Human Evaluation
```
# Sample selection
selected_queries = random.sample(QuerySet, 10)
selected_conditions = ["Direct", "Multi-Expert-4", "Multi-Expert-8"]
# Create evaluation set
evaluation_items = []
For each Q in selected_queries:
For each C in selected_conditions:
ideas = Get(Q, C)
For each idea in ideas:
evaluation_items.append((Q, C, idea))
# Randomize and assign to evaluators
random.shuffle(evaluation_items)
assignments = assign_to_evaluators(evaluation_items, n_evaluators=60)
# Collect ratings
ratings = collect_human_ratings(assignments)
```
### 7.4 Phase 4: Analysis
```
# Automatic metrics analysis
Run ANOVA: diversity ~ condition + query + condition:query
Run post-hoc: Tukey HSD for condition pairs
Compute effect sizes
# Human ratings analysis
Check reliability: Cronbach's alpha, ICC
Run mixed-effects model: rating ~ condition + (1|evaluator) + (1|query)
Compute correlations: human vs automatic metrics
# Visualization
Plot: Diversity by condition (box plots)
Plot: t-SNE of idea embeddings colored by condition
Plot: Expert count vs diversity curve
```
---
## 8. Implementation Checklist
### 8.1 Code to Implement
- [ ] `experiments/generate_ideas.py` - Idea generation for all conditions
- [ ] `experiments/compute_metrics.py` - Automatic metric computation
- [ ] `experiments/export_for_evaluation.py` - Prepare human evaluation set
- [ ] `experiments/analyze_results.py` - Statistical analysis
- [ ] `experiments/visualize.py` - Generate figures
### 8.2 Data Files to Create
- [ ] `data/queries.json` - 30 queries with metadata
- [ ] `data/random_words.json` - Random perspective words
- [ ] `data/generated_ideas/` - Raw idea outputs
- [ ] `data/metrics/` - Computed metric results
- [ ] `data/human_ratings/` - Collected ratings
### 8.3 Analysis Outputs
- [ ] `results/diversity_by_condition.csv`
- [ ] `results/patent_novelty_by_condition.csv`
- [ ] `results/human_ratings_summary.csv`
- [ ] `results/statistical_tests.txt`
- [ ] `figures/` - All visualizations
---
## 9. Expected Results & Hypotheses
### 9.1 Primary Hypotheses (2×2 Factorial)
| Hypothesis | Prediction | Metric |
|------------|------------|--------|
| **H1: Main Effect of Attributes** | Attribute-Only > Direct | Semantic diversity |
| **H2: Main Effect of Experts** | Expert-Only > Direct | Semantic diversity |
| **H3: Interaction Effect** | Full Pipeline > (Attribute-Only + Expert-Only - Direct) | Semantic diversity |
| **H4: Novelty** | Full Pipeline > all other conditions | Patent novelty rate |
| **H5: Expert vs Random** | Expert-Only > Random-Perspective | Validates expert knowledge matters |
| **H6: Novelty-Usefulness Tradeoff** | Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%) | Nonsense rate |
### 9.2 Expected Pattern
```
Without Experts With Experts
--------------- ------------
Without Attributes Direct (low) Expert-Only (medium)
With Attributes Attr-Only (medium) Full Pipeline (high)
```
**Expected interaction**: The combination (Full Pipeline) should produce super-additive effects - the benefit of experts is amplified when combined with structured attributes.
### 9.3 Expected Effect Sizes
Based on related work:
- Main effect of attributes: d = 0.3-0.5 (small to medium)
- Main effect of experts: d = 0.4-0.6 (medium)
- Interaction effect: d = 0.2-0.4 (small)
- Patent novelty increase: 20-40% improvement
- Human creativity rating: d = 0.3-0.5 (small to medium)
### 9.3 Potential Confounds
| Confound | Mitigation |
|----------|-----------|
| Query difficulty | Crossed design (all queries × all conditions) |
| LLM variability | Multiple runs, fixed seed where possible |
| Evaluator bias | Randomized presentation, blinding |
| Order effects | Counterbalancing in human evaluation |
---
## 10. Timeline
| Week | Activity |
|------|----------|
| 1-2 | Implement idea generation scripts |
| 3 | Generate all ideas (5 conditions × 30 queries) |
| 4 | Compute automatic metrics |
| 5 | Design and pilot human evaluation |
| 6-7 | Run human evaluation (60 participants) |
| 8 | Analyze results |
| 9-10 | Write paper |
| 11 | Internal review |
| 12 | Submit |
---
## 11. Appendix: Direct Generation Prompt
For baseline condition C1 (Direct LLM generation):
```
You are a creative innovation consultant. Generate 20 unique and creative ideas
for improving or reimagining a [QUERY].
Requirements:
- Each idea should be distinct and novel
- Ideas should range from incremental improvements to radical innovations
- Consider different aspects: materials, functions, user experiences, contexts
- Provide a brief (15-30 word) description for each idea
Output format:
1. [Idea keyword]: [Description]
2. [Idea keyword]: [Description]
...
20. [Idea keyword]: [Description]
```
---
## 12. Appendix: Random Perspective Words
For condition C5 (Random-Perspective), sample from:
```json
[
"ocean", "mountain", "forest", "desert", "cave",
"microscope", "telescope", "kaleidoscope", "prism", "lens",
"butterfly", "elephant", "octopus", "eagle", "ant",
"sunrise", "thunderstorm", "rainbow", "fog", "aurora",
"clockwork", "origami", "mosaic", "symphony", "ballet",
"ancient", "futuristic", "organic", "crystalline", "liquid",
"whisper", "explosion", "rhythm", "silence", "echo"
]
```
This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.