- Improve patent search service with expanded functionality - Update PatentSearchPanel UI component - Add new research_report.md - Update experimental protocol, literature review, paper outline, and theoretical framework Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
678 lines
22 KiB
Markdown
678 lines
22 KiB
Markdown
# Experimental Protocol: Expert-Augmented LLM Ideation
|
||
|
||
## Executive Summary
|
||
|
||
This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation.
|
||
|
||
---
|
||
|
||
## 1. Research Questions
|
||
|
||
| ID | Research Question |
|
||
|----|-------------------|
|
||
| **RQ1** | Does attribute decomposition improve semantic diversity of generated ideas? |
|
||
| **RQ2** | Does expert perspective transformation improve semantic diversity of generated ideas? |
|
||
| **RQ3** | Is there an interaction effect between attribute decomposition and expert perspectives? |
|
||
| **RQ4** | Which combination produces the highest patent novelty (lowest overlap)? |
|
||
| **RQ5** | How do different expert sources (LLM vs Curated vs External) affect idea quality? |
|
||
| **RQ6** | Does context-free keyword generation (current design) increase hallucination/nonsense rate? |
|
||
|
||
### Design Note: Context-Free Keyword Generation
|
||
|
||
Our system intentionally excludes the original query during keyword generation (Stage 1):
|
||
|
||
```
|
||
Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant)
|
||
Expert does NOT see "椅子" (chair)
|
||
→ Generates: "資金流動" (cash flow)
|
||
|
||
Stage 2 (Description): Expert sees "椅子" + "資金流動"
|
||
→ Applies keyword to original query
|
||
```
|
||
|
||
**Rationale**: This forces maximum semantic distance in keyword generation.
|
||
**Risk**: Some keywords may be too distant, resulting in nonsensical or unusable ideas.
|
||
**RQ6 investigates**: What is the hallucination/nonsense rate, and is the tradeoff worthwhile?
|
||
|
||
---
|
||
|
||
## 2. Experimental Design Overview
|
||
|
||
### 2.1 Design Type
|
||
**2×2 Factorial Design**: Attribute Decomposition (With/Without) × Expert Perspectives (With/Without)
|
||
- Within-subjects for queries (all queries tested across all conditions)
|
||
|
||
### 2.2 Variables
|
||
|
||
#### Independent Variables (Manipulated)
|
||
|
||
| Variable | Levels | Description |
|
||
|----------|--------|-------------|
|
||
| **Attribute Decomposition** | 2 levels: With / Without | Whether to decompose query into structured attributes |
|
||
| **Expert Perspectives** | 2 levels: With / Without | Whether to use expert personas for idea generation |
|
||
| **Expert Source** (secondary) | LLM, Curated, External | Source of expert occupations (tested within Expert=With conditions) |
|
||
|
||
#### Dependent Variables (Measured)
|
||
|
||
| Variable | Measurement Method |
|
||
|----------|-------------------|
|
||
| **Semantic Diversity** | Mean pairwise cosine distance (embeddings) |
|
||
| **Cluster Spread** | Number of clusters, silhouette score |
|
||
| **Patent Novelty** | 1 - (ideas with patent match / total ideas) |
|
||
| **Semantic Distance** | Distance from query centroid |
|
||
| **Human Novelty Rating** | 7-point Likert scale |
|
||
| **Human Usefulness Rating** | 7-point Likert scale |
|
||
| **Human Creativity Rating** | 7-point Likert scale |
|
||
|
||
#### Control Variables (Held Constant)
|
||
|
||
| Variable | Fixed Value |
|
||
|----------|-------------|
|
||
| LLM Model | Qwen3:8b (or specify) |
|
||
| Temperature | 0.7 |
|
||
| Total Ideas per Query | 20 |
|
||
| Keywords per Expert | 1 |
|
||
| Deduplication | Disabled for raw comparison |
|
||
| Language | English (for patent search) |
|
||
|
||
---
|
||
|
||
## 3. Experimental Conditions
|
||
|
||
### 3.1 Main Study: 2×2 Factorial Design
|
||
|
||
| Condition | Attributes | Experts | Description |
|
||
|-----------|------------|---------|-------------|
|
||
| **C1: Direct** | ❌ Without | ❌ Without | Baseline: "Generate 20 creative ideas for [query]" |
|
||
| **C2: Expert-Only** | ❌ Without | ✅ With | Expert personas generate for whole query |
|
||
| **C3: Attribute-Only** | ✅ With | ❌ Without | Decompose query, direct generate per attribute |
|
||
| **C4: Full Pipeline** | ✅ With | ✅ With | Decompose query, experts generate per attribute |
|
||
|
||
### 3.2 Control Condition
|
||
|
||
| Condition | Description | Purpose |
|
||
|-----------|-------------|---------|
|
||
| **C5: Random-Perspective** | 4 random words as "perspectives" | Tests if ANY perspective shift helps, or if EXPERT knowledge specifically matters |
|
||
|
||
### 3.3 Expert Source Study (Secondary, within Expert=With conditions)
|
||
|
||
| Condition | Source | Implementation |
|
||
|-----------|--------|----------------|
|
||
| **S-LLM** | LLM-generated | Query-specific experts generated by LLM |
|
||
| **S-Curated** | Curated occupations | Pre-selected high-quality occupations |
|
||
| **S-External** | External sources | Wikidata/ConceptNet occupations |
|
||
|
||
---
|
||
|
||
## 4. Query Dataset
|
||
|
||
### 4.1 Design Principles
|
||
- **Diversity**: Cover multiple domains (consumer products, technology, services, abstract concepts)
|
||
- **Complexity Variation**: Simple objects to complex systems
|
||
- **Familiarity Variation**: Common items to specialized equipment
|
||
- **Cultural Neutrality**: Concepts understandable across cultures
|
||
|
||
### 4.2 Query Set (30 Queries)
|
||
|
||
#### Category A: Everyday Objects (10)
|
||
| ID | Query | Complexity |
|
||
|----|-------|------------|
|
||
| A1 | Chair | Low |
|
||
| A2 | Umbrella | Low |
|
||
| A3 | Backpack | Low |
|
||
| A4 | Coffee mug | Low |
|
||
| A5 | Bicycle | Medium |
|
||
| A6 | Refrigerator | Medium |
|
||
| A7 | Smartphone | Medium |
|
||
| A8 | Running shoes | Medium |
|
||
| A9 | Kitchen knife | Low |
|
||
| A10 | Desk lamp | Low |
|
||
|
||
#### Category B: Technology & Tools (10)
|
||
| ID | Query | Complexity |
|
||
|----|-------|------------|
|
||
| B1 | Solar panel | Medium |
|
||
| B2 | Electric vehicle | High |
|
||
| B3 | 3D printer | High |
|
||
| B4 | Drone | Medium |
|
||
| B5 | Smart thermostat | Medium |
|
||
| B6 | Noise-canceling headphones | Medium |
|
||
| B7 | Water purifier | Medium |
|
||
| B8 | Wind turbine | High |
|
||
| B9 | Robotic vacuum | Medium |
|
||
| B10 | Wearable fitness tracker | Medium |
|
||
|
||
#### Category C: Services & Systems (10)
|
||
| ID | Query | Complexity |
|
||
|----|-------|------------|
|
||
| C1 | Food delivery service | Medium |
|
||
| C2 | Online education platform | High |
|
||
| C3 | Healthcare appointment system | High |
|
||
| C4 | Public transportation | High |
|
||
| C5 | Hotel booking system | Medium |
|
||
| C6 | Personal finance app | Medium |
|
||
| C7 | Grocery shopping experience | Medium |
|
||
| C8 | Parking solution | Medium |
|
||
| C9 | Elderly care service | High |
|
||
| C10 | Waste management system | High |
|
||
|
||
### 4.3 Sample Size Justification
|
||
|
||
Based on [CHI meta-study on effect sizes](https://dl.acm.org/doi/10.1145/3706598.3713671):
|
||
|
||
- **Queries**: 30 (crossed with conditions)
|
||
- **Expected effect size**: d = 0.5 (medium)
|
||
- **Power target**: 80%
|
||
- **For automatic metrics**: 30 queries × 5 conditions × 20 ideas = 3,000 ideas
|
||
- **For human evaluation**: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas
|
||
|
||
---
|
||
|
||
## 5. Automatic Metrics Collection
|
||
|
||
### 5.1 Semantic Diversity Metrics
|
||
|
||
#### 5.1.1 Mean Pairwise Distance (Primary)
|
||
```python
|
||
def compute_mean_pairwise_distance(ideas: List[str], embedding_model: str) -> float:
|
||
"""
|
||
Compute mean cosine distance between all idea pairs.
|
||
Higher = more diverse.
|
||
"""
|
||
embeddings = get_embeddings(ideas, model=embedding_model)
|
||
n = len(embeddings)
|
||
distances = []
|
||
for i in range(n):
|
||
for j in range(i+1, n):
|
||
dist = 1 - cosine_similarity(embeddings[i], embeddings[j])
|
||
distances.append(dist)
|
||
return np.mean(distances), np.std(distances)
|
||
```
|
||
|
||
#### 5.1.2 Cluster Analysis
|
||
```python
|
||
def compute_cluster_metrics(ideas: List[str], embedding_model: str) -> dict:
|
||
"""
|
||
Analyze idea clustering patterns.
|
||
"""
|
||
embeddings = get_embeddings(ideas, model=embedding_model)
|
||
|
||
# Find optimal k using silhouette score
|
||
silhouette_scores = []
|
||
for k in range(2, min(len(ideas), 10)):
|
||
kmeans = KMeans(n_clusters=k)
|
||
labels = kmeans.fit_predict(embeddings)
|
||
score = silhouette_score(embeddings, labels)
|
||
silhouette_scores.append((k, score))
|
||
|
||
best_k = max(silhouette_scores, key=lambda x: x[1])[0]
|
||
|
||
return {
|
||
'optimal_clusters': best_k,
|
||
'silhouette_score': max(silhouette_scores, key=lambda x: x[1])[1],
|
||
'cluster_distribution': compute_cluster_sizes(embeddings, best_k)
|
||
}
|
||
```
|
||
|
||
#### 5.1.3 Semantic Distance from Query
|
||
```python
|
||
def compute_query_distance(query: str, ideas: List[str], embedding_model: str) -> dict:
|
||
"""
|
||
Measure how far ideas are from the original query.
|
||
Higher = more novel/distant.
|
||
"""
|
||
query_emb = get_embedding(query, model=embedding_model)
|
||
idea_embs = get_embeddings(ideas, model=embedding_model)
|
||
|
||
distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]
|
||
|
||
return {
|
||
'mean_distance': np.mean(distances),
|
||
'max_distance': np.max(distances),
|
||
'min_distance': np.min(distances),
|
||
'std_distance': np.std(distances)
|
||
}
|
||
```
|
||
|
||
### 5.2 Patent Novelty Metrics
|
||
|
||
#### 5.2.1 Patent Overlap Rate
|
||
```python
|
||
def compute_patent_novelty(ideas: List[str], query: str) -> dict:
|
||
"""
|
||
Search patents for each idea and compute overlap rate.
|
||
Uses existing patent_search_service.
|
||
"""
|
||
matches = 0
|
||
match_details = []
|
||
|
||
for idea in ideas:
|
||
result = patent_search_service.search(idea)
|
||
if result.has_match:
|
||
matches += 1
|
||
match_details.append({
|
||
'idea': idea,
|
||
'patent': result.best_match
|
||
})
|
||
|
||
return {
|
||
'novelty_rate': 1 - (matches / len(ideas)),
|
||
'match_count': matches,
|
||
'total_ideas': len(ideas),
|
||
'match_details': match_details
|
||
}
|
||
```
|
||
|
||
### 5.3 Hallucination/Nonsense Metrics (RQ6)
|
||
|
||
Since our design intentionally excludes the original query during keyword generation, we need to measure the "cost" of this approach.
|
||
|
||
#### 5.3.1 LLM-as-Judge for Relevance
|
||
```python
|
||
def compute_relevance_score(query: str, ideas: List[str], judge_model: str) -> dict:
|
||
"""
|
||
Use LLM to judge if each idea is relevant/applicable to the original query.
|
||
"""
|
||
relevant_count = 0
|
||
nonsense_count = 0
|
||
results = []
|
||
|
||
for idea in ideas:
|
||
prompt = f"""
|
||
Original query: {query}
|
||
Generated idea: {idea}
|
||
|
||
Is this idea relevant and applicable to the original query?
|
||
Rate: 1 (nonsense/irrelevant), 2 (weak connection), 3 (relevant)
|
||
|
||
Return JSON: {{"score": N, "reason": "brief explanation"}}
|
||
"""
|
||
result = llm_judge(prompt, model=judge_model)
|
||
results.append(result)
|
||
if result['score'] == 1:
|
||
nonsense_count += 1
|
||
elif result['score'] >= 2:
|
||
relevant_count += 1
|
||
|
||
return {
|
||
'relevance_rate': relevant_count / len(ideas),
|
||
'nonsense_rate': nonsense_count / len(ideas),
|
||
'details': results
|
||
}
|
||
```
|
||
|
||
#### 5.3.2 Semantic Distance Threshold Analysis
|
||
```python
|
||
def analyze_distance_threshold(query: str, ideas: List[str], embedding_model: str) -> dict:
|
||
"""
|
||
Analyze which ideas exceed a "too far" semantic distance threshold.
|
||
Ideas beyond threshold may be creative OR nonsensical.
|
||
"""
|
||
query_emb = get_embedding(query, model=embedding_model)
|
||
idea_embs = get_embeddings(ideas, model=embedding_model)
|
||
|
||
distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]
|
||
|
||
# Define thresholds (to be calibrated)
|
||
CREATIVE_THRESHOLD = 0.6 # Ideas this far are "creative"
|
||
NONSENSE_THRESHOLD = 0.85 # Ideas this far may be "nonsense"
|
||
|
||
return {
|
||
'creative_zone': sum(1 for d in distances if CREATIVE_THRESHOLD <= d < NONSENSE_THRESHOLD),
|
||
'potential_nonsense': sum(1 for d in distances if d >= NONSENSE_THRESHOLD),
|
||
'safe_zone': sum(1 for d in distances if d < CREATIVE_THRESHOLD),
|
||
'distance_distribution': distances
|
||
}
|
||
```
|
||
|
||
### 5.4 Metrics Summary Table
|
||
|
||
| Metric | Formula | Interpretation |
|
||
|--------|---------|----------------|
|
||
| **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) for all pairs | Higher = more diverse |
|
||
| **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters |
|
||
| **Optimal Cluster Count** | argmax(silhouette) | More clusters = more themes |
|
||
| **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original |
|
||
| **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel |
|
||
|
||
### 5.5 Nonsense/Hallucination Analysis (RQ6) - Three Methods
|
||
|
||
| Method | Metric | How it works | Pros/Cons |
|
||
|--------|--------|--------------|-----------|
|
||
| **Automatic** | Semantic Distance Threshold | Ideas with distance > 0.85 flagged as "potential nonsense" | Fast, cheap; May miss contextual nonsense |
|
||
| **LLM-as-Judge** | Relevance Score (1-3) | GPT-4 rates if idea is relevant to original query | Moderate cost; Good balance |
|
||
| **Human Evaluation** | Relevance Rating (1-7 Likert) | Humans rate coherence/relevance | Gold standard; Most expensive |
|
||
|
||
**Triangulation**: Compare all three methods to validate findings:
|
||
- If automatic + LLM + human agree → high confidence
|
||
- If they disagree → investigate why (interesting edge cases)
|
||
|
||
---
|
||
|
||
## 6. Human Evaluation Protocol
|
||
|
||
### 6.1 Participants
|
||
|
||
#### 6.1.1 Recruitment
|
||
- **Platform**: Prolific, MTurk, or domain experts
|
||
- **Sample Size**: 60 evaluators (20 per condition group)
|
||
- **Criteria**:
|
||
- Native English speakers
|
||
- Bachelor's degree or higher
|
||
- Attention check pass rate > 80%
|
||
|
||
#### 6.1.2 Compensation
|
||
- $15/hour equivalent
|
||
- ~30 minutes per session
|
||
- Bonus for high-quality ratings
|
||
|
||
### 6.2 Rating Scales
|
||
|
||
#### 6.2.1 Novelty (7-point Likert)
|
||
```
|
||
How novel/surprising is this idea?
|
||
1 = Not at all novel (very common/obvious)
|
||
4 = Moderately novel
|
||
7 = Extremely novel (never seen before)
|
||
```
|
||
|
||
#### 6.2.2 Usefulness (7-point Likert)
|
||
```
|
||
How useful/practical is this idea?
|
||
1 = Not at all useful (impractical)
|
||
4 = Moderately useful
|
||
7 = Extremely useful (highly practical)
|
||
```
|
||
|
||
#### 6.2.3 Creativity (7-point Likert)
|
||
```
|
||
How creative is this idea overall?
|
||
1 = Not at all creative
|
||
4 = Moderately creative
|
||
7 = Extremely creative
|
||
```
|
||
|
||
#### 6.2.4 Relevance/Coherence (7-point Likert) - For RQ6
|
||
```
|
||
How relevant and coherent is this idea to the original query?
|
||
1 = Nonsense/completely irrelevant (no logical connection)
|
||
2 = Very weak connection (hard to see relevance)
|
||
3 = Weak connection (requires stretch to see relevance)
|
||
4 = Moderate connection (somewhat relevant)
|
||
5 = Good connection (clearly relevant)
|
||
6 = Strong connection (directly applicable)
|
||
7 = Perfect fit (highly relevant and coherent)
|
||
```
|
||
|
||
**Note**: This scale specifically measures the "cost" of context-free generation.
|
||
- Ideas with high novelty but low relevance (1-3) = potential hallucination
|
||
- Ideas with high novelty AND high relevance (5-7) = successful creative leap
|
||
|
||
### 6.3 Procedure
|
||
|
||
1. **Introduction** (5 min)
|
||
- Study purpose (without revealing hypotheses)
|
||
- Rating scale explanation
|
||
- Practice with 3 example ideas
|
||
|
||
2. **Training** (5 min)
|
||
- Rate 5 calibration ideas with feedback
|
||
- Discuss edge cases
|
||
|
||
3. **Main Evaluation** (20 min)
|
||
- Rate 30 ideas (randomized order)
|
||
- 3 attention check items embedded
|
||
- Break after 15 ideas
|
||
|
||
4. **Debriefing** (2 min)
|
||
- Demographics
|
||
- Open-ended feedback
|
||
|
||
### 6.4 Quality Control
|
||
|
||
| Check | Threshold | Action |
|
||
|-------|-----------|--------|
|
||
| Attention checks | < 2/3 correct | Exclude |
|
||
| Completion time | < 10 min | Flag for review |
|
||
| Variance in ratings | All same score | Exclude |
|
||
| Inter-rater reliability | Cronbach's α < 0.7 | Review ratings |
|
||
|
||
### 6.5 Analysis Plan
|
||
|
||
#### 6.5.1 Reliability
|
||
- Cronbach's alpha for each scale
|
||
- ICC (Intraclass Correlation) for inter-rater agreement
|
||
|
||
#### 6.5.2 Main Analysis
|
||
- Mixed-effects ANOVA: Condition × Query
|
||
- Post-hoc: Tukey HSD for pairwise comparisons
|
||
- Effect sizes: Cohen's d
|
||
|
||
#### 6.5.3 Correlation with Automatic Metrics
|
||
- Pearson correlation: Human ratings vs semantic diversity
|
||
- Regression: Predict human ratings from automatic metrics
|
||
|
||
---
|
||
|
||
## 7. Experimental Procedure
|
||
|
||
### 7.1 Phase 1: Idea Generation
|
||
|
||
```
|
||
For each query Q in QuerySet:
|
||
For each condition C in Conditions:
|
||
|
||
If C == "Direct":
|
||
# No attributes, no experts
|
||
ideas = direct_llm_generation(Q, n=20)
|
||
|
||
Elif C == "Expert-Only":
|
||
# No attributes, with experts
|
||
experts = generate_experts(Q, n=4)
|
||
ideas = expert_generation_whole_query(Q, experts, ideas_per_expert=5)
|
||
|
||
Elif C == "Attribute-Only":
|
||
# With attributes, no experts
|
||
attributes = decompose_attributes(Q)
|
||
ideas = direct_generation_per_attribute(Q, attributes, ideas_per_attr=5)
|
||
|
||
Elif C == "Full-Pipeline":
|
||
# With attributes, with experts
|
||
attributes = decompose_attributes(Q)
|
||
experts = generate_experts(Q, n=4)
|
||
ideas = expert_transformation(Q, attributes, experts, ideas_per_combo=1-2)
|
||
|
||
Elif C == "Random-Perspective":
|
||
# Control: random words instead of experts
|
||
perspectives = random.sample(RANDOM_WORDS, 4)
|
||
ideas = perspective_generation(Q, perspectives, ideas_per=5)
|
||
|
||
Store(Q, C, ideas)
|
||
```
|
||
|
||
### 7.2 Phase 2: Automatic Metrics
|
||
|
||
```
|
||
For each (Q, C, ideas) in Results:
|
||
metrics = {
|
||
'diversity': compute_mean_pairwise_distance(ideas),
|
||
'clusters': compute_cluster_metrics(ideas),
|
||
'query_distance': compute_query_distance(Q, ideas),
|
||
'patent_novelty': compute_patent_novelty(ideas, Q)
|
||
}
|
||
Store(Q, C, metrics)
|
||
```
|
||
|
||
### 7.3 Phase 3: Human Evaluation
|
||
|
||
```
|
||
# Sample selection
|
||
selected_queries = random.sample(QuerySet, 10)
|
||
selected_conditions = ["Direct", "Multi-Expert-4", "Multi-Expert-8"]
|
||
|
||
# Create evaluation set
|
||
evaluation_items = []
|
||
For each Q in selected_queries:
|
||
For each C in selected_conditions:
|
||
ideas = Get(Q, C)
|
||
For each idea in ideas:
|
||
evaluation_items.append((Q, C, idea))
|
||
|
||
# Randomize and assign to evaluators
|
||
random.shuffle(evaluation_items)
|
||
assignments = assign_to_evaluators(evaluation_items, n_evaluators=60)
|
||
|
||
# Collect ratings
|
||
ratings = collect_human_ratings(assignments)
|
||
```
|
||
|
||
### 7.4 Phase 4: Analysis
|
||
|
||
```
|
||
# Automatic metrics analysis
|
||
Run ANOVA: diversity ~ condition + query + condition:query
|
||
Run post-hoc: Tukey HSD for condition pairs
|
||
Compute effect sizes
|
||
|
||
# Human ratings analysis
|
||
Check reliability: Cronbach's alpha, ICC
|
||
Run mixed-effects model: rating ~ condition + (1|evaluator) + (1|query)
|
||
Compute correlations: human vs automatic metrics
|
||
|
||
# Visualization
|
||
Plot: Diversity by condition (box plots)
|
||
Plot: t-SNE of idea embeddings colored by condition
|
||
Plot: Expert count vs diversity curve
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Implementation Checklist
|
||
|
||
### 8.1 Code to Implement
|
||
|
||
- [ ] `experiments/generate_ideas.py` - Idea generation for all conditions
|
||
- [ ] `experiments/compute_metrics.py` - Automatic metric computation
|
||
- [ ] `experiments/export_for_evaluation.py` - Prepare human evaluation set
|
||
- [ ] `experiments/analyze_results.py` - Statistical analysis
|
||
- [ ] `experiments/visualize.py` - Generate figures
|
||
|
||
### 8.2 Data Files to Create
|
||
|
||
- [ ] `data/queries.json` - 30 queries with metadata
|
||
- [ ] `data/random_words.json` - Random perspective words
|
||
- [ ] `data/generated_ideas/` - Raw idea outputs
|
||
- [ ] `data/metrics/` - Computed metric results
|
||
- [ ] `data/human_ratings/` - Collected ratings
|
||
|
||
### 8.3 Analysis Outputs
|
||
|
||
- [ ] `results/diversity_by_condition.csv`
|
||
- [ ] `results/patent_novelty_by_condition.csv`
|
||
- [ ] `results/human_ratings_summary.csv`
|
||
- [ ] `results/statistical_tests.txt`
|
||
- [ ] `figures/` - All visualizations
|
||
|
||
---
|
||
|
||
## 9. Expected Results & Hypotheses
|
||
|
||
### 9.1 Primary Hypotheses (2×2 Factorial)
|
||
|
||
| Hypothesis | Prediction | Metric |
|
||
|------------|------------|--------|
|
||
| **H1: Main Effect of Attributes** | Attribute-Only > Direct | Semantic diversity |
|
||
| **H2: Main Effect of Experts** | Expert-Only > Direct | Semantic diversity |
|
||
| **H3: Interaction Effect** | Full Pipeline > (Attribute-Only + Expert-Only - Direct) | Semantic diversity |
|
||
| **H4: Novelty** | Full Pipeline > all other conditions | Patent novelty rate |
|
||
| **H5: Expert vs Random** | Expert-Only > Random-Perspective | Validates expert knowledge matters |
|
||
| **H6: Novelty-Usefulness Tradeoff** | Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%) | Nonsense rate |
|
||
|
||
### 9.2 Expected Pattern
|
||
|
||
```
|
||
Without Experts With Experts
|
||
--------------- ------------
|
||
Without Attributes Direct (low) Expert-Only (medium)
|
||
With Attributes Attr-Only (medium) Full Pipeline (high)
|
||
```
|
||
|
||
**Expected interaction**: The combination (Full Pipeline) should produce super-additive effects - the benefit of experts is amplified when combined with structured attributes.
|
||
|
||
### 9.3 Expected Effect Sizes
|
||
|
||
Based on related work:
|
||
- Main effect of attributes: d = 0.3-0.5 (small to medium)
|
||
- Main effect of experts: d = 0.4-0.6 (medium)
|
||
- Interaction effect: d = 0.2-0.4 (small)
|
||
- Patent novelty increase: 20-40% improvement
|
||
- Human creativity rating: d = 0.3-0.5 (small to medium)
|
||
|
||
### 9.3 Potential Confounds
|
||
|
||
| Confound | Mitigation |
|
||
|----------|-----------|
|
||
| Query difficulty | Crossed design (all queries × all conditions) |
|
||
| LLM variability | Multiple runs, fixed seed where possible |
|
||
| Evaluator bias | Randomized presentation, blinding |
|
||
| Order effects | Counterbalancing in human evaluation |
|
||
|
||
---
|
||
|
||
## 10. Timeline
|
||
|
||
| Week | Activity |
|
||
|------|----------|
|
||
| 1-2 | Implement idea generation scripts |
|
||
| 3 | Generate all ideas (5 conditions × 30 queries) |
|
||
| 4 | Compute automatic metrics |
|
||
| 5 | Design and pilot human evaluation |
|
||
| 6-7 | Run human evaluation (60 participants) |
|
||
| 8 | Analyze results |
|
||
| 9-10 | Write paper |
|
||
| 11 | Internal review |
|
||
| 12 | Submit |
|
||
|
||
---
|
||
|
||
## 11. Appendix: Direct Generation Prompt
|
||
|
||
For baseline condition C1 (Direct LLM generation):
|
||
|
||
```
|
||
You are a creative innovation consultant. Generate 20 unique and creative ideas
|
||
for improving or reimagining a [QUERY].
|
||
|
||
Requirements:
|
||
- Each idea should be distinct and novel
|
||
- Ideas should range from incremental improvements to radical innovations
|
||
- Consider different aspects: materials, functions, user experiences, contexts
|
||
- Provide a brief (15-30 word) description for each idea
|
||
|
||
Output format:
|
||
1. [Idea keyword]: [Description]
|
||
2. [Idea keyword]: [Description]
|
||
...
|
||
20. [Idea keyword]: [Description]
|
||
```
|
||
|
||
---
|
||
|
||
## 12. Appendix: Random Perspective Words
|
||
|
||
For condition C5 (Random-Perspective), sample from:
|
||
|
||
```json
|
||
[
|
||
"ocean", "mountain", "forest", "desert", "cave",
|
||
"microscope", "telescope", "kaleidoscope", "prism", "lens",
|
||
"butterfly", "elephant", "octopus", "eagle", "ant",
|
||
"sunrise", "thunderstorm", "rainbow", "fog", "aurora",
|
||
"clockwork", "origami", "mosaic", "symphony", "ballet",
|
||
"ancient", "futuristic", "organic", "crystalline", "liquid",
|
||
"whisper", "explosion", "rhythm", "silence", "echo"
|
||
]
|
||
```
|
||
|
||
This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.
|