chore: save local changes

2026-01-05 22:32:08 +08:00
parent bc281b8e0a
commit ec48709755
42 changed files with 5576 additions and 254 deletions
--- a/research/experimental_protocol.md
+++ b/research/experimental_protocol.md
@@ -0,0 +1,555 @@
+# Experimental Protocol: Expert-Augmented LLM Ideation
+
+## Executive Summary
+
+This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation.
+
+---
+
+## 1. Research Questions
+
+| ID | Research Question |
+|----|-------------------|
+| **RQ1** | Does multi-expert generation produce higher semantic diversity than direct LLM generation? |
+| **RQ2** | Does multi-expert generation produce ideas with lower patent overlap (higher novelty)? |
+| **RQ3** | What is the optimal number of experts for maximizing diversity? |
+| **RQ4** | How do different expert sources (LLM vs Curated vs DBpedia) affect idea quality? |
+| **RQ5** | Does structured attribute decomposition enhance the multi-expert effect? |
+
+---
+
+## 2. Experimental Design Overview
+
+### 2.1 Design Type
+**Mixed Design**: Between-subjects for main conditions × Within-subjects for queries
+
+### 2.2 Variables
+
+#### Independent Variables (Manipulated)
+
+| Variable | Levels | Your System Parameter |
+|----------|--------|----------------------|
+| **Generation Method** | 5 levels (see conditions) | Condition-dependent |
+| **Expert Count** | 1, 2, 4, 6, 8 | `expert_count` |
+| **Expert Source** | LLM, Curated, DBpedia | `expert_source` |
+| **Attribute Structure** | With/Without decomposition | Pipeline inclusion |
+
+#### Dependent Variables (Measured)
+
+| Variable | Measurement Method |
+|----------|-------------------|
+| **Semantic Diversity** | Mean pairwise cosine distance (embeddings) |
+| **Cluster Spread** | Number of clusters, silhouette score |
+| **Patent Novelty** | 1 - (ideas with patent match / total ideas) |
+| **Semantic Distance** | Distance from query centroid |
+| **Human Novelty Rating** | 7-point Likert scale |
+| **Human Usefulness Rating** | 7-point Likert scale |
+| **Human Creativity Rating** | 7-point Likert scale |
+
+#### Control Variables (Held Constant)
+
+| Variable | Fixed Value |
+|----------|-------------|
+| LLM Model | Qwen3:8b (or specify) |
+| Temperature | 0.7 |
+| Total Ideas per Query | 20 |
+| Keywords per Expert | 1 |
+| Deduplication | Disabled for raw comparison |
+| Language | English (for patent search) |
+
+---
+
+## 3. Experimental Conditions
+
+### 3.1 Main Study: Generation Method Comparison
+
+| Condition | Description | Implementation |
+|-----------|-------------|----------------|
+| **C1: Direct** | Direct LLM generation | Prompt: "Generate 20 creative ideas for [query]" |
+| **C2: Single-Expert** | 1 expert × 20 ideas | `expert_count=1`, `keywords_per_expert=20` |
+| **C3: Multi-Expert-4** | 4 experts × 5 ideas each | `expert_count=4`, `keywords_per_expert=5` |
+| **C4: Multi-Expert-8** | 8 experts × 2-3 ideas each | `expert_count=8`, `keywords_per_expert=2-3` |
+| **C5: Random-Perspective** | 4 random words as "perspectives" | Custom prompt with random nouns |
+
+### 3.2 Expert Count Study
+
+| Condition | Expert Count | Ideas per Expert |
+|-----------|--------------|------------------|
+| **E1** | 1 | 20 |
+| **E2** | 2 | 10 |
+| **E4** | 4 | 5 |
+| **E6** | 6 | 3-4 |
+| **E8** | 8 | 2-3 |
+
+### 3.3 Expert Source Study
+
+| Condition | Source | Implementation |
+|-----------|--------|----------------|
+| **S-LLM** | LLM-generated | `expert_source=ExpertSource.LLM` |
+| **S-Curated** | Curated 210 occupations | `expert_source=ExpertSource.CURATED` |
+| **S-DBpedia** | DBpedia 2164 occupations | `expert_source=ExpertSource.DBPEDIA` |
+| **S-Random** | Random word "experts" | Custom implementation |
+
+---
+
+## 4. Query Dataset
+
+### 4.1 Design Principles
+- **Diversity**: Cover multiple domains (consumer products, technology, services, abstract concepts)
+- **Complexity Variation**: Simple objects to complex systems
+- **Familiarity Variation**: Common items to specialized equipment
+- **Cultural Neutrality**: Concepts understandable across cultures
+
+### 4.2 Query Set (30 Queries)
+
+#### Category A: Everyday Objects (10)
+| ID | Query | Complexity |
+|----|-------|------------|
+| A1 | Chair | Low |
+| A2 | Umbrella | Low |
+| A3 | Backpack | Low |
+| A4 | Coffee mug | Low |
+| A5 | Bicycle | Medium |
+| A6 | Refrigerator | Medium |
+| A7 | Smartphone | Medium |
+| A8 | Running shoes | Medium |
+| A9 | Kitchen knife | Low |
+| A10 | Desk lamp | Low |
+
+#### Category B: Technology & Tools (10)
+| ID | Query | Complexity |
+|----|-------|------------|
+| B1 | Solar panel | Medium |
+| B2 | Electric vehicle | High |
+| B3 | 3D printer | High |
+| B4 | Drone | Medium |
+| B5 | Smart thermostat | Medium |
+| B6 | Noise-canceling headphones | Medium |
+| B7 | Water purifier | Medium |
+| B8 | Wind turbine | High |
+| B9 | Robotic vacuum | Medium |
+| B10 | Wearable fitness tracker | Medium |
+
+#### Category C: Services & Systems (10)
+| ID | Query | Complexity |
+|----|-------|------------|
+| C1 | Food delivery service | Medium |
+| C2 | Online education platform | High |
+| C3 | Healthcare appointment system | High |
+| C4 | Public transportation | High |
+| C5 | Hotel booking system | Medium |
+| C6 | Personal finance app | Medium |
+| C7 | Grocery shopping experience | Medium |
+| C8 | Parking solution | Medium |
+| C9 | Elderly care service | High |
+| C10 | Waste management system | High |
+
+### 4.3 Sample Size Justification
+
+Based on [CHI meta-study on effect sizes](https://dl.acm.org/doi/10.1145/3706598.3713671):
+
+- **Queries**: 30 (crossed with conditions)
+- **Expected effect size**: d = 0.5 (medium)
+- **Power target**: 80%
+- **For automatic metrics**: 30 queries × 5 conditions × 20 ideas = 3,000 ideas
+- **For human evaluation**: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas
+
+---
+
+## 5. Automatic Metrics Collection
+
+### 5.1 Semantic Diversity Metrics
+
+#### 5.1.1 Mean Pairwise Distance (Primary)
+```python
+def compute_mean_pairwise_distance(ideas: List[str], embedding_model: str) -> float:
+    """
+    Compute mean cosine distance between all idea pairs.
+    Higher = more diverse.
+    """
+    embeddings = get_embeddings(ideas, model=embedding_model)
+    n = len(embeddings)
+    distances = []
+    for i in range(n):
+        for j in range(i+1, n):
+            dist = 1 - cosine_similarity(embeddings[i], embeddings[j])
+            distances.append(dist)
+    return np.mean(distances), np.std(distances)
+```
+
+#### 5.1.2 Cluster Analysis
+```python
+def compute_cluster_metrics(ideas: List[str], embedding_model: str) -> dict:
+    """
+    Analyze idea clustering patterns.
+    """
+    embeddings = get_embeddings(ideas, model=embedding_model)
+
+    # Find optimal k using silhouette score
+    silhouette_scores = []
+    for k in range(2, min(len(ideas), 10)):
+        kmeans = KMeans(n_clusters=k)
+        labels = kmeans.fit_predict(embeddings)
+        score = silhouette_score(embeddings, labels)
+        silhouette_scores.append((k, score))
+
+    best_k = max(silhouette_scores, key=lambda x: x[1])[0]
+
+    return {
+        'optimal_clusters': best_k,
+        'silhouette_score': max(silhouette_scores, key=lambda x: x[1])[1],
+        'cluster_distribution': compute_cluster_sizes(embeddings, best_k)
+    }
+```
+
+#### 5.1.3 Semantic Distance from Query
+```python
+def compute_query_distance(query: str, ideas: List[str], embedding_model: str) -> dict:
+    """
+    Measure how far ideas are from the original query.
+    Higher = more novel/distant.
+    """
+    query_emb = get_embedding(query, model=embedding_model)
+    idea_embs = get_embeddings(ideas, model=embedding_model)
+
+    distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]
+
+    return {
+        'mean_distance': np.mean(distances),
+        'max_distance': np.max(distances),
+        'min_distance': np.min(distances),
+        'std_distance': np.std(distances)
+    }
+```
+
+### 5.2 Patent Novelty Metrics
+
+#### 5.2.1 Patent Overlap Rate
+```python
+def compute_patent_novelty(ideas: List[str], query: str) -> dict:
+    """
+    Search patents for each idea and compute overlap rate.
+    Uses existing patent_search_service.
+    """
+    matches = 0
+    match_details = []
+
+    for idea in ideas:
+        result = patent_search_service.search(idea)
+        if result.has_match:
+            matches += 1
+            match_details.append({
+                'idea': idea,
+                'patent': result.best_match
+            })
+
+    return {
+        'novelty_rate': 1 - (matches / len(ideas)),
+        'match_count': matches,
+        'total_ideas': len(ideas),
+        'match_details': match_details
+    }
+```
+
+### 5.3 Metrics Summary Table
+
+| Metric | Formula | Interpretation |
+|--------|---------|----------------|
+| **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) for all pairs | Higher = more diverse |
+| **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters |
+| **Optimal Cluster Count** | argmax(silhouette) | More clusters = more themes |
+| **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original |
+| **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel |
+
+---
+
+## 6. Human Evaluation Protocol
+
+### 6.1 Participants
+
+#### 6.1.1 Recruitment
+- **Platform**: Prolific, MTurk, or domain experts
+- **Sample Size**: 60 evaluators (20 per condition group)
+- **Criteria**:
+  - Native English speakers
+  - Bachelor's degree or higher
+  - Attention check pass rate > 80%
+
+#### 6.1.2 Compensation
+- $15/hour equivalent
+- ~30 minutes per session
+- Bonus for high-quality ratings
+
+### 6.2 Rating Scales
+
+#### 6.2.1 Novelty (7-point Likert)
+```
+How novel/surprising is this idea?
+1 = Not at all novel (very common/obvious)
+4 = Moderately novel
+7 = Extremely novel (never seen before)
+```
+
+#### 6.2.2 Usefulness (7-point Likert)
+```
+How useful/practical is this idea?
+1 = Not at all useful (impractical)
+4 = Moderately useful
+7 = Extremely useful (highly practical)
+```
+
+#### 6.2.3 Creativity (7-point Likert)
+```
+How creative is this idea overall?
+1 = Not at all creative
+4 = Moderately creative
+7 = Extremely creative
+```
+
+### 6.3 Procedure
+
+1. **Introduction** (5 min)
+   - Study purpose (without revealing hypotheses)
+   - Rating scale explanation
+   - Practice with 3 example ideas
+
+2. **Training** (5 min)
+   - Rate 5 calibration ideas with feedback
+   - Discuss edge cases
+
+3. **Main Evaluation** (20 min)
+   - Rate 30 ideas (randomized order)
+   - 3 attention check items embedded
+   - Break after 15 ideas
+
+4. **Debriefing** (2 min)
+   - Demographics
+   - Open-ended feedback
+
+### 6.4 Quality Control
+
+| Check | Threshold | Action |
+|-------|-----------|--------|
+| Attention checks | < 2/3 correct | Exclude |
+| Completion time | < 10 min | Flag for review |
+| Variance in ratings | All same score | Exclude |
+| Inter-rater reliability | Cronbach's α < 0.7 | Review ratings |
+
+### 6.5 Analysis Plan
+
+#### 6.5.1 Reliability
+- Cronbach's alpha for each scale
+- ICC (Intraclass Correlation) for inter-rater agreement
+
+#### 6.5.2 Main Analysis
+- Mixed-effects ANOVA: Condition × Query
+- Post-hoc: Tukey HSD for pairwise comparisons
+- Effect sizes: Cohen's d
+
+#### 6.5.3 Correlation with Automatic Metrics
+- Pearson correlation: Human ratings vs semantic diversity
+- Regression: Predict human ratings from automatic metrics
+
+---
+
+## 7. Experimental Procedure
+
+### 7.1 Phase 1: Idea Generation
+
+```
+For each query Q in QuerySet:
+    For each condition C in Conditions:
+
+        If C == "Direct":
+            ideas = direct_llm_generation(Q, n=20)
+
+        Elif C == "Single-Expert":
+            expert = generate_expert(Q, n=1)
+            ideas = expert_transformation(Q, expert, ideas_per_expert=20)
+
+        Elif C == "Multi-Expert-4":
+            experts = generate_experts(Q, n=4)
+            ideas = expert_transformation(Q, experts, ideas_per_expert=5)
+
+        Elif C == "Multi-Expert-8":
+            experts = generate_experts(Q, n=8)
+            ideas = expert_transformation(Q, experts, ideas_per_expert=2-3)
+
+        Elif C == "Random-Perspective":
+            perspectives = random.sample(RANDOM_WORDS, 4)
+            ideas = perspective_generation(Q, perspectives, ideas_per=5)
+
+        Store(Q, C, ideas)
+```
+
+### 7.2 Phase 2: Automatic Metrics
+
+```
+For each (Q, C, ideas) in Results:
+    metrics = {
+        'diversity': compute_mean_pairwise_distance(ideas),
+        'clusters': compute_cluster_metrics(ideas),
+        'query_distance': compute_query_distance(Q, ideas),
+        'patent_novelty': compute_patent_novelty(ideas, Q)
+    }
+    Store(Q, C, metrics)
+```
+
+### 7.3 Phase 3: Human Evaluation
+
+```
+# Sample selection
+selected_queries = random.sample(QuerySet, 10)
+selected_conditions = ["Direct", "Multi-Expert-4", "Multi-Expert-8"]
+
+# Create evaluation set
+evaluation_items = []
+For each Q in selected_queries:
+    For each C in selected_conditions:
+        ideas = Get(Q, C)
+        For each idea in ideas:
+            evaluation_items.append((Q, C, idea))
+
+# Randomize and assign to evaluators
+random.shuffle(evaluation_items)
+assignments = assign_to_evaluators(evaluation_items, n_evaluators=60)
+
+# Collect ratings
+ratings = collect_human_ratings(assignments)
+```
+
+### 7.4 Phase 4: Analysis
+
+```
+# Automatic metrics analysis
+Run ANOVA: diversity ~ condition + query + condition:query
+Run post-hoc: Tukey HSD for condition pairs
+Compute effect sizes
+
+# Human ratings analysis
+Check reliability: Cronbach's alpha, ICC
+Run mixed-effects model: rating ~ condition + (1|evaluator) + (1|query)
+Compute correlations: human vs automatic metrics
+
+# Visualization
+Plot: Diversity by condition (box plots)
+Plot: t-SNE of idea embeddings colored by condition
+Plot: Expert count vs diversity curve
+```
+
+---
+
+## 8. Implementation Checklist
+
+### 8.1 Code to Implement
+
+- [ ] `experiments/generate_ideas.py` - Idea generation for all conditions
+- [ ] `experiments/compute_metrics.py` - Automatic metric computation
+- [ ] `experiments/export_for_evaluation.py` - Prepare human evaluation set
+- [ ] `experiments/analyze_results.py` - Statistical analysis
+- [ ] `experiments/visualize.py` - Generate figures
+
+### 8.2 Data Files to Create
+
+- [ ] `data/queries.json` - 30 queries with metadata
+- [ ] `data/random_words.json` - Random perspective words
+- [ ] `data/generated_ideas/` - Raw idea outputs
+- [ ] `data/metrics/` - Computed metric results
+- [ ] `data/human_ratings/` - Collected ratings
+
+### 8.3 Analysis Outputs
+
+- [ ] `results/diversity_by_condition.csv`
+- [ ] `results/patent_novelty_by_condition.csv`
+- [ ] `results/human_ratings_summary.csv`
+- [ ] `results/statistical_tests.txt`
+- [ ] `figures/` - All visualizations
+
+---
+
+## 9. Expected Results & Hypotheses
+
+### 9.1 Primary Hypotheses
+
+| Hypothesis | Prediction | Metric |
+|------------|------------|--------|
+| **H1** | Multi-Expert-4 > Single-Expert > Direct | Semantic diversity |
+| **H2** | Multi-Expert-8 ≈ Multi-Expert-4 (diminishing returns) | Semantic diversity |
+| **H3** | Multi-Expert > Direct | Patent novelty rate |
+| **H4** | LLM experts > Curated > DBpedia | Unconventionality |
+| **H5** | With attributes > Without attributes | Overall diversity |
+
+### 9.2 Expected Effect Sizes
+
+Based on related work:
+- Diversity increase: d = 0.5-0.8 (medium to large)
+- Patent novelty increase: 20-40% improvement
+- Human creativity rating: d = 0.3-0.5 (small to medium)
+
+### 9.3 Potential Confounds
+
+| Confound | Mitigation |
+|----------|-----------|
+| Query difficulty | Crossed design (all queries × all conditions) |
+| LLM variability | Multiple runs, fixed seed where possible |
+| Evaluator bias | Randomized presentation, blinding |
+| Order effects | Counterbalancing in human evaluation |
+
+---
+
+## 10. Timeline
+
+| Week | Activity |
+|------|----------|
+| 1-2 | Implement idea generation scripts |
+| 3 | Generate all ideas (5 conditions × 30 queries) |
+| 4 | Compute automatic metrics |
+| 5 | Design and pilot human evaluation |
+| 6-7 | Run human evaluation (60 participants) |
+| 8 | Analyze results |
+| 9-10 | Write paper |
+| 11 | Internal review |
+| 12 | Submit |
+
+---
+
+## 11. Appendix: Direct Generation Prompt
+
+For baseline condition C1 (Direct LLM generation):
+
+```
+You are a creative innovation consultant. Generate 20 unique and creative ideas
+for improving or reimagining a [QUERY].
+
+Requirements:
+- Each idea should be distinct and novel
+- Ideas should range from incremental improvements to radical innovations
+- Consider different aspects: materials, functions, user experiences, contexts
+- Provide a brief (15-30 word) description for each idea
+
+Output format:
+1. [Idea keyword]: [Description]
+2. [Idea keyword]: [Description]
+...
+20. [Idea keyword]: [Description]
+```
+
+---
+
+## 12. Appendix: Random Perspective Words
+
+For condition C5 (Random-Perspective), sample from:
+
+```json
+[
+  "ocean", "mountain", "forest", "desert", "cave",
+  "microscope", "telescope", "kaleidoscope", "prism", "lens",
+  "butterfly", "elephant", "octopus", "eagle", "ant",
+  "sunrise", "thunderstorm", "rainbow", "fog", "aurora",
+  "clockwork", "origami", "mosaic", "symphony", "ballet",
+  "ancient", "futuristic", "organic", "crystalline", "liquid",
+  "whisper", "explosion", "rhythm", "silence", "echo"
+]
+```
+
+This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.