# Experimental Protocol: Expert-Augmented LLM Ideation ## Executive Summary This document outlines a comprehensive experimental design to test the hypothesis that multi-expert LLM-based ideation produces more diverse and novel ideas than direct LLM generation. --- ## 1. Research Questions | ID | Research Question | |----|-------------------| | **RQ1** | Does attribute decomposition improve semantic diversity of generated ideas? | | **RQ2** | Does expert perspective transformation improve semantic diversity of generated ideas? | | **RQ3** | Is there an interaction effect between attribute decomposition and expert perspectives? | | **RQ4** | Which combination produces the highest patent novelty (lowest overlap)? | | **RQ5** | How do different expert sources (LLM vs Curated vs External) affect idea quality? | | **RQ6** | Does context-free keyword generation (current design) increase hallucination/nonsense rate? | ### Design Note: Context-Free Keyword Generation Our system intentionally excludes the original query during keyword generation (Stage 1): ``` Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant) Expert does NOT see "椅子" (chair) → Generates: "資金流動" (cash flow) Stage 2 (Description): Expert sees "椅子" + "資金流動" → Applies keyword to original query ``` **Rationale**: This forces maximum semantic distance in keyword generation. **Risk**: Some keywords may be too distant, resulting in nonsensical or unusable ideas. **RQ6 investigates**: What is the hallucination/nonsense rate, and is the tradeoff worthwhile? --- ## 2. Experimental Design Overview ### 2.1 Design Type **2×2 Factorial Design**: Attribute Decomposition (With/Without) × Expert Perspectives (With/Without) - Within-subjects for queries (all queries tested across all conditions) ### 2.2 Variables #### Independent Variables (Manipulated) | Variable | Levels | Description | |----------|--------|-------------| | **Attribute Decomposition** | 2 levels: With / Without | Whether to decompose query into structured attributes | | **Expert Perspectives** | 2 levels: With / Without | Whether to use expert personas for idea generation | | **Expert Source** (secondary) | LLM, Curated, External | Source of expert occupations (tested within Expert=With conditions) | #### Dependent Variables (Measured) | Variable | Measurement Method | |----------|-------------------| | **Semantic Diversity** | Mean pairwise cosine distance (embeddings) | | **Cluster Spread** | Number of clusters, silhouette score | | **Patent Novelty** | 1 - (ideas with patent match / total ideas) | | **Semantic Distance** | Distance from query centroid | | **Human Novelty Rating** | 7-point Likert scale | | **Human Usefulness Rating** | 7-point Likert scale | | **Human Creativity Rating** | 7-point Likert scale | #### Control Variables (Held Constant) | Variable | Fixed Value | |----------|-------------| | LLM Model | Qwen3:8b (or specify) | | Temperature | 0.7 | | Total Ideas per Query | 20 | | Keywords per Expert | 1 | | Deduplication | Disabled for raw comparison | | Language | English (for patent search) | --- ## 3. Experimental Conditions ### 3.1 Main Study: 2×2 Factorial Design | Condition | Attributes | Experts | Description | |-----------|------------|---------|-------------| | **C1: Direct** | ❌ Without | ❌ Without | Baseline: "Generate 20 creative ideas for [query]" | | **C2: Expert-Only** | ❌ Without | ✅ With | Expert personas generate for whole query | | **C3: Attribute-Only** | ✅ With | ❌ Without | Decompose query, direct generate per attribute | | **C4: Full Pipeline** | ✅ With | ✅ With | Decompose query, experts generate per attribute | ### 3.2 Control Condition | Condition | Description | Purpose | |-----------|-------------|---------| | **C5: Random-Perspective** | 4 random words as "perspectives" | Tests if ANY perspective shift helps, or if EXPERT knowledge specifically matters | ### 3.3 Expert Source Study (Secondary, within Expert=With conditions) | Condition | Source | Implementation | |-----------|--------|----------------| | **S-LLM** | LLM-generated | Query-specific experts generated by LLM | | **S-Curated** | Curated occupations | Pre-selected high-quality occupations | | **S-External** | External sources | Wikidata/ConceptNet occupations | --- ## 4. Query Dataset ### 4.1 Design Principles - **Diversity**: Cover multiple domains (consumer products, technology, services, abstract concepts) - **Complexity Variation**: Simple objects to complex systems - **Familiarity Variation**: Common items to specialized equipment - **Cultural Neutrality**: Concepts understandable across cultures ### 4.2 Query Set (30 Queries) #### Category A: Everyday Objects (10) | ID | Query | Complexity | |----|-------|------------| | A1 | Chair | Low | | A2 | Umbrella | Low | | A3 | Backpack | Low | | A4 | Coffee mug | Low | | A5 | Bicycle | Medium | | A6 | Refrigerator | Medium | | A7 | Smartphone | Medium | | A8 | Running shoes | Medium | | A9 | Kitchen knife | Low | | A10 | Desk lamp | Low | #### Category B: Technology & Tools (10) | ID | Query | Complexity | |----|-------|------------| | B1 | Solar panel | Medium | | B2 | Electric vehicle | High | | B3 | 3D printer | High | | B4 | Drone | Medium | | B5 | Smart thermostat | Medium | | B6 | Noise-canceling headphones | Medium | | B7 | Water purifier | Medium | | B8 | Wind turbine | High | | B9 | Robotic vacuum | Medium | | B10 | Wearable fitness tracker | Medium | #### Category C: Services & Systems (10) | ID | Query | Complexity | |----|-------|------------| | C1 | Food delivery service | Medium | | C2 | Online education platform | High | | C3 | Healthcare appointment system | High | | C4 | Public transportation | High | | C5 | Hotel booking system | Medium | | C6 | Personal finance app | Medium | | C7 | Grocery shopping experience | Medium | | C8 | Parking solution | Medium | | C9 | Elderly care service | High | | C10 | Waste management system | High | ### 4.3 Sample Size Justification Based on [CHI meta-study on effect sizes](https://dl.acm.org/doi/10.1145/3706598.3713671): - **Queries**: 30 (crossed with conditions) - **Expected effect size**: d = 0.5 (medium) - **Power target**: 80% - **For automatic metrics**: 30 queries × 5 conditions × 20 ideas = 3,000 ideas - **For human evaluation**: Subset of 10 queries × 3 conditions × 20 ideas = 600 ideas --- ## 5. Automatic Metrics Collection ### 5.1 Semantic Diversity Metrics #### 5.1.1 Mean Pairwise Distance (Primary) ```python def compute_mean_pairwise_distance(ideas: List[str], embedding_model: str) -> float: """ Compute mean cosine distance between all idea pairs. Higher = more diverse. """ embeddings = get_embeddings(ideas, model=embedding_model) n = len(embeddings) distances = [] for i in range(n): for j in range(i+1, n): dist = 1 - cosine_similarity(embeddings[i], embeddings[j]) distances.append(dist) return np.mean(distances), np.std(distances) ``` #### 5.1.2 Cluster Analysis ```python def compute_cluster_metrics(ideas: List[str], embedding_model: str) -> dict: """ Analyze idea clustering patterns. """ embeddings = get_embeddings(ideas, model=embedding_model) # Find optimal k using silhouette score silhouette_scores = [] for k in range(2, min(len(ideas), 10)): kmeans = KMeans(n_clusters=k) labels = kmeans.fit_predict(embeddings) score = silhouette_score(embeddings, labels) silhouette_scores.append((k, score)) best_k = max(silhouette_scores, key=lambda x: x[1])[0] return { 'optimal_clusters': best_k, 'silhouette_score': max(silhouette_scores, key=lambda x: x[1])[1], 'cluster_distribution': compute_cluster_sizes(embeddings, best_k) } ``` #### 5.1.3 Semantic Distance from Query ```python def compute_query_distance(query: str, ideas: List[str], embedding_model: str) -> dict: """ Measure how far ideas are from the original query. Higher = more novel/distant. """ query_emb = get_embedding(query, model=embedding_model) idea_embs = get_embeddings(ideas, model=embedding_model) distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs] return { 'mean_distance': np.mean(distances), 'max_distance': np.max(distances), 'min_distance': np.min(distances), 'std_distance': np.std(distances) } ``` ### 5.2 Patent Novelty Metrics #### 5.2.1 Patent Overlap Rate ```python def compute_patent_novelty(ideas: List[str], query: str) -> dict: """ Search patents for each idea and compute overlap rate. Uses existing patent_search_service. """ matches = 0 match_details = [] for idea in ideas: result = patent_search_service.search(idea) if result.has_match: matches += 1 match_details.append({ 'idea': idea, 'patent': result.best_match }) return { 'novelty_rate': 1 - (matches / len(ideas)), 'match_count': matches, 'total_ideas': len(ideas), 'match_details': match_details } ``` ### 5.3 Hallucination/Nonsense Metrics (RQ6) Since our design intentionally excludes the original query during keyword generation, we need to measure the "cost" of this approach. #### 5.3.1 LLM-as-Judge for Relevance ```python def compute_relevance_score(query: str, ideas: List[str], judge_model: str) -> dict: """ Use LLM to judge if each idea is relevant/applicable to the original query. """ relevant_count = 0 nonsense_count = 0 results = [] for idea in ideas: prompt = f""" Original query: {query} Generated idea: {idea} Is this idea relevant and applicable to the original query? Rate: 1 (nonsense/irrelevant), 2 (weak connection), 3 (relevant) Return JSON: {{"score": N, "reason": "brief explanation"}} """ result = llm_judge(prompt, model=judge_model) results.append(result) if result['score'] == 1: nonsense_count += 1 elif result['score'] >= 2: relevant_count += 1 return { 'relevance_rate': relevant_count / len(ideas), 'nonsense_rate': nonsense_count / len(ideas), 'details': results } ``` #### 5.3.2 Semantic Distance Threshold Analysis ```python def analyze_distance_threshold(query: str, ideas: List[str], embedding_model: str) -> dict: """ Analyze which ideas exceed a "too far" semantic distance threshold. Ideas beyond threshold may be creative OR nonsensical. """ query_emb = get_embedding(query, model=embedding_model) idea_embs = get_embeddings(ideas, model=embedding_model) distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs] # Define thresholds (to be calibrated) CREATIVE_THRESHOLD = 0.6 # Ideas this far are "creative" NONSENSE_THRESHOLD = 0.85 # Ideas this far may be "nonsense" return { 'creative_zone': sum(1 for d in distances if CREATIVE_THRESHOLD <= d < NONSENSE_THRESHOLD), 'potential_nonsense': sum(1 for d in distances if d >= NONSENSE_THRESHOLD), 'safe_zone': sum(1 for d in distances if d < CREATIVE_THRESHOLD), 'distance_distribution': distances } ``` ### 5.4 Metrics Summary Table | Metric | Formula | Interpretation | |--------|---------|----------------| | **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) for all pairs | Higher = more diverse | | **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters | | **Optimal Cluster Count** | argmax(silhouette) | More clusters = more themes | | **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original | | **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel | ### 5.5 Nonsense/Hallucination Analysis (RQ6) - Three Methods | Method | Metric | How it works | Pros/Cons | |--------|--------|--------------|-----------| | **Automatic** | Semantic Distance Threshold | Ideas with distance > 0.85 flagged as "potential nonsense" | Fast, cheap; May miss contextual nonsense | | **LLM-as-Judge** | Relevance Score (1-3) | GPT-4 rates if idea is relevant to original query | Moderate cost; Good balance | | **Human Evaluation** | Relevance Rating (1-7 Likert) | Humans rate coherence/relevance | Gold standard; Most expensive | **Triangulation**: Compare all three methods to validate findings: - If automatic + LLM + human agree → high confidence - If they disagree → investigate why (interesting edge cases) --- ## 6. Human Evaluation Protocol ### 6.1 Participants #### 6.1.1 Recruitment - **Platform**: Prolific, MTurk, or domain experts - **Sample Size**: 60 evaluators (20 per condition group) - **Criteria**: - Native English speakers - Bachelor's degree or higher - Attention check pass rate > 80% #### 6.1.2 Compensation - $15/hour equivalent - ~30 minutes per session - Bonus for high-quality ratings ### 6.2 Rating Scales #### 6.2.1 Novelty (7-point Likert) ``` How novel/surprising is this idea? 1 = Not at all novel (very common/obvious) 4 = Moderately novel 7 = Extremely novel (never seen before) ``` #### 6.2.2 Usefulness (7-point Likert) ``` How useful/practical is this idea? 1 = Not at all useful (impractical) 4 = Moderately useful 7 = Extremely useful (highly practical) ``` #### 6.2.3 Creativity (7-point Likert) ``` How creative is this idea overall? 1 = Not at all creative 4 = Moderately creative 7 = Extremely creative ``` #### 6.2.4 Relevance/Coherence (7-point Likert) - For RQ6 ``` How relevant and coherent is this idea to the original query? 1 = Nonsense/completely irrelevant (no logical connection) 2 = Very weak connection (hard to see relevance) 3 = Weak connection (requires stretch to see relevance) 4 = Moderate connection (somewhat relevant) 5 = Good connection (clearly relevant) 6 = Strong connection (directly applicable) 7 = Perfect fit (highly relevant and coherent) ``` **Note**: This scale specifically measures the "cost" of context-free generation. - Ideas with high novelty but low relevance (1-3) = potential hallucination - Ideas with high novelty AND high relevance (5-7) = successful creative leap ### 6.3 Procedure 1. **Introduction** (5 min) - Study purpose (without revealing hypotheses) - Rating scale explanation - Practice with 3 example ideas 2. **Training** (5 min) - Rate 5 calibration ideas with feedback - Discuss edge cases 3. **Main Evaluation** (20 min) - Rate 30 ideas (randomized order) - 3 attention check items embedded - Break after 15 ideas 4. **Debriefing** (2 min) - Demographics - Open-ended feedback ### 6.4 Quality Control | Check | Threshold | Action | |-------|-----------|--------| | Attention checks | < 2/3 correct | Exclude | | Completion time | < 10 min | Flag for review | | Variance in ratings | All same score | Exclude | | Inter-rater reliability | Cronbach's α < 0.7 | Review ratings | ### 6.5 Analysis Plan #### 6.5.1 Reliability - Cronbach's alpha for each scale - ICC (Intraclass Correlation) for inter-rater agreement #### 6.5.2 Main Analysis - Mixed-effects ANOVA: Condition × Query - Post-hoc: Tukey HSD for pairwise comparisons - Effect sizes: Cohen's d #### 6.5.3 Correlation with Automatic Metrics - Pearson correlation: Human ratings vs semantic diversity - Regression: Predict human ratings from automatic metrics --- ## 7. Experimental Procedure ### 7.1 Phase 1: Idea Generation ``` For each query Q in QuerySet: For each condition C in Conditions: If C == "Direct": # No attributes, no experts ideas = direct_llm_generation(Q, n=20) Elif C == "Expert-Only": # No attributes, with experts experts = generate_experts(Q, n=4) ideas = expert_generation_whole_query(Q, experts, ideas_per_expert=5) Elif C == "Attribute-Only": # With attributes, no experts attributes = decompose_attributes(Q) ideas = direct_generation_per_attribute(Q, attributes, ideas_per_attr=5) Elif C == "Full-Pipeline": # With attributes, with experts attributes = decompose_attributes(Q) experts = generate_experts(Q, n=4) ideas = expert_transformation(Q, attributes, experts, ideas_per_combo=1-2) Elif C == "Random-Perspective": # Control: random words instead of experts perspectives = random.sample(RANDOM_WORDS, 4) ideas = perspective_generation(Q, perspectives, ideas_per=5) Store(Q, C, ideas) ``` ### 7.2 Phase 2: Automatic Metrics ``` For each (Q, C, ideas) in Results: metrics = { 'diversity': compute_mean_pairwise_distance(ideas), 'clusters': compute_cluster_metrics(ideas), 'query_distance': compute_query_distance(Q, ideas), 'patent_novelty': compute_patent_novelty(ideas, Q) } Store(Q, C, metrics) ``` ### 7.3 Phase 3: Human Evaluation ``` # Sample selection selected_queries = random.sample(QuerySet, 10) selected_conditions = ["Direct", "Multi-Expert-4", "Multi-Expert-8"] # Create evaluation set evaluation_items = [] For each Q in selected_queries: For each C in selected_conditions: ideas = Get(Q, C) For each idea in ideas: evaluation_items.append((Q, C, idea)) # Randomize and assign to evaluators random.shuffle(evaluation_items) assignments = assign_to_evaluators(evaluation_items, n_evaluators=60) # Collect ratings ratings = collect_human_ratings(assignments) ``` ### 7.4 Phase 4: Analysis ``` # Automatic metrics analysis Run ANOVA: diversity ~ condition + query + condition:query Run post-hoc: Tukey HSD for condition pairs Compute effect sizes # Human ratings analysis Check reliability: Cronbach's alpha, ICC Run mixed-effects model: rating ~ condition + (1|evaluator) + (1|query) Compute correlations: human vs automatic metrics # Visualization Plot: Diversity by condition (box plots) Plot: t-SNE of idea embeddings colored by condition Plot: Expert count vs diversity curve ``` --- ## 8. Implementation Checklist ### 8.1 Code to Implement - [ ] `experiments/generate_ideas.py` - Idea generation for all conditions - [ ] `experiments/compute_metrics.py` - Automatic metric computation - [ ] `experiments/export_for_evaluation.py` - Prepare human evaluation set - [ ] `experiments/analyze_results.py` - Statistical analysis - [ ] `experiments/visualize.py` - Generate figures ### 8.2 Data Files to Create - [ ] `data/queries.json` - 30 queries with metadata - [ ] `data/random_words.json` - Random perspective words - [ ] `data/generated_ideas/` - Raw idea outputs - [ ] `data/metrics/` - Computed metric results - [ ] `data/human_ratings/` - Collected ratings ### 8.3 Analysis Outputs - [ ] `results/diversity_by_condition.csv` - [ ] `results/patent_novelty_by_condition.csv` - [ ] `results/human_ratings_summary.csv` - [ ] `results/statistical_tests.txt` - [ ] `figures/` - All visualizations --- ## 9. Expected Results & Hypotheses ### 9.1 Primary Hypotheses (2×2 Factorial) | Hypothesis | Prediction | Metric | |------------|------------|--------| | **H1: Main Effect of Attributes** | Attribute-Only > Direct | Semantic diversity | | **H2: Main Effect of Experts** | Expert-Only > Direct | Semantic diversity | | **H3: Interaction Effect** | Full Pipeline > (Attribute-Only + Expert-Only - Direct) | Semantic diversity | | **H4: Novelty** | Full Pipeline > all other conditions | Patent novelty rate | | **H5: Expert vs Random** | Expert-Only > Random-Perspective | Validates expert knowledge matters | | **H6: Novelty-Usefulness Tradeoff** | Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%) | Nonsense rate | ### 9.2 Expected Pattern ``` Without Experts With Experts --------------- ------------ Without Attributes Direct (low) Expert-Only (medium) With Attributes Attr-Only (medium) Full Pipeline (high) ``` **Expected interaction**: The combination (Full Pipeline) should produce super-additive effects - the benefit of experts is amplified when combined with structured attributes. ### 9.3 Expected Effect Sizes Based on related work: - Main effect of attributes: d = 0.3-0.5 (small to medium) - Main effect of experts: d = 0.4-0.6 (medium) - Interaction effect: d = 0.2-0.4 (small) - Patent novelty increase: 20-40% improvement - Human creativity rating: d = 0.3-0.5 (small to medium) ### 9.3 Potential Confounds | Confound | Mitigation | |----------|-----------| | Query difficulty | Crossed design (all queries × all conditions) | | LLM variability | Multiple runs, fixed seed where possible | | Evaluator bias | Randomized presentation, blinding | | Order effects | Counterbalancing in human evaluation | --- ## 10. Timeline | Week | Activity | |------|----------| | 1-2 | Implement idea generation scripts | | 3 | Generate all ideas (5 conditions × 30 queries) | | 4 | Compute automatic metrics | | 5 | Design and pilot human evaluation | | 6-7 | Run human evaluation (60 participants) | | 8 | Analyze results | | 9-10 | Write paper | | 11 | Internal review | | 12 | Submit | --- ## 11. Appendix: Direct Generation Prompt For baseline condition C1 (Direct LLM generation): ``` You are a creative innovation consultant. Generate 20 unique and creative ideas for improving or reimagining a [QUERY]. Requirements: - Each idea should be distinct and novel - Ideas should range from incremental improvements to radical innovations - Consider different aspects: materials, functions, user experiences, contexts - Provide a brief (15-30 word) description for each idea Output format: 1. [Idea keyword]: [Description] 2. [Idea keyword]: [Description] ... 20. [Idea keyword]: [Description] ``` --- ## 12. Appendix: Random Perspective Words For condition C5 (Random-Perspective), sample from: ```json [ "ocean", "mountain", "forest", "desert", "cave", "microscope", "telescope", "kaleidoscope", "prism", "lens", "butterfly", "elephant", "octopus", "eagle", "ant", "sunrise", "thunderstorm", "rainbow", "fog", "aurora", "clockwork", "origami", "mosaic", "symphony", "ballet", "ancient", "futuristic", "organic", "crystalline", "liquid", "whisper", "explosion", "rhythm", "silence", "echo" ] ``` This tests whether ANY perspective shift helps, or if EXPERT perspectives specifically matter.