--- marp: true theme: default paginate: true size: 16:9 style: | section { font-size: 24px; } h1 { color: #2563eb; } h2 { color: #1e40af; } table { font-size: 20px; } .columns { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; } --- # Breaking Semantic Gravity ## Expert-Augmented LLM Ideation for Enhanced Creativity **Research Progress Report** January 2026 --- # Agenda 1. Research Problem & Motivation 2. Theoretical Framework: "Semantic Gravity" 3. Proposed Solution: Expert-Augmented Ideation 4. Experimental Design 5. Implementation Progress 6. Timeline & Next Steps --- # 1. Research Problem ## The Myth, Problem and Myth of LLM Creativity **Myth**: LLMs enable infinite idea generation for creative tasks **Problem**: Generated ideas lack **diversity** and **novelty** - Ideas cluster around high-probability training distributions - Limited exploration of distant conceptual spaces - "Creative" outputs are **interpolations**, not **extrapolations** --- # The "Semantic Gravity" Phenomenon ``` Direct LLM Generation: Input: "Generate creative ideas for a chair" Result: - "Ergonomic office chair" (high probability) - "Foldable portable chair" (high probability) - "Eco-friendly bamboo chair" (moderate probability) Problem: → Ideas cluster in predictable semantic neighborhoods → Limited exploration of distant conceptual spaces ``` --- # Why Does Semantic Gravity Occur? | Factor | Description | |--------|-------------| | **Statistical Pattern Learning** | LLMs learn co-occurrence patterns from training data | | **Model Collapse** (再看看) | Sampling from "creative ideas" distribution seen in training | | **Relevance Trap** (再看看) | Strong associations dominate weak ones | | **Domain Bias** | Outputs gravitate toward category prototypes | --- # 2. Theoretical Framework ## Three Key Foundations 1. **Semantic Distance Theory** (Mednick, 1962) - Creativity correlates with conceptual "jump" distance 2. **Conceptual Blending Theory** (Fauconnier & Turner, 2002) - Creative products emerge from blending input spaces 3. **Design Fixation** (Jansson & Smith, 1991) - Blind adherence to initial ideas limits creativity --- # Semantic Distance in Action ``` Without Expert: "Chair" → furniture, sitting, comfort, design Semantic distance: SHORT With Marine Biologist Expert: "Chair" → underwater pressure, coral structure, buoyancy Semantic distance: LONG Result: Novel ideas like "pressure-adaptive seating" ``` **Key Insight**: Expert perspectives force semantic jumps that LLMs wouldn't naturally make. --- # 3. Proposed Solution ## Expert-Augmented LLM Ideation Pipeline ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Attribute │ → │ Expert │ → │ Expert │ │ Decomposition│ │ Generation │ │Transformation│ └──────────────┘ └──────────────┘ └──────────────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ │ Novelty │ ← │ Deduplication│ │ Validation │ │ │ └──────────────┘ └──────────────┘ ``` --- # From "Wisdom of Crowds" to "Inner Crowd" **Traditional Crowd**: - Person 1 → Ideas from perspective 1 - Person 2 → Ideas from perspective 2 - Aggregation → Diverse idea pool **Our "Inner Crowd"**: - LLM + Expert 1 Persona → Ideas from perspective 1 - LLM + Expert 2 Persona → Ideas from perspective 2 - Aggregation → Diverse idea pool (simulated crowd) --- # Expert Sources | Source | Description | Coverage | |--------|-------------|----------| | **LLM-Generated** | Query-specific, prioritizes unconventional | Flexible | | **Curated** | 210 pre-selected high-quality occupations | Controlled | | **DBpedia** | 2,164 occupations from database | Broad | Note: use the domain list (嘗試加入杜威分類法兩層? Future work? ) --- # 4. Research Questions (2×2 Factorial Design) | ID | Research Question | |----|-------------------| | **RQ1** | Does attribute decomposition improve semantic diversity? | | **RQ2** | Does expert perspective transformation improve semantic diversity? | | **RQ3** | Is there an interaction effect between the two factors? | | **RQ4** | Which combination produces the highest patent novelty? | | **RQ5** | How do expert sources (LLM vs Curated vs External) affect quality? | | **RQ6** | What is the hallucination/nonsense rate of context-free generation? | --- # Design Choice: Context-Free Keyword Generation Our system intentionally excludes the original query during keyword generation: ``` Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant) Expert does NOT see "椅子" (chair) → Generates: "資金流動" (cash flow) Stage 2 (Description): Expert sees "椅子" + "資金流動" → Applies keyword to original query ``` **Rationale**: Forces maximum semantic distance for novelty **Risk**: Some keywords may be too distant → nonsense/hallucination **RQ6**: Measure this tradeoff --- # The Semantic Distance Tradeoff ``` Too Close Optimal Zone Too Far (Semantic Gravity) (Creative) (Hallucination) ├─────────────────────────┼──────────────────────────────┼─────────────────────────┤ "Ergonomic office chair" "Pressure-adaptive seating" "Quantum chair consciousness" High usefulness High novelty + useful High novelty, nonsense Low novelty Low usefulness ``` **H6**: Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%) --- # Measuring Nonsense/Hallucination (RQ6) - Three Methods | Method | Metric | Pros | Cons | |--------|--------|------|------| | **Automatic** | Semantic distance > 0.85 | Fast, cheap | May miss contextual nonsense | | **LLM-as-Judge** | GPT-4 relevance score (1-3) | Moderate cost, scalable | Potential LLM bias | | **Human Evaluation** | Relevance rating (1-7 Likert) | Gold standard | Expensive, slow | **Triangulation**: Compare all three methods - Agreement → high confidence in nonsense detection - Disagreement → interesting edge cases to analyze --- # Core Hypotheses (2×2 Factorial) | Hypothesis | Prediction | Metric | |------------|------------|--------| | **H1: Attributes** | (Attr-Only + Full) > (Direct + Expert-Only) | Semantic diversity | | **H2: Experts** | (Expert-Only + Full) > (Direct + Attr-Only) | Semantic diversity | | **H3: Interaction** | Full > (Attr-Only + Expert-Only - Direct) | Super-additive effect | | **H4: Novelty** | Full Pipeline > all others | Patent novelty rate | | **H5: Control** | Expert-Only > Random-Perspective | Validates expert knowledge | | **H6: Tradeoff** | Full Pipeline nonsense rate < 20% | Nonsense rate | --- # Experimental Conditions (2×2 Factorial) | Condition | Attributes | Experts | Description | |-----------|------------|---------|-------------| | **C1: Direct** | ❌ | ❌ | Baseline: "Generate 20 ideas for [query]" | | **C2: Expert-Only** | ❌ | ✅ | Expert personas generate for whole query | | **C3: Attribute-Only** | ✅ | ❌ | Decompose query, direct generate per attribute | | **C4: Full Pipeline** | ✅ | ✅ | Decompose query, experts generate per attribute | | **C5: Random-Perspective** | ❌ | (random) | Control: random words as "perspectives" | --- # Expected 2×2 Pattern ``` Without Experts With Experts --------------- ------------ Without Attributes Direct (low) Expert-Only (medium) With Attributes Attr-Only (medium) Full Pipeline (high) ``` **Key prediction**: The combination (Full Pipeline) produces **super-additive** effects - Experts are more effective when given structured attributes to transform - The interaction term should be statistically significant --- # Query Dataset (30 Queries) **Category A: Everyday Objects (10)** - Chair, Umbrella, Backpack, Coffee mug, Bicycle... **Category B: Technology & Tools (10)** - Solar panel, Electric vehicle, 3D printer, Drone... **Category C: Services & Systems (10)** - Food delivery, Online education, Healthcare appointment... **Total**: 30 queries × 5 conditions (4 factorial + 1 control) × 20 ideas = **3,000 ideas** --- # Metrics: Stastic Evaluation | Metric | Formula | Interpretation | |--------|---------|----------------| | **Mean Pairwise Distance** | avg(1 - cos_sim(i, j)) | Higher = more diverse | | **Silhouette Score** | Cluster cohesion vs separation | Higher = clearer clusters | | **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original | | **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel | --- # Metrics: Human Evaluation **Participants**: 60 evaluators (Prolific/MTurk) **Rating Scales** (7-point Likert): - **Novelty**: How novel/surprising is this idea? - **Usefulness**: How practical is this idea? - **Creativity**: How creative is this idea overall? - **Relevance**: How relevant/coherent is this idea to the query? **(RQ6)** - Nonsense ? **Quality Control**: - Attention checks, completion time monitoring - Inter-rater reliability (Cronbach's α > 0.7) --- # What is Prolific/MTurk? Online platforms for recruiting human participants for research studies. | Platform | Description | Best For | |----------|-------------|----------| | **Prolific** | Academic-focused crowdsourcing | Research studies (higher quality) | | **MTurk** | Amazon Mechanical Turk | Large-scale tasks (lower cost) | **How it works for our study**: 1. Upload 600 ideas to evaluate (subset of generated ideas) 2. Recruit 60 participants (~$8-15/hour compensation) 3. Each participant rates ~30 ideas (novelty, usefulness, creativity) 4. Download ratings → statistical analysis **Cost estimate**: 60 participants × 30 min × $12/hr = ~$360 --- # Alternative: LLM-as-Judge If human evaluation is too expensive or time-consuming: | Approach | Pros | Cons | |----------|------|------| | **Human (Prolific/MTurk)** | Gold standard, publishable | Cost, time, IRB approval | | **LLM-as-Judge (GPT-4)** | Fast, cheap, reproducible | Less rigorous, potential bias | | **Automatic metrics only** | No human cost | Missing subjective quality | **Recommendation**: Start with automatic metrics, add human evaluation for final paper submission. --- # 5. Implementation Status ## System Components (Implemented) - Attribute decomposition pipeline - Expert team generation (LLM, Curated, DBpedia sources) - Expert transformation with parallel processing - Semantic deduplication (embedding + LLM methods) - Patent search integration - Web-based visualization interface --- # Implementation Checklist ### Experiment Scripts (To Do) - [ ] `experiments/generate_ideas.py` - Idea generation - [ ] `experiments/compute_metrics.py` - Automatic metrics - [ ] `experiments/export_for_evaluation.py` - Human evaluation prep - [ ] `experiments/analyze_results.py` - Statistical analysis - [ ] `experiments/visualize.py` - Generate figures --- # 6. Timeline | Phase | Activity | |-------|----------| | **Phase 1** | Implement idea generation scripts | | **Phase 2** | Generate all ideas (5 conditions × 30 queries) | | **Phase 3** | Compute automatic metrics | | **Phase 4** | Design and pilot human evaluation | | **Phase 5** | Run human evaluation (60 participants) | | **Phase 6** | Analyze results and write paper | --- # Target Venues ### Tier 1 (Recommended) - **CHI** - ACM Conference on Human Factors (Sept deadline) - **CSCW** - Computer-Supported Cooperative Work (Apr/Jan deadline) - **Creativity & Cognition** - Specialized computational creativity ### Journal Options - **IJHCS** - International Journal of Human-Computer Studies - **TOCHI** - ACM Transactions on CHI --- # Key Contributions 1. **Theoretical**: "Semantic gravity" framework + two-factor solution 2. **Methodological**: 2×2 factorial design isolates attribute vs expert contributions 3. **Empirical**: Quantitative evidence for interaction effects in LLM creativity 4. **Practical**: Open-source system with both factors for maximum diversity --- # Key Differentiator vs PersonaFlow ``` PersonaFlow (2024): Query → Experts → Ideas (Experts see WHOLE query, no structure) Our Approach: Query → Attributes → (Attributes × Experts) → Ideas (Experts see SPECIFIC attributes, systematic) ``` **What we can answer that PersonaFlow cannot:** 1. Does problem structure alone help? (Attribute-Only vs Direct) 2. Do experts help beyond structure? (Full vs Attribute-Only) 3. Is there an interaction effect? (amplification hypothesis) --- # Related Work Comparison | Approach | Limitation | Our Advantage | |----------|------------|---------------| | Direct LLM | Semantic gravity | Two-factor enhancement | | **PersonaFlow** | **No problem structure** | **Attribute decomposition amplifies experts** | | PopBlends | Two-concept only | Systematic attribute × expert matrix | | BILLY | Cannot isolate factors | 2×2 factorial isolates contributions | --- # References (Key Papers) 1. Siangliulue et al. (2017) - Wisdom of Crowds via Role Assumption 2. Liu et al. (2024) - PersonaFlow: LLM-Simulated Expert Perspectives 3. Choi et al. (2023) - PopBlends: Conceptual Blending with LLMs 4. Wadinambiarachchi et al. (2024) - Effects of Generative AI on Design Fixation 5. Mednick (1962) - Semantic Distance Theory 6. Fauconnier & Turner (2002) - Conceptual Blending Theory *Full reference list: 55+ papers in `research/references.md`* --- # Questions & Discussion ## Next Steps 1. Finalize experimental design details 2. Implement experiment scripts 3. Collect pilot data for validation 4. Submit IRB for human evaluation (if needed) --- # Thank You **Project Repository**: novelty-seeking **Research Materials**: - `research/literature_review.md` - `research/theoretical_framework.md` - `research/experimental_protocol.md` - `research/paper_outline.md` - `research/references.md`