Files
novelty-seeking/research/research_report.md
gbanyan 26a56a2a07 feat: Enhance patent search and update research documentation
- Improve patent search service with expanded functionality
- Update PatentSearchPanel UI component
- Add new research_report.md
- Update experimental protocol, literature review, paper outline, and theoretical framework

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 15:52:33 +08:00

14 KiB
Raw Blame History

marp, theme, paginate, size, style
marp theme paginate size style
true default true 16:9 section { font-size: 24px; } h1 { color: #2563eb; } h2 { color: #1e40af; } table { font-size: 20px; } .columns { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; }

Breaking Semantic Gravity

Expert-Augmented LLM Ideation for Enhanced Creativity

Research Progress Report

January 2026


Agenda

  1. Research Problem & Motivation
  2. Theoretical Framework: "Semantic Gravity"
  3. Proposed Solution: Expert-Augmented Ideation
  4. Experimental Design
  5. Implementation Progress
  6. Timeline & Next Steps

1. Research Problem

The Myth, Problem and Myth of LLM Creativity

Myth: LLMs enable infinite idea generation for creative tasks

Problem: Generated ideas lack diversity and novelty

  • Ideas cluster around high-probability training distributions
  • Limited exploration of distant conceptual spaces
  • "Creative" outputs are interpolations, not extrapolations

The "Semantic Gravity" Phenomenon

Direct LLM Generation:
  Input: "Generate creative ideas for a chair"

  Result:
    - "Ergonomic office chair"      (high probability)
    - "Foldable portable chair"     (high probability)
    - "Eco-friendly bamboo chair"   (moderate probability)

  Problem:
    → Ideas cluster in predictable semantic neighborhoods
    → Limited exploration of distant conceptual spaces

Why Does Semantic Gravity Occur?

Factor Description
Statistical Pattern Learning LLMs learn co-occurrence patterns from training data
Model Collapse (再看看) Sampling from "creative ideas" distribution seen in training
Relevance Trap (再看看) Strong associations dominate weak ones
Domain Bias Outputs gravitate toward category prototypes

2. Theoretical Framework

Three Key Foundations

  1. Semantic Distance Theory (Mednick, 1962)

    • Creativity correlates with conceptual "jump" distance
  2. Conceptual Blending Theory (Fauconnier & Turner, 2002)

    • Creative products emerge from blending input spaces
  3. Design Fixation (Jansson & Smith, 1991)

    • Blind adherence to initial ideas limits creativity

Semantic Distance in Action

Without Expert:
  "Chair" → furniture, sitting, comfort, design
  Semantic distance: SHORT

With Marine Biologist Expert:
  "Chair" → underwater pressure, coral structure, buoyancy
  Semantic distance: LONG

Result: Novel ideas like "pressure-adaptive seating"

Key Insight: Expert perspectives force semantic jumps that LLMs wouldn't naturally make.


3. Proposed Solution

Expert-Augmented LLM Ideation Pipeline

┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│   Attribute  │ → │    Expert    │ → │    Expert    │
│ Decomposition│   │  Generation  │   │Transformation│
└──────────────┘   └──────────────┘   └──────────────┘
                                              │
                                              ▼
                   ┌──────────────┐   ┌──────────────┐
                   │   Novelty    │ ← │ Deduplication│
                   │  Validation  │   │              │
                   └──────────────┘   └──────────────┘

From "Wisdom of Crowds" to "Inner Crowd"

Traditional Crowd:

  • Person 1 → Ideas from perspective 1
  • Person 2 → Ideas from perspective 2
  • Aggregation → Diverse idea pool

Our "Inner Crowd":

  • LLM + Expert 1 Persona → Ideas from perspective 1
  • LLM + Expert 2 Persona → Ideas from perspective 2
  • Aggregation → Diverse idea pool (simulated crowd)

Expert Sources

Source Description Coverage
LLM-Generated Query-specific, prioritizes unconventional Flexible
Curated 210 pre-selected high-quality occupations Controlled
DBpedia 2,164 occupations from database Broad

Note: use the domain list (嘗試加入杜威分類法兩層? Future work? )


4. Research Questions (2×2 Factorial Design)

ID Research Question
RQ1 Does attribute decomposition improve semantic diversity?
RQ2 Does expert perspective transformation improve semantic diversity?
RQ3 Is there an interaction effect between the two factors?
RQ4 Which combination produces the highest patent novelty?
RQ5 How do expert sources (LLM vs Curated vs External) affect quality?
RQ6 What is the hallucination/nonsense rate of context-free generation?

Design Choice: Context-Free Keyword Generation

Our system intentionally excludes the original query during keyword generation:

Stage 1 (Keyword):     Expert sees "木質" (wood) + "會計師" (accountant)
                       Expert does NOT see "椅子" (chair)
                       → Generates: "資金流動" (cash flow)

Stage 2 (Description): Expert sees "椅子" + "資金流動"
                       → Applies keyword to original query

Rationale: Forces maximum semantic distance for novelty Risk: Some keywords may be too distant → nonsense/hallucination RQ6: Measure this tradeoff


The Semantic Distance Tradeoff

Too Close                 Optimal Zone                   Too Far
(Semantic Gravity)        (Creative)                     (Hallucination)
├─────────────────────────┼──────────────────────────────┼─────────────────────────┤
"Ergonomic office chair"  "Pressure-adaptive seating"    "Quantum chair consciousness"

High usefulness           High novelty + useful          High novelty, nonsense
Low novelty                                              Low usefulness

H6: Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%)


Measuring Nonsense/Hallucination (RQ6) - Three Methods

Method Metric Pros Cons
Automatic Semantic distance > 0.85 Fast, cheap May miss contextual nonsense
LLM-as-Judge GPT-4 relevance score (1-3) Moderate cost, scalable Potential LLM bias
Human Evaluation Relevance rating (1-7 Likert) Gold standard Expensive, slow

Triangulation: Compare all three methods

  • Agreement → high confidence in nonsense detection
  • Disagreement → interesting edge cases to analyze

Core Hypotheses (2×2 Factorial)

Hypothesis Prediction Metric
H1: Attributes (Attr-Only + Full) > (Direct + Expert-Only) Semantic diversity
H2: Experts (Expert-Only + Full) > (Direct + Attr-Only) Semantic diversity
H3: Interaction Full > (Attr-Only + Expert-Only - Direct) Super-additive effect
H4: Novelty Full Pipeline > all others Patent novelty rate
H5: Control Expert-Only > Random-Perspective Validates expert knowledge
H6: Tradeoff Full Pipeline nonsense rate < 20% Nonsense rate

Experimental Conditions (2×2 Factorial)

Condition Attributes Experts Description
C1: Direct Baseline: "Generate 20 ideas for [query]"
C2: Expert-Only Expert personas generate for whole query
C3: Attribute-Only Decompose query, direct generate per attribute
C4: Full Pipeline Decompose query, experts generate per attribute
C5: Random-Perspective (random) Control: random words as "perspectives"

Expected 2×2 Pattern

                      Without Experts       With Experts
                      ---------------       ------------
Without Attributes    Direct (low)          Expert-Only (medium)

With Attributes       Attr-Only (medium)    Full Pipeline (high)

Key prediction: The combination (Full Pipeline) produces super-additive effects

  • Experts are more effective when given structured attributes to transform
  • The interaction term should be statistically significant

Query Dataset (30 Queries)

Category A: Everyday Objects (10)

  • Chair, Umbrella, Backpack, Coffee mug, Bicycle...

Category B: Technology & Tools (10)

  • Solar panel, Electric vehicle, 3D printer, Drone...

Category C: Services & Systems (10)

  • Food delivery, Online education, Healthcare appointment...

Total: 30 queries × 5 conditions (4 factorial + 1 control) × 20 ideas = 3,000 ideas


Metrics: Stastic Evaluation

Metric Formula Interpretation
Mean Pairwise Distance avg(1 - cos_sim(i, j)) Higher = more diverse
Silhouette Score Cluster cohesion vs separation Higher = clearer clusters
Query Distance 1 - cos_sim(query, idea) Higher = farther from original
Patent Novelty Rate 1 - (matches / total) Higher = more novel

Metrics: Human Evaluation

Participants: 60 evaluators (Prolific/MTurk)

Rating Scales (7-point Likert):

  • Novelty: How novel/surprising is this idea?
  • Usefulness: How practical is this idea?
  • Creativity: How creative is this idea overall?
  • Relevance: How relevant/coherent is this idea to the query? (RQ6)
  • Nonsense ?

Quality Control:

  • Attention checks, completion time monitoring
  • Inter-rater reliability (Cronbach's α > 0.7)

What is Prolific/MTurk?

Online platforms for recruiting human participants for research studies.

Platform Description Best For
Prolific Academic-focused crowdsourcing Research studies (higher quality)
MTurk Amazon Mechanical Turk Large-scale tasks (lower cost)

How it works for our study:

  1. Upload 600 ideas to evaluate (subset of generated ideas)
  2. Recruit 60 participants (~$8-15/hour compensation)
  3. Each participant rates ~30 ideas (novelty, usefulness, creativity)
  4. Download ratings → statistical analysis

Cost estimate: 60 participants × 30 min × $12/hr = ~$360


Alternative: LLM-as-Judge

If human evaluation is too expensive or time-consuming:

Approach Pros Cons
Human (Prolific/MTurk) Gold standard, publishable Cost, time, IRB approval
LLM-as-Judge (GPT-4) Fast, cheap, reproducible Less rigorous, potential bias
Automatic metrics only No human cost Missing subjective quality

Recommendation: Start with automatic metrics, add human evaluation for final paper submission.


5. Implementation Status

System Components (Implemented)

  • Attribute decomposition pipeline
  • Expert team generation (LLM, Curated, DBpedia sources)
  • Expert transformation with parallel processing
  • Semantic deduplication (embedding + LLM methods)
  • Patent search integration
  • Web-based visualization interface

Implementation Checklist

Experiment Scripts (To Do)

  • experiments/generate_ideas.py - Idea generation
  • experiments/compute_metrics.py - Automatic metrics
  • experiments/export_for_evaluation.py - Human evaluation prep
  • experiments/analyze_results.py - Statistical analysis
  • experiments/visualize.py - Generate figures

6. Timeline

Phase Activity
Phase 1 Implement idea generation scripts
Phase 2 Generate all ideas (5 conditions × 30 queries)
Phase 3 Compute automatic metrics
Phase 4 Design and pilot human evaluation
Phase 5 Run human evaluation (60 participants)
Phase 6 Analyze results and write paper

Target Venues

  • CHI - ACM Conference on Human Factors (Sept deadline)
  • CSCW - Computer-Supported Cooperative Work (Apr/Jan deadline)
  • Creativity & Cognition - Specialized computational creativity

Journal Options

  • IJHCS - International Journal of Human-Computer Studies
  • TOCHI - ACM Transactions on CHI

Key Contributions

  1. Theoretical: "Semantic gravity" framework + two-factor solution

  2. Methodological: 2×2 factorial design isolates attribute vs expert contributions

  3. Empirical: Quantitative evidence for interaction effects in LLM creativity

  4. Practical: Open-source system with both factors for maximum diversity


Key Differentiator vs PersonaFlow

PersonaFlow (2024):   Query → Experts → Ideas
                      (Experts see WHOLE query, no structure)

Our Approach:         Query → Attributes → (Attributes × Experts) → Ideas
                      (Experts see SPECIFIC attributes, systematic)

What we can answer that PersonaFlow cannot:

  1. Does problem structure alone help? (Attribute-Only vs Direct)
  2. Do experts help beyond structure? (Full vs Attribute-Only)
  3. Is there an interaction effect? (amplification hypothesis)

Related Work Comparison

Approach Limitation Our Advantage
Direct LLM Semantic gravity Two-factor enhancement
PersonaFlow No problem structure Attribute decomposition amplifies experts
PopBlends Two-concept only Systematic attribute × expert matrix
BILLY Cannot isolate factors 2×2 factorial isolates contributions

References (Key Papers)

  1. Siangliulue et al. (2017) - Wisdom of Crowds via Role Assumption
  2. Liu et al. (2024) - PersonaFlow: LLM-Simulated Expert Perspectives
  3. Choi et al. (2023) - PopBlends: Conceptual Blending with LLMs
  4. Wadinambiarachchi et al. (2024) - Effects of Generative AI on Design Fixation
  5. Mednick (1962) - Semantic Distance Theory
  6. Fauconnier & Turner (2002) - Conceptual Blending Theory

Full reference list: 55+ papers in research/references.md


Questions & Discussion

Next Steps

  1. Finalize experimental design details
  2. Implement experiment scripts
  3. Collect pilot data for validation
  4. Submit IRB for human evaluation (if needed)

Thank You

Project Repository: novelty-seeking

Research Materials:

  • research/literature_review.md
  • research/theoretical_framework.md
  • research/experimental_protocol.md
  • research/paper_outline.md
  • research/references.md