feat: Enhance patent search and update research documentation

- Improve patent search service with expanded functionality - Update PatentSearchPanel UI component - Add new research_report.md - Update experimental protocol, literature review, paper outline, and theoretical framework Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 15:52:33 +08:00
parent ec48709755
commit 26a56a2a07
13 changed files with 1446 additions and 537 deletions
--- a/research/experimental_protocol.md
+++ b/research/experimental_protocol.md
@@ -10,29 +10,47 @@ This document outlines a comprehensive experimental design to test the hypothesi

 | ID | Research Question |
 |----|-------------------|
-| **RQ1** | Does multi-expert generation produce higher semantic diversity than direct LLM generation? |
-| **RQ2** | Does multi-expert generation produce ideas with lower patent overlap (higher novelty)? |
-| **RQ3** | What is the optimal number of experts for maximizing diversity? |
-| **RQ4** | How do different expert sources (LLM vs Curated vs DBpedia) affect idea quality? |
-| **RQ5** | Does structured attribute decomposition enhance the multi-expert effect? |
+| **RQ1** | Does attribute decomposition improve semantic diversity of generated ideas? |
+| **RQ2** | Does expert perspective transformation improve semantic diversity of generated ideas? |
+| **RQ3** | Is there an interaction effect between attribute decomposition and expert perspectives? |
+| **RQ4** | Which combination produces the highest patent novelty (lowest overlap)? |
+| **RQ5** | How do different expert sources (LLM vs Curated vs External) affect idea quality? |
+| **RQ6** | Does context-free keyword generation (current design) increase hallucination/nonsense rate? |
+
+### Design Note: Context-Free Keyword Generation
+
+Our system intentionally excludes the original query during keyword generation (Stage 1):
+
+```
+Stage 1 (Keyword): Expert sees "木質" (wood) + "會計師" (accountant)
+                   Expert does NOT see "椅子" (chair)
+                   → Generates: "資金流動" (cash flow)
+
+Stage 2 (Description): Expert sees "椅子" + "資金流動"
+                       → Applies keyword to original query
+```
+
+**Rationale**: This forces maximum semantic distance in keyword generation.
+**Risk**: Some keywords may be too distant, resulting in nonsensical or unusable ideas.
+**RQ6 investigates**: What is the hallucination/nonsense rate, and is the tradeoff worthwhile?

 ---

 ## 2. Experimental Design Overview

 ### 2.1 Design Type
-**Mixed Design**: Between-subjects for main conditions × Within-subjects for queries
+**2×2 Factorial Design**: Attribute Decomposition (With/Without) × Expert Perspectives (With/Without)
+- Within-subjects for queries (all queries tested across all conditions)

 ### 2.2 Variables

 #### Independent Variables (Manipulated)

-| Variable | Levels | Your System Parameter |
-|----------|--------|----------------------|
-| **Generation Method** | 5 levels (see conditions) | Condition-dependent |
-| **Expert Count** | 1, 2, 4, 6, 8 | `expert_count` |
-| **Expert Source** | LLM, Curated, DBpedia | `expert_source` |
-| **Attribute Structure** | With/Without decomposition | Pipeline inclusion |
+| Variable | Levels | Description |
+|----------|--------|-------------|
+| **Attribute Decomposition** | 2 levels: With / Without | Whether to decompose query into structured attributes |
+| **Expert Perspectives** | 2 levels: With / Without | Whether to use expert personas for idea generation |
+| **Expert Source** (secondary) | LLM, Curated, External | Source of expert occupations (tested within Expert=With conditions) |

 #### Dependent Variables (Measured)

@@ -61,34 +79,28 @@ This document outlines a comprehensive experimental design to test the hypothesi

 ## 3. Experimental Conditions

-### 3.1 Main Study: Generation Method Comparison
+### 3.1 Main Study: 2×2 Factorial Design

-| Condition | Description | Implementation |
-|-----------|-------------|----------------|
-| **C1: Direct** | Direct LLM generation | Prompt: "Generate 20 creative ideas for [query]" |
-| **C2: Single-Expert** | 1 expert × 20 ideas | `expert_count=1`, `keywords_per_expert=20` |
-| **C3: Multi-Expert-4** | 4 experts × 5 ideas each | `expert_count=4`, `keywords_per_expert=5` |
-| **C4: Multi-Expert-8** | 8 experts × 2-3 ideas each | `expert_count=8`, `keywords_per_expert=2-3` |
-| **C5: Random-Perspective** | 4 random words as "perspectives" | Custom prompt with random nouns |
+| Condition | Attributes | Experts | Description |
+|-----------|------------|---------|-------------|
+| **C1: Direct** | ❌ Without | ❌ Without | Baseline: "Generate 20 creative ideas for [query]" |
+| **C2: Expert-Only** | ❌ Without | ✅ With | Expert personas generate for whole query |
+| **C3: Attribute-Only** | ✅ With | ❌ Without | Decompose query, direct generate per attribute |
+| **C4: Full Pipeline** | ✅ With | ✅ With | Decompose query, experts generate per attribute |

-### 3.2 Expert Count Study
+### 3.2 Control Condition

-| Condition | Expert Count | Ideas per Expert |
-|-----------|--------------|------------------|
-| **E1** | 1 | 20 |
-| **E2** | 2 | 10 |
-| **E4** | 4 | 5 |
-| **E6** | 6 | 3-4 |
-| **E8** | 8 | 2-3 |
+| Condition | Description | Purpose |
+|-----------|-------------|---------|
+| **C5: Random-Perspective** | 4 random words as "perspectives" | Tests if ANY perspective shift helps, or if EXPERT knowledge specifically matters |

-### 3.3 Expert Source Study
+### 3.3 Expert Source Study (Secondary, within Expert=With conditions)

 | Condition | Source | Implementation |
 |-----------|--------|----------------|
-| **S-LLM** | LLM-generated | `expert_source=ExpertSource.LLM` |
-| **S-Curated** | Curated 210 occupations | `expert_source=ExpertSource.CURATED` |
-| **S-DBpedia** | DBpedia 2164 occupations | `expert_source=ExpertSource.DBPEDIA` |
-| **S-Random** | Random word "experts" | Custom implementation |
+| **S-LLM** | LLM-generated | Query-specific experts generated by LLM |
+| **S-Curated** | Curated occupations | Pre-selected high-quality occupations |
+| **S-External** | External sources | Wikidata/ConceptNet occupations |

 ---

@@ -251,7 +263,69 @@ def compute_patent_novelty(ideas: List[str], query: str) -> dict:
    }
 ```

-### 5.3 Metrics Summary Table
+### 5.3 Hallucination/Nonsense Metrics (RQ6)
+
+Since our design intentionally excludes the original query during keyword generation, we need to measure the "cost" of this approach.
+
+#### 5.3.1 LLM-as-Judge for Relevance
+```python
+def compute_relevance_score(query: str, ideas: List[str], judge_model: str) -> dict:
+    """
+    Use LLM to judge if each idea is relevant/applicable to the original query.
+    """
+    relevant_count = 0
+    nonsense_count = 0
+    results = []
+
+    for idea in ideas:
+        prompt = f"""
+        Original query: {query}
+        Generated idea: {idea}
+
+        Is this idea relevant and applicable to the original query?
+        Rate: 1 (nonsense/irrelevant), 2 (weak connection), 3 (relevant)
+
+        Return JSON: {{"score": N, "reason": "brief explanation"}}
+        """
+        result = llm_judge(prompt, model=judge_model)
+        results.append(result)
+        if result['score'] == 1:
+            nonsense_count += 1
+        elif result['score'] >= 2:
+            relevant_count += 1
+
+    return {
+        'relevance_rate': relevant_count / len(ideas),
+        'nonsense_rate': nonsense_count / len(ideas),
+        'details': results
+    }
+```
+
+#### 5.3.2 Semantic Distance Threshold Analysis
+```python
+def analyze_distance_threshold(query: str, ideas: List[str], embedding_model: str) -> dict:
+    """
+    Analyze which ideas exceed a "too far" semantic distance threshold.
+    Ideas beyond threshold may be creative OR nonsensical.
+    """
+    query_emb = get_embedding(query, model=embedding_model)
+    idea_embs = get_embeddings(ideas, model=embedding_model)
+
+    distances = [1 - cosine_similarity(query_emb, e) for e in idea_embs]
+
+    # Define thresholds (to be calibrated)
+    CREATIVE_THRESHOLD = 0.6  # Ideas this far are "creative"
+    NONSENSE_THRESHOLD = 0.85  # Ideas this far may be "nonsense"
+
+    return {
+        'creative_zone': sum(1 for d in distances if CREATIVE_THRESHOLD <= d < NONSENSE_THRESHOLD),
+        'potential_nonsense': sum(1 for d in distances if d >= NONSENSE_THRESHOLD),
+        'safe_zone': sum(1 for d in distances if d < CREATIVE_THRESHOLD),
+        'distance_distribution': distances
+    }
+```
+
+### 5.4 Metrics Summary Table

 | Metric | Formula | Interpretation |
 |--------|---------|----------------|
@@ -261,6 +335,18 @@ def compute_patent_novelty(ideas: List[str], query: str) -> dict:
 | **Query Distance** | 1 - cos_sim(query, idea) | Higher = farther from original |
 | **Patent Novelty Rate** | 1 - (matches / total) | Higher = more novel |

+### 5.5 Nonsense/Hallucination Analysis (RQ6) - Three Methods
+
+| Method | Metric | How it works | Pros/Cons |
+|--------|--------|--------------|-----------|
+| **Automatic** | Semantic Distance Threshold | Ideas with distance > 0.85 flagged as "potential nonsense" | Fast, cheap; May miss contextual nonsense |
+| **LLM-as-Judge** | Relevance Score (1-3) | GPT-4 rates if idea is relevant to original query | Moderate cost; Good balance |
+| **Human Evaluation** | Relevance Rating (1-7 Likert) | Humans rate coherence/relevance | Gold standard; Most expensive |
+
+**Triangulation**: Compare all three methods to validate findings:
+- If automatic + LLM + human agree → high confidence
+- If they disagree → investigate why (interesting edge cases)
+
 ---

 ## 6. Human Evaluation Protocol
@@ -306,6 +392,22 @@ How creative is this idea overall?
 7 = Extremely creative
 ```

+#### 6.2.4 Relevance/Coherence (7-point Likert) - For RQ6
+```
+How relevant and coherent is this idea to the original query?
+1 = Nonsense/completely irrelevant (no logical connection)
+2 = Very weak connection (hard to see relevance)
+3 = Weak connection (requires stretch to see relevance)
+4 = Moderate connection (somewhat relevant)
+5 = Good connection (clearly relevant)
+6 = Strong connection (directly applicable)
+7 = Perfect fit (highly relevant and coherent)
+```
+
+**Note**: This scale specifically measures the "cost" of context-free generation.
+- Ideas with high novelty but low relevance (1-3) = potential hallucination
+- Ideas with high novelty AND high relevance (5-7) = successful creative leap
+
 ### 6.3 Procedure

 1. **Introduction** (5 min)
@@ -361,21 +463,27 @@ For each query Q in QuerySet:
    For each condition C in Conditions:

        If C == "Direct":
+            # No attributes, no experts
            ideas = direct_llm_generation(Q, n=20)

-        Elif C == "Single-Expert":
-            expert = generate_expert(Q, n=1)
-            ideas = expert_transformation(Q, expert, ideas_per_expert=20)
-
-        Elif C == "Multi-Expert-4":
+        Elif C == "Expert-Only":
+            # No attributes, with experts
            experts = generate_experts(Q, n=4)
-            ideas = expert_transformation(Q, experts, ideas_per_expert=5)
+            ideas = expert_generation_whole_query(Q, experts, ideas_per_expert=5)

-        Elif C == "Multi-Expert-8":
-            experts = generate_experts(Q, n=8)
-            ideas = expert_transformation(Q, experts, ideas_per_expert=2-3)
+        Elif C == "Attribute-Only":
+            # With attributes, no experts
+            attributes = decompose_attributes(Q)
+            ideas = direct_generation_per_attribute(Q, attributes, ideas_per_attr=5)
+
+        Elif C == "Full-Pipeline":
+            # With attributes, with experts
+            attributes = decompose_attributes(Q)
+            experts = generate_experts(Q, n=4)
+            ideas = expert_transformation(Q, attributes, experts, ideas_per_combo=1-2)

        Elif C == "Random-Perspective":
+            # Control: random words instead of experts
            perspectives = random.sample(RANDOM_WORDS, 4)
            ideas = perspective_generation(Q, perspectives, ideas_per=5)

@@ -469,20 +577,34 @@ Plot: Expert count vs diversity curve

 ## 9. Expected Results & Hypotheses

-### 9.1 Primary Hypotheses
+### 9.1 Primary Hypotheses (2×2 Factorial)

 | Hypothesis | Prediction | Metric |
 |------------|------------|--------|
-| **H1** | Multi-Expert-4 > Single-Expert > Direct | Semantic diversity |
-| **H2** | Multi-Expert-8 ≈ Multi-Expert-4 (diminishing returns) | Semantic diversity |
-| **H3** | Multi-Expert > Direct | Patent novelty rate |
-| **H4** | LLM experts > Curated > DBpedia | Unconventionality |
-| **H5** | With attributes > Without attributes | Overall diversity |
+| **H1: Main Effect of Attributes** | Attribute-Only > Direct | Semantic diversity |
+| **H2: Main Effect of Experts** | Expert-Only > Direct | Semantic diversity |
+| **H3: Interaction Effect** | Full Pipeline > (Attribute-Only + Expert-Only - Direct) | Semantic diversity |
+| **H4: Novelty** | Full Pipeline > all other conditions | Patent novelty rate |
+| **H5: Expert vs Random** | Expert-Only > Random-Perspective | Validates expert knowledge matters |
+| **H6: Novelty-Usefulness Tradeoff** | Full Pipeline has higher nonsense rate than Direct, but acceptable (<20%) | Nonsense rate |

-### 9.2 Expected Effect Sizes
+### 9.2 Expected Pattern
+
+```
+                    Without Experts    With Experts
+                    ---------------    ------------
+Without Attributes    Direct (low)      Expert-Only (medium)
+With Attributes       Attr-Only (medium) Full Pipeline (high)
+```
+
+**Expected interaction**: The combination (Full Pipeline) should produce super-additive effects - the benefit of experts is amplified when combined with structured attributes.
+
+### 9.3 Expected Effect Sizes

 Based on related work:
- Diversity increase: d = 0.5-0.8 (medium to large)
+- Main effect of attributes: d = 0.3-0.5 (small to medium)
+- Main effect of experts: d = 0.4-0.6 (medium)
+- Interaction effect: d = 0.2-0.4 (small)
 - Patent novelty increase: 20-40% improvement
 - Human creativity rating: d = 0.3-0.5 (small to medium)