feat: Add experiments framework and novelty-driven agent loop

- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 10:16:21 +08:00
parent 26a56a2a07
commit 43c025e060
81 changed files with 18766 additions and 2 deletions
--- a/experiments/docs/experiment_design_2026-01-19.md
+++ b/experiments/docs/experiment_design_2026-01-19.md
@@ -0,0 +1,259 @@
+# Experiment Design: 5-Condition Idea Generation Study
+
+**Date:** January 19, 2026
+**Version:** 1.0
+**Status:** Pilot Implementation
+
+## Overview
+
+This experiment tests whether the novelty-seeking system's two key mechanisms—**attribute decomposition** and **expert transformation**—independently and jointly improve creative ideation quality compared to direct LLM generation.
+
+## Research Questions
+
+1. Does decomposing a query into structured attributes improve idea diversity?
+2. Do expert perspectives improve idea novelty?
+3. Do these mechanisms have synergistic effects when combined?
+4. Is the benefit from experts due to domain knowledge, or simply perspective-shifting?
+
+## Experimental Design
+
+### 2×2 Factorial Design + Control
+
+|                    | No Attributes | With Attributes |
+|--------------------|---------------|-----------------|
+| **No Experts**     | C1: Direct    | C3: Attr-Only   |
+| **With Experts**   | C2: Expert-Only | C4: Full Pipeline |
+
+**Plus:** C5: Random-Perspective (tests perspective-shifting without domain knowledge)
+
+### Condition Descriptions
+
+#### C1: Direct Generation (Baseline)
+- Single LLM call: "Generate 20 creative ideas for [query]"
+- No attribute decomposition
+- No expert perspectives
+- Purpose: Baseline for standard LLM ideation
+
+#### C2: Expert-Only
+- 4 experts from curated occupations
+- Each expert generates 5 ideas directly for the query
+- No attribute decomposition
+- Purpose: Isolate expert contribution
+
+#### C3: Attribute-Only
+- Decompose query into 4 fixed categories
+- Generate attributes per category
+- Direct idea generation per attribute (no expert framing)
+- Purpose: Isolate attribute decomposition contribution
+
+#### C4: Full Pipeline
+- Full attribute decomposition (4 categories)
+- Expert transformation (4 experts × 1 keyword per attribute)
+- Purpose: Test combined mechanism (main system)
+
+#### C5: Random-Perspective
+- 4 random words per query (from curated pool)
+- Each word used as a "perspective" to generate 5 ideas
+- Purpose: Control for perspective-shifting vs. expert knowledge
+
+---
+
+## Key Design Decisions & Rationale
+
+### 1. Why 5 Conditions?
+
+C1-C4 form a 2×2 factorial design that isolates the independent contributions of:
+- **Attribute decomposition** (C1 vs C3, C2 vs C4)
+- **Expert perspectives** (C1 vs C2, C3 vs C4)
+
+C5 addresses a critical confound: if experts improve ideation, is it because of their **domain knowledge** or simply because any **perspective shift** helps? By using random words instead of domain experts, C5 tests whether the perspective-taking mechanism alone provides benefits.
+
+### 2. Why Random Words in C5 (Not Fixed)?
+
+**Decision:** Use randomly sampled words (with seed) rather than a fixed set.
+
+**Rationale:**
+- Stronger generalization: results hold across many word combinations
+- Avoids cherry-picking accusation ("you just picked easy words")
+- Reproducible via random seed (seed=42)
+- Each query gets different random words, increasing robustness
+
+### 3. Why Apply Deduplication Uniformly?
+
+**Decision:** Apply embedding-based deduplication (threshold=0.85) to ALL conditions after generation.
+
+**Rationale:**
+- Fair comparison: all conditions normalized to unique ideas
+- Creates "dedup survival rate" as an additional metric
+- Hypothesis: Full Pipeline ideas are diverse (low redundancy), not just numerous
+- Direct generation may produce many similar ideas that collapse after dedup
+
+### 4. Why FIXED_ONLY Categories?
+
+**Decision:** Use 4 fixed categories: Functions, Usages, User Groups, Characteristics
+
+**Rationale:**
+- Best for proof power: isolates "attribute decomposition" effect
+- No confound from dynamic category selection variability
+- Universal applicability: these 4 categories apply to objects, technology, and services
+- Dropped "Materials" category as it doesn't apply well to services
+
+### 5. Why Curated Expert Source?
+
+**Decision:** Use curated occupations (210 professions) rather than LLM-generated experts.
+
+**Rationale:**
+- Reproducibility: same occupation pool across runs
+- Consistency: no variance from LLM expert generation
+- Control: we know exactly which experts are available
+- Validation: occupations were manually curated for diversity
+
+### 6. Why Temperature 0.9?
+
+**Decision:** Use temperature=0.9 for all conditions.
+
+**Rationale:**
+- Higher temperature encourages more diverse/creative outputs
+- Matches typical creative task settings
+- Consistent across conditions for fair comparison
+- Lower temperatures (0.7) showed more repetitive outputs in testing
+
+### 7. Why 10 Pilot Queries?
+
+**Decision:** Start with 10 queries before scaling to full 30.
+
+**Rationale:**
+- Validate pipeline works before full investment
+- Catch implementation bugs early
+- Balanced across categories (3 everyday, 3 technology, 4 services)
+- Sufficient for initial pattern detection
+
+---
+
+## Configuration Summary
+
+| Setting | Value | Rationale |
+|---------|-------|-----------|
+| **LLM Model** | qwen3:8b | Local, fast, consistent |
+| **Temperature** | 0.9 | Encourages creativity |
+| **Expert Count** | 4 | Balance diversity vs. cost |
+| **Expert Source** | Curated | Reproducibility |
+| **Keywords/Expert** | 1 | Simplifies analysis |
+| **Language** | English | Consistency |
+| **Categories** | Functions, Usages, User Groups, Characteristics | Universal applicability |
+| **Dedup Threshold** | 0.85 | Standard similarity cutoff |
+| **Random Seed** | 42 | Reproducibility |
+| **Pilot Queries** | 10 | Validation before scaling |
+
+---
+
+## Query Selection
+
+### Pilot Queries (10)
+
+| ID | Query | Category |
+|----|-------|----------|
+| A1 | Chair | Everyday |
+| A5 | Bicycle | Everyday |
+| A7 | Smartphone | Everyday |
+| B1 | Solar panel | Technology |
+| B3 | 3D printer | Technology |
+| B4 | Drone | Technology |
+| C1 | Food delivery service | Services |
+| C2 | Online education platform | Services |
+| C4 | Public transportation | Services |
+| C9 | Elderly care service | Services |
+
+### Selection Criteria
+- Balanced across 3 domains (everyday objects, technology, services)
+- Varying complexity levels
+- Different user familiarity levels
+- Subset from full 30-query experimental protocol
+
+---
+
+## Random Word Pool (C5)
+
+35 words selected across 7 conceptual categories:
+
+| Category | Words |
+|----------|-------|
+| Nature | ocean, mountain, forest, desert, cave |
+| Optics | microscope, telescope, kaleidoscope, prism, lens |
+| Animals | butterfly, elephant, octopus, eagle, ant |
+| Weather | sunrise, thunderstorm, rainbow, fog, aurora |
+| Art | clockwork, origami, mosaic, symphony, ballet |
+| Temporal | ancient, futuristic, organic, crystalline, liquid |
+| Sensory | whisper, explosion, rhythm, silence, echo |
+
+**Selection Criteria:**
+- Concrete and evocative (easy to generate associations)
+- Diverse domains (no overlap with typical expert knowledge)
+- No obvious connection to test queries
+- Equal representation across categories
+
+---
+
+## Expected Outputs
+
+### Per Condition Per Query
+
+| Condition | Expected Ideas (pre-dedup) | Mechanism |
+|-----------|---------------------------|-----------|
+| C1 | 20 | Direct request |
+| C2 | 20 | 4 experts × 5 ideas |
+| C3 | ~20 | Varies by attribute count |
+| C4 | ~20 | 4 experts × ~5 keywords × 1 description |
+| C5 | 20 | 4 words × 5 ideas |
+
+### Metrics to Collect
+
+1. **Pre-deduplication count**: Raw ideas generated
+2. **Post-deduplication count**: Unique ideas after similarity filtering
+3. **Dedup survival rate**: post/pre ratio
+4. **Generation metadata**: Experts/words used, attributes generated
+
+---
+
+## File Structure
+
+```
+experiments/
+├── __init__.py
+├── config.py               # Experiment configuration
+├── docs/
+│   └── experiment_design_2026-01-19.md  # This file
+├── conditions/
+│   ├── __init__.py
+│   ├── c1_direct.py
+│   ├── c2_expert_only.py
+│   ├── c3_attribute_only.py
+│   ├── c4_full_pipeline.py
+│   └── c5_random_perspective.py
+├── data/
+│   ├── queries.json        # 10 pilot queries
+│   └── random_words.json   # Word pool for C5
+├── generate_ideas.py       # Main runner
+├── deduplication.py        # Post-processing
+└── results/                # Output (gitignored)
+```
+
+---
+
+## Verification Checklist
+
+- [ ] Each condition produces expected number of ideas
+- [ ] Deduplication reduces count meaningfully
+- [ ] Results JSON contains all required metadata
+- [ ] Random seed produces reproducible C5 results
+- [ ] No runtime errors on all 10 pilot queries
+
+---
+
+## Next Steps After Pilot
+
+1. Analyze pilot results for obvious issues
+2. Adjust parameters if needed (idea count normalization, etc.)
+3. Scale to full 30 queries
+4. Human evaluation of idea quality (novelty, usefulness, feasibility)
+5. Statistical analysis of condition differences