- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
260 lines
8.6 KiB
Markdown
260 lines
8.6 KiB
Markdown
# Experiment Design: 5-Condition Idea Generation Study
|
||
|
||
**Date:** January 19, 2026
|
||
**Version:** 1.0
|
||
**Status:** Pilot Implementation
|
||
|
||
## Overview
|
||
|
||
This experiment tests whether the novelty-seeking system's two key mechanisms—**attribute decomposition** and **expert transformation**—independently and jointly improve creative ideation quality compared to direct LLM generation.
|
||
|
||
## Research Questions
|
||
|
||
1. Does decomposing a query into structured attributes improve idea diversity?
|
||
2. Do expert perspectives improve idea novelty?
|
||
3. Do these mechanisms have synergistic effects when combined?
|
||
4. Is the benefit from experts due to domain knowledge, or simply perspective-shifting?
|
||
|
||
## Experimental Design
|
||
|
||
### 2×2 Factorial Design + Control
|
||
|
||
| | No Attributes | With Attributes |
|
||
|--------------------|---------------|-----------------|
|
||
| **No Experts** | C1: Direct | C3: Attr-Only |
|
||
| **With Experts** | C2: Expert-Only | C4: Full Pipeline |
|
||
|
||
**Plus:** C5: Random-Perspective (tests perspective-shifting without domain knowledge)
|
||
|
||
### Condition Descriptions
|
||
|
||
#### C1: Direct Generation (Baseline)
|
||
- Single LLM call: "Generate 20 creative ideas for [query]"
|
||
- No attribute decomposition
|
||
- No expert perspectives
|
||
- Purpose: Baseline for standard LLM ideation
|
||
|
||
#### C2: Expert-Only
|
||
- 4 experts from curated occupations
|
||
- Each expert generates 5 ideas directly for the query
|
||
- No attribute decomposition
|
||
- Purpose: Isolate expert contribution
|
||
|
||
#### C3: Attribute-Only
|
||
- Decompose query into 4 fixed categories
|
||
- Generate attributes per category
|
||
- Direct idea generation per attribute (no expert framing)
|
||
- Purpose: Isolate attribute decomposition contribution
|
||
|
||
#### C4: Full Pipeline
|
||
- Full attribute decomposition (4 categories)
|
||
- Expert transformation (4 experts × 1 keyword per attribute)
|
||
- Purpose: Test combined mechanism (main system)
|
||
|
||
#### C5: Random-Perspective
|
||
- 4 random words per query (from curated pool)
|
||
- Each word used as a "perspective" to generate 5 ideas
|
||
- Purpose: Control for perspective-shifting vs. expert knowledge
|
||
|
||
---
|
||
|
||
## Key Design Decisions & Rationale
|
||
|
||
### 1. Why 5 Conditions?
|
||
|
||
C1-C4 form a 2×2 factorial design that isolates the independent contributions of:
|
||
- **Attribute decomposition** (C1 vs C3, C2 vs C4)
|
||
- **Expert perspectives** (C1 vs C2, C3 vs C4)
|
||
|
||
C5 addresses a critical confound: if experts improve ideation, is it because of their **domain knowledge** or simply because any **perspective shift** helps? By using random words instead of domain experts, C5 tests whether the perspective-taking mechanism alone provides benefits.
|
||
|
||
### 2. Why Random Words in C5 (Not Fixed)?
|
||
|
||
**Decision:** Use randomly sampled words (with seed) rather than a fixed set.
|
||
|
||
**Rationale:**
|
||
- Stronger generalization: results hold across many word combinations
|
||
- Avoids cherry-picking accusation ("you just picked easy words")
|
||
- Reproducible via random seed (seed=42)
|
||
- Each query gets different random words, increasing robustness
|
||
|
||
### 3. Why Apply Deduplication Uniformly?
|
||
|
||
**Decision:** Apply embedding-based deduplication (threshold=0.85) to ALL conditions after generation.
|
||
|
||
**Rationale:**
|
||
- Fair comparison: all conditions normalized to unique ideas
|
||
- Creates "dedup survival rate" as an additional metric
|
||
- Hypothesis: Full Pipeline ideas are diverse (low redundancy), not just numerous
|
||
- Direct generation may produce many similar ideas that collapse after dedup
|
||
|
||
### 4. Why FIXED_ONLY Categories?
|
||
|
||
**Decision:** Use 4 fixed categories: Functions, Usages, User Groups, Characteristics
|
||
|
||
**Rationale:**
|
||
- Best for proof power: isolates "attribute decomposition" effect
|
||
- No confound from dynamic category selection variability
|
||
- Universal applicability: these 4 categories apply to objects, technology, and services
|
||
- Dropped "Materials" category as it doesn't apply well to services
|
||
|
||
### 5. Why Curated Expert Source?
|
||
|
||
**Decision:** Use curated occupations (210 professions) rather than LLM-generated experts.
|
||
|
||
**Rationale:**
|
||
- Reproducibility: same occupation pool across runs
|
||
- Consistency: no variance from LLM expert generation
|
||
- Control: we know exactly which experts are available
|
||
- Validation: occupations were manually curated for diversity
|
||
|
||
### 6. Why Temperature 0.9?
|
||
|
||
**Decision:** Use temperature=0.9 for all conditions.
|
||
|
||
**Rationale:**
|
||
- Higher temperature encourages more diverse/creative outputs
|
||
- Matches typical creative task settings
|
||
- Consistent across conditions for fair comparison
|
||
- Lower temperatures (0.7) showed more repetitive outputs in testing
|
||
|
||
### 7. Why 10 Pilot Queries?
|
||
|
||
**Decision:** Start with 10 queries before scaling to full 30.
|
||
|
||
**Rationale:**
|
||
- Validate pipeline works before full investment
|
||
- Catch implementation bugs early
|
||
- Balanced across categories (3 everyday, 3 technology, 4 services)
|
||
- Sufficient for initial pattern detection
|
||
|
||
---
|
||
|
||
## Configuration Summary
|
||
|
||
| Setting | Value | Rationale |
|
||
|---------|-------|-----------|
|
||
| **LLM Model** | qwen3:8b | Local, fast, consistent |
|
||
| **Temperature** | 0.9 | Encourages creativity |
|
||
| **Expert Count** | 4 | Balance diversity vs. cost |
|
||
| **Expert Source** | Curated | Reproducibility |
|
||
| **Keywords/Expert** | 1 | Simplifies analysis |
|
||
| **Language** | English | Consistency |
|
||
| **Categories** | Functions, Usages, User Groups, Characteristics | Universal applicability |
|
||
| **Dedup Threshold** | 0.85 | Standard similarity cutoff |
|
||
| **Random Seed** | 42 | Reproducibility |
|
||
| **Pilot Queries** | 10 | Validation before scaling |
|
||
|
||
---
|
||
|
||
## Query Selection
|
||
|
||
### Pilot Queries (10)
|
||
|
||
| ID | Query | Category |
|
||
|----|-------|----------|
|
||
| A1 | Chair | Everyday |
|
||
| A5 | Bicycle | Everyday |
|
||
| A7 | Smartphone | Everyday |
|
||
| B1 | Solar panel | Technology |
|
||
| B3 | 3D printer | Technology |
|
||
| B4 | Drone | Technology |
|
||
| C1 | Food delivery service | Services |
|
||
| C2 | Online education platform | Services |
|
||
| C4 | Public transportation | Services |
|
||
| C9 | Elderly care service | Services |
|
||
|
||
### Selection Criteria
|
||
- Balanced across 3 domains (everyday objects, technology, services)
|
||
- Varying complexity levels
|
||
- Different user familiarity levels
|
||
- Subset from full 30-query experimental protocol
|
||
|
||
---
|
||
|
||
## Random Word Pool (C5)
|
||
|
||
35 words selected across 7 conceptual categories:
|
||
|
||
| Category | Words |
|
||
|----------|-------|
|
||
| Nature | ocean, mountain, forest, desert, cave |
|
||
| Optics | microscope, telescope, kaleidoscope, prism, lens |
|
||
| Animals | butterfly, elephant, octopus, eagle, ant |
|
||
| Weather | sunrise, thunderstorm, rainbow, fog, aurora |
|
||
| Art | clockwork, origami, mosaic, symphony, ballet |
|
||
| Temporal | ancient, futuristic, organic, crystalline, liquid |
|
||
| Sensory | whisper, explosion, rhythm, silence, echo |
|
||
|
||
**Selection Criteria:**
|
||
- Concrete and evocative (easy to generate associations)
|
||
- Diverse domains (no overlap with typical expert knowledge)
|
||
- No obvious connection to test queries
|
||
- Equal representation across categories
|
||
|
||
---
|
||
|
||
## Expected Outputs
|
||
|
||
### Per Condition Per Query
|
||
|
||
| Condition | Expected Ideas (pre-dedup) | Mechanism |
|
||
|-----------|---------------------------|-----------|
|
||
| C1 | 20 | Direct request |
|
||
| C2 | 20 | 4 experts × 5 ideas |
|
||
| C3 | ~20 | Varies by attribute count |
|
||
| C4 | ~20 | 4 experts × ~5 keywords × 1 description |
|
||
| C5 | 20 | 4 words × 5 ideas |
|
||
|
||
### Metrics to Collect
|
||
|
||
1. **Pre-deduplication count**: Raw ideas generated
|
||
2. **Post-deduplication count**: Unique ideas after similarity filtering
|
||
3. **Dedup survival rate**: post/pre ratio
|
||
4. **Generation metadata**: Experts/words used, attributes generated
|
||
|
||
---
|
||
|
||
## File Structure
|
||
|
||
```
|
||
experiments/
|
||
├── __init__.py
|
||
├── config.py # Experiment configuration
|
||
├── docs/
|
||
│ └── experiment_design_2026-01-19.md # This file
|
||
├── conditions/
|
||
│ ├── __init__.py
|
||
│ ├── c1_direct.py
|
||
│ ├── c2_expert_only.py
|
||
│ ├── c3_attribute_only.py
|
||
│ ├── c4_full_pipeline.py
|
||
│ └── c5_random_perspective.py
|
||
├── data/
|
||
│ ├── queries.json # 10 pilot queries
|
||
│ └── random_words.json # Word pool for C5
|
||
├── generate_ideas.py # Main runner
|
||
├── deduplication.py # Post-processing
|
||
└── results/ # Output (gitignored)
|
||
```
|
||
|
||
---
|
||
|
||
## Verification Checklist
|
||
|
||
- [ ] Each condition produces expected number of ideas
|
||
- [ ] Deduplication reduces count meaningfully
|
||
- [ ] Results JSON contains all required metadata
|
||
- [ ] Random seed produces reproducible C5 results
|
||
- [ ] No runtime errors on all 10 pilot queries
|
||
|
||
---
|
||
|
||
## Next Steps After Pilot
|
||
|
||
1. Analyze pilot results for obvious issues
|
||
2. Adjust parameters if needed (idea count normalization, etc.)
|
||
3. Scale to full 30 queries
|
||
4. Human evaluation of idea quality (novelty, usefulness, feasibility)
|
||
5. Statistical analysis of condition differences
|