feat: Add experiments framework and novelty-driven agent loop
- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
259
experiments/docs/experiment_design_2026-01-19.md
Normal file
259
experiments/docs/experiment_design_2026-01-19.md
Normal file
@@ -0,0 +1,259 @@
|
||||
# Experiment Design: 5-Condition Idea Generation Study
|
||||
|
||||
**Date:** January 19, 2026
|
||||
**Version:** 1.0
|
||||
**Status:** Pilot Implementation
|
||||
|
||||
## Overview
|
||||
|
||||
This experiment tests whether the novelty-seeking system's two key mechanisms—**attribute decomposition** and **expert transformation**—independently and jointly improve creative ideation quality compared to direct LLM generation.
|
||||
|
||||
## Research Questions
|
||||
|
||||
1. Does decomposing a query into structured attributes improve idea diversity?
|
||||
2. Do expert perspectives improve idea novelty?
|
||||
3. Do these mechanisms have synergistic effects when combined?
|
||||
4. Is the benefit from experts due to domain knowledge, or simply perspective-shifting?
|
||||
|
||||
## Experimental Design
|
||||
|
||||
### 2×2 Factorial Design + Control
|
||||
|
||||
| | No Attributes | With Attributes |
|
||||
|--------------------|---------------|-----------------|
|
||||
| **No Experts** | C1: Direct | C3: Attr-Only |
|
||||
| **With Experts** | C2: Expert-Only | C4: Full Pipeline |
|
||||
|
||||
**Plus:** C5: Random-Perspective (tests perspective-shifting without domain knowledge)
|
||||
|
||||
### Condition Descriptions
|
||||
|
||||
#### C1: Direct Generation (Baseline)
|
||||
- Single LLM call: "Generate 20 creative ideas for [query]"
|
||||
- No attribute decomposition
|
||||
- No expert perspectives
|
||||
- Purpose: Baseline for standard LLM ideation
|
||||
|
||||
#### C2: Expert-Only
|
||||
- 4 experts from curated occupations
|
||||
- Each expert generates 5 ideas directly for the query
|
||||
- No attribute decomposition
|
||||
- Purpose: Isolate expert contribution
|
||||
|
||||
#### C3: Attribute-Only
|
||||
- Decompose query into 4 fixed categories
|
||||
- Generate attributes per category
|
||||
- Direct idea generation per attribute (no expert framing)
|
||||
- Purpose: Isolate attribute decomposition contribution
|
||||
|
||||
#### C4: Full Pipeline
|
||||
- Full attribute decomposition (4 categories)
|
||||
- Expert transformation (4 experts × 1 keyword per attribute)
|
||||
- Purpose: Test combined mechanism (main system)
|
||||
|
||||
#### C5: Random-Perspective
|
||||
- 4 random words per query (from curated pool)
|
||||
- Each word used as a "perspective" to generate 5 ideas
|
||||
- Purpose: Control for perspective-shifting vs. expert knowledge
|
||||
|
||||
---
|
||||
|
||||
## Key Design Decisions & Rationale
|
||||
|
||||
### 1. Why 5 Conditions?
|
||||
|
||||
C1-C4 form a 2×2 factorial design that isolates the independent contributions of:
|
||||
- **Attribute decomposition** (C1 vs C3, C2 vs C4)
|
||||
- **Expert perspectives** (C1 vs C2, C3 vs C4)
|
||||
|
||||
C5 addresses a critical confound: if experts improve ideation, is it because of their **domain knowledge** or simply because any **perspective shift** helps? By using random words instead of domain experts, C5 tests whether the perspective-taking mechanism alone provides benefits.
|
||||
|
||||
### 2. Why Random Words in C5 (Not Fixed)?
|
||||
|
||||
**Decision:** Use randomly sampled words (with seed) rather than a fixed set.
|
||||
|
||||
**Rationale:**
|
||||
- Stronger generalization: results hold across many word combinations
|
||||
- Avoids cherry-picking accusation ("you just picked easy words")
|
||||
- Reproducible via random seed (seed=42)
|
||||
- Each query gets different random words, increasing robustness
|
||||
|
||||
### 3. Why Apply Deduplication Uniformly?
|
||||
|
||||
**Decision:** Apply embedding-based deduplication (threshold=0.85) to ALL conditions after generation.
|
||||
|
||||
**Rationale:**
|
||||
- Fair comparison: all conditions normalized to unique ideas
|
||||
- Creates "dedup survival rate" as an additional metric
|
||||
- Hypothesis: Full Pipeline ideas are diverse (low redundancy), not just numerous
|
||||
- Direct generation may produce many similar ideas that collapse after dedup
|
||||
|
||||
### 4. Why FIXED_ONLY Categories?
|
||||
|
||||
**Decision:** Use 4 fixed categories: Functions, Usages, User Groups, Characteristics
|
||||
|
||||
**Rationale:**
|
||||
- Best for proof power: isolates "attribute decomposition" effect
|
||||
- No confound from dynamic category selection variability
|
||||
- Universal applicability: these 4 categories apply to objects, technology, and services
|
||||
- Dropped "Materials" category as it doesn't apply well to services
|
||||
|
||||
### 5. Why Curated Expert Source?
|
||||
|
||||
**Decision:** Use curated occupations (210 professions) rather than LLM-generated experts.
|
||||
|
||||
**Rationale:**
|
||||
- Reproducibility: same occupation pool across runs
|
||||
- Consistency: no variance from LLM expert generation
|
||||
- Control: we know exactly which experts are available
|
||||
- Validation: occupations were manually curated for diversity
|
||||
|
||||
### 6. Why Temperature 0.9?
|
||||
|
||||
**Decision:** Use temperature=0.9 for all conditions.
|
||||
|
||||
**Rationale:**
|
||||
- Higher temperature encourages more diverse/creative outputs
|
||||
- Matches typical creative task settings
|
||||
- Consistent across conditions for fair comparison
|
||||
- Lower temperatures (0.7) showed more repetitive outputs in testing
|
||||
|
||||
### 7. Why 10 Pilot Queries?
|
||||
|
||||
**Decision:** Start with 10 queries before scaling to full 30.
|
||||
|
||||
**Rationale:**
|
||||
- Validate pipeline works before full investment
|
||||
- Catch implementation bugs early
|
||||
- Balanced across categories (3 everyday, 3 technology, 4 services)
|
||||
- Sufficient for initial pattern detection
|
||||
|
||||
---
|
||||
|
||||
## Configuration Summary
|
||||
|
||||
| Setting | Value | Rationale |
|
||||
|---------|-------|-----------|
|
||||
| **LLM Model** | qwen3:8b | Local, fast, consistent |
|
||||
| **Temperature** | 0.9 | Encourages creativity |
|
||||
| **Expert Count** | 4 | Balance diversity vs. cost |
|
||||
| **Expert Source** | Curated | Reproducibility |
|
||||
| **Keywords/Expert** | 1 | Simplifies analysis |
|
||||
| **Language** | English | Consistency |
|
||||
| **Categories** | Functions, Usages, User Groups, Characteristics | Universal applicability |
|
||||
| **Dedup Threshold** | 0.85 | Standard similarity cutoff |
|
||||
| **Random Seed** | 42 | Reproducibility |
|
||||
| **Pilot Queries** | 10 | Validation before scaling |
|
||||
|
||||
---
|
||||
|
||||
## Query Selection
|
||||
|
||||
### Pilot Queries (10)
|
||||
|
||||
| ID | Query | Category |
|
||||
|----|-------|----------|
|
||||
| A1 | Chair | Everyday |
|
||||
| A5 | Bicycle | Everyday |
|
||||
| A7 | Smartphone | Everyday |
|
||||
| B1 | Solar panel | Technology |
|
||||
| B3 | 3D printer | Technology |
|
||||
| B4 | Drone | Technology |
|
||||
| C1 | Food delivery service | Services |
|
||||
| C2 | Online education platform | Services |
|
||||
| C4 | Public transportation | Services |
|
||||
| C9 | Elderly care service | Services |
|
||||
|
||||
### Selection Criteria
|
||||
- Balanced across 3 domains (everyday objects, technology, services)
|
||||
- Varying complexity levels
|
||||
- Different user familiarity levels
|
||||
- Subset from full 30-query experimental protocol
|
||||
|
||||
---
|
||||
|
||||
## Random Word Pool (C5)
|
||||
|
||||
35 words selected across 7 conceptual categories:
|
||||
|
||||
| Category | Words |
|
||||
|----------|-------|
|
||||
| Nature | ocean, mountain, forest, desert, cave |
|
||||
| Optics | microscope, telescope, kaleidoscope, prism, lens |
|
||||
| Animals | butterfly, elephant, octopus, eagle, ant |
|
||||
| Weather | sunrise, thunderstorm, rainbow, fog, aurora |
|
||||
| Art | clockwork, origami, mosaic, symphony, ballet |
|
||||
| Temporal | ancient, futuristic, organic, crystalline, liquid |
|
||||
| Sensory | whisper, explosion, rhythm, silence, echo |
|
||||
|
||||
**Selection Criteria:**
|
||||
- Concrete and evocative (easy to generate associations)
|
||||
- Diverse domains (no overlap with typical expert knowledge)
|
||||
- No obvious connection to test queries
|
||||
- Equal representation across categories
|
||||
|
||||
---
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
### Per Condition Per Query
|
||||
|
||||
| Condition | Expected Ideas (pre-dedup) | Mechanism |
|
||||
|-----------|---------------------------|-----------|
|
||||
| C1 | 20 | Direct request |
|
||||
| C2 | 20 | 4 experts × 5 ideas |
|
||||
| C3 | ~20 | Varies by attribute count |
|
||||
| C4 | ~20 | 4 experts × ~5 keywords × 1 description |
|
||||
| C5 | 20 | 4 words × 5 ideas |
|
||||
|
||||
### Metrics to Collect
|
||||
|
||||
1. **Pre-deduplication count**: Raw ideas generated
|
||||
2. **Post-deduplication count**: Unique ideas after similarity filtering
|
||||
3. **Dedup survival rate**: post/pre ratio
|
||||
4. **Generation metadata**: Experts/words used, attributes generated
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
experiments/
|
||||
├── __init__.py
|
||||
├── config.py # Experiment configuration
|
||||
├── docs/
|
||||
│ └── experiment_design_2026-01-19.md # This file
|
||||
├── conditions/
|
||||
│ ├── __init__.py
|
||||
│ ├── c1_direct.py
|
||||
│ ├── c2_expert_only.py
|
||||
│ ├── c3_attribute_only.py
|
||||
│ ├── c4_full_pipeline.py
|
||||
│ └── c5_random_perspective.py
|
||||
├── data/
|
||||
│ ├── queries.json # 10 pilot queries
|
||||
│ └── random_words.json # Word pool for C5
|
||||
├── generate_ideas.py # Main runner
|
||||
├── deduplication.py # Post-processing
|
||||
└── results/ # Output (gitignored)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [ ] Each condition produces expected number of ideas
|
||||
- [ ] Deduplication reduces count meaningfully
|
||||
- [ ] Results JSON contains all required metadata
|
||||
- [ ] Random seed produces reproducible C5 results
|
||||
- [ ] No runtime errors on all 10 pilot queries
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Pilot
|
||||
|
||||
1. Analyze pilot results for obvious issues
|
||||
2. Adjust parameters if needed (idea count normalization, etc.)
|
||||
3. Scale to full 30 queries
|
||||
4. Human evaluation of idea quality (novelty, usefulness, feasibility)
|
||||
5. Statistical analysis of condition differences
|
||||
Reference in New Issue
Block a user