Files
novelty-seeking/experiments/docs/experiment_design_2026-01-19.md
gbanyan 43c025e060 feat: Add experiments framework and novelty-driven agent loop
- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 10:16:21 +08:00

260 lines
8.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Experiment Design: 5-Condition Idea Generation Study
**Date:** January 19, 2026
**Version:** 1.0
**Status:** Pilot Implementation
## Overview
This experiment tests whether the novelty-seeking system's two key mechanisms—**attribute decomposition** and **expert transformation**—independently and jointly improve creative ideation quality compared to direct LLM generation.
## Research Questions
1. Does decomposing a query into structured attributes improve idea diversity?
2. Do expert perspectives improve idea novelty?
3. Do these mechanisms have synergistic effects when combined?
4. Is the benefit from experts due to domain knowledge, or simply perspective-shifting?
## Experimental Design
### 2×2 Factorial Design + Control
| | No Attributes | With Attributes |
|--------------------|---------------|-----------------|
| **No Experts** | C1: Direct | C3: Attr-Only |
| **With Experts** | C2: Expert-Only | C4: Full Pipeline |
**Plus:** C5: Random-Perspective (tests perspective-shifting without domain knowledge)
### Condition Descriptions
#### C1: Direct Generation (Baseline)
- Single LLM call: "Generate 20 creative ideas for [query]"
- No attribute decomposition
- No expert perspectives
- Purpose: Baseline for standard LLM ideation
#### C2: Expert-Only
- 4 experts from curated occupations
- Each expert generates 5 ideas directly for the query
- No attribute decomposition
- Purpose: Isolate expert contribution
#### C3: Attribute-Only
- Decompose query into 4 fixed categories
- Generate attributes per category
- Direct idea generation per attribute (no expert framing)
- Purpose: Isolate attribute decomposition contribution
#### C4: Full Pipeline
- Full attribute decomposition (4 categories)
- Expert transformation (4 experts × 1 keyword per attribute)
- Purpose: Test combined mechanism (main system)
#### C5: Random-Perspective
- 4 random words per query (from curated pool)
- Each word used as a "perspective" to generate 5 ideas
- Purpose: Control for perspective-shifting vs. expert knowledge
---
## Key Design Decisions & Rationale
### 1. Why 5 Conditions?
C1-C4 form a 2×2 factorial design that isolates the independent contributions of:
- **Attribute decomposition** (C1 vs C3, C2 vs C4)
- **Expert perspectives** (C1 vs C2, C3 vs C4)
C5 addresses a critical confound: if experts improve ideation, is it because of their **domain knowledge** or simply because any **perspective shift** helps? By using random words instead of domain experts, C5 tests whether the perspective-taking mechanism alone provides benefits.
### 2. Why Random Words in C5 (Not Fixed)?
**Decision:** Use randomly sampled words (with seed) rather than a fixed set.
**Rationale:**
- Stronger generalization: results hold across many word combinations
- Avoids cherry-picking accusation ("you just picked easy words")
- Reproducible via random seed (seed=42)
- Each query gets different random words, increasing robustness
### 3. Why Apply Deduplication Uniformly?
**Decision:** Apply embedding-based deduplication (threshold=0.85) to ALL conditions after generation.
**Rationale:**
- Fair comparison: all conditions normalized to unique ideas
- Creates "dedup survival rate" as an additional metric
- Hypothesis: Full Pipeline ideas are diverse (low redundancy), not just numerous
- Direct generation may produce many similar ideas that collapse after dedup
### 4. Why FIXED_ONLY Categories?
**Decision:** Use 4 fixed categories: Functions, Usages, User Groups, Characteristics
**Rationale:**
- Best for proof power: isolates "attribute decomposition" effect
- No confound from dynamic category selection variability
- Universal applicability: these 4 categories apply to objects, technology, and services
- Dropped "Materials" category as it doesn't apply well to services
### 5. Why Curated Expert Source?
**Decision:** Use curated occupations (210 professions) rather than LLM-generated experts.
**Rationale:**
- Reproducibility: same occupation pool across runs
- Consistency: no variance from LLM expert generation
- Control: we know exactly which experts are available
- Validation: occupations were manually curated for diversity
### 6. Why Temperature 0.9?
**Decision:** Use temperature=0.9 for all conditions.
**Rationale:**
- Higher temperature encourages more diverse/creative outputs
- Matches typical creative task settings
- Consistent across conditions for fair comparison
- Lower temperatures (0.7) showed more repetitive outputs in testing
### 7. Why 10 Pilot Queries?
**Decision:** Start with 10 queries before scaling to full 30.
**Rationale:**
- Validate pipeline works before full investment
- Catch implementation bugs early
- Balanced across categories (3 everyday, 3 technology, 4 services)
- Sufficient for initial pattern detection
---
## Configuration Summary
| Setting | Value | Rationale |
|---------|-------|-----------|
| **LLM Model** | qwen3:8b | Local, fast, consistent |
| **Temperature** | 0.9 | Encourages creativity |
| **Expert Count** | 4 | Balance diversity vs. cost |
| **Expert Source** | Curated | Reproducibility |
| **Keywords/Expert** | 1 | Simplifies analysis |
| **Language** | English | Consistency |
| **Categories** | Functions, Usages, User Groups, Characteristics | Universal applicability |
| **Dedup Threshold** | 0.85 | Standard similarity cutoff |
| **Random Seed** | 42 | Reproducibility |
| **Pilot Queries** | 10 | Validation before scaling |
---
## Query Selection
### Pilot Queries (10)
| ID | Query | Category |
|----|-------|----------|
| A1 | Chair | Everyday |
| A5 | Bicycle | Everyday |
| A7 | Smartphone | Everyday |
| B1 | Solar panel | Technology |
| B3 | 3D printer | Technology |
| B4 | Drone | Technology |
| C1 | Food delivery service | Services |
| C2 | Online education platform | Services |
| C4 | Public transportation | Services |
| C9 | Elderly care service | Services |
### Selection Criteria
- Balanced across 3 domains (everyday objects, technology, services)
- Varying complexity levels
- Different user familiarity levels
- Subset from full 30-query experimental protocol
---
## Random Word Pool (C5)
35 words selected across 7 conceptual categories:
| Category | Words |
|----------|-------|
| Nature | ocean, mountain, forest, desert, cave |
| Optics | microscope, telescope, kaleidoscope, prism, lens |
| Animals | butterfly, elephant, octopus, eagle, ant |
| Weather | sunrise, thunderstorm, rainbow, fog, aurora |
| Art | clockwork, origami, mosaic, symphony, ballet |
| Temporal | ancient, futuristic, organic, crystalline, liquid |
| Sensory | whisper, explosion, rhythm, silence, echo |
**Selection Criteria:**
- Concrete and evocative (easy to generate associations)
- Diverse domains (no overlap with typical expert knowledge)
- No obvious connection to test queries
- Equal representation across categories
---
## Expected Outputs
### Per Condition Per Query
| Condition | Expected Ideas (pre-dedup) | Mechanism |
|-----------|---------------------------|-----------|
| C1 | 20 | Direct request |
| C2 | 20 | 4 experts × 5 ideas |
| C3 | ~20 | Varies by attribute count |
| C4 | ~20 | 4 experts × ~5 keywords × 1 description |
| C5 | 20 | 4 words × 5 ideas |
### Metrics to Collect
1. **Pre-deduplication count**: Raw ideas generated
2. **Post-deduplication count**: Unique ideas after similarity filtering
3. **Dedup survival rate**: post/pre ratio
4. **Generation metadata**: Experts/words used, attributes generated
---
## File Structure
```
experiments/
├── __init__.py
├── config.py # Experiment configuration
├── docs/
│ └── experiment_design_2026-01-19.md # This file
├── conditions/
│ ├── __init__.py
│ ├── c1_direct.py
│ ├── c2_expert_only.py
│ ├── c3_attribute_only.py
│ ├── c4_full_pipeline.py
│ └── c5_random_perspective.py
├── data/
│ ├── queries.json # 10 pilot queries
│ └── random_words.json # Word pool for C5
├── generate_ideas.py # Main runner
├── deduplication.py # Post-processing
└── results/ # Output (gitignored)
```
---
## Verification Checklist
- [ ] Each condition produces expected number of ideas
- [ ] Deduplication reduces count meaningfully
- [ ] Results JSON contains all required metadata
- [ ] Random seed produces reproducible C5 results
- [ ] No runtime errors on all 10 pilot queries
---
## Next Steps After Pilot
1. Analyze pilot results for obvious issues
2. Adjust parameters if needed (idea count normalization, etc.)
3. Scale to full 30 queries
4. Human evaluation of idea quality (novelty, usefulness, feasibility)
5. Statistical analysis of condition differences