- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8.6 KiB
Experiment Design: 5-Condition Idea Generation Study
Date: January 19, 2026 Version: 1.0 Status: Pilot Implementation
Overview
This experiment tests whether the novelty-seeking system's two key mechanisms—attribute decomposition and expert transformation—independently and jointly improve creative ideation quality compared to direct LLM generation.
Research Questions
- Does decomposing a query into structured attributes improve idea diversity?
- Do expert perspectives improve idea novelty?
- Do these mechanisms have synergistic effects when combined?
- Is the benefit from experts due to domain knowledge, or simply perspective-shifting?
Experimental Design
2×2 Factorial Design + Control
| No Attributes | With Attributes | |
|---|---|---|
| No Experts | C1: Direct | C3: Attr-Only |
| With Experts | C2: Expert-Only | C4: Full Pipeline |
Plus: C5: Random-Perspective (tests perspective-shifting without domain knowledge)
Condition Descriptions
C1: Direct Generation (Baseline)
- Single LLM call: "Generate 20 creative ideas for [query]"
- No attribute decomposition
- No expert perspectives
- Purpose: Baseline for standard LLM ideation
C2: Expert-Only
- 4 experts from curated occupations
- Each expert generates 5 ideas directly for the query
- No attribute decomposition
- Purpose: Isolate expert contribution
C3: Attribute-Only
- Decompose query into 4 fixed categories
- Generate attributes per category
- Direct idea generation per attribute (no expert framing)
- Purpose: Isolate attribute decomposition contribution
C4: Full Pipeline
- Full attribute decomposition (4 categories)
- Expert transformation (4 experts × 1 keyword per attribute)
- Purpose: Test combined mechanism (main system)
C5: Random-Perspective
- 4 random words per query (from curated pool)
- Each word used as a "perspective" to generate 5 ideas
- Purpose: Control for perspective-shifting vs. expert knowledge
Key Design Decisions & Rationale
1. Why 5 Conditions?
C1-C4 form a 2×2 factorial design that isolates the independent contributions of:
- Attribute decomposition (C1 vs C3, C2 vs C4)
- Expert perspectives (C1 vs C2, C3 vs C4)
C5 addresses a critical confound: if experts improve ideation, is it because of their domain knowledge or simply because any perspective shift helps? By using random words instead of domain experts, C5 tests whether the perspective-taking mechanism alone provides benefits.
2. Why Random Words in C5 (Not Fixed)?
Decision: Use randomly sampled words (with seed) rather than a fixed set.
Rationale:
- Stronger generalization: results hold across many word combinations
- Avoids cherry-picking accusation ("you just picked easy words")
- Reproducible via random seed (seed=42)
- Each query gets different random words, increasing robustness
3. Why Apply Deduplication Uniformly?
Decision: Apply embedding-based deduplication (threshold=0.85) to ALL conditions after generation.
Rationale:
- Fair comparison: all conditions normalized to unique ideas
- Creates "dedup survival rate" as an additional metric
- Hypothesis: Full Pipeline ideas are diverse (low redundancy), not just numerous
- Direct generation may produce many similar ideas that collapse after dedup
4. Why FIXED_ONLY Categories?
Decision: Use 4 fixed categories: Functions, Usages, User Groups, Characteristics
Rationale:
- Best for proof power: isolates "attribute decomposition" effect
- No confound from dynamic category selection variability
- Universal applicability: these 4 categories apply to objects, technology, and services
- Dropped "Materials" category as it doesn't apply well to services
5. Why Curated Expert Source?
Decision: Use curated occupations (210 professions) rather than LLM-generated experts.
Rationale:
- Reproducibility: same occupation pool across runs
- Consistency: no variance from LLM expert generation
- Control: we know exactly which experts are available
- Validation: occupations were manually curated for diversity
6. Why Temperature 0.9?
Decision: Use temperature=0.9 for all conditions.
Rationale:
- Higher temperature encourages more diverse/creative outputs
- Matches typical creative task settings
- Consistent across conditions for fair comparison
- Lower temperatures (0.7) showed more repetitive outputs in testing
7. Why 10 Pilot Queries?
Decision: Start with 10 queries before scaling to full 30.
Rationale:
- Validate pipeline works before full investment
- Catch implementation bugs early
- Balanced across categories (3 everyday, 3 technology, 4 services)
- Sufficient for initial pattern detection
Configuration Summary
| Setting | Value | Rationale |
|---|---|---|
| LLM Model | qwen3:8b | Local, fast, consistent |
| Temperature | 0.9 | Encourages creativity |
| Expert Count | 4 | Balance diversity vs. cost |
| Expert Source | Curated | Reproducibility |
| Keywords/Expert | 1 | Simplifies analysis |
| Language | English | Consistency |
| Categories | Functions, Usages, User Groups, Characteristics | Universal applicability |
| Dedup Threshold | 0.85 | Standard similarity cutoff |
| Random Seed | 42 | Reproducibility |
| Pilot Queries | 10 | Validation before scaling |
Query Selection
Pilot Queries (10)
| ID | Query | Category |
|---|---|---|
| A1 | Chair | Everyday |
| A5 | Bicycle | Everyday |
| A7 | Smartphone | Everyday |
| B1 | Solar panel | Technology |
| B3 | 3D printer | Technology |
| B4 | Drone | Technology |
| C1 | Food delivery service | Services |
| C2 | Online education platform | Services |
| C4 | Public transportation | Services |
| C9 | Elderly care service | Services |
Selection Criteria
- Balanced across 3 domains (everyday objects, technology, services)
- Varying complexity levels
- Different user familiarity levels
- Subset from full 30-query experimental protocol
Random Word Pool (C5)
35 words selected across 7 conceptual categories:
| Category | Words |
|---|---|
| Nature | ocean, mountain, forest, desert, cave |
| Optics | microscope, telescope, kaleidoscope, prism, lens |
| Animals | butterfly, elephant, octopus, eagle, ant |
| Weather | sunrise, thunderstorm, rainbow, fog, aurora |
| Art | clockwork, origami, mosaic, symphony, ballet |
| Temporal | ancient, futuristic, organic, crystalline, liquid |
| Sensory | whisper, explosion, rhythm, silence, echo |
Selection Criteria:
- Concrete and evocative (easy to generate associations)
- Diverse domains (no overlap with typical expert knowledge)
- No obvious connection to test queries
- Equal representation across categories
Expected Outputs
Per Condition Per Query
| Condition | Expected Ideas (pre-dedup) | Mechanism |
|---|---|---|
| C1 | 20 | Direct request |
| C2 | 20 | 4 experts × 5 ideas |
| C3 | ~20 | Varies by attribute count |
| C4 | ~20 | 4 experts × ~5 keywords × 1 description |
| C5 | 20 | 4 words × 5 ideas |
Metrics to Collect
- Pre-deduplication count: Raw ideas generated
- Post-deduplication count: Unique ideas after similarity filtering
- Dedup survival rate: post/pre ratio
- Generation metadata: Experts/words used, attributes generated
File Structure
experiments/
├── __init__.py
├── config.py # Experiment configuration
├── docs/
│ └── experiment_design_2026-01-19.md # This file
├── conditions/
│ ├── __init__.py
│ ├── c1_direct.py
│ ├── c2_expert_only.py
│ ├── c3_attribute_only.py
│ ├── c4_full_pipeline.py
│ └── c5_random_perspective.py
├── data/
│ ├── queries.json # 10 pilot queries
│ └── random_words.json # Word pool for C5
├── generate_ideas.py # Main runner
├── deduplication.py # Post-processing
└── results/ # Output (gitignored)
Verification Checklist
- Each condition produces expected number of ideas
- Deduplication reduces count meaningfully
- Results JSON contains all required metadata
- Random seed produces reproducible C5 results
- No runtime errors on all 10 pilot queries
Next Steps After Pilot
- Analyze pilot results for obvious issues
- Adjust parameters if needed (idea count normalization, etc.)
- Scale to full 30 queries
- Human evaluation of idea quality (novelty, usefulness, feasibility)
- Statistical analysis of condition differences