Files
novelty-seeking/experiments/docs/experiment_design_2026-01-19.md
gbanyan 43c025e060 feat: Add experiments framework and novelty-driven agent loop
- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 10:16:21 +08:00

8.6 KiB
Raw Permalink Blame History

Experiment Design: 5-Condition Idea Generation Study

Date: January 19, 2026 Version: 1.0 Status: Pilot Implementation

Overview

This experiment tests whether the novelty-seeking system's two key mechanisms—attribute decomposition and expert transformation—independently and jointly improve creative ideation quality compared to direct LLM generation.

Research Questions

  1. Does decomposing a query into structured attributes improve idea diversity?
  2. Do expert perspectives improve idea novelty?
  3. Do these mechanisms have synergistic effects when combined?
  4. Is the benefit from experts due to domain knowledge, or simply perspective-shifting?

Experimental Design

2×2 Factorial Design + Control

No Attributes With Attributes
No Experts C1: Direct C3: Attr-Only
With Experts C2: Expert-Only C4: Full Pipeline

Plus: C5: Random-Perspective (tests perspective-shifting without domain knowledge)

Condition Descriptions

C1: Direct Generation (Baseline)

  • Single LLM call: "Generate 20 creative ideas for [query]"
  • No attribute decomposition
  • No expert perspectives
  • Purpose: Baseline for standard LLM ideation

C2: Expert-Only

  • 4 experts from curated occupations
  • Each expert generates 5 ideas directly for the query
  • No attribute decomposition
  • Purpose: Isolate expert contribution

C3: Attribute-Only

  • Decompose query into 4 fixed categories
  • Generate attributes per category
  • Direct idea generation per attribute (no expert framing)
  • Purpose: Isolate attribute decomposition contribution

C4: Full Pipeline

  • Full attribute decomposition (4 categories)
  • Expert transformation (4 experts × 1 keyword per attribute)
  • Purpose: Test combined mechanism (main system)

C5: Random-Perspective

  • 4 random words per query (from curated pool)
  • Each word used as a "perspective" to generate 5 ideas
  • Purpose: Control for perspective-shifting vs. expert knowledge

Key Design Decisions & Rationale

1. Why 5 Conditions?

C1-C4 form a 2×2 factorial design that isolates the independent contributions of:

  • Attribute decomposition (C1 vs C3, C2 vs C4)
  • Expert perspectives (C1 vs C2, C3 vs C4)

C5 addresses a critical confound: if experts improve ideation, is it because of their domain knowledge or simply because any perspective shift helps? By using random words instead of domain experts, C5 tests whether the perspective-taking mechanism alone provides benefits.

2. Why Random Words in C5 (Not Fixed)?

Decision: Use randomly sampled words (with seed) rather than a fixed set.

Rationale:

  • Stronger generalization: results hold across many word combinations
  • Avoids cherry-picking accusation ("you just picked easy words")
  • Reproducible via random seed (seed=42)
  • Each query gets different random words, increasing robustness

3. Why Apply Deduplication Uniformly?

Decision: Apply embedding-based deduplication (threshold=0.85) to ALL conditions after generation.

Rationale:

  • Fair comparison: all conditions normalized to unique ideas
  • Creates "dedup survival rate" as an additional metric
  • Hypothesis: Full Pipeline ideas are diverse (low redundancy), not just numerous
  • Direct generation may produce many similar ideas that collapse after dedup

4. Why FIXED_ONLY Categories?

Decision: Use 4 fixed categories: Functions, Usages, User Groups, Characteristics

Rationale:

  • Best for proof power: isolates "attribute decomposition" effect
  • No confound from dynamic category selection variability
  • Universal applicability: these 4 categories apply to objects, technology, and services
  • Dropped "Materials" category as it doesn't apply well to services

5. Why Curated Expert Source?

Decision: Use curated occupations (210 professions) rather than LLM-generated experts.

Rationale:

  • Reproducibility: same occupation pool across runs
  • Consistency: no variance from LLM expert generation
  • Control: we know exactly which experts are available
  • Validation: occupations were manually curated for diversity

6. Why Temperature 0.9?

Decision: Use temperature=0.9 for all conditions.

Rationale:

  • Higher temperature encourages more diverse/creative outputs
  • Matches typical creative task settings
  • Consistent across conditions for fair comparison
  • Lower temperatures (0.7) showed more repetitive outputs in testing

7. Why 10 Pilot Queries?

Decision: Start with 10 queries before scaling to full 30.

Rationale:

  • Validate pipeline works before full investment
  • Catch implementation bugs early
  • Balanced across categories (3 everyday, 3 technology, 4 services)
  • Sufficient for initial pattern detection

Configuration Summary

Setting Value Rationale
LLM Model qwen3:8b Local, fast, consistent
Temperature 0.9 Encourages creativity
Expert Count 4 Balance diversity vs. cost
Expert Source Curated Reproducibility
Keywords/Expert 1 Simplifies analysis
Language English Consistency
Categories Functions, Usages, User Groups, Characteristics Universal applicability
Dedup Threshold 0.85 Standard similarity cutoff
Random Seed 42 Reproducibility
Pilot Queries 10 Validation before scaling

Query Selection

Pilot Queries (10)

ID Query Category
A1 Chair Everyday
A5 Bicycle Everyday
A7 Smartphone Everyday
B1 Solar panel Technology
B3 3D printer Technology
B4 Drone Technology
C1 Food delivery service Services
C2 Online education platform Services
C4 Public transportation Services
C9 Elderly care service Services

Selection Criteria

  • Balanced across 3 domains (everyday objects, technology, services)
  • Varying complexity levels
  • Different user familiarity levels
  • Subset from full 30-query experimental protocol

Random Word Pool (C5)

35 words selected across 7 conceptual categories:

Category Words
Nature ocean, mountain, forest, desert, cave
Optics microscope, telescope, kaleidoscope, prism, lens
Animals butterfly, elephant, octopus, eagle, ant
Weather sunrise, thunderstorm, rainbow, fog, aurora
Art clockwork, origami, mosaic, symphony, ballet
Temporal ancient, futuristic, organic, crystalline, liquid
Sensory whisper, explosion, rhythm, silence, echo

Selection Criteria:

  • Concrete and evocative (easy to generate associations)
  • Diverse domains (no overlap with typical expert knowledge)
  • No obvious connection to test queries
  • Equal representation across categories

Expected Outputs

Per Condition Per Query

Condition Expected Ideas (pre-dedup) Mechanism
C1 20 Direct request
C2 20 4 experts × 5 ideas
C3 ~20 Varies by attribute count
C4 ~20 4 experts × ~5 keywords × 1 description
C5 20 4 words × 5 ideas

Metrics to Collect

  1. Pre-deduplication count: Raw ideas generated
  2. Post-deduplication count: Unique ideas after similarity filtering
  3. Dedup survival rate: post/pre ratio
  4. Generation metadata: Experts/words used, attributes generated

File Structure

experiments/
├── __init__.py
├── config.py               # Experiment configuration
├── docs/
│   └── experiment_design_2026-01-19.md  # This file
├── conditions/
│   ├── __init__.py
│   ├── c1_direct.py
│   ├── c2_expert_only.py
│   ├── c3_attribute_only.py
│   ├── c4_full_pipeline.py
│   └── c5_random_perspective.py
├── data/
│   ├── queries.json        # 10 pilot queries
│   └── random_words.json   # Word pool for C5
├── generate_ideas.py       # Main runner
├── deduplication.py        # Post-processing
└── results/                # Output (gitignored)

Verification Checklist

  • Each condition produces expected number of ideas
  • Deduplication reduces count meaningfully
  • Results JSON contains all required metadata
  • Random seed produces reproducible C5 results
  • No runtime errors on all 10 pilot queries

Next Steps After Pilot

  1. Analyze pilot results for obvious issues
  2. Adjust parameters if needed (idea count normalization, etc.)
  3. Scale to full 30 queries
  4. Human evaluation of idea quality (novelty, usefulness, feasibility)
  5. Statistical analysis of condition differences