feat: Add experiments framework and novelty-driven agent loop

- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 10:16:21 +08:00
parent 26a56a2a07
commit 43c025e060
81 changed files with 18766 additions and 2 deletions
--- a/experiments/assessment/README.md
+++ b/experiments/assessment/README.md
@@ -0,0 +1,314 @@
+# Human Assessment Web Interface
+
+A standalone web application for human assessment of generated ideas using Torrance-inspired creativity metrics.
+
+## Overview
+
+This tool enables blind evaluation of creative ideas generated by the novelty-seeking experiment. Raters assess ideas on four dimensions without knowing which experimental condition produced each idea, ensuring unbiased evaluation.
+
+## Quick Start
+
+```bash
+cd experiments/assessment
+
+# 1. Prepare assessment data (if not already done)
+python3 prepare_data.py
+
+# 2. Start the system
+./start.sh
+
+# 3. Open browser
+open http://localhost:5174
+```
+
+## Directory Structure
+
+```
+assessment/
+├── backend/
+│   ├── app.py           # FastAPI backend API
+│   ├── database.py      # SQLite database operations
+│   ├── models.py        # Pydantic models & dimension definitions
+│   └── requirements.txt # Python dependencies
+├── frontend/
+│   ├── src/
+│   │   ├── components/  # React UI components
+│   │   ├── hooks/       # React state management
+│   │   ├── services/    # API client
+│   │   └── types/       # TypeScript definitions
+│   └── package.json
+├── data/
+│   └── assessment_items.json  # Prepared ideas for rating
+├── results/
+│   └── ratings.db             # SQLite database with ratings
+├── prepare_data.py      # Data preparation script
+├── analyze_ratings.py   # Inter-rater reliability analysis
+├── start.sh             # Start both servers
+├── stop.sh              # Stop all services
+└── README.md            # This file
+```
+
+## Data Preparation
+
+### List Available Experiment Files
+
+```bash
+python3 prepare_data.py --list
+```
+
+Output:
+```
+Available experiment files (most recent first):
+  experiment_20260119_165650_deduped.json (1571.3 KB)
+  experiment_20260119_163040_deduped.json (156.4 KB)
+```
+
+### Prepare Assessment Data
+
+```bash
+# Use all ideas (not recommended for human assessment)
+python3 prepare_data.py
+
+# RECOMMENDED: Stratified sampling - 4 ideas per condition per query
+# Results in ~200 ideas (5 conditions × 4 ideas × 10 queries)
+python3 prepare_data.py --per-condition 4
+
+# Alternative: Sample 150 ideas total (proportionally across queries)
+python3 prepare_data.py --sample 150
+
+# Limit per query (20 ideas max per query)
+python3 prepare_data.py --per-query 20
+
+# Combined: 4 per condition, max 15 per query
+python3 prepare_data.py --per-condition 4 --per-query 15
+
+# Specify a different experiment file
+python3 prepare_data.py experiment_20260119_163040_deduped.json --per-condition 4
+```
+
+### Sampling Options
+
+| Option | Description | Example |
+|--------|-------------|---------|
+| `--per-condition N` | Max N ideas per condition per query (stratified) | `--per-condition 4` → ~200 ideas |
+| `--per-query N` | Max N ideas per query | `--per-query 20` |
+| `--sample N` | Total N ideas (proportionally distributed) | `--sample 150` |
+| `--seed N` | Random seed for reproducibility | `--seed 42` (default) |
+
+**Recommendation**: Use `--per-condition 4` for balanced assessment across conditions.
+
+The script:
+1. Loads the deduped experiment results
+2. Extracts all unique ideas with hidden metadata (condition, expert, keyword)
+3. Assigns stable IDs to each idea
+4. Shuffles ideas within each query (reproducible with seed=42)
+5. Outputs `data/assessment_items.json`
+
+## Assessment Dimensions
+
+Raters evaluate each idea on four dimensions using a 1-5 Likert scale:
+
+### Originality
+*How unexpected or surprising is this idea?*
+
+| Score | Description |
+|-------|-------------|
+| 1 | Very common/obvious idea anyone would suggest |
+| 2 | Somewhat common, slight variation on expected ideas |
+| 3 | Moderately original, some unexpected elements |
+| 4 | Quite original, notably different approach |
+| 5 | Highly unexpected, truly novel concept |
+
+### Elaboration
+*How detailed and well-developed is this idea?*
+
+| Score | Description |
+|-------|-------------|
+| 1 | Vague, minimal detail, just a concept |
+| 2 | Basic idea with little specificity |
+| 3 | Moderately detailed, some specifics provided |
+| 4 | Well-developed with clear implementation hints |
+| 5 | Highly specific, thoroughly developed concept |
+
+### Coherence
+*Does this idea make logical sense and relate to the query object?*
+
+| Score | Description |
+|-------|-------------|
+| 1 | Nonsensical, irrelevant, or incomprehensible |
+| 2 | Mostly unclear, weak connection to query |
+| 3 | Partially coherent, some logical gaps |
+| 4 | Mostly coherent with minor issues |
+| 5 | Fully coherent, clearly relates to query |
+
+### Usefulness
+*Could this idea have practical value or inspire real innovation?*
+
+| Score | Description |
+|-------|-------------|
+| 1 | No practical value whatsoever |
+| 2 | Minimal usefulness, highly impractical |
+| 3 | Some potential value with major limitations |
+| 4 | Useful idea with realistic applications |
+| 5 | Highly useful, clear practical value |
+
+## Running the System
+
+### Start
+
+```bash
+./start.sh
+```
+
+This will:
+1. Check for `data/assessment_items.json` (runs `prepare_data.py` if missing)
+2. Install frontend dependencies if needed
+3. Start backend API on port 8002
+4. Start frontend dev server on port 5174
+
+### Stop
+
+```bash
+./stop.sh
+```
+
+Or press `Ctrl+C` in the terminal running `start.sh`.
+
+### Manual Start (Development)
+
+```bash
+# Terminal 1: Backend
+cd backend
+../../../backend/venv/bin/uvicorn app:app --host 0.0.0.0 --port 8002 --reload
+
+# Terminal 2: Frontend
+cd frontend
+npm run dev
+```
+
+## API Endpoints
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/api/health` | GET | Health check |
+| `/api/info` | GET | Experiment info (total ideas, queries, conditions) |
+| `/api/dimensions` | GET | Dimension definitions for UI |
+| `/api/raters` | GET | List all raters |
+| `/api/raters` | POST | Register/login rater |
+| `/api/queries` | GET | List all queries |
+| `/api/queries/{id}` | GET | Get query with all ideas |
+| `/api/queries/{id}/unrated?rater_id=X` | GET | Get unrated ideas for rater |
+| `/api/ratings` | POST | Submit a rating |
+| `/api/progress/{rater_id}` | GET | Get rater's progress |
+| `/api/statistics` | GET | Overall statistics |
+| `/api/export` | GET | Export all ratings with metadata |
+
+## Analysis
+
+After collecting ratings from multiple raters:
+
+```bash
+python3 analyze_ratings.py
+```
+
+This calculates:
+- **Krippendorff's alpha**: Inter-rater reliability for ordinal data
+- **ICC(2,1)**: Intraclass Correlation Coefficient with 95% CI
+- **Mean ratings per condition**: Compare experimental conditions
+- **Kruskal-Wallis test**: Statistical significance between conditions
+
+Output is saved to `results/analysis_results.json`.
+
+## Database Schema
+
+SQLite database (`results/ratings.db`):
+
+```sql
+-- Raters
+CREATE TABLE raters (
+    rater_id TEXT PRIMARY KEY,
+    name TEXT,
+    created_at TIMESTAMP
+);
+
+-- Ratings
+CREATE TABLE ratings (
+    id INTEGER PRIMARY KEY,
+    rater_id TEXT,
+    idea_id TEXT,
+    query_id TEXT,
+    originality INTEGER CHECK(originality BETWEEN 1 AND 5),
+    elaboration INTEGER CHECK(elaboration BETWEEN 1 AND 5),
+    coherence INTEGER CHECK(coherence BETWEEN 1 AND 5),
+    usefulness INTEGER CHECK(usefulness BETWEEN 1 AND 5),
+    skipped INTEGER DEFAULT 0,
+    timestamp TIMESTAMP,
+    UNIQUE(rater_id, idea_id)
+);
+
+-- Progress tracking
+CREATE TABLE progress (
+    rater_id TEXT,
+    query_id TEXT,
+    completed_count INTEGER,
+    total_count INTEGER,
+    PRIMARY KEY (rater_id, query_id)
+);
+```
+
+## Blind Assessment Design
+
+To ensure unbiased evaluation:
+
+1. **Randomization**: Ideas are shuffled within each query using a fixed seed (42) for reproducibility
+2. **Hidden metadata**: Condition, expert name, and keywords are stored but not shown to raters
+3. **Consistent ordering**: All raters see the same randomized order
+4. **Context provided**: Only the query text is shown (e.g., "Chair", "Bicycle")
+
+## Workflow for Raters
+
+1. **Login**: Enter a unique rater ID
+2. **Instructions**: Read dimension definitions (shown before first rating)
+3. **Rate ideas**: For each idea:
+   - Read the idea text
+   - Rate all 4 dimensions (1-5)
+   - Click "Submit & Next" or "Skip"
+4. **Progress**: Track completion per query and overall
+5. **Completion**: Summary shown when all ideas are rated
+
+## Troubleshooting
+
+### Backend won't start
+```bash
+# Check if port 8002 is in use
+lsof -i :8002
+
+# Check backend logs
+cat /tmp/assessment_backend.log
+```
+
+### Frontend won't start
+```bash
+# Reinstall dependencies
+cd frontend
+rm -rf node_modules
+npm install
+```
+
+### Reset database
+```bash
+rm results/ratings.db
+# Database is auto-created on next backend start
+```
+
+### Regenerate assessment data
+```bash
+rm data/assessment_items.json
+python3 prepare_data.py
+```
+
+## Tech Stack
+
+- **Backend**: Python 3.11+, FastAPI, SQLite, Pydantic
+- **Frontend**: React 19, TypeScript, Vite, Ant Design 6.0
+- **Analysis**: NumPy, SciPy (for statistical tests)