feat: Add experiments framework and novelty-driven agent loop
- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
314
experiments/assessment/README.md
Normal file
314
experiments/assessment/README.md
Normal file
@@ -0,0 +1,314 @@
|
||||
# Human Assessment Web Interface
|
||||
|
||||
A standalone web application for human assessment of generated ideas using Torrance-inspired creativity metrics.
|
||||
|
||||
## Overview
|
||||
|
||||
This tool enables blind evaluation of creative ideas generated by the novelty-seeking experiment. Raters assess ideas on four dimensions without knowing which experimental condition produced each idea, ensuring unbiased evaluation.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
cd experiments/assessment
|
||||
|
||||
# 1. Prepare assessment data (if not already done)
|
||||
python3 prepare_data.py
|
||||
|
||||
# 2. Start the system
|
||||
./start.sh
|
||||
|
||||
# 3. Open browser
|
||||
open http://localhost:5174
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
assessment/
|
||||
├── backend/
|
||||
│ ├── app.py # FastAPI backend API
|
||||
│ ├── database.py # SQLite database operations
|
||||
│ ├── models.py # Pydantic models & dimension definitions
|
||||
│ └── requirements.txt # Python dependencies
|
||||
├── frontend/
|
||||
│ ├── src/
|
||||
│ │ ├── components/ # React UI components
|
||||
│ │ ├── hooks/ # React state management
|
||||
│ │ ├── services/ # API client
|
||||
│ │ └── types/ # TypeScript definitions
|
||||
│ └── package.json
|
||||
├── data/
|
||||
│ └── assessment_items.json # Prepared ideas for rating
|
||||
├── results/
|
||||
│ └── ratings.db # SQLite database with ratings
|
||||
├── prepare_data.py # Data preparation script
|
||||
├── analyze_ratings.py # Inter-rater reliability analysis
|
||||
├── start.sh # Start both servers
|
||||
├── stop.sh # Stop all services
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Data Preparation
|
||||
|
||||
### List Available Experiment Files
|
||||
|
||||
```bash
|
||||
python3 prepare_data.py --list
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Available experiment files (most recent first):
|
||||
experiment_20260119_165650_deduped.json (1571.3 KB)
|
||||
experiment_20260119_163040_deduped.json (156.4 KB)
|
||||
```
|
||||
|
||||
### Prepare Assessment Data
|
||||
|
||||
```bash
|
||||
# Use all ideas (not recommended for human assessment)
|
||||
python3 prepare_data.py
|
||||
|
||||
# RECOMMENDED: Stratified sampling - 4 ideas per condition per query
|
||||
# Results in ~200 ideas (5 conditions × 4 ideas × 10 queries)
|
||||
python3 prepare_data.py --per-condition 4
|
||||
|
||||
# Alternative: Sample 150 ideas total (proportionally across queries)
|
||||
python3 prepare_data.py --sample 150
|
||||
|
||||
# Limit per query (20 ideas max per query)
|
||||
python3 prepare_data.py --per-query 20
|
||||
|
||||
# Combined: 4 per condition, max 15 per query
|
||||
python3 prepare_data.py --per-condition 4 --per-query 15
|
||||
|
||||
# Specify a different experiment file
|
||||
python3 prepare_data.py experiment_20260119_163040_deduped.json --per-condition 4
|
||||
```
|
||||
|
||||
### Sampling Options
|
||||
|
||||
| Option | Description | Example |
|
||||
|--------|-------------|---------|
|
||||
| `--per-condition N` | Max N ideas per condition per query (stratified) | `--per-condition 4` → ~200 ideas |
|
||||
| `--per-query N` | Max N ideas per query | `--per-query 20` |
|
||||
| `--sample N` | Total N ideas (proportionally distributed) | `--sample 150` |
|
||||
| `--seed N` | Random seed for reproducibility | `--seed 42` (default) |
|
||||
|
||||
**Recommendation**: Use `--per-condition 4` for balanced assessment across conditions.
|
||||
|
||||
The script:
|
||||
1. Loads the deduped experiment results
|
||||
2. Extracts all unique ideas with hidden metadata (condition, expert, keyword)
|
||||
3. Assigns stable IDs to each idea
|
||||
4. Shuffles ideas within each query (reproducible with seed=42)
|
||||
5. Outputs `data/assessment_items.json`
|
||||
|
||||
## Assessment Dimensions
|
||||
|
||||
Raters evaluate each idea on four dimensions using a 1-5 Likert scale:
|
||||
|
||||
### Originality
|
||||
*How unexpected or surprising is this idea?*
|
||||
|
||||
| Score | Description |
|
||||
|-------|-------------|
|
||||
| 1 | Very common/obvious idea anyone would suggest |
|
||||
| 2 | Somewhat common, slight variation on expected ideas |
|
||||
| 3 | Moderately original, some unexpected elements |
|
||||
| 4 | Quite original, notably different approach |
|
||||
| 5 | Highly unexpected, truly novel concept |
|
||||
|
||||
### Elaboration
|
||||
*How detailed and well-developed is this idea?*
|
||||
|
||||
| Score | Description |
|
||||
|-------|-------------|
|
||||
| 1 | Vague, minimal detail, just a concept |
|
||||
| 2 | Basic idea with little specificity |
|
||||
| 3 | Moderately detailed, some specifics provided |
|
||||
| 4 | Well-developed with clear implementation hints |
|
||||
| 5 | Highly specific, thoroughly developed concept |
|
||||
|
||||
### Coherence
|
||||
*Does this idea make logical sense and relate to the query object?*
|
||||
|
||||
| Score | Description |
|
||||
|-------|-------------|
|
||||
| 1 | Nonsensical, irrelevant, or incomprehensible |
|
||||
| 2 | Mostly unclear, weak connection to query |
|
||||
| 3 | Partially coherent, some logical gaps |
|
||||
| 4 | Mostly coherent with minor issues |
|
||||
| 5 | Fully coherent, clearly relates to query |
|
||||
|
||||
### Usefulness
|
||||
*Could this idea have practical value or inspire real innovation?*
|
||||
|
||||
| Score | Description |
|
||||
|-------|-------------|
|
||||
| 1 | No practical value whatsoever |
|
||||
| 2 | Minimal usefulness, highly impractical |
|
||||
| 3 | Some potential value with major limitations |
|
||||
| 4 | Useful idea with realistic applications |
|
||||
| 5 | Highly useful, clear practical value |
|
||||
|
||||
## Running the System
|
||||
|
||||
### Start
|
||||
|
||||
```bash
|
||||
./start.sh
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Check for `data/assessment_items.json` (runs `prepare_data.py` if missing)
|
||||
2. Install frontend dependencies if needed
|
||||
3. Start backend API on port 8002
|
||||
4. Start frontend dev server on port 5174
|
||||
|
||||
### Stop
|
||||
|
||||
```bash
|
||||
./stop.sh
|
||||
```
|
||||
|
||||
Or press `Ctrl+C` in the terminal running `start.sh`.
|
||||
|
||||
### Manual Start (Development)
|
||||
|
||||
```bash
|
||||
# Terminal 1: Backend
|
||||
cd backend
|
||||
../../../backend/venv/bin/uvicorn app:app --host 0.0.0.0 --port 8002 --reload
|
||||
|
||||
# Terminal 2: Frontend
|
||||
cd frontend
|
||||
npm run dev
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/api/health` | GET | Health check |
|
||||
| `/api/info` | GET | Experiment info (total ideas, queries, conditions) |
|
||||
| `/api/dimensions` | GET | Dimension definitions for UI |
|
||||
| `/api/raters` | GET | List all raters |
|
||||
| `/api/raters` | POST | Register/login rater |
|
||||
| `/api/queries` | GET | List all queries |
|
||||
| `/api/queries/{id}` | GET | Get query with all ideas |
|
||||
| `/api/queries/{id}/unrated?rater_id=X` | GET | Get unrated ideas for rater |
|
||||
| `/api/ratings` | POST | Submit a rating |
|
||||
| `/api/progress/{rater_id}` | GET | Get rater's progress |
|
||||
| `/api/statistics` | GET | Overall statistics |
|
||||
| `/api/export` | GET | Export all ratings with metadata |
|
||||
|
||||
## Analysis
|
||||
|
||||
After collecting ratings from multiple raters:
|
||||
|
||||
```bash
|
||||
python3 analyze_ratings.py
|
||||
```
|
||||
|
||||
This calculates:
|
||||
- **Krippendorff's alpha**: Inter-rater reliability for ordinal data
|
||||
- **ICC(2,1)**: Intraclass Correlation Coefficient with 95% CI
|
||||
- **Mean ratings per condition**: Compare experimental conditions
|
||||
- **Kruskal-Wallis test**: Statistical significance between conditions
|
||||
|
||||
Output is saved to `results/analysis_results.json`.
|
||||
|
||||
## Database Schema
|
||||
|
||||
SQLite database (`results/ratings.db`):
|
||||
|
||||
```sql
|
||||
-- Raters
|
||||
CREATE TABLE raters (
|
||||
rater_id TEXT PRIMARY KEY,
|
||||
name TEXT,
|
||||
created_at TIMESTAMP
|
||||
);
|
||||
|
||||
-- Ratings
|
||||
CREATE TABLE ratings (
|
||||
id INTEGER PRIMARY KEY,
|
||||
rater_id TEXT,
|
||||
idea_id TEXT,
|
||||
query_id TEXT,
|
||||
originality INTEGER CHECK(originality BETWEEN 1 AND 5),
|
||||
elaboration INTEGER CHECK(elaboration BETWEEN 1 AND 5),
|
||||
coherence INTEGER CHECK(coherence BETWEEN 1 AND 5),
|
||||
usefulness INTEGER CHECK(usefulness BETWEEN 1 AND 5),
|
||||
skipped INTEGER DEFAULT 0,
|
||||
timestamp TIMESTAMP,
|
||||
UNIQUE(rater_id, idea_id)
|
||||
);
|
||||
|
||||
-- Progress tracking
|
||||
CREATE TABLE progress (
|
||||
rater_id TEXT,
|
||||
query_id TEXT,
|
||||
completed_count INTEGER,
|
||||
total_count INTEGER,
|
||||
PRIMARY KEY (rater_id, query_id)
|
||||
);
|
||||
```
|
||||
|
||||
## Blind Assessment Design
|
||||
|
||||
To ensure unbiased evaluation:
|
||||
|
||||
1. **Randomization**: Ideas are shuffled within each query using a fixed seed (42) for reproducibility
|
||||
2. **Hidden metadata**: Condition, expert name, and keywords are stored but not shown to raters
|
||||
3. **Consistent ordering**: All raters see the same randomized order
|
||||
4. **Context provided**: Only the query text is shown (e.g., "Chair", "Bicycle")
|
||||
|
||||
## Workflow for Raters
|
||||
|
||||
1. **Login**: Enter a unique rater ID
|
||||
2. **Instructions**: Read dimension definitions (shown before first rating)
|
||||
3. **Rate ideas**: For each idea:
|
||||
- Read the idea text
|
||||
- Rate all 4 dimensions (1-5)
|
||||
- Click "Submit & Next" or "Skip"
|
||||
4. **Progress**: Track completion per query and overall
|
||||
5. **Completion**: Summary shown when all ideas are rated
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Backend won't start
|
||||
```bash
|
||||
# Check if port 8002 is in use
|
||||
lsof -i :8002
|
||||
|
||||
# Check backend logs
|
||||
cat /tmp/assessment_backend.log
|
||||
```
|
||||
|
||||
### Frontend won't start
|
||||
```bash
|
||||
# Reinstall dependencies
|
||||
cd frontend
|
||||
rm -rf node_modules
|
||||
npm install
|
||||
```
|
||||
|
||||
### Reset database
|
||||
```bash
|
||||
rm results/ratings.db
|
||||
# Database is auto-created on next backend start
|
||||
```
|
||||
|
||||
### Regenerate assessment data
|
||||
```bash
|
||||
rm data/assessment_items.json
|
||||
python3 prepare_data.py
|
||||
```
|
||||
|
||||
## Tech Stack
|
||||
|
||||
- **Backend**: Python 3.11+, FastAPI, SQLite, Pydantic
|
||||
- **Frontend**: React 19, TypeScript, Vite, Ant Design 6.0
|
||||
- **Analysis**: NumPy, SciPy (for statistical tests)
|
||||
Reference in New Issue
Block a user