feat: Add experiments framework and novelty-driven agent loop

- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-20 10:16:21 +08:00
parent 26a56a2a07
commit 43c025e060
81 changed files with 18766 additions and 2 deletions

View File

@@ -0,0 +1,314 @@
# Human Assessment Web Interface
A standalone web application for human assessment of generated ideas using Torrance-inspired creativity metrics.
## Overview
This tool enables blind evaluation of creative ideas generated by the novelty-seeking experiment. Raters assess ideas on four dimensions without knowing which experimental condition produced each idea, ensuring unbiased evaluation.
## Quick Start
```bash
cd experiments/assessment
# 1. Prepare assessment data (if not already done)
python3 prepare_data.py
# 2. Start the system
./start.sh
# 3. Open browser
open http://localhost:5174
```
## Directory Structure
```
assessment/
├── backend/
│ ├── app.py # FastAPI backend API
│ ├── database.py # SQLite database operations
│ ├── models.py # Pydantic models & dimension definitions
│ └── requirements.txt # Python dependencies
├── frontend/
│ ├── src/
│ │ ├── components/ # React UI components
│ │ ├── hooks/ # React state management
│ │ ├── services/ # API client
│ │ └── types/ # TypeScript definitions
│ └── package.json
├── data/
│ └── assessment_items.json # Prepared ideas for rating
├── results/
│ └── ratings.db # SQLite database with ratings
├── prepare_data.py # Data preparation script
├── analyze_ratings.py # Inter-rater reliability analysis
├── start.sh # Start both servers
├── stop.sh # Stop all services
└── README.md # This file
```
## Data Preparation
### List Available Experiment Files
```bash
python3 prepare_data.py --list
```
Output:
```
Available experiment files (most recent first):
experiment_20260119_165650_deduped.json (1571.3 KB)
experiment_20260119_163040_deduped.json (156.4 KB)
```
### Prepare Assessment Data
```bash
# Use all ideas (not recommended for human assessment)
python3 prepare_data.py
# RECOMMENDED: Stratified sampling - 4 ideas per condition per query
# Results in ~200 ideas (5 conditions × 4 ideas × 10 queries)
python3 prepare_data.py --per-condition 4
# Alternative: Sample 150 ideas total (proportionally across queries)
python3 prepare_data.py --sample 150
# Limit per query (20 ideas max per query)
python3 prepare_data.py --per-query 20
# Combined: 4 per condition, max 15 per query
python3 prepare_data.py --per-condition 4 --per-query 15
# Specify a different experiment file
python3 prepare_data.py experiment_20260119_163040_deduped.json --per-condition 4
```
### Sampling Options
| Option | Description | Example |
|--------|-------------|---------|
| `--per-condition N` | Max N ideas per condition per query (stratified) | `--per-condition 4` → ~200 ideas |
| `--per-query N` | Max N ideas per query | `--per-query 20` |
| `--sample N` | Total N ideas (proportionally distributed) | `--sample 150` |
| `--seed N` | Random seed for reproducibility | `--seed 42` (default) |
**Recommendation**: Use `--per-condition 4` for balanced assessment across conditions.
The script:
1. Loads the deduped experiment results
2. Extracts all unique ideas with hidden metadata (condition, expert, keyword)
3. Assigns stable IDs to each idea
4. Shuffles ideas within each query (reproducible with seed=42)
5. Outputs `data/assessment_items.json`
## Assessment Dimensions
Raters evaluate each idea on four dimensions using a 1-5 Likert scale:
### Originality
*How unexpected or surprising is this idea?*
| Score | Description |
|-------|-------------|
| 1 | Very common/obvious idea anyone would suggest |
| 2 | Somewhat common, slight variation on expected ideas |
| 3 | Moderately original, some unexpected elements |
| 4 | Quite original, notably different approach |
| 5 | Highly unexpected, truly novel concept |
### Elaboration
*How detailed and well-developed is this idea?*
| Score | Description |
|-------|-------------|
| 1 | Vague, minimal detail, just a concept |
| 2 | Basic idea with little specificity |
| 3 | Moderately detailed, some specifics provided |
| 4 | Well-developed with clear implementation hints |
| 5 | Highly specific, thoroughly developed concept |
### Coherence
*Does this idea make logical sense and relate to the query object?*
| Score | Description |
|-------|-------------|
| 1 | Nonsensical, irrelevant, or incomprehensible |
| 2 | Mostly unclear, weak connection to query |
| 3 | Partially coherent, some logical gaps |
| 4 | Mostly coherent with minor issues |
| 5 | Fully coherent, clearly relates to query |
### Usefulness
*Could this idea have practical value or inspire real innovation?*
| Score | Description |
|-------|-------------|
| 1 | No practical value whatsoever |
| 2 | Minimal usefulness, highly impractical |
| 3 | Some potential value with major limitations |
| 4 | Useful idea with realistic applications |
| 5 | Highly useful, clear practical value |
## Running the System
### Start
```bash
./start.sh
```
This will:
1. Check for `data/assessment_items.json` (runs `prepare_data.py` if missing)
2. Install frontend dependencies if needed
3. Start backend API on port 8002
4. Start frontend dev server on port 5174
### Stop
```bash
./stop.sh
```
Or press `Ctrl+C` in the terminal running `start.sh`.
### Manual Start (Development)
```bash
# Terminal 1: Backend
cd backend
../../../backend/venv/bin/uvicorn app:app --host 0.0.0.0 --port 8002 --reload
# Terminal 2: Frontend
cd frontend
npm run dev
```
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/health` | GET | Health check |
| `/api/info` | GET | Experiment info (total ideas, queries, conditions) |
| `/api/dimensions` | GET | Dimension definitions for UI |
| `/api/raters` | GET | List all raters |
| `/api/raters` | POST | Register/login rater |
| `/api/queries` | GET | List all queries |
| `/api/queries/{id}` | GET | Get query with all ideas |
| `/api/queries/{id}/unrated?rater_id=X` | GET | Get unrated ideas for rater |
| `/api/ratings` | POST | Submit a rating |
| `/api/progress/{rater_id}` | GET | Get rater's progress |
| `/api/statistics` | GET | Overall statistics |
| `/api/export` | GET | Export all ratings with metadata |
## Analysis
After collecting ratings from multiple raters:
```bash
python3 analyze_ratings.py
```
This calculates:
- **Krippendorff's alpha**: Inter-rater reliability for ordinal data
- **ICC(2,1)**: Intraclass Correlation Coefficient with 95% CI
- **Mean ratings per condition**: Compare experimental conditions
- **Kruskal-Wallis test**: Statistical significance between conditions
Output is saved to `results/analysis_results.json`.
## Database Schema
SQLite database (`results/ratings.db`):
```sql
-- Raters
CREATE TABLE raters (
rater_id TEXT PRIMARY KEY,
name TEXT,
created_at TIMESTAMP
);
-- Ratings
CREATE TABLE ratings (
id INTEGER PRIMARY KEY,
rater_id TEXT,
idea_id TEXT,
query_id TEXT,
originality INTEGER CHECK(originality BETWEEN 1 AND 5),
elaboration INTEGER CHECK(elaboration BETWEEN 1 AND 5),
coherence INTEGER CHECK(coherence BETWEEN 1 AND 5),
usefulness INTEGER CHECK(usefulness BETWEEN 1 AND 5),
skipped INTEGER DEFAULT 0,
timestamp TIMESTAMP,
UNIQUE(rater_id, idea_id)
);
-- Progress tracking
CREATE TABLE progress (
rater_id TEXT,
query_id TEXT,
completed_count INTEGER,
total_count INTEGER,
PRIMARY KEY (rater_id, query_id)
);
```
## Blind Assessment Design
To ensure unbiased evaluation:
1. **Randomization**: Ideas are shuffled within each query using a fixed seed (42) for reproducibility
2. **Hidden metadata**: Condition, expert name, and keywords are stored but not shown to raters
3. **Consistent ordering**: All raters see the same randomized order
4. **Context provided**: Only the query text is shown (e.g., "Chair", "Bicycle")
## Workflow for Raters
1. **Login**: Enter a unique rater ID
2. **Instructions**: Read dimension definitions (shown before first rating)
3. **Rate ideas**: For each idea:
- Read the idea text
- Rate all 4 dimensions (1-5)
- Click "Submit & Next" or "Skip"
4. **Progress**: Track completion per query and overall
5. **Completion**: Summary shown when all ideas are rated
## Troubleshooting
### Backend won't start
```bash
# Check if port 8002 is in use
lsof -i :8002
# Check backend logs
cat /tmp/assessment_backend.log
```
### Frontend won't start
```bash
# Reinstall dependencies
cd frontend
rm -rf node_modules
npm install
```
### Reset database
```bash
rm results/ratings.db
# Database is auto-created on next backend start
```
### Regenerate assessment data
```bash
rm data/assessment_items.json
python3 prepare_data.py
```
## Tech Stack
- **Backend**: Python 3.11+, FastAPI, SQLite, Pydantic
- **Frontend**: React 19, TypeScript, Vite, Ant Design 6.0
- **Analysis**: NumPy, SciPy (for statistical tests)