- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
315 lines
8.8 KiB
Markdown
315 lines
8.8 KiB
Markdown
# Human Assessment Web Interface
|
||
|
||
A standalone web application for human assessment of generated ideas using Torrance-inspired creativity metrics.
|
||
|
||
## Overview
|
||
|
||
This tool enables blind evaluation of creative ideas generated by the novelty-seeking experiment. Raters assess ideas on four dimensions without knowing which experimental condition produced each idea, ensuring unbiased evaluation.
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
cd experiments/assessment
|
||
|
||
# 1. Prepare assessment data (if not already done)
|
||
python3 prepare_data.py
|
||
|
||
# 2. Start the system
|
||
./start.sh
|
||
|
||
# 3. Open browser
|
||
open http://localhost:5174
|
||
```
|
||
|
||
## Directory Structure
|
||
|
||
```
|
||
assessment/
|
||
├── backend/
|
||
│ ├── app.py # FastAPI backend API
|
||
│ ├── database.py # SQLite database operations
|
||
│ ├── models.py # Pydantic models & dimension definitions
|
||
│ └── requirements.txt # Python dependencies
|
||
├── frontend/
|
||
│ ├── src/
|
||
│ │ ├── components/ # React UI components
|
||
│ │ ├── hooks/ # React state management
|
||
│ │ ├── services/ # API client
|
||
│ │ └── types/ # TypeScript definitions
|
||
│ └── package.json
|
||
├── data/
|
||
│ └── assessment_items.json # Prepared ideas for rating
|
||
├── results/
|
||
│ └── ratings.db # SQLite database with ratings
|
||
├── prepare_data.py # Data preparation script
|
||
├── analyze_ratings.py # Inter-rater reliability analysis
|
||
├── start.sh # Start both servers
|
||
├── stop.sh # Stop all services
|
||
└── README.md # This file
|
||
```
|
||
|
||
## Data Preparation
|
||
|
||
### List Available Experiment Files
|
||
|
||
```bash
|
||
python3 prepare_data.py --list
|
||
```
|
||
|
||
Output:
|
||
```
|
||
Available experiment files (most recent first):
|
||
experiment_20260119_165650_deduped.json (1571.3 KB)
|
||
experiment_20260119_163040_deduped.json (156.4 KB)
|
||
```
|
||
|
||
### Prepare Assessment Data
|
||
|
||
```bash
|
||
# Use all ideas (not recommended for human assessment)
|
||
python3 prepare_data.py
|
||
|
||
# RECOMMENDED: Stratified sampling - 4 ideas per condition per query
|
||
# Results in ~200 ideas (5 conditions × 4 ideas × 10 queries)
|
||
python3 prepare_data.py --per-condition 4
|
||
|
||
# Alternative: Sample 150 ideas total (proportionally across queries)
|
||
python3 prepare_data.py --sample 150
|
||
|
||
# Limit per query (20 ideas max per query)
|
||
python3 prepare_data.py --per-query 20
|
||
|
||
# Combined: 4 per condition, max 15 per query
|
||
python3 prepare_data.py --per-condition 4 --per-query 15
|
||
|
||
# Specify a different experiment file
|
||
python3 prepare_data.py experiment_20260119_163040_deduped.json --per-condition 4
|
||
```
|
||
|
||
### Sampling Options
|
||
|
||
| Option | Description | Example |
|
||
|--------|-------------|---------|
|
||
| `--per-condition N` | Max N ideas per condition per query (stratified) | `--per-condition 4` → ~200 ideas |
|
||
| `--per-query N` | Max N ideas per query | `--per-query 20` |
|
||
| `--sample N` | Total N ideas (proportionally distributed) | `--sample 150` |
|
||
| `--seed N` | Random seed for reproducibility | `--seed 42` (default) |
|
||
|
||
**Recommendation**: Use `--per-condition 4` for balanced assessment across conditions.
|
||
|
||
The script:
|
||
1. Loads the deduped experiment results
|
||
2. Extracts all unique ideas with hidden metadata (condition, expert, keyword)
|
||
3. Assigns stable IDs to each idea
|
||
4. Shuffles ideas within each query (reproducible with seed=42)
|
||
5. Outputs `data/assessment_items.json`
|
||
|
||
## Assessment Dimensions
|
||
|
||
Raters evaluate each idea on four dimensions using a 1-5 Likert scale:
|
||
|
||
### Originality
|
||
*How unexpected or surprising is this idea?*
|
||
|
||
| Score | Description |
|
||
|-------|-------------|
|
||
| 1 | Very common/obvious idea anyone would suggest |
|
||
| 2 | Somewhat common, slight variation on expected ideas |
|
||
| 3 | Moderately original, some unexpected elements |
|
||
| 4 | Quite original, notably different approach |
|
||
| 5 | Highly unexpected, truly novel concept |
|
||
|
||
### Elaboration
|
||
*How detailed and well-developed is this idea?*
|
||
|
||
| Score | Description |
|
||
|-------|-------------|
|
||
| 1 | Vague, minimal detail, just a concept |
|
||
| 2 | Basic idea with little specificity |
|
||
| 3 | Moderately detailed, some specifics provided |
|
||
| 4 | Well-developed with clear implementation hints |
|
||
| 5 | Highly specific, thoroughly developed concept |
|
||
|
||
### Coherence
|
||
*Does this idea make logical sense and relate to the query object?*
|
||
|
||
| Score | Description |
|
||
|-------|-------------|
|
||
| 1 | Nonsensical, irrelevant, or incomprehensible |
|
||
| 2 | Mostly unclear, weak connection to query |
|
||
| 3 | Partially coherent, some logical gaps |
|
||
| 4 | Mostly coherent with minor issues |
|
||
| 5 | Fully coherent, clearly relates to query |
|
||
|
||
### Usefulness
|
||
*Could this idea have practical value or inspire real innovation?*
|
||
|
||
| Score | Description |
|
||
|-------|-------------|
|
||
| 1 | No practical value whatsoever |
|
||
| 2 | Minimal usefulness, highly impractical |
|
||
| 3 | Some potential value with major limitations |
|
||
| 4 | Useful idea with realistic applications |
|
||
| 5 | Highly useful, clear practical value |
|
||
|
||
## Running the System
|
||
|
||
### Start
|
||
|
||
```bash
|
||
./start.sh
|
||
```
|
||
|
||
This will:
|
||
1. Check for `data/assessment_items.json` (runs `prepare_data.py` if missing)
|
||
2. Install frontend dependencies if needed
|
||
3. Start backend API on port 8002
|
||
4. Start frontend dev server on port 5174
|
||
|
||
### Stop
|
||
|
||
```bash
|
||
./stop.sh
|
||
```
|
||
|
||
Or press `Ctrl+C` in the terminal running `start.sh`.
|
||
|
||
### Manual Start (Development)
|
||
|
||
```bash
|
||
# Terminal 1: Backend
|
||
cd backend
|
||
../../../backend/venv/bin/uvicorn app:app --host 0.0.0.0 --port 8002 --reload
|
||
|
||
# Terminal 2: Frontend
|
||
cd frontend
|
||
npm run dev
|
||
```
|
||
|
||
## API Endpoints
|
||
|
||
| Endpoint | Method | Description |
|
||
|----------|--------|-------------|
|
||
| `/api/health` | GET | Health check |
|
||
| `/api/info` | GET | Experiment info (total ideas, queries, conditions) |
|
||
| `/api/dimensions` | GET | Dimension definitions for UI |
|
||
| `/api/raters` | GET | List all raters |
|
||
| `/api/raters` | POST | Register/login rater |
|
||
| `/api/queries` | GET | List all queries |
|
||
| `/api/queries/{id}` | GET | Get query with all ideas |
|
||
| `/api/queries/{id}/unrated?rater_id=X` | GET | Get unrated ideas for rater |
|
||
| `/api/ratings` | POST | Submit a rating |
|
||
| `/api/progress/{rater_id}` | GET | Get rater's progress |
|
||
| `/api/statistics` | GET | Overall statistics |
|
||
| `/api/export` | GET | Export all ratings with metadata |
|
||
|
||
## Analysis
|
||
|
||
After collecting ratings from multiple raters:
|
||
|
||
```bash
|
||
python3 analyze_ratings.py
|
||
```
|
||
|
||
This calculates:
|
||
- **Krippendorff's alpha**: Inter-rater reliability for ordinal data
|
||
- **ICC(2,1)**: Intraclass Correlation Coefficient with 95% CI
|
||
- **Mean ratings per condition**: Compare experimental conditions
|
||
- **Kruskal-Wallis test**: Statistical significance between conditions
|
||
|
||
Output is saved to `results/analysis_results.json`.
|
||
|
||
## Database Schema
|
||
|
||
SQLite database (`results/ratings.db`):
|
||
|
||
```sql
|
||
-- Raters
|
||
CREATE TABLE raters (
|
||
rater_id TEXT PRIMARY KEY,
|
||
name TEXT,
|
||
created_at TIMESTAMP
|
||
);
|
||
|
||
-- Ratings
|
||
CREATE TABLE ratings (
|
||
id INTEGER PRIMARY KEY,
|
||
rater_id TEXT,
|
||
idea_id TEXT,
|
||
query_id TEXT,
|
||
originality INTEGER CHECK(originality BETWEEN 1 AND 5),
|
||
elaboration INTEGER CHECK(elaboration BETWEEN 1 AND 5),
|
||
coherence INTEGER CHECK(coherence BETWEEN 1 AND 5),
|
||
usefulness INTEGER CHECK(usefulness BETWEEN 1 AND 5),
|
||
skipped INTEGER DEFAULT 0,
|
||
timestamp TIMESTAMP,
|
||
UNIQUE(rater_id, idea_id)
|
||
);
|
||
|
||
-- Progress tracking
|
||
CREATE TABLE progress (
|
||
rater_id TEXT,
|
||
query_id TEXT,
|
||
completed_count INTEGER,
|
||
total_count INTEGER,
|
||
PRIMARY KEY (rater_id, query_id)
|
||
);
|
||
```
|
||
|
||
## Blind Assessment Design
|
||
|
||
To ensure unbiased evaluation:
|
||
|
||
1. **Randomization**: Ideas are shuffled within each query using a fixed seed (42) for reproducibility
|
||
2. **Hidden metadata**: Condition, expert name, and keywords are stored but not shown to raters
|
||
3. **Consistent ordering**: All raters see the same randomized order
|
||
4. **Context provided**: Only the query text is shown (e.g., "Chair", "Bicycle")
|
||
|
||
## Workflow for Raters
|
||
|
||
1. **Login**: Enter a unique rater ID
|
||
2. **Instructions**: Read dimension definitions (shown before first rating)
|
||
3. **Rate ideas**: For each idea:
|
||
- Read the idea text
|
||
- Rate all 4 dimensions (1-5)
|
||
- Click "Submit & Next" or "Skip"
|
||
4. **Progress**: Track completion per query and overall
|
||
5. **Completion**: Summary shown when all ideas are rated
|
||
|
||
## Troubleshooting
|
||
|
||
### Backend won't start
|
||
```bash
|
||
# Check if port 8002 is in use
|
||
lsof -i :8002
|
||
|
||
# Check backend logs
|
||
cat /tmp/assessment_backend.log
|
||
```
|
||
|
||
### Frontend won't start
|
||
```bash
|
||
# Reinstall dependencies
|
||
cd frontend
|
||
rm -rf node_modules
|
||
npm install
|
||
```
|
||
|
||
### Reset database
|
||
```bash
|
||
rm results/ratings.db
|
||
# Database is auto-created on next backend start
|
||
```
|
||
|
||
### Regenerate assessment data
|
||
```bash
|
||
rm data/assessment_items.json
|
||
python3 prepare_data.py
|
||
```
|
||
|
||
## Tech Stack
|
||
|
||
- **Backend**: Python 3.11+, FastAPI, SQLite, Pydantic
|
||
- **Frontend**: React 19, TypeScript, Vite, Ant Design 6.0
|
||
- **Analysis**: NumPy, SciPy (for statistical tests)
|