- Add complete experiments directory with pilot study infrastructure - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective) - Human assessment tool with React frontend and FastAPI backend - AUT flexibility analysis with jump signal detection - Result visualization and metrics computation - Add novelty-driven agent loop module (experiments/novelty_loop/) - NoveltyDrivenTaskAgent with expert perspective perturbation - Three termination strategies: breakthrough, exhaust, coverage - Interactive CLI demo with colored output - Embedding-based novelty scoring - Add DDC knowledge domain classification data (en/zh) - Add CLAUDE.md project documentation - Update research report with experiment findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8.8 KiB
Human Assessment Web Interface
A standalone web application for human assessment of generated ideas using Torrance-inspired creativity metrics.
Overview
This tool enables blind evaluation of creative ideas generated by the novelty-seeking experiment. Raters assess ideas on four dimensions without knowing which experimental condition produced each idea, ensuring unbiased evaluation.
Quick Start
cd experiments/assessment
# 1. Prepare assessment data (if not already done)
python3 prepare_data.py
# 2. Start the system
./start.sh
# 3. Open browser
open http://localhost:5174
Directory Structure
assessment/
├── backend/
│ ├── app.py # FastAPI backend API
│ ├── database.py # SQLite database operations
│ ├── models.py # Pydantic models & dimension definitions
│ └── requirements.txt # Python dependencies
├── frontend/
│ ├── src/
│ │ ├── components/ # React UI components
│ │ ├── hooks/ # React state management
│ │ ├── services/ # API client
│ │ └── types/ # TypeScript definitions
│ └── package.json
├── data/
│ └── assessment_items.json # Prepared ideas for rating
├── results/
│ └── ratings.db # SQLite database with ratings
├── prepare_data.py # Data preparation script
├── analyze_ratings.py # Inter-rater reliability analysis
├── start.sh # Start both servers
├── stop.sh # Stop all services
└── README.md # This file
Data Preparation
List Available Experiment Files
python3 prepare_data.py --list
Output:
Available experiment files (most recent first):
experiment_20260119_165650_deduped.json (1571.3 KB)
experiment_20260119_163040_deduped.json (156.4 KB)
Prepare Assessment Data
# Use all ideas (not recommended for human assessment)
python3 prepare_data.py
# RECOMMENDED: Stratified sampling - 4 ideas per condition per query
# Results in ~200 ideas (5 conditions × 4 ideas × 10 queries)
python3 prepare_data.py --per-condition 4
# Alternative: Sample 150 ideas total (proportionally across queries)
python3 prepare_data.py --sample 150
# Limit per query (20 ideas max per query)
python3 prepare_data.py --per-query 20
# Combined: 4 per condition, max 15 per query
python3 prepare_data.py --per-condition 4 --per-query 15
# Specify a different experiment file
python3 prepare_data.py experiment_20260119_163040_deduped.json --per-condition 4
Sampling Options
| Option | Description | Example |
|---|---|---|
--per-condition N |
Max N ideas per condition per query (stratified) | --per-condition 4 → ~200 ideas |
--per-query N |
Max N ideas per query | --per-query 20 |
--sample N |
Total N ideas (proportionally distributed) | --sample 150 |
--seed N |
Random seed for reproducibility | --seed 42 (default) |
Recommendation: Use --per-condition 4 for balanced assessment across conditions.
The script:
- Loads the deduped experiment results
- Extracts all unique ideas with hidden metadata (condition, expert, keyword)
- Assigns stable IDs to each idea
- Shuffles ideas within each query (reproducible with seed=42)
- Outputs
data/assessment_items.json
Assessment Dimensions
Raters evaluate each idea on four dimensions using a 1-5 Likert scale:
Originality
How unexpected or surprising is this idea?
| Score | Description |
|---|---|
| 1 | Very common/obvious idea anyone would suggest |
| 2 | Somewhat common, slight variation on expected ideas |
| 3 | Moderately original, some unexpected elements |
| 4 | Quite original, notably different approach |
| 5 | Highly unexpected, truly novel concept |
Elaboration
How detailed and well-developed is this idea?
| Score | Description |
|---|---|
| 1 | Vague, minimal detail, just a concept |
| 2 | Basic idea with little specificity |
| 3 | Moderately detailed, some specifics provided |
| 4 | Well-developed with clear implementation hints |
| 5 | Highly specific, thoroughly developed concept |
Coherence
Does this idea make logical sense and relate to the query object?
| Score | Description |
|---|---|
| 1 | Nonsensical, irrelevant, or incomprehensible |
| 2 | Mostly unclear, weak connection to query |
| 3 | Partially coherent, some logical gaps |
| 4 | Mostly coherent with minor issues |
| 5 | Fully coherent, clearly relates to query |
Usefulness
Could this idea have practical value or inspire real innovation?
| Score | Description |
|---|---|
| 1 | No practical value whatsoever |
| 2 | Minimal usefulness, highly impractical |
| 3 | Some potential value with major limitations |
| 4 | Useful idea with realistic applications |
| 5 | Highly useful, clear practical value |
Running the System
Start
./start.sh
This will:
- Check for
data/assessment_items.json(runsprepare_data.pyif missing) - Install frontend dependencies if needed
- Start backend API on port 8002
- Start frontend dev server on port 5174
Stop
./stop.sh
Or press Ctrl+C in the terminal running start.sh.
Manual Start (Development)
# Terminal 1: Backend
cd backend
../../../backend/venv/bin/uvicorn app:app --host 0.0.0.0 --port 8002 --reload
# Terminal 2: Frontend
cd frontend
npm run dev
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Health check |
/api/info |
GET | Experiment info (total ideas, queries, conditions) |
/api/dimensions |
GET | Dimension definitions for UI |
/api/raters |
GET | List all raters |
/api/raters |
POST | Register/login rater |
/api/queries |
GET | List all queries |
/api/queries/{id} |
GET | Get query with all ideas |
/api/queries/{id}/unrated?rater_id=X |
GET | Get unrated ideas for rater |
/api/ratings |
POST | Submit a rating |
/api/progress/{rater_id} |
GET | Get rater's progress |
/api/statistics |
GET | Overall statistics |
/api/export |
GET | Export all ratings with metadata |
Analysis
After collecting ratings from multiple raters:
python3 analyze_ratings.py
This calculates:
- Krippendorff's alpha: Inter-rater reliability for ordinal data
- ICC(2,1): Intraclass Correlation Coefficient with 95% CI
- Mean ratings per condition: Compare experimental conditions
- Kruskal-Wallis test: Statistical significance between conditions
Output is saved to results/analysis_results.json.
Database Schema
SQLite database (results/ratings.db):
-- Raters
CREATE TABLE raters (
rater_id TEXT PRIMARY KEY,
name TEXT,
created_at TIMESTAMP
);
-- Ratings
CREATE TABLE ratings (
id INTEGER PRIMARY KEY,
rater_id TEXT,
idea_id TEXT,
query_id TEXT,
originality INTEGER CHECK(originality BETWEEN 1 AND 5),
elaboration INTEGER CHECK(elaboration BETWEEN 1 AND 5),
coherence INTEGER CHECK(coherence BETWEEN 1 AND 5),
usefulness INTEGER CHECK(usefulness BETWEEN 1 AND 5),
skipped INTEGER DEFAULT 0,
timestamp TIMESTAMP,
UNIQUE(rater_id, idea_id)
);
-- Progress tracking
CREATE TABLE progress (
rater_id TEXT,
query_id TEXT,
completed_count INTEGER,
total_count INTEGER,
PRIMARY KEY (rater_id, query_id)
);
Blind Assessment Design
To ensure unbiased evaluation:
- Randomization: Ideas are shuffled within each query using a fixed seed (42) for reproducibility
- Hidden metadata: Condition, expert name, and keywords are stored but not shown to raters
- Consistent ordering: All raters see the same randomized order
- Context provided: Only the query text is shown (e.g., "Chair", "Bicycle")
Workflow for Raters
- Login: Enter a unique rater ID
- Instructions: Read dimension definitions (shown before first rating)
- Rate ideas: For each idea:
- Read the idea text
- Rate all 4 dimensions (1-5)
- Click "Submit & Next" or "Skip"
- Progress: Track completion per query and overall
- Completion: Summary shown when all ideas are rated
Troubleshooting
Backend won't start
# Check if port 8002 is in use
lsof -i :8002
# Check backend logs
cat /tmp/assessment_backend.log
Frontend won't start
# Reinstall dependencies
cd frontend
rm -rf node_modules
npm install
Reset database
rm results/ratings.db
# Database is auto-created on next backend start
Regenerate assessment data
rm data/assessment_items.json
python3 prepare_data.py
Tech Stack
- Backend: Python 3.11+, FastAPI, SQLite, Pydantic
- Frontend: React 19, TypeScript, Vite, Ant Design 6.0
- Analysis: NumPy, SciPy (for statistical tests)