Files

gbanyan 43c025e060 feat: Add experiments framework and novelty-driven agent loop

- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 10:16:21 +08:00

8.8 KiB

Raw Blame History

Human Assessment Web Interface

A standalone web application for human assessment of generated ideas using Torrance-inspired creativity metrics.

Overview

This tool enables blind evaluation of creative ideas generated by the novelty-seeking experiment. Raters assess ideas on four dimensions without knowing which experimental condition produced each idea, ensuring unbiased evaluation.

Quick Start

cd experiments/assessment

# 1. Prepare assessment data (if not already done)
python3 prepare_data.py

# 2. Start the system
./start.sh

# 3. Open browser
open http://localhost:5174

Directory Structure

assessment/
├── backend/
│   ├── app.py           # FastAPI backend API
│   ├── database.py      # SQLite database operations
│   ├── models.py        # Pydantic models & dimension definitions
│   └── requirements.txt # Python dependencies
├── frontend/
│   ├── src/
│   │   ├── components/  # React UI components
│   │   ├── hooks/       # React state management
│   │   ├── services/    # API client
│   │   └── types/       # TypeScript definitions
│   └── package.json
├── data/
│   └── assessment_items.json  # Prepared ideas for rating
├── results/
│   └── ratings.db             # SQLite database with ratings
├── prepare_data.py      # Data preparation script
├── analyze_ratings.py   # Inter-rater reliability analysis
├── start.sh             # Start both servers
├── stop.sh              # Stop all services
└── README.md            # This file

Data Preparation

List Available Experiment Files

python3 prepare_data.py --list

Output:

Available experiment files (most recent first):
  experiment_20260119_165650_deduped.json (1571.3 KB)
  experiment_20260119_163040_deduped.json (156.4 KB)

Prepare Assessment Data

# Use all ideas (not recommended for human assessment)
python3 prepare_data.py

# RECOMMENDED: Stratified sampling - 4 ideas per condition per query
# Results in ~200 ideas (5 conditions × 4 ideas × 10 queries)
python3 prepare_data.py --per-condition 4

# Alternative: Sample 150 ideas total (proportionally across queries)
python3 prepare_data.py --sample 150

# Limit per query (20 ideas max per query)
python3 prepare_data.py --per-query 20

# Combined: 4 per condition, max 15 per query
python3 prepare_data.py --per-condition 4 --per-query 15

# Specify a different experiment file
python3 prepare_data.py experiment_20260119_163040_deduped.json --per-condition 4

Sampling Options

Option	Description	Example
`--per-condition N`	Max N ideas per condition per query (stratified)	`--per-condition 4` → ~200 ideas
`--per-query N`	Max N ideas per query	`--per-query 20`
`--sample N`	Total N ideas (proportionally distributed)	`--sample 150`
`--seed N`	Random seed for reproducibility	`--seed 42` (default)

Recommendation: Use --per-condition 4 for balanced assessment across conditions.

The script:

Loads the deduped experiment results
Extracts all unique ideas with hidden metadata (condition, expert, keyword)
Assigns stable IDs to each idea
Shuffles ideas within each query (reproducible with seed=42)
Outputs data/assessment_items.json

Assessment Dimensions

Raters evaluate each idea on four dimensions using a 1-5 Likert scale:

Originality

How unexpected or surprising is this idea?

Score	Description
1	Very common/obvious idea anyone would suggest
2	Somewhat common, slight variation on expected ideas
3	Moderately original, some unexpected elements
4	Quite original, notably different approach
5	Highly unexpected, truly novel concept

Elaboration

How detailed and well-developed is this idea?

Score	Description
1	Vague, minimal detail, just a concept
2	Basic idea with little specificity
3	Moderately detailed, some specifics provided
4	Well-developed with clear implementation hints
5	Highly specific, thoroughly developed concept

Coherence

Does this idea make logical sense and relate to the query object?

Score	Description
1	Nonsensical, irrelevant, or incomprehensible
2	Mostly unclear, weak connection to query
3	Partially coherent, some logical gaps
4	Mostly coherent with minor issues
5	Fully coherent, clearly relates to query

Usefulness

Could this idea have practical value or inspire real innovation?

Score	Description
1	No practical value whatsoever
2	Minimal usefulness, highly impractical
3	Some potential value with major limitations
4	Useful idea with realistic applications
5	Highly useful, clear practical value

Running the System

Start

./start.sh

This will:

Check for data/assessment_items.json (runs prepare_data.py if missing)
Install frontend dependencies if needed
Start backend API on port 8002
Start frontend dev server on port 5174

Stop

./stop.sh

Or press Ctrl+C in the terminal running start.sh.

Manual Start (Development)

# Terminal 1: Backend
cd backend
../../../backend/venv/bin/uvicorn app:app --host 0.0.0.0 --port 8002 --reload

# Terminal 2: Frontend
cd frontend
npm run dev

API Endpoints

Endpoint	Method	Description
`/api/health`	GET	Health check
`/api/info`	GET	Experiment info (total ideas, queries, conditions)
`/api/dimensions`	GET	Dimension definitions for UI
`/api/raters`	GET	List all raters
`/api/raters`	POST	Register/login rater
`/api/queries`	GET	List all queries
`/api/queries/{id}`	GET	Get query with all ideas
`/api/queries/{id}/unrated?rater_id=X`	GET	Get unrated ideas for rater
`/api/ratings`	POST	Submit a rating
`/api/progress/{rater_id}`	GET	Get rater's progress
`/api/statistics`	GET	Overall statistics
`/api/export`	GET	Export all ratings with metadata

Analysis

After collecting ratings from multiple raters:

python3 analyze_ratings.py

This calculates:

Krippendorff's alpha: Inter-rater reliability for ordinal data
ICC(2,1): Intraclass Correlation Coefficient with 95% CI
Mean ratings per condition: Compare experimental conditions
Kruskal-Wallis test: Statistical significance between conditions

Output is saved to results/analysis_results.json.

Database Schema

SQLite database (results/ratings.db):

-- Raters
CREATE TABLE raters (
    rater_id TEXT PRIMARY KEY,
    name TEXT,
    created_at TIMESTAMP
);

-- Ratings
CREATE TABLE ratings (
    id INTEGER PRIMARY KEY,
    rater_id TEXT,
    idea_id TEXT,
    query_id TEXT,
    originality INTEGER CHECK(originality BETWEEN 1 AND 5),
    elaboration INTEGER CHECK(elaboration BETWEEN 1 AND 5),
    coherence INTEGER CHECK(coherence BETWEEN 1 AND 5),
    usefulness INTEGER CHECK(usefulness BETWEEN 1 AND 5),
    skipped INTEGER DEFAULT 0,
    timestamp TIMESTAMP,
    UNIQUE(rater_id, idea_id)
);

-- Progress tracking
CREATE TABLE progress (
    rater_id TEXT,
    query_id TEXT,
    completed_count INTEGER,
    total_count INTEGER,
    PRIMARY KEY (rater_id, query_id)
);

Blind Assessment Design

To ensure unbiased evaluation:

Randomization: Ideas are shuffled within each query using a fixed seed (42) for reproducibility
Hidden metadata: Condition, expert name, and keywords are stored but not shown to raters
Consistent ordering: All raters see the same randomized order
Context provided: Only the query text is shown (e.g., "Chair", "Bicycle")

Workflow for Raters

Login: Enter a unique rater ID
Instructions: Read dimension definitions (shown before first rating)
Rate ideas: For each idea:
- Read the idea text
- Rate all 4 dimensions (1-5)
- Click "Submit & Next" or "Skip"
Progress: Track completion per query and overall
Completion: Summary shown when all ideas are rated

Troubleshooting

Backend won't start

# Check if port 8002 is in use
lsof -i :8002

# Check backend logs
cat /tmp/assessment_backend.log

Frontend won't start

# Reinstall dependencies
cd frontend
rm -rf node_modules
npm install

Reset database

rm results/ratings.db
# Database is auto-created on next backend start

Regenerate assessment data

rm data/assessment_items.json
python3 prepare_data.py

Tech Stack

Backend: Python 3.11+, FastAPI, SQLite, Pydantic
Frontend: React 19, TypeScript, Vite, Ant Design 6.0
Analysis: NumPy, SciPy (for statistical tests)

8.8 KiB Raw Blame History Unescape Escape