Files
gbanyan 43c025e060 feat: Add experiments framework and novelty-driven agent loop
- Add complete experiments directory with pilot study infrastructure
  - 5 experimental conditions (direct, expert-only, attribute-only, full-pipeline, random-perspective)
  - Human assessment tool with React frontend and FastAPI backend
  - AUT flexibility analysis with jump signal detection
  - Result visualization and metrics computation

- Add novelty-driven agent loop module (experiments/novelty_loop/)
  - NoveltyDrivenTaskAgent with expert perspective perturbation
  - Three termination strategies: breakthrough, exhaust, coverage
  - Interactive CLI demo with colored output
  - Embedding-based novelty scoring

- Add DDC knowledge domain classification data (en/zh)
- Add CLAUDE.md project documentation
- Update research report with experiment findings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 10:16:21 +08:00

8.8 KiB
Raw Permalink Blame History

Human Assessment Web Interface

A standalone web application for human assessment of generated ideas using Torrance-inspired creativity metrics.

Overview

This tool enables blind evaluation of creative ideas generated by the novelty-seeking experiment. Raters assess ideas on four dimensions without knowing which experimental condition produced each idea, ensuring unbiased evaluation.

Quick Start

cd experiments/assessment

# 1. Prepare assessment data (if not already done)
python3 prepare_data.py

# 2. Start the system
./start.sh

# 3. Open browser
open http://localhost:5174

Directory Structure

assessment/
├── backend/
│   ├── app.py           # FastAPI backend API
│   ├── database.py      # SQLite database operations
│   ├── models.py        # Pydantic models & dimension definitions
│   └── requirements.txt # Python dependencies
├── frontend/
│   ├── src/
│   │   ├── components/  # React UI components
│   │   ├── hooks/       # React state management
│   │   ├── services/    # API client
│   │   └── types/       # TypeScript definitions
│   └── package.json
├── data/
│   └── assessment_items.json  # Prepared ideas for rating
├── results/
│   └── ratings.db             # SQLite database with ratings
├── prepare_data.py      # Data preparation script
├── analyze_ratings.py   # Inter-rater reliability analysis
├── start.sh             # Start both servers
├── stop.sh              # Stop all services
└── README.md            # This file

Data Preparation

List Available Experiment Files

python3 prepare_data.py --list

Output:

Available experiment files (most recent first):
  experiment_20260119_165650_deduped.json (1571.3 KB)
  experiment_20260119_163040_deduped.json (156.4 KB)

Prepare Assessment Data

# Use all ideas (not recommended for human assessment)
python3 prepare_data.py

# RECOMMENDED: Stratified sampling - 4 ideas per condition per query
# Results in ~200 ideas (5 conditions × 4 ideas × 10 queries)
python3 prepare_data.py --per-condition 4

# Alternative: Sample 150 ideas total (proportionally across queries)
python3 prepare_data.py --sample 150

# Limit per query (20 ideas max per query)
python3 prepare_data.py --per-query 20

# Combined: 4 per condition, max 15 per query
python3 prepare_data.py --per-condition 4 --per-query 15

# Specify a different experiment file
python3 prepare_data.py experiment_20260119_163040_deduped.json --per-condition 4

Sampling Options

Option Description Example
--per-condition N Max N ideas per condition per query (stratified) --per-condition 4 → ~200 ideas
--per-query N Max N ideas per query --per-query 20
--sample N Total N ideas (proportionally distributed) --sample 150
--seed N Random seed for reproducibility --seed 42 (default)

Recommendation: Use --per-condition 4 for balanced assessment across conditions.

The script:

  1. Loads the deduped experiment results
  2. Extracts all unique ideas with hidden metadata (condition, expert, keyword)
  3. Assigns stable IDs to each idea
  4. Shuffles ideas within each query (reproducible with seed=42)
  5. Outputs data/assessment_items.json

Assessment Dimensions

Raters evaluate each idea on four dimensions using a 1-5 Likert scale:

Originality

How unexpected or surprising is this idea?

Score Description
1 Very common/obvious idea anyone would suggest
2 Somewhat common, slight variation on expected ideas
3 Moderately original, some unexpected elements
4 Quite original, notably different approach
5 Highly unexpected, truly novel concept

Elaboration

How detailed and well-developed is this idea?

Score Description
1 Vague, minimal detail, just a concept
2 Basic idea with little specificity
3 Moderately detailed, some specifics provided
4 Well-developed with clear implementation hints
5 Highly specific, thoroughly developed concept

Coherence

Does this idea make logical sense and relate to the query object?

Score Description
1 Nonsensical, irrelevant, or incomprehensible
2 Mostly unclear, weak connection to query
3 Partially coherent, some logical gaps
4 Mostly coherent with minor issues
5 Fully coherent, clearly relates to query

Usefulness

Could this idea have practical value or inspire real innovation?

Score Description
1 No practical value whatsoever
2 Minimal usefulness, highly impractical
3 Some potential value with major limitations
4 Useful idea with realistic applications
5 Highly useful, clear practical value

Running the System

Start

./start.sh

This will:

  1. Check for data/assessment_items.json (runs prepare_data.py if missing)
  2. Install frontend dependencies if needed
  3. Start backend API on port 8002
  4. Start frontend dev server on port 5174

Stop

./stop.sh

Or press Ctrl+C in the terminal running start.sh.

Manual Start (Development)

# Terminal 1: Backend
cd backend
../../../backend/venv/bin/uvicorn app:app --host 0.0.0.0 --port 8002 --reload

# Terminal 2: Frontend
cd frontend
npm run dev

API Endpoints

Endpoint Method Description
/api/health GET Health check
/api/info GET Experiment info (total ideas, queries, conditions)
/api/dimensions GET Dimension definitions for UI
/api/raters GET List all raters
/api/raters POST Register/login rater
/api/queries GET List all queries
/api/queries/{id} GET Get query with all ideas
/api/queries/{id}/unrated?rater_id=X GET Get unrated ideas for rater
/api/ratings POST Submit a rating
/api/progress/{rater_id} GET Get rater's progress
/api/statistics GET Overall statistics
/api/export GET Export all ratings with metadata

Analysis

After collecting ratings from multiple raters:

python3 analyze_ratings.py

This calculates:

  • Krippendorff's alpha: Inter-rater reliability for ordinal data
  • ICC(2,1): Intraclass Correlation Coefficient with 95% CI
  • Mean ratings per condition: Compare experimental conditions
  • Kruskal-Wallis test: Statistical significance between conditions

Output is saved to results/analysis_results.json.

Database Schema

SQLite database (results/ratings.db):

-- Raters
CREATE TABLE raters (
    rater_id TEXT PRIMARY KEY,
    name TEXT,
    created_at TIMESTAMP
);

-- Ratings
CREATE TABLE ratings (
    id INTEGER PRIMARY KEY,
    rater_id TEXT,
    idea_id TEXT,
    query_id TEXT,
    originality INTEGER CHECK(originality BETWEEN 1 AND 5),
    elaboration INTEGER CHECK(elaboration BETWEEN 1 AND 5),
    coherence INTEGER CHECK(coherence BETWEEN 1 AND 5),
    usefulness INTEGER CHECK(usefulness BETWEEN 1 AND 5),
    skipped INTEGER DEFAULT 0,
    timestamp TIMESTAMP,
    UNIQUE(rater_id, idea_id)
);

-- Progress tracking
CREATE TABLE progress (
    rater_id TEXT,
    query_id TEXT,
    completed_count INTEGER,
    total_count INTEGER,
    PRIMARY KEY (rater_id, query_id)
);

Blind Assessment Design

To ensure unbiased evaluation:

  1. Randomization: Ideas are shuffled within each query using a fixed seed (42) for reproducibility
  2. Hidden metadata: Condition, expert name, and keywords are stored but not shown to raters
  3. Consistent ordering: All raters see the same randomized order
  4. Context provided: Only the query text is shown (e.g., "Chair", "Bicycle")

Workflow for Raters

  1. Login: Enter a unique rater ID
  2. Instructions: Read dimension definitions (shown before first rating)
  3. Rate ideas: For each idea:
    • Read the idea text
    • Rate all 4 dimensions (1-5)
    • Click "Submit & Next" or "Skip"
  4. Progress: Track completion per query and overall
  5. Completion: Summary shown when all ideas are rated

Troubleshooting

Backend won't start

# Check if port 8002 is in use
lsof -i :8002

# Check backend logs
cat /tmp/assessment_backend.log

Frontend won't start

# Reinstall dependencies
cd frontend
rm -rf node_modules
npm install

Reset database

rm results/ratings.db
# Database is auto-created on next backend start

Regenerate assessment data

rm data/assessment_items.json
python3 prepare_data.py

Tech Stack

  • Backend: Python 3.11+, FastAPI, SQLite, Pydantic
  • Frontend: React 19, TypeScript, Vite, Ant Design 6.0
  • Analysis: NumPy, SciPy (for statistical tests)