pdf_signature_extraction/SESSION_INIT.md

# Session Initialization - PDF Signature Extraction Project

**Purpose:** This document helps you (or another Claude instance) quickly understand the project state and continue working.

---

## Project Quick Summary

**Goal:** Extract handwritten Chinese signatures from 86,073 PDF documents automatically.

**Current Status:** ✅ Working solution with 70% recall, 100% precision (tested on 5 PDFs)

**Approach:** Hybrid VLM name extraction + Computer Vision detection + VLM verification

---

## 🚀 Quick Start (Resume Work)

### If you want to continue testing:
```bash
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate

# Test with more files (edit line 425 in script)
python extract_signatures_hybrid.py
```

### If you want to review what was done:
```bash
# Read the complete history
less PROJECT_DOCUMENTATION.md

# Check test results
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
```

### If you want to commit to git:
```bash
# Follow the guide
less COMMIT_SUMMARY.md
```

---

## 📁 Key Files (What Each Does)

### Production Scripts ✅
- **extract_pages_from_csv.py** - Step 1: Extract pages from CSV (tested: 100 files)
- **extract_signatures_hybrid.py** - Step 2: Extract signatures (CURRENT WORKING, tested: 5 files)
- **extract_handwriting.py** - CV-only approach (component used in hybrid)

### Documentation 📚
- **PROJECT_DOCUMENTATION.md** - ⭐ READ THIS FIRST - Complete history of all 5 approaches tested
- **README.md** - Quick start guide
- **COMMIT_SUMMARY.md** - Git commit instructions
- **SESSION_INIT.md** - This file (for session continuity)

### Configuration ⚙️
- **.gitignore** - Excludes diagnostic scripts and test outputs

---

## 🎯 Current Working Solution

### Architecture
```
1. VLM extracts signature names: "周寶蓮", "魏興海"
2. CV detects signature-like regions (5K-200K pixels)
3. VLM verifies each region against expected names
4. Save verified signatures: signature_周寶蓮.png
```

### Test Results (5 PDFs)
| Metric | Value |
|--------|-------|
| Expected signatures | 10 |
| Found signatures | 7 |
| Recall | 70% |
| Precision | 100% |
| False positives | 0 |

### Why 30% Missing?
- Computer vision parameters too conservative
- Some signatures smaller/larger than 5K-200K pixel range
- Aspect ratio filter (0.5-10) may exclude some signatures

---

## ⚠️ Critical Context (What You MUST Know)

### 1. VLM Coordinate System is UNRELIABLE ❌

**Discovery:** VLM (qwen2.5vl:32b) provides inaccurate coordinates.

**Example:**
- VLM said signatures at: top=58%, top=68%
- Actual location: top=26%
- Error: ~32% offset (NOT consistent across files!)

**Test file:** `201301_2458_AI1_page4.pdf`
- VLM correctly identifies 2 signatures: "周寶蓮", "魏興海"
- VLM coordinates extract 100% white/blank regions
- This is why we abandoned coordinate-based approach

**Evidence:** See diagnostic scripts and results in PROJECT_DOCUMENTATION.md

### 2. Name-Based Approach is the Solution ✅

Instead of using VLM coordinates:
- ✅ Use VLM to extract **names** (reliable)
- ✅ Use CV to find **locations** (pixel-accurate)
- ✅ Use VLM to **verify** each region against names (accurate)

### 3. All Test PDFs Are Scanned Images

- No searchable text layer
- PDF text layer method (Method A) is **untested**
- All current results use CV detection (Method B)

---

## 🔧 Configuration Details

### Ollama Setup
```python
OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"
```

**Verify connection:**
```bash
curl http://192.168.30.36:11434/api/tags
```

### File Paths
```python
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = ".../signatures/rejected"
```

### CV Detection Parameters (adjust to improve recall)
```python
# In extract_signatures_hybrid.py, detect_signature_regions_cv()
MIN_CONTOUR_AREA = 5000      # ⬇️ Lower = catch smaller signatures
MAX_CONTOUR_AREA = 200000    # ⬆️ Higher = catch larger signatures
ASPECT_RATIO_MIN = 0.5       # ⬇️ Lower = catch taller signatures
ASPECT_RATIO_MAX = 10.0      # ⬆️ Higher = catch wider signatures
```

---

## 🎬 What Happened (Session History)

### Approaches Tested (Chronological)

1. **PDF Image Objects** → Abandoned (extracted full pages, not signatures)
2. **Simple Page Extraction** → ✅ Working (extract pages from CSV)
3. **Computer Vision Only** → Insufficient (6,420 regions from 100 pages - too many)
4. **VLM Coordinates** → ❌ Failed (coordinates unreliable, extracted blank regions)
5. **Hybrid Name-Based** → ✅ Current (70% recall, 100% precision)

### Key Decisions Made

✅ Use VLM for names, not coordinates
✅ Verify each region against expected names
✅ Save signatures with person names
✅ Reject regions that don't match any name
✅ Prevent duplicate signatures per person

### Diagnostic Work Done

Created 11 diagnostic scripts to investigate VLM coordinate failure:
- Visualized bounding boxes
- Analyzed pixel content
- Tested actual vs. reported locations
- Confirmed coordinates 32% off on test file

All findings documented in PROJECT_DOCUMENTATION.md

---

## 🚧 Known Issues & Next Steps

### Issue 1: 30% Missing Signatures
**Status:** Open
**Options:**
1. Widen CV parameter ranges (test with different thresholds)
2. Multi-pass detection with different kernels
3. Ask VLM for help when signatures missing
4. Manual review of rejected folder

### Issue 2: Text Layer Method Untested
**Status:** Pending
**Need:** PDFs with searchable text to test Method A

### Issue 3: Performance (24 sec/PDF)
**Status:** Acceptable for now
**Future:** Optimize if processing full 86K dataset

---

## 📊 Test Data Reference

### Test Files Used (5 PDFs)
```
201301_1324_AI1_page3.pdf - ✅ Found 2/2: 楊智惠, 張志銘
201301_2061_AI1_page5.pdf - ⚠️ Found 1/2: 廖阿甚 (missing 林姿妤)
201301_2458_AI1_page4.pdf - ⚠️ Found 1/2: 周寶蓮 (missing 魏興海) ← VLM coordinate test file
201301_2923_AI1_page3.pdf - ⚠️ Found 1/2: 黄瑞展 (missing 陈丽琦)
201301_3189_AI1_page3.pdf - ✅ Found 2/2: 黄辉, 黄益辉
```

### Output Location
```
/Volumes/NV2/PDF-Processing/signature-image-output/signatures/
├── 201301_1324_AI1_page3_signature_張志銘.png
├── 201301_1324_AI1_page3_signature_楊智惠.png
├── 201301_2061_AI1_page5_signature_廖阿甚.png
├── 201301_2458_AI1_page4_signature_周寶蓮.png
├── 201301_2923_AI1_page3_signature_黄瑞展.png
├── 201301_3189_AI1_page3_signature_黄辉.png
├── 201301_3189_AI1_page3_signature_黄益辉.png
└── rejected/ (non-signature regions)
```

---

## 💡 How to Continue Work

### Option 1: Improve Recall (Find Missing Signatures)

**Goal:** Get from 70% to 90%+ recall

**Approach:**
1. Read rejected folder to see if missing signatures were detected but rejected
2. Adjust CV parameters in `detect_signature_regions_cv()`:
   ```python
   MIN_CONTOUR_AREA = 3000      # Lower threshold
   MAX_CONTOUR_AREA = 300000    # Higher threshold
   ```
3. Test on same 5 PDFs and compare results
4. If recall improves without too many false positives, proceed

**Files to edit:**
- `extract_signatures_hybrid.py` lines 178-214

### Option 2: Scale Up Testing

**Goal:** Test on 100 PDFs to verify reliability

**Approach:**
1. Edit `extract_signatures_hybrid.py` line 425:
   ```python
   pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]
   ```
2. Run script (will take ~40 minutes)
3. Analyze results in log file
4. Calculate overall recall/precision

### Option 3: Prepare for Production

**Goal:** Process all 86,073 files

**Requirements:**
1. Verify current approach is acceptable (70% recall OK?)
2. Estimate time: 86K files × 24 sec/file = ~24 days
3. Consider parallel processing or optimization
4. Set up monitoring and resume capability

### Option 4: Commit Current State

**Goal:** Save working solution to git

**Steps:**
1. Read `COMMIT_SUMMARY.md`
2. Review files to commit
3. Run verification checks
4. Execute git commands
5. Tag release: `v1.0-hybrid-70percent`

---

## 🔍 How to Debug Issues

### If extraction fails:
```bash
# Check Ollama connection
curl http://192.168.30.36:11434/api/tags

# Check input PDFs exist
ls /Volumes/NV2/PDF-Processing/signature-image-output/*.pdf | head -5

# Run with single file for testing
python -c "from extract_signatures_hybrid import *; process_pdf_page('/path/to/test.pdf', OUTPUT_PATH)"
```

### If too many false positives:
- Increase `MIN_CONTOUR_AREA` (filter out small regions)
- Decrease `MAX_CONTOUR_AREA` (filter out large regions)
- Check rejected folder to verify they're actually non-signatures

### If missing signatures:
- Check rejected folder (might be detected but not verified)
- Lower `MIN_CONTOUR_AREA` (catch smaller signatures)
- Increase `MAX_CONTOUR_AREA` (catch larger signatures)
- Widen aspect ratio range

---

## 📋 Session Handoff Checklist

When starting a new session, provide this context:

✅ **Project Goal:** Extract Chinese signatures from 86K PDFs
✅ **Current Approach:** Hybrid VLM name + CV detection + VLM verification
✅ **Status:** Working at 70% recall, 100% precision on 5 test files
✅ **Key Context:** VLM coordinates unreliable (32% offset), use names instead
✅ **Key Files:** extract_signatures_hybrid.py (main), PROJECT_DOCUMENTATION.md (history)
✅ **Next Steps:** Improve recall OR scale up testing OR commit to git

---

## 🎓 Important Lessons Learned

1. **VLM spatial reasoning is unreliable** - Don't trust percentage-based coordinates
2. **VLM text recognition is excellent** - Use for extracting names, not locations
3. **Computer vision is precise** - Use for pixel-level location detection
4. **Name-based verification works** - Filters false positives effectively
5. **Diagnostic scripts are crucial** - Helped discover coordinate offset issue
6. **Conservative parameters** - Better to miss signatures than get false positives

---

## 📞 Quick Reference

### Most Important Command
```bash
python extract_signatures_hybrid.py  # Run signature extraction
```

### Most Important File
```bash
less PROJECT_DOCUMENTATION.md  # Complete project history
```

### Most Important Finding
**VLM coordinates are unreliable → Use VLM for names, CV for locations**

---

## ✨ Session Start Template

**When starting a new session, say:**

> "I'm continuing work on the PDF signature extraction project. Please read `/Volumes/NV2/pdf_recognize/SESSION_INIT.md` and `/Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md` to understand the current state.
>
> Current status: Working hybrid approach with 70% recall on 5 test files.
>
> I want to: [choose one]
> - Improve recall by tuning CV parameters
> - Test on 100 PDFs to verify reliability
> - Commit current solution to git
> - Process full 86K dataset
> - Debug a specific issue: [describe]"

---

**Document Created:** October 26, 2025
**Last Updated:** October 26, 2025
**Status:** Ready for Next Session
**Working Directory:** `/Volumes/NV2/pdf_recognize/`