Add hybrid signature extraction with name-based verification

Implement VLM name extraction + CV detection hybrid approach to replace unreliable VLM coordinate system with name-based verification. Key Features: - VLM extracts signature names (周寶蓮, 魏興海, etc.) - CV or PDF text layer detects regions - VLM verifies each region against expected names - Signatures saved with person names: signature_周寶蓮.png - Duplicate prevention and rejection handling Test Results: - 5 PDF pages tested - 7/10 signatures extracted (70% recall) - 100% precision (no false positives) - No blank regions extracted (previous issue resolved) Files: - extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files) - extract_signatures_hybrid.py: Hybrid extraction (current working solution) - extract_handwriting.py: CV-only approach (component) - extract_signatures_vlm.py: Deprecated VLM coordinate approach - PROJECT_DOCUMENTATION.md: Complete project history and results - SESSION_INIT.md: Session handoff documentation - SESSION_CHECKLIST.md: Status checklist - NEW_SESSION_PROMPT.txt: Template for next session - HOW_TO_CONTINUE.txt: Visual handoff guide - COMMIT_SUMMARY.md: Commit preparation guide - README.md: Quick start guide - README_page_extraction.md: Page extraction docs - README_hybrid_extraction.md: Hybrid approach docs - .gitignore: Exclude diagnostic scripts and outputs Known Limitations: - 30% of signatures missed due to conservative CV parameters - Text layer method untested (all test PDFs are scanned images) - Performance: ~24 seconds per PDF Next Steps: - Tune CV parameters for higher recall - Test with larger dataset (100+ files) - Process full dataset (86,073 files) 🤖 Generated with Claude Code
2025-10-26 23:39:52 +08:00
commit 52612e14ba
14 changed files with 3583 additions and 0 deletions
--- a/SESSION_INIT.md
+++ b/SESSION_INIT.md
@@ -0,0 +1,372 @@
+# Session Initialization - PDF Signature Extraction Project
+
+**Purpose:** This document helps you (or another Claude instance) quickly understand the project state and continue working.
+
+---
+
+## Project Quick Summary
+
+**Goal:** Extract handwritten Chinese signatures from 86,073 PDF documents automatically.
+
+**Current Status:** ✅ Working solution with 70% recall, 100% precision (tested on 5 PDFs)
+
+**Approach:** Hybrid VLM name extraction + Computer Vision detection + VLM verification
+
+---
+
+## 🚀 Quick Start (Resume Work)
+
+### If you want to continue testing:
+```bash
+cd /Volumes/NV2/pdf_recognize
+source venv/bin/activate
+
+# Test with more files (edit line 425 in script)
+python extract_signatures_hybrid.py
+```
+
+### If you want to review what was done:
+```bash
+# Read the complete history
+less PROJECT_DOCUMENTATION.md
+
+# Check test results
+ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
+```
+
+### If you want to commit to git:
+```bash
+# Follow the guide
+less COMMIT_SUMMARY.md
+```
+
+---
+
+## 📁 Key Files (What Each Does)
+
+### Production Scripts ✅
+- **extract_pages_from_csv.py** - Step 1: Extract pages from CSV (tested: 100 files)
+- **extract_signatures_hybrid.py** - Step 2: Extract signatures (CURRENT WORKING, tested: 5 files)
+- **extract_handwriting.py** - CV-only approach (component used in hybrid)
+
+### Documentation 📚
+- **PROJECT_DOCUMENTATION.md** - ⭐ READ THIS FIRST - Complete history of all 5 approaches tested
+- **README.md** - Quick start guide
+- **COMMIT_SUMMARY.md** - Git commit instructions
+- **SESSION_INIT.md** - This file (for session continuity)
+
+### Configuration ⚙️
+- **.gitignore** - Excludes diagnostic scripts and test outputs
+
+---
+
+## 🎯 Current Working Solution
+
+### Architecture
+```
+1. VLM extracts signature names: "周寶蓮", "魏興海"
+2. CV detects signature-like regions (5K-200K pixels)
+3. VLM verifies each region against expected names
+4. Save verified signatures: signature_周寶蓮.png
+```
+
+### Test Results (5 PDFs)
+| Metric | Value |
+|--------|-------|
+| Expected signatures | 10 |
+| Found signatures | 7 |
+| Recall | 70% |
+| Precision | 100% |
+| False positives | 0 |
+
+### Why 30% Missing?
+- Computer vision parameters too conservative
+- Some signatures smaller/larger than 5K-200K pixel range
+- Aspect ratio filter (0.5-10) may exclude some signatures
+
+---
+
+## ⚠️ Critical Context (What You MUST Know)
+
+### 1. VLM Coordinate System is UNRELIABLE ❌
+
+**Discovery:** VLM (qwen2.5vl:32b) provides inaccurate coordinates.
+
+**Example:**
+- VLM said signatures at: top=58%, top=68%
+- Actual location: top=26%
+- Error: ~32% offset (NOT consistent across files!)
+
+**Test file:** `201301_2458_AI1_page4.pdf`
+- VLM correctly identifies 2 signatures: "周寶蓮", "魏興海"
+- VLM coordinates extract 100% white/blank regions
+- This is why we abandoned coordinate-based approach
+
+**Evidence:** See diagnostic scripts and results in PROJECT_DOCUMENTATION.md
+
+### 2. Name-Based Approach is the Solution ✅
+
+Instead of using VLM coordinates:
+- ✅ Use VLM to extract **names** (reliable)
+- ✅ Use CV to find **locations** (pixel-accurate)
+- ✅ Use VLM to **verify** each region against names (accurate)
+
+### 3. All Test PDFs Are Scanned Images
+
+- No searchable text layer
+- PDF text layer method (Method A) is **untested**
+- All current results use CV detection (Method B)
+
+---
+
+## 🔧 Configuration Details
+
+### Ollama Setup
+```python
+OLLAMA_URL = "http://192.168.30.36:11434"
+OLLAMA_MODEL = "qwen2.5vl:32b"
+```
+
+**Verify connection:**
+```bash
+curl http://192.168.30.36:11434/api/tags
+```
+
+### File Paths
+```python
+PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
+OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
+REJECTED_PATH = ".../signatures/rejected"
+```
+
+### CV Detection Parameters (adjust to improve recall)
+```python
+# In extract_signatures_hybrid.py, detect_signature_regions_cv()
+MIN_CONTOUR_AREA = 5000      # ⬇️ Lower = catch smaller signatures
+MAX_CONTOUR_AREA = 200000    # ⬆️ Higher = catch larger signatures
+ASPECT_RATIO_MIN = 0.5       # ⬇️ Lower = catch taller signatures
+ASPECT_RATIO_MAX = 10.0      # ⬆️ Higher = catch wider signatures
+```
+
+---
+
+## 🎬 What Happened (Session History)
+
+### Approaches Tested (Chronological)
+
+1. **PDF Image Objects** → Abandoned (extracted full pages, not signatures)
+2. **Simple Page Extraction** → ✅ Working (extract pages from CSV)
+3. **Computer Vision Only** → Insufficient (6,420 regions from 100 pages - too many)
+4. **VLM Coordinates** → ❌ Failed (coordinates unreliable, extracted blank regions)
+5. **Hybrid Name-Based** → ✅ Current (70% recall, 100% precision)
+
+### Key Decisions Made
+
+✅ Use VLM for names, not coordinates
+✅ Verify each region against expected names
+✅ Save signatures with person names
+✅ Reject regions that don't match any name
+✅ Prevent duplicate signatures per person
+
+### Diagnostic Work Done
+
+Created 11 diagnostic scripts to investigate VLM coordinate failure:
+- Visualized bounding boxes
+- Analyzed pixel content
+- Tested actual vs. reported locations
+- Confirmed coordinates 32% off on test file
+
+All findings documented in PROJECT_DOCUMENTATION.md
+
+---
+
+## 🚧 Known Issues & Next Steps
+
+### Issue 1: 30% Missing Signatures
+**Status:** Open
+**Options:**
+1. Widen CV parameter ranges (test with different thresholds)
+2. Multi-pass detection with different kernels
+3. Ask VLM for help when signatures missing
+4. Manual review of rejected folder
+
+### Issue 2: Text Layer Method Untested
+**Status:** Pending
+**Need:** PDFs with searchable text to test Method A
+
+### Issue 3: Performance (24 sec/PDF)
+**Status:** Acceptable for now
+**Future:** Optimize if processing full 86K dataset
+
+---
+
+## 📊 Test Data Reference
+
+### Test Files Used (5 PDFs)
+```
+201301_1324_AI1_page3.pdf - ✅ Found 2/2: 楊智惠, 張志銘
+201301_2061_AI1_page5.pdf - ⚠️ Found 1/2: 廖阿甚 (missing 林姿妤)
+201301_2458_AI1_page4.pdf - ⚠️ Found 1/2: 周寶蓮 (missing 魏興海) ← VLM coordinate test file
+201301_2923_AI1_page3.pdf - ⚠️ Found 1/2: 黄瑞展 (missing 陈丽琦)
+201301_3189_AI1_page3.pdf - ✅ Found 2/2: 黄辉, 黄益辉
+```
+
+### Output Location
+```
+/Volumes/NV2/PDF-Processing/signature-image-output/signatures/
+├── 201301_1324_AI1_page3_signature_張志銘.png
+├── 201301_1324_AI1_page3_signature_楊智惠.png
+├── 201301_2061_AI1_page5_signature_廖阿甚.png
+├── 201301_2458_AI1_page4_signature_周寶蓮.png
+├── 201301_2923_AI1_page3_signature_黄瑞展.png
+├── 201301_3189_AI1_page3_signature_黄辉.png
+├── 201301_3189_AI1_page3_signature_黄益辉.png
+└── rejected/ (non-signature regions)
+```
+
+---
+
+## 💡 How to Continue Work
+
+### Option 1: Improve Recall (Find Missing Signatures)
+
+**Goal:** Get from 70% to 90%+ recall
+
+**Approach:**
+1. Read rejected folder to see if missing signatures were detected but rejected
+2. Adjust CV parameters in `detect_signature_regions_cv()`:
+   ```python
+   MIN_CONTOUR_AREA = 3000      # Lower threshold
+   MAX_CONTOUR_AREA = 300000    # Higher threshold
+   ```
+3. Test on same 5 PDFs and compare results
+4. If recall improves without too many false positives, proceed
+
+**Files to edit:**
+- `extract_signatures_hybrid.py` lines 178-214
+
+### Option 2: Scale Up Testing
+
+**Goal:** Test on 100 PDFs to verify reliability
+
+**Approach:**
+1. Edit `extract_signatures_hybrid.py` line 425:
+   ```python
+   pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]
+   ```
+2. Run script (will take ~40 minutes)
+3. Analyze results in log file
+4. Calculate overall recall/precision
+
+### Option 3: Prepare for Production
+
+**Goal:** Process all 86,073 files
+
+**Requirements:**
+1. Verify current approach is acceptable (70% recall OK?)
+2. Estimate time: 86K files × 24 sec/file = ~24 days
+3. Consider parallel processing or optimization
+4. Set up monitoring and resume capability
+
+### Option 4: Commit Current State
+
+**Goal:** Save working solution to git
+
+**Steps:**
+1. Read `COMMIT_SUMMARY.md`
+2. Review files to commit
+3. Run verification checks
+4. Execute git commands
+5. Tag release: `v1.0-hybrid-70percent`
+
+---
+
+## 🔍 How to Debug Issues
+
+### If extraction fails:
+```bash
+# Check Ollama connection
+curl http://192.168.30.36:11434/api/tags
+
+# Check input PDFs exist
+ls /Volumes/NV2/PDF-Processing/signature-image-output/*.pdf | head -5
+
+# Run with single file for testing
+python -c "from extract_signatures_hybrid import *; process_pdf_page('/path/to/test.pdf', OUTPUT_PATH)"
+```
+
+### If too many false positives:
+- Increase `MIN_CONTOUR_AREA` (filter out small regions)
+- Decrease `MAX_CONTOUR_AREA` (filter out large regions)
+- Check rejected folder to verify they're actually non-signatures
+
+### If missing signatures:
+- Check rejected folder (might be detected but not verified)
+- Lower `MIN_CONTOUR_AREA` (catch smaller signatures)
+- Increase `MAX_CONTOUR_AREA` (catch larger signatures)
+- Widen aspect ratio range
+
+---
+
+## 📋 Session Handoff Checklist
+
+When starting a new session, provide this context:
+
+✅ **Project Goal:** Extract Chinese signatures from 86K PDFs
+✅ **Current Approach:** Hybrid VLM name + CV detection + VLM verification
+✅ **Status:** Working at 70% recall, 100% precision on 5 test files
+✅ **Key Context:** VLM coordinates unreliable (32% offset), use names instead
+✅ **Key Files:** extract_signatures_hybrid.py (main), PROJECT_DOCUMENTATION.md (history)
+✅ **Next Steps:** Improve recall OR scale up testing OR commit to git
+
+---
+
+## 🎓 Important Lessons Learned
+
+1. **VLM spatial reasoning is unreliable** - Don't trust percentage-based coordinates
+2. **VLM text recognition is excellent** - Use for extracting names, not locations
+3. **Computer vision is precise** - Use for pixel-level location detection
+4. **Name-based verification works** - Filters false positives effectively
+5. **Diagnostic scripts are crucial** - Helped discover coordinate offset issue
+6. **Conservative parameters** - Better to miss signatures than get false positives
+
+---
+
+## 📞 Quick Reference
+
+### Most Important Command
+```bash
+python extract_signatures_hybrid.py  # Run signature extraction
+```
+
+### Most Important File
+```bash
+less PROJECT_DOCUMENTATION.md  # Complete project history
+```
+
+### Most Important Finding
+**VLM coordinates are unreliable → Use VLM for names, CV for locations**
+
+---
+
+## ✨ Session Start Template
+
+**When starting a new session, say:**
+
+> "I'm continuing work on the PDF signature extraction project. Please read `/Volumes/NV2/pdf_recognize/SESSION_INIT.md` and `/Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md` to understand the current state.
+>
+> Current status: Working hybrid approach with 70% recall on 5 test files.
+>
+> I want to: [choose one]
+> - Improve recall by tuning CV parameters
+> - Test on 100 PDFs to verify reliability
+> - Commit current solution to git
+> - Process full 86K dataset
+> - Debug a specific issue: [describe]"
+
+---
+
+**Document Created:** October 26, 2025
+**Last Updated:** October 26, 2025
+**Status:** Ready for Next Session
+**Working Directory:** `/Volumes/NV2/pdf_recognize/`