Add hybrid signature extraction with name-based verification

Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.

Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling

Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)

Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs

Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF

Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)

🤖 Generated with Claude Code
This commit is contained in:
2025-10-26 23:39:52 +08:00
commit 52612e14ba
14 changed files with 3583 additions and 0 deletions

372
SESSION_INIT.md Normal file
View File

@@ -0,0 +1,372 @@
# Session Initialization - PDF Signature Extraction Project
**Purpose:** This document helps you (or another Claude instance) quickly understand the project state and continue working.
---
## Project Quick Summary
**Goal:** Extract handwritten Chinese signatures from 86,073 PDF documents automatically.
**Current Status:** ✅ Working solution with 70% recall, 100% precision (tested on 5 PDFs)
**Approach:** Hybrid VLM name extraction + Computer Vision detection + VLM verification
---
## 🚀 Quick Start (Resume Work)
### If you want to continue testing:
```bash
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
# Test with more files (edit line 425 in script)
python extract_signatures_hybrid.py
```
### If you want to review what was done:
```bash
# Read the complete history
less PROJECT_DOCUMENTATION.md
# Check test results
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
```
### If you want to commit to git:
```bash
# Follow the guide
less COMMIT_SUMMARY.md
```
---
## 📁 Key Files (What Each Does)
### Production Scripts ✅
- **extract_pages_from_csv.py** - Step 1: Extract pages from CSV (tested: 100 files)
- **extract_signatures_hybrid.py** - Step 2: Extract signatures (CURRENT WORKING, tested: 5 files)
- **extract_handwriting.py** - CV-only approach (component used in hybrid)
### Documentation 📚
- **PROJECT_DOCUMENTATION.md** - ⭐ READ THIS FIRST - Complete history of all 5 approaches tested
- **README.md** - Quick start guide
- **COMMIT_SUMMARY.md** - Git commit instructions
- **SESSION_INIT.md** - This file (for session continuity)
### Configuration ⚙️
- **.gitignore** - Excludes diagnostic scripts and test outputs
---
## 🎯 Current Working Solution
### Architecture
```
1. VLM extracts signature names: "周寶蓮", "魏興海"
2. CV detects signature-like regions (5K-200K pixels)
3. VLM verifies each region against expected names
4. Save verified signatures: signature_周寶蓮.png
```
### Test Results (5 PDFs)
| Metric | Value |
|--------|-------|
| Expected signatures | 10 |
| Found signatures | 7 |
| Recall | 70% |
| Precision | 100% |
| False positives | 0 |
### Why 30% Missing?
- Computer vision parameters too conservative
- Some signatures smaller/larger than 5K-200K pixel range
- Aspect ratio filter (0.5-10) may exclude some signatures
---
## ⚠️ Critical Context (What You MUST Know)
### 1. VLM Coordinate System is UNRELIABLE ❌
**Discovery:** VLM (qwen2.5vl:32b) provides inaccurate coordinates.
**Example:**
- VLM said signatures at: top=58%, top=68%
- Actual location: top=26%
- Error: ~32% offset (NOT consistent across files!)
**Test file:** `201301_2458_AI1_page4.pdf`
- VLM correctly identifies 2 signatures: "周寶蓮", "魏興海"
- VLM coordinates extract 100% white/blank regions
- This is why we abandoned coordinate-based approach
**Evidence:** See diagnostic scripts and results in PROJECT_DOCUMENTATION.md
### 2. Name-Based Approach is the Solution ✅
Instead of using VLM coordinates:
- ✅ Use VLM to extract **names** (reliable)
- ✅ Use CV to find **locations** (pixel-accurate)
- ✅ Use VLM to **verify** each region against names (accurate)
### 3. All Test PDFs Are Scanned Images
- No searchable text layer
- PDF text layer method (Method A) is **untested**
- All current results use CV detection (Method B)
---
## 🔧 Configuration Details
### Ollama Setup
```python
OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"
```
**Verify connection:**
```bash
curl http://192.168.30.36:11434/api/tags
```
### File Paths
```python
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = ".../signatures/rejected"
```
### CV Detection Parameters (adjust to improve recall)
```python
# In extract_signatures_hybrid.py, detect_signature_regions_cv()
MIN_CONTOUR_AREA = 5000 # ⬇️ Lower = catch smaller signatures
MAX_CONTOUR_AREA = 200000 # ⬆️ Higher = catch larger signatures
ASPECT_RATIO_MIN = 0.5 # ⬇️ Lower = catch taller signatures
ASPECT_RATIO_MAX = 10.0 # ⬆️ Higher = catch wider signatures
```
---
## 🎬 What Happened (Session History)
### Approaches Tested (Chronological)
1. **PDF Image Objects** → Abandoned (extracted full pages, not signatures)
2. **Simple Page Extraction** → ✅ Working (extract pages from CSV)
3. **Computer Vision Only** → Insufficient (6,420 regions from 100 pages - too many)
4. **VLM Coordinates** → ❌ Failed (coordinates unreliable, extracted blank regions)
5. **Hybrid Name-Based** → ✅ Current (70% recall, 100% precision)
### Key Decisions Made
✅ Use VLM for names, not coordinates
✅ Verify each region against expected names
✅ Save signatures with person names
✅ Reject regions that don't match any name
✅ Prevent duplicate signatures per person
### Diagnostic Work Done
Created 11 diagnostic scripts to investigate VLM coordinate failure:
- Visualized bounding boxes
- Analyzed pixel content
- Tested actual vs. reported locations
- Confirmed coordinates 32% off on test file
All findings documented in PROJECT_DOCUMENTATION.md
---
## 🚧 Known Issues & Next Steps
### Issue 1: 30% Missing Signatures
**Status:** Open
**Options:**
1. Widen CV parameter ranges (test with different thresholds)
2. Multi-pass detection with different kernels
3. Ask VLM for help when signatures missing
4. Manual review of rejected folder
### Issue 2: Text Layer Method Untested
**Status:** Pending
**Need:** PDFs with searchable text to test Method A
### Issue 3: Performance (24 sec/PDF)
**Status:** Acceptable for now
**Future:** Optimize if processing full 86K dataset
---
## 📊 Test Data Reference
### Test Files Used (5 PDFs)
```
201301_1324_AI1_page3.pdf - ✅ Found 2/2: 楊智惠, 張志銘
201301_2061_AI1_page5.pdf - ⚠️ Found 1/2: 廖阿甚 (missing 林姿妤)
201301_2458_AI1_page4.pdf - ⚠️ Found 1/2: 周寶蓮 (missing 魏興海) ← VLM coordinate test file
201301_2923_AI1_page3.pdf - ⚠️ Found 1/2: 黄瑞展 (missing 陈丽琦)
201301_3189_AI1_page3.pdf - ✅ Found 2/2: 黄辉, 黄益辉
```
### Output Location
```
/Volumes/NV2/PDF-Processing/signature-image-output/signatures/
├── 201301_1324_AI1_page3_signature_張志銘.png
├── 201301_1324_AI1_page3_signature_楊智惠.png
├── 201301_2061_AI1_page5_signature_廖阿甚.png
├── 201301_2458_AI1_page4_signature_周寶蓮.png
├── 201301_2923_AI1_page3_signature_黄瑞展.png
├── 201301_3189_AI1_page3_signature_黄辉.png
├── 201301_3189_AI1_page3_signature_黄益辉.png
└── rejected/ (non-signature regions)
```
---
## 💡 How to Continue Work
### Option 1: Improve Recall (Find Missing Signatures)
**Goal:** Get from 70% to 90%+ recall
**Approach:**
1. Read rejected folder to see if missing signatures were detected but rejected
2. Adjust CV parameters in `detect_signature_regions_cv()`:
```python
MIN_CONTOUR_AREA = 3000 # Lower threshold
MAX_CONTOUR_AREA = 300000 # Higher threshold
```
3. Test on same 5 PDFs and compare results
4. If recall improves without too many false positives, proceed
**Files to edit:**
- `extract_signatures_hybrid.py` lines 178-214
### Option 2: Scale Up Testing
**Goal:** Test on 100 PDFs to verify reliability
**Approach:**
1. Edit `extract_signatures_hybrid.py` line 425:
```python
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]
```
2. Run script (will take ~40 minutes)
3. Analyze results in log file
4. Calculate overall recall/precision
### Option 3: Prepare for Production
**Goal:** Process all 86,073 files
**Requirements:**
1. Verify current approach is acceptable (70% recall OK?)
2. Estimate time: 86K files × 24 sec/file = ~24 days
3. Consider parallel processing or optimization
4. Set up monitoring and resume capability
### Option 4: Commit Current State
**Goal:** Save working solution to git
**Steps:**
1. Read `COMMIT_SUMMARY.md`
2. Review files to commit
3. Run verification checks
4. Execute git commands
5. Tag release: `v1.0-hybrid-70percent`
---
## 🔍 How to Debug Issues
### If extraction fails:
```bash
# Check Ollama connection
curl http://192.168.30.36:11434/api/tags
# Check input PDFs exist
ls /Volumes/NV2/PDF-Processing/signature-image-output/*.pdf | head -5
# Run with single file for testing
python -c "from extract_signatures_hybrid import *; process_pdf_page('/path/to/test.pdf', OUTPUT_PATH)"
```
### If too many false positives:
- Increase `MIN_CONTOUR_AREA` (filter out small regions)
- Decrease `MAX_CONTOUR_AREA` (filter out large regions)
- Check rejected folder to verify they're actually non-signatures
### If missing signatures:
- Check rejected folder (might be detected but not verified)
- Lower `MIN_CONTOUR_AREA` (catch smaller signatures)
- Increase `MAX_CONTOUR_AREA` (catch larger signatures)
- Widen aspect ratio range
---
## 📋 Session Handoff Checklist
When starting a new session, provide this context:
✅ **Project Goal:** Extract Chinese signatures from 86K PDFs
✅ **Current Approach:** Hybrid VLM name + CV detection + VLM verification
✅ **Status:** Working at 70% recall, 100% precision on 5 test files
✅ **Key Context:** VLM coordinates unreliable (32% offset), use names instead
✅ **Key Files:** extract_signatures_hybrid.py (main), PROJECT_DOCUMENTATION.md (history)
✅ **Next Steps:** Improve recall OR scale up testing OR commit to git
---
## 🎓 Important Lessons Learned
1. **VLM spatial reasoning is unreliable** - Don't trust percentage-based coordinates
2. **VLM text recognition is excellent** - Use for extracting names, not locations
3. **Computer vision is precise** - Use for pixel-level location detection
4. **Name-based verification works** - Filters false positives effectively
5. **Diagnostic scripts are crucial** - Helped discover coordinate offset issue
6. **Conservative parameters** - Better to miss signatures than get false positives
---
## 📞 Quick Reference
### Most Important Command
```bash
python extract_signatures_hybrid.py # Run signature extraction
```
### Most Important File
```bash
less PROJECT_DOCUMENTATION.md # Complete project history
```
### Most Important Finding
**VLM coordinates are unreliable → Use VLM for names, CV for locations**
---
## ✨ Session Start Template
**When starting a new session, say:**
> "I'm continuing work on the PDF signature extraction project. Please read `/Volumes/NV2/pdf_recognize/SESSION_INIT.md` and `/Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md` to understand the current state.
>
> Current status: Working hybrid approach with 70% recall on 5 test files.
>
> I want to: [choose one]
> - Improve recall by tuning CV parameters
> - Test on 100 PDFs to verify reliability
> - Commit current solution to git
> - Process full 86K dataset
> - Debug a specific issue: [describe]"
---
**Document Created:** October 26, 2025
**Last Updated:** October 26, 2025
**Status:** Ready for Next Session
**Working Directory:** `/Volumes/NV2/pdf_recognize/`