Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
11 KiB
Session Initialization - PDF Signature Extraction Project
Purpose: This document helps you (or another Claude instance) quickly understand the project state and continue working.
Project Quick Summary
Goal: Extract handwritten Chinese signatures from 86,073 PDF documents automatically.
Current Status: ✅ Working solution with 70% recall, 100% precision (tested on 5 PDFs)
Approach: Hybrid VLM name extraction + Computer Vision detection + VLM verification
🚀 Quick Start (Resume Work)
If you want to continue testing:
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
# Test with more files (edit line 425 in script)
python extract_signatures_hybrid.py
If you want to review what was done:
# Read the complete history
less PROJECT_DOCUMENTATION.md
# Check test results
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
If you want to commit to git:
# Follow the guide
less COMMIT_SUMMARY.md
📁 Key Files (What Each Does)
Production Scripts ✅
- extract_pages_from_csv.py - Step 1: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py - Step 2: Extract signatures (CURRENT WORKING, tested: 5 files)
- extract_handwriting.py - CV-only approach (component used in hybrid)
Documentation 📚
- PROJECT_DOCUMENTATION.md - ⭐ READ THIS FIRST - Complete history of all 5 approaches tested
- README.md - Quick start guide
- COMMIT_SUMMARY.md - Git commit instructions
- SESSION_INIT.md - This file (for session continuity)
Configuration ⚙️
- .gitignore - Excludes diagnostic scripts and test outputs
🎯 Current Working Solution
Architecture
1. VLM extracts signature names: "周寶蓮", "魏興海"
2. CV detects signature-like regions (5K-200K pixels)
3. VLM verifies each region against expected names
4. Save verified signatures: signature_周寶蓮.png
Test Results (5 PDFs)
| Metric | Value |
|---|---|
| Expected signatures | 10 |
| Found signatures | 7 |
| Recall | 70% |
| Precision | 100% |
| False positives | 0 |
Why 30% Missing?
- Computer vision parameters too conservative
- Some signatures smaller/larger than 5K-200K pixel range
- Aspect ratio filter (0.5-10) may exclude some signatures
⚠️ Critical Context (What You MUST Know)
1. VLM Coordinate System is UNRELIABLE ❌
Discovery: VLM (qwen2.5vl:32b) provides inaccurate coordinates.
Example:
- VLM said signatures at: top=58%, top=68%
- Actual location: top=26%
- Error: ~32% offset (NOT consistent across files!)
Test file: 201301_2458_AI1_page4.pdf
- VLM correctly identifies 2 signatures: "周寶蓮", "魏興海"
- VLM coordinates extract 100% white/blank regions
- This is why we abandoned coordinate-based approach
Evidence: See diagnostic scripts and results in PROJECT_DOCUMENTATION.md
2. Name-Based Approach is the Solution ✅
Instead of using VLM coordinates:
- ✅ Use VLM to extract names (reliable)
- ✅ Use CV to find locations (pixel-accurate)
- ✅ Use VLM to verify each region against names (accurate)
3. All Test PDFs Are Scanned Images
- No searchable text layer
- PDF text layer method (Method A) is untested
- All current results use CV detection (Method B)
🔧 Configuration Details
Ollama Setup
OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"
Verify connection:
curl http://192.168.30.36:11434/api/tags
File Paths
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = ".../signatures/rejected"
CV Detection Parameters (adjust to improve recall)
# In extract_signatures_hybrid.py, detect_signature_regions_cv()
MIN_CONTOUR_AREA = 5000 # ⬇️ Lower = catch smaller signatures
MAX_CONTOUR_AREA = 200000 # ⬆️ Higher = catch larger signatures
ASPECT_RATIO_MIN = 0.5 # ⬇️ Lower = catch taller signatures
ASPECT_RATIO_MAX = 10.0 # ⬆️ Higher = catch wider signatures
🎬 What Happened (Session History)
Approaches Tested (Chronological)
- PDF Image Objects → Abandoned (extracted full pages, not signatures)
- Simple Page Extraction → ✅ Working (extract pages from CSV)
- Computer Vision Only → Insufficient (6,420 regions from 100 pages - too many)
- VLM Coordinates → ❌ Failed (coordinates unreliable, extracted blank regions)
- Hybrid Name-Based → ✅ Current (70% recall, 100% precision)
Key Decisions Made
✅ Use VLM for names, not coordinates ✅ Verify each region against expected names ✅ Save signatures with person names ✅ Reject regions that don't match any name ✅ Prevent duplicate signatures per person
Diagnostic Work Done
Created 11 diagnostic scripts to investigate VLM coordinate failure:
- Visualized bounding boxes
- Analyzed pixel content
- Tested actual vs. reported locations
- Confirmed coordinates 32% off on test file
All findings documented in PROJECT_DOCUMENTATION.md
🚧 Known Issues & Next Steps
Issue 1: 30% Missing Signatures
Status: Open Options:
- Widen CV parameter ranges (test with different thresholds)
- Multi-pass detection with different kernels
- Ask VLM for help when signatures missing
- Manual review of rejected folder
Issue 2: Text Layer Method Untested
Status: Pending Need: PDFs with searchable text to test Method A
Issue 3: Performance (24 sec/PDF)
Status: Acceptable for now Future: Optimize if processing full 86K dataset
📊 Test Data Reference
Test Files Used (5 PDFs)
201301_1324_AI1_page3.pdf - ✅ Found 2/2: 楊智惠, 張志銘
201301_2061_AI1_page5.pdf - ⚠️ Found 1/2: 廖阿甚 (missing 林姿妤)
201301_2458_AI1_page4.pdf - ⚠️ Found 1/2: 周寶蓮 (missing 魏興海) ← VLM coordinate test file
201301_2923_AI1_page3.pdf - ⚠️ Found 1/2: 黄瑞展 (missing 陈丽琦)
201301_3189_AI1_page3.pdf - ✅ Found 2/2: 黄辉, 黄益辉
Output Location
/Volumes/NV2/PDF-Processing/signature-image-output/signatures/
├── 201301_1324_AI1_page3_signature_張志銘.png
├── 201301_1324_AI1_page3_signature_楊智惠.png
├── 201301_2061_AI1_page5_signature_廖阿甚.png
├── 201301_2458_AI1_page4_signature_周寶蓮.png
├── 201301_2923_AI1_page3_signature_黄瑞展.png
├── 201301_3189_AI1_page3_signature_黄辉.png
├── 201301_3189_AI1_page3_signature_黄益辉.png
└── rejected/ (non-signature regions)
💡 How to Continue Work
Option 1: Improve Recall (Find Missing Signatures)
Goal: Get from 70% to 90%+ recall
Approach:
- Read rejected folder to see if missing signatures were detected but rejected
- Adjust CV parameters in
detect_signature_regions_cv():MIN_CONTOUR_AREA = 3000 # Lower threshold MAX_CONTOUR_AREA = 300000 # Higher threshold - Test on same 5 PDFs and compare results
- If recall improves without too many false positives, proceed
Files to edit:
extract_signatures_hybrid.pylines 178-214
Option 2: Scale Up Testing
Goal: Test on 100 PDFs to verify reliability
Approach:
- Edit
extract_signatures_hybrid.pyline 425:pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100] - Run script (will take ~40 minutes)
- Analyze results in log file
- Calculate overall recall/precision
Option 3: Prepare for Production
Goal: Process all 86,073 files
Requirements:
- Verify current approach is acceptable (70% recall OK?)
- Estimate time: 86K files × 24 sec/file = ~24 days
- Consider parallel processing or optimization
- Set up monitoring and resume capability
Option 4: Commit Current State
Goal: Save working solution to git
Steps:
- Read
COMMIT_SUMMARY.md - Review files to commit
- Run verification checks
- Execute git commands
- Tag release:
v1.0-hybrid-70percent
🔍 How to Debug Issues
If extraction fails:
# Check Ollama connection
curl http://192.168.30.36:11434/api/tags
# Check input PDFs exist
ls /Volumes/NV2/PDF-Processing/signature-image-output/*.pdf | head -5
# Run with single file for testing
python -c "from extract_signatures_hybrid import *; process_pdf_page('/path/to/test.pdf', OUTPUT_PATH)"
If too many false positives:
- Increase
MIN_CONTOUR_AREA(filter out small regions) - Decrease
MAX_CONTOUR_AREA(filter out large regions) - Check rejected folder to verify they're actually non-signatures
If missing signatures:
- Check rejected folder (might be detected but not verified)
- Lower
MIN_CONTOUR_AREA(catch smaller signatures) - Increase
MAX_CONTOUR_AREA(catch larger signatures) - Widen aspect ratio range
📋 Session Handoff Checklist
When starting a new session, provide this context:
✅ Project Goal: Extract Chinese signatures from 86K PDFs ✅ Current Approach: Hybrid VLM name + CV detection + VLM verification ✅ Status: Working at 70% recall, 100% precision on 5 test files ✅ Key Context: VLM coordinates unreliable (32% offset), use names instead ✅ Key Files: extract_signatures_hybrid.py (main), PROJECT_DOCUMENTATION.md (history) ✅ Next Steps: Improve recall OR scale up testing OR commit to git
🎓 Important Lessons Learned
- VLM spatial reasoning is unreliable - Don't trust percentage-based coordinates
- VLM text recognition is excellent - Use for extracting names, not locations
- Computer vision is precise - Use for pixel-level location detection
- Name-based verification works - Filters false positives effectively
- Diagnostic scripts are crucial - Helped discover coordinate offset issue
- Conservative parameters - Better to miss signatures than get false positives
📞 Quick Reference
Most Important Command
python extract_signatures_hybrid.py # Run signature extraction
Most Important File
less PROJECT_DOCUMENTATION.md # Complete project history
Most Important Finding
VLM coordinates are unreliable → Use VLM for names, CV for locations
✨ Session Start Template
When starting a new session, say:
"I'm continuing work on the PDF signature extraction project. Please read
/Volumes/NV2/pdf_recognize/SESSION_INIT.mdand/Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.mdto understand the current state.Current status: Working hybrid approach with 70% recall on 5 test files.
I want to: [choose one]
- Improve recall by tuning CV parameters
- Test on 100 PDFs to verify reliability
- Commit current solution to git
- Process full 86K dataset
- Debug a specific issue: [describe]"
Document Created: October 26, 2025
Last Updated: October 26, 2025
Status: Ready for Next Session
Working Directory: /Volumes/NV2/pdf_recognize/