Files

gbanyan 52612e14ba Add hybrid signature extraction with name-based verification

Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.

Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling

Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)

Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs

Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF

Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)

🤖 Generated with Claude Code

2025-10-26 23:39:52 +08:00

11 KiB

Raw Permalink Blame History

Session Initialization - PDF Signature Extraction Project

Purpose: This document helps you (or another Claude instance) quickly understand the project state and continue working.

Project Quick Summary

Goal: Extract handwritten Chinese signatures from 86,073 PDF documents automatically.

Current Status: ✅ Working solution with 70% recall, 100% precision (tested on 5 PDFs)

Approach: Hybrid VLM name extraction + Computer Vision detection + VLM verification

🚀 Quick Start (Resume Work)

If you want to continue testing:

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate

# Test with more files (edit line 425 in script)
python extract_signatures_hybrid.py

If you want to review what was done:

# Read the complete history
less PROJECT_DOCUMENTATION.md

# Check test results
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png

If you want to commit to git:

# Follow the guide
less COMMIT_SUMMARY.md

📁 Key Files (What Each Does)

Production Scripts ✅

extract_pages_from_csv.py - Step 1: Extract pages from CSV (tested: 100 files)
extract_signatures_hybrid.py - Step 2: Extract signatures (CURRENT WORKING, tested: 5 files)
extract_handwriting.py - CV-only approach (component used in hybrid)

Documentation 📚

PROJECT_DOCUMENTATION.md - ⭐ READ THIS FIRST - Complete history of all 5 approaches tested
README.md - Quick start guide
COMMIT_SUMMARY.md - Git commit instructions
SESSION_INIT.md - This file (for session continuity)

Configuration ⚙️

.gitignore - Excludes diagnostic scripts and test outputs

🎯 Current Working Solution

Architecture

1. VLM extracts signature names: "周寶蓮", "魏興海"
2. CV detects signature-like regions (5K-200K pixels)
3. VLM verifies each region against expected names
4. Save verified signatures: signature_周寶蓮.png

Test Results (5 PDFs)

Metric	Value
Expected signatures	10
Found signatures	7
Recall	70%
Precision	100%
False positives	0

Why 30% Missing?

Computer vision parameters too conservative
Some signatures smaller/larger than 5K-200K pixel range
Aspect ratio filter (0.5-10) may exclude some signatures

⚠️ Critical Context (What You MUST Know)

1. VLM Coordinate System is UNRELIABLE ❌

Discovery: VLM (qwen2.5vl:32b) provides inaccurate coordinates.

Example:

VLM said signatures at: top=58%, top=68%
Actual location: top=26%
Error: ~32% offset (NOT consistent across files!)

Test file: 201301_2458_AI1_page4.pdf

VLM correctly identifies 2 signatures: "周寶蓮", "魏興海"
VLM coordinates extract 100% white/blank regions
This is why we abandoned coordinate-based approach

Evidence: See diagnostic scripts and results in PROJECT_DOCUMENTATION.md

2. Name-Based Approach is the Solution ✅

Instead of using VLM coordinates:

✅ Use VLM to extract names (reliable)
✅ Use CV to find locations (pixel-accurate)
✅ Use VLM to verify each region against names (accurate)

3. All Test PDFs Are Scanned Images

No searchable text layer
PDF text layer method (Method A) is untested
All current results use CV detection (Method B)

🔧 Configuration Details

Ollama Setup

OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"

Verify connection:

curl http://192.168.30.36:11434/api/tags

File Paths

PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = ".../signatures/rejected"

CV Detection Parameters (adjust to improve recall)

# In extract_signatures_hybrid.py, detect_signature_regions_cv()
MIN_CONTOUR_AREA = 5000      # ⬇️ Lower = catch smaller signatures
MAX_CONTOUR_AREA = 200000    # ⬆️ Higher = catch larger signatures
ASPECT_RATIO_MIN = 0.5       # ⬇️ Lower = catch taller signatures
ASPECT_RATIO_MAX = 10.0      # ⬆️ Higher = catch wider signatures

🎬 What Happened (Session History)

Approaches Tested (Chronological)

PDF Image Objects → Abandoned (extracted full pages, not signatures)
Simple Page Extraction → ✅ Working (extract pages from CSV)
Computer Vision Only → Insufficient (6,420 regions from 100 pages - too many)
VLM Coordinates → ❌ Failed (coordinates unreliable, extracted blank regions)
Hybrid Name-Based → ✅ Current (70% recall, 100% precision)

Key Decisions Made

✅ Use VLM for names, not coordinates ✅ Verify each region against expected names ✅ Save signatures with person names ✅ Reject regions that don't match any name ✅ Prevent duplicate signatures per person

Diagnostic Work Done

Created 11 diagnostic scripts to investigate VLM coordinate failure:

Visualized bounding boxes
Analyzed pixel content
Tested actual vs. reported locations
Confirmed coordinates 32% off on test file

All findings documented in PROJECT_DOCUMENTATION.md

🚧 Known Issues & Next Steps

Issue 1: 30% Missing Signatures

Status: Open Options:

Widen CV parameter ranges (test with different thresholds)
Multi-pass detection with different kernels
Ask VLM for help when signatures missing
Manual review of rejected folder

Issue 2: Text Layer Method Untested

Status: Pending Need: PDFs with searchable text to test Method A

Issue 3: Performance (24 sec/PDF)

Status: Acceptable for now Future: Optimize if processing full 86K dataset

📊 Test Data Reference

Test Files Used (5 PDFs)

201301_1324_AI1_page3.pdf - ✅ Found 2/2: 楊智惠, 張志銘
201301_2061_AI1_page5.pdf - ⚠️ Found 1/2: 廖阿甚 (missing 林姿妤)
201301_2458_AI1_page4.pdf - ⚠️ Found 1/2: 周寶蓮 (missing 魏興海) ← VLM coordinate test file
201301_2923_AI1_page3.pdf - ⚠️ Found 1/2: 黄瑞展 (missing 陈丽琦)
201301_3189_AI1_page3.pdf - ✅ Found 2/2: 黄辉, 黄益辉

Output Location

/Volumes/NV2/PDF-Processing/signature-image-output/signatures/
├── 201301_1324_AI1_page3_signature_張志銘.png
├── 201301_1324_AI1_page3_signature_楊智惠.png
├── 201301_2061_AI1_page5_signature_廖阿甚.png
├── 201301_2458_AI1_page4_signature_周寶蓮.png
├── 201301_2923_AI1_page3_signature_黄瑞展.png
├── 201301_3189_AI1_page3_signature_黄辉.png
├── 201301_3189_AI1_page3_signature_黄益辉.png
└── rejected/ (non-signature regions)

💡 How to Continue Work

Option 1: Improve Recall (Find Missing Signatures)

Goal: Get from 70% to 90%+ recall

Approach:

Read rejected folder to see if missing signatures were detected but rejected

Adjust CV parameters in detect_signature_regions_cv():

MIN_CONTOUR_AREA = 3000      # Lower threshold
MAX_CONTOUR_AREA = 300000    # Higher threshold

Test on same 5 PDFs and compare results
If recall improves without too many false positives, proceed

Files to edit:

extract_signatures_hybrid.py lines 178-214

Option 2: Scale Up Testing

Goal: Test on 100 PDFs to verify reliability

Approach:

Edit extract_signatures_hybrid.py line 425:

pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]

Run script (will take ~40 minutes)
Analyze results in log file
Calculate overall recall/precision

Option 3: Prepare for Production

Goal: Process all 86,073 files

Requirements:

Verify current approach is acceptable (70% recall OK?)
Estimate time: 86K files × 24 sec/file = ~24 days
Consider parallel processing or optimization
Set up monitoring and resume capability

Option 4: Commit Current State

Goal: Save working solution to git

Steps:

Read COMMIT_SUMMARY.md
Review files to commit
Run verification checks
Execute git commands
Tag release: v1.0-hybrid-70percent

🔍 How to Debug Issues

If extraction fails:

# Check Ollama connection
curl http://192.168.30.36:11434/api/tags

# Check input PDFs exist
ls /Volumes/NV2/PDF-Processing/signature-image-output/*.pdf | head -5

# Run with single file for testing
python -c "from extract_signatures_hybrid import *; process_pdf_page('/path/to/test.pdf', OUTPUT_PATH)"

If too many false positives:

Increase MIN_CONTOUR_AREA (filter out small regions)
Decrease MAX_CONTOUR_AREA (filter out large regions)
Check rejected folder to verify they're actually non-signatures

If missing signatures:

Check rejected folder (might be detected but not verified)
Lower MIN_CONTOUR_AREA (catch smaller signatures)
Increase MAX_CONTOUR_AREA (catch larger signatures)
Widen aspect ratio range

📋 Session Handoff Checklist

When starting a new session, provide this context:

✅ Project Goal: Extract Chinese signatures from 86K PDFs ✅ Current Approach: Hybrid VLM name + CV detection + VLM verification ✅ Status: Working at 70% recall, 100% precision on 5 test files ✅ Key Context: VLM coordinates unreliable (32% offset), use names instead ✅ Key Files: extract_signatures_hybrid.py (main), PROJECT_DOCUMENTATION.md (history) ✅ Next Steps: Improve recall OR scale up testing OR commit to git

🎓 Important Lessons Learned

VLM spatial reasoning is unreliable - Don't trust percentage-based coordinates
VLM text recognition is excellent - Use for extracting names, not locations
Computer vision is precise - Use for pixel-level location detection
Name-based verification works - Filters false positives effectively
Diagnostic scripts are crucial - Helped discover coordinate offset issue
Conservative parameters - Better to miss signatures than get false positives

📞 Quick Reference

Most Important Command

python extract_signatures_hybrid.py  # Run signature extraction

Most Important File

less PROJECT_DOCUMENTATION.md  # Complete project history

Most Important Finding

VLM coordinates are unreliable → Use VLM for names, CV for locations

✨ Session Start Template

When starting a new session, say:

"I'm continuing work on the PDF signature extraction project. Please read /Volumes/NV2/pdf_recognize/SESSION_INIT.md and /Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md to understand the current state.

Current status: Working hybrid approach with 70% recall on 5 test files.

I want to: [choose one]

Improve recall by tuning CV parameters

Test on 100 PDFs to verify reliability

Commit current solution to git

Process full 86K dataset

Debug a specific issue: [describe]"

Document Created: October 26, 2025 Last Updated: October 26, 2025 Status: Ready for Next Session Working Directory: /Volumes/NV2/pdf_recognize/

11 KiB Raw Permalink Blame History Unescape Escape