Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
6.5 KiB
Hybrid Signature Extraction
This script uses a hybrid approach combining VLM (Vision Language Model) name recognition with computer vision detection.
Key Innovation
Instead of relying on VLM's unreliable coordinate system, we:
- Use VLM for name extraction (what it's good at)
- Use computer vision for location detection (precise pixel-level detection)
- Use VLM for name-specific verification (matching signatures to people)
Workflow
┌─────────────────────────────────────────┐
│ Step 1: VLM extracts signature names │
│ Example: "周寶蓮", "魏興海" │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Step 2a: Search PDF text layer │
│ - If names found in PDF text objects │
│ - Use precise text coordinates │
│ - Expand region to capture nearby sig │
│ │
│ Step 2b: Fallback to Computer Vision │
│ - If no text layer or names not found │
│ - Use OpenCV to detect signature regions│
│ - Based on size, density, morphology │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Step 3: Extract all candidate regions │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Step 4: VLM verifies EACH region │
│ "Does this contain signature of: │
│ 周寶蓮, 魏興海?" │
│ │
│ - If matches: Save as signature_周寶蓮 │
│ - If duplicate: Reject │
│ - If no match: Move to rejected/ │
└─────────────────────────────────────────┘
Advantages
✅ More reliable - Uses VLM for names, not unreliable coordinates
✅ Name-based verification - Matches specific signatures to specific people
✅ Prevents duplicates - Tracks which signatures already found
✅ Better organization - Files named by person: signature_周寶蓮.png
✅ Handles both scenarios - PDFs with/without text layer
✅ Fewer false positives - Only saves verified signatures
Configuration
Edit these values in extract_signatures_hybrid.py:
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"
OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"
DPI = 300 # Resolution for PDF rendering
Usage
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_signatures_hybrid.py
Test Results (5 PDFs)
| File | Expected | Found | Names Extracted |
|---|---|---|---|
| 201301_1324_AI1_page3 | 2 | 2 ✓ | 楊智惠, 張志銘 |
| 201301_2061_AI1_page5 | 2 | 1 ⚠️ | 廖阿甚 (missing 林姿妤) |
| 201301_2458_AI1_page4 | 2 | 1 ⚠️ | 周寶蓮 (missing 魏興海) |
| 201301_2923_AI1_page3 | 2 | 1 ⚠️ | 黄瑞展 (missing 陈丽琦) |
| 201301_3189_AI1_page3 | 2 | 2 ✓ | 黄辉, 黄益辉 |
| Total | 10 | 7 | 70% recall |
Comparison with previous approach:
- Old VLM coordinate method: 44 extractions (many false positives, blank regions)
- New hybrid method: 7 extractions (all verified, no blank regions)
Why Some Signatures Are Missed
The current CV detection parameters may be too conservative:
# Filter by area (signatures are medium-sized)
if 5000 < area < 200000: # May need adjustment
# Filter by aspect ratio
if 0.5 < aspect_ratio < 10: # May need widening
Options to improve recall:
- Widen CV detection parameters (may increase false positives)
- Add multiple passes with different parameters
- Use VLM to suggest additional search regions if expected signatures not found
Output Files
Extracted Signatures
Location: /Volumes/NV2/PDF-Processing/signature-image-output/signatures/
Naming: {pdf_name}_signature_{person_name}.png
Examples:
201301_2458_AI1_page4_signature_周寶蓮.png201301_1324_AI1_page3_signature_張志銘.png
Rejected Regions
Location: /Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected/
Contains regions that:
- Don't match any expected signatures
- Are duplicates of already-found signatures
Log File
Location: /Volumes/NV2/PDF-Processing/signature-image-output/signatures/hybrid_extraction_log_YYYYMMDD_HHMMSS.csv
Columns:
pdf_filename- Source PDFsignatures_found- Number of verified signaturesmethod_used- "text_layer" or "computer_vision"extracted_files- List of saved filenameserror- Error message if any
Performance
- Processing speed: ~2-3 PDFs per minute (depends on VLM API latency)
- VLM calls per PDF: 1 (name extraction) + N (region verification)
- For 5 test PDFs: ~2 minutes total
Next Steps
To process full dataset (100 files from CSV):
# Edit line in extract_signatures_hybrid.py
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100] # Or remove [:5] for all
Troubleshooting
No signatures extracted:
- Check Ollama connection:
curl http://192.168.30.36:11434/api/tags - Verify PDF files exist in input directory
- Check if PDF is readable (not corrupted)
Too many false positives:
- Tighten CV detection parameters (increase
MIN_CONTOUR_AREA) - Reduce
MAX_CONTOUR_AREA - Adjust aspect ratio filters
Missing expected signatures:
- Loosen CV detection parameters
- Check rejected folder to see if signature was detected but not verified
- Reduce minimum area threshold
- Increase maximum area threshold
Dependencies
- Python 3.9+
- PyMuPDF (fitz)
- OpenCV (cv2)
- NumPy
- Requests (for Ollama API)
- Ollama with qwen2.5vl:32b model