# PDF Signature Extraction System Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. ## Quick Start ### Step 1: Extract Pages from CSV ```bash cd /Volumes/NV2/pdf_recognize source venv/bin/activate python extract_pages_from_csv.py ``` ### Step 2: Extract Signatures ```bash python extract_signatures_hybrid.py ``` ## Documentation - **[PROJECT_DOCUMENTATION.md](PROJECT_DOCUMENTATION.md)** - Complete project history, all approaches tested, detailed results - **[README_page_extraction.md](README_page_extraction.md)** - Page extraction documentation - **[README_hybrid_extraction.md](README_hybrid_extraction.md)** - Hybrid signature extraction documentation ## Current Performance **Test Dataset:** 5 PDF pages - **Signatures expected:** 10 - **Signatures found:** 7 - **Precision:** 100% (no false positives) - **Recall:** 70% ## Key Features ✅ **Hybrid Approach:** VLM name extraction + CV detection + VLM verification ✅ **Name-Based:** Signatures saved as `signature_周寶蓮.png` ✅ **No False Positives:** Name-specific verification filters out dates, text, stamps ✅ **Duplicate Prevention:** Only one signature per person ✅ **Handles Both:** PDFs with/without text layer ## File Structure ``` extract_pages_from_csv.py # Step 1: Extract pages extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT) README.md # This file PROJECT_DOCUMENTATION.md # Complete documentation README_page_extraction.md # Page extraction guide README_hybrid_extraction.md # Signature extraction guide ``` ## Requirements - Python 3.9+ - PyMuPDF, OpenCV, NumPy, Requests - Ollama with qwen2.5vl:32b model - Ollama instance: http://192.168.30.36:11434 ## Data - **Input:** `/Volumes/NV2/PDF-Processing/master_signatures.csv` (86,073 rows) - **PDFs:** `/Volumes/NV2/PDF-Processing/total-pdf/batch_*/` - **Output:** `/Volumes/NV2/PDF-Processing/signature-image-output/` ## Status ✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending See [PROJECT_DOCUMENTATION.md](PROJECT_DOCUMENTATION.md) for complete details.