# Hybrid Signature Extraction This script uses a **hybrid approach** combining VLM (Vision Language Model) name recognition with computer vision detection. ## Key Innovation Instead of relying on VLM's unreliable coordinate system, we: 1. **Use VLM for name extraction** (what it's good at) 2. **Use computer vision for location detection** (precise pixel-level detection) 3. **Use VLM for name-specific verification** (matching signatures to people) ## Workflow ``` ┌─────────────────────────────────────────┐ │ Step 1: VLM extracts signature names │ │ Example: "周寶蓮", "魏興海" │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ Step 2a: Search PDF text layer │ │ - If names found in PDF text objects │ │ - Use precise text coordinates │ │ - Expand region to capture nearby sig │ │ │ │ Step 2b: Fallback to Computer Vision │ │ - If no text layer or names not found │ │ - Use OpenCV to detect signature regions│ │ - Based on size, density, morphology │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ Step 3: Extract all candidate regions │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ Step 4: VLM verifies EACH region │ │ "Does this contain signature of: │ │ 周寶蓮, 魏興海?" │ │ │ │ - If matches: Save as signature_周寶蓮 │ │ - If duplicate: Reject │ │ - If no match: Move to rejected/ │ └─────────────────────────────────────────┘ ``` ## Advantages ✅ **More reliable** - Uses VLM for names, not unreliable coordinates ✅ **Name-based verification** - Matches specific signatures to specific people ✅ **Prevents duplicates** - Tracks which signatures already found ✅ **Better organization** - Files named by person: `signature_周寶蓮.png` ✅ **Handles both scenarios** - PDFs with/without text layer ✅ **Fewer false positives** - Only saves verified signatures ## Configuration Edit these values in `extract_signatures_hybrid.py`: ```python PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output" OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures" REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected" OLLAMA_URL = "http://192.168.30.36:11434" OLLAMA_MODEL = "qwen2.5vl:32b" DPI = 300 # Resolution for PDF rendering ``` ## Usage ```bash cd /Volumes/NV2/pdf_recognize source venv/bin/activate python extract_signatures_hybrid.py ``` ## Test Results (5 PDFs) | File | Expected | Found | Names Extracted | |------|----------|-------|----------------| | 201301_1324_AI1_page3 | 2 | 2 ✓ | 楊智惠, 張志銘 | | 201301_2061_AI1_page5 | 2 | 1 ⚠️ | 廖阿甚 (missing 林姿妤) | | 201301_2458_AI1_page4 | 2 | 1 ⚠️ | 周寶蓮 (missing 魏興海) | | 201301_2923_AI1_page3 | 2 | 1 ⚠️ | 黄瑞展 (missing 陈丽琦) | | 201301_3189_AI1_page3 | 2 | 2 ✓ | 黄辉, 黄益辉 | | **Total** | **10** | **7** | **70% recall** | **Comparison with previous approach:** - Old VLM coordinate method: 44 extractions (many false positives, blank regions) - New hybrid method: 7 extractions (all verified, no blank regions) ## Why Some Signatures Are Missed The current CV detection parameters may be too conservative: ```python # Filter by area (signatures are medium-sized) if 5000 < area < 200000: # May need adjustment # Filter by aspect ratio if 0.5 < aspect_ratio < 10: # May need widening ``` **Options to improve recall:** 1. Widen CV detection parameters (may increase false positives) 2. Add multiple passes with different parameters 3. Use VLM to suggest additional search regions if expected signatures not found ## Output Files ### Extracted Signatures Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/` **Naming:** `{pdf_name}_signature_{person_name}.png` Examples: - `201301_2458_AI1_page4_signature_周寶蓮.png` - `201301_1324_AI1_page3_signature_張志銘.png` ### Rejected Regions Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected/` Contains regions that: - Don't match any expected signatures - Are duplicates of already-found signatures ### Log File Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/hybrid_extraction_log_YYYYMMDD_HHMMSS.csv` Columns: - `pdf_filename` - Source PDF - `signatures_found` - Number of verified signatures - `method_used` - "text_layer" or "computer_vision" - `extracted_files` - List of saved filenames - `error` - Error message if any ## Performance - Processing speed: ~2-3 PDFs per minute (depends on VLM API latency) - VLM calls per PDF: 1 (name extraction) + N (region verification) - For 5 test PDFs: ~2 minutes total ## Next Steps To process full dataset (100 files from CSV): ```python # Edit line in extract_signatures_hybrid.py pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100] # Or remove [:5] for all ``` ## Troubleshooting **No signatures extracted:** - Check Ollama connection: `curl http://192.168.30.36:11434/api/tags` - Verify PDF files exist in input directory - Check if PDF is readable (not corrupted) **Too many false positives:** - Tighten CV detection parameters (increase `MIN_CONTOUR_AREA`) - Reduce `MAX_CONTOUR_AREA` - Adjust aspect ratio filters **Missing expected signatures:** - Loosen CV detection parameters - Check rejected folder to see if signature was detected but not verified - Reduce minimum area threshold - Increase maximum area threshold ## Dependencies - Python 3.9+ - PyMuPDF (fitz) - OpenCV (cv2) - NumPy - Requests (for Ollama API) - Ollama with qwen2.5vl:32b model