Add hybrid signature extraction with name-based verification

Implement VLM name extraction + CV detection hybrid approach to replace unreliable VLM coordinate system with name-based verification. Key Features: - VLM extracts signature names (周寶蓮, 魏興海, etc.) - CV or PDF text layer detects regions - VLM verifies each region against expected names - Signatures saved with person names: signature_周寶蓮.png - Duplicate prevention and rejection handling Test Results: - 5 PDF pages tested - 7/10 signatures extracted (70% recall) - 100% precision (no false positives) - No blank regions extracted (previous issue resolved) Files: - extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files) - extract_signatures_hybrid.py: Hybrid extraction (current working solution) - extract_handwriting.py: CV-only approach (component) - extract_signatures_vlm.py: Deprecated VLM coordinate approach - PROJECT_DOCUMENTATION.md: Complete project history and results - SESSION_INIT.md: Session handoff documentation - SESSION_CHECKLIST.md: Status checklist - NEW_SESSION_PROMPT.txt: Template for next session - HOW_TO_CONTINUE.txt: Visual handoff guide - COMMIT_SUMMARY.md: Commit preparation guide - README.md: Quick start guide - README_page_extraction.md: Page extraction docs - README_hybrid_extraction.md: Hybrid approach docs - .gitignore: Exclude diagnostic scripts and outputs Known Limitations: - 30% of signatures missed due to conservative CV parameters - Text layer method untested (all test PDFs are scanned images) - Performance: ~24 seconds per PDF Next Steps: - Tune CV parameters for higher recall - Test with larger dataset (100+ files) - Process full dataset (86,073 files) 🤖 Generated with Claude Code
2025-10-26 23:39:52 +08:00
commit 52612e14ba
14 changed files with 3583 additions and 0 deletions
--- a/README_hybrid_extraction.md
+++ b/README_hybrid_extraction.md
@@ -0,0 +1,179 @@
+# Hybrid Signature Extraction
+
+This script uses a **hybrid approach** combining VLM (Vision Language Model) name recognition with computer vision detection.
+
+## Key Innovation
+
+Instead of relying on VLM's unreliable coordinate system, we:
+1. **Use VLM for name extraction** (what it's good at)
+2. **Use computer vision for location detection** (precise pixel-level detection)
+3. **Use VLM for name-specific verification** (matching signatures to people)
+
+## Workflow
+
+```
+┌─────────────────────────────────────────┐
+│ Step 1: VLM extracts signature names   │
+│ Example: "周寶蓮", "魏興海"              │
+└─────────────────────────────────────────┘
+                   ↓
+┌─────────────────────────────────────────┐
+│ Step 2a: Search PDF text layer         │
+│ - If names found in PDF text objects   │
+│ - Use precise text coordinates          │
+│ - Expand region to capture nearby sig   │
+│                                          │
+│ Step 2b: Fallback to Computer Vision   │
+│ - If no text layer or names not found   │
+│ - Use OpenCV to detect signature regions│
+│ - Based on size, density, morphology    │
+└─────────────────────────────────────────┘
+                   ↓
+┌─────────────────────────────────────────┐
+│ Step 3: Extract all candidate regions   │
+└─────────────────────────────────────────┘
+                   ↓
+┌─────────────────────────────────────────┐
+│ Step 4: VLM verifies EACH region        │
+│ "Does this contain signature of:        │
+│  周寶蓮, 魏興海?"                         │
+│                                          │
+│ - If matches: Save as signature_周寶蓮   │
+│ - If duplicate: Reject                   │
+│ - If no match: Move to rejected/        │
+└─────────────────────────────────────────┘
+```
+
+## Advantages
+
+✅ **More reliable** - Uses VLM for names, not unreliable coordinates
+✅ **Name-based verification** - Matches specific signatures to specific people
+✅ **Prevents duplicates** - Tracks which signatures already found
+✅ **Better organization** - Files named by person: `signature_周寶蓮.png`
+✅ **Handles both scenarios** - PDFs with/without text layer
+✅ **Fewer false positives** - Only saves verified signatures
+
+## Configuration
+
+Edit these values in `extract_signatures_hybrid.py`:
+
+```python
+PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
+OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
+REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"
+
+OLLAMA_URL = "http://192.168.30.36:11434"
+OLLAMA_MODEL = "qwen2.5vl:32b"
+
+DPI = 300  # Resolution for PDF rendering
+```
+
+## Usage
+
+```bash
+cd /Volumes/NV2/pdf_recognize
+source venv/bin/activate
+python extract_signatures_hybrid.py
+```
+
+## Test Results (5 PDFs)
+
+| File | Expected | Found | Names Extracted |
+|------|----------|-------|----------------|
+| 201301_1324_AI1_page3 | 2 | 2 ✓ | 楊智惠, 張志銘 |
+| 201301_2061_AI1_page5 | 2 | 1 ⚠️ | 廖阿甚 (missing 林姿妤) |
+| 201301_2458_AI1_page4 | 2 | 1 ⚠️ | 周寶蓮 (missing 魏興海) |
+| 201301_2923_AI1_page3 | 2 | 1 ⚠️ | 黄瑞展 (missing 陈丽琦) |
+| 201301_3189_AI1_page3 | 2 | 2 ✓ | 黄辉, 黄益辉 |
+| **Total** | **10** | **7** | **70% recall** |
+
+**Comparison with previous approach:**
+- Old VLM coordinate method: 44 extractions (many false positives, blank regions)
+- New hybrid method: 7 extractions (all verified, no blank regions)
+
+## Why Some Signatures Are Missed
+
+The current CV detection parameters may be too conservative:
+
+```python
+# Filter by area (signatures are medium-sized)
+if 5000 < area < 200000:  # May need adjustment
+
+# Filter by aspect ratio
+if 0.5 < aspect_ratio < 10:  # May need widening
+```
+
+**Options to improve recall:**
+1. Widen CV detection parameters (may increase false positives)
+2. Add multiple passes with different parameters
+3. Use VLM to suggest additional search regions if expected signatures not found
+
+## Output Files
+
+### Extracted Signatures
+Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/`
+
+**Naming:** `{pdf_name}_signature_{person_name}.png`
+
+Examples:
+- `201301_2458_AI1_page4_signature_周寶蓮.png`
+- `201301_1324_AI1_page3_signature_張志銘.png`
+
+### Rejected Regions
+Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected/`
+
+Contains regions that:
+- Don't match any expected signatures
+- Are duplicates of already-found signatures
+
+### Log File
+Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/hybrid_extraction_log_YYYYMMDD_HHMMSS.csv`
+
+Columns:
+- `pdf_filename` - Source PDF
+- `signatures_found` - Number of verified signatures
+- `method_used` - "text_layer" or "computer_vision"
+- `extracted_files` - List of saved filenames
+- `error` - Error message if any
+
+## Performance
+
+- Processing speed: ~2-3 PDFs per minute (depends on VLM API latency)
+- VLM calls per PDF: 1 (name extraction) + N (region verification)
+- For 5 test PDFs: ~2 minutes total
+
+## Next Steps
+
+To process full dataset (100 files from CSV):
+
+```python
+# Edit line in extract_signatures_hybrid.py
+pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]  # Or remove [:5] for all
+```
+
+## Troubleshooting
+
+**No signatures extracted:**
+- Check Ollama connection: `curl http://192.168.30.36:11434/api/tags`
+- Verify PDF files exist in input directory
+- Check if PDF is readable (not corrupted)
+
+**Too many false positives:**
+- Tighten CV detection parameters (increase `MIN_CONTOUR_AREA`)
+- Reduce `MAX_CONTOUR_AREA`
+- Adjust aspect ratio filters
+
+**Missing expected signatures:**
+- Loosen CV detection parameters
+- Check rejected folder to see if signature was detected but not verified
+- Reduce minimum area threshold
+- Increase maximum area threshold
+
+## Dependencies
+
+- Python 3.9+
+- PyMuPDF (fitz)
+- OpenCV (cv2)
+- NumPy
+- Requests (for Ollama API)
+- Ollama with qwen2.5vl:32b model