Add hybrid signature extraction with name-based verification

Implement VLM name extraction + CV detection hybrid approach to replace unreliable VLM coordinate system with name-based verification. Key Features: - VLM extracts signature names (周寶蓮, 魏興海, etc.) - CV or PDF text layer detects regions - VLM verifies each region against expected names - Signatures saved with person names: signature_周寶蓮.png - Duplicate prevention and rejection handling Test Results: - 5 PDF pages tested - 7/10 signatures extracted (70% recall) - 100% precision (no false positives) - No blank regions extracted (previous issue resolved) Files: - extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files) - extract_signatures_hybrid.py: Hybrid extraction (current working solution) - extract_handwriting.py: CV-only approach (component) - extract_signatures_vlm.py: Deprecated VLM coordinate approach - PROJECT_DOCUMENTATION.md: Complete project history and results - SESSION_INIT.md: Session handoff documentation - SESSION_CHECKLIST.md: Status checklist - NEW_SESSION_PROMPT.txt: Template for next session - HOW_TO_CONTINUE.txt: Visual handoff guide - COMMIT_SUMMARY.md: Commit preparation guide - README.md: Quick start guide - README_page_extraction.md: Page extraction docs - README_hybrid_extraction.md: Hybrid approach docs - .gitignore: Exclude diagnostic scripts and outputs Known Limitations: - 30% of signatures missed due to conservative CV parameters - Text layer method untested (all test PDFs are scanned images) - Performance: ~24 seconds per PDF Next Steps: - Tune CV parameters for higher recall - Test with larger dataset (100+ files) - Process full dataset (86,073 files) 🤖 Generated with Claude Code
2025-10-26 23:39:52 +08:00
commit 52612e14ba
14 changed files with 3583 additions and 0 deletions
@@ -0,0 +1,35 @@
+I'm continuing work on the PDF signature extraction project at /Volumes/NV2/pdf_recognize/
+
+Please read these files to understand the current state:
+1. /Volumes/NV2/pdf_recognize/SESSION_INIT.md (start here)
+2. /Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md (complete history)
+
+Key context:
+- Working hybrid approach: VLM name extraction + CV detection + VLM verification
+- Test results: 70% recall, 100% precision (5 PDFs tested)
+- Important: VLM coordinates are unreliable (32% offset discovered), we use names instead
+- Current script: extract_signatures_hybrid.py
+
+I want to: [CHOOSE ONE OR DESCRIBE YOUR GOAL]
+
+Option A: Improve recall from 70% to 90%+
+- Tune CV detection parameters to catch more signatures
+- Test if missing signatures are in rejected folder
+
+Option B: Scale up testing to 100 PDFs
+- Verify reliability on larger dataset
+- Analyze results and calculate overall metrics
+
+Option C: Commit current solution to git
+- Follow instructions in COMMIT_SUMMARY.md
+- Tag release as v1.0-hybrid-70percent
+
+Option D: Process full dataset (86,073 files)
+- Estimate time and optimize if needed
+- Set up monitoring and resume capability
+
+Option E: Debug specific issue
+- [Describe the issue you're encountering]
+
+Option F: Other
+- [Describe what you want to work on]