Add hybrid signature extraction with name-based verification

Implement VLM name extraction + CV detection hybrid approach to replace unreliable VLM coordinate system with name-based verification. Key Features: - VLM extracts signature names (周寶蓮, 魏興海, etc.) - CV or PDF text layer detects regions - VLM verifies each region against expected names - Signatures saved with person names: signature_周寶蓮.png - Duplicate prevention and rejection handling Test Results: - 5 PDF pages tested - 7/10 signatures extracted (70% recall) - 100% precision (no false positives) - No blank regions extracted (previous issue resolved) Files: - extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files) - extract_signatures_hybrid.py: Hybrid extraction (current working solution) - extract_handwriting.py: CV-only approach (component) - extract_signatures_vlm.py: Deprecated VLM coordinate approach - PROJECT_DOCUMENTATION.md: Complete project history and results - SESSION_INIT.md: Session handoff documentation - SESSION_CHECKLIST.md: Status checklist - NEW_SESSION_PROMPT.txt: Template for next session - HOW_TO_CONTINUE.txt: Visual handoff guide - COMMIT_SUMMARY.md: Commit preparation guide - README.md: Quick start guide - README_page_extraction.md: Page extraction docs - README_hybrid_extraction.md: Hybrid approach docs - .gitignore: Exclude diagnostic scripts and outputs Known Limitations: - 30% of signatures missed due to conservative CV parameters - Text layer method untested (all test PDFs are scanned images) - Performance: ~24 seconds per PDF Next Steps: - Tune CV parameters for higher recall - Test with larger dataset (100+ files) - Process full dataset (86,073 files) 🤖 Generated with Claude Code
2025-10-26 23:39:52 +08:00
commit 52612e14ba
14 changed files with 3583 additions and 0 deletions
--- a/HOW_TO_CONTINUE.txt
+++ b/HOW_TO_CONTINUE.txt
@@ -0,0 +1,53 @@
+╔═══════════════════════════════════════════════════════════════╗
+║         PDF SIGNATURE EXTRACTION - SESSION HANDOFF            ║
+╚═══════════════════════════════════════════════════════════════╝
+
+📂 FOR YOUR NEXT SESSION:
+
+  1️⃣  Copy this prompt:
+      cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
+
+  2️⃣  Paste to new Claude Code session
+
+  3️⃣  Claude will read:
+      ✅ SESSION_INIT.md (quick start)
+      ✅ PROJECT_DOCUMENTATION.md (complete history)
+
+═══════════════════════════════════════════════════════════════
+
+📋 QUICK REFERENCE:
+
+  Current Status:  ✅ Working (70% recall, 100% precision)
+  Main Script:     extract_signatures_hybrid.py
+  Test Results:    7/10 signatures found (5 PDFs tested)
+  Key Finding:     VLM coordinates unreliable → use names instead
+
+═══════════════════════════════════════════════════════════════
+
+🎯 WHAT YOU CAN ASK CLAUDE TO DO:
+
+  Option A: Improve recall to 90%+ (tune parameters)
+  Option B: Test on 100 PDFs (verify reliability)
+  Option C: Commit to git (save working solution)
+  Option D: Process 86K files (full production run)
+  Option E: Debug issue (specific problem)
+
+═══════════════════════════════════════════════════════════════
+
+📄 FILES CREATED FOR YOU:
+
+  SESSION_INIT.md          → Quick project overview & how to continue
+  NEW_SESSION_PROMPT.txt   → Copy-paste prompt for next session
+  PROJECT_DOCUMENTATION.md → Complete history (24KB, READ THIS!)
+  COMMIT_SUMMARY.md        → Git commit instructions
+  README.md                → Quick start guide
+
+═══════════════════════════════════════════════════════════════
+
+✨ NEXT SESSION COMMAND:
+
+  cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
+
+  Then paste output to new Claude Code session!
+
+═══════════════════════════════════════════════════════════════