Add hybrid signature extraction with name-based verification
Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
This commit is contained in:
53
HOW_TO_CONTINUE.txt
Normal file
53
HOW_TO_CONTINUE.txt
Normal file
@@ -0,0 +1,53 @@
|
||||
╔═══════════════════════════════════════════════════════════════╗
|
||||
║ PDF SIGNATURE EXTRACTION - SESSION HANDOFF ║
|
||||
╚═══════════════════════════════════════════════════════════════╝
|
||||
|
||||
📂 FOR YOUR NEXT SESSION:
|
||||
|
||||
1️⃣ Copy this prompt:
|
||||
cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
|
||||
|
||||
2️⃣ Paste to new Claude Code session
|
||||
|
||||
3️⃣ Claude will read:
|
||||
✅ SESSION_INIT.md (quick start)
|
||||
✅ PROJECT_DOCUMENTATION.md (complete history)
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
|
||||
📋 QUICK REFERENCE:
|
||||
|
||||
Current Status: ✅ Working (70% recall, 100% precision)
|
||||
Main Script: extract_signatures_hybrid.py
|
||||
Test Results: 7/10 signatures found (5 PDFs tested)
|
||||
Key Finding: VLM coordinates unreliable → use names instead
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
|
||||
🎯 WHAT YOU CAN ASK CLAUDE TO DO:
|
||||
|
||||
Option A: Improve recall to 90%+ (tune parameters)
|
||||
Option B: Test on 100 PDFs (verify reliability)
|
||||
Option C: Commit to git (save working solution)
|
||||
Option D: Process 86K files (full production run)
|
||||
Option E: Debug issue (specific problem)
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
|
||||
📄 FILES CREATED FOR YOU:
|
||||
|
||||
SESSION_INIT.md → Quick project overview & how to continue
|
||||
NEW_SESSION_PROMPT.txt → Copy-paste prompt for next session
|
||||
PROJECT_DOCUMENTATION.md → Complete history (24KB, READ THIS!)
|
||||
COMMIT_SUMMARY.md → Git commit instructions
|
||||
README.md → Quick start guide
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
|
||||
✨ NEXT SESSION COMMAND:
|
||||
|
||||
cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
|
||||
|
||||
Then paste output to new Claude Code session!
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
Reference in New Issue
Block a user