Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
54 lines
2.7 KiB
Plaintext
54 lines
2.7 KiB
Plaintext
╔═══════════════════════════════════════════════════════════════╗
|
||
║ PDF SIGNATURE EXTRACTION - SESSION HANDOFF ║
|
||
╚═══════════════════════════════════════════════════════════════╝
|
||
|
||
📂 FOR YOUR NEXT SESSION:
|
||
|
||
1️⃣ Copy this prompt:
|
||
cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
|
||
|
||
2️⃣ Paste to new Claude Code session
|
||
|
||
3️⃣ Claude will read:
|
||
✅ SESSION_INIT.md (quick start)
|
||
✅ PROJECT_DOCUMENTATION.md (complete history)
|
||
|
||
═══════════════════════════════════════════════════════════════
|
||
|
||
📋 QUICK REFERENCE:
|
||
|
||
Current Status: ✅ Working (70% recall, 100% precision)
|
||
Main Script: extract_signatures_hybrid.py
|
||
Test Results: 7/10 signatures found (5 PDFs tested)
|
||
Key Finding: VLM coordinates unreliable → use names instead
|
||
|
||
═══════════════════════════════════════════════════════════════
|
||
|
||
🎯 WHAT YOU CAN ASK CLAUDE TO DO:
|
||
|
||
Option A: Improve recall to 90%+ (tune parameters)
|
||
Option B: Test on 100 PDFs (verify reliability)
|
||
Option C: Commit to git (save working solution)
|
||
Option D: Process 86K files (full production run)
|
||
Option E: Debug issue (specific problem)
|
||
|
||
═══════════════════════════════════════════════════════════════
|
||
|
||
📄 FILES CREATED FOR YOU:
|
||
|
||
SESSION_INIT.md → Quick project overview & how to continue
|
||
NEW_SESSION_PROMPT.txt → Copy-paste prompt for next session
|
||
PROJECT_DOCUMENTATION.md → Complete history (24KB, READ THIS!)
|
||
COMMIT_SUMMARY.md → Git commit instructions
|
||
README.md → Quick start guide
|
||
|
||
═══════════════════════════════════════════════════════════════
|
||
|
||
✨ NEXT SESSION COMMAND:
|
||
|
||
cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
|
||
|
||
Then paste output to new Claude Code session!
|
||
|
||
═══════════════════════════════════════════════════════════════
|