Files
pdf_signature_extraction/HOW_TO_CONTINUE.txt
gbanyan 52612e14ba Add hybrid signature extraction with name-based verification
Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.

Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling

Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)

Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs

Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF

Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)

🤖 Generated with Claude Code
2025-10-26 23:39:52 +08:00

54 lines
2.7 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
╔═══════════════════════════════════════════════════════════════╗
║ PDF SIGNATURE EXTRACTION - SESSION HANDOFF ║
╚═══════════════════════════════════════════════════════════════╝
📂 FOR YOUR NEXT SESSION:
1⃣ Copy this prompt:
cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
2⃣ Paste to new Claude Code session
3⃣ Claude will read:
✅ SESSION_INIT.md (quick start)
✅ PROJECT_DOCUMENTATION.md (complete history)
═══════════════════════════════════════════════════════════════
📋 QUICK REFERENCE:
Current Status: ✅ Working (70% recall, 100% precision)
Main Script: extract_signatures_hybrid.py
Test Results: 7/10 signatures found (5 PDFs tested)
Key Finding: VLM coordinates unreliable → use names instead
═══════════════════════════════════════════════════════════════
🎯 WHAT YOU CAN ASK CLAUDE TO DO:
Option A: Improve recall to 90%+ (tune parameters)
Option B: Test on 100 PDFs (verify reliability)
Option C: Commit to git (save working solution)
Option D: Process 86K files (full production run)
Option E: Debug issue (specific problem)
═══════════════════════════════════════════════════════════════
📄 FILES CREATED FOR YOU:
SESSION_INIT.md → Quick project overview & how to continue
NEW_SESSION_PROMPT.txt → Copy-paste prompt for next session
PROJECT_DOCUMENTATION.md → Complete history (24KB, READ THIS!)
COMMIT_SUMMARY.md → Git commit instructions
README.md → Quick start guide
═══════════════════════════════════════════════════════════════
✨ NEXT SESSION COMMAND:
cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
Then paste output to new Claude Code session!
═══════════════════════════════════════════════════════════════