Add hybrid signature extraction with name-based verification
Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
This commit is contained in:
195
SESSION_CHECKLIST.md
Normal file
195
SESSION_CHECKLIST.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Session Handoff Checklist ✓
|
||||
|
||||
## Before You Exit This Session
|
||||
|
||||
- [x] All documentation written
|
||||
- [x] Test results recorded (7/10 signatures, 70% recall)
|
||||
- [x] Session initialization files created
|
||||
- [x] .gitignore configured
|
||||
- [x] Commit guide prepared
|
||||
- [ ] **Git commit performed** (waiting for user approval)
|
||||
|
||||
## Files Created for Next Session
|
||||
|
||||
### Essential Files ⭐
|
||||
- [x] **SESSION_INIT.md** - Read this first in next session
|
||||
- [x] **NEW_SESSION_PROMPT.txt** - Copy-paste prompt template
|
||||
- [x] **PROJECT_DOCUMENTATION.md** - Complete 24KB history
|
||||
- [x] **HOW_TO_CONTINUE.txt** - Visual guide
|
||||
|
||||
### Supporting Files
|
||||
- [x] README.md - Quick start guide
|
||||
- [x] COMMIT_SUMMARY.md - Git instructions
|
||||
- [x] README_page_extraction.md - Page extraction docs
|
||||
- [x] README_hybrid_extraction.md - Signature extraction docs
|
||||
- [x] .gitignore - Configured properly
|
||||
|
||||
### Working Scripts
|
||||
- [x] extract_pages_from_csv.py - Tested (100 files)
|
||||
- [x] extract_signatures_hybrid.py - Tested (5 files, 70% recall)
|
||||
- [x] extract_handwriting.py - Component script
|
||||
|
||||
## What's Working ✅
|
||||
|
||||
| Component | Status | Details |
|
||||
|-----------|--------|---------|
|
||||
| Page extraction | ✅ Working | 100 files tested |
|
||||
| VLM name extraction | ✅ Working | 100% accurate on 5 files |
|
||||
| CV detection | ⚠️ Conservative | Finds 70% of signatures |
|
||||
| VLM verification | ✅ Working | 100% precision, no false positives |
|
||||
| Overall system | ✅ Working | 70% recall, 100% precision |
|
||||
|
||||
## What's Not Working / Unknown ⚠️
|
||||
|
||||
| Issue | Status | Next Steps |
|
||||
|-------|--------|------------|
|
||||
| Missing 30% signatures | Known | Tune CV parameters |
|
||||
| Text layer method | Untested | Need PDFs with text |
|
||||
| Large-scale performance | Unknown | Test with 100+ files |
|
||||
| Full dataset (86K) | Unknown | Estimate time & optimize |
|
||||
|
||||
## Critical Context to Remember 🧠
|
||||
|
||||
1. **VLM coordinates are unreliable** (32% offset on test file)
|
||||
- Don't use VLM for location detection
|
||||
- Use VLM for name extraction only
|
||||
|
||||
2. **Name-based approach is the solution**
|
||||
- VLM extracts names ✓
|
||||
- CV finds locations ✓
|
||||
- VLM verifies regions ✓
|
||||
|
||||
3. **Test file with coordinate issue:**
|
||||
- `201301_2458_AI1_page4.pdf`
|
||||
- VLM found 2 names but coordinates pointed to blank areas
|
||||
- Actual signatures at 26% (reported as 58% and 68%)
|
||||
|
||||
## To Start Next Session
|
||||
|
||||
### Simple Method (Recommended)
|
||||
```bash
|
||||
cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt
|
||||
# Copy output and paste to new Claude Code session
|
||||
```
|
||||
|
||||
### Manual Method
|
||||
Tell Claude:
|
||||
> "I'm continuing the PDF signature extraction project at `/Volumes/NV2/pdf_recognize/`. Please read `SESSION_INIT.md` and `PROJECT_DOCUMENTATION.md` to understand the current state. I want to [choose option from SESSION_INIT.md]."
|
||||
|
||||
## Quick Commands Reference
|
||||
|
||||
### View Documentation
|
||||
```bash
|
||||
less /Volumes/NV2/pdf_recognize/SESSION_INIT.md
|
||||
less /Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md
|
||||
```
|
||||
|
||||
### Run Scripts
|
||||
```bash
|
||||
cd /Volumes/NV2/pdf_recognize
|
||||
source venv/bin/activate
|
||||
python extract_signatures_hybrid.py # Main script
|
||||
```
|
||||
|
||||
### Check Results
|
||||
```bash
|
||||
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
|
||||
```
|
||||
|
||||
### View Session Handoff
|
||||
```bash
|
||||
cat /Volumes/NV2/pdf_recognize/HOW_TO_CONTINUE.txt
|
||||
```
|
||||
|
||||
## What Can Be Improved (Future Work)
|
||||
|
||||
### Priority 1: Increase Recall
|
||||
- Current: 70%
|
||||
- Target: 90%+
|
||||
- Method: Tune CV parameters in lines 178-214 of extract_signatures_hybrid.py
|
||||
|
||||
### Priority 2: Scale Testing
|
||||
- Current: 5 files tested
|
||||
- Next: 100 files
|
||||
- Future: 86,073 files (full dataset)
|
||||
|
||||
### Priority 3: Optimization
|
||||
- Current: ~24 seconds per PDF
|
||||
- Consider: Parallel processing, batch VLM calls
|
||||
|
||||
### Priority 4: Text Layer Testing
|
||||
- Current: Untested (all PDFs are scanned)
|
||||
- Need: Find PDFs with searchable text layer
|
||||
|
||||
## Verification Steps
|
||||
|
||||
Before next session, verify files exist:
|
||||
```bash
|
||||
cd /Volumes/NV2/pdf_recognize
|
||||
|
||||
# Check essential docs
|
||||
ls -lh SESSION_INIT.md PROJECT_DOCUMENTATION.md NEW_SESSION_PROMPT.txt
|
||||
|
||||
# Check working scripts
|
||||
ls -lh extract_pages_from_csv.py extract_signatures_hybrid.py
|
||||
|
||||
# Check test results
|
||||
ls /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png | wc -l
|
||||
# Should show: 7 (the 7 verified signatures)
|
||||
```
|
||||
|
||||
## Known Good State
|
||||
|
||||
### Environment
|
||||
- Python: 3.9+ with venv
|
||||
- Ollama: http://192.168.30.36:11434
|
||||
- Model: qwen2.5vl:32b
|
||||
- Working directory: /Volumes/NV2/pdf_recognize/
|
||||
|
||||
### Test Data
|
||||
- 5 PDFs processed
|
||||
- 7 signatures extracted
|
||||
- All verified (100% precision)
|
||||
- 3 signatures missed (70% recall)
|
||||
|
||||
### Output Files
|
||||
```
|
||||
201301_1324_AI1_page3_signature_張志銘.png (33 KB)
|
||||
201301_1324_AI1_page3_signature_楊智惠.png (37 KB)
|
||||
201301_2061_AI1_page5_signature_廖阿甚.png (87 KB)
|
||||
201301_2458_AI1_page4_signature_周寶蓮.png (230 KB)
|
||||
201301_2923_AI1_page3_signature_黄瑞展.png (184 KB)
|
||||
201301_3189_AI1_page3_signature_黄益辉.png (24 KB)
|
||||
201301_3189_AI1_page3_signature_黄辉.png (84 KB)
|
||||
```
|
||||
|
||||
## Git Status (Pre-Commit)
|
||||
|
||||
Files staged for commit:
|
||||
- [ ] extract_pages_from_csv.py
|
||||
- [ ] extract_signatures_hybrid.py
|
||||
- [ ] extract_handwriting.py
|
||||
- [ ] README.md
|
||||
- [ ] PROJECT_DOCUMENTATION.md
|
||||
- [ ] README_page_extraction.md
|
||||
- [ ] README_hybrid_extraction.md
|
||||
- [ ] .gitignore
|
||||
|
||||
**Waiting for:** User to review docs and approve commit
|
||||
|
||||
## Session Health Check ✓
|
||||
|
||||
- [x] All scripts working
|
||||
- [x] Test results documented
|
||||
- [x] Issues identified and recorded
|
||||
- [x] Next steps defined
|
||||
- [x] Session continuity files created
|
||||
- [x] Git commit prepared
|
||||
|
||||
**Status:** ✅ Ready for handoff
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** October 26, 2025
|
||||
**Session End:** Ready for next session
|
||||
**Next Action:** User reviews docs → Git commit → Continue work
|
||||
Reference in New Issue
Block a user