Add hybrid signature extraction with name-based verification
Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
This commit is contained in:
372
SESSION_INIT.md
Normal file
372
SESSION_INIT.md
Normal file
@@ -0,0 +1,372 @@
|
||||
# Session Initialization - PDF Signature Extraction Project
|
||||
|
||||
**Purpose:** This document helps you (or another Claude instance) quickly understand the project state and continue working.
|
||||
|
||||
---
|
||||
|
||||
## Project Quick Summary
|
||||
|
||||
**Goal:** Extract handwritten Chinese signatures from 86,073 PDF documents automatically.
|
||||
|
||||
**Current Status:** ✅ Working solution with 70% recall, 100% precision (tested on 5 PDFs)
|
||||
|
||||
**Approach:** Hybrid VLM name extraction + Computer Vision detection + VLM verification
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Quick Start (Resume Work)
|
||||
|
||||
### If you want to continue testing:
|
||||
```bash
|
||||
cd /Volumes/NV2/pdf_recognize
|
||||
source venv/bin/activate
|
||||
|
||||
# Test with more files (edit line 425 in script)
|
||||
python extract_signatures_hybrid.py
|
||||
```
|
||||
|
||||
### If you want to review what was done:
|
||||
```bash
|
||||
# Read the complete history
|
||||
less PROJECT_DOCUMENTATION.md
|
||||
|
||||
# Check test results
|
||||
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
|
||||
```
|
||||
|
||||
### If you want to commit to git:
|
||||
```bash
|
||||
# Follow the guide
|
||||
less COMMIT_SUMMARY.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 Key Files (What Each Does)
|
||||
|
||||
### Production Scripts ✅
|
||||
- **extract_pages_from_csv.py** - Step 1: Extract pages from CSV (tested: 100 files)
|
||||
- **extract_signatures_hybrid.py** - Step 2: Extract signatures (CURRENT WORKING, tested: 5 files)
|
||||
- **extract_handwriting.py** - CV-only approach (component used in hybrid)
|
||||
|
||||
### Documentation 📚
|
||||
- **PROJECT_DOCUMENTATION.md** - ⭐ READ THIS FIRST - Complete history of all 5 approaches tested
|
||||
- **README.md** - Quick start guide
|
||||
- **COMMIT_SUMMARY.md** - Git commit instructions
|
||||
- **SESSION_INIT.md** - This file (for session continuity)
|
||||
|
||||
### Configuration ⚙️
|
||||
- **.gitignore** - Excludes diagnostic scripts and test outputs
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Current Working Solution
|
||||
|
||||
### Architecture
|
||||
```
|
||||
1. VLM extracts signature names: "周寶蓮", "魏興海"
|
||||
2. CV detects signature-like regions (5K-200K pixels)
|
||||
3. VLM verifies each region against expected names
|
||||
4. Save verified signatures: signature_周寶蓮.png
|
||||
```
|
||||
|
||||
### Test Results (5 PDFs)
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Expected signatures | 10 |
|
||||
| Found signatures | 7 |
|
||||
| Recall | 70% |
|
||||
| Precision | 100% |
|
||||
| False positives | 0 |
|
||||
|
||||
### Why 30% Missing?
|
||||
- Computer vision parameters too conservative
|
||||
- Some signatures smaller/larger than 5K-200K pixel range
|
||||
- Aspect ratio filter (0.5-10) may exclude some signatures
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Critical Context (What You MUST Know)
|
||||
|
||||
### 1. VLM Coordinate System is UNRELIABLE ❌
|
||||
|
||||
**Discovery:** VLM (qwen2.5vl:32b) provides inaccurate coordinates.
|
||||
|
||||
**Example:**
|
||||
- VLM said signatures at: top=58%, top=68%
|
||||
- Actual location: top=26%
|
||||
- Error: ~32% offset (NOT consistent across files!)
|
||||
|
||||
**Test file:** `201301_2458_AI1_page4.pdf`
|
||||
- VLM correctly identifies 2 signatures: "周寶蓮", "魏興海"
|
||||
- VLM coordinates extract 100% white/blank regions
|
||||
- This is why we abandoned coordinate-based approach
|
||||
|
||||
**Evidence:** See diagnostic scripts and results in PROJECT_DOCUMENTATION.md
|
||||
|
||||
### 2. Name-Based Approach is the Solution ✅
|
||||
|
||||
Instead of using VLM coordinates:
|
||||
- ✅ Use VLM to extract **names** (reliable)
|
||||
- ✅ Use CV to find **locations** (pixel-accurate)
|
||||
- ✅ Use VLM to **verify** each region against names (accurate)
|
||||
|
||||
### 3. All Test PDFs Are Scanned Images
|
||||
|
||||
- No searchable text layer
|
||||
- PDF text layer method (Method A) is **untested**
|
||||
- All current results use CV detection (Method B)
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Configuration Details
|
||||
|
||||
### Ollama Setup
|
||||
```python
|
||||
OLLAMA_URL = "http://192.168.30.36:11434"
|
||||
OLLAMA_MODEL = "qwen2.5vl:32b"
|
||||
```
|
||||
|
||||
**Verify connection:**
|
||||
```bash
|
||||
curl http://192.168.30.36:11434/api/tags
|
||||
```
|
||||
|
||||
### File Paths
|
||||
```python
|
||||
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
|
||||
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
|
||||
REJECTED_PATH = ".../signatures/rejected"
|
||||
```
|
||||
|
||||
### CV Detection Parameters (adjust to improve recall)
|
||||
```python
|
||||
# In extract_signatures_hybrid.py, detect_signature_regions_cv()
|
||||
MIN_CONTOUR_AREA = 5000 # ⬇️ Lower = catch smaller signatures
|
||||
MAX_CONTOUR_AREA = 200000 # ⬆️ Higher = catch larger signatures
|
||||
ASPECT_RATIO_MIN = 0.5 # ⬇️ Lower = catch taller signatures
|
||||
ASPECT_RATIO_MAX = 10.0 # ⬆️ Higher = catch wider signatures
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎬 What Happened (Session History)
|
||||
|
||||
### Approaches Tested (Chronological)
|
||||
|
||||
1. **PDF Image Objects** → Abandoned (extracted full pages, not signatures)
|
||||
2. **Simple Page Extraction** → ✅ Working (extract pages from CSV)
|
||||
3. **Computer Vision Only** → Insufficient (6,420 regions from 100 pages - too many)
|
||||
4. **VLM Coordinates** → ❌ Failed (coordinates unreliable, extracted blank regions)
|
||||
5. **Hybrid Name-Based** → ✅ Current (70% recall, 100% precision)
|
||||
|
||||
### Key Decisions Made
|
||||
|
||||
✅ Use VLM for names, not coordinates
|
||||
✅ Verify each region against expected names
|
||||
✅ Save signatures with person names
|
||||
✅ Reject regions that don't match any name
|
||||
✅ Prevent duplicate signatures per person
|
||||
|
||||
### Diagnostic Work Done
|
||||
|
||||
Created 11 diagnostic scripts to investigate VLM coordinate failure:
|
||||
- Visualized bounding boxes
|
||||
- Analyzed pixel content
|
||||
- Tested actual vs. reported locations
|
||||
- Confirmed coordinates 32% off on test file
|
||||
|
||||
All findings documented in PROJECT_DOCUMENTATION.md
|
||||
|
||||
---
|
||||
|
||||
## 🚧 Known Issues & Next Steps
|
||||
|
||||
### Issue 1: 30% Missing Signatures
|
||||
**Status:** Open
|
||||
**Options:**
|
||||
1. Widen CV parameter ranges (test with different thresholds)
|
||||
2. Multi-pass detection with different kernels
|
||||
3. Ask VLM for help when signatures missing
|
||||
4. Manual review of rejected folder
|
||||
|
||||
### Issue 2: Text Layer Method Untested
|
||||
**Status:** Pending
|
||||
**Need:** PDFs with searchable text to test Method A
|
||||
|
||||
### Issue 3: Performance (24 sec/PDF)
|
||||
**Status:** Acceptable for now
|
||||
**Future:** Optimize if processing full 86K dataset
|
||||
|
||||
---
|
||||
|
||||
## 📊 Test Data Reference
|
||||
|
||||
### Test Files Used (5 PDFs)
|
||||
```
|
||||
201301_1324_AI1_page3.pdf - ✅ Found 2/2: 楊智惠, 張志銘
|
||||
201301_2061_AI1_page5.pdf - ⚠️ Found 1/2: 廖阿甚 (missing 林姿妤)
|
||||
201301_2458_AI1_page4.pdf - ⚠️ Found 1/2: 周寶蓮 (missing 魏興海) ← VLM coordinate test file
|
||||
201301_2923_AI1_page3.pdf - ⚠️ Found 1/2: 黄瑞展 (missing 陈丽琦)
|
||||
201301_3189_AI1_page3.pdf - ✅ Found 2/2: 黄辉, 黄益辉
|
||||
```
|
||||
|
||||
### Output Location
|
||||
```
|
||||
/Volumes/NV2/PDF-Processing/signature-image-output/signatures/
|
||||
├── 201301_1324_AI1_page3_signature_張志銘.png
|
||||
├── 201301_1324_AI1_page3_signature_楊智惠.png
|
||||
├── 201301_2061_AI1_page5_signature_廖阿甚.png
|
||||
├── 201301_2458_AI1_page4_signature_周寶蓮.png
|
||||
├── 201301_2923_AI1_page3_signature_黄瑞展.png
|
||||
├── 201301_3189_AI1_page3_signature_黄辉.png
|
||||
├── 201301_3189_AI1_page3_signature_黄益辉.png
|
||||
└── rejected/ (non-signature regions)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 💡 How to Continue Work
|
||||
|
||||
### Option 1: Improve Recall (Find Missing Signatures)
|
||||
|
||||
**Goal:** Get from 70% to 90%+ recall
|
||||
|
||||
**Approach:**
|
||||
1. Read rejected folder to see if missing signatures were detected but rejected
|
||||
2. Adjust CV parameters in `detect_signature_regions_cv()`:
|
||||
```python
|
||||
MIN_CONTOUR_AREA = 3000 # Lower threshold
|
||||
MAX_CONTOUR_AREA = 300000 # Higher threshold
|
||||
```
|
||||
3. Test on same 5 PDFs and compare results
|
||||
4. If recall improves without too many false positives, proceed
|
||||
|
||||
**Files to edit:**
|
||||
- `extract_signatures_hybrid.py` lines 178-214
|
||||
|
||||
### Option 2: Scale Up Testing
|
||||
|
||||
**Goal:** Test on 100 PDFs to verify reliability
|
||||
|
||||
**Approach:**
|
||||
1. Edit `extract_signatures_hybrid.py` line 425:
|
||||
```python
|
||||
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]
|
||||
```
|
||||
2. Run script (will take ~40 minutes)
|
||||
3. Analyze results in log file
|
||||
4. Calculate overall recall/precision
|
||||
|
||||
### Option 3: Prepare for Production
|
||||
|
||||
**Goal:** Process all 86,073 files
|
||||
|
||||
**Requirements:**
|
||||
1. Verify current approach is acceptable (70% recall OK?)
|
||||
2. Estimate time: 86K files × 24 sec/file = ~24 days
|
||||
3. Consider parallel processing or optimization
|
||||
4. Set up monitoring and resume capability
|
||||
|
||||
### Option 4: Commit Current State
|
||||
|
||||
**Goal:** Save working solution to git
|
||||
|
||||
**Steps:**
|
||||
1. Read `COMMIT_SUMMARY.md`
|
||||
2. Review files to commit
|
||||
3. Run verification checks
|
||||
4. Execute git commands
|
||||
5. Tag release: `v1.0-hybrid-70percent`
|
||||
|
||||
---
|
||||
|
||||
## 🔍 How to Debug Issues
|
||||
|
||||
### If extraction fails:
|
||||
```bash
|
||||
# Check Ollama connection
|
||||
curl http://192.168.30.36:11434/api/tags
|
||||
|
||||
# Check input PDFs exist
|
||||
ls /Volumes/NV2/PDF-Processing/signature-image-output/*.pdf | head -5
|
||||
|
||||
# Run with single file for testing
|
||||
python -c "from extract_signatures_hybrid import *; process_pdf_page('/path/to/test.pdf', OUTPUT_PATH)"
|
||||
```
|
||||
|
||||
### If too many false positives:
|
||||
- Increase `MIN_CONTOUR_AREA` (filter out small regions)
|
||||
- Decrease `MAX_CONTOUR_AREA` (filter out large regions)
|
||||
- Check rejected folder to verify they're actually non-signatures
|
||||
|
||||
### If missing signatures:
|
||||
- Check rejected folder (might be detected but not verified)
|
||||
- Lower `MIN_CONTOUR_AREA` (catch smaller signatures)
|
||||
- Increase `MAX_CONTOUR_AREA` (catch larger signatures)
|
||||
- Widen aspect ratio range
|
||||
|
||||
---
|
||||
|
||||
## 📋 Session Handoff Checklist
|
||||
|
||||
When starting a new session, provide this context:
|
||||
|
||||
✅ **Project Goal:** Extract Chinese signatures from 86K PDFs
|
||||
✅ **Current Approach:** Hybrid VLM name + CV detection + VLM verification
|
||||
✅ **Status:** Working at 70% recall, 100% precision on 5 test files
|
||||
✅ **Key Context:** VLM coordinates unreliable (32% offset), use names instead
|
||||
✅ **Key Files:** extract_signatures_hybrid.py (main), PROJECT_DOCUMENTATION.md (history)
|
||||
✅ **Next Steps:** Improve recall OR scale up testing OR commit to git
|
||||
|
||||
---
|
||||
|
||||
## 🎓 Important Lessons Learned
|
||||
|
||||
1. **VLM spatial reasoning is unreliable** - Don't trust percentage-based coordinates
|
||||
2. **VLM text recognition is excellent** - Use for extracting names, not locations
|
||||
3. **Computer vision is precise** - Use for pixel-level location detection
|
||||
4. **Name-based verification works** - Filters false positives effectively
|
||||
5. **Diagnostic scripts are crucial** - Helped discover coordinate offset issue
|
||||
6. **Conservative parameters** - Better to miss signatures than get false positives
|
||||
|
||||
---
|
||||
|
||||
## 📞 Quick Reference
|
||||
|
||||
### Most Important Command
|
||||
```bash
|
||||
python extract_signatures_hybrid.py # Run signature extraction
|
||||
```
|
||||
|
||||
### Most Important File
|
||||
```bash
|
||||
less PROJECT_DOCUMENTATION.md # Complete project history
|
||||
```
|
||||
|
||||
### Most Important Finding
|
||||
**VLM coordinates are unreliable → Use VLM for names, CV for locations**
|
||||
|
||||
---
|
||||
|
||||
## ✨ Session Start Template
|
||||
|
||||
**When starting a new session, say:**
|
||||
|
||||
> "I'm continuing work on the PDF signature extraction project. Please read `/Volumes/NV2/pdf_recognize/SESSION_INIT.md` and `/Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md` to understand the current state.
|
||||
>
|
||||
> Current status: Working hybrid approach with 70% recall on 5 test files.
|
||||
>
|
||||
> I want to: [choose one]
|
||||
> - Improve recall by tuning CV parameters
|
||||
> - Test on 100 PDFs to verify reliability
|
||||
> - Commit current solution to git
|
||||
> - Process full 86K dataset
|
||||
> - Debug a specific issue: [describe]"
|
||||
|
||||
---
|
||||
|
||||
**Document Created:** October 26, 2025
|
||||
**Last Updated:** October 26, 2025
|
||||
**Status:** Ready for Next Session
|
||||
**Working Directory:** `/Volumes/NV2/pdf_recognize/`
|
||||
Reference in New Issue
Block a user