Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
373 lines
11 KiB
Markdown
373 lines
11 KiB
Markdown
# Session Initialization - PDF Signature Extraction Project
|
||
|
||
**Purpose:** This document helps you (or another Claude instance) quickly understand the project state and continue working.
|
||
|
||
---
|
||
|
||
## Project Quick Summary
|
||
|
||
**Goal:** Extract handwritten Chinese signatures from 86,073 PDF documents automatically.
|
||
|
||
**Current Status:** ✅ Working solution with 70% recall, 100% precision (tested on 5 PDFs)
|
||
|
||
**Approach:** Hybrid VLM name extraction + Computer Vision detection + VLM verification
|
||
|
||
---
|
||
|
||
## 🚀 Quick Start (Resume Work)
|
||
|
||
### If you want to continue testing:
|
||
```bash
|
||
cd /Volumes/NV2/pdf_recognize
|
||
source venv/bin/activate
|
||
|
||
# Test with more files (edit line 425 in script)
|
||
python extract_signatures_hybrid.py
|
||
```
|
||
|
||
### If you want to review what was done:
|
||
```bash
|
||
# Read the complete history
|
||
less PROJECT_DOCUMENTATION.md
|
||
|
||
# Check test results
|
||
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
|
||
```
|
||
|
||
### If you want to commit to git:
|
||
```bash
|
||
# Follow the guide
|
||
less COMMIT_SUMMARY.md
|
||
```
|
||
|
||
---
|
||
|
||
## 📁 Key Files (What Each Does)
|
||
|
||
### Production Scripts ✅
|
||
- **extract_pages_from_csv.py** - Step 1: Extract pages from CSV (tested: 100 files)
|
||
- **extract_signatures_hybrid.py** - Step 2: Extract signatures (CURRENT WORKING, tested: 5 files)
|
||
- **extract_handwriting.py** - CV-only approach (component used in hybrid)
|
||
|
||
### Documentation 📚
|
||
- **PROJECT_DOCUMENTATION.md** - ⭐ READ THIS FIRST - Complete history of all 5 approaches tested
|
||
- **README.md** - Quick start guide
|
||
- **COMMIT_SUMMARY.md** - Git commit instructions
|
||
- **SESSION_INIT.md** - This file (for session continuity)
|
||
|
||
### Configuration ⚙️
|
||
- **.gitignore** - Excludes diagnostic scripts and test outputs
|
||
|
||
---
|
||
|
||
## 🎯 Current Working Solution
|
||
|
||
### Architecture
|
||
```
|
||
1. VLM extracts signature names: "周寶蓮", "魏興海"
|
||
2. CV detects signature-like regions (5K-200K pixels)
|
||
3. VLM verifies each region against expected names
|
||
4. Save verified signatures: signature_周寶蓮.png
|
||
```
|
||
|
||
### Test Results (5 PDFs)
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Expected signatures | 10 |
|
||
| Found signatures | 7 |
|
||
| Recall | 70% |
|
||
| Precision | 100% |
|
||
| False positives | 0 |
|
||
|
||
### Why 30% Missing?
|
||
- Computer vision parameters too conservative
|
||
- Some signatures smaller/larger than 5K-200K pixel range
|
||
- Aspect ratio filter (0.5-10) may exclude some signatures
|
||
|
||
---
|
||
|
||
## ⚠️ Critical Context (What You MUST Know)
|
||
|
||
### 1. VLM Coordinate System is UNRELIABLE ❌
|
||
|
||
**Discovery:** VLM (qwen2.5vl:32b) provides inaccurate coordinates.
|
||
|
||
**Example:**
|
||
- VLM said signatures at: top=58%, top=68%
|
||
- Actual location: top=26%
|
||
- Error: ~32% offset (NOT consistent across files!)
|
||
|
||
**Test file:** `201301_2458_AI1_page4.pdf`
|
||
- VLM correctly identifies 2 signatures: "周寶蓮", "魏興海"
|
||
- VLM coordinates extract 100% white/blank regions
|
||
- This is why we abandoned coordinate-based approach
|
||
|
||
**Evidence:** See diagnostic scripts and results in PROJECT_DOCUMENTATION.md
|
||
|
||
### 2. Name-Based Approach is the Solution ✅
|
||
|
||
Instead of using VLM coordinates:
|
||
- ✅ Use VLM to extract **names** (reliable)
|
||
- ✅ Use CV to find **locations** (pixel-accurate)
|
||
- ✅ Use VLM to **verify** each region against names (accurate)
|
||
|
||
### 3. All Test PDFs Are Scanned Images
|
||
|
||
- No searchable text layer
|
||
- PDF text layer method (Method A) is **untested**
|
||
- All current results use CV detection (Method B)
|
||
|
||
---
|
||
|
||
## 🔧 Configuration Details
|
||
|
||
### Ollama Setup
|
||
```python
|
||
OLLAMA_URL = "http://192.168.30.36:11434"
|
||
OLLAMA_MODEL = "qwen2.5vl:32b"
|
||
```
|
||
|
||
**Verify connection:**
|
||
```bash
|
||
curl http://192.168.30.36:11434/api/tags
|
||
```
|
||
|
||
### File Paths
|
||
```python
|
||
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
|
||
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
|
||
REJECTED_PATH = ".../signatures/rejected"
|
||
```
|
||
|
||
### CV Detection Parameters (adjust to improve recall)
|
||
```python
|
||
# In extract_signatures_hybrid.py, detect_signature_regions_cv()
|
||
MIN_CONTOUR_AREA = 5000 # ⬇️ Lower = catch smaller signatures
|
||
MAX_CONTOUR_AREA = 200000 # ⬆️ Higher = catch larger signatures
|
||
ASPECT_RATIO_MIN = 0.5 # ⬇️ Lower = catch taller signatures
|
||
ASPECT_RATIO_MAX = 10.0 # ⬆️ Higher = catch wider signatures
|
||
```
|
||
|
||
---
|
||
|
||
## 🎬 What Happened (Session History)
|
||
|
||
### Approaches Tested (Chronological)
|
||
|
||
1. **PDF Image Objects** → Abandoned (extracted full pages, not signatures)
|
||
2. **Simple Page Extraction** → ✅ Working (extract pages from CSV)
|
||
3. **Computer Vision Only** → Insufficient (6,420 regions from 100 pages - too many)
|
||
4. **VLM Coordinates** → ❌ Failed (coordinates unreliable, extracted blank regions)
|
||
5. **Hybrid Name-Based** → ✅ Current (70% recall, 100% precision)
|
||
|
||
### Key Decisions Made
|
||
|
||
✅ Use VLM for names, not coordinates
|
||
✅ Verify each region against expected names
|
||
✅ Save signatures with person names
|
||
✅ Reject regions that don't match any name
|
||
✅ Prevent duplicate signatures per person
|
||
|
||
### Diagnostic Work Done
|
||
|
||
Created 11 diagnostic scripts to investigate VLM coordinate failure:
|
||
- Visualized bounding boxes
|
||
- Analyzed pixel content
|
||
- Tested actual vs. reported locations
|
||
- Confirmed coordinates 32% off on test file
|
||
|
||
All findings documented in PROJECT_DOCUMENTATION.md
|
||
|
||
---
|
||
|
||
## 🚧 Known Issues & Next Steps
|
||
|
||
### Issue 1: 30% Missing Signatures
|
||
**Status:** Open
|
||
**Options:**
|
||
1. Widen CV parameter ranges (test with different thresholds)
|
||
2. Multi-pass detection with different kernels
|
||
3. Ask VLM for help when signatures missing
|
||
4. Manual review of rejected folder
|
||
|
||
### Issue 2: Text Layer Method Untested
|
||
**Status:** Pending
|
||
**Need:** PDFs with searchable text to test Method A
|
||
|
||
### Issue 3: Performance (24 sec/PDF)
|
||
**Status:** Acceptable for now
|
||
**Future:** Optimize if processing full 86K dataset
|
||
|
||
---
|
||
|
||
## 📊 Test Data Reference
|
||
|
||
### Test Files Used (5 PDFs)
|
||
```
|
||
201301_1324_AI1_page3.pdf - ✅ Found 2/2: 楊智惠, 張志銘
|
||
201301_2061_AI1_page5.pdf - ⚠️ Found 1/2: 廖阿甚 (missing 林姿妤)
|
||
201301_2458_AI1_page4.pdf - ⚠️ Found 1/2: 周寶蓮 (missing 魏興海) ← VLM coordinate test file
|
||
201301_2923_AI1_page3.pdf - ⚠️ Found 1/2: 黄瑞展 (missing 陈丽琦)
|
||
201301_3189_AI1_page3.pdf - ✅ Found 2/2: 黄辉, 黄益辉
|
||
```
|
||
|
||
### Output Location
|
||
```
|
||
/Volumes/NV2/PDF-Processing/signature-image-output/signatures/
|
||
├── 201301_1324_AI1_page3_signature_張志銘.png
|
||
├── 201301_1324_AI1_page3_signature_楊智惠.png
|
||
├── 201301_2061_AI1_page5_signature_廖阿甚.png
|
||
├── 201301_2458_AI1_page4_signature_周寶蓮.png
|
||
├── 201301_2923_AI1_page3_signature_黄瑞展.png
|
||
├── 201301_3189_AI1_page3_signature_黄辉.png
|
||
├── 201301_3189_AI1_page3_signature_黄益辉.png
|
||
└── rejected/ (non-signature regions)
|
||
```
|
||
|
||
---
|
||
|
||
## 💡 How to Continue Work
|
||
|
||
### Option 1: Improve Recall (Find Missing Signatures)
|
||
|
||
**Goal:** Get from 70% to 90%+ recall
|
||
|
||
**Approach:**
|
||
1. Read rejected folder to see if missing signatures were detected but rejected
|
||
2. Adjust CV parameters in `detect_signature_regions_cv()`:
|
||
```python
|
||
MIN_CONTOUR_AREA = 3000 # Lower threshold
|
||
MAX_CONTOUR_AREA = 300000 # Higher threshold
|
||
```
|
||
3. Test on same 5 PDFs and compare results
|
||
4. If recall improves without too many false positives, proceed
|
||
|
||
**Files to edit:**
|
||
- `extract_signatures_hybrid.py` lines 178-214
|
||
|
||
### Option 2: Scale Up Testing
|
||
|
||
**Goal:** Test on 100 PDFs to verify reliability
|
||
|
||
**Approach:**
|
||
1. Edit `extract_signatures_hybrid.py` line 425:
|
||
```python
|
||
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]
|
||
```
|
||
2. Run script (will take ~40 minutes)
|
||
3. Analyze results in log file
|
||
4. Calculate overall recall/precision
|
||
|
||
### Option 3: Prepare for Production
|
||
|
||
**Goal:** Process all 86,073 files
|
||
|
||
**Requirements:**
|
||
1. Verify current approach is acceptable (70% recall OK?)
|
||
2. Estimate time: 86K files × 24 sec/file = ~24 days
|
||
3. Consider parallel processing or optimization
|
||
4. Set up monitoring and resume capability
|
||
|
||
### Option 4: Commit Current State
|
||
|
||
**Goal:** Save working solution to git
|
||
|
||
**Steps:**
|
||
1. Read `COMMIT_SUMMARY.md`
|
||
2. Review files to commit
|
||
3. Run verification checks
|
||
4. Execute git commands
|
||
5. Tag release: `v1.0-hybrid-70percent`
|
||
|
||
---
|
||
|
||
## 🔍 How to Debug Issues
|
||
|
||
### If extraction fails:
|
||
```bash
|
||
# Check Ollama connection
|
||
curl http://192.168.30.36:11434/api/tags
|
||
|
||
# Check input PDFs exist
|
||
ls /Volumes/NV2/PDF-Processing/signature-image-output/*.pdf | head -5
|
||
|
||
# Run with single file for testing
|
||
python -c "from extract_signatures_hybrid import *; process_pdf_page('/path/to/test.pdf', OUTPUT_PATH)"
|
||
```
|
||
|
||
### If too many false positives:
|
||
- Increase `MIN_CONTOUR_AREA` (filter out small regions)
|
||
- Decrease `MAX_CONTOUR_AREA` (filter out large regions)
|
||
- Check rejected folder to verify they're actually non-signatures
|
||
|
||
### If missing signatures:
|
||
- Check rejected folder (might be detected but not verified)
|
||
- Lower `MIN_CONTOUR_AREA` (catch smaller signatures)
|
||
- Increase `MAX_CONTOUR_AREA` (catch larger signatures)
|
||
- Widen aspect ratio range
|
||
|
||
---
|
||
|
||
## 📋 Session Handoff Checklist
|
||
|
||
When starting a new session, provide this context:
|
||
|
||
✅ **Project Goal:** Extract Chinese signatures from 86K PDFs
|
||
✅ **Current Approach:** Hybrid VLM name + CV detection + VLM verification
|
||
✅ **Status:** Working at 70% recall, 100% precision on 5 test files
|
||
✅ **Key Context:** VLM coordinates unreliable (32% offset), use names instead
|
||
✅ **Key Files:** extract_signatures_hybrid.py (main), PROJECT_DOCUMENTATION.md (history)
|
||
✅ **Next Steps:** Improve recall OR scale up testing OR commit to git
|
||
|
||
---
|
||
|
||
## 🎓 Important Lessons Learned
|
||
|
||
1. **VLM spatial reasoning is unreliable** - Don't trust percentage-based coordinates
|
||
2. **VLM text recognition is excellent** - Use for extracting names, not locations
|
||
3. **Computer vision is precise** - Use for pixel-level location detection
|
||
4. **Name-based verification works** - Filters false positives effectively
|
||
5. **Diagnostic scripts are crucial** - Helped discover coordinate offset issue
|
||
6. **Conservative parameters** - Better to miss signatures than get false positives
|
||
|
||
---
|
||
|
||
## 📞 Quick Reference
|
||
|
||
### Most Important Command
|
||
```bash
|
||
python extract_signatures_hybrid.py # Run signature extraction
|
||
```
|
||
|
||
### Most Important File
|
||
```bash
|
||
less PROJECT_DOCUMENTATION.md # Complete project history
|
||
```
|
||
|
||
### Most Important Finding
|
||
**VLM coordinates are unreliable → Use VLM for names, CV for locations**
|
||
|
||
---
|
||
|
||
## ✨ Session Start Template
|
||
|
||
**When starting a new session, say:**
|
||
|
||
> "I'm continuing work on the PDF signature extraction project. Please read `/Volumes/NV2/pdf_recognize/SESSION_INIT.md` and `/Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md` to understand the current state.
|
||
>
|
||
> Current status: Working hybrid approach with 70% recall on 5 test files.
|
||
>
|
||
> I want to: [choose one]
|
||
> - Improve recall by tuning CV parameters
|
||
> - Test on 100 PDFs to verify reliability
|
||
> - Commit current solution to git
|
||
> - Process full 86K dataset
|
||
> - Debug a specific issue: [describe]"
|
||
|
||
---
|
||
|
||
**Document Created:** October 26, 2025
|
||
**Last Updated:** October 26, 2025
|
||
**Status:** Ready for Next Session
|
||
**Working Directory:** `/Volumes/NV2/pdf_recognize/`
|