Files
pdf_signature_extraction/SESSION_INIT.md
gbanyan 52612e14ba Add hybrid signature extraction with name-based verification
Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.

Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling

Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)

Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs

Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF

Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)

🤖 Generated with Claude Code
2025-10-26 23:39:52 +08:00

373 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Session Initialization - PDF Signature Extraction Project
**Purpose:** This document helps you (or another Claude instance) quickly understand the project state and continue working.
---
## Project Quick Summary
**Goal:** Extract handwritten Chinese signatures from 86,073 PDF documents automatically.
**Current Status:** ✅ Working solution with 70% recall, 100% precision (tested on 5 PDFs)
**Approach:** Hybrid VLM name extraction + Computer Vision detection + VLM verification
---
## 🚀 Quick Start (Resume Work)
### If you want to continue testing:
```bash
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
# Test with more files (edit line 425 in script)
python extract_signatures_hybrid.py
```
### If you want to review what was done:
```bash
# Read the complete history
less PROJECT_DOCUMENTATION.md
# Check test results
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
```
### If you want to commit to git:
```bash
# Follow the guide
less COMMIT_SUMMARY.md
```
---
## 📁 Key Files (What Each Does)
### Production Scripts ✅
- **extract_pages_from_csv.py** - Step 1: Extract pages from CSV (tested: 100 files)
- **extract_signatures_hybrid.py** - Step 2: Extract signatures (CURRENT WORKING, tested: 5 files)
- **extract_handwriting.py** - CV-only approach (component used in hybrid)
### Documentation 📚
- **PROJECT_DOCUMENTATION.md** - ⭐ READ THIS FIRST - Complete history of all 5 approaches tested
- **README.md** - Quick start guide
- **COMMIT_SUMMARY.md** - Git commit instructions
- **SESSION_INIT.md** - This file (for session continuity)
### Configuration ⚙️
- **.gitignore** - Excludes diagnostic scripts and test outputs
---
## 🎯 Current Working Solution
### Architecture
```
1. VLM extracts signature names: "周寶蓮", "魏興海"
2. CV detects signature-like regions (5K-200K pixels)
3. VLM verifies each region against expected names
4. Save verified signatures: signature_周寶蓮.png
```
### Test Results (5 PDFs)
| Metric | Value |
|--------|-------|
| Expected signatures | 10 |
| Found signatures | 7 |
| Recall | 70% |
| Precision | 100% |
| False positives | 0 |
### Why 30% Missing?
- Computer vision parameters too conservative
- Some signatures smaller/larger than 5K-200K pixel range
- Aspect ratio filter (0.5-10) may exclude some signatures
---
## ⚠️ Critical Context (What You MUST Know)
### 1. VLM Coordinate System is UNRELIABLE ❌
**Discovery:** VLM (qwen2.5vl:32b) provides inaccurate coordinates.
**Example:**
- VLM said signatures at: top=58%, top=68%
- Actual location: top=26%
- Error: ~32% offset (NOT consistent across files!)
**Test file:** `201301_2458_AI1_page4.pdf`
- VLM correctly identifies 2 signatures: "周寶蓮", "魏興海"
- VLM coordinates extract 100% white/blank regions
- This is why we abandoned coordinate-based approach
**Evidence:** See diagnostic scripts and results in PROJECT_DOCUMENTATION.md
### 2. Name-Based Approach is the Solution ✅
Instead of using VLM coordinates:
- ✅ Use VLM to extract **names** (reliable)
- ✅ Use CV to find **locations** (pixel-accurate)
- ✅ Use VLM to **verify** each region against names (accurate)
### 3. All Test PDFs Are Scanned Images
- No searchable text layer
- PDF text layer method (Method A) is **untested**
- All current results use CV detection (Method B)
---
## 🔧 Configuration Details
### Ollama Setup
```python
OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"
```
**Verify connection:**
```bash
curl http://192.168.30.36:11434/api/tags
```
### File Paths
```python
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = ".../signatures/rejected"
```
### CV Detection Parameters (adjust to improve recall)
```python
# In extract_signatures_hybrid.py, detect_signature_regions_cv()
MIN_CONTOUR_AREA = 5000 # ⬇️ Lower = catch smaller signatures
MAX_CONTOUR_AREA = 200000 # ⬆️ Higher = catch larger signatures
ASPECT_RATIO_MIN = 0.5 # ⬇️ Lower = catch taller signatures
ASPECT_RATIO_MAX = 10.0 # ⬆️ Higher = catch wider signatures
```
---
## 🎬 What Happened (Session History)
### Approaches Tested (Chronological)
1. **PDF Image Objects** → Abandoned (extracted full pages, not signatures)
2. **Simple Page Extraction** → ✅ Working (extract pages from CSV)
3. **Computer Vision Only** → Insufficient (6,420 regions from 100 pages - too many)
4. **VLM Coordinates** → ❌ Failed (coordinates unreliable, extracted blank regions)
5. **Hybrid Name-Based** → ✅ Current (70% recall, 100% precision)
### Key Decisions Made
✅ Use VLM for names, not coordinates
✅ Verify each region against expected names
✅ Save signatures with person names
✅ Reject regions that don't match any name
✅ Prevent duplicate signatures per person
### Diagnostic Work Done
Created 11 diagnostic scripts to investigate VLM coordinate failure:
- Visualized bounding boxes
- Analyzed pixel content
- Tested actual vs. reported locations
- Confirmed coordinates 32% off on test file
All findings documented in PROJECT_DOCUMENTATION.md
---
## 🚧 Known Issues & Next Steps
### Issue 1: 30% Missing Signatures
**Status:** Open
**Options:**
1. Widen CV parameter ranges (test with different thresholds)
2. Multi-pass detection with different kernels
3. Ask VLM for help when signatures missing
4. Manual review of rejected folder
### Issue 2: Text Layer Method Untested
**Status:** Pending
**Need:** PDFs with searchable text to test Method A
### Issue 3: Performance (24 sec/PDF)
**Status:** Acceptable for now
**Future:** Optimize if processing full 86K dataset
---
## 📊 Test Data Reference
### Test Files Used (5 PDFs)
```
201301_1324_AI1_page3.pdf - ✅ Found 2/2: 楊智惠, 張志銘
201301_2061_AI1_page5.pdf - ⚠️ Found 1/2: 廖阿甚 (missing 林姿妤)
201301_2458_AI1_page4.pdf - ⚠️ Found 1/2: 周寶蓮 (missing 魏興海) ← VLM coordinate test file
201301_2923_AI1_page3.pdf - ⚠️ Found 1/2: 黄瑞展 (missing 陈丽琦)
201301_3189_AI1_page3.pdf - ✅ Found 2/2: 黄辉, 黄益辉
```
### Output Location
```
/Volumes/NV2/PDF-Processing/signature-image-output/signatures/
├── 201301_1324_AI1_page3_signature_張志銘.png
├── 201301_1324_AI1_page3_signature_楊智惠.png
├── 201301_2061_AI1_page5_signature_廖阿甚.png
├── 201301_2458_AI1_page4_signature_周寶蓮.png
├── 201301_2923_AI1_page3_signature_黄瑞展.png
├── 201301_3189_AI1_page3_signature_黄辉.png
├── 201301_3189_AI1_page3_signature_黄益辉.png
└── rejected/ (non-signature regions)
```
---
## 💡 How to Continue Work
### Option 1: Improve Recall (Find Missing Signatures)
**Goal:** Get from 70% to 90%+ recall
**Approach:**
1. Read rejected folder to see if missing signatures were detected but rejected
2. Adjust CV parameters in `detect_signature_regions_cv()`:
```python
MIN_CONTOUR_AREA = 3000 # Lower threshold
MAX_CONTOUR_AREA = 300000 # Higher threshold
```
3. Test on same 5 PDFs and compare results
4. If recall improves without too many false positives, proceed
**Files to edit:**
- `extract_signatures_hybrid.py` lines 178-214
### Option 2: Scale Up Testing
**Goal:** Test on 100 PDFs to verify reliability
**Approach:**
1. Edit `extract_signatures_hybrid.py` line 425:
```python
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]
```
2. Run script (will take ~40 minutes)
3. Analyze results in log file
4. Calculate overall recall/precision
### Option 3: Prepare for Production
**Goal:** Process all 86,073 files
**Requirements:**
1. Verify current approach is acceptable (70% recall OK?)
2. Estimate time: 86K files × 24 sec/file = ~24 days
3. Consider parallel processing or optimization
4. Set up monitoring and resume capability
### Option 4: Commit Current State
**Goal:** Save working solution to git
**Steps:**
1. Read `COMMIT_SUMMARY.md`
2. Review files to commit
3. Run verification checks
4. Execute git commands
5. Tag release: `v1.0-hybrid-70percent`
---
## 🔍 How to Debug Issues
### If extraction fails:
```bash
# Check Ollama connection
curl http://192.168.30.36:11434/api/tags
# Check input PDFs exist
ls /Volumes/NV2/PDF-Processing/signature-image-output/*.pdf | head -5
# Run with single file for testing
python -c "from extract_signatures_hybrid import *; process_pdf_page('/path/to/test.pdf', OUTPUT_PATH)"
```
### If too many false positives:
- Increase `MIN_CONTOUR_AREA` (filter out small regions)
- Decrease `MAX_CONTOUR_AREA` (filter out large regions)
- Check rejected folder to verify they're actually non-signatures
### If missing signatures:
- Check rejected folder (might be detected but not verified)
- Lower `MIN_CONTOUR_AREA` (catch smaller signatures)
- Increase `MAX_CONTOUR_AREA` (catch larger signatures)
- Widen aspect ratio range
---
## 📋 Session Handoff Checklist
When starting a new session, provide this context:
✅ **Project Goal:** Extract Chinese signatures from 86K PDFs
✅ **Current Approach:** Hybrid VLM name + CV detection + VLM verification
✅ **Status:** Working at 70% recall, 100% precision on 5 test files
✅ **Key Context:** VLM coordinates unreliable (32% offset), use names instead
✅ **Key Files:** extract_signatures_hybrid.py (main), PROJECT_DOCUMENTATION.md (history)
✅ **Next Steps:** Improve recall OR scale up testing OR commit to git
---
## 🎓 Important Lessons Learned
1. **VLM spatial reasoning is unreliable** - Don't trust percentage-based coordinates
2. **VLM text recognition is excellent** - Use for extracting names, not locations
3. **Computer vision is precise** - Use for pixel-level location detection
4. **Name-based verification works** - Filters false positives effectively
5. **Diagnostic scripts are crucial** - Helped discover coordinate offset issue
6. **Conservative parameters** - Better to miss signatures than get false positives
---
## 📞 Quick Reference
### Most Important Command
```bash
python extract_signatures_hybrid.py # Run signature extraction
```
### Most Important File
```bash
less PROJECT_DOCUMENTATION.md # Complete project history
```
### Most Important Finding
**VLM coordinates are unreliable → Use VLM for names, CV for locations**
---
## ✨ Session Start Template
**When starting a new session, say:**
> "I'm continuing work on the PDF signature extraction project. Please read `/Volumes/NV2/pdf_recognize/SESSION_INIT.md` and `/Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md` to understand the current state.
>
> Current status: Working hybrid approach with 70% recall on 5 test files.
>
> I want to: [choose one]
> - Improve recall by tuning CV parameters
> - Test on 100 PDFs to verify reliability
> - Commit current solution to git
> - Process full 86K dataset
> - Debug a specific issue: [describe]"
---
**Document Created:** October 26, 2025
**Last Updated:** October 26, 2025
**Status:** Ready for Next Session
**Working Directory:** `/Volumes/NV2/pdf_recognize/`