Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
180 lines
6.5 KiB
Markdown
180 lines
6.5 KiB
Markdown
# Hybrid Signature Extraction
|
|
|
|
This script uses a **hybrid approach** combining VLM (Vision Language Model) name recognition with computer vision detection.
|
|
|
|
## Key Innovation
|
|
|
|
Instead of relying on VLM's unreliable coordinate system, we:
|
|
1. **Use VLM for name extraction** (what it's good at)
|
|
2. **Use computer vision for location detection** (precise pixel-level detection)
|
|
3. **Use VLM for name-specific verification** (matching signatures to people)
|
|
|
|
## Workflow
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ Step 1: VLM extracts signature names │
|
|
│ Example: "周寶蓮", "魏興海" │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Step 2a: Search PDF text layer │
|
|
│ - If names found in PDF text objects │
|
|
│ - Use precise text coordinates │
|
|
│ - Expand region to capture nearby sig │
|
|
│ │
|
|
│ Step 2b: Fallback to Computer Vision │
|
|
│ - If no text layer or names not found │
|
|
│ - Use OpenCV to detect signature regions│
|
|
│ - Based on size, density, morphology │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Step 3: Extract all candidate regions │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Step 4: VLM verifies EACH region │
|
|
│ "Does this contain signature of: │
|
|
│ 周寶蓮, 魏興海?" │
|
|
│ │
|
|
│ - If matches: Save as signature_周寶蓮 │
|
|
│ - If duplicate: Reject │
|
|
│ - If no match: Move to rejected/ │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
## Advantages
|
|
|
|
✅ **More reliable** - Uses VLM for names, not unreliable coordinates
|
|
✅ **Name-based verification** - Matches specific signatures to specific people
|
|
✅ **Prevents duplicates** - Tracks which signatures already found
|
|
✅ **Better organization** - Files named by person: `signature_周寶蓮.png`
|
|
✅ **Handles both scenarios** - PDFs with/without text layer
|
|
✅ **Fewer false positives** - Only saves verified signatures
|
|
|
|
## Configuration
|
|
|
|
Edit these values in `extract_signatures_hybrid.py`:
|
|
|
|
```python
|
|
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
|
|
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
|
|
REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"
|
|
|
|
OLLAMA_URL = "http://192.168.30.36:11434"
|
|
OLLAMA_MODEL = "qwen2.5vl:32b"
|
|
|
|
DPI = 300 # Resolution for PDF rendering
|
|
```
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
cd /Volumes/NV2/pdf_recognize
|
|
source venv/bin/activate
|
|
python extract_signatures_hybrid.py
|
|
```
|
|
|
|
## Test Results (5 PDFs)
|
|
|
|
| File | Expected | Found | Names Extracted |
|
|
|------|----------|-------|----------------|
|
|
| 201301_1324_AI1_page3 | 2 | 2 ✓ | 楊智惠, 張志銘 |
|
|
| 201301_2061_AI1_page5 | 2 | 1 ⚠️ | 廖阿甚 (missing 林姿妤) |
|
|
| 201301_2458_AI1_page4 | 2 | 1 ⚠️ | 周寶蓮 (missing 魏興海) |
|
|
| 201301_2923_AI1_page3 | 2 | 1 ⚠️ | 黄瑞展 (missing 陈丽琦) |
|
|
| 201301_3189_AI1_page3 | 2 | 2 ✓ | 黄辉, 黄益辉 |
|
|
| **Total** | **10** | **7** | **70% recall** |
|
|
|
|
**Comparison with previous approach:**
|
|
- Old VLM coordinate method: 44 extractions (many false positives, blank regions)
|
|
- New hybrid method: 7 extractions (all verified, no blank regions)
|
|
|
|
## Why Some Signatures Are Missed
|
|
|
|
The current CV detection parameters may be too conservative:
|
|
|
|
```python
|
|
# Filter by area (signatures are medium-sized)
|
|
if 5000 < area < 200000: # May need adjustment
|
|
|
|
# Filter by aspect ratio
|
|
if 0.5 < aspect_ratio < 10: # May need widening
|
|
```
|
|
|
|
**Options to improve recall:**
|
|
1. Widen CV detection parameters (may increase false positives)
|
|
2. Add multiple passes with different parameters
|
|
3. Use VLM to suggest additional search regions if expected signatures not found
|
|
|
|
## Output Files
|
|
|
|
### Extracted Signatures
|
|
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/`
|
|
|
|
**Naming:** `{pdf_name}_signature_{person_name}.png`
|
|
|
|
Examples:
|
|
- `201301_2458_AI1_page4_signature_周寶蓮.png`
|
|
- `201301_1324_AI1_page3_signature_張志銘.png`
|
|
|
|
### Rejected Regions
|
|
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected/`
|
|
|
|
Contains regions that:
|
|
- Don't match any expected signatures
|
|
- Are duplicates of already-found signatures
|
|
|
|
### Log File
|
|
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/hybrid_extraction_log_YYYYMMDD_HHMMSS.csv`
|
|
|
|
Columns:
|
|
- `pdf_filename` - Source PDF
|
|
- `signatures_found` - Number of verified signatures
|
|
- `method_used` - "text_layer" or "computer_vision"
|
|
- `extracted_files` - List of saved filenames
|
|
- `error` - Error message if any
|
|
|
|
## Performance
|
|
|
|
- Processing speed: ~2-3 PDFs per minute (depends on VLM API latency)
|
|
- VLM calls per PDF: 1 (name extraction) + N (region verification)
|
|
- For 5 test PDFs: ~2 minutes total
|
|
|
|
## Next Steps
|
|
|
|
To process full dataset (100 files from CSV):
|
|
|
|
```python
|
|
# Edit line in extract_signatures_hybrid.py
|
|
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100] # Or remove [:5] for all
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
**No signatures extracted:**
|
|
- Check Ollama connection: `curl http://192.168.30.36:11434/api/tags`
|
|
- Verify PDF files exist in input directory
|
|
- Check if PDF is readable (not corrupted)
|
|
|
|
**Too many false positives:**
|
|
- Tighten CV detection parameters (increase `MIN_CONTOUR_AREA`)
|
|
- Reduce `MAX_CONTOUR_AREA`
|
|
- Adjust aspect ratio filters
|
|
|
|
**Missing expected signatures:**
|
|
- Loosen CV detection parameters
|
|
- Check rejected folder to see if signature was detected but not verified
|
|
- Reduce minimum area threshold
|
|
- Increase maximum area threshold
|
|
|
|
## Dependencies
|
|
|
|
- Python 3.9+
|
|
- PyMuPDF (fitz)
|
|
- OpenCV (cv2)
|
|
- NumPy
|
|
- Requests (for Ollama API)
|
|
- Ollama with qwen2.5vl:32b model
|