Add hybrid signature extraction with name-based verification
Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.
Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling
Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)
Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs
Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF
Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)
🤖 Generated with Claude Code
This commit is contained in:
179
README_hybrid_extraction.md
Normal file
179
README_hybrid_extraction.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# Hybrid Signature Extraction
|
||||
|
||||
This script uses a **hybrid approach** combining VLM (Vision Language Model) name recognition with computer vision detection.
|
||||
|
||||
## Key Innovation
|
||||
|
||||
Instead of relying on VLM's unreliable coordinate system, we:
|
||||
1. **Use VLM for name extraction** (what it's good at)
|
||||
2. **Use computer vision for location detection** (precise pixel-level detection)
|
||||
3. **Use VLM for name-specific verification** (matching signatures to people)
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Step 1: VLM extracts signature names │
|
||||
│ Example: "周寶蓮", "魏興海" │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Step 2a: Search PDF text layer │
|
||||
│ - If names found in PDF text objects │
|
||||
│ - Use precise text coordinates │
|
||||
│ - Expand region to capture nearby sig │
|
||||
│ │
|
||||
│ Step 2b: Fallback to Computer Vision │
|
||||
│ - If no text layer or names not found │
|
||||
│ - Use OpenCV to detect signature regions│
|
||||
│ - Based on size, density, morphology │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Step 3: Extract all candidate regions │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Step 4: VLM verifies EACH region │
|
||||
│ "Does this contain signature of: │
|
||||
│ 周寶蓮, 魏興海?" │
|
||||
│ │
|
||||
│ - If matches: Save as signature_周寶蓮 │
|
||||
│ - If duplicate: Reject │
|
||||
│ - If no match: Move to rejected/ │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Advantages
|
||||
|
||||
✅ **More reliable** - Uses VLM for names, not unreliable coordinates
|
||||
✅ **Name-based verification** - Matches specific signatures to specific people
|
||||
✅ **Prevents duplicates** - Tracks which signatures already found
|
||||
✅ **Better organization** - Files named by person: `signature_周寶蓮.png`
|
||||
✅ **Handles both scenarios** - PDFs with/without text layer
|
||||
✅ **Fewer false positives** - Only saves verified signatures
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit these values in `extract_signatures_hybrid.py`:
|
||||
|
||||
```python
|
||||
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
|
||||
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
|
||||
REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"
|
||||
|
||||
OLLAMA_URL = "http://192.168.30.36:11434"
|
||||
OLLAMA_MODEL = "qwen2.5vl:32b"
|
||||
|
||||
DPI = 300 # Resolution for PDF rendering
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
cd /Volumes/NV2/pdf_recognize
|
||||
source venv/bin/activate
|
||||
python extract_signatures_hybrid.py
|
||||
```
|
||||
|
||||
## Test Results (5 PDFs)
|
||||
|
||||
| File | Expected | Found | Names Extracted |
|
||||
|------|----------|-------|----------------|
|
||||
| 201301_1324_AI1_page3 | 2 | 2 ✓ | 楊智惠, 張志銘 |
|
||||
| 201301_2061_AI1_page5 | 2 | 1 ⚠️ | 廖阿甚 (missing 林姿妤) |
|
||||
| 201301_2458_AI1_page4 | 2 | 1 ⚠️ | 周寶蓮 (missing 魏興海) |
|
||||
| 201301_2923_AI1_page3 | 2 | 1 ⚠️ | 黄瑞展 (missing 陈丽琦) |
|
||||
| 201301_3189_AI1_page3 | 2 | 2 ✓ | 黄辉, 黄益辉 |
|
||||
| **Total** | **10** | **7** | **70% recall** |
|
||||
|
||||
**Comparison with previous approach:**
|
||||
- Old VLM coordinate method: 44 extractions (many false positives, blank regions)
|
||||
- New hybrid method: 7 extractions (all verified, no blank regions)
|
||||
|
||||
## Why Some Signatures Are Missed
|
||||
|
||||
The current CV detection parameters may be too conservative:
|
||||
|
||||
```python
|
||||
# Filter by area (signatures are medium-sized)
|
||||
if 5000 < area < 200000: # May need adjustment
|
||||
|
||||
# Filter by aspect ratio
|
||||
if 0.5 < aspect_ratio < 10: # May need widening
|
||||
```
|
||||
|
||||
**Options to improve recall:**
|
||||
1. Widen CV detection parameters (may increase false positives)
|
||||
2. Add multiple passes with different parameters
|
||||
3. Use VLM to suggest additional search regions if expected signatures not found
|
||||
|
||||
## Output Files
|
||||
|
||||
### Extracted Signatures
|
||||
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/`
|
||||
|
||||
**Naming:** `{pdf_name}_signature_{person_name}.png`
|
||||
|
||||
Examples:
|
||||
- `201301_2458_AI1_page4_signature_周寶蓮.png`
|
||||
- `201301_1324_AI1_page3_signature_張志銘.png`
|
||||
|
||||
### Rejected Regions
|
||||
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected/`
|
||||
|
||||
Contains regions that:
|
||||
- Don't match any expected signatures
|
||||
- Are duplicates of already-found signatures
|
||||
|
||||
### Log File
|
||||
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/hybrid_extraction_log_YYYYMMDD_HHMMSS.csv`
|
||||
|
||||
Columns:
|
||||
- `pdf_filename` - Source PDF
|
||||
- `signatures_found` - Number of verified signatures
|
||||
- `method_used` - "text_layer" or "computer_vision"
|
||||
- `extracted_files` - List of saved filenames
|
||||
- `error` - Error message if any
|
||||
|
||||
## Performance
|
||||
|
||||
- Processing speed: ~2-3 PDFs per minute (depends on VLM API latency)
|
||||
- VLM calls per PDF: 1 (name extraction) + N (region verification)
|
||||
- For 5 test PDFs: ~2 minutes total
|
||||
|
||||
## Next Steps
|
||||
|
||||
To process full dataset (100 files from CSV):
|
||||
|
||||
```python
|
||||
# Edit line in extract_signatures_hybrid.py
|
||||
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100] # Or remove [:5] for all
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**No signatures extracted:**
|
||||
- Check Ollama connection: `curl http://192.168.30.36:11434/api/tags`
|
||||
- Verify PDF files exist in input directory
|
||||
- Check if PDF is readable (not corrupted)
|
||||
|
||||
**Too many false positives:**
|
||||
- Tighten CV detection parameters (increase `MIN_CONTOUR_AREA`)
|
||||
- Reduce `MAX_CONTOUR_AREA`
|
||||
- Adjust aspect ratio filters
|
||||
|
||||
**Missing expected signatures:**
|
||||
- Loosen CV detection parameters
|
||||
- Check rejected folder to see if signature was detected but not verified
|
||||
- Reduce minimum area threshold
|
||||
- Increase maximum area threshold
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Python 3.9+
|
||||
- PyMuPDF (fitz)
|
||||
- OpenCV (cv2)
|
||||
- NumPy
|
||||
- Requests (for Ollama API)
|
||||
- Ollama with qwen2.5vl:32b model
|
||||
Reference in New Issue
Block a user