Add hybrid signature extraction with name-based verification

Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.

Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling

Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)

Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs

Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF

Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)

🤖 Generated with Claude Code
This commit is contained in:
2025-10-26 23:39:52 +08:00
commit 52612e14ba
14 changed files with 3583 additions and 0 deletions

179
README_hybrid_extraction.md Normal file
View File

@@ -0,0 +1,179 @@
# Hybrid Signature Extraction
This script uses a **hybrid approach** combining VLM (Vision Language Model) name recognition with computer vision detection.
## Key Innovation
Instead of relying on VLM's unreliable coordinate system, we:
1. **Use VLM for name extraction** (what it's good at)
2. **Use computer vision for location detection** (precise pixel-level detection)
3. **Use VLM for name-specific verification** (matching signatures to people)
## Workflow
```
┌─────────────────────────────────────────┐
│ Step 1: VLM extracts signature names │
│ Example: "周寶蓮", "魏興海" │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Step 2a: Search PDF text layer │
│ - If names found in PDF text objects │
│ - Use precise text coordinates │
│ - Expand region to capture nearby sig │
│ │
│ Step 2b: Fallback to Computer Vision │
│ - If no text layer or names not found │
│ - Use OpenCV to detect signature regions│
│ - Based on size, density, morphology │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Step 3: Extract all candidate regions │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Step 4: VLM verifies EACH region │
│ "Does this contain signature of: │
│ 周寶蓮, 魏興海?" │
│ │
│ - If matches: Save as signature_周寶蓮 │
│ - If duplicate: Reject │
│ - If no match: Move to rejected/ │
└─────────────────────────────────────────┘
```
## Advantages
**More reliable** - Uses VLM for names, not unreliable coordinates
**Name-based verification** - Matches specific signatures to specific people
**Prevents duplicates** - Tracks which signatures already found
**Better organization** - Files named by person: `signature_周寶蓮.png`
**Handles both scenarios** - PDFs with/without text layer
**Fewer false positives** - Only saves verified signatures
## Configuration
Edit these values in `extract_signatures_hybrid.py`:
```python
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"
OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"
DPI = 300 # Resolution for PDF rendering
```
## Usage
```bash
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_signatures_hybrid.py
```
## Test Results (5 PDFs)
| File | Expected | Found | Names Extracted |
|------|----------|-------|----------------|
| 201301_1324_AI1_page3 | 2 | 2 ✓ | 楊智惠, 張志銘 |
| 201301_2061_AI1_page5 | 2 | 1 ⚠️ | 廖阿甚 (missing 林姿妤) |
| 201301_2458_AI1_page4 | 2 | 1 ⚠️ | 周寶蓮 (missing 魏興海) |
| 201301_2923_AI1_page3 | 2 | 1 ⚠️ | 黄瑞展 (missing 陈丽琦) |
| 201301_3189_AI1_page3 | 2 | 2 ✓ | 黄辉, 黄益辉 |
| **Total** | **10** | **7** | **70% recall** |
**Comparison with previous approach:**
- Old VLM coordinate method: 44 extractions (many false positives, blank regions)
- New hybrid method: 7 extractions (all verified, no blank regions)
## Why Some Signatures Are Missed
The current CV detection parameters may be too conservative:
```python
# Filter by area (signatures are medium-sized)
if 5000 < area < 200000: # May need adjustment
# Filter by aspect ratio
if 0.5 < aspect_ratio < 10: # May need widening
```
**Options to improve recall:**
1. Widen CV detection parameters (may increase false positives)
2. Add multiple passes with different parameters
3. Use VLM to suggest additional search regions if expected signatures not found
## Output Files
### Extracted Signatures
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/`
**Naming:** `{pdf_name}_signature_{person_name}.png`
Examples:
- `201301_2458_AI1_page4_signature_周寶蓮.png`
- `201301_1324_AI1_page3_signature_張志銘.png`
### Rejected Regions
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected/`
Contains regions that:
- Don't match any expected signatures
- Are duplicates of already-found signatures
### Log File
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/hybrid_extraction_log_YYYYMMDD_HHMMSS.csv`
Columns:
- `pdf_filename` - Source PDF
- `signatures_found` - Number of verified signatures
- `method_used` - "text_layer" or "computer_vision"
- `extracted_files` - List of saved filenames
- `error` - Error message if any
## Performance
- Processing speed: ~2-3 PDFs per minute (depends on VLM API latency)
- VLM calls per PDF: 1 (name extraction) + N (region verification)
- For 5 test PDFs: ~2 minutes total
## Next Steps
To process full dataset (100 files from CSV):
```python
# Edit line in extract_signatures_hybrid.py
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100] # Or remove [:5] for all
```
## Troubleshooting
**No signatures extracted:**
- Check Ollama connection: `curl http://192.168.30.36:11434/api/tags`
- Verify PDF files exist in input directory
- Check if PDF is readable (not corrupted)
**Too many false positives:**
- Tighten CV detection parameters (increase `MIN_CONTOUR_AREA`)
- Reduce `MAX_CONTOUR_AREA`
- Adjust aspect ratio filters
**Missing expected signatures:**
- Loosen CV detection parameters
- Check rejected folder to see if signature was detected but not verified
- Reduce minimum area threshold
- Increase maximum area threshold
## Dependencies
- Python 3.9+
- PyMuPDF (fitz)
- OpenCV (cv2)
- NumPy
- Requests (for Ollama API)
- Ollama with qwen2.5vl:32b model