pdf_signature_extraction/README_hybrid_extraction.md

# Hybrid Signature Extraction

This script uses a **hybrid approach** combining VLM (Vision Language Model) name recognition with computer vision detection.

## Key Innovation

Instead of relying on VLM's unreliable coordinate system, we:
1. **Use VLM for name extraction** (what it's good at)
2. **Use computer vision for location detection** (precise pixel-level detection)
3. **Use VLM for name-specific verification** (matching signatures to people)

## Workflow

```
┌─────────────────────────────────────────┐
│ Step 1: VLM extracts signature names   │
│ Example: "周寶蓮", "魏興海"              │
└─────────────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────┐
│ Step 2a: Search PDF text layer         │
│ - If names found in PDF text objects   │
│ - Use precise text coordinates          │
│ - Expand region to capture nearby sig   │
│                                          │
│ Step 2b: Fallback to Computer Vision   │
│ - If no text layer or names not found   │
│ - Use OpenCV to detect signature regions│
│ - Based on size, density, morphology    │
└─────────────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────┐
│ Step 3: Extract all candidate regions   │
└─────────────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────┐
│ Step 4: VLM verifies EACH region        │
│ "Does this contain signature of:        │
│  周寶蓮, 魏興海?"                         │
│                                          │
│ - If matches: Save as signature_周寶蓮   │
│ - If duplicate: Reject                   │
│ - If no match: Move to rejected/        │
└─────────────────────────────────────────┘
```

## Advantages

✅ **More reliable** - Uses VLM for names, not unreliable coordinates
✅ **Name-based verification** - Matches specific signatures to specific people
✅ **Prevents duplicates** - Tracks which signatures already found
✅ **Better organization** - Files named by person: `signature_周寶蓮.png`
✅ **Handles both scenarios** - PDFs with/without text layer
✅ **Fewer false positives** - Only saves verified signatures

## Configuration

Edit these values in `extract_signatures_hybrid.py`:

```python
PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"

OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"

DPI = 300  # Resolution for PDF rendering
```

## Usage

```bash
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_signatures_hybrid.py
```

## Test Results (5 PDFs)

| File | Expected | Found | Names Extracted |
|------|----------|-------|----------------|
| 201301_1324_AI1_page3 | 2 | 2 ✓ | 楊智惠, 張志銘 |
| 201301_2061_AI1_page5 | 2 | 1 ⚠️ | 廖阿甚 (missing 林姿妤) |
| 201301_2458_AI1_page4 | 2 | 1 ⚠️ | 周寶蓮 (missing 魏興海) |
| 201301_2923_AI1_page3 | 2 | 1 ⚠️ | 黄瑞展 (missing 陈丽琦) |
| 201301_3189_AI1_page3 | 2 | 2 ✓ | 黄辉, 黄益辉 |
| **Total** | **10** | **7** | **70% recall** |

**Comparison with previous approach:**
- Old VLM coordinate method: 44 extractions (many false positives, blank regions)
- New hybrid method: 7 extractions (all verified, no blank regions)

## Why Some Signatures Are Missed

The current CV detection parameters may be too conservative:

```python
# Filter by area (signatures are medium-sized)
if 5000 < area < 200000:  # May need adjustment

# Filter by aspect ratio
if 0.5 < aspect_ratio < 10:  # May need widening
```

**Options to improve recall:**
1. Widen CV detection parameters (may increase false positives)
2. Add multiple passes with different parameters
3. Use VLM to suggest additional search regions if expected signatures not found

## Output Files

### Extracted Signatures
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/`

**Naming:** `{pdf_name}_signature_{person_name}.png`

Examples:
- `201301_2458_AI1_page4_signature_周寶蓮.png`
- `201301_1324_AI1_page3_signature_張志銘.png`

### Rejected Regions
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected/`

Contains regions that:
- Don't match any expected signatures
- Are duplicates of already-found signatures

### Log File
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/signatures/hybrid_extraction_log_YYYYMMDD_HHMMSS.csv`

Columns:
- `pdf_filename` - Source PDF
- `signatures_found` - Number of verified signatures
- `method_used` - "text_layer" or "computer_vision"
- `extracted_files` - List of saved filenames
- `error` - Error message if any

## Performance

- Processing speed: ~2-3 PDFs per minute (depends on VLM API latency)
- VLM calls per PDF: 1 (name extraction) + N (region verification)
- For 5 test PDFs: ~2 minutes total

## Next Steps

To process full dataset (100 files from CSV):

```python
# Edit line in extract_signatures_hybrid.py
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]  # Or remove [:5] for all
```

## Troubleshooting

**No signatures extracted:**
- Check Ollama connection: `curl http://192.168.30.36:11434/api/tags`
- Verify PDF files exist in input directory
- Check if PDF is readable (not corrupted)

**Too many false positives:**
- Tighten CV detection parameters (increase `MIN_CONTOUR_AREA`)
- Reduce `MAX_CONTOUR_AREA`
- Adjust aspect ratio filters

**Missing expected signatures:**
- Loosen CV detection parameters
- Check rejected folder to see if signature was detected but not verified
- Reduce minimum area threshold
- Increase maximum area threshold

## Dependencies

- Python 3.9+
- PyMuPDF (fitz)
- OpenCV (cv2)
- NumPy
- Requests (for Ollama API)
- Ollama with qwen2.5vl:32b model