Files
pdf_signature_extraction/README_hybrid_extraction.md
gbanyan 52612e14ba Add hybrid signature extraction with name-based verification
Implement VLM name extraction + CV detection hybrid approach to
replace unreliable VLM coordinate system with name-based verification.

Key Features:
- VLM extracts signature names (周寶蓮, 魏興海, etc.)
- CV or PDF text layer detects regions
- VLM verifies each region against expected names
- Signatures saved with person names: signature_周寶蓮.png
- Duplicate prevention and rejection handling

Test Results:
- 5 PDF pages tested
- 7/10 signatures extracted (70% recall)
- 100% precision (no false positives)
- No blank regions extracted (previous issue resolved)

Files:
- extract_pages_from_csv.py: Extract pages from CSV (tested: 100 files)
- extract_signatures_hybrid.py: Hybrid extraction (current working solution)
- extract_handwriting.py: CV-only approach (component)
- extract_signatures_vlm.py: Deprecated VLM coordinate approach
- PROJECT_DOCUMENTATION.md: Complete project history and results
- SESSION_INIT.md: Session handoff documentation
- SESSION_CHECKLIST.md: Status checklist
- NEW_SESSION_PROMPT.txt: Template for next session
- HOW_TO_CONTINUE.txt: Visual handoff guide
- COMMIT_SUMMARY.md: Commit preparation guide
- README.md: Quick start guide
- README_page_extraction.md: Page extraction docs
- README_hybrid_extraction.md: Hybrid approach docs
- .gitignore: Exclude diagnostic scripts and outputs

Known Limitations:
- 30% of signatures missed due to conservative CV parameters
- Text layer method untested (all test PDFs are scanned images)
- Performance: ~24 seconds per PDF

Next Steps:
- Tune CV parameters for higher recall
- Test with larger dataset (100+ files)
- Process full dataset (86,073 files)

🤖 Generated with Claude Code
2025-10-26 23:39:52 +08:00

6.5 KiB

Hybrid Signature Extraction

This script uses a hybrid approach combining VLM (Vision Language Model) name recognition with computer vision detection.

Key Innovation

Instead of relying on VLM's unreliable coordinate system, we:

  1. Use VLM for name extraction (what it's good at)
  2. Use computer vision for location detection (precise pixel-level detection)
  3. Use VLM for name-specific verification (matching signatures to people)

Workflow

┌─────────────────────────────────────────┐
│ Step 1: VLM extracts signature names   │
│ Example: "周寶蓮", "魏興海"              │
└─────────────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────┐
│ Step 2a: Search PDF text layer         │
│ - If names found in PDF text objects   │
│ - Use precise text coordinates          │
│ - Expand region to capture nearby sig   │
│                                          │
│ Step 2b: Fallback to Computer Vision   │
│ - If no text layer or names not found   │
│ - Use OpenCV to detect signature regions│
│ - Based on size, density, morphology    │
└─────────────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────┐
│ Step 3: Extract all candidate regions   │
└─────────────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────┐
│ Step 4: VLM verifies EACH region        │
│ "Does this contain signature of:        │
│  周寶蓮, 魏興海?"                         │
│                                          │
│ - If matches: Save as signature_周寶蓮   │
│ - If duplicate: Reject                   │
│ - If no match: Move to rejected/        │
└─────────────────────────────────────────┘

Advantages

More reliable - Uses VLM for names, not unreliable coordinates Name-based verification - Matches specific signatures to specific people Prevents duplicates - Tracks which signatures already found Better organization - Files named by person: signature_周寶蓮.png Handles both scenarios - PDFs with/without text layer Fewer false positives - Only saves verified signatures

Configuration

Edit these values in extract_signatures_hybrid.py:

PDF_INPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures"
REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"

OLLAMA_URL = "http://192.168.30.36:11434"
OLLAMA_MODEL = "qwen2.5vl:32b"

DPI = 300  # Resolution for PDF rendering

Usage

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_signatures_hybrid.py

Test Results (5 PDFs)

File Expected Found Names Extracted
201301_1324_AI1_page3 2 2 ✓ 楊智惠, 張志銘
201301_2061_AI1_page5 2 1 ⚠️ 廖阿甚 (missing 林姿妤)
201301_2458_AI1_page4 2 1 ⚠️ 周寶蓮 (missing 魏興海)
201301_2923_AI1_page3 2 1 ⚠️ 黄瑞展 (missing 陈丽琦)
201301_3189_AI1_page3 2 2 ✓ 黄辉, 黄益辉
Total 10 7 70% recall

Comparison with previous approach:

  • Old VLM coordinate method: 44 extractions (many false positives, blank regions)
  • New hybrid method: 7 extractions (all verified, no blank regions)

Why Some Signatures Are Missed

The current CV detection parameters may be too conservative:

# Filter by area (signatures are medium-sized)
if 5000 < area < 200000:  # May need adjustment

# Filter by aspect ratio
if 0.5 < aspect_ratio < 10:  # May need widening

Options to improve recall:

  1. Widen CV detection parameters (may increase false positives)
  2. Add multiple passes with different parameters
  3. Use VLM to suggest additional search regions if expected signatures not found

Output Files

Extracted Signatures

Location: /Volumes/NV2/PDF-Processing/signature-image-output/signatures/

Naming: {pdf_name}_signature_{person_name}.png

Examples:

  • 201301_2458_AI1_page4_signature_周寶蓮.png
  • 201301_1324_AI1_page3_signature_張志銘.png

Rejected Regions

Location: /Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected/

Contains regions that:

  • Don't match any expected signatures
  • Are duplicates of already-found signatures

Log File

Location: /Volumes/NV2/PDF-Processing/signature-image-output/signatures/hybrid_extraction_log_YYYYMMDD_HHMMSS.csv

Columns:

  • pdf_filename - Source PDF
  • signatures_found - Number of verified signatures
  • method_used - "text_layer" or "computer_vision"
  • extracted_files - List of saved filenames
  • error - Error message if any

Performance

  • Processing speed: ~2-3 PDFs per minute (depends on VLM API latency)
  • VLM calls per PDF: 1 (name extraction) + N (region verification)
  • For 5 test PDFs: ~2 minutes total

Next Steps

To process full dataset (100 files from CSV):

# Edit line in extract_signatures_hybrid.py
pdf_files = sorted(Path(PDF_INPUT_PATH).glob("*.pdf"))[:100]  # Or remove [:5] for all

Troubleshooting

No signatures extracted:

  • Check Ollama connection: curl http://192.168.30.36:11434/api/tags
  • Verify PDF files exist in input directory
  • Check if PDF is readable (not corrupted)

Too many false positives:

  • Tighten CV detection parameters (increase MIN_CONTOUR_AREA)
  • Reduce MAX_CONTOUR_AREA
  • Adjust aspect ratio filters

Missing expected signatures:

  • Loosen CV detection parameters
  • Check rejected folder to see if signature was detected but not verified
  • Reduce minimum area threshold
  • Increase maximum area threshold

Dependencies

  • Python 3.9+
  • PyMuPDF (fitz)
  • OpenCV (cv2)
  • NumPy
  • Requests (for Ollama API)
  • Ollama with qwen2.5vl:32b model