Files
pdf_signature_extraction/PADDLEOCR_STATUS.md
gbanyan 479d4e0019 Add PaddleOCR masking and region detection pipeline
- Created PaddleOCR client for remote server communication
- Implemented text masking + region detection pipeline
- Test results: 100% recall on sample PDF (found both signatures)
- Identified issues: split regions, printed text not fully masked
- Documented 5 solution options in PADDLEOCR_STATUS.md
- Next: Implement region merging and two-stage cleaning
2025-10-28 22:28:18 +08:00

13 KiB

PaddleOCR Signature Extraction - Status & Options

Date: October 28, 2025 Branch: PaddleOCR-Cover Current Stage: Masking + Region Detection Working, Refinement Needed


Current Approach Overview

Strategy: PaddleOCR masks printed text → Detect remaining regions → VLM verification

Pipeline Steps

1. PaddleOCR (Linux server 192.168.30.36:5555)
   └─> Detect printed text bounding boxes

2. OpenCV Masking (Local)
   └─> Black out all printed text areas

3. Region Detection (Local)
   └─> Find non-white areas (potential handwriting)

4. VLM Verification (TODO)
   └─> Confirm which regions are handwritten signatures

Test Results (File: 201301_1324_AI1_page3.pdf)

Performance

Metric Value
Printed text regions masked 26
Candidate regions detected 12
Actual signatures found 2
False positives (printed text) 9
Split signatures 1 (Region 5 might be part of Region 4)

Success

PaddleOCR detected most printed text (26 regions) Masking works correctly (black rectangles) Region detection found both signatures (regions 2, 4) No false negatives (didn't miss any signatures)

Issues Identified

Problem 1: Handwriting Split Into Multiple Regions

  • Some signatures may be split into 2+ separate regions
  • Example: Region 4 and Region 5 might be parts of same signature area
  • Caused by gaps between handwritten strokes after masking

Problem 2: Printed Name + Handwritten Signature Mixed

  • Region 2: Contains "張 志 銘" (printed) + handwritten signature
  • Region 4: Contains "楊 智 惠" (printed) + handwritten signature
  • PaddleOCR missed these printed names, so they weren't masked
  • Final output includes both printed and handwritten parts

Problem 3: Printed Text Not Masked by PaddleOCR

  • 9 regions contain printed text that PaddleOCR didn't detect
  • These became false positive candidates
  • Examples: dates, company names, paragraph text
  • Shows PaddleOCR's detection isn't 100% complete

Proposed Solutions

Problem 1: Split Signatures

Option A: More Aggressive Morphology EASY

Approach: Increase kernel size and iterations to connect nearby strokes

# Current settings:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)

# Proposed settings:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 15))  # 3x larger
morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=5)  # More iterations

Pros:

  • Simple one-line change
  • Connects nearby strokes automatically
  • Fast execution

Cons:

  • May merge unrelated regions if too aggressive
  • Need to tune parameters carefully
  • Could lose fine details

Recommendation: Try first - easiest to implement and test


Approach: After detecting all regions, merge those that are close together

def merge_nearby_regions(regions, distance_threshold=50):
    """
    Merge regions that are within distance_threshold pixels of each other.

    Args:
        regions: List of region dicts with 'box' (x, y, w, h)
        distance_threshold: Maximum pixels between regions to merge

    Returns:
        List of merged regions
    """
    # Algorithm:
    # 1. Calculate distance between all region pairs
    # 2. If distance < threshold, merge their bounding boxes
    # 3. Repeat until no more merges possible

    merged = []
    # Implementation here...
    return merged

Pros:

  • Keeps signatures together intelligently
  • Won't merge distant unrelated regions
  • Preserves original stroke details
  • Can use vertical/horizontal distance separately

Cons:

  • Need to tune distance threshold
  • More complex than Option A
  • May need multiple merge passes

Recommendation: Best balance - implement this first


Option C: Don't Split - Extract Larger Context EASY

Approach: When extracting regions, add significant padding to capture full context

# Current: padding = 10 pixels
padding = 50  # Much larger padding

# Or: Merge all regions in the bottom 20% of page
# (signatures are usually at the bottom)

Pros:

  • Guaranteed to capture complete signatures
  • Very simple to implement
  • No risk of losing parts

Cons:

  • May include extra unwanted content
  • Larger image files
  • Makes VLM verification more complex

Recommendation: Use as fallback if B doesn't work


Problem 2: Printed + Handwritten in Same Region

Option A: Expand PaddleOCR Masking Boxes EASY

Approach: Add padding when masking text boxes to catch edges

padding = 20  # pixels

for (x, y, w, h) in text_boxes:
    # Expand box in all directions
    x_pad = max(0, x - padding)
    y_pad = max(0, y - padding)
    w_pad = min(image.shape[1] - x_pad, w + 2*padding)
    h_pad = min(image.shape[0] - y_pad, h + 2*padding)

    cv2.rectangle(masked_image, (x_pad, y_pad),
                  (x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)

Pros:

  • Very simple - one parameter change
  • Catches text edges and nearby text
  • Fast execution

Cons:

  • If padding too large, may mask handwriting
  • If padding too small, still misses text
  • Hard to find perfect padding value

Recommendation: Quick test - try with padding=20-30


Option B: Run PaddleOCR Again on Each Region MEDIUM

Approach: Second-pass OCR on extracted regions to find remaining printed text

def clean_region(region_image, ocr_client):
    """
    Remove any remaining printed text from a region.

    Args:
        region_image: Extracted candidate region
        ocr_client: PaddleOCR client

    Returns:
        Cleaned image with only handwriting
    """
    # Run OCR on this specific region
    text_boxes = ocr_client.get_text_boxes(region_image)

    # Mask any detected printed text
    cleaned = region_image.copy()
    for (x, y, w, h) in text_boxes:
        cv2.rectangle(cleaned, (x, y), (x+w, y+h), (0, 0, 0), -1)

    return cleaned

Pros:

  • Very accurate - catches printed text PaddleOCR missed initially
  • Clean separation of printed vs handwritten
  • No manual tuning needed

Cons:

  • 2x slower (OCR call per region)
  • May occasionally mask handwritten text if it looks printed
  • More complex pipeline

Recommendation: Good option if masking padding isn't enough


Option C: Computer Vision Stroke Analysis HARD

Approach: Analyze stroke characteristics to distinguish printed vs handwritten

def separate_printed_handwritten(region_image):
    """
    Use CV techniques to separate printed from handwritten.

    Techniques:
    - Stroke width analysis (printed = uniform, handwritten = variable)
    - Edge detection + smoothness (printed = sharp, handwritten = organic)
    - Connected component analysis
    - Hough line detection (printed = straight, handwritten = curved)
    """
    # Complex implementation...
    pass

Pros:

  • No API calls needed (fast)
  • Can work when OCR fails
  • Learns patterns in data

Cons:

  • Very complex to implement
  • May not be reliable across different documents
  • Requires significant tuning
  • Hard to maintain

Recommendation: Skip for now - too complex, uncertain results


Option D: VLM Crop Guidance ⚠️ RISKY

Approach: Ask VLM to provide coordinates of handwriting location

prompt = """
This image contains both printed and handwritten text.
Where is the handwritten signature located?
Provide coordinates as: x_start, y_start, x_end, y_end
"""

# VLM returns coordinates
# Crop to that region only

Pros:

  • VLM understands visual context
  • Can distinguish printed vs handwritten

Cons:

  • VLM coordinates are unreliable (32% offset discovered in previous tests!)
  • This was the original problem that led to PaddleOCR approach
  • May extract wrong region

Recommendation: DO NOT USE - VLM coordinates proven unreliable


Approach: Combine detection with targeted cleaning

def extract_signatures_twostage(pdf_path):
    """
    Stage 1: Detect candidate regions (current pipeline)
    Stage 2: Clean each region
    """
    # Stage 1: Full page processing
    image = render_pdf(pdf_path)
    text_boxes = ocr_client.get_text_boxes(image)
    masked_image = mask_text_regions(image, text_boxes, padding=20)
    candidate_regions = detect_regions(masked_image)

    # Stage 2: Per-region cleaning
    signatures = []
    for region_box in candidate_regions:
        # Extract region from ORIGINAL image (not masked)
        region_img = extract_region(image, region_box)

        # Option 1: Run OCR again to find remaining printed text
        region_text_boxes = ocr_client.get_text_boxes(region_img)
        cleaned_region = mask_text_regions(region_img, region_text_boxes)

        # Option 2: Ask VLM if it contains handwriting (no coordinates!)
        is_handwriting = vlm_verify(cleaned_region)

        if is_handwriting:
            signatures.append(cleaned_region)

    return signatures

Pros:

  • Best accuracy - two passes of OCR
  • Combines strengths of both approaches
  • VLM only for yes/no, not coordinates
  • Clean final output with only handwriting

Cons:

  • Slower (2 OCR calls per page)
  • More complex code
  • Higher computational cost

Recommendation: BEST OVERALL - implement this for production


Implementation Priority

Phase 1: Quick Wins (Test Immediately)

  1. Expand masking padding (Problem 2, Option A) - 5 minutes
  2. More aggressive morphology (Problem 1, Option A) - 5 minutes
  3. Test and measure improvement

Phase 2: Region Merging (If Phase 1 insufficient)

  1. Implement region merging algorithm (Problem 1, Option B) - 30 minutes
  2. Test on multiple PDFs
  3. Tune distance threshold

Phase 3: Two-Stage Approach (Best quality)

  1. Implement second-pass OCR on regions (Problem 2, Option E) - 1 hour
  2. Add VLM verification (Step 4 of pipeline) - 30 minutes
  3. Full pipeline testing

Code Files Status

Existing Files

  • paddleocr_client.py - REST API client for PaddleOCR server
  • test_paddleocr_client.py - Connection and OCR test
  • test_mask_and_detect.py - Current masking + detection pipeline

To Be Created 📝

  • extract_signatures_paddleocr.py - Production pipeline with all improvements
  • region_merger.py - Region merging utilities
  • vlm_verifier.py - VLM handwriting verification

Server Configuration

PaddleOCR Server:

  • Host: 192.168.30.36:5555
  • Running: Yes (PID: 210417)
  • Version: 3.3.0
  • GPU: Enabled
  • Language: Chinese (lang='ch')

VLM Server:

  • Host: 192.168.30.36:11434 (Ollama)
  • Model: qwen2.5vl:32b
  • Status: Not tested yet in this pipeline

Test Plan

Test File

  • File: 201301_1324_AI1_page3.pdf
  • Expected signatures: 2 (楊智惠, 張志銘)
  • Current recall: 100% (found both)
  • Current precision: 16.7% (2 correct out of 12 regions)

Success Metrics After Improvements

Metric Current Target
Signatures found 2/2 (100%) 2/2 (100%)
False positives 10 < 2
Precision 16.7% > 80%
Signatures split Unknown 0
Printed text in regions Yes No

Git Branch Strategy

Current branch: PaddleOCR-Cover Status: Masking + Region Detection working, needs refinement

Recommended next steps:

  1. Commit current state with tag: paddleocr-v1-basic
  2. Create feature branches:
    • paddleocr-region-merging - For Problem 1 solutions
    • paddleocr-two-stage - For Problem 2 solutions
  3. Merge best solution back to PaddleOCR-Cover

Next Actions

Immediate (Today)

  • Commit current working state
  • Test Phase 1 quick wins (padding + morphology)
  • Measure improvement

Short-term (This week)

  • Implement Region Merging (Option B)
  • Implement Two-Stage OCR (Option E)
  • Add VLM verification
  • Test on 10 PDFs

Long-term (Production)

  • Optimize performance (parallel processing)
  • Error handling and logging
  • Process full 86K dataset
  • Compare with previous hybrid approach (70% recall)

Comparison: PaddleOCR vs Previous Hybrid Approach

Previous Approach (VLM-Cover branch)

  • Method: VLM names + CV detection + VLM verification
  • Results: 70% recall, 100% precision
  • Problem: Missed 30% of signatures (CV parameters too conservative)

PaddleOCR Approach (Current)

  • Method: PaddleOCR masking + CV detection + VLM verification
  • Results: 100% recall (found both signatures)
  • Problem: Low precision (many false positives), printed text not fully removed

Winner: TBD

  • PaddleOCR shows better recall potential
  • After implementing refinements (Phase 2-3), should achieve high recall + high precision
  • Need to test on larger dataset to confirm

Document version: 1.0 Last updated: October 28, 2025 Author: Claude Code Status: Ready for implementation