Files

gbanyan 479d4e0019 Add PaddleOCR masking and region detection pipeline

- Created PaddleOCR client for remote server communication
- Implemented text masking + region detection pipeline
- Test results: 100% recall on sample PDF (found both signatures)
- Identified issues: split regions, printed text not fully masked
- Documented 5 solution options in PADDLEOCR_STATUS.md
- Next: Implement region merging and two-stage cleaning

2025-10-28 22:28:18 +08:00

13 KiB

Raw Blame History

PaddleOCR Signature Extraction - Status & Options

Date: October 28, 2025 Branch: PaddleOCR-Cover Current Stage: Masking + Region Detection Working, Refinement Needed

Current Approach Overview

Strategy: PaddleOCR masks printed text → Detect remaining regions → VLM verification

Pipeline Steps

1. PaddleOCR (Linux server 192.168.30.36:5555)
   └─> Detect printed text bounding boxes

2. OpenCV Masking (Local)
   └─> Black out all printed text areas

3. Region Detection (Local)
   └─> Find non-white areas (potential handwriting)

4. VLM Verification (TODO)
   └─> Confirm which regions are handwritten signatures

Test Results (File: 201301_1324_AI1_page3.pdf)

Performance

Metric	Value
Printed text regions masked	26
Candidate regions detected	12
Actual signatures found	2 ✅
False positives (printed text)	9
Split signatures	1 (Region 5 might be part of Region 4)

Success

✅ PaddleOCR detected most printed text (26 regions) ✅ Masking works correctly (black rectangles) ✅ Region detection found both signatures (regions 2, 4) ✅ No false negatives (didn't miss any signatures)

Issues Identified

❌ Problem 1: Handwriting Split Into Multiple Regions

Some signatures may be split into 2+ separate regions
Example: Region 4 and Region 5 might be parts of same signature area
Caused by gaps between handwritten strokes after masking

❌ Problem 2: Printed Name + Handwritten Signature Mixed

Region 2: Contains "張志銘" (printed) + handwritten signature
Region 4: Contains "楊智惠" (printed) + handwritten signature
PaddleOCR missed these printed names, so they weren't masked
Final output includes both printed and handwritten parts

❌ Problem 3: Printed Text Not Masked by PaddleOCR

9 regions contain printed text that PaddleOCR didn't detect
These became false positive candidates
Examples: dates, company names, paragraph text
Shows PaddleOCR's detection isn't 100% complete

Proposed Solutions

Problem 1: Split Signatures

Option A: More Aggressive Morphology ⭐ EASY

Approach: Increase kernel size and iterations to connect nearby strokes

# Current settings:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)

# Proposed settings:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 15))  # 3x larger
morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=5)  # More iterations

Pros:

Simple one-line change
Connects nearby strokes automatically
Fast execution

Cons:

May merge unrelated regions if too aggressive
Need to tune parameters carefully
Could lose fine details

Recommendation: ⭐ Try first - easiest to implement and test

Option B: Region Merging After Detection ⭐⭐ MEDIUM (RECOMMENDED)

Approach: After detecting all regions, merge those that are close together

def merge_nearby_regions(regions, distance_threshold=50):
    """
    Merge regions that are within distance_threshold pixels of each other.

    Args:
        regions: List of region dicts with 'box' (x, y, w, h)
        distance_threshold: Maximum pixels between regions to merge

    Returns:
        List of merged regions
    """
    # Algorithm:
    # 1. Calculate distance between all region pairs
    # 2. If distance < threshold, merge their bounding boxes
    # 3. Repeat until no more merges possible

    merged = []
    # Implementation here...
    return merged

Pros:

Keeps signatures together intelligently
Won't merge distant unrelated regions
Preserves original stroke details
Can use vertical/horizontal distance separately

Cons:

Need to tune distance threshold
More complex than Option A
May need multiple merge passes

Recommendation: ⭐⭐ Best balance - implement this first

Option C: Don't Split - Extract Larger Context ⭐ EASY

Approach: When extracting regions, add significant padding to capture full context

# Current: padding = 10 pixels
padding = 50  # Much larger padding

# Or: Merge all regions in the bottom 20% of page
# (signatures are usually at the bottom)

Pros:

Guaranteed to capture complete signatures
Very simple to implement
No risk of losing parts

Cons:

May include extra unwanted content
Larger image files
Makes VLM verification more complex

Recommendation: ⭐ Use as fallback if B doesn't work

Problem 2: Printed + Handwritten in Same Region

Option A: Expand PaddleOCR Masking Boxes ⭐ EASY

Approach: Add padding when masking text boxes to catch edges

padding = 20  # pixels

for (x, y, w, h) in text_boxes:
    # Expand box in all directions
    x_pad = max(0, x - padding)
    y_pad = max(0, y - padding)
    w_pad = min(image.shape[1] - x_pad, w + 2*padding)
    h_pad = min(image.shape[0] - y_pad, h + 2*padding)

    cv2.rectangle(masked_image, (x_pad, y_pad),
                  (x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)

Pros:

Very simple - one parameter change
Catches text edges and nearby text
Fast execution

Cons:

If padding too large, may mask handwriting
If padding too small, still misses text
Hard to find perfect padding value

Recommendation: ⭐ Quick test - try with padding=20-30

Option B: Run PaddleOCR Again on Each Region ⭐⭐ MEDIUM

Approach: Second-pass OCR on extracted regions to find remaining printed text

def clean_region(region_image, ocr_client):
    """
    Remove any remaining printed text from a region.

    Args:
        region_image: Extracted candidate region
        ocr_client: PaddleOCR client

    Returns:
        Cleaned image with only handwriting
    """
    # Run OCR on this specific region
    text_boxes = ocr_client.get_text_boxes(region_image)

    # Mask any detected printed text
    cleaned = region_image.copy()
    for (x, y, w, h) in text_boxes:
        cv2.rectangle(cleaned, (x, y), (x+w, y+h), (0, 0, 0), -1)

    return cleaned

Pros:

Very accurate - catches printed text PaddleOCR missed initially
Clean separation of printed vs handwritten
No manual tuning needed

Cons:

2x slower (OCR call per region)
May occasionally mask handwritten text if it looks printed
More complex pipeline

Recommendation: ⭐⭐ Good option if masking padding isn't enough

Option C: Computer Vision Stroke Analysis ⭐⭐⭐ HARD

Approach: Analyze stroke characteristics to distinguish printed vs handwritten

def separate_printed_handwritten(region_image):
    """
    Use CV techniques to separate printed from handwritten.

    Techniques:
    - Stroke width analysis (printed = uniform, handwritten = variable)
    - Edge detection + smoothness (printed = sharp, handwritten = organic)
    - Connected component analysis
    - Hough line detection (printed = straight, handwritten = curved)
    """
    # Complex implementation...
    pass

Pros:

No API calls needed (fast)
Can work when OCR fails
Learns patterns in data

Cons:

Very complex to implement
May not be reliable across different documents
Requires significant tuning
Hard to maintain

Recommendation: ❌ Skip for now - too complex, uncertain results

Option D: VLM Crop Guidance ⚠️ RISKY

Approach: Ask VLM to provide coordinates of handwriting location

prompt = """
This image contains both printed and handwritten text.
Where is the handwritten signature located?
Provide coordinates as: x_start, y_start, x_end, y_end
"""

# VLM returns coordinates
# Crop to that region only

Pros:

VLM understands visual context
Can distinguish printed vs handwritten

Cons:

VLM coordinates are unreliable (32% offset discovered in previous tests!)
This was the original problem that led to PaddleOCR approach
May extract wrong region

Recommendation: ❌ DO NOT USE - VLM coordinates proven unreliable

Option E: Two-Stage Hybrid Approach ⭐⭐⭐ BEST (RECOMMENDED)

Approach: Combine detection with targeted cleaning

def extract_signatures_twostage(pdf_path):
    """
    Stage 1: Detect candidate regions (current pipeline)
    Stage 2: Clean each region
    """
    # Stage 1: Full page processing
    image = render_pdf(pdf_path)
    text_boxes = ocr_client.get_text_boxes(image)
    masked_image = mask_text_regions(image, text_boxes, padding=20)
    candidate_regions = detect_regions(masked_image)

    # Stage 2: Per-region cleaning
    signatures = []
    for region_box in candidate_regions:
        # Extract region from ORIGINAL image (not masked)
        region_img = extract_region(image, region_box)

        # Option 1: Run OCR again to find remaining printed text
        region_text_boxes = ocr_client.get_text_boxes(region_img)
        cleaned_region = mask_text_regions(region_img, region_text_boxes)

        # Option 2: Ask VLM if it contains handwriting (no coordinates!)
        is_handwriting = vlm_verify(cleaned_region)

        if is_handwriting:
            signatures.append(cleaned_region)

    return signatures

Pros:

Best accuracy - two passes of OCR
Combines strengths of both approaches
VLM only for yes/no, not coordinates
Clean final output with only handwriting

Cons:

Slower (2 OCR calls per page)
More complex code
Higher computational cost

Recommendation: ⭐⭐⭐ BEST OVERALL - implement this for production

Implementation Priority

Phase 1: Quick Wins (Test Immediately)

Expand masking padding (Problem 2, Option A) - 5 minutes
More aggressive morphology (Problem 1, Option A) - 5 minutes
Test and measure improvement

Phase 2: Region Merging (If Phase 1 insufficient)

Implement region merging algorithm (Problem 1, Option B) - 30 minutes
Test on multiple PDFs
Tune distance threshold

Phase 3: Two-Stage Approach (Best quality)

Implement second-pass OCR on regions (Problem 2, Option E) - 1 hour
Add VLM verification (Step 4 of pipeline) - 30 minutes
Full pipeline testing

Code Files Status

Existing Files ✅

paddleocr_client.py - REST API client for PaddleOCR server
test_paddleocr_client.py - Connection and OCR test
test_mask_and_detect.py - Current masking + detection pipeline

To Be Created 📝

extract_signatures_paddleocr.py - Production pipeline with all improvements
region_merger.py - Region merging utilities
vlm_verifier.py - VLM handwriting verification

Server Configuration

PaddleOCR Server:

Host: 192.168.30.36:5555
Running: ✅ Yes (PID: 210417)
Version: 3.3.0
GPU: Enabled
Language: Chinese (lang='ch')

VLM Server:

Host: 192.168.30.36:11434 (Ollama)
Model: qwen2.5vl:32b
Status: Not tested yet in this pipeline

Test Plan

Test File

File: 201301_1324_AI1_page3.pdf
Expected signatures: 2 (楊智惠, 張志銘)
Current recall: 100% (found both)
Current precision: 16.7% (2 correct out of 12 regions)

Success Metrics After Improvements

Metric	Current	Target
Signatures found	2/2 (100%)	2/2 (100%)
False positives	10	< 2
Precision	16.7%	> 80%
Signatures split	Unknown	0
Printed text in regions	Yes	No

Git Branch Strategy

Current branch: PaddleOCR-Cover Status: Masking + Region Detection working, needs refinement

Recommended next steps:

Commit current state with tag: paddleocr-v1-basic
Create feature branches:
- paddleocr-region-merging - For Problem 1 solutions
- paddleocr-two-stage - For Problem 2 solutions
Merge best solution back to PaddleOCR-Cover

Next Actions

Immediate (Today)

Commit current working state
Test Phase 1 quick wins (padding + morphology)
Measure improvement

Short-term (This week)

Implement Region Merging (Option B)
Implement Two-Stage OCR (Option E)
Add VLM verification
Test on 10 PDFs

Long-term (Production)

Optimize performance (parallel processing)
Error handling and logging
Process full 86K dataset
Compare with previous hybrid approach (70% recall)

Comparison: PaddleOCR vs Previous Hybrid Approach

Previous Approach (VLM-Cover branch)

Method: VLM names + CV detection + VLM verification
Results: 70% recall, 100% precision
Problem: Missed 30% of signatures (CV parameters too conservative)

PaddleOCR Approach (Current)

Method: PaddleOCR masking + CV detection + VLM verification
Results: 100% recall (found both signatures)
Problem: Low precision (many false positives), printed text not fully removed

Winner: TBD

PaddleOCR shows better recall potential
After implementing refinements (Phase 2-3), should achieve high recall + high precision
Need to test on larger dataset to confirm

Document version: 1.0 Last updated: October 28, 2025 Author: Claude Code Status: Ready for implementation

13 KiB Raw Blame History

PaddleOCR Signature Extraction - Status & Options

Current Approach Overview

Pipeline Steps

Test Results (File: 201301_1324_AI1_page3.pdf)

Performance

Success

Issues Identified

Proposed Solutions

Problem 1: Split Signatures

Option A: More Aggressive Morphology ⭐ EASY

Option B: Region Merging After Detection ⭐⭐ MEDIUM (RECOMMENDED)

Option C: Don't Split - Extract Larger Context ⭐ EASY

Problem 2: Printed + Handwritten in Same Region

Option A: Expand PaddleOCR Masking Boxes ⭐ EASY

Option B: Run PaddleOCR Again on Each Region ⭐⭐ MEDIUM

Option C: Computer Vision Stroke Analysis ⭐⭐⭐ HARD

Option D: VLM Crop Guidance ⚠️ RISKY

Option E: Two-Stage Hybrid Approach ⭐⭐⭐ BEST (RECOMMENDED)

Implementation Priority

Phase 1: Quick Wins (Test Immediately)

Phase 2: Region Merging (If Phase 1 insufficient)

Phase 3: Two-Stage Approach (Best quality)

Code Files Status

Existing Files ✅

To Be Created 📝

Server Configuration

Test Plan

Test File

Success Metrics After Improvements

Git Branch Strategy

Next Actions

Immediate (Today)

Short-term (This week)

Long-term (Production)

Comparison: PaddleOCR vs Previous Hybrid Approach

Previous Approach (VLM-Cover branch)

PaddleOCR Approach (Current)

Winner: TBD

13 KiB

Raw Blame History