- Created PaddleOCR client for remote server communication - Implemented text masking + region detection pipeline - Test results: 100% recall on sample PDF (found both signatures) - Identified issues: split regions, printed text not fully masked - Documented 5 solution options in PADDLEOCR_STATUS.md - Next: Implement region merging and two-stage cleaning
13 KiB
PaddleOCR Signature Extraction - Status & Options
Date: October 28, 2025
Branch: PaddleOCR-Cover
Current Stage: Masking + Region Detection Working, Refinement Needed
Current Approach Overview
Strategy: PaddleOCR masks printed text → Detect remaining regions → VLM verification
Pipeline Steps
1. PaddleOCR (Linux server 192.168.30.36:5555)
└─> Detect printed text bounding boxes
2. OpenCV Masking (Local)
└─> Black out all printed text areas
3. Region Detection (Local)
└─> Find non-white areas (potential handwriting)
4. VLM Verification (TODO)
└─> Confirm which regions are handwritten signatures
Test Results (File: 201301_1324_AI1_page3.pdf)
Performance
| Metric | Value |
|---|---|
| Printed text regions masked | 26 |
| Candidate regions detected | 12 |
| Actual signatures found | 2 ✅ |
| False positives (printed text) | 9 |
| Split signatures | 1 (Region 5 might be part of Region 4) |
Success
✅ PaddleOCR detected most printed text (26 regions) ✅ Masking works correctly (black rectangles) ✅ Region detection found both signatures (regions 2, 4) ✅ No false negatives (didn't miss any signatures)
Issues Identified
❌ Problem 1: Handwriting Split Into Multiple Regions
- Some signatures may be split into 2+ separate regions
- Example: Region 4 and Region 5 might be parts of same signature area
- Caused by gaps between handwritten strokes after masking
❌ Problem 2: Printed Name + Handwritten Signature Mixed
- Region 2: Contains "張 志 銘" (printed) + handwritten signature
- Region 4: Contains "楊 智 惠" (printed) + handwritten signature
- PaddleOCR missed these printed names, so they weren't masked
- Final output includes both printed and handwritten parts
❌ Problem 3: Printed Text Not Masked by PaddleOCR
- 9 regions contain printed text that PaddleOCR didn't detect
- These became false positive candidates
- Examples: dates, company names, paragraph text
- Shows PaddleOCR's detection isn't 100% complete
Proposed Solutions
Problem 1: Split Signatures
Option A: More Aggressive Morphology ⭐ EASY
Approach: Increase kernel size and iterations to connect nearby strokes
# Current settings:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
# Proposed settings:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 15)) # 3x larger
morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=5) # More iterations
Pros:
- Simple one-line change
- Connects nearby strokes automatically
- Fast execution
Cons:
- May merge unrelated regions if too aggressive
- Need to tune parameters carefully
- Could lose fine details
Recommendation: ⭐ Try first - easiest to implement and test
Option B: Region Merging After Detection ⭐⭐ MEDIUM (RECOMMENDED)
Approach: After detecting all regions, merge those that are close together
def merge_nearby_regions(regions, distance_threshold=50):
"""
Merge regions that are within distance_threshold pixels of each other.
Args:
regions: List of region dicts with 'box' (x, y, w, h)
distance_threshold: Maximum pixels between regions to merge
Returns:
List of merged regions
"""
# Algorithm:
# 1. Calculate distance between all region pairs
# 2. If distance < threshold, merge their bounding boxes
# 3. Repeat until no more merges possible
merged = []
# Implementation here...
return merged
Pros:
- Keeps signatures together intelligently
- Won't merge distant unrelated regions
- Preserves original stroke details
- Can use vertical/horizontal distance separately
Cons:
- Need to tune distance threshold
- More complex than Option A
- May need multiple merge passes
Recommendation: ⭐⭐ Best balance - implement this first
Option C: Don't Split - Extract Larger Context ⭐ EASY
Approach: When extracting regions, add significant padding to capture full context
# Current: padding = 10 pixels
padding = 50 # Much larger padding
# Or: Merge all regions in the bottom 20% of page
# (signatures are usually at the bottom)
Pros:
- Guaranteed to capture complete signatures
- Very simple to implement
- No risk of losing parts
Cons:
- May include extra unwanted content
- Larger image files
- Makes VLM verification more complex
Recommendation: ⭐ Use as fallback if B doesn't work
Problem 2: Printed + Handwritten in Same Region
Option A: Expand PaddleOCR Masking Boxes ⭐ EASY
Approach: Add padding when masking text boxes to catch edges
padding = 20 # pixels
for (x, y, w, h) in text_boxes:
# Expand box in all directions
x_pad = max(0, x - padding)
y_pad = max(0, y - padding)
w_pad = min(image.shape[1] - x_pad, w + 2*padding)
h_pad = min(image.shape[0] - y_pad, h + 2*padding)
cv2.rectangle(masked_image, (x_pad, y_pad),
(x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)
Pros:
- Very simple - one parameter change
- Catches text edges and nearby text
- Fast execution
Cons:
- If padding too large, may mask handwriting
- If padding too small, still misses text
- Hard to find perfect padding value
Recommendation: ⭐ Quick test - try with padding=20-30
Option B: Run PaddleOCR Again on Each Region ⭐⭐ MEDIUM
Approach: Second-pass OCR on extracted regions to find remaining printed text
def clean_region(region_image, ocr_client):
"""
Remove any remaining printed text from a region.
Args:
region_image: Extracted candidate region
ocr_client: PaddleOCR client
Returns:
Cleaned image with only handwriting
"""
# Run OCR on this specific region
text_boxes = ocr_client.get_text_boxes(region_image)
# Mask any detected printed text
cleaned = region_image.copy()
for (x, y, w, h) in text_boxes:
cv2.rectangle(cleaned, (x, y), (x+w, y+h), (0, 0, 0), -1)
return cleaned
Pros:
- Very accurate - catches printed text PaddleOCR missed initially
- Clean separation of printed vs handwritten
- No manual tuning needed
Cons:
- 2x slower (OCR call per region)
- May occasionally mask handwritten text if it looks printed
- More complex pipeline
Recommendation: ⭐⭐ Good option if masking padding isn't enough
Option C: Computer Vision Stroke Analysis ⭐⭐⭐ HARD
Approach: Analyze stroke characteristics to distinguish printed vs handwritten
def separate_printed_handwritten(region_image):
"""
Use CV techniques to separate printed from handwritten.
Techniques:
- Stroke width analysis (printed = uniform, handwritten = variable)
- Edge detection + smoothness (printed = sharp, handwritten = organic)
- Connected component analysis
- Hough line detection (printed = straight, handwritten = curved)
"""
# Complex implementation...
pass
Pros:
- No API calls needed (fast)
- Can work when OCR fails
- Learns patterns in data
Cons:
- Very complex to implement
- May not be reliable across different documents
- Requires significant tuning
- Hard to maintain
Recommendation: ❌ Skip for now - too complex, uncertain results
Option D: VLM Crop Guidance ⚠️ RISKY
Approach: Ask VLM to provide coordinates of handwriting location
prompt = """
This image contains both printed and handwritten text.
Where is the handwritten signature located?
Provide coordinates as: x_start, y_start, x_end, y_end
"""
# VLM returns coordinates
# Crop to that region only
Pros:
- VLM understands visual context
- Can distinguish printed vs handwritten
Cons:
- VLM coordinates are unreliable (32% offset discovered in previous tests!)
- This was the original problem that led to PaddleOCR approach
- May extract wrong region
Recommendation: ❌ DO NOT USE - VLM coordinates proven unreliable
Option E: Two-Stage Hybrid Approach ⭐⭐⭐ BEST (RECOMMENDED)
Approach: Combine detection with targeted cleaning
def extract_signatures_twostage(pdf_path):
"""
Stage 1: Detect candidate regions (current pipeline)
Stage 2: Clean each region
"""
# Stage 1: Full page processing
image = render_pdf(pdf_path)
text_boxes = ocr_client.get_text_boxes(image)
masked_image = mask_text_regions(image, text_boxes, padding=20)
candidate_regions = detect_regions(masked_image)
# Stage 2: Per-region cleaning
signatures = []
for region_box in candidate_regions:
# Extract region from ORIGINAL image (not masked)
region_img = extract_region(image, region_box)
# Option 1: Run OCR again to find remaining printed text
region_text_boxes = ocr_client.get_text_boxes(region_img)
cleaned_region = mask_text_regions(region_img, region_text_boxes)
# Option 2: Ask VLM if it contains handwriting (no coordinates!)
is_handwriting = vlm_verify(cleaned_region)
if is_handwriting:
signatures.append(cleaned_region)
return signatures
Pros:
- Best accuracy - two passes of OCR
- Combines strengths of both approaches
- VLM only for yes/no, not coordinates
- Clean final output with only handwriting
Cons:
- Slower (2 OCR calls per page)
- More complex code
- Higher computational cost
Recommendation: ⭐⭐⭐ BEST OVERALL - implement this for production
Implementation Priority
Phase 1: Quick Wins (Test Immediately)
- Expand masking padding (Problem 2, Option A) - 5 minutes
- More aggressive morphology (Problem 1, Option A) - 5 minutes
- Test and measure improvement
Phase 2: Region Merging (If Phase 1 insufficient)
- Implement region merging algorithm (Problem 1, Option B) - 30 minutes
- Test on multiple PDFs
- Tune distance threshold
Phase 3: Two-Stage Approach (Best quality)
- Implement second-pass OCR on regions (Problem 2, Option E) - 1 hour
- Add VLM verification (Step 4 of pipeline) - 30 minutes
- Full pipeline testing
Code Files Status
Existing Files ✅
paddleocr_client.py- REST API client for PaddleOCR servertest_paddleocr_client.py- Connection and OCR testtest_mask_and_detect.py- Current masking + detection pipeline
To Be Created 📝
extract_signatures_paddleocr.py- Production pipeline with all improvementsregion_merger.py- Region merging utilitiesvlm_verifier.py- VLM handwriting verification
Server Configuration
PaddleOCR Server:
- Host:
192.168.30.36:5555 - Running: ✅ Yes (PID: 210417)
- Version: 3.3.0
- GPU: Enabled
- Language: Chinese (lang='ch')
VLM Server:
- Host:
192.168.30.36:11434(Ollama) - Model:
qwen2.5vl:32b - Status: Not tested yet in this pipeline
Test Plan
Test File
- File:
201301_1324_AI1_page3.pdf - Expected signatures: 2 (楊智惠, 張志銘)
- Current recall: 100% (found both)
- Current precision: 16.7% (2 correct out of 12 regions)
Success Metrics After Improvements
| Metric | Current | Target |
|---|---|---|
| Signatures found | 2/2 (100%) | 2/2 (100%) |
| False positives | 10 | < 2 |
| Precision | 16.7% | > 80% |
| Signatures split | Unknown | 0 |
| Printed text in regions | Yes | No |
Git Branch Strategy
Current branch: PaddleOCR-Cover
Status: Masking + Region Detection working, needs refinement
Recommended next steps:
- Commit current state with tag:
paddleocr-v1-basic - Create feature branches:
paddleocr-region-merging- For Problem 1 solutionspaddleocr-two-stage- For Problem 2 solutions
- Merge best solution back to
PaddleOCR-Cover
Next Actions
Immediate (Today)
- Commit current working state
- Test Phase 1 quick wins (padding + morphology)
- Measure improvement
Short-term (This week)
- Implement Region Merging (Option B)
- Implement Two-Stage OCR (Option E)
- Add VLM verification
- Test on 10 PDFs
Long-term (Production)
- Optimize performance (parallel processing)
- Error handling and logging
- Process full 86K dataset
- Compare with previous hybrid approach (70% recall)
Comparison: PaddleOCR vs Previous Hybrid Approach
Previous Approach (VLM-Cover branch)
- Method: VLM names + CV detection + VLM verification
- Results: 70% recall, 100% precision
- Problem: Missed 30% of signatures (CV parameters too conservative)
PaddleOCR Approach (Current)
- Method: PaddleOCR masking + CV detection + VLM verification
- Results: 100% recall (found both signatures)
- Problem: Low precision (many false positives), printed text not fully removed
Winner: TBD
- PaddleOCR shows better recall potential
- After implementing refinements (Phase 2-3), should achieve high recall + high precision
- Need to test on larger dataset to confirm
Document version: 1.0 Last updated: October 28, 2025 Author: Claude Code Status: Ready for implementation