- Created PaddleOCR client for remote server communication - Implemented text masking + region detection pipeline - Test results: 100% recall on sample PDF (found both signatures) - Identified issues: split regions, printed text not fully masked - Documented 5 solution options in PADDLEOCR_STATUS.md - Next: Implement region merging and two-stage cleaning
476 lines
13 KiB
Markdown
476 lines
13 KiB
Markdown
# PaddleOCR Signature Extraction - Status & Options
|
|
|
|
**Date**: October 28, 2025
|
|
**Branch**: `PaddleOCR-Cover`
|
|
**Current Stage**: Masking + Region Detection Working, Refinement Needed
|
|
|
|
---
|
|
|
|
## Current Approach Overview
|
|
|
|
**Strategy**: PaddleOCR masks printed text → Detect remaining regions → VLM verification
|
|
|
|
### Pipeline Steps
|
|
|
|
```
|
|
1. PaddleOCR (Linux server 192.168.30.36:5555)
|
|
└─> Detect printed text bounding boxes
|
|
|
|
2. OpenCV Masking (Local)
|
|
└─> Black out all printed text areas
|
|
|
|
3. Region Detection (Local)
|
|
└─> Find non-white areas (potential handwriting)
|
|
|
|
4. VLM Verification (TODO)
|
|
└─> Confirm which regions are handwritten signatures
|
|
```
|
|
|
|
---
|
|
|
|
## Test Results (File: 201301_1324_AI1_page3.pdf)
|
|
|
|
### Performance
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Printed text regions masked | 26 |
|
|
| Candidate regions detected | 12 |
|
|
| Actual signatures found | 2 ✅ |
|
|
| False positives (printed text) | 9 |
|
|
| Split signatures | 1 (Region 5 might be part of Region 4) |
|
|
|
|
### Success
|
|
|
|
✅ **PaddleOCR detected most printed text** (26 regions)
|
|
✅ **Masking works correctly** (black rectangles)
|
|
✅ **Region detection found both signatures** (regions 2, 4)
|
|
✅ **No false negatives** (didn't miss any signatures)
|
|
|
|
### Issues Identified
|
|
|
|
❌ **Problem 1: Handwriting Split Into Multiple Regions**
|
|
- Some signatures may be split into 2+ separate regions
|
|
- Example: Region 4 and Region 5 might be parts of same signature area
|
|
- Caused by gaps between handwritten strokes after masking
|
|
|
|
❌ **Problem 2: Printed Name + Handwritten Signature Mixed**
|
|
- Region 2: Contains "張 志 銘" (printed) + handwritten signature
|
|
- Region 4: Contains "楊 智 惠" (printed) + handwritten signature
|
|
- PaddleOCR missed these printed names, so they weren't masked
|
|
- Final output includes both printed and handwritten parts
|
|
|
|
❌ **Problem 3: Printed Text Not Masked by PaddleOCR**
|
|
- 9 regions contain printed text that PaddleOCR didn't detect
|
|
- These became false positive candidates
|
|
- Examples: dates, company names, paragraph text
|
|
- Shows PaddleOCR's detection isn't 100% complete
|
|
|
|
---
|
|
|
|
## Proposed Solutions
|
|
|
|
### Problem 1: Split Signatures
|
|
|
|
#### Option A: More Aggressive Morphology ⭐ EASY
|
|
**Approach**: Increase kernel size and iterations to connect nearby strokes
|
|
|
|
```python
|
|
# Current settings:
|
|
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
|
|
morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
|
|
|
|
# Proposed settings:
|
|
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 15)) # 3x larger
|
|
morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=5) # More iterations
|
|
```
|
|
|
|
**Pros**:
|
|
- Simple one-line change
|
|
- Connects nearby strokes automatically
|
|
- Fast execution
|
|
|
|
**Cons**:
|
|
- May merge unrelated regions if too aggressive
|
|
- Need to tune parameters carefully
|
|
- Could lose fine details
|
|
|
|
**Recommendation**: ⭐ Try first - easiest to implement and test
|
|
|
|
---
|
|
|
|
#### Option B: Region Merging After Detection ⭐⭐ MEDIUM (RECOMMENDED)
|
|
**Approach**: After detecting all regions, merge those that are close together
|
|
|
|
```python
|
|
def merge_nearby_regions(regions, distance_threshold=50):
|
|
"""
|
|
Merge regions that are within distance_threshold pixels of each other.
|
|
|
|
Args:
|
|
regions: List of region dicts with 'box' (x, y, w, h)
|
|
distance_threshold: Maximum pixels between regions to merge
|
|
|
|
Returns:
|
|
List of merged regions
|
|
"""
|
|
# Algorithm:
|
|
# 1. Calculate distance between all region pairs
|
|
# 2. If distance < threshold, merge their bounding boxes
|
|
# 3. Repeat until no more merges possible
|
|
|
|
merged = []
|
|
# Implementation here...
|
|
return merged
|
|
```
|
|
|
|
**Pros**:
|
|
- Keeps signatures together intelligently
|
|
- Won't merge distant unrelated regions
|
|
- Preserves original stroke details
|
|
- Can use vertical/horizontal distance separately
|
|
|
|
**Cons**:
|
|
- Need to tune distance threshold
|
|
- More complex than Option A
|
|
- May need multiple merge passes
|
|
|
|
**Recommendation**: ⭐⭐ **Best balance** - implement this first
|
|
|
|
---
|
|
|
|
#### Option C: Don't Split - Extract Larger Context ⭐ EASY
|
|
**Approach**: When extracting regions, add significant padding to capture full context
|
|
|
|
```python
|
|
# Current: padding = 10 pixels
|
|
padding = 50 # Much larger padding
|
|
|
|
# Or: Merge all regions in the bottom 20% of page
|
|
# (signatures are usually at the bottom)
|
|
```
|
|
|
|
**Pros**:
|
|
- Guaranteed to capture complete signatures
|
|
- Very simple to implement
|
|
- No risk of losing parts
|
|
|
|
**Cons**:
|
|
- May include extra unwanted content
|
|
- Larger image files
|
|
- Makes VLM verification more complex
|
|
|
|
**Recommendation**: ⭐ Use as fallback if B doesn't work
|
|
|
|
---
|
|
|
|
### Problem 2: Printed + Handwritten in Same Region
|
|
|
|
#### Option A: Expand PaddleOCR Masking Boxes ⭐ EASY
|
|
**Approach**: Add padding when masking text boxes to catch edges
|
|
|
|
```python
|
|
padding = 20 # pixels
|
|
|
|
for (x, y, w, h) in text_boxes:
|
|
# Expand box in all directions
|
|
x_pad = max(0, x - padding)
|
|
y_pad = max(0, y - padding)
|
|
w_pad = min(image.shape[1] - x_pad, w + 2*padding)
|
|
h_pad = min(image.shape[0] - y_pad, h + 2*padding)
|
|
|
|
cv2.rectangle(masked_image, (x_pad, y_pad),
|
|
(x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)
|
|
```
|
|
|
|
**Pros**:
|
|
- Very simple - one parameter change
|
|
- Catches text edges and nearby text
|
|
- Fast execution
|
|
|
|
**Cons**:
|
|
- If padding too large, may mask handwriting
|
|
- If padding too small, still misses text
|
|
- Hard to find perfect padding value
|
|
|
|
**Recommendation**: ⭐ Quick test - try with padding=20-30
|
|
|
|
---
|
|
|
|
#### Option B: Run PaddleOCR Again on Each Region ⭐⭐ MEDIUM
|
|
**Approach**: Second-pass OCR on extracted regions to find remaining printed text
|
|
|
|
```python
|
|
def clean_region(region_image, ocr_client):
|
|
"""
|
|
Remove any remaining printed text from a region.
|
|
|
|
Args:
|
|
region_image: Extracted candidate region
|
|
ocr_client: PaddleOCR client
|
|
|
|
Returns:
|
|
Cleaned image with only handwriting
|
|
"""
|
|
# Run OCR on this specific region
|
|
text_boxes = ocr_client.get_text_boxes(region_image)
|
|
|
|
# Mask any detected printed text
|
|
cleaned = region_image.copy()
|
|
for (x, y, w, h) in text_boxes:
|
|
cv2.rectangle(cleaned, (x, y), (x+w, y+h), (0, 0, 0), -1)
|
|
|
|
return cleaned
|
|
```
|
|
|
|
**Pros**:
|
|
- Very accurate - catches printed text PaddleOCR missed initially
|
|
- Clean separation of printed vs handwritten
|
|
- No manual tuning needed
|
|
|
|
**Cons**:
|
|
- 2x slower (OCR call per region)
|
|
- May occasionally mask handwritten text if it looks printed
|
|
- More complex pipeline
|
|
|
|
**Recommendation**: ⭐⭐ Good option if masking padding isn't enough
|
|
|
|
---
|
|
|
|
#### Option C: Computer Vision Stroke Analysis ⭐⭐⭐ HARD
|
|
**Approach**: Analyze stroke characteristics to distinguish printed vs handwritten
|
|
|
|
```python
|
|
def separate_printed_handwritten(region_image):
|
|
"""
|
|
Use CV techniques to separate printed from handwritten.
|
|
|
|
Techniques:
|
|
- Stroke width analysis (printed = uniform, handwritten = variable)
|
|
- Edge detection + smoothness (printed = sharp, handwritten = organic)
|
|
- Connected component analysis
|
|
- Hough line detection (printed = straight, handwritten = curved)
|
|
"""
|
|
# Complex implementation...
|
|
pass
|
|
```
|
|
|
|
**Pros**:
|
|
- No API calls needed (fast)
|
|
- Can work when OCR fails
|
|
- Learns patterns in data
|
|
|
|
**Cons**:
|
|
- Very complex to implement
|
|
- May not be reliable across different documents
|
|
- Requires significant tuning
|
|
- Hard to maintain
|
|
|
|
**Recommendation**: ❌ Skip for now - too complex, uncertain results
|
|
|
|
---
|
|
|
|
#### Option D: VLM Crop Guidance ⚠️ RISKY
|
|
**Approach**: Ask VLM to provide coordinates of handwriting location
|
|
|
|
```python
|
|
prompt = """
|
|
This image contains both printed and handwritten text.
|
|
Where is the handwritten signature located?
|
|
Provide coordinates as: x_start, y_start, x_end, y_end
|
|
"""
|
|
|
|
# VLM returns coordinates
|
|
# Crop to that region only
|
|
```
|
|
|
|
**Pros**:
|
|
- VLM understands visual context
|
|
- Can distinguish printed vs handwritten
|
|
|
|
**Cons**:
|
|
- **VLM coordinates are unreliable** (32% offset discovered in previous tests!)
|
|
- This was the original problem that led to PaddleOCR approach
|
|
- May extract wrong region
|
|
|
|
**Recommendation**: ❌ **DO NOT USE** - VLM coordinates proven unreliable
|
|
|
|
---
|
|
|
|
#### Option E: Two-Stage Hybrid Approach ⭐⭐⭐ BEST (RECOMMENDED)
|
|
**Approach**: Combine detection with targeted cleaning
|
|
|
|
```python
|
|
def extract_signatures_twostage(pdf_path):
|
|
"""
|
|
Stage 1: Detect candidate regions (current pipeline)
|
|
Stage 2: Clean each region
|
|
"""
|
|
# Stage 1: Full page processing
|
|
image = render_pdf(pdf_path)
|
|
text_boxes = ocr_client.get_text_boxes(image)
|
|
masked_image = mask_text_regions(image, text_boxes, padding=20)
|
|
candidate_regions = detect_regions(masked_image)
|
|
|
|
# Stage 2: Per-region cleaning
|
|
signatures = []
|
|
for region_box in candidate_regions:
|
|
# Extract region from ORIGINAL image (not masked)
|
|
region_img = extract_region(image, region_box)
|
|
|
|
# Option 1: Run OCR again to find remaining printed text
|
|
region_text_boxes = ocr_client.get_text_boxes(region_img)
|
|
cleaned_region = mask_text_regions(region_img, region_text_boxes)
|
|
|
|
# Option 2: Ask VLM if it contains handwriting (no coordinates!)
|
|
is_handwriting = vlm_verify(cleaned_region)
|
|
|
|
if is_handwriting:
|
|
signatures.append(cleaned_region)
|
|
|
|
return signatures
|
|
```
|
|
|
|
**Pros**:
|
|
- Best accuracy - two passes of OCR
|
|
- Combines strengths of both approaches
|
|
- VLM only for yes/no, not coordinates
|
|
- Clean final output with only handwriting
|
|
|
|
**Cons**:
|
|
- Slower (2 OCR calls per page)
|
|
- More complex code
|
|
- Higher computational cost
|
|
|
|
**Recommendation**: ⭐⭐⭐ **BEST OVERALL** - implement this for production
|
|
|
|
---
|
|
|
|
## Implementation Priority
|
|
|
|
### Phase 1: Quick Wins (Test Immediately)
|
|
1. **Expand masking padding** (Problem 2, Option A) - 5 minutes
|
|
2. **More aggressive morphology** (Problem 1, Option A) - 5 minutes
|
|
3. **Test and measure improvement**
|
|
|
|
### Phase 2: Region Merging (If Phase 1 insufficient)
|
|
4. **Implement region merging algorithm** (Problem 1, Option B) - 30 minutes
|
|
5. **Test on multiple PDFs**
|
|
6. **Tune distance threshold**
|
|
|
|
### Phase 3: Two-Stage Approach (Best quality)
|
|
7. **Implement second-pass OCR on regions** (Problem 2, Option E) - 1 hour
|
|
8. **Add VLM verification** (Step 4 of pipeline) - 30 minutes
|
|
9. **Full pipeline testing**
|
|
|
|
---
|
|
|
|
## Code Files Status
|
|
|
|
### Existing Files ✅
|
|
- **`paddleocr_client.py`** - REST API client for PaddleOCR server
|
|
- **`test_paddleocr_client.py`** - Connection and OCR test
|
|
- **`test_mask_and_detect.py`** - Current masking + detection pipeline
|
|
|
|
### To Be Created 📝
|
|
- **`extract_signatures_paddleocr.py`** - Production pipeline with all improvements
|
|
- **`region_merger.py`** - Region merging utilities
|
|
- **`vlm_verifier.py`** - VLM handwriting verification
|
|
|
|
---
|
|
|
|
## Server Configuration
|
|
|
|
**PaddleOCR Server**:
|
|
- Host: `192.168.30.36:5555`
|
|
- Running: ✅ Yes (PID: 210417)
|
|
- Version: 3.3.0
|
|
- GPU: Enabled
|
|
- Language: Chinese (lang='ch')
|
|
|
|
**VLM Server**:
|
|
- Host: `192.168.30.36:11434` (Ollama)
|
|
- Model: `qwen2.5vl:32b`
|
|
- Status: Not tested yet in this pipeline
|
|
|
|
---
|
|
|
|
## Test Plan
|
|
|
|
### Test File
|
|
- **File**: `201301_1324_AI1_page3.pdf`
|
|
- **Expected signatures**: 2 (楊智惠, 張志銘)
|
|
- **Current recall**: 100% (found both)
|
|
- **Current precision**: 16.7% (2 correct out of 12 regions)
|
|
|
|
### Success Metrics After Improvements
|
|
|
|
| Metric | Current | Target |
|
|
|--------|---------|--------|
|
|
| Signatures found | 2/2 (100%) | 2/2 (100%) |
|
|
| False positives | 10 | < 2 |
|
|
| Precision | 16.7% | > 80% |
|
|
| Signatures split | Unknown | 0 |
|
|
| Printed text in regions | Yes | No |
|
|
|
|
---
|
|
|
|
## Git Branch Strategy
|
|
|
|
**Current branch**: `PaddleOCR-Cover`
|
|
**Status**: Masking + Region Detection working, needs refinement
|
|
|
|
**Recommended next steps**:
|
|
1. Commit current state with tag: `paddleocr-v1-basic`
|
|
2. Create feature branches:
|
|
- `paddleocr-region-merging` - For Problem 1 solutions
|
|
- `paddleocr-two-stage` - For Problem 2 solutions
|
|
3. Merge best solution back to `PaddleOCR-Cover`
|
|
|
|
---
|
|
|
|
## Next Actions
|
|
|
|
### Immediate (Today)
|
|
- [ ] Commit current working state
|
|
- [ ] Test Phase 1 quick wins (padding + morphology)
|
|
- [ ] Measure improvement
|
|
|
|
### Short-term (This week)
|
|
- [ ] Implement Region Merging (Option B)
|
|
- [ ] Implement Two-Stage OCR (Option E)
|
|
- [ ] Add VLM verification
|
|
- [ ] Test on 10 PDFs
|
|
|
|
### Long-term (Production)
|
|
- [ ] Optimize performance (parallel processing)
|
|
- [ ] Error handling and logging
|
|
- [ ] Process full 86K dataset
|
|
- [ ] Compare with previous hybrid approach (70% recall)
|
|
|
|
---
|
|
|
|
## Comparison: PaddleOCR vs Previous Hybrid Approach
|
|
|
|
### Previous Approach (VLM-Cover branch)
|
|
- **Method**: VLM names + CV detection + VLM verification
|
|
- **Results**: 70% recall, 100% precision
|
|
- **Problem**: Missed 30% of signatures (CV parameters too conservative)
|
|
|
|
### PaddleOCR Approach (Current)
|
|
- **Method**: PaddleOCR masking + CV detection + VLM verification
|
|
- **Results**: 100% recall (found both signatures)
|
|
- **Problem**: Low precision (many false positives), printed text not fully removed
|
|
|
|
### Winner: TBD
|
|
- PaddleOCR shows **better recall potential**
|
|
- After implementing refinements (Phase 2-3), should achieve **high recall + high precision**
|
|
- Need to test on larger dataset to confirm
|
|
|
|
---
|
|
|
|
**Document version**: 1.0
|
|
**Last updated**: October 28, 2025
|
|
**Author**: Claude Code
|
|
**Status**: Ready for implementation
|