# PaddleOCR Signature Extraction - Status & Options **Date**: October 28, 2025 **Branch**: `PaddleOCR-Cover` **Current Stage**: Masking + Region Detection Working, Refinement Needed --- ## Current Approach Overview **Strategy**: PaddleOCR masks printed text → Detect remaining regions → VLM verification ### Pipeline Steps ``` 1. PaddleOCR (Linux server 192.168.30.36:5555) └─> Detect printed text bounding boxes 2. OpenCV Masking (Local) └─> Black out all printed text areas 3. Region Detection (Local) └─> Find non-white areas (potential handwriting) 4. VLM Verification (TODO) └─> Confirm which regions are handwritten signatures ``` --- ## Test Results (File: 201301_1324_AI1_page3.pdf) ### Performance | Metric | Value | |--------|-------| | Printed text regions masked | 26 | | Candidate regions detected | 12 | | Actual signatures found | 2 ✅ | | False positives (printed text) | 9 | | Split signatures | 1 (Region 5 might be part of Region 4) | ### Success ✅ **PaddleOCR detected most printed text** (26 regions) ✅ **Masking works correctly** (black rectangles) ✅ **Region detection found both signatures** (regions 2, 4) ✅ **No false negatives** (didn't miss any signatures) ### Issues Identified ❌ **Problem 1: Handwriting Split Into Multiple Regions** - Some signatures may be split into 2+ separate regions - Example: Region 4 and Region 5 might be parts of same signature area - Caused by gaps between handwritten strokes after masking ❌ **Problem 2: Printed Name + Handwritten Signature Mixed** - Region 2: Contains "張 志 銘" (printed) + handwritten signature - Region 4: Contains "楊 智 惠" (printed) + handwritten signature - PaddleOCR missed these printed names, so they weren't masked - Final output includes both printed and handwritten parts ❌ **Problem 3: Printed Text Not Masked by PaddleOCR** - 9 regions contain printed text that PaddleOCR didn't detect - These became false positive candidates - Examples: dates, company names, paragraph text - Shows PaddleOCR's detection isn't 100% complete --- ## Proposed Solutions ### Problem 1: Split Signatures #### Option A: More Aggressive Morphology ⭐ EASY **Approach**: Increase kernel size and iterations to connect nearby strokes ```python # Current settings: kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5)) morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2) # Proposed settings: kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 15)) # 3x larger morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=5) # More iterations ``` **Pros**: - Simple one-line change - Connects nearby strokes automatically - Fast execution **Cons**: - May merge unrelated regions if too aggressive - Need to tune parameters carefully - Could lose fine details **Recommendation**: ⭐ Try first - easiest to implement and test --- #### Option B: Region Merging After Detection ⭐⭐ MEDIUM (RECOMMENDED) **Approach**: After detecting all regions, merge those that are close together ```python def merge_nearby_regions(regions, distance_threshold=50): """ Merge regions that are within distance_threshold pixels of each other. Args: regions: List of region dicts with 'box' (x, y, w, h) distance_threshold: Maximum pixels between regions to merge Returns: List of merged regions """ # Algorithm: # 1. Calculate distance between all region pairs # 2. If distance < threshold, merge their bounding boxes # 3. Repeat until no more merges possible merged = [] # Implementation here... return merged ``` **Pros**: - Keeps signatures together intelligently - Won't merge distant unrelated regions - Preserves original stroke details - Can use vertical/horizontal distance separately **Cons**: - Need to tune distance threshold - More complex than Option A - May need multiple merge passes **Recommendation**: ⭐⭐ **Best balance** - implement this first --- #### Option C: Don't Split - Extract Larger Context ⭐ EASY **Approach**: When extracting regions, add significant padding to capture full context ```python # Current: padding = 10 pixels padding = 50 # Much larger padding # Or: Merge all regions in the bottom 20% of page # (signatures are usually at the bottom) ``` **Pros**: - Guaranteed to capture complete signatures - Very simple to implement - No risk of losing parts **Cons**: - May include extra unwanted content - Larger image files - Makes VLM verification more complex **Recommendation**: ⭐ Use as fallback if B doesn't work --- ### Problem 2: Printed + Handwritten in Same Region #### Option A: Expand PaddleOCR Masking Boxes ⭐ EASY **Approach**: Add padding when masking text boxes to catch edges ```python padding = 20 # pixels for (x, y, w, h) in text_boxes: # Expand box in all directions x_pad = max(0, x - padding) y_pad = max(0, y - padding) w_pad = min(image.shape[1] - x_pad, w + 2*padding) h_pad = min(image.shape[0] - y_pad, h + 2*padding) cv2.rectangle(masked_image, (x_pad, y_pad), (x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1) ``` **Pros**: - Very simple - one parameter change - Catches text edges and nearby text - Fast execution **Cons**: - If padding too large, may mask handwriting - If padding too small, still misses text - Hard to find perfect padding value **Recommendation**: ⭐ Quick test - try with padding=20-30 --- #### Option B: Run PaddleOCR Again on Each Region ⭐⭐ MEDIUM **Approach**: Second-pass OCR on extracted regions to find remaining printed text ```python def clean_region(region_image, ocr_client): """ Remove any remaining printed text from a region. Args: region_image: Extracted candidate region ocr_client: PaddleOCR client Returns: Cleaned image with only handwriting """ # Run OCR on this specific region text_boxes = ocr_client.get_text_boxes(region_image) # Mask any detected printed text cleaned = region_image.copy() for (x, y, w, h) in text_boxes: cv2.rectangle(cleaned, (x, y), (x+w, y+h), (0, 0, 0), -1) return cleaned ``` **Pros**: - Very accurate - catches printed text PaddleOCR missed initially - Clean separation of printed vs handwritten - No manual tuning needed **Cons**: - 2x slower (OCR call per region) - May occasionally mask handwritten text if it looks printed - More complex pipeline **Recommendation**: ⭐⭐ Good option if masking padding isn't enough --- #### Option C: Computer Vision Stroke Analysis ⭐⭐⭐ HARD **Approach**: Analyze stroke characteristics to distinguish printed vs handwritten ```python def separate_printed_handwritten(region_image): """ Use CV techniques to separate printed from handwritten. Techniques: - Stroke width analysis (printed = uniform, handwritten = variable) - Edge detection + smoothness (printed = sharp, handwritten = organic) - Connected component analysis - Hough line detection (printed = straight, handwritten = curved) """ # Complex implementation... pass ``` **Pros**: - No API calls needed (fast) - Can work when OCR fails - Learns patterns in data **Cons**: - Very complex to implement - May not be reliable across different documents - Requires significant tuning - Hard to maintain **Recommendation**: ❌ Skip for now - too complex, uncertain results --- #### Option D: VLM Crop Guidance ⚠️ RISKY **Approach**: Ask VLM to provide coordinates of handwriting location ```python prompt = """ This image contains both printed and handwritten text. Where is the handwritten signature located? Provide coordinates as: x_start, y_start, x_end, y_end """ # VLM returns coordinates # Crop to that region only ``` **Pros**: - VLM understands visual context - Can distinguish printed vs handwritten **Cons**: - **VLM coordinates are unreliable** (32% offset discovered in previous tests!) - This was the original problem that led to PaddleOCR approach - May extract wrong region **Recommendation**: ❌ **DO NOT USE** - VLM coordinates proven unreliable --- #### Option E: Two-Stage Hybrid Approach ⭐⭐⭐ BEST (RECOMMENDED) **Approach**: Combine detection with targeted cleaning ```python def extract_signatures_twostage(pdf_path): """ Stage 1: Detect candidate regions (current pipeline) Stage 2: Clean each region """ # Stage 1: Full page processing image = render_pdf(pdf_path) text_boxes = ocr_client.get_text_boxes(image) masked_image = mask_text_regions(image, text_boxes, padding=20) candidate_regions = detect_regions(masked_image) # Stage 2: Per-region cleaning signatures = [] for region_box in candidate_regions: # Extract region from ORIGINAL image (not masked) region_img = extract_region(image, region_box) # Option 1: Run OCR again to find remaining printed text region_text_boxes = ocr_client.get_text_boxes(region_img) cleaned_region = mask_text_regions(region_img, region_text_boxes) # Option 2: Ask VLM if it contains handwriting (no coordinates!) is_handwriting = vlm_verify(cleaned_region) if is_handwriting: signatures.append(cleaned_region) return signatures ``` **Pros**: - Best accuracy - two passes of OCR - Combines strengths of both approaches - VLM only for yes/no, not coordinates - Clean final output with only handwriting **Cons**: - Slower (2 OCR calls per page) - More complex code - Higher computational cost **Recommendation**: ⭐⭐⭐ **BEST OVERALL** - implement this for production --- ## Implementation Priority ### Phase 1: Quick Wins (Test Immediately) 1. **Expand masking padding** (Problem 2, Option A) - 5 minutes 2. **More aggressive morphology** (Problem 1, Option A) - 5 minutes 3. **Test and measure improvement** ### Phase 2: Region Merging (If Phase 1 insufficient) 4. **Implement region merging algorithm** (Problem 1, Option B) - 30 minutes 5. **Test on multiple PDFs** 6. **Tune distance threshold** ### Phase 3: Two-Stage Approach (Best quality) 7. **Implement second-pass OCR on regions** (Problem 2, Option E) - 1 hour 8. **Add VLM verification** (Step 4 of pipeline) - 30 minutes 9. **Full pipeline testing** --- ## Code Files Status ### Existing Files ✅ - **`paddleocr_client.py`** - REST API client for PaddleOCR server - **`test_paddleocr_client.py`** - Connection and OCR test - **`test_mask_and_detect.py`** - Current masking + detection pipeline ### To Be Created 📝 - **`extract_signatures_paddleocr.py`** - Production pipeline with all improvements - **`region_merger.py`** - Region merging utilities - **`vlm_verifier.py`** - VLM handwriting verification --- ## Server Configuration **PaddleOCR Server**: - Host: `192.168.30.36:5555` - Running: ✅ Yes (PID: 210417) - Version: 3.3.0 - GPU: Enabled - Language: Chinese (lang='ch') **VLM Server**: - Host: `192.168.30.36:11434` (Ollama) - Model: `qwen2.5vl:32b` - Status: Not tested yet in this pipeline --- ## Test Plan ### Test File - **File**: `201301_1324_AI1_page3.pdf` - **Expected signatures**: 2 (楊智惠, 張志銘) - **Current recall**: 100% (found both) - **Current precision**: 16.7% (2 correct out of 12 regions) ### Success Metrics After Improvements | Metric | Current | Target | |--------|---------|--------| | Signatures found | 2/2 (100%) | 2/2 (100%) | | False positives | 10 | < 2 | | Precision | 16.7% | > 80% | | Signatures split | Unknown | 0 | | Printed text in regions | Yes | No | --- ## Git Branch Strategy **Current branch**: `PaddleOCR-Cover` **Status**: Masking + Region Detection working, needs refinement **Recommended next steps**: 1. Commit current state with tag: `paddleocr-v1-basic` 2. Create feature branches: - `paddleocr-region-merging` - For Problem 1 solutions - `paddleocr-two-stage` - For Problem 2 solutions 3. Merge best solution back to `PaddleOCR-Cover` --- ## Next Actions ### Immediate (Today) - [ ] Commit current working state - [ ] Test Phase 1 quick wins (padding + morphology) - [ ] Measure improvement ### Short-term (This week) - [ ] Implement Region Merging (Option B) - [ ] Implement Two-Stage OCR (Option E) - [ ] Add VLM verification - [ ] Test on 10 PDFs ### Long-term (Production) - [ ] Optimize performance (parallel processing) - [ ] Error handling and logging - [ ] Process full 86K dataset - [ ] Compare with previous hybrid approach (70% recall) --- ## Comparison: PaddleOCR vs Previous Hybrid Approach ### Previous Approach (VLM-Cover branch) - **Method**: VLM names + CV detection + VLM verification - **Results**: 70% recall, 100% precision - **Problem**: Missed 30% of signatures (CV parameters too conservative) ### PaddleOCR Approach (Current) - **Method**: PaddleOCR masking + CV detection + VLM verification - **Results**: 100% recall (found both signatures) - **Problem**: Low precision (many false positives), printed text not fully removed ### Winner: TBD - PaddleOCR shows **better recall potential** - After implementing refinements (Phase 2-3), should achieve **high recall + high precision** - Need to test on larger dataset to confirm --- **Document version**: 1.0 **Last updated**: October 28, 2025 **Author**: Claude Code **Status**: Ready for implementation