Add PaddleOCR masking and region detection pipeline

- Created PaddleOCR client for remote server communication - Implemented text masking + region detection pipeline - Test results: 100% recall on sample PDF (found both signatures) - Identified issues: split regions, printed text not fully masked - Documented 5 solution options in PADDLEOCR_STATUS.md - Next: Implement region merging and two-stage cleaning
2025-10-28 22:28:18 +08:00
parent 52612e14ba
commit 479d4e0019
6 changed files with 1118 additions and 0 deletions
--- a/PADDLEOCR_STATUS.md
+++ b/PADDLEOCR_STATUS.md
@@ -0,0 +1,475 @@
+# PaddleOCR Signature Extraction - Status & Options
+
+**Date**: October 28, 2025
+**Branch**: `PaddleOCR-Cover`
+**Current Stage**: Masking + Region Detection Working, Refinement Needed
+
+---
+
+## Current Approach Overview
+
+**Strategy**: PaddleOCR masks printed text → Detect remaining regions → VLM verification
+
+### Pipeline Steps
+
+```
+1. PaddleOCR (Linux server 192.168.30.36:5555)
+   └─> Detect printed text bounding boxes
+
+2. OpenCV Masking (Local)
+   └─> Black out all printed text areas
+
+3. Region Detection (Local)
+   └─> Find non-white areas (potential handwriting)
+
+4. VLM Verification (TODO)
+   └─> Confirm which regions are handwritten signatures
+```
+
+---
+
+## Test Results (File: 201301_1324_AI1_page3.pdf)
+
+### Performance
+
+| Metric | Value |
+|--------|-------|
+| Printed text regions masked | 26 |
+| Candidate regions detected | 12 |
+| Actual signatures found | 2 ✅ |
+| False positives (printed text) | 9 |
+| Split signatures | 1 (Region 5 might be part of Region 4) |
+
+### Success
+
+✅ **PaddleOCR detected most printed text** (26 regions)
+✅ **Masking works correctly** (black rectangles)
+✅ **Region detection found both signatures** (regions 2, 4)
+✅ **No false negatives** (didn't miss any signatures)
+
+### Issues Identified
+
+❌ **Problem 1: Handwriting Split Into Multiple Regions**
+- Some signatures may be split into 2+ separate regions
+- Example: Region 4 and Region 5 might be parts of same signature area
+- Caused by gaps between handwritten strokes after masking
+
+❌ **Problem 2: Printed Name + Handwritten Signature Mixed**
+- Region 2: Contains "張 志 銘" (printed) + handwritten signature
+- Region 4: Contains "楊 智 惠" (printed) + handwritten signature
+- PaddleOCR missed these printed names, so they weren't masked
+- Final output includes both printed and handwritten parts
+
+❌ **Problem 3: Printed Text Not Masked by PaddleOCR**
+- 9 regions contain printed text that PaddleOCR didn't detect
+- These became false positive candidates
+- Examples: dates, company names, paragraph text
+- Shows PaddleOCR's detection isn't 100% complete
+
+---
+
+## Proposed Solutions
+
+### Problem 1: Split Signatures
+
+#### Option A: More Aggressive Morphology ⭐ EASY
+**Approach**: Increase kernel size and iterations to connect nearby strokes
+
+```python
+# Current settings:
+kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
+morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
+
+# Proposed settings:
+kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 15))  # 3x larger
+morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=5)  # More iterations
+```
+
+**Pros**:
+- Simple one-line change
+- Connects nearby strokes automatically
+- Fast execution
+
+**Cons**:
+- May merge unrelated regions if too aggressive
+- Need to tune parameters carefully
+- Could lose fine details
+
+**Recommendation**: ⭐ Try first - easiest to implement and test
+
+---
+
+#### Option B: Region Merging After Detection ⭐⭐ MEDIUM (RECOMMENDED)
+**Approach**: After detecting all regions, merge those that are close together
+
+```python
+def merge_nearby_regions(regions, distance_threshold=50):
+    """
+    Merge regions that are within distance_threshold pixels of each other.
+
+    Args:
+        regions: List of region dicts with 'box' (x, y, w, h)
+        distance_threshold: Maximum pixels between regions to merge
+
+    Returns:
+        List of merged regions
+    """
+    # Algorithm:
+    # 1. Calculate distance between all region pairs
+    # 2. If distance < threshold, merge their bounding boxes
+    # 3. Repeat until no more merges possible
+
+    merged = []
+    # Implementation here...
+    return merged
+```
+
+**Pros**:
+- Keeps signatures together intelligently
+- Won't merge distant unrelated regions
+- Preserves original stroke details
+- Can use vertical/horizontal distance separately
+
+**Cons**:
+- Need to tune distance threshold
+- More complex than Option A
+- May need multiple merge passes
+
+**Recommendation**: ⭐⭐ **Best balance** - implement this first
+
+---
+
+#### Option C: Don't Split - Extract Larger Context ⭐ EASY
+**Approach**: When extracting regions, add significant padding to capture full context
+
+```python
+# Current: padding = 10 pixels
+padding = 50  # Much larger padding
+
+# Or: Merge all regions in the bottom 20% of page
+# (signatures are usually at the bottom)
+```
+
+**Pros**:
+- Guaranteed to capture complete signatures
+- Very simple to implement
+- No risk of losing parts
+
+**Cons**:
+- May include extra unwanted content
+- Larger image files
+- Makes VLM verification more complex
+
+**Recommendation**: ⭐ Use as fallback if B doesn't work
+
+---
+
+### Problem 2: Printed + Handwritten in Same Region
+
+#### Option A: Expand PaddleOCR Masking Boxes ⭐ EASY
+**Approach**: Add padding when masking text boxes to catch edges
+
+```python
+padding = 20  # pixels
+
+for (x, y, w, h) in text_boxes:
+    # Expand box in all directions
+    x_pad = max(0, x - padding)
+    y_pad = max(0, y - padding)
+    w_pad = min(image.shape[1] - x_pad, w + 2*padding)
+    h_pad = min(image.shape[0] - y_pad, h + 2*padding)
+
+    cv2.rectangle(masked_image, (x_pad, y_pad),
+                  (x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)
+```
+
+**Pros**:
+- Very simple - one parameter change
+- Catches text edges and nearby text
+- Fast execution
+
+**Cons**:
+- If padding too large, may mask handwriting
+- If padding too small, still misses text
+- Hard to find perfect padding value
+
+**Recommendation**: ⭐ Quick test - try with padding=20-30
+
+---
+
+#### Option B: Run PaddleOCR Again on Each Region ⭐⭐ MEDIUM
+**Approach**: Second-pass OCR on extracted regions to find remaining printed text
+
+```python
+def clean_region(region_image, ocr_client):
+    """
+    Remove any remaining printed text from a region.
+
+    Args:
+        region_image: Extracted candidate region
+        ocr_client: PaddleOCR client
+
+    Returns:
+        Cleaned image with only handwriting
+    """
+    # Run OCR on this specific region
+    text_boxes = ocr_client.get_text_boxes(region_image)
+
+    # Mask any detected printed text
+    cleaned = region_image.copy()
+    for (x, y, w, h) in text_boxes:
+        cv2.rectangle(cleaned, (x, y), (x+w, y+h), (0, 0, 0), -1)
+
+    return cleaned
+```
+
+**Pros**:
+- Very accurate - catches printed text PaddleOCR missed initially
+- Clean separation of printed vs handwritten
+- No manual tuning needed
+
+**Cons**:
+- 2x slower (OCR call per region)
+- May occasionally mask handwritten text if it looks printed
+- More complex pipeline
+
+**Recommendation**: ⭐⭐ Good option if masking padding isn't enough
+
+---
+
+#### Option C: Computer Vision Stroke Analysis ⭐⭐⭐ HARD
+**Approach**: Analyze stroke characteristics to distinguish printed vs handwritten
+
+```python
+def separate_printed_handwritten(region_image):
+    """
+    Use CV techniques to separate printed from handwritten.
+
+    Techniques:
+    - Stroke width analysis (printed = uniform, handwritten = variable)
+    - Edge detection + smoothness (printed = sharp, handwritten = organic)
+    - Connected component analysis
+    - Hough line detection (printed = straight, handwritten = curved)
+    """
+    # Complex implementation...
+    pass
+```
+
+**Pros**:
+- No API calls needed (fast)
+- Can work when OCR fails
+- Learns patterns in data
+
+**Cons**:
+- Very complex to implement
+- May not be reliable across different documents
+- Requires significant tuning
+- Hard to maintain
+
+**Recommendation**: ❌ Skip for now - too complex, uncertain results
+
+---
+
+#### Option D: VLM Crop Guidance ⚠️ RISKY
+**Approach**: Ask VLM to provide coordinates of handwriting location
+
+```python
+prompt = """
+This image contains both printed and handwritten text.
+Where is the handwritten signature located?
+Provide coordinates as: x_start, y_start, x_end, y_end
+"""
+
+# VLM returns coordinates
+# Crop to that region only
+```
+
+**Pros**:
+- VLM understands visual context
+- Can distinguish printed vs handwritten
+
+**Cons**:
+- **VLM coordinates are unreliable** (32% offset discovered in previous tests!)
+- This was the original problem that led to PaddleOCR approach
+- May extract wrong region
+
+**Recommendation**: ❌ **DO NOT USE** - VLM coordinates proven unreliable
+
+---
+
+#### Option E: Two-Stage Hybrid Approach ⭐⭐⭐ BEST (RECOMMENDED)
+**Approach**: Combine detection with targeted cleaning
+
+```python
+def extract_signatures_twostage(pdf_path):
+    """
+    Stage 1: Detect candidate regions (current pipeline)
+    Stage 2: Clean each region
+    """
+    # Stage 1: Full page processing
+    image = render_pdf(pdf_path)
+    text_boxes = ocr_client.get_text_boxes(image)
+    masked_image = mask_text_regions(image, text_boxes, padding=20)
+    candidate_regions = detect_regions(masked_image)
+
+    # Stage 2: Per-region cleaning
+    signatures = []
+    for region_box in candidate_regions:
+        # Extract region from ORIGINAL image (not masked)
+        region_img = extract_region(image, region_box)
+
+        # Option 1: Run OCR again to find remaining printed text
+        region_text_boxes = ocr_client.get_text_boxes(region_img)
+        cleaned_region = mask_text_regions(region_img, region_text_boxes)
+
+        # Option 2: Ask VLM if it contains handwriting (no coordinates!)
+        is_handwriting = vlm_verify(cleaned_region)
+
+        if is_handwriting:
+            signatures.append(cleaned_region)
+
+    return signatures
+```
+
+**Pros**:
+- Best accuracy - two passes of OCR
+- Combines strengths of both approaches
+- VLM only for yes/no, not coordinates
+- Clean final output with only handwriting
+
+**Cons**:
+- Slower (2 OCR calls per page)
+- More complex code
+- Higher computational cost
+
+**Recommendation**: ⭐⭐⭐ **BEST OVERALL** - implement this for production
+
+---
+
+## Implementation Priority
+
+### Phase 1: Quick Wins (Test Immediately)
+1. **Expand masking padding** (Problem 2, Option A) - 5 minutes
+2. **More aggressive morphology** (Problem 1, Option A) - 5 minutes
+3. **Test and measure improvement**
+
+### Phase 2: Region Merging (If Phase 1 insufficient)
+4. **Implement region merging algorithm** (Problem 1, Option B) - 30 minutes
+5. **Test on multiple PDFs**
+6. **Tune distance threshold**
+
+### Phase 3: Two-Stage Approach (Best quality)
+7. **Implement second-pass OCR on regions** (Problem 2, Option E) - 1 hour
+8. **Add VLM verification** (Step 4 of pipeline) - 30 minutes
+9. **Full pipeline testing**
+
+---
+
+## Code Files Status
+
+### Existing Files ✅
+- **`paddleocr_client.py`** - REST API client for PaddleOCR server
+- **`test_paddleocr_client.py`** - Connection and OCR test
+- **`test_mask_and_detect.py`** - Current masking + detection pipeline
+
+### To Be Created 📝
+- **`extract_signatures_paddleocr.py`** - Production pipeline with all improvements
+- **`region_merger.py`** - Region merging utilities
+- **`vlm_verifier.py`** - VLM handwriting verification
+
+---
+
+## Server Configuration
+
+**PaddleOCR Server**:
+- Host: `192.168.30.36:5555`
+- Running: ✅ Yes (PID: 210417)
+- Version: 3.3.0
+- GPU: Enabled
+- Language: Chinese (lang='ch')
+
+**VLM Server**:
+- Host: `192.168.30.36:11434` (Ollama)
+- Model: `qwen2.5vl:32b`
+- Status: Not tested yet in this pipeline
+
+---
+
+## Test Plan
+
+### Test File
+- **File**: `201301_1324_AI1_page3.pdf`
+- **Expected signatures**: 2 (楊智惠, 張志銘)
+- **Current recall**: 100% (found both)
+- **Current precision**: 16.7% (2 correct out of 12 regions)
+
+### Success Metrics After Improvements
+
+| Metric | Current | Target |
+|--------|---------|--------|
+| Signatures found | 2/2 (100%) | 2/2 (100%) |
+| False positives | 10 | < 2 |
+| Precision | 16.7% | > 80% |
+| Signatures split | Unknown | 0 |
+| Printed text in regions | Yes | No |
+
+---
+
+## Git Branch Strategy
+
+**Current branch**: `PaddleOCR-Cover`
+**Status**: Masking + Region Detection working, needs refinement
+
+**Recommended next steps**:
+1. Commit current state with tag: `paddleocr-v1-basic`
+2. Create feature branches:
+   - `paddleocr-region-merging` - For Problem 1 solutions
+   - `paddleocr-two-stage` - For Problem 2 solutions
+3. Merge best solution back to `PaddleOCR-Cover`
+
+---
+
+## Next Actions
+
+### Immediate (Today)
+- [ ] Commit current working state
+- [ ] Test Phase 1 quick wins (padding + morphology)
+- [ ] Measure improvement
+
+### Short-term (This week)
+- [ ] Implement Region Merging (Option B)
+- [ ] Implement Two-Stage OCR (Option E)
+- [ ] Add VLM verification
+- [ ] Test on 10 PDFs
+
+### Long-term (Production)
+- [ ] Optimize performance (parallel processing)
+- [ ] Error handling and logging
+- [ ] Process full 86K dataset
+- [ ] Compare with previous hybrid approach (70% recall)
+
+---
+
+## Comparison: PaddleOCR vs Previous Hybrid Approach
+
+### Previous Approach (VLM-Cover branch)
+- **Method**: VLM names + CV detection + VLM verification
+- **Results**: 70% recall, 100% precision
+- **Problem**: Missed 30% of signatures (CV parameters too conservative)
+
+### PaddleOCR Approach (Current)
+- **Method**: PaddleOCR masking + CV detection + VLM verification
+- **Results**: 100% recall (found both signatures)
+- **Problem**: Low precision (many false positives), printed text not fully removed
+
+### Winner: TBD
+- PaddleOCR shows **better recall potential**
+- After implementing refinements (Phase 2-3), should achieve **high recall + high precision**
+- Need to test on larger dataset to confirm
+
+---
+
+**Document version**: 1.0
+**Last updated**: October 28, 2025
+**Author**: Claude Code
+**Status**: Ready for implementation
--- a/check_rejected_for_missing.py
+++ b/check_rejected_for_missing.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python3
+"""Check if rejected regions contain the missing signatures."""
+
+import base64
+import requests
+from pathlib import Path
+
+OLLAMA_URL = "http://192.168.30.36:11434"
+OLLAMA_MODEL = "qwen2.5vl:32b"
+REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"
+
+# Missing signatures based on test results
+MISSING = {
+    "201301_2061_AI1_page5": "林姿妤",
+    "201301_2458_AI1_page4": "魏興海",
+    "201301_2923_AI1_page3": "陈丽琦"
+}
+
+def encode_image_to_base64(image_path):
+    """Encode image file to base64."""
+    with open(image_path, 'rb') as f:
+        return base64.b64encode(f.read()).decode('utf-8')
+
+def ask_vlm_about_signature(image_base64, expected_name):
+    """Ask VLM if the image contains the expected signature."""
+    prompt = f"""Does this image contain a handwritten signature with the Chinese name: "{expected_name}"?
+
+Look carefully for handwritten Chinese characters matching this name.
+
+Answer only 'yes' or 'no'."""
+
+    payload = {
+        "model": OLLAMA_MODEL,
+        "prompt": prompt,
+        "images": [image_base64],
+        "stream": False
+    }
+
+    try:
+        response = requests.post(f"{OLLAMA_URL}/api/generate", json=payload, timeout=60)
+        response.raise_for_status()
+        answer = response.json()['response'].strip().lower()
+        return answer
+    except Exception as e:
+        return f"error: {str(e)}"
+
+# Check each missing signature
+for pdf_stem, missing_name in MISSING.items():
+    print(f"\n{'='*80}")
+    print(f"Checking rejected regions from: {pdf_stem}")
+    print(f"Looking for missing signature: {missing_name}")
+    print('='*80)
+
+    # Find all rejected regions from this PDF
+    rejected_regions = sorted(Path(REJECTED_PATH).glob(f"{pdf_stem}_region_*.png"))
+
+    print(f"Found {len(rejected_regions)} rejected regions to check")
+
+    for region_path in rejected_regions:
+        region_name = region_path.name
+        print(f"\nChecking: {region_name}...", end='', flush=True)
+
+        # Encode and ask VLM
+        image_base64 = encode_image_to_base64(region_path)
+        answer = ask_vlm_about_signature(image_base64, missing_name)
+
+        if 'yes' in answer:
+            print(f" ✅ FOUND! This region contains {missing_name}")
+            print(f"   → The signature was detected by CV but rejected by verification!")
+        else:
+            print(f" ❌ No (VLM says: {answer})")
+
+print(f"\n{'='*80}")
+print("Analysis complete!")
+print('='*80)
--- a/paddleocr_client.py
+++ b/paddleocr_client.py
@@ -0,0 +1,169 @@
+#!/usr/bin/env python3
+"""
+PaddleOCR Client
+Connects to remote PaddleOCR server for OCR inference
+"""
+
+import requests
+import base64
+import numpy as np
+from typing import List, Dict, Tuple, Optional
+from PIL import Image
+from io import BytesIO
+
+class PaddleOCRClient:
+    """Client for remote PaddleOCR server."""
+
+    def __init__(self, server_url: str = "http://192.168.30.36:5555"):
+        """
+        Initialize PaddleOCR client.
+
+        Args:
+            server_url: URL of the PaddleOCR server
+        """
+        self.server_url = server_url.rstrip('/')
+        self.timeout = 30  # seconds
+
+    def health_check(self) -> bool:
+        """
+        Check if server is healthy.
+
+        Returns:
+            True if server is healthy, False otherwise
+        """
+        try:
+            response = requests.get(
+                f"{self.server_url}/health",
+                timeout=5
+            )
+            return response.status_code == 200 and response.json().get('status') == 'ok'
+        except Exception as e:
+            print(f"Health check failed: {e}")
+            return False
+
+    def ocr(self, image: np.ndarray) -> List[Dict]:
+        """
+        Perform OCR on an image.
+
+        Args:
+            image: numpy array of the image (RGB format)
+
+        Returns:
+            List of detection results, each containing:
+                - box: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
+                - text: detected text string
+                - confidence: confidence score (0-1)
+
+        Raises:
+            Exception if OCR fails
+        """
+        # Convert numpy array to PIL Image
+        if len(image.shape) == 2:  # Grayscale
+            pil_image = Image.fromarray(image)
+        else:  # RGB or RGBA
+            pil_image = Image.fromarray(image.astype(np.uint8))
+
+        # Encode to base64
+        buffered = BytesIO()
+        pil_image.save(buffered, format="PNG")
+        image_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
+
+        # Send request
+        try:
+            response = requests.post(
+                f"{self.server_url}/ocr",
+                json={"image": image_base64},
+                timeout=self.timeout
+            )
+            response.raise_for_status()
+
+            result = response.json()
+
+            if not result.get('success'):
+                error_msg = result.get('error', 'Unknown error')
+                raise Exception(f"OCR failed: {error_msg}")
+
+            return result.get('results', [])
+
+        except requests.exceptions.Timeout:
+            raise Exception(f"OCR request timed out after {self.timeout} seconds")
+        except requests.exceptions.ConnectionError:
+            raise Exception(f"Could not connect to server at {self.server_url}")
+        except Exception as e:
+            raise Exception(f"OCR request failed: {str(e)}")
+
+    def get_text_boxes(self, image: np.ndarray) -> List[Tuple[int, int, int, int]]:
+        """
+        Get bounding boxes of all detected text.
+
+        Args:
+            image: numpy array of the image
+
+        Returns:
+            List of bounding boxes as (x, y, w, h) tuples
+        """
+        results = self.ocr(image)
+        boxes = []
+
+        for result in results:
+            box = result['box']  # [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
+
+            # Convert polygon to bounding box
+            xs = [point[0] for point in box]
+            ys = [point[1] for point in box]
+
+            x = int(min(xs))
+            y = int(min(ys))
+            w = int(max(xs) - min(xs))
+            h = int(max(ys) - min(ys))
+
+            boxes.append((x, y, w, h))
+
+        return boxes
+
+    def __repr__(self):
+        return f"PaddleOCRClient(server_url='{self.server_url}')"
+
+
+# Convenience function
+def create_ocr_client(server_url: str = "http://192.168.30.36:5555") -> PaddleOCRClient:
+    """
+    Create and test PaddleOCR client.
+
+    Args:
+        server_url: URL of the PaddleOCR server
+
+    Returns:
+        PaddleOCRClient instance
+
+    Raises:
+        Exception if server is not reachable
+    """
+    client = PaddleOCRClient(server_url)
+
+    if not client.health_check():
+        raise Exception(
+            f"PaddleOCR server at {server_url} is not responding. "
+            "Make sure the server is running on the Linux machine."
+        )
+
+    return client
+
+
+if __name__ == "__main__":
+    # Test the client
+    print("Testing PaddleOCR client...")
+
+    try:
+        client = create_ocr_client()
+        print(f"✅ Connected to server: {client.server_url}")
+
+        # Create a test image
+        test_image = np.ones((100, 100, 3), dtype=np.uint8) * 255
+
+        print("Running test OCR...")
+        results = client.ocr(test_image)
+        print(f"✅ OCR test successful! Found {len(results)} text regions")
+
+    except Exception as e:
+        print(f"❌ Error: {e}")
--- a/test_mask_and_detect.py
+++ b/test_mask_and_detect.py
@@ -0,0 +1,216 @@
+#!/usr/bin/env python3
+"""
+Test PaddleOCR Masking + Region Detection Pipeline
+
+This script demonstrates:
+1. PaddleOCR detects printed text bounding boxes
+2. Mask out all printed text areas (fill with black)
+3. Detect remaining non-white regions (potential handwriting)
+4. Visualize the results
+"""
+
+import fitz  # PyMuPDF
+import numpy as np
+import cv2
+from pathlib import Path
+from paddleocr_client import create_ocr_client
+
+# Configuration
+TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
+OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/mask_test"
+DPI = 300
+
+# Region detection parameters
+MIN_REGION_AREA = 3000      # Minimum pixels for a region
+MAX_REGION_AREA = 300000    # Maximum pixels for a region
+MIN_ASPECT_RATIO = 0.3      # Minimum width/height ratio
+MAX_ASPECT_RATIO = 15.0     # Maximum width/height ratio
+
+print("="*80)
+print("PaddleOCR Masking + Region Detection Test")
+print("="*80)
+
+# Create output directory
+Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
+
+# Step 1: Connect to PaddleOCR server
+print("\n1. Connecting to PaddleOCR server...")
+try:
+    ocr_client = create_ocr_client()
+    print(f"   ✅ Connected: {ocr_client.server_url}")
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    exit(1)
+
+# Step 2: Render PDF to image
+print("\n2. Rendering PDF to image...")
+try:
+    doc = fitz.open(TEST_PDF)
+    page = doc[0]
+    mat = fitz.Matrix(DPI/72, DPI/72)
+    pix = page.get_pixmap(matrix=mat)
+    original_image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
+
+    if pix.n == 4:  # RGBA
+        original_image = cv2.cvtColor(original_image, cv2.COLOR_RGBA2RGB)
+
+    print(f"   ✅ Rendered: {original_image.shape[1]}x{original_image.shape[0]} pixels")
+    doc.close()
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    exit(1)
+
+# Step 3: Detect printed text with PaddleOCR
+print("\n3. Detecting printed text with PaddleOCR...")
+try:
+    text_boxes = ocr_client.get_text_boxes(original_image)
+    print(f"   ✅ Detected {len(text_boxes)} text regions")
+
+    # Show some sample boxes
+    if text_boxes:
+        print("   Sample text boxes (x, y, w, h):")
+        for i, box in enumerate(text_boxes[:3]):
+            print(f"      {i+1}. {box}")
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    exit(1)
+
+# Step 4: Mask out printed text areas
+print("\n4. Masking printed text areas...")
+try:
+    masked_image = original_image.copy()
+
+    # Fill each text box with black
+    for (x, y, w, h) in text_boxes:
+        cv2.rectangle(masked_image, (x, y), (x + w, y + h), (0, 0, 0), -1)
+
+    print(f"   ✅ Masked {len(text_boxes)} text regions")
+
+    # Save masked image
+    masked_path = Path(OUTPUT_DIR) / "01_masked_image.png"
+    cv2.imwrite(str(masked_path), cv2.cvtColor(masked_image, cv2.COLOR_RGB2BGR))
+    print(f"   📁 Saved: {masked_path}")
+
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    exit(1)
+
+# Step 5: Detect remaining non-white regions
+print("\n5. Detecting remaining non-white regions...")
+try:
+    # Convert to grayscale
+    gray = cv2.cvtColor(masked_image, cv2.COLOR_RGB2GRAY)
+
+    # Threshold to find non-white areas
+    # Anything darker than 250 is considered "content"
+    _, binary = cv2.threshold(gray, 250, 255, cv2.THRESH_BINARY_INV)
+
+    # Apply morphological operations to connect nearby regions
+    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
+    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
+
+    # Find contours
+    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+
+    print(f"   ✅ Found {len(contours)} contours")
+
+    # Filter contours by size and aspect ratio
+    potential_regions = []
+
+    for contour in contours:
+        x, y, w, h = cv2.boundingRect(contour)
+        area = w * h
+        aspect_ratio = w / h if h > 0 else 0
+
+        # Check constraints
+        if (MIN_REGION_AREA <= area <= MAX_REGION_AREA and
+            MIN_ASPECT_RATIO <= aspect_ratio <= MAX_ASPECT_RATIO):
+            potential_regions.append({
+                'box': (x, y, w, h),
+                'area': area,
+                'aspect_ratio': aspect_ratio
+            })
+
+    print(f"   ✅ Filtered to {len(potential_regions)} potential handwriting regions")
+
+    # Show region details
+    if potential_regions:
+        print("\n   Detected regions:")
+        for i, region in enumerate(potential_regions[:5]):
+            x, y, w, h = region['box']
+            print(f"      {i+1}. Box: ({x}, {y}, {w}, {h}), "
+                  f"Area: {region['area']}, "
+                  f"Aspect: {region['aspect_ratio']:.2f}")
+
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    import traceback
+    traceback.print_exc()
+    exit(1)
+
+# Step 6: Visualize results
+print("\n6. Creating visualizations...")
+try:
+    # Visualization 1: Original with text boxes
+    vis_original = original_image.copy()
+    for (x, y, w, h) in text_boxes:
+        cv2.rectangle(vis_original, (x, y), (x + w, y + h), (0, 255, 0), 3)
+
+    vis_original_path = Path(OUTPUT_DIR) / "02_original_with_text_boxes.png"
+    cv2.imwrite(str(vis_original_path), cv2.cvtColor(vis_original, cv2.COLOR_RGB2BGR))
+    print(f"   📁 Original + text boxes: {vis_original_path}")
+
+    # Visualization 2: Masked image with detected regions
+    vis_masked = masked_image.copy()
+    for region in potential_regions:
+        x, y, w, h = region['box']
+        cv2.rectangle(vis_masked, (x, y), (x + w, y + h), (255, 0, 0), 3)
+
+    vis_masked_path = Path(OUTPUT_DIR) / "03_masked_with_regions.png"
+    cv2.imwrite(str(vis_masked_path), cv2.cvtColor(vis_masked, cv2.COLOR_RGB2BGR))
+    print(f"   📁 Masked + regions: {vis_masked_path}")
+
+    # Visualization 3: Binary threshold result
+    binary_path = Path(OUTPUT_DIR) / "04_binary_threshold.png"
+    cv2.imwrite(str(binary_path), binary)
+    print(f"   📁 Binary threshold: {binary_path}")
+
+    # Visualization 4: Morphed result
+    morphed_path = Path(OUTPUT_DIR) / "05_morphed.png"
+    cv2.imwrite(str(morphed_path), morphed)
+    print(f"   📁 Morphed: {morphed_path}")
+
+    # Extract and save each detected region
+    print("\n7. Extracting detected regions...")
+    for i, region in enumerate(potential_regions):
+        x, y, w, h = region['box']
+
+        # Add padding
+        padding = 10
+        x_pad = max(0, x - padding)
+        y_pad = max(0, y - padding)
+        w_pad = min(original_image.shape[1] - x_pad, w + 2*padding)
+        h_pad = min(original_image.shape[0] - y_pad, h + 2*padding)
+
+        # Extract region from original image
+        region_img = original_image[y_pad:y_pad+h_pad, x_pad:x_pad+w_pad]
+
+        # Save region
+        region_path = Path(OUTPUT_DIR) / f"region_{i+1:02d}.png"
+        cv2.imwrite(str(region_path), cv2.cvtColor(region_img, cv2.COLOR_RGB2BGR))
+        print(f"   📁 Region {i+1}: {region_path}")
+
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    import traceback
+    traceback.print_exc()
+
+print("\n" + "="*80)
+print("Test completed!")
+print(f"Results saved to: {OUTPUT_DIR}")
+print("="*80)
+print("\nSummary:")
+print(f"  - Printed text regions detected: {len(text_boxes)}")
+print(f"  - Potential handwriting regions: {len(potential_regions)}")
+print(f"  - Expected signatures: 2 (楊智惠, 張志銘)")
+print("="*80)
--- a/test_paddleocr.py
+++ b/test_paddleocr.py
@@ -0,0 +1,102 @@
+#!/usr/bin/env python3
+"""Test PaddleOCR on a sample PDF page."""
+
+import fitz  # PyMuPDF
+from paddleocr import PaddleOCR
+import numpy as np
+from PIL import Image
+import cv2
+from pathlib import Path
+
+# Configuration
+TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
+DPI = 300
+
+print("="*80)
+print("Testing PaddleOCR on macOS Apple Silicon")
+print("="*80)
+
+# Step 1: Render PDF to image
+print("\n1. Rendering PDF to image...")
+try:
+    doc = fitz.open(TEST_PDF)
+    page = doc[0]
+    mat = fitz.Matrix(DPI/72, DPI/72)
+    pix = page.get_pixmap(matrix=mat)
+    image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
+
+    if pix.n == 4:  # RGBA
+        image = cv2.cvtColor(image, cv2.COLOR_RGBA2RGB)
+
+    print(f"   ✅ Rendered: {image.shape[1]}x{image.shape[0]} pixels")
+    doc.close()
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    exit(1)
+
+# Step 2: Initialize PaddleOCR
+print("\n2. Initializing PaddleOCR...")
+print("   (First run will download models, may take a few minutes...)")
+try:
+    # Use the correct syntax from official docs
+    ocr = PaddleOCR(
+        use_doc_orientation_classify=False,
+        use_doc_unwarping=False,
+        use_textline_orientation=False,
+        lang='ch'  # Chinese language
+    )
+    print("   ✅ PaddleOCR initialized successfully")
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    import traceback
+    traceback.print_exc()
+    print("\n   Note: PaddleOCR requires PaddlePaddle backend.")
+    print("   If this is a module import error, PaddlePaddle may not support this platform.")
+    exit(1)
+
+# Step 3: Run OCR
+print("\n3. Running OCR to detect printed text...")
+try:
+    result = ocr.ocr(image, cls=False)
+
+    if result and result[0]:
+        print(f"   ✅ Detected {len(result[0])} text regions")
+
+        # Show first few detections
+        print("\n   Sample detections:")
+        for i, item in enumerate(result[0][:5]):
+            box = item[0]  # Bounding box coordinates
+            text = item[1][0]  # Detected text
+            confidence = item[1][1]  # Confidence score
+            print(f"      {i+1}. Text: '{text}' (confidence: {confidence:.2f})")
+            print(f"         Box: {box}")
+    else:
+        print("   ⚠️  No text detected")
+
+except Exception as e:
+    print(f"   ❌ Error during OCR: {e}")
+    import traceback
+    traceback.print_exc()
+    exit(1)
+
+# Step 4: Visualize detection
+print("\n4. Creating visualization...")
+try:
+    vis_image = image.copy()
+
+    if result and result[0]:
+        for item in result[0]:
+            box = np.array(item[0], dtype=np.int32)
+            cv2.polylines(vis_image, [box], True, (0, 255, 0), 2)
+
+    # Save visualization
+    output_path = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_test_detection.png"
+    cv2.imwrite(output_path, cv2.cvtColor(vis_image, cv2.COLOR_RGB2BGR))
+    print(f"   ✅ Saved visualization: {output_path}")
+
+except Exception as e:
+    print(f"   ❌ Error during visualization: {e}")
+
+print("\n" + "="*80)
+print("PaddleOCR test completed!")
+print("="*80)
--- a/test_paddleocr_client.py
+++ b/test_paddleocr_client.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+"""Test PaddleOCR client with a real PDF page."""
+
+import fitz  # PyMuPDF
+import numpy as np
+import cv2
+from paddleocr_client import create_ocr_client
+
+# Test PDF
+TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
+DPI = 300
+
+print("="*80)
+print("Testing PaddleOCR Client with Real PDF")
+print("="*80)
+
+# Step 1: Connect to server
+print("\n1. Connecting to PaddleOCR server...")
+try:
+    client = create_ocr_client()
+    print(f"   ✅ Connected: {client.server_url}")
+except Exception as e:
+    print(f"   ❌ Connection failed: {e}")
+    exit(1)
+
+# Step 2: Render PDF
+print("\n2. Rendering PDF to image...")
+try:
+    doc = fitz.open(TEST_PDF)
+    page = doc[0]
+    mat = fitz.Matrix(DPI/72, DPI/72)
+    pix = page.get_pixmap(matrix=mat)
+    image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
+
+    if pix.n == 4:  # RGBA
+        image = cv2.cvtColor(image, cv2.COLOR_RGBA2RGB)
+
+    print(f"   ✅ Rendered: {image.shape[1]}x{image.shape[0]} pixels")
+    doc.close()
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+    exit(1)
+
+# Step 3: Run OCR
+print("\n3. Running OCR on image...")
+try:
+    results = client.ocr(image)
+    print(f"   ✅ OCR successful!")
+    print(f"   Found {len(results)} text regions")
+
+    # Show first few results
+    if results:
+        print("\n   Sample detections:")
+        for i, result in enumerate(results[:5]):
+            text = result['text']
+            confidence = result['confidence']
+            print(f"      {i+1}. '{text}' (confidence: {confidence:.2f})")
+
+except Exception as e:
+    print(f"   ❌ OCR failed: {e}")
+    import traceback
+    traceback.print_exc()
+    exit(1)
+
+# Step 4: Get bounding boxes
+print("\n4. Getting text bounding boxes...")
+try:
+    boxes = client.get_text_boxes(image)
+    print(f"   ✅ Got {len(boxes)} bounding boxes")
+
+    if boxes:
+        print("   Sample boxes (x, y, w, h):")
+        for i, box in enumerate(boxes[:3]):
+            print(f"      {i+1}. {box}")
+
+except Exception as e:
+    print(f"   ❌ Error: {e}")
+
+print("\n" + "="*80)
+print("Test completed successfully!")
+print("="*80)