Complete PP-OCRv5 research and v4 vs v5 comparison

## 研究成果 ### PP-OCRv5 API 測試 - 成功升級到 PaddleOCR 3.3.2 (PP-OCRv5) - 理解新 API 結構和調用方式 - 驗證基礎檢測功能 ### 關鍵發現 ❌ PP-OCRv5 **沒有內建手寫分類功能** - text_type 字段是語言類型，不是手寫/印刷分類 - 仍需要 OpenCV Method 3 來分離手寫和印刷文字 ### 完整 Pipeline 對比測試 - v4 (2.7.3): 檢測 14 個文字 → 4 個候選區域 - v5 (3.3.2): 檢測 50 個文字 → 7 個候選區域 - 主簽名區域：兩個版本幾乎相同 (1150x511 vs 1144x511) ### 性能分析優點： - v5 手寫識別準確率 +13.7% (文檔承諾) - 可能減少漏檢缺點： - 過度檢測（印章小字等） - API 完全重寫，不兼容 - 仍無法替代 OpenCV Method 3 ### 文件 - PP_OCRV5_RESEARCH_FINDINGS.md: 完整研究報告 - signature-comparison/: v4 vs v5 對比結果 - test_results/: v5 測試輸出 - test_*_pipeline.py: 完整測試腳本 ### 建議當前方案（v2.7.3 + OpenCV Method 3）已足夠穩定，除非遇到大量漏檢，否則暫不升級到 v5。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Complete OpenCV Method 3 implementation with 86.5% handwriting retention
2025-11-27 11:21:55 +08:00 · 2025-11-27 10:35:46 +08:00 · 2025-10-28 22:28:18 +08:00
22 changed files with 6562 additions and 0 deletions
--- a/CURRENT_STATUS.md
+++ b/CURRENT_STATUS.md
@@ -0,0 +1,252 @@
 # 项目当前状态
 **更新时间**: 2025-10-29
 **分支**: `paddleocr-improvements`
 **PaddleOCR版本**: 2.7.3 (稳定版本)
 ---
 ## 当前进度总结
 ### ✅ 已完成
 1. **PaddleOCR服务器部署** (192.168.30.36:5555)
   - 版本: PaddleOCR 2.7.3
   - GPU: 启用
   - 语言: 中文
   - 状态: 稳定运行
 2. **基础Pipeline实现**
   - ✅ PDF → 图像渲染 (DPI=300)
   - ✅ PaddleOCR文字检测 (26个区域/页)
   - ✅ 文本区域遮罩 (padding=25px)
   - ✅ 候选区域检测
   - ✅ 区域合并算法 (12→4 regions)
 3. **OpenCV分离方法测试**
   - Method 1: 笔画宽度分析 - ❌ 效果差
   - Method 2: 连通组件基础分析 - ⚠️ 中等效果
   - Method 3: 综合特征分析 - ✅ **最佳方案** (86.5%手写保留率)
 4. **测试结果**
   - 测试文件: `201301_1324_AI1_page3.pdf`
   - 预期签名: 2个 (楊智惠, 張志銘)
   - 检测结果: 2个签名区域成功合并
   - 保留率: 86.5% 手写内容
 ---
 ## 技术架构
 ```
 PDF文档
  ↓
 1. 渲染 (PyMuPDF, 300 DPI)
  ↓
 2. PaddleOCR检测 (识别印刷文字)
  ↓
 3. 遮罩印刷文字 (黑色填充, padding=25px)
  ↓
 4. 区域检测 (OpenCV形态学)
  ↓
 5. 区域合并 (距离阈值: H≤100px, V≤50px)
  ↓
 6. 特征分析 (大小+笔画长度+规律性)
  ↓
 7. [TODO] VLM验证
  ↓
 签名提取结果
 ```
 ---
 ## 核心文件
 | 文件 | 说明 | 状态 |
 |------|------|------|
 | `paddleocr_client.py` | PaddleOCR REST客户端 | ✅ 稳定 |
 | `test_mask_and_detect.py` | 基础遮罩+检测测试 | ✅ 完成 |
 | `test_opencv_separation.py` | OpenCV方法1+2测试 | ✅ 完成 |
 | `test_opencv_advanced.py` | OpenCV方法3(最佳) | ✅ 完成 |
 | `extract_signatures_paddleocr_improved.py` | 完整Pipeline (Method B+E) | ⚠️ Method E有问题 |
 | `PADDLEOCR_STATUS.md` | 详细技术文档 | ✅ 完成 |
 ---
 ## Method 3: 综合特征分析 (当前最佳方案)
 ### 判断依据
 **您的观察** (非常准确):
 1. ✅ **手写字比印刷字大** - height > 50px
 2. ✅ **手写笔画长度更长** - stroke_ratio > 0.4
 3. ✅ **印刷体规律，手写潦草** - compactness, solidity
 ### 评分系统
 ```python
 handwriting_score = 0
 # 大小评分
 if height > 50: score += 3
 elif height > 35: score += 2
 # 笔画长度评分
 if stroke_ratio > 0.5: score += 2
 elif stroke_ratio > 0.35: score += 1
 # 规律性评分
 if is_irregular: score += 1  # 不规律 = 手写
 else: score -= 1              # 规律 = 印刷
 # 面积评分
 if area > 2000: score += 2
 elif area < 500: score -= 1
 # 分类: score > 0 → 手写
 ```
 ### 效果
 - 手写像素保留: **86.5%** ✅
 - 印刷像素过滤: 13.5%
 - Top 10组件全部正确分类
 ---
 ## 已识别问题
 ### 1. Method E (两阶段OCR) 失效 ❌
 **原因**: PaddleOCR无法区分"印刷"和"手写"，第二次OCR会把手写也识别并删除
 **解决方案**:
 - ❌ 不使用Method E
 - ✅ 使用Method B (区域合并) + OpenCV Method 3
 ### 2. 印刷名字与手写签名重叠
 **现象**: 区域包含"楊 智 惠"(印刷) + 手写签名
 **策略**: 接受少量印刷残留，优先保证手写完整性
 **后续**: 用VLM最终验证
 ### 3. Masking padding 矛盾
 **小padding (5-10px)**: 印刷残留多，但不伤手写
 **大padding (25px)**: 印刷删除干净，但可能遮住手写边缘
 **当前**: 使用 25px，依赖OpenCV Method 3过滤残留
 ---
 ## 下一步计划
 ### 短期 (继续当前方案)
 - [ ] 整合 Method B + OpenCV Method 3 为完整Pipeline
 - [ ] 添加VLM验证步骤
 - [ ] 在10个样本上测试
 - [ ] 调优参数 (height阈值, merge距离等)
 ### 中期 (PP-OCRv5研究)
 **新branch**: `pp-ocrv5-research`
 - [ ] 研究PaddleOCR 3.3.0新API
 - [ ] 测试PP-OCRv5手写检测能力
 - [ ] 对比性能: v4 vs v5
 - [ ] 评估是否升级
 ---
 ## 服务器配置
 ### PaddleOCR服务器 (Linux)
 ```
 Host: 192.168.30.36:5555
 SSH: ssh gblinux
 路径: ~/Project/paddleocr-server/
 版本: PaddleOCR 2.7.3, numpy 1.26.4, opencv-contrib 4.6.0.66
 启动: cd ~/Project/paddleocr-server && source venv/bin/activate && python paddleocr_server.py
 日志: ~/Project/paddleocr-server/server_stable.log
 ```
 ### VLM服务器 (Ollama)
 ```
 Host: 192.168.30.36:11434
 模型: qwen2.5vl:32b
 状态: 未在当前Pipeline中使用
 ```
 ---
 ## 测试数据
 ### 样本文件
 ```
 /Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf
 - 页面: 第3页
 - 预期签名: 2个 (楊智惠, 張志銘)
 - 尺寸: 2481x3510 pixels
 ```
 ### 输出目录
 ```
 /Volumes/NV2/PDF-Processing/signature-image-output/
 ├── mask_test/              # 基础遮罩测试结果
 ├── paddleocr_improved/     # Method B+E测试 (E失败)
 ├── opencv_separation_test/ # Method 1+2测试
 └── opencv_advanced_test/   # Method 3测试 (最佳)
 ```
 ---
 ## 性能对比
 | 方法 | 手写保留 | 印刷去除 | 总评 |
 |------|---------|---------|------|
 | 基础遮罩 | 100% | 低 | ⚠️ 太多印刷残留 |
 | Method 1 (笔画宽度) | 0% | - | ❌ 完全失败 |
 | Method 2 (连通组件) | 1% | 中 | ❌ 丢失太多手写 |
 | Method 3 (综合特征) | **86.5%** | 高 | ✅ **最佳** |
 ---
 ## Git状态
 ```
 当前分支: paddleocr-improvements
 基于: PaddleOCR-Cover
 标签: paddleocr-v1-basic (基础遮罩版本)
 待提交:
 - OpenCV高级分离方法 (Method 3)
 - 完整测试脚本和结果
 - 文档更新
 ```
 ---
 ## 已知限制
 1. **参数需调优**: height阈值、merge距离等在不同文档可能需要调整
 2. **依赖文档质量**: 模糊、倾斜的文档可能效果变差
 3. **计算性能**: OpenCV处理较快，但完整Pipeline需要优化
 4. **泛化能力**: 仅在1个样本测试，需要更多样本验证
 ---
 ## 联系与协作
 **主要开发者**: Claude Code
 **协作方式**: 会话式开发
 **代码仓库**: 本地Git仓库
 **测试环境**: macOS (本地) + Linux (服务器)
 ---
 **状态**: ✅ 当前方案稳定，可继续开发
 **建议**: 先在更多样本测试Method 3，再考虑PP-OCRv5升级
--- a/NEW_SESSION_HANDOFF.md
+++ b/NEW_SESSION_HANDOFF.md
@@ -0,0 +1,432 @@
 # 新对话交接文档 - PP-OCRv5研究
 **日期**: 2025-10-29
 **前序对话**: PaddleOCR-Cover分支开发
 **当前分支**: `paddleocr-improvements` (稳定)
 **新分支**: `pp-ocrv5-research` (待创建)
 ---
 ## 🎯 任务目标
 研究和实现 **PP-OCRv5** 的手写签名检测功能
 ---
 ## 📋 背景信息
 ### 当前状况
 ✅ **已有稳定方案** (`paddleocr-improvements` 分支):
 - PaddleOCR 2.7.3 + OpenCV Method 3
 - 86.5%手写保留率
 - 区域合并算法工作良好
 - 测试: 1个PDF成功检测2个签名
 ⚠️ **PP-OCRv5升级遇到问题**:
 - PaddleOCR 3.3.0 API完全改变
 - 旧服务器代码不兼容
 - 需要深入研究新API
 ### 为什么要研究PP-OCRv5？
 **文档显示**: https://www.paddleocr.ai/main/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5.html
 PP-OCRv5性能提升:
 - 手写中文检测: **0.706 → 0.803** (+13.7%)
 - 手写英文检测: **0.249 → 0.841** (+237%)
 - 可能支持直接输出手写区域坐标
 **潜在优势**:
 1. 更好的手写识别能力
 2. 可能内置手写/印刷分类
 3. 更准确的坐标输出
 4. 减少复杂的后处理
 ---
 ## 🔧 技术栈
 ### 服务器环境
 ```
 Host: 192.168.30.36 (Linux GPU服务器)
 SSH: ssh gblinux
 目录: ~/Project/paddleocr-server/
 ```
 **当前稳定版本**:
 - PaddleOCR: 2.7.3
 - numpy: 1.26.4
 - opencv-contrib-python: 4.6.0.66
 - 服务器文件: `paddleocr_server.py`
 **已安装但未使用**:
 - PaddleOCR 3.3.0 (PP-OCRv5)
 - 临时服务器: `paddleocr_server_v5.py` (未完成)
 ### 本地环境
 ```
 macOS
 Python: 3.14
 虚拟环境: venv/
 客户端: paddleocr_client.py
 ```
 ---
 ## 📝 核心问题
 ### 1. API变更
 **旧API (2.7.3)**:
 ```python
 from paddleocr import PaddleOCR
 ocr = PaddleOCR(lang='ch')
 result = ocr.ocr(image_np, cls=False)
 # 返回格式:
 # [[[box], (text, confidence)], ...]
 ```
 **新API (3.3.0)** - ⚠️ 未完全理解:
 ```python
 # 方式1: 传统方式 (Deprecated)
 result = ocr.ocr(image_np)  # 警告: Please use predict instead
 # 方式2: 新方式
 from paddlex import create_model
 model = create_model("???")  # 模型名未知
 result = model.predict(image_np)
 # 返回格式: ???
 ```
 ### 2. 遇到的错误
 **错误1**: `cls` 参数不再支持
 ```python
 # 错误: PaddleOCR.predict() got an unexpected keyword argument 'cls'
 result = ocr.ocr(image_np, cls=False)  # ❌
 ```
 **错误2**: 返回格式改变
 ```python
 # 旧代码解析失败:
 text = item[1][0]       # ❌ IndexError
 confidence = item[1][1]  # ❌ IndexError
 ```
 **错误3**: 模型名称错误
 ```python
 model = create_model("PP-OCRv5_server")  # ❌ Model not supported
 ```
 ---
 ## 🎯 研究任务清单
 ### Phase 1: API研究 (优先级高)
 - [ ] **阅读官方文档**
  - PP-OCRv5完整文档
  - PaddleX API文档
  - 迁移指南 (如果有)
 - [ ] **理解新API**
  ```python
  # 需要搞清楚:
  1. 正确的导入方式
  2. 模型初始化方法
  3. predict()参数和返回格式
  4. 如何区分手写/印刷
  5. 是否有手写检测专用功能
  ```
 - [ ] **编写测试脚本**
  - `test_pp_ocrv5_api.py` - 测试基础API调用
  - 打印完整的result数据结构
  - 对比v4和v5的返回差异
 ### Phase 2: 服务器适配
 - [ ] **重写服务器代码**
  - 适配新API
  - 正确解析返回数据
  - 保持REST接口兼容
 - [ ] **测试稳定性**
  - 测试10个PDF样本
  - 检查GPU利用率
  - 对比v4性能
 ### Phase 3: 手写检测功能
 - [ ] **查找手写检测能力**
  ```python
  # 可能的方式:
  1. result中是否有 text_type 字段?
  2. 是否有专门的 handwriting_detection 模型?
  3. 是否有置信度差异可以利用?
  4. PP-Structure 的 layout 分析?
  ```
 - [ ] **对比测试**
  - v4 (当前方案) vs v5
  - 准确率、召回率、速度
  - 手写检测能力
 ### Phase 4: 集成决策
 - [ ] **性能评估**
  - 如果v5更好 → 升级
  - 如果改进不明显 → 保持v4
 - [ ] **文档更新**
  - 记录v5使用方法
  - 更新PADDLEOCR_STATUS.md
 ---
 ## 🔍 调试技巧
 ### 1. 查看完整返回数据
 ```python
 import pprint
 result = model.predict(image)
 pprint.pprint(result)  # 完整输出所有字段
 # 或者
 import json
 print(json.dumps(result, indent=2, ensure_ascii=False))
 ```
 ### 2. 查找官方示例
 ```bash
 # 在服务器上查找PaddleOCR安装示例
 find ~/Project/paddleocr-server/venv/lib/python3.12/site-packages/paddleocr -name "*.py" | grep example
 # 查看源码
 less ~/Project/paddleocr-server/venv/lib/python3.12/site-packages/paddleocr/paddleocr.py
 ```
 ### 3. 查看可用模型
 ```python
 from paddlex.inference.models import OFFICIAL_MODELS
 print(OFFICIAL_MODELS)  # 列出所有支持的模型名
 ```
 ### 4. Web文档搜索
 重点查看:
 - https://github.com/PaddlePaddle/PaddleOCR
 - https://www.paddleocr.ai
 - https://github.com/PaddlePaddle/PaddleX
 ---
 ## 📂 文件结构
 ```
 /Volumes/NV2/pdf_recognize/
 ├── CURRENT_STATUS.md          # 当前状态文档 ✅
 ├── NEW_SESSION_HANDOFF.md     # 本文件 ✅
 ├── PADDLEOCR_STATUS.md        # 详细技术文档 ✅
 ├── SESSION_INIT.md            # 初始会话信息
 │
 ├── paddleocr_client.py        # 稳定客户端 (v2.7.3) ✅
 ├── paddleocr_server_v5.py     # v5服务器 (未完成) ⚠️
 │
 ├── test_paddleocr_client.py           # 基础测试
 ├── test_mask_and_detect.py            # 遮罩+检测
 ├── test_opencv_separation.py          # Method 1+2
 ├── test_opencv_advanced.py            # Method 3 (最佳) ✅
 ├── extract_signatures_paddleocr_improved.py  # 完整Pipeline
 │
 └── check_rejected_for_missing.py      # 诊断脚本
 ```
 **服务器端** (`ssh gblinux`):
 ```
 ~/Project/paddleocr-server/
 ├── paddleocr_server.py        # v2.7.3稳定版 ✅
 ├── paddleocr_server_v5.py     # v5版本 (待完成) ⚠️
 ├── paddleocr_server_backup.py # 备份
 ├── server_stable.log          # 当前运行日志
 └── venv/                      # 虚拟环境
 ```
 ---
 ## ⚡ 快速启动
 ### 启动稳定服务器 (v2.7.3)
 ```bash
 ssh gblinux
 cd ~/Project/paddleocr-server
 source venv/bin/activate
 python paddleocr_server.py
 ```
 ### 测试连接
 ```bash
 # 本地Mac
 cd /Volumes/NV2/pdf_recognize
 source venv/bin/activate
 python test_paddleocr_client.py
 ```
 ### 创建新研究分支
 ```bash
 cd /Volumes/NV2/pdf_recognize
 git checkout -b pp-ocrv5-research
 ```
 ---
 ## 🚨 注意事项
 ### 1. 不要破坏稳定版本
 - `paddleocr-improvements` 分支保持稳定
 - 所有v5实验在新分支 `pp-ocrv5-research`
 - 服务器保留 `paddleocr_server.py` (v2.7.3)
 - 新代码命名: `paddleocr_server_v5.py`
 ### 2. 环境隔离
 - 服务器虚拟环境可能需要重建
 - 或者用Docker隔离v4和v5
 - 避免版本冲突
 ### 3. 性能测试
 - 记录v4和v5的具体指标
 - 至少测试10个样本
 - 包括速度、准确率、召回率
 ### 4. 文档驱动
 - 每个发现记录到文档
 - API用法写清楚
 - 便于未来维护
 ---
 ## 📊 成功标准
 ### 最低目标
 - [ ] 成功运行PP-OCRv5基础OCR
 - [ ] 理解新API调用方式
 - [ ] 服务器稳定运行
 - [ ] 记录完整文档
 ### 理想目标
 - [ ] 发现手写检测功能
 - [ ] 性能超过v4方案
 - [ ] 简化Pipeline复杂度
 - [ ] 提升准确率 > 90%
 ### 决策点
 **如果v5明显更好** → 升级到v5，废弃v4
 **如果v5改进不明显** → 保持v4，v5仅作研究记录
 **如果v5有bug** → 等待官方修复，暂用v4
 ---
 ## 📞 问题排查
 ### 遇到问题时
 1. **先查日志**: `tail -f ~/Project/paddleocr-server/server_stable.log`
 2. **查看源码**: 在venv里找PaddleOCR代码
 3. **搜索Issues**: https://github.com/PaddlePaddle/PaddleOCR/issues
 4. **降级测试**: 确认v2.7.3是否还能用
 ### 常见问题
 **Q: 服务器启动失败?**
 A: 检查numpy版本 (需要 < 2.0)
 **Q: 找不到模型?**
 A: 模型名可能变化，查看OFFICIAL_MODELS
 **Q: API调用失败?**
 A: 对比官方文档，可能参数格式变化
 ---
 ## 🎓 学习资源
 ### 官方文档
 1. **PP-OCRv5**: https://www.paddleocr.ai/main/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5.html
 2. **PaddleOCR GitHub**: https://github.com/PaddlePaddle/PaddleOCR
 3. **PaddleX**: https://github.com/PaddlePaddle/PaddleX
 ### 相关技术
 - PaddlePaddle深度学习框架
 - PP-Structure文档结构分析
 - 手写识别 (Handwriting Recognition)
 - 版面分析 (Layout Analysis)
 ---
 ## 💡 提示
 ### 如果发现内置手写检测
 可能的用法:
 ```python
 # 猜测1: 返回结果包含类型
 for item in result:
    text_type = item.get('type')  # 'printed' or 'handwritten'?
 # 猜测2: 专门的layout模型
 from paddlex import create_model
 layout_model = create_model("PP-Structure")
 layout_result = layout_model.predict(image)
 # 可能返回: text, handwriting, figure, table...
 # 猜测3: 置信度差异
 # 手写文字置信度可能更低
 ```
 ### 如果没有内置手写检测
 那么当前OpenCV Method 3仍然是最佳方案，v5仅提供更好的OCR准确度。
 ---
 ## ✅ 完成检查清单
 研究完成后，确保:
 - [ ] 新API用法完全理解并文档化
 - [ ] 服务器代码重写并测试通过
 - [ ] 性能对比数据记录
 - [ ] 决策文档 (升级 vs 保持v4)
 - [ ] 代码提交到 `pp-ocrv5-research` 分支
 - [ ] 更新 `CURRENT_STATUS.md`
 - [ ] 如果升级: 合并到main分支
 ---
 **祝研究顺利！** 🚀
 有问题随时查阅:
 - `CURRENT_STATUS.md` - 当前方案详情
 - `PADDLEOCR_STATUS.md` - 技术细节和问题分析
 **最重要**: 记录所有发现，无论成功或失败，都是宝贵经验！
--- a/PADDLEOCR_STATUS.md
+++ b/PADDLEOCR_STATUS.md
@@ -0,0 +1,475 @@
 # PaddleOCR Signature Extraction - Status & Options
 **Date**: October 28, 2025
 **Branch**: `PaddleOCR-Cover`
 **Current Stage**: Masking + Region Detection Working, Refinement Needed
 ---
 ## Current Approach Overview
 **Strategy**: PaddleOCR masks printed text → Detect remaining regions → VLM verification
 ### Pipeline Steps
 ```
 1. PaddleOCR (Linux server 192.168.30.36:5555)
   └─> Detect printed text bounding boxes
 2. OpenCV Masking (Local)
   └─> Black out all printed text areas
 3. Region Detection (Local)
   └─> Find non-white areas (potential handwriting)
 4. VLM Verification (TODO)
   └─> Confirm which regions are handwritten signatures
 ```
 ---
 ## Test Results (File: 201301_1324_AI1_page3.pdf)
 ### Performance
 | Metric | Value |
 |--------|-------|
 | Printed text regions masked | 26 |
 | Candidate regions detected | 12 |
 | Actual signatures found | 2 ✅ |
 | False positives (printed text) | 9 |
 | Split signatures | 1 (Region 5 might be part of Region 4) |
 ### Success
 ✅ **PaddleOCR detected most printed text** (26 regions)
 ✅ **Masking works correctly** (black rectangles)
 ✅ **Region detection found both signatures** (regions 2, 4)
 ✅ **No false negatives** (didn't miss any signatures)
 ### Issues Identified
 ❌ **Problem 1: Handwriting Split Into Multiple Regions**
 - Some signatures may be split into 2+ separate regions
 - Example: Region 4 and Region 5 might be parts of same signature area
 - Caused by gaps between handwritten strokes after masking
 ❌ **Problem 2: Printed Name + Handwritten Signature Mixed**
 - Region 2: Contains "張 志 銘" (printed) + handwritten signature
 - Region 4: Contains "楊 智 惠" (printed) + handwritten signature
 - PaddleOCR missed these printed names, so they weren't masked
 - Final output includes both printed and handwritten parts
 ❌ **Problem 3: Printed Text Not Masked by PaddleOCR**
 - 9 regions contain printed text that PaddleOCR didn't detect
 - These became false positive candidates
 - Examples: dates, company names, paragraph text
 - Shows PaddleOCR's detection isn't 100% complete
 ---
 ## Proposed Solutions
 ### Problem 1: Split Signatures
 #### Option A: More Aggressive Morphology ⭐ EASY
 **Approach**: Increase kernel size and iterations to connect nearby strokes
 ```python
 # Current settings:
 kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
 morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
 # Proposed settings:
 kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 15))  # 3x larger
 morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=5)  # More iterations
 ```
 **Pros**:
 - Simple one-line change
 - Connects nearby strokes automatically
 - Fast execution
 **Cons**:
 - May merge unrelated regions if too aggressive
 - Need to tune parameters carefully
 - Could lose fine details
 **Recommendation**: ⭐ Try first - easiest to implement and test
 ---
 #### Option B: Region Merging After Detection ⭐⭐ MEDIUM (RECOMMENDED)
 **Approach**: After detecting all regions, merge those that are close together
 ```python
 def merge_nearby_regions(regions, distance_threshold=50):
    """
    Merge regions that are within distance_threshold pixels of each other.
    Args:
        regions: List of region dicts with 'box' (x, y, w, h)
        distance_threshold: Maximum pixels between regions to merge
    Returns:
        List of merged regions
    """
    # Algorithm:
    # 1. Calculate distance between all region pairs
    # 2. If distance < threshold, merge their bounding boxes
    # 3. Repeat until no more merges possible
    merged = []
    # Implementation here...
    return merged
 ```
 **Pros**:
 - Keeps signatures together intelligently
 - Won't merge distant unrelated regions
 - Preserves original stroke details
 - Can use vertical/horizontal distance separately
 **Cons**:
 - Need to tune distance threshold
 - More complex than Option A
 - May need multiple merge passes
 **Recommendation**: ⭐⭐ **Best balance** - implement this first
 ---
 #### Option C: Don't Split - Extract Larger Context ⭐ EASY
 **Approach**: When extracting regions, add significant padding to capture full context
 ```python
 # Current: padding = 10 pixels
 padding = 50  # Much larger padding
 # Or: Merge all regions in the bottom 20% of page
 # (signatures are usually at the bottom)
 ```
 **Pros**:
 - Guaranteed to capture complete signatures
 - Very simple to implement
 - No risk of losing parts
 **Cons**:
 - May include extra unwanted content
 - Larger image files
 - Makes VLM verification more complex
 **Recommendation**: ⭐ Use as fallback if B doesn't work
 ---
 ### Problem 2: Printed + Handwritten in Same Region
 #### Option A: Expand PaddleOCR Masking Boxes ⭐ EASY
 **Approach**: Add padding when masking text boxes to catch edges
 ```python
 padding = 20  # pixels
 for (x, y, w, h) in text_boxes:
    # Expand box in all directions
    x_pad = max(0, x - padding)
    y_pad = max(0, y - padding)
    w_pad = min(image.shape[1] - x_pad, w + 2*padding)
    h_pad = min(image.shape[0] - y_pad, h + 2*padding)
    cv2.rectangle(masked_image, (x_pad, y_pad),
                  (x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)
 ```
 **Pros**:
 - Very simple - one parameter change
 - Catches text edges and nearby text
 - Fast execution
 **Cons**:
 - If padding too large, may mask handwriting
 - If padding too small, still misses text
 - Hard to find perfect padding value
 **Recommendation**: ⭐ Quick test - try with padding=20-30
 ---
 #### Option B: Run PaddleOCR Again on Each Region ⭐⭐ MEDIUM
 **Approach**: Second-pass OCR on extracted regions to find remaining printed text
 ```python
 def clean_region(region_image, ocr_client):
    """
    Remove any remaining printed text from a region.
    Args:
        region_image: Extracted candidate region
        ocr_client: PaddleOCR client
    Returns:
        Cleaned image with only handwriting
    """
    # Run OCR on this specific region
    text_boxes = ocr_client.get_text_boxes(region_image)
    # Mask any detected printed text
    cleaned = region_image.copy()
    for (x, y, w, h) in text_boxes:
        cv2.rectangle(cleaned, (x, y), (x+w, y+h), (0, 0, 0), -1)
    return cleaned
 ```
 **Pros**:
 - Very accurate - catches printed text PaddleOCR missed initially
 - Clean separation of printed vs handwritten
 - No manual tuning needed
 **Cons**:
 - 2x slower (OCR call per region)
 - May occasionally mask handwritten text if it looks printed
 - More complex pipeline
 **Recommendation**: ⭐⭐ Good option if masking padding isn't enough
 ---
 #### Option C: Computer Vision Stroke Analysis ⭐⭐⭐ HARD
 **Approach**: Analyze stroke characteristics to distinguish printed vs handwritten
 ```python
 def separate_printed_handwritten(region_image):
    """
    Use CV techniques to separate printed from handwritten.
    Techniques:
    - Stroke width analysis (printed = uniform, handwritten = variable)
    - Edge detection + smoothness (printed = sharp, handwritten = organic)
    - Connected component analysis
    - Hough line detection (printed = straight, handwritten = curved)
    """
    # Complex implementation...
    pass
 ```
 **Pros**:
 - No API calls needed (fast)
 - Can work when OCR fails
 - Learns patterns in data
 **Cons**:
 - Very complex to implement
 - May not be reliable across different documents
 - Requires significant tuning
 - Hard to maintain
 **Recommendation**: ❌ Skip for now - too complex, uncertain results
 ---
 #### Option D: VLM Crop Guidance ⚠️ RISKY
 **Approach**: Ask VLM to provide coordinates of handwriting location
 ```python
 prompt = """
 This image contains both printed and handwritten text.
 Where is the handwritten signature located?
 Provide coordinates as: x_start, y_start, x_end, y_end
 """
 # VLM returns coordinates
 # Crop to that region only
 ```
 **Pros**:
 - VLM understands visual context
 - Can distinguish printed vs handwritten
 **Cons**:
 - **VLM coordinates are unreliable** (32% offset discovered in previous tests!)
 - This was the original problem that led to PaddleOCR approach
 - May extract wrong region
 **Recommendation**: ❌ **DO NOT USE** - VLM coordinates proven unreliable
 ---
 #### Option E: Two-Stage Hybrid Approach ⭐⭐⭐ BEST (RECOMMENDED)
 **Approach**: Combine detection with targeted cleaning
 ```python
 def extract_signatures_twostage(pdf_path):
    """
    Stage 1: Detect candidate regions (current pipeline)
    Stage 2: Clean each region
    """
    # Stage 1: Full page processing
    image = render_pdf(pdf_path)
    text_boxes = ocr_client.get_text_boxes(image)
    masked_image = mask_text_regions(image, text_boxes, padding=20)
    candidate_regions = detect_regions(masked_image)
    # Stage 2: Per-region cleaning
    signatures = []
    for region_box in candidate_regions:
        # Extract region from ORIGINAL image (not masked)
        region_img = extract_region(image, region_box)
        # Option 1: Run OCR again to find remaining printed text
        region_text_boxes = ocr_client.get_text_boxes(region_img)
        cleaned_region = mask_text_regions(region_img, region_text_boxes)
        # Option 2: Ask VLM if it contains handwriting (no coordinates!)
        is_handwriting = vlm_verify(cleaned_region)
        if is_handwriting:
            signatures.append(cleaned_region)
    return signatures
 ```
 **Pros**:
 - Best accuracy - two passes of OCR
 - Combines strengths of both approaches
 - VLM only for yes/no, not coordinates
 - Clean final output with only handwriting
 **Cons**:
 - Slower (2 OCR calls per page)
 - More complex code
 - Higher computational cost
 **Recommendation**: ⭐⭐⭐ **BEST OVERALL** - implement this for production
 ---
 ## Implementation Priority
 ### Phase 1: Quick Wins (Test Immediately)
 1. **Expand masking padding** (Problem 2, Option A) - 5 minutes
 2. **More aggressive morphology** (Problem 1, Option A) - 5 minutes
 3. **Test and measure improvement**
 ### Phase 2: Region Merging (If Phase 1 insufficient)
 4. **Implement region merging algorithm** (Problem 1, Option B) - 30 minutes
 5. **Test on multiple PDFs**
 6. **Tune distance threshold**
 ### Phase 3: Two-Stage Approach (Best quality)
 7. **Implement second-pass OCR on regions** (Problem 2, Option E) - 1 hour
 8. **Add VLM verification** (Step 4 of pipeline) - 30 minutes
 9. **Full pipeline testing**
 ---
 ## Code Files Status
 ### Existing Files ✅
 - **`paddleocr_client.py`** - REST API client for PaddleOCR server
 - **`test_paddleocr_client.py`** - Connection and OCR test
 - **`test_mask_and_detect.py`** - Current masking + detection pipeline
 ### To Be Created 📝
 - **`extract_signatures_paddleocr.py`** - Production pipeline with all improvements
 - **`region_merger.py`** - Region merging utilities
 - **`vlm_verifier.py`** - VLM handwriting verification
 ---
 ## Server Configuration
 **PaddleOCR Server**:
 - Host: `192.168.30.36:5555`
 - Running: ✅ Yes (PID: 210417)
 - Version: 3.3.0
 - GPU: Enabled
 - Language: Chinese (lang='ch')
 **VLM Server**:
 - Host: `192.168.30.36:11434` (Ollama)
 - Model: `qwen2.5vl:32b`
 - Status: Not tested yet in this pipeline
 ---
 ## Test Plan
 ### Test File
 - **File**: `201301_1324_AI1_page3.pdf`
 - **Expected signatures**: 2 (楊智惠, 張志銘)
 - **Current recall**: 100% (found both)
 - **Current precision**: 16.7% (2 correct out of 12 regions)
 ### Success Metrics After Improvements
 | Metric | Current | Target |
 |--------|---------|--------|
 | Signatures found | 2/2 (100%) | 2/2 (100%) |
 | False positives | 10 | < 2 |
 | Precision | 16.7% | > 80% |
 | Signatures split | Unknown | 0 |
 | Printed text in regions | Yes | No |
 ---
 ## Git Branch Strategy
 **Current branch**: `PaddleOCR-Cover`
 **Status**: Masking + Region Detection working, needs refinement
 **Recommended next steps**:
 1. Commit current state with tag: `paddleocr-v1-basic`
 2. Create feature branches:
   - `paddleocr-region-merging` - For Problem 1 solutions
   - `paddleocr-two-stage` - For Problem 2 solutions
 3. Merge best solution back to `PaddleOCR-Cover`
 ---
 ## Next Actions
 ### Immediate (Today)
 - [ ] Commit current working state
 - [ ] Test Phase 1 quick wins (padding + morphology)
 - [ ] Measure improvement
 ### Short-term (This week)
 - [ ] Implement Region Merging (Option B)
 - [ ] Implement Two-Stage OCR (Option E)
 - [ ] Add VLM verification
 - [ ] Test on 10 PDFs
 ### Long-term (Production)
 - [ ] Optimize performance (parallel processing)
 - [ ] Error handling and logging
 - [ ] Process full 86K dataset
 - [ ] Compare with previous hybrid approach (70% recall)
 ---
 ## Comparison: PaddleOCR vs Previous Hybrid Approach
 ### Previous Approach (VLM-Cover branch)
 - **Method**: VLM names + CV detection + VLM verification
 - **Results**: 70% recall, 100% precision
 - **Problem**: Missed 30% of signatures (CV parameters too conservative)
 ### PaddleOCR Approach (Current)
 - **Method**: PaddleOCR masking + CV detection + VLM verification
 - **Results**: 100% recall (found both signatures)
 - **Problem**: Low precision (many false positives), printed text not fully removed
 ### Winner: TBD
 - PaddleOCR shows **better recall potential**
 - After implementing refinements (Phase 2-3), should achieve **high recall + high precision**
 - Need to test on larger dataset to confirm
 ---
 **Document version**: 1.0
 **Last updated**: October 28, 2025
 **Author**: Claude Code
 **Status**: Ready for implementation
--- a/PP_OCRV5_RESEARCH_FINDINGS.md
+++ b/PP_OCRV5_RESEARCH_FINDINGS.md
@@ -0,0 +1,281 @@
 # PP-OCRv5 研究發現
 **日期**: 2025-01-27
 **分支**: pp-ocrv5-research
 **狀態**: 研究完成
 ---
 ## 📋 研究摘要
 我們成功升級並測試了 PP-OCRv5，以下是關鍵發現：
 ### ✅ 成功完成
 1. PaddleOCR 升級：2.7.3 → 3.3.2
 2. 新 API 理解和驗證
 3. 手寫檢測能力測試
 4. 數據結構分析
 ### ❌ 關鍵限制
 **PP-OCRv5 沒有內建的手寫 vs 印刷文字分類功能**
 ---
 ## 🔧 技術細節
 ### API 變更
 **舊 API (2.7.3)**:
 ```python
 from paddleocr import PaddleOCR
 ocr = PaddleOCR(lang='ch', show_log=False)
 result = ocr.ocr(image_np, cls=False)
 ```
 **新 API (3.3.2)**:
 ```python
 from paddleocr import PaddleOCR
 ocr = PaddleOCR(
    text_detection_model_name="PP-OCRv5_server_det",
    text_recognition_model_name="PP-OCRv5_server_rec",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False
    # ❌ 不再支持: show_log, cls
 )
 result = ocr.predict(image_path)  # ✅ 使用 predict() 而不是 ocr()
 ```
 ### 主要 API 差異
 | 特性 | v2.7.3 | v3.3.2 |
 |------|--------|--------|
 | 初始化 | `PaddleOCR(lang='ch')` | `PaddleOCR(text_detection_model_name=...)` |
 | 預測方法 | `ocr.ocr()` | `ocr.predict()` |
 | `cls` 參數 | ✅ 支持 | ❌ 已移除 |
 | `show_log` 參數 | ✅ 支持 | ❌ 已移除 |
 | 返回格式 | `[[[box], (text, conf)], ...]` | `OCRResult` 對象 with `.json` 屬性 |
 | 依賴 | 獨立 | 需要 PaddleX >=3.3.0 |
 ---
 ## 📊 返回數據結構
 ### v3.3.2 返回格式
 ```python
 result = ocr.predict(image_path)
 json_data = result[0].json['res']
 # 可用字段：
 json_data = {
    'input_path': str,                    # 輸入圖片路徑
    'page_index': None,                   # PDF 頁碼（圖片為 None）
    'model_settings': dict,               # 模型配置
    'dt_polys': list,                     # 檢測多邊形框 (N, 4, 2)
    'dt_scores': list,                    # 檢測置信度
    'rec_texts': list,                    # 識別文字
    'rec_scores': list,                   # 識別置信度
    'rec_boxes': list,                    # 矩形框 [x_min, y_min, x_max, y_max]
    'rec_polys': list,                    # 識別多邊形框
    'text_det_params': dict,              # 檢測參數
    'text_rec_score_thresh': float,       # 識別閾值
    'text_type': str,                     # ⚠️ 'general' (語言類型，不是手寫分類)
    'textline_orientation_angles': list,  # 文字方向角度
    'return_word_box': bool               # 是否返回詞級框
 }
 ```
 ---
 ## 🔍 手寫檢測功能測試
 ### 測試問題
 **PP-OCRv5 是否能區分手寫和印刷文字？**
 ### 測試結果：❌ 不能
 #### 測試過程
 1. ✅ 發現 `text_type` 字段
 2. ❌ 但 `text_type = 'general'` 是**語言類型**，不是書寫風格
 3. ✅ 查閱官方文檔確認
 4. ❌ 沒有任何字段標註手寫 vs 印刷
 #### 官方文檔說明
 - `text_type` 可能的值：'general', 'ch', 'en', 'japan', 'pinyin'
 - 這些值指的是**語言/腳本類型**
 - **不是**手寫 (handwritten) vs 印刷 (printed) 的分類
 ### 結論
 PP-OCRv5 雖然能**識別**手寫文字，但**不會標註**某個文字區域是手寫還是印刷。
 ---
 ## 📈 性能提升（根據官方文檔）
 ### 手寫文字識別準確率
 | 類型 | PP-OCRv4 | PP-OCRv5 | 提升 |
 |------|----------|----------|------|
 | 手寫中文 | 0.706 | 0.803 | **+13.7%** |
 | 手寫英文 | 0.249 | 0.841 | **+237%** |
 ### 實測結果（full_page_original.png）
 **v3.3.2 (PP-OCRv5)**:
 - 檢測到 **50** 個文字區域
 - 平均置信度：~0.98
 - 示例：
  - "依本會計師核閱結果..." (0.9936)
  - "在所有重大方面有違反..." (0.9976)
 **待測試**: v2.7.3 的對比結果（需要回退測試）
 ---
 ## 💡 升級影響分析
 ### 優勢
 1. ✅ **更好的手寫識別能力**（+13.7%）
 2. ✅ **可能檢測到更多手寫區域**
 3. ✅ **更高的識別置信度**
 4. ✅ **統一的 Pipeline 架構**
 ### 劣勢
 1. ❌ **無法區分手寫和印刷**（仍需 OpenCV Method 3）
 2. ⚠️ **API 完全不兼容**（需重寫服務器代碼）
 3. ⚠️ **依賴 PaddleX**（額外的依賴）
 4. ⚠️ **OpenCV 版本升級**（4.6 → 4.10）
 ---
 ## 🎯 對我們項目的影響
 ### 當前方案（v2.7.3 + OpenCV Method 3）
 ```
 PDF → PaddleOCR 檢測 → 遮罩印刷文字 → OpenCV Method 3 分離手寫 → VLM 驗證
                        ↑ 86.5% 手寫保留率
 ```
 ### PP-OCRv5 方案
 ```
 PDF → PP-OCRv5 檢測 → 遮罩印刷文字 → OpenCV Method 3 分離手寫 → VLM 驗證
      ↑ 可能檢測更多手寫   ↑ 仍然需要！
 ```
 ### 關鍵發現
 **PP-OCRv5 不能替代 OpenCV Method 3！**
 ---
 ## 🤔 升級建議
 ### 升級的理由
 1. 更好地檢測手寫簽名（+13.7% 準確率）
 2. 可能減少漏檢
 3. 更高的識別置信度可以幫助後續分析
 ### 不升級的理由
 1. 當前方案已經穩定（86.5% 保留率）
 2. 仍然需要 OpenCV Method 3
 3. API 重寫成本高
 4. 額外的依賴和複雜度
 ### 推薦決策
 **階段性升級策略**：
 1. **短期（當前）**：
   - ✅ 保持 v2.7.3 穩定方案
   - ✅ 繼續使用 OpenCV Method 3
   - ✅ 在更多樣本上測試當前方案
 2. **中期（如果需要優化）**：
   - 對比測試 v2.7.3 vs v3.3.2 在真實簽名樣本上的性能
   - 如果 v5 明顯減少漏檢 → 升級
   - 如果差異不大 → 保持 v2.7.3
 3. **長期**：
   - 關注 PaddleOCR 是否會添加手寫分類功能
   - 如果有 → 重新評估升級價值
 ---
 ## 📝 技術債務記錄
 ### 如果決定升級到 v3.3.2
 需要完成的工作：
 1. **服務器端**：
   - [ ] 重寫 `paddleocr_server.py` 適配新 API
   - [ ] 測試 GPU 利用率和速度
   - [ ] 處理 OpenCV 4.10 兼容性
   - [ ] 更新依賴文檔
 2. **客戶端**：
   - [ ] 更新 `paddleocr_client.py`（如果 REST 接口改變）
   - [ ] 適配新的返回格式
 3. **測試**：
   - [ ] 10+ 樣本對比測試
   - [ ] 性能基準測試
   - [ ] 穩定性測試
 4. **文檔**：
   - [ ] 更新 CURRENT_STATUS.md
   - [ ] 記錄 API 遷移指南
   - [ ] 更新部署文檔
 ---
 ## ✅ 完成的工作
 1. ✅ 升級 PaddleOCR: 2.7.3 → 3.3.2
 2. ✅ 理解新 API 結構
 3. ✅ 測試基礎功能
 4. ✅ 分析返回數據結構
 5. ✅ 測試手寫分類功能（結論：無）
 6. ✅ 查閱官方文檔驗證
 7. ✅ 記錄完整研究過程
 ---
 ## 🎓 學到的經驗
 1. **API 版本升級風險**：主版本升級通常有破壞性變更
 2. **功能驗證的重要性**：文檔提到的「手寫支持」不等於「手寫分類」
 3. **現有方案的價值**：OpenCV Method 3 仍然是必需的
 4. **性能 vs 複雜度權衡**：不是所有性能提升都值得立即升級
 ---
 ## 🔗 相關文檔
 - [CURRENT_STATUS.md](./CURRENT_STATUS.md) - 當前穩定方案
 - [NEW_SESSION_HANDOFF.md](./NEW_SESSION_HANDOFF.md) - 研究任務清單
 - [PADDLEOCR_STATUS.md](./PADDLEOCR_STATUS.md) - 詳細技術分析
 ---
 ## 📌 下一步
 建議用戶：
 1. **立即行動**：
   - 在更多 PDF 樣本上測試當前方案
   - 記錄成功率和失敗案例
 2. **評估升級**：
   - 如果當前方案滿意 → 保持 v2.7.3
   - 如果遇到大量漏檢 → 考慮 v3.3.2
 3. **長期監控**：
   - 關注 PaddleOCR GitHub Issues
   - 追蹤是否有手寫分類功能的更新
 ---
 **結論**: PP-OCRv5 提升了手寫識別能力，但不能替代 OpenCV Method 3 來分離手寫和印刷文字。當前方案（v2.7.3 + OpenCV Method 3）已經足夠好，除非遇到性能瓶頸，否則不建議立即升級。
--- a/check_rejected_for_missing.py
+++ b/check_rejected_for_missing.py
@@ -0,0 +1,75 @@
 #!/usr/bin/env python3
 """Check if rejected regions contain the missing signatures."""
 import base64
 import requests
 from pathlib import Path
 OLLAMA_URL = "http://192.168.30.36:11434"
 OLLAMA_MODEL = "qwen2.5vl:32b"
 REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"
 # Missing signatures based on test results
 MISSING = {
    "201301_2061_AI1_page5": "林姿妤",
    "201301_2458_AI1_page4": "魏興海",
    "201301_2923_AI1_page3": "陈丽琦"
 }
 def encode_image_to_base64(image_path):
    """Encode image file to base64."""
    with open(image_path, 'rb') as f:
        return base64.b64encode(f.read()).decode('utf-8')
 def ask_vlm_about_signature(image_base64, expected_name):
    """Ask VLM if the image contains the expected signature."""
    prompt = f"""Does this image contain a handwritten signature with the Chinese name: "{expected_name}"?
 Look carefully for handwritten Chinese characters matching this name.
 Answer only 'yes' or 'no'."""
    payload = {
        "model": OLLAMA_MODEL,
        "prompt": prompt,
        "images": [image_base64],
        "stream": False
    }
    try:
        response = requests.post(f"{OLLAMA_URL}/api/generate", json=payload, timeout=60)
        response.raise_for_status()
        answer = response.json()['response'].strip().lower()
        return answer
    except Exception as e:
        return f"error: {str(e)}"
 # Check each missing signature
 for pdf_stem, missing_name in MISSING.items():
    print(f"\n{'='*80}")
    print(f"Checking rejected regions from: {pdf_stem}")
    print(f"Looking for missing signature: {missing_name}")
    print('='*80)
    # Find all rejected regions from this PDF
    rejected_regions = sorted(Path(REJECTED_PATH).glob(f"{pdf_stem}_region_*.png"))
    print(f"Found {len(rejected_regions)} rejected regions to check")
    for region_path in rejected_regions:
        region_name = region_path.name
        print(f"\nChecking: {region_name}...", end='', flush=True)
        # Encode and ask VLM
        image_base64 = encode_image_to_base64(region_path)
        answer = ask_vlm_about_signature(image_base64, missing_name)
        if 'yes' in answer:
            print(f" ✅ FOUND! This region contains {missing_name}")
            print(f"   → The signature was detected by CV but rejected by verification!")
        else:
            print(f" ❌ No (VLM says: {answer})")
 print(f"\n{'='*80}")
 print("Analysis complete!")
 print('='*80)
--- a/extract_signatures_paddleocr_improved.py
+++ b/extract_signatures_paddleocr_improved.py
@@ -0,0 +1,415 @@
 #!/usr/bin/env python3
 """
 PaddleOCR Signature Extraction - Improved Pipeline
 Implements:
 - Method B: Region Merging (merge nearby regions to avoid splits)
 - Method E: Two-Stage Approach (second OCR pass on regions)
 Pipeline:
 1. PaddleOCR detects printed text on full page
 2. Mask printed text with padding
 3. Detect candidate regions
 4. Merge nearby regions (METHOD B)
 5. For each region: Run OCR again to remove remaining printed text (METHOD E)
 6. VLM verification (optional)
 7. Save cleaned handwriting regions
 """
 import fitz  # PyMuPDF
 import numpy as np
 import cv2
 from pathlib import Path
 from paddleocr_client import create_ocr_client
 from typing import List, Dict, Tuple
 import base64
 import requests
 # Configuration
 TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
 OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_improved"
 DPI = 300
 # PaddleOCR Settings
 MASKING_PADDING = 25  # Pixels to expand text boxes when masking
 # Region Detection Parameters
 MIN_REGION_AREA = 3000
 MAX_REGION_AREA = 300000
 MIN_ASPECT_RATIO = 0.3
 MAX_ASPECT_RATIO = 15.0
 # Region Merging Parameters (METHOD B)
 MERGE_DISTANCE_HORIZONTAL = 100  # pixels
 MERGE_DISTANCE_VERTICAL = 50     # pixels
 # VLM Settings (optional)
 USE_VLM_VERIFICATION = False  # Set to True to enable VLM filtering
 OLLAMA_URL = "http://192.168.30.36:11434"
 OLLAMA_MODEL = "qwen2.5vl:32b"
 def merge_nearby_regions(regions: List[Dict],
                        h_distance: int = 100,
                        v_distance: int = 50) -> List[Dict]:
    """
    Merge regions that are close to each other (METHOD B).
    Args:
        regions: List of region dicts with 'box': (x, y, w, h)
        h_distance: Maximum horizontal distance between regions to merge
        v_distance: Maximum vertical distance between regions to merge
    Returns:
        List of merged regions
    """
    if not regions:
        return []
    # Sort regions by y-coordinate (top to bottom)
    regions = sorted(regions, key=lambda r: r['box'][1])
    merged = []
    skip_indices = set()
    for i, region1 in enumerate(regions):
        if i in skip_indices:
            continue
        x1, y1, w1, h1 = region1['box']
        # Find all regions that should merge with this one
        merge_group = [region1]
        for j, region2 in enumerate(regions[i+1:], start=i+1):
            if j in skip_indices:
                continue
            x2, y2, w2, h2 = region2['box']
            # Calculate distances
            # Horizontal distance: gap between boxes horizontally
            h_dist = max(0, max(x1, x2) - min(x1 + w1, x2 + w2))
            # Vertical distance: gap between boxes vertically
            v_dist = max(0, max(y1, y2) - min(y1 + h1, y2 + h2))
            # Check if regions are close enough to merge
            if h_dist <= h_distance and v_dist <= v_distance:
                merge_group.append(region2)
                skip_indices.add(j)
                # Update bounding box to include new region
                x1 = min(x1, x2)
                y1 = min(y1, y2)
                w1 = max(x1 + w1, x2 + w2) - x1
                h1 = max(y1 + h1, y2 + h2) - y1
        # Create merged region
        merged_box = (x1, y1, w1, h1)
        merged_area = w1 * h1
        merged_aspect = w1 / h1 if h1 > 0 else 0
        merged.append({
            'box': merged_box,
            'area': merged_area,
            'aspect_ratio': merged_aspect,
            'merged_count': len(merge_group)
        })
    return merged
 def clean_region_with_ocr(region_image: np.ndarray,
                          ocr_client,
                          padding: int = 10) -> np.ndarray:
    """
    Remove printed text from a region using second OCR pass (METHOD E).
    Args:
        region_image: The region image to clean
        ocr_client: PaddleOCR client
        padding: Padding around detected text boxes
    Returns:
        Cleaned region with printed text masked
    """
    try:
        # Run OCR on this specific region
        text_boxes = ocr_client.get_text_boxes(region_image)
        if not text_boxes:
            return region_image  # No text found, return as-is
        # Mask detected printed text
        cleaned = region_image.copy()
        for (x, y, w, h) in text_boxes:
            # Add padding
            x_pad = max(0, x - padding)
            y_pad = max(0, y - padding)
            w_pad = min(cleaned.shape[1] - x_pad, w + 2*padding)
            h_pad = min(cleaned.shape[0] - y_pad, h + 2*padding)
            cv2.rectangle(cleaned, (x_pad, y_pad),
                         (x_pad + w_pad, y_pad + h_pad),
                         (255, 255, 255), -1)  # Fill with white
        return cleaned
    except Exception as e:
        print(f"      Warning: OCR cleaning failed: {e}")
        return region_image
 def verify_handwriting_with_vlm(image: np.ndarray) -> Tuple[bool, float]:
    """
    Use VLM to verify if image contains handwriting.
    Args:
        image: Region image (RGB numpy array)
    Returns:
        (is_handwriting: bool, confidence: float)
    """
    try:
        # Convert image to base64
        from PIL import Image
        from io import BytesIO
        pil_image = Image.fromarray(image.astype(np.uint8))
        buffered = BytesIO()
        pil_image.save(buffered, format="PNG")
        image_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
        # Ask VLM
        prompt = """Does this image contain handwritten text or a handwritten signature?
 Answer only 'yes' or 'no', followed by a confidence score 0-100.
 Format: yes 95 OR no 80"""
        payload = {
            "model": OLLAMA_MODEL,
            "prompt": prompt,
            "images": [image_base64],
            "stream": False
        }
        response = requests.post(f"{OLLAMA_URL}/api/generate",
                                json=payload, timeout=30)
        response.raise_for_status()
        answer = response.json()['response'].strip().lower()
        # Parse answer
        is_handwriting = 'yes' in answer
        # Try to extract confidence
        confidence = 0.5
        parts = answer.split()
        for part in parts:
            try:
                conf = float(part)
                if 0 <= conf <= 100:
                    confidence = conf / 100
                    break
            except:
                continue
        return is_handwriting, confidence
    except Exception as e:
        print(f"      Warning: VLM verification failed: {e}")
        return True, 0.5  # Default to accepting the region
 print("="*80)
 print("PaddleOCR Improved Pipeline - Region Merging + Two-Stage Cleaning")
 print("="*80)
 # Create output directory
 Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
 # Step 1: Connect to PaddleOCR
 print("\n1. Connecting to PaddleOCR server...")
 try:
    ocr_client = create_ocr_client()
    print(f"   ✅ Connected: {ocr_client.server_url}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 2: Render PDF
 print("\n2. Rendering PDF...")
 try:
    doc = fitz.open(TEST_PDF)
    page = doc[0]
    mat = fitz.Matrix(DPI/72, DPI/72)
    pix = page.get_pixmap(matrix=mat)
    original_image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
        pix.height, pix.width, pix.n)
    if pix.n == 4:
        original_image = cv2.cvtColor(original_image, cv2.COLOR_RGBA2RGB)
    print(f"   ✅ Rendered: {original_image.shape[1]}x{original_image.shape[0]}")
    doc.close()
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 3: Detect printed text (Stage 1)
 print("\n3. Detecting printed text (Stage 1 OCR)...")
 try:
    text_boxes = ocr_client.get_text_boxes(original_image)
    print(f"   ✅ Detected {len(text_boxes)} text regions")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 4: Mask printed text with padding
 print(f"\n4. Masking printed text (padding={MASKING_PADDING}px)...")
 try:
    masked_image = original_image.copy()
    for (x, y, w, h) in text_boxes:
        # Add padding
        x_pad = max(0, x - MASKING_PADDING)
        y_pad = max(0, y - MASKING_PADDING)
        w_pad = min(masked_image.shape[1] - x_pad, w + 2*MASKING_PADDING)
        h_pad = min(masked_image.shape[0] - y_pad, h + 2*MASKING_PADDING)
        cv2.rectangle(masked_image, (x_pad, y_pad),
                     (x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)
    print(f"   ✅ Masked {len(text_boxes)} regions")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 5: Detect candidate regions
 print("\n5. Detecting candidate regions...")
 try:
    gray = cv2.cvtColor(masked_image, cv2.COLOR_RGB2GRAY)
    _, binary = cv2.threshold(gray, 250, 255, cv2.THRESH_BINARY_INV)
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    candidate_regions = []
    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        area = w * h
        aspect_ratio = w / h if h > 0 else 0
        if (MIN_REGION_AREA <= area <= MAX_REGION_AREA and
            MIN_ASPECT_RATIO <= aspect_ratio <= MAX_ASPECT_RATIO):
            candidate_regions.append({
                'box': (x, y, w, h),
                'area': area,
                'aspect_ratio': aspect_ratio
            })
    print(f"   ✅ Found {len(candidate_regions)} candidate regions")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 6: Merge nearby regions (METHOD B)
 print(f"\n6. Merging nearby regions (h_dist<={MERGE_DISTANCE_HORIZONTAL}, v_dist<={MERGE_DISTANCE_VERTICAL})...")
 try:
    merged_regions = merge_nearby_regions(
        candidate_regions,
        h_distance=MERGE_DISTANCE_HORIZONTAL,
        v_distance=MERGE_DISTANCE_VERTICAL
    )
    print(f"   ✅ Merged {len(candidate_regions)} → {len(merged_regions)} regions")
    for i, region in enumerate(merged_regions):
        if region['merged_count'] > 1:
            print(f"      Region {i+1}: Merged {region['merged_count']} sub-regions")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    import traceback
    traceback.print_exc()
    exit(1)
 # Step 7: Extract and clean each region (METHOD E)
 print("\n7. Extracting and cleaning regions (Stage 2 OCR)...")
 final_signatures = []
 for i, region in enumerate(merged_regions):
    x, y, w, h = region['box']
    print(f"\n   Region {i+1}/{len(merged_regions)}: ({x}, {y}, {w}, {h})")
    # Extract region from ORIGINAL image (not masked)
    padding = 10
    x_pad = max(0, x - padding)
    y_pad = max(0, y - padding)
    w_pad = min(original_image.shape[1] - x_pad, w + 2*padding)
    h_pad = min(original_image.shape[0] - y_pad, h + 2*padding)
    region_img = original_image[y_pad:y_pad+h_pad, x_pad:x_pad+w_pad].copy()
    print(f"      - Extracted: {region_img.shape[1]}x{region_img.shape[0]}px")
    # Clean with second OCR pass
    print(f"      - Running Stage 2 OCR to remove printed text...")
    cleaned_region = clean_region_with_ocr(region_img, ocr_client, padding=5)
    # VLM verification (optional)
    if USE_VLM_VERIFICATION:
        print(f"      - VLM verification...")
        is_handwriting, confidence = verify_handwriting_with_vlm(cleaned_region)
        print(f"      - VLM says: {'✅ Handwriting' if is_handwriting else '❌ Not handwriting'} (confidence: {confidence:.2f})")
        if not is_handwriting:
            print(f"      - Skipping (not handwriting)")
            continue
    # Save
    final_signatures.append({
        'image': cleaned_region,
        'box': region['box'],
        'original_image': region_img
    })
    print(f"      ✅ Kept as signature candidate")
 print(f"\n   ✅ Final signatures: {len(final_signatures)}")
 # Step 8: Save results
 print("\n8. Saving results...")
 for i, sig in enumerate(final_signatures):
    # Save cleaned signature
    sig_path = Path(OUTPUT_DIR) / f"signature_{i+1:02d}_cleaned.png"
    cv2.imwrite(str(sig_path), cv2.cvtColor(sig['image'], cv2.COLOR_RGB2BGR))
    # Save original region for comparison
    orig_path = Path(OUTPUT_DIR) / f"signature_{i+1:02d}_original.png"
    cv2.imwrite(str(orig_path), cv2.cvtColor(sig['original_image'], cv2.COLOR_RGB2BGR))
    print(f"   📁 Signature {i+1}: {sig_path.name}")
 # Save visualizations
 vis_merged = original_image.copy()
 for region in merged_regions:
    x, y, w, h = region['box']
    color = (255, 0, 0) if region in [{'box': s['box']} for s in final_signatures] else (128, 128, 128)
    cv2.rectangle(vis_merged, (x, y), (x + w, y + h), color, 3)
 vis_path = Path(OUTPUT_DIR) / "visualization_merged_regions.png"
 cv2.imwrite(str(vis_path), cv2.cvtColor(vis_merged, cv2.COLOR_RGB2BGR))
 print(f"   📁 Visualization: {vis_path.name}")
 print("\n" + "="*80)
 print("Pipeline completed!")
 print(f"Results: {OUTPUT_DIR}")
 print("="*80)
 print(f"\nSummary:")
 print(f"  - Stage 1 OCR: {len(text_boxes)} text regions masked")
 print(f"  - Initial candidates: {len(candidate_regions)}")
 print(f"  - After merging: {len(merged_regions)}")
 print(f"  - Final signatures: {len(final_signatures)}")
 print(f"  - Expected signatures: 2 (楊智惠, 張志銘)")
 print("="*80)
--- a/paddleocr_client.py
+++ b/paddleocr_client.py
@@ -0,0 +1,169 @@
 #!/usr/bin/env python3
 """
 PaddleOCR Client
 Connects to remote PaddleOCR server for OCR inference
 """
 import requests
 import base64
 import numpy as np
 from typing import List, Dict, Tuple, Optional
 from PIL import Image
 from io import BytesIO
 class PaddleOCRClient:
    """Client for remote PaddleOCR server."""
    def __init__(self, server_url: str = "http://192.168.30.36:5555"):
        """
        Initialize PaddleOCR client.
        Args:
            server_url: URL of the PaddleOCR server
        """
        self.server_url = server_url.rstrip('/')
        self.timeout = 30  # seconds
    def health_check(self) -> bool:
        """
        Check if server is healthy.
        Returns:
            True if server is healthy, False otherwise
        """
        try:
            response = requests.get(
                f"{self.server_url}/health",
                timeout=5
            )
            return response.status_code == 200 and response.json().get('status') == 'ok'
        except Exception as e:
            print(f"Health check failed: {e}")
            return False
    def ocr(self, image: np.ndarray) -> List[Dict]:
        """
        Perform OCR on an image.
        Args:
            image: numpy array of the image (RGB format)
        Returns:
            List of detection results, each containing:
                - box: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
                - text: detected text string
                - confidence: confidence score (0-1)
        Raises:
            Exception if OCR fails
        """
        # Convert numpy array to PIL Image
        if len(image.shape) == 2:  # Grayscale
            pil_image = Image.fromarray(image)
        else:  # RGB or RGBA
            pil_image = Image.fromarray(image.astype(np.uint8))
        # Encode to base64
        buffered = BytesIO()
        pil_image.save(buffered, format="PNG")
        image_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
        # Send request
        try:
            response = requests.post(
                f"{self.server_url}/ocr",
                json={"image": image_base64},
                timeout=self.timeout
            )
            response.raise_for_status()
            result = response.json()
            if not result.get('success'):
                error_msg = result.get('error', 'Unknown error')
                raise Exception(f"OCR failed: {error_msg}")
            return result.get('results', [])
        except requests.exceptions.Timeout:
            raise Exception(f"OCR request timed out after {self.timeout} seconds")
        except requests.exceptions.ConnectionError:
            raise Exception(f"Could not connect to server at {self.server_url}")
        except Exception as e:
            raise Exception(f"OCR request failed: {str(e)}")
    def get_text_boxes(self, image: np.ndarray) -> List[Tuple[int, int, int, int]]:
        """
        Get bounding boxes of all detected text.
        Args:
            image: numpy array of the image
        Returns:
            List of bounding boxes as (x, y, w, h) tuples
        """
        results = self.ocr(image)
        boxes = []
        for result in results:
            box = result['box']  # [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
            # Convert polygon to bounding box
            xs = [point[0] for point in box]
            ys = [point[1] for point in box]
            x = int(min(xs))
            y = int(min(ys))
            w = int(max(xs) - min(xs))
            h = int(max(ys) - min(ys))
            boxes.append((x, y, w, h))
        return boxes
    def __repr__(self):
        return f"PaddleOCRClient(server_url='{self.server_url}')"
 # Convenience function
 def create_ocr_client(server_url: str = "http://192.168.30.36:5555") -> PaddleOCRClient:
    """
    Create and test PaddleOCR client.
    Args:
        server_url: URL of the PaddleOCR server
    Returns:
        PaddleOCRClient instance
    Raises:
        Exception if server is not reachable
    """
    client = PaddleOCRClient(server_url)
    if not client.health_check():
        raise Exception(
            f"PaddleOCR server at {server_url} is not responding. "
            "Make sure the server is running on the Linux machine."
        )
    return client
 if __name__ == "__main__":
    # Test the client
    print("Testing PaddleOCR client...")
    try:
        client = create_ocr_client()
        print(f"✅ Connected to server: {client.server_url}")
        # Create a test image
        test_image = np.ones((100, 100, 3), dtype=np.uint8) * 255
        print("Running test OCR...")
        results = client.ocr(test_image)
        print(f"✅ OCR test successful! Found {len(results)} text regions")
    except Exception as e:
        print(f"❌ Error: {e}")
--- a/paddleocr_server_v5.py
+++ b/paddleocr_server_v5.py
@@ -0,0 +1,91 @@
 #!/usr/bin/env python3
 """
 PaddleOCR Server v5 (PP-OCRv5)
 Flask HTTP server exposing PaddleOCR v3.3.0 functionality
 """
 from paddlex import create_model
 import base64
 import numpy as np
 from PIL import Image
 from io import BytesIO
 from flask import Flask, request, jsonify
 import traceback
 app = Flask(__name__)
 # Initialize PP-OCRv5 model
 print("Initializing PP-OCRv5 model...")
 model = create_model("PP-OCRv5_server")
 print("PP-OCRv5 model loaded successfully!")
@app.route('/health', methods=['GET'])
 def health():
    """Health check endpoint."""
    return jsonify({
        'status': 'ok',
        'service': 'paddleocr-server-v5',
        'version': '3.3.0',
        'model': 'PP-OCRv5_server',
        'gpu_enabled': True
    })
@app.route('/ocr', methods=['POST'])
 def ocr_endpoint():
    """
    OCR endpoint using PP-OCRv5.
    Accepts: {"image": "base64_encoded_image"}
    Returns: {"success": true, "count": N, "results": [...]}
    """
    try:
        # Parse request
        data = request.get_json()
        image_base64 = data['image']
        # Decode image
        image_bytes = base64.b64decode(image_base64)
        image = Image.open(BytesIO(image_bytes))
        image_np = np.array(image)
        # Run OCR with PP-OCRv5
        result = model.predict(image_np)
        # Format results
        formatted_results = []
        if result and 'dt_polys' in result[0] and 'rec_text' in result[0]:
            dt_polys = result[0]['dt_polys']
            rec_texts = result[0]['rec_text']
            rec_scores = result[0]['rec_score']
            for i in range(len(dt_polys)):
                box = dt_polys[i].tolist()  # Convert to list
                text = rec_texts[i]
                confidence = float(rec_scores[i])
                formatted_results.append({
                    'box': box,
                    'text': text,
                    'confidence': confidence
                })
        return jsonify({
            'success': True,
            'count': len(formatted_results),
            'results': formatted_results
        })
    except Exception as e:
        print(f"Error during OCR: {str(e)}")
        traceback.print_exc()
        return jsonify({
            'success': False,
            'error': str(e)
        }), 500
 if __name__ == '__main__':
    print("Starting PP-OCRv5 server on port 5555...")
    print("Model: PP-OCRv5_server")
    print("Version: 3.3.0")
    app.run(host='0.0.0.0', port=5555, debug=False)
--- a/signature-comparison/v4-current/SUMMARY.txt
+++ b/signature-comparison/v4-current/SUMMARY.txt
@@ -0,0 +1,17 @@
 PaddleOCR v2.7.3 (v4) 完整 Pipeline 測試結果
 ============================================================
 1. OCR 檢測: 14 個文字區域
 2. 遮罩印刷文字: 完成
 3. 檢測候選區域: 4 個
 4. 提取簽名: 4 個
 候選區域詳情:
 ------------------------------------------------------------
 Region 1: 位置(1211, 1462), 大小965x191, 面積=184315
 Region 2: 位置(1215, 877), 大小1150x511, 面積=587650
 Region 3: 位置(332, 150), 大小197x96, 面積=18912
 Region 4: 位置(1147, 3303), 大小159x42, 面積=6678
 所有結果保存在: /Volumes/NV2/pdf_recognize/signature-comparison/v4-current
--- a/signature-comparison/v5-new/SUMMARY.txt
+++ b/signature-comparison/v5-new/SUMMARY.txt
@@ -0,0 +1,20 @@
 PP-OCRv5 完整 Pipeline 測試結果
 ============================================================
 1. OCR 檢測: 50 個文字區域
 2. 遮罩印刷文字: /Volumes/NV2/pdf_recognize/test_results/v5_pipeline/01_masked.png
 3. 檢測候選區域: 7 個
 4. 提取簽名: 7 個
 候選區域詳情:
 ------------------------------------------------------------
 Region 1: 位置(1218, 877), 大小1144x511, 面積=584584
 Region 2: 位置(1213, 1457), 大小961x196, 面積=188356
 Region 3: 位置(228, 386), 大小2028x209, 面積=423852
 Region 4: 位置(330, 310), 大小1932x63, 面積=121716
 Region 5: 位置(1990, 945), 大小375x212, 面積=79500
 Region 6: 位置(327, 145), 大小203x101, 面積=20503
 Region 7: 位置(1139, 3289), 大小174x63, 面積=10962
 所有結果保存在: /Volumes/NV2/pdf_recognize/test_results/v5_pipeline
--- a/test_mask_and_detect.py
+++ b/test_mask_and_detect.py
@@ -0,0 +1,216 @@
 #!/usr/bin/env python3
 """
 Test PaddleOCR Masking + Region Detection Pipeline
 This script demonstrates:
 1. PaddleOCR detects printed text bounding boxes
 2. Mask out all printed text areas (fill with black)
 3. Detect remaining non-white regions (potential handwriting)
 4. Visualize the results
 """
 import fitz  # PyMuPDF
 import numpy as np
 import cv2
 from pathlib import Path
 from paddleocr_client import create_ocr_client
 # Configuration
 TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
 OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/mask_test"
 DPI = 300
 # Region detection parameters
 MIN_REGION_AREA = 3000      # Minimum pixels for a region
 MAX_REGION_AREA = 300000    # Maximum pixels for a region
 MIN_ASPECT_RATIO = 0.3      # Minimum width/height ratio
 MAX_ASPECT_RATIO = 15.0     # Maximum width/height ratio
 print("="*80)
 print("PaddleOCR Masking + Region Detection Test")
 print("="*80)
 # Create output directory
 Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
 # Step 1: Connect to PaddleOCR server
 print("\n1. Connecting to PaddleOCR server...")
 try:
    ocr_client = create_ocr_client()
    print(f"   ✅ Connected: {ocr_client.server_url}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 2: Render PDF to image
 print("\n2. Rendering PDF to image...")
 try:
    doc = fitz.open(TEST_PDF)
    page = doc[0]
    mat = fitz.Matrix(DPI/72, DPI/72)
    pix = page.get_pixmap(matrix=mat)
    original_image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    if pix.n == 4:  # RGBA
        original_image = cv2.cvtColor(original_image, cv2.COLOR_RGBA2RGB)
    print(f"   ✅ Rendered: {original_image.shape[1]}x{original_image.shape[0]} pixels")
    doc.close()
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 3: Detect printed text with PaddleOCR
 print("\n3. Detecting printed text with PaddleOCR...")
 try:
    text_boxes = ocr_client.get_text_boxes(original_image)
    print(f"   ✅ Detected {len(text_boxes)} text regions")
    # Show some sample boxes
    if text_boxes:
        print("   Sample text boxes (x, y, w, h):")
        for i, box in enumerate(text_boxes[:3]):
            print(f"      {i+1}. {box}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 4: Mask out printed text areas
 print("\n4. Masking printed text areas...")
 try:
    masked_image = original_image.copy()
    # Fill each text box with black
    for (x, y, w, h) in text_boxes:
        cv2.rectangle(masked_image, (x, y), (x + w, y + h), (0, 0, 0), -1)
    print(f"   ✅ Masked {len(text_boxes)} text regions")
    # Save masked image
    masked_path = Path(OUTPUT_DIR) / "01_masked_image.png"
    cv2.imwrite(str(masked_path), cv2.cvtColor(masked_image, cv2.COLOR_RGB2BGR))
    print(f"   📁 Saved: {masked_path}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 5: Detect remaining non-white regions
 print("\n5. Detecting remaining non-white regions...")
 try:
    # Convert to grayscale
    gray = cv2.cvtColor(masked_image, cv2.COLOR_RGB2GRAY)
    # Threshold to find non-white areas
    # Anything darker than 250 is considered "content"
    _, binary = cv2.threshold(gray, 250, 255, cv2.THRESH_BINARY_INV)
    # Apply morphological operations to connect nearby regions
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
    # Find contours
    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    print(f"   ✅ Found {len(contours)} contours")
    # Filter contours by size and aspect ratio
    potential_regions = []
    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        area = w * h
        aspect_ratio = w / h if h > 0 else 0
        # Check constraints
        if (MIN_REGION_AREA <= area <= MAX_REGION_AREA and
            MIN_ASPECT_RATIO <= aspect_ratio <= MAX_ASPECT_RATIO):
            potential_regions.append({
                'box': (x, y, w, h),
                'area': area,
                'aspect_ratio': aspect_ratio
            })
    print(f"   ✅ Filtered to {len(potential_regions)} potential handwriting regions")
    # Show region details
    if potential_regions:
        print("\n   Detected regions:")
        for i, region in enumerate(potential_regions[:5]):
            x, y, w, h = region['box']
            print(f"      {i+1}. Box: ({x}, {y}, {w}, {h}), "
                  f"Area: {region['area']}, "
                  f"Aspect: {region['aspect_ratio']:.2f}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    import traceback
    traceback.print_exc()
    exit(1)
 # Step 6: Visualize results
 print("\n6. Creating visualizations...")
 try:
    # Visualization 1: Original with text boxes
    vis_original = original_image.copy()
    for (x, y, w, h) in text_boxes:
        cv2.rectangle(vis_original, (x, y), (x + w, y + h), (0, 255, 0), 3)
    vis_original_path = Path(OUTPUT_DIR) / "02_original_with_text_boxes.png"
    cv2.imwrite(str(vis_original_path), cv2.cvtColor(vis_original, cv2.COLOR_RGB2BGR))
    print(f"   📁 Original + text boxes: {vis_original_path}")
    # Visualization 2: Masked image with detected regions
    vis_masked = masked_image.copy()
    for region in potential_regions:
        x, y, w, h = region['box']
        cv2.rectangle(vis_masked, (x, y), (x + w, y + h), (255, 0, 0), 3)
    vis_masked_path = Path(OUTPUT_DIR) / "03_masked_with_regions.png"
    cv2.imwrite(str(vis_masked_path), cv2.cvtColor(vis_masked, cv2.COLOR_RGB2BGR))
    print(f"   📁 Masked + regions: {vis_masked_path}")
    # Visualization 3: Binary threshold result
    binary_path = Path(OUTPUT_DIR) / "04_binary_threshold.png"
    cv2.imwrite(str(binary_path), binary)
    print(f"   📁 Binary threshold: {binary_path}")
    # Visualization 4: Morphed result
    morphed_path = Path(OUTPUT_DIR) / "05_morphed.png"
    cv2.imwrite(str(morphed_path), morphed)
    print(f"   📁 Morphed: {morphed_path}")
    # Extract and save each detected region
    print("\n7. Extracting detected regions...")
    for i, region in enumerate(potential_regions):
        x, y, w, h = region['box']
        # Add padding
        padding = 10
        x_pad = max(0, x - padding)
        y_pad = max(0, y - padding)
        w_pad = min(original_image.shape[1] - x_pad, w + 2*padding)
        h_pad = min(original_image.shape[0] - y_pad, h + 2*padding)
        # Extract region from original image
        region_img = original_image[y_pad:y_pad+h_pad, x_pad:x_pad+w_pad]
        # Save region
        region_path = Path(OUTPUT_DIR) / f"region_{i+1:02d}.png"
        cv2.imwrite(str(region_path), cv2.cvtColor(region_img, cv2.COLOR_RGB2BGR))
        print(f"   📁 Region {i+1}: {region_path}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    import traceback
    traceback.print_exc()
 print("\n" + "="*80)
 print("Test completed!")
 print(f"Results saved to: {OUTPUT_DIR}")
 print("="*80)
 print("\nSummary:")
 print(f"  - Printed text regions detected: {len(text_boxes)}")
 print(f"  - Potential handwriting regions: {len(potential_regions)}")
 print(f"  - Expected signatures: 2 (楊智惠, 張志銘)")
 print("="*80)
--- a/test_opencv_advanced.py
+++ b/test_opencv_advanced.py
@@ -0,0 +1,256 @@
 #!/usr/bin/env python3
 """
 Advanced OpenCV separation based on key observations:
 1. 手写字比印刷字大 (Handwriting is LARGER)
 2. 手写笔画长度更长 (Handwriting strokes are LONGER)
 3. 印刷标楷体规律，手写潦草 (Printed is regular, handwriting is messy)
 """
 import cv2
 import numpy as np
 from pathlib import Path
 from scipy import ndimage
 # Test image
 TEST_IMAGE = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_improved/signature_02_original.png"
 OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/opencv_advanced_test"
 print("="*80)
 print("Advanced OpenCV Separation - Size + Stroke Length + Regularity")
 print("="*80)
 Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
 # Load and preprocess
 image = cv2.imread(TEST_IMAGE)
 gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
 print(f"\nImage: {image.shape[1]}x{image.shape[0]}")
 # Save binary
 cv2.imwrite(str(Path(OUTPUT_DIR) / "00_binary.png"), binary)
 print("\n" + "="*80)
 print("METHOD 3: Comprehensive Feature Analysis")
 print("="*80)
 # Find connected components
 num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
 print(f"\nFound {num_labels - 1} connected components")
 print("\nAnalyzing each component...")
 # Store analysis for each component
 components_analysis = []
 for i in range(1, num_labels):
    x, y, w, h, area = stats[i]
    # Extract component mask
    component_mask = (labels == i).astype(np.uint8) * 255
    # ============================================
    # FEATURE 1: Size (手写字比印刷字大)
    # ============================================
    bbox_area = w * h
    font_height = h  # Character height is a good indicator
    # ============================================
    # FEATURE 2: Stroke Length (笔画长度)
    # ============================================
    # Skeletonize to get the actual stroke centerline
    from skimage.morphology import skeletonize
    skeleton = skeletonize(component_mask // 255)
    stroke_length = np.sum(skeleton)  # Total length of strokes
    # Stroke length ratio (length relative to area)
    stroke_length_ratio = stroke_length / area if area > 0 else 0
    # ============================================
    # FEATURE 3: Regularity vs Messiness
    # ============================================
    # 3a. Compactness (regular shapes are more compact)
    contours, _ = cv2.findContours(component_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if contours:
        perimeter = cv2.arcLength(contours[0], True)
        compactness = (4 * np.pi * area) / (perimeter * perimeter) if perimeter > 0 else 0
    else:
        compactness = 0
    # 3b. Solidity (ratio of area to convex hull area)
    if contours:
        hull = cv2.convexHull(contours[0])
        hull_area = cv2.contourArea(hull)
        solidity = area / hull_area if hull_area > 0 else 0
    else:
        solidity = 0
    # 3c. Extent (ratio of area to bounding box area)
    extent = area / bbox_area if bbox_area > 0 else 0
    # 3d. Edge roughness (measure irregularity)
    # More irregular edges = more "messy" = likely handwriting
    edges = cv2.Canny(component_mask, 50, 150)
    edge_pixels = np.sum(edges > 0)
    edge_roughness = edge_pixels / perimeter if perimeter > 0 else 0
    # ============================================
    # CLASSIFICATION LOGIC
    # ============================================
    # Large characters are likely handwriting
    is_large = font_height > 40  # Threshold for "large" characters
    # Long strokes relative to area indicate handwriting
    is_long_stroke = stroke_length_ratio > 0.4  # Handwriting has higher ratio
    # Regular shapes (high compactness, high solidity) = printed
    # Irregular shapes (low compactness, low solidity) = handwriting
    is_irregular = compactness < 0.3 or solidity < 0.7 or extent < 0.5
    # DECISION RULES
    handwriting_score = 0
    # Size-based scoring (重要!)
    if font_height > 50:
        handwriting_score += 3  # Very large = likely handwriting
    elif font_height > 35:
        handwriting_score += 2  # Medium-large = possibly handwriting
    elif font_height < 25:
        handwriting_score -= 2  # Small = likely printed
    # Stroke length scoring
    if stroke_length_ratio > 0.5:
        handwriting_score += 2  # Long strokes
    elif stroke_length_ratio > 0.35:
        handwriting_score += 1
    # Regularity scoring (标楷体 is regular, 手写 is messy)
    if is_irregular:
        handwriting_score += 1  # Irregular = handwriting
    else:
        handwriting_score -= 1  # Regular = printed
    # Area scoring
    if area > 2000:
        handwriting_score += 2  # Large area = handwriting
    elif area < 500:
        handwriting_score -= 1  # Small area = printed
    # Final classification
    is_handwriting = handwriting_score > 0
    components_analysis.append({
        'id': i,
        'box': (x, y, w, h),
        'area': area,
        'height': font_height,
        'stroke_length': stroke_length,
        'stroke_ratio': stroke_length_ratio,
        'compactness': compactness,
        'solidity': solidity,
        'extent': extent,
        'edge_roughness': edge_roughness,
        'handwriting_score': handwriting_score,
        'is_handwriting': is_handwriting,
        'mask': component_mask
    })
 # Sort by area (largest first)
 components_analysis.sort(key=lambda c: c['area'], reverse=True)
 # Print analysis
 print("\n" + "-"*80)
 print("Top 10 Components Analysis:")
 print("-"*80)
 print(f"{'ID':<4} {'Area':<6} {'H':<4} {'StrokeLen':<9} {'StrokeR':<7} {'Compact':<7} "
      f"{'Solid':<6} {'Score':<5} {'Type':<12}")
 print("-"*80)
 for i, comp in enumerate(components_analysis[:10]):
    comp_type = "✅ Handwriting" if comp['is_handwriting'] else "❌ Printed"
    print(f"{comp['id']:<4} {comp['area']:<6} {comp['height']:<4} "
          f"{comp['stroke_length']:<9.0f} {comp['stroke_ratio']:<7.3f} "
          f"{comp['compactness']:<7.3f} {comp['solidity']:<6.3f} "
          f"{comp['handwriting_score']:>+5} {comp_type:<12}")
 # Create masks
 handwriting_mask = np.zeros_like(binary)
 printed_mask = np.zeros_like(binary)
 for comp in components_analysis:
    if comp['is_handwriting']:
        handwriting_mask = cv2.bitwise_or(handwriting_mask, comp['mask'])
    else:
        printed_mask = cv2.bitwise_or(printed_mask, comp['mask'])
 # Statistics
 hw_count = sum(1 for c in components_analysis if c['is_handwriting'])
 pr_count = sum(1 for c in components_analysis if not c['is_handwriting'])
 print("\n" + "="*80)
 print("Classification Results:")
 print("="*80)
 print(f"  Handwriting components: {hw_count}")
 print(f"  Printed components: {pr_count}")
 print(f"  Total: {len(components_analysis)}")
 # Apply to original image
 result_handwriting = cv2.bitwise_and(image, image, mask=handwriting_mask)
 result_printed = cv2.bitwise_and(image, image, mask=printed_mask)
 # Save results
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_handwriting_mask.png"), handwriting_mask)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_printed_mask.png"), printed_mask)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_handwriting_result.png"), result_handwriting)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_printed_result.png"), result_printed)
 # Create visualization
 vis_overlay = image.copy()
 vis_overlay[handwriting_mask > 0] = [0, 255, 0]  # Green for handwriting
 vis_overlay[printed_mask > 0] = [0, 0, 255]      # Red for printed
 vis_final = cv2.addWeighted(image, 0.6, vis_overlay, 0.4, 0)
 # Add labels to visualization
 for comp in components_analysis[:15]:  # Label top 15
    x, y, w, h = comp['box']
    cx, cy = x + w//2, y + h//2
    color = (0, 255, 0) if comp['is_handwriting'] else (0, 0, 255)
    label = f"H{comp['handwriting_score']:+d}" if comp['is_handwriting'] else f"P{comp['handwriting_score']:+d}"
    cv2.putText(vis_final, label, (cx-15, cy), cv2.FONT_HERSHEY_SIMPLEX, 0.4, color, 1)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_visualization.png"), vis_final)
 print("\n📁 Saved results:")
 print("  - method3_handwriting_mask.png")
 print("  - method3_printed_mask.png")
 print("  - method3_handwriting_result.png")
 print("  - method3_printed_result.png")
 print("  - method3_visualization.png")
 # Calculate content pixels
 hw_pixels = np.count_nonzero(handwriting_mask)
 pr_pixels = np.count_nonzero(printed_mask)
 total_pixels = np.count_nonzero(binary)
 print("\n" + "="*80)
 print("Pixel Distribution:")
 print("="*80)
 print(f"  Total foreground:   {total_pixels:6d} pixels (100.0%)")
 print(f"  Handwriting:        {hw_pixels:6d} pixels ({hw_pixels/total_pixels*100:5.1f}%)")
 print(f"  Printed:            {pr_pixels:6d} pixels ({pr_pixels/total_pixels*100:5.1f}%)")
 print("\n" + "="*80)
 print("Test completed!")
 print(f"Results: {OUTPUT_DIR}")
 print("="*80)
 print("\n📊 Feature Analysis Summary:")
 print("  ✅ Size-based classification: Large characters → Handwriting")
 print("  ✅ Stroke length analysis: Long stroke ratio → Handwriting")
 print("  ✅ Regularity analysis: Irregular shapes → Handwriting")
 print("\nNext: Review visualization to tune thresholds if needed")
--- a/test_opencv_separation.py
+++ b/test_opencv_separation.py
@@ -0,0 +1,272 @@
 #!/usr/bin/env python3
 """
 Test OpenCV methods to separate handwriting from printed text
 Tests two methods:
 1. Stroke Width Analysis (笔画宽度分析)
 2. Connected Components + Shape Features (连通组件+形状特征)
 """
 import cv2
 import numpy as np
 from pathlib import Path
 # Test image - contains both printed and handwritten
 TEST_IMAGE = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_improved/signature_02_original.png"
 OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/opencv_separation_test"
 print("="*80)
 print("OpenCV Handwriting Separation Test")
 print("="*80)
 # Create output directory
 Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
 # Load image
 print(f"\nLoading test image: {Path(TEST_IMAGE).name}")
 image = cv2.imread(TEST_IMAGE)
 if image is None:
    print(f"Error: Cannot load image from {TEST_IMAGE}")
    exit(1)
 image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
 print(f"Image size: {image.shape[1]}x{image.shape[0]}")
 # Convert to grayscale
 gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 # Binarize
 _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
 # Save binary for reference
 cv2.imwrite(str(Path(OUTPUT_DIR) / "00_binary.png"), binary)
 print("\n📁 Saved: 00_binary.png")
 print("\n" + "="*80)
 print("METHOD 1: Stroke Width Analysis (笔画宽度分析)")
 print("="*80)
 def method1_stroke_width(binary_img, threshold_values=[2.0, 3.0, 4.0, 5.0]):
    """
    Method 1: Separate by stroke width using distance transform
    Args:
        binary_img: Binary image (foreground = 255, background = 0)
        threshold_values: List of distance thresholds to test
    Returns:
        List of (threshold, result_image) tuples
    """
    results = []
    # Calculate distance transform
    dist_transform = cv2.distanceTransform(binary_img, cv2.DIST_L2, 5)
    # Normalize for visualization
    dist_normalized = cv2.normalize(dist_transform, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U)
    results.append(('distance_transform', dist_normalized))
    print("\n  Distance transform statistics:")
    print(f"    Min: {dist_transform.min():.2f}")
    print(f"    Max: {dist_transform.max():.2f}")
    print(f"    Mean: {dist_transform.mean():.2f}")
    print(f"    Median: {np.median(dist_transform):.2f}")
    # Test different thresholds
    print("\n  Testing different stroke width thresholds:")
    for threshold in threshold_values:
        # Pixels with distance > threshold are considered "thick strokes" (handwriting)
        handwriting_mask = (dist_transform > threshold).astype(np.uint8) * 255
        # Count pixels
        total_foreground = np.count_nonzero(binary_img)
        handwriting_pixels = np.count_nonzero(handwriting_mask)
        percentage = (handwriting_pixels / total_foreground * 100) if total_foreground > 0 else 0
        print(f"    Threshold {threshold:.1f}: {handwriting_pixels} pixels ({percentage:.1f}% of foreground)")
        results.append((f'threshold_{threshold:.1f}', handwriting_mask))
    return results
 # Run Method 1
 method1_results = method1_stroke_width(binary, threshold_values=[2.0, 2.5, 3.0, 3.5, 4.0, 5.0])
 # Save Method 1 results
 print("\n  Saving results...")
 for name, result_img in method1_results:
    output_path = Path(OUTPUT_DIR) / f"method1_{name}.png"
    cv2.imwrite(str(output_path), result_img)
    print(f"    📁 {output_path.name}")
 # Apply best threshold result to original image
 best_threshold = 3.0  # Will adjust based on visual inspection
 _, best_mask = [(n, r) for n, r in method1_results if f'threshold_{best_threshold}' in n][0]
 # Dilate mask slightly to connect nearby strokes
 kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
 best_mask_dilated = cv2.dilate(best_mask, kernel, iterations=1)
 # Apply to color image
 result_method1 = cv2.bitwise_and(image, image, mask=best_mask_dilated)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method1_final_result.png"), result_method1)
 print(f"\n  📁 Final result: method1_final_result.png (threshold={best_threshold})")
 print("\n" + "="*80)
 print("METHOD 2: Connected Components + Shape Features (连通组件分析)")
 print("="*80)
 def method2_component_analysis(binary_img, original_img):
    """
    Method 2: Analyze each connected component's shape features
    Printed text characteristics:
    - Regular bounding box (aspect ratio ~1:1)
    - Medium size (200-2000 pixels)
    - High circularity/compactness
    Handwriting characteristics:
    - Irregular shapes
    - May be large (connected strokes)
    - Variable aspect ratios
    """
    # Find connected components
    num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_img, connectivity=8)
    print(f"\n  Found {num_labels - 1} connected components")
    # Create masks for different categories
    handwriting_mask = np.zeros_like(binary_img)
    printed_mask = np.zeros_like(binary_img)
    # Analyze each component
    component_info = []
    for i in range(1, num_labels):  # Skip background (0)
        x, y, w, h, area = stats[i]
        # Calculate features
        aspect_ratio = w / h if h > 0 else 0
        perimeter = cv2.arcLength(cv2.findContours((labels == i).astype(np.uint8),
                                                    cv2.RETR_EXTERNAL,
                                                    cv2.CHAIN_APPROX_SIMPLE)[0][0], True)
        compactness = (4 * np.pi * area) / (perimeter * perimeter) if perimeter > 0 else 0
        # Classification logic
        # Printed text: medium size, regular aspect ratio, compact
        is_printed = (
            (200 < area < 3000) and              # Medium size
            (0.3 < aspect_ratio < 3.0) and       # Not too elongated
            (area < 1000)                         # Small to medium
        )
        # Handwriting: larger, or irregular, or very wide/tall
        is_handwriting = (
            (area >= 3000) or                     # Large components (likely handwriting)
            (aspect_ratio > 3.0) or               # Very elongated (连笔)
            (aspect_ratio < 0.3) or               # Very tall
            not is_printed                        # Default to handwriting if not clearly printed
        )
        component_info.append({
            'id': i,
            'area': area,
            'aspect_ratio': aspect_ratio,
            'compactness': compactness,
            'is_printed': is_printed,
            'is_handwriting': is_handwriting
        })
        # Assign to mask
        if is_handwriting:
            handwriting_mask[labels == i] = 255
        if is_printed:
            printed_mask[labels == i] = 255
    # Print statistics
    print("\n  Component statistics:")
    handwriting_components = [c for c in component_info if c['is_handwriting']]
    printed_components = [c for c in component_info if c['is_printed']]
    print(f"    Handwriting components: {len(handwriting_components)}")
    print(f"    Printed components: {len(printed_components)}")
    # Show top 5 largest components
    print("\n  Top 5 largest components:")
    sorted_components = sorted(component_info, key=lambda c: c['area'], reverse=True)
    for i, comp in enumerate(sorted_components[:5], 1):
        comp_type = "Handwriting" if comp['is_handwriting'] else "Printed"
        print(f"    {i}. Area: {comp['area']:5d}, Aspect: {comp['aspect_ratio']:.2f}, "
              f"Type: {comp_type}")
    return handwriting_mask, printed_mask, component_info
 # Run Method 2
 handwriting_mask_m2, printed_mask_m2, components = method2_component_analysis(binary, image)
 # Save Method 2 results
 print("\n  Saving results...")
 # Handwriting mask
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_handwriting_mask.png"), handwriting_mask_m2)
 print(f"    📁 method2_handwriting_mask.png")
 # Printed mask
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_printed_mask.png"), printed_mask_m2)
 print(f"    📁 method2_printed_mask.png")
 # Apply to original image
 result_handwriting = cv2.bitwise_and(image, image, mask=handwriting_mask_m2)
 result_printed = cv2.bitwise_and(image, image, mask=printed_mask_m2)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_handwriting_result.png"), result_handwriting)
 print(f"    📁 method2_handwriting_result.png")
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_printed_result.png"), result_printed)
 print(f"    📁 method2_printed_result.png")
 # Create visualization with component labels
 vis_components = cv2.cvtColor(binary, cv2.COLOR_GRAY2BGR)
 vis_components = cv2.cvtColor(vis_components, cv2.COLOR_BGR2RGB)
 # Color code: green = handwriting, red = printed
 vis_overlay = image.copy()
 vis_overlay[handwriting_mask_m2 > 0] = [0, 255, 0]  # Green for handwriting
 vis_overlay[printed_mask_m2 > 0] = [0, 0, 255]      # Red for printed
 # Blend with original
 vis_final = cv2.addWeighted(image, 0.6, vis_overlay, 0.4, 0)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_visualization.png"), vis_final)
 print(f"    📁 method2_visualization.png (green=handwriting, red=printed)")
 print("\n" + "="*80)
 print("COMPARISON")
 print("="*80)
 # Count non-white pixels in each result
 def count_content_pixels(img):
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) if len(img.shape) == 3 else img
    return np.count_nonzero(gray > 10)
 original_pixels = count_content_pixels(image)
 method1_pixels = count_content_pixels(result_method1)
 method2_pixels = count_content_pixels(result_handwriting)
 print(f"\nContent pixels retained:")
 print(f"  Original image:     {original_pixels:6d} pixels")
 print(f"  Method 1 (stroke):  {method1_pixels:6d} pixels ({method1_pixels/original_pixels*100:.1f}%)")
 print(f"  Method 2 (component): {method2_pixels:6d} pixels ({method2_pixels/original_pixels*100:.1f}%)")
 print("\n" + "="*80)
 print("Test completed!")
 print(f"Results saved to: {OUTPUT_DIR}")
 print("="*80)
 print("\nNext steps:")
 print("  1. Review the output images")
 print("  2. Check which method better preserves handwriting")
 print("  3. Adjust thresholds if needed")
 print("  4. Choose the best method for production pipeline")
--- a/test_paddleocr.py
+++ b/test_paddleocr.py
@@ -0,0 +1,102 @@
 #!/usr/bin/env python3
 """Test PaddleOCR on a sample PDF page."""
 import fitz  # PyMuPDF
 from paddleocr import PaddleOCR
 import numpy as np
 from PIL import Image
 import cv2
 from pathlib import Path
 # Configuration
 TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
 DPI = 300
 print("="*80)
 print("Testing PaddleOCR on macOS Apple Silicon")
 print("="*80)
 # Step 1: Render PDF to image
 print("\n1. Rendering PDF to image...")
 try:
    doc = fitz.open(TEST_PDF)
    page = doc[0]
    mat = fitz.Matrix(DPI/72, DPI/72)
    pix = page.get_pixmap(matrix=mat)
    image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    if pix.n == 4:  # RGBA
        image = cv2.cvtColor(image, cv2.COLOR_RGBA2RGB)
    print(f"   ✅ Rendered: {image.shape[1]}x{image.shape[0]} pixels")
    doc.close()
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 2: Initialize PaddleOCR
 print("\n2. Initializing PaddleOCR...")
 print("   (First run will download models, may take a few minutes...)")
 try:
    # Use the correct syntax from official docs
    ocr = PaddleOCR(
        use_doc_orientation_classify=False,
        use_doc_unwarping=False,
        use_textline_orientation=False,
        lang='ch'  # Chinese language
    )
    print("   ✅ PaddleOCR initialized successfully")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    import traceback
    traceback.print_exc()
    print("\n   Note: PaddleOCR requires PaddlePaddle backend.")
    print("   If this is a module import error, PaddlePaddle may not support this platform.")
    exit(1)
 # Step 3: Run OCR
 print("\n3. Running OCR to detect printed text...")
 try:
    result = ocr.ocr(image, cls=False)
    if result and result[0]:
        print(f"   ✅ Detected {len(result[0])} text regions")
        # Show first few detections
        print("\n   Sample detections:")
        for i, item in enumerate(result[0][:5]):
            box = item[0]  # Bounding box coordinates
            text = item[1][0]  # Detected text
            confidence = item[1][1]  # Confidence score
            print(f"      {i+1}. Text: '{text}' (confidence: {confidence:.2f})")
            print(f"         Box: {box}")
    else:
        print("   ⚠️  No text detected")
 except Exception as e:
    print(f"   ❌ Error during OCR: {e}")
    import traceback
    traceback.print_exc()
    exit(1)
 # Step 4: Visualize detection
 print("\n4. Creating visualization...")
 try:
    vis_image = image.copy()
    if result and result[0]:
        for item in result[0]:
            box = np.array(item[0], dtype=np.int32)
            cv2.polylines(vis_image, [box], True, (0, 255, 0), 2)
    # Save visualization
    output_path = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_test_detection.png"
    cv2.imwrite(output_path, cv2.cvtColor(vis_image, cv2.COLOR_RGB2BGR))
    print(f"   ✅ Saved visualization: {output_path}")
 except Exception as e:
    print(f"   ❌ Error during visualization: {e}")
 print("\n" + "="*80)
 print("PaddleOCR test completed!")
 print("="*80)
--- a/test_paddleocr_client.py
+++ b/test_paddleocr_client.py
@@ -0,0 +1,81 @@
 #!/usr/bin/env python3
 """Test PaddleOCR client with a real PDF page."""
 import fitz  # PyMuPDF
 import numpy as np
 import cv2
 from paddleocr_client import create_ocr_client
 # Test PDF
 TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
 DPI = 300
 print("="*80)
 print("Testing PaddleOCR Client with Real PDF")
 print("="*80)
 # Step 1: Connect to server
 print("\n1. Connecting to PaddleOCR server...")
 try:
    client = create_ocr_client()
    print(f"   ✅ Connected: {client.server_url}")
 except Exception as e:
    print(f"   ❌ Connection failed: {e}")
    exit(1)
 # Step 2: Render PDF
 print("\n2. Rendering PDF to image...")
 try:
    doc = fitz.open(TEST_PDF)
    page = doc[0]
    mat = fitz.Matrix(DPI/72, DPI/72)
    pix = page.get_pixmap(matrix=mat)
    image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    if pix.n == 4:  # RGBA
        image = cv2.cvtColor(image, cv2.COLOR_RGBA2RGB)
    print(f"   ✅ Rendered: {image.shape[1]}x{image.shape[0]} pixels")
    doc.close()
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 3: Run OCR
 print("\n3. Running OCR on image...")
 try:
    results = client.ocr(image)
    print(f"   ✅ OCR successful!")
    print(f"   Found {len(results)} text regions")
    # Show first few results
    if results:
        print("\n   Sample detections:")
        for i, result in enumerate(results[:5]):
            text = result['text']
            confidence = result['confidence']
            print(f"      {i+1}. '{text}' (confidence: {confidence:.2f})")
 except Exception as e:
    print(f"   ❌ OCR failed: {e}")
    import traceback
    traceback.print_exc()
    exit(1)
 # Step 4: Get bounding boxes
 print("\n4. Getting text bounding boxes...")
 try:
    boxes = client.get_text_boxes(image)
    print(f"   ✅ Got {len(boxes)} bounding boxes")
    if boxes:
        print("   Sample boxes (x, y, w, h):")
        for i, box in enumerate(boxes[:3]):
            print(f"      {i+1}. {box}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
 print("\n" + "="*80)
 print("Test completed successfully!")
 print("="*80)
--- a/test_pp_ocrv5_api.py
+++ b/test_pp_ocrv5_api.py
@@ -0,0 +1,254 @@
 #!/usr/bin/env python3
 """
 測試 PP-OCRv5 API 的基礎功能
 目標：
 1. 驗證正確的 API 調用方式
 2. 查看完整的返回數據結構
 3. 對比 v4 和 v5 的檢測結果
 4. 確認是否有手寫分類功能
 """
 import sys
 import json
 import pprint
 from pathlib import Path
 # 測試圖片路徑
 TEST_IMAGE = "/Volumes/NV2/pdf_recognize/test_images/page_0.png"
 def test_basic_import():
    """測試基礎導入"""
    print("=" * 60)
    print("測試 1: 基礎導入")
    print("=" * 60)
    try:
        from paddleocr import PaddleOCR
        print("✅ 成功導入 PaddleOCR")
        return True
    except ImportError as e:
        print(f"❌ 導入失敗: {e}")
        return False
 def test_model_initialization():
    """測試模型初始化"""
    print("\n" + "=" * 60)
    print("測試 2: 模型初始化")
    print("=" * 60)
    try:
        from paddleocr import PaddleOCR
        print("\n初始化 PP-OCRv5...")
        ocr = PaddleOCR(
            text_detection_model_name="PP-OCRv5_server_det",
            text_recognition_model_name="PP-OCRv5_server_rec",
            use_doc_orientation_classify=False,
            use_doc_unwarping=False,
            use_textline_orientation=False,
            show_log=True
        )
        print("✅ 模型初始化成功")
        return ocr
    except Exception as e:
        print(f"❌ 初始化失敗: {e}")
        import traceback
        traceback.print_exc()
        return None
 def test_prediction(ocr):
    """測試預測功能"""
    print("\n" + "=" * 60)
    print("測試 3: 預測功能")
    print("=" * 60)
    if not Path(TEST_IMAGE).exists():
        print(f"❌ 測試圖片不存在: {TEST_IMAGE}")
        return None
    try:
        print(f"\n預測圖片: {TEST_IMAGE}")
        result = ocr.predict(TEST_IMAGE)
        print(f"✅ 預測成功，返回 {len(result)} 個結果")
        return result
    except Exception as e:
        print(f"❌ 預測失敗: {e}")
        import traceback
        traceback.print_exc()
        return None
 def analyze_result_structure(result):
    """分析返回結果的完整結構"""
    print("\n" + "=" * 60)
    print("測試 4: 分析返回結果結構")
    print("=" * 60)
    if not result:
        print("❌ 沒有結果可分析")
        return
    # 獲取第一個結果
    first_result = result[0]
    print("\n結果類型:", type(first_result))
    print("結果屬性:", dir(first_result))
    # 查看是否有 json 屬性
    if hasattr(first_result, 'json'):
        print("\n✅ 找到 .json 屬性")
        json_data = first_result.json
        print("\nJSON 數據鍵值:")
        for key in json_data.keys():
            print(f"  - {key}: {type(json_data[key])}")
        # 檢查是否有手寫分類相關字段
        print("\n查找手寫分類字段...")
        handwriting_related_keys = [
            k for k in json_data.keys()
            if any(word in k.lower() for word in ['handwriting', 'handwritten', 'type', 'class', 'category'])
        ]
        if handwriting_related_keys:
            print(f"✅ 找到可能相關的字段: {handwriting_related_keys}")
            for key in handwriting_related_keys:
                print(f"  {key}: {json_data[key]}")
        else:
            print("❌ 未找到手寫分類相關字段")
        # 打印部分檢測結果
        if 'rec_texts' in json_data and json_data['rec_texts']:
            print("\n檢測到的文字 (前 5 個):")
            for i, text in enumerate(json_data['rec_texts'][:5]):
                box = json_data['rec_boxes'][i] if 'rec_boxes' in json_data else None
                score = json_data['rec_scores'][i] if 'rec_scores' in json_data else None
                print(f"  [{i}] 文字: {text}")
                print(f"      分數: {score}")
                print(f"      位置: {box}")
        # 保存完整 JSON 到文件
        output_path = "/Volumes/NV2/pdf_recognize/test_results/pp_ocrv5_result.json"
        Path(output_path).parent.mkdir(exist_ok=True)
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(json_data, f, ensure_ascii=False, indent=2, default=str)
        print(f"\n✅ 完整結果已保存到: {output_path}")
        return json_data
    else:
        print("❌ 沒有找到 .json 屬性")
        print("\n直接打印結果:")
        pprint.pprint(first_result)
 def compare_with_v4():
    """對比 v4 和 v5 的結果"""
    print("\n" + "=" * 60)
    print("測試 5: 對比 v4 和 v5")
    print("=" * 60)
    try:
        from paddleocr import PaddleOCR
        # v4
        print("\n初始化 PP-OCRv4...")
        ocr_v4 = PaddleOCR(
            ocr_version="PP-OCRv4",
            use_doc_orientation_classify=False,
            show_log=False
        )
        print("預測 v4...")
        result_v4 = ocr_v4.predict(TEST_IMAGE)
        json_v4 = result_v4[0].json if hasattr(result_v4[0], 'json') else None
        # v5
        print("\n初始化 PP-OCRv5...")
        ocr_v5 = PaddleOCR(
            text_detection_model_name="PP-OCRv5_server_det",
            text_recognition_model_name="PP-OCRv5_server_rec",
            use_doc_orientation_classify=False,
            show_log=False
        )
        print("預測 v5...")
        result_v5 = ocr_v5.predict(TEST_IMAGE)
        json_v5 = result_v5[0].json if hasattr(result_v5[0], 'json') else None
        # 對比
        if json_v4 and json_v5:
            print("\n對比結果:")
            print(f"  v4 檢測到 {len(json_v4.get('rec_texts', []))} 個文字區域")
            print(f"  v5 檢測到 {len(json_v5.get('rec_texts', []))} 個文字區域")
            # 保存對比結果
            comparison = {
                "v4": {
                    "count": len(json_v4.get('rec_texts', [])),
                    "texts": json_v4.get('rec_texts', [])[:10],  # 前 10 個
                    "scores": json_v4.get('rec_scores', [])[:10]
                },
                "v5": {
                    "count": len(json_v5.get('rec_texts', [])),
                    "texts": json_v5.get('rec_texts', [])[:10],
                    "scores": json_v5.get('rec_scores', [])[:10]
                }
            }
            output_path = "/Volumes/NV2/pdf_recognize/test_results/v4_vs_v5_comparison.json"
            with open(output_path, 'w', encoding='utf-8') as f:
                json.dump(comparison, f, ensure_ascii=False, indent=2, default=str)
            print(f"\n✅ 對比結果已保存到: {output_path}")
    except Exception as e:
        print(f"❌ 對比失敗: {e}")
        import traceback
        traceback.print_exc()
 def main():
    """主測試流程"""
    print("開始測試 PP-OCRv5 API\n")
    # 測試 1: 導入
    if not test_basic_import():
        print("\n❌ 導入失敗，無法繼續測試")
        return
    # 測試 2: 初始化
    ocr = test_model_initialization()
    if not ocr:
        print("\n❌ 初始化失敗，無法繼續測試")
        return
    # 測試 3: 預測
    result = test_prediction(ocr)
    if not result:
        print("\n❌ 預測失敗，無法繼續測試")
        return
    # 測試 4: 分析結構
    json_data = analyze_result_structure(result)
    # 測試 5: 對比 v4 和 v5
    compare_with_v4()
    print("\n" + "=" * 60)
    print("測試完成")
    print("=" * 60)
 if __name__ == "__main__":
    main()
--- a/test_results/v5_analysis_report.txt
+++ b/test_results/v5_analysis_report.txt
@@ -0,0 +1,58 @@
 PP-OCRv5 檢測結果詳細報告
 ================================================================================
 總數: 50
 平均置信度: 0.4579
 完整檢測列表:
 --------------------------------------------------------------------------------
 [ 0] 0.8783   202x100  KPMG
 [ 1] 0.9936  1931x 62  依本會計師核閱結果，除第三段及第四段所述該等被投資公司財務季報告倘經會計師核閱
 [ 2] 0.9976  2013x 62  ，對第一段所述合併財務季報告可能有所調整之影響外，並未發現第一段所述合併財務季報告
 [ 3] 0.9815  2025x 62  在所有重大方面有違反證券發行人財務報告編製準則及金融監督管理委員會認可之國際會計準
 [ 4] 0.9912  1125x 56  則第三十四號「期中財務報導」而須作修正之情事。
 [ 5] 0.9712   872x 61  安侯建業聯合會計師事務所
 [ 6] 0.9123   174x203  寶
 [ 7] 0.8466   166x179  蓮
 [ 8] 0.0000    36x 18  
 [ 9] 0.9968   175x193  周
 [10] 0.0000    33x 69  
 [11] 0.2521     7x 12  5
 [12] 0.0000    35x 13  
 [13] 0.0000    28x 10  
 [14] 0.4726    12x  9  vA
 [15] 0.1788     9x 11  上
 [16] 0.0000    38x 14  
 [17] 0.4133    21x  8  R-
 [18] 0.4681    15x  8  40
 [19] 0.0000    38x 13  
 [20] 0.5587    16x  7  GAN
 [21] 0.9623   291x 61  會計師：
 [22] 0.9893   213x234  魏
 [23] 0.1751   190x174  興
 [24] 0.8862   180x191  海
 [25] 0.0000    65x 17  
 [26] 0.5110    27x  7  U
 [27] 0.1669    10x  8  2
 [28] 0.4839    39x 10  eredooos
 [29] 0.1775    10x 24  B
 [30] 0.4896    29x 10  n
 [31] 0.3774     7x  7  1
 [32] 0.0000    34x 14  
 [33] 0.0000     7x 15  
 [34] 0.0000    12x 38  
 [35] 0.8701    22x 11  0
 [36] 0.2034     8x 23  40
 [37] 0.0000    20x 12  
 [38] 0.0000    29x 10  
 [39] 0.0970     9x 10  m
 [40] 0.3102    20x  7  A
 [41] 0.0000    34x  6  
 [42] 0.2435    21x  6  专
 [43] 0.3260    41x 15  o
 [44] 0.0000    31x  7  
 [45] 0.9769   960x 73  證券主管機關．金管證六字第0940100754號
 [46] 0.9747   899x 60  核准簽證文號(88)台財證(六)第18311號
 [47] 0.9205   824x 67  民國一〇二年五月二
 [48] 0.9996    47x 46  日
 [49] 0.8414   173x 62  ~3-1~
--- a/test_results/v5_pipeline/SUMMARY.txt
+++ b/test_results/v5_pipeline/SUMMARY.txt
@@ -0,0 +1,20 @@
 PP-OCRv5 完整 Pipeline 測試結果
 ============================================================
 1. OCR 檢測: 50 個文字區域
 2. 遮罩印刷文字: /Volumes/NV2/pdf_recognize/test_results/v5_pipeline/01_masked.png
 3. 檢測候選區域: 7 個
 4. 提取簽名: 7 個
 候選區域詳情:
 ------------------------------------------------------------
 Region 1: 位置(1218, 877), 大小1144x511, 面積=584584
 Region 2: 位置(1213, 1457), 大小961x196, 面積=188356
 Region 3: 位置(228, 386), 大小2028x209, 面積=423852
 Region 4: 位置(330, 310), 大小1932x63, 面積=121716
 Region 5: 位置(1990, 945), 大小375x212, 面積=79500
 Region 6: 位置(327, 145), 大小203x101, 面積=20503
 Region 7: 位置(1139, 3289), 大小174x63, 面積=10962
 所有結果保存在: /Volumes/NV2/pdf_recognize/test_results/v5_pipeline
--- a/test_results/v5_result.json
+++ b/test_results/v5_result.json
--- a/test_v4_full_pipeline.py
+++ b/test_v4_full_pipeline.py
@@ -0,0 +1,290 @@
 #!/usr/bin/env python3
 """
 使用 PaddleOCR v2.7.3 (v4) 跑完整的簽名提取 pipeline
 與 v5 對比
 """
 import sys
 import json
 import cv2
 import numpy as np
 import requests
 from pathlib import Path
 # 配置
 OCR_SERVER = "http://192.168.30.36:5555"
 OUTPUT_DIR = Path("/Volumes/NV2/pdf_recognize/signature-comparison/v4-current")
 MASKING_PADDING = 0
 def setup_output_dir():
    """創建輸出目錄"""
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    print(f"輸出目錄: {OUTPUT_DIR}")
 def get_page_image():
    """獲取測試頁面圖片"""
    test_image = "/Volumes/NV2/pdf_recognize/full_page_original.png"
    if Path(test_image).exists():
        return cv2.imread(test_image)
    else:
        print(f"❌ 測試圖片不存在: {test_image}")
        return None
 def call_ocr_server(image):
    """調用服務器端的 PaddleOCR v2.7.3"""
    print("\n調用 PaddleOCR v2.7.3 服務器...")
    try:
        import base64
        _, buffer = cv2.imencode('.png', image)
        img_base64 = base64.b64encode(buffer).decode('utf-8')
        response = requests.post(
            f"{OCR_SERVER}/ocr",
            json={'image': img_base64},
            timeout=30
        )
        if response.status_code == 200:
            result = response.json()
            print(f"✅ OCR 完成，檢測到 {len(result.get('results', []))} 個文字區域")
            return result.get('results', [])
        else:
            print(f"❌ 服務器錯誤: {response.status_code}")
            return None
    except Exception as e:
        print(f"❌ OCR 調用失敗: {e}")
        import traceback
        traceback.print_exc()
        return None
 def mask_printed_text(image, ocr_results):
    """遮罩印刷文字"""
    print("\n遮罩印刷文字...")
    masked_image = image.copy()
    for i, result in enumerate(ocr_results):
        box = result.get('box')
        if box is None:
            continue
        # v2.7.3 返回多邊形格式: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
        # 轉換為矩形
        box_points = np.array(box)
        x_min = int(box_points[:, 0].min())
        y_min = int(box_points[:, 1].min())
        x_max = int(box_points[:, 0].max())
        y_max = int(box_points[:, 1].max())
        cv2.rectangle(
            masked_image,
            (x_min - MASKING_PADDING, y_min - MASKING_PADDING),
            (x_max + MASKING_PADDING, y_max + MASKING_PADDING),
            (0, 0, 0),
            -1
        )
    masked_path = OUTPUT_DIR / "01_masked.png"
    cv2.imwrite(str(masked_path), masked_image)
    print(f"✅ 遮罩完成: {masked_path}")
    return masked_image
 def detect_regions(masked_image):
    """檢測候選區域"""
    print("\n檢測候選區域...")
    gray = cv2.cvtColor(masked_image, cv2.COLOR_BGR2GRAY)
    _, binary = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY_INV)
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
    cv2.imwrite(str(OUTPUT_DIR / "02_binary.png"), binary)
    cv2.imwrite(str(OUTPUT_DIR / "03_morphed.png"), morphed)
    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    MIN_AREA = 3000
    MAX_AREA = 300000
    candidate_regions = []
    for contour in contours:
        area = cv2.contourArea(contour)
        if MIN_AREA <= area <= MAX_AREA:
            x, y, w, h = cv2.boundingRect(contour)
            aspect_ratio = w / h if h > 0 else 0
            candidate_regions.append({
                'box': (x, y, w, h),
                'area': area,
                'aspect_ratio': aspect_ratio
            })
    candidate_regions.sort(key=lambda r: r['area'], reverse=True)
    print(f"✅ 找到 {len(candidate_regions)} 個候選區域")
    return candidate_regions
 def merge_nearby_regions(regions, h_distance=100, v_distance=50):
    """合併鄰近區域"""
    print("\n合併鄰近區域...")
    if not regions:
        return []
    merged = []
    used = set()
    for i, r1 in enumerate(regions):
        if i in used:
            continue
        x1, y1, w1, h1 = r1['box']
        merged_box = [x1, y1, x1 + w1, y1 + h1]
        group = [i]
        for j, r2 in enumerate(regions):
            if j <= i or j in used:
                continue
            x2, y2, w2, h2 = r2['box']
            h_dist = min(abs(x1 - (x2 + w2)), abs((x1 + w1) - x2))
            v_dist = min(abs(y1 - (y2 + h2)), abs((y1 + h1) - y2))
            x_overlap = not (x1 + w1 < x2 or x2 + w2 < x1)
            y_overlap = not (y1 + h1 < y2 or y2 + h2 < y1)
            if (x_overlap and v_dist <= v_distance) or (y_overlap and h_dist <= h_distance):
                merged_box[0] = min(merged_box[0], x2)
                merged_box[1] = min(merged_box[1], y2)
                merged_box[2] = max(merged_box[2], x2 + w2)
                merged_box[3] = max(merged_box[3], y2 + h2)
                group.append(j)
                used.add(j)
        used.add(i)
        x, y = merged_box[0], merged_box[1]
        w, h = merged_box[2] - merged_box[0], merged_box[3] - merged_box[1]
        merged.append({
            'box': (x, y, w, h),
            'area': w * h,
            'merged_count': len(group)
        })
    print(f"✅ 合併後剩餘 {len(merged)} 個區域")
    return merged
 def extract_signatures(image, regions):
    """提取簽名區域"""
    print("\n提取簽名區域...")
    vis_image = image.copy()
    for i, region in enumerate(regions):
        x, y, w, h = region['box']
        cv2.rectangle(vis_image, (x, y), (x + w, y + h), (0, 255, 0), 3)
        cv2.putText(vis_image, f"Region {i+1}", (x, y - 10),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
        signature = image[y:y+h, x:x+w]
        sig_path = OUTPUT_DIR / f"signature_{i+1}.png"
        cv2.imwrite(str(sig_path), signature)
        print(f"  Region {i+1}: {w}x{h} 像素, 面積={region['area']}")
    vis_path = OUTPUT_DIR / "04_detected_regions.png"
    cv2.imwrite(str(vis_path), vis_image)
    print(f"\n✅ 標註圖已保存: {vis_path}")
    return vis_image
 def generate_summary(ocr_count, regions):
    """生成摘要報告"""
    summary = f"""
 PaddleOCR v2.7.3 (v4) 完整 Pipeline 測試結果
 {'=' * 60}
 1. OCR 檢測: {ocr_count} 個文字區域
 2. 遮罩印刷文字: 完成
 3. 檢測候選區域: {len(regions)} 個
 4. 提取簽名: {len(regions)} 個
 候選區域詳情:
 {'-' * 60}
 """
    for i, region in enumerate(regions):
        x, y, w, h = region['box']
        area = region['area']
        summary += f"Region {i+1}: 位置({x}, {y}), 大小{w}x{h}, 面積={area}\n"
    summary += f"\n所有結果保存在: {OUTPUT_DIR}\n"
    return summary
 def main():
    print("=" * 60)
    print("PaddleOCR v2.7.3 (v4) 完整 Pipeline 測試")
    print("=" * 60)
    setup_output_dir()
    print("\n1. 讀取測試圖片...")
    image = get_page_image()
    if image is None:
        return
    print(f"   圖片大小: {image.shape}")
    cv2.imwrite(str(OUTPUT_DIR / "00_original.png"), image)
    print("\n2. PaddleOCR v2.7.3 檢測文字...")
    ocr_results = call_ocr_server(image)
    if ocr_results is None:
        print("❌ OCR 失敗，終止測試")
        return
    print("\n3. 遮罩印刷文字...")
    masked_image = mask_printed_text(image, ocr_results)
    print("\n4. 檢測候選區域...")
    regions = detect_regions(masked_image)
    print("\n5. 合併鄰近區域...")
    merged_regions = merge_nearby_regions(regions)
    print("\n6. 提取簽名...")
    vis_image = extract_signatures(image, merged_regions)
    print("\n7. 生成摘要報告...")
    summary = generate_summary(len(ocr_results), merged_regions)
    print(summary)
    summary_path = OUTPUT_DIR / "SUMMARY.txt"
    with open(summary_path, 'w', encoding='utf-8') as f:
        f.write(summary)
    print("=" * 60)
    print("✅ v4 測試完成！")
    print(f"結果目錄: {OUTPUT_DIR}")
    print("=" * 60)
 if __name__ == "__main__":
    main()
--- a/test_v5_full_pipeline.py
+++ b/test_v5_full_pipeline.py
@@ -0,0 +1,322 @@
 #!/usr/bin/env python3
 """
 使用 PP-OCRv5 跑完整的簽名提取 pipeline
 流程：
 1. 使用服務器上的 PP-OCRv5 檢測文字
 2. 遮罩印刷文字
 3. 檢測候選區域
 4. 提取簽名
 """
 import sys
 import json
 import cv2
 import numpy as np
 import requests
 from pathlib import Path
 # 配置
 OCR_SERVER = "http://192.168.30.36:5555"
 PDF_PATH = "/Volumes/NV2/pdf_recognize/test.pdf"
 OUTPUT_DIR = Path("/Volumes/NV2/pdf_recognize/test_results/v5_pipeline")
 MASKING_PADDING = 0
 def setup_output_dir():
    """創建輸出目錄"""
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    print(f"輸出目錄: {OUTPUT_DIR}")
 def get_page_image():
    """獲取測試頁面圖片"""
    # 使用已有的測試圖片
    test_image = "/Volumes/NV2/pdf_recognize/full_page_original.png"
    if Path(test_image).exists():
        return cv2.imread(test_image)
    else:
        print(f"❌ 測試圖片不存在: {test_image}")
        return None
 def call_ocr_server(image):
    """調用服務器端的 PP-OCRv5"""
    print("\n調用 PP-OCRv5 服務器...")
    try:
        # 編碼圖片
        import base64
        _, buffer = cv2.imencode('.png', image)
        img_base64 = base64.b64encode(buffer).decode('utf-8')
        # 發送請求
        response = requests.post(
            f"{OCR_SERVER}/ocr",
            json={'image': img_base64},
            timeout=30
        )
        if response.status_code == 200:
            result = response.json()
            print(f"✅ OCR 完成，檢測到 {len(result.get('results', []))} 個文字區域")
            return result.get('results', [])
        else:
            print(f"❌ 服務器錯誤: {response.status_code}")
            return None
    except Exception as e:
        print(f"❌ OCR 調用失敗: {e}")
        import traceback
        traceback.print_exc()
        return None
 def mask_printed_text(image, ocr_results):
    """遮罩印刷文字"""
    print("\n遮罩印刷文字...")
    masked_image = image.copy()
    for i, result in enumerate(ocr_results):
        box = result.get('box')
        if box is None:
            continue
        # box 格式: [x, y, w, h]
        x, y, w, h = box
        # 遮罩（黑色矩形）
        cv2.rectangle(
            masked_image,
            (x - MASKING_PADDING, y - MASKING_PADDING),
            (x + w + MASKING_PADDING, y + h + MASKING_PADDING),
            (0, 0, 0),
            -1
        )
    # 保存遮罩後的圖片
    masked_path = OUTPUT_DIR / "01_masked.png"
    cv2.imwrite(str(masked_path), masked_image)
    print(f"✅ 遮罩完成: {masked_path}")
    return masked_image
 def detect_regions(masked_image):
    """檢測候選區域"""
    print("\n檢測候選區域...")
    # 轉灰度
    gray = cv2.cvtColor(masked_image, cv2.COLOR_BGR2GRAY)
    # 二值化
    _, binary = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY_INV)
    # 形態學操作
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
    # 保存中間結果
    cv2.imwrite(str(OUTPUT_DIR / "02_binary.png"), binary)
    cv2.imwrite(str(OUTPUT_DIR / "03_morphed.png"), morphed)
    # 找輪廓
    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    # 過濾候選區域
    MIN_AREA = 3000
    MAX_AREA = 300000
    candidate_regions = []
    for contour in contours:
        area = cv2.contourArea(contour)
        if MIN_AREA <= area <= MAX_AREA:
            x, y, w, h = cv2.boundingRect(contour)
            aspect_ratio = w / h if h > 0 else 0
            candidate_regions.append({
                'box': (x, y, w, h),
                'area': area,
                'aspect_ratio': aspect_ratio
            })
    # 按面積排序
    candidate_regions.sort(key=lambda r: r['area'], reverse=True)
    print(f"✅ 找到 {len(candidate_regions)} 個候選區域")
    return candidate_regions
 def merge_nearby_regions(regions, h_distance=100, v_distance=50):
    """合併鄰近區域"""
    print("\n合併鄰近區域...")
    if not regions:
        return []
    merged = []
    used = set()
    for i, r1 in enumerate(regions):
        if i in used:
            continue
        x1, y1, w1, h1 = r1['box']
        merged_box = [x1, y1, x1 + w1, y1 + h1]  # [x_min, y_min, x_max, y_max]
        group = [i]
        for j, r2 in enumerate(regions):
            if j <= i or j in used:
                continue
            x2, y2, w2, h2 = r2['box']
            # 計算距離
            h_dist = min(abs(x1 - (x2 + w2)), abs((x1 + w1) - x2))
            v_dist = min(abs(y1 - (y2 + h2)), abs((y1 + h1) - y2))
            # 檢查重疊或接近
            x_overlap = not (x1 + w1 < x2 or x2 + w2 < x1)
            y_overlap = not (y1 + h1 < y2 or y2 + h2 < y1)
            if (x_overlap and v_dist <= v_distance) or (y_overlap and h_dist <= h_distance):
                # 合併
                merged_box[0] = min(merged_box[0], x2)
                merged_box[1] = min(merged_box[1], y2)
                merged_box[2] = max(merged_box[2], x2 + w2)
                merged_box[3] = max(merged_box[3], y2 + h2)
                group.append(j)
                used.add(j)
        used.add(i)
        # 轉回 (x, y, w, h) 格式
        x, y = merged_box[0], merged_box[1]
        w, h = merged_box[2] - merged_box[0], merged_box[3] - merged_box[1]
        merged.append({
            'box': (x, y, w, h),
            'area': w * h,
            'merged_count': len(group)
        })
    print(f"✅ 合併後剩餘 {len(merged)} 個區域")
    return merged
 def extract_signatures(image, regions):
    """提取簽名區域"""
    print("\n提取簽名區域...")
    # 在圖片上標註所有區域
    vis_image = image.copy()
    for i, region in enumerate(regions):
        x, y, w, h = region['box']
        # 繪製框
        cv2.rectangle(vis_image, (x, y), (x + w, y + h), (0, 255, 0), 3)
        cv2.putText(vis_image, f"Region {i+1}", (x, y - 10),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
        # 提取並保存
        signature = image[y:y+h, x:x+w]
        sig_path = OUTPUT_DIR / f"signature_{i+1}.png"
        cv2.imwrite(str(sig_path), signature)
        print(f"  Region {i+1}: {w}x{h} 像素, 面積={region['area']}")
    # 保存標註圖
    vis_path = OUTPUT_DIR / "04_detected_regions.png"
    cv2.imwrite(str(vis_path), vis_image)
    print(f"\n✅ 標註圖已保存: {vis_path}")
    return vis_image
 def generate_summary(ocr_count, masked_path, regions):
    """生成摘要報告"""
    summary = f"""
 PP-OCRv5 完整 Pipeline 測試結果
 {'=' * 60}
 1. OCR 檢測: {ocr_count} 個文字區域
 2. 遮罩印刷文字: {masked_path}
 3. 檢測候選區域: {len(regions)} 個
 4. 提取簽名: {len(regions)} 個
 候選區域詳情:
 {'-' * 60}
 """
    for i, region in enumerate(regions):
        x, y, w, h = region['box']
        area = region['area']
        summary += f"Region {i+1}: 位置({x}, {y}), 大小{w}x{h}, 面積={area}\n"
    summary += f"\n所有結果保存在: {OUTPUT_DIR}\n"
    return summary
 def main():
    print("=" * 60)
    print("PP-OCRv5 完整 Pipeline 測試")
    print("=" * 60)
    # 準備
    setup_output_dir()
    # 1. 獲取圖片
    print("\n1. 讀取測試圖片...")
    image = get_page_image()
    if image is None:
        return
    print(f"   圖片大小: {image.shape}")
    # 保存原圖
    cv2.imwrite(str(OUTPUT_DIR / "00_original.png"), image)
    # 2. OCR 檢測
    print("\n2. PP-OCRv5 檢測文字...")
    ocr_results = call_ocr_server(image)
    if ocr_results is None:
        print("❌ OCR 失敗，終止測試")
        return
    # 3. 遮罩印刷文字
    print("\n3. 遮罩印刷文字...")
    masked_image = mask_printed_text(image, ocr_results)
    # 4. 檢測候選區域
    print("\n4. 檢測候選區域...")
    regions = detect_regions(masked_image)
    # 5. 合併鄰近區域
    print("\n5. 合併鄰近區域...")
    merged_regions = merge_nearby_regions(regions)
    # 6. 提取簽名
    print("\n6. 提取簽名...")
    vis_image = extract_signatures(image, merged_regions)
    # 7. 生成摘要
    print("\n7. 生成摘要報告...")
    summary = generate_summary(len(ocr_results), OUTPUT_DIR / "01_masked.png", merged_regions)
    print(summary)
    # 保存摘要
    summary_path = OUTPUT_DIR / "SUMMARY.txt"
    with open(summary_path, 'w', encoding='utf-8') as f:
        f.write(summary)
    print("=" * 60)
    print("✅ 測試完成！")
    print(f"結果目錄: {OUTPUT_DIR}")
    print("=" * 60)
 if __name__ == "__main__":
    main()
--- a/visualize_v5_results.py
+++ b/visualize_v5_results.py
@@ -0,0 +1,181 @@
 #!/usr/bin/env python3
 """
 可視化 PP-OCRv5 的檢測結果
 """
 import json
 import cv2
 import numpy as np
 from pathlib import Path
 def load_results():
    """加載 v5 檢測結果"""
    result_file = "/Volumes/NV2/pdf_recognize/test_results/v5_result.json"
    with open(result_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data['res']
 def draw_detections(image_path, results, output_path):
    """在圖片上繪製檢測框和文字"""
    # 讀取圖片
    img = cv2.imread(image_path)
    if img is None:
        print(f"❌ 無法讀取圖片: {image_path}")
        return None
    # 創建副本用於繪製
    vis_img = img.copy()
    # 獲取檢測結果
    rec_texts = results.get('rec_texts', [])
    rec_boxes = results.get('rec_boxes', [])
    rec_scores = results.get('rec_scores', [])
    print(f"\n檢測到 {len(rec_texts)} 個文字區域")
    # 繪製每個檢測框
    for i, (text, box, score) in enumerate(zip(rec_texts, rec_boxes, rec_scores)):
        x_min, y_min, x_max, y_max = box
        # 繪製矩形框（綠色）
        cv2.rectangle(vis_img, (x_min, y_min), (x_max, y_max), (0, 255, 0), 2)
        # 繪製索引號（小字）
        cv2.putText(vis_img, f"{i}", (x_min, y_min - 5),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
    # 保存結果
    cv2.imwrite(output_path, vis_img)
    print(f"✅ 可視化結果已保存: {output_path}")
    return vis_img
 def generate_text_report(results):
    """生成文字報告"""
    rec_texts = results.get('rec_texts', [])
    rec_scores = results.get('rec_scores', [])
    rec_boxes = results.get('rec_boxes', [])
    print("\n" + "=" * 80)
    print("PP-OCRv5 檢測結果報告")
    print("=" * 80)
    print(f"\n總共檢測到: {len(rec_texts)} 個文字區域")
    print(f"平均置信度: {np.mean(rec_scores):.4f}")
    print(f"最高置信度: {np.max(rec_scores):.4f}")
    print(f"最低置信度: {np.min(rec_scores):.4f}")
    # 分類統計
    high_conf = sum(1 for s in rec_scores if s >= 0.95)
    medium_conf = sum(1 for s in rec_scores if 0.8 <= s < 0.95)
    low_conf = sum(1 for s in rec_scores if s < 0.8)
    print(f"\n置信度分布:")
    print(f"  高 (≥0.95): {high_conf} 個 ({high_conf/len(rec_scores)*100:.1f}%)")
    print(f"  中 (0.8-0.95): {medium_conf} 個 ({medium_conf/len(rec_scores)*100:.1f}%)")
    print(f"  低 (<0.8): {low_conf} 個 ({low_conf/len(rec_scores)*100:.1f}%)")
    # 顯示前 20 個檢測結果
    print("\n前 20 個檢測結果:")
    print("-" * 80)
    for i in range(min(20, len(rec_texts))):
        text = rec_texts[i]
        score = rec_scores[i]
        box = rec_boxes[i]
        # 計算框的大小
        width = box[2] - box[0]
        height = box[3] - box[1]
        print(f"[{i:2d}] 置信度: {score:.4f}  大小: {width:4d}x{height:3d}  文字: {text}")
    if len(rec_texts) > 20:
        print(f"\n... 還有 {len(rec_texts) - 20} 個結果（省略）")
    # 尋找可能的手寫區域（低置信度 或 大字）
    print("\n" + "=" * 80)
    print("可能的手寫區域分析")
    print("=" * 80)
    potential_handwriting = []
    for i, (text, score, box) in enumerate(zip(rec_texts, rec_scores, rec_boxes)):
        width = box[2] - box[0]
        height = box[3] - box[1]
        # 判斷條件：
        # 1. 高度較大 (>50px)
        # 2. 或置信度較低 (<0.9)
        # 3. 或文字較短但字體大
        is_large = height > 50
        is_low_conf = score < 0.9
        is_short_text = len(text) <= 3 and height > 40
        if is_large or is_low_conf or is_short_text:
            potential_handwriting.append({
                'index': i,
                'text': text,
                'score': score,
                'height': height,
                'width': width,
                'reason': []
            })
            if is_large:
                potential_handwriting[-1]['reason'].append('大字')
            if is_low_conf:
                potential_handwriting[-1]['reason'].append('低置信度')
            if is_short_text:
                potential_handwriting[-1]['reason'].append('短文大字')
    if potential_handwriting:
        print(f"\n找到 {len(potential_handwriting)} 個可能的手寫區域:")
        print("-" * 80)
        for item in potential_handwriting[:15]:  # 只顯示前 15 個
            reasons = ', '.join(item['reason'])
            print(f"[{item['index']:2d}] {item['height']:3d}px  {item['score']:.4f}  ({reasons})  {item['text']}")
    else:
        print("未找到明顯的手寫特徵區域")
    # 保存詳細報告到文件
    report_path = "/Volumes/NV2/pdf_recognize/test_results/v5_analysis_report.txt"
    with open(report_path, 'w', encoding='utf-8') as f:
        f.write(f"PP-OCRv5 檢測結果詳細報告\n")
        f.write("=" * 80 + "\n\n")
        f.write(f"總數: {len(rec_texts)}\n")
        f.write(f"平均置信度: {np.mean(rec_scores):.4f}\n\n")
        f.write("完整檢測列表:\n")
        f.write("-" * 80 + "\n")
        for i, (text, score, box) in enumerate(zip(rec_texts, rec_scores, rec_boxes)):
            width = box[2] - box[0]
            height = box[3] - box[1]
            f.write(f"[{i:2d}] {score:.4f}  {width:4d}x{height:3d}  {text}\n")
    print(f"\n詳細報告已保存: {report_path}")
 def main():
    # 加載結果
    print("加載 PP-OCRv5 檢測結果...")
    results = load_results()
    # 生成文字報告
    generate_text_report(results)
    # 可視化
    print("\n" + "=" * 80)
    print("生成可視化圖片")
    print("=" * 80)
    image_path = "/Volumes/NV2/pdf_recognize/full_page_original.png"
    output_path = "/Volumes/NV2/pdf_recognize/test_results/v5_visualization.png"
    if Path(image_path).exists():
        draw_detections(image_path, results, output_path)
    else:
        print(f"⚠️  原始圖片不存在: {image_path}")
    print("\n" + "=" * 80)
    print("分析完成")
    print("=" * 80)
 if __name__ == "__main__":
    main()