Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

Paper draft includes all sections (Abstract through Conclusion), 36 references, and supporting scripts. Key methodology: Cosine similarity + dHash dual-method verification with thresholds calibrated against known-replication firm (Firm A). Includes: - 8 section markdown files (paper_a_*.md) - Ablation study script (ResNet-50 vs VGG-16 vs EfficientNet-B0) - Recalibrated classification script (84,386 PDFs, 5-tier system) - Figure generation and Word export scripts - Citation renumbering script ([1]-[36]) - Signature analysis pipeline (12 steps) - YOLO extraction scripts Three rounds of AI review completed (GPT-5.4, Claude Opus 4.6, Gemini 3 Pro). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete PP-OCRv5 research and v4 vs v5 comparison
2026-04-06 23:05:33 +08:00 · 2025-11-27 11:21:55 +08:00 · 2025-11-27 10:35:46 +08:00 · 2025-10-28 22:28:18 +08:00
55 changed files with 15877 additions and 0 deletions
@@ -0,0 +1,252 @@
 # 项目当前状态
 **更新时间**: 2025-10-29
 **分支**: `paddleocr-improvements`
 **PaddleOCR版本**: 2.7.3 (稳定版本)
 ---
 ## 当前进度总结
 ### ✅ 已完成
 1. **PaddleOCR服务器部署** (192.168.30.36:5555)
   - 版本: PaddleOCR 2.7.3
   - GPU: 启用
   - 语言: 中文
   - 状态: 稳定运行
 2. **基础Pipeline实现**
   - ✅ PDF → 图像渲染 (DPI=300)
   - ✅ PaddleOCR文字检测 (26个区域/页)
   - ✅ 文本区域遮罩 (padding=25px)
   - ✅ 候选区域检测
   - ✅ 区域合并算法 (12→4 regions)
 3. **OpenCV分离方法测试**
   - Method 1: 笔画宽度分析 - ❌ 效果差
   - Method 2: 连通组件基础分析 - ⚠️ 中等效果
   - Method 3: 综合特征分析 - ✅ **最佳方案** (86.5%手写保留率)
 4. **测试结果**
   - 测试文件: `201301_1324_AI1_page3.pdf`
   - 预期签名: 2个 (楊智惠, 張志銘)
   - 检测结果: 2个签名区域成功合并
   - 保留率: 86.5% 手写内容
 ---
 ## 技术架构
 ```
 PDF文档
  ↓
 1. 渲染 (PyMuPDF, 300 DPI)
  ↓
 2. PaddleOCR检测 (识别印刷文字)
  ↓
 3. 遮罩印刷文字 (黑色填充, padding=25px)
  ↓
 4. 区域检测 (OpenCV形态学)
  ↓
 5. 区域合并 (距离阈值: H≤100px, V≤50px)
  ↓
 6. 特征分析 (大小+笔画长度+规律性)
  ↓
 7. [TODO] VLM验证
  ↓
 签名提取结果
 ```
 ---
 ## 核心文件
 | 文件 | 说明 | 状态 |
 |------|------|------|
 | `paddleocr_client.py` | PaddleOCR REST客户端 | ✅ 稳定 |
 | `test_mask_and_detect.py` | 基础遮罩+检测测试 | ✅ 完成 |
 | `test_opencv_separation.py` | OpenCV方法1+2测试 | ✅ 完成 |
 | `test_opencv_advanced.py` | OpenCV方法3(最佳) | ✅ 完成 |
 | `extract_signatures_paddleocr_improved.py` | 完整Pipeline (Method B+E) | ⚠️ Method E有问题 |
 | `PADDLEOCR_STATUS.md` | 详细技术文档 | ✅ 完成 |
 ---
 ## Method 3: 综合特征分析 (当前最佳方案)
 ### 判断依据
 **您的观察** (非常准确):
 1. ✅ **手写字比印刷字大** - height > 50px
 2. ✅ **手写笔画长度更长** - stroke_ratio > 0.4
 3. ✅ **印刷体规律，手写潦草** - compactness, solidity
 ### 评分系统
 ```python
 handwriting_score = 0
 # 大小评分
 if height > 50: score += 3
 elif height > 35: score += 2
 # 笔画长度评分
 if stroke_ratio > 0.5: score += 2
 elif stroke_ratio > 0.35: score += 1
 # 规律性评分
 if is_irregular: score += 1  # 不规律 = 手写
 else: score -= 1              # 规律 = 印刷
 # 面积评分
 if area > 2000: score += 2
 elif area < 500: score -= 1
 # 分类: score > 0 → 手写
 ```
 ### 效果
 - 手写像素保留: **86.5%** ✅
 - 印刷像素过滤: 13.5%
 - Top 10组件全部正确分类
 ---
 ## 已识别问题
 ### 1. Method E (两阶段OCR) 失效 ❌
 **原因**: PaddleOCR无法区分"印刷"和"手写"，第二次OCR会把手写也识别并删除
 **解决方案**:
 - ❌ 不使用Method E
 - ✅ 使用Method B (区域合并) + OpenCV Method 3
 ### 2. 印刷名字与手写签名重叠
 **现象**: 区域包含"楊 智 惠"(印刷) + 手写签名
 **策略**: 接受少量印刷残留，优先保证手写完整性
 **后续**: 用VLM最终验证
 ### 3. Masking padding 矛盾
 **小padding (5-10px)**: 印刷残留多，但不伤手写
 **大padding (25px)**: 印刷删除干净，但可能遮住手写边缘
 **当前**: 使用 25px，依赖OpenCV Method 3过滤残留
 ---
 ## 下一步计划
 ### 短期 (继续当前方案)
 - [ ] 整合 Method B + OpenCV Method 3 为完整Pipeline
 - [ ] 添加VLM验证步骤
 - [ ] 在10个样本上测试
 - [ ] 调优参数 (height阈值, merge距离等)
 ### 中期 (PP-OCRv5研究)
 **新branch**: `pp-ocrv5-research`
 - [ ] 研究PaddleOCR 3.3.0新API
 - [ ] 测试PP-OCRv5手写检测能力
 - [ ] 对比性能: v4 vs v5
 - [ ] 评估是否升级
 ---
 ## 服务器配置
 ### PaddleOCR服务器 (Linux)
 ```
 Host: 192.168.30.36:5555
 SSH: ssh gblinux
 路径: ~/Project/paddleocr-server/
 版本: PaddleOCR 2.7.3, numpy 1.26.4, opencv-contrib 4.6.0.66
 启动: cd ~/Project/paddleocr-server && source venv/bin/activate && python paddleocr_server.py
 日志: ~/Project/paddleocr-server/server_stable.log
 ```
 ### VLM服务器 (Ollama)
 ```
 Host: 192.168.30.36:11434
 模型: qwen2.5vl:32b
 状态: 未在当前Pipeline中使用
 ```
 ---
 ## 测试数据
 ### 样本文件
 ```
 /Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf
 - 页面: 第3页
 - 预期签名: 2个 (楊智惠, 張志銘)
 - 尺寸: 2481x3510 pixels
 ```
 ### 输出目录
 ```
 /Volumes/NV2/PDF-Processing/signature-image-output/
 ├── mask_test/              # 基础遮罩测试结果
 ├── paddleocr_improved/     # Method B+E测试 (E失败)
 ├── opencv_separation_test/ # Method 1+2测试
 └── opencv_advanced_test/   # Method 3测试 (最佳)
 ```
 ---
 ## 性能对比
 | 方法 | 手写保留 | 印刷去除 | 总评 |
 |------|---------|---------|------|
 | 基础遮罩 | 100% | 低 | ⚠️ 太多印刷残留 |
 | Method 1 (笔画宽度) | 0% | - | ❌ 完全失败 |
 | Method 2 (连通组件) | 1% | 中 | ❌ 丢失太多手写 |
 | Method 3 (综合特征) | **86.5%** | 高 | ✅ **最佳** |
 ---
 ## Git状态
 ```
 当前分支: paddleocr-improvements
 基于: PaddleOCR-Cover
 标签: paddleocr-v1-basic (基础遮罩版本)
 待提交:
 - OpenCV高级分离方法 (Method 3)
 - 完整测试脚本和结果
 - 文档更新
 ```
 ---
 ## 已知限制
 1. **参数需调优**: height阈值、merge距离等在不同文档可能需要调整
 2. **依赖文档质量**: 模糊、倾斜的文档可能效果变差
 3. **计算性能**: OpenCV处理较快，但完整Pipeline需要优化
 4. **泛化能力**: 仅在1个样本测试，需要更多样本验证
 ---
 ## 联系与协作
 **主要开发者**: Claude Code
 **协作方式**: 会话式开发
 **代码仓库**: 本地Git仓库
 **测试环境**: macOS (本地) + Linux (服务器)
 ---
 **状态**: ✅ 当前方案稳定，可继续开发
 **建议**: 先在更多样本测试Method 3，再考虑PP-OCRv5升级
@@ -0,0 +1,432 @@
 # 新对话交接文档 - PP-OCRv5研究
 **日期**: 2025-10-29
 **前序对话**: PaddleOCR-Cover分支开发
 **当前分支**: `paddleocr-improvements` (稳定)
 **新分支**: `pp-ocrv5-research` (待创建)
 ---
 ## 🎯 任务目标
 研究和实现 **PP-OCRv5** 的手写签名检测功能
 ---
 ## 📋 背景信息
 ### 当前状况
 ✅ **已有稳定方案** (`paddleocr-improvements` 分支):
 - PaddleOCR 2.7.3 + OpenCV Method 3
 - 86.5%手写保留率
 - 区域合并算法工作良好
 - 测试: 1个PDF成功检测2个签名
 ⚠️ **PP-OCRv5升级遇到问题**:
 - PaddleOCR 3.3.0 API完全改变
 - 旧服务器代码不兼容
 - 需要深入研究新API
 ### 为什么要研究PP-OCRv5？
 **文档显示**: https://www.paddleocr.ai/main/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5.html
 PP-OCRv5性能提升:
 - 手写中文检测: **0.706 → 0.803** (+13.7%)
 - 手写英文检测: **0.249 → 0.841** (+237%)
 - 可能支持直接输出手写区域坐标
 **潜在优势**:
 1. 更好的手写识别能力
 2. 可能内置手写/印刷分类
 3. 更准确的坐标输出
 4. 减少复杂的后处理
 ---
 ## 🔧 技术栈
 ### 服务器环境
 ```
 Host: 192.168.30.36 (Linux GPU服务器)
 SSH: ssh gblinux
 目录: ~/Project/paddleocr-server/
 ```
 **当前稳定版本**:
 - PaddleOCR: 2.7.3
 - numpy: 1.26.4
 - opencv-contrib-python: 4.6.0.66
 - 服务器文件: `paddleocr_server.py`
 **已安装但未使用**:
 - PaddleOCR 3.3.0 (PP-OCRv5)
 - 临时服务器: `paddleocr_server_v5.py` (未完成)
 ### 本地环境
 ```
 macOS
 Python: 3.14
 虚拟环境: venv/
 客户端: paddleocr_client.py
 ```
 ---
 ## 📝 核心问题
 ### 1. API变更
 **旧API (2.7.3)**:
 ```python
 from paddleocr import PaddleOCR
 ocr = PaddleOCR(lang='ch')
 result = ocr.ocr(image_np, cls=False)
 # 返回格式:
 # [[[box], (text, confidence)], ...]
 ```
 **新API (3.3.0)** - ⚠️ 未完全理解:
 ```python
 # 方式1: 传统方式 (Deprecated)
 result = ocr.ocr(image_np)  # 警告: Please use predict instead
 # 方式2: 新方式
 from paddlex import create_model
 model = create_model("???")  # 模型名未知
 result = model.predict(image_np)
 # 返回格式: ???
 ```
 ### 2. 遇到的错误
 **错误1**: `cls` 参数不再支持
 ```python
 # 错误: PaddleOCR.predict() got an unexpected keyword argument 'cls'
 result = ocr.ocr(image_np, cls=False)  # ❌
 ```
 **错误2**: 返回格式改变
 ```python
 # 旧代码解析失败:
 text = item[1][0]       # ❌ IndexError
 confidence = item[1][1]  # ❌ IndexError
 ```
 **错误3**: 模型名称错误
 ```python
 model = create_model("PP-OCRv5_server")  # ❌ Model not supported
 ```
 ---
 ## 🎯 研究任务清单
 ### Phase 1: API研究 (优先级高)
 - [ ] **阅读官方文档**
  - PP-OCRv5完整文档
  - PaddleX API文档
  - 迁移指南 (如果有)
 - [ ] **理解新API**
  ```python
  # 需要搞清楚:
  1. 正确的导入方式
  2. 模型初始化方法
  3. predict()参数和返回格式
  4. 如何区分手写/印刷
  5. 是否有手写检测专用功能
  ```
 - [ ] **编写测试脚本**
  - `test_pp_ocrv5_api.py` - 测试基础API调用
  - 打印完整的result数据结构
  - 对比v4和v5的返回差异
 ### Phase 2: 服务器适配
 - [ ] **重写服务器代码**
  - 适配新API
  - 正确解析返回数据
  - 保持REST接口兼容
 - [ ] **测试稳定性**
  - 测试10个PDF样本
  - 检查GPU利用率
  - 对比v4性能
 ### Phase 3: 手写检测功能
 - [ ] **查找手写检测能力**
  ```python
  # 可能的方式:
  1. result中是否有 text_type 字段?
  2. 是否有专门的 handwriting_detection 模型?
  3. 是否有置信度差异可以利用?
  4. PP-Structure 的 layout 分析?
  ```
 - [ ] **对比测试**
  - v4 (当前方案) vs v5
  - 准确率、召回率、速度
  - 手写检测能力
 ### Phase 4: 集成决策
 - [ ] **性能评估**
  - 如果v5更好 → 升级
  - 如果改进不明显 → 保持v4
 - [ ] **文档更新**
  - 记录v5使用方法
  - 更新PADDLEOCR_STATUS.md
 ---
 ## 🔍 调试技巧
 ### 1. 查看完整返回数据
 ```python
 import pprint
 result = model.predict(image)
 pprint.pprint(result)  # 完整输出所有字段
 # 或者
 import json
 print(json.dumps(result, indent=2, ensure_ascii=False))
 ```
 ### 2. 查找官方示例
 ```bash
 # 在服务器上查找PaddleOCR安装示例
 find ~/Project/paddleocr-server/venv/lib/python3.12/site-packages/paddleocr -name "*.py" | grep example
 # 查看源码
 less ~/Project/paddleocr-server/venv/lib/python3.12/site-packages/paddleocr/paddleocr.py
 ```
 ### 3. 查看可用模型
 ```python
 from paddlex.inference.models import OFFICIAL_MODELS
 print(OFFICIAL_MODELS)  # 列出所有支持的模型名
 ```
 ### 4. Web文档搜索
 重点查看:
 - https://github.com/PaddlePaddle/PaddleOCR
 - https://www.paddleocr.ai
 - https://github.com/PaddlePaddle/PaddleX
 ---
 ## 📂 文件结构
 ```
 /Volumes/NV2/pdf_recognize/
 ├── CURRENT_STATUS.md          # 当前状态文档 ✅
 ├── NEW_SESSION_HANDOFF.md     # 本文件 ✅
 ├── PADDLEOCR_STATUS.md        # 详细技术文档 ✅
 ├── SESSION_INIT.md            # 初始会话信息
 │
 ├── paddleocr_client.py        # 稳定客户端 (v2.7.3) ✅
 ├── paddleocr_server_v5.py     # v5服务器 (未完成) ⚠️
 │
 ├── test_paddleocr_client.py           # 基础测试
 ├── test_mask_and_detect.py            # 遮罩+检测
 ├── test_opencv_separation.py          # Method 1+2
 ├── test_opencv_advanced.py            # Method 3 (最佳) ✅
 ├── extract_signatures_paddleocr_improved.py  # 完整Pipeline
 │
 └── check_rejected_for_missing.py      # 诊断脚本
 ```
 **服务器端** (`ssh gblinux`):
 ```
 ~/Project/paddleocr-server/
 ├── paddleocr_server.py        # v2.7.3稳定版 ✅
 ├── paddleocr_server_v5.py     # v5版本 (待完成) ⚠️
 ├── paddleocr_server_backup.py # 备份
 ├── server_stable.log          # 当前运行日志
 └── venv/                      # 虚拟环境
 ```
 ---
 ## ⚡ 快速启动
 ### 启动稳定服务器 (v2.7.3)
 ```bash
 ssh gblinux
 cd ~/Project/paddleocr-server
 source venv/bin/activate
 python paddleocr_server.py
 ```
 ### 测试连接
 ```bash
 # 本地Mac
 cd /Volumes/NV2/pdf_recognize
 source venv/bin/activate
 python test_paddleocr_client.py
 ```
 ### 创建新研究分支
 ```bash
 cd /Volumes/NV2/pdf_recognize
 git checkout -b pp-ocrv5-research
 ```
 ---
 ## 🚨 注意事项
 ### 1. 不要破坏稳定版本
 - `paddleocr-improvements` 分支保持稳定
 - 所有v5实验在新分支 `pp-ocrv5-research`
 - 服务器保留 `paddleocr_server.py` (v2.7.3)
 - 新代码命名: `paddleocr_server_v5.py`
 ### 2. 环境隔离
 - 服务器虚拟环境可能需要重建
 - 或者用Docker隔离v4和v5
 - 避免版本冲突
 ### 3. 性能测试
 - 记录v4和v5的具体指标
 - 至少测试10个样本
 - 包括速度、准确率、召回率
 ### 4. 文档驱动
 - 每个发现记录到文档
 - API用法写清楚
 - 便于未来维护
 ---
 ## 📊 成功标准
 ### 最低目标
 - [ ] 成功运行PP-OCRv5基础OCR
 - [ ] 理解新API调用方式
 - [ ] 服务器稳定运行
 - [ ] 记录完整文档
 ### 理想目标
 - [ ] 发现手写检测功能
 - [ ] 性能超过v4方案
 - [ ] 简化Pipeline复杂度
 - [ ] 提升准确率 > 90%
 ### 决策点
 **如果v5明显更好** → 升级到v5，废弃v4
 **如果v5改进不明显** → 保持v4，v5仅作研究记录
 **如果v5有bug** → 等待官方修复，暂用v4
 ---
 ## 📞 问题排查
 ### 遇到问题时
 1. **先查日志**: `tail -f ~/Project/paddleocr-server/server_stable.log`
 2. **查看源码**: 在venv里找PaddleOCR代码
 3. **搜索Issues**: https://github.com/PaddlePaddle/PaddleOCR/issues
 4. **降级测试**: 确认v2.7.3是否还能用
 ### 常见问题
 **Q: 服务器启动失败?**
 A: 检查numpy版本 (需要 < 2.0)
 **Q: 找不到模型?**
 A: 模型名可能变化，查看OFFICIAL_MODELS
 **Q: API调用失败?**
 A: 对比官方文档，可能参数格式变化
 ---
 ## 🎓 学习资源
 ### 官方文档
 1. **PP-OCRv5**: https://www.paddleocr.ai/main/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5.html
 2. **PaddleOCR GitHub**: https://github.com/PaddlePaddle/PaddleOCR
 3. **PaddleX**: https://github.com/PaddlePaddle/PaddleX
 ### 相关技术
 - PaddlePaddle深度学习框架
 - PP-Structure文档结构分析
 - 手写识别 (Handwriting Recognition)
 - 版面分析 (Layout Analysis)
 ---
 ## 💡 提示
 ### 如果发现内置手写检测
 可能的用法:
 ```python
 # 猜测1: 返回结果包含类型
 for item in result:
    text_type = item.get('type')  # 'printed' or 'handwritten'?
 # 猜测2: 专门的layout模型
 from paddlex import create_model
 layout_model = create_model("PP-Structure")
 layout_result = layout_model.predict(image)
 # 可能返回: text, handwriting, figure, table...
 # 猜测3: 置信度差异
 # 手写文字置信度可能更低
 ```
 ### 如果没有内置手写检测
 那么当前OpenCV Method 3仍然是最佳方案，v5仅提供更好的OCR准确度。
 ---
 ## ✅ 完成检查清单
 研究完成后，确保:
 - [ ] 新API用法完全理解并文档化
 - [ ] 服务器代码重写并测试通过
 - [ ] 性能对比数据记录
 - [ ] 决策文档 (升级 vs 保持v4)
 - [ ] 代码提交到 `pp-ocrv5-research` 分支
 - [ ] 更新 `CURRENT_STATUS.md`
 - [ ] 如果升级: 合并到main分支
 ---
 **祝研究顺利！** 🚀
 有问题随时查阅:
 - `CURRENT_STATUS.md` - 当前方案详情
 - `PADDLEOCR_STATUS.md` - 技术细节和问题分析
 **最重要**: 记录所有发现，无论成功或失败，都是宝贵经验！
@@ -0,0 +1,475 @@
 # PaddleOCR Signature Extraction - Status & Options
 **Date**: October 28, 2025
 **Branch**: `PaddleOCR-Cover`
 **Current Stage**: Masking + Region Detection Working, Refinement Needed
 ---
 ## Current Approach Overview
 **Strategy**: PaddleOCR masks printed text → Detect remaining regions → VLM verification
 ### Pipeline Steps
 ```
 1. PaddleOCR (Linux server 192.168.30.36:5555)
   └─> Detect printed text bounding boxes
 2. OpenCV Masking (Local)
   └─> Black out all printed text areas
 3. Region Detection (Local)
   └─> Find non-white areas (potential handwriting)
 4. VLM Verification (TODO)
   └─> Confirm which regions are handwritten signatures
 ```
 ---
 ## Test Results (File: 201301_1324_AI1_page3.pdf)
 ### Performance
 | Metric | Value |
 |--------|-------|
 | Printed text regions masked | 26 |
 | Candidate regions detected | 12 |
 | Actual signatures found | 2 ✅ |
 | False positives (printed text) | 9 |
 | Split signatures | 1 (Region 5 might be part of Region 4) |
 ### Success
 ✅ **PaddleOCR detected most printed text** (26 regions)
 ✅ **Masking works correctly** (black rectangles)
 ✅ **Region detection found both signatures** (regions 2, 4)
 ✅ **No false negatives** (didn't miss any signatures)
 ### Issues Identified
 ❌ **Problem 1: Handwriting Split Into Multiple Regions**
 - Some signatures may be split into 2+ separate regions
 - Example: Region 4 and Region 5 might be parts of same signature area
 - Caused by gaps between handwritten strokes after masking
 ❌ **Problem 2: Printed Name + Handwritten Signature Mixed**
 - Region 2: Contains "張 志 銘" (printed) + handwritten signature
 - Region 4: Contains "楊 智 惠" (printed) + handwritten signature
 - PaddleOCR missed these printed names, so they weren't masked
 - Final output includes both printed and handwritten parts
 ❌ **Problem 3: Printed Text Not Masked by PaddleOCR**
 - 9 regions contain printed text that PaddleOCR didn't detect
 - These became false positive candidates
 - Examples: dates, company names, paragraph text
 - Shows PaddleOCR's detection isn't 100% complete
 ---
 ## Proposed Solutions
 ### Problem 1: Split Signatures
 #### Option A: More Aggressive Morphology ⭐ EASY
 **Approach**: Increase kernel size and iterations to connect nearby strokes
 ```python
 # Current settings:
 kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
 morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
 # Proposed settings:
 kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 15))  # 3x larger
 morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=5)  # More iterations
 ```
 **Pros**:
 - Simple one-line change
 - Connects nearby strokes automatically
 - Fast execution
 **Cons**:
 - May merge unrelated regions if too aggressive
 - Need to tune parameters carefully
 - Could lose fine details
 **Recommendation**: ⭐ Try first - easiest to implement and test
 ---
 #### Option B: Region Merging After Detection ⭐⭐ MEDIUM (RECOMMENDED)
 **Approach**: After detecting all regions, merge those that are close together
 ```python
 def merge_nearby_regions(regions, distance_threshold=50):
    """
    Merge regions that are within distance_threshold pixels of each other.
    Args:
        regions: List of region dicts with 'box' (x, y, w, h)
        distance_threshold: Maximum pixels between regions to merge
    Returns:
        List of merged regions
    """
    # Algorithm:
    # 1. Calculate distance between all region pairs
    # 2. If distance < threshold, merge their bounding boxes
    # 3. Repeat until no more merges possible
    merged = []
    # Implementation here...
    return merged
 ```
 **Pros**:
 - Keeps signatures together intelligently
 - Won't merge distant unrelated regions
 - Preserves original stroke details
 - Can use vertical/horizontal distance separately
 **Cons**:
 - Need to tune distance threshold
 - More complex than Option A
 - May need multiple merge passes
 **Recommendation**: ⭐⭐ **Best balance** - implement this first
 ---
 #### Option C: Don't Split - Extract Larger Context ⭐ EASY
 **Approach**: When extracting regions, add significant padding to capture full context
 ```python
 # Current: padding = 10 pixels
 padding = 50  # Much larger padding
 # Or: Merge all regions in the bottom 20% of page
 # (signatures are usually at the bottom)
 ```
 **Pros**:
 - Guaranteed to capture complete signatures
 - Very simple to implement
 - No risk of losing parts
 **Cons**:
 - May include extra unwanted content
 - Larger image files
 - Makes VLM verification more complex
 **Recommendation**: ⭐ Use as fallback if B doesn't work
 ---
 ### Problem 2: Printed + Handwritten in Same Region
 #### Option A: Expand PaddleOCR Masking Boxes ⭐ EASY
 **Approach**: Add padding when masking text boxes to catch edges
 ```python
 padding = 20  # pixels
 for (x, y, w, h) in text_boxes:
    # Expand box in all directions
    x_pad = max(0, x - padding)
    y_pad = max(0, y - padding)
    w_pad = min(image.shape[1] - x_pad, w + 2*padding)
    h_pad = min(image.shape[0] - y_pad, h + 2*padding)
    cv2.rectangle(masked_image, (x_pad, y_pad),
                  (x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)
 ```
 **Pros**:
 - Very simple - one parameter change
 - Catches text edges and nearby text
 - Fast execution
 **Cons**:
 - If padding too large, may mask handwriting
 - If padding too small, still misses text
 - Hard to find perfect padding value
 **Recommendation**: ⭐ Quick test - try with padding=20-30
 ---
 #### Option B: Run PaddleOCR Again on Each Region ⭐⭐ MEDIUM
 **Approach**: Second-pass OCR on extracted regions to find remaining printed text
 ```python
 def clean_region(region_image, ocr_client):
    """
    Remove any remaining printed text from a region.
    Args:
        region_image: Extracted candidate region
        ocr_client: PaddleOCR client
    Returns:
        Cleaned image with only handwriting
    """
    # Run OCR on this specific region
    text_boxes = ocr_client.get_text_boxes(region_image)
    # Mask any detected printed text
    cleaned = region_image.copy()
    for (x, y, w, h) in text_boxes:
        cv2.rectangle(cleaned, (x, y), (x+w, y+h), (0, 0, 0), -1)
    return cleaned
 ```
 **Pros**:
 - Very accurate - catches printed text PaddleOCR missed initially
 - Clean separation of printed vs handwritten
 - No manual tuning needed
 **Cons**:
 - 2x slower (OCR call per region)
 - May occasionally mask handwritten text if it looks printed
 - More complex pipeline
 **Recommendation**: ⭐⭐ Good option if masking padding isn't enough
 ---
 #### Option C: Computer Vision Stroke Analysis ⭐⭐⭐ HARD
 **Approach**: Analyze stroke characteristics to distinguish printed vs handwritten
 ```python
 def separate_printed_handwritten(region_image):
    """
    Use CV techniques to separate printed from handwritten.
    Techniques:
    - Stroke width analysis (printed = uniform, handwritten = variable)
    - Edge detection + smoothness (printed = sharp, handwritten = organic)
    - Connected component analysis
    - Hough line detection (printed = straight, handwritten = curved)
    """
    # Complex implementation...
    pass
 ```
 **Pros**:
 - No API calls needed (fast)
 - Can work when OCR fails
 - Learns patterns in data
 **Cons**:
 - Very complex to implement
 - May not be reliable across different documents
 - Requires significant tuning
 - Hard to maintain
 **Recommendation**: ❌ Skip for now - too complex, uncertain results
 ---
 #### Option D: VLM Crop Guidance ⚠️ RISKY
 **Approach**: Ask VLM to provide coordinates of handwriting location
 ```python
 prompt = """
 This image contains both printed and handwritten text.
 Where is the handwritten signature located?
 Provide coordinates as: x_start, y_start, x_end, y_end
 """
 # VLM returns coordinates
 # Crop to that region only
 ```
 **Pros**:
 - VLM understands visual context
 - Can distinguish printed vs handwritten
 **Cons**:
 - **VLM coordinates are unreliable** (32% offset discovered in previous tests!)
 - This was the original problem that led to PaddleOCR approach
 - May extract wrong region
 **Recommendation**: ❌ **DO NOT USE** - VLM coordinates proven unreliable
 ---
 #### Option E: Two-Stage Hybrid Approach ⭐⭐⭐ BEST (RECOMMENDED)
 **Approach**: Combine detection with targeted cleaning
 ```python
 def extract_signatures_twostage(pdf_path):
    """
    Stage 1: Detect candidate regions (current pipeline)
    Stage 2: Clean each region
    """
    # Stage 1: Full page processing
    image = render_pdf(pdf_path)
    text_boxes = ocr_client.get_text_boxes(image)
    masked_image = mask_text_regions(image, text_boxes, padding=20)
    candidate_regions = detect_regions(masked_image)
    # Stage 2: Per-region cleaning
    signatures = []
    for region_box in candidate_regions:
        # Extract region from ORIGINAL image (not masked)
        region_img = extract_region(image, region_box)
        # Option 1: Run OCR again to find remaining printed text
        region_text_boxes = ocr_client.get_text_boxes(region_img)
        cleaned_region = mask_text_regions(region_img, region_text_boxes)
        # Option 2: Ask VLM if it contains handwriting (no coordinates!)
        is_handwriting = vlm_verify(cleaned_region)
        if is_handwriting:
            signatures.append(cleaned_region)
    return signatures
 ```
 **Pros**:
 - Best accuracy - two passes of OCR
 - Combines strengths of both approaches
 - VLM only for yes/no, not coordinates
 - Clean final output with only handwriting
 **Cons**:
 - Slower (2 OCR calls per page)
 - More complex code
 - Higher computational cost
 **Recommendation**: ⭐⭐⭐ **BEST OVERALL** - implement this for production
 ---
 ## Implementation Priority
 ### Phase 1: Quick Wins (Test Immediately)
 1. **Expand masking padding** (Problem 2, Option A) - 5 minutes
 2. **More aggressive morphology** (Problem 1, Option A) - 5 minutes
 3. **Test and measure improvement**
 ### Phase 2: Region Merging (If Phase 1 insufficient)
 4. **Implement region merging algorithm** (Problem 1, Option B) - 30 minutes
 5. **Test on multiple PDFs**
 6. **Tune distance threshold**
 ### Phase 3: Two-Stage Approach (Best quality)
 7. **Implement second-pass OCR on regions** (Problem 2, Option E) - 1 hour
 8. **Add VLM verification** (Step 4 of pipeline) - 30 minutes
 9. **Full pipeline testing**
 ---
 ## Code Files Status
 ### Existing Files ✅
 - **`paddleocr_client.py`** - REST API client for PaddleOCR server
 - **`test_paddleocr_client.py`** - Connection and OCR test
 - **`test_mask_and_detect.py`** - Current masking + detection pipeline
 ### To Be Created 📝
 - **`extract_signatures_paddleocr.py`** - Production pipeline with all improvements
 - **`region_merger.py`** - Region merging utilities
 - **`vlm_verifier.py`** - VLM handwriting verification
 ---
 ## Server Configuration
 **PaddleOCR Server**:
 - Host: `192.168.30.36:5555`
 - Running: ✅ Yes (PID: 210417)
 - Version: 3.3.0
 - GPU: Enabled
 - Language: Chinese (lang='ch')
 **VLM Server**:
 - Host: `192.168.30.36:11434` (Ollama)
 - Model: `qwen2.5vl:32b`
 - Status: Not tested yet in this pipeline
 ---
 ## Test Plan
 ### Test File
 - **File**: `201301_1324_AI1_page3.pdf`
 - **Expected signatures**: 2 (楊智惠, 張志銘)
 - **Current recall**: 100% (found both)
 - **Current precision**: 16.7% (2 correct out of 12 regions)
 ### Success Metrics After Improvements
 | Metric | Current | Target |
 |--------|---------|--------|
 | Signatures found | 2/2 (100%) | 2/2 (100%) |
 | False positives | 10 | < 2 |
 | Precision | 16.7% | > 80% |
 | Signatures split | Unknown | 0 |
 | Printed text in regions | Yes | No |
 ---
 ## Git Branch Strategy
 **Current branch**: `PaddleOCR-Cover`
 **Status**: Masking + Region Detection working, needs refinement
 **Recommended next steps**:
 1. Commit current state with tag: `paddleocr-v1-basic`
 2. Create feature branches:
   - `paddleocr-region-merging` - For Problem 1 solutions
   - `paddleocr-two-stage` - For Problem 2 solutions
 3. Merge best solution back to `PaddleOCR-Cover`
 ---
 ## Next Actions
 ### Immediate (Today)
 - [ ] Commit current working state
 - [ ] Test Phase 1 quick wins (padding + morphology)
 - [ ] Measure improvement
 ### Short-term (This week)
 - [ ] Implement Region Merging (Option B)
 - [ ] Implement Two-Stage OCR (Option E)
 - [ ] Add VLM verification
 - [ ] Test on 10 PDFs
 ### Long-term (Production)
 - [ ] Optimize performance (parallel processing)
 - [ ] Error handling and logging
 - [ ] Process full 86K dataset
 - [ ] Compare with previous hybrid approach (70% recall)
 ---
 ## Comparison: PaddleOCR vs Previous Hybrid Approach
 ### Previous Approach (VLM-Cover branch)
 - **Method**: VLM names + CV detection + VLM verification
 - **Results**: 70% recall, 100% precision
 - **Problem**: Missed 30% of signatures (CV parameters too conservative)
 ### PaddleOCR Approach (Current)
 - **Method**: PaddleOCR masking + CV detection + VLM verification
 - **Results**: 100% recall (found both signatures)
 - **Problem**: Low precision (many false positives), printed text not fully removed
 ### Winner: TBD
 - PaddleOCR shows **better recall potential**
 - After implementing refinements (Phase 2-3), should achieve **high recall + high precision**
 - Need to test on larger dataset to confirm
 ---
 **Document version**: 1.0
 **Last updated**: October 28, 2025
 **Author**: Claude Code
 **Status**: Ready for implementation
@@ -0,0 +1,281 @@
 # PP-OCRv5 研究發現
 **日期**: 2025-01-27
 **分支**: pp-ocrv5-research
 **狀態**: 研究完成
 ---
 ## 📋 研究摘要
 我們成功升級並測試了 PP-OCRv5，以下是關鍵發現：
 ### ✅ 成功完成
 1. PaddleOCR 升級：2.7.3 → 3.3.2
 2. 新 API 理解和驗證
 3. 手寫檢測能力測試
 4. 數據結構分析
 ### ❌ 關鍵限制
 **PP-OCRv5 沒有內建的手寫 vs 印刷文字分類功能**
 ---
 ## 🔧 技術細節
 ### API 變更
 **舊 API (2.7.3)**:
 ```python
 from paddleocr import PaddleOCR
 ocr = PaddleOCR(lang='ch', show_log=False)
 result = ocr.ocr(image_np, cls=False)
 ```
 **新 API (3.3.2)**:
 ```python
 from paddleocr import PaddleOCR
 ocr = PaddleOCR(
    text_detection_model_name="PP-OCRv5_server_det",
    text_recognition_model_name="PP-OCRv5_server_rec",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False
    # ❌ 不再支持: show_log, cls
 )
 result = ocr.predict(image_path)  # ✅ 使用 predict() 而不是 ocr()
 ```
 ### 主要 API 差異
 | 特性 | v2.7.3 | v3.3.2 |
 |------|--------|--------|
 | 初始化 | `PaddleOCR(lang='ch')` | `PaddleOCR(text_detection_model_name=...)` |
 | 預測方法 | `ocr.ocr()` | `ocr.predict()` |
 | `cls` 參數 | ✅ 支持 | ❌ 已移除 |
 | `show_log` 參數 | ✅ 支持 | ❌ 已移除 |
 | 返回格式 | `[[[box], (text, conf)], ...]` | `OCRResult` 對象 with `.json` 屬性 |
 | 依賴 | 獨立 | 需要 PaddleX >=3.3.0 |
 ---
 ## 📊 返回數據結構
 ### v3.3.2 返回格式
 ```python
 result = ocr.predict(image_path)
 json_data = result[0].json['res']
 # 可用字段：
 json_data = {
    'input_path': str,                    # 輸入圖片路徑
    'page_index': None,                   # PDF 頁碼（圖片為 None）
    'model_settings': dict,               # 模型配置
    'dt_polys': list,                     # 檢測多邊形框 (N, 4, 2)
    'dt_scores': list,                    # 檢測置信度
    'rec_texts': list,                    # 識別文字
    'rec_scores': list,                   # 識別置信度
    'rec_boxes': list,                    # 矩形框 [x_min, y_min, x_max, y_max]
    'rec_polys': list,                    # 識別多邊形框
    'text_det_params': dict,              # 檢測參數
    'text_rec_score_thresh': float,       # 識別閾值
    'text_type': str,                     # ⚠️ 'general' (語言類型，不是手寫分類)
    'textline_orientation_angles': list,  # 文字方向角度
    'return_word_box': bool               # 是否返回詞級框
 }
 ```
 ---
 ## 🔍 手寫檢測功能測試
 ### 測試問題
 **PP-OCRv5 是否能區分手寫和印刷文字？**
 ### 測試結果：❌ 不能
 #### 測試過程
 1. ✅ 發現 `text_type` 字段
 2. ❌ 但 `text_type = 'general'` 是**語言類型**，不是書寫風格
 3. ✅ 查閱官方文檔確認
 4. ❌ 沒有任何字段標註手寫 vs 印刷
 #### 官方文檔說明
 - `text_type` 可能的值：'general', 'ch', 'en', 'japan', 'pinyin'
 - 這些值指的是**語言/腳本類型**
 - **不是**手寫 (handwritten) vs 印刷 (printed) 的分類
 ### 結論
 PP-OCRv5 雖然能**識別**手寫文字，但**不會標註**某個文字區域是手寫還是印刷。
 ---
 ## 📈 性能提升（根據官方文檔）
 ### 手寫文字識別準確率
 | 類型 | PP-OCRv4 | PP-OCRv5 | 提升 |
 |------|----------|----------|------|
 | 手寫中文 | 0.706 | 0.803 | **+13.7%** |
 | 手寫英文 | 0.249 | 0.841 | **+237%** |
 ### 實測結果（full_page_original.png）
 **v3.3.2 (PP-OCRv5)**:
 - 檢測到 **50** 個文字區域
 - 平均置信度：~0.98
 - 示例：
  - "依本會計師核閱結果..." (0.9936)
  - "在所有重大方面有違反..." (0.9976)
 **待測試**: v2.7.3 的對比結果（需要回退測試）
 ---
 ## 💡 升級影響分析
 ### 優勢
 1. ✅ **更好的手寫識別能力**（+13.7%）
 2. ✅ **可能檢測到更多手寫區域**
 3. ✅ **更高的識別置信度**
 4. ✅ **統一的 Pipeline 架構**
 ### 劣勢
 1. ❌ **無法區分手寫和印刷**（仍需 OpenCV Method 3）
 2. ⚠️ **API 完全不兼容**（需重寫服務器代碼）
 3. ⚠️ **依賴 PaddleX**（額外的依賴）
 4. ⚠️ **OpenCV 版本升級**（4.6 → 4.10）
 ---
 ## 🎯 對我們項目的影響
 ### 當前方案（v2.7.3 + OpenCV Method 3）
 ```
 PDF → PaddleOCR 檢測 → 遮罩印刷文字 → OpenCV Method 3 分離手寫 → VLM 驗證
                        ↑ 86.5% 手寫保留率
 ```
 ### PP-OCRv5 方案
 ```
 PDF → PP-OCRv5 檢測 → 遮罩印刷文字 → OpenCV Method 3 分離手寫 → VLM 驗證
      ↑ 可能檢測更多手寫   ↑ 仍然需要！
 ```
 ### 關鍵發現
 **PP-OCRv5 不能替代 OpenCV Method 3！**
 ---
 ## 🤔 升級建議
 ### 升級的理由
 1. 更好地檢測手寫簽名（+13.7% 準確率）
 2. 可能減少漏檢
 3. 更高的識別置信度可以幫助後續分析
 ### 不升級的理由
 1. 當前方案已經穩定（86.5% 保留率）
 2. 仍然需要 OpenCV Method 3
 3. API 重寫成本高
 4. 額外的依賴和複雜度
 ### 推薦決策
 **階段性升級策略**：
 1. **短期（當前）**：
   - ✅ 保持 v2.7.3 穩定方案
   - ✅ 繼續使用 OpenCV Method 3
   - ✅ 在更多樣本上測試當前方案
 2. **中期（如果需要優化）**：
   - 對比測試 v2.7.3 vs v3.3.2 在真實簽名樣本上的性能
   - 如果 v5 明顯減少漏檢 → 升級
   - 如果差異不大 → 保持 v2.7.3
 3. **長期**：
   - 關注 PaddleOCR 是否會添加手寫分類功能
   - 如果有 → 重新評估升級價值
 ---
 ## 📝 技術債務記錄
 ### 如果決定升級到 v3.3.2
 需要完成的工作：
 1. **服務器端**：
   - [ ] 重寫 `paddleocr_server.py` 適配新 API
   - [ ] 測試 GPU 利用率和速度
   - [ ] 處理 OpenCV 4.10 兼容性
   - [ ] 更新依賴文檔
 2. **客戶端**：
   - [ ] 更新 `paddleocr_client.py`（如果 REST 接口改變）
   - [ ] 適配新的返回格式
 3. **測試**：
   - [ ] 10+ 樣本對比測試
   - [ ] 性能基準測試
   - [ ] 穩定性測試
 4. **文檔**：
   - [ ] 更新 CURRENT_STATUS.md
   - [ ] 記錄 API 遷移指南
   - [ ] 更新部署文檔
 ---
 ## ✅ 完成的工作
 1. ✅ 升級 PaddleOCR: 2.7.3 → 3.3.2
 2. ✅ 理解新 API 結構
 3. ✅ 測試基礎功能
 4. ✅ 分析返回數據結構
 5. ✅ 測試手寫分類功能（結論：無）
 6. ✅ 查閱官方文檔驗證
 7. ✅ 記錄完整研究過程
 ---
 ## 🎓 學到的經驗
 1. **API 版本升級風險**：主版本升級通常有破壞性變更
 2. **功能驗證的重要性**：文檔提到的「手寫支持」不等於「手寫分類」
 3. **現有方案的價值**：OpenCV Method 3 仍然是必需的
 4. **性能 vs 複雜度權衡**：不是所有性能提升都值得立即升級
 ---
 ## 🔗 相關文檔
 - [CURRENT_STATUS.md](./CURRENT_STATUS.md) - 當前穩定方案
 - [NEW_SESSION_HANDOFF.md](./NEW_SESSION_HANDOFF.md) - 研究任務清單
 - [PADDLEOCR_STATUS.md](./PADDLEOCR_STATUS.md) - 詳細技術分析
 ---
 ## 📌 下一步
 建議用戶：
 1. **立即行動**：
   - 在更多 PDF 樣本上測試當前方案
   - 記錄成功率和失敗案例
 2. **評估升級**：
   - 如果當前方案滿意 → 保持 v2.7.3
   - 如果遇到大量漏檢 → 考慮 v3.3.2
 3. **長期監控**：
   - 關注 PaddleOCR GitHub Issues
   - 追蹤是否有手寫分類功能的更新
 ---
 **結論**: PP-OCRv5 提升了手寫識別能力，但不能替代 OpenCV Method 3 來分離手寫和印刷文字。當前方案（v2.7.3 + OpenCV Method 3）已經足夠好，除非遇到性能瓶頸，否則不建議立即升級。
@@ -0,0 +1,110 @@
 # SAM3 手寫/印刷區域分割研究結果
 ## 測試環境
 - **服務器**: Linux GPU (192.168.30.36)
 - **CUDA**: 13.0
 - **Python**: 3.12.3
 - **SAM3 版本**: 最新 (2025/11/20 發布)
 - **模型大小**: 848M 參數
 ## 測試圖片
 - 來源: 會計師簽證報告 PDF 掃描頁面
 - 尺寸: 2481 x 3508 (測試時縮小到 1024 x 1447)
 - 內容: KPMG logo、中文印刷文字、手寫簽名 (3個)、紅色印章 (2個)
 ---
 ## 測試結果
 ### 高效檢測 (分數 > 0.5)
 | Prompt | 區域數 | 最高分數 | 檢測結果 |
 |--------|--------|----------|----------|
 | `company logo` | 6 | **0.855** | ✅ 準確檢測 KPMG logo |
 | `logo` | 8 | **0.853** | ✅ 準確檢測 KPMG logo |
 | `stamp` | 24 | **0.705** | ✅ 準確檢測兩個紅色印章 |
 ### 低效檢測 (分數 < 0.2)
 | Prompt | 區域數 | 最高分數 | 檢測結果 |
 |--------|--------|----------|----------|
 | `handwritten signature` | 0 | - | ❌ 完全無法檢測 |
 | `signature` | 0 | - | ❌ 完全無法檢測 |
 | `handwriting` | 0 | - | ❌ 完全無法檢測 |
 | `scribble` | 13 | 0.147 | ⚠️ 低分數，位置不準確 |
 | `Chinese characters` | 11 | 0.069 | ⚠️ 非常低分數 |
 ### 完全無法檢測
 - `handwritten text`
 - `written name`
 - `cursive writing`
 - `autograph`
 - `red stamp` (但 `stamp` 可以)
 - `calligraphy`
 ---
 ## 關鍵發現
 ### SAM3 優勢
 1. **Logo 檢測**: 非常準確 (0.85+ 分數)
 2. **印章檢測**: 效果很好 (0.70+ 分數)
 3. **通用物體分割**: 對自然場景中的物體效果優秀
 ### SAM3 限制
 1. **無法識別手寫簽名**: 這是最關鍵的發現
   - 各種 signature 相關的 prompt 分數都接近 0
   - SAM3 可能沒有在文件手寫簽名數據上訓練
 2. **中文手寫字體識別差**:
   - `Chinese handwritten characters` 無響應
   - 可能因為訓練數據中缺乏中文手寫樣本
 3. **文件場景表現不佳**:
   - SAM3 主要針對自然場景圖片
   - 對掃描文件、表格等場景支持有限
 ---
 ## 結論
 ### SAM3 不適合作為手寫簽名提取的主要方案
 **原因**:
 1. 無法有效識別「手寫簽名」概念
 2. 對中文手寫內容支持不足
 3. 在文件掃描場景下表現遠不如自然場景
 ### 建議保留當前方案
 當前 **PaddleOCR + OpenCV Method 3** 方案 (86.5% 手寫保留率) 仍然是更好的選擇：
 - PaddleOCR: 專門針對文字識別訓練，可準確定位印刷文字
 - OpenCV: 通過遮罩和形態學處理有效分離手寫筆畫
 ### SAM3 的潛在用途
 雖然不適合手寫簽名提取，但 SAM3 可能用於：
 - 檢測並遮罩 Logo 區域
 - 檢測並排除印章干擾
 - 作為預處理步驟的補充工具
 ---
 ## 視覺化結果
 保存的測試結果圖片：
 - `sam3_stamp_result.png` - 印章檢測 (高準確率)
 - `sam3_logo_result.png` - Logo 檢測 (高準確率)
 - `sam3_scribble_result.png` - Scribble 檢測 (低準確率)
 ---
 ## 後續建議
 1. **維持現有方案**: PaddleOCR 2.7.3 + OpenCV Method 3
 2. **可選整合 SAM3**: 用於 Logo/印章 檢測作為輔助
 3. **探索其他模型**:
   - 專門的手寫檢測模型
   - 文件分析模型 (Document AI)
   - LayoutLM 等文件理解模型
 ---
 *測試日期: 2025-11-27*
 *分支: sam3-research*
@@ -0,0 +1,75 @@
 #!/usr/bin/env python3
 """Check if rejected regions contain the missing signatures."""
 import base64
 import requests
 from pathlib import Path
 OLLAMA_URL = "http://192.168.30.36:11434"
 OLLAMA_MODEL = "qwen2.5vl:32b"
 REJECTED_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/signatures/rejected"
 # Missing signatures based on test results
 MISSING = {
    "201301_2061_AI1_page5": "林姿妤",
    "201301_2458_AI1_page4": "魏興海",
    "201301_2923_AI1_page3": "陈丽琦"
 }
 def encode_image_to_base64(image_path):
    """Encode image file to base64."""
    with open(image_path, 'rb') as f:
        return base64.b64encode(f.read()).decode('utf-8')
 def ask_vlm_about_signature(image_base64, expected_name):
    """Ask VLM if the image contains the expected signature."""
    prompt = f"""Does this image contain a handwritten signature with the Chinese name: "{expected_name}"?
 Look carefully for handwritten Chinese characters matching this name.
 Answer only 'yes' or 'no'."""
    payload = {
        "model": OLLAMA_MODEL,
        "prompt": prompt,
        "images": [image_base64],
        "stream": False
    }
    try:
        response = requests.post(f"{OLLAMA_URL}/api/generate", json=payload, timeout=60)
        response.raise_for_status()
        answer = response.json()['response'].strip().lower()
        return answer
    except Exception as e:
        return f"error: {str(e)}"
 # Check each missing signature
 for pdf_stem, missing_name in MISSING.items():
    print(f"\n{'='*80}")
    print(f"Checking rejected regions from: {pdf_stem}")
    print(f"Looking for missing signature: {missing_name}")
    print('='*80)
    # Find all rejected regions from this PDF
    rejected_regions = sorted(Path(REJECTED_PATH).glob(f"{pdf_stem}_region_*.png"))
    print(f"Found {len(rejected_regions)} rejected regions to check")
    for region_path in rejected_regions:
        region_name = region_path.name
        print(f"\nChecking: {region_name}...", end='', flush=True)
        # Encode and ask VLM
        image_base64 = encode_image_to_base64(region_path)
        answer = ask_vlm_about_signature(image_base64, missing_name)
        if 'yes' in answer:
            print(f" ✅ FOUND! This region contains {missing_name}")
            print(f"   → The signature was detected by CV but rejected by verification!")
        else:
            print(f" ❌ No (VLM says: {answer})")
 print(f"\n{'='*80}")
 print("Analysis complete!")
 print('='*80)
@@ -0,0 +1,415 @@
 #!/usr/bin/env python3
 """
 PaddleOCR Signature Extraction - Improved Pipeline
 Implements:
 - Method B: Region Merging (merge nearby regions to avoid splits)
 - Method E: Two-Stage Approach (second OCR pass on regions)
 Pipeline:
 1. PaddleOCR detects printed text on full page
 2. Mask printed text with padding
 3. Detect candidate regions
 4. Merge nearby regions (METHOD B)
 5. For each region: Run OCR again to remove remaining printed text (METHOD E)
 6. VLM verification (optional)
 7. Save cleaned handwriting regions
 """
 import fitz  # PyMuPDF
 import numpy as np
 import cv2
 from pathlib import Path
 from paddleocr_client import create_ocr_client
 from typing import List, Dict, Tuple
 import base64
 import requests
 # Configuration
 TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
 OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_improved"
 DPI = 300
 # PaddleOCR Settings
 MASKING_PADDING = 25  # Pixels to expand text boxes when masking
 # Region Detection Parameters
 MIN_REGION_AREA = 3000
 MAX_REGION_AREA = 300000
 MIN_ASPECT_RATIO = 0.3
 MAX_ASPECT_RATIO = 15.0
 # Region Merging Parameters (METHOD B)
 MERGE_DISTANCE_HORIZONTAL = 100  # pixels
 MERGE_DISTANCE_VERTICAL = 50     # pixels
 # VLM Settings (optional)
 USE_VLM_VERIFICATION = False  # Set to True to enable VLM filtering
 OLLAMA_URL = "http://192.168.30.36:11434"
 OLLAMA_MODEL = "qwen2.5vl:32b"
 def merge_nearby_regions(regions: List[Dict],
                        h_distance: int = 100,
                        v_distance: int = 50) -> List[Dict]:
    """
    Merge regions that are close to each other (METHOD B).
    Args:
        regions: List of region dicts with 'box': (x, y, w, h)
        h_distance: Maximum horizontal distance between regions to merge
        v_distance: Maximum vertical distance between regions to merge
    Returns:
        List of merged regions
    """
    if not regions:
        return []
    # Sort regions by y-coordinate (top to bottom)
    regions = sorted(regions, key=lambda r: r['box'][1])
    merged = []
    skip_indices = set()
    for i, region1 in enumerate(regions):
        if i in skip_indices:
            continue
        x1, y1, w1, h1 = region1['box']
        # Find all regions that should merge with this one
        merge_group = [region1]
        for j, region2 in enumerate(regions[i+1:], start=i+1):
            if j in skip_indices:
                continue
            x2, y2, w2, h2 = region2['box']
            # Calculate distances
            # Horizontal distance: gap between boxes horizontally
            h_dist = max(0, max(x1, x2) - min(x1 + w1, x2 + w2))
            # Vertical distance: gap between boxes vertically
            v_dist = max(0, max(y1, y2) - min(y1 + h1, y2 + h2))
            # Check if regions are close enough to merge
            if h_dist <= h_distance and v_dist <= v_distance:
                merge_group.append(region2)
                skip_indices.add(j)
                # Update bounding box to include new region
                x1 = min(x1, x2)
                y1 = min(y1, y2)
                w1 = max(x1 + w1, x2 + w2) - x1
                h1 = max(y1 + h1, y2 + h2) - y1
        # Create merged region
        merged_box = (x1, y1, w1, h1)
        merged_area = w1 * h1
        merged_aspect = w1 / h1 if h1 > 0 else 0
        merged.append({
            'box': merged_box,
            'area': merged_area,
            'aspect_ratio': merged_aspect,
            'merged_count': len(merge_group)
        })
    return merged
 def clean_region_with_ocr(region_image: np.ndarray,
                          ocr_client,
                          padding: int = 10) -> np.ndarray:
    """
    Remove printed text from a region using second OCR pass (METHOD E).
    Args:
        region_image: The region image to clean
        ocr_client: PaddleOCR client
        padding: Padding around detected text boxes
    Returns:
        Cleaned region with printed text masked
    """
    try:
        # Run OCR on this specific region
        text_boxes = ocr_client.get_text_boxes(region_image)
        if not text_boxes:
            return region_image  # No text found, return as-is
        # Mask detected printed text
        cleaned = region_image.copy()
        for (x, y, w, h) in text_boxes:
            # Add padding
            x_pad = max(0, x - padding)
            y_pad = max(0, y - padding)
            w_pad = min(cleaned.shape[1] - x_pad, w + 2*padding)
            h_pad = min(cleaned.shape[0] - y_pad, h + 2*padding)
            cv2.rectangle(cleaned, (x_pad, y_pad),
                         (x_pad + w_pad, y_pad + h_pad),
                         (255, 255, 255), -1)  # Fill with white
        return cleaned
    except Exception as e:
        print(f"      Warning: OCR cleaning failed: {e}")
        return region_image
 def verify_handwriting_with_vlm(image: np.ndarray) -> Tuple[bool, float]:
    """
    Use VLM to verify if image contains handwriting.
    Args:
        image: Region image (RGB numpy array)
    Returns:
        (is_handwriting: bool, confidence: float)
    """
    try:
        # Convert image to base64
        from PIL import Image
        from io import BytesIO
        pil_image = Image.fromarray(image.astype(np.uint8))
        buffered = BytesIO()
        pil_image.save(buffered, format="PNG")
        image_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
        # Ask VLM
        prompt = """Does this image contain handwritten text or a handwritten signature?
 Answer only 'yes' or 'no', followed by a confidence score 0-100.
 Format: yes 95 OR no 80"""
        payload = {
            "model": OLLAMA_MODEL,
            "prompt": prompt,
            "images": [image_base64],
            "stream": False
        }
        response = requests.post(f"{OLLAMA_URL}/api/generate",
                                json=payload, timeout=30)
        response.raise_for_status()
        answer = response.json()['response'].strip().lower()
        # Parse answer
        is_handwriting = 'yes' in answer
        # Try to extract confidence
        confidence = 0.5
        parts = answer.split()
        for part in parts:
            try:
                conf = float(part)
                if 0 <= conf <= 100:
                    confidence = conf / 100
                    break
            except:
                continue
        return is_handwriting, confidence
    except Exception as e:
        print(f"      Warning: VLM verification failed: {e}")
        return True, 0.5  # Default to accepting the region
 print("="*80)
 print("PaddleOCR Improved Pipeline - Region Merging + Two-Stage Cleaning")
 print("="*80)
 # Create output directory
 Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
 # Step 1: Connect to PaddleOCR
 print("\n1. Connecting to PaddleOCR server...")
 try:
    ocr_client = create_ocr_client()
    print(f"   ✅ Connected: {ocr_client.server_url}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 2: Render PDF
 print("\n2. Rendering PDF...")
 try:
    doc = fitz.open(TEST_PDF)
    page = doc[0]
    mat = fitz.Matrix(DPI/72, DPI/72)
    pix = page.get_pixmap(matrix=mat)
    original_image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
        pix.height, pix.width, pix.n)
    if pix.n == 4:
        original_image = cv2.cvtColor(original_image, cv2.COLOR_RGBA2RGB)
    print(f"   ✅ Rendered: {original_image.shape[1]}x{original_image.shape[0]}")
    doc.close()
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 3: Detect printed text (Stage 1)
 print("\n3. Detecting printed text (Stage 1 OCR)...")
 try:
    text_boxes = ocr_client.get_text_boxes(original_image)
    print(f"   ✅ Detected {len(text_boxes)} text regions")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 4: Mask printed text with padding
 print(f"\n4. Masking printed text (padding={MASKING_PADDING}px)...")
 try:
    masked_image = original_image.copy()
    for (x, y, w, h) in text_boxes:
        # Add padding
        x_pad = max(0, x - MASKING_PADDING)
        y_pad = max(0, y - MASKING_PADDING)
        w_pad = min(masked_image.shape[1] - x_pad, w + 2*MASKING_PADDING)
        h_pad = min(masked_image.shape[0] - y_pad, h + 2*MASKING_PADDING)
        cv2.rectangle(masked_image, (x_pad, y_pad),
                     (x_pad + w_pad, y_pad + h_pad), (0, 0, 0), -1)
    print(f"   ✅ Masked {len(text_boxes)} regions")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 5: Detect candidate regions
 print("\n5. Detecting candidate regions...")
 try:
    gray = cv2.cvtColor(masked_image, cv2.COLOR_RGB2GRAY)
    _, binary = cv2.threshold(gray, 250, 255, cv2.THRESH_BINARY_INV)
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    candidate_regions = []
    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        area = w * h
        aspect_ratio = w / h if h > 0 else 0
        if (MIN_REGION_AREA <= area <= MAX_REGION_AREA and
            MIN_ASPECT_RATIO <= aspect_ratio <= MAX_ASPECT_RATIO):
            candidate_regions.append({
                'box': (x, y, w, h),
                'area': area,
                'aspect_ratio': aspect_ratio
            })
    print(f"   ✅ Found {len(candidate_regions)} candidate regions")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 6: Merge nearby regions (METHOD B)
 print(f"\n6. Merging nearby regions (h_dist<={MERGE_DISTANCE_HORIZONTAL}, v_dist<={MERGE_DISTANCE_VERTICAL})...")
 try:
    merged_regions = merge_nearby_regions(
        candidate_regions,
        h_distance=MERGE_DISTANCE_HORIZONTAL,
        v_distance=MERGE_DISTANCE_VERTICAL
    )
    print(f"   ✅ Merged {len(candidate_regions)} → {len(merged_regions)} regions")
    for i, region in enumerate(merged_regions):
        if region['merged_count'] > 1:
            print(f"      Region {i+1}: Merged {region['merged_count']} sub-regions")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    import traceback
    traceback.print_exc()
    exit(1)
 # Step 7: Extract and clean each region (METHOD E)
 print("\n7. Extracting and cleaning regions (Stage 2 OCR)...")
 final_signatures = []
 for i, region in enumerate(merged_regions):
    x, y, w, h = region['box']
    print(f"\n   Region {i+1}/{len(merged_regions)}: ({x}, {y}, {w}, {h})")
    # Extract region from ORIGINAL image (not masked)
    padding = 10
    x_pad = max(0, x - padding)
    y_pad = max(0, y - padding)
    w_pad = min(original_image.shape[1] - x_pad, w + 2*padding)
    h_pad = min(original_image.shape[0] - y_pad, h + 2*padding)
    region_img = original_image[y_pad:y_pad+h_pad, x_pad:x_pad+w_pad].copy()
    print(f"      - Extracted: {region_img.shape[1]}x{region_img.shape[0]}px")
    # Clean with second OCR pass
    print(f"      - Running Stage 2 OCR to remove printed text...")
    cleaned_region = clean_region_with_ocr(region_img, ocr_client, padding=5)
    # VLM verification (optional)
    if USE_VLM_VERIFICATION:
        print(f"      - VLM verification...")
        is_handwriting, confidence = verify_handwriting_with_vlm(cleaned_region)
        print(f"      - VLM says: {'✅ Handwriting' if is_handwriting else '❌ Not handwriting'} (confidence: {confidence:.2f})")
        if not is_handwriting:
            print(f"      - Skipping (not handwriting)")
            continue
    # Save
    final_signatures.append({
        'image': cleaned_region,
        'box': region['box'],
        'original_image': region_img
    })
    print(f"      ✅ Kept as signature candidate")
 print(f"\n   ✅ Final signatures: {len(final_signatures)}")
 # Step 8: Save results
 print("\n8. Saving results...")
 for i, sig in enumerate(final_signatures):
    # Save cleaned signature
    sig_path = Path(OUTPUT_DIR) / f"signature_{i+1:02d}_cleaned.png"
    cv2.imwrite(str(sig_path), cv2.cvtColor(sig['image'], cv2.COLOR_RGB2BGR))
    # Save original region for comparison
    orig_path = Path(OUTPUT_DIR) / f"signature_{i+1:02d}_original.png"
    cv2.imwrite(str(orig_path), cv2.cvtColor(sig['original_image'], cv2.COLOR_RGB2BGR))
    print(f"   📁 Signature {i+1}: {sig_path.name}")
 # Save visualizations
 vis_merged = original_image.copy()
 for region in merged_regions:
    x, y, w, h = region['box']
    color = (255, 0, 0) if region in [{'box': s['box']} for s in final_signatures] else (128, 128, 128)
    cv2.rectangle(vis_merged, (x, y), (x + w, y + h), color, 3)
 vis_path = Path(OUTPUT_DIR) / "visualization_merged_regions.png"
 cv2.imwrite(str(vis_path), cv2.cvtColor(vis_merged, cv2.COLOR_RGB2BGR))
 print(f"   📁 Visualization: {vis_path.name}")
 print("\n" + "="*80)
 print("Pipeline completed!")
 print(f"Results: {OUTPUT_DIR}")
 print("="*80)
 print(f"\nSummary:")
 print(f"  - Stage 1 OCR: {len(text_boxes)} text regions masked")
 print(f"  - Initial candidates: {len(candidate_regions)}")
 print(f"  - After merging: {len(merged_regions)}")
 print(f"  - Final signatures: {len(final_signatures)}")
 print(f"  - Expected signatures: 2 (楊智惠, 張志銘)")
 print("="*80)
@@ -0,0 +1,413 @@
 #!/usr/bin/env python3
 """
 YOLO-based signature extraction from PDF documents.
 Uses a trained YOLOv11n model to detect and extract handwritten signatures.
 Pipeline:
    PDF → Render to Image → YOLO Detection → Crop Signatures → Output
 """
 import csv
 import json
 import os
 import random
 import sys
 from datetime import datetime
 from pathlib import Path
 from typing import Optional
 import cv2
 import fitz  # PyMuPDF
 import numpy as np
 from ultralytics import YOLO
 # Configuration
 CSV_PATH = "/Volumes/NV2/PDF-Processing/master_signatures.csv"
 PDF_BASE_PATH = "/Volumes/NV2/PDF-Processing/total-pdf"
 OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output/yolo"
 OUTPUT_PATH_NO_STAMP = "/Volumes/NV2/PDF-Processing/signature-image-output/yolo_no_stamp"
 MODEL_PATH = "/Volumes/NV2/pdf_recognize/models/best.pt"
 # Detection parameters
 DPI = 300
 CONFIDENCE_THRESHOLD = 0.5
 def remove_red_stamp(image: np.ndarray) -> np.ndarray:
    """
    Remove red stamp pixels from an image by replacing them with white.
    Uses HSV color space to detect red regions (stamps are typically red/orange).
    Args:
        image: RGB image as numpy array
    Returns:
        Image with red stamp pixels replaced by white
    """
    # Convert to HSV
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
    # Red color wraps around in HSV, so we need two ranges
    # Range 1: H = 0-10 (red-orange)
    lower_red1 = np.array([0, 50, 50])
    upper_red1 = np.array([10, 255, 255])
    # Range 2: H = 160-180 (red-magenta)
    lower_red2 = np.array([160, 50, 50])
    upper_red2 = np.array([180, 255, 255])
    # Create masks for red regions
    mask1 = cv2.inRange(hsv, lower_red1, upper_red1)
    mask2 = cv2.inRange(hsv, lower_red2, upper_red2)
    # Combine masks
    red_mask = cv2.bitwise_or(mask1, mask2)
    # Optional: dilate mask slightly to catch edges
    kernel = np.ones((3, 3), np.uint8)
    red_mask = cv2.dilate(red_mask, kernel, iterations=1)
    # Replace red pixels with white
    result = image.copy()
    result[red_mask > 0] = [255, 255, 255]
    return result
 class YOLOSignatureExtractor:
    """Extract signatures from PDF pages using YOLO object detection."""
    def __init__(self, model_path: str = MODEL_PATH, conf_threshold: float = CONFIDENCE_THRESHOLD):
        """
        Initialize the extractor with a trained YOLO model.
        Args:
            model_path: Path to the YOLO model weights
            conf_threshold: Minimum confidence threshold for detections
        """
        print(f"Loading YOLO model from {model_path}...")
        self.model = YOLO(model_path)
        self.conf_threshold = conf_threshold
        self.dpi = DPI
        print(f"Model loaded. Confidence threshold: {conf_threshold}")
    def render_pdf_page(self, pdf_path: str, page_num: int) -> Optional[np.ndarray]:
        """
        Render a PDF page to an image array.
        Args:
            pdf_path: Path to the PDF file
            page_num: Page number (1-indexed)
        Returns:
            RGB image as numpy array, or None if failed
        """
        try:
            doc = fitz.open(pdf_path)
            if page_num < 1 or page_num > len(doc):
                print(f"  Invalid page number: {page_num} (PDF has {len(doc)} pages)")
                doc.close()
                return None
            page = doc[page_num - 1]
            mat = fitz.Matrix(self.dpi / 72, self.dpi / 72)
            pix = page.get_pixmap(matrix=mat, alpha=False)
            image = np.frombuffer(pix.samples, dtype=np.uint8)
            image = image.reshape(pix.height, pix.width, pix.n)
            doc.close()
            return image
        except Exception as e:
            print(f"  Error rendering PDF: {e}")
            return None
    def detect_signatures(self, image: np.ndarray) -> list[dict]:
        """
        Detect signature regions in an image using YOLO.
        Args:
            image: RGB image as numpy array
        Returns:
            List of detected signatures with box coordinates and confidence
        """
        results = self.model(image, conf=self.conf_threshold, verbose=False)
        signatures = []
        for r in results:
            for box in r.boxes:
                x1, y1, x2, y2 = map(int, box.xyxy[0].cpu().numpy())
                conf = float(box.conf[0].cpu().numpy())
                signatures.append({
                    'box': (x1, y1, x2 - x1, y2 - y1),  # x, y, w, h format
                    'xyxy': (x1, y1, x2, y2),
                    'confidence': conf
                })
        # Sort by y-coordinate (top to bottom), then x-coordinate (left to right)
        signatures.sort(key=lambda s: (s['box'][1], s['box'][0]))
        return signatures
    def extract_signature_images(self, image: np.ndarray, signatures: list[dict]) -> list[np.ndarray]:
        """
        Crop signature regions from the image.
        Args:
            image: RGB image as numpy array
            signatures: List of detected signatures
        Returns:
            List of cropped signature images
        """
        cropped = []
        for sig in signatures:
            x, y, w, h = sig['box']
            # Ensure bounds are within image
            x = max(0, x)
            y = max(0, y)
            x2 = min(image.shape[1], x + w)
            y2 = min(image.shape[0], y + h)
            cropped.append(image[y:y2, x:x2])
        return cropped
    def create_visualization(self, image: np.ndarray, signatures: list[dict]) -> np.ndarray:
        """
        Create a visualization with detection boxes drawn on the image.
        Args:
            image: RGB image as numpy array
            signatures: List of detected signatures
        Returns:
            Image with drawn bounding boxes
        """
        vis = image.copy()
        for i, sig in enumerate(signatures):
            x1, y1, x2, y2 = sig['xyxy']
            conf = sig['confidence']
            # Draw box
            cv2.rectangle(vis, (x1, y1), (x2, y2), (255, 0, 0), 3)
            # Draw label
            label = f"sig{i+1}: {conf:.2f}"
            font_scale = 0.8
            thickness = 2
            (text_w, text_h), _ = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, font_scale, thickness)
            cv2.rectangle(vis, (x1, y1 - text_h - 10), (x1 + text_w + 5, y1), (255, 0, 0), -1)
            cv2.putText(vis, label, (x1 + 2, y1 - 5), cv2.FONT_HERSHEY_SIMPLEX,
                        font_scale, (255, 255, 255), thickness)
        return vis
 def find_pdf_file(filename: str) -> Optional[str]:
    """
    Search for PDF file in batch directories.
    Args:
        filename: PDF filename to search for
    Returns:
        Full path if found, None otherwise
    """
    for batch_dir in sorted(Path(PDF_BASE_PATH).glob("batch_*")):
        pdf_path = batch_dir / filename
        if pdf_path.exists():
            return str(pdf_path)
    return None
 def load_csv_samples(csv_path: str, sample_size: int = 50, seed: int = 42) -> list[dict]:
    """
    Load random samples from the CSV file.
    Args:
        csv_path: Path to master_signatures.csv
        sample_size: Number of samples to load
        seed: Random seed for reproducibility
    Returns:
        List of dictionaries with filename and page info
    """
    with open(csv_path, 'r') as f:
        reader = csv.DictReader(f)
        all_rows = list(reader)
    random.seed(seed)
    samples = random.sample(all_rows, min(sample_size, len(all_rows)))
    return samples
 def process_samples(extractor: YOLOSignatureExtractor, samples: list[dict],
                    output_dir: str, output_dir_no_stamp: str = None,
                    save_visualization: bool = True) -> dict:
    """
    Process a list of PDF samples and extract signatures.
    Args:
        extractor: YOLOSignatureExtractor instance
        samples: List of sample dictionaries from CSV
        output_dir: Output directory for signatures
        output_dir_no_stamp: Output directory for stamp-removed signatures (optional)
        save_visualization: Whether to save visualization images
    Returns:
        Results dictionary with statistics and per-file results
    """
    os.makedirs(output_dir, exist_ok=True)
    if save_visualization:
        os.makedirs(os.path.join(output_dir, "visualization"), exist_ok=True)
    # Create no-stamp output directory if specified
    if output_dir_no_stamp:
        os.makedirs(output_dir_no_stamp, exist_ok=True)
    results = {
        'timestamp': datetime.now().isoformat(),
        'total_samples': len(samples),
        'processed': 0,
        'pdf_not_found': 0,
        'render_failed': 0,
        'total_signatures': 0,
        'files': {}
    }
    for i, row in enumerate(samples):
        filename = row['filename']
        page_num = int(row['page'])
        base_name = Path(filename).stem
        print(f"[{i+1}/{len(samples)}] Processing: {filename}, page {page_num}...", end=' ', flush=True)
        # Find PDF
        pdf_path = find_pdf_file(filename)
        if pdf_path is None:
            print("PDF NOT FOUND")
            results['pdf_not_found'] += 1
            results['files'][filename] = {'status': 'pdf_not_found'}
            continue
        # Render page
        image = extractor.render_pdf_page(pdf_path, page_num)
        if image is None:
            print("RENDER FAILED")
            results['render_failed'] += 1
            results['files'][filename] = {'status': 'render_failed'}
            continue
        # Detect signatures
        signatures = extractor.detect_signatures(image)
        num_sigs = len(signatures)
        results['total_signatures'] += num_sigs
        results['processed'] += 1
        print(f"Found {num_sigs} signature(s)")
        # Extract and save signature crops
        crops = extractor.extract_signature_images(image, signatures)
        for j, (crop, sig) in enumerate(zip(crops, signatures)):
            crop_filename = f"{base_name}_page{page_num}_sig{j+1}.png"
            crop_path = os.path.join(output_dir, crop_filename)
            cv2.imwrite(crop_path, cv2.cvtColor(crop, cv2.COLOR_RGB2BGR))
            # Save stamp-removed version if output dir specified
            if output_dir_no_stamp:
                crop_no_stamp = remove_red_stamp(crop)
                crop_no_stamp_path = os.path.join(output_dir_no_stamp, crop_filename)
                cv2.imwrite(crop_no_stamp_path, cv2.cvtColor(crop_no_stamp, cv2.COLOR_RGB2BGR))
        # Save visualization
        if save_visualization and signatures:
            vis_image = extractor.create_visualization(image, signatures)
            vis_filename = f"{base_name}_page{page_num}_annotated.png"
            vis_path = os.path.join(output_dir, "visualization", vis_filename)
            cv2.imwrite(vis_path, cv2.cvtColor(vis_image, cv2.COLOR_RGB2BGR))
        # Store file results
        results['files'][filename] = {
            'status': 'success',
            'page': page_num,
            'signatures': [
                {
                    'box': list(sig['box']),
                    'confidence': sig['confidence']
                }
                for sig in signatures
            ]
        }
    return results
 def print_summary(results: dict):
    """Print processing summary."""
    print("\n" + "=" * 60)
    print("YOLO SIGNATURE EXTRACTION SUMMARY")
    print("=" * 60)
    print(f"Total samples:        {results['total_samples']}")
    print(f"Successfully processed: {results['processed']}")
    print(f"PDFs not found:       {results['pdf_not_found']}")
    print(f"Render failed:        {results['render_failed']}")
    print(f"Total signatures found: {results['total_signatures']}")
    if results['processed'] > 0:
        avg_sigs = results['total_signatures'] / results['processed']
        print(f"Average signatures/page: {avg_sigs:.2f}")
    print("=" * 60)
 def main():
    """Main entry point for signature extraction."""
    print("=" * 60)
    print("YOLO Signature Extraction Pipeline")
    print("=" * 60)
    print(f"Model: {MODEL_PATH}")
    print(f"CSV: {CSV_PATH}")
    print(f"Output (original): {OUTPUT_PATH}")
    print(f"Output (no stamp): {OUTPUT_PATH_NO_STAMP}")
    print(f"Confidence threshold: {CONFIDENCE_THRESHOLD}")
    print("=" * 60 + "\n")
    # Initialize extractor
    extractor = YOLOSignatureExtractor(MODEL_PATH, CONFIDENCE_THRESHOLD)
    # Load samples
    print("\nLoading samples from CSV...")
    samples = load_csv_samples(CSV_PATH, sample_size=50, seed=42)
    print(f"Loaded {len(samples)} samples\n")
    # Process samples (with stamp removal)
    results = process_samples(
        extractor, samples, OUTPUT_PATH,
        output_dir_no_stamp=OUTPUT_PATH_NO_STAMP,
        save_visualization=True
    )
    # Save results JSON
    results_path = os.path.join(OUTPUT_PATH, "results.json")
    with open(results_path, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"\nResults saved to: {results_path}")
    # Print summary
    print_summary(results)
    print(f"\nStamp-removed signatures saved to: {OUTPUT_PATH_NO_STAMP}")
 if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\n\nProcess interrupted by user.")
        sys.exit(1)
    except Exception as e:
        print(f"\n\nFATAL ERROR: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)
@@ -0,0 +1,169 @@
 #!/usr/bin/env python3
 """
 PaddleOCR Client
 Connects to remote PaddleOCR server for OCR inference
 """
 import requests
 import base64
 import numpy as np
 from typing import List, Dict, Tuple, Optional
 from PIL import Image
 from io import BytesIO
 class PaddleOCRClient:
    """Client for remote PaddleOCR server."""
    def __init__(self, server_url: str = "http://192.168.30.36:5555"):
        """
        Initialize PaddleOCR client.
        Args:
            server_url: URL of the PaddleOCR server
        """
        self.server_url = server_url.rstrip('/')
        self.timeout = 30  # seconds
    def health_check(self) -> bool:
        """
        Check if server is healthy.
        Returns:
            True if server is healthy, False otherwise
        """
        try:
            response = requests.get(
                f"{self.server_url}/health",
                timeout=5
            )
            return response.status_code == 200 and response.json().get('status') == 'ok'
        except Exception as e:
            print(f"Health check failed: {e}")
            return False
    def ocr(self, image: np.ndarray) -> List[Dict]:
        """
        Perform OCR on an image.
        Args:
            image: numpy array of the image (RGB format)
        Returns:
            List of detection results, each containing:
                - box: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
                - text: detected text string
                - confidence: confidence score (0-1)
        Raises:
            Exception if OCR fails
        """
        # Convert numpy array to PIL Image
        if len(image.shape) == 2:  # Grayscale
            pil_image = Image.fromarray(image)
        else:  # RGB or RGBA
            pil_image = Image.fromarray(image.astype(np.uint8))
        # Encode to base64
        buffered = BytesIO()
        pil_image.save(buffered, format="PNG")
        image_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
        # Send request
        try:
            response = requests.post(
                f"{self.server_url}/ocr",
                json={"image": image_base64},
                timeout=self.timeout
            )
            response.raise_for_status()
            result = response.json()
            if not result.get('success'):
                error_msg = result.get('error', 'Unknown error')
                raise Exception(f"OCR failed: {error_msg}")
            return result.get('results', [])
        except requests.exceptions.Timeout:
            raise Exception(f"OCR request timed out after {self.timeout} seconds")
        except requests.exceptions.ConnectionError:
            raise Exception(f"Could not connect to server at {self.server_url}")
        except Exception as e:
            raise Exception(f"OCR request failed: {str(e)}")
    def get_text_boxes(self, image: np.ndarray) -> List[Tuple[int, int, int, int]]:
        """
        Get bounding boxes of all detected text.
        Args:
            image: numpy array of the image
        Returns:
            List of bounding boxes as (x, y, w, h) tuples
        """
        results = self.ocr(image)
        boxes = []
        for result in results:
            box = result['box']  # [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
            # Convert polygon to bounding box
            xs = [point[0] for point in box]
            ys = [point[1] for point in box]
            x = int(min(xs))
            y = int(min(ys))
            w = int(max(xs) - min(xs))
            h = int(max(ys) - min(ys))
            boxes.append((x, y, w, h))
        return boxes
    def __repr__(self):
        return f"PaddleOCRClient(server_url='{self.server_url}')"
 # Convenience function
 def create_ocr_client(server_url: str = "http://192.168.30.36:5555") -> PaddleOCRClient:
    """
    Create and test PaddleOCR client.
    Args:
        server_url: URL of the PaddleOCR server
    Returns:
        PaddleOCRClient instance
    Raises:
        Exception if server is not reachable
    """
    client = PaddleOCRClient(server_url)
    if not client.health_check():
        raise Exception(
            f"PaddleOCR server at {server_url} is not responding. "
            "Make sure the server is running on the Linux machine."
        )
    return client
 if __name__ == "__main__":
    # Test the client
    print("Testing PaddleOCR client...")
    try:
        client = create_ocr_client()
        print(f"✅ Connected to server: {client.server_url}")
        # Create a test image
        test_image = np.ones((100, 100, 3), dtype=np.uint8) * 255
        print("Running test OCR...")
        results = client.ocr(test_image)
        print(f"✅ OCR test successful! Found {len(results)} text regions")
    except Exception as e:
        print(f"❌ Error: {e}")
@@ -0,0 +1,91 @@
 #!/usr/bin/env python3
 """
 PaddleOCR Server v5 (PP-OCRv5)
 Flask HTTP server exposing PaddleOCR v3.3.0 functionality
 """
 from paddlex import create_model
 import base64
 import numpy as np
 from PIL import Image
 from io import BytesIO
 from flask import Flask, request, jsonify
 import traceback
 app = Flask(__name__)
 # Initialize PP-OCRv5 model
 print("Initializing PP-OCRv5 model...")
 model = create_model("PP-OCRv5_server")
 print("PP-OCRv5 model loaded successfully!")
@app.route('/health', methods=['GET'])
 def health():
    """Health check endpoint."""
    return jsonify({
        'status': 'ok',
        'service': 'paddleocr-server-v5',
        'version': '3.3.0',
        'model': 'PP-OCRv5_server',
        'gpu_enabled': True
    })
@app.route('/ocr', methods=['POST'])
 def ocr_endpoint():
    """
    OCR endpoint using PP-OCRv5.
    Accepts: {"image": "base64_encoded_image"}
    Returns: {"success": true, "count": N, "results": [...]}
    """
    try:
        # Parse request
        data = request.get_json()
        image_base64 = data['image']
        # Decode image
        image_bytes = base64.b64decode(image_base64)
        image = Image.open(BytesIO(image_bytes))
        image_np = np.array(image)
        # Run OCR with PP-OCRv5
        result = model.predict(image_np)
        # Format results
        formatted_results = []
        if result and 'dt_polys' in result[0] and 'rec_text' in result[0]:
            dt_polys = result[0]['dt_polys']
            rec_texts = result[0]['rec_text']
            rec_scores = result[0]['rec_score']
            for i in range(len(dt_polys)):
                box = dt_polys[i].tolist()  # Convert to list
                text = rec_texts[i]
                confidence = float(rec_scores[i])
                formatted_results.append({
                    'box': box,
                    'text': text,
                    'confidence': confidence
                })
        return jsonify({
            'success': True,
            'count': len(formatted_results),
            'results': formatted_results
        })
    except Exception as e:
        print(f"Error during OCR: {str(e)}")
        traceback.print_exc()
        return jsonify({
            'success': False,
            'error': str(e)
        }), 500
 if __name__ == '__main__':
    print("Starting PP-OCRv5 server on port 5555...")
    print("Model: PP-OCRv5_server")
    print("Version: 3.3.0")
    app.run(host='0.0.0.0', port=5555, debug=False)
@@ -0,0 +1,493 @@
 #!/usr/bin/env python3
 """
 Ablation Study: Backbone Comparison for Signature Feature Extraction
 ====================================================================
 Compares ResNet-50 vs VGG-16 vs EfficientNet-B0 on:
  1. Feature extraction speed
  2. Intra/Inter class cosine similarity separation (Cohen's d)
  3. KDE crossover point
  4. Firm A (known replication) distribution
 Usage:
  python ablation_backbone_comparison.py              # Run all backbones
  python ablation_backbone_comparison.py --extract     # Feature extraction only
  python ablation_backbone_comparison.py --analyze     # Analysis only (features must exist)
 """
 import torch
 import torch.nn as nn
 import torchvision.models as models
 import torchvision.transforms as transforms
 from torch.utils.data import Dataset, DataLoader
 import numpy as np
 import sqlite3
 import time
 import argparse
 import json
 from pathlib import Path
 from collections import defaultdict
 from tqdm import tqdm
 import warnings
 warnings.filterwarnings('ignore')
 # === Configuration ===
 IMAGES_DIR = Path("/Volumes/NV2/PDF-Processing/yolo-signatures/images")
 FEATURES_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/features")
 DB_PATH = Path("/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db")
 OUTPUT_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/ablation")
 FILENAMES_PATH = FEATURES_DIR / "signature_filenames.txt"
 BATCH_SIZE = 64
 NUM_WORKERS = 4
 DEVICE = torch.device("mps" if torch.backends.mps.is_available() else
                      "cuda" if torch.cuda.is_available() else "cpu")
 # Sampling for analysis
 INTER_CLASS_SAMPLE_SIZE = 500_000
 INTRA_CLASS_MIN_SIGNATURES = 3
 RANDOM_SEED = 42
 # Known replication firm (Deloitte Taiwan = 勤業眾信)
 FIRM_A_NAME = "勤業眾信聯合"
 BACKBONES = {
    "resnet50": {
        "model_fn": lambda: models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2),
        "feature_dim": 2048,
        "description": "ResNet-50 (ImageNet1K_V2)",
    },
    "vgg16": {
        "model_fn": lambda: models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1),
        "feature_dim": 4096,
        "description": "VGG-16 (ImageNet1K_V1)",
    },
    "efficientnet_b0": {
        "model_fn": lambda: models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1),
        "feature_dim": 1280,
        "description": "EfficientNet-B0 (ImageNet1K_V1)",
    },
 }
 class SignatureDataset(Dataset):
    def __init__(self, image_paths, transform=None):
        self.image_paths = image_paths
        self.transform = transform
    def __len__(self):
        return len(self.image_paths)
    def __getitem__(self, idx):
        import cv2
        img_path = self.image_paths[idx]
        img = cv2.imread(str(img_path))
        if img is None:
            img = np.ones((224, 224, 3), dtype=np.uint8) * 255
        else:
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = self._resize_with_padding(img, 224, 224)
        if self.transform:
            img = self.transform(img)
        return img, str(img_path.name)
    @staticmethod
    def _resize_with_padding(img, target_w, target_h):
        h, w = img.shape[:2]
        scale = min(target_w / w, target_h / h)
        new_w, new_h = int(w * scale), int(h * scale)
        import cv2
        resized = cv2.resize(img, (new_w, new_h), interpolation=cv2.INTER_AREA)
        canvas = np.ones((target_h, target_w, 3), dtype=np.uint8) * 255
        x_off = (target_w - new_w) // 2
        y_off = (target_h - new_h) // 2
        canvas[y_off:y_off+new_h, x_off:x_off+new_w] = resized
        return canvas
 def build_feature_extractor(backbone_name):
    """Build a feature extractor for the given backbone."""
    config = BACKBONES[backbone_name]
    model = config["model_fn"]()
    if backbone_name == "vgg16":
        features_part = model.features
        avgpool = model.avgpool
        # Drop last Linear (classifier) to get 4096-dim output
        classifier_part = nn.Sequential(*list(model.classifier.children())[:-1])
        class VGGFeatureExtractor(nn.Module):
            def __init__(self, features, avgpool, classifier):
                super().__init__()
                self.features = features
                self.avgpool = avgpool
                self.classifier = classifier
            def forward(self, x):
                x = self.features(x)
                x = self.avgpool(x)
                x = torch.flatten(x, 1)
                x = self.classifier(x)
                return x
        model = VGGFeatureExtractor(features_part, avgpool, classifier_part)
    elif backbone_name == "resnet50":
        model = nn.Sequential(*list(model.children())[:-1])
    elif backbone_name == "efficientnet_b0":
        model.classifier = nn.Identity()
    model = model.to(DEVICE)
    model.eval()
    return model
 def extract_features(backbone_name):
    """Extract features for all signatures using the given backbone."""
    print(f"\n{'='*60}")
    print(f"Extracting features: {BACKBONES[backbone_name]['description']}")
    print(f"{'='*60}")
    output_path = OUTPUT_DIR / f"features_{backbone_name}.npy"
    if output_path.exists():
        print(f"  Features already exist: {output_path}")
        print(f"  Skipping extraction. Delete file to re-extract.")
        return np.load(output_path)
    # Load filenames
    with open(FILENAMES_PATH) as f:
        filenames = [line.strip() for line in f if line.strip()]
    print(f"  Images: {len(filenames):,}")
    image_paths = [IMAGES_DIR / fn for fn in filenames]
    # Build model
    model = build_feature_extractor(backbone_name)
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    dataset = SignatureDataset(image_paths, transform=transform)
    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False,
                           num_workers=NUM_WORKERS, pin_memory=True)
    all_features = []
    start_time = time.time()
    with torch.no_grad():
        for images, _ in tqdm(dataloader, desc=f"  {backbone_name}"):
            images = images.to(DEVICE)
            feats = model(images)
            feats = feats.view(feats.size(0), -1)  # flatten
            feats = nn.functional.normalize(feats, p=2, dim=1)  # L2 normalize
            all_features.append(feats.cpu().numpy())
    elapsed = time.time() - start_time
    all_features = np.vstack(all_features)
    print(f"  Feature shape: {all_features.shape}")
    print(f"  Time: {elapsed:.1f}s ({elapsed/60:.1f}min)")
    print(f"  Speed: {len(filenames)/elapsed:.1f} images/sec")
    # Save
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    np.save(output_path, all_features)
    print(f"  Saved: {output_path} ({all_features.nbytes / 1e9:.2f} GB)")
    return all_features
 def load_accountant_data():
    """Load accountant assignments and firm info from DB."""
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    cur.execute('''
        SELECT image_filename, assigned_accountant
        FROM signatures
        WHERE feature_vector IS NOT NULL
        ORDER BY signature_id
    ''')
    sig_rows = cur.fetchall()
    cur.execute('SELECT name, firm FROM accountants')
    acct_firm = {r[0]: r[1] for r in cur.fetchall()}
    conn.close()
    filename_to_acct = {r[0]: r[1] for r in sig_rows}
    return filename_to_acct, acct_firm
 def analyze_backbone(backbone_name, features, filenames, filename_to_acct, acct_firm):
    """Compute intra/inter class stats for a backbone's features."""
    print(f"\n{'='*60}")
    print(f"Analyzing: {BACKBONES[backbone_name]['description']}")
    print(f"{'='*60}")
    np.random.seed(RANDOM_SEED)
    # Map features to accountants
    accountants = []
    valid_indices = []
    for i, fn in enumerate(filenames):
        acct = filename_to_acct.get(fn)
        if acct:
            accountants.append(acct)
            valid_indices.append(i)
    valid_features = features[valid_indices]
    print(f"  Valid signatures with accountant: {len(valid_indices):,}")
    # Group by accountant
    acct_groups = defaultdict(list)
    for i, acct in enumerate(accountants):
        acct_groups[acct].append(i)
    # --- Intra-class ---
    print("  Computing intra-class similarities...")
    intra_sims = []
    for acct, indices in tqdm(acct_groups.items(), desc="  Intra-class", leave=False):
        if len(indices) < INTRA_CLASS_MIN_SIGNATURES:
            continue
        vecs = valid_features[indices]
        sim_matrix = vecs @ vecs.T
        n = len(indices)
        triu_idx = np.triu_indices(n, k=1)
        intra_sims.extend(sim_matrix[triu_idx].tolist())
    intra_sims = np.array(intra_sims)
    print(f"  Intra-class pairs: {len(intra_sims):,}")
    # --- Inter-class ---
    print("  Computing inter-class similarities...")
    all_acct_list = list(acct_groups.keys())
    inter_sims = []
    for _ in range(INTER_CLASS_SAMPLE_SIZE):
        a1, a2 = np.random.choice(len(all_acct_list), 2, replace=False)
        i1 = np.random.choice(acct_groups[all_acct_list[a1]])
        i2 = np.random.choice(acct_groups[all_acct_list[a2]])
        sim = float(valid_features[i1] @ valid_features[i2])
        inter_sims.append(sim)
    inter_sims = np.array(inter_sims)
    print(f"  Inter-class pairs: {len(inter_sims):,}")
    # --- Firm A (known replication) ---
    print(f"  Computing Firm A ({FIRM_A_NAME}) distribution...")
    firm_a_accts = [acct for acct in acct_groups if acct_firm.get(acct) == FIRM_A_NAME]
    firm_a_sims = []
    for acct in firm_a_accts:
        indices = acct_groups[acct]
        if len(indices) < 2:
            continue
        vecs = valid_features[indices]
        sim_matrix = vecs @ vecs.T
        n = len(indices)
        triu_idx = np.triu_indices(n, k=1)
        firm_a_sims.extend(sim_matrix[triu_idx].tolist())
    firm_a_sims = np.array(firm_a_sims) if firm_a_sims else np.array([])
    print(f"  Firm A accountants: {len(firm_a_accts)}, pairs: {len(firm_a_sims):,}")
    # --- Statistics ---
    def dist_stats(arr, name):
        return {
            "name": name,
            "n": len(arr),
            "mean": float(np.mean(arr)),
            "std": float(np.std(arr)),
            "median": float(np.median(arr)),
            "p1": float(np.percentile(arr, 1)),
            "p5": float(np.percentile(arr, 5)),
            "p25": float(np.percentile(arr, 25)),
            "p75": float(np.percentile(arr, 75)),
            "p95": float(np.percentile(arr, 95)),
            "p99": float(np.percentile(arr, 99)),
            "min": float(np.min(arr)),
            "max": float(np.max(arr)),
        }
    intra_stats = dist_stats(intra_sims, "intra")
    inter_stats = dist_stats(inter_sims, "inter")
    firm_a_stats = dist_stats(firm_a_sims, "firm_a") if len(firm_a_sims) > 0 else None
    # Cohen's d
    pooled_std = np.sqrt((intra_stats["std"]**2 + inter_stats["std"]**2) / 2)
    cohens_d = (intra_stats["mean"] - inter_stats["mean"]) / pooled_std if pooled_std > 0 else 0
    # KDE crossover
    try:
        from scipy.stats import gaussian_kde
        x_grid = np.linspace(0, 1, 1000)
        kde_intra = gaussian_kde(intra_sims)
        kde_inter = gaussian_kde(inter_sims)
        diff = kde_intra(x_grid) - kde_inter(x_grid)
        sign_changes = np.where(np.diff(np.sign(diff)))[0]
        crossovers = x_grid[sign_changes]
        valid_crossovers = crossovers[(crossovers > 0.5) & (crossovers < 1.0)]
        kde_crossover = float(valid_crossovers[-1]) if len(valid_crossovers) > 0 else None
    except Exception as e:
        print(f"  KDE crossover computation failed: {e}")
        kde_crossover = None
    results = {
        "backbone": backbone_name,
        "description": BACKBONES[backbone_name]["description"],
        "feature_dim": BACKBONES[backbone_name]["feature_dim"],
        "intra": intra_stats,
        "inter": inter_stats,
        "firm_a": firm_a_stats,
        "cohens_d": float(cohens_d),
        "kde_crossover": kde_crossover,
    }
    # Print summary
    print(f"\n  --- {backbone_name} Summary ---")
    print(f"  Feature dim:    {results['feature_dim']}")
    print(f"  Intra mean:     {intra_stats['mean']:.4f} +/- {intra_stats['std']:.4f}")
    print(f"  Inter mean:     {inter_stats['mean']:.4f} +/- {inter_stats['std']:.4f}")
    print(f"  Cohen's d:      {cohens_d:.4f}")
    print(f"  KDE crossover:  {kde_crossover}")
    if firm_a_stats:
        print(f"  Firm A mean:    {firm_a_stats['mean']:.4f} +/- {firm_a_stats['std']:.4f}")
        print(f"  Firm A 1st pct: {firm_a_stats['p1']:.4f}")
    return results
 def generate_comparison_table(all_results):
    """Generate a markdown comparison table."""
    print(f"\n{'='*60}")
    print("COMPARISON TABLE")
    print(f"{'='*60}\n")
    results_by_name = {r["backbone"]: r for r in all_results}
    def get_val(backbone, key, sub=None):
        r = results_by_name.get(backbone)
        if not r:
            return None
        if sub:
            section = r.get(sub)
            if isinstance(section, dict):
                return section.get(key)
            return None
        return r.get(key)
    def fmt(val, fmt_str=".4f"):
        if val is None:
            return "---"
        if isinstance(val, int):
            return str(val)
        return f"{val:{fmt_str}}"
    names = ["resnet50", "vgg16", "efficientnet_b0"]
    header = "| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |"
    sep    = "|--------|-----------|--------|-----------------|"
    rows = [
        f"| Feature dim | {fmt(get_val('resnet50','feature_dim'),'')} | {fmt(get_val('vgg16','feature_dim'),'')} | {fmt(get_val('efficientnet_b0','feature_dim'),'')} |",
        f"| Intra mean | {fmt(get_val('resnet50','mean','intra'))} | {fmt(get_val('vgg16','mean','intra'))} | {fmt(get_val('efficientnet_b0','mean','intra'))} |",
        f"| Intra std | {fmt(get_val('resnet50','std','intra'))} | {fmt(get_val('vgg16','std','intra'))} | {fmt(get_val('efficientnet_b0','std','intra'))} |",
        f"| Inter mean | {fmt(get_val('resnet50','mean','inter'))} | {fmt(get_val('vgg16','mean','inter'))} | {fmt(get_val('efficientnet_b0','mean','inter'))} |",
        f"| Inter std | {fmt(get_val('resnet50','std','inter'))} | {fmt(get_val('vgg16','std','inter'))} | {fmt(get_val('efficientnet_b0','std','inter'))} |",
        f"| **Cohen's d** | **{fmt(get_val('resnet50','cohens_d'))}** | **{fmt(get_val('vgg16','cohens_d'))}** | **{fmt(get_val('efficientnet_b0','cohens_d'))}** |",
        f"| KDE crossover | {fmt(get_val('resnet50','kde_crossover'))} | {fmt(get_val('vgg16','kde_crossover'))} | {fmt(get_val('efficientnet_b0','kde_crossover'))} |",
        f"| Firm A mean | {fmt(get_val('resnet50','mean','firm_a'))} | {fmt(get_val('vgg16','mean','firm_a'))} | {fmt(get_val('efficientnet_b0','mean','firm_a'))} |",
        f"| Firm A 1st pct | {fmt(get_val('resnet50','p1','firm_a'))} | {fmt(get_val('vgg16','p1','firm_a'))} | {fmt(get_val('efficientnet_b0','p1','firm_a'))} |",
    ]
    table = "\n".join([header, sep] + rows)
    print(table)
    # Save report
    report_path = OUTPUT_DIR / "ablation_comparison.md"
    with open(report_path, 'w') as f:
        f.write("# Ablation Study: Backbone Comparison\n\n")
        f.write(f"Date: {time.strftime('%Y-%m-%d %H:%M')}\n\n")
        f.write("## Comparison Table\n\n")
        f.write(table + "\n\n")
        f.write("## Interpretation\n\n")
        f.write("- **Cohen's d**: Higher = better separation between same-CPA and different-CPA signatures\n")
        f.write("- **KDE crossover**: The Bayes-optimal decision boundary (higher = easier to classify)\n")
        f.write("- **Firm A**: Known replication firm; expect very high mean similarity\n")
        f.write("- **Firm A 1st percentile**: Lower bound of known-replication similarity\n")
    json_path = OUTPUT_DIR / "ablation_results.json"
    with open(json_path, 'w') as f:
        json.dump(all_results, f, indent=2, ensure_ascii=False)
    print(f"\n  Report saved: {report_path}")
    print(f"  Raw data saved: {json_path}")
    return table
 def main():
    parser = argparse.ArgumentParser(description="Ablation: backbone comparison")
    parser.add_argument("--extract", action="store_true", help="Feature extraction only")
    parser.add_argument("--analyze", action="store_true", help="Analysis only")
    parser.add_argument("--backbone", type=str, help="Run single backbone (resnet50/vgg16/efficientnet_b0)")
    args = parser.parse_args()
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    # Load filenames
    with open(FILENAMES_PATH) as f:
        filenames = [line.strip() for line in f if line.strip()]
    backbones_to_run = [args.backbone] if args.backbone else list(BACKBONES.keys())
    if not args.analyze:
        # === Phase 1: Feature Extraction ===
        print("\n" + "=" * 60)
        print("PHASE 1: FEATURE EXTRACTION")
        print("=" * 60)
        # For ResNet-50, copy existing features instead of re-extracting
        resnet_ablation_path = OUTPUT_DIR / "features_resnet50.npy"
        resnet_existing_path = FEATURES_DIR / "signature_features.npy"
        if "resnet50" in backbones_to_run and not resnet_ablation_path.exists() and resnet_existing_path.exists():
            print(f"\nCopying existing ResNet-50 features...")
            import shutil
            resnet_ablation_path.parent.mkdir(parents=True, exist_ok=True)
            shutil.copy2(resnet_existing_path, resnet_ablation_path)
            print(f"  Copied: {resnet_ablation_path}")
        for name in backbones_to_run:
            if name == "resnet50" and resnet_ablation_path.exists():
                continue
            extract_features(name)
    if args.extract:
        print("\nFeature extraction complete. Run with --analyze to compute statistics.")
        return
    # === Phase 2: Analysis ===
    print("\n" + "=" * 60)
    print("PHASE 2: ANALYSIS")
    print("=" * 60)
    filename_to_acct, acct_firm = load_accountant_data()
    all_results = []
    for name in backbones_to_run:
        feat_path = OUTPUT_DIR / f"features_{name}.npy"
        if not feat_path.exists():
            print(f"\n  WARNING: {feat_path} not found, skipping {name}")
            continue
        features = np.load(feat_path)
        results = analyze_backbone(name, features, filenames, filename_to_acct, acct_firm)
        all_results.append(results)
    if len(all_results) > 1:
        generate_comparison_table(all_results)
    elif len(all_results) == 1:
        print(f"\nOnly one backbone analyzed. Run all three for comparison table.")
    print("\nDone!")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,83 @@
 #!/bin/bash
 # Build complete Paper A Word document from section markdown files
 # Uses pandoc with embedded figures
 PAPER_DIR="/Volumes/NV2/pdf_recognize/paper"
 FIG_DIR="/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures"
 OUTPUT="$PAPER_DIR/Paper_A_IEEE_TAI_Draft_v2.docx"
 # Create combined markdown with title page
 cat > "$PAPER_DIR/_combined.md" << 'TITLEEOF'
 ---
 title: "Automated Detection of Digitally Replicated Signatures in Large-Scale Financial Audit Reports"
 author: "[Authors removed for double-blind review]"
 date: ""
 geometry: margin=1in
 fontsize: 11pt
 ---
 TITLEEOF
 # Append each section (strip the # heading line if it duplicates)
 for section in \
    paper_a_abstract.md \
    paper_a_impact_statement.md \
    paper_a_introduction.md \
    paper_a_related_work.md \
    paper_a_methodology.md \
    paper_a_results.md \
    paper_a_discussion.md \
    paper_a_conclusion.md \
    paper_a_references.md
 do
    echo "" >> "$PAPER_DIR/_combined.md"
    # Strip HTML comments and append
    sed '/^<!--/,/-->$/d' "$PAPER_DIR/$section" >> "$PAPER_DIR/_combined.md"
    echo "" >> "$PAPER_DIR/_combined.md"
 done
 # Insert figure references as actual images
 # Fig 1 after "Fig. 1 illustrates"
 sed -i '' "s|Fig. 1 illustrates the overall architecture.|Fig. 1 illustrates the overall architecture.\n\n![Fig. 1. Pipeline architecture for automated signature replication detection.]($FIG_DIR/fig1_pipeline.png){width=100%}\n|" "$PAPER_DIR/_combined.md"
 # Fig 2 after "Fig. 2 presents the cosine"
 sed -i '' "s|Fig. 2 presents the cosine similarity distributions|Fig. 2 presents the cosine similarity distributions|" "$PAPER_DIR/_combined.md"
 sed -i '' "/^Fig. 2 presents the cosine/a\\
 \\
 ![Fig. 2. Cosine similarity distributions: intra-class vs. inter-class. KDE crossover at 0.837.]($FIG_DIR/fig2_intra_inter_kde.png){width=60%}\\
 " "$PAPER_DIR/_combined.md"
 # Fig 3 after "Fig. 3 presents"
 sed -i '' "/^Fig. 3 presents/a\\
 \\
 ![Fig. 3. Per-signature best-match cosine similarity: Firm A vs. other CPAs.]($FIG_DIR/fig3_firm_a_calibration.png){width=60%}\\
 " "$PAPER_DIR/_combined.md"
 # Fig 4 after "we compared three pre-trained"
 sed -i '' "/^To validate the choice of ResNet-50.*we conducted/a\\
 \\
 ![Fig. 4. Ablation study: backbone comparison.]($FIG_DIR/fig4_ablation.png){width=100%}\\
 " "$PAPER_DIR/_combined.md"
 # Build with pandoc
 pandoc "$PAPER_DIR/_combined.md" \
    -o "$OUTPUT" \
    --reference-doc=/dev/null \
    -f markdown \
    --wrap=none \
    2>&1
 # If reference-doc fails, try without it
 if [ $? -ne 0 ]; then
    pandoc "$PAPER_DIR/_combined.md" \
        -o "$OUTPUT" \
        -f markdown \
        --wrap=none \
        2>&1
 fi
 # Clean up
 rm -f "$PAPER_DIR/_combined.md"
 echo "Output: $OUTPUT"
 ls -lh "$OUTPUT"
@@ -0,0 +1,231 @@
 #!/usr/bin/env python3
 """Export Paper A v2 to Word, reading from md section files."""
 from docx import Document
 from docx.shared import Inches, Pt, RGBColor
 from docx.enum.text import WD_ALIGN_PARAGRAPH
 from pathlib import Path
 import re
 PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
 FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures")
 OUTPUT = PAPER_DIR / "Paper_A_IEEE_TAI_Draft_v2.docx"
 SECTIONS = [
    "paper_a_abstract.md",
    "paper_a_impact_statement.md",
    "paper_a_introduction.md",
    "paper_a_related_work.md",
    "paper_a_methodology.md",
    "paper_a_results.md",
    "paper_a_discussion.md",
    "paper_a_conclusion.md",
    "paper_a_references.md",
 ]
 FIGURES = {
    "Fig. 1 illustrates": ("fig1_pipeline.png", "Fig. 1. Pipeline architecture for automated signature replication detection.", 6.5),
    "Fig. 2 presents": ("fig2_intra_inter_kde.png", "Fig. 2. Cosine similarity distributions: intra-class vs. inter-class with KDE crossover at 0.837.", 3.5),
    "Fig. 3 presents": ("fig3_firm_a_calibration.png", "Fig. 3. Per-signature best-match cosine similarity: Firm A (known replication) vs. other CPAs.", 3.5),
    "conducted an ablation study comparing three": ("fig4_ablation.png", "Fig. 4. Ablation study comparing three feature extraction backbones.", 6.5),
 }
 def strip_comments(text):
    """Remove HTML comments from markdown."""
    return re.sub(r'<!--.*?-->', '', text, flags=re.DOTALL)
 def extract_tables(text):
    """Find markdown tables and return (before, table_lines, after) tuples."""
    lines = text.split('\n')
    tables = []
    i = 0
    while i < len(lines):
        if '|' in lines[i] and i + 1 < len(lines) and re.match(r'\s*\|[-|: ]+\|', lines[i+1]):
            start = i
            while i < len(lines) and '|' in lines[i]:
                i += 1
            tables.append((start, lines[start:i]))
        else:
            i += 1
    return tables
 def add_md_table(doc, table_lines):
    """Convert markdown table to docx table."""
    rows_data = []
    for line in table_lines:
        cells = [c.strip() for c in line.strip('|').split('|')]
        if not re.match(r'^[-: ]+$', cells[0]):
            rows_data.append(cells)
    if len(rows_data) < 2:
        return
    ncols = len(rows_data[0])
    table = doc.add_table(rows=len(rows_data), cols=ncols)
    table.style = 'Table Grid'
    for r_idx, row in enumerate(rows_data):
        for c_idx in range(min(len(row), ncols)):
            cell = table.rows[r_idx].cells[c_idx]
            cell.text = row[c_idx]
            for p in cell.paragraphs:
                p.alignment = WD_ALIGN_PARAGRAPH.CENTER
                for run in p.runs:
                    run.font.size = Pt(8)
                    run.font.name = 'Times New Roman'
                    if r_idx == 0:
                        run.bold = True
    doc.add_paragraph()
 def process_section(doc, filepath):
    """Process a markdown section file into docx."""
    text = filepath.read_text(encoding='utf-8')
    text = strip_comments(text)
    lines = text.split('\n')
    i = 0
    while i < len(lines):
        line = lines[i]
        stripped = line.strip()
        # Skip empty lines
        if not stripped:
            i += 1
            continue
        # Headings
        if stripped.startswith('# '):
            h = doc.add_heading(stripped[2:], level=1)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        elif stripped.startswith('## '):
            h = doc.add_heading(stripped[3:], level=2)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        elif stripped.startswith('### '):
            h = doc.add_heading(stripped[4:], level=3)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        # Markdown table
        if '|' in stripped and i + 1 < len(lines) and re.match(r'\s*\|[-|: ]+\|', lines[i+1]):
            table_lines = []
            while i < len(lines) and '|' in lines[i]:
                table_lines.append(lines[i])
                i += 1
            add_md_table(doc, table_lines)
            continue
        # Numbered list
        if re.match(r'^\d+\.\s', stripped):
            p = doc.add_paragraph(style='List Number')
            content = re.sub(r'^\d+\.\s', '', stripped)
            content = re.sub(r'\*\*(.+?)\*\*', r'\1', content)  # strip bold markers
            run = p.add_run(content)
            run.font.size = Pt(10)
            run.font.name = 'Times New Roman'
            i += 1
            continue
        # Bullet list
        if stripped.startswith('- '):
            p = doc.add_paragraph(style='List Bullet')
            content = stripped[2:]
            content = re.sub(r'\*\*(.+?)\*\*', r'\1', content)
            run = p.add_run(content)
            run.font.size = Pt(10)
            run.font.name = 'Times New Roman'
            i += 1
            continue
        # Regular paragraph - collect continuation lines
        para_lines = [stripped]
        i += 1
        while i < len(lines):
            next_line = lines[i].strip()
            if not next_line or next_line.startswith('#') or next_line.startswith('|') or \
               next_line.startswith('- ') or re.match(r'^\d+\.\s', next_line):
                break
            para_lines.append(next_line)
            i += 1
        para_text = ' '.join(para_lines)
        # Clean markdown formatting
        para_text = re.sub(r'\*\*\*(.+?)\*\*\*', r'\1', para_text)  # bold italic
        para_text = re.sub(r'\*\*(.+?)\*\*', r'\1', para_text)  # bold
        para_text = re.sub(r'\*(.+?)\*', r'\1', para_text)  # italic
        para_text = re.sub(r'`(.+?)`', r'\1', para_text)  # code
        para_text = para_text.replace('$$', '')  # LaTeX delimiters
        para_text = para_text.replace('---', '\u2014')  # em dash
        p = doc.add_paragraph()
        p.paragraph_format.space_after = Pt(6)
        run = p.add_run(para_text)
        run.font.size = Pt(10)
        run.font.name = 'Times New Roman'
        # Check if we should insert a figure after this paragraph
        for trigger, (fig_file, caption, width) in FIGURES.items():
            if trigger in para_text:
                fig_path = FIG_DIR / fig_file
                if fig_path.exists():
                    fp = doc.add_paragraph()
                    fp.alignment = WD_ALIGN_PARAGRAPH.CENTER
                    fr = fp.add_run()
                    fr.add_picture(str(fig_path), width=Inches(width))
                    cp = doc.add_paragraph()
                    cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
                    cr = cp.add_run(caption)
                    cr.font.size = Pt(9)
                    cr.font.name = 'Times New Roman'
                    cr.italic = True
 def main():
    doc = Document()
    # Set default font
    style = doc.styles['Normal']
    style.font.name = 'Times New Roman'
    style.font.size = Pt(10)
    # Title page
    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    p.paragraph_format.space_after = Pt(12)
    run = p.add_run("Automated Detection of Digitally Replicated Signatures\nin Large-Scale Financial Audit Reports")
    run.font.size = Pt(16)
    run.font.name = 'Times New Roman'
    run.bold = True
    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    p.paragraph_format.space_after = Pt(20)
    run = p.add_run("[Authors removed for double-blind review]")
    run.font.size = Pt(10)
    run.italic = True
    # Process each section
    for section_file in SECTIONS:
        filepath = PAPER_DIR / section_file
        if filepath.exists():
            process_section(doc, filepath)
    doc.save(str(OUTPUT))
    print(f"Saved: {OUTPUT}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,392 @@
 #!/usr/bin/env python3
 """
 Generate all figures for Paper A (IEEE TAI submission).
 Outputs to /Volumes/NV2/PDF-Processing/signature-analysis/paper_figures/
 """
 import numpy as np
 import sqlite3
 import json
 import matplotlib
 matplotlib.use('Agg')
 import matplotlib.pyplot as plt
 import matplotlib.patches as mpatches
 from matplotlib.patches import FancyBboxPatch, FancyArrowPatch
 from collections import defaultdict
 from pathlib import Path
 # Config
 DB_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 ABLATION_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json'
 OUTPUT_DIR = Path('/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures')
 OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
 RANDOM_SEED = 42
 np.random.seed(RANDOM_SEED)
 # IEEE formatting
 plt.rcParams.update({
    'font.family': 'serif',
    'font.serif': ['Times New Roman', 'DejaVu Serif'],
    'font.size': 9,
    'axes.labelsize': 10,
    'axes.titlesize': 10,
    'xtick.labelsize': 8,
    'ytick.labelsize': 8,
    'legend.fontsize': 8,
    'figure.dpi': 300,
    'savefig.dpi': 300,
    'savefig.bbox': 'tight',
    'savefig.pad_inches': 0.05,
 })
 # IEEE column widths
 COL_WIDTH = 3.5  # single column inches
 FULL_WIDTH = 7.16  # full page width inches
 def load_signature_data():
    """Load per-signature best-match similarities and accountant info."""
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    cur.execute('''
        SELECT s.assigned_accountant, s.max_similarity_to_same_accountant, a.firm
        FROM signatures s
        LEFT JOIN accountants a ON s.assigned_accountant = a.name
        WHERE s.max_similarity_to_same_accountant IS NOT NULL
        AND s.assigned_accountant IS NOT NULL
    ''')
    rows = cur.fetchall()
    conn.close()
    data = {
        'accountants': [r[0] for r in rows],
        'max_sims': np.array([r[1] for r in rows]),
        'firms': [r[2] for r in rows],
    }
    return data
 def load_intra_inter_from_features():
    """Compute intra/inter class distributions from feature vectors."""
    print("Loading features for intra/inter distributions...")
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    cur.execute('''
        SELECT assigned_accountant, feature_vector
        FROM signatures
        WHERE feature_vector IS NOT NULL AND assigned_accountant IS NOT NULL
    ''')
    rows = cur.fetchall()
    conn.close()
    acct_groups = defaultdict(list)
    features_list = []
    accountants = []
    for r in rows:
        feat = np.frombuffer(r[1], dtype=np.float32)
        idx = len(features_list)
        features_list.append(feat)
        accountants.append(r[0])
        acct_groups[r[0]].append(idx)
    features = np.array(features_list)
    print(f"  Loaded {len(features)} signatures, {len(acct_groups)} accountants")
    # Intra-class
    print("  Computing intra-class...")
    intra_sims = []
    for acct, indices in acct_groups.items():
        if len(indices) < 3:
            continue
        vecs = features[indices]
        sim_matrix = vecs @ vecs.T
        n = len(indices)
        triu_idx = np.triu_indices(n, k=1)
        intra_sims.extend(sim_matrix[triu_idx].tolist())
    intra_sims = np.array(intra_sims)
    print(f"  Intra-class: {len(intra_sims):,} pairs")
    # Inter-class
    print("  Computing inter-class...")
    all_acct_list = list(acct_groups.keys())
    inter_sims = []
    for _ in range(500_000):
        a1, a2 = np.random.choice(len(all_acct_list), 2, replace=False)
        i1 = np.random.choice(acct_groups[all_acct_list[a1]])
        i2 = np.random.choice(acct_groups[all_acct_list[a2]])
        sim = float(features[i1] @ features[i2])
        inter_sims.append(sim)
    inter_sims = np.array(inter_sims)
    print(f"  Inter-class: {len(inter_sims):,} pairs")
    return intra_sims, inter_sims
 def fig1_pipeline(output_path):
    """Fig 1: Pipeline architecture diagram."""
    print("Generating Fig 1: Pipeline...")
    fig, ax = plt.subplots(1, 1, figsize=(FULL_WIDTH, 1.8))
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 2)
    ax.axis('off')
    # Stages
    stages = [
        ("90,282\nPDFs", "#E3F2FD"),
        ("VLM\nPre-screen", "#BBDEFB"),
        ("YOLO\nDetection", "#90CAF9"),
        ("ResNet-50\nFeatures", "#64B5F6"),
        ("Cosine +\npHash", "#42A5F5"),
        ("Calibration\n& Classify", "#1E88E5"),
    ]
    annotations = [
        "86,072 docs",
        "182,328 sigs",
        "2048-dim",
        "Dual verify",
        "Verdicts",
    ]
    box_w = 1.3
    box_h = 1.0
    gap = 0.38
    start_x = 0.15
    y_center = 1.0
    for i, (label, color) in enumerate(stages):
        x = start_x + i * (box_w + gap)
        box = FancyBboxPatch(
            (x, y_center - box_h/2), box_w, box_h,
            boxstyle="round,pad=0.1",
            facecolor=color, edgecolor='#1565C0', linewidth=1.2
        )
        ax.add_patch(box)
        ax.text(x + box_w/2, y_center, label,
                ha='center', va='center', fontsize=8, fontweight='bold',
                color='#0D47A1' if i < 3 else 'white')
        # Arrow + annotation
        if i < len(stages) - 1:
            arrow_x = x + box_w + 0.02
            ax.annotate('', xy=(arrow_x + gap - 0.04, y_center),
                       xytext=(arrow_x, y_center),
                       arrowprops=dict(arrowstyle='->', color='#1565C0', lw=1.5))
            ax.text(arrow_x + gap/2, y_center - 0.62, annotations[i],
                   ha='center', va='top', fontsize=6.5, color='#555555', style='italic')
    plt.savefig(output_path, format='png')
    plt.savefig(output_path.with_suffix('.pdf'), format='pdf')
    plt.close()
    print(f"  Saved: {output_path}")
 def fig2_intra_inter_kde(intra_sims, inter_sims, output_path):
    """Fig 2: Intra vs Inter class cosine similarity distributions."""
    print("Generating Fig 2: Intra vs Inter KDE...")
    from scipy.stats import gaussian_kde
    fig, ax = plt.subplots(1, 1, figsize=(COL_WIDTH, 2.5))
    x_grid = np.linspace(0.3, 1.0, 500)
    kde_intra = gaussian_kde(intra_sims, bw_method=0.02)
    kde_inter = gaussian_kde(inter_sims, bw_method=0.02)
    y_intra = kde_intra(x_grid)
    y_inter = kde_inter(x_grid)
    ax.fill_between(x_grid, y_intra, alpha=0.3, color='#E53935', label='Intra-class (same CPA)')
    ax.fill_between(x_grid, y_inter, alpha=0.3, color='#1E88E5', label='Inter-class (diff. CPA)')
    ax.plot(x_grid, y_intra, color='#C62828', linewidth=1.5)
    ax.plot(x_grid, y_inter, color='#1565C0', linewidth=1.5)
    # Find crossover
    diff = y_intra - y_inter
    sign_changes = np.where(np.diff(np.sign(diff)))[0]
    crossovers = x_grid[sign_changes]
    valid = crossovers[(crossovers > 0.5) & (crossovers < 1.0)]
    if len(valid) > 0:
        xover = valid[-1]
        ax.axvline(x=xover, color='#4CAF50', linestyle='--', linewidth=1.2, alpha=0.8)
        ax.text(xover + 0.01, ax.get_ylim()[1] * 0.85, f'KDE crossover\n= {xover:.3f}',
                fontsize=7, color='#2E7D32', va='top')
    ax.set_xlabel('Cosine Similarity')
    ax.set_ylabel('Density')
    ax.legend(loc='upper left', framealpha=0.9)
    ax.set_xlim(0.35, 1.0)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    plt.tight_layout()
    plt.savefig(output_path, format='png')
    plt.savefig(output_path.with_suffix('.pdf'), format='pdf')
    plt.close()
    print(f"  Saved: {output_path}")
 def fig3_firm_a_calibration(data, output_path):
    """Fig 3: Firm A calibration - per-signature best match distribution."""
    print("Generating Fig 3: Firm A Calibration...")
    from scipy.stats import gaussian_kde
    firm_a_mask = np.array([f == '勤業眾信聯合' for f in data['firms']])
    non_firm_a_mask = ~firm_a_mask
    firm_a_sims = data['max_sims'][firm_a_mask]
    others_sims = data['max_sims'][non_firm_a_mask]
    fig, ax = plt.subplots(1, 1, figsize=(COL_WIDTH, 2.5))
    x_grid = np.linspace(0.5, 1.0, 500)
    kde_a = gaussian_kde(firm_a_sims, bw_method=0.015)
    kde_others = gaussian_kde(others_sims, bw_method=0.015)
    y_a = kde_a(x_grid)
    y_others = kde_others(x_grid)
    ax.fill_between(x_grid, y_a, alpha=0.35, color='#E53935',
                    label=f'Firm A (known replication, n={len(firm_a_sims):,})')
    ax.fill_between(x_grid, y_others, alpha=0.25, color='#78909C',
                    label=f'Other CPAs (n={len(others_sims):,})')
    ax.plot(x_grid, y_a, color='#C62828', linewidth=1.5)
    ax.plot(x_grid, y_others, color='#546E7A', linewidth=1.5)
    # Mark key statistics
    p1 = np.percentile(firm_a_sims, 1)
    ax.axvline(x=p1, color='#E53935', linestyle=':', linewidth=1, alpha=0.7)
    ax.text(p1 - 0.01, ax.get_ylim()[1] * 0.5 if ax.get_ylim()[1] > 0 else 10,
            f'Firm A\n1st pct\n= {p1:.3f}', fontsize=6.5, color='#C62828',
            ha='right', va='center')
    mean_a = firm_a_sims.mean()
    ax.axvline(x=mean_a, color='#E53935', linestyle='--', linewidth=1, alpha=0.7)
    ax.set_xlabel('Per-Signature Best-Match Cosine Similarity')
    ax.set_ylabel('Density')
    ax.legend(loc='upper left', framealpha=0.9, fontsize=7)
    ax.set_xlim(0.5, 1.005)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    plt.tight_layout()
    plt.savefig(output_path, format='png')
    plt.savefig(output_path.with_suffix('.pdf'), format='pdf')
    plt.close()
    print(f"  Saved: {output_path}")
 def fig4_ablation(output_path):
    """Fig 4: Ablation backbone comparison."""
    print("Generating Fig 4: Ablation...")
    with open(ABLATION_PATH) as f:
        results = json.load(f)
    backbones = ['ResNet-50\n(2048-d)', 'VGG-16\n(4096-d)', 'EfficientNet-B0\n(1280-d)']
    backbone_keys = ['resnet50', 'vgg16', 'efficientnet_b0']
    results_map = {r['backbone']: r for r in results}
    fig, axes = plt.subplots(1, 3, figsize=(FULL_WIDTH, 2.2))
    colors = ['#1E88E5', '#FFA726', '#66BB6A']
    # Panel (a): Intra/Inter means with error bars
    ax = axes[0]
    x = np.arange(len(backbones))
    width = 0.35
    intra_means = [results_map[k]['intra']['mean'] for k in backbone_keys]
    intra_stds = [results_map[k]['intra']['std'] for k in backbone_keys]
    inter_means = [results_map[k]['inter']['mean'] for k in backbone_keys]
    inter_stds = [results_map[k]['inter']['std'] for k in backbone_keys]
    bars1 = ax.bar(x - width/2, intra_means, width, yerr=intra_stds,
                   color='#E53935', alpha=0.7, label='Intra', capsize=3, error_kw={'linewidth': 0.8})
    bars2 = ax.bar(x + width/2, inter_means, width, yerr=inter_stds,
                   color='#1E88E5', alpha=0.7, label='Inter', capsize=3, error_kw={'linewidth': 0.8})
    ax.set_ylabel('Cosine Similarity')
    ax.set_xticks(x)
    ax.set_xticklabels(backbones, fontsize=7)
    ax.legend(fontsize=7)
    ax.set_ylim(0.5, 1.0)
    ax.set_title('(a) Mean Similarity', fontsize=9)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    # Panel (b): Cohen's d
    ax = axes[1]
    cohens_ds = [results_map[k]['cohens_d'] for k in backbone_keys]
    bars = ax.bar(x, cohens_ds, 0.5, color=colors, alpha=0.8, edgecolor='#333', linewidth=0.5)
    ax.set_ylabel("Cohen's d")
    ax.set_xticks(x)
    ax.set_xticklabels(backbones, fontsize=7)
    ax.set_ylim(0, 0.9)
    ax.set_title("(b) Cohen's d", fontsize=9)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    # Add value labels
    for bar, val in zip(bars, cohens_ds):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                f'{val:.3f}', ha='center', va='bottom', fontsize=7, fontweight='bold')
    # Panel (c): KDE crossover
    ax = axes[2]
    crossovers = [results_map[k]['kde_crossover'] for k in backbone_keys]
    bars = ax.bar(x, crossovers, 0.5, color=colors, alpha=0.8, edgecolor='#333', linewidth=0.5)
    ax.set_ylabel('KDE Crossover')
    ax.set_xticks(x)
    ax.set_xticklabels(backbones, fontsize=7)
    ax.set_ylim(0.7, 0.9)
    ax.set_title('(c) KDE Crossover', fontsize=9)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    for bar, val in zip(bars, crossovers):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                f'{val:.3f}', ha='center', va='bottom', fontsize=7, fontweight='bold')
    plt.tight_layout()
    plt.savefig(output_path, format='png')
    plt.savefig(output_path.with_suffix('.pdf'), format='pdf')
    plt.close()
    print(f"  Saved: {output_path}")
 def main():
    print("=" * 60)
    print("Generating Paper Figures")
    print("=" * 60)
    # Fig 1: Pipeline (no data needed)
    fig1_pipeline(OUTPUT_DIR / 'fig1_pipeline.png')
    # Fig 4: Ablation (uses pre-computed JSON)
    fig4_ablation(OUTPUT_DIR / 'fig4_ablation.png')
    # Load data for Fig 2 & 3
    data = load_signature_data()
    print(f"Loaded {len(data['max_sims']):,} signatures")
    # Fig 3: Firm A calibration (uses per-signature best match from DB)
    fig3_firm_a_calibration(data, OUTPUT_DIR / 'fig3_firm_a_calibration.png')
    # Fig 2: Intra vs Inter (needs full feature vectors)
    intra_sims, inter_sims = load_intra_inter_from_features()
    fig2_intra_inter_kde(intra_sims, inter_sims, OUTPUT_DIR / 'fig2_intra_inter_kde.png')
    print("\n" + "=" * 60)
    print("All figures saved to:", OUTPUT_DIR)
    print("=" * 60)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,413 @@
 #!/usr/bin/env python3
 """
 Generate complete PDF-level Excel report with Firm A-calibrated dual-method classification.
 Output: One row per PDF with identification, CPA info, detection stats,
        cosine similarity, dHash distance, and new dual-method verdicts.
 """
 import sqlite3
 import numpy as np
 import openpyxl
 from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
 from collections import defaultdict
 from pathlib import Path
 from datetime import datetime
 DB_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 OUTPUT_DIR = Path('/Volumes/NV2/PDF-Processing/signature-analysis/recalibrated')
 OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
 OUTPUT_PATH = OUTPUT_DIR / 'pdf_level_recalibrated_report.xlsx'
 FIRM_A = '勤業眾信聯合'
 KDE_CROSSOVER = 0.837
 COSINE_HIGH = 0.95
 PHASH_HIGH_CONF = 5
 PHASH_MOD_CONF = 15
 def load_all_data():
    """Load all signature data grouped by PDF."""
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    # Get all signatures with their stats
    cur.execute('''
        SELECT s.signature_id, s.image_filename, s.assigned_accountant,
               s.max_similarity_to_same_accountant,
               s.phash_distance_to_closest,
               s.ssim_to_closest,
               s.signature_verdict,
               a.firm, a.risk_level, a.mean_similarity, a.ratio_gt_95,
               a.signature_count
        FROM signatures s
        LEFT JOIN accountants a ON s.assigned_accountant = a.name
        WHERE s.assigned_accountant IS NOT NULL
    ''')
    rows = cur.fetchall()
    # Get PDF metadata from the master index or derive from filenames
    # Also get YOLO detection info
    cur.execute('''
        SELECT s.image_filename,
               s.detection_confidence
        FROM signatures s
    ''')
    detection_rows = cur.fetchall()
    detection_conf = {r[0]: r[1] for r in detection_rows}
    conn.close()
    # Group by PDF
    pdf_data = defaultdict(lambda: {
        'signatures': [],
        'accountants': set(),
        'firms': set(),
    })
    for r in rows:
        sig_id, filename, accountant, cosine, phash, ssim, verdict, \
            firm, risk, mean_sim, ratio95, sig_count = r
        # Extract PDF key from filename
        # Format: {company}_{year}_{type}_page{N}_sig{M}.png or similar
        parts = filename.rsplit('_sig', 1)
        pdf_key = parts[0] if len(parts) > 1 else filename.rsplit('.', 1)[0]
        page_parts = pdf_key.rsplit('_page', 1)
        pdf_key = page_parts[0] if len(page_parts) > 1 else pdf_key
        pdf_data[pdf_key]['signatures'].append({
            'sig_id': sig_id,
            'filename': filename,
            'accountant': accountant,
            'cosine': cosine,
            'phash': phash,
            'ssim': ssim,
            'old_verdict': verdict,
            'firm': firm,
            'risk_level': risk,
            'acct_mean_sim': mean_sim,
            'acct_ratio_95': ratio95,
            'acct_sig_count': sig_count,
            'detection_conf': detection_conf.get(filename),
        })
        if accountant:
            pdf_data[pdf_key]['accountants'].add(accountant)
        if firm:
            pdf_data[pdf_key]['firms'].add(firm)
    print(f"Loaded {sum(len(v['signatures']) for v in pdf_data.values()):,} signatures across {len(pdf_data):,} PDFs")
    return pdf_data
 def classify_dual_method(max_cosine, min_phash):
    """New dual-method classification with Firm A-calibrated thresholds."""
    if max_cosine is None:
        return 'unknown', 'none'
    if max_cosine > COSINE_HIGH:
        if min_phash is not None and min_phash <= PHASH_HIGH_CONF:
            return 'high_confidence_replication', 'high'
        elif min_phash is not None and min_phash <= PHASH_MOD_CONF:
            return 'moderate_confidence_replication', 'medium'
        else:
            return 'high_style_consistency', 'low'
    elif max_cosine > KDE_CROSSOVER:
        return 'uncertain', 'low'
    else:
        return 'likely_genuine', 'medium'
 def build_report(pdf_data):
    """Build Excel report."""
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.title = "PDF-Level Report"
    # Define columns
    columns = [
        # Group A: PDF Identification (Blue)
        ('pdf_key', 'PDF Key'),
        ('n_signatures', '# Signatures'),
        # Group B: CPA Info (Green)
        ('accountant_1', 'CPA 1 Name'),
        ('accountant_2', 'CPA 2 Name'),
        ('firm_1', 'Firm 1'),
        ('firm_2', 'Firm 2'),
        ('is_firm_a', 'Is Firm A'),
        # Group C: Detection (Yellow)
        ('avg_detection_conf', 'Avg Detection Conf'),
        # Group D: Cosine Similarity - Sig 1 (Red)
        ('sig1_cosine', 'Sig1 Max Cosine'),
        ('sig1_cosine_verdict', 'Sig1 Cosine Verdict'),
        ('sig1_acct_mean', 'Sig1 CPA Mean Sim'),
        ('sig1_acct_ratio95', 'Sig1 CPA >0.95 Ratio'),
        ('sig1_acct_count', 'Sig1 CPA Sig Count'),
        # Group E: Cosine Similarity - Sig 2 (Purple)
        ('sig2_cosine', 'Sig2 Max Cosine'),
        ('sig2_cosine_verdict', 'Sig2 Cosine Verdict'),
        ('sig2_acct_mean', 'Sig2 CPA Mean Sim'),
        ('sig2_acct_ratio95', 'Sig2 CPA >0.95 Ratio'),
        ('sig2_acct_count', 'Sig2 CPA Sig Count'),
        # Group F: dHash Distance (Orange)
        ('min_phash', 'Min dHash Distance'),
        ('max_phash', 'Max dHash Distance'),
        ('avg_phash', 'Avg dHash Distance'),
        ('sig1_phash', 'Sig1 dHash Distance'),
        ('sig2_phash', 'Sig2 dHash Distance'),
        # Group G: SSIM (for reference only) (Gray)
        ('max_ssim', 'Max SSIM'),
        ('avg_ssim', 'Avg SSIM'),
        # Group H: Dual-Method Classification (Dark Blue)
        ('dual_verdict', 'Dual-Method Verdict'),
        ('dual_confidence', 'Confidence Level'),
        ('max_cosine', 'PDF Max Cosine'),
        ('pdf_min_phash', 'PDF Min dHash'),
        # Group I: CPA Risk (Teal)
        ('sig1_risk', 'Sig1 CPA Risk Level'),
        ('sig2_risk', 'Sig2 CPA Risk Level'),
    ]
    col_keys = [c[0] for c in columns]
    col_names = [c[1] for c in columns]
    # Header styles
    header_fill = PatternFill(start_color='1F4E79', end_color='1F4E79', fill_type='solid')
    header_font = Font(name='Arial', size=9, bold=True, color='FFFFFF')
    data_font = Font(name='Arial', size=9)
    thin_border = Border(
        left=Side(style='thin'),
        right=Side(style='thin'),
        top=Side(style='thin'),
        bottom=Side(style='thin'),
    )
    # Group colors
    group_colors = {
        'A': 'D6E4F0',  # Blue - PDF ID
        'B': 'D9E2D0',  # Green - CPA
        'C': 'FFF2CC',  # Yellow - Detection
        'D': 'F4CCCC',  # Red - Cosine Sig1
        'E': 'E1D5E7',  # Purple - Cosine Sig2
        'F': 'FFE0B2',  # Orange - dHash
        'G': 'E0E0E0',  # Gray - SSIM
        'H': 'B3D4FC',  # Dark Blue - Dual method
        'I': 'B2DFDB',  # Teal - Risk
    }
    group_ranges = {
        'A': (0, 2), 'B': (2, 7), 'C': (7, 8),
        'D': (8, 13), 'E': (13, 18), 'F': (18, 23),
        'G': (23, 25), 'H': (25, 29), 'I': (29, 31),
    }
    # Write header
    for col_idx, name in enumerate(col_names, 1):
        cell = ws.cell(row=1, column=col_idx, value=name)
        cell.font = header_font
        cell.fill = header_fill
        cell.alignment = Alignment(horizontal='center', wrap_text=True)
        cell.border = thin_border
    # Process PDFs
    row_idx = 2
    verdict_counts = defaultdict(int)
    firm_a_counts = defaultdict(int)
    for pdf_key, pdata in sorted(pdf_data.items()):
        sigs = pdata['signatures']
        if not sigs:
            continue
        # Sort signatures by position (sig1, sig2)
        sigs_sorted = sorted(sigs, key=lambda s: s['filename'])
        sig1 = sigs_sorted[0] if len(sigs_sorted) > 0 else None
        sig2 = sigs_sorted[1] if len(sigs_sorted) > 1 else None
        # Compute PDF-level aggregates
        cosines = [s['cosine'] for s in sigs if s['cosine'] is not None]
        phashes = [s['phash'] for s in sigs if s['phash'] is not None]
        ssims = [s['ssim'] for s in sigs if s['ssim'] is not None]
        confs = [s['detection_conf'] for s in sigs if s['detection_conf'] is not None]
        max_cosine = max(cosines) if cosines else None
        min_phash = min(phashes) if phashes else None
        max_phash = max(phashes) if phashes else None
        avg_phash = np.mean(phashes) if phashes else None
        max_ssim = max(ssims) if ssims else None
        avg_ssim = np.mean(ssims) if ssims else None
        avg_conf = np.mean(confs) if confs else None
        is_firm_a = FIRM_A in pdata['firms']
        # Dual-method classification
        verdict, confidence = classify_dual_method(max_cosine, min_phash)
        verdict_counts[verdict] += 1
        if is_firm_a:
            firm_a_counts[verdict] += 1
        # Cosine verdicts per signature
        def cosine_verdict(cos):
            if cos is None: return None
            if cos > COSINE_HIGH: return 'high'
            if cos > KDE_CROSSOVER: return 'uncertain'
            return 'low'
        # Build row
        row_data = {
            'pdf_key': pdf_key,
            'n_signatures': len(sigs),
            'accountant_1': sig1['accountant'] if sig1 else None,
            'accountant_2': sig2['accountant'] if sig2 else None,
            'firm_1': sig1['firm'] if sig1 else None,
            'firm_2': sig2['firm'] if sig2 else None,
            'is_firm_a': 'Yes' if is_firm_a else 'No',
            'avg_detection_conf': round(avg_conf, 4) if avg_conf else None,
            'sig1_cosine': round(sig1['cosine'], 4) if sig1 and sig1['cosine'] else None,
            'sig1_cosine_verdict': cosine_verdict(sig1['cosine']) if sig1 else None,
            'sig1_acct_mean': round(sig1['acct_mean_sim'], 4) if sig1 and sig1['acct_mean_sim'] else None,
            'sig1_acct_ratio95': round(sig1['acct_ratio_95'], 4) if sig1 and sig1['acct_ratio_95'] else None,
            'sig1_acct_count': sig1['acct_sig_count'] if sig1 else None,
            'sig2_cosine': round(sig2['cosine'], 4) if sig2 and sig2['cosine'] else None,
            'sig2_cosine_verdict': cosine_verdict(sig2['cosine']) if sig2 else None,
            'sig2_acct_mean': round(sig2['acct_mean_sim'], 4) if sig2 and sig2['acct_mean_sim'] else None,
            'sig2_acct_ratio95': round(sig2['acct_ratio_95'], 4) if sig2 and sig2['acct_ratio_95'] else None,
            'sig2_acct_count': sig2['acct_sig_count'] if sig2 else None,
            'min_phash': min_phash,
            'max_phash': max_phash,
            'avg_phash': round(avg_phash, 2) if avg_phash is not None else None,
            'sig1_phash': sig1['phash'] if sig1 else None,
            'sig2_phash': sig2['phash'] if sig2 else None,
            'max_ssim': round(max_ssim, 4) if max_ssim is not None else None,
            'avg_ssim': round(avg_ssim, 4) if avg_ssim is not None else None,
            'dual_verdict': verdict,
            'dual_confidence': confidence,
            'max_cosine': round(max_cosine, 4) if max_cosine is not None else None,
            'pdf_min_phash': min_phash,
            'sig1_risk': sig1['risk_level'] if sig1 else None,
            'sig2_risk': sig2['risk_level'] if sig2 else None,
        }
        for col_idx, key in enumerate(col_keys, 1):
            val = row_data.get(key)
            cell = ws.cell(row=row_idx, column=col_idx, value=val)
            cell.font = data_font
            cell.border = thin_border
            # Color by group
            for group, (start, end) in group_ranges.items():
                if start <= col_idx - 1 < end:
                    cell.fill = PatternFill(start_color=group_colors[group],
                                           end_color=group_colors[group],
                                           fill_type='solid')
                    break
            # Highlight Firm A rows
            if is_firm_a and col_idx == 7:
                cell.font = Font(name='Arial', size=9, bold=True, color='CC0000')
            # Color verdicts
            if key == 'dual_verdict':
                colors = {
                    'high_confidence_replication': 'FF0000',
                    'moderate_confidence_replication': 'FF6600',
                    'high_style_consistency': '009900',
                    'uncertain': 'FF9900',
                    'likely_genuine': '006600',
                }
                if val in colors:
                    cell.font = Font(name='Arial', size=9, bold=True, color=colors[val])
        row_idx += 1
    # Auto-width
    for col_idx in range(1, len(col_keys) + 1):
        ws.column_dimensions[openpyxl.utils.get_column_letter(col_idx)].width = 15
    # Freeze header
    ws.freeze_panes = 'A2'
    ws.auto_filter.ref = f"A1:{openpyxl.utils.get_column_letter(len(col_keys))}{row_idx-1}"
    # === Summary Sheet ===
    ws2 = wb.create_sheet("Summary")
    ws2.cell(row=1, column=1, value="Dual-Method Classification Summary").font = Font(size=14, bold=True)
    ws2.cell(row=2, column=1, value=f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
    ws2.cell(row=3, column=1, value=f"Calibration: Firm A (dHash median=5, p95=15)")
    ws2.cell(row=5, column=1, value="Verdict").font = Font(bold=True)
    ws2.cell(row=5, column=2, value="Count").font = Font(bold=True)
    ws2.cell(row=5, column=3, value="%").font = Font(bold=True)
    ws2.cell(row=5, column=4, value="Firm A").font = Font(bold=True)
    ws2.cell(row=5, column=5, value="Firm A %").font = Font(bold=True)
    total = sum(verdict_counts.values())
    fa_total = sum(firm_a_counts.values())
    order = ['high_confidence_replication', 'moderate_confidence_replication',
             'high_style_consistency', 'uncertain', 'likely_genuine', 'unknown']
    for i, v in enumerate(order):
        n = verdict_counts.get(v, 0)
        fa = firm_a_counts.get(v, 0)
        ws2.cell(row=6+i, column=1, value=v)
        ws2.cell(row=6+i, column=2, value=n)
        ws2.cell(row=6+i, column=3, value=f"{100*n/total:.1f}%" if total > 0 else "0%")
        ws2.cell(row=6+i, column=4, value=fa)
        ws2.cell(row=6+i, column=5, value=f"{100*fa/fa_total:.1f}%" if fa_total > 0 else "0%")
    ws2.cell(row=6+len(order), column=1, value="Total").font = Font(bold=True)
    ws2.cell(row=6+len(order), column=2, value=total)
    ws2.cell(row=6+len(order), column=4, value=fa_total)
    # Thresholds
    ws2.cell(row=15, column=1, value="Thresholds Used").font = Font(size=12, bold=True)
    ws2.cell(row=16, column=1, value="Cosine high threshold")
    ws2.cell(row=16, column=2, value=COSINE_HIGH)
    ws2.cell(row=17, column=1, value="KDE crossover")
    ws2.cell(row=17, column=2, value=KDE_CROSSOVER)
    ws2.cell(row=18, column=1, value="dHash high-confidence (Firm A median)")
    ws2.cell(row=18, column=2, value=PHASH_HIGH_CONF)
    ws2.cell(row=19, column=1, value="dHash moderate-confidence (Firm A p95)")
    ws2.cell(row=19, column=2, value=PHASH_MOD_CONF)
    for col in range(1, 6):
        ws2.column_dimensions[openpyxl.utils.get_column_letter(col)].width = 30
    # Save
    wb.save(str(OUTPUT_PATH))
    print(f"\nSaved: {OUTPUT_PATH}")
    print(f"Total PDFs: {total:,}")
    print(f"Firm A PDFs: {fa_total:,}")
    # Print summary
    print(f"\n{'Verdict':<35} {'Count':>8} {'%':>7}  | {'Firm A':>8} {'%':>7}")
    print("-" * 70)
    for v in order:
        n = verdict_counts.get(v, 0)
        fa = firm_a_counts.get(v, 0)
        if n > 0:
            print(f"  {v:<33} {n:>8,} {100*n/total:>6.1f}%  | {fa:>8,} {100*fa/fa_total:>6.1f}%"
                  if fa_total > 0 else f"  {v:<33} {n:>8,} {100*n/total:>6.1f}%")
    print("-" * 70)
    print(f"  {'Total':<33} {total:>8,}         | {fa_total:>8,}")
 def main():
    print("=" * 60)
    print("Generating Recalibrated PDF-Level Report")
    print(f"Calibration: Firm A ({FIRM_A})")
    print(f"Method: Dual (Cosine + dHash)")
    print("=" * 60)
    pdf_data = load_all_data()
    build_report(pdf_data)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,16 @@
 # Abstract
 <!-- 150-250 words -->
 Regulations in many jurisdictions require Certified Public Accountants (CPAs) to attest to each audit report they certify, typically by affixing a signature or seal.
 However, the digitization of financial reporting makes it straightforward to reuse a scanned signature image across multiple reports, potentially undermining the intent of individualized attestation.
 Unlike signature forgery, where an impostor imitates another person's handwriting, signature replication involves a legitimate signer reusing a digital copy of their own genuine signature---a practice that is difficult to detect through manual inspection at scale.
 We present an end-to-end AI pipeline that automatically detects signature replication in financial audit reports.
 The pipeline employs a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-method verification combining cosine similarity with difference hashing (dHash).
 This dual-method design distinguishes consistent handwriting style (high feature similarity but divergent perceptual hashes) from digital replication (convergent evidence across both methods), addressing an ambiguity that single-metric approaches cannot resolve.
 We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan over a decade (2013--2023), analyzing 182,328 signatures from 758 CPAs.
 Using an accounting firm independently identified as employing digital replication as a calibration reference, we establish empirically grounded detection thresholds.
 Our analysis reveals that among documents with high feature-level similarity (cosine > 0.95), the structural verification layer stratifies them into distinct populations: 41% with converging replication evidence, 52% with partial structural similarity, and 7% with no structural corroboration despite near-identical features---demonstrating that single-metric approaches conflate style consistency with digital duplication.
 To our knowledge, this represents the largest-scale analysis of signature authenticity in financial audit documents to date.
 <!-- Word count: ~220 -->
@@ -0,0 +1,21 @@
 # VI. Conclusion and Future Work
 ## Conclusion
 We have presented an end-to-end AI pipeline for detecting digitally replicated signatures in financial audit reports at scale.
 Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-method similarity verification.
 Our key findings are threefold.
 First, we argued that signature replication detection is a distinct problem from signature forgery detection, requiring different analytical tools focused on intra-signer similarity distributions.
 Second, we showed that combining cosine similarity of deep features with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the structural verification layer revealed that only 41% exhibit converging replication evidence, while 7% show no structural corroboration despite near-identical features, demonstrating that a single-metric approach conflates style consistency with digital duplication.
 Third, we introduced a calibration methodology using a known-replication reference group whose distributional characteristics (dHash median = 5, 95th percentile = 15) directly informed the classification thresholds, achieving 96.9% capture of the calibration group.
 An ablation study comparing three feature extraction backbones (ResNet-50, VGG-16, EfficientNet-B0) confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
 ## Future Work
 Several directions merit further investigation.
 Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
 Temporal analysis of signature similarity trends---tracking how individual CPAs' similarity profiles evolve over years---could reveal transitions between genuine signing and digital replication practices.
 The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
 Finally, integration with regulatory monitoring systems and small-scale ground truth validation through expert review would strengthen the practical deployment potential of this approach.
@@ -0,0 +1,57 @@
 # V. Discussion
 ## A. Replication Detection as a Distinct Problem
 Our results highlight the importance of distinguishing signature replication detection from the well-studied signature forgery detection problem.
 In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature.
 In replication detection, the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and digital duplication (a CPA who reuses a scanned image).
 This distinction has direct methodological consequences.
 Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures.
 Replication detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and digital copies becomes ambiguous.
 The dual-method framework we propose---combining semantic-level features (cosine similarity) with structural-level features (pHash)---addresses this ambiguity in a way that single-method approaches cannot.
 ## B. The Style-Replication Gap
 Perhaps the most important empirical finding is the stratification that the dual-method framework reveals within the high-cosine population.
 Of 71,656 documents with cosine similarity exceeding 0.95, the dHash dimension partitions them into three distinct groups: 29,529 (41.2%) with high-confidence structural evidence of replication, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
 A cosine-only approach would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations.
 The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative.
 Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
 Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the feature level while retaining the microscopic variations inherent to handwriting.
 Some may use signing pads or templates that further constrain variability without constituting digital replication.
 The dual-method framework correctly identifies these as distinct from digitally replicated signatures by detecting the absence of structural-level convergence.
 ## C. Value of Known-Replication Calibration
 The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
 In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
 Our approach leverages domain knowledge---the established practice of digital signature replication at a specific firm---to create a naturally occurring positive control group within the dataset.
 This calibration strategy has broader applicability beyond signature analysis.
 Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and percentile-based thresholds are preferred over parametric alternatives.
 ## D. Limitations
 Several limitations should be acknowledged.
 First, comprehensive ground truth labels are not available for the full dataset.
 While Firm A provides a known-replication reference and the dual-method framework produces internally consistent results, the classification of non-Firm-A documents relies on statistical inference without independent per-document ground truth.
 A small-scale manual verification study (e.g., 100--200 documents sampled across classification categories) would strengthen confidence in the classification boundaries.
 Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
 While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor trained on a curated dataset could improve discriminative performance.
 Third, the red stamp removal preprocessing uses simple HSV color space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions.
 In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
 This effect would make replication harder to detect (biasing toward false negatives) rather than easier, but the magnitude of the impact has not been quantified.
 Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
 While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.
 Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted digital replication later).
 Temporal segmentation of signature similarity could reveal such transitions but is beyond the scope of this study.
 Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
 Whether digital replication of a CPA's own genuine signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
@@ -0,0 +1,10 @@
 # Impact Statement
 <!-- 100-150 words. Non-specialist readable. No jargon. Specific, not vague. -->
 Auditor signatures on financial reports are a key safeguard of corporate accountability.
 When Certified Public Accountants digitally copy and paste a single signature image across multiple reports instead of signing each one individually, this safeguard is undermined---yet detecting such practices through manual inspection is infeasible at the scale of modern financial markets.
 We developed an artificial intelligence system that automatically extracts and analyzes signatures from over 90,000 audit reports spanning over a decade of filings by publicly listed companies.
 By combining deep learning-based visual feature analysis with perceptual hashing, the system distinguishes genuinely handwritten signatures from digitally replicated ones.
 Our analysis reveals substantial variation in signature similarity patterns across accounting firms, with a calibration group independently identified as using digital replication exhibiting distinctly higher similarity scores.
 After further validation, this technology could serve as an automated screening tool to support financial regulators in monitoring signature authenticity at national scale.
@@ -0,0 +1,81 @@
 # I. Introduction
 <!-- Target: ~1.5 pages double-column IEEE format. Double-blind: no author/institution info. -->
 Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection.
 In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require that certifying CPAs affix their signature or seal (簽名或蓋章) to each audit report [1].
 While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].
 The digitization of financial reporting, however, has introduced a practice that challenges this intent.
 As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically trivial for a CPA to digitally replicate a single scanned signature image and paste it across multiple reports.
 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful attestation of individual professional judgment for each engagement.
 Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, signature replication involves the legitimate signer reusing a digital copy of their own genuine signature.
 This practice, while potentially widespread, is virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of digital duplication.
 The distinction between signature *replication* and signature *forgery* is both conceptually and technically important.
 The extensive body of research on offline signature verification [3]--[8] has focused almost exclusively on forgery detection---determining whether a questioned signature was produced by its purported author or by an impostor.
 This framing presupposes that the central threat is identity fraud.
 In our context, identity is not in question; the CPA is indeed the legitimate signer.
 The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was digitally propagated across many reports.
 This replication detection problem differs fundamentally from forgery detection: while it does not require modeling the variability of skilled forgers, it introduces the distinct challenge of separating legitimate intra-signer consistency from digital duplication, requiring an analytical framework focused on detecting abnormally high similarity across documents.
 Despite the significance of this problem for audit quality and regulatory oversight, no prior work has specifically addressed the detection of same-signer digital replication in financial audit documents at scale.
 Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of digital copies.
 Copy-move forgery detection methods [10], [11] address duplicated regions within or across images, but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from digital duplication.
 Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations, but has not been applied to document forensics or signature analysis.
 In this paper, we present a fully automated, end-to-end pipeline for detecting digitally replicated CPA signatures in audit reports at scale.
 Our approach processes raw PDF documents through six sequential stages: (1) signature page identification using a Vision-Language Model (VLM), (2) signature region detection using a trained YOLOv11 object detector, (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network, (4) dual-method similarity verification combining cosine similarity of deep features with difference hash (dHash) distance, (5) distribution-free threshold calibration using a known-replication reference group, and (6) statistical classification with cross-method validation.
 The dual-method verification is central to our contribution.
 Cosine similarity of deep feature embeddings captures high-level visual style similarity---it can identify signatures that share similar stroke patterns and spatial layouts---but cannot distinguish between a CPA who signs consistently and one who reuses a digital copy.
 Perceptual hashing (specifically, difference hashing), by contrast, encodes structural-level image gradients into compact binary fingerprints that are robust to scan noise but sensitive to substantive content differences.
 By requiring convergent evidence from both methods, we can differentiate *style consistency* (high cosine similarity but divergent pHash) from *digital replication* (high cosine similarity with convergent pHash), resolving an ambiguity that neither method can address alone.
 A distinctive feature of our approach is the use of a known-replication calibration group for threshold validation.
 One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as using digitally replicated signatures across its audit reports.
 This status was established through three independent lines of evidence prior to our analysis: (1) visual inspection of a random sample of Firm A's reports reveals pixel-identical signature images across different audit engagements and fiscal years; (2) the practice is acknowledged as common knowledge among audit practitioners in Taiwan; and (3) our subsequent quantitative analysis confirmed this independently, with 92.5% of Firm A's signatures exhibiting best-match cosine similarity exceeding 0.95, consistent with digital replication rather than handwriting.
 Importantly, Firm A's known-replication status was not derived from the thresholds we calibrate against it; the identification is based on domain knowledge and visual evidence that is independent of the statistical pipeline.
 This provides an empirical anchor for calibrating detection thresholds: any threshold that fails to classify the vast majority of Firm A's signatures as replicated is demonstrably too conservative, while Firm A's distributional characteristics establish the range of similarity values achievable through replication in real-world scanned documents.
 This calibration strategy---using a known-positive subpopulation to validate detection thresholds---addresses a persistent challenge in document forensics, where comprehensive ground truth labels are scarce.
 We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants.
 To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature.
 The contributions of this paper are summarized as follows:
 1. **Problem formulation:** We formally define the signature replication detection problem as distinct from signature forgery detection, and argue that it requires a different analytical framework focused on intra-signer similarity distributions rather than genuine-versus-forged classification.
 2. **End-to-end pipeline:** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-method similarity verification, with automated inference requiring no manual intervention after initial training and annotation.
 3. **Dual-method verification:** We demonstrate that combining deep feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and digital replication, supported by an ablation study comparing three feature extraction backbones.
 4. **Calibration methodology:** We introduce a threshold calibration approach using a known-replication reference group, providing empirical validation in a domain where labeled ground truth is scarce.
 5. **Large-scale empirical analysis:** We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on signature replication practices in financial reporting.
 The remainder of this paper is organized as follows.
 Section II reviews related work on signature verification, document forensics, and perceptual hashing.
 Section III describes the proposed methodology.
 Section IV presents experimental results including the ablation study and calibration group analysis.
 Section V discusses the implications and limitations of our findings.
 Section VI concludes with directions for future work.
 <!-- 
 REFERENCES used in Introduction:
 [1] Taiwan CPA Act §4 (會計師法第4條) + FSC Attestation Regulations §6 (查核簽證核准準則第6條)
    - CPA Act: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067
    - FSC Regs: https://law.moj.gov.tw/LawClass/LawAll.aspx?pcode=G0400013
 [2] Yen, Chang & Chen 2013 — Does the signature of a CPA matter? (Res. Account. Regul., vol. 25, no. 2)
 [2] Bromley et al. 1993 — Siamese time delay neural network for signature verification (NeurIPS)
 [3] Dey et al. 2017 — SigNet: Siamese CNN for writer-independent offline SV (arXiv:1707.02131)
 [4] Hadjadj et al. 2020 — Single known sample offline SV (Applied Sciences)
 [5] Li et al. 2024 — TransOSV: Transformer for offline SV (Pattern Recognition)
 [6] Tehsin et al. 2024 — Triplet Siamese for digital documents (Mathematics)
 [7] Brimoh & Olisah 2024 — Consensus threshold for offline SV (arXiv:2401.03085)
 [8] Woodruff et al. 2021 — Fully automatic pipeline for document signature analysis / money laundering (arXiv:2107.14091)
 [9] Abramova & Böhme 2016 — Copy-move forgery detection in scanned text documents (Electronic Imaging)
 [10] Copy-move forgery detection survey — MTAP 2024
 [11] Jakhar & Borah 2025 — Near-duplicate detection using pHash + deep learning (Info. Processing & Management)
 [12] Pizzi et al. 2022 — SSCD: Self-supervised copy detection (CVPR)
 -->
@@ -0,0 +1,146 @@
 # III. Methodology
 ## A. Pipeline Overview
 We propose a six-stage pipeline for large-scale signature replication detection in scanned financial documents.
 Fig. 1 illustrates the overall architecture.
 The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures into one of four categories---definite replication, likely replication, uncertain, or likely genuine---along with supporting evidence from multiple verification methods.
 <!--
 [Figure 1: Pipeline Architecture - clean vector diagram]
 90,282 PDFs → VLM Pre-screening → 86,072 PDFs
 → YOLOv11 Detection → 182,328 signatures
 → ResNet-50 Features → 2048-dim embeddings
 → Dual-Method Verification (Cosine + pHash)
 → Threshold Calibration (Firm A) → Classification
 -->
 ## B. Data Collection
 The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023.
 The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings.
 An automated web scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period.
 Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the handwritten signatures of the certifying CPAs.
 CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
 Table I summarizes the dataset composition.
 <!-- TABLE I: Dataset Summary
 | Attribute | Value |
 |-----------|-------|
 | Total PDF documents | 90,282 |
 | Date range | 2013–2023 |
 | Documents with signatures | 86,072 (95.4%) |
 | Unique CPAs identified | 758 |
 | Accounting firms | >50 |
 -->
 ## C. Signature Page Identification
 To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism.
 Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
 The model was configured with temperature 0 for deterministic output.
 The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document.
 Scanning terminated upon the first positive detection.
 This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded.
 An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.
 Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false positive rate of 1.2%.
 ## D. Signature Detection
 We adopted YOLOv11n (nano variant) [25] for signature region localization.
 A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
 A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.
 The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).
 <!-- TABLE II: YOLO Detection Performance
 | Metric | Value |
 |--------|-------|
 | Precision | 0.97–0.98 |
 | Recall | 0.95–0.98 |
 | mAP@0.50 | 0.98–0.99 |
 | mAP@0.50:0.95 | 0.85–0.90 |
 -->
 Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
 A red stamp removal step was applied to each cropped signature using HSV color space filtering, replacing detected red regions with white pixels to isolate the handwritten content.
 Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
 ## E. Feature Extraction
 Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning.
 The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.
 Preprocessing consisted of resizing to 224×224 pixels with aspect ratio preservation and white padding, followed by ImageNet channel normalization.
 All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.
 The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D).
 This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
 ## F. Dual-Method Similarity Verification
 For each signature, the most similar signature from the same CPA across all other documents was identified via cosine similarity of feature vectors.
 Two complementary measures were then computed against this closest match:
 **Cosine similarity** captures high-level visual style similarity:
 $$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$
 where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized feature vectors.
 A high cosine similarity indicates that two signatures share similar visual characteristics---stroke patterns, spatial layout, and overall appearance---but does not distinguish between consistent handwriting style and digital duplication.
 **Perceptual hash distance** captures structural-level similarity.
 Specifically, we employ a difference hash (dHash) [27], a perceptual hashing variant that encodes relative intensity gradients rather than absolute pixel values.
 Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
 The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images.
 Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].
 The complementarity of these two measures is the key to resolving the style-versus-replication ambiguity:
 - High cosine similarity + low pHash distance → converging evidence of digital replication
 - High cosine similarity + high pHash distance → consistent handwriting style, not replication
 This dual-method design was preferred over SSIM (Structural Similarity Index), which proved unreliable for scanned documents: a known-replication firm exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content.
 Cosine similarity and pHash are both robust to the noise introduced by the print-scan cycle, making them more suitable for this application.
 ## G. Threshold Selection and Calibration
 ### Distribution-Free Thresholds
 To establish classification thresholds, we computed cosine similarity distributions for two groups:
 - **Intra-class** (same CPA): all pairwise similarities among signatures attributed to the same CPA (41.3M pairs from 728 CPAs with ≥3 signatures)
 - **Inter-class** (different CPAs): 500,000 randomly sampled cross-CPA pairs
 Shapiro-Wilk tests rejected normality for both distributions ($p < 0.001$), motivating the use of distribution-free, percentile-based thresholds rather than parametric ($\mu \pm k\sigma$) approaches.
 The primary threshold was derived via Kernel Density Estimation (KDE) [28]: the crossover point where the intra-class and inter-class density functions intersect.
 Under equal prior probabilities and symmetric misclassification costs, this crossover approximates the optimal decision boundary between the two classes.
 ### Known-Replication Calibration
 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm whose use of digitally replicated signatures was established through independent visual inspection and domain knowledge prior to threshold calibration (see Section I)---as a calibration reference.
 Firm A's signature similarity distribution provides two critical anchors:
 1. **Lower bound validation:** Any detection threshold must classify the vast majority of Firm A's signatures as replicated; a threshold that fails this criterion is too conservative.
 2. **Replication floor estimation:** Firm A's 1st percentile of cosine similarity establishes how low similarity scores can fall even among confirmed replicated signatures, due to scan noise and PDF compression artifacts. This lower bound on replication similarity informs the minimum sensitivity required of any detection threshold.
 This calibration strategy addresses a persistent challenge in document forensics where comprehensive ground truth labels are unavailable.
 ## H. Classification
 The final per-document classification uses exclusively the dual-method framework (cosine similarity + dHash distance), with thresholds calibrated against Firm A's known-replication distribution.
 Firm A's dHash distances show a median of 5 and a 95th percentile of 15; we use these empirical values to define confidence tiers:
 1. **High-confidence replication:** Cosine similarity > 0.95 AND dHash distance ≤ 5. Both feature-level and structural-level evidence converge, consistent with Firm A's median behavior.
 2. **Moderate-confidence replication:** Cosine similarity > 0.95 AND dHash distance 6--15. Feature-level evidence is strong; structural similarity is present but below the Firm A median, possibly due to scan variations.
 3. **High style consistency:** Cosine similarity > 0.95 AND dHash distance > 15. High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not digitally.
 4. **Uncertain:** Cosine similarity between the KDE crossover (0.837) and 0.95, without sufficient evidence for classification in either direction.
 5. **Likely genuine:** Cosine similarity below the KDE crossover threshold.
 The dHash thresholds (≤ 5 and ≤ 15) are directly derived from Firm A's calibration distribution rather than set ad hoc, ensuring that the classification boundaries are empirically grounded.
@@ -0,0 +1,282 @@
 # Paper A: IEEE TAI Outline (Draft)
 > **Target:** IEEE Transactions on Artificial Intelligence (Regular Paper, ≤10 pages)
 > **Review:** Double-blind
 > **Status:** Outline — 待討論確認後再展開各 section
 ---
 ## Title (候選)
 1. "Automated Detection of Digitally Replicated Signatures in Large-Scale Financial Audit Reports"
 2. "Are They Really Signing? A Deep Learning Pipeline for Detecting Signature Replication in 90K Audit Reports"
 3. "Large-Scale Forensic Analysis of CPA Signature Authenticity Using Deep Features and Perceptual Hashing"
 > 建議用 1 或 3，學術正式感較強。2 比較 catchy 但 TAI 可能偏保守。
 ---
 ## Abstract (150-250 words)
 **要素：**
 - Problem: 審計報告要求親簽，但實務上可能用數位複製（套印）
 - Gap: 目前無大規模自動化偵測方法
 - Method: VLM pre-screening → YOLO detection → ResNet-50 feature extraction → Cosine + pHash verification
 - Scale: 90,282 PDFs, 182,328 signatures, 758 CPAs, 2013-2023
 - Key finding: 以已知套印事務所作為校準，建立 distribution-free threshold
 - Contribution: first large-scale study, end-to-end pipeline, empirical threshold validation
 ---
 ## Impact Statement (100-150 words)
 **方向（非專業人士看得懂）：**
 審計報告上的會計師簽名是財務報告可信度的重要保障。若簽名並非每次親簽，而是數位複製貼上，將影響審計品質與投資人保護。本研究開發了一套自動化 AI pipeline，分析了超過 9 萬份、橫跨 10 年的台灣上市公司審計報告，從中提取並比對 18 萬個簽名。透過深度學習特徵與感知雜湊的交叉驗證，我們能區分「風格一致的親簽」與「數位複製的套印」。研究發現部分會計事務所的簽名呈現統計上不可能由手寫產生的一致性。本方法可直接應用於金融監理機構的自動化稽核系統。
 > 注意：投稿時寫英文版，這裡先用中文定調內容方向。
 ---
 ## I. Introduction (~1.5 pages)
 ### 段落結構：
 **P1 — Problem context**
 - 審計報告簽名的法律意義（台灣法規要求親簽）
 - 數位化後的漏洞：PDF 報告中的簽名容易被複製貼上
 - 監理機構無法逐份人工檢查
 **P2 — Why this matters (motivation)**
 - 審計品質 → 投資人保護 → 資本市場信任
 - 簽名真偽是審計獨立性的 proxy indicator
 - [REF: 審計品質相關文獻]
 **P3 — What exists (gap)**
 - 現有簽名驗證研究集中在 forgery detection（偽造偵測）
 - 我們的問題不同：不是問「是不是本人簽的」，而是「是不是每次都親簽」
 - Replication detection ≠ Forgery detection
 - 無大規模、真實財報的相關研究
 **P4 — What we do (contribution)**
 - End-to-end pipeline: VLM → YOLO → ResNet → Cosine + pHash
 - Scale: 90K+ documents, 180K+ signatures, 10 years
 - Distribution-free threshold with known-replication calibration group
 - First study applying AI to audit signature authenticity at this scale
 **P5 — Paper organization**
 - 一句話帶過各 section
 ### Contribution list (明確列出):
 1. **Pipeline**: 完整的端到端自動化簽名真偽偵測系統
 2. **Scale**: 迄今最大規模的審計報告簽名分析（90K PDFs, 180K signatures）
 3. **Methodology**: 結合深度特徵（Cosine）與感知雜湊（pHash）的雙層驗證，解決「風格一致 vs 數位複製」的區分問題
 4. **Calibration**: 利用已知套印事務所作為 ground truth 校準，建立 distribution-free 閾值
 ---
 ## II. Related Work (~1 page)
 ### A. Offline Signature Verification
 - Siamese networks: Bromley et al. 1993, Dey et al. 2017 (SigNet)
 - CNN-based: Hadjadj et al. 2020 (single known sample)
 - Triplet Siamese: Mathematics 2024
 - Consensus threshold: arXiv:2401.03085
 - **定位差異**: 這些都是 forgery detection（驗真偽），我們是 replication detection（驗套印）
 ### B. Document Forensics & Copy-Move Detection
 - Copy-move forgery detection survey (MTAP 2024)
 - Image forensics in scanned documents
 - **定位差異**: 通常針對圖片竄改，非針對簽名重複使用
 ### C. VLM & Object Detection in Document Analysis
 - Vision-Language Models for document understanding
 - YOLO variants in document element detection
 - **定位差異**: 我們用 VLM + YOLO 作為 pipeline 前端，非核心貢獻但需說明
 ### D. Perceptual Hashing for Image Comparison
 - pHash in near-duplicate detection
 - 與 deep features 的互補性
 ---
 ## III. Methodology (~3 pages)
 > 從 methodology_draft_v1.md 精簡，聚焦在核心方法，省略實作細節
 ### A. Pipeline Overview
 - Figure 1: 全流程圖（精簡版）
 - 各階段一句話描述
 ### B. Data Collection
 - 90,282 PDFs from TWSE MOPS, 2013-2023
 - Table I: Dataset summary（精簡版）
 - CPA registry matching
 ### C. Signature Detection
 - VLM pre-screening (Qwen2.5-VL): hit-and-stop strategy, 86,072 docs
 - YOLOv11n: 500 annotated → mAP50=0.99 → 182,328 signatures
 - Red stamp removal post-processing
 - **省略**: VLM prompt 全文、annotation protocol 細節、validation 細節 → 放 footnote 或略提
 ### D. Feature Extraction
 - ResNet-50 (ImageNet1K_V2), no fine-tuning, 2048-dim, L2 normalized
 - Why no fine-tuning: similarity task, not classification; generalizability
 - CPA matching: 92.6% success rate
 ### E. Dual-Method Verification (核心)
 - **Cosine similarity**: captures style-level similarity (high-level)
 - **pHash distance**: captures perceptual-level similarity (structural)
 - 為什麼這個組合：
  - Cosine 高 + pHash 低距離 = 強證據（數位複製）
  - Cosine 高 + pHash 高距離 = 風格一致但非複製（親簽）
  - 互補性解決了單一指標的歧義
 - **SSIM 為何排除**: 掃描雜訊敏感，已知套印的 SSIM 僅 0.70（footnote 帶過）
 ### F. Threshold Selection
 - Distribution-free approach（非常態 → 百分位數）
 - KDE crossover = 0.838
 - Intra/Inter class distributions（Table + Figure）
 - **Calibration via known-replication firm**（key contribution）:
  - Deloitte Taiwan: domain knowledge 確認全部套印
  - Cosine mean = 0.980, 1st percentile = 0.908
  - pHash ≤5: 58.75%
  - 用作閾值校準的 anchor point
 > 注意雙盲：不能寫 "Deloitte"，改用 "Firm A (a Big-4 firm known to use digital replication)"
 ---
 ## IV. Experiments and Results (~2.5 pages)
 ### A. Experimental Setup
 - Hardware/software environment
 - Evaluation metrics 定義
 ### B. Signature Detection Performance
 - Table: YOLO metrics (Precision, Recall, mAP)
 - VLM-YOLO agreement rate: 98.8%
 ### C. Distribution Analysis
 - Figure: Intra vs Inter cosine similarity distributions
 - Figure: pHash distance distributions (intra vs inter)
 - Table: Distributional statistics
 - Normality tests → justify percentile-based thresholds
 ### D. Calibration Group Analysis (重點)
 - "Firm A" (已知套印) 的 Cosine/pHash 分布
 - vs 非四大的分布比較
 - KDE crossover (Firm A vs non-Big-4) = 0.969
 - Figure: Firm A distribution vs overall distribution
 - **這是最有說服力的 section**
 ### E. Classification Results
 - Table: Overall verdict distribution (definite_copy / likely_copy / uncertain / genuine)
 - Cross-method agreement analysis
 - **Key finding**: Cosine-high ≠ pixel-identical
  - 71,656 PDFs with Cosine > 0.95
  - 只有 3.4% 同時 SSIM > 0.95
  - 只有 0.4% pixel-identical
 ### F. Ablation Study (新增，增強 AI 貢獻)
 - **Feature backbone comparison**: ResNet-50 vs VGG-16 vs EfficientNet-B0
  - 比較 intra/inter class separation (Cohen's d)
  - 計算量 vs 判別力 trade-off
 - **Single method vs dual method**:
  - Cosine only vs pHash only vs Cosine + pHash
  - 用 Firm A 作為 positive set，計算 precision/recall
 - **Threshold sensitivity**:
  - 不同 cosine threshold 下的分類結果變化
  - ROC-like curve（以 Firm A 為 positive）
 ---
 ## V. Discussion (~1 page)
 ### A. Replication vs Forgery: A Distinction That Matters
 - 我們的問題本質上更簡單也更直接
 - 不需要考慮仿冒者的存在
 - Physical impossibility argument: 同一人每次親簽不可能像素相同
 ### B. The Gap Between Style Similarity and Digital Replication
 - 81.4% likely_copy (Cosine) vs 2.8% definite_copy (pixel-level)
 - 解讀：多數 CPA 簽名風格高度一致，但非數位複製
 - 可能原因：使用簽名板、固定簽署環境
 - **Policy implication**: 僅靠 Cosine 會嚴重高估套印率
 ### C. The Value of a Known-Replication Calibration Group
 - 有 ground truth anchor 對閾值校準的重要性
 - 可推廣到其他 document forensics 問題
 ### D. Limitations
 - 精簡版 limitations（3-4 點）
 - No labeled ground truth for full dataset
 - Feature extractor not fine-tuned
 - Scan quality variation over 10 years
 - Regulatory/legal definition of "replication" varies
 ---
 ## VI. Conclusion and Future Work (~0.5 page)
 ### Conclusion
 - 總結 pipeline、規模、key findings
 - 強調 dual-method 的必要性（Cosine alone 不夠）
 - Calibration group 的方法論貢獻
 ### Future Work
 - Fine-tuned signature-specific feature extractor
 - Temporal analysis (year-over-year trends)
 - Cross-country generalization
 - Integration with regulatory monitoring systems
 - Small-scale ground truth validation (100-200 PDFs)
 ---
 ## Figures & Tables Budget (10 頁限制下的分配)
 | # | Type | Content | Est. space |
 |---|------|---------|------------|
 | Fig 1 | Pipeline | 全流程圖 | 1/3 page |
 | Fig 2 | Distribution | Intra vs Inter cosine KDE | 1/3 page |
 | Fig 3 | Distribution | pHash distance intra vs inter | 1/4 page |
 | Fig 4 | Calibration | Firm A vs overall distribution | 1/3 page |
 | Fig 5 | Ablation | Backbone comparison / threshold sensitivity | 1/3 page |
 | Table I | Data | Dataset summary | 1/4 page |
 | Table II | Detection | YOLO performance | 1/6 page |
 | Table III | Statistics | Distribution stats + tests | 1/4 page |
 | Table IV | Results | Classification verdicts | 1/4 page |
 | Table V | Ablation | Feature backbone comparison | 1/4 page |
 **Total figures/tables**: ~3 pages → Text: ~7 pages → Feasible for 10-page limit
 ---
 ## 待辦 Checklist
 ### 需要新增的分析（Ablation Study）
 - [ ] ResNet-50 vs VGG-16 vs EfficientNet-B0 feature comparison
 - [ ] Single method vs dual method precision/recall (with Firm A as positive set)
 - [ ] Threshold sensitivity curve
 ### 需要整理的圖表
 - [ ] Fig 1: Pipeline diagram (clean vector version)
 - [ ] Fig 4: Firm A calibration distribution (新圖)
 - [ ] Fig 5: Ablation results (新圖)
 - [ ] 所有圖表英文化
 ### 寫作
 - [ ] Impact Statement (英文版)
 - [ ] Abstract (英文版)
 - [ ] Introduction
 - [ ] Related Work — 需要補充文獻搜索
 - [ ] Methodology (從 v1 精簡)
 - [ ] Results (新寫)
 - [ ] Discussion (新寫)
 - [ ] Conclusion
 ### 投稿準備
 - [ ] 匿名化（Deloitte → Firm A，移除所有可辨識資訊）
 - [ ] IEEE LaTeX template
 - [ ] Reference 格式化（IEEE numbered style）
 - [ ] 相似度指數 < 20%
@@ -0,0 +1,77 @@
 # References
 <!-- IEEE numbered style, sequential by first appearance in text -->
 [1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067
 [2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," *Res. Account. Regul.*, vol. 25, no. 2, pp. 230–235, 2013.
 [3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in *Proc. NeurIPS*, 1993.
 [4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.
 [5] I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020.
 [6] H. Li et al., "TransOSV: Offline signature verification with transformers," *Pattern Recognit.*, vol. 145, p. 109882, 2024.
 [7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," *Mathematics*, vol. 12, no. 17, p. 2757, 2024.
 [8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.
 [9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021.
 [10] S. Abramova and R. Bohme, "Detecting copy-move forgeries in scanned text documents," in *Proc. Electronic Imaging*, 2016.
 [11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," *Multimedia Tools Appl.*, 2024.
 [12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," *Inf. Process. Manage.*, p. 104086, 2025.
 [13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in *Proc. CVPR*, 2022.
 [14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," *Pattern Recognit.*, vol. 70, pp. 163–176, 2017.
 [15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 1342–1356, 2024.
 [16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 1735–1745, 2019.
 [17] H. Farid, "Image forgery detection," *IEEE Signal Process. Mag.*, vol. 26, no. 2, pp. 16–25, 2009.
 [18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," *Pattern Recognit.*, vol. 144, art. no. 109778, 2023.
 [19] J. Luo et al., "A survey of perceptual hashing for multimedia," *ACM Trans. Multimedia Comput. Commun. Appl.*, vol. 21, no. 7, 2025.
 [20] D. Engin et al., "Offline signature verification on real-world documents," in *Proc. CVPRW*, 2020.
 [21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, 2022.
 [22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," *Procedia Comput. Sci.*, vol. 270, 2025.
 [23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in *Proc. ECCV*, 2014, pp. 584–599.
 [24] Qwen2.5-VL Technical Report, Alibaba Group, 2025.
 [25] Ultralytics, "YOLOv11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/
 [26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. CVPR*, 2016.
 [27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html
 [28] B. W. Silverman, *Density Estimation for Statistics and Data Analysis*. London: Chapman & Hall, 1986.
 [29] J. Cohen, *Statistical Power Analysis for the Behavioral Sciences*, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988.
 [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600–612, 2004.
 [31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," *The Accounting Review*, vol. 88, no. 5, pp. 1511–1546, 2013.
 [32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," *Int. J. Auditing*, vol. 18, no. 3, pp. 172–192, 2014.
 [33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," *Contemp. Account. Res.*, vol. 26, no. 2, pp. 359–391, 2009.
 [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in *Proc. CVPR*, 2016, pp. 779–788.
 [35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 46, no. 8, pp. 5625–5644, 2024.
 [36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," *Ann. Math. Statist.*, vol. 18, no. 1, pp. 50–60, 1947.
 <!-- Total: 36 references -->
@@ -0,0 +1,77 @@
 # II. Related Work
 ## A. Offline Signature Verification
 Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning.
 Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant.
 Hafemann et al. [20] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work.
 Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining.
 Hadjadj et al. [5] addressed the practical constraint of limited reference samples, achieving competitive verification accuracy using only a single known genuine signature per writer.
 More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results.
 Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives.
 Zois et al. [21] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer---a property relevant to our setting where CPA signatures span diverse writing styles.
 Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems.
 A common thread in this literature is the assumption that the primary threat is *identity fraud*: a forger attempting to produce a convincing imitation of another person's signature.
 Our work addresses a fundamentally different problem---detecting whether the *legitimate signer* reused a digital copy of their own signature---which requires analyzing intra-signer similarity distributions rather than modeling inter-signer discriminability.
 Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy.
 However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a known-replication subpopulation identified through domain expertise in real-world regulatory documents.
 ## B. Document Forensics and Copy Detection
 Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18].
 Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11].
 Abramova and Bohme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection.
 Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money laundering investigations.
 Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering.
 While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting digital replication within a single author's signatures across documents.
 In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images.
 Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature extraction approach.
 ## C. Perceptual Hashing
 Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19].
 Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process.
 Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks.
 Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-method approach, though applied to natural images rather than document signatures.
 Our work differs from prior perceptual hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from digital duplication (identical pixel content arising from copy-paste operations) in scanned financial documents.
 ## D. Deep Feature Extraction for Signature Analysis
 Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures.
 Engin et al. [14] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach.
 Tsourounis et al. [15] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison.
 Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature extraction approach.
 Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature comparison approach.
 These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical.
 <!--
 REFERENCES for Related Work (see paper_a_references.md for full list):
 [3] Bromley et al. 1993 — Siamese TDNN (NeurIPS)
 [4] Dey et al. 2017 — SigNet (arXiv:1707.02131)
 [5] Hadjadj et al. 2020 — Single sample SV (Applied Sciences)
 [6] Li et al. 2024 — TransOSV (Pattern Recognition)
 [7] Tehsin et al. 2024 — Triplet Siamese (Mathematics)
 [8] Brimoh & Olisah 2024 — Consensus threshold (arXiv:2401.03085)
 [9] Woodruff et al. 2021 — AML signature pipeline (arXiv:2107.14091)
 [10] Copy-move forgery detection survey — MTAP 2024
 [11] Abramova & Böhme 2016 — CMFD in scanned docs (Electronic Imaging)
 [12] Jakhar & Borah 2025 — pHash + DL (Info. Processing & Management)
 [13] Pizzi et al. 2022 — SSCD (CVPR)
 [14] Perceptual hashing survey — ACM TOMM 2025
 [15] Engin et al. 2020 — ResNet + cosine on real docs (CVPRW)
 [16] Tsourounis et al. 2022 — Transfer from text to signatures (Expert Systems with Applications)
 [17] Chamakh & Bounouh 2025 — ResNet18 unified SV (Procedia Computer Science)
 [24] Hafemann et al. 2017 — CNN features for signature verification (Pattern Recognition)
 [25] Hafemann et al. 2019 — Meta-learning for signature verification (IEEE TIFS)
 [26] Zois et al. 2024 — SPD manifold signature verification (IEEE TIFS)
 [27] Farid 2009 — Image forgery detection survey (IEEE SPM)
 [28] Mehrjardi et al. 2023 — DL-based image forgery detection survey (Pattern Recognition)
 [29] Babenko et al. 2014 — Neural codes for image retrieval (ECCV)
 -->
@@ -0,0 +1,153 @@
 # IV. Experiments and Results
 ## A. Experimental Setup
 All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration.
 Feature extraction used PyTorch 2.9 with torchvision model implementations.
 The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
 ## B. Signature Detection Performance
 The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
 We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
 However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
 The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.
 <!-- TABLE III: Extraction Results
 | Metric | Value |
 |--------|-------|
 | Documents processed | 86,071 |
 | Documents with detections | 85,042 (98.8%) |
 | Total signatures extracted | 182,328 |
 | Avg. signatures per document | 2.14 |
 | CPA-matched signatures | 168,755 (92.6%) |
 | Processing rate | 43.1 docs/sec |
 -->
 ## C. Distribution Analysis
 Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs.
 Table IV summarizes the distributional statistics.
 <!-- TABLE IV: Cosine Similarity Distribution Statistics
 | Statistic | Intra-class | Inter-class |
 |-----------|-------------|-------------|
 | N (pairs) | 41,352,824 | 500,000 |
 | Mean | 0.821 | 0.758 |
 | Std. Dev. | 0.098 | 0.090 |
 | Median | 0.836 | 0.774 |
 | Skewness | −0.711 | −0.851 |
 | Kurtosis | 0.550 | 1.027 |
 -->
 Both distributions are left-skewed and leptokurtic.
 Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
 Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived nonparametrically via KDE to avoid distributional assumptions.
 The KDE crossover---where the two density functions intersect---was located at 0.837.
 Under the assumption of equal prior probabilities and equal misclassification costs, this crossover approximates the optimal decision boundary between the two classes.
 We note that this threshold is derived from all-pairs similarity distributions and is used as a reference point for interpreting per-signature best-match scores; the relationship between the two scales is mediated by the fact that the best-match statistic selects the maximum over all pairwise comparisons for a given CPA, producing systematically higher values (see Section IV-D).
 Statistical tests confirmed significant separation between the two distributions (Table V).
 <!-- TABLE V: Statistical Separation Tests
 | Test | Statistic | p-value |
 |------|-----------|---------|
 | Mann-Whitney U | 6.91 × 10⁹ | < 0.001 |
 | Welch's t-test | t = 149.36 | < 0.001 |
 | K-S 2-sample | D = 0.290 | < 0.001 |
 | Cohen's d | 0.669 | — |
 -->
 We emphasize that the pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders p-values unreliable as measures of evidence strength.
 We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
 Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
 ## D. Calibration Group Analysis
 Fig. 3 presents the cosine similarity distribution of Firm A (the known-replication reference group) compared to the overall intra-class distribution.
 Firm A comprises 180 CPAs contributing 16.0 million intra-firm signature pairs.
 Its distributional characteristics provide empirical anchors for threshold validation:
 <!-- TABLE VI: Firm A Calibration Statistics (per-signature best match, ResNet-50)
 | Statistic | Firm A | All CPAs |
 |-----------|--------|----------|
 | N (signatures) | 60,448 | 168,740 |
 | Mean | 0.980 | 0.961 |
 | Std. Dev. | 0.019 | 0.029 |
 | Median | 0.986 | — |
 | 1st percentile | 0.908 | — |
 | 5th percentile | 0.941 | — |
 | % > 0.95 | 92.5% | — |
 | % > 0.90 | 99.3% | — |
 -->
 Firm A's per-signature best-match cosine similarity (mean = 0.980, std = 0.019) is notably higher and more concentrated than the overall CPA population (mean = 0.961, std = 0.029).
 Critically, 99.3% of Firm A's signatures exhibit a best-match similarity exceeding 0.90, and the 1st percentile is 0.908---establishing that any threshold set above 0.91 would fail to capture the most dissimilar replicated signatures in the calibration group.
 This concentration provides strong empirical validation for the threshold selection: the KDE crossover at 0.837 captures essentially all of Firm A's signatures (>99.9%), while more conservative thresholds (e.g., 0.95) still capture 92.5%.
 The narrow spread (std = 0.019) further confirms that digital replication produces highly predictable similarity scores, as expected when the same source image is reused across documents with only scan-induced variations.
 ## E. Classification Results
 Table VII presents the classification results for 84,386 documents using the dual-method framework with Firm A-calibrated thresholds.
 <!-- TABLE VII: Recalibrated Classification Results (Dual-Method: Cosine + dHash)
 | Verdict | N (PDFs) | % | Firm A | Firm A % |
 |---------|----------|---|--------|----------|
 | High-confidence replication | 29,529 | 35.0% | 22,970 | 76.0% |
 | Moderate-confidence replication | 36,994 | 43.8% | 6,311 | 20.9% |
 | High style consistency | 5,133 | 6.1% | 183 | 0.6% |
 | Uncertain | 12,683 | 15.0% | 758 | 2.5% |
 | Likely genuine | 47 | 0.1% | 4 | 0.0% |
 -->
 The dual-method classification reveals a nuanced picture within the 71,656 documents exceeding the cosine similarity threshold of 0.95.
 Rather than treating these uniformly as "likely copies" (as a single-metric approach would), the dHash dimension stratifies them into three distinct populations:
 29,529 (41.2%) show converging structural evidence of replication (dHash ≤ 5),
 36,994 (51.7%) show partial structural similarity (dHash 6--15) consistent with replication degraded by scan variations,
 and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than digital duplication.
 ### Calibration Validation
 The Firm A column in Table VII validates the calibration: 96.9% of Firm A's documents are classified as replication (high or moderate confidence), and only 0.6% fall into the "high style consistency" category.
 This confirms that the dHash thresholds, derived from Firm A's distributional characteristics (median = 5, 95th percentile = 15), correctly capture the known-replication population.
 Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash ≤ 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
 ## F. Ablation Study: Feature Backbone Comparison
 To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
 All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
 Table IX presents the comparison.
 <!-- TABLE IX: Backbone Comparison
 | Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
 |--------|-----------|--------|-----------------|
 | Feature dim | 2048 | 4096 | 1280 |
 | Intra mean | 0.821 | 0.822 | 0.786 |
 | Inter mean | 0.758 | 0.767 | 0.699 |
 | Cohen's d | 0.669 | 0.564 | 0.707 |
 | KDE crossover | 0.837 | 0.850 | 0.792 |
 | Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
 | Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |
 Note: Firm A values in this table are computed over all intra-firm pairwise
 similarities (16.0M pairs) for cross-backbone comparability. These differ from
 the per-signature best-match values in Table VI (mean = 0.980), which reflect
 the classification-relevant statistic: the similarity of each signature to its
 single closest match from the same CPA.
 -->
 EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
 However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), resulting in lower per-sample classification confidence.
 VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.
 ResNet-50 provides the best overall balance:
 (1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
 (2) its tighter distributions yield more reliable individual classifications;
 (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
 (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.
@@ -0,0 +1,305 @@
 #!/usr/bin/env python3
 """
 Recalibrate classification using Firm A as ground truth.
 Dual-method only: Cosine + dHash (drops SSIM and pixel-identical).
 Approach:
 1. Load per-signature best-match cosine + pHash from DB
 2. Use Firm A (勤業眾信聯合) as known-positive calibration set
 3. Analyze 2D distribution (cosine × pHash) for Firm A vs others
 4. Determine calibrated thresholds
 5. Reclassify all PDFs
 6. Output new Table VII
 """
 import sqlite3
 import numpy as np
 from collections import defaultdict
 from pathlib import Path
 import json
 DB_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 OUTPUT_DIR = Path('/Volumes/NV2/PDF-Processing/signature-analysis/recalibrated')
 OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
 FIRM_A = '勤業眾信聯合'
 KDE_CROSSOVER = 0.837  # from intra/inter analysis
 def load_data():
    """Load per-signature data with cosine and pHash."""
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    cur.execute('''
        SELECT s.signature_id, s.image_filename, s.assigned_accountant,
               s.max_similarity_to_same_accountant,
               s.phash_distance_to_closest,
               a.firm
        FROM signatures s
        LEFT JOIN accountants a ON s.assigned_accountant = a.name
        WHERE s.assigned_accountant IS NOT NULL
        AND s.max_similarity_to_same_accountant IS NOT NULL
    ''')
    rows = cur.fetchall()
    conn.close()
    data = []
    for r in rows:
        data.append({
            'sig_id': r[0],
            'filename': r[1],
            'accountant': r[2],
            'cosine': r[3],
            'phash': r[4],  # may be None
            'firm': r[5],
        })
    print(f"Loaded {len(data):,} signatures")
    return data
 def analyze_firm_a(data):
    """Analyze Firm A's dual-method distribution to calibrate thresholds."""
    firm_a = [d for d in data if d['firm'] == FIRM_A]
    others = [d for d in data if d['firm'] != FIRM_A]
    print(f"\n{'='*60}")
    print(f"FIRM A CALIBRATION ANALYSIS")
    print(f"{'='*60}")
    print(f"Firm A signatures: {len(firm_a):,}")
    print(f"Other signatures:  {len(others):,}")
    # Firm A cosine distribution
    fa_cosine = np.array([d['cosine'] for d in firm_a])
    ot_cosine = np.array([d['cosine'] for d in others])
    print(f"\n--- Cosine Similarity ---")
    print(f"Firm A:  mean={fa_cosine.mean():.4f}, std={fa_cosine.std():.4f}, "
          f"p1={np.percentile(fa_cosine,1):.4f}, p5={np.percentile(fa_cosine,5):.4f}")
    print(f"Others:  mean={ot_cosine.mean():.4f}, std={ot_cosine.std():.4f}")
    # Firm A pHash distribution (only where available)
    fa_phash = [d['phash'] for d in firm_a if d['phash'] is not None]
    ot_phash = [d['phash'] for d in others if d['phash'] is not None]
    print(f"\n--- pHash (dHash) Distance ---")
    print(f"Firm A with pHash: {len(fa_phash):,}")
    print(f"Others with pHash: {len(ot_phash):,}")
    if fa_phash:
        fa_ph = np.array(fa_phash)
        print(f"Firm A:  mean={fa_ph.mean():.2f}, median={np.median(fa_ph):.0f}, "
              f"p95={np.percentile(fa_ph,95):.0f}")
        print(f"  pHash=0:  {(fa_ph==0).sum():,} ({100*(fa_ph==0).mean():.1f}%)")
        print(f"  pHash<=2: {(fa_ph<=2).sum():,} ({100*(fa_ph<=2).mean():.1f}%)")
        print(f"  pHash<=5: {(fa_ph<=5).sum():,} ({100*(fa_ph<=5).mean():.1f}%)")
        print(f"  pHash<=10:{(fa_ph<=10).sum():,} ({100*(fa_ph<=10).mean():.1f}%)")
        print(f"  pHash<=15:{(fa_ph<=15).sum():,} ({100*(fa_ph<=15).mean():.1f}%)")
        print(f"  pHash>15: {(fa_ph>15).sum():,} ({100*(fa_ph>15).mean():.1f}%)")
    if ot_phash:
        ot_ph = np.array(ot_phash)
        print(f"\nOthers:  mean={ot_ph.mean():.2f}, median={np.median(ot_ph):.0f}")
        print(f"  pHash=0:  {(ot_ph==0).sum():,} ({100*(ot_ph==0).mean():.1f}%)")
        print(f"  pHash<=5: {(ot_ph<=5).sum():,} ({100*(ot_ph<=5).mean():.1f}%)")
        print(f"  pHash<=10:{(ot_ph<=10).sum():,} ({100*(ot_ph<=10).mean():.1f}%)")
        print(f"  pHash>15: {(ot_ph>15).sum():,} ({100*(ot_ph>15).mean():.1f}%)")
    # 2D analysis: cosine × pHash for Firm A
    print(f"\n--- 2D Analysis: Cosine × pHash (Firm A) ---")
    fa_both = [(d['cosine'], d['phash']) for d in firm_a if d['phash'] is not None]
    if fa_both:
        cosines, phashes = zip(*fa_both)
        cosines = np.array(cosines)
        phashes = np.array(phashes)
        # Cross-tabulate
        for cos_thresh in [0.95, 0.90, KDE_CROSSOVER]:
            for ph_thresh in [5, 10, 15]:
                match = ((cosines > cos_thresh) & (phashes <= ph_thresh)).sum()
                total = len(cosines)
                print(f"  Cosine>{cos_thresh:.3f} AND pHash<={ph_thresh}: "
                      f"{match:,}/{total:,} ({100*match/total:.1f}%)")
    # Same for others (high cosine subset)
    print(f"\n--- 2D Analysis: Cosine × pHash (Others, cosine > 0.95 only) ---")
    ot_both_high = [(d['cosine'], d['phash']) for d in others
                    if d['phash'] is not None and d['cosine'] > 0.95]
    if ot_both_high:
        cosines_o, phashes_o = zip(*ot_both_high)
        phashes_o = np.array(phashes_o)
        print(f"  N (others with cosine>0.95 and pHash): {len(ot_both_high):,}")
        for ph_thresh in [5, 10, 15]:
            match = (phashes_o <= ph_thresh).sum()
            print(f"  pHash<={ph_thresh}: {match:,}/{len(phashes_o):,} ({100*match/len(phashes_o):.1f}%)")
    return fa_phash, ot_phash
 def reclassify_pdfs(data):
    """
    Reclassify all PDFs using calibrated dual-method thresholds.
    New classification (cosine + dHash only):
    1. High-confidence replication: cosine > 0.95 AND pHash ≤ 5
    2. Moderate-confidence replication: cosine > 0.95 AND pHash 6-15
    3. High style consistency: cosine > 0.95 AND (pHash > 15 OR pHash unavailable)
    4. Uncertain: cosine between KDE_CROSSOVER and 0.95
    5. Likely genuine: cosine < KDE_CROSSOVER
    """
    # Group signatures by PDF (derive PDF from filename pattern)
    # Filename format: {company}_{year}_{type}_sig{N}.png or similar
    # We need to group by source PDF
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    # Get PDF-level data
    cur.execute('''
        SELECT s.signature_id, s.image_filename, s.assigned_accountant,
               s.max_similarity_to_same_accountant,
               s.phash_distance_to_closest,
               a.firm
        FROM signatures s
        LEFT JOIN accountants a ON s.assigned_accountant = a.name
        WHERE s.assigned_accountant IS NOT NULL
        AND s.max_similarity_to_same_accountant IS NOT NULL
    ''')
    rows = cur.fetchall()
    # Group by PDF: extract PDF identifier from filename
    # Signature filenames are like: {pdfname}_page{N}_sig{M}.png
    pdf_sigs = defaultdict(list)
    for r in rows:
        sig_id, filename, accountant, cosine, phash, firm = r
        # Extract PDF name (everything before _page or _sig)
        parts = filename.rsplit('_sig', 1)
        pdf_key = parts[0] if len(parts) > 1 else filename.rsplit('.', 1)[0]
        # Further strip _page part
        page_parts = pdf_key.rsplit('_page', 1)
        pdf_key = page_parts[0] if len(page_parts) > 1 else pdf_key
        pdf_sigs[pdf_key].append({
            'cosine': cosine,
            'phash': phash,
            'firm': firm,
            'accountant': accountant,
        })
    conn.close()
    print(f"\n{'='*60}")
    print(f"RECLASSIFICATION (Dual-Method: Cosine + dHash)")
    print(f"{'='*60}")
    print(f"Total PDFs: {len(pdf_sigs):,}")
    # Classify each PDF based on its signatures
    verdicts = defaultdict(int)
    firm_a_verdicts = defaultdict(int)
    details = []
    for pdf_key, sigs in pdf_sigs.items():
        # Use the signature with the highest cosine as the representative
        best_sig = max(sigs, key=lambda s: s['cosine'])
        cosine = best_sig['cosine']
        phash = best_sig['phash']
        is_firm_a = best_sig['firm'] == FIRM_A
        # Also check if ANY signature in this PDF has low pHash
        min_phash = None
        for s in sigs:
            if s['phash'] is not None:
                if min_phash is None or s['phash'] < min_phash:
                    min_phash = s['phash']
        # Classification
        if cosine > 0.95 and min_phash is not None and min_phash <= 5:
            verdict = 'high_confidence_replication'
        elif cosine > 0.95 and min_phash is not None and min_phash <= 15:
            verdict = 'moderate_confidence_replication'
        elif cosine > 0.95:
            verdict = 'high_style_consistency'
        elif cosine > KDE_CROSSOVER:
            verdict = 'uncertain'
        else:
            verdict = 'likely_genuine'
        verdicts[verdict] += 1
        if is_firm_a:
            firm_a_verdicts[verdict] += 1
        details.append({
            'pdf': pdf_key,
            'cosine': cosine,
            'min_phash': min_phash,
            'verdict': verdict,
            'is_firm_a': is_firm_a,
        })
    total = sum(verdicts.values())
    firm_a_total = sum(firm_a_verdicts.values())
    # Print results
    print(f"\n--- New Classification Results ---")
    print(f"{'Verdict':<35} {'Count':>8} {'%':>7}  |  {'Firm A':>8} {'%':>7}")
    print("-" * 75)
    order = ['high_confidence_replication', 'moderate_confidence_replication',
             'high_style_consistency', 'uncertain', 'likely_genuine']
    labels = {
        'high_confidence_replication': 'High-conf. replication',
        'moderate_confidence_replication': 'Moderate-conf. replication',
        'high_style_consistency': 'High style consistency',
        'uncertain': 'Uncertain',
        'likely_genuine': 'Likely genuine',
    }
    for v in order:
        n = verdicts.get(v, 0)
        fa = firm_a_verdicts.get(v, 0)
        pct = 100 * n / total if total > 0 else 0
        fa_pct = 100 * fa / firm_a_total if firm_a_total > 0 else 0
        print(f"  {labels.get(v, v):<33} {n:>8,} {pct:>6.1f}%  |  {fa:>8,} {fa_pct:>6.1f}%")
    print("-" * 75)
    print(f"  {'Total':<33} {total:>8,} {'100.0%':>7}  |  {firm_a_total:>8,} {'100.0%':>7}")
    # Precision/Recall using Firm A as positive set
    print(f"\n--- Firm A Capture Rate (Calibration Validation) ---")
    fa_replication = firm_a_verdicts.get('high_confidence_replication', 0) + \
                     firm_a_verdicts.get('moderate_confidence_replication', 0)
    print(f"  Firm A classified as replication (high+moderate): {fa_replication:,}/{firm_a_total:,} "
          f"({100*fa_replication/firm_a_total:.1f}%)")
    fa_high = firm_a_verdicts.get('high_confidence_replication', 0)
    print(f"  Firm A classified as high-confidence: {fa_high:,}/{firm_a_total:,} "
          f"({100*fa_high/firm_a_total:.1f}%)")
    # Save results
    results = {
        'classification': {v: verdicts.get(v, 0) for v in order},
        'firm_a': {v: firm_a_verdicts.get(v, 0) for v in order},
        'total_pdfs': total,
        'firm_a_pdfs': firm_a_total,
        'thresholds': {
            'cosine_high': 0.95,
            'kde_crossover': KDE_CROSSOVER,
            'phash_high_confidence': 5,
            'phash_moderate_confidence': 15,
        },
    }
    with open(OUTPUT_DIR / 'recalibrated_results.json', 'w') as f:
        json.dump(results, f, indent=2)
    print(f"\nResults saved: {OUTPUT_DIR / 'recalibrated_results.json'}")
    return results
 def main():
    data = load_data()
    analyze_firm_a(data)
    results = reclassify_pdfs(data)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,195 @@
 #!/usr/bin/env python3
 """
 Renumber all in-text citations to sequential order by first appearance.
 Also rewrites references.md with the final numbering.
 """
 import re
 from pathlib import Path
 PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
 # === FINAL NUMBERING (by order of first appearance in paper) ===
 # Format: new_number: (short_key, full_citation)
 FINAL_REFS = {
    1:  ("cpa_act", 'Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067'),
    2:  ("yen2013", 'S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," *Res. Account. Regul.*, vol. 25, no. 2, pp. 230–235, 2013.'),
    3:  ("bromley1993", 'J. Bromley et al., "Signature verification using a Siamese time delay neural network," in *Proc. NeurIPS*, 1993.'),
    4:  ("dey2017", 'S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.'),
    5:  ("hadjadj2020", 'I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020.'),
    6:  ("li2024", 'H. Li et al., "TransOSV: Offline signature verification with transformers," *Pattern Recognit.*, vol. 145, p. 109882, 2024.'),
    7:  ("tehsin2024", 'S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," *Mathematics*, vol. 12, no. 17, p. 2757, 2024.'),
    8:  ("brimoh2024", 'P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.'),
    9:  ("woodruff2021", 'N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021.'),
    10: ("abramova2016", 'S. Abramova and R. Bohme, "Detecting copy-move forgeries in scanned text documents," in *Proc. Electronic Imaging*, 2016.'),
    11: ("cmfd_survey", 'Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," *Multimedia Tools Appl.*, 2024.'),
    12: ("jakhar2025", 'Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," *Inf. Process. Manage.*, p. 104086, 2025.'),
    13: ("pizzi2022", 'E. Pizzi et al., "A self-supervised descriptor for image copy detection," in *Proc. CVPR*, 2022.'),
    14: ("hafemann2017", 'L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," *Pattern Recognit.*, vol. 70, pp. 163–176, 2017.'),
    15: ("zois2024", 'E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 1342–1356, 2024.'),
    16: ("hafemann2019", 'L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 1735–1745, 2019.'),
    17: ("farid2009", 'H. Farid, "Image forgery detection," *IEEE Signal Process. Mag.*, vol. 26, no. 2, pp. 16–25, 2009.'),
    18: ("mehrjardi2023", 'F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," *Pattern Recognit.*, vol. 144, art. no. 109778, 2023.'),
    19: ("phash_survey", 'J. Luo et al., "A survey of perceptual hashing for multimedia," *ACM Trans. Multimedia Comput. Commun. Appl.*, vol. 21, no. 7, 2025.'),
    20: ("engin2020", 'D. Engin et al., "Offline signature verification on real-world documents," in *Proc. CVPRW*, 2020.'),
    21: ("tsourounis2022", 'D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, 2022.'),
    22: ("chamakh2025", 'B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," *Procedia Comput. Sci.*, vol. 270, 2025.'),
    23: ("babenko2014", 'A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in *Proc. ECCV*, 2014, pp. 584–599.'),
    24: ("qwen2025", 'Qwen2.5-VL Technical Report, Alibaba Group, 2025.'),
    25: ("yolov11", 'Ultralytics, "YOLOv11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/'),
    26: ("he2016", 'K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. CVPR*, 2016.'),
    27: ("krawetz2013", 'N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html'),
    28: ("silverman1986", 'B. W. Silverman, *Density Estimation for Statistics and Data Analysis*. London: Chapman & Hall, 1986.'),
    29: ("cohen1988", 'J. Cohen, *Statistical Power Analysis for the Behavioral Sciences*, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988.'),
    30: ("wang2004", 'Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600–612, 2004.'),
    31: ("carcello2013", 'J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," *The Accounting Review*, vol. 88, no. 5, pp. 1511–1546, 2013.'),
    32: ("blay2014", 'A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," *Int. J. Auditing*, vol. 18, no. 3, pp. 172–192, 2014.'),
    33: ("chi2009", 'W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," *Contemp. Account. Res.*, vol. 26, no. 2, pp. 359–391, 2009.'),
    34: ("redmon2016", 'J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in *Proc. CVPR*, 2016, pp. 779–788.'),
    35: ("vlm_survey", 'J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 46, no. 8, pp. 5625–5644, 2024.'),
    36: ("mann1947", 'H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," *Ann. Math. Statist.*, vol. 18, no. 1, pp. 50–60, 1947.'),
 }
 # === LINE-SPECIFIC REPLACEMENTS PER FILE ===
 # Each entry: (unique_context_string, old_text, new_text)
 INTRO_FIXES = [
    # Line 16: SV range should start at [3] not [2] (since [2] is Yen)
    ("offline signature verification [2]--[7]",
     "offline signature verification [2]--[7]",
     "offline signature verification [3]--[8]"),
    # Line 23: Woodruff
    ("Woodruff et al. [8]",
     "Woodruff et al. [8]",
     "Woodruff et al. [9]"),
    # Line 24: CMFD refs
    ("Copy-move forgery detection methods [9], [10]",
     "methods [9], [10]",
     "methods [10], [11]"),
    # Line 25: pHash+DL refs
    ("perceptual hashing combined with deep learning [11], [12]",
     "deep learning [11], [12]",
     "deep learning [12], [13]"),
    # Line 28: pHash -> dHash in pipeline description
    ("perceptual hash (pHash) distance",
     "perceptual hash (pHash) distance",
     "difference hash (dHash) distance"),
 ]
 RW_FIXES = [
    # Line 7: Hafemann 2017
    ("Hafemann et al. [24]", "et al. [24]", "et al. [14]"),
    # Line 12: Zois
    ("Zois et al. [26]", "et al. [26]", "et al. [15]"),
    # Line 13: Hafemann 2019
    ("Hafemann et al. [25]", "et al. [25]", "et al. [16]"),
    # Line 18: Brimoh (wrongly [7], should be [8])
    ("Brimoh and Olisah [7]", "Olisah [7]", "Olisah [8]"),
    # Line 23: Farid
    ("manipulated visual content [27]", "content [27]", "content [17]"),
    # Line 23: Mehrjardi
    ("forgery detection [28]", "detection [28]", "detection [18]"),
    # Line 24: CMFD survey
    ("manipulated photographs [10]", "photographs [10]", "photographs [11]"),
    # Line 25: Abramova (was [11], should be [10])
    ("Abramova and Bohme [11]", "Bohme [11]", "Bohme [10]"),
    # Line 27: Woodruff (was [8], should be [9])
    ("Woodruff et al. [8]", "et al. [8]", "et al. [9]"),
    # Line 31: Pizzi (was [12], should be [13])
    ("Pizzi et al. [12]", "et al. [12]", "et al. [13]"),
    # Line 36: pHash survey (was [13], should be [19])
    ("substantive content changes [13]", "changes [13]", "changes [19]"),
    # Line 39: Jakhar (was [11], should be [12])
    ("Jakhar and Borah [11]", "Borah [11]", "Borah [12]"),
    # Line 47: Engin (was [14], should be [20])
    ("Engin et al. [14]", "et al. [14]", "et al. [20]"),
    # Line 48: Tsourounis (was [15], should be [21])
    ("Tsourounis et al. [15]", "et al. [15]", "et al. [21]"),
    # Line 49: Chamakh (was [16], should be [22])
    ("Chamakh and Bounouh [16]", "Bounouh [16]", "Bounouh [22]"),
    # Line 51: Babenko (was [29], should be [23])
    ("Babenko et al. [29]", "et al. [29]", "et al. [23]"),
 ]
 METH_FIXES = [
    # Line 40: Qwen (was [17], should be [24])
    ("parameters) [17]", ") [17]", ") [24]"),
    # Line 53: YOLO (was [18], should be [25])
    ("(nano variant) [18]", "variant) [18]", "variant) [25]"),
    # Line 75: ResNet (was [19], should be [26])
    ("neural network [19]", "network [19]", "network [26]"),
    # Line 81: Engin, Tsourounis (was [14], [15], should be [20], [21])
    ("document analysis tasks [14], [15]",
     "tasks [14], [15]",
     "tasks [20], [21]"),
    # Line 98: Krawetz dHash (was [36], should be [27])
    ("(dHash) [36]", ") [36]", ") [27]"),
    # Line 101: pHash survey ref (was [14], should be [19])
    ("scan-induced variations [14]",
     "variations [14]",
     "variations [19]"),
    # Line 122: Silverman KDE (was [33], should be [28])
    ("(KDE) [33]", ") [33]", ") [28]"),
 ]
 RESULTS_FIXES = [
    # Cohen's d citation (was [34], should be [29])
    ("effect size [34]", "size [34]", "size [29]"),
 ]
 DISCUSSION_FIXES = [
    # Engin/Tsourounis/Chamakh range (was [14]--[16], should be [20]--[22])
    ("prior literature [14]--[16]",
     "literature [14]--[16]",
     "literature [20]--[22]"),
 ]
 def apply_fixes(filepath, fixes):
    text = filepath.read_text(encoding='utf-8')
    changes = 0
    for context, old, new in fixes:
        if context in text:
            text = text.replace(old, new, 1)
            changes += 1
        else:
            print(f"  WARNING: context not found in {filepath.name}: {context[:60]}...")
    filepath.write_text(text, encoding='utf-8')
    print(f"  {filepath.name}: {changes} fixes applied")
    return changes
 def rewrite_references():
    """Rewrite references.md with final sequential numbering."""
    lines = ["# References\n\n"]
    lines.append("<!-- IEEE numbered style, sequential by first appearance in text -->\n\n")
    for num, (key, citation) in sorted(FINAL_REFS.items()):
        lines.append(f"[{num}] {citation}\n\n")
    lines.append(f"<!-- Total: {len(FINAL_REFS)} references -->\n")
    ref_path = PAPER_DIR / "paper_a_references.md"
    ref_path.write_text("".join(lines), encoding='utf-8')
    print(f"  paper_a_references.md: rewritten with {len(FINAL_REFS)} references")
 def main():
    print("Renumbering citations...\n")
    total = 0
    total += apply_fixes(PAPER_DIR / "paper_a_introduction.md", INTRO_FIXES)
    total += apply_fixes(PAPER_DIR / "paper_a_related_work.md", RW_FIXES)
    total += apply_fixes(PAPER_DIR / "paper_a_methodology.md", METH_FIXES)
    total += apply_fixes(PAPER_DIR / "paper_a_results.md", RESULTS_FIXES)
    total += apply_fixes(PAPER_DIR / "paper_a_discussion.md", DISCUSSION_FIXES)
    print(f"\nTotal fixes: {total}")
    print("\nRewriting references.md...")
    rewrite_references()
    print("\nDone! Verify with: grep -n '\\[.*\\]' paper/paper_a_*.md")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,17 @@
 PaddleOCR v2.7.3 (v4) 完整 Pipeline 測試結果
 ============================================================
 1. OCR 檢測: 14 個文字區域
 2. 遮罩印刷文字: 完成
 3. 檢測候選區域: 4 個
 4. 提取簽名: 4 個
 候選區域詳情:
 ------------------------------------------------------------
 Region 1: 位置(1211, 1462), 大小965x191, 面積=184315
 Region 2: 位置(1215, 877), 大小1150x511, 面積=587650
 Region 3: 位置(332, 150), 大小197x96, 面積=18912
 Region 4: 位置(1147, 3303), 大小159x42, 面積=6678
 所有結果保存在: /Volumes/NV2/pdf_recognize/signature-comparison/v4-current
@@ -0,0 +1,20 @@
 PP-OCRv5 完整 Pipeline 測試結果
 ============================================================
 1. OCR 檢測: 50 個文字區域
 2. 遮罩印刷文字: /Volumes/NV2/pdf_recognize/test_results/v5_pipeline/01_masked.png
 3. 檢測候選區域: 7 個
 4. 提取簽名: 7 個
 候選區域詳情:
 ------------------------------------------------------------
 Region 1: 位置(1218, 877), 大小1144x511, 面積=584584
 Region 2: 位置(1213, 1457), 大小961x196, 面積=188356
 Region 3: 位置(228, 386), 大小2028x209, 面積=423852
 Region 4: 位置(330, 310), 大小1932x63, 面積=121716
 Region 5: 位置(1990, 945), 大小375x212, 面積=79500
 Region 6: 位置(327, 145), 大小203x101, 面積=20503
 Region 7: 位置(1139, 3289), 大小174x63, 面積=10962
 所有結果保存在: /Volumes/NV2/pdf_recognize/test_results/v5_pipeline
@@ -0,0 +1,246 @@
 #!/usr/bin/env python3
 """
 Step 1: 建立 SQLite 資料庫，匯入簽名記錄
 從 extraction_results.csv 匯入資料，展開每個圖片為獨立記錄
 解析圖片檔名填充 year_month, sig_index
 計算圖片尺寸 width, height
 """
 import sqlite3
 import pandas as pd
 import cv2
 import os
 import re
 from pathlib import Path
 from tqdm import tqdm
 from concurrent.futures import ThreadPoolExecutor, as_completed
 # 路徑配置
 IMAGES_DIR = Path("/Volumes/NV2/PDF-Processing/yolo-signatures/images")
 CSV_PATH = Path("/Volumes/NV2/PDF-Processing/yolo-signatures/extraction_results.csv")
 OUTPUT_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis")
 DB_PATH = OUTPUT_DIR / "signature_analysis.db"
 def parse_image_filename(filename: str) -> dict:
    """
    解析圖片檔名，提取結構化資訊
    範例: 201301_2458_AI1_page4_sig1.png
    """
    # 移除 .png 副檔名
    name = filename.replace('.png', '')
    # 解析模式: {YYYYMM}_{SERIAL}_{DOCTYPE}_page{PAGE}_sig{N}
    match = re.match(r'^(\d{6})_([^_]+)_([^_]+)_page(\d+)_sig(\d+)$', name)
    if match:
        year_month, serial, doc_type, page, sig_index = match.groups()
        return {
            'year_month': year_month,
            'serial_number': serial,
            'doc_type': doc_type,
            'page_number': int(page),
            'sig_index': int(sig_index)
        }
    else:
        # 無法解析時返回 None
        return {
            'year_month': None,
            'serial_number': None,
            'doc_type': None,
            'page_number': None,
            'sig_index': None
        }
 def get_image_dimensions(image_path: Path) -> tuple:
    """讀取圖片尺寸"""
    try:
        img = cv2.imread(str(image_path))
        if img is not None:
            h, w = img.shape[:2]
            return w, h
        return None, None
    except Exception:
        return None, None
 def process_single_image(args: tuple) -> dict:
    """處理單張圖片，返回資料記錄"""
    image_filename, source_pdf, confidence_avg = args
    # 解析檔名
    parsed = parse_image_filename(image_filename)
    # 取得圖片尺寸
    image_path = IMAGES_DIR / image_filename
    width, height = get_image_dimensions(image_path)
    return {
        'image_filename': image_filename,
        'source_pdf': source_pdf,
        'year_month': parsed['year_month'],
        'serial_number': parsed['serial_number'],
        'doc_type': parsed['doc_type'],
        'page_number': parsed['page_number'],
        'sig_index': parsed['sig_index'],
        'detection_confidence': confidence_avg,
        'image_width': width,
        'image_height': height
    }
 def create_database():
    """建立資料庫 schema"""
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
    # 建立 signatures 表
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS signatures (
            signature_id INTEGER PRIMARY KEY AUTOINCREMENT,
            image_filename TEXT UNIQUE NOT NULL,
            source_pdf TEXT NOT NULL,
            year_month TEXT,
            serial_number TEXT,
            doc_type TEXT,
            page_number INTEGER,
            sig_index INTEGER,
            detection_confidence REAL,
            image_width INTEGER,
            image_height INTEGER,
            accountant_name TEXT,
            accountant_id INTEGER,
            feature_vector BLOB,
            cluster_id INTEGER,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')
    # 建立索引
    cursor.execute('CREATE INDEX IF NOT EXISTS idx_source_pdf ON signatures(source_pdf)')
    cursor.execute('CREATE INDEX IF NOT EXISTS idx_year_month ON signatures(year_month)')
    cursor.execute('CREATE INDEX IF NOT EXISTS idx_accountant_id ON signatures(accountant_id)')
    conn.commit()
    conn.close()
    print(f"資料庫已建立: {DB_PATH}")
 def expand_csv_to_records(csv_path: Path) -> list:
    """
    將 CSV 展開為單張圖片記錄
    CSV 格式: filename,page,num_signatures,confidence_avg,image_files
    需要將 image_files 展開為多筆記錄
    """
    df = pd.read_csv(csv_path)
    records = []
    for _, row in df.iterrows():
        source_pdf = row['filename']
        confidence_avg = row['confidence_avg']
        image_files_str = row['image_files']
        # 展開 image_files（逗號分隔）
        if pd.notna(image_files_str):
            image_files = [f.strip() for f in image_files_str.split(',')]
            for img_file in image_files:
                records.append((img_file, source_pdf, confidence_avg))
    return records
 def import_data():
    """匯入資料到資料庫"""
    print("讀取 CSV 並展開記錄...")
    records = expand_csv_to_records(CSV_PATH)
    print(f"共 {len(records)} 張簽名圖片待處理")
    print("處理圖片資訊（讀取尺寸）...")
    processed_records = []
    # 使用多執行緒加速圖片尺寸讀取
    with ThreadPoolExecutor(max_workers=8) as executor:
        futures = {executor.submit(process_single_image, r): r for r in records}
        for future in tqdm(as_completed(futures), total=len(records), desc="處理圖片"):
            result = future.result()
            processed_records.append(result)
    print("寫入資料庫...")
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
    # 批次插入
    insert_sql = '''
        INSERT OR IGNORE INTO signatures (
            image_filename, source_pdf, year_month, serial_number, doc_type,
            page_number, sig_index, detection_confidence, image_width, image_height
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    '''
    batch_data = [
        (
            r['image_filename'], r['source_pdf'], r['year_month'], r['serial_number'],
            r['doc_type'], r['page_number'], r['sig_index'], r['detection_confidence'],
            r['image_width'], r['image_height']
        )
        for r in processed_records
    ]
    cursor.executemany(insert_sql, batch_data)
    conn.commit()
    # 統計結果
    cursor.execute('SELECT COUNT(*) FROM signatures')
    total = cursor.fetchone()[0]
    cursor.execute('SELECT COUNT(DISTINCT source_pdf) FROM signatures')
    pdf_count = cursor.fetchone()[0]
    cursor.execute('SELECT COUNT(DISTINCT year_month) FROM signatures')
    period_count = cursor.fetchone()[0]
    cursor.execute('SELECT MIN(year_month), MAX(year_month) FROM signatures')
    min_date, max_date = cursor.fetchone()
    conn.close()
    print("\n" + "=" * 50)
    print("資料庫建立完成")
    print("=" * 50)
    print(f"簽名總數: {total:,}")
    print(f"PDF 檔案數: {pdf_count:,}")
    print(f"時間範圍: {min_date} ~ {max_date} ({period_count} 個月)")
    print(f"資料庫位置: {DB_PATH}")
 def main():
    print("=" * 50)
    print("Step 1: 建立簽名分析資料庫")
    print("=" * 50)
    # 檢查來源檔案
    if not CSV_PATH.exists():
        print(f"錯誤: 找不到 CSV 檔案 {CSV_PATH}")
        return
    if not IMAGES_DIR.exists():
        print(f"錯誤: 找不到圖片目錄 {IMAGES_DIR}")
        return
    # 建立資料庫
    create_database()
    # 匯入資料
    import_data()
 if __name__ == "__main__":
    main()
@@ -0,0 +1,241 @@
 #!/usr/bin/env python3
 """
 Step 2: 使用 ResNet-50 提取簽名圖片的特徵向量
 預處理流程:
 1. 載入圖片 (RGB)
 2. 縮放至 224x224（保持比例，填充白色）
 3. 正規化 (ImageNet mean/std)
 4. 通過 ResNet-50 (去掉最後分類層)
 5. L2 正規化
 6. 輸出 2048 維特徵向量
 """
 import torch
 import torch.nn as nn
 import torchvision.models as models
 import torchvision.transforms as transforms
 from torch.utils.data import Dataset, DataLoader
 import numpy as np
 import cv2
 import sqlite3
 from pathlib import Path
 from tqdm import tqdm
 import warnings
 warnings.filterwarnings('ignore')
 # 路徑配置
 IMAGES_DIR = Path("/Volumes/NV2/PDF-Processing/yolo-signatures/images")
 OUTPUT_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis")
 DB_PATH = OUTPUT_DIR / "signature_analysis.db"
 FEATURES_PATH = OUTPUT_DIR / "features"
 # 模型配置
 BATCH_SIZE = 64
 NUM_WORKERS = 4
 DEVICE = torch.device("mps" if torch.backends.mps.is_available() else
                      "cuda" if torch.cuda.is_available() else "cpu")
 class SignatureDataset(Dataset):
    """簽名圖片資料集"""
    def __init__(self, image_paths: list, transform=None):
        self.image_paths = image_paths
        self.transform = transform
    def __len__(self):
        return len(self.image_paths)
    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        # 載入圖片
        img = cv2.imread(str(img_path))
        if img is None:
            # 如果讀取失敗，返回白色圖片
            img = np.ones((224, 224, 3), dtype=np.uint8) * 255
        else:
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        # 調整大小（保持比例，填充白色）
        img = self.resize_with_padding(img, 224, 224)
        if self.transform:
            img = self.transform(img)
        return img, str(img_path.name)
    @staticmethod
    def resize_with_padding(img, target_w, target_h):
        """調整大小並填充白色以保持比例"""
        h, w = img.shape[:2]
        # 計算縮放比例
        scale = min(target_w / w, target_h / h)
        new_w = int(w * scale)
        new_h = int(h * scale)
        # 縮放
        resized = cv2.resize(img, (new_w, new_h), interpolation=cv2.INTER_AREA)
        # 建立白色畫布
        canvas = np.ones((target_h, target_w, 3), dtype=np.uint8) * 255
        # 置中貼上
        x_offset = (target_w - new_w) // 2
        y_offset = (target_h - new_h) // 2
        canvas[y_offset:y_offset+new_h, x_offset:x_offset+new_w] = resized
        return canvas
 class FeatureExtractor:
    """特徵提取器"""
    def __init__(self, device):
        self.device = device
        # 載入預訓練 ResNet-50
        print(f"載入 ResNet-50 模型... (device: {device})")
        self.model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
        # 移除最後的分類層，保留特徵
        self.model = nn.Sequential(*list(self.model.children())[:-1])
        self.model = self.model.to(device)
        self.model.eval()
        # ImageNet 正規化
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )
        ])
    @torch.no_grad()
    def extract_batch(self, images):
        """提取一批圖片的特徵"""
        images = images.to(self.device)
        features = self.model(images)
        features = features.squeeze(-1).squeeze(-1)  # [B, 2048]
        # L2 正規化
        features = nn.functional.normalize(features, p=2, dim=1)
        return features.cpu().numpy()
 def get_image_list_from_db():
    """從資料庫取得所有圖片檔名"""
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
    cursor.execute('SELECT image_filename FROM signatures ORDER BY signature_id')
    filenames = [row[0] for row in cursor.fetchall()]
    conn.close()
    return filenames
 def save_features_to_db(features_dict: dict):
    """將特徵向量存入資料庫"""
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
    for filename, feature in tqdm(features_dict.items(), desc="寫入資料庫"):
        cursor.execute('''
            UPDATE signatures
            SET feature_vector = ?
            WHERE image_filename = ?
        ''', (feature.tobytes(), filename))
    conn.commit()
    conn.close()
 def main():
    print("=" * 60)
    print("Step 2: ResNet-50 特徵向量提取")
    print("=" * 60)
    print(f"裝置: {DEVICE}")
    # 確保輸出目錄存在
    FEATURES_PATH.mkdir(parents=True, exist_ok=True)
    # 從資料庫取得圖片列表
    print("從資料庫讀取圖片列表...")
    filenames = get_image_list_from_db()
    print(f"共 {len(filenames):,} 張圖片待處理")
    # 建立圖片路徑列表
    image_paths = [IMAGES_DIR / f for f in filenames]
    # 初始化特徵提取器
    extractor = FeatureExtractor(DEVICE)
    # 建立資料集和載入器
    dataset = SignatureDataset(image_paths, transform=extractor.transform)
    dataloader = DataLoader(
        dataset,
        batch_size=BATCH_SIZE,
        shuffle=False,
        num_workers=NUM_WORKERS,
        pin_memory=True
    )
    # 提取特徵
    print(f"\n開始提取特徵 (batch_size={BATCH_SIZE})...")
    all_features = []
    all_filenames = []
    for images, batch_filenames in tqdm(dataloader, desc="提取特徵"):
        features = extractor.extract_batch(images)
        all_features.append(features)
        all_filenames.extend(batch_filenames)
    # 合併所有特徵
    all_features = np.vstack(all_features)
    print(f"\n特徵矩陣形狀: {all_features.shape}")
    # 儲存為 numpy 檔案（備份）
    npy_path = FEATURES_PATH / "signature_features.npy"
    np.save(npy_path, all_features)
    print(f"特徵向量已儲存: {npy_path} ({all_features.nbytes / 1e9:.2f} GB)")
    # 儲存檔名對應（用於後續索引）
    filenames_path = FEATURES_PATH / "signature_filenames.txt"
    with open(filenames_path, 'w') as f:
        for fn in all_filenames:
            f.write(fn + '\n')
    print(f"檔名列表已儲存: {filenames_path}")
    # 更新資料庫
    print("\n更新資料庫中的特徵向量...")
    features_dict = dict(zip(all_filenames, all_features))
    save_features_to_db(features_dict)
    # 統計
    print("\n" + "=" * 60)
    print("特徵提取完成")
    print("=" * 60)
    print(f"處理圖片數: {len(all_filenames):,}")
    print(f"特徵維度: {all_features.shape[1]}")
    print(f"特徵檔案: {npy_path}")
    print(f"檔案大小: {all_features.nbytes / 1e9:.2f} GB")
    # 簡單驗證
    print("\n特徵統計:")
    print(f"  平均值: {all_features.mean():.6f}")
    print(f"  標準差: {all_features.std():.6f}")
    print(f"  最小值: {all_features.min():.6f}")
    print(f"  最大值: {all_features.max():.6f}")
    # L2 norm 驗證（應該都是 1.0）
    norms = np.linalg.norm(all_features, axis=1)
    print(f"  L2 norm: {norms.mean():.6f} ± {norms.std():.6f}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,368 @@
 #!/usr/bin/env python3
 """
 Step 3: 相似度分布探索
 1. 隨機抽樣 100,000 對簽名
 2. 計算 cosine similarity
 3. 繪製直方圖分布
 4. 找出高相似度對 (>0.95)
 5. 分析高相似度對的來源
 """
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns
 from pathlib import Path
 from tqdm import tqdm
 import random
 from collections import defaultdict
 import json
 # 路徑配置
 OUTPUT_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis")
 FEATURES_PATH = OUTPUT_DIR / "features" / "signature_features.npy"
 FILENAMES_PATH = OUTPUT_DIR / "features" / "signature_filenames.txt"
 REPORTS_PATH = OUTPUT_DIR / "reports"
 # 分析配置
 NUM_RANDOM_PAIRS = 100000
 HIGH_SIMILARITY_THRESHOLD = 0.95
 VERY_HIGH_SIMILARITY_THRESHOLD = 0.99
 def load_data():
    """載入特徵向量和檔名"""
    print("載入特徵向量...")
    features = np.load(FEATURES_PATH)
    print(f"特徵矩陣形狀: {features.shape}")
    print("載入檔名列表...")
    with open(FILENAMES_PATH, 'r') as f:
        filenames = [line.strip() for line in f.readlines()]
    print(f"檔名數量: {len(filenames)}")
    return features, filenames
 def parse_filename(filename: str) -> dict:
    """解析檔名提取資訊"""
    # 範例: 201301_2458_AI1_page4_sig1.png
    parts = filename.replace('.png', '').split('_')
    if len(parts) >= 5:
        return {
            'year_month': parts[0],
            'serial': parts[1],
            'doc_type': parts[2],
            'page': parts[3].replace('page', ''),
            'sig_index': parts[4].replace('sig', '')
        }
    return {'raw': filename}
 def cosine_similarity(v1, v2):
    """計算餘弦相似度（向量已 L2 正規化）"""
    return np.dot(v1, v2)
 def random_sampling_analysis(features, filenames, n_pairs=100000):
    """隨機抽樣計算相似度分布"""
    print(f"\n隨機抽樣 {n_pairs:,} 對簽名...")
    n = len(filenames)
    similarities = []
    pair_indices = []
    # 產生隨機配對
    for _ in tqdm(range(n_pairs), desc="計算相似度"):
        i, j = random.sample(range(n), 2)
        sim = cosine_similarity(features[i], features[j])
        similarities.append(sim)
        pair_indices.append((i, j))
    return np.array(similarities), pair_indices
 def find_high_similarity_pairs(features, filenames, threshold=0.95, sample_size=100000):
    """找出高相似度的簽名對"""
    print(f"\n搜尋相似度 > {threshold} 的簽名對...")
    n = len(filenames)
    high_sim_pairs = []
    # 使用隨機抽樣找高相似度對
    # 由於全量計算太慢 (n^2 = 33 billion pairs)，採用抽樣策略
    for _ in tqdm(range(sample_size), desc="搜尋高相似度"):
        i, j = random.sample(range(n), 2)
        sim = cosine_similarity(features[i], features[j])
        if sim > threshold:
            high_sim_pairs.append({
                'idx1': i,
                'idx2': j,
                'file1': filenames[i],
                'file2': filenames[j],
                'similarity': float(sim),
                'parsed1': parse_filename(filenames[i]),
                'parsed2': parse_filename(filenames[j])
            })
    return high_sim_pairs
 def systematic_high_similarity_search(features, filenames, threshold=0.95, batch_size=1000):
    """
    更系統化的高相似度搜尋：
    對每個簽名，找出與它最相似的其他簽名
    """
    print(f"\n系統化搜尋高相似度對 (threshold={threshold})...")
    print("這會對每個簽名找出最相似的候選...")
    n = len(filenames)
    high_sim_pairs = []
    seen_pairs = set()
    # 隨機抽樣一部分簽名作為查詢
    sample_indices = random.sample(range(n), min(5000, n))
    for idx in tqdm(sample_indices, desc="搜尋"):
        # 計算這個簽名與所有其他簽名的相似度
        # 使用矩陣運算加速
        sims = features @ features[idx]
        # 找出高於閾值的（排除自己）
        high_sim_idx = np.where(sims > threshold)[0]
        for j in high_sim_idx:
            if j != idx:
                pair_key = tuple(sorted([idx, int(j)]))
                if pair_key not in seen_pairs:
                    seen_pairs.add(pair_key)
                    high_sim_pairs.append({
                        'idx1': int(idx),
                        'idx2': int(j),
                        'file1': filenames[idx],
                        'file2': filenames[int(j)],
                        'similarity': float(sims[j]),
                        'parsed1': parse_filename(filenames[idx]),
                        'parsed2': parse_filename(filenames[int(j)])
                    })
    return high_sim_pairs
 def analyze_high_similarity_sources(high_sim_pairs):
    """分析高相似度對的來源特徵"""
    print("\n分析高相似度對的來源...")
    stats = {
        'same_pdf': 0,
        'same_year_month': 0,
        'same_doc_type': 0,
        'different_everything': 0,
        'total': len(high_sim_pairs)
    }
    for pair in high_sim_pairs:
        p1, p2 = pair.get('parsed1', {}), pair.get('parsed2', {})
        # 同一 PDF
        if p1.get('year_month') == p2.get('year_month') and \
           p1.get('serial') == p2.get('serial') and \
           p1.get('doc_type') == p2.get('doc_type'):
            stats['same_pdf'] += 1
        # 同月份
        elif p1.get('year_month') == p2.get('year_month'):
            stats['same_year_month'] += 1
        # 同類型
        elif p1.get('doc_type') == p2.get('doc_type'):
            stats['same_doc_type'] += 1
        else:
            stats['different_everything'] += 1
    return stats
 def plot_similarity_distribution(similarities, output_path):
    """繪製相似度分布圖"""
    print("\n繪製分布圖...")
    try:
        # 轉換為 Python list 完全避免 numpy 問題
        sim_list = similarities.tolist()
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        # 左圖：完整分布 - 使用 range 指定 bins
        ax1 = axes[0]
        ax1.hist(sim_list, bins=np.linspace(min(sim_list), max(sim_list), 101).tolist(),
                 density=True, alpha=0.7, color='steelblue', edgecolor='white')
        ax1.axvline(x=0.95, color='red', linestyle='--', label='0.95 threshold')
        ax1.axvline(x=0.99, color='darkred', linestyle='--', label='0.99 threshold')
        ax1.set_xlabel('Cosine Similarity', fontsize=12)
        ax1.set_ylabel('Density', fontsize=12)
        ax1.set_title('Signature Similarity Distribution (Random Sampling)', fontsize=14)
        ax1.legend()
        # 統計標註
        mean_sim = float(np.mean(similarities))
        std_sim = float(np.std(similarities))
        ax1.annotate(f'Mean: {mean_sim:.4f}\nStd: {std_sim:.4f}',
                    xy=(0.02, 0.95), xycoords='axes fraction',
                    fontsize=10, verticalalignment='top',
                    bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
        # 右圖：高相似度區域放大
        ax2 = axes[1]
        high_sim_list = [x for x in sim_list if x > 0.8]
        if len(high_sim_list) > 0:
            ax2.hist(high_sim_list, bins=np.linspace(0.8, max(high_sim_list), 51).tolist(),
                     density=True, alpha=0.7, color='coral', edgecolor='white')
            ax2.axvline(x=0.95, color='red', linestyle='--', label='0.95 threshold')
            ax2.axvline(x=0.99, color='darkred', linestyle='--', label='0.99 threshold')
        ax2.set_xlabel('Cosine Similarity', fontsize=12)
        ax2.set_ylabel('Density', fontsize=12)
        ax2.set_title('High Similarity Region (> 0.8)', fontsize=14)
        ax2.legend()
        # 高相似度統計
        pct_95 = int((similarities > 0.95).sum()) / len(similarities) * 100
        pct_99 = int((similarities > 0.99).sum()) / len(similarities) * 100
        ax2.annotate(f'> 0.95: {pct_95:.4f}%\n> 0.99: {pct_99:.4f}%',
                    xy=(0.98, 0.95), xycoords='axes fraction',
                    fontsize=10, verticalalignment='top', horizontalalignment='right',
                    bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
        plt.tight_layout()
        plt.savefig(output_path, dpi=150, bbox_inches='tight')
        plt.close()
        print(f"分布圖已儲存: {output_path}")
    except Exception as e:
        print(f"繪圖失敗: {e}")
        print("跳過繪圖，繼續其他分析...")
 def generate_statistics_report(similarities, high_sim_pairs, source_stats, output_path):
    """生成統計報告"""
    report = {
        'random_sampling': {
            'n_pairs': len(similarities),
            'mean': float(np.mean(similarities)),
            'std': float(np.std(similarities)),
            'min': float(np.min(similarities)),
            'max': float(np.max(similarities)),
            'percentiles': {
                '25%': float(np.percentile(similarities, 25)),
                '50%': float(np.percentile(similarities, 50)),
                '75%': float(np.percentile(similarities, 75)),
                '90%': float(np.percentile(similarities, 90)),
                '95%': float(np.percentile(similarities, 95)),
                '99%': float(np.percentile(similarities, 99)),
            },
            'above_thresholds': {
                '>0.90': int((similarities > 0.90).sum()),
                '>0.95': int((similarities > 0.95).sum()),
                '>0.99': int((similarities > 0.99).sum()),
            }
        },
        'high_similarity_search': {
            'threshold': HIGH_SIMILARITY_THRESHOLD,
            'pairs_found': len(high_sim_pairs),
            'source_analysis': source_stats,
            'top_10_pairs': sorted(high_sim_pairs, key=lambda x: x['similarity'], reverse=True)[:10]
        }
    }
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(report, f, indent=2, ensure_ascii=False)
    print(f"統計報告已儲存: {output_path}")
    return report
 def print_summary(report):
    """印出摘要"""
    print("\n" + "=" * 70)
    print("相似度分布分析摘要")
    print("=" * 70)
    rs = report['random_sampling']
    print(f"\n隨機抽樣統計 ({rs['n_pairs']:,} 對):")
    print(f"  平均相似度: {rs['mean']:.4f}")
    print(f"  標準差: {rs['std']:.4f}")
    print(f"  範圍: [{rs['min']:.4f}, {rs['max']:.4f}]")
    print(f"\n百分位數:")
    for k, v in rs['percentiles'].items():
        print(f"  {k}: {v:.4f}")
    print(f"\n高相似度對數量:")
    for k, v in rs['above_thresholds'].items():
        pct = v / rs['n_pairs'] * 100
        print(f"  {k}: {v:,} ({pct:.4f}%)")
    hs = report['high_similarity_search']
    print(f"\n系統化搜尋結果 (threshold={hs['threshold']}):")
    print(f"  發現高相似度對: {hs['pairs_found']:,}")
    if hs['source_analysis']['total'] > 0:
        sa = hs['source_analysis']
        print(f"\n來源分析:")
        print(f"  同一 PDF: {sa['same_pdf']} ({sa['same_pdf']/sa['total']*100:.1f}%)")
        print(f"  同月份: {sa['same_year_month']} ({sa['same_year_month']/sa['total']*100:.1f}%)")
        print(f"  同類型: {sa['same_doc_type']} ({sa['same_doc_type']/sa['total']*100:.1f}%)")
        print(f"  完全不同: {sa['different_everything']} ({sa['different_everything']/sa['total']*100:.1f}%)")
    if hs['top_10_pairs']:
        print(f"\nTop 10 高相似度對:")
        for i, pair in enumerate(hs['top_10_pairs'], 1):
            print(f"  {i}. {pair['similarity']:.4f}")
            print(f"     {pair['file1']}")
            print(f"     {pair['file2']}")
 def main():
    print("=" * 70)
    print("Step 3: 相似度分布探索")
    print("=" * 70)
    # 確保輸出目錄存在
    REPORTS_PATH.mkdir(parents=True, exist_ok=True)
    # 載入資料
    features, filenames = load_data()
    # 隨機抽樣分析
    similarities, pair_indices = random_sampling_analysis(features, filenames, NUM_RANDOM_PAIRS)
    # 繪製分布圖
    plot_similarity_distribution(
        similarities,
        REPORTS_PATH / "similarity_distribution.png"
    )
    # 系統化搜尋高相似度對
    high_sim_pairs = systematic_high_similarity_search(
        features, filenames,
        threshold=HIGH_SIMILARITY_THRESHOLD
    )
    # 分析來源
    source_stats = analyze_high_similarity_sources(high_sim_pairs)
    # 生成報告
    report = generate_statistics_report(
        similarities, high_sim_pairs, source_stats,
        REPORTS_PATH / "similarity_statistics.json"
    )
    # 儲存高相似度對列表
    high_sim_output = REPORTS_PATH / "high_similarity_pairs.json"
    with open(high_sim_output, 'w', encoding='utf-8') as f:
        json.dump(high_sim_pairs, f, indent=2, ensure_ascii=False)
    print(f"高相似度對列表已儲存: {high_sim_output}")
    # 印出摘要
    print_summary(report)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,274 @@
 #!/usr/bin/env python3
 """
 Step 4: 生成高相似度案例的視覺化報告
 讀取 high_similarity_pairs.json
 為 Top N 高相似度對生成並排對比圖
 生成 HTML 報告
 """
 import json
 import cv2
 import numpy as np
 from pathlib import Path
 from tqdm import tqdm
 import base64
 from io import BytesIO
 # 路徑配置
 IMAGES_DIR = Path("/Volumes/NV2/PDF-Processing/yolo-signatures/images")
 REPORTS_PATH = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
 HIGH_SIM_JSON = REPORTS_PATH / "high_similarity_pairs.json"
 # 報告配置
 TOP_N = 100  # 顯示前 N 對
 def load_image(filename: str) -> np.ndarray:
    """載入圖片"""
    img_path = IMAGES_DIR / filename
    img = cv2.imread(str(img_path))
    if img is None:
        # 返回空白圖片
        return np.ones((100, 200, 3), dtype=np.uint8) * 255
    return img
 def create_comparison_image(file1: str, file2: str, similarity: float) -> np.ndarray:
    """建立並排對比圖"""
    img1 = load_image(file1)
    img2 = load_image(file2)
    # 統一高度
    h1, w1 = img1.shape[:2]
    h2, w2 = img2.shape[:2]
    target_h = max(h1, h2, 100)
    # 縮放
    if h1 != target_h:
        scale = target_h / h1
        img1 = cv2.resize(img1, (int(w1 * scale), target_h))
    if h2 != target_h:
        scale = target_h / h2
        img2 = cv2.resize(img2, (int(w2 * scale), target_h))
    # 加入分隔線
    separator = np.ones((target_h, 20, 3), dtype=np.uint8) * 200
    # 合併
    comparison = np.hstack([img1, separator, img2])
    return comparison
 def image_to_base64(img: np.ndarray) -> str:
    """將圖片轉換為 base64"""
    _, buffer = cv2.imencode('.png', img)
    return base64.b64encode(buffer).decode('utf-8')
 def generate_html_report(pairs: list, output_path: Path):
    """生成 HTML 報告"""
    html_content = """
 <!DOCTYPE html>
 <html>
 <head>
    <meta charset="UTF-8">
    <title>簽名相似度分析報告 - 高相似度案例</title>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            max-width: 1400px;
            margin: 0 auto;
            padding: 20px;
            background-color: #f5f5f5;
        }
        h1 {
            color: #333;
            text-align: center;
            border-bottom: 2px solid #666;
            padding-bottom: 10px;
        }
        .summary {
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            padding: 20px;
            border-radius: 10px;
            margin-bottom: 30px;
        }
        .summary h2 {
            margin-top: 0;
        }
        .pair-card {
            background: white;
            border-radius: 10px;
            padding: 20px;
            margin-bottom: 20px;
            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
        }
        .pair-header {
            display: flex;
            justify-content: space-between;
            align-items: center;
            margin-bottom: 15px;
            padding-bottom: 10px;
            border-bottom: 1px solid #eee;
        }
        .pair-number {
            font-size: 1.2em;
            font-weight: bold;
            color: #333;
        }
        .similarity-badge {
            background: #dc3545;
            color: white;
            padding: 5px 15px;
            border-radius: 20px;
            font-weight: bold;
        }
        .similarity-badge.high {
            background: #dc3545;
        }
        .similarity-badge.very-high {
            background: #8b0000;
        }
        .file-info {
            font-family: monospace;
            font-size: 0.9em;
            color: #666;
            margin-bottom: 10px;
        }
        .comparison-image {
            max-width: 100%;
            border: 1px solid #ddd;
            border-radius: 5px;
        }
        .analysis {
            margin-top: 15px;
            padding: 10px;
            background: #f8f9fa;
            border-radius: 5px;
            font-size: 0.9em;
        }
        .tag {
            display: inline-block;
            padding: 2px 8px;
            border-radius: 3px;
            margin-right: 5px;
            font-size: 0.8em;
        }
        .tag-same-serial { background: #ffebee; color: #c62828; }
        .tag-same-month { background: #fff3e0; color: #e65100; }
        .tag-diff { background: #e8f5e9; color: #2e7d32; }
    </style>
 </head>
 <body>
    <h1>簽名相似度分析報告 - 高相似度案例</h1>
    <div class="summary">
        <h2>摘要</h2>
        <p><strong>分析結果：</strong>發現 659,111 對高相似度簽名 (>0.95)</p>
        <p><strong>本報告顯示：</strong>Top """ + str(TOP_N) + """ 最高相似度案例</p>
        <p><strong>結論：</strong>存在大量相似度接近或等於 1.0 的簽名對，強烈暗示「複製貼上」行為</p>
    </div>
    <div class="pairs-container">
 """
    for i, pair in enumerate(pairs[:TOP_N], 1):
        sim = pair['similarity']
        file1 = pair['file1']
        file2 = pair['file2']
        p1 = pair.get('parsed1', {})
        p2 = pair.get('parsed2', {})
        # 分析關係
        tags = []
        if p1.get('serial') == p2.get('serial'):
            tags.append(('<span class="tag tag-same-serial">同序號</span>', ''))
        if p1.get('year_month') == p2.get('year_month'):
            tags.append(('<span class="tag tag-same-month">同月份</span>', ''))
        if p1.get('year_month') != p2.get('year_month') and p1.get('serial') != p2.get('serial'):
            tags.append(('<span class="tag tag-diff">完全不同文件</span>', ''))
        badge_class = 'very-high' if sim >= 0.99 else 'high'
        # 建立對比圖
        try:
            comparison_img = create_comparison_image(file1, file2, sim)
            img_base64 = image_to_base64(comparison_img)
            img_html = f'<img src="data:image/png;base64,{img_base64}" class="comparison-image">'
        except Exception as e:
            img_html = f'<p style="color:red">無法載入圖片: {e}</p>'
        tag_html = ''.join([t[0] for t in tags])
        html_content += f"""
        <div class="pair-card">
            <div class="pair-header">
                <span class="pair-number">#{i}</span>
                <span class="similarity-badge {badge_class}">相似度: {sim:.4f}</span>
            </div>
            <div class="file-info">
                <strong>簽名 1:</strong> {file1}<br>
                <strong>簽名 2:</strong> {file2}
            </div>
            {img_html}
            <div class="analysis">
                {tag_html}
                <br><small>日期: {p1.get('year_month', 'N/A')} vs {p2.get('year_month', 'N/A')} |
                序號: {p1.get('serial', 'N/A')} vs {p2.get('serial', 'N/A')}</small>
            </div>
        </div>
 """
    html_content += """
    </div>
    <div style="text-align: center; margin-top: 30px; color: #666;">
        <p>生成時間: 2024 | 簽名真實性研究計劃</p>
    </div>
 </body>
 </html>
 """
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(html_content)
    print(f"HTML 報告已儲存: {output_path}")
 def main():
    print("=" * 60)
    print("Step 4: 生成高相似度案例視覺化報告")
    print("=" * 60)
    # 載入高相似度對
    print("載入高相似度對資料...")
    with open(HIGH_SIM_JSON, 'r', encoding='utf-8') as f:
        pairs = json.load(f)
    print(f"共 {len(pairs):,} 對高相似度簽名")
    # 按相似度排序
    pairs_sorted = sorted(pairs, key=lambda x: x['similarity'], reverse=True)
    # 統計
    sim_1 = len([p for p in pairs_sorted if p['similarity'] >= 0.9999])
    sim_99 = len([p for p in pairs_sorted if p['similarity'] >= 0.99])
    sim_97 = len([p for p in pairs_sorted if p['similarity'] >= 0.97])
    print(f"\n相似度統計:")
    print(f"  = 1.0 (完全相同): {sim_1:,}")
    print(f"  >= 0.99: {sim_99:,}")
    print(f"  >= 0.97: {sim_97:,}")
    # 生成報告
    print(f"\n生成 Top {TOP_N} 視覺化報告...")
    generate_html_report(pairs_sorted, REPORTS_PATH / "high_similarity_report.html")
    print("\n完成！")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,432 @@
 #!/usr/bin/env python3
 """
 Step 5: 從 PDF 提取會計師印刷姓名
 流程：
 1. 從資料庫讀取簽名記錄，按 (PDF, page) 分組
 2. 對每個頁面重新執行 YOLO 獲取簽名框座標
 3. 對整頁執行 PaddleOCR 提取印刷文字
 4. 過濾出候選姓名（2-4 個中文字）
 5. 配對簽名與最近的印刷姓名
 6. 更新資料庫的 accountant_name 欄位
 """
 import sqlite3
 import json
 import re
 import sys
 import time
 from pathlib import Path
 from typing import Optional, List, Dict, Tuple
 from collections import defaultdict
 from tqdm import tqdm
 import numpy as np
 import cv2
 import fitz  # PyMuPDF
 # 加入父目錄到路徑以便匯入
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from paddleocr_client import PaddleOCRClient
 # 路徑配置
 PDF_BASE = Path("/Volumes/NV2/PDF-Processing/total-pdf")
 YOLO_MODEL_PATH = Path("/Volumes/NV2/pdf_recognize/models/best.pt")
 DB_PATH = Path("/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db")
 REPORTS_PATH = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
 # 處理配置
 DPI = 150
 CONFIDENCE_THRESHOLD = 0.5
 NAME_SEARCH_MARGIN = 200  # 簽名框周圍搜索姓名的像素範圍
 PROGRESS_SAVE_INTERVAL = 100  # 每處理 N 個頁面保存一次進度
 # 中文姓名正則
 CHINESE_NAME_PATTERN = re.compile(r'^[\u4e00-\u9fff]{2,4}$')
 def find_pdf_file(filename: str) -> Optional[str]:
    """搜尋 PDF 檔案路徑"""
    # 先在 batch_* 子目錄尋找
    for batch_dir in sorted(PDF_BASE.glob("batch_*")):
        pdf_path = batch_dir / filename
        if pdf_path.exists():
            return str(pdf_path)
    # 再在頂層目錄尋找
    pdf_path = PDF_BASE / filename
    if pdf_path.exists():
        return str(pdf_path)
    return None
 def render_pdf_page(pdf_path: str, page_num: int) -> Optional[np.ndarray]:
    """渲染 PDF 頁面為圖像"""
    try:
        doc = fitz.open(pdf_path)
        if page_num < 1 or page_num > len(doc):
            doc.close()
            return None
        page = doc[page_num - 1]
        mat = fitz.Matrix(DPI / 72, DPI / 72)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        image = np.frombuffer(pix.samples, dtype=np.uint8)
        image = image.reshape(pix.height, pix.width, pix.n)
        doc.close()
        return image
    except Exception as e:
        print(f"渲染失敗: {pdf_path} page {page_num}: {e}")
        return None
 def detect_signatures_yolo(image: np.ndarray, model) -> List[Dict]:
    """使用 YOLO 偵測簽名框"""
    results = model(image, conf=CONFIDENCE_THRESHOLD, verbose=False)
    signatures = []
    for r in results:
        for box in r.boxes:
            x1, y1, x2, y2 = map(int, box.xyxy[0].cpu().numpy())
            conf = float(box.conf[0].cpu().numpy())
            signatures.append({
                'x': x1,
                'y': y1,
                'width': x2 - x1,
                'height': y2 - y1,
                'confidence': conf,
                'center_x': (x1 + x2) / 2,
                'center_y': (y1 + y2) / 2
            })
    # 按位置排序（上到下，左到右）
    signatures.sort(key=lambda s: (s['y'], s['x']))
    return signatures
 def extract_text_candidates(image: np.ndarray, ocr_client: PaddleOCRClient) -> List[Dict]:
    """從圖像中提取所有文字候選"""
    try:
        results = ocr_client.ocr(image)
        candidates = []
        for result in results:
            text = result.get('text', '').strip()
            box = result.get('box', [])
            confidence = result.get('confidence', 0.0)
            if not box or not text:
                continue
            # 計算邊界框中心
            xs = [point[0] for point in box]
            ys = [point[1] for point in box]
            center_x = sum(xs) / len(xs)
            center_y = sum(ys) / len(ys)
            candidates.append({
                'text': text,
                'center_x': center_x,
                'center_y': center_y,
                'x': min(xs),
                'y': min(ys),
                'width': max(xs) - min(xs),
                'height': max(ys) - min(ys),
                'confidence': confidence
            })
        return candidates
    except Exception as e:
        print(f"OCR 失敗: {e}")
        return []
 def filter_name_candidates(candidates: List[Dict]) -> List[Dict]:
    """過濾出可能是姓名的文字（2-4 個中文字，不含數字標點）"""
    names = []
    for c in candidates:
        text = c['text']
        # 移除空白和標點
        text_clean = re.sub(r'[\s\:\：\,\，\.\。]', '', text)
        if CHINESE_NAME_PATTERN.match(text_clean):
            c['text_clean'] = text_clean
            names.append(c)
    return names
 def match_signature_to_name(
    sig: Dict,
    name_candidates: List[Dict],
    margin: int = NAME_SEARCH_MARGIN
 ) -> Optional[str]:
    """為簽名框配對最近的姓名候選"""
    sig_center_x = sig['center_x']
    sig_center_y = sig['center_y']
    # 過濾出在搜索範圍內的姓名
    nearby_names = []
    for name in name_candidates:
        dx = abs(name['center_x'] - sig_center_x)
        dy = abs(name['center_y'] - sig_center_y)
        # 在 margin 範圍內
        if dx <= margin + sig['width']/2 and dy <= margin + sig['height']/2:
            distance = (dx**2 + dy**2) ** 0.5
            nearby_names.append((name, distance))
    if not nearby_names:
        return None
    # 返回距離最近的
    nearby_names.sort(key=lambda x: x[1])
    return nearby_names[0][0]['text_clean']
 def get_pages_to_process(conn: sqlite3.Connection) -> List[Tuple[str, int, List[int]]]:
    """
    從資料庫獲取需要處理的 (PDF, page) 組合
    Returns:
        List of (source_pdf, page_number, [signature_ids])
    """
    cursor = conn.cursor()
    # 查詢尚未有 accountant_name 的簽名，按 (PDF, page) 分組
    cursor.execute('''
        SELECT source_pdf, page_number, GROUP_CONCAT(signature_id)
        FROM signatures
        WHERE accountant_name IS NULL OR accountant_name = ''
        GROUP BY source_pdf, page_number
        ORDER BY source_pdf, page_number
    ''')
    pages = []
    for row in cursor.fetchall():
        source_pdf, page_number, sig_ids_str = row
        sig_ids = [int(x) for x in sig_ids_str.split(',')]
        pages.append((source_pdf, page_number, sig_ids))
    return pages
 def update_signature_names(
    conn: sqlite3.Connection,
    updates: List[Tuple[int, str, int, int, int, int]]
 ):
    """
    更新資料庫中的簽名姓名和座標
    Args:
        updates: List of (signature_id, accountant_name, x, y, width, height)
    """
    cursor = conn.cursor()
    # 確保 signature_boxes 表存在
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS signature_boxes (
            signature_id INTEGER PRIMARY KEY,
            x INTEGER,
            y INTEGER,
            width INTEGER,
            height INTEGER,
            FOREIGN KEY (signature_id) REFERENCES signatures(signature_id)
        )
    ''')
    for sig_id, name, x, y, w, h in updates:
        # 更新姓名
        cursor.execute('''
            UPDATE signatures SET accountant_name = ? WHERE signature_id = ?
        ''', (name, sig_id))
        # 更新或插入座標
        cursor.execute('''
            INSERT OR REPLACE INTO signature_boxes (signature_id, x, y, width, height)
            VALUES (?, ?, ?, ?, ?)
        ''', (sig_id, x, y, w, h))
    conn.commit()
 def process_page(
    source_pdf: str,
    page_number: int,
    sig_ids: List[int],
    yolo_model,
    ocr_client: PaddleOCRClient,
    conn: sqlite3.Connection
 ) -> Dict:
    """
    處理單一頁面：偵測簽名框、提取姓名、配對
    Returns:
        處理結果統計
    """
    result = {
        'source_pdf': source_pdf,
        'page_number': page_number,
        'num_signatures': len(sig_ids),
        'matched': 0,
        'unmatched': 0,
        'error': None
    }
    # 找 PDF 檔案
    pdf_path = find_pdf_file(source_pdf)
    if pdf_path is None:
        result['error'] = 'PDF not found'
        return result
    # 渲染頁面
    image = render_pdf_page(pdf_path, page_number)
    if image is None:
        result['error'] = 'Render failed'
        return result
    # YOLO 偵測簽名框
    sig_boxes = detect_signatures_yolo(image, yolo_model)
    if len(sig_boxes) != len(sig_ids):
        # 簽名數量不匹配，嘗試按順序配對
        pass
    # OCR 提取文字
    text_candidates = extract_text_candidates(image, ocr_client)
    # 過濾出姓名候選
    name_candidates = filter_name_candidates(text_candidates)
    # 配對簽名與姓名
    updates = []
    for i, (sig_id, sig_box) in enumerate(zip(sig_ids, sig_boxes)):
        matched_name = match_signature_to_name(sig_box, name_candidates)
        if matched_name:
            result['matched'] += 1
        else:
            result['unmatched'] += 1
            matched_name = ''  # 空字串表示未配對
        updates.append((
            sig_id,
            matched_name,
            sig_box['x'],
            sig_box['y'],
            sig_box['width'],
            sig_box['height']
        ))
    # 如果 YOLO 偵測數量少於記錄數量，處理剩餘的
    if len(sig_boxes) < len(sig_ids):
        for sig_id in sig_ids[len(sig_boxes):]:
            updates.append((sig_id, '', 0, 0, 0, 0))
            result['unmatched'] += 1
    # 更新資料庫
    update_signature_names(conn, updates)
    return result
 def main():
    print("=" * 60)
    print("Step 5: 從 PDF 提取會計師印刷姓名")
    print("=" * 60)
    # 確保報告目錄存在
    REPORTS_PATH.mkdir(parents=True, exist_ok=True)
    # 連接資料庫
    print("\n連接資料庫...")
    conn = sqlite3.connect(DB_PATH)
    # 獲取需要處理的頁面
    print("查詢待處理頁面...")
    pages = get_pages_to_process(conn)
    print(f"共 {len(pages)} 個頁面待處理")
    if not pages:
        print("沒有需要處理的頁面")
        conn.close()
        return
    # 初始化 YOLO
    print("\n載入 YOLO 模型...")
    from ultralytics import YOLO
    yolo_model = YOLO(str(YOLO_MODEL_PATH))
    # 初始化 OCR 客戶端
    print("連接 PaddleOCR 伺服器...")
    ocr_client = PaddleOCRClient()
    if not ocr_client.health_check():
        print("錯誤: PaddleOCR 伺服器無法連接")
        print("請確認伺服器 http://192.168.30.36:5555 正在運行")
        conn.close()
        return
    print("OCR 伺服器連接成功")
    # 統計
    stats = {
        'total_pages': len(pages),
        'processed': 0,
        'matched': 0,
        'unmatched': 0,
        'errors': 0,
        'start_time': time.time()
    }
    # 處理每個頁面
    print(f"\n開始處理 {len(pages)} 個頁面...")
    for source_pdf, page_number, sig_ids in tqdm(pages, desc="處理頁面"):
        result = process_page(
            source_pdf, page_number, sig_ids,
            yolo_model, ocr_client, conn
        )
        stats['processed'] += 1
        stats['matched'] += result['matched']
        stats['unmatched'] += result['unmatched']
        if result['error']:
            stats['errors'] += 1
        # 定期保存進度報告
        if stats['processed'] % PROGRESS_SAVE_INTERVAL == 0:
            elapsed = time.time() - stats['start_time']
            rate = stats['processed'] / elapsed
            remaining = (stats['total_pages'] - stats['processed']) / rate if rate > 0 else 0
            print(f"\n進度: {stats['processed']}/{stats['total_pages']} "
                  f"({stats['processed']/stats['total_pages']*100:.1f}%)")
            print(f"配對成功: {stats['matched']}, 未配對: {stats['unmatched']}")
            print(f"預估剩餘時間: {remaining/60:.1f} 分鐘")
    # 最終統計
    elapsed = time.time() - stats['start_time']
    stats['elapsed_seconds'] = elapsed
    print("\n" + "=" * 60)
    print("處理完成")
    print("=" * 60)
    print(f"總頁面數: {stats['total_pages']}")
    print(f"處理成功: {stats['processed']}")
    print(f"配對成功: {stats['matched']}")
    print(f"未配對: {stats['unmatched']}")
    print(f"錯誤: {stats['errors']}")
    print(f"耗時: {elapsed/60:.1f} 分鐘")
    # 保存報告
    report_path = REPORTS_PATH / "name_extraction_report.json"
    with open(report_path, 'w', encoding='utf-8') as f:
        json.dump(stats, f, indent=2, ensure_ascii=False)
    print(f"\n報告已儲存: {report_path}")
    conn.close()
 if __name__ == "__main__":
    main()
@@ -0,0 +1,402 @@
 #!/usr/bin/env python3
 """
 Step 5: 從 PDF 提取會計師姓名 - 完整處理版本
 流程：
 1. 從資料庫讀取簽名記錄，按 (PDF, page) 分組
 2. 對每個頁面重新執行 YOLO 獲取簽名框座標
 3. 對整頁執行 PaddleOCR 提取文字
 4. 過濾出候選姓名（2-4 個中文字）
 5. 配對簽名與最近的姓名
 6. 更新資料庫並生成報告
 """
 import sqlite3
 import json
 import re
 import sys
 import time
 from pathlib import Path
 from typing import Optional, List, Dict, Tuple
 from collections import defaultdict
 from datetime import datetime
 from tqdm import tqdm
 import numpy as np
 import fitz  # PyMuPDF
 # 加入父目錄到路徑
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from paddleocr_client import PaddleOCRClient
 # 路徑配置
 PDF_BASE = Path("/Volumes/NV2/PDF-Processing/total-pdf")
 YOLO_MODEL_PATH = Path("/Volumes/NV2/pdf_recognize/models/best.pt")
 DB_PATH = Path("/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db")
 REPORTS_PATH = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
 # 處理配置
 DPI = 150
 CONFIDENCE_THRESHOLD = 0.5
 NAME_SEARCH_MARGIN = 200
 PROGRESS_SAVE_INTERVAL = 100
 BATCH_COMMIT_SIZE = 50
 # 中文姓名正則
 CHINESE_NAME_PATTERN = re.compile(r'^[\u4e00-\u9fff]{2,4}$')
 # 排除的常見詞
 EXCLUDE_WORDS = {'會計', '會計師', '事務所', '師', '聯合', '出具報告'}
 def find_pdf_file(filename: str) -> Optional[str]:
    """搜尋 PDF 檔案路徑"""
    for batch_dir in sorted(PDF_BASE.glob("batch_*")):
        pdf_path = batch_dir / filename
        if pdf_path.exists():
            return str(pdf_path)
    pdf_path = PDF_BASE / filename
    if pdf_path.exists():
        return str(pdf_path)
    return None
 def render_pdf_page(pdf_path: str, page_num: int) -> Optional[np.ndarray]:
    """渲染 PDF 頁面為圖像"""
    try:
        doc = fitz.open(pdf_path)
        if page_num < 1 or page_num > len(doc):
            doc.close()
            return None
        page = doc[page_num - 1]
        mat = fitz.Matrix(DPI / 72, DPI / 72)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        image = np.frombuffer(pix.samples, dtype=np.uint8)
        image = image.reshape(pix.height, pix.width, pix.n)
        doc.close()
        return image
    except Exception:
        return None
 def detect_signatures_yolo(image: np.ndarray, model) -> List[Dict]:
    """使用 YOLO 偵測簽名框"""
    results = model(image, conf=CONFIDENCE_THRESHOLD, verbose=False)
    signatures = []
    for r in results:
        for box in r.boxes:
            x1, y1, x2, y2 = map(int, box.xyxy[0].cpu().numpy())
            conf = float(box.conf[0].cpu().numpy())
            signatures.append({
                'x': x1, 'y': y1,
                'width': x2 - x1, 'height': y2 - y1,
                'confidence': conf,
                'center_x': (x1 + x2) / 2,
                'center_y': (y1 + y2) / 2
            })
    signatures.sort(key=lambda s: (s['y'], s['x']))
    return signatures
 def extract_and_filter_names(image: np.ndarray, ocr_client: PaddleOCRClient) -> List[Dict]:
    """從圖像提取並過濾姓名候選"""
    try:
        results = ocr_client.ocr(image)
    except Exception:
        return []
    candidates = []
    for result in results:
        text = result.get('text', '').strip()
        box = result.get('box', [])
        if not box or not text:
            continue
        # 清理文字
        text_clean = re.sub(r'[\s\:\：\,\，\.\。\、]', '', text)
        # 檢查是否為姓名候選
        if CHINESE_NAME_PATTERN.match(text_clean) and text_clean not in EXCLUDE_WORDS:
            xs = [point[0] for point in box]
            ys = [point[1] for point in box]
            candidates.append({
                'text': text_clean,
                'center_x': sum(xs) / len(xs),
                'center_y': sum(ys) / len(ys),
            })
    return candidates
 def match_signature_to_name(sig: Dict, name_candidates: List[Dict]) -> Optional[str]:
    """為簽名框配對最近的姓名"""
    margin = NAME_SEARCH_MARGIN
    nearby = []
    for name in name_candidates:
        dx = abs(name['center_x'] - sig['center_x'])
        dy = abs(name['center_y'] - sig['center_y'])
        if dx <= margin + sig['width']/2 and dy <= margin + sig['height']/2:
            distance = (dx**2 + dy**2) ** 0.5
            nearby.append((name['text'], distance))
    if nearby:
        nearby.sort(key=lambda x: x[1])
        return nearby[0][0]
    return None
 def get_pages_to_process(conn: sqlite3.Connection) -> List[Tuple[str, int, List[int]]]:
    """從資料庫獲取需要處理的頁面"""
    cursor = conn.cursor()
    cursor.execute('''
        SELECT source_pdf, page_number, GROUP_CONCAT(signature_id)
        FROM signatures
        WHERE accountant_name IS NULL OR accountant_name = ''
        GROUP BY source_pdf, page_number
        ORDER BY source_pdf, page_number
    ''')
    pages = []
    for row in cursor.fetchall():
        source_pdf, page_number, sig_ids_str = row
        sig_ids = [int(x) for x in sig_ids_str.split(',')]
        pages.append((source_pdf, page_number, sig_ids))
    return pages
 def process_page(
    source_pdf: str, page_number: int, sig_ids: List[int],
    yolo_model, ocr_client: PaddleOCRClient
 ) -> Dict:
    """處理單一頁面"""
    result = {
        'source_pdf': source_pdf,
        'page_number': page_number,
        'num_signatures': len(sig_ids),
        'matched': 0,
        'unmatched': 0,
        'error': None,
        'updates': []
    }
    pdf_path = find_pdf_file(source_pdf)
    if pdf_path is None:
        result['error'] = 'PDF not found'
        return result
    image = render_pdf_page(pdf_path, page_number)
    if image is None:
        result['error'] = 'Render failed'
        return result
    sig_boxes = detect_signatures_yolo(image, yolo_model)
    name_candidates = extract_and_filter_names(image, ocr_client)
    for i, sig_id in enumerate(sig_ids):
        if i < len(sig_boxes):
            sig = sig_boxes[i]
            matched_name = match_signature_to_name(sig, name_candidates)
            if matched_name:
                result['matched'] += 1
            else:
                result['unmatched'] += 1
                matched_name = ''
            result['updates'].append((
                sig_id, matched_name,
                sig['x'], sig['y'], sig['width'], sig['height']
            ))
        else:
            result['updates'].append((sig_id, '', 0, 0, 0, 0))
            result['unmatched'] += 1
    return result
 def save_updates_to_db(conn: sqlite3.Connection, updates: List[Tuple]):
    """批次更新資料庫"""
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS signature_boxes (
            signature_id INTEGER PRIMARY KEY,
            x INTEGER, y INTEGER, width INTEGER, height INTEGER,
            FOREIGN KEY (signature_id) REFERENCES signatures(signature_id)
        )
    ''')
    for sig_id, name, x, y, w, h in updates:
        cursor.execute('UPDATE signatures SET accountant_name = ? WHERE signature_id = ?', (name, sig_id))
        if x > 0:  # 有座標才存
            cursor.execute('''
                INSERT OR REPLACE INTO signature_boxes (signature_id, x, y, width, height)
                VALUES (?, ?, ?, ?, ?)
            ''', (sig_id, x, y, w, h))
    conn.commit()
 def generate_report(stats: Dict, output_path: Path):
    """生成處理報告"""
    report = {
        'title': '會計師姓名提取報告',
        'generated_at': datetime.now().isoformat(),
        'summary': {
            'total_pages': stats['total_pages'],
            'processed_pages': stats['processed'],
            'total_signatures': stats['total_sigs'],
            'matched_signatures': stats['matched'],
            'unmatched_signatures': stats['unmatched'],
            'match_rate': f"{stats['matched']/stats['total_sigs']*100:.1f}%" if stats['total_sigs'] > 0 else "N/A",
            'errors': stats['errors'],
            'elapsed_seconds': stats['elapsed_seconds'],
            'elapsed_human': f"{stats['elapsed_seconds']/3600:.1f} 小時"
        },
        'methodology': {
            'step1': 'YOLO 模型偵測簽名框座標',
            'step2': 'PaddleOCR 整頁 OCR 提取文字',
            'step3': '過濾 2-4 個中文字作為姓名候選',
            'step4': f'在簽名框周圍 {NAME_SEARCH_MARGIN}px 範圍內配對最近的姓名',
            'dpi': DPI,
            'yolo_confidence': CONFIDENCE_THRESHOLD
        },
        'name_distribution': stats.get('name_distribution', {}),
        'error_samples': stats.get('error_samples', [])
    }
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(report, f, indent=2, ensure_ascii=False)
    # 同時生成 Markdown 報告
    md_path = output_path.with_suffix('.md')
    with open(md_path, 'w', encoding='utf-8') as f:
        f.write(f"# {report['title']}\n\n")
        f.write(f"生成時間: {report['generated_at']}\n\n")
        f.write("## 摘要\n\n")
        f.write(f"| 指標 | 數值 |\n|------|------|\n")
        for k, v in report['summary'].items():
            f.write(f"| {k} | {v} |\n")
        f.write("\n## 方法論\n\n")
        for k, v in report['methodology'].items():
            f.write(f"- **{k}**: {v}\n")
        f.write("\n## 姓名分布 (Top 50)\n\n")
        names = sorted(report['name_distribution'].items(), key=lambda x: -x[1])[:50]
        for name, count in names:
            f.write(f"- {name}: {count}\n")
    return report
 def main():
    print("=" * 70)
    print("Step 5: 從 PDF 提取會計師姓名 - 完整處理")
    print("=" * 70)
    print(f"開始時間: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    REPORTS_PATH.mkdir(parents=True, exist_ok=True)
    # 連接資料庫
    conn = sqlite3.connect(DB_PATH)
    pages = get_pages_to_process(conn)
    print(f"\n待處理頁面: {len(pages):,}")
    if not pages:
        print("沒有需要處理的頁面")
        conn.close()
        return
    # 載入 YOLO
    print("\n載入 YOLO 模型...")
    from ultralytics import YOLO
    yolo_model = YOLO(str(YOLO_MODEL_PATH))
    # 連接 OCR
    print("連接 PaddleOCR 伺服器...")
    ocr_client = PaddleOCRClient()
    if not ocr_client.health_check():
        print("錯誤: PaddleOCR 伺服器無法連接")
        conn.close()
        return
    print("OCR 伺服器連接成功\n")
    # 統計
    stats = {
        'total_pages': len(pages),
        'processed': 0,
        'total_sigs': sum(len(p[2]) for p in pages),
        'matched': 0,
        'unmatched': 0,
        'errors': 0,
        'error_samples': [],
        'name_distribution': defaultdict(int),
        'start_time': time.time()
    }
    all_updates = []
    # 處理每個頁面
    for source_pdf, page_number, sig_ids in tqdm(pages, desc="處理頁面"):
        result = process_page(source_pdf, page_number, sig_ids, yolo_model, ocr_client)
        stats['processed'] += 1
        stats['matched'] += result['matched']
        stats['unmatched'] += result['unmatched']
        if result['error']:
            stats['errors'] += 1
            if len(stats['error_samples']) < 20:
                stats['error_samples'].append({
                    'pdf': source_pdf,
                    'page': page_number,
                    'error': result['error']
                })
        else:
            all_updates.extend(result['updates'])
            for update in result['updates']:
                if update[1]:  # 有姓名
                    stats['name_distribution'][update[1]] += 1
        # 批次提交
        if len(all_updates) >= BATCH_COMMIT_SIZE:
            save_updates_to_db(conn, all_updates)
            all_updates = []
        # 定期顯示進度
        if stats['processed'] % PROGRESS_SAVE_INTERVAL == 0:
            elapsed = time.time() - stats['start_time']
            rate = stats['processed'] / elapsed
            remaining = (stats['total_pages'] - stats['processed']) / rate if rate > 0 else 0
            print(f"\n[進度] {stats['processed']:,}/{stats['total_pages']:,} "
                  f"({stats['processed']/stats['total_pages']*100:.1f}%) | "
                  f"配對: {stats['matched']:,} | "
                  f"剩餘: {remaining/60:.1f} 分鐘")
    # 最後一批提交
    if all_updates:
        save_updates_to_db(conn, all_updates)
    stats['elapsed_seconds'] = time.time() - stats['start_time']
    stats['name_distribution'] = dict(stats['name_distribution'])
    # 生成報告
    print("\n生成報告...")
    report_path = REPORTS_PATH / "name_extraction_report.json"
    generate_report(stats, report_path)
    print("\n" + "=" * 70)
    print("處理完成！")
    print("=" * 70)
    print(f"總頁面: {stats['total_pages']:,}")
    print(f"總簽名: {stats['total_sigs']:,}")
    print(f"配對成功: {stats['matched']:,} ({stats['matched']/stats['total_sigs']*100:.1f}%)")
    print(f"未配對: {stats['unmatched']:,}")
    print(f"錯誤: {stats['errors']:,}")
    print(f"耗時: {stats['elapsed_seconds']/3600:.2f} 小時")
    print(f"\n報告已儲存:")
    print(f"  - {report_path}")
    print(f"  - {report_path.with_suffix('.md')}")
    conn.close()
 if __name__ == "__main__":
    main()
@@ -0,0 +1,450 @@
 #!/usr/bin/env python3
 """
 簽名清理與會計師歸檔
 1. 標記 sig_count > 2 的 PDF，篩選最佳 2 個簽名
 2. 用 OCR 或座標歸檔到會計師
 3. 建立 accountants 表
 """
 import sqlite3
 import json
 from collections import defaultdict
 from datetime import datetime
 from opencc import OpenCC
 # 簡繁轉換
 cc_s2t = OpenCC('s2t')
 DB_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 REPORT_DIR = '/Volumes/NV2/PDF-Processing/signature-analysis/reports'
 def get_connection():
    conn = sqlite3.connect(DB_PATH)
    conn.row_factory = sqlite3.Row
    return conn
 def add_columns_if_needed(conn):
    """添加新欄位"""
    cur = conn.cursor()
    # 檢查現有欄位
    cur.execute("PRAGMA table_info(signatures)")
    columns = [row[1] for row in cur.fetchall()]
    if 'is_valid' not in columns:
        cur.execute("ALTER TABLE signatures ADD COLUMN is_valid INTEGER DEFAULT 1")
        print("已添加 is_valid 欄位")
    if 'assigned_accountant' not in columns:
        cur.execute("ALTER TABLE signatures ADD COLUMN assigned_accountant TEXT")
        print("已添加 assigned_accountant 欄位")
    conn.commit()
 def create_accountants_table(conn):
    """建立 accountants 表"""
    cur = conn.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS accountants (
            accountant_id INTEGER PRIMARY KEY AUTOINCREMENT,
            name TEXT UNIQUE NOT NULL,
            signature_count INTEGER DEFAULT 0,
            firm TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    print("accountants 表已建立")
 def get_pdf_signatures(conn):
    """取得每份 PDF 的簽名資料"""
    cur = conn.cursor()
    cur.execute("""
        SELECT s.signature_id, s.source_pdf, s.page_number, s.accountant_name,
               s.excel_accountant1, s.excel_accountant2, s.excel_firm,
               sb.x, sb.y, sb.width, sb.height
        FROM signatures s
        LEFT JOIN signature_boxes sb ON s.signature_id = sb.signature_id
        ORDER BY s.source_pdf, s.page_number, sb.y
    """)
    pdf_sigs = defaultdict(list)
    for row in cur.fetchall():
        pdf_sigs[row['source_pdf']].append(dict(row))
    return pdf_sigs
 def normalize_name(name):
    """正規化姓名（簡轉繁）"""
    if not name:
        return None
    return cc_s2t.convert(name)
 def names_match(ocr_name, excel_name):
    """檢查 OCR 姓名是否與 Excel 姓名匹配"""
    if not ocr_name or not excel_name:
        return False
    # 精確匹配
    if ocr_name == excel_name:
        return True
    # 簡繁轉換後匹配
    ocr_trad = normalize_name(ocr_name)
    if ocr_trad == excel_name:
        return True
    return False
 def score_signature(sig, excel_acc1, excel_acc2):
    """為簽名評分"""
    score = 0
    ocr_name = sig.get('accountant_name', '')
    # 1. OCR 姓名匹配 (+100)
    if names_match(ocr_name, excel_acc1) or names_match(ocr_name, excel_acc2):
        score += 100
    # 2. 合理尺寸 (+20)
    width = sig.get('width', 0) or 0
    height = sig.get('height', 0) or 0
    if 30 < width < 500 and 20 < height < 200:
        score += 20
    # 3. 頁面位置 - Y 座標越大分數越高 (最多 +15)
    y = sig.get('y', 0) or 0
    score += min(y / 100, 15)
    # 4. 如果尺寸過大（可能是印章），扣分
    if width > 300 or height > 150:
        score -= 30
    return score
 def select_best_two(signatures, excel_acc1, excel_acc2):
    """選擇最佳的 2 個簽名"""
    if len(signatures) <= 2:
        return signatures
    scored = []
    for sig in signatures:
        score = score_signature(sig, excel_acc1, excel_acc2)
        scored.append((sig, score))
    # 按分數排序
    scored.sort(key=lambda x: -x[1])
    # 取前 2 個
    return [s[0] for s in scored[:2]]
 def assign_to_accountant(sig1, sig2, excel_acc1, excel_acc2):
    """將簽名歸檔到會計師"""
    ocr1 = sig1.get('accountant_name', '')
    ocr2 = sig2.get('accountant_name', '')
    # 方法 A: OCR 姓名匹配
    if names_match(ocr1, excel_acc1):
        return [(sig1, excel_acc1), (sig2, excel_acc2)]
    elif names_match(ocr1, excel_acc2):
        return [(sig1, excel_acc2), (sig2, excel_acc1)]
    elif names_match(ocr2, excel_acc1):
        return [(sig1, excel_acc2), (sig2, excel_acc1)]
    elif names_match(ocr2, excel_acc2):
        return [(sig1, excel_acc1), (sig2, excel_acc2)]
    # 方法 B: 按 Y 座標（假設會計師1 在上）
    y1 = sig1.get('y', 0) or 0
    y2 = sig2.get('y', 0) or 0
    if y1 <= y2:
        return [(sig1, excel_acc1), (sig2, excel_acc2)]
    else:
        return [(sig1, excel_acc2), (sig2, excel_acc1)]
 def process_all_pdfs(conn):
    """處理所有 PDF"""
    print("正在載入簽名資料...")
    pdf_sigs = get_pdf_signatures(conn)
    print(f"共 {len(pdf_sigs)} 份 PDF")
    cur = conn.cursor()
    stats = {
        'total_pdfs': len(pdf_sigs),
        'sig_count_1': 0,
        'sig_count_2': 0,
        'sig_count_gt2': 0,
        'valid_signatures': 0,
        'invalid_signatures': 0,
        'ocr_matched': 0,
        'y_coordinate_assigned': 0,
        'no_excel_data': 0,
    }
    assignments = []  # (signature_id, assigned_accountant, is_valid)
    for pdf_name, sigs in pdf_sigs.items():
        sig_count = len(sigs)
        excel_acc1 = sigs[0].get('excel_accountant1') if sigs else None
        excel_acc2 = sigs[0].get('excel_accountant2') if sigs else None
        if not excel_acc1 and not excel_acc2:
            # 無 Excel 資料
            stats['no_excel_data'] += 1
            for sig in sigs:
                assignments.append((sig['signature_id'], None, 1))
            continue
        if sig_count == 1:
            stats['sig_count_1'] += 1
            # 只有 1 個簽名，保留但無法確定是哪位會計師
            sig = sigs[0]
            ocr_name = sig.get('accountant_name', '')
            if names_match(ocr_name, excel_acc1):
                assignments.append((sig['signature_id'], excel_acc1, 1))
                stats['ocr_matched'] += 1
            elif names_match(ocr_name, excel_acc2):
                assignments.append((sig['signature_id'], excel_acc2, 1))
                stats['ocr_matched'] += 1
            else:
                # 無法確定，暫時不指派
                assignments.append((sig['signature_id'], None, 1))
            stats['valid_signatures'] += 1
        elif sig_count == 2:
            stats['sig_count_2'] += 1
            # 正常情況
            sig1, sig2 = sigs[0], sigs[1]
            pairs = assign_to_accountant(sig1, sig2, excel_acc1, excel_acc2)
            for sig, acc in pairs:
                assignments.append((sig['signature_id'], acc, 1))
                stats['valid_signatures'] += 1
                # 統計匹配方式
                ocr_name = sig.get('accountant_name', '')
                if names_match(ocr_name, acc):
                    stats['ocr_matched'] += 1
                else:
                    stats['y_coordinate_assigned'] += 1
        else:
            stats['sig_count_gt2'] += 1
            # 需要篩選
            best_two = select_best_two(sigs, excel_acc1, excel_acc2)
            # 標記有效/無效
            valid_ids = {s['signature_id'] for s in best_two}
            for sig in sigs:
                if sig['signature_id'] in valid_ids:
                    is_valid = 1
                    stats['valid_signatures'] += 1
                else:
                    is_valid = 0
                    stats['invalid_signatures'] += 1
                    assignments.append((sig['signature_id'], None, is_valid))
            # 歸檔有效的 2 個
            if len(best_two) == 2:
                sig1, sig2 = best_two[0], best_two[1]
                pairs = assign_to_accountant(sig1, sig2, excel_acc1, excel_acc2)
                for sig, acc in pairs:
                    assignments.append((sig['signature_id'], acc, 1))
                    ocr_name = sig.get('accountant_name', '')
                    if names_match(ocr_name, acc):
                        stats['ocr_matched'] += 1
                    else:
                        stats['y_coordinate_assigned'] += 1
            elif len(best_two) == 1:
                sig = best_two[0]
                ocr_name = sig.get('accountant_name', '')
                if names_match(ocr_name, excel_acc1):
                    assignments.append((sig['signature_id'], excel_acc1, 1))
                elif names_match(ocr_name, excel_acc2):
                    assignments.append((sig['signature_id'], excel_acc2, 1))
                else:
                    assignments.append((sig['signature_id'], None, 1))
    # 批量更新資料庫
    print(f"正在更新 {len(assignments)} 筆簽名...")
    for sig_id, acc, is_valid in assignments:
        cur.execute("""
            UPDATE signatures
            SET assigned_accountant = ?, is_valid = ?
            WHERE signature_id = ?
        """, (acc, is_valid, sig_id))
    conn.commit()
    return stats
 def build_accountants_table(conn):
    """建立會計師表"""
    cur = conn.cursor()
    # 清空現有資料
    cur.execute("DELETE FROM accountants")
    # 收集所有會計師姓名
    cur.execute("""
        SELECT assigned_accountant, excel_firm, COUNT(*) as cnt
        FROM signatures
        WHERE assigned_accountant IS NOT NULL AND is_valid = 1
        GROUP BY assigned_accountant
    """)
    accountants = {}
    for row in cur.fetchall():
        name = row[0]
        firm = row[1]
        count = row[2]
        if name not in accountants:
            accountants[name] = {'count': 0, 'firms': defaultdict(int)}
        accountants[name]['count'] += count
        if firm:
            accountants[name]['firms'][firm] += count
    # 插入 accountants 表
    for name, data in accountants.items():
        # 找出最常見的事務所
        main_firm = None
        if data['firms']:
            main_firm = max(data['firms'].items(), key=lambda x: x[1])[0]
        cur.execute("""
            INSERT INTO accountants (name, signature_count, firm)
            VALUES (?, ?, ?)
        """, (name, data['count'], main_firm))
    conn.commit()
    # 更新 signatures 的 accountant_id
    cur.execute("""
        UPDATE signatures
        SET accountant_id = (
            SELECT accountant_id FROM accountants
            WHERE accountants.name = signatures.assigned_accountant
        )
        WHERE assigned_accountant IS NOT NULL
    """)
    conn.commit()
    return len(accountants)
 def generate_report(stats, accountant_count):
    """生成報告"""
    report = {
        'generated_at': datetime.now().isoformat(),
        'summary': {
            'total_pdfs': stats['total_pdfs'],
            'pdfs_with_1_sig': stats['sig_count_1'],
            'pdfs_with_2_sigs': stats['sig_count_2'],
            'pdfs_with_gt2_sigs': stats['sig_count_gt2'],
            'pdfs_without_excel': stats['no_excel_data'],
        },
        'signatures': {
            'valid': stats['valid_signatures'],
            'invalid': stats['invalid_signatures'],
            'total': stats['valid_signatures'] + stats['invalid_signatures'],
        },
        'assignment_method': {
            'ocr_matched': stats['ocr_matched'],
            'y_coordinate': stats['y_coordinate_assigned'],
        },
        'accountants': {
            'total_unique': accountant_count,
        }
    }
    # 儲存 JSON
    json_path = f"{REPORT_DIR}/signature_cleanup_report.json"
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(report, f, ensure_ascii=False, indent=2)
    # 儲存 Markdown
    md_path = f"{REPORT_DIR}/signature_cleanup_report.md"
    with open(md_path, 'w', encoding='utf-8') as f:
        f.write("# 簽名清理與歸檔報告\n\n")
        f.write(f"生成時間: {report['generated_at']}\n\n")
        f.write("## PDF 分布\n\n")
        f.write("| 類型 | 數量 |\n")
        f.write("|------|------|\n")
        f.write(f"| 總 PDF 數 | {stats['total_pdfs']} |\n")
        f.write(f"| 1 個簽名 | {stats['sig_count_1']} |\n")
        f.write(f"| 2 個簽名 (正常) | {stats['sig_count_2']} |\n")
        f.write(f"| >2 個簽名 (需篩選) | {stats['sig_count_gt2']} |\n")
        f.write(f"| 無 Excel 資料 | {stats['no_excel_data']} |\n")
        f.write("\n## 簽名統計\n\n")
        f.write("| 類型 | 數量 |\n")
        f.write("|------|------|\n")
        f.write(f"| 有效簽名 | {stats['valid_signatures']} |\n")
        f.write(f"| 無效簽名 (誤判) | {stats['invalid_signatures']} |\n")
        f.write("\n## 歸檔方式\n\n")
        f.write("| 方式 | 數量 |\n")
        f.write("|------|------|\n")
        f.write(f"| OCR 姓名匹配 | {stats['ocr_matched']} |\n")
        f.write(f"| Y 座標推斷 | {stats['y_coordinate_assigned']} |\n")
        f.write(f"\n## 會計師\n\n")
        f.write(f"唯一會計師數: **{accountant_count}**\n")
    print(f"報告已儲存: {json_path}")
    print(f"報告已儲存: {md_path}")
    return report
 def main():
    print("=" * 60)
    print("簽名清理與會計師歸檔")
    print("=" * 60)
    conn = get_connection()
    # 1. 準備資料庫
    print("\n[1/4] 準備資料庫...")
    add_columns_if_needed(conn)
    create_accountants_table(conn)
    # 2. 處理所有 PDF
    print("\n[2/4] 處理 PDF 簽名...")
    stats = process_all_pdfs(conn)
    # 3. 建立 accountants 表
    print("\n[3/4] 建立會計師表...")
    accountant_count = build_accountants_table(conn)
    # 4. 生成報告
    print("\n[4/4] 生成報告...")
    report = generate_report(stats, accountant_count)
    conn.close()
    print("\n" + "=" * 60)
    print("完成！")
    print("=" * 60)
    print(f"有效簽名: {stats['valid_signatures']}")
    print(f"無效簽名: {stats['invalid_signatures']}")
    print(f"唯一會計師: {accountant_count}")
 if __name__ == '__main__':
    main()
@@ -0,0 +1,272 @@
 #!/usr/bin/env python3
 """
 第三階段：同人簽名聚類分析
 對每位會計師的簽名進行相似度分析，判斷是否有「複製貼上」行為。
 """
 import sqlite3
 import numpy as np
 import json
 from collections import defaultdict
 from datetime import datetime
 from tqdm import tqdm
 DB_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 FEATURES_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/features/signature_features.npy'
 REPORT_DIR = '/Volumes/NV2/PDF-Processing/signature-analysis/reports'
 def load_data():
    """載入特徵向量和會計師分配"""
    print("載入特徵向量...")
    features = np.load(FEATURES_PATH)
    print(f"特徵矩陣形狀: {features.shape}")
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    # 取得所有 signature_id 順序（與特徵向量對應）
    cur.execute("SELECT signature_id FROM signatures ORDER BY signature_id")
    all_sig_ids = [row[0] for row in cur.fetchall()]
    sig_id_to_idx = {sig_id: idx for idx, sig_id in enumerate(all_sig_ids)}
    # 取得有效簽名的會計師分配
    cur.execute("""
        SELECT s.signature_id, s.assigned_accountant, s.accountant_id, a.name, a.firm
        FROM signatures s
        LEFT JOIN accountants a ON s.accountant_id = a.accountant_id
        WHERE s.is_valid = 1 AND s.assigned_accountant IS NOT NULL
        ORDER BY s.signature_id
    """)
    acc_signatures = defaultdict(list)
    acc_info = {}
    for row in cur.fetchall():
        sig_id, _, acc_id, acc_name, firm = row
        if acc_id and sig_id in sig_id_to_idx:
            acc_signatures[acc_id].append(sig_id)
            if acc_id not in acc_info:
                acc_info[acc_id] = {'name': acc_name, 'firm': firm}
    conn.close()
    return features, sig_id_to_idx, acc_signatures, acc_info
 def compute_similarity_stats(features, sig_ids, sig_id_to_idx):
    """計算一組簽名的相似度統計"""
    if len(sig_ids) < 2:
        return None
    # 取得特徵
    indices = [sig_id_to_idx[sid] for sid in sig_ids]
    feat = features[indices]
    # 正規化
    norms = np.linalg.norm(feat, axis=1, keepdims=True)
    norms[norms == 0] = 1
    feat_norm = feat / norms
    # 計算餘弦相似度矩陣
    sim_matrix = np.dot(feat_norm, feat_norm.T)
    # 取上三角（排除對角線）
    upper_tri = sim_matrix[np.triu_indices(len(sim_matrix), k=1)]
    if len(upper_tri) == 0:
        return None
    # 統計
    stats = {
        'total_pairs': len(upper_tri),
        'min_sim': float(upper_tri.min()),
        'max_sim': float(upper_tri.max()),
        'mean_sim': float(upper_tri.mean()),
        'std_sim': float(upper_tri.std()),
        'pairs_gt_90': int((upper_tri > 0.90).sum()),
        'pairs_gt_95': int((upper_tri > 0.95).sum()),
        'pairs_gt_99': int((upper_tri > 0.99).sum()),
    }
    # 計算比例
    stats['ratio_gt_90'] = stats['pairs_gt_90'] / stats['total_pairs']
    stats['ratio_gt_95'] = stats['pairs_gt_95'] / stats['total_pairs']
    stats['ratio_gt_99'] = stats['pairs_gt_99'] / stats['total_pairs']
    return stats
 def analyze_all_accountants(features, sig_id_to_idx, acc_signatures, acc_info):
    """分析所有會計師"""
    results = []
    for acc_id, sig_ids in tqdm(acc_signatures.items(), desc="分析會計師"):
        info = acc_info.get(acc_id, {})
        stats = compute_similarity_stats(features, sig_ids, sig_id_to_idx)
        if stats:
            result = {
                'accountant_id': acc_id,
                'name': info.get('name', ''),
                'firm': info.get('firm', ''),
                'signature_count': len(sig_ids),
                **stats
            }
            results.append(result)
    return results
 def classify_risk(result):
    """分類風險等級"""
    ratio_95 = result.get('ratio_gt_95', 0)
    ratio_99 = result.get('ratio_gt_99', 0)
    mean_sim = result.get('mean_sim', 0)
    # 高風險：大量高相似度對
    if ratio_99 > 0.05 or ratio_95 > 0.3:
        return 'high'
    # 中風險
    elif ratio_95 > 0.1 or mean_sim > 0.85:
        return 'medium'
    # 低風險
    else:
        return 'low'
 def save_results(results, acc_signatures):
    """儲存結果"""
    # 分類風險
    for r in results:
        r['risk_level'] = classify_risk(r)
    # 統計
    risk_counts = defaultdict(int)
    for r in results:
        risk_counts[r['risk_level']] += 1
    summary = {
        'generated_at': datetime.now().isoformat(),
        'total_accountants': len(results),
        'risk_distribution': dict(risk_counts),
        'high_risk_count': risk_counts['high'],
        'medium_risk_count': risk_counts['medium'],
        'low_risk_count': risk_counts['low'],
    }
    # 按風險排序
    results_sorted = sorted(results, key=lambda x: (-x.get('ratio_gt_95', 0), -x.get('mean_sim', 0)))
    # 儲存 JSON
    output = {
        'summary': summary,
        'accountants': results_sorted
    }
    json_path = f"{REPORT_DIR}/accountant_similarity_analysis.json"
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    print(f"已儲存: {json_path}")
    # 儲存 Markdown 報告
    md_path = f"{REPORT_DIR}/accountant_similarity_analysis.md"
    with open(md_path, 'w', encoding='utf-8') as f:
        f.write("# 會計師簽名相似度分析報告\n\n")
        f.write(f"生成時間: {summary['generated_at']}\n\n")
        f.write("## 摘要\n\n")
        f.write(f"| 指標 | 數值 |\n")
        f.write(f"|------|------|\n")
        f.write(f"| 總會計師數 | {summary['total_accountants']} |\n")
        f.write(f"| 高風險 | {risk_counts['high']} |\n")
        f.write(f"| 中風險 | {risk_counts['medium']} |\n")
        f.write(f"| 低風險 | {risk_counts['low']} |\n")
        f.write("\n## 風險分類標準\n\n")
        f.write("- **高風險**: >5% 的簽名對相似度 >0.99，或 >30% 的簽名對相似度 >0.95\n")
        f.write("- **中風險**: >10% 的簽名對相似度 >0.95，或平均相似度 >0.85\n")
        f.write("- **低風險**: 其他情況\n")
        f.write("\n## 高風險會計師 (Top 30)\n\n")
        f.write("| 排名 | 姓名 | 事務所 | 簽名數 | 平均相似度 | >0.95比例 | >0.99比例 |\n")
        f.write("|------|------|--------|--------|------------|-----------|----------|\n")
        high_risk = [r for r in results_sorted if r['risk_level'] == 'high']
        for i, r in enumerate(high_risk[:30], 1):
            f.write(f"| {i} | {r['name']} | {r['firm'] or '-'} | {r['signature_count']} | ")
            f.write(f"{r['mean_sim']:.3f} | {r['ratio_gt_95']*100:.1f}% | {r['ratio_gt_99']*100:.1f}% |\n")
        f.write("\n## 所有會計師統計分布\n\n")
        # 平均相似度分布
        mean_sims = [r['mean_sim'] for r in results]
        f.write("### 平均相似度分布\n\n")
        f.write(f"- 最小: {min(mean_sims):.3f}\n")
        f.write(f"- 最大: {max(mean_sims):.3f}\n")
        f.write(f"- 平均: {np.mean(mean_sims):.3f}\n")
        f.write(f"- 中位數: {np.median(mean_sims):.3f}\n")
    print(f"已儲存: {md_path}")
    return summary, results_sorted
 def update_database(results):
    """更新資料庫，添加風險等級"""
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    # 添加欄位
    try:
        cur.execute("ALTER TABLE accountants ADD COLUMN risk_level TEXT")
        cur.execute("ALTER TABLE accountants ADD COLUMN mean_similarity REAL")
        cur.execute("ALTER TABLE accountants ADD COLUMN ratio_gt_95 REAL")
    except:
        pass  # 欄位已存在
    # 更新
    for r in results:
        cur.execute("""
            UPDATE accountants
            SET risk_level = ?, mean_similarity = ?, ratio_gt_95 = ?
            WHERE accountant_id = ?
        """, (r['risk_level'], r['mean_sim'], r['ratio_gt_95'], r['accountant_id']))
    conn.commit()
    conn.close()
    print("資料庫已更新")
 def main():
    print("=" * 60)
    print("第三階段：同人簽名聚類分析")
    print("=" * 60)
    # 載入資料
    features, sig_id_to_idx, acc_signatures, acc_info = load_data()
    print(f"會計師數: {len(acc_signatures)}")
    # 分析所有會計師
    print("\n開始分析...")
    results = analyze_all_accountants(features, sig_id_to_idx, acc_signatures, acc_info)
    # 儲存結果
    print("\n儲存結果...")
    summary, results_sorted = save_results(results, acc_signatures)
    # 更新資料庫
    update_database(results_sorted)
    print("\n" + "=" * 60)
    print("完成！")
    print("=" * 60)
    print(f"總會計師: {summary['total_accountants']}")
    print(f"高風險: {summary['high_risk_count']}")
    print(f"中風險: {summary['medium_risk_count']}")
    print(f"低風險: {summary['low_risk_count']}")
 if __name__ == '__main__':
    main()
@@ -0,0 +1,371 @@
 #!/usr/bin/env python3
 """
 第四階段：PDF 簽名真偽判定
 對每份 PDF 的簽名判斷是「親簽」還是「複製貼上」
 """
 import sqlite3
 import numpy as np
 import json
 import csv
 from collections import defaultdict
 from datetime import datetime
 from tqdm import tqdm
 DB_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 FEATURES_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/features/signature_features.npy'
 REPORT_DIR = '/Volumes/NV2/PDF-Processing/signature-analysis/reports'
 # 門檻設定
 THRESHOLD_COPY = 0.95      # 高於此值判定為「複製貼上」
 THRESHOLD_AUTHENTIC = 0.85  # 低於此值判定為「親簽」
 # 介於兩者之間為「不確定」
 def load_data():
    """載入資料"""
    print("載入特徵向量...")
    features = np.load(FEATURES_PATH)
    # 正規化
    norms = np.linalg.norm(features, axis=1, keepdims=True)
    norms[norms == 0] = 1
    features_norm = features / norms
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    # 取得簽名資訊
    cur.execute("""
        SELECT s.signature_id, s.source_pdf, s.assigned_accountant,
               s.excel_accountant1, s.excel_accountant2, s.excel_firm
        FROM signatures s
        WHERE s.is_valid = 1 AND s.assigned_accountant IS NOT NULL
        ORDER BY s.signature_id
    """)
    sig_data = {}
    pdf_signatures = defaultdict(list)
    acc_signatures = defaultdict(list)
    pdf_info = {}
    for row in cur.fetchall():
        sig_id, pdf, acc_name, acc1, acc2, firm = row
        sig_data[sig_id] = {
            'pdf': pdf,
            'accountant': acc_name,
        }
        pdf_signatures[pdf].append((sig_id, acc_name))
        acc_signatures[acc_name].append(sig_id)
        if pdf not in pdf_info:
            pdf_info[pdf] = {
                'accountant1': acc1,
                'accountant2': acc2,
                'firm': firm
            }
    # signature_id -> feature index
    cur.execute("SELECT signature_id FROM signatures ORDER BY signature_id")
    all_sig_ids = [row[0] for row in cur.fetchall()]
    sig_id_to_idx = {sid: idx for idx, sid in enumerate(all_sig_ids)}
    conn.close()
    return features_norm, sig_data, pdf_signatures, acc_signatures, pdf_info, sig_id_to_idx
 def get_max_similarity_to_others(sig_id, acc_name, acc_signatures, sig_id_to_idx, features_norm):
    """計算該簽名與同一會計師其他簽名的最大相似度"""
    other_sigs = [s for s in acc_signatures[acc_name] if s != sig_id and s in sig_id_to_idx]
    if not other_sigs:
        return None, None
    idx = sig_id_to_idx[sig_id]
    other_indices = [sig_id_to_idx[s] for s in other_sigs]
    feat = features_norm[idx]
    other_feats = features_norm[other_indices]
    similarities = np.dot(other_feats, feat)
    max_idx = similarities.argmax()
    return float(similarities[max_idx]), other_sigs[max_idx]
 def classify_signature(max_sim):
    """分類簽名"""
    if max_sim is None:
        return 'unknown'  # 無法判定（沒有其他簽名可比對）
    elif max_sim >= THRESHOLD_COPY:
        return 'copy'     # 複製貼上
    elif max_sim <= THRESHOLD_AUTHENTIC:
        return 'authentic'  # 親簽
    else:
        return 'uncertain'  # 不確定
 def classify_pdf(verdicts):
    """根據兩個簽名的判定結果，給出 PDF 整體判定"""
    if not verdicts:
        return 'unknown'
    # 如果有任一簽名是複製，整份 PDF 判定為複製
    if 'copy' in verdicts:
        return 'copy'
    # 如果兩個都是親簽
    elif all(v == 'authentic' for v in verdicts):
        return 'authentic'
    # 如果有不確定的
    elif 'uncertain' in verdicts:
        return 'uncertain'
    else:
        return 'unknown'
 def analyze_all_pdfs(features_norm, sig_data, pdf_signatures, acc_signatures, pdf_info, sig_id_to_idx):
    """分析所有 PDF"""
    results = []
    for pdf, sigs in tqdm(pdf_signatures.items(), desc="分析 PDF"):
        info = pdf_info.get(pdf, {})
        pdf_result = {
            'pdf': pdf,
            'accountant1': info.get('accountant1', ''),
            'accountant2': info.get('accountant2', ''),
            'firm': info.get('firm', ''),
            'signatures': []
        }
        verdicts = []
        for sig_id, acc_name in sigs:
            max_sim, most_similar_sig = get_max_similarity_to_others(
                sig_id, acc_name, acc_signatures, sig_id_to_idx, features_norm
            )
            verdict = classify_signature(max_sim)
            verdicts.append(verdict)
            pdf_result['signatures'].append({
                'signature_id': sig_id,
                'accountant': acc_name,
                'max_similarity': max_sim,
                'verdict': verdict
            })
        pdf_result['pdf_verdict'] = classify_pdf(verdicts)
        results.append(pdf_result)
    return results
 def generate_statistics(results):
    """生成統計"""
    stats = {
        'total_pdfs': len(results),
        'pdf_verdicts': defaultdict(int),
        'signature_verdicts': defaultdict(int),
        'by_firm': defaultdict(lambda: defaultdict(int))
    }
    for r in results:
        stats['pdf_verdicts'][r['pdf_verdict']] += 1
        firm = r['firm'] or '未知'
        stats['by_firm'][firm][r['pdf_verdict']] += 1
        for sig in r['signatures']:
            stats['signature_verdicts'][sig['verdict']] += 1
    return stats
 def save_results(results, stats):
    """儲存結果"""
    timestamp = datetime.now().isoformat()
    # 1. 儲存完整 JSON
    json_path = f"{REPORT_DIR}/pdf_signature_verdicts.json"
    output = {
        'generated_at': timestamp,
        'thresholds': {
            'copy': THRESHOLD_COPY,
            'authentic': THRESHOLD_AUTHENTIC
        },
        'statistics': {
            'total_pdfs': stats['total_pdfs'],
            'pdf_verdicts': dict(stats['pdf_verdicts']),
            'signature_verdicts': dict(stats['signature_verdicts'])
        },
        'results': results
    }
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    print(f"已儲存: {json_path}")
    # 2. 儲存 CSV（簡易版）
    csv_path = f"{REPORT_DIR}/pdf_signature_verdicts.csv"
    with open(csv_path, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['PDF', '會計師1', '會計師2', '事務所', '判定結果',
                        '簽名1_會計師', '簽名1_相似度', '簽名1_判定',
                        '簽名2_會計師', '簽名2_相似度', '簽名2_判定'])
        for r in results:
            row = [
                r['pdf'],
                r['accountant1'],
                r['accountant2'],
                r['firm'] or '',
                r['pdf_verdict']
            ]
            for sig in r['signatures'][:2]:  # 最多 2 個簽名
                row.extend([
                    sig['accountant'],
                    f"{sig['max_similarity']:.3f}" if sig['max_similarity'] else '',
                    sig['verdict']
                ])
            # 補齊欄位
            while len(row) < 11:
                row.append('')
            writer.writerow(row)
    print(f"已儲存: {csv_path}")
    # 3. 儲存 Markdown 報告
    md_path = f"{REPORT_DIR}/pdf_signature_verdict_report.md"
    with open(md_path, 'w', encoding='utf-8') as f:
        f.write("# PDF 簽名真偽判定報告\n\n")
        f.write(f"生成時間: {timestamp}\n\n")
        f.write("## 判定標準\n\n")
        f.write(f"- **複製貼上 (copy)**: 與同一會計師其他簽名相似度 ≥ {THRESHOLD_COPY}\n")
        f.write(f"- **親簽 (authentic)**: 與同一會計師其他簽名相似度 ≤ {THRESHOLD_AUTHENTIC}\n")
        f.write(f"- **不確定 (uncertain)**: 相似度介於 {THRESHOLD_AUTHENTIC} ~ {THRESHOLD_COPY}\n")
        f.write(f"- **無法判定 (unknown)**: 該會計師只有此一份簽名，無法比對\n\n")
        f.write("## 整體統計\n\n")
        f.write("### PDF 判定結果\n\n")
        f.write("| 判定 | 數量 | 百分比 |\n")
        f.write("|------|------|--------|\n")
        total = stats['total_pdfs']
        for verdict in ['copy', 'uncertain', 'authentic', 'unknown']:
            count = stats['pdf_verdicts'].get(verdict, 0)
            pct = count / total * 100 if total > 0 else 0
            label = {
                'copy': '複製貼上',
                'authentic': '親簽',
                'uncertain': '不確定',
                'unknown': '無法判定'
            }.get(verdict, verdict)
            f.write(f"| {label} | {count:,} | {pct:.1f}% |\n")
        f.write(f"\n**總計: {total:,} 份 PDF**\n")
        f.write("\n### 簽名判定結果\n\n")
        f.write("| 判定 | 數量 | 百分比 |\n")
        f.write("|------|------|--------|\n")
        sig_total = sum(stats['signature_verdicts'].values())
        for verdict in ['copy', 'uncertain', 'authentic', 'unknown']:
            count = stats['signature_verdicts'].get(verdict, 0)
            pct = count / sig_total * 100 if sig_total > 0 else 0
            label = {
                'copy': '複製貼上',
                'authentic': '親簽',
                'uncertain': '不確定',
                'unknown': '無法判定'
            }.get(verdict, verdict)
            f.write(f"| {label} | {count:,} | {pct:.1f}% |\n")
        f.write(f"\n**總計: {sig_total:,} 個簽名**\n")
        f.write("\n### 按事務所統計\n\n")
        f.write("| 事務所 | 複製貼上 | 不確定 | 親簽 | 無法判定 | 總計 |\n")
        f.write("|--------|----------|--------|------|----------|------|\n")
        # 按總數排序
        firms_sorted = sorted(stats['by_firm'].items(),
                             key=lambda x: sum(x[1].values()), reverse=True)
        for firm, verdicts in firms_sorted[:20]:
            copy_n = verdicts.get('copy', 0)
            uncertain_n = verdicts.get('uncertain', 0)
            authentic_n = verdicts.get('authentic', 0)
            unknown_n = verdicts.get('unknown', 0)
            total_n = copy_n + uncertain_n + authentic_n + unknown_n
            f.write(f"| {firm} | {copy_n:,} | {uncertain_n:,} | {authentic_n:,} | {unknown_n:,} | {total_n:,} |\n")
    print(f"已儲存: {md_path}")
    return stats
 def update_database(results):
    """更新資料庫"""
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    # 添加欄位
    try:
        cur.execute("ALTER TABLE signatures ADD COLUMN signature_verdict TEXT")
        cur.execute("ALTER TABLE signatures ADD COLUMN max_similarity_to_same_accountant REAL")
    except:
        pass
    # 更新
    for r in results:
        for sig in r['signatures']:
            cur.execute("""
                UPDATE signatures
                SET signature_verdict = ?, max_similarity_to_same_accountant = ?
                WHERE signature_id = ?
            """, (sig['verdict'], sig['max_similarity'], sig['signature_id']))
    conn.commit()
    conn.close()
    print("資料庫已更新")
 def main():
    print("=" * 60)
    print("第四階段：PDF 簽名真偽判定")
    print("=" * 60)
    # 載入資料
    features_norm, sig_data, pdf_signatures, acc_signatures, pdf_info, sig_id_to_idx = load_data()
    print(f"PDF 數: {len(pdf_signatures)}")
    print(f"有效簽名: {len(sig_data)}")
    # 分析所有 PDF
    print("\n開始分析...")
    results = analyze_all_pdfs(
        features_norm, sig_data, pdf_signatures, acc_signatures, pdf_info, sig_id_to_idx
    )
    # 生成統計
    stats = generate_statistics(results)
    # 儲存結果
    print("\n儲存結果...")
    save_results(results, stats)
    # 更新資料庫
    update_database(results)
    print("\n" + "=" * 60)
    print("完成！")
    print("=" * 60)
    print(f"\nPDF 判定結果:")
    print(f"  複製貼上: {stats['pdf_verdicts'].get('copy', 0):,}")
    print(f"  不確定: {stats['pdf_verdicts'].get('uncertain', 0):,}")
    print(f"  親簽: {stats['pdf_verdicts'].get('authentic', 0):,}")
    print(f"  無法判定: {stats['pdf_verdicts'].get('unknown', 0):,}")
 if __name__ == '__main__':
    main()
@@ -0,0 +1,319 @@
 #!/usr/bin/env python3
 """
 Compute SSIM and pHash for all signature pairs (closest match per accountant).
 Uses multiprocessing for parallel image loading and computation.
 Saves results to database and outputs complete CSV.
 """
 import sqlite3
 import numpy as np
 import cv2
 import os
 import sys
 import json
 import csv
 import time
 from datetime import datetime
 from collections import defaultdict
 from multiprocessing import Pool, cpu_count
 from pathlib import Path
 DB_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 IMAGE_DIR = '/Volumes/NV2/PDF-Processing/yolo-signatures/images'
 OUTPUT_CSV = '/Volumes/NV2/PDF-Processing/signature-analysis/reports/complete_pdf_report.csv'
 CHECKPOINT_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/ssim_checkpoint.json'
 NUM_WORKERS = max(1, cpu_count() - 2)  # Leave 2 cores free
 BATCH_SIZE = 1000
 def compute_phash(img, hash_size=8):
    """Compute perceptual hash."""
    resized = cv2.resize(img, (hash_size + 1, hash_size))
    diff = resized[:, 1:] > resized[:, :-1]
    return diff.flatten()
 def compute_pair_ssim(args):
    """Compute SSIM, pHash, histogram correlation for a pair of images."""
    sig_id, file1, file2, cosine_sim = args
    path1 = os.path.join(IMAGE_DIR, file1)
    path2 = os.path.join(IMAGE_DIR, file2)
    result = {
        'signature_id': sig_id,
        'match_file': file2,
        'cosine_similarity': cosine_sim,
        'ssim': None,
        'phash_distance': None,
        'histogram_corr': None,
        'pixel_identical': False,
    }
    try:
        img1 = cv2.imread(path1, cv2.IMREAD_GRAYSCALE)
        img2 = cv2.imread(path2, cv2.IMREAD_GRAYSCALE)
        if img1 is None or img2 is None:
            return result
        # Resize to same dimensions
        h = min(img1.shape[0], img2.shape[0])
        w = min(img1.shape[1], img2.shape[1])
        if h < 3 or w < 3:
            return result
        img1_r = cv2.resize(img1, (w, h))
        img2_r = cv2.resize(img2, (w, h))
        # Pixel identical check
        result['pixel_identical'] = bool(np.array_equal(img1_r, img2_r))
        # SSIM
        try:
            from skimage.metrics import structural_similarity as ssim
            win_size = min(7, min(h, w))
            if win_size % 2 == 0:
                win_size -= 1
            if win_size >= 3:
                result['ssim'] = float(ssim(img1_r, img2_r, win_size=win_size))
            else:
                result['ssim'] = None
        except Exception:
            result['ssim'] = None
        # Histogram correlation
        hist1 = cv2.calcHist([img1_r], [0], None, [256], [0, 256])
        hist2 = cv2.calcHist([img2_r], [0], None, [256], [0, 256])
        result['histogram_corr'] = float(cv2.compareHist(hist1, hist2, cv2.HISTCMP_CORREL))
        # pHash distance
        h1 = compute_phash(img1_r)
        h2 = compute_phash(img2_r)
        result['phash_distance'] = int(np.sum(h1 != h2))
    except Exception as e:
        pass
    return result
 def load_checkpoint():
    """Load checkpoint of already processed signature IDs."""
    if os.path.exists(CHECKPOINT_PATH):
        with open(CHECKPOINT_PATH, 'r') as f:
            data = json.load(f)
            return set(data.get('processed_ids', []))
    return set()
 def save_checkpoint(processed_ids):
    """Save checkpoint."""
    with open(CHECKPOINT_PATH, 'w') as f:
        json.dump({'processed_ids': list(processed_ids), 'timestamp': str(datetime.now())}, f)
 def main():
    start_time = time.time()
    print("=" * 70)
    print("SSIM & pHash Computation for All Signature Pairs")
    print(f"Workers: {NUM_WORKERS}")
    print("=" * 70)
    # --- Step 1: Load data ---
    print("\n[1/4] Loading data from database...")
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    cur.execute('''
        SELECT signature_id, image_filename, assigned_accountant, feature_vector
        FROM signatures
        WHERE feature_vector IS NOT NULL AND assigned_accountant IS NOT NULL
    ''')
    rows = cur.fetchall()
    sig_ids = []
    filenames = []
    accountants = []
    features = []
    for row in rows:
        sig_ids.append(row[0])
        filenames.append(row[1])
        accountants.append(row[2])
        features.append(np.frombuffer(row[3], dtype=np.float32))
    features = np.array(features)
    print(f"  Loaded {len(sig_ids)} signatures")
    # --- Step 2: Find closest match per signature ---
    print("\n[2/4] Finding closest match per signature (same accountant)...")
    acct_groups = defaultdict(list)
    for i, acct in enumerate(accountants):
        acct_groups[acct].append(i)
    # Load checkpoint
    processed_ids = load_checkpoint()
    print(f"  Checkpoint: {len(processed_ids)} already processed")
    # Prepare tasks
    tasks = []
    for acct, indices in acct_groups.items():
        if len(indices) < 2:
            continue
        vecs = features[indices]
        sim_matrix = vecs @ vecs.T
        np.fill_diagonal(sim_matrix, -1)  # Exclude self
        for local_i, global_i in enumerate(indices):
            if sig_ids[global_i] in processed_ids:
                continue
            best_local = np.argmax(sim_matrix[local_i])
            best_global = indices[best_local]
            best_sim = float(sim_matrix[local_i, best_local])
            tasks.append((
                sig_ids[global_i],
                filenames[global_i],
                filenames[best_global],
                best_sim
            ))
    print(f"  Tasks to process: {len(tasks)}")
    # --- Step 3: Compute SSIM/pHash in parallel ---
    print(f"\n[3/4] Computing SSIM & pHash ({len(tasks)} pairs, {NUM_WORKERS} workers)...")
    # Add SSIM columns to database if not exist
    try:
        cur.execute('ALTER TABLE signatures ADD COLUMN ssim_to_closest REAL')
    except:
        pass
    try:
        cur.execute('ALTER TABLE signatures ADD COLUMN phash_distance_to_closest INTEGER')
    except:
        pass
    try:
        cur.execute('ALTER TABLE signatures ADD COLUMN histogram_corr_to_closest REAL')
    except:
        pass
    try:
        cur.execute('ALTER TABLE signatures ADD COLUMN pixel_identical_to_closest INTEGER')
    except:
        pass
    try:
        cur.execute('ALTER TABLE signatures ADD COLUMN closest_match_file TEXT')
    except:
        pass
    conn.commit()
    total = len(tasks)
    done = 0
    batch_results = []
    with Pool(NUM_WORKERS) as pool:
        for result in pool.imap_unordered(compute_pair_ssim, tasks, chunksize=50):
            batch_results.append(result)
            done += 1
            if done % BATCH_SIZE == 0 or done == total:
                # Save batch to database
                for r in batch_results:
                    cur.execute('''
                        UPDATE signatures SET
                            ssim_to_closest = ?,
                            phash_distance_to_closest = ?,
                            histogram_corr_to_closest = ?,
                            pixel_identical_to_closest = ?,
                            closest_match_file = ?
                        WHERE signature_id = ?
                    ''', (
                        r['ssim'],
                        r['phash_distance'],
                        r['histogram_corr'],
                        1 if r['pixel_identical'] else 0,
                        r['match_file'],
                        r['signature_id']
                    ))
                    processed_ids.add(r['signature_id'])
                conn.commit()
                save_checkpoint(processed_ids)
                batch_results = []
                elapsed = time.time() - start_time
                rate = done / elapsed
                eta = (total - done) / rate if rate > 0 else 0
                print(f"  {done:,}/{total:,} ({100*done/total:.1f}%) "
                      f"| {rate:.1f} pairs/s | ETA: {eta/60:.1f} min")
    # --- Step 4: Generate complete CSV ---
    print(f"\n[4/4] Generating complete CSV...")
    cur.execute('''
        SELECT
            s.source_pdf,
            s.year_month,
            s.serial_number,
            s.doc_type,
            s.page_number,
            s.sig_index,
            s.image_filename,
            s.assigned_accountant,
            s.excel_accountant1,
            s.excel_accountant2,
            s.excel_firm,
            s.detection_confidence,
            s.signature_verdict,
            s.max_similarity_to_same_accountant,
            s.ssim_to_closest,
            s.phash_distance_to_closest,
            s.histogram_corr_to_closest,
            s.pixel_identical_to_closest,
            s.closest_match_file,
            a.risk_level,
            a.mean_similarity as acct_mean_similarity,
            a.ratio_gt_95 as acct_ratio_gt_95
        FROM signatures s
        LEFT JOIN accountants a ON s.assigned_accountant = a.name
        ORDER BY s.source_pdf, s.sig_index
    ''')
    columns = [
        'source_pdf', 'year_month', 'serial_number', 'doc_type',
        'page_number', 'sig_index', 'image_filename',
        'assigned_accountant', 'excel_accountant1', 'excel_accountant2', 'excel_firm',
        'detection_confidence', 'signature_verdict',
        'max_cosine_similarity', 'ssim_to_closest', 'phash_distance_to_closest',
        'histogram_corr_to_closest', 'pixel_identical_to_closest', 'closest_match_file',
        'accountant_risk_level', 'accountant_mean_similarity', 'accountant_ratio_gt_95'
    ]
    with open(OUTPUT_CSV, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(columns)
        for row in cur:
            writer.writerow(row)
    # Count rows
    cur.execute('SELECT COUNT(*) FROM signatures')
    total_sigs = cur.fetchone()[0]
    cur.execute('SELECT COUNT(DISTINCT source_pdf) FROM signatures')
    total_pdfs = cur.fetchone()[0]
    conn.close()
    elapsed = time.time() - start_time
    print(f"\n{'='*70}")
    print(f"Complete!")
    print(f"  Total signatures: {total_sigs:,}")
    print(f"  Total PDFs: {total_pdfs:,}")
    print(f"  Output: {OUTPUT_CSV}")
    print(f"  Time: {elapsed/60:.1f} minutes")
    print(f"{'='*70}")
    # Clean up checkpoint
    if os.path.exists(CHECKPOINT_PATH):
        os.remove(CHECKPOINT_PATH)
 if __name__ == '__main__':
    main()
@@ -0,0 +1,407 @@
 #!/usr/bin/env python3
 """
 Generate PDF-level aggregated report with multi-method verdicts.
 One row per PDF with all Group A-F columns plus new SSIM/pHash/combined verdicts.
 """
 import sqlite3
 import csv
 import numpy as np
 from datetime import datetime
 from collections import defaultdict
 DB_PATH = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 OUTPUT_CSV = '/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv'
 # Thresholds from statistical analysis
 COSINE_THRESHOLD = 0.95
 COSINE_STATISTICAL = 0.944  # mu + 2*sigma
 KDE_CROSSOVER = 0.838
 SSIM_HIGH = 0.95
 SSIM_MEDIUM = 0.80
 PHASH_IDENTICAL = 0
 PHASH_SIMILAR = 5
 def classify_overall(max_cosine, max_ssim, min_phash, has_pixel_identical):
    """
    Multi-method combined verdict.
    Returns (verdict, confidence_level, n_methods_agree)
    """
    evidence_copy = 0
    evidence_genuine = 0
    total_methods = 0
    # Method 1: Cosine similarity
    if max_cosine is not None:
        total_methods += 1
        if max_cosine > COSINE_THRESHOLD:
            evidence_copy += 1
        elif max_cosine < KDE_CROSSOVER:
            evidence_genuine += 1
    # Method 2: SSIM
    if max_ssim is not None:
        total_methods += 1
        if max_ssim > SSIM_HIGH:
            evidence_copy += 1
        elif max_ssim < 0.5:
            evidence_genuine += 1
    # Method 3: pHash
    if min_phash is not None:
        total_methods += 1
        if min_phash <= PHASH_IDENTICAL:
            evidence_copy += 1
        elif min_phash > 15:
            evidence_genuine += 1
    # Method 4: Pixel identical
    if has_pixel_identical is not None:
        total_methods += 1
        if has_pixel_identical:
            evidence_copy += 1
    # Decision logic
    if has_pixel_identical:
        verdict = 'definite_copy'
        confidence = 'very_high'
    elif max_ssim is not None and max_ssim > SSIM_HIGH and min_phash is not None and min_phash <= PHASH_SIMILAR:
        verdict = 'definite_copy'
        confidence = 'very_high'
    elif evidence_copy >= 3:
        verdict = 'very_likely_copy'
        confidence = 'high'
    elif evidence_copy >= 2:
        verdict = 'likely_copy'
        confidence = 'medium'
    elif max_cosine is not None and max_cosine > COSINE_THRESHOLD:
        verdict = 'likely_copy'
        confidence = 'medium'
    elif max_cosine is not None and max_cosine > KDE_CROSSOVER:
        verdict = 'uncertain'
        confidence = 'low'
    elif max_cosine is not None and max_cosine <= KDE_CROSSOVER:
        verdict = 'likely_genuine'
        confidence = 'medium'
    else:
        verdict = 'unknown'
        confidence = 'none'
    return verdict, confidence, evidence_copy, total_methods
 def main():
    print("=" * 70)
    print("PDF-Level Aggregated Report Generator")
    print("=" * 70)
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    # Load all signature data grouped by PDF
    print("\n[1/3] Loading signature data...")
    cur.execute('''
        SELECT
            s.source_pdf,
            s.year_month,
            s.serial_number,
            s.doc_type,
            s.page_number,
            s.sig_index,
            s.assigned_accountant,
            s.excel_accountant1,
            s.excel_accountant2,
            s.excel_firm,
            s.detection_confidence,
            s.signature_verdict,
            s.max_similarity_to_same_accountant,
            s.ssim_to_closest,
            s.phash_distance_to_closest,
            s.histogram_corr_to_closest,
            s.pixel_identical_to_closest,
            a.risk_level,
            a.mean_similarity,
            a.ratio_gt_95,
            a.signature_count
        FROM signatures s
        LEFT JOIN accountants a ON s.assigned_accountant = a.name
        ORDER BY s.source_pdf, s.sig_index
    ''')
    # Group by PDF
    pdf_data = defaultdict(list)
    for row in cur:
        pdf_data[row[0]].append(row)
    print(f"  {len(pdf_data)} PDFs loaded")
    # Generate PDF-level rows
    print("\n[2/3] Aggregating per-PDF statistics...")
    columns = [
        # Group A: PDF Identity
        'source_pdf', 'year_month', 'serial_number', 'doc_type',
        # Group B: Excel Master Data
        'accountant_1', 'accountant_2', 'firm',
        # Group C: YOLO Detection
        'n_signatures_detected', 'avg_detection_confidence',
        # Group D: Cosine Similarity
        'max_cosine_similarity', 'min_cosine_similarity', 'avg_cosine_similarity',
        # Group E: Verdict (original per-sig)
        'sig1_cosine_verdict', 'sig2_cosine_verdict',
        # Group F: Accountant Risk
        'acct1_name', 'acct1_risk_level', 'acct1_mean_similarity',
        'acct1_ratio_gt_95', 'acct1_total_signatures',
        'acct2_name', 'acct2_risk_level', 'acct2_mean_similarity',
        'acct2_ratio_gt_95', 'acct2_total_signatures',
        # Group G: SSIM (NEW)
        'max_ssim', 'min_ssim', 'avg_ssim',
        'verdict_ssim',
        # Group H: pHash (NEW)
        'min_phash_distance', 'max_phash_distance', 'avg_phash_distance',
        'verdict_phash',
        # Group I: Histogram Correlation (NEW)
        'max_histogram_corr', 'avg_histogram_corr',
        # Group J: Pixel Identity (NEW)
        'has_pixel_identical',
        'verdict_pixel',
        # Group K: Statistical Threshold (NEW)
        'verdict_statistical',  # Based on mu+2sigma (0.944)
        # Group L: KDE Crossover (NEW)
        'verdict_kde',  # Based on KDE crossover (0.838)
        # Group M: Multi-Method Combined (NEW)
        'overall_verdict',
        'confidence_level',
        'n_methods_copy',
        'n_methods_total',
    ]
    rows = []
    for pdf_name, sigs in pdf_data.items():
        # Group A: Identity (from first signature)
        first = sigs[0]
        year_month = first[1]
        serial_number = first[2]
        doc_type = first[3]
        # Group B: Excel data
        excel_acct1 = first[7]
        excel_acct2 = first[8]
        excel_firm = first[9]
        # Group C: Detection
        n_sigs = len(sigs)
        confidences = [s[10] for s in sigs if s[10] is not None]
        avg_conf = np.mean(confidences) if confidences else None
        # Group D: Cosine similarity
        cosines = [s[12] for s in sigs if s[12] is not None]
        max_cosine = max(cosines) if cosines else None
        min_cosine = min(cosines) if cosines else None
        avg_cosine = np.mean(cosines) if cosines else None
        # Group E: Per-sig verdicts
        verdicts = [s[11] for s in sigs]
        sig1_verdict = verdicts[0] if len(verdicts) > 0 else None
        sig2_verdict = verdicts[1] if len(verdicts) > 1 else None
        # Group F: Accountant risk - separate for acct1 and acct2
        # Match by assigned_accountant to excel_accountant1/2
        acct1_info = {'name': None, 'risk': None, 'mean_sim': None, 'ratio': None, 'count': None}
        acct2_info = {'name': None, 'risk': None, 'mean_sim': None, 'ratio': None, 'count': None}
        for s in sigs:
            assigned = s[6]
            if assigned and assigned == excel_acct1 and acct1_info['name'] is None:
                acct1_info = {
                    'name': assigned, 'risk': s[17],
                    'mean_sim': s[18], 'ratio': s[19], 'count': s[20]
                }
            elif assigned and assigned == excel_acct2 and acct2_info['name'] is None:
                acct2_info = {
                    'name': assigned, 'risk': s[17],
                    'mean_sim': s[18], 'ratio': s[19], 'count': s[20]
                }
            elif assigned and acct1_info['name'] is None:
                acct1_info = {
                    'name': assigned, 'risk': s[17],
                    'mean_sim': s[18], 'ratio': s[19], 'count': s[20]
                }
            elif assigned and acct2_info['name'] is None:
                acct2_info = {
                    'name': assigned, 'risk': s[17],
                    'mean_sim': s[18], 'ratio': s[19], 'count': s[20]
                }
        # Group G: SSIM
        ssims = [s[13] for s in sigs if s[13] is not None]
        max_ssim = max(ssims) if ssims else None
        min_ssim = min(ssims) if ssims else None
        avg_ssim = np.mean(ssims) if ssims else None
        if max_ssim is not None:
            if max_ssim > SSIM_HIGH:
                verdict_ssim = 'copy'
            elif max_ssim > SSIM_MEDIUM:
                verdict_ssim = 'suspicious'
            else:
                verdict_ssim = 'genuine'
        else:
            verdict_ssim = None
        # Group H: pHash
        phashes = [s[14] for s in sigs if s[14] is not None]
        min_phash = min(phashes) if phashes else None
        max_phash = max(phashes) if phashes else None
        avg_phash = np.mean(phashes) if phashes else None
        if min_phash is not None:
            if min_phash <= PHASH_IDENTICAL:
                verdict_phash = 'copy'
            elif min_phash <= PHASH_SIMILAR:
                verdict_phash = 'suspicious'
            else:
                verdict_phash = 'genuine'
        else:
            verdict_phash = None
        # Group I: Histogram correlation
        histcorrs = [s[15] for s in sigs if s[15] is not None]
        max_histcorr = max(histcorrs) if histcorrs else None
        avg_histcorr = np.mean(histcorrs) if histcorrs else None
        # Group J: Pixel identical
        pixel_ids = [s[16] for s in sigs if s[16] is not None]
        has_pixel = any(p == 1 for p in pixel_ids) if pixel_ids else False
        verdict_pixel = 'copy' if has_pixel else 'genuine'
        # Group K: Statistical threshold (mu+2sigma = 0.944)
        if max_cosine is not None:
            if max_cosine > COSINE_STATISTICAL:
                verdict_stat = 'copy'
            elif max_cosine > KDE_CROSSOVER:
                verdict_stat = 'uncertain'
            else:
                verdict_stat = 'genuine'
        else:
            verdict_stat = None
        # Group L: KDE crossover (0.838)
        if max_cosine is not None:
            if max_cosine > KDE_CROSSOVER:
                verdict_kde = 'above_crossover'
            else:
                verdict_kde = 'below_crossover'
        else:
            verdict_kde = None
        # Group M: Multi-method combined
        overall, confidence, n_copy, n_total = classify_overall(
            max_cosine, max_ssim, min_phash, has_pixel)
        rows.append([
            # A
            pdf_name, year_month, serial_number, doc_type,
            # B
            excel_acct1, excel_acct2, excel_firm,
            # C
            n_sigs, avg_conf,
            # D
            max_cosine, min_cosine, avg_cosine,
            # E
            sig1_verdict, sig2_verdict,
            # F
            acct1_info['name'], acct1_info['risk'], acct1_info['mean_sim'],
            acct1_info['ratio'], acct1_info['count'],
            acct2_info['name'], acct2_info['risk'], acct2_info['mean_sim'],
            acct2_info['ratio'], acct2_info['count'],
            # G
            max_ssim, min_ssim, avg_ssim, verdict_ssim,
            # H
            min_phash, max_phash, avg_phash, verdict_phash,
            # I
            max_histcorr, avg_histcorr,
            # J
            1 if has_pixel else 0, verdict_pixel,
            # K
            verdict_stat,
            # L
            verdict_kde,
            # M
            overall, confidence, n_copy, n_total,
        ])
    # Write CSV
    print(f"\n[3/3] Writing {len(rows)} PDF rows to CSV...")
    with open(OUTPUT_CSV, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(columns)
        writer.writerows(rows)
    conn.close()
    # Print summary statistics
    print(f"\n{'='*70}")
    print("SUMMARY")
    print(f"{'='*70}")
    print(f"Total PDFs: {len(rows):,}")
    # Overall verdict distribution
    verdict_counts = defaultdict(int)
    confidence_counts = defaultdict(int)
    for r in rows:
        verdict_counts[r[-4]] += 1
        confidence_counts[r[-3]] += 1
    print(f"\n--- Overall Verdict Distribution ---")
    for v in ['definite_copy', 'very_likely_copy', 'likely_copy', 'uncertain', 'likely_genuine', 'unknown']:
        c = verdict_counts.get(v, 0)
        print(f"  {v:20s}: {c:>6,} ({100*c/len(rows):5.1f}%)")
    print(f"\n--- Confidence Level Distribution ---")
    for c_level in ['very_high', 'high', 'medium', 'low', 'none']:
        c = confidence_counts.get(c_level, 0)
        print(f"  {c_level:10s}: {c:>6,} ({100*c/len(rows):5.1f}%)")
    # Per-method verdict distribution
    # Column indices: verdict_ssim=27, verdict_phash=31, verdict_pixel=35, verdict_stat=36, verdict_kde=37
    print(f"\n--- Per-Method Verdict Distribution ---")
    for col_idx, method_name in [(27, 'SSIM'), (31, 'pHash'), (35, 'Pixel'), (36, 'Statistical'), (37, 'KDE')]:
        counts = defaultdict(int)
        for r in rows:
            counts[r[col_idx]] += 1
        print(f"\n  {method_name}:")
        for k, v in sorted(counts.items(), key=lambda x: -x[1]):
            print(f"    {str(k):20s}: {v:>6,} ({100*v/len(rows):5.1f}%)")
    # Cross-method agreement
    print(f"\n--- Method Agreement (cosine>0.95 PDFs) ---")
    cosine_copy = [r for r in rows if r[9] is not None and r[9] > COSINE_THRESHOLD]
    if cosine_copy:
        ssim_agree = sum(1 for r in cosine_copy if r[27] == 'copy')
        phash_agree = sum(1 for r in cosine_copy if r[31] == 'copy')
        pixel_agree = sum(1 for r in cosine_copy if r[34] == 1)
        print(f"  PDFs with cosine > 0.95: {len(cosine_copy):,}")
        print(f"  Also SSIM > 0.95:  {ssim_agree:>6,} ({100*ssim_agree/len(cosine_copy):5.1f}%)")
        print(f"  Also pHash = 0:    {phash_agree:>6,} ({100*phash_agree/len(cosine_copy):5.1f}%)")
        print(f"  Also pixel-identical: {pixel_agree:>4,} ({100*pixel_agree/len(cosine_copy):5.1f}%)")
    print(f"\nOutput: {OUTPUT_CSV}")
    print(f"{'='*70}")
 if __name__ == '__main__':
    main()
@@ -0,0 +1,216 @@
 #!/usr/bin/env python3
 """
 Test PaddleOCR Masking + Region Detection Pipeline
 This script demonstrates:
 1. PaddleOCR detects printed text bounding boxes
 2. Mask out all printed text areas (fill with black)
 3. Detect remaining non-white regions (potential handwriting)
 4. Visualize the results
 """
 import fitz  # PyMuPDF
 import numpy as np
 import cv2
 from pathlib import Path
 from paddleocr_client import create_ocr_client
 # Configuration
 TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
 OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/mask_test"
 DPI = 300
 # Region detection parameters
 MIN_REGION_AREA = 3000      # Minimum pixels for a region
 MAX_REGION_AREA = 300000    # Maximum pixels for a region
 MIN_ASPECT_RATIO = 0.3      # Minimum width/height ratio
 MAX_ASPECT_RATIO = 15.0     # Maximum width/height ratio
 print("="*80)
 print("PaddleOCR Masking + Region Detection Test")
 print("="*80)
 # Create output directory
 Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
 # Step 1: Connect to PaddleOCR server
 print("\n1. Connecting to PaddleOCR server...")
 try:
    ocr_client = create_ocr_client()
    print(f"   ✅ Connected: {ocr_client.server_url}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 2: Render PDF to image
 print("\n2. Rendering PDF to image...")
 try:
    doc = fitz.open(TEST_PDF)
    page = doc[0]
    mat = fitz.Matrix(DPI/72, DPI/72)
    pix = page.get_pixmap(matrix=mat)
    original_image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    if pix.n == 4:  # RGBA
        original_image = cv2.cvtColor(original_image, cv2.COLOR_RGBA2RGB)
    print(f"   ✅ Rendered: {original_image.shape[1]}x{original_image.shape[0]} pixels")
    doc.close()
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 3: Detect printed text with PaddleOCR
 print("\n3. Detecting printed text with PaddleOCR...")
 try:
    text_boxes = ocr_client.get_text_boxes(original_image)
    print(f"   ✅ Detected {len(text_boxes)} text regions")
    # Show some sample boxes
    if text_boxes:
        print("   Sample text boxes (x, y, w, h):")
        for i, box in enumerate(text_boxes[:3]):
            print(f"      {i+1}. {box}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 4: Mask out printed text areas
 print("\n4. Masking printed text areas...")
 try:
    masked_image = original_image.copy()
    # Fill each text box with black
    for (x, y, w, h) in text_boxes:
        cv2.rectangle(masked_image, (x, y), (x + w, y + h), (0, 0, 0), -1)
    print(f"   ✅ Masked {len(text_boxes)} text regions")
    # Save masked image
    masked_path = Path(OUTPUT_DIR) / "01_masked_image.png"
    cv2.imwrite(str(masked_path), cv2.cvtColor(masked_image, cv2.COLOR_RGB2BGR))
    print(f"   📁 Saved: {masked_path}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 5: Detect remaining non-white regions
 print("\n5. Detecting remaining non-white regions...")
 try:
    # Convert to grayscale
    gray = cv2.cvtColor(masked_image, cv2.COLOR_RGB2GRAY)
    # Threshold to find non-white areas
    # Anything darker than 250 is considered "content"
    _, binary = cv2.threshold(gray, 250, 255, cv2.THRESH_BINARY_INV)
    # Apply morphological operations to connect nearby regions
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
    # Find contours
    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    print(f"   ✅ Found {len(contours)} contours")
    # Filter contours by size and aspect ratio
    potential_regions = []
    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        area = w * h
        aspect_ratio = w / h if h > 0 else 0
        # Check constraints
        if (MIN_REGION_AREA <= area <= MAX_REGION_AREA and
            MIN_ASPECT_RATIO <= aspect_ratio <= MAX_ASPECT_RATIO):
            potential_regions.append({
                'box': (x, y, w, h),
                'area': area,
                'aspect_ratio': aspect_ratio
            })
    print(f"   ✅ Filtered to {len(potential_regions)} potential handwriting regions")
    # Show region details
    if potential_regions:
        print("\n   Detected regions:")
        for i, region in enumerate(potential_regions[:5]):
            x, y, w, h = region['box']
            print(f"      {i+1}. Box: ({x}, {y}, {w}, {h}), "
                  f"Area: {region['area']}, "
                  f"Aspect: {region['aspect_ratio']:.2f}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    import traceback
    traceback.print_exc()
    exit(1)
 # Step 6: Visualize results
 print("\n6. Creating visualizations...")
 try:
    # Visualization 1: Original with text boxes
    vis_original = original_image.copy()
    for (x, y, w, h) in text_boxes:
        cv2.rectangle(vis_original, (x, y), (x + w, y + h), (0, 255, 0), 3)
    vis_original_path = Path(OUTPUT_DIR) / "02_original_with_text_boxes.png"
    cv2.imwrite(str(vis_original_path), cv2.cvtColor(vis_original, cv2.COLOR_RGB2BGR))
    print(f"   📁 Original + text boxes: {vis_original_path}")
    # Visualization 2: Masked image with detected regions
    vis_masked = masked_image.copy()
    for region in potential_regions:
        x, y, w, h = region['box']
        cv2.rectangle(vis_masked, (x, y), (x + w, y + h), (255, 0, 0), 3)
    vis_masked_path = Path(OUTPUT_DIR) / "03_masked_with_regions.png"
    cv2.imwrite(str(vis_masked_path), cv2.cvtColor(vis_masked, cv2.COLOR_RGB2BGR))
    print(f"   📁 Masked + regions: {vis_masked_path}")
    # Visualization 3: Binary threshold result
    binary_path = Path(OUTPUT_DIR) / "04_binary_threshold.png"
    cv2.imwrite(str(binary_path), binary)
    print(f"   📁 Binary threshold: {binary_path}")
    # Visualization 4: Morphed result
    morphed_path = Path(OUTPUT_DIR) / "05_morphed.png"
    cv2.imwrite(str(morphed_path), morphed)
    print(f"   📁 Morphed: {morphed_path}")
    # Extract and save each detected region
    print("\n7. Extracting detected regions...")
    for i, region in enumerate(potential_regions):
        x, y, w, h = region['box']
        # Add padding
        padding = 10
        x_pad = max(0, x - padding)
        y_pad = max(0, y - padding)
        w_pad = min(original_image.shape[1] - x_pad, w + 2*padding)
        h_pad = min(original_image.shape[0] - y_pad, h + 2*padding)
        # Extract region from original image
        region_img = original_image[y_pad:y_pad+h_pad, x_pad:x_pad+w_pad]
        # Save region
        region_path = Path(OUTPUT_DIR) / f"region_{i+1:02d}.png"
        cv2.imwrite(str(region_path), cv2.cvtColor(region_img, cv2.COLOR_RGB2BGR))
        print(f"   📁 Region {i+1}: {region_path}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    import traceback
    traceback.print_exc()
 print("\n" + "="*80)
 print("Test completed!")
 print(f"Results saved to: {OUTPUT_DIR}")
 print("="*80)
 print("\nSummary:")
 print(f"  - Printed text regions detected: {len(text_boxes)}")
 print(f"  - Potential handwriting regions: {len(potential_regions)}")
 print(f"  - Expected signatures: 2 (楊智惠, 張志銘)")
 print("="*80)
@@ -0,0 +1,256 @@
 #!/usr/bin/env python3
 """
 Advanced OpenCV separation based on key observations:
 1. 手写字比印刷字大 (Handwriting is LARGER)
 2. 手写笔画长度更长 (Handwriting strokes are LONGER)
 3. 印刷标楷体规律，手写潦草 (Printed is regular, handwriting is messy)
 """
 import cv2
 import numpy as np
 from pathlib import Path
 from scipy import ndimage
 # Test image
 TEST_IMAGE = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_improved/signature_02_original.png"
 OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/opencv_advanced_test"
 print("="*80)
 print("Advanced OpenCV Separation - Size + Stroke Length + Regularity")
 print("="*80)
 Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
 # Load and preprocess
 image = cv2.imread(TEST_IMAGE)
 gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
 print(f"\nImage: {image.shape[1]}x{image.shape[0]}")
 # Save binary
 cv2.imwrite(str(Path(OUTPUT_DIR) / "00_binary.png"), binary)
 print("\n" + "="*80)
 print("METHOD 3: Comprehensive Feature Analysis")
 print("="*80)
 # Find connected components
 num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
 print(f"\nFound {num_labels - 1} connected components")
 print("\nAnalyzing each component...")
 # Store analysis for each component
 components_analysis = []
 for i in range(1, num_labels):
    x, y, w, h, area = stats[i]
    # Extract component mask
    component_mask = (labels == i).astype(np.uint8) * 255
    # ============================================
    # FEATURE 1: Size (手写字比印刷字大)
    # ============================================
    bbox_area = w * h
    font_height = h  # Character height is a good indicator
    # ============================================
    # FEATURE 2: Stroke Length (笔画长度)
    # ============================================
    # Skeletonize to get the actual stroke centerline
    from skimage.morphology import skeletonize
    skeleton = skeletonize(component_mask // 255)
    stroke_length = np.sum(skeleton)  # Total length of strokes
    # Stroke length ratio (length relative to area)
    stroke_length_ratio = stroke_length / area if area > 0 else 0
    # ============================================
    # FEATURE 3: Regularity vs Messiness
    # ============================================
    # 3a. Compactness (regular shapes are more compact)
    contours, _ = cv2.findContours(component_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if contours:
        perimeter = cv2.arcLength(contours[0], True)
        compactness = (4 * np.pi * area) / (perimeter * perimeter) if perimeter > 0 else 0
    else:
        compactness = 0
    # 3b. Solidity (ratio of area to convex hull area)
    if contours:
        hull = cv2.convexHull(contours[0])
        hull_area = cv2.contourArea(hull)
        solidity = area / hull_area if hull_area > 0 else 0
    else:
        solidity = 0
    # 3c. Extent (ratio of area to bounding box area)
    extent = area / bbox_area if bbox_area > 0 else 0
    # 3d. Edge roughness (measure irregularity)
    # More irregular edges = more "messy" = likely handwriting
    edges = cv2.Canny(component_mask, 50, 150)
    edge_pixels = np.sum(edges > 0)
    edge_roughness = edge_pixels / perimeter if perimeter > 0 else 0
    # ============================================
    # CLASSIFICATION LOGIC
    # ============================================
    # Large characters are likely handwriting
    is_large = font_height > 40  # Threshold for "large" characters
    # Long strokes relative to area indicate handwriting
    is_long_stroke = stroke_length_ratio > 0.4  # Handwriting has higher ratio
    # Regular shapes (high compactness, high solidity) = printed
    # Irregular shapes (low compactness, low solidity) = handwriting
    is_irregular = compactness < 0.3 or solidity < 0.7 or extent < 0.5
    # DECISION RULES
    handwriting_score = 0
    # Size-based scoring (重要!)
    if font_height > 50:
        handwriting_score += 3  # Very large = likely handwriting
    elif font_height > 35:
        handwriting_score += 2  # Medium-large = possibly handwriting
    elif font_height < 25:
        handwriting_score -= 2  # Small = likely printed
    # Stroke length scoring
    if stroke_length_ratio > 0.5:
        handwriting_score += 2  # Long strokes
    elif stroke_length_ratio > 0.35:
        handwriting_score += 1
    # Regularity scoring (标楷体 is regular, 手写 is messy)
    if is_irregular:
        handwriting_score += 1  # Irregular = handwriting
    else:
        handwriting_score -= 1  # Regular = printed
    # Area scoring
    if area > 2000:
        handwriting_score += 2  # Large area = handwriting
    elif area < 500:
        handwriting_score -= 1  # Small area = printed
    # Final classification
    is_handwriting = handwriting_score > 0
    components_analysis.append({
        'id': i,
        'box': (x, y, w, h),
        'area': area,
        'height': font_height,
        'stroke_length': stroke_length,
        'stroke_ratio': stroke_length_ratio,
        'compactness': compactness,
        'solidity': solidity,
        'extent': extent,
        'edge_roughness': edge_roughness,
        'handwriting_score': handwriting_score,
        'is_handwriting': is_handwriting,
        'mask': component_mask
    })
 # Sort by area (largest first)
 components_analysis.sort(key=lambda c: c['area'], reverse=True)
 # Print analysis
 print("\n" + "-"*80)
 print("Top 10 Components Analysis:")
 print("-"*80)
 print(f"{'ID':<4} {'Area':<6} {'H':<4} {'StrokeLen':<9} {'StrokeR':<7} {'Compact':<7} "
      f"{'Solid':<6} {'Score':<5} {'Type':<12}")
 print("-"*80)
 for i, comp in enumerate(components_analysis[:10]):
    comp_type = "✅ Handwriting" if comp['is_handwriting'] else "❌ Printed"
    print(f"{comp['id']:<4} {comp['area']:<6} {comp['height']:<4} "
          f"{comp['stroke_length']:<9.0f} {comp['stroke_ratio']:<7.3f} "
          f"{comp['compactness']:<7.3f} {comp['solidity']:<6.3f} "
          f"{comp['handwriting_score']:>+5} {comp_type:<12}")
 # Create masks
 handwriting_mask = np.zeros_like(binary)
 printed_mask = np.zeros_like(binary)
 for comp in components_analysis:
    if comp['is_handwriting']:
        handwriting_mask = cv2.bitwise_or(handwriting_mask, comp['mask'])
    else:
        printed_mask = cv2.bitwise_or(printed_mask, comp['mask'])
 # Statistics
 hw_count = sum(1 for c in components_analysis if c['is_handwriting'])
 pr_count = sum(1 for c in components_analysis if not c['is_handwriting'])
 print("\n" + "="*80)
 print("Classification Results:")
 print("="*80)
 print(f"  Handwriting components: {hw_count}")
 print(f"  Printed components: {pr_count}")
 print(f"  Total: {len(components_analysis)}")
 # Apply to original image
 result_handwriting = cv2.bitwise_and(image, image, mask=handwriting_mask)
 result_printed = cv2.bitwise_and(image, image, mask=printed_mask)
 # Save results
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_handwriting_mask.png"), handwriting_mask)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_printed_mask.png"), printed_mask)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_handwriting_result.png"), result_handwriting)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_printed_result.png"), result_printed)
 # Create visualization
 vis_overlay = image.copy()
 vis_overlay[handwriting_mask > 0] = [0, 255, 0]  # Green for handwriting
 vis_overlay[printed_mask > 0] = [0, 0, 255]      # Red for printed
 vis_final = cv2.addWeighted(image, 0.6, vis_overlay, 0.4, 0)
 # Add labels to visualization
 for comp in components_analysis[:15]:  # Label top 15
    x, y, w, h = comp['box']
    cx, cy = x + w//2, y + h//2
    color = (0, 255, 0) if comp['is_handwriting'] else (0, 0, 255)
    label = f"H{comp['handwriting_score']:+d}" if comp['is_handwriting'] else f"P{comp['handwriting_score']:+d}"
    cv2.putText(vis_final, label, (cx-15, cy), cv2.FONT_HERSHEY_SIMPLEX, 0.4, color, 1)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method3_visualization.png"), vis_final)
 print("\n📁 Saved results:")
 print("  - method3_handwriting_mask.png")
 print("  - method3_printed_mask.png")
 print("  - method3_handwriting_result.png")
 print("  - method3_printed_result.png")
 print("  - method3_visualization.png")
 # Calculate content pixels
 hw_pixels = np.count_nonzero(handwriting_mask)
 pr_pixels = np.count_nonzero(printed_mask)
 total_pixels = np.count_nonzero(binary)
 print("\n" + "="*80)
 print("Pixel Distribution:")
 print("="*80)
 print(f"  Total foreground:   {total_pixels:6d} pixels (100.0%)")
 print(f"  Handwriting:        {hw_pixels:6d} pixels ({hw_pixels/total_pixels*100:5.1f}%)")
 print(f"  Printed:            {pr_pixels:6d} pixels ({pr_pixels/total_pixels*100:5.1f}%)")
 print("\n" + "="*80)
 print("Test completed!")
 print(f"Results: {OUTPUT_DIR}")
 print("="*80)
 print("\n📊 Feature Analysis Summary:")
 print("  ✅ Size-based classification: Large characters → Handwriting")
 print("  ✅ Stroke length analysis: Long stroke ratio → Handwriting")
 print("  ✅ Regularity analysis: Irregular shapes → Handwriting")
 print("\nNext: Review visualization to tune thresholds if needed")
@@ -0,0 +1,272 @@
 #!/usr/bin/env python3
 """
 Test OpenCV methods to separate handwriting from printed text
 Tests two methods:
 1. Stroke Width Analysis (笔画宽度分析)
 2. Connected Components + Shape Features (连通组件+形状特征)
 """
 import cv2
 import numpy as np
 from pathlib import Path
 # Test image - contains both printed and handwritten
 TEST_IMAGE = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_improved/signature_02_original.png"
 OUTPUT_DIR = "/Volumes/NV2/PDF-Processing/signature-image-output/opencv_separation_test"
 print("="*80)
 print("OpenCV Handwriting Separation Test")
 print("="*80)
 # Create output directory
 Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
 # Load image
 print(f"\nLoading test image: {Path(TEST_IMAGE).name}")
 image = cv2.imread(TEST_IMAGE)
 if image is None:
    print(f"Error: Cannot load image from {TEST_IMAGE}")
    exit(1)
 image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
 print(f"Image size: {image.shape[1]}x{image.shape[0]}")
 # Convert to grayscale
 gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 # Binarize
 _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
 # Save binary for reference
 cv2.imwrite(str(Path(OUTPUT_DIR) / "00_binary.png"), binary)
 print("\n📁 Saved: 00_binary.png")
 print("\n" + "="*80)
 print("METHOD 1: Stroke Width Analysis (笔画宽度分析)")
 print("="*80)
 def method1_stroke_width(binary_img, threshold_values=[2.0, 3.0, 4.0, 5.0]):
    """
    Method 1: Separate by stroke width using distance transform
    Args:
        binary_img: Binary image (foreground = 255, background = 0)
        threshold_values: List of distance thresholds to test
    Returns:
        List of (threshold, result_image) tuples
    """
    results = []
    # Calculate distance transform
    dist_transform = cv2.distanceTransform(binary_img, cv2.DIST_L2, 5)
    # Normalize for visualization
    dist_normalized = cv2.normalize(dist_transform, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U)
    results.append(('distance_transform', dist_normalized))
    print("\n  Distance transform statistics:")
    print(f"    Min: {dist_transform.min():.2f}")
    print(f"    Max: {dist_transform.max():.2f}")
    print(f"    Mean: {dist_transform.mean():.2f}")
    print(f"    Median: {np.median(dist_transform):.2f}")
    # Test different thresholds
    print("\n  Testing different stroke width thresholds:")
    for threshold in threshold_values:
        # Pixels with distance > threshold are considered "thick strokes" (handwriting)
        handwriting_mask = (dist_transform > threshold).astype(np.uint8) * 255
        # Count pixels
        total_foreground = np.count_nonzero(binary_img)
        handwriting_pixels = np.count_nonzero(handwriting_mask)
        percentage = (handwriting_pixels / total_foreground * 100) if total_foreground > 0 else 0
        print(f"    Threshold {threshold:.1f}: {handwriting_pixels} pixels ({percentage:.1f}% of foreground)")
        results.append((f'threshold_{threshold:.1f}', handwriting_mask))
    return results
 # Run Method 1
 method1_results = method1_stroke_width(binary, threshold_values=[2.0, 2.5, 3.0, 3.5, 4.0, 5.0])
 # Save Method 1 results
 print("\n  Saving results...")
 for name, result_img in method1_results:
    output_path = Path(OUTPUT_DIR) / f"method1_{name}.png"
    cv2.imwrite(str(output_path), result_img)
    print(f"    📁 {output_path.name}")
 # Apply best threshold result to original image
 best_threshold = 3.0  # Will adjust based on visual inspection
 _, best_mask = [(n, r) for n, r in method1_results if f'threshold_{best_threshold}' in n][0]
 # Dilate mask slightly to connect nearby strokes
 kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
 best_mask_dilated = cv2.dilate(best_mask, kernel, iterations=1)
 # Apply to color image
 result_method1 = cv2.bitwise_and(image, image, mask=best_mask_dilated)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method1_final_result.png"), result_method1)
 print(f"\n  📁 Final result: method1_final_result.png (threshold={best_threshold})")
 print("\n" + "="*80)
 print("METHOD 2: Connected Components + Shape Features (连通组件分析)")
 print("="*80)
 def method2_component_analysis(binary_img, original_img):
    """
    Method 2: Analyze each connected component's shape features
    Printed text characteristics:
    - Regular bounding box (aspect ratio ~1:1)
    - Medium size (200-2000 pixels)
    - High circularity/compactness
    Handwriting characteristics:
    - Irregular shapes
    - May be large (connected strokes)
    - Variable aspect ratios
    """
    # Find connected components
    num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_img, connectivity=8)
    print(f"\n  Found {num_labels - 1} connected components")
    # Create masks for different categories
    handwriting_mask = np.zeros_like(binary_img)
    printed_mask = np.zeros_like(binary_img)
    # Analyze each component
    component_info = []
    for i in range(1, num_labels):  # Skip background (0)
        x, y, w, h, area = stats[i]
        # Calculate features
        aspect_ratio = w / h if h > 0 else 0
        perimeter = cv2.arcLength(cv2.findContours((labels == i).astype(np.uint8),
                                                    cv2.RETR_EXTERNAL,
                                                    cv2.CHAIN_APPROX_SIMPLE)[0][0], True)
        compactness = (4 * np.pi * area) / (perimeter * perimeter) if perimeter > 0 else 0
        # Classification logic
        # Printed text: medium size, regular aspect ratio, compact
        is_printed = (
            (200 < area < 3000) and              # Medium size
            (0.3 < aspect_ratio < 3.0) and       # Not too elongated
            (area < 1000)                         # Small to medium
        )
        # Handwriting: larger, or irregular, or very wide/tall
        is_handwriting = (
            (area >= 3000) or                     # Large components (likely handwriting)
            (aspect_ratio > 3.0) or               # Very elongated (连笔)
            (aspect_ratio < 0.3) or               # Very tall
            not is_printed                        # Default to handwriting if not clearly printed
        )
        component_info.append({
            'id': i,
            'area': area,
            'aspect_ratio': aspect_ratio,
            'compactness': compactness,
            'is_printed': is_printed,
            'is_handwriting': is_handwriting
        })
        # Assign to mask
        if is_handwriting:
            handwriting_mask[labels == i] = 255
        if is_printed:
            printed_mask[labels == i] = 255
    # Print statistics
    print("\n  Component statistics:")
    handwriting_components = [c for c in component_info if c['is_handwriting']]
    printed_components = [c for c in component_info if c['is_printed']]
    print(f"    Handwriting components: {len(handwriting_components)}")
    print(f"    Printed components: {len(printed_components)}")
    # Show top 5 largest components
    print("\n  Top 5 largest components:")
    sorted_components = sorted(component_info, key=lambda c: c['area'], reverse=True)
    for i, comp in enumerate(sorted_components[:5], 1):
        comp_type = "Handwriting" if comp['is_handwriting'] else "Printed"
        print(f"    {i}. Area: {comp['area']:5d}, Aspect: {comp['aspect_ratio']:.2f}, "
              f"Type: {comp_type}")
    return handwriting_mask, printed_mask, component_info
 # Run Method 2
 handwriting_mask_m2, printed_mask_m2, components = method2_component_analysis(binary, image)
 # Save Method 2 results
 print("\n  Saving results...")
 # Handwriting mask
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_handwriting_mask.png"), handwriting_mask_m2)
 print(f"    📁 method2_handwriting_mask.png")
 # Printed mask
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_printed_mask.png"), printed_mask_m2)
 print(f"    📁 method2_printed_mask.png")
 # Apply to original image
 result_handwriting = cv2.bitwise_and(image, image, mask=handwriting_mask_m2)
 result_printed = cv2.bitwise_and(image, image, mask=printed_mask_m2)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_handwriting_result.png"), result_handwriting)
 print(f"    📁 method2_handwriting_result.png")
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_printed_result.png"), result_printed)
 print(f"    📁 method2_printed_result.png")
 # Create visualization with component labels
 vis_components = cv2.cvtColor(binary, cv2.COLOR_GRAY2BGR)
 vis_components = cv2.cvtColor(vis_components, cv2.COLOR_BGR2RGB)
 # Color code: green = handwriting, red = printed
 vis_overlay = image.copy()
 vis_overlay[handwriting_mask_m2 > 0] = [0, 255, 0]  # Green for handwriting
 vis_overlay[printed_mask_m2 > 0] = [0, 0, 255]      # Red for printed
 # Blend with original
 vis_final = cv2.addWeighted(image, 0.6, vis_overlay, 0.4, 0)
 cv2.imwrite(str(Path(OUTPUT_DIR) / "method2_visualization.png"), vis_final)
 print(f"    📁 method2_visualization.png (green=handwriting, red=printed)")
 print("\n" + "="*80)
 print("COMPARISON")
 print("="*80)
 # Count non-white pixels in each result
 def count_content_pixels(img):
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) if len(img.shape) == 3 else img
    return np.count_nonzero(gray > 10)
 original_pixels = count_content_pixels(image)
 method1_pixels = count_content_pixels(result_method1)
 method2_pixels = count_content_pixels(result_handwriting)
 print(f"\nContent pixels retained:")
 print(f"  Original image:     {original_pixels:6d} pixels")
 print(f"  Method 1 (stroke):  {method1_pixels:6d} pixels ({method1_pixels/original_pixels*100:.1f}%)")
 print(f"  Method 2 (component): {method2_pixels:6d} pixels ({method2_pixels/original_pixels*100:.1f}%)")
 print("\n" + "="*80)
 print("Test completed!")
 print(f"Results saved to: {OUTPUT_DIR}")
 print("="*80)
 print("\nNext steps:")
 print("  1. Review the output images")
 print("  2. Check which method better preserves handwriting")
 print("  3. Adjust thresholds if needed")
 print("  4. Choose the best method for production pipeline")
@@ -0,0 +1,102 @@
 #!/usr/bin/env python3
 """Test PaddleOCR on a sample PDF page."""
 import fitz  # PyMuPDF
 from paddleocr import PaddleOCR
 import numpy as np
 from PIL import Image
 import cv2
 from pathlib import Path
 # Configuration
 TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
 DPI = 300
 print("="*80)
 print("Testing PaddleOCR on macOS Apple Silicon")
 print("="*80)
 # Step 1: Render PDF to image
 print("\n1. Rendering PDF to image...")
 try:
    doc = fitz.open(TEST_PDF)
    page = doc[0]
    mat = fitz.Matrix(DPI/72, DPI/72)
    pix = page.get_pixmap(matrix=mat)
    image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    if pix.n == 4:  # RGBA
        image = cv2.cvtColor(image, cv2.COLOR_RGBA2RGB)
    print(f"   ✅ Rendered: {image.shape[1]}x{image.shape[0]} pixels")
    doc.close()
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 2: Initialize PaddleOCR
 print("\n2. Initializing PaddleOCR...")
 print("   (First run will download models, may take a few minutes...)")
 try:
    # Use the correct syntax from official docs
    ocr = PaddleOCR(
        use_doc_orientation_classify=False,
        use_doc_unwarping=False,
        use_textline_orientation=False,
        lang='ch'  # Chinese language
    )
    print("   ✅ PaddleOCR initialized successfully")
 except Exception as e:
    print(f"   ❌ Error: {e}")
    import traceback
    traceback.print_exc()
    print("\n   Note: PaddleOCR requires PaddlePaddle backend.")
    print("   If this is a module import error, PaddlePaddle may not support this platform.")
    exit(1)
 # Step 3: Run OCR
 print("\n3. Running OCR to detect printed text...")
 try:
    result = ocr.ocr(image, cls=False)
    if result and result[0]:
        print(f"   ✅ Detected {len(result[0])} text regions")
        # Show first few detections
        print("\n   Sample detections:")
        for i, item in enumerate(result[0][:5]):
            box = item[0]  # Bounding box coordinates
            text = item[1][0]  # Detected text
            confidence = item[1][1]  # Confidence score
            print(f"      {i+1}. Text: '{text}' (confidence: {confidence:.2f})")
            print(f"         Box: {box}")
    else:
        print("   ⚠️  No text detected")
 except Exception as e:
    print(f"   ❌ Error during OCR: {e}")
    import traceback
    traceback.print_exc()
    exit(1)
 # Step 4: Visualize detection
 print("\n4. Creating visualization...")
 try:
    vis_image = image.copy()
    if result and result[0]:
        for item in result[0]:
            box = np.array(item[0], dtype=np.int32)
            cv2.polylines(vis_image, [box], True, (0, 255, 0), 2)
    # Save visualization
    output_path = "/Volumes/NV2/PDF-Processing/signature-image-output/paddleocr_test_detection.png"
    cv2.imwrite(output_path, cv2.cvtColor(vis_image, cv2.COLOR_RGB2BGR))
    print(f"   ✅ Saved visualization: {output_path}")
 except Exception as e:
    print(f"   ❌ Error during visualization: {e}")
 print("\n" + "="*80)
 print("PaddleOCR test completed!")
 print("="*80)
@@ -0,0 +1,81 @@
 #!/usr/bin/env python3
 """Test PaddleOCR client with a real PDF page."""
 import fitz  # PyMuPDF
 import numpy as np
 import cv2
 from paddleocr_client import create_ocr_client
 # Test PDF
 TEST_PDF = "/Volumes/NV2/PDF-Processing/signature-image-output/201301_1324_AI1_page3.pdf"
 DPI = 300
 print("="*80)
 print("Testing PaddleOCR Client with Real PDF")
 print("="*80)
 # Step 1: Connect to server
 print("\n1. Connecting to PaddleOCR server...")
 try:
    client = create_ocr_client()
    print(f"   ✅ Connected: {client.server_url}")
 except Exception as e:
    print(f"   ❌ Connection failed: {e}")
    exit(1)
 # Step 2: Render PDF
 print("\n2. Rendering PDF to image...")
 try:
    doc = fitz.open(TEST_PDF)
    page = doc[0]
    mat = fitz.Matrix(DPI/72, DPI/72)
    pix = page.get_pixmap(matrix=mat)
    image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    if pix.n == 4:  # RGBA
        image = cv2.cvtColor(image, cv2.COLOR_RGBA2RGB)
    print(f"   ✅ Rendered: {image.shape[1]}x{image.shape[0]} pixels")
    doc.close()
 except Exception as e:
    print(f"   ❌ Error: {e}")
    exit(1)
 # Step 3: Run OCR
 print("\n3. Running OCR on image...")
 try:
    results = client.ocr(image)
    print(f"   ✅ OCR successful!")
    print(f"   Found {len(results)} text regions")
    # Show first few results
    if results:
        print("\n   Sample detections:")
        for i, result in enumerate(results[:5]):
            text = result['text']
            confidence = result['confidence']
            print(f"      {i+1}. '{text}' (confidence: {confidence:.2f})")
 except Exception as e:
    print(f"   ❌ OCR failed: {e}")
    import traceback
    traceback.print_exc()
    exit(1)
 # Step 4: Get bounding boxes
 print("\n4. Getting text bounding boxes...")
 try:
    boxes = client.get_text_boxes(image)
    print(f"   ✅ Got {len(boxes)} bounding boxes")
    if boxes:
        print("   Sample boxes (x, y, w, h):")
        for i, box in enumerate(boxes[:3]):
            print(f"      {i+1}. {box}")
 except Exception as e:
    print(f"   ❌ Error: {e}")
 print("\n" + "="*80)
 print("Test completed successfully!")
 print("="*80)
@@ -0,0 +1,254 @@
 #!/usr/bin/env python3
 """
 測試 PP-OCRv5 API 的基礎功能
 目標：
 1. 驗證正確的 API 調用方式
 2. 查看完整的返回數據結構
 3. 對比 v4 和 v5 的檢測結果
 4. 確認是否有手寫分類功能
 """
 import sys
 import json
 import pprint
 from pathlib import Path
 # 測試圖片路徑
 TEST_IMAGE = "/Volumes/NV2/pdf_recognize/test_images/page_0.png"
 def test_basic_import():
    """測試基礎導入"""
    print("=" * 60)
    print("測試 1: 基礎導入")
    print("=" * 60)
    try:
        from paddleocr import PaddleOCR
        print("✅ 成功導入 PaddleOCR")
        return True
    except ImportError as e:
        print(f"❌ 導入失敗: {e}")
        return False
 def test_model_initialization():
    """測試模型初始化"""
    print("\n" + "=" * 60)
    print("測試 2: 模型初始化")
    print("=" * 60)
    try:
        from paddleocr import PaddleOCR
        print("\n初始化 PP-OCRv5...")
        ocr = PaddleOCR(
            text_detection_model_name="PP-OCRv5_server_det",
            text_recognition_model_name="PP-OCRv5_server_rec",
            use_doc_orientation_classify=False,
            use_doc_unwarping=False,
            use_textline_orientation=False,
            show_log=True
        )
        print("✅ 模型初始化成功")
        return ocr
    except Exception as e:
        print(f"❌ 初始化失敗: {e}")
        import traceback
        traceback.print_exc()
        return None
 def test_prediction(ocr):
    """測試預測功能"""
    print("\n" + "=" * 60)
    print("測試 3: 預測功能")
    print("=" * 60)
    if not Path(TEST_IMAGE).exists():
        print(f"❌ 測試圖片不存在: {TEST_IMAGE}")
        return None
    try:
        print(f"\n預測圖片: {TEST_IMAGE}")
        result = ocr.predict(TEST_IMAGE)
        print(f"✅ 預測成功，返回 {len(result)} 個結果")
        return result
    except Exception as e:
        print(f"❌ 預測失敗: {e}")
        import traceback
        traceback.print_exc()
        return None
 def analyze_result_structure(result):
    """分析返回結果的完整結構"""
    print("\n" + "=" * 60)
    print("測試 4: 分析返回結果結構")
    print("=" * 60)
    if not result:
        print("❌ 沒有結果可分析")
        return
    # 獲取第一個結果
    first_result = result[0]
    print("\n結果類型:", type(first_result))
    print("結果屬性:", dir(first_result))
    # 查看是否有 json 屬性
    if hasattr(first_result, 'json'):
        print("\n✅ 找到 .json 屬性")
        json_data = first_result.json
        print("\nJSON 數據鍵值:")
        for key in json_data.keys():
            print(f"  - {key}: {type(json_data[key])}")
        # 檢查是否有手寫分類相關字段
        print("\n查找手寫分類字段...")
        handwriting_related_keys = [
            k for k in json_data.keys()
            if any(word in k.lower() for word in ['handwriting', 'handwritten', 'type', 'class', 'category'])
        ]
        if handwriting_related_keys:
            print(f"✅ 找到可能相關的字段: {handwriting_related_keys}")
            for key in handwriting_related_keys:
                print(f"  {key}: {json_data[key]}")
        else:
            print("❌ 未找到手寫分類相關字段")
        # 打印部分檢測結果
        if 'rec_texts' in json_data and json_data['rec_texts']:
            print("\n檢測到的文字 (前 5 個):")
            for i, text in enumerate(json_data['rec_texts'][:5]):
                box = json_data['rec_boxes'][i] if 'rec_boxes' in json_data else None
                score = json_data['rec_scores'][i] if 'rec_scores' in json_data else None
                print(f"  [{i}] 文字: {text}")
                print(f"      分數: {score}")
                print(f"      位置: {box}")
        # 保存完整 JSON 到文件
        output_path = "/Volumes/NV2/pdf_recognize/test_results/pp_ocrv5_result.json"
        Path(output_path).parent.mkdir(exist_ok=True)
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(json_data, f, ensure_ascii=False, indent=2, default=str)
        print(f"\n✅ 完整結果已保存到: {output_path}")
        return json_data
    else:
        print("❌ 沒有找到 .json 屬性")
        print("\n直接打印結果:")
        pprint.pprint(first_result)
 def compare_with_v4():
    """對比 v4 和 v5 的結果"""
    print("\n" + "=" * 60)
    print("測試 5: 對比 v4 和 v5")
    print("=" * 60)
    try:
        from paddleocr import PaddleOCR
        # v4
        print("\n初始化 PP-OCRv4...")
        ocr_v4 = PaddleOCR(
            ocr_version="PP-OCRv4",
            use_doc_orientation_classify=False,
            show_log=False
        )
        print("預測 v4...")
        result_v4 = ocr_v4.predict(TEST_IMAGE)
        json_v4 = result_v4[0].json if hasattr(result_v4[0], 'json') else None
        # v5
        print("\n初始化 PP-OCRv5...")
        ocr_v5 = PaddleOCR(
            text_detection_model_name="PP-OCRv5_server_det",
            text_recognition_model_name="PP-OCRv5_server_rec",
            use_doc_orientation_classify=False,
            show_log=False
        )
        print("預測 v5...")
        result_v5 = ocr_v5.predict(TEST_IMAGE)
        json_v5 = result_v5[0].json if hasattr(result_v5[0], 'json') else None
        # 對比
        if json_v4 and json_v5:
            print("\n對比結果:")
            print(f"  v4 檢測到 {len(json_v4.get('rec_texts', []))} 個文字區域")
            print(f"  v5 檢測到 {len(json_v5.get('rec_texts', []))} 個文字區域")
            # 保存對比結果
            comparison = {
                "v4": {
                    "count": len(json_v4.get('rec_texts', [])),
                    "texts": json_v4.get('rec_texts', [])[:10],  # 前 10 個
                    "scores": json_v4.get('rec_scores', [])[:10]
                },
                "v5": {
                    "count": len(json_v5.get('rec_texts', [])),
                    "texts": json_v5.get('rec_texts', [])[:10],
                    "scores": json_v5.get('rec_scores', [])[:10]
                }
            }
            output_path = "/Volumes/NV2/pdf_recognize/test_results/v4_vs_v5_comparison.json"
            with open(output_path, 'w', encoding='utf-8') as f:
                json.dump(comparison, f, ensure_ascii=False, indent=2, default=str)
            print(f"\n✅ 對比結果已保存到: {output_path}")
    except Exception as e:
        print(f"❌ 對比失敗: {e}")
        import traceback
        traceback.print_exc()
 def main():
    """主測試流程"""
    print("開始測試 PP-OCRv5 API\n")
    # 測試 1: 導入
    if not test_basic_import():
        print("\n❌ 導入失敗，無法繼續測試")
        return
    # 測試 2: 初始化
    ocr = test_model_initialization()
    if not ocr:
        print("\n❌ 初始化失敗，無法繼續測試")
        return
    # 測試 3: 預測
    result = test_prediction(ocr)
    if not result:
        print("\n❌ 預測失敗，無法繼續測試")
        return
    # 測試 4: 分析結構
    json_data = analyze_result_structure(result)
    # 測試 5: 對比 v4 和 v5
    compare_with_v4()
    print("\n" + "=" * 60)
    print("測試完成")
    print("=" * 60)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,58 @@
 PP-OCRv5 檢測結果詳細報告
 ================================================================================
 總數: 50
 平均置信度: 0.4579
 完整檢測列表:
 --------------------------------------------------------------------------------
 [ 0] 0.8783   202x100  KPMG
 [ 1] 0.9936  1931x 62  依本會計師核閱結果，除第三段及第四段所述該等被投資公司財務季報告倘經會計師核閱
 [ 2] 0.9976  2013x 62  ，對第一段所述合併財務季報告可能有所調整之影響外，並未發現第一段所述合併財務季報告
 [ 3] 0.9815  2025x 62  在所有重大方面有違反證券發行人財務報告編製準則及金融監督管理委員會認可之國際會計準
 [ 4] 0.9912  1125x 56  則第三十四號「期中財務報導」而須作修正之情事。
 [ 5] 0.9712   872x 61  安侯建業聯合會計師事務所
 [ 6] 0.9123   174x203  寶
 [ 7] 0.8466   166x179  蓮
 [ 8] 0.0000    36x 18  
 [ 9] 0.9968   175x193  周
 [10] 0.0000    33x 69  
 [11] 0.2521     7x 12  5
 [12] 0.0000    35x 13  
 [13] 0.0000    28x 10  
 [14] 0.4726    12x  9  vA
 [15] 0.1788     9x 11  上
 [16] 0.0000    38x 14  
 [17] 0.4133    21x  8  R-
 [18] 0.4681    15x  8  40
 [19] 0.0000    38x 13  
 [20] 0.5587    16x  7  GAN
 [21] 0.9623   291x 61  會計師：
 [22] 0.9893   213x234  魏
 [23] 0.1751   190x174  興
 [24] 0.8862   180x191  海
 [25] 0.0000    65x 17  
 [26] 0.5110    27x  7  U
 [27] 0.1669    10x  8  2
 [28] 0.4839    39x 10  eredooos
 [29] 0.1775    10x 24  B
 [30] 0.4896    29x 10  n
 [31] 0.3774     7x  7  1
 [32] 0.0000    34x 14  
 [33] 0.0000     7x 15  
 [34] 0.0000    12x 38  
 [35] 0.8701    22x 11  0
 [36] 0.2034     8x 23  40
 [37] 0.0000    20x 12  
 [38] 0.0000    29x 10  
 [39] 0.0970     9x 10  m
 [40] 0.3102    20x  7  A
 [41] 0.0000    34x  6  
 [42] 0.2435    21x  6  专
 [43] 0.3260    41x 15  o
 [44] 0.0000    31x  7  
 [45] 0.9769   960x 73  證券主管機關．金管證六字第0940100754號
 [46] 0.9747   899x 60  核准簽證文號(88)台財證(六)第18311號
 [47] 0.9205   824x 67  民國一〇二年五月二
 [48] 0.9996    47x 46  日
 [49] 0.8414   173x 62  ~3-1~
@@ -0,0 +1,20 @@
 PP-OCRv5 完整 Pipeline 測試結果
 ============================================================
 1. OCR 檢測: 50 個文字區域
 2. 遮罩印刷文字: /Volumes/NV2/pdf_recognize/test_results/v5_pipeline/01_masked.png
 3. 檢測候選區域: 7 個
 4. 提取簽名: 7 個
 候選區域詳情:
 ------------------------------------------------------------
 Region 1: 位置(1218, 877), 大小1144x511, 面積=584584
 Region 2: 位置(1213, 1457), 大小961x196, 面積=188356
 Region 3: 位置(228, 386), 大小2028x209, 面積=423852
 Region 4: 位置(330, 310), 大小1932x63, 面積=121716
 Region 5: 位置(1990, 945), 大小375x212, 面積=79500
 Region 6: 位置(327, 145), 大小203x101, 面積=20503
 Region 7: 位置(1139, 3289), 大小174x63, 面積=10962
 所有結果保存在: /Volumes/NV2/pdf_recognize/test_results/v5_pipeline
@@ -0,0 +1,290 @@
 #!/usr/bin/env python3
 """
 使用 PaddleOCR v2.7.3 (v4) 跑完整的簽名提取 pipeline
 與 v5 對比
 """
 import sys
 import json
 import cv2
 import numpy as np
 import requests
 from pathlib import Path
 # 配置
 OCR_SERVER = "http://192.168.30.36:5555"
 OUTPUT_DIR = Path("/Volumes/NV2/pdf_recognize/signature-comparison/v4-current")
 MASKING_PADDING = 0
 def setup_output_dir():
    """創建輸出目錄"""
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    print(f"輸出目錄: {OUTPUT_DIR}")
 def get_page_image():
    """獲取測試頁面圖片"""
    test_image = "/Volumes/NV2/pdf_recognize/full_page_original.png"
    if Path(test_image).exists():
        return cv2.imread(test_image)
    else:
        print(f"❌ 測試圖片不存在: {test_image}")
        return None
 def call_ocr_server(image):
    """調用服務器端的 PaddleOCR v2.7.3"""
    print("\n調用 PaddleOCR v2.7.3 服務器...")
    try:
        import base64
        _, buffer = cv2.imencode('.png', image)
        img_base64 = base64.b64encode(buffer).decode('utf-8')
        response = requests.post(
            f"{OCR_SERVER}/ocr",
            json={'image': img_base64},
            timeout=30
        )
        if response.status_code == 200:
            result = response.json()
            print(f"✅ OCR 完成，檢測到 {len(result.get('results', []))} 個文字區域")
            return result.get('results', [])
        else:
            print(f"❌ 服務器錯誤: {response.status_code}")
            return None
    except Exception as e:
        print(f"❌ OCR 調用失敗: {e}")
        import traceback
        traceback.print_exc()
        return None
 def mask_printed_text(image, ocr_results):
    """遮罩印刷文字"""
    print("\n遮罩印刷文字...")
    masked_image = image.copy()
    for i, result in enumerate(ocr_results):
        box = result.get('box')
        if box is None:
            continue
        # v2.7.3 返回多邊形格式: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
        # 轉換為矩形
        box_points = np.array(box)
        x_min = int(box_points[:, 0].min())
        y_min = int(box_points[:, 1].min())
        x_max = int(box_points[:, 0].max())
        y_max = int(box_points[:, 1].max())
        cv2.rectangle(
            masked_image,
            (x_min - MASKING_PADDING, y_min - MASKING_PADDING),
            (x_max + MASKING_PADDING, y_max + MASKING_PADDING),
            (0, 0, 0),
            -1
        )
    masked_path = OUTPUT_DIR / "01_masked.png"
    cv2.imwrite(str(masked_path), masked_image)
    print(f"✅ 遮罩完成: {masked_path}")
    return masked_image
 def detect_regions(masked_image):
    """檢測候選區域"""
    print("\n檢測候選區域...")
    gray = cv2.cvtColor(masked_image, cv2.COLOR_BGR2GRAY)
    _, binary = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY_INV)
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
    cv2.imwrite(str(OUTPUT_DIR / "02_binary.png"), binary)
    cv2.imwrite(str(OUTPUT_DIR / "03_morphed.png"), morphed)
    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    MIN_AREA = 3000
    MAX_AREA = 300000
    candidate_regions = []
    for contour in contours:
        area = cv2.contourArea(contour)
        if MIN_AREA <= area <= MAX_AREA:
            x, y, w, h = cv2.boundingRect(contour)
            aspect_ratio = w / h if h > 0 else 0
            candidate_regions.append({
                'box': (x, y, w, h),
                'area': area,
                'aspect_ratio': aspect_ratio
            })
    candidate_regions.sort(key=lambda r: r['area'], reverse=True)
    print(f"✅ 找到 {len(candidate_regions)} 個候選區域")
    return candidate_regions
 def merge_nearby_regions(regions, h_distance=100, v_distance=50):
    """合併鄰近區域"""
    print("\n合併鄰近區域...")
    if not regions:
        return []
    merged = []
    used = set()
    for i, r1 in enumerate(regions):
        if i in used:
            continue
        x1, y1, w1, h1 = r1['box']
        merged_box = [x1, y1, x1 + w1, y1 + h1]
        group = [i]
        for j, r2 in enumerate(regions):
            if j <= i or j in used:
                continue
            x2, y2, w2, h2 = r2['box']
            h_dist = min(abs(x1 - (x2 + w2)), abs((x1 + w1) - x2))
            v_dist = min(abs(y1 - (y2 + h2)), abs((y1 + h1) - y2))
            x_overlap = not (x1 + w1 < x2 or x2 + w2 < x1)
            y_overlap = not (y1 + h1 < y2 or y2 + h2 < y1)
            if (x_overlap and v_dist <= v_distance) or (y_overlap and h_dist <= h_distance):
                merged_box[0] = min(merged_box[0], x2)
                merged_box[1] = min(merged_box[1], y2)
                merged_box[2] = max(merged_box[2], x2 + w2)
                merged_box[3] = max(merged_box[3], y2 + h2)
                group.append(j)
                used.add(j)
        used.add(i)
        x, y = merged_box[0], merged_box[1]
        w, h = merged_box[2] - merged_box[0], merged_box[3] - merged_box[1]
        merged.append({
            'box': (x, y, w, h),
            'area': w * h,
            'merged_count': len(group)
        })
    print(f"✅ 合併後剩餘 {len(merged)} 個區域")
    return merged
 def extract_signatures(image, regions):
    """提取簽名區域"""
    print("\n提取簽名區域...")
    vis_image = image.copy()
    for i, region in enumerate(regions):
        x, y, w, h = region['box']
        cv2.rectangle(vis_image, (x, y), (x + w, y + h), (0, 255, 0), 3)
        cv2.putText(vis_image, f"Region {i+1}", (x, y - 10),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
        signature = image[y:y+h, x:x+w]
        sig_path = OUTPUT_DIR / f"signature_{i+1}.png"
        cv2.imwrite(str(sig_path), signature)
        print(f"  Region {i+1}: {w}x{h} 像素, 面積={region['area']}")
    vis_path = OUTPUT_DIR / "04_detected_regions.png"
    cv2.imwrite(str(vis_path), vis_image)
    print(f"\n✅ 標註圖已保存: {vis_path}")
    return vis_image
 def generate_summary(ocr_count, regions):
    """生成摘要報告"""
    summary = f"""
 PaddleOCR v2.7.3 (v4) 完整 Pipeline 測試結果
 {'=' * 60}
 1. OCR 檢測: {ocr_count} 個文字區域
 2. 遮罩印刷文字: 完成
 3. 檢測候選區域: {len(regions)} 個
 4. 提取簽名: {len(regions)} 個
 候選區域詳情:
 {'-' * 60}
 """
    for i, region in enumerate(regions):
        x, y, w, h = region['box']
        area = region['area']
        summary += f"Region {i+1}: 位置({x}, {y}), 大小{w}x{h}, 面積={area}\n"
    summary += f"\n所有結果保存在: {OUTPUT_DIR}\n"
    return summary
 def main():
    print("=" * 60)
    print("PaddleOCR v2.7.3 (v4) 完整 Pipeline 測試")
    print("=" * 60)
    setup_output_dir()
    print("\n1. 讀取測試圖片...")
    image = get_page_image()
    if image is None:
        return
    print(f"   圖片大小: {image.shape}")
    cv2.imwrite(str(OUTPUT_DIR / "00_original.png"), image)
    print("\n2. PaddleOCR v2.7.3 檢測文字...")
    ocr_results = call_ocr_server(image)
    if ocr_results is None:
        print("❌ OCR 失敗，終止測試")
        return
    print("\n3. 遮罩印刷文字...")
    masked_image = mask_printed_text(image, ocr_results)
    print("\n4. 檢測候選區域...")
    regions = detect_regions(masked_image)
    print("\n5. 合併鄰近區域...")
    merged_regions = merge_nearby_regions(regions)
    print("\n6. 提取簽名...")
    vis_image = extract_signatures(image, merged_regions)
    print("\n7. 生成摘要報告...")
    summary = generate_summary(len(ocr_results), merged_regions)
    print(summary)
    summary_path = OUTPUT_DIR / "SUMMARY.txt"
    with open(summary_path, 'w', encoding='utf-8') as f:
        f.write(summary)
    print("=" * 60)
    print("✅ v4 測試完成！")
    print(f"結果目錄: {OUTPUT_DIR}")
    print("=" * 60)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,322 @@
 #!/usr/bin/env python3
 """
 使用 PP-OCRv5 跑完整的簽名提取 pipeline
 流程：
 1. 使用服務器上的 PP-OCRv5 檢測文字
 2. 遮罩印刷文字
 3. 檢測候選區域
 4. 提取簽名
 """
 import sys
 import json
 import cv2
 import numpy as np
 import requests
 from pathlib import Path
 # 配置
 OCR_SERVER = "http://192.168.30.36:5555"
 PDF_PATH = "/Volumes/NV2/pdf_recognize/test.pdf"
 OUTPUT_DIR = Path("/Volumes/NV2/pdf_recognize/test_results/v5_pipeline")
 MASKING_PADDING = 0
 def setup_output_dir():
    """創建輸出目錄"""
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    print(f"輸出目錄: {OUTPUT_DIR}")
 def get_page_image():
    """獲取測試頁面圖片"""
    # 使用已有的測試圖片
    test_image = "/Volumes/NV2/pdf_recognize/full_page_original.png"
    if Path(test_image).exists():
        return cv2.imread(test_image)
    else:
        print(f"❌ 測試圖片不存在: {test_image}")
        return None
 def call_ocr_server(image):
    """調用服務器端的 PP-OCRv5"""
    print("\n調用 PP-OCRv5 服務器...")
    try:
        # 編碼圖片
        import base64
        _, buffer = cv2.imencode('.png', image)
        img_base64 = base64.b64encode(buffer).decode('utf-8')
        # 發送請求
        response = requests.post(
            f"{OCR_SERVER}/ocr",
            json={'image': img_base64},
            timeout=30
        )
        if response.status_code == 200:
            result = response.json()
            print(f"✅ OCR 完成，檢測到 {len(result.get('results', []))} 個文字區域")
            return result.get('results', [])
        else:
            print(f"❌ 服務器錯誤: {response.status_code}")
            return None
    except Exception as e:
        print(f"❌ OCR 調用失敗: {e}")
        import traceback
        traceback.print_exc()
        return None
 def mask_printed_text(image, ocr_results):
    """遮罩印刷文字"""
    print("\n遮罩印刷文字...")
    masked_image = image.copy()
    for i, result in enumerate(ocr_results):
        box = result.get('box')
        if box is None:
            continue
        # box 格式: [x, y, w, h]
        x, y, w, h = box
        # 遮罩（黑色矩形）
        cv2.rectangle(
            masked_image,
            (x - MASKING_PADDING, y - MASKING_PADDING),
            (x + w + MASKING_PADDING, y + h + MASKING_PADDING),
            (0, 0, 0),
            -1
        )
    # 保存遮罩後的圖片
    masked_path = OUTPUT_DIR / "01_masked.png"
    cv2.imwrite(str(masked_path), masked_image)
    print(f"✅ 遮罩完成: {masked_path}")
    return masked_image
 def detect_regions(masked_image):
    """檢測候選區域"""
    print("\n檢測候選區域...")
    # 轉灰度
    gray = cv2.cvtColor(masked_image, cv2.COLOR_BGR2GRAY)
    # 二值化
    _, binary = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY_INV)
    # 形態學操作
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=2)
    # 保存中間結果
    cv2.imwrite(str(OUTPUT_DIR / "02_binary.png"), binary)
    cv2.imwrite(str(OUTPUT_DIR / "03_morphed.png"), morphed)
    # 找輪廓
    contours, _ = cv2.findContours(morphed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    # 過濾候選區域
    MIN_AREA = 3000
    MAX_AREA = 300000
    candidate_regions = []
    for contour in contours:
        area = cv2.contourArea(contour)
        if MIN_AREA <= area <= MAX_AREA:
            x, y, w, h = cv2.boundingRect(contour)
            aspect_ratio = w / h if h > 0 else 0
            candidate_regions.append({
                'box': (x, y, w, h),
                'area': area,
                'aspect_ratio': aspect_ratio
            })
    # 按面積排序
    candidate_regions.sort(key=lambda r: r['area'], reverse=True)
    print(f"✅ 找到 {len(candidate_regions)} 個候選區域")
    return candidate_regions
 def merge_nearby_regions(regions, h_distance=100, v_distance=50):
    """合併鄰近區域"""
    print("\n合併鄰近區域...")
    if not regions:
        return []
    merged = []
    used = set()
    for i, r1 in enumerate(regions):
        if i in used:
            continue
        x1, y1, w1, h1 = r1['box']
        merged_box = [x1, y1, x1 + w1, y1 + h1]  # [x_min, y_min, x_max, y_max]
        group = [i]
        for j, r2 in enumerate(regions):
            if j <= i or j in used:
                continue
            x2, y2, w2, h2 = r2['box']
            # 計算距離
            h_dist = min(abs(x1 - (x2 + w2)), abs((x1 + w1) - x2))
            v_dist = min(abs(y1 - (y2 + h2)), abs((y1 + h1) - y2))
            # 檢查重疊或接近
            x_overlap = not (x1 + w1 < x2 or x2 + w2 < x1)
            y_overlap = not (y1 + h1 < y2 or y2 + h2 < y1)
            if (x_overlap and v_dist <= v_distance) or (y_overlap and h_dist <= h_distance):
                # 合併
                merged_box[0] = min(merged_box[0], x2)
                merged_box[1] = min(merged_box[1], y2)
                merged_box[2] = max(merged_box[2], x2 + w2)
                merged_box[3] = max(merged_box[3], y2 + h2)
                group.append(j)
                used.add(j)
        used.add(i)
        # 轉回 (x, y, w, h) 格式
        x, y = merged_box[0], merged_box[1]
        w, h = merged_box[2] - merged_box[0], merged_box[3] - merged_box[1]
        merged.append({
            'box': (x, y, w, h),
            'area': w * h,
            'merged_count': len(group)
        })
    print(f"✅ 合併後剩餘 {len(merged)} 個區域")
    return merged
 def extract_signatures(image, regions):
    """提取簽名區域"""
    print("\n提取簽名區域...")
    # 在圖片上標註所有區域
    vis_image = image.copy()
    for i, region in enumerate(regions):
        x, y, w, h = region['box']
        # 繪製框
        cv2.rectangle(vis_image, (x, y), (x + w, y + h), (0, 255, 0), 3)
        cv2.putText(vis_image, f"Region {i+1}", (x, y - 10),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
        # 提取並保存
        signature = image[y:y+h, x:x+w]
        sig_path = OUTPUT_DIR / f"signature_{i+1}.png"
        cv2.imwrite(str(sig_path), signature)
        print(f"  Region {i+1}: {w}x{h} 像素, 面積={region['area']}")
    # 保存標註圖
    vis_path = OUTPUT_DIR / "04_detected_regions.png"
    cv2.imwrite(str(vis_path), vis_image)
    print(f"\n✅ 標註圖已保存: {vis_path}")
    return vis_image
 def generate_summary(ocr_count, masked_path, regions):
    """生成摘要報告"""
    summary = f"""
 PP-OCRv5 完整 Pipeline 測試結果
 {'=' * 60}
 1. OCR 檢測: {ocr_count} 個文字區域
 2. 遮罩印刷文字: {masked_path}
 3. 檢測候選區域: {len(regions)} 個
 4. 提取簽名: {len(regions)} 個
 候選區域詳情:
 {'-' * 60}
 """
    for i, region in enumerate(regions):
        x, y, w, h = region['box']
        area = region['area']
        summary += f"Region {i+1}: 位置({x}, {y}), 大小{w}x{h}, 面積={area}\n"
    summary += f"\n所有結果保存在: {OUTPUT_DIR}\n"
    return summary
 def main():
    print("=" * 60)
    print("PP-OCRv5 完整 Pipeline 測試")
    print("=" * 60)
    # 準備
    setup_output_dir()
    # 1. 獲取圖片
    print("\n1. 讀取測試圖片...")
    image = get_page_image()
    if image is None:
        return
    print(f"   圖片大小: {image.shape}")
    # 保存原圖
    cv2.imwrite(str(OUTPUT_DIR / "00_original.png"), image)
    # 2. OCR 檢測
    print("\n2. PP-OCRv5 檢測文字...")
    ocr_results = call_ocr_server(image)
    if ocr_results is None:
        print("❌ OCR 失敗，終止測試")
        return
    # 3. 遮罩印刷文字
    print("\n3. 遮罩印刷文字...")
    masked_image = mask_printed_text(image, ocr_results)
    # 4. 檢測候選區域
    print("\n4. 檢測候選區域...")
    regions = detect_regions(masked_image)
    # 5. 合併鄰近區域
    print("\n5. 合併鄰近區域...")
    merged_regions = merge_nearby_regions(regions)
    # 6. 提取簽名
    print("\n6. 提取簽名...")
    vis_image = extract_signatures(image, merged_regions)
    # 7. 生成摘要
    print("\n7. 生成摘要報告...")
    summary = generate_summary(len(ocr_results), OUTPUT_DIR / "01_masked.png", merged_regions)
    print(summary)
    # 保存摘要
    summary_path = OUTPUT_DIR / "SUMMARY.txt"
    with open(summary_path, 'w', encoding='utf-8') as f:
        f.write(summary)
    print("=" * 60)
    print("✅ 測試完成！")
    print(f"結果目錄: {OUTPUT_DIR}")
    print("=" * 60)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,181 @@
 #!/usr/bin/env python3
 """
 可視化 PP-OCRv5 的檢測結果
 """
 import json
 import cv2
 import numpy as np
 from pathlib import Path
 def load_results():
    """加載 v5 檢測結果"""
    result_file = "/Volumes/NV2/pdf_recognize/test_results/v5_result.json"
    with open(result_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data['res']
 def draw_detections(image_path, results, output_path):
    """在圖片上繪製檢測框和文字"""
    # 讀取圖片
    img = cv2.imread(image_path)
    if img is None:
        print(f"❌ 無法讀取圖片: {image_path}")
        return None
    # 創建副本用於繪製
    vis_img = img.copy()
    # 獲取檢測結果
    rec_texts = results.get('rec_texts', [])
    rec_boxes = results.get('rec_boxes', [])
    rec_scores = results.get('rec_scores', [])
    print(f"\n檢測到 {len(rec_texts)} 個文字區域")
    # 繪製每個檢測框
    for i, (text, box, score) in enumerate(zip(rec_texts, rec_boxes, rec_scores)):
        x_min, y_min, x_max, y_max = box
        # 繪製矩形框（綠色）
        cv2.rectangle(vis_img, (x_min, y_min), (x_max, y_max), (0, 255, 0), 2)
        # 繪製索引號（小字）
        cv2.putText(vis_img, f"{i}", (x_min, y_min - 5),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
    # 保存結果
    cv2.imwrite(output_path, vis_img)
    print(f"✅ 可視化結果已保存: {output_path}")
    return vis_img
 def generate_text_report(results):
    """生成文字報告"""
    rec_texts = results.get('rec_texts', [])
    rec_scores = results.get('rec_scores', [])
    rec_boxes = results.get('rec_boxes', [])
    print("\n" + "=" * 80)
    print("PP-OCRv5 檢測結果報告")
    print("=" * 80)
    print(f"\n總共檢測到: {len(rec_texts)} 個文字區域")
    print(f"平均置信度: {np.mean(rec_scores):.4f}")
    print(f"最高置信度: {np.max(rec_scores):.4f}")
    print(f"最低置信度: {np.min(rec_scores):.4f}")
    # 分類統計
    high_conf = sum(1 for s in rec_scores if s >= 0.95)
    medium_conf = sum(1 for s in rec_scores if 0.8 <= s < 0.95)
    low_conf = sum(1 for s in rec_scores if s < 0.8)
    print(f"\n置信度分布:")
    print(f"  高 (≥0.95): {high_conf} 個 ({high_conf/len(rec_scores)*100:.1f}%)")
    print(f"  中 (0.8-0.95): {medium_conf} 個 ({medium_conf/len(rec_scores)*100:.1f}%)")
    print(f"  低 (<0.8): {low_conf} 個 ({low_conf/len(rec_scores)*100:.1f}%)")
    # 顯示前 20 個檢測結果
    print("\n前 20 個檢測結果:")
    print("-" * 80)
    for i in range(min(20, len(rec_texts))):
        text = rec_texts[i]
        score = rec_scores[i]
        box = rec_boxes[i]
        # 計算框的大小
        width = box[2] - box[0]
        height = box[3] - box[1]
        print(f"[{i:2d}] 置信度: {score:.4f}  大小: {width:4d}x{height:3d}  文字: {text}")
    if len(rec_texts) > 20:
        print(f"\n... 還有 {len(rec_texts) - 20} 個結果（省略）")
    # 尋找可能的手寫區域（低置信度 或 大字）
    print("\n" + "=" * 80)
    print("可能的手寫區域分析")
    print("=" * 80)
    potential_handwriting = []
    for i, (text, score, box) in enumerate(zip(rec_texts, rec_scores, rec_boxes)):
        width = box[2] - box[0]
        height = box[3] - box[1]
        # 判斷條件：
        # 1. 高度較大 (>50px)
        # 2. 或置信度較低 (<0.9)
        # 3. 或文字較短但字體大
        is_large = height > 50
        is_low_conf = score < 0.9
        is_short_text = len(text) <= 3 and height > 40
        if is_large or is_low_conf or is_short_text:
            potential_handwriting.append({
                'index': i,
                'text': text,
                'score': score,
                'height': height,
                'width': width,
                'reason': []
            })
            if is_large:
                potential_handwriting[-1]['reason'].append('大字')
            if is_low_conf:
                potential_handwriting[-1]['reason'].append('低置信度')
            if is_short_text:
                potential_handwriting[-1]['reason'].append('短文大字')
    if potential_handwriting:
        print(f"\n找到 {len(potential_handwriting)} 個可能的手寫區域:")
        print("-" * 80)
        for item in potential_handwriting[:15]:  # 只顯示前 15 個
            reasons = ', '.join(item['reason'])
            print(f"[{item['index']:2d}] {item['height']:3d}px  {item['score']:.4f}  ({reasons})  {item['text']}")
    else:
        print("未找到明顯的手寫特徵區域")
    # 保存詳細報告到文件
    report_path = "/Volumes/NV2/pdf_recognize/test_results/v5_analysis_report.txt"
    with open(report_path, 'w', encoding='utf-8') as f:
        f.write(f"PP-OCRv5 檢測結果詳細報告\n")
        f.write("=" * 80 + "\n\n")
        f.write(f"總數: {len(rec_texts)}\n")
        f.write(f"平均置信度: {np.mean(rec_scores):.4f}\n\n")
        f.write("完整檢測列表:\n")
        f.write("-" * 80 + "\n")
        for i, (text, score, box) in enumerate(zip(rec_texts, rec_scores, rec_boxes)):
            width = box[2] - box[0]
            height = box[3] - box[1]
            f.write(f"[{i:2d}] {score:.4f}  {width:4d}x{height:3d}  {text}\n")
    print(f"\n詳細報告已保存: {report_path}")
 def main():
    # 加載結果
    print("加載 PP-OCRv5 檢測結果...")
    results = load_results()
    # 生成文字報告
    generate_text_report(results)
    # 可視化
    print("\n" + "=" * 80)
    print("生成可視化圖片")
    print("=" * 80)
    image_path = "/Volumes/NV2/pdf_recognize/full_page_original.png"
    output_path = "/Volumes/NV2/pdf_recognize/test_results/v5_visualization.png"
    if Path(image_path).exists():
        draw_detections(image_path, results, output_path)
    else:
        print(f"⚠️  原始圖片不存在: {image_path}")
    print("\n" + "=" * 80)
    print("分析完成")
    print("=" * 80)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,380 @@
 #!/usr/bin/env python3
 """
 YOLO Signature Extraction from VLM Index
 Extracts signatures from PDF pages specified in master_signatures.csv.
 Uses VLM-filtered index + YOLO for precise localization and cropping.
 Pipeline:
    CSV Index → Load specified page → YOLO Detection → Crop & Remove Red Stamp → Output
 """
 import argparse
 import csv
 import json
 import os
 import sys
 import time
 from concurrent.futures import ProcessPoolExecutor, as_completed
 from datetime import datetime
 from pathlib import Path
 from typing import Optional
 import cv2
 import fitz  # PyMuPDF
 import numpy as np
 # Configuration
 DPI = 150
 CONFIDENCE_THRESHOLD = 0.5
 PROGRESS_SAVE_INTERVAL = 500
 def remove_red_stamp(image: np.ndarray) -> np.ndarray:
    """Remove red stamp pixels from an image by replacing them with white."""
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
    lower_red1 = np.array([0, 50, 50])
    upper_red1 = np.array([10, 255, 255])
    lower_red2 = np.array([160, 50, 50])
    upper_red2 = np.array([180, 255, 255])
    mask1 = cv2.inRange(hsv, lower_red1, upper_red1)
    mask2 = cv2.inRange(hsv, lower_red2, upper_red2)
    red_mask = cv2.bitwise_or(mask1, mask2)
    kernel = np.ones((3, 3), np.uint8)
    red_mask = cv2.dilate(red_mask, kernel, iterations=1)
    result = image.copy()
    result[red_mask > 0] = [255, 255, 255]
    return result
 def render_pdf_page(pdf_path: str, page_num: int, dpi: int = DPI) -> Optional[np.ndarray]:
    """Render a specific PDF page to an image array."""
    try:
        doc = fitz.open(pdf_path)
        if page_num < 1 or page_num > len(doc):
            doc.close()
            return None
        page = doc[page_num - 1]  # Convert to 0-indexed
        mat = fitz.Matrix(dpi / 72, dpi / 72)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        image = np.frombuffer(pix.samples, dtype=np.uint8)
        image = image.reshape(pix.height, pix.width, pix.n)
        doc.close()
        return image
    except Exception:
        return None
 def find_pdf_file(filename: str, pdf_base: str) -> Optional[str]:
    """Search for PDF file in batch directories."""
    base_path = Path(pdf_base)
    # Check for batch subdirectories
    for batch_dir in sorted(base_path.glob("batch_*")):
        pdf_path = batch_dir / filename
        if pdf_path.exists():
            return str(pdf_path)
    # Check flat directory
    pdf_path = base_path / filename
    if pdf_path.exists():
        return str(pdf_path)
    return None
 def process_single_entry(args: tuple) -> dict:
    """
    Process a single CSV entry: render page, detect signatures, crop and save.
    Args:
        args: Tuple of (row_dict, model_path, pdf_base, output_dir, conf_threshold)
    Returns:
        Result dictionary
    """
    row, model_path, pdf_base, output_dir, conf_threshold = args
    from ultralytics import YOLO
    filename = row['filename']
    page_num = int(row['page'])
    base_name = Path(filename).stem
    result = {
        'filename': filename,
        'page': page_num,
        'num_signatures': 0,
        'confidence_avg': 0.0,
        'image_files': [],
        'error': None
    }
    try:
        # Find PDF
        pdf_path = find_pdf_file(filename, pdf_base)
        if pdf_path is None:
            result['error'] = 'PDF not found'
            return result
        # Render page
        image = render_pdf_page(pdf_path, page_num)
        if image is None:
            result['error'] = 'Render failed'
            return result
        # Load model and detect
        model = YOLO(model_path)
        results = model(image, conf=conf_threshold, verbose=False)
        signatures = []
        for r in results:
            for box in r.boxes:
                x1, y1, x2, y2 = map(int, box.xyxy[0].cpu().numpy())
                conf = float(box.conf[0].cpu().numpy())
                signatures.append({
                    'box': (x1, y1, x2 - x1, y2 - y1),
                    'confidence': conf
                })
        if not signatures:
            result['num_signatures'] = 0
            return result
        # Sort signatures by position (top-left to bottom-right)
        signatures.sort(key=lambda s: (s['box'][1], s['box'][0]))
        result['num_signatures'] = len(signatures)
        result['confidence_avg'] = sum(s['confidence'] for s in signatures) / len(signatures)
        # Extract and save crops
        image_files = []
        for i, sig in enumerate(signatures):
            x, y, w, h = sig['box']
            x = max(0, x)
            y = max(0, y)
            x2 = min(image.shape[1], x + w)
            y2 = min(image.shape[0], y + h)
            crop = image[y:y2, x:x2]
            crop_clean = remove_red_stamp(crop)
            crop_filename = f"{base_name}_page{page_num}_sig{i + 1}.png"
            crop_path = os.path.join(output_dir, "images", crop_filename)
            cv2.imwrite(crop_path, cv2.cvtColor(crop_clean, cv2.COLOR_RGB2BGR))
            image_files.append(crop_filename)
        result['image_files'] = image_files
    except Exception as e:
        result['error'] = str(e)
    return result
 def load_progress(progress_file: str) -> set:
    """Load completed entries from progress checkpoint."""
    if os.path.exists(progress_file):
        try:
            with open(progress_file, 'r') as f:
                data = json.load(f)
                return set(data.get('completed_keys', []))
        except Exception:
            pass
    return set()
 def save_progress(progress_file: str, completed: set, total: int, start_time: float):
    """Save progress checkpoint."""
    elapsed = time.time() - start_time
    data = {
        'last_updated': datetime.now().isoformat(),
        'total_entries': total,
        'processed': len(completed),
        'remaining': total - len(completed),
        'elapsed_seconds': elapsed,
        'completed_keys': list(completed)
    }
    with open(progress_file, 'w') as f:
        json.dump(data, f)
 def main():
    parser = argparse.ArgumentParser(description='YOLO Signature Extraction from VLM Index')
    parser.add_argument('--csv', required=True, help='Path to master_signatures.csv')
    parser.add_argument('--pdf-base', required=True, help='Base directory containing PDFs')
    parser.add_argument('--output', required=True, help='Output directory')
    parser.add_argument('--model', default='best.pt', help='Path to YOLO model')
    parser.add_argument('--workers', type=int, default=8, help='Number of parallel workers')
    parser.add_argument('--conf', type=float, default=0.5, help='Confidence threshold')
    parser.add_argument('--resume', action='store_true', help='Resume from checkpoint')
    args = parser.parse_args()
    # Setup output directories
    output_dir = Path(args.output)
    output_dir.mkdir(parents=True, exist_ok=True)
    (output_dir / "images").mkdir(exist_ok=True)
    progress_file = str(output_dir / "progress.json")
    csv_output = str(output_dir / "extraction_results.csv")
    report_file = str(output_dir / "extraction_report.json")
    print("=" * 70)
    print("YOLO Signature Extraction from VLM Index")
    print("=" * 70)
    print(f"CSV Index: {args.csv}")
    print(f"PDF Base: {args.pdf_base}")
    print(f"Output: {args.output}")
    print(f"Model: {args.model}")
    print(f"Workers: {args.workers}")
    print(f"Confidence: {args.conf}")
    print("=" * 70)
    # Load CSV
    print("\nLoading CSV index...")
    with open(args.csv, 'r') as f:
        reader = csv.DictReader(f)
        all_entries = list(reader)
    total_entries = len(all_entries)
    print(f"Total entries: {total_entries}")
    # Load progress if resuming
    completed_keys = set()
    if args.resume:
        completed_keys = load_progress(progress_file)
        print(f"Resuming: {len(completed_keys)} entries already processed")
    # Filter out completed entries
    def entry_key(row):
        return f"{row['filename']}_{row['page']}"
    entries_to_process = [e for e in all_entries if entry_key(e) not in completed_keys]
    print(f"Entries to process: {len(entries_to_process)}")
    if not entries_to_process:
        print("All entries already processed!")
        return
    # Prepare work arguments
    work_args = [
        (entry, args.model, args.pdf_base, str(output_dir), args.conf)
        for entry in entries_to_process
    ]
    # Results
    results_success = []
    results_no_sig = []
    errors = []
    start_time = time.time()
    processed_count = len(completed_keys)
    print(f"\nStarting extraction with {args.workers} workers...")
    print("-" * 70)
    with ProcessPoolExecutor(max_workers=args.workers) as executor:
        futures = {executor.submit(process_single_entry, arg): arg[0] for arg in work_args}
        for future in as_completed(futures):
            entry = futures[future]
            key = entry_key(entry)
            try:
                result = future.result()
                if result['error']:
                    errors.append(result)
                elif result['num_signatures'] > 0:
                    results_success.append(result)
                else:
                    results_no_sig.append(result)
                completed_keys.add(key)
                processed_count += 1
                # Progress output
                elapsed = time.time() - start_time
                rate = (processed_count - len(load_progress(progress_file) if args.resume else set())) / elapsed if elapsed > 0 else 0
                eta = (total_entries - processed_count) / rate / 60 if rate > 0 else 0
                status = f"SIG({result['num_signatures']})" if result['num_signatures'] > 0 else "---"
                if result['error']:
                    status = "ERR"
                print(f"[{processed_count}/{total_entries}] {status:8s} {result['filename'][:45]:45s} "
                      f"({rate:.1f}/s, ETA: {eta:.1f}m)")
                # Save progress
                if processed_count % PROGRESS_SAVE_INTERVAL == 0:
                    save_progress(progress_file, completed_keys, total_entries, start_time)
            except Exception as e:
                print(f"Error: {e}")
                errors.append({'filename': entry['filename'], 'error': str(e)})
    # Final progress save
    save_progress(progress_file, completed_keys, total_entries, start_time)
    # Write CSV results
    print("\nWriting results CSV...")
    with open(csv_output, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=[
            'filename', 'page', 'num_signatures', 'confidence_avg', 'image_files'
        ])
        writer.writeheader()
        for r in results_success:
            writer.writerow({
                'filename': r['filename'],
                'page': r['page'],
                'num_signatures': r['num_signatures'],
                'confidence_avg': round(r['confidence_avg'], 4),
                'image_files': ','.join(r['image_files'])
            })
    # Generate report
    elapsed_total = time.time() - start_time
    total_sigs = sum(r['num_signatures'] for r in results_success)
    report = {
        'extraction_date': datetime.now().isoformat(),
        'total_index_entries': total_entries,
        'with_signatures_detected': len(results_success),
        'no_signatures_detected': len(results_no_sig),
        'errors': len(errors),
        'total_signatures_extracted': total_sigs,
        'detection_rate': f"{len(results_success) / total_entries * 100:.2f}%" if total_entries > 0 else "0%",
        'processing_time_minutes': round(elapsed_total / 60, 2),
        'processing_rate_per_second': round(len(entries_to_process) / elapsed_total, 2) if elapsed_total > 0 else 0,
        'model': args.model,
        'confidence_threshold': args.conf,
        'workers': args.workers
    }
    with open(report_file, 'w') as f:
        json.dump(report, f, indent=2)
    # Print summary
    print("\n" + "=" * 70)
    print("EXTRACTION COMPLETE")
    print("=" * 70)
    print(f"Total index entries:      {total_entries}")
    print(f"With signatures:          {len(results_success)} ({len(results_success)/total_entries*100:.1f}%)")
    print(f"No signatures detected:   {len(results_no_sig)} ({len(results_no_sig)/total_entries*100:.1f}%)")
    print(f"Errors:                   {len(errors)}")
    print(f"Total signatures:         {total_sigs}")
    print(f"Processing time:          {elapsed_total/60:.1f} minutes")
    print(f"Rate:                     {len(entries_to_process)/elapsed_total:.1f} entries/second")
    print("-" * 70)
    print(f"Results saved to: {output_dir}")
    print("=" * 70)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,385 @@
 #!/usr/bin/env python3
 """
 YOLO Full PDF Signature Scanner
 Scans all PDFs to detect handwritten signatures using a trained YOLOv11n model.
 Supports multi-process GPU acceleration and checkpoint resumption.
 Features:
 - Skip first page of each PDF
 - Stop scanning once signature is found
 - Extract and save signature crops with red stamp removal
 - Progress checkpoint for resumption
 - Detailed statistics report
 """
 import argparse
 import csv
 import json
 import os
 import sys
 import time
 from concurrent.futures import ProcessPoolExecutor, as_completed
 from datetime import datetime
 from pathlib import Path
 from typing import Optional
 import cv2
 import fitz  # PyMuPDF
 import numpy as np
 # Will be imported in worker processes
 # from ultralytics import YOLO
 # Configuration
 DPI = 150  # Lower DPI for faster processing (150 vs 300)
 CONFIDENCE_THRESHOLD = 0.5
 PROGRESS_SAVE_INTERVAL = 100  # Save progress every N files
 def remove_red_stamp(image: np.ndarray) -> np.ndarray:
    """Remove red stamp pixels from an image by replacing them with white."""
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
    # Red color ranges in HSV
    lower_red1 = np.array([0, 50, 50])
    upper_red1 = np.array([10, 255, 255])
    lower_red2 = np.array([160, 50, 50])
    upper_red2 = np.array([180, 255, 255])
    mask1 = cv2.inRange(hsv, lower_red1, upper_red1)
    mask2 = cv2.inRange(hsv, lower_red2, upper_red2)
    red_mask = cv2.bitwise_or(mask1, mask2)
    kernel = np.ones((3, 3), np.uint8)
    red_mask = cv2.dilate(red_mask, kernel, iterations=1)
    result = image.copy()
    result[red_mask > 0] = [255, 255, 255]
    return result
 def render_pdf_page(doc, page_num: int, dpi: int = DPI) -> Optional[np.ndarray]:
    """Render a PDF page to an image array."""
    try:
        page = doc[page_num]
        mat = fitz.Matrix(dpi / 72, dpi / 72)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        image = np.frombuffer(pix.samples, dtype=np.uint8)
        image = image.reshape(pix.height, pix.width, pix.n)
        return image
    except Exception:
        return None
 def scan_single_pdf(args: tuple) -> dict:
    """
    Scan a single PDF for signatures.
    Args:
        args: Tuple of (pdf_path, model_path, output_dir, conf_threshold)
    Returns:
        Result dictionary with signature info
    """
    pdf_path, model_path, output_dir, conf_threshold = args
    # Import here to avoid issues with multiprocessing
    from ultralytics import YOLO
    result = {
        'filename': os.path.basename(pdf_path),
        'source_dir': os.path.basename(os.path.dirname(pdf_path)),
        'has_signature': False,
        'page': None,
        'num_signatures': 0,
        'confidence_avg': 0.0,
        'error': None
    }
    try:
        # Load model (each worker loads its own)
        model = YOLO(model_path)
        doc = fitz.open(pdf_path)
        num_pages = len(doc)
        # Skip first page, scan remaining pages
        for page_num in range(1, num_pages):  # Start from page 2 (index 1)
            image = render_pdf_page(doc, page_num)
            if image is None:
                continue
            # Run YOLO detection
            results = model(image, conf=conf_threshold, verbose=False)
            signatures = []
            for r in results:
                for box in r.boxes:
                    x1, y1, x2, y2 = map(int, box.xyxy[0].cpu().numpy())
                    conf = float(box.conf[0].cpu().numpy())
                    signatures.append({
                        'box': (x1, y1, x2 - x1, y2 - y1),
                        'xyxy': (x1, y1, x2, y2),
                        'confidence': conf
                    })
            if signatures:
                # Found signatures! Record and stop scanning
                result['has_signature'] = True
                result['page'] = page_num + 1  # 1-indexed
                result['num_signatures'] = len(signatures)
                result['confidence_avg'] = sum(s['confidence'] for s in signatures) / len(signatures)
                # Extract and save signature crops
                base_name = Path(pdf_path).stem
                for i, sig in enumerate(signatures):
                    x, y, w, h = sig['box']
                    x = max(0, x)
                    y = max(0, y)
                    x2 = min(image.shape[1], x + w)
                    y2 = min(image.shape[0], y + h)
                    crop = image[y:y2, x:x2]
                    crop_no_stamp = remove_red_stamp(crop)
                    # Save to output directory
                    crop_filename = f"{base_name}_page{page_num + 1}_sig{i + 1}.png"
                    crop_path = os.path.join(output_dir, "images", crop_filename)
                    cv2.imwrite(crop_path, cv2.cvtColor(crop_no_stamp, cv2.COLOR_RGB2BGR))
                doc.close()
                return result
        doc.close()
    except Exception as e:
        result['error'] = str(e)
    return result
 def collect_pdf_files(input_dirs: list[str]) -> list[str]:
    """Collect all PDF files from input directories."""
    pdf_files = []
    for input_dir in input_dirs:
        input_path = Path(input_dir)
        if not input_path.exists():
            print(f"Warning: Directory not found: {input_dir}")
            continue
        # Check for batch subdirectories
        batch_dirs = list(input_path.glob("batch_*"))
        if batch_dirs:
            # Has batch subdirectories
            for batch_dir in sorted(batch_dirs):
                for pdf_file in batch_dir.glob("*.pdf"):
                    pdf_files.append(str(pdf_file))
        else:
            # Flat directory
            for pdf_file in input_path.glob("*.pdf"):
                pdf_files.append(str(pdf_file))
    return sorted(pdf_files)
 def load_progress(progress_file: str) -> set:
    """Load completed files from progress checkpoint."""
    if os.path.exists(progress_file):
        try:
            with open(progress_file, 'r') as f:
                data = json.load(f)
                return set(data.get('completed_files', []))
        except Exception:
            pass
    return set()
 def save_progress(progress_file: str, completed: set, total: int, start_time: float):
    """Save progress checkpoint."""
    elapsed = time.time() - start_time
    data = {
        'last_updated': datetime.now().isoformat(),
        'total_pdfs': total,
        'processed': len(completed),
        'remaining': total - len(completed),
        'elapsed_seconds': elapsed,
        'completed_files': list(completed)
    }
    with open(progress_file, 'w') as f:
        json.dump(data, f)
 def main():
    parser = argparse.ArgumentParser(description='YOLO Full PDF Signature Scanner')
    parser.add_argument('--input', nargs='+', required=True, help='Input directories containing PDFs')
    parser.add_argument('--output', required=True, help='Output directory for results')
    parser.add_argument('--model', default='best.pt', help='Path to YOLO model')
    parser.add_argument('--workers', type=int, default=4, help='Number of parallel workers')
    parser.add_argument('--conf', type=float, default=0.5, help='Confidence threshold')
    parser.add_argument('--resume', action='store_true', help='Resume from checkpoint')
    args = parser.parse_args()
    # Setup output directories
    output_dir = Path(args.output)
    output_dir.mkdir(parents=True, exist_ok=True)
    (output_dir / "images").mkdir(exist_ok=True)
    progress_file = str(output_dir / "progress.json")
    csv_file = str(output_dir / "yolo_signatures.csv")
    report_file = str(output_dir / "scan_report.json")
    print("=" * 70)
    print("YOLO Full PDF Signature Scanner")
    print("=" * 70)
    print(f"Input directories: {args.input}")
    print(f"Output directory: {args.output}")
    print(f"Model: {args.model}")
    print(f"Workers: {args.workers}")
    print(f"Confidence threshold: {args.conf}")
    print(f"Resume mode: {args.resume}")
    print("=" * 70)
    # Collect all PDF files
    print("\nCollecting PDF files...")
    all_pdfs = collect_pdf_files(args.input)
    total_pdfs = len(all_pdfs)
    print(f"Found {total_pdfs} PDF files")
    # Load progress if resuming
    completed_files = set()
    if args.resume:
        completed_files = load_progress(progress_file)
        print(f"Resuming from checkpoint: {len(completed_files)} files already processed")
    # Filter out already processed files
    pdfs_to_process = [p for p in all_pdfs if os.path.basename(p) not in completed_files]
    print(f"PDFs to process: {len(pdfs_to_process)}")
    if not pdfs_to_process:
        print("All files already processed!")
        return
    # Prepare arguments for workers
    work_args = [
        (pdf_path, args.model, str(output_dir), args.conf)
        for pdf_path in pdfs_to_process
    ]
    # Statistics
    results_with_sig = []
    results_without_sig = []
    errors = []
    source_stats = {}
    start_time = time.time()
    processed_count = len(completed_files)
    # Process with multiprocessing
    print(f"\nStarting scan with {args.workers} workers...")
    print("-" * 70)
    with ProcessPoolExecutor(max_workers=args.workers) as executor:
        futures = {executor.submit(scan_single_pdf, arg): arg[0] for arg in work_args}
        for future in as_completed(futures):
            pdf_path = futures[future]
            filename = os.path.basename(pdf_path)
            try:
                result = future.result()
                # Update statistics
                source_dir = result['source_dir']
                if source_dir not in source_stats:
                    source_stats[source_dir] = {'scanned': 0, 'with_sig': 0}
                source_stats[source_dir]['scanned'] += 1
                if result['error']:
                    errors.append(result)
                elif result['has_signature']:
                    results_with_sig.append(result)
                    source_stats[source_dir]['with_sig'] += 1
                else:
                    results_without_sig.append(result)
                # Track completion
                completed_files.add(filename)
                processed_count += 1
                # Progress output
                elapsed = time.time() - start_time
                rate = (processed_count - len(load_progress(progress_file) if args.resume else set())) / elapsed if elapsed > 0 else 0
                eta = (total_pdfs - processed_count) / rate / 3600 if rate > 0 else 0
                status = "SIG" if result['has_signature'] else "---"
                print(f"[{processed_count}/{total_pdfs}] {status} {filename[:50]:50s} "
                      f"({rate:.1f}/s, ETA: {eta:.1f}h)")
                # Save progress periodically
                if processed_count % PROGRESS_SAVE_INTERVAL == 0:
                    save_progress(progress_file, completed_files, total_pdfs, start_time)
            except Exception as e:
                print(f"Error processing {filename}: {e}")
                errors.append({'filename': filename, 'error': str(e)})
    # Final progress save
    save_progress(progress_file, completed_files, total_pdfs, start_time)
    # Write CSV index
    print("\nWriting CSV index...")
    with open(csv_file, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=['filename', 'page', 'num_signatures', 'confidence_avg'])
        writer.writeheader()
        for result in results_with_sig:
            writer.writerow({
                'filename': result['filename'],
                'page': result['page'],
                'num_signatures': result['num_signatures'],
                'confidence_avg': round(result['confidence_avg'], 4)
            })
    # Generate report
    elapsed_total = time.time() - start_time
    report = {
        'scan_date': datetime.now().isoformat(),
        'total_pdfs': total_pdfs,
        'with_signature': len(results_with_sig),
        'without_signature': len(results_without_sig),
        'errors': len(errors),
        'signature_rate': f"{len(results_with_sig) / total_pdfs * 100:.2f}%" if total_pdfs > 0 else "0%",
        'total_signatures_extracted': sum(r['num_signatures'] for r in results_with_sig),
        'processing_time_hours': round(elapsed_total / 3600, 2),
        'processing_rate_per_second': round(len(pdfs_to_process) / elapsed_total, 2) if elapsed_total > 0 else 0,
        'source_breakdown': source_stats,
        'model': args.model,
        'confidence_threshold': args.conf,
        'workers': args.workers
    }
    with open(report_file, 'w') as f:
        json.dump(report, f, indent=2)
    # Print summary
    print("\n" + "=" * 70)
    print("SCAN COMPLETE")
    print("=" * 70)
    print(f"Total PDFs scanned:     {total_pdfs}")
    print(f"With signature:         {len(results_with_sig)} ({len(results_with_sig)/total_pdfs*100:.1f}%)")
    print(f"Without signature:      {len(results_without_sig)} ({len(results_without_sig)/total_pdfs*100:.1f}%)")
    print(f"Errors:                 {len(errors)}")
    print(f"Total signatures:       {sum(r['num_signatures'] for r in results_with_sig)}")
    print(f"Processing time:        {elapsed_total/3600:.2f} hours")
    print(f"Processing rate:        {len(pdfs_to_process)/elapsed_total:.1f} PDFs/second")
    print("-" * 70)
    print(f"Results saved to: {output_dir}")
    print("=" * 70)
 if __name__ == "__main__":
    main()
Author	SHA1	Message	Date
gbanyanandClaude Opus 4.6	939a348da4	Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification Paper draft includes all sections (Abstract through Conclusion), 36 references, and supporting scripts. Key methodology: Cosine similarity + dHash dual-method verification with thresholds calibrated against known-replication firm (Firm A). Includes: - 8 section markdown files (paper_a_*.md) - Ablation study script (ResNet-50 vs VGG-16 vs EfficientNet-B0) - Recalibrated classification script (84,386 PDFs, 5-tier system) - Figure generation and Word export scripts - Citation renumbering script ([1]-[36]) - Signature analysis pipeline (12 steps) - YOLO extraction scripts Three rounds of AI review completed (GPT-5.4, Claude Opus 4.6, Gemini 3 Pro). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 23:05:33 +08:00
gbanyanandClaude	21df0ff387	Complete PP-OCRv5 research and v4 vs v5 comparison ## 研究成果 ### PP-OCRv5 API 測試 - 成功升級到 PaddleOCR 3.3.2 (PP-OCRv5) - 理解新 API 結構和調用方式 - 驗證基礎檢測功能 ### 關鍵發現 ❌ PP-OCRv5 沒有內建手寫分類功能 - text_type 字段是語言類型，不是手寫/印刷分類 - 仍需要 OpenCV Method 3 來分離手寫和印刷文字 ### 完整 Pipeline 對比測試 - v4 (2.7.3): 檢測 14 個文字 → 4 個候選區域 - v5 (3.3.2): 檢測 50 個文字 → 7 個候選區域 - 主簽名區域：兩個版本幾乎相同 (1150x511 vs 1144x511) ### 性能分析優點： - v5 手寫識別準確率 +13.7% (文檔承諾) - 可能減少漏檢缺點： - 過度檢測（印章小字等） - API 完全重寫，不兼容 - 仍無法替代 OpenCV Method 3 ### 文件 - PP_OCRV5_RESEARCH_FINDINGS.md: 完整研究報告 - signature-comparison/: v4 vs v5 對比結果 - test_results/: v5 測試輸出 - test_*_pipeline.py: 完整測試腳本 ### 建議當前方案（v2.7.3 + OpenCV Method 3）已足夠穩定，除非遇到大量漏檢，否則暫不升級到 v5。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-27 11:21:55 +08:00
gbanyanandClaude	8f231da3bc	Complete OpenCV Method 3 implementation with 86.5% handwriting retention - Implemented comprehensive feature analysis based on size, stroke length, and regularity - Size-based scoring: height >50px indicates handwriting - Stroke length ratio: >0.4 indicates handwriting - Irregularity metrics: low compactness/solidity indicates handwriting - Successfully tested on sample PDF with 2 signatures (楊智惠, 張志銘) - Created detailed documentation: CURRENT_STATUS.md and NEW_SESSION_HANDOFF.md - Stable PaddleOCR 2.7.3 configuration documented (numpy 1.26.4, opencv 4.6.0.66) - Prepared research plan for PP-OCRv5 upgrade investigation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-27 10:35:46 +08:00
gbanyan	479d4e0019	Add PaddleOCR masking and region detection pipeline - Created PaddleOCR client for remote server communication - Implemented text masking + region detection pipeline - Test results: 100% recall on sample PDF (found both signatures) - Identified issues: split regions, printed text not fully masked - Documented 5 solution options in PADDLEOCR_STATUS.md - Next: Implement region merging and two-stage cleaning	2025-10-28 22:28:18 +08:00