diff --git a/.gitignore b/.gitignore index 5d41f81..d983646 100644 --- a/.gitignore +++ b/.gitignore @@ -61,3 +61,5 @@ node_modules/ # Sensitive/large data *.xlsx + +.serena/ diff --git a/NEW_SESSION_HANDOFF.md b/NEW_SESSION_HANDOFF.md deleted file mode 100644 index 079f37f..0000000 --- a/NEW_SESSION_HANDOFF.md +++ /dev/null @@ -1,432 +0,0 @@ -# 新对话交接文档 - PP-OCRv5研究 - -**日期**: 2025-10-29 -**前序对话**: PaddleOCR-Cover分支开发 -**当前分支**: `paddleocr-improvements` (稳定) -**新分支**: `pp-ocrv5-research` (待创建) - ---- - -## 🎯 任务目标 - -研究和实现 **PP-OCRv5** 的手写签名检测功能 - ---- - -## 📋 背景信息 - -### 当前状况 - -✅ **已有稳定方案** (`paddleocr-improvements` 分支): -- PaddleOCR 2.7.3 + OpenCV Method 3 -- 86.5%手写保留率 -- 区域合并算法工作良好 -- 测试: 1个PDF成功检测2个签名 - -⚠️ **PP-OCRv5升级遇到问题**: -- PaddleOCR 3.3.0 API完全改变 -- 旧服务器代码不兼容 -- 需要深入研究新API - -### 为什么要研究PP-OCRv5? - -**文档显示**: https://www.paddleocr.ai/main/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5.html - -PP-OCRv5性能提升: -- 手写中文检测: **0.706 → 0.803** (+13.7%) -- 手写英文检测: **0.249 → 0.841** (+237%) -- 可能支持直接输出手写区域坐标 - -**潜在优势**: -1. 更好的手写识别能力 -2. 可能内置手写/印刷分类 -3. 更准确的坐标输出 -4. 减少复杂的后处理 - ---- - -## 🔧 技术栈 - -### 服务器环境 - -``` -Host: 192.168.30.36 (Linux GPU服务器) -SSH: ssh gblinux -目录: ~/Project/paddleocr-server/ -``` - -**当前稳定版本**: -- PaddleOCR: 2.7.3 -- numpy: 1.26.4 -- opencv-contrib-python: 4.6.0.66 -- 服务器文件: `paddleocr_server.py` - -**已安装但未使用**: -- PaddleOCR 3.3.0 (PP-OCRv5) -- 临时服务器: `paddleocr_server_v5.py` (未完成) - -### 本地环境 - -``` -macOS -Python: 3.14 -虚拟环境: venv/ -客户端: paddleocr_client.py -``` - ---- - -## 📝 核心问题 - -### 1. API变更 - -**旧API (2.7.3)**: -```python -from paddleocr import PaddleOCR -ocr = PaddleOCR(lang='ch') -result = ocr.ocr(image_np, cls=False) - -# 返回格式: -# [[[box], (text, confidence)], ...] -``` - -**新API (3.3.0)** - ⚠️ 未完全理解: -```python -# 方式1: 传统方式 (Deprecated) -result = ocr.ocr(image_np) # 警告: Please use predict instead - -# 方式2: 新方式 -from paddlex import create_model -model = create_model("???") # 模型名未知 -result = model.predict(image_np) - -# 返回格式: ??? -``` - -### 2. 遇到的错误 - -**错误1**: `cls` 参数不再支持 -```python -# 错误: PaddleOCR.predict() got an unexpected keyword argument 'cls' -result = ocr.ocr(image_np, cls=False) # ❌ -``` - -**错误2**: 返回格式改变 -```python -# 旧代码解析失败: -text = item[1][0] # ❌ IndexError -confidence = item[1][1] # ❌ IndexError -``` - -**错误3**: 模型名称错误 -```python -model = create_model("PP-OCRv5_server") # ❌ Model not supported -``` - ---- - -## 🎯 研究任务清单 - -### Phase 1: API研究 (优先级高) - -- [ ] **阅读官方文档** - - PP-OCRv5完整文档 - - PaddleX API文档 - - 迁移指南 (如果有) - -- [ ] **理解新API** - ```python - # 需要搞清楚: - 1. 正确的导入方式 - 2. 模型初始化方法 - 3. predict()参数和返回格式 - 4. 如何区分手写/印刷 - 5. 是否有手写检测专用功能 - ``` - -- [ ] **编写测试脚本** - - `test_pp_ocrv5_api.py` - 测试基础API调用 - - 打印完整的result数据结构 - - 对比v4和v5的返回差异 - -### Phase 2: 服务器适配 - -- [ ] **重写服务器代码** - - 适配新API - - 正确解析返回数据 - - 保持REST接口兼容 - -- [ ] **测试稳定性** - - 测试10个PDF样本 - - 检查GPU利用率 - - 对比v4性能 - -### Phase 3: 手写检测功能 - -- [ ] **查找手写检测能力** - ```python - # 可能的方式: - 1. result中是否有 text_type 字段? - 2. 是否有专门的 handwriting_detection 模型? - 3. 是否有置信度差异可以利用? - 4. PP-Structure 的 layout 分析? - ``` - -- [ ] **对比测试** - - v4 (当前方案) vs v5 - - 准确率、召回率、速度 - - 手写检测能力 - -### Phase 4: 集成决策 - -- [ ] **性能评估** - - 如果v5更好 → 升级 - - 如果改进不明显 → 保持v4 - -- [ ] **文档更新** - - 记录v5使用方法 - - 更新PADDLEOCR_STATUS.md - ---- - -## 🔍 调试技巧 - -### 1. 查看完整返回数据 - -```python -import pprint -result = model.predict(image) -pprint.pprint(result) # 完整输出所有字段 - -# 或者 -import json -print(json.dumps(result, indent=2, ensure_ascii=False)) -``` - -### 2. 查找官方示例 - -```bash -# 在服务器上查找PaddleOCR安装示例 -find ~/Project/paddleocr-server/venv/lib/python3.12/site-packages/paddleocr -name "*.py" | grep example - -# 查看源码 -less ~/Project/paddleocr-server/venv/lib/python3.12/site-packages/paddleocr/paddleocr.py -``` - -### 3. 查看可用模型 - -```python -from paddlex.inference.models import OFFICIAL_MODELS -print(OFFICIAL_MODELS) # 列出所有支持的模型名 -``` - -### 4. Web文档搜索 - -重点查看: -- https://github.com/PaddlePaddle/PaddleOCR -- https://www.paddleocr.ai -- https://github.com/PaddlePaddle/PaddleX - ---- - -## 📂 文件结构 - -``` -/Volumes/NV2/pdf_recognize/ -├── CURRENT_STATUS.md # 当前状态文档 ✅ -├── NEW_SESSION_HANDOFF.md # 本文件 ✅ -├── PADDLEOCR_STATUS.md # 详细技术文档 ✅ -├── SESSION_INIT.md # 初始会话信息 -│ -├── paddleocr_client.py # 稳定客户端 (v2.7.3) ✅ -├── paddleocr_server_v5.py # v5服务器 (未完成) ⚠️ -│ -├── test_paddleocr_client.py # 基础测试 -├── test_mask_and_detect.py # 遮罩+检测 -├── test_opencv_separation.py # Method 1+2 -├── test_opencv_advanced.py # Method 3 (最佳) ✅ -├── extract_signatures_paddleocr_improved.py # 完整Pipeline -│ -└── check_rejected_for_missing.py # 诊断脚本 -``` - -**服务器端** (`ssh gblinux`): -``` -~/Project/paddleocr-server/ -├── paddleocr_server.py # v2.7.3稳定版 ✅ -├── paddleocr_server_v5.py # v5版本 (待完成) ⚠️ -├── paddleocr_server_backup.py # 备份 -├── server_stable.log # 当前运行日志 -└── venv/ # 虚拟环境 -``` - ---- - -## ⚡ 快速启动 - -### 启动稳定服务器 (v2.7.3) - -```bash -ssh gblinux -cd ~/Project/paddleocr-server -source venv/bin/activate -python paddleocr_server.py -``` - -### 测试连接 - -```bash -# 本地Mac -cd /Volumes/NV2/pdf_recognize -source venv/bin/activate -python test_paddleocr_client.py -``` - -### 创建新研究分支 - -```bash -cd /Volumes/NV2/pdf_recognize -git checkout -b pp-ocrv5-research -``` - ---- - -## 🚨 注意事项 - -### 1. 不要破坏稳定版本 - -- `paddleocr-improvements` 分支保持稳定 -- 所有v5实验在新分支 `pp-ocrv5-research` -- 服务器保留 `paddleocr_server.py` (v2.7.3) -- 新代码命名: `paddleocr_server_v5.py` - -### 2. 环境隔离 - -- 服务器虚拟环境可能需要重建 -- 或者用Docker隔离v4和v5 -- 避免版本冲突 - -### 3. 性能测试 - -- 记录v4和v5的具体指标 -- 至少测试10个样本 -- 包括速度、准确率、召回率 - -### 4. 文档驱动 - -- 每个发现记录到文档 -- API用法写清楚 -- 便于未来维护 - ---- - -## 📊 成功标准 - -### 最低目标 - -- [ ] 成功运行PP-OCRv5基础OCR -- [ ] 理解新API调用方式 -- [ ] 服务器稳定运行 -- [ ] 记录完整文档 - -### 理想目标 - -- [ ] 发现手写检测功能 -- [ ] 性能超过v4方案 -- [ ] 简化Pipeline复杂度 -- [ ] 提升准确率 > 90% - -### 决策点 - -**如果v5明显更好** → 升级到v5,废弃v4 -**如果v5改进不明显** → 保持v4,v5仅作研究记录 -**如果v5有bug** → 等待官方修复,暂用v4 - ---- - -## 📞 问题排查 - -### 遇到问题时 - -1. **先查日志**: `tail -f ~/Project/paddleocr-server/server_stable.log` -2. **查看源码**: 在venv里找PaddleOCR代码 -3. **搜索Issues**: https://github.com/PaddlePaddle/PaddleOCR/issues -4. **降级测试**: 确认v2.7.3是否还能用 - -### 常见问题 - -**Q: 服务器启动失败?** -A: 检查numpy版本 (需要 < 2.0) - -**Q: 找不到模型?** -A: 模型名可能变化,查看OFFICIAL_MODELS - -**Q: API调用失败?** -A: 对比官方文档,可能参数格式变化 - ---- - -## 🎓 学习资源 - -### 官方文档 - -1. **PP-OCRv5**: https://www.paddleocr.ai/main/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5.html -2. **PaddleOCR GitHub**: https://github.com/PaddlePaddle/PaddleOCR -3. **PaddleX**: https://github.com/PaddlePaddle/PaddleX - -### 相关技术 - -- PaddlePaddle深度学习框架 -- PP-Structure文档结构分析 -- 手写识别 (Handwriting Recognition) -- 版面分析 (Layout Analysis) - ---- - -## 💡 提示 - -### 如果发现内置手写检测 - -可能的用法: -```python -# 猜测1: 返回结果包含类型 -for item in result: - text_type = item.get('type') # 'printed' or 'handwritten'? - -# 猜测2: 专门的layout模型 -from paddlex import create_model -layout_model = create_model("PP-Structure") -layout_result = layout_model.predict(image) -# 可能返回: text, handwriting, figure, table... - -# 猜测3: 置信度差异 -# 手写文字置信度可能更低 -``` - -### 如果没有内置手写检测 - -那么当前OpenCV Method 3仍然是最佳方案,v5仅提供更好的OCR准确度。 - ---- - -## ✅ 完成检查清单 - -研究完成后,确保: - -- [ ] 新API用法完全理解并文档化 -- [ ] 服务器代码重写并测试通过 -- [ ] 性能对比数据记录 -- [ ] 决策文档 (升级 vs 保持v4) -- [ ] 代码提交到 `pp-ocrv5-research` 分支 -- [ ] 更新 `CURRENT_STATUS.md` -- [ ] 如果升级: 合并到main分支 - ---- - -**祝研究顺利!** 🚀 - -有问题随时查阅: -- `CURRENT_STATUS.md` - 当前方案详情 -- `PADDLEOCR_STATUS.md` - 技术细节和问题分析 - -**最重要**: 记录所有发现,无论成功或失败,都是宝贵经验! diff --git a/SESSION_CHECKLIST.md b/SESSION_CHECKLIST.md deleted file mode 100644 index 627f7ff..0000000 --- a/SESSION_CHECKLIST.md +++ /dev/null @@ -1,195 +0,0 @@ -# Session Handoff Checklist ✓ - -## Before You Exit This Session - -- [x] All documentation written -- [x] Test results recorded (7/10 signatures, 70% recall) -- [x] Session initialization files created -- [x] .gitignore configured -- [x] Commit guide prepared -- [ ] **Git commit performed** (waiting for user approval) - -## Files Created for Next Session - -### Essential Files ⭐ -- [x] **SESSION_INIT.md** - Read this first in next session -- [x] **NEW_SESSION_PROMPT.txt** - Copy-paste prompt template -- [x] **PROJECT_DOCUMENTATION.md** - Complete 24KB history -- [x] **HOW_TO_CONTINUE.txt** - Visual guide - -### Supporting Files -- [x] README.md - Quick start guide -- [x] COMMIT_SUMMARY.md - Git instructions -- [x] README_page_extraction.md - Page extraction docs -- [x] README_hybrid_extraction.md - Signature extraction docs -- [x] .gitignore - Configured properly - -### Working Scripts -- [x] extract_pages_from_csv.py - Tested (100 files) -- [x] extract_signatures_hybrid.py - Tested (5 files, 70% recall) -- [x] extract_handwriting.py - Component script - -## What's Working ✅ - -| Component | Status | Details | -|-----------|--------|---------| -| Page extraction | ✅ Working | 100 files tested | -| VLM name extraction | ✅ Working | 100% accurate on 5 files | -| CV detection | ⚠️ Conservative | Finds 70% of signatures | -| VLM verification | ✅ Working | 100% precision, no false positives | -| Overall system | ✅ Working | 70% recall, 100% precision | - -## What's Not Working / Unknown ⚠️ - -| Issue | Status | Next Steps | -|-------|--------|------------| -| Missing 30% signatures | Known | Tune CV parameters | -| Text layer method | Untested | Need PDFs with text | -| Large-scale performance | Unknown | Test with 100+ files | -| Full dataset (86K) | Unknown | Estimate time & optimize | - -## Critical Context to Remember 🧠 - -1. **VLM coordinates are unreliable** (32% offset on test file) - - Don't use VLM for location detection - - Use VLM for name extraction only - -2. **Name-based approach is the solution** - - VLM extracts names ✓ - - CV finds locations ✓ - - VLM verifies regions ✓ - -3. **Test file with coordinate issue:** - - `201301_2458_AI1_page4.pdf` - - VLM found 2 names but coordinates pointed to blank areas - - Actual signatures at 26% (reported as 58% and 68%) - -## To Start Next Session - -### Simple Method (Recommended) -```bash -cat /Volumes/NV2/pdf_recognize/NEW_SESSION_PROMPT.txt -# Copy output and paste to new Claude Code session -``` - -### Manual Method -Tell Claude: -> "I'm continuing the PDF signature extraction project at `/Volumes/NV2/pdf_recognize/`. Please read `SESSION_INIT.md` and `PROJECT_DOCUMENTATION.md` to understand the current state. I want to [choose option from SESSION_INIT.md]." - -## Quick Commands Reference - -### View Documentation -```bash -less /Volumes/NV2/pdf_recognize/SESSION_INIT.md -less /Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md -``` - -### Run Scripts -```bash -cd /Volumes/NV2/pdf_recognize -source venv/bin/activate -python extract_signatures_hybrid.py # Main script -``` - -### Check Results -```bash -ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png -``` - -### View Session Handoff -```bash -cat /Volumes/NV2/pdf_recognize/HOW_TO_CONTINUE.txt -``` - -## What Can Be Improved (Future Work) - -### Priority 1: Increase Recall -- Current: 70% -- Target: 90%+ -- Method: Tune CV parameters in lines 178-214 of extract_signatures_hybrid.py - -### Priority 2: Scale Testing -- Current: 5 files tested -- Next: 100 files -- Future: 86,073 files (full dataset) - -### Priority 3: Optimization -- Current: ~24 seconds per PDF -- Consider: Parallel processing, batch VLM calls - -### Priority 4: Text Layer Testing -- Current: Untested (all PDFs are scanned) -- Need: Find PDFs with searchable text layer - -## Verification Steps - -Before next session, verify files exist: -```bash -cd /Volumes/NV2/pdf_recognize - -# Check essential docs -ls -lh SESSION_INIT.md PROJECT_DOCUMENTATION.md NEW_SESSION_PROMPT.txt - -# Check working scripts -ls -lh extract_pages_from_csv.py extract_signatures_hybrid.py - -# Check test results -ls /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png | wc -l -# Should show: 7 (the 7 verified signatures) -``` - -## Known Good State - -### Environment -- Python: 3.9+ with venv -- Ollama: http://192.168.30.36:11434 -- Model: qwen2.5vl:32b -- Working directory: /Volumes/NV2/pdf_recognize/ - -### Test Data -- 5 PDFs processed -- 7 signatures extracted -- All verified (100% precision) -- 3 signatures missed (70% recall) - -### Output Files -``` -201301_1324_AI1_page3_signature_張志銘.png (33 KB) -201301_1324_AI1_page3_signature_楊智惠.png (37 KB) -201301_2061_AI1_page5_signature_廖阿甚.png (87 KB) -201301_2458_AI1_page4_signature_周寶蓮.png (230 KB) -201301_2923_AI1_page3_signature_黄瑞展.png (184 KB) -201301_3189_AI1_page3_signature_黄益辉.png (24 KB) -201301_3189_AI1_page3_signature_黄辉.png (84 KB) -``` - -## Git Status (Pre-Commit) - -Files staged for commit: -- [ ] extract_pages_from_csv.py -- [ ] extract_signatures_hybrid.py -- [ ] extract_handwriting.py -- [ ] README.md -- [ ] PROJECT_DOCUMENTATION.md -- [ ] README_page_extraction.md -- [ ] README_hybrid_extraction.md -- [ ] .gitignore - -**Waiting for:** User to review docs and approve commit - -## Session Health Check ✓ - -- [x] All scripts working -- [x] Test results documented -- [x] Issues identified and recorded -- [x] Next steps defined -- [x] Session continuity files created -- [x] Git commit prepared - -**Status:** ✅ Ready for handoff - ---- - -**Last Updated:** October 26, 2025 -**Session End:** Ready for next session -**Next Action:** User reviews docs → Git commit → Continue work diff --git a/paper/Paper_A_IEEE_Access_Draft_v4.3_20260604.pandoc.docx b/paper/Paper_A_IEEE_Access_Draft_v4.3_20260604.pandoc.docx new file mode 100644 index 0000000..8f9f701 Binary files /dev/null and b/paper/Paper_A_IEEE_Access_Draft_v4.3_20260604.pandoc.docx differ diff --git a/paper/paper_a_v4_combined.md b/paper/paper_a_v4_combined.md index 207ca50..1790be5 100644 --- a/paper/paper_a_v4_combined.md +++ b/paper/paper_a_v4_combined.md @@ -7,9 +7,9 @@ author: "[Authors removed for double-blind review]" -Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000010$; per-signature $0.006$; per-document $0.012$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A never coincides cross-firm yet fires on $82\%$ of its own ($\sim 139\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth. +Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a low inter-CPA coincidence rate (per-comparison ICCR $0.000010$; per-signature $0.006$; per-document $0.012$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A never coincides cross-firm yet fires on $82\%$ of its own ($\sim 139\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth. - + # I. Introduction @@ -26,17 +26,17 @@ A methodological concern shapes the research design. Many prior similarity-based Despite the significance of the problem for audit quality and regulatory oversight, to our knowledge no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation. -In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) disclosure of each diagnostic's untested assumption (§III-M). +In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-I); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-J.1); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-K.4); and (8) disclosure of each diagnostic's untested assumption (§III-N). We are deliberate about what the system claims. The operating thresholds are *operator-tunable* rather than asserted as ground-truth decision boundaries: the contribution is not a fixed detector that pronounces a signature non-hand-signed, but (a) a dual-descriptor design that separates *style consistency* from *image reproduction*, and (b) a methodology for choosing and characterising a screening operating point in the absence of labels, so that an operator can set a specificity target and read off what each setting yields. Operationally the framework is a semi-automated triage step that surfaces a tractable set of replication candidates from hundreds of thousands of signatures for human adjudication; it does not adjudicate. The firm-level results and the byte-identical capture check are reported as *demonstrations that this triage works at scale*, not as forensic determinations. -A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors. +A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-K.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors. -In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000010$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0059$ (CPA-block bootstrap 95% $[0.0045, 0.0073]$), and per-document ICCR $= 0.012$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.175$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide. +In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019); §III-I.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000010$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0059$ (CPA-block bootstrap 95% $[0.0045, 0.0073]$), and per-document ICCR $= 0.012$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.175$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide. With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0059$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 139\times$ floor), Firms B/C/D at $0.24$–$0.35$ ($\sim 40$–$59\times$). Firm A scored against the clean 2013–2019 baseline coincides essentially never ($0.0001$, below the clean-baseline floor itself) — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels. -Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G. +Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-L's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G. We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available. @@ -56,7 +56,7 @@ The contributions of this paper are: 7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair. -8. **Annotation-free positive-anchor capture check and unsupervised-setting disclosure.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-M addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review. +8. **Annotation-free positive-anchor capture check and unsupervised-setting disclosure.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-N addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review. The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity positive-anchor check, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work. @@ -137,7 +137,7 @@ Under mild regularity conditions, White's quasi-MLE result [41] supports interpr The present study uses these tools diagnostically: first to test whether the descriptor distribution supports a natural operating boundary, and then, when that support fails under composition decomposition, to motivate anchor-based ICCR calibration of a fixed deployed rule. *Cross-validation in a small-cluster scope.* -Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier. +Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-M differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-I). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier.