From b6913d2f93af0ab22c7ee89bfb8e8c5f245d2429 Mon Sep 17 00:00:00 2001 From: gbanyan Date: Fri, 15 May 2026 18:07:31 +0800 Subject: [PATCH] =?UTF-8?q?Phase=206=20round-2=20reviewer=20revisions:=20?= =?UTF-8?q?=C2=A7III-H.1=20promotion=20+=20framing=20alignment?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Structural: - Promote operational classifier definition from §III-L.0 to new §III-H.1, so the reader meets the five-way HC/MC/HSC/UN/LH rule before the §III-I/J/K diagnostic chain instead of ~130 lines after. §III-L renamed to "Anchor-Based Threshold Calibration"; §III-L.0 retains only calibration methodology, three units of analysis, any-pair semantics, and the FAR terminological note. §III-L.7 deleted (redundant with §III-J). - Reorganise §V-H Limitations into Primary / Secondary / Documented features / Engineering groupings (was a flat 14-item list). - Reframe §III-M from "ten-tool unsupervised-validation collection" to "each diagnostic addresses one specific unsupervised failure mode"; rename "What v4.0 does/does not claim" → "Limits / Scope of the present analysis"; retitle Table XXVII. Framing alignment (cross-section): - Strip all v3.x / v4.0 / v3.20 / v4-new / inherited lineage labels from rendered text (Abstract, Intro, §II, §III, §IV, §V, §VI, Appendix, Impact). - Replace "Paper A" rule references with "deployed" rule references. - Soften "validation" to "characterise" / "check" / "screening label" / "consistency check" / "support"; "verdict" → "screening label". - Remove codex-verified spike claims (non-Big-4 jittered dHash, Big-4 pooled cosine after firm-mean centring). Only formally scripted evidence (Scripts 39b–39e) retained; non-Big-4 evidence framed as corroborating raw-axis cosine, not as calibration evidence. - Strip script-provenance parentheticals from Introduction; defer Script 39c internal references and similar to Methodology / Appendix. Numerical / table fixes: - §III-C document-count arithmetic: 12 corrupted → 13 corrupted/unreadable, verified against sqlite DB and total-pdf/ folder counts (90,282 - 4,198 no-sig - 13 corrupted = 86,071 → 85,042 with detections → 182,328 sigs → 168,755 CPA-matched). Table I shows VLM-positive (86,084) and processed-for-extraction (86,071) as separate rows. - Wilson 95% CIs added for joint-rule ICCR rows in Table XXI / methodology table ([0.00011, 0.00018] and [0.00008, 0.00014]). - Unit error fixed: 0.3856 pp / 0.4431 pp → 0.3856 (38.6 pp) / 0.4431 (44.3 pp). Smaller revisions: - Pipeline framing: "detecting" → "screening" in Abstract / Intro / Conclusion for consistency with the unsupervised-screening positioning. - "hard ground-truth subset" → "conservative hard-positive subset" throughout. - §III-F SSIM / pixel-comparison rebuttal compressed from ~15 lines to 4; design-level argument deferred to supplementary materials. - "stakeholders can adopt / can derive thresholds" → "alternative operating points can be characterised by inverting" (less prescriptive). - "the same mechanism extending in milder form to Firms B/C/D" → "similar, milder production-related reuse patterns at Firms B/C/D" (mechanism claim softened). - Appendix A "non-hand-signed mode" / "two-mechanism mixture" lineage language aligned with v4 framing. Appendix B: - Rebuilt as a redirect-only stub. The HTML-commented obsolete table mapping (Table IX–XVIII labels with FAR / capture-rate / validation language) is removed; replaced with a short paragraph pointing to supplementary materials for full table-to-script provenance. Cross-references: - All §III-L references for the rule definition retargeted to §III-H.1; references for calibration still point to §III-L. - §III-H references for byte-level Firm A evidence / non-Big-4 reverse anchor retargeted to §III-H.2. Artefacts: - Combined manuscript regenerated: paper_a_v4_combined.md, 1314 lines (was 1346 pre-review). - Two review handoff documents added: paper/review_handoff_abstract_intro_20260515.md paper/review_handoff_body_20260515.md Co-Authored-By: Claude Opus 4.7 (1M context) --- paper/paper_a_abstract_v3.md | 2 +- paper/paper_a_appendix_v3.md | 33 +- paper/paper_a_conclusion_v3.md | 6 +- paper/paper_a_discussion_v3.md | 52 +- paper/paper_a_impact_statement_v3.md | 6 +- paper/paper_a_introduction_v3.md | 18 +- paper/paper_a_methodology_v3.md | 197 ++- paper/paper_a_references_v3.md | 4 +- paper/paper_a_related_work_v3.md | 12 +- paper/paper_a_results_v3.md | 92 +- paper/paper_a_v4_combined.md | 1314 +++++++++++++++++ .../review_handoff_abstract_intro_20260515.md | 361 +++++ paper/review_handoff_body_20260515.md | 397 +++++ 13 files changed, 2267 insertions(+), 227 deletions(-) create mode 100644 paper/paper_a_v4_combined.md create mode 100644 paper/review_handoff_abstract_intro_20260515.md create mode 100644 paper/review_handoff_body_20260515.md diff --git a/paper/paper_a_abstract_v3.md b/paper/paper_a_abstract_v3.md index ebfc068..d8ee96f 100644 --- a/paper/paper_a_abstract_v3.md +++ b/paper/paper_a_abstract_v3.md @@ -2,6 +2,6 @@ -Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — undermining individualized attestation. We build an end-to-end pipeline detecting such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$–$0.16$ at Firms B/C/D after pool-size adjustment, and under the deployed any-pair rule $77$–$99\%$ of inter-CPA collisions concentrate within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth. +Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports — through administrative stamping or firm-level electronic signing — thereby undermining individualized attestation. We build an end-to-end pipeline for screening such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$–$0.16$ at Firms B/C/D after pool-size adjustment, and under the deployed any-pair rule $77$–$99\%$ of inter-CPA collisions concentrate within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth. diff --git a/paper/paper_a_appendix_v3.md b/paper/paper_a_appendix_v3.md index 71881ea..62cf56d 100644 --- a/paper/paper_a_appendix_v3.md +++ b/paper/paper_a_appendix_v3.md @@ -27,38 +27,13 @@ First, the procedure consistently identifies a "transition" under every bin widt The $Z$ statistics also inflate superlinearly with the bin width (Firm A cosine $|Z|$ rises from $\sim 9$ at bin 0.003 to $\sim 106$ at bin 0.015) because wider bins aggregate more mass per bin and therefore shrink the per-bin standard error on a very large sample. Both features are characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity. -Second, the candidate transitions all locate *inside* the non-hand-signed mode (cosine $\geq 0.975$, dHash $\leq 10$) rather than between modes, which is the location pattern we would expect of a clean two-mechanism boundary. +Second, the candidate transitions all locate *inside* the high-similarity region (cosine $\geq 0.975$, dHash $\leq 10$) rather than at a between-mode boundary, which is the location pattern we would expect of a clean within-population antimode. Taken together, Table A.I shows that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes. -This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that per-signature similarity does not form a clean two-mechanism mixture. +This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold. Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials. -# Appendix B. Table-to-Script Provenance +# Appendix B. Reproducibility Materials -For reproducibility, the following table maps each numerical table in Section IV to the analysis script that produces its underlying values and to the report file emitted by that script. Scripts are under `signature_analysis/`. Report artifact paths below are listed relative to the project's analysis report root, which is `/Volumes/NV2/PDF-Processing/signature-analysis/` in our local deployment; replicators should rebase the paths to whatever report root they configure when invoking the scripts. - - - -The table-to-script mapping above is intended as a navigation aid for replicators. All scripts run deterministically under the fixed random seeds documented in the supplementary materials; the artifact paths above were verified against the local deployment at the time of submission, and any reviewer reproduction step should re-emit the artifacts from the listed scripts rather than depend on the absolute path layout. +The full table-to-script provenance mapping, script source code, and report artefacts for every numerical table and figure in this paper are provided in the supplementary materials. Scripts run deterministically under fixed random seeds documented there; reviewer reproduction should re-emit artefacts from the listed scripts rather than rely on any local path layout. diff --git a/paper/paper_a_conclusion_v3.md b/paper/paper_a_conclusion_v3.md index 46c5509..6ce9541 100644 --- a/paper/paper_a_conclusion_v3.md +++ b/paper/paper_a_conclusion_v3.md @@ -1,7 +1,7 @@ # VI. Conclusion and Future Work -We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is an inherited Paper A five-way per-signature classifier with worst-case document-level aggregation (§III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. +We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature classifier with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. -Our central methodological contributions are: (1) a composition decomposition (Scripts 39b–39e) that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the inherited operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) a ten-tool unsupervised-validation collection (§III-M Table XXVII) that explicitly discloses each tool's untested assumption and positions the system as an anchor-calibrated screening framework with human-in-the-loop review, not as a validated forensic detector. +Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (§III-M Table XXVII), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector. -Future work falls in four directions. *First*, a small-scale human-rated validation set would enable direct ROC optimisation and provide signature-level ground truth that v4.0 fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with v3.x's byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the v4.0 calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. +Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. diff --git a/paper/paper_a_discussion_v3.md b/paper/paper_a_discussion_v3.md index a6c9bc8..23bd89b 100644 --- a/paper/paper_a_discussion_v3.md +++ b/paper/paper_a_discussion_v3.md @@ -6,64 +6,70 @@ Non-hand-signing differs from forgery in that the questioned signature is produc ## B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Multimodality is Composition-Driven -A central empirical finding of v3.x was that *per-signature* similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading. - -The Big-4 accountant-level descriptor distribution does reject unimodality on both marginals at $p < 5 \times 10^{-4}$ (Script 34). v4.0's composition decomposition (§III-I.4; Scripts 39b–39e) shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with $\geq 500$ signatures tested (cosine: Scripts 39b/39c; jittered-dHash: Script 39d for Big-4 plus codex-verified read-only spike for the ten non-Big-4 firms; see §III-I.4). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes. +The Big-4 accountant-level descriptor distribution rejects unimodality on both marginals at $p < 5 \times 10^{-4}$ (§IV-D Table V). The composition decomposition of §III-I.4 shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within the Big-4 firms, the descriptor marginals at the signature level are unimodal once integer ties are broken (Scripts 39b, 39d); eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (Script 39c) but are not used as calibration evidence (§III-I.4). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes. ## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor) -Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). The additional v3.x finding that the 145 Firm A pixel-identical signatures span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches across different fiscal years, is inherited from v3.20.0 §IV-F.1 / Script 28 / Appendix B byte-decomposition output and was not regenerated in v4.0's spike scripts; we retain those numbers by reference. +Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (Appendix B) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years. -In v4.0 we treat Firm A as a *templated-end case study* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). Firm A's high per-document HC$+$MC alarm rate of $0.62$ (versus Firms B/C/D's $0.09$–$0.16$) reflects high inter-CPA collision concentration under the deployed rule on real same-CPA pools, consistent with firm-specific template, stamp, or document-production reuse — though the inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence of v3.x §IV-F.1 (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence that firm-level template reuse does occur at Firm A; the within-firm collision pattern at all four Big-4 firms is consistent with that mechanism extending in milder form to Firms B/C/D. +We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). Firm A's high per-document HC$+$MC alarm rate of $0.62$ (versus Firms B/C/D's $0.09$–$0.16$) reflects high inter-CPA collision concentration under the deployed rule on real same-CPA pools, consistent with firm-specific template, stamp, or document-production reuse — though the inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder production-related reuse patterns at Firms B/C/D. ## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions Leave-one-firm-out cross-validation of the Big-4 mixture fit reveals a sharp contrast between K=2 and K=3 behaviour. K=2 is unstable: across-fold cosine-crossing deviation is $0.028$, and holding Firm A out gives a fold rule (cos $> 0.938$, dHash $\leq 8.79$) that classifies $100\%$ of held-out Firm A in the upper component, while holding any non-Firm-A Big-4 firm out gives a fold rule near (cos $> 0.975$, dHash $\leq 3.76$) that classifies $0\%$ of the held-out firm in the upper component. The K=2 boundary is essentially a Firm-A-vs-others separator — direct evidence that the K=2 partition reflects firm-compositional rather than mechanistic structure. -K=3 in contrast has a *reproducible component shape* at the descriptor-position level: across the four folds the C1 (low-cos / high-dHash) component cosine mean varies by at most $0.005$, the dHash mean by at most $0.96$, and the weight by at most $0.023$. Hard-posterior membership for the held-out firm is composition-sensitive (absolute differences $1.8$–$12.8$ pp across folds). Together with the §III-I.4 composition decomposition (no within-population bimodal antimode), the K=3 stability supports a descriptive reading: the Big-4 descriptor plane has a reproducible three-region partition that reflects how firm-compositional weight is distributed across the descriptor space, *not* a three-mechanism latent-class structure. We accordingly do not use K=3 hard-posterior membership as an operational classifier; we use it as the accountant-level descriptive summary that complements the deployed signature-level five-way classifier of §III-L. +K=3 in contrast has a *reproducible component shape* at the descriptor-position level: across the four folds the C1 (low-cos / high-dHash) component cosine mean varies by at most $0.005$, the dHash mean by at most $0.96$, and the weight by at most $0.023$. Hard-posterior membership for the held-out firm is composition-sensitive (absolute differences $1.8$–$12.8$ pp across folds). Together with the §III-I.4 composition decomposition (no within-population bimodal antimode), the K=3 stability supports a descriptive reading: the Big-4 descriptor plane has a reproducible three-region partition that reflects how firm-compositional weight is distributed across the descriptor space, *not* a three-mechanism latent-class structure. We accordingly do not use K=3 hard-posterior membership as an operational classifier; we use it as the accountant-level descriptive summary that complements the deployed signature-level five-way classifier of §III-H.1. ## E. Three-Score Convergent Internal-Consistency -Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score, not a mechanism cluster posterior); the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the inherited Paper A box-rule less-replication-dominated rate. The three scores are *not* statistically independent measurements — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise $|\rho| > 0.87$) and its persistence at the signature level (Cohen $\kappa = 0.87$ between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-region ordering, and three different summarisations of the descriptor space produce broadly concordant per-CPA rankings, with a residual non-Firm-A disagreement (the reverse-anchor cosine percentile ranks Firm D fractionally above Firm C, while the mixture posterior and the box-rule rate rank Firm C highest among non-Firm-A firms). +Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score, not a mechanism cluster posterior); the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the deployed box-rule less-replication-dominated rate. The three scores are *not* statistically independent measurements — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise $|\rho| > 0.87$) and its persistence at the signature level (Cohen $\kappa = 0.87$ between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-region ordering, and three different summarisations of the descriptor space produce broadly concordant per-CPA rankings, with a residual non-Firm-A disagreement (the reverse-anchor cosine percentile ranks Firm D fractionally above Firm C, while the mixture posterior and the deployed box-rule rate rank Firm C highest among non-Firm-A firms). ## F. Anchor-Based Multi-Level Calibration -The operational specificity of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR replicates v3.x's per-comparison rate (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$. +The operational specificity of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR matches a prior published per-comparison rate (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$. -Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5; Script 46) shows the inherited HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); stakeholders requiring different specificity-alert-yield operating points can derive thresholds by inverting the ICCR curves (a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint gives per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation. +Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5) shows the deployed HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); alternative operating points can be characterised by inverting the ICCR curves (e.g., a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint corresponds to per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation. -## G. Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor Reframed as Coincidence Rate +## G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor -The only hard ground-truth subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are conservative-subset ground truth for the *replicated* class. On the Big-4 subset ($n = 262$ pixel-identical signatures), all three candidate checks — the inherited box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). We caution that this result is necessary but not sufficient: for the box rule it is close to tautological, because byte-identical neighbours have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region. The corresponding signature-level *negative* anchor evidence is developed in §III-L.1 above (v4 spike: cos$>0.95$ per-comparison ICCR $= 0.00060$, replicating v3.20.0's reported $0.0005$ under prior "FAR" terminology). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures. +The only conservative hard-positive subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are a conservative hard-positive subset for image replication. On the Big-4 subset ($n = 262$ pixel-identical signatures), all three candidate checks — the deployed box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). We caution that this result is necessary but not sufficient: for the deployed box rule it is close to tautological, because byte-identical neighbours have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region. The corresponding signature-level *negative* anchor evidence is developed in §III-L.1 above (per-comparison ICCR $= 0.00060$ at cos$>0.95$, consistent with the prior published rate of $0.0005$). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures. ## H. Limitations -Several limitations should be transparent. The first nine are v4.0-specific; the last five are inherited from v3.20.0 §V-G and still apply to the v4.0 pipeline. +Several limitations should be transparent. We group them into primary methodological limitations, secondary scope and validation caveats, documented design features, and engineering-level caveats of the pipeline. + +**Primary methodological limitations.** *No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M). -*Inter-CPA negative-anchor assumption is partially violated and the violation is firm-dependent.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs. Because the violation is firm-dependent, Firm A's per-firm ICCR is more contaminated by within-firm sharing than Firms B/C/D's; the per-firm B/C/D rates of $0.09$–$0.16$ are therefore closer to a clean specificity estimate than the pooled rate, and the Firm A vs Firms B/C/D contrast reflects both genuine firm heterogeneity and a firm-dependent proxy-contamination gradient. +*Inter-CPA negative-anchor assumption is partially violated and the violation is firm-dependent.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs. Because the violation is firm-dependent, Firm A's per-firm ICCR is more contaminated by within-firm sharing than Firms B/C/D's; the per-firm B/C/D rates of $0.09$–$0.16$ may therefore be less contaminated than the pooled rate, and the Firm A vs Firms B/C/D contrast reflects both genuine firm heterogeneity and a firm-dependent proxy-contamination gradient. -*Scope.* The v4.0 primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full $n = 686$ scope; the §IV-K full-dataset Spearman re-run shows the K=3 $+$ box-rule rank-convergence is preserved at $n = 686$ but does not validate the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope. +*Scope.* The primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full $n = 686$ scope; the §IV-K full-dataset Spearman re-run shows the K=3 $+$ deployed box-rule rank-convergence is preserved at $n = 686$ but does not establish portability of the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope. -*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the inherited box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures). +**Secondary scope and validation caveats.** -*Inherited rule components are not separately v4-validated.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their v3.20.0 calibration and capture-rate evidence; v4.0's anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-validated by v4.0's diagnostic battery. +*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures). + +*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their prior calibration and capture-rate evidence (Appendix B); the anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery. *Deployed-rate excess is not a presumed true-positive rate.* The $\sim 44$-pp per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the inter-CPA proxy rate (HC: $0.18$) cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; within-firm template sharing inflates the inter-CPA proxy baseline). The gap is best read as a same-CPA repeatability signal. *A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime. -*K=3 hard-posterior membership is composition-sensitive.* The K=3 hard-posterior membership for any single firm varies by up to $12.8$ pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as v4.0 operational classifier output; they are reported only as accountant-level descriptive characterisation. +**Documented design features.** -*No partner-level mechanism attribution.* v4.0 reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-L.4 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing. +*K=3 hard-posterior membership is composition-sensitive.* The K=3 hard-posterior membership for any single firm varies by up to $12.8$ pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as operational classifier output; they are reported only as accountant-level descriptive characterisation. -*Transferred ImageNet features (inherited from v3.20.0).* The ResNet-50 feature extractor uses pre-trained ImageNet weights without signature-domain fine-tuning. While our backbone-ablation study (§IV-L, inherited from v3.20.0 §IV-I) and prior literature support the effectiveness of transferred ImageNet features for signature comparison, a signature-domain fine-tuned feature extractor could improve discriminative performance. +*No partner-level mechanism attribution.* The analysis reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-L.4 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing. -*Red-stamp HSV preprocessing artifacts (inherited from v3.20.0).* The red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. Blended pixels are replaced with white, potentially creating small gaps in signature strokes that could reduce dHash similarity. This bias would push classifications toward false negatives rather than false positives. +**Engineering-level caveats of the pipeline.** -*Longitudinal scan / PDF / compression confounds (inherited from v3.20.0).* Scanning equipment, PDF generation software, and compression algorithms may have changed over the 2013–2023 study period, potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded. +*Transferred ImageNet features.* The ResNet-50 feature extractor uses pre-trained ImageNet weights without signature-domain fine-tuning. While our backbone-ablation study (§IV-L) and prior literature support the effectiveness of transferred ImageNet features for signature comparison, a signature-domain fine-tuned feature extractor could improve discriminative performance. -*Source-exemplar misattribution in max/min pair logic (inherited from v3.20.0).* The max-cosine / min-dHash detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case where one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as a stamping or e-signature template, the pair correctly identifies image reuse but misattributes non-hand-signed status to the source exemplar. This affects at most one source document per template variant per CPA and is not expected to be common. +*Red-stamp HSV preprocessing artifacts.* The red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. Blended pixels are replaced with white, potentially creating small gaps in signature strokes that could reduce dHash similarity. This bias would push classifications toward false negatives rather than false positives. -*Legal and regulatory interpretation (inherited from v3.20.0).* Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a jurisdiction-specific legal question. Our technical analysis can inform such determinations but cannot resolve them. +*Longitudinal scan / PDF / compression confounds.* Scanning equipment, PDF generation software, and compression algorithms may have changed over the 2013–2023 study period, potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded. + +*Source-exemplar misattribution in max/min pair logic.* The max-cosine / min-dHash detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case where one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as a stamping or e-signature template, the pair correctly identifies image reuse but misattributes non-hand-signed status to the source exemplar. This affects at most one source document per template variant per CPA and is not expected to be common. + +*Legal and regulatory interpretation.* Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a jurisdiction-specific legal question. Our technical analysis can inform such determinations but cannot resolve them. diff --git a/paper/paper_a_impact_statement_v3.md b/paper/paper_a_impact_statement_v3.md index 9fa176a..a73ebbd 100644 --- a/paper/paper_a_impact_statement_v3.md +++ b/paper/paper_a_impact_statement_v3.md @@ -4,7 +4,7 @@ ARCHIVED. Not part of the IEEE Access submission. IEEE Access Regular Papers do not include a separate Impact Statement section. The text below is retained for possible reuse in a cover letter, grant report, or non-IEEE venue. It is excluded from the -assembled paper by export_v3.py. +assembled paper by the manuscript export script. If reused, note that the wording "distinguishes genuinely hand-signed signatures from reproduced ones" overstates what a five-way confidence @@ -17,5 +17,5 @@ external use. Auditor signatures on financial reports are a key safeguard of corporate accountability. When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets. We developed a pipeline that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan. -Combining deep-learning visual features with perceptual hashing and two methodologically distinct threshold estimators (plus a density-smoothness diagnostic), the system stratifies signatures into a five-way confidence-graded classification and quantifies how the practice varies across firms and over time. -After further validation, the technology could support financial regulators in screening signature authenticity at national scale. +Combining deep-learning visual features with perceptual hashing, distributional diagnostics, and anchor-based inter-CPA coincidence-rate calibration, the system stratifies signatures into a five-way confidence-graded classification and quantifies how the practice varies across firms and over time. +With a future labelled evaluation set, the technology could support financial regulators in screening candidate non-hand-signed signatures at national scale. diff --git a/paper/paper_a_introduction_v3.md b/paper/paper_a_introduction_v3.md index dbab293..a853b79 100644 --- a/paper/paper_a_introduction_v3.md +++ b/paper/paper_a_introduction_v3.md @@ -10,17 +10,17 @@ The distinction between *non-hand-signing detection* and *signature forgery dete A methodological concern shapes the research design. Many prior similarity-based classification studies rely on ad-hoc thresholds — declaring two images equivalent above a hand-picked cosine cutoff, for example — without principled statistical justification. Such thresholds are fragile in an archival-data setting. A defensible approach requires (i) explicit calibration of the operational thresholds against measurable negative-anchor evidence; (ii) diagnostic procedures that test whether the descriptor distribution itself supports a within-population threshold, including formal decomposition of apparent multimodality into between-group composition and integer-tie artefacts; (iii) annotation-free reporting of operational alarm rates at multiple analysis units (per-comparison, per-signature pool, per-document) with Wilson 95% confidence intervals; (iv) per-firm stratification of the reported rates to surface heterogeneity that aggregate metrics conceal; and (v) explicit disclosure of the unsupervised setting's limits — in particular, the inability to estimate true error rates without signature-level ground-truth labels. -Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation. +Despite the significance of the problem for audit quality and regulatory oversight, to our knowledge no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation. -In this paper we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale, together with a multi-tool validation framework that explicitly discloses the unsupervised setting's limits. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) a multi-tool unsupervised validation strategy with disclosed assumption-violation analysis (§III-M). +In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) disclosure of each diagnostic's untested assumption (§III-M). -The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. Earlier work in this lineage adopted a distributional path to thresholds — fitting accountant-level finite-mixture models and treating their marginal crossings as data-derived "natural" thresholds. v4.0 reports a composition decomposition diagnostic (§III-I.4) that overturns this reading: the apparent multimodality of the Big-4 accountant-level distribution is fully explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. Once both confounds are removed (firm-mean centring plus uniform integer jitter), the Big-4 pooled dHash dip test yields $p_{\text{median}} = 0.35$ across five jitter seeds, eliminating the rejection. Within-firm signature-level cosine dip tests fail to reject in every individual Big-4 firm and in every individual mid/small firm with $\geq 500$ signatures (10 firms tested in Script 39c), and the corresponding within-firm jittered-dHash dip tests likewise fail to reject in all four Big-4 firms (Script 39d) and across a codex-verified read-only spike on the same ten mid/small firms ($0/10$ reject; §III-I.4). The descriptor distributions therefore contain no within-population bimodal antimode that could anchor an operational threshold. +A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors. -In place of distributional anchoring, v4.0 adopts an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the inherited cos$>0.95$ operating point yields ICCR $= 0.00060$ on a $5 \times 10^5$-pair Big-4 sample (replicating v3.x's reported per-comparison rate of $0.0005$ under prior "FAR" terminology); the dHash$\leq 5$ structural cutoff yields ICCR $= 0.00129$ (v4 new); the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR $= 0.00014$ (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is $0.1102$ (Wilson 95% CI $[0.1086, 0.1118]$; CPA-block bootstrap 95% $[0.0908, 0.1330]$). At the per-document unit, the operational HC$+$MC alarm fires on $33.75\%$ of Big-4 documents under the inter-CPA candidate-pool counterfactual. +In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the cos$>0.95$ operating point yields ICCR $= 0.00060$ on a $5 \times 10^5$-pair Big-4 sample; the dHash$\leq 5$ structural cutoff yields ICCR $= 0.00129$; the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR $= 0.00014$ (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is $0.1102$ (Wilson 95% CI $[0.1086, 0.1118]$; CPA-block bootstrap 95% $[0.0908, 0.1330]$). At the per-document unit, the operational HC$+$MC alarm fires on $33.75\%$ of Big-4 documents under the inter-CPA candidate-pool counterfactual. -The pooled per-signature and per-document rates conceal striking firm heterogeneity. A logistic regression of the per-signature hit indicator on firm dummies (Firm A reference) and centred log pool size yields odds ratios of $0.053$ (Firm B), $0.010$ (Firm C), and $0.027$ (Firm D) — Firms B/C/D are an order of magnitude below Firm A even after controlling for the pool-size confound (Script 44). Cross-firm hit matrix analysis under the deployed any-pair rule shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (Table XXV; the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). The pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — though not by itself diagnostic of deliberate sharing. We retain the inherited Paper A v3.x five-way box rule as the operational classifier; v4.0's contribution is to characterise its multi-level coincidence behaviour against the inter-CPA negative anchor rather than to derive new thresholds. +The pooled per-signature and per-document rates conceal striking firm heterogeneity. A logistic regression of the per-signature hit indicator on firm dummies (Firm A reference) and centred log pool size yields odds ratios of $0.053$ (Firm B), $0.010$ (Firm C), and $0.027$ (Firm D) — Firms B/C/D are an order of magnitude below Firm A even after controlling for the pool-size confound. Cross-firm hit matrix analysis under the deployed any-pair rule shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (Table XXV; the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). The pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour rather than to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels. -Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$ (Script 38): the K=3 mixture posterior (now interpreted as a firm-compositional position score, not a mechanism cluster posterior; §III-J), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the inherited box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. Hard ground truth for the *replicated* class is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G. +Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G. We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available. @@ -32,14 +32,14 @@ The contributions of this paper are: 3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we validate the backbone choice through a feature-backbone ablation. -4. **Composition decomposition disproves the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; "natural threshold" language in this lineage's prior work is not empirically supported. +4. **Composition decomposition disproves the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported. 5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide. 6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-firm rates differ by an order of magnitude after pool-size adjustment (Firm A's per-document HC$+$MC alarm at $0.62$ versus Firms B/C/D at $0.09$–$0.16$). Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms); the pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing. -7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (no longer interpreted as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair. +7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair. -8. **Annotation-free positive-anchor validation and unsupervised validation ceiling.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. We frame the overall validation strategy as a multi-tool collection of ten partial-evidence diagnostics (§III-M Table XXVII), each with an explicitly disclosed untested assumption; their conjunction constitutes the unsupervised validation ceiling achievable on this corpus. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review. +8. **Annotation-free positive-anchor capture check and unsupervised-setting disclosure.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-M addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review. The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity validation, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work. diff --git a/paper/paper_a_methodology_v3.md b/paper/paper_a_methodology_v3.md index 87b6faf..35e8d19 100644 --- a/paper/paper_a_methodology_v3.md +++ b/paper/paper_a_methodology_v3.md @@ -4,19 +4,19 @@ We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents. Fig. 1 illustrates the overall architecture. -The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum anchored on whole-sample Firm A percentile heuristics and validated against a byte-level pixel-identity positive anchor and a large random inter-CPA negative anchor. +The pipeline takes as input a corpus of PDF audit reports and produces five-way operational screening labels (§III-H.1) whose behaviour is characterised by pixel-identity positive-anchor capture checks and inter-CPA coincidence-rate calibration (§III-L). Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise. ## B. Data Collection @@ -34,7 +34,8 @@ Table I summarizes the dataset composition. |-----------|-------| | Total PDF documents | 90,282 | | Date range | 2013–2023 | -| Documents with signatures | 86,072 (95.4%) | +| Signature-page candidates (VLM-positive) | 86,084 (95.3%) | +| Processed for signature extraction | 86,071 (95.3%) | | Unique CPAs identified | 758 | | Accounting firms | >50 | --> @@ -47,8 +48,8 @@ The model was configured with temperature 0 for deterministic output. The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. Scanning terminated upon the first positive detection. -This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. -An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents. +This process identified 86,084 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. +An additional 13 PDFs that could not be rendered (corruption or read errors) were excluded, yielding a final set of 86,071 documents. Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents. The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling. @@ -73,8 +74,7 @@ The model was trained for 100 epochs on a 425/75 training/validation split with Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content. -Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). -The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset is the sample that flows into Sections IV-D through IV-H; the unmatched 7.4% are excluded for definitional reasons rather than discarded as noise. +Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). The matched records assume standard two-signature ordering; residual order-mismatch risk remains for nonstandard layouts. The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset forms the candidate pool for same-CPA analyses, before the Big-4 and descriptor-completeness restrictions described in §III-G. ## E. Feature Extraction @@ -85,7 +85,7 @@ Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preserv All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product. The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-G). -This design choice is validated by an ablation study (Section IV-L) comparing ResNet-50 against VGG-16 and EfficientNet-B0. +This design choice is supported by an ablation study (Section IV-L) comparing ResNet-50 against VGG-16 and EfficientNet-B0. ## F. Dual-Method Similarity Descriptors @@ -105,30 +105,17 @@ Unlike DCT-based perceptual hashes, dHash is computationally lightweight and par These descriptors provide partially independent evidence. Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. -Non-hand-signing yields extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise. -Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). +Non-hand-signing is expected to yield extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise; scan-stage noise can in principle push a replicated pair off either extremum but rarely both. +Hand-signing, by contrast, often yields high dHash similarity (the overall layout of a signature is typically preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline. -We did not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors, and the reasons are specific to what each of those measures was designed to do rather than to how either happened to perform on our corpus. +We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. Appendix B contains the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument. -SSIM was developed by Wang et al. [30] as a perceptual quality index for *natural images*, and it factorises local-window image statistics into three components---luminance, contrast, and structural correlation---combined multiplicatively over a sliding window. -Each of these components is computed at the pixel level on the original-resolution image and is *designed to be sensitive* to small fluctuations in local luminance and local contrast, because that is what makes SSIM track human perception of natural-image quality. -Applied to a binarised auditor's signature crop, exactly those design choices become liabilities: the JPEG block artifacts, scan-noise speckle, and faint scanner-rule ghosts that are routine in a print-scan cycle perturb local luminance and local contrast in every window they touch, and SSIM amplifies those perturbations in the structural-correlation product. -A signature reproduced twice from the same stored image---the very case that defines our positive class---is therefore one in which SSIM is structurally guaranteed to penalise the easily perturbed margins around the strokes, even though the strokes themselves are identical up to rendering noise. -This is a property of how SSIM is constructed, not a finding about how it scored on our data; the empirical observation that the calibration firm exhibits a mean SSIM of only $0.70$ in our corpus is a confirmation of the design-level prediction rather than the basis for the rejection. - -Pixel-level comparison---whether $L_1$, $L_2$, or pixel-identity counting---fails on a stricter design ground. -Pixel-level distances are defined on geometrically aligned images at a common resolution, and they treat any sub-pixel translation, rotation, or rescale as a large perturbation by construction (a one-pixel uniform translation flips a fraction of foreground pixels on a thin-stroke signature crop and inflates pixel L1 distance to the same magnitude as for a different signer's signature). -Two scans of the same physical document, however, do not share a common pixel grid: scanner DPI, paper-handling alignment, and PDF-page rasterisation each contribute random sub-pixel offsets, and the print-scan cycle that intervenes between the stored stamp image and the audit-report PDF additionally introduces resolution mismatch and small geometric drift. -A pixel-level descriptor cannot therefore satisfy the basic stability requirement for our task: two presentations of the same stored image must score nearly identically. -We retain pixel-identity counting only as a *threshold-free anchor* (Section III-K), because byte-identical pairs in our corpus are necessarily produced by literal file reuse rather than by repeated scanning, and so they do not interact with the alignment-fragility argument; they are not used as a primary similarity descriptor. - -Cosine similarity on deep embeddings and dHash, in contrast, both remain stable across the print-scan-rasterise cycle by design: cosine on L2-normalised pooled features is invariant to overall scale and bias and degrades gracefully under local-pixel noise that the convolutional backbone has been trained to absorb [14], [21], while dHash compresses the image to a $9 \times 8$ grayscale grid before computing horizontal-gradient signs, which removes the resolution and sub-pixel-alignment sensitivity that breaks pixel-level comparison [19], [27]. -Together they constitute the dual descriptor used throughout the rest of this paper. +Cosine similarity on L2-normalised deep embeddings and dHash both remain stable across the print-scan-rasterise cycle by design [14], [19], [21], [27]; together they constitute the dual descriptor used throughout the rest of this paper. ## G. Unit of Analysis and Scope -We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inherited inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I; reported under prior "FAR" terminology in v3.x). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. +We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism. @@ -138,35 +125,49 @@ We adopt one stipulation about same-CPA pair detectability: A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication. -**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, §III-L, and the v4-new analyses in §IV-D through §IV-J) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C, §IV-I (inter-CPA negative-anchor coincidence rate), and §IV-L (feature-backbone ablation) report inherited corpus-wide v3.x material that v4.0 does not re-scope to Big-4. §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ — the threshold for accountant-level analyses (Scripts 36, 38) — totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the v4-new analyses to Big-4 is a methodological choice driven by four considerations: +**Scope: the Big-4 sub-corpus.** The primary analyses (§III-I, §III-J, §III-K, §III-L, and the corresponding §IV-D through §IV-J and §IV-M tables) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F; §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ — the threshold for accountant-level analyses — totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the primary analyses to Big-4 is a methodological choice driven by four considerations: -1. **Leave-one-firm-out fold feasibility.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm. +1. **Restricted generalisability claim and Big-4 institutional comparability.** The primary claims are scoped to the Big-4 audit-report context, where the four firms share comparable institutional scale, document-production infrastructure, and CPA-volume regime; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H.2's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-I.4. Generalisation beyond Big-4 is left as future work. -2. **Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-J K=3 component cross-tab; v3.x byte-level pair analysis referenced in §III-H). v4.0 retains Firm A within the Big-4 scope as a descriptive case study of the templated end, rather than treating Firm A as the calibration anchor for thresholds (the v3.x role of Firm A). +2. **Within-firm cross-CPA collision structure analysis.** §III-L.4 reports a Big-4 cross-firm hit-matrix analysis that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work. -3. **Within-firm cross-CPA collision structure analysis.** §III-L.4 reports a Big-4 cross-firm hit-matrix analysis (Script 44) that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work. +3. **Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-J K=3 component cross-tab; byte-level pair analysis referenced in §III-H.2). We retain Firm A within the Big-4 scope as a descriptive case study of the templated end rather than as the calibration anchor for thresholds. -4. **Restricted generalisability claim.** v4.0's primary claims are scoped to the Big-4 audit-report context; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-I.4 (Script 39c). Generalisation beyond Big-4 is left as future work. - -We earlier (v4.0 first draft) listed "statistical multimodality at the accountant level" among the scope justifications, on the basis that the Hartigan dip test rejects unimodality on the Big-4 accountant-level marginals. §III-I.4 reports diagnostics (Scripts 39b–39e) that explain the rejection as a joint effect of between-firm composition shift and dHash integer mass points, not as evidence of within-population continuous bimodality. We therefore no longer list dip-test multimodality among the Big-4 scope rationales; the K=3 mixture is retained as a descriptive partition (§III-J), not as inferential evidence for two mechanism modes. +4. **Leave-one-firm-out fold feasibility.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm. **Sample-size reconciliation.** Two Big-4 signature counts appear in this section and §IV: $n = 150{,}442$ for analyses using the pre-computed per-signature descriptors $\text{cos}_s$ (`max_similarity_to_same_accountant`) and $\text{dHash}_s$ (`min_dhash_independent`), and $n = 150{,}453$ for analyses recomputing pair-level metrics directly from the stored feature and dHash byte vectors (Scripts 40b, 43, 44). The $11$-signature difference reflects descriptor-completion status: $11$ signatures have feature vectors and dHash byte vectors stored but lack the pre-computed extrema. The $11$ signatures are negligible at population scale and do not affect any reported coincidence rate within $0.01$ percentage point. The CPA counts $468$ (all Big-4 CPAs with both vectors stored) and $437$ (Big-4 CPAs with $n_{\text{sig}} \geq 10$ for accountant-level stability) likewise reflect a single uniform exclusion rule rather than analysis-specific subsetting. -## H. Reference Populations +## H. Operational Classifier and Reference Populations -v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing. +### H.1. Deployed Operational Rule -**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). The byte-level pair analysis reported in v3.x §IV-F.1 identifies 145 Firm A pixel-identical signatures at the signature level (Script 40 verifies the 145/262 split among Big-4 pixel-identical signatures); the additional details that v3.x attributes to this analysis (50 distinct Firm A partners of 180 registered; 35 byte-identical matches spanning different fiscal years) are inherited from the Script 28 / Appendix B byte-decomposition output and were not regenerated in the v4.0 spike scripts. We retain those v3.x details by reference and mark them in the provenance table as "inherited from v3 §IV-F.1 / Script 28." +Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA: -In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate. +1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence. +2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff. +3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction. +4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$. +5. **Likely hand-signed (LH):** Cosine $\leq 0.837$. -**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores that v4.0 uses as a cross-check on the inherited per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked. +Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (Appendix B). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L). + +The remainder of this section (§III-H.2) describes the reference populations used to calibrate and cross-check this rule. §III-I demonstrates that the descriptor distributions do not provide a within-population natural threshold; §III-J–§III-K develop the descriptive partition and internal-consistency cross-checks; §III-L develops the anchor-based threshold calibration; §III-M discloses the unsupervised-setting limits. + +### H.2. Reference Populations + +The calibration distinguishes two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking. + +**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (Appendix B) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures. + +Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate. + +**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores used as a cross-check on the per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked. The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 low-cos / high-dHash component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 high-cos / low-dHash component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end of the descriptor plane. This is a "deviation in the less-replication-dominated descriptor-position direction" measure, not a "deviation toward the templated descriptor-position" measure; the reference is the less-replication-dominated population. ## I. Distributional Diagnostics: Why the Composition Path Does Not Yield a Natural Threshold -This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G and tests whether the distribution provides distributional support — in the form of within-population bimodality — for the operational thresholds inherited from v3.x. We apply four diagnostic procedures in turn: a univariate unimodality test on each accountant-level marginal; a 2D Gaussian mixture fit (developed in §III-J); a density-smoothness diagnostic; and a composition decomposition that distinguishes within-population multimodality from between-firm location-shift artefacts (the v4-new diagnostic battery). The four diagnostics jointly imply that the operational thresholds are *not* anchored by distributional bimodality: §III-L develops an anchor-based calibration framework that does not require this assumption. +This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G and tests whether the distribution provides distributional support — in the form of within-population bimodality — for the deployed operational thresholds. We apply four diagnostic procedures in turn: a univariate unimodality test on each accountant-level marginal; a 2D Gaussian mixture fit (developed in §III-J); a density-smoothness diagnostic; and a composition decomposition that distinguishes within-population multimodality from between-firm location-shift artefacts. The four diagnostics jointly imply that the operational thresholds are *not* anchored by distributional bimodality: §III-L develops an anchor-based calibration framework that does not require this assumption. **1. Hartigan dip test on each accountant-level marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope at the accountant level rejected unimodality. The accountant-level Big-4 rejection is a descriptive observation; §III-I.4 below shows that the rejection is fully explained by between-firm location-shift effects rather than within-population bimodality. @@ -178,7 +179,7 @@ This section characterises the joint distribution of accountant-level descriptor *Within-firm signature-level dip (Scripts 39b, 39c).* Repeating the dip test at the signature level inside each individual Big-4 firm (Script 39b) and inside each individual non-Big-4 firm with $\geq 500$ signatures (Script 39c) yields a consistent picture. The cosine marginal *fails* to reject unimodality in every single firm tested — all four Big-4 firms ($p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ for Firms A through D; Script 39b) and ten non-Big-4 firms with $\geq 500$ signatures ($p_{\text{cos}} \in [0.59, 0.99]$; Script 39c). The raw dHash marginal *does* reject unimodality in every firm tested ($p < 5 \times 10^{-4}$ in all $14$ firms), but the raw dHash values are integer-valued in $\{0, 1, \ldots, 64\}$, leaving open the possibility of an integer-tie artefact. -*Integer-jitter robustness (Scripts 39d, 39e).* Adding independent uniform jitter $\sim \mathrm{U}[-0.5, +0.5]$ to break exact dHash ties and re-running the dip test on the perturbed signature cloud (5 seeds, $n_{\text{boot}} = 2000$; Script 39d) eliminates the dHash within-firm rejection in every Big-4 firm tested (Firm A jittered $p_{\text{median}} = 0.999$; B $0.996$; C $0.999$; D $0.9995$; $0$/$5$ seeds reject at $\alpha = 0.05$ in any firm). A codex-verified read-only spike applying the same jitter procedure to the ten non-Big-4 firms with $\geq 500$ signatures (Script 39c substrate) likewise yields no rejection ($0$/$10$ firms reject at $\alpha = 0.05$; per-firm median-$p$ range $[0.38, 1.00]$). The pooled-Big-4 dHash dip *does* survive jitter alone ($p_{\text{median}} = 0$, $5$/$5$ seeds reject), but Firm A's mean dHash ($2.73$) is substantially below Firms B/C/D's ($6.46$, $7.39$, $7.21$) — a between-firm location shift. Script 39e applies a 2 \times 2 factorial correction (firm-mean centring $\times$ integer jitter) on the Big-4 pooled dHash: +*Integer-jitter robustness (Scripts 39d, 39e).* Adding independent uniform jitter $\sim \mathrm{U}[-0.5, +0.5]$ to break exact dHash ties and re-running the dip test on the perturbed signature cloud (5 seeds, $n_{\text{boot}} = 2000$; Script 39d) eliminates the dHash within-firm rejection in every Big-4 firm tested (Firm A jittered $p_{\text{median}} = 0.999$; B $0.996$; C $0.999$; D $0.9995$; $0$/$5$ seeds reject at $\alpha = 0.05$ in any firm). The pooled-Big-4 dHash dip *does* survive jitter alone ($p_{\text{median}} = 0$, $5$/$5$ seeds reject), but Firm A's mean dHash ($2.73$) is substantially below Firms B/C/D's ($6.46$, $7.39$, $7.21$) — a between-firm location shift. Script 39e applies a 2 \times 2 factorial correction (firm-mean centring $\times$ integer jitter) on the Big-4 pooled dHash: | Condition | Firm-mean centred | Integer jitter | Median dip $p$ | Reject at $\alpha = 0.05$ | |---|---|---|---|---| @@ -189,15 +190,15 @@ This section characterises the joint distribution of accountant-level descriptor Removing *both* the between-firm location shift *and* the integer mass points eliminates the Big-4 dHash rejection. The Big-4 pooled dHash multimodality is therefore fully attributable to firm-composition contrast (primarily Firm A's mean $\text{dHash} = 2.73$ versus Firms B/C/D $\approx 6.5$–$7.4$) and integer-density artefacts, with no residual continuous within-firm bimodality. -*Cosine analogue.* The cosine axis exhibits the same pattern. Codex-verified read-only spike on the Big-4 pooled signature cloud yields signature-level cosine dip $p < 5 \times 10^{-4}$ on the raw data, but $p = 0.597$ after firm-mean centring; accountant-level cosine $p = 1.0$ after firm-mean centring. The cosine multimodality is therefore between-firm composition-driven, not within-population bimodality. +*Cosine analogue.* The cosine axis follows the same pattern by construction: the within-firm signature-level cosine dip tests above (Scripts 39b, 39c) fail to reject in every Big-4 firm and in every eligible non-Big-4 firm, so any pooled cosine multimodality must arise from between-firm composition rather than from within-population bimodality. -*Integer-histogram valleys (Script 39d).* A genuine within-firm dHash antimode would appear as a strict local minimum in the count histogram with deep relative depth. Within each of the four Big-4 firms, the dHash histogram on bins $0$–$20$ exhibits no strict local minimum; the Big-4 pooled histogram exhibits one shallow valley at $\text{dHash} = 4$ with relative depth $0.021$ (a $2.1\%$ count drop). No valley near the inherited $\text{dHash} = 5$ operational boundary appears within any individual firm. The hypothesised dHash antimode near $\text{dHash} \approx 5$ is not empirically supported by the histogram analysis. +*Integer-histogram valleys (Script 39d).* A genuine within-firm dHash antimode would appear as a strict local minimum in the count histogram with deep relative depth. Within each of the four Big-4 firms, the dHash histogram on bins $0$–$20$ exhibits no strict local minimum; the Big-4 pooled histogram exhibits one shallow valley at $\text{dHash} = 4$ with relative depth $0.021$ (a $2.1\%$ count drop). No valley near the deployed $\text{dHash} = 5$ operational boundary appears within any individual firm. The hypothesised dHash antimode near $\text{dHash} \approx 5$ is not empirically supported by the histogram analysis. -**5. Conclusion: no natural threshold from the descriptor distribution.** §III-I.4 jointly establishes that (a) the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts; (b) within any individual firm, the descriptor marginals at the signature level are unimodal once integer ties are broken; and (c) no integer-histogram valley near the inherited $\text{dHash} = 5$ operational boundary exists within any firm. The descriptor distributions therefore do not contain a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits of §III-I.2 and §III-J are retained as *descriptive partitions* that reflect firm-composition contrast, not as inferential evidence for two or three population modes. §III-L develops the v4.0 anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode. +**5. Conclusion: no natural threshold from the descriptor distribution.** §III-I.4 jointly establishes that (a) the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts; (b) within the Big-4 firms, the descriptor marginals at the signature level are unimodal once integer ties are broken (Scripts 39b, 39d); (c) eligible non-Big-4 checks provide corroborating raw-axis evidence on the cosine dimension (Script 39c) and corroborate the integer-mass-point reading of raw dHash, but are not used as calibration evidence for the deployed thresholds; and (d) no integer-histogram valley near the deployed $\text{dHash} = 5$ operational boundary exists within any Big-4 firm. The descriptor distributions therefore do not contain a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits of §III-I.2 and §III-J are retained as *descriptive partitions* that reflect firm-composition contrast, not as inferential evidence for two or three population modes. §III-L develops the anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode. ## J. K=3 as a Descriptive Partition of Firm-Composition Contrast -This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes.** §III-I.4 demonstrates that the apparent multimodality of the accountant-level marginals is fully explained by between-firm location shifts and integer mass-point artefacts, leaving no residual evidence for two or three latent within-population mechanism classes. Neither mixture is used to assign signature-level or document-level labels in the v4.0 primary analysis. The operational classifier of §III-L is calibrated via inter-CPA negative-anchor coincidence rates, not via mixture-derived antimodes. +This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes.** §III-I.4 demonstrates that the apparent multimodality of the accountant-level marginals is fully explained by between-firm location shifts and integer mass-point artefacts, leaving no residual evidence for two or three latent within-population mechanism classes. Neither mixture is used to assign signature-level or document-level labels in the primary analysis. The operational classifier of §III-H.1 is calibrated in §III-L via inter-CPA negative-anchor coincidence rates, not via mixture-derived antimodes. **K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ (weight $0.689$) and $(0.983, 2.41)$ (weight $0.311$) (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. We refer to the components by index rather than by mechanism labels, since §III-I.4 establishes that the K=2 separation is firm-compositional rather than mechanistic. @@ -209,7 +210,7 @@ This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 account | C2 | 0.9558 | 6.66 | 0.536 | central region | | C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner | -$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical preference for K=3 under standard BIC interpretation, but not by itself decisive). The "descriptive position" column replaces v3.x's "hand-leaning / mixed / replicated" mechanism labels: §III-I.4 establishes that the cosine and dHash axes both lack within-population bimodality, so component centres are best interpreted as locations in a continuous descriptor space rather than as latent mechanism modes. +$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical preference for K=3 under standard BIC interpretation, but not by itself decisive). The "descriptive position" column refrains from any mechanism interpretation: §III-I.4 establishes that the cosine and dHash axes both lack within-population bimodality, so component centres are best interpreted as locations in a continuous descriptor space rather than as latent mechanism modes. **Per-firm component composition (Script 35 firm × cluster cross-tab).** The K=3 partition is dominated by firm membership: @@ -222,23 +223,23 @@ Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24 **Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label. -We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the v4.0 operational classifier: +We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the operational classifier: - The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary. - The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$. -- Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs in v4.0. +- Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs. -The operational signature-level classifier of §III-L is calibrated against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the inherited five-way box rule and the K=3 partition appear in §III-K. +The operational signature-level classifier of §III-L is calibrated against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K. ## K. Convergent Internal-Consistency Checks -The descriptive partition of §III-J is supported by three feature-derived per-CPA scores and a hard-ground-truth subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. Per §III-I.4, none of the three scores has a within-population bimodality interpretation; they are firm-compositional position scores at the accountant level. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide). +The descriptive partition of §III-J is supported by three feature-derived per-CPA scores and a conservative hard-positive subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. Per §III-I.4, none of the three scores has a within-population bimodality interpretation; they are firm-compositional position scores at the accountant level. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide). **1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute: - **Score 1 (K=3 posterior on the low-cos / high-dHash component):** $P(\text{C1})$ from the K=3 fit of §III-J. Per §III-J this is a firm-compositional position score on the (cos, dHash) plane (not a probability of any latent "hand-signing mechanism") — a function of both descriptor means. -- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end. This is a function of $\overline{\text{cos}}_a$ alone. -- **Score 3 (inherited binary high-confidence box rule rate):** the per-CPA fraction of signatures that do **not** satisfy the inherited binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors. +- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H.2, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end. This is a function of $\overline{\text{cos}}_a$ alone. +- **Score 3 (deployed binary high-confidence box rule rate):** the per-CPA fraction of signatures that do **not** satisfy the deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors. Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38): @@ -248,63 +249,53 @@ Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs | Score 2 vs Score 3 | $+0.8890$ | $< 10^{-149}$ | | Score 1 vs Score 2 | $+0.8794$ | $< 10^{-142}$ | -We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms. +We read this as the strongest internal-consistency signal in the analysis: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the deployed box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms. **2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replication-dominated vs less-replication-dominated): | Pair | Cohen $\kappa$ | |---|---| -| Paper A binary high-confidence box rule vs per-CPA K=3 hard label | $0.662$ | -| Paper A binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ | +| Deployed binary high-confidence box rule vs per-CPA K=3 hard label | $0.662$ | +| Deployed binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ | | Per-CPA K=3 vs per-signature K=3 | $0.870$ | -The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the inherited five-way rule. We note that this comparison validates only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly validate the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, which retains its v3.20.0 calibration and capture-rate evaluation (v3.20.0 Tables IX, XI, XII, XII-B; documented as inherited in §IV-J). +The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in Appendix B and not regenerated on the Big-4 subset. **3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference: - *K=2 LOOO is unstable.* The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$, against the report's $0.005$ across-fold stability tolerance (Script 36; pairwise fold range $0.0376$, from $0.9380$ to $0.9756$). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary. -- *K=3 LOOO is partially stable.* The C1 (low-cos / high-dHash) component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own verdict is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label. +- *K=3 LOOO is partially stable.* The C1 (low-cos / high-dHash) component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own screening label is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label. -**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one hard ground-truth subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40). +**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one conservative hard-positive subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40). -We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 v4 sample) and the inherited corpus-wide v3.x version cited at §IV-I (reported under prior "FAR" terminology): +We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the corpus-wide version cited at §IV-I: | Candidate check | Pixel-identity miss rate (Wilson 95% CI) | |---|---| -| Inherited Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ | +| Deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ | | K=3 per-CPA hard label (C3 high-cos / low-dHash corner; descriptive only) | $0\%$ $[0\%, 1.45\%]$ | | Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ | -All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the inherited box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them; v3.x discussed this conservative-subset caveat at length (v3 §III-J item 1, V-F). The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the inherited box rule's overall replicated rate ($49.58\%$ of Big-4 signatures; Script 40); this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation. +All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them. The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the box rule's overall replicated rate ($49.58\%$ of Big-4 signatures); this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation. -## L. Anchor-Based Threshold Calibration and Operational Classifier +## L. Anchor-Based Threshold Calibration -§III-I.4 established that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. This section develops v4.0's anchor-based threshold calibration: the operational thresholds inherited from v3.x are characterised by their inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. Throughout this section we report **inter-CPA coincidence rates** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0. +The operational classifier defined in §III-H.1 is calibrated by characterising the deployed thresholds' inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. §III-I.4 establishes that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. Throughout this section we report **inter-CPA coincidence rates** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0. ### L.0. Calibration methodology -**Operational classifier (inherited from v3.20.0 §III-K, retained unchanged).** Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA: - -1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence. -2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff. -3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction. -4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$. -5. **Likely hand-signed (LH):** Cosine $\leq 0.837$. - -The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) are inherited from v3.x §III-K and retain their v3.x calibration provenance. Document-level labels are aggregated via the v3.x worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). - -**Why retained without v4.0 recalibration.** The inherited thresholds preserve continuity with v3.x reporting and with the existing literature. §III-I.4 establishes that a v4.0 recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 confirms that the cosine threshold's specificity behaviour at the inter-CPA pair level (the v3.x calibration anchor) is reproducible on the v4 spike sample, and §III-L.1 newly characterises the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. Sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain v3.x's inherited calibration; v4.0 does not provide independent calibration for those sub-bands. +**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with prior reporting and with the existing literature. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine threshold's specificity behaviour at the inter-CPA pair level and the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain the prior calibration; the present calibration does not provide independent rates for those sub-bands. **Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question: -- *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the unit at which v3.x §IV-I characterised the cosine threshold's specificity behaviour and at which threshold-derivation in biometric verification is conventionally calibrated. We report it for both the cosine and dHash dimensions, marginally and jointly (§III-L.1). +- *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the conventional pairwise calibration unit in biometric verification. We report it for both the cosine and dHash dimensions, marginally and jointly (§III-L.1). - *Per signature pool.* For a Big-4 source signature $s$ with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates? This addresses the standard concern that a per-pair rate computed on independent pairs is not the deployed-rule rate at the per-signature classifier level: the deployed rule takes max-cosine and min-dHash over a pool of size $n_{\text{pool}}(s)$, so its effective coincidence rate is approximately $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-L.2). - *Per document.* For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose deployed pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-L.3). -**Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate in the pool satisfies both. We refer to this as the **any-pair** rule. A stricter alternative — the **same-pair** rule — requires a single candidate to satisfy both inequalities; the deployed v3/v4 rule is any-pair, but we report same-pair as a stricter alternative classifier where useful (§III-L.2, §III-L.4). +**Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate in the pool satisfies both. We refer to this as the **any-pair** rule. A stricter alternative — the **same-pair** rule — requires a single candidate to satisfy both inequalities; the deployed rule is any-pair, but we report same-pair as a stricter alternative classifier where useful (§III-L.2, §III-L.4). -**Terminological note on "FAR".** The v3.x and biometric-verification literature speak of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the v4.0 metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) §III-L.4 shows that the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures. Reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence. We retain the v3.x numerical results (which are quantitatively reproduced in §III-L.1) under the new terminology. +**Terminological note on "FAR".** The biometric-verification literature speaks of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) §III-L.4 shows that the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures. Reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence. ### L.1. Per-comparison inter-CPA coincidence rate (Script 40b) @@ -313,17 +304,17 @@ We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatu | Threshold | Per-comparison inter-CPA coincidence rate | 95% Wilson CI | |---|---|---| | Cosine $> 0.95$ | $0.00060$ | $[0.00053, 0.00067]$ | -| Cosine $> 0.945$ (v3.x published "natural threshold") | $0.00081$ | $[0.00073, 0.00089]$ | +| Cosine $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ | | Cosine $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ | | Cosine $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ | | dHash $\leq 5$ | $0.00129$ | $[0.00120, 0.00140]$ | | dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ | | dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ | | dHash $\leq 2$ | $0.00006$ | $[0.00004, 0.00008]$ | -| Joint: cos $> 0.95$ AND dHash $\leq 5$ | $0.00014$ | (any-pair semantics) | -| Joint: cos $> 0.95$ AND dHash $\leq 4$ | $0.00011$ | (any-pair semantics) | +| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ | +| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ | -The cosine row at $\text{cos} > 0.95$ replicates the v3.x §IV-I Table X result (v3.x reported the per-comparison rate as $0.0005$ under prior "FAR" terminology from a similarly-sized inter-CPA negative anchor; the v4 spike on a $5 \times 10^5$-pair sample yields $0.00060$, within the v3.x reported precision). The dHash and joint rows are v4-new: v3.x calibration did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule. +The cosine row at $\text{cos} > 0.95$ is consistent with a prior published per-comparison rate of $0.0005$ on a similarly-sized inter-CPA sample; the present $5 \times 10^5$-pair sample yields $0.00060$, within that earlier precision. The dHash row and joint row are reported here for the first time on this corpus; the prior calibration did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule. The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check. @@ -353,16 +344,16 @@ Per-firm any-pair rates (no bootstrap; descriptive): **Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors). -**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). Stakeholders requiring a per-signature ICCR of $\leq 0.05$ at HC can adopt dHash $\leq 3$ same-pair as a stricter operating point; at $\leq 0.10$ the inherited HC any-pair rule with $\text{dHash} \leq 5$ at $0.1102$ is within tolerance. +**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). A stricter operating point of dHash $\leq 3$ same-pair would correspond to a per-signature ICCR of $\approx 0.05$; the deployed HC any-pair rule with $\text{dHash} \leq 5$ corresponds to $\approx 0.11$. Stakeholders requiring a tighter specificity proxy could consider the dHash $\leq 3$ same-pair variant, with the unsupervised-setting caveats of §III-M. ### L.3. Document-level inter-CPA proxy alert rate (Script 45) -The deployed worst-case aggregation classifies each document by the most-replication-consistent category among its constituent signatures (§III-L.0). Three operationally meaningful document-level alarm definitions are reported, each as the fraction of documents whose worst-case signature category falls in the alarm set under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 45; $n_{\text{docs}} = 75{,}233$ Big-4 documents): +The deployed worst-case aggregation classifies each document by the most-replication-consistent category among its constituent signatures (§III-H.1). Three operationally meaningful document-level alarm definitions are reported, each as the fraction of documents whose worst-case signature category falls in the alarm set under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 45; $n_{\text{docs}} = 75{,}233$ Big-4 documents): | Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI | |---|---|---|---| | D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ | -| D2 | HC + MC ("any non-hand-signed verdict") | $0.3375$ | $[0.3342, 0.3409]$ | +| D2 | HC + MC ("any non-hand-signed screening label") | $0.3375$ | $[0.3342, 0.3409]$ | | D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ | Per-firm D2 document-level rates: @@ -387,7 +378,7 @@ The document-level D2 rate of $33.75\%$ pooled over Big-4 is the most operationa | Firm D | $0.027$ | $< 1$ | $\sim 37\times$ lower odds than Firm A | | log(pool size, centred) | $4.01$ | $> 1$ | $\sim 4\times$ higher odds per unit log pool size | -The Firm B/C/D odds ratios are very small after controlling for pool size, indicating that firm membership accounts for a large multiplicative effect on the per-signature rate that is *not* explained by pool size alone. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm, and naive standard errors would be inflated by within-cluster correlation; a cluster-robust standard error analysis is left as a robustness check.) +The Firm B/C/D odds ratios are very small after controlling for pool size, indicating that firm membership accounts for a large multiplicative effect on the per-signature rate that is *not* explained by pool size alone. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm, and naive standard errors would be unreliable under within-cluster correlation; a cluster-robust standard error analysis is left as a robustness check.) The per-decile per-firm breakdown (Script 44) confirms the pattern: within every pool-size decile, Firms B/C/D have rates of $0.0006$–$0.0358$, while Firm A's rate ranges $0.0541$–$0.5958$ across deciles. The firm gap is large within matched pool sizes, not driven by pool composition. @@ -402,23 +393,23 @@ The per-decile per-firm breakdown (Script 44) confirms the pattern: within every For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$). -**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$–$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. The byte-level evidence of v3.x §IV-F.1 (Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners) provides direct evidence that firm-level template reuse does occur at Firm A; the broader inter-CPA collision pattern in §III-L.4 is consistent with that mechanism extending in milder form to Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing. +**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$–$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (Appendix B; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing. This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where the within-firm collision concentration is the dominant pattern at all four Big-4 firms — most strongly at Firm A ($98.8\%$ any-pair, $99.96\%$ same-pair) and at materially lower but still majority levels at Firms B/C/D ($76.7$–$83.7\%$ any-pair; $97.0$–$98.2\%$ same-pair). -### L.5. Alert-rate sensitivity around inherited thresholds (Script 46) +### L.5. Alert-rate sensitivity around deployed thresholds (Script 46) -To test whether the inherited cosine threshold $0.95$ and dHash threshold $5$ coincide with a low-gradient (plateau-stable) region of the deployed-rule alert-rate surface — which would be weak distributional evidence that the inherited thresholds are stable operating points — we sweep each threshold across a range and report the per-signature alert rate on actual observed Big-4 same-CPA pools (not inter-CPA-replaced pools), comparing the local gradient at the inherited threshold to the median gradient across the sweep (Script 46). +To test whether the deployed cosine threshold $0.95$ and dHash threshold $5$ coincide with a low-gradient (plateau-stable) region of the deployed-rule alert-rate surface — which would be weak distributional evidence that the deployed thresholds are stable operating points — we sweep each threshold across a range and report the per-signature alert rate on actual observed Big-4 same-CPA pools (not inter-CPA-replaced pools), comparing the local gradient at the deployed threshold to the median gradient across the sweep (Script 46). -At the inherited HC operating point cos $> 0.95$ AND dHash $\leq 5$, the local gradient of the per-signature alert rate is substantially larger than the median gradient across the sweep (cosine: ratio $\approx 25\times$ at the $0.95$ point relative to median; dHash: ratio $\approx 3.8\times$ at the $5$ point relative to median; both Script 46). Reading these ratios descriptively, the inherited HC threshold is *locally sensitive* rather than plateau-stable: small threshold perturbations materially change the deployed alert rate (cosine sweep at dHash $\leq 5$ yields rates of $0.5091$ at cos $> 0.945$ vs $0.4789$ at cos $> 0.955$, a $3.0$ pp swing across a $0.01$ cosine perturbation; dHash sweep at cos $> 0.95$ yields rates of $0.4207$ at dHash $\leq 4$ vs $0.5639$ at dHash $\leq 6$, a $14.3$ pp swing across a single integer step). The local-gradient-to-median-gradient ratios are descriptive diagnostics, not formal plateau tests; the primary evidence for "no within-population bimodal antimode at these thresholds" comes from §III-I.4's composition decomposition, not from §III-L.5. +At the deployed HC operating point cos $> 0.95$ AND dHash $\leq 5$, the local gradient of the per-signature alert rate is substantially larger than the median gradient across the sweep (cosine: ratio $\approx 25\times$ at the $0.95$ point relative to median; dHash: ratio $\approx 3.8\times$ at the $5$ point relative to median; both Script 46). Reading these ratios descriptively, the deployed HC threshold is *locally sensitive* rather than plateau-stable: small threshold perturbations materially change the deployed alert rate (cosine sweep at dHash $\leq 5$ yields rates of $0.5091$ at cos $> 0.945$ vs $0.4789$ at cos $> 0.955$, a $3.0$ pp swing across a $0.01$ cosine perturbation; dHash sweep at cos $> 0.95$ yields rates of $0.4207$ at dHash $\leq 4$ vs $0.5639$ at dHash $\leq 6$, a $14.3$ pp swing across a single integer step). The local-gradient-to-median-gradient ratios are descriptive diagnostics, not formal plateau tests; the primary evidence for "no within-population bimodal antimode at these thresholds" comes from §III-I.4's composition decomposition, not from §III-L.5. The MC/HSC boundary at dHash $= 15$, by contrast, *is* in a low-gradient region (ratio $\approx 0.08$ to the median); the plateau-like behaviour around dHash $= 15$ is corroborating evidence that the high-end structural threshold lies in a regime where the rule's alert rate is approximately saturated, consistent with the high-dHash tail behaviour expected once near-identical pairs have been exhausted. The §III-L.5 non-plateau / local-sensitivity finding therefore applies specifically to the HC cutoff (cos $= 0.95$, dHash $= 5$); the MC/HSC sub-band boundary at dHash $= 15$ exhibits the opposite behaviour and is plateau-like. -We interpret the inherited HC thresholds as **specificity-anchored operating points** chosen for the specificity-vs-alert-yield tradeoff (§III-L.1), *not* as distributional antimodes. Stakeholders requiring different operating points on the tradeoff curve can derive thresholds by inverting the per-comparison or pool-normalised ICCR curves (§III-L.1, §III-L.2) at their preferred specificity target. +We interpret the deployed HC thresholds as **specificity-anchored operating points** chosen for the specificity-vs-alert-yield tradeoff (§III-L.1), *not* as distributional antimodes. Alternative operating points on the tradeoff curve can be characterised by inverting the per-comparison or pool-normalised ICCR curves (§III-L.1, §III-L.2) at the preferred specificity target. ### L.6. Observed deployed alert rate on actual same-CPA pools -The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the inherited HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$). +The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the deployed HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$). The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised inter-CPA rate ($0.4958$ vs $0.1102$); the per-document observed-deployed rate is $\sim 3.5\times$ the pool-normalised inter-CPA D1 (HC) rate ($0.6228$ vs $0.1797$). We refer to this multiplicative gap as the **deployed-rate excess over the inter-CPA proxy**: @@ -427,36 +418,32 @@ The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence. -### L.7. K=3 not used as classifier - -The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used only for the accountant-level firm × cluster cross-tabulation (§III-J; Script 35), and the K=3 *posterior* P(C1) is used (as the continuous Score 1) in the internal-consistency Spearman correlations of §III-K. The operational classifier of §III-L.0 is the inherited v3.x five-way box rule; the calibration evidence in §III-L.1 through §III-L.6 characterises its multi-level coincidence behaviour against the inter-CPA negative anchor. - ## M. Validation Strategy and Limitations under Unsupervised Setting -The v4.0 corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4 and v3.x §IV-F.1) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth. +The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth. -In place of supervised validation, v4.0 adopts a **multi-tool collection of partial-evidence diagnostics** (Table XXVII), each with an explicitly disclosed assumption: +Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier (Table XXVII), with an explicitly disclosed untested assumption: -**Table XXVII.** Ten-tool unsupervised-validation collection with disclosed untested assumptions. +**Table XXVII.** Diagnostics, failure mode addressed, and disclosed untested assumption. -| Tool | What it measures | Untested assumption | +| Diagnostic | Failure mode addressed | Disclosed untested assumption | |---|---|---| | Composition decomposition (§III-I.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) | | Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) | | Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property | | Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above | -| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors inflated; cluster-robust analysis is a future check | +| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check | | Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 45's mode-of-firms tie-break (§IV-M.4 footnote) | | Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test | | Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent | -| Pixel-identical conservative positive capture (§III-K.4; v3.x; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold | +| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold | | LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity | -No single tool in this collection provides ground-truth validation. Their conjunction constitutes the unsupervised validation ceiling that the v4.0 corpus admits. +No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth. -**What v4.0 does not claim.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; v3.x byte-level evidence of Firm A's pixel-identical signatures across partners) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question. +**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, Appendix B) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question. -**What v4.0 does claim.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier. +**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier. **Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data. diff --git a/paper/paper_a_references_v3.md b/paper/paper_a_references_v3.md index 335dd32..0bb440f 100644 --- a/paper/paper_a_references_v3.md +++ b/paper/paper_a_references_v3.md @@ -1,6 +1,6 @@ # References - + [1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067 @@ -90,4 +90,4 @@ [44] A. Vehtari, A. Gelman, and J. Gabry, "Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC," *Stat. Comput.*, vol. 27, no. 5, pp. 1413–1432, 2017. - + diff --git a/paper/paper_a_related_work_v3.md b/paper/paper_a_related_work_v3.md index cf82b04..b3ce027 100644 --- a/paper/paper_a_related_work_v3.md +++ b/paper/paper_a_related_work_v3.md @@ -15,8 +15,8 @@ Hafemann et al. [16] further addressed the practical challenge of adapting to ne A common thread in this literature is the assumption that the primary threat is *identity fraud*: a forger attempting to produce a convincing imitation of another person's signature. Our work addresses a fundamentally different problem---detecting whether the *legitimate signer's* stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability. -Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy. -However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a replication-dominated subpopulation identified through domain expertise in real-world regulatory documents. +Brimoh and Olisah [8] are closest in spirit in using reference evidence to discipline threshold choice. +Their setting, however, uses standard verification benchmarks with known genuine references, whereas our archival setting lacks signature-level labels and therefore characterises a fixed deployed screening rule through inter-CPA coincidence-rate anchors. ## B. Document Forensics and Copy Detection @@ -51,9 +51,9 @@ Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine sim Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach. These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical. -## E. Statistical Methods for Threshold Determination +## E. Statistical Methods for Threshold Characterisation and Calibration -Our threshold-determination framework combines three families of methods developed in statistics and accounting-econometrics. +Our threshold-characterisation and calibration framework combines three families of methods developed in statistics and accounting-econometrics. *Non-parametric density estimation.* Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions. @@ -71,10 +71,10 @@ When the empirical distribution is viewed as a weighted sum of two (or more) lat For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation. Under mild regularity conditions, White's quasi-MLE result [41] supports interpreting maximum-likelihood estimates under a mis-specified parametric family as consistent estimators of the pseudo-true parameter that minimizes the Kullback-Leibler divergence to the data-generating distribution within that family; we use this result to justify the Beta-mixture fit as a principled approximation rather than as a guarantee that the true distribution is Beta. -The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported. +The present study uses these tools diagnostically: first to test whether the descriptor distribution supports a natural operating boundary, and then, when that support fails under composition decomposition, to motivate anchor-based ICCR calibration of a fixed deployed rule. *Cross-validation in a small-cluster scope.* -Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the inherited five-way operational classifier (which is calibrated separately; §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier. +Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier. -The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in all §IV v4 primary analyses (§IV-D through §IV-J). +The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in the primary analyses of §IV-D through §IV-J. ## C. All-Pairs Intra-vs-Inter Class Distribution Analysis @@ -48,10 +48,10 @@ Table IV summarizes the distributional statistics. Both distributions are left-skewed and leptokurtic. Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate. -Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent threshold-estimator outputs reported in Section IV-D are derived via the methods of Section III-I to avoid single-family distributional assumptions. +Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-I to avoid single-family distributional assumptions. The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). -Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes. +Under equal prior probabilities and equal misclassification costs, this crossover is a candidate decision boundary between the two classes; we adopt it only as the operational LH/UN boundary in §III-H.1, not as a natural distributional threshold. Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$). We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. @@ -60,7 +60,7 @@ A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the ## D. Big-4 Accountant-Level Distributional Characterisation -This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. All numbers below are direct re-statements from Scripts 32 / 34. The accountant-level dip-test rejection reported in Table V is, per §III-I.4 (Scripts 39b–39e), fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the v4-new composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration. +This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. The accountant-level dip-test rejection reported in Table V is, per §III-I.4, fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration. **Table V.** Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32). @@ -82,7 +82,7 @@ Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no boot | Firms B + C + D pooled | none | one transition at $\overline{\text{dHash}} = 10.8$ | | All non-Firm-A pooled | none | one transition at $\overline{\text{dHash}} = 6.6$ | -The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as v4.0 operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes. +The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes. ## E. Big-4 K=2 / K=3 Mixture Fits @@ -122,15 +122,15 @@ This section reports the empirical evidence for §III-K's three-score internal-c | Score pair | Spearman $\rho$ | $p$-value | |---|---|---| -| K=3 P(C1) vs Paper A box-rule less-replication-dominated rate | $+0.9627$ | $< 10^{-248}$ | -| Reverse-anchor cosine percentile vs Paper A box-rule less-replication-dominated rate | $+0.8890$ | $< 10^{-149}$ | +| K=3 P(C1) vs deployed box-rule less-replication-dominated rate | $+0.9627$ | $< 10^{-248}$ | +| Reverse-anchor cosine percentile vs deployed box-rule less-replication-dominated rate | $+0.8890$ | $< 10^{-149}$ | | K=3 P(C1) vs Reverse-anchor cosine percentile | $+0.8794$ | $< 10^{-142}$ | (Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on $n = 249$ non-Big-4 CPAs; reference centre $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$. **Table X.** Per-firm summary across the three feature-derived scores, Big-4. -| Firm | $n$ CPAs | mean $P(\text{C1})$ | mean reverse-anchor score | mean Paper A less-replication-dominated rate | +| Firm | $n$ CPAs | mean $P(\text{C1})$ | mean reverse-anchor score | mean deployed less-replication-dominated rate | |---|---|---|---|---| | Firm A | 171 | 0.0072 | $-0.9726$ | 0.1935 | | Firm B | 112 | 0.1410 | $-0.8201$ | 0.6962 | @@ -145,11 +145,11 @@ The three scores agree on placing Firm A as the most replication-dominated and t | Pair | Cohen $\kappa$ | |---|---| -| Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) vs per-CPA K=3 hard label | 0.662 | -| Paper A binary high-confidence box rule vs per-signature K=3 hard label | 0.559 | +| deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) vs per-CPA K=3 hard label | 0.662 | +| deployed binary high-confidence box rule vs per-signature K=3 hard label | 0.559 | | Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 | -(Source: Script 39; verdict label `SIG_CONVERGENCE_MODERATE`.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evaluation (§IV-J). +(Source: Script 39.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) retains its prior calibration and capture-rate evidence (Appendix B; cross-referenced in §IV-J). ## G. Leave-One-Firm-Out Reproducibility @@ -176,33 +176,33 @@ This section reports the firm-level cross-validation evidence motivating §III-J | Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp | | Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp | -(Source: Script 37; verdict label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L). +(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L). ## H. Pixel-Identity Positive-Anchor Miss Rate -This section reports the only hard-ground-truth subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4). +This section reports the only conservative hard-positive subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4). **Table XIV.** Positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures. | Classifier | Misclassified as less-replication-dominated | Miss rate | Wilson 95% CI | |---|---|---|---| -| Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ | +| deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ | | K=3 per-CPA hard label (C3 = high-cos / low-dHash; descriptive) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ | | Reverse-anchor (prevalence-calibrated cut) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ | (Source: Script 40.) Per-firm breakdown of the byte-identical subset: Firm A 145; Firm B 8; Firm C 107; Firm D 2. All three candidate scores correctly assign every byte-identical signature to the replicated class. -We caution that for the Paper A box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region); v3.20.0 §V-F discusses this conservative-subset caveat at length and we retain that discussion. The reverse-anchor cut is chosen by *prevalence calibration* against the inherited box rule's overall replicated rate of $49.58\%$ across Big-4 signatures; this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation. +We caution that for the deployed box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region). The reverse-anchor cut is chosen by *prevalence calibration* against the box rule's overall replicated rate of $49.58\%$ across Big-4 signatures; this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation. -## I. Inter-CPA Pair-Level Coincidence Rate (Big-4 spike + inherited corpus-wide) +## I. Inter-CPA Pair-Level Coincidence Rate -The signature-level inter-CPA pair-level coincidence-rate analysis (reported in v3.20.0 §IV-F.1, Table X as "FAR") is inherited and extended in v4.0. v4.0 retroactively reframes the metric as **inter-CPA pair-level coincidence rate (ICCR)** rather than "False Acceptance Rate" because the corpus does not provide signature-level ground-truth negative labels; the inter-CPA negative-anchor assumption underpinning the metric is itself partially violated by within-firm cross-CPA template-like collision structures (§III-L.4). The v3.20.0 corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs reported a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$. +The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It is the per-pair rate at which two signatures from different CPAs satisfy the deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has a biometric-verification meaning that requires ground-truth negative labels, and (b) the inter-CPA negative-anchor assumption is partially violated by within-firm cross-CPA template-like collision structures (§III-L.4 cross-firm hit matrix). -v4.0 additionally reports the §III-L.1 Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs; Script 40b), which replicates and extends the v3 result and adds the structural dimension (dHash) and joint-rule rates. The §III-L.1 numbers are referenced rather than duplicated here; the consolidated v4-new ICCR calibration appears in §IV-M Tables XXI–XXVI. +A corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs gives a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$. The Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs) replicates this number, adds the structural dimension (dHash), and adds joint-rule rates; the §III-L.1 numbers are referenced rather than duplicated here, and the consolidated ICCR calibration appears in §IV-M Tables XXI–XXVI. ## J. Five-Way Per-Signature + Document-Level Classification Output -This section reports the §III-L five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. The five-way category definitions are inherited unchanged from v3.20.0 §III-K (now §III-L); see §III-L for the cosine and dHash cuts. +This section reports the five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. See §III-H.1 for the five-way category definitions and the cosine and dHash cuts; calibration is in §III-L. **Table XV.** Five-way per-signature category counts, Big-4 sub-corpus, $n = 150{,}442$ classified. @@ -227,7 +227,7 @@ This section reports the §III-L five-way per-signature + document-level worst-c (Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%). -**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.20.0 worst-case rule (HC > MC > HSC > UN > LH; §III-L). v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset). +**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-L), applied to the Big-4 sub-corpus. **Table XIX.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs. @@ -252,7 +252,7 @@ This section reports the §III-L five-way per-signature + document-level worst-c (Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.) -The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The v3.20.0 capture-rate calibration evidence for the moderate band (v3.20.0 Tables IX, XI, XII, XII-B) is carried into v4.0 by reference and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B). +The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains its prior calibration (Appendix B); it is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in Appendix B and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B). **Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus. @@ -269,7 +269,7 @@ The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < ## K. Full-Dataset Robustness (light scope) -This section reports the v4.0 reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the v4.0 K=3 + Paper A convergence reproduces at the wider scope. The §III-L five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band is documented as inherited from v3.x calibration in §IV-J. +This section reports the reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (Appendix B; §IV-J). **Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset. @@ -281,9 +281,9 @@ This section reports the v4.0 reproducibility cross-check at the full accountant (Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.) -**Table XVIII.** Spearman rank correlation between K=3 P(C1) and Paper A operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset. +**Table XVIII.** Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset. -| Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs Paper A less-replication-dominated rate) | $p$-value | +| Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs deployed less-replication-dominated rate) | $p$-value | |---|---|---|---| | Big-4 (primary) | 437 | $+0.9627$ | $< 10^{-248}$ | | Full dataset | 686 | $+0.9558$ | $< 10^{-300}$ | @@ -291,15 +291,15 @@ This section reports the v4.0 reproducibility cross-check at the full accountant (Source: Script 41.) -**Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the Paper A box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + Paper A convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the v4.0 primary methodology is restricted to Big-4 by design (§III-G item 4). +**Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the deployed box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + deployed-rule convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the primary methodology is restricted to Big-4 by design (§III-G item 4). ## L. Ablation Study: Feature Backbone Comparison -To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). +To support the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. -The comparison summary is inherited unchanged from the v3.20.0 backbone-ablation table (v3.20.0 Table XVIII; not the same table as v4 Table XVIII which reports Big-4 vs full-dataset Spearman drift in §IV-K). +The comparison summary is reported in Appendix B (the backbone-ablation table; not the same table as Table XVIII in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K). - EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. -However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence. +However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), i.e., a wider descriptor dispersion per signature. VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task. ResNet-50 provides the best overall balance: (1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707; -(2) its tighter distributions yield more reliable individual classifications; -(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and +(2) its tighter distributions yield more stable descriptor behaviour at the per-signature level; +(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that Firm A replication-dominated signatures are least likely to produce low-similarity outlier pairs under this backbone; and (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures. -## M. v4-New Anchor-Based ICCR Calibration Results +## M. Anchor-Based ICCR Calibration Results -This section consolidates the v4-new empirical results that support the §III-L anchor-based threshold calibration framework. Numbers below are direct re-statements from the spike scripts cited per row; the corresponding §III provenance table entries appear in §III's provenance table. +This section consolidates the empirical results that support the §III-L anchor-based threshold calibration framework. ### M.1 Composition decomposition (Scripts 39b–39e) @@ -342,29 +342,29 @@ This section consolidates the v4-new empirical results that support the §III-L | Within-firm signature-level cosine dip | non-Big-4 (10 firms $\geq 500$ sigs) | $p_{\text{cos}} \in [0.59, 0.99]$ | 0/10 firms reject; cosine within-firm unimodal | | Within-firm jittered-dHash dip (5 seeds, median) | Big-4 (4 firms) | $p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\}$ | 0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact | | Big-4 pooled dHash: 2×2 factorial | firm-centred + jittered (5 seeds) | $p_{\text{med}} = 0.35$, 0/5 seeds reject | combined corrections eliminate rejection; multimodality is composition + integer artefact | -| Integer-histogram valley near $\text{dHash} \approx 5$ | within each Big-4 firm | none (0/4 firms) | no within-firm dHash antimode at the inherited HC cutoff | +| Integer-histogram valley near $\text{dHash} \approx 5$ | within each Big-4 firm | none (0/4 firms) | no within-firm dHash antimode at the deployed HC cutoff | (Source: Scripts 39b, 39c, 39d, 39e; bootstrap $n_{\text{boot}} = 2000$; jitter $\sim \mathrm{U}[-0.5, +0.5]$.) ### M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b) -**Table XXI.** Big-4 inter-CPA per-comparison ICCR sweep, $n = 5 \times 10^5$ pairs (Big-4 scope; v4 new). +**Table XXI.** Big-4 inter-CPA per-comparison ICCR sweep, $n = 5 \times 10^5$ pairs (Big-4 scope). | Threshold | Per-comparison ICCR | 95% Wilson CI | |---|---|---| -| cos $> 0.945$ (v3.x published "natural threshold") | $0.00081$ | $[0.00073, 0.00089]$ | -| cos $> 0.95$ (inherited operating point) | $0.00060$ | $[0.00053, 0.00067]$ | +| cos $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ | +| cos $> 0.95$ (deployed operating point) | $0.00060$ | $[0.00053, 0.00067]$ | | cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ | | cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ | -| dHash $\leq 5$ (inherited operating point) | $0.00129$ | $[0.00120, 0.00140]$ | +| dHash $\leq 5$ (deployed operating point) | $0.00129$ | $[0.00120, 0.00140]$ | | dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ | | dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ | -| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | — | -| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | — | +| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ | +| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ | Conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$ (Wilson 95% $[0.190, 0.285]$; $70$ of $299$ pairs). -The cos $> 0.95$ row replicates v3.20.0 §IV-F.1 Table X (v3 reported $0.0005$ under prior "FAR" terminology). The dHash row and joint row are v4 new. +The cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I (per-comparison rate $0.0005$). The dHash row and joint row are reported here for the first time on this corpus. ### M.3 Pool-normalised per-signature ICCR (Script 43) @@ -419,9 +419,9 @@ Per-decile per-firm rates (Table not duplicated here; Script 44 decile table ava Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively. -### M.6 Alert-rate sensitivity around inherited HC threshold (Script 46) +### M.6 Alert-rate sensitivity around deployed HC threshold (Script 46) -**Table XXVI.** Local-gradient / median-gradient ratio at inherited thresholds (descriptive plateau diagnostic). +**Table XXVI.** Local-gradient / median-gradient ratio at deployed thresholds (descriptive plateau diagnostic). | Threshold | Local / median gradient ratio | Interpretation | |---|---|---| @@ -429,4 +429,4 @@ Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\ | dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) | | dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) | -Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ pp per-signature and $0.4431$ pp per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate. +Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ ($38.6$ pp) per-signature and $0.4431$ ($44.3$ pp) per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate. diff --git a/paper/paper_a_v4_combined.md b/paper/paper_a_v4_combined.md new file mode 100644 index 0000000..4d22565 --- /dev/null +++ b/paper/paper_a_v4_combined.md @@ -0,0 +1,1314 @@ +# Abstract + + + +Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports — through administrative stamping or firm-level electronic signing — thereby undermining individualized attestation. We build an end-to-end pipeline for screening such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$–$0.16$ at Firms B/C/D after pool-size adjustment, and under the deployed any-pair rule $77$–$99\%$ of inter-CPA collisions concentrate within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth. + + + + +# I. Introduction + + + +Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require certifying CPAs to affix their signature or seal (簽名或蓋章) to each audit report [1]. While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2]. + +The digitization of financial reporting has introduced a practice that complicates this intent. As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one. This reproduction can occur either through an administrative stamping workflow — in which scanned signature images are affixed by staff as part of the report-assembly process — or through a firm-level electronic signing system that automates the same step. We refer to signatures produced by either workflow collectively as *non-hand-signed*. Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement. The accounting literature has examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33]. Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused, and is visually invisible to report users at scale. + +The distinction between *non-hand-signing detection* and *signature forgery detection* is conceptually and technically important. The extensive body of research on offline signature verification [3]–[8] focuses almost exclusively on forgery detection — determining whether a questioned signature was produced by its purported author. In our context, identity is not in question; the CPA is indeed the legitimate signer. The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction. + +A methodological concern shapes the research design. Many prior similarity-based classification studies rely on ad-hoc thresholds — declaring two images equivalent above a hand-picked cosine cutoff, for example — without principled statistical justification. Such thresholds are fragile in an archival-data setting. A defensible approach requires (i) explicit calibration of the operational thresholds against measurable negative-anchor evidence; (ii) diagnostic procedures that test whether the descriptor distribution itself supports a within-population threshold, including formal decomposition of apparent multimodality into between-group composition and integer-tie artefacts; (iii) annotation-free reporting of operational alarm rates at multiple analysis units (per-comparison, per-signature pool, per-document) with Wilson 95% confidence intervals; (iv) per-firm stratification of the reported rates to surface heterogeneity that aggregate metrics conceal; and (v) explicit disclosure of the unsupervised setting's limits — in particular, the inability to estimate true error rates without signature-level ground-truth labels. + +Despite the significance of the problem for audit quality and regulatory oversight, to our knowledge no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation. + +In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) disclosure of each diagnostic's untested assumption (§III-M). + +A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors. + +In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the cos$>0.95$ operating point yields ICCR $= 0.00060$ on a $5 \times 10^5$-pair Big-4 sample; the dHash$\leq 5$ structural cutoff yields ICCR $= 0.00129$; the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR $= 0.00014$ (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is $0.1102$ (Wilson 95% CI $[0.1086, 0.1118]$; CPA-block bootstrap 95% $[0.0908, 0.1330]$). At the per-document unit, the operational HC$+$MC alarm fires on $33.75\%$ of Big-4 documents under the inter-CPA candidate-pool counterfactual. + +The pooled per-signature and per-document rates conceal striking firm heterogeneity. A logistic regression of the per-signature hit indicator on firm dummies (Firm A reference) and centred log pool size yields odds ratios of $0.053$ (Firm B), $0.010$ (Firm C), and $0.027$ (Firm D) — Firms B/C/D are an order of magnitude below Firm A even after controlling for the pool-size confound. Cross-firm hit matrix analysis under the deployed any-pair rule shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (Table XXV; the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). The pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour rather than to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels. + +Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G. + +We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available. + +The contributions of this paper are: + +1. **Problem formulation.** We define non-hand-signing detection as distinct from signature forgery detection and frame it as a detection problem on intra-signer similarity distributions. + +2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training. + +3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we validate the backbone choice through a feature-backbone ablation. + +4. **Composition decomposition disproves the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported. + +5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide. + +6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-firm rates differ by an order of magnitude after pool-size adjustment (Firm A's per-document HC$+$MC alarm at $0.62$ versus Firms B/C/D at $0.09$–$0.16$). Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms); the pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing. + +7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair. + +8. **Annotation-free positive-anchor capture check and unsupervised-setting disclosure.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-M addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review. + +The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity validation, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work. + + +# II. Related Work + +## A. Offline Signature Verification + +Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning. +Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant. +Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work. +Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining. +Kao and Wen [5] addressed offline verification and forgery detection using only a single known genuine signature per writer with an explainable deep-learning approach. +More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results. +Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives. +Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer. +Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems. + +A common thread in this literature is the assumption that the primary threat is *identity fraud*: a forger attempting to produce a convincing imitation of another person's signature. +Our work addresses a fundamentally different problem---detecting whether the *legitimate signer's* stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability. + +Brimoh and Olisah [8] are closest in spirit in using reference evidence to discipline threshold choice. +Their setting, however, uses standard verification benchmarks with known genuine references, whereas our archival setting lacks signature-level labels and therefore characterises a fixed deployed screening rule through inter-CPA coincidence-rate anchors. + +## B. Document Forensics and Copy Detection + +Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18]. +Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11]. +Abramova and Böhme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection. + +Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money-laundering investigations. +Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering. +While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting image-level reproduction within a single author's signatures across documents. + +In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images. +Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach. + +## C. Perceptual Hashing + +Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19]. +Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process. + +Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks. +Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-descriptor approach, though applied to natural images rather than document signatures. + +Our work differs from prior perceptual-hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from image-level reproduction in scanned financial documents. + +## D. Deep Feature Extraction for Signature Analysis + +Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures. +Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach. +Tsourounis et al. [21] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison. +Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach. + +Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach. +These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical. + +## E. Statistical Methods for Threshold Characterisation and Calibration + +Our threshold-characterisation and calibration framework combines three families of methods developed in statistics and accounting-econometrics. + +*Non-parametric density estimation.* +Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions. +Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors. +The statistical validity of the unimodality-vs-multimodality dichotomy can be tested via the Hartigan & Hartigan dip test [37], which tests the null of unimodality; we use rejection of this null as evidence consistent with (though not a direct test for) bimodality. + +*Discontinuity tests on empirical distributions.* +Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions. +Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately $N(0,1)$. +The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis. +The BD/McCrary pairing provides a local-density-discontinuity diagnostic that is informative about distributional smoothness under minimal assumptions; we use it in that diagnostic role (rather than as a threshold estimator) because its transitions in our corpus are bin-width-sensitive at the signature level and rarely significant at the accountant level (Appendix A). + +*Finite mixture models.* +When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters. +For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation. +Under mild regularity conditions, White's quasi-MLE result [41] supports interpreting maximum-likelihood estimates under a mis-specified parametric family as consistent estimators of the pseudo-true parameter that minimizes the Kullback-Leibler divergence to the data-generating distribution within that family; we use this result to justify the Beta-mixture fit as a principled approximation rather than as a guarantee that the true distribution is Beta. + +The present study uses these tools diagnostically: first to test whether the descriptor distribution supports a natural operating boundary, and then, when that support fails under composition decomposition, to motivate anchor-based ICCR calibration of a fixed deployed rule. + +*Cross-validation in a small-cluster scope.* +Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier. + + + +# III. Methodology + +## A. Pipeline Overview + +We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents. +Fig. 1 illustrates the overall architecture. +The pipeline takes as input a corpus of PDF audit reports and produces five-way operational screening labels (§III-H.1) whose behaviour is characterised by pixel-identity positive-anchor capture checks and inter-CPA coincidence-rate calibration (§III-L). + +Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). +From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise. + + + +## B. Data Collection + +The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. +The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. +An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. +Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs. + +CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. +Table I summarizes the dataset composition. + + + +## C. Signature Page Identification + +To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism. +Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. +The model was configured with temperature 0 for deterministic output. + +The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. +Scanning terminated upon the first positive detection. +This process identified 86,084 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. +An additional 13 PDFs that could not be rendered (corruption or read errors) were excluded, yielding a final set of 86,071 documents. + +Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents. +The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling. + +## D. Signature Detection + +We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization. +A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. +A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps. + +The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II). + + + +Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). +A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content. + +Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). The matched records assume standard two-signature ordering; residual order-mismatch risk remains for nonstandard layouts. The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset forms the candidate pool for same-CPA analyses, before the Big-4 and descriptor-completeness restrictions described in §III-G. + +## E. Feature Extraction + +Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. +The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer. + +Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. +All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product. + +The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-G). +This design choice is supported by an ablation study (Section IV-L) comparing ResNet-50 against VGG-16 and EfficientNet-B0. + +## F. Dual-Method Similarity Descriptors + +For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA: + +**Cosine similarity on deep embeddings** captures high-level visual style: + +$$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$ + +where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized 2048-dim feature vectors. +Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14]. + +**Perceptual hash distance (dHash)** [27] captures structural-level similarity. +Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. +The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. +Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19]. + +These descriptors provide partially independent evidence. +Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. +Non-hand-signing is expected to yield extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise; scan-stage noise can in principle push a replicated pair off either extremum but rarely both. +Hand-signing, by contrast, often yields high dHash similarity (the overall layout of a signature is typically preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). +Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline. + +We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. Appendix B contains the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument. + +Cosine similarity on L2-normalised deep embeddings and dHash both remain stable across the print-scan-rasterise cycle by design [14], [19], [21], [27]; together they constitute the dual descriptor used throughout the rest of this paper. + +## G. Unit of Analysis and Scope + +We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. + +We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism. + +We adopt one stipulation about same-CPA pair detectability: + +> **(A1) Pair-detectability.** *If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the cross-year same-CPA pool used by the max-cosine / min-dHash computation.* + +A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication. + +**Scope: the Big-4 sub-corpus.** The primary analyses (§III-I, §III-J, §III-K, §III-L, and the corresponding §IV-D through §IV-J and §IV-M tables) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F; §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ — the threshold for accountant-level analyses — totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the primary analyses to Big-4 is a methodological choice driven by four considerations: + +1. **Restricted generalisability claim and Big-4 institutional comparability.** The primary claims are scoped to the Big-4 audit-report context, where the four firms share comparable institutional scale, document-production infrastructure, and CPA-volume regime; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H.2's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-I.4. Generalisation beyond Big-4 is left as future work. + +2. **Within-firm cross-CPA collision structure analysis.** §III-L.4 reports a Big-4 cross-firm hit-matrix analysis that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work. + +3. **Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-J K=3 component cross-tab; byte-level pair analysis referenced in §III-H.2). We retain Firm A within the Big-4 scope as a descriptive case study of the templated end rather than as the calibration anchor for thresholds. + +4. **Leave-one-firm-out fold feasibility.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm. + +**Sample-size reconciliation.** Two Big-4 signature counts appear in this section and §IV: $n = 150{,}442$ for analyses using the pre-computed per-signature descriptors $\text{cos}_s$ (`max_similarity_to_same_accountant`) and $\text{dHash}_s$ (`min_dhash_independent`), and $n = 150{,}453$ for analyses recomputing pair-level metrics directly from the stored feature and dHash byte vectors (Scripts 40b, 43, 44). The $11$-signature difference reflects descriptor-completion status: $11$ signatures have feature vectors and dHash byte vectors stored but lack the pre-computed extrema. The $11$ signatures are negligible at population scale and do not affect any reported coincidence rate within $0.01$ percentage point. The CPA counts $468$ (all Big-4 CPAs with both vectors stored) and $437$ (Big-4 CPAs with $n_{\text{sig}} \geq 10$ for accountant-level stability) likewise reflect a single uniform exclusion rule rather than analysis-specific subsetting. + +## H. Operational Classifier and Reference Populations + +### H.1. Deployed Operational Rule + +Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA: + +1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence. +2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff. +3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction. +4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$. +5. **Likely hand-signed (LH):** Cosine $\leq 0.837$. + +Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (Appendix B). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L). + +The remainder of this section (§III-H.2) describes the reference populations used to calibrate and cross-check this rule. §III-I demonstrates that the descriptor distributions do not provide a within-population natural threshold; §III-J–§III-K develop the descriptive partition and internal-consistency cross-checks; §III-L develops the anchor-based threshold calibration; §III-M discloses the unsupervised-setting limits. + +### H.2. Reference Populations + +The calibration distinguishes two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking. + +**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (Appendix B) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures. + +Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate. + +**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores used as a cross-check on the per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked. + +The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 low-cos / high-dHash component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 high-cos / low-dHash component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end of the descriptor plane. This is a "deviation in the less-replication-dominated descriptor-position direction" measure, not a "deviation toward the templated descriptor-position" measure; the reference is the less-replication-dominated population. + +## I. Distributional Diagnostics: Why the Composition Path Does Not Yield a Natural Threshold + +This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G and tests whether the distribution provides distributional support — in the form of within-population bimodality — for the deployed operational thresholds. We apply four diagnostic procedures in turn: a univariate unimodality test on each accountant-level marginal; a 2D Gaussian mixture fit (developed in §III-J); a density-smoothness diagnostic; and a composition decomposition that distinguishes within-population multimodality from between-firm location-shift artefacts. The four diagnostics jointly imply that the operational thresholds are *not* anchored by distributional bimodality: §III-L develops an anchor-based calibration framework that does not require this assumption. + +**1. Hartigan dip test on each accountant-level marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope at the accountant level rejected unimodality. The accountant-level Big-4 rejection is a descriptive observation; §III-I.4 below shows that the rejection is fully explained by between-firm location-shift effects rather than within-population bimodality. + +**2. K=2 / K=3 Gaussian mixture fits (descriptive partition).** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3 as a population mixture. Following §III-I.4 we treat both K=2 and K=3 fits as *descriptive partitions* of the joint Big-4 distribution that reflect firm-composition structure (Firm A vs others; §III-J) rather than as inferential evidence for two or three latent population modes. + +**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each accountant-level marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with §III-I.4 below: under the composition decomposition the Big-4 marginals are unimodal once between-firm and integer-tie confounds are removed, so a local-discontinuity test correctly fails to flag a within-population transition. + +**4. Composition decomposition (Scripts 39b–39e).** §III-I.1 establishes that the accountant-level marginals reject unimodality at the Big-4 sub-corpus. The remaining question is whether the rejection reflects (a) genuine within-population bimodality at the signature or accountant level, (b) between-firm location-shift artefacts (firms with different mean descriptor positions pool to a multi-peaked distribution), or (c) integer mass-point artefacts on the integer-valued dHash axis (the dHash dip statistic is sensitive to spikes at integer values). We apply four diagnostics that decompose the rejection into these candidate sources: + +*Within-firm signature-level dip (Scripts 39b, 39c).* Repeating the dip test at the signature level inside each individual Big-4 firm (Script 39b) and inside each individual non-Big-4 firm with $\geq 500$ signatures (Script 39c) yields a consistent picture. The cosine marginal *fails* to reject unimodality in every single firm tested — all four Big-4 firms ($p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ for Firms A through D; Script 39b) and ten non-Big-4 firms with $\geq 500$ signatures ($p_{\text{cos}} \in [0.59, 0.99]$; Script 39c). The raw dHash marginal *does* reject unimodality in every firm tested ($p < 5 \times 10^{-4}$ in all $14$ firms), but the raw dHash values are integer-valued in $\{0, 1, \ldots, 64\}$, leaving open the possibility of an integer-tie artefact. + +*Integer-jitter robustness (Scripts 39d, 39e).* Adding independent uniform jitter $\sim \mathrm{U}[-0.5, +0.5]$ to break exact dHash ties and re-running the dip test on the perturbed signature cloud (5 seeds, $n_{\text{boot}} = 2000$; Script 39d) eliminates the dHash within-firm rejection in every Big-4 firm tested (Firm A jittered $p_{\text{median}} = 0.999$; B $0.996$; C $0.999$; D $0.9995$; $0$/$5$ seeds reject at $\alpha = 0.05$ in any firm). The pooled-Big-4 dHash dip *does* survive jitter alone ($p_{\text{median}} = 0$, $5$/$5$ seeds reject), but Firm A's mean dHash ($2.73$) is substantially below Firms B/C/D's ($6.46$, $7.39$, $7.21$) — a between-firm location shift. Script 39e applies a 2 \times 2 factorial correction (firm-mean centring $\times$ integer jitter) on the Big-4 pooled dHash: + +| Condition | Firm-mean centred | Integer jitter | Median dip $p$ | Reject at $\alpha = 0.05$ | +|---|---|---|---|---| +| 1 raw | — | — | $< 5 \times 10^{-4}$ | $5/5$ | +| 2 centred only | $\checkmark$ | — | $< 5 \times 10^{-4}$ | $5/5$ | +| 3 jittered only | — | $\checkmark$ | $< 5 \times 10^{-4}$ | $5/5$ | +| 4 centred and jittered | $\checkmark$ | $\checkmark$ | $\mathbf{0.35}$ | $\mathbf{0/5}$ | + +Removing *both* the between-firm location shift *and* the integer mass points eliminates the Big-4 dHash rejection. The Big-4 pooled dHash multimodality is therefore fully attributable to firm-composition contrast (primarily Firm A's mean $\text{dHash} = 2.73$ versus Firms B/C/D $\approx 6.5$–$7.4$) and integer-density artefacts, with no residual continuous within-firm bimodality. + +*Cosine analogue.* The cosine axis follows the same pattern by construction: the within-firm signature-level cosine dip tests above (Scripts 39b, 39c) fail to reject in every Big-4 firm and in every eligible non-Big-4 firm, so any pooled cosine multimodality must arise from between-firm composition rather than from within-population bimodality. + +*Integer-histogram valleys (Script 39d).* A genuine within-firm dHash antimode would appear as a strict local minimum in the count histogram with deep relative depth. Within each of the four Big-4 firms, the dHash histogram on bins $0$–$20$ exhibits no strict local minimum; the Big-4 pooled histogram exhibits one shallow valley at $\text{dHash} = 4$ with relative depth $0.021$ (a $2.1\%$ count drop). No valley near the deployed $\text{dHash} = 5$ operational boundary appears within any individual firm. The hypothesised dHash antimode near $\text{dHash} \approx 5$ is not empirically supported by the histogram analysis. + +**5. Conclusion: no natural threshold from the descriptor distribution.** §III-I.4 jointly establishes that (a) the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts; (b) within the Big-4 firms, the descriptor marginals at the signature level are unimodal once integer ties are broken (Scripts 39b, 39d); (c) eligible non-Big-4 checks provide corroborating raw-axis evidence on the cosine dimension (Script 39c) and corroborate the integer-mass-point reading of raw dHash, but are not used as calibration evidence for the deployed thresholds; and (d) no integer-histogram valley near the deployed $\text{dHash} = 5$ operational boundary exists within any Big-4 firm. The descriptor distributions therefore do not contain a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits of §III-I.2 and §III-J are retained as *descriptive partitions* that reflect firm-composition contrast, not as inferential evidence for two or three population modes. §III-L develops the anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode. + +## J. K=3 as a Descriptive Partition of Firm-Composition Contrast + +This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes.** §III-I.4 demonstrates that the apparent multimodality of the accountant-level marginals is fully explained by between-firm location shifts and integer mass-point artefacts, leaving no residual evidence for two or three latent within-population mechanism classes. Neither mixture is used to assign signature-level or document-level labels in the primary analysis. The operational classifier of §III-H.1 is calibrated in §III-L via inter-CPA negative-anchor coincidence rates, not via mixture-derived antimodes. + +**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ (weight $0.689$) and $(0.983, 2.41)$ (weight $0.311$) (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. We refer to the components by index rather than by mechanism labels, since §III-I.4 establishes that the K=2 separation is firm-compositional rather than mechanistic. + +**K=3 fit.** Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces): + +| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position | +|---|---|---|---|---| +| C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner | +| C2 | 0.9558 | 6.66 | 0.536 | central region | +| C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner | + +$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical preference for K=3 under standard BIC interpretation, but not by itself decisive). The "descriptive position" column refrains from any mechanism interpretation: §III-I.4 establishes that the cosine and dHash axes both lack within-population bimodality, so component centres are best interpreted as locations in a continuous descriptor space rather than as latent mechanism modes. + +**Per-firm component composition (Script 35 firm × cluster cross-tab).** The K=3 partition is dominated by firm membership: + +- Firm A: $0\%$ C1, $17.5\%$ C2, $82.5\%$ C3 +- Firm B: $8.9\%$ C1, $\sim 78\%$ C2, $\sim 13\%$ C3 +- Firm C: $23.5\%$ C1, $75.5\%$ C2, $1.0\%$ C3 +- Firm D: $11.5\%$ C1, $\sim 84\%$ C2, $\sim 4.5\%$ C3 + +Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs. The K=3 partition is therefore well-described as a firm-compositional decomposition: C3 is essentially "Firm A and any non-Firm-A CPA whose mean descriptors happen to land in the high-cos / low-dHash corner"; C1 is essentially "non-Firm-A CPAs whose mean descriptors land in the low-cos / high-dHash corner." The composition contrast that K=3 captures at the accountant level reappears at the deployment level in the cross-firm hit matrix of §III-L.4 (Script 44): under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). The K=3 partition and the cross-firm hit matrix therefore describe the same underlying firm-compositional structure at two different units of analysis. + +**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label. + +We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the operational classifier: + +- The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary. +- The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$. +- Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs. + +The operational signature-level classifier of §III-L is calibrated against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K. + +## K. Convergent Internal-Consistency Checks + +The descriptive partition of §III-J is supported by three feature-derived per-CPA scores and a conservative hard-positive subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. Per §III-I.4, none of the three scores has a within-population bimodality interpretation; they are firm-compositional position scores at the accountant level. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide). + +**1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute: + +- **Score 1 (K=3 posterior on the low-cos / high-dHash component):** $P(\text{C1})$ from the K=3 fit of §III-J. Per §III-J this is a firm-compositional position score on the (cos, dHash) plane (not a probability of any latent "hand-signing mechanism") — a function of both descriptor means. +- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H.2, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end. This is a function of $\overline{\text{cos}}_a$ alone. +- **Score 3 (deployed binary high-confidence box rule rate):** the per-CPA fraction of signatures that do **not** satisfy the deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors. + +Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38): + +| Pair | Spearman $\rho$ | $p$-value | +|---|---|---| +| Score 1 vs Score 3 | $+0.9627$ | $< 10^{-248}$ | +| Score 2 vs Score 3 | $+0.8890$ | $< 10^{-149}$ | +| Score 1 vs Score 2 | $+0.8794$ | $< 10^{-142}$ | + +We read this as the strongest internal-consistency signal in the analysis: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the deployed box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms. + +**2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replication-dominated vs less-replication-dominated): + +| Pair | Cohen $\kappa$ | +|---|---| +| Deployed binary high-confidence box rule vs per-CPA K=3 hard label | $0.662$ | +| Deployed binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ | +| Per-CPA K=3 vs per-signature K=3 | $0.870$ | + +The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in Appendix B and not regenerated on the Big-4 subset. + +**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference: + +- *K=2 LOOO is unstable.* The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$, against the report's $0.005$ across-fold stability tolerance (Script 36; pairwise fold range $0.0376$, from $0.9380$ to $0.9756$). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary. + +- *K=3 LOOO is partially stable.* The C1 (low-cos / high-dHash) component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own screening label is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label. + +**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one conservative hard-positive subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40). + +We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the corpus-wide version cited at §IV-I: + +| Candidate check | Pixel-identity miss rate (Wilson 95% CI) | +|---|---| +| Deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ | +| K=3 per-CPA hard label (C3 high-cos / low-dHash corner; descriptive only) | $0\%$ $[0\%, 1.45\%]$ | +| Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ | + +All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them. The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the box rule's overall replicated rate ($49.58\%$ of Big-4 signatures); this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation. + +## L. Anchor-Based Threshold Calibration + +The operational classifier defined in §III-H.1 is calibrated by characterising the deployed thresholds' inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. §III-I.4 establishes that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. Throughout this section we report **inter-CPA coincidence rates** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0. + +### L.0. Calibration methodology + +**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with prior reporting and with the existing literature. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine threshold's specificity behaviour at the inter-CPA pair level and the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain the prior calibration; the present calibration does not provide independent rates for those sub-bands. + +**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question: + +- *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the conventional pairwise calibration unit in biometric verification. We report it for both the cosine and dHash dimensions, marginally and jointly (§III-L.1). +- *Per signature pool.* For a Big-4 source signature $s$ with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates? This addresses the standard concern that a per-pair rate computed on independent pairs is not the deployed-rule rate at the per-signature classifier level: the deployed rule takes max-cosine and min-dHash over a pool of size $n_{\text{pool}}(s)$, so its effective coincidence rate is approximately $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-L.2). +- *Per document.* For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose deployed pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-L.3). + +**Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate in the pool satisfies both. We refer to this as the **any-pair** rule. A stricter alternative — the **same-pair** rule — requires a single candidate to satisfy both inequalities; the deployed rule is any-pair, but we report same-pair as a stricter alternative classifier where useful (§III-L.2, §III-L.4). + +**Terminological note on "FAR".** The biometric-verification literature speaks of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) §III-L.4 shows that the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures. Reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence. + +### L.1. Per-comparison inter-CPA coincidence rate (Script 40b) + +We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatures, computing for each pair the cosine similarity (feature dot product) and Hamming distance between the dHash byte vectors. Marginal and joint rates at threshold $k$ are reported with Wilson 95% confidence intervals (Script 40b). + +| Threshold | Per-comparison inter-CPA coincidence rate | 95% Wilson CI | +|---|---|---| +| Cosine $> 0.95$ | $0.00060$ | $[0.00053, 0.00067]$ | +| Cosine $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ | +| Cosine $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ | +| Cosine $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ | +| dHash $\leq 5$ | $0.00129$ | $[0.00120, 0.00140]$ | +| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ | +| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ | +| dHash $\leq 2$ | $0.00006$ | $[0.00004, 0.00008]$ | +| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ | +| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ | + +The cosine row at $\text{cos} > 0.95$ is consistent with a prior published per-comparison rate of $0.0005$ on a similarly-sized inter-CPA sample; the present $5 \times 10^5$-pair sample yields $0.00060$, within that earlier precision. The dHash row and joint row are reported here for the first time on this corpus; the prior calibration did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule. + +The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check. + +**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison specificity (joint $0.00014$ vs cos-only $0.00060$). + +The per-comparison rate is a useful *specificity-proxy calibration* for the deployed rule's pair-level behaviour. It does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2. + +### L.2. Pool-normalised inter-CPA alert rate (Script 43) + +The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \text{dHash}$ over the same-CPA pool of size $n_{\text{pool}}(s)$ for each signature $s$. A per-comparison rate is therefore not the rate at which the deployed classifier fires per signature. To compute the per-signature inter-CPA-equivalent rate, for each Big-4 source signature $s$ we simulate one realisation of an inter-CPA candidate pool of the same size $n_{\text{pool}}(s)$, drawn uniformly from non-same-CPA signatures across all firms, compute the deployed extrema and rule indicator, and aggregate (Script 43; $n_{\text{sig}} = 150{,}453$ vector-complete in this analysis; CPA-block bootstrap 95% CIs reported below). + +**Headline rates (deployed any-pair rule, HC = cos $> 0.95$ AND dHash $\leq 5$).** Wilson 95% CIs on the point estimate, CPA-block bootstrap 95% CI on $n_{\text{boot}} = 1000$ replicates: + +| Rule semantics | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI | +|---|---|---|---| +| Any-pair (deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ | +| Same-pair (stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ | + +Per-firm any-pair rates (no bootstrap; descriptive): + +| Firm | $n_{\text{sig}}$ | Any-pair ICCR | Same-pair ICCR | +|---|---|---|---| +| Firm A | $60{,}450$ | $0.2594$ | $0.2018$ | +| Firm B | $34{,}254$ | $0.0147$ | $0.0023$ | +| Firm C | $38{,}616$ | $0.0053$ | $0.0019$ | +| Firm D | $17{,}133$ | $0.0110$ | $0.0051$ | + +**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors). + +**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). A stricter operating point of dHash $\leq 3$ same-pair would correspond to a per-signature ICCR of $\approx 0.05$; the deployed HC any-pair rule with $\text{dHash} \leq 5$ corresponds to $\approx 0.11$. Stakeholders requiring a tighter specificity proxy could consider the dHash $\leq 3$ same-pair variant, with the unsupervised-setting caveats of §III-M. + +### L.3. Document-level inter-CPA proxy alert rate (Script 45) + +The deployed worst-case aggregation classifies each document by the most-replication-consistent category among its constituent signatures (§III-H.1). Three operationally meaningful document-level alarm definitions are reported, each as the fraction of documents whose worst-case signature category falls in the alarm set under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 45; $n_{\text{docs}} = 75{,}233$ Big-4 documents): + +| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI | +|---|---|---|---| +| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ | +| D2 | HC + MC ("any non-hand-signed screening label") | $0.3375$ | $[0.3342, 0.3409]$ | +| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ | + +Per-firm D2 document-level rates: + +| Firm | $n_{\text{docs}}$ | D2 (HC + MC) ICCR | +|---|---|---| +| Firm A | $30{,}226$ | $0.6201$ | +| Firm B | $17{,}127$ | $0.1600$ | +| Firm C | $19{,}501$ | $0.1635$ | +| Firm D | $8{,}379$ | $0.0863$ | + +The document-level D2 rate of $33.75\%$ pooled over Big-4 is the most operationally relevant alarm-rate metric: it is the fraction of audit documents that would carry at least one signature flagged HC or MC under the counterfactual of inter-CPA candidate-pool replacement. The non-trivial per-document inter-CPA alarm rate (and its concentration in Firm A at $62\%$) motivates the positioning of the operational system as a **screening framework with human-in-the-loop review**, not as an autonomous forensic classifier (§III-M). + +### L.4. Firm heterogeneity (Script 44) + +§III-L.2 and §III-L.3 report large per-firm variation in the deployed rule's pool-normalised behaviour: Firm A's any-pair per-signature ICCR is $0.2594$, an order of magnitude larger than Firm B's $0.0147$, Firm C's $0.0053$, Firm D's $0.0110$. A natural alternative explanation is the pool-size confound: Firm A's median pool size ($\sim 285$) is larger than other firms', and pool size monotonically (broadly) increases the per-signature rate (§III-L.2 decile trend). We test the firm-vs-pool confound with a logistic regression of the per-signature hit indicator (any-pair HC) on firm dummies (Firm A = reference) and centred log pool size (Script 44): + +| Term | Odds ratio (vs Firm A) | Direction | Magnitude | +|---|---|---|---| +| Firm B | $0.053$ | $< 1$ | $\sim 19\times$ lower odds than Firm A | +| Firm C | $0.010$ | $< 1$ | $\sim 100\times$ lower odds than Firm A | +| Firm D | $0.027$ | $< 1$ | $\sim 37\times$ lower odds than Firm A | +| log(pool size, centred) | $4.01$ | $> 1$ | $\sim 4\times$ higher odds per unit log pool size | + +The Firm B/C/D odds ratios are very small after controlling for pool size, indicating that firm membership accounts for a large multiplicative effect on the per-signature rate that is *not* explained by pool size alone. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm, and naive standard errors would be unreliable under within-cluster correlation; a cluster-robust standard error analysis is left as a robustness check.) + +The per-decile per-firm breakdown (Script 44) confirms the pattern: within every pool-size decile, Firms B/C/D have rates of $0.0006$–$0.0358$, while Firm A's rate ranges $0.0541$–$0.5958$ across deciles. The firm gap is large within matched pool sizes, not driven by pool composition. + +**Cross-firm hit matrix.** Among Big-4 source signatures whose any-pair rule fires under the inter-CPA candidate-pool counterfactual, the candidate firm of the max-cosine partner is distributed as follows (Script 44): + +| Source firm | Firm A candidate | Firm B | Firm C | Firm D | non-Big-4 | hits | +|---|---|---|---|---|---|---| +| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ | +| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ | +| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ | +| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ | + +For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$). + +**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$–$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (Appendix B; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing. + +This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where the within-firm collision concentration is the dominant pattern at all four Big-4 firms — most strongly at Firm A ($98.8\%$ any-pair, $99.96\%$ same-pair) and at materially lower but still majority levels at Firms B/C/D ($76.7$–$83.7\%$ any-pair; $97.0$–$98.2\%$ same-pair). + +### L.5. Alert-rate sensitivity around deployed thresholds (Script 46) + +To test whether the deployed cosine threshold $0.95$ and dHash threshold $5$ coincide with a low-gradient (plateau-stable) region of the deployed-rule alert-rate surface — which would be weak distributional evidence that the deployed thresholds are stable operating points — we sweep each threshold across a range and report the per-signature alert rate on actual observed Big-4 same-CPA pools (not inter-CPA-replaced pools), comparing the local gradient at the deployed threshold to the median gradient across the sweep (Script 46). + +At the deployed HC operating point cos $> 0.95$ AND dHash $\leq 5$, the local gradient of the per-signature alert rate is substantially larger than the median gradient across the sweep (cosine: ratio $\approx 25\times$ at the $0.95$ point relative to median; dHash: ratio $\approx 3.8\times$ at the $5$ point relative to median; both Script 46). Reading these ratios descriptively, the deployed HC threshold is *locally sensitive* rather than plateau-stable: small threshold perturbations materially change the deployed alert rate (cosine sweep at dHash $\leq 5$ yields rates of $0.5091$ at cos $> 0.945$ vs $0.4789$ at cos $> 0.955$, a $3.0$ pp swing across a $0.01$ cosine perturbation; dHash sweep at cos $> 0.95$ yields rates of $0.4207$ at dHash $\leq 4$ vs $0.5639$ at dHash $\leq 6$, a $14.3$ pp swing across a single integer step). The local-gradient-to-median-gradient ratios are descriptive diagnostics, not formal plateau tests; the primary evidence for "no within-population bimodal antimode at these thresholds" comes from §III-I.4's composition decomposition, not from §III-L.5. + +The MC/HSC boundary at dHash $= 15$, by contrast, *is* in a low-gradient region (ratio $\approx 0.08$ to the median); the plateau-like behaviour around dHash $= 15$ is corroborating evidence that the high-end structural threshold lies in a regime where the rule's alert rate is approximately saturated, consistent with the high-dHash tail behaviour expected once near-identical pairs have been exhausted. The §III-L.5 non-plateau / local-sensitivity finding therefore applies specifically to the HC cutoff (cos $= 0.95$, dHash $= 5$); the MC/HSC sub-band boundary at dHash $= 15$ exhibits the opposite behaviour and is plateau-like. + +We interpret the deployed HC thresholds as **specificity-anchored operating points** chosen for the specificity-vs-alert-yield tradeoff (§III-L.1), *not* as distributional antimodes. Alternative operating points on the tradeoff curve can be characterised by inverting the per-comparison or pool-normalised ICCR curves (§III-L.1, §III-L.2) at the preferred specificity target. + +### L.6. Observed deployed alert rate on actual same-CPA pools + +The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the deployed HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$). + +The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised inter-CPA rate ($0.4958$ vs $0.1102$); the per-document observed-deployed rate is $\sim 3.5\times$ the pool-normalised inter-CPA D1 (HC) rate ($0.6228$ vs $0.1797$). We refer to this multiplicative gap as the **deployed-rate excess over the inter-CPA proxy**: + +- Per-signature: $0.4958 - 0.1102 = 0.3856$ ($38.6$ pp excess) +- Per-document HC: $0.6228 - 0.1797 = 0.4431$ ($44.3$ pp excess) + +We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence. + +## M. Validation Strategy and Limitations under Unsupervised Setting + +The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth. + +Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier (Table XXVII), with an explicitly disclosed untested assumption: + +**Table XXVII.** Diagnostics, failure mode addressed, and disclosed untested assumption. + +| Diagnostic | Failure mode addressed | Disclosed untested assumption | +|---|---|---| +| Composition decomposition (§III-I.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) | +| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) | +| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property | +| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above | +| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check | +| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 45's mode-of-firms tie-break (§IV-M.4 footnote) | +| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test | +| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent | +| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold | +| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity | + +No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth. + +**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, Appendix B) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question. + +**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier. + +**Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data. + +## N. Data Source and Firm Anonymization + +**Audit-report corpus.** The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation. +MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document. +We did not access any non-public auditor work papers, internal firm records, or personally identifying information beyond the certifying CPAs' names and signatures, which are themselves published on the face of the audit report as part of the public regulatory filing. +The CPA registry used to map signatures to CPAs is a publicly available audit-firm tenure registry (Section III-B). + +**Firm-level anonymization.** Although all audit reports and CPA identities in the corpus are public, we report firm-level results under the pseudonyms Firm A / B / C / D throughout this paper to avoid naming specific accounting firms in descriptive rate comparisons. +Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name. + + +# IV. Experiments and Results + +Section IV reports the empirical results that calibrate and characterise the operational classifier of §III-H.1 (calibration developed in §III-L). The primary analyses (§IV-D through §IV-J, and the anchor-based ICCR calibration consolidated in §IV-M) are scoped to the Big-4 sub-corpus (Firms A–D, $n = 437$ CPAs with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3 mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F. + +## A. Experimental Setup + +Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration. +Feature extraction used PyTorch 2.9 with torchvision model implementations. +The complete pipeline---from raw PDF processing through final classification---was implemented in Python. +Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision. + +## B. Signature Detection Performance + +The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). +We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). +However, the subsequent production deployment provides a practical consistency check: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. +The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale. + + + +The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in the primary analyses of §IV-D through §IV-J. + +## C. All-Pairs Intra-vs-Inter Class Distribution Analysis + +Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). +This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L). +Table IV summarizes the distributional statistics. + + + +Both distributions are left-skewed and leptokurtic. +Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate. +Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-I to avoid single-family distributional assumptions. + +The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). +Under equal prior probabilities and equal misclassification costs, this crossover is a candidate decision boundary between the two classes; we adopt it only as the operational LH/UN boundary in §III-H.1, not as a natural distributional threshold. +Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$). + +We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. +We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size. +A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count. + +## D. Big-4 Accountant-Level Distributional Characterisation + +This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. The accountant-level dip-test rejection reported in Table V is, per §III-I.4, fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration. + +**Table V.** Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32). + +| Population | $n$ CPAs | $p_{\text{cos}}$ | $p_{\text{dHash}}$ | Interpretation | +|---|---|---|---|---| +| **Big-4 pooled (primary)** | 437 | $< 5 \times 10^{-4}$ | $< 5 \times 10^{-4}$ | reject unimodality on both axes | +| Firm A pooled alone | 171 | 0.992 | 0.924 | unimodal | +| Firms B + C + D pooled | 266 | 0.998 | 0.906 | unimodal | +| All non-Firm-A pooled | 515 | 0.998 | 0.907 | unimodal | + +Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution $1 / 2000 = 5 \times 10^{-4}$ (Script 34 reports this as $p = 0.0000$; we report $p < 5 \times 10^{-4}$ to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed. + +**Table VI.** Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; $\alpha = 0.05$, two-sided). + +| Population | Cosine: significant transition? | dHash: significant transition? | +|---|---|---| +| **Big-4 pooled (primary)** | none ($p > 0.05$) | none ($p > 0.05$) | +| Firm A pooled alone | none | none | +| Firms B + C + D pooled | none | one transition at $\overline{\text{dHash}} = 10.8$ | +| All non-Firm-A pooled | none | one transition at $\overline{\text{dHash}} = 6.6$ | + +The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes. + +## E. Big-4 K=2 / K=3 Mixture Fits + +This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings. + +**Table VII.** Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-J) and marginal-crossing bootstrap 95% confidence intervals. + +| K=2 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | +|---|---|---|---| +| K=2-a (low-cos / high-dHash position) | 0.954 | 7.14 | 0.689 | +| K=2-b (high-cos / low-dHash position) | 0.983 | 2.41 | 0.311 | + +Marginal crossings (point + bootstrap 95% CI, $n_{\text{boot}} = 500$): + +| Axis | Point | Bootstrap median | 95% CI | CI half-width | +|---|---|---|---|---| +| cos | 0.9755 | 0.9754 | $[0.9742, 0.9772]$ | 0.0015 | +| dHash | 3.755 | 3.763 | $[3.476, 3.969]$ | 0.246 | + +$\text{BIC}(K{=}2) = -1108.45$ (Script 34). + +**Table VIII.** Big-4 K=3 mixture components (descriptive firm-compositional partition per §III-J; not mechanism clusters). + +| K=3 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position | +|---|---|---|---|---| +| C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner | +| C2 | 0.9558 | 6.66 | 0.536 | central region | +| C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner | + +$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively. + +## F. Convergent Internal-Consistency Checks + +This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ and are *not statistically independent measurements*. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth. + +**Table IX.** Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, $n = 437$. + +| Score pair | Spearman $\rho$ | $p$-value | +|---|---|---| +| K=3 P(C1) vs deployed box-rule less-replication-dominated rate | $+0.9627$ | $< 10^{-248}$ | +| Reverse-anchor cosine percentile vs deployed box-rule less-replication-dominated rate | $+0.8890$ | $< 10^{-149}$ | +| K=3 P(C1) vs Reverse-anchor cosine percentile | $+0.8794$ | $< 10^{-142}$ | + +(Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on $n = 249$ non-Big-4 CPAs; reference centre $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$. + +**Table X.** Per-firm summary across the three feature-derived scores, Big-4. + +| Firm | $n$ CPAs | mean $P(\text{C1})$ | mean reverse-anchor score | mean deployed less-replication-dominated rate | +|---|---|---|---|---| +| Firm A | 171 | 0.0072 | $-0.9726$ | 0.1935 | +| Firm B | 112 | 0.1410 | $-0.8201$ | 0.6962 | +| Firm C | 102 | 0.3110 | $-0.7672$ | 0.7896 | +| Firm D | 52 | 0.2406 | $-0.7125$ | 0.7608 | + +(Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that *higher* values indicate deeper into the reference left tail = less replication-dominated relative to the non-Big-4 reference.) + +The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as less replication-dominated. The K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Score 1 and Score 3) place Firm C at the least-replication-dominated end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors. + +**Table XI.** Per-signature Cohen $\kappa$ (binary collapse, replication-dominated vs less-replication-dominated), $n = 150{,}442$ Big-4 signatures. + +| Pair | Cohen $\kappa$ | +|---|---| +| deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) vs per-CPA K=3 hard label | 0.662 | +| deployed binary high-confidence box rule vs per-signature K=3 hard label | 0.559 | +| Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 | + +(Source: Script 39.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) retains its prior calibration and capture-rate evidence (Appendix B; cross-referenced in §IV-J). + +## G. Leave-One-Firm-Out Reproducibility + +This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing. + +**Table XII.** K=2 leave-one-firm-out across the four Big-4 folds. + +| Held-out firm | $n_{\text{train}}$ | $n_{\text{held}}$ | Fold rule (cos cut, dHash cut) | Held-out classified as templated by fold rule | +|---|---|---|---|---| +| Firm A | 266 | 171 | cos $> 0.9380$ AND dHash $\leq 8.79$ | $171 / 171 = 100.00\%$ ($95\%$ Wilson $[97.80\%, 100.00\%]$) | +| Firm B | 325 | 112 | cos $> 0.9744$ AND dHash $\leq 3.98$ | $0 / 112 = 0\%$ ($95\%$ Wilson $[0\%, 3.32\%]$) | +| Firm C | 335 | 102 | cos $> 0.9752$ AND dHash $\leq 3.75$ | $0 / 102 = 0\%$ ($95\%$ Wilson $[0\%, 3.63\%]$) | +| Firm D | 385 | 52 | cos $> 0.9756$ AND dHash $\leq 3.74$ | $0 / 52 = 0\%$ ($95\%$ Wilson $[0\%, 6.88\%]$) | + +(Source: Script 36.) Across-fold cosine crossing: pairwise range $[0.9380, 0.9756]$, range = $0.0376$; max absolute deviation from the across-fold mean is $0.028$. This exceeds the report's $0.005$ across-fold stability tolerance by $5.6\times$ and is much larger than the full-Big-4 bootstrap CI half-width of $0.0015$. Together with the all-or-nothing held-out classification pattern (Firm A held out $\Rightarrow$ all held-out CPAs templated; any non-Firm-A firm held out $\Rightarrow$ none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary. + +**Table XIII.** K=3 leave-one-firm-out: C1 component shape and held-out membership. + +| Held-out firm | C1 cos (fit) | C1 dHash (fit) | C1 weight (fit) | Held-out C1 hard-label rate | Full-Big-4 baseline C1% | Absolute difference | +|---|---|---|---|---|---|---| +| Full-Big-4 baseline | 0.9457 | 9.17 | 0.143 | — | — | — | +| Firm A held out | 0.9425 | 10.13 | 0.145 | $4.68\%$ | $0.00\%$ | $4.68$ pp | +| Firm B held out | 0.9441 | 9.16 | 0.127 | $7.14\%$ | $8.93\%$ | $1.76$ pp | +| Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp | +| Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp | + +(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L). + +## H. Pixel-Identity Positive-Anchor Miss Rate + +This section reports the only conservative hard-positive subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4). + +**Table XIV.** Positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures. + +| Classifier | Misclassified as less-replication-dominated | Miss rate | Wilson 95% CI | +|---|---|---|---| +| deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ | +| K=3 per-CPA hard label (C3 = high-cos / low-dHash; descriptive) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ | +| Reverse-anchor (prevalence-calibrated cut) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ | + +(Source: Script 40.) Per-firm breakdown of the byte-identical subset: Firm A 145; Firm B 8; Firm C 107; Firm D 2. All three candidate scores correctly assign every byte-identical signature to the replicated class. + +We caution that for the deployed box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region). The reverse-anchor cut is chosen by *prevalence calibration* against the box rule's overall replicated rate of $49.58\%$ across Big-4 signatures; this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation. + +## I. Inter-CPA Pair-Level Coincidence Rate + +The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It is the per-pair rate at which two signatures from different CPAs satisfy the deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has a biometric-verification meaning that requires ground-truth negative labels, and (b) the inter-CPA negative-anchor assumption is partially violated by within-firm cross-CPA template-like collision structures (§III-L.4 cross-firm hit matrix). + +A corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs gives a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$. The Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs) replicates this number, adds the structural dimension (dHash), and adds joint-rule rates; the §III-L.1 numbers are referenced rather than duplicated here, and the consolidated ICCR calibration appears in §IV-M Tables XXI–XXVI. + +## J. Five-Way Per-Signature + Document-Level Classification Output + +This section reports the five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. See §III-H.1 for the five-way category definitions and the cosine and dHash cuts; calibration is in §III-L. + +**Table XV.** Five-way per-signature category counts, Big-4 sub-corpus, $n = 150{,}442$ classified. + +| Category | Long name | $n$ signatures | % of classified | +|---|---|---|---| +| HC | High-confidence non-hand-signed | 74,593 | 49.58% | +| MC | Moderate-confidence non-hand-signed | 39,817 | 26.47% | +| HSC | High style consistency | 314 | 0.21% | +| UN | Uncertain | 35,480 | 23.58% | +| LH | Likely hand-signed | 238 | 0.16% | + +(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The $150{,}442$ vs $150{,}453$ distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use $n = 150{,}442$; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIV–XXV; Scripts 40b, 43, 44) use $n = 150{,}453$ because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.) + +**Per-firm five-way breakdown (% within firm).** + +| Firm | HC | MC | HSC | UN | LH | total signatures | +|---|---|---|---|---|---|---| +| Firm A | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 | +| Firm B | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 | +| Firm C | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 | +| Firm D | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 | + +(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%). + +**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-L), applied to the Big-4 sub-corpus. + +**Table XIX.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs. + +| Category | Long name | $n$ documents | % | +|---|---|---|---| +| HC | High-confidence non-hand-signed | 46,857 | 62.28% | +| MC | Moderate-confidence non-hand-signed | 19,667 | 26.14% | +| HSC | High style consistency | 167 | 0.22% | +| UN | Uncertain | 8,524 | 11.33% | +| LH | Likely hand-signed | 18 | 0.02% | + +(Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.) + +**Per-firm document-level breakdown (single-firm PDFs only).** + +| Firm | HC | MC | HSC | UN | LH | total docs | +|---|---|---|---|---|---|---| +| Firm A | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 | +| Firm B | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 | +| Firm C | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 | +| Firm D | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 | + +(Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.) + +The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains its prior calibration (Appendix B); it is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in Appendix B and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B). + +**Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus. + +| Firm | $n$ | C1 (low-cos / high-dHash) | C2 (central) | C3 (high-cos / low-dHash) | C1 % | C3 % | +|---|---|---|---|---|---|---| +| Firm A | 171 | 0 | 30 | 141 | $0.00\%$ | $82.46\%$ | +| Firm B | 112 | 10 | 102 | 0 | $8.93\%$ | $0.00\%$ | +| Firm C | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ | +| Firm D | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ | + +(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there). + +**Document-level worst-case aggregation outputs are reported in Table XIX above.** + +## K. Full-Dataset Robustness (light scope) + +This section reports the reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (Appendix B; §IV-J). + +**Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset. + +| K=3 component | Big-4 (n=437) cos / dHash / weight | Full (n=686) cos / dHash / weight | Drift Big-4 → Full | +|---|---|---|---| +| C1 (low-cos / high-dHash) | 0.9457 / 9.17 / 0.143 | 0.9278 / 11.17 / 0.284 | $\lvert\Delta\rvert$ cos 0.018, dHash 1.99, wt 0.141 | +| C2 (central) | 0.9558 / 6.66 / 0.536 | 0.9535 / 6.99 / 0.512 | $\lvert\Delta\rvert$ cos 0.002, dHash 0.33, wt 0.024 | +| C3 (high-cos / low-dHash) | 0.9826 / 2.41 / 0.321 | 0.9826 / 2.40 / 0.205 | $\lvert\Delta\rvert$ cos 0.000, dHash 0.01, wt 0.117 | + +(Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.) + +**Table XVIII.** Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset. + +| Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs deployed less-replication-dominated rate) | $p$-value | +|---|---|---|---| +| Big-4 (primary) | 437 | $+0.9627$ | $< 10^{-248}$ | +| Full dataset | 686 | $+0.9558$ | $< 10^{-300}$ | +| $\lvert\rho_{\text{full}} - \rho_{\text{Big-4}}\rvert$ | — | $0.0069$ | — | + +(Source: Script 41.) + +**Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the deployed box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + deployed-rule convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the primary methodology is restricted to Big-4 by design (§III-G item 4). + +## L. Ablation Study: Feature Backbone Comparison + +To support the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). +All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. +The comparison summary is reported in Appendix B (the backbone-ablation table; not the same table as Table XVIII in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K). + + + +EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. +However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), i.e., a wider descriptor dispersion per signature. +VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task. + +ResNet-50 provides the best overall balance: +(1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707; +(2) its tighter distributions yield more stable descriptor behaviour at the per-signature level; +(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that Firm A replication-dominated signatures are least likely to produce low-similarity outlier pairs under this backbone; and +(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures. + +## M. Anchor-Based ICCR Calibration Results + +This section consolidates the empirical results that support the §III-L anchor-based threshold calibration framework. + +### M.1 Composition decomposition (Scripts 39b–39e) + +**Table XX.** Within-firm and between-firm decomposition of the Big-4 accountant-level dip-test rejection. + +| Diagnostic | Scope | Statistic | Implication | +|---|---|---|---| +| Within-firm signature-level cosine dip | Big-4 (4 firms) | $p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ | 0/4 firms reject; cosine within-firm unimodal | +| Within-firm signature-level cosine dip | non-Big-4 (10 firms $\geq 500$ sigs) | $p_{\text{cos}} \in [0.59, 0.99]$ | 0/10 firms reject; cosine within-firm unimodal | +| Within-firm jittered-dHash dip (5 seeds, median) | Big-4 (4 firms) | $p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\}$ | 0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact | +| Big-4 pooled dHash: 2×2 factorial | firm-centred + jittered (5 seeds) | $p_{\text{med}} = 0.35$, 0/5 seeds reject | combined corrections eliminate rejection; multimodality is composition + integer artefact | +| Integer-histogram valley near $\text{dHash} \approx 5$ | within each Big-4 firm | none (0/4 firms) | no within-firm dHash antimode at the deployed HC cutoff | + +(Source: Scripts 39b, 39c, 39d, 39e; bootstrap $n_{\text{boot}} = 2000$; jitter $\sim \mathrm{U}[-0.5, +0.5]$.) + +### M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b) + +**Table XXI.** Big-4 inter-CPA per-comparison ICCR sweep, $n = 5 \times 10^5$ pairs (Big-4 scope). + +| Threshold | Per-comparison ICCR | 95% Wilson CI | +|---|---|---| +| cos $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ | +| cos $> 0.95$ (deployed operating point) | $0.00060$ | $[0.00053, 0.00067]$ | +| cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ | +| cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ | +| dHash $\leq 5$ (deployed operating point) | $0.00129$ | $[0.00120, 0.00140]$ | +| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ | +| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ | +| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ | +| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ | + +Conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$ (Wilson 95% $[0.190, 0.285]$; $70$ of $299$ pairs). + +The cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I (per-comparison rate $0.0005$). The dHash row and joint row are reported here for the first time on this corpus. + +### M.3 Pool-normalised per-signature ICCR (Script 43) + +**Table XXII.** Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos $> 0.95$ AND dHash $\leq 5$); $n_{\text{sig}} = 150{,}453$ (vector-complete Big-4); CPA-block bootstrap $n_{\text{boot}} = 1000$. + +| Scope | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI | +|---|---|---|---| +| Big-4 pooled (any-pair, deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ | +| Big-4 pooled (same-pair, stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ | +| Firm A (any-pair) | $0.2594$ | — | — | +| Firm B (any-pair) | $0.0147$ | — | — | +| Firm C (any-pair) | $0.0053$ | — | — | +| Firm D (any-pair) | $0.0110$ | — | — | +| Pool-size decile 1 (smallest pools) any-pair | $0.0249$ | — | — | +| Pool-size decile 10 (largest pools) any-pair | $0.1905$ | — | — | + +Decile trend is broadly monotone in pool size with two minor reversals (decile 5 and decile 9 dip below their predecessors). Stricter operating point cos $> 0.95$ AND dHash $\leq 3$ (same-pair) gives per-signature ICCR $0.0449$. + +### M.4 Document-level ICCR under three alarm definitions (Script 45) + +**Table XXIII.** Document-level inter-CPA ICCR by alarm definition; $n_{\text{docs}} = 75{,}233$. + +| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI | +|---|---|---|---| +| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ | +| D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ | +| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ | + +Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XIX's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XIX's per-firm rows sum to $74{,}854 = 75{,}233 - 379$. + +### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44) + +**Table XXIV.** Logistic regression of per-signature any-pair HC hit indicator on firm dummies and centred log pool size (Firm A reference). + +| Term | Odds ratio (vs Firm A) | Direction | +|---|---|---| +| Firm B | $0.053$ | $\sim 19\times$ lower odds than Firm A | +| Firm C | $0.010$ | $\sim 100\times$ lower odds than Firm A | +| Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A | +| log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size | + +Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–$0.0358$ while Firm A ranges $0.0541$–$0.5958$. The firm gap survives within matched pool sizes. + +**Table XXV.** Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts). + +| Source firm | Firm A cand. | Firm B | Firm C | Firm D | non-Big-4 | n hits | +|---|---|---|---|---|---|---| +| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ | +| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ | +| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ | +| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ | + +Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively. + +### M.6 Alert-rate sensitivity around deployed HC threshold (Script 46) + +**Table XXVI.** Local-gradient / median-gradient ratio at deployed thresholds (descriptive plateau diagnostic). + +| Threshold | Local / median gradient ratio | Interpretation | +|---|---|---| +| cos $= 0.95$ (HC) | $\approx 25\times$ | locally sensitive (not plateau-stable) | +| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) | +| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) | + +Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ ($38.6$ pp) per-signature and $0.4431$ ($44.3$ pp) per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate. + + +# V. Discussion + +## A. Non-Hand-Signing Detection as a Distinct Problem + +Non-hand-signing differs from forgery in that the questioned signature is produced by its legitimate signer's own stored image rather than by an impostor. The detection problem is therefore framed around *intra-signer image reproduction* rather than *inter-signer imitation*. This framing has analytical consequences. The within-CPA signature distribution is the analytical population of interest; the cross-CPA inter-class distribution is a *reference* against which intra-CPA similarity is interpreted, not the population to be modelled. This contrasts with most prior offline signature verification work, which treats genuine-versus-forged as the central two-class problem. + +## B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Multimodality is Composition-Driven + +The Big-4 accountant-level descriptor distribution rejects unimodality on both marginals at $p < 5 \times 10^{-4}$ (§IV-D Table V). The composition decomposition of §III-I.4 shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within the Big-4 firms, the descriptor marginals at the signature level are unimodal once integer ties are broken (Scripts 39b, 39d); eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (Script 39c) but are not used as calibration evidence (§III-I.4). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes. + +## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor) + +Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (Appendix B) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years. + +We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). Firm A's high per-document HC$+$MC alarm rate of $0.62$ (versus Firms B/C/D's $0.09$–$0.16$) reflects high inter-CPA collision concentration under the deployed rule on real same-CPA pools, consistent with firm-specific template, stamp, or document-production reuse — though the inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder production-related reuse patterns at Firms B/C/D. + +## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions + +Leave-one-firm-out cross-validation of the Big-4 mixture fit reveals a sharp contrast between K=2 and K=3 behaviour. K=2 is unstable: across-fold cosine-crossing deviation is $0.028$, and holding Firm A out gives a fold rule (cos $> 0.938$, dHash $\leq 8.79$) that classifies $100\%$ of held-out Firm A in the upper component, while holding any non-Firm-A Big-4 firm out gives a fold rule near (cos $> 0.975$, dHash $\leq 3.76$) that classifies $0\%$ of the held-out firm in the upper component. The K=2 boundary is essentially a Firm-A-vs-others separator — direct evidence that the K=2 partition reflects firm-compositional rather than mechanistic structure. + +K=3 in contrast has a *reproducible component shape* at the descriptor-position level: across the four folds the C1 (low-cos / high-dHash) component cosine mean varies by at most $0.005$, the dHash mean by at most $0.96$, and the weight by at most $0.023$. Hard-posterior membership for the held-out firm is composition-sensitive (absolute differences $1.8$–$12.8$ pp across folds). Together with the §III-I.4 composition decomposition (no within-population bimodal antimode), the K=3 stability supports a descriptive reading: the Big-4 descriptor plane has a reproducible three-region partition that reflects how firm-compositional weight is distributed across the descriptor space, *not* a three-mechanism latent-class structure. We accordingly do not use K=3 hard-posterior membership as an operational classifier; we use it as the accountant-level descriptive summary that complements the deployed signature-level five-way classifier of §III-H.1. + +## E. Three-Score Convergent Internal-Consistency + +Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score, not a mechanism cluster posterior); the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the deployed box-rule less-replication-dominated rate. The three scores are *not* statistically independent measurements — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise $|\rho| > 0.87$) and its persistence at the signature level (Cohen $\kappa = 0.87$ between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-region ordering, and three different summarisations of the descriptor space produce broadly concordant per-CPA rankings, with a residual non-Firm-A disagreement (the reverse-anchor cosine percentile ranks Firm D fractionally above Firm C, while the mixture posterior and the deployed box-rule rate rank Firm C highest among non-Firm-A firms). + +## F. Anchor-Based Multi-Level Calibration + +The operational specificity of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR matches a prior published per-comparison rate (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$. + +Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5) shows the deployed HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); alternative operating points can be characterised by inverting the ICCR curves (e.g., a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint corresponds to per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation. + +## G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor + +The only conservative hard-positive subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are a conservative hard-positive subset for image replication. On the Big-4 subset ($n = 262$ pixel-identical signatures), all three candidate checks — the deployed box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). We caution that this result is necessary but not sufficient: for the deployed box rule it is close to tautological, because byte-identical neighbours have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region. The corresponding signature-level *negative* anchor evidence is developed in §III-L.1 above (per-comparison ICCR $= 0.00060$ at cos$>0.95$, consistent with the prior published rate of $0.0005$). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures. + +## H. Limitations + +Several limitations should be transparent. We group them into primary methodological limitations, secondary scope and validation caveats, documented design features, and engineering-level caveats of the pipeline. + +**Primary methodological limitations.** + +*No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M). + +*Inter-CPA negative-anchor assumption is partially violated and the violation is firm-dependent.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs. Because the violation is firm-dependent, Firm A's per-firm ICCR is more contaminated by within-firm sharing than Firms B/C/D's; the per-firm B/C/D rates of $0.09$–$0.16$ may therefore be less contaminated than the pooled rate, and the Firm A vs Firms B/C/D contrast reflects both genuine firm heterogeneity and a firm-dependent proxy-contamination gradient. + +*Scope.* The primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full $n = 686$ scope; the §IV-K full-dataset Spearman re-run shows the K=3 $+$ deployed box-rule rank-convergence is preserved at $n = 686$ but does not establish portability of the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope. + +**Secondary scope and validation caveats.** + +*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures). + +*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their prior calibration and capture-rate evidence (Appendix B); the anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery. + +*Deployed-rate excess is not a presumed true-positive rate.* The $\sim 44$-pp per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the inter-CPA proxy rate (HC: $0.18$) cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; within-firm template sharing inflates the inter-CPA proxy baseline). The gap is best read as a same-CPA repeatability signal. + +*A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime. + +**Documented design features.** + +*K=3 hard-posterior membership is composition-sensitive.* The K=3 hard-posterior membership for any single firm varies by up to $12.8$ pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as operational classifier output; they are reported only as accountant-level descriptive characterisation. + +*No partner-level mechanism attribution.* The analysis reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-L.4 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing. + +**Engineering-level caveats of the pipeline.** + +*Transferred ImageNet features.* The ResNet-50 feature extractor uses pre-trained ImageNet weights without signature-domain fine-tuning. While our backbone-ablation study (§IV-L) and prior literature support the effectiveness of transferred ImageNet features for signature comparison, a signature-domain fine-tuned feature extractor could improve discriminative performance. + +*Red-stamp HSV preprocessing artifacts.* The red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. Blended pixels are replaced with white, potentially creating small gaps in signature strokes that could reduce dHash similarity. This bias would push classifications toward false negatives rather than false positives. + +*Longitudinal scan / PDF / compression confounds.* Scanning equipment, PDF generation software, and compression algorithms may have changed over the 2013–2023 study period, potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded. + +*Source-exemplar misattribution in max/min pair logic.* The max-cosine / min-dHash detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case where one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as a stamping or e-signature template, the pair correctly identifies image reuse but misattributes non-hand-signed status to the source exemplar. This affects at most one source document per template variant per CPA and is not expected to be common. + +*Legal and regulatory interpretation.* Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a jurisdiction-specific legal question. Our technical analysis can inform such determinations but cannot resolve them. + + +# VI. Conclusion and Future Work + +We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature classifier with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. + +Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (§III-M Table XXVII), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector. + +Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. + + +# References + + + +[1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067 + +[2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," *Res. Account. Regul.*, vol. 25, no. 2, pp. 230–235, 2013. + +[3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in *Proc. NeurIPS*, 1993. + +[4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017. + +[5] H.-H. Kao and C.-Y. Wen, "An offline signature verification and forgery detection method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020. + +[6] H. Li et al., "TransOSV: Offline signature verification with transformers," *Pattern Recognit.*, vol. 145, p. 109882, 2024. + +[7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," *Mathematics*, vol. 12, no. 17, p. 2757, 2024. + +[8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024. + +[9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021. + +[10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," in *Proc. Electronic Imaging*, 2016. + +[11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," *Multimedia Tools Appl.*, 2024. + +[12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," *Inf. Process. Manage.*, p. 104086, 2025. + +[13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in *Proc. CVPR*, 2022. + +[14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," *Pattern Recognit.*, vol. 70, pp. 163–176, 2017. + +[15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 1342–1356, 2024. + +[16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 1735–1745, 2020. + +[17] H. Farid, "Image forgery detection," *IEEE Signal Process. Mag.*, vol. 26, no. 2, pp. 16–25, 2009. + +[18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," *Pattern Recognit.*, vol. 144, art. no. 109778, 2023. + +[19] J. Luo et al., "A survey of perceptual hashing for multimedia," *ACM Trans. Multimedia Comput. Commun. Appl.*, vol. 21, no. 7, 2025. + +[20] D. Engin et al., "Offline signature verification on real-world documents," in *Proc. CVPRW*, 2020. + +[21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, vol. 189, art. 116136, 2022. + +[22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification across multilingual datasets," *Procedia Comput. Sci.*, vol. 270, pp. 4024–4033, 2025. + +[23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in *Proc. ECCV*, 2014, pp. 584–599. + +[24] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, "Qwen2.5-VL technical report," arXiv:2502.13923, 2025. [Online]. Available: https://arxiv.org/abs/2502.13923 + +[25] Ultralytics, "YOLO11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/models/yolo11/ + +[26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. CVPR*, 2016. + +[27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html + +[28] B. W. Silverman, *Density Estimation for Statistics and Data Analysis*. London: Chapman & Hall, 1986. + +[29] J. Cohen, *Statistical Power Analysis for the Behavioral Sciences*, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988. + +[30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600–612, 2004. + +[31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," *The Accounting Review*, vol. 88, no. 5, pp. 1511–1546, 2013. + +[32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," *Int. J. Auditing*, vol. 18, no. 3, pp. 172–192, 2014. + +[33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," *Contemp. Account. Res.*, vol. 26, no. 2, pp. 359–391, 2009. + +[34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in *Proc. CVPR*, 2016, pp. 779–788. + +[35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 46, no. 8, pp. 5625–5644, 2024. + +[36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," *Ann. Math. Statist.*, vol. 18, no. 1, pp. 50–60, 1947. + +[37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," *Ann. Statist.*, vol. 13, no. 1, pp. 70–84, 1985. + +[38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," *J. Account. Econ.*, vol. 24, no. 1, pp. 99–126, 1997. + +[39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," *J. Econometrics*, vol. 142, no. 2, pp. 698–714, 2008. + +[40] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," *J. R. Statist. Soc. B*, vol. 39, no. 1, pp. 1–38, 1977. + +[41] H. White, "Maximum likelihood estimation of misspecified models," *Econometrica*, vol. 50, no. 1, pp. 1–25, 1982. + +[42] M. Stone, "Cross-validatory choice and assessment of statistical predictions," *J. R. Statist. Soc. B*, vol. 36, no. 2, pp. 111–147, 1974. + +[43] S. Geisser, "The predictive sample reuse method with applications," *J. Amer. Statist. Assoc.*, vol. 70, no. 350, pp. 320–328, 1975. + +[44] A. Vehtari, A. Gelman, and J. Gabry, "Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC," *Stat. Comput.*, vol. 27, no. 5, pp. 1413–1432, 2017. + + + + +# Appendix A. BD/McCrary Bin-Width Sensitivity (Signature Level) + +The main text (Section III-I, Section IV-D.2) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator. +This appendix documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction. + + + +Two patterns are visible in Table A.I. +First, the procedure consistently identifies a "transition" under every bin width, but the *location* of that transition drifts monotonically with bin width (Firm A cosine: 0.987 → 0.985 → 0.980 → 0.975 as bin width grows from 0.003 to 0.015; full-sample dHash: 2 → 10 → 9 as the bin width grows from 1 to 3). +The $Z$ statistics also inflate superlinearly with the bin width (Firm A cosine $|Z|$ rises from $\sim 9$ at bin 0.003 to $\sim 106$ at bin 0.015) because wider bins aggregate more mass per bin and therefore shrink the per-bin standard error on a very large sample. +Both features are characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity. + +Second, the candidate transitions all locate *inside* the high-similarity region (cosine $\geq 0.975$, dHash $\leq 10$) rather than at a between-mode boundary, which is the location pattern we would expect of a clean within-population antimode. + +Taken together, Table A.I shows that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes. +This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold. + +Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials. + +# Appendix B. Reproducibility Materials + +The full table-to-script provenance mapping, script source code, and report artefacts for every numerical table and figure in this paper are provided in the supplementary materials. Scripts run deterministically under fixed random seeds documented there; reviewer reproduction should re-emit artefacts from the listed scripts rather than rely on any local path layout. + + +# Declarations + +**Conflict of interest.** The authors declare no conflict of interest with Firm A, Firm B, Firm C, or Firm D, or with any other entity referenced in this work. + +**Data availability.** All audit reports analysed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, a publicly accessible regulatory disclosure platform. The CPA registry used to map signatures to certifying CPAs is publicly available. Signature images, model weights, and reproducibility scripts are available in the supplementary materials. + +**Funding.** [To be filled in before submission.] + + + + +# Impact Statement (archived; not in IEEE Access submission) + +Auditor signatures on financial reports are a key safeguard of corporate accountability. +When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets. +We developed a pipeline that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan. +Combining deep-learning visual features with perceptual hashing, distributional diagnostics, and anchor-based inter-CPA coincidence-rate calibration, the system stratifies signatures into a five-way confidence-graded classification and quantifies how the practice varies across firms and over time. +With a future labelled evaluation set, the technology could support financial regulators in screening candidate non-hand-signed signatures at national scale. diff --git a/paper/review_handoff_abstract_intro_20260515.md b/paper/review_handoff_abstract_intro_20260515.md new file mode 100644 index 0000000..ed5e5b8 --- /dev/null +++ b/paper/review_handoff_abstract_intro_20260515.md @@ -0,0 +1,361 @@ +# Review Handoff: Abstract and Introduction + +Date: 2026-05-15 +Target manuscript: `paper/paper_a_v4_combined.md` +Scope reviewed: Abstract and Introduction only + +## Overall Assessment + +The Abstract and Introduction are substantively strong and defensible. The current argument is clear: + +- Regulations require CPA attestation, but digitized PDF workflows make stored-signature reuse operationally easy. +- The problem is not signature forgery; identity is not in dispute. The target is detecting possible image-level reproduction by the legitimate signer or firm workflow. +- The paper avoids claiming validated forensic detection and instead frames the system as an anchor-calibrated screening framework under unsupervised constraints. +- The strongest methodological move is replacing unsupported distributional "natural threshold" logic with anchor-based inter-CPA coincidence-rate (ICCR) calibration. + +Recommended disposition: Minor Revision for prose and narrative complexity, not for core empirical weakness. + +## Main Reviewer Concern + +The Introduction currently explains the methodology shift too explicitly as a research-process or version-history pivot. This is useful internally, but in the submitted paper it may increase complexity and invite reviewers to focus on why earlier versions used a different framing. + +The final manuscript should explain the final methodological choice, not the internal research journey. + +Keep: + +- The descriptor distribution does not support a stable within-population bimodal antimode. +- Apparent multimodality is explained by firm composition and integer mass-point artefacts. +- Mixture fits are descriptive, not threshold-generating. +- Operational rules are characterized using anchor-based ICCR at multiple units. + +Reduce or remove: + +- "Earlier work in this lineage..." +- "v4.0 contribution..." +- "overturns this reading..." +- "inherited Paper A v3.x..." +- Internal script-heavy provenance in the Introduction. + +Detailed provenance belongs in Methodology, Results, Appendix, or reproducibility notes, not in the opening narrative. + +## Suggested Rewrite Direction for Introduction Pivot Paragraph + +Current issue location: around `paper/paper_a_v4_combined.md`, Introduction paragraph beginning with "The methodological reframing relative to earlier versions..." + +Recommended replacement direction: + +```text +A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location shifts and integer mass-point artefacts on the dHash axis. After firm-mean centring and integer-tie jitter, the pooled dHash dip-test rejection disappears. Within-firm diagnostics likewise do not reveal a stable bimodal antimode. We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors. +``` + +This preserves the methodological defense while removing the internal v3-to-v4 story. + +## Abstract-Specific Comments + +The Abstract is strong but very dense. It is currently optimized for technical reviewers rather than broad readability. That may be acceptable for IEEE Access, but the first sentence has a small grammar/style issue. + +Suggested edit: + +```text +Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports -- through administrative stamping or firm-level electronic signing -- thereby undermining individualized attestation. +``` + +Reason: + +- Current wording: "digitization makes reusing ... undermining ..." is grammatically awkward. +- The suggested version makes the causal relation explicit. + +No need to remove the final limitation sentence. The sentence "not as a validated forensic detector; no calibrated error rates..." is important and should remain. + +## Introduction-Specific Comments + +### 1. Keep the legal framing but avoid legal overclaiming + +The sentence saying non-hand-signed workflows "may fall within the literal statutory requirement" is acceptable because it is cautious. Do not strengthen it into a legal conclusion. + +Preferred style: + +- "may fall within" +- "raises substantive concerns" +- "may not represent meaningful individual attestation" + +Avoid: + +- "violates" +- "illegal" +- "non-compliant" +- "fraudulent" + +### 2. Preserve the forgery distinction + +The distinction between non-hand-signing detection and signature forgery detection is one of the strongest conceptual contributions. Keep it prominent. + +Key idea to preserve: + +- Forgery detection asks whether the signer is genuine. +- This paper asks whether the signing act was repeated for each document or a stored image was reused. + +### 3. Reduce script/provenance detail in the Introduction + +Current paragraph references scripts such as Script 39c and Script 39d. This makes the Introduction read like an internal review memo. + +Recommendation: + +- Remove or simplify script references from Introduction. +- Keep exact script provenance in Methodology, Results, Appendix B, or supplementary material. + +Specific risk: + +- The current parenthetical "10 firms tested in Script 39c" is imprecise for jittered-dHash. Script 39c raw dHash tests reject unimodality; the non-Big-4 jittered-dHash no-rejection statement depends on a codex-verified read-only spike on the same substrate. + +Safer Introduction wording: + +```text +Within-firm diagnostics likewise fail to reveal stable bimodal structure after accounting for integer ties, including in eligible mid/small-firm checks. +``` + +If provenance must remain: + +```text +Within-firm signature-level cosine checks fail to reject in eligible firms, and corresponding jittered-dHash checks fail to reject in Big-4 firms and in a read-only spike on the same mid/small-firm substrate. +``` + +### 4. Avoid presenting the Introduction as a Results section + +The Introduction currently contains many detailed numbers. Some are necessary because the paper is methodological, but the v4 pivot paragraphs are numerically heavy. + +Keep headline numbers: + +- Dataset size: 90,282 reports, 182,328 signatures, 758 CPAs. +- Big-4 scope: 437 CPAs, 150,442 signatures. +- Key ICCR levels: per-comparison, per-signature, per-document. +- Firm heterogeneity: Firm A 0.62 vs Firms B/C/D 0.09-0.16. + +Consider moving or reducing: + +- Full script-specific details. +- Too many parenthetical rule semantics in the Introduction. +- Repeated mentions of inherited/v3/v4 framing. + +## Recommended Minimum Patch List + +1. Fix Abstract first sentence grammar: + +```text +digitization makes it feasible to reuse... +``` + +2. Rewrite the Introduction paragraph that begins with "The methodological reframing relative to earlier versions..." so it describes the final methodological rationale rather than v3-to-v4 revision history. + +3. Remove or narrow `Script 39c` provenance in the Introduction because the raw vs jittered dHash distinction is subtle and currently risky. + +4. Replace internal-version language across the Introduction: + +- Replace "v4.0 adopts..." with "We adopt..." +- Replace "Earlier work in this lineage..." with "A distributional-threshold approach would be inappropriate here because..." +- Replace "inherited Paper A v3.x five-way box rule" with "the deployed five-way box rule" unless historical provenance is essential. + +5. Preserve limitation language: + +- The paper should continue to say it is not a validated forensic detector. +- The paper should continue to say calibrated error rates cannot be reported without signature-level ground truth. + +## Reviewer Bottom Line + +The paper should not hide that the distributional threshold path failed; that is actually a methodological strength. But it should present this as a final empirical finding and design rationale, not as a visible research-history correction. + +Recommended framing: + +```text +Because the observed distribution does not provide a defensible natural threshold, we use ICCR calibration to characterize the deployed operating rules under explicit unsupervised assumptions. +``` + +This is cleaner, less complex, and more reviewer-facing than the current v3-to-v4 narrative. + +## Additional Framing Issue: Are We Giving Thresholds or Not? + +A likely reviewer confusion point is whether the paper provides a concrete classifier threshold or merely explains why no defensible threshold can be derived. + +The intended answer should be explicit: + +- The paper does provide a concrete, reproducible operational classifier. +- The paper does not claim that this classifier is ground-truth-optimal. +- The paper does not claim that the operating thresholds are natural antimodes in the descriptor distribution. +- The paper's calibration contribution is to characterize the deployed rule's inter-CPA coincidence behavior under unsupervised assumptions. + +Recommended high-level framing: + +```text +We use a fixed, pre-specified five-way operating rule. The present calibration does not derive an optimal threshold; instead, it quantifies the rule's inter-CPA coincidence behavior at per-comparison, per-signature, and per-document units under explicit unsupervised assumptions. +``` + +Chinese interpretation: + +```text +我們有一組明確、可重現的五分類操作規則;本文不是宣稱這組門檻是最佳門檻或自然分界點,而是在沒有 signature-level ground truth 的情況下,用 ICCR 量化這組規則的 specificity-proxy 行為。 +``` + +## Concrete Threshold Language to Make Visible + +The manuscript should not bury the actual operating thresholds. Somewhere early in Methodology, and preferably summarized in Introduction, make the rule explicit: + +```text +High-confidence non-hand-signed: cosine > 0.95 AND dHash <= 5. +Moderate-confidence non-hand-signed: cosine > 0.95 AND 5 < dHash <= 15. +Other outcomes follow the fixed five-way box rule. +``` + +If space allows, add a compact sentence: + +```text +Thus, the system has explicit decision rules; what remains uncalibrated in the absence of signature-level labels is their true false-positive and false-negative error rate. +``` + +This directly answers the reviewer question: "Do the authors actually have a classifier?" + +## Rewrite Style Recommendation + +Avoid language that sounds like the authors are unable to provide thresholds: + +- Avoid: "No threshold can be derived." +- Avoid: "The distribution does not support classification." +- Avoid: "We cannot determine a threshold." + +Use language that distinguishes operational thresholds from statistically natural or supervised-optimal thresholds: + +- Prefer: "The deployed thresholds are operational rules rather than natural antimodes." +- Prefer: "We characterize these rules with ICCR rather than claiming supervised error rates." +- Prefer: "The absence of a distributional antimode motivates anchor-based calibration, not threshold-free analysis." +- Prefer: "The system is a concrete screening classifier with explicit unsupervised calibration limits." + +## Reviewer-Facing Answer to the Threshold Question + +If the manuscript needs one sentence that resolves the ambiguity, use: + +```text +The system therefore uses explicit operating thresholds, but the evidentiary claim attached to those thresholds is limited: they define a reproducible screening rule whose coincidence behavior can be estimated under inter-CPA anchors, not a validated forensic decision boundary with calibrated error rates. +``` + +This should be the guiding style for Abstract, Introduction, and the start of Methodology. + +## Readability Risk: Too Many Diagnostics Can Look Like Methodological Overbuilding + +The manuscript's multi-method statistical design increases rigor, but it also creates a readability risk. In the current form, some sections may feel like a defensive accumulation of diagnostics rather than a clean research design. + +Reviewer risk: + +- The reader may ask: "Are the authors using many methods because the core classifier is unclear?" +- The reader may miss the simple main claim because the paper introduces too many caveats and validation tools early. +- The paper may look like "we used many methods, therefore credible" instead of "each method answers one necessary question." + +Recommended main-thread sentence: + +```text +We deploy a fixed five-way screening rule and characterize its unsupervised reliability limits using ICCR, after showing that the descriptor distribution does not support a natural threshold. +``` + +Chinese interpretation: + +```text +我們有明確五分類篩檢規則;先證明不能用自然分布切點來當門檻,再用 ICCR 描述這組規則在無標註資料中的可靠性邊界。 +``` + +All methods and diagnostics should serve this main thread. + +## Core vs Supporting Diagnostics + +Treat the following as core and keep them prominent: + +- End-to-end pipeline: VLM -> YOLO -> ResNet -> cosine/dHash. +- Explicit five-way operating rule. +- Composition decomposition showing why the descriptor distribution does not yield a natural threshold. +- ICCR calibration at three units: per-comparison, per-signature, per-document. +- Firm heterogeneity and within-firm collision concentration. +- Ground-truth limitation and no true error-rate claim. + +Treat the following as supporting diagnostics and avoid letting them dominate the main narrative: + +- K=2 / K=3 mixture fits. +- Three-score Spearman convergence. +- Leave-one-firm-out reproducibility. +- BD/McCrary sensitivity. +- Ten-tool validation table. +- Pixel-identity positive anchor, especially because it is close to tautological for the high-confidence rule. + +These supporting diagnostics can stay, but they should be framed as robustness checks, assumption checks, or supplementary evidence, not as independent central contributions. + +## Suggested Manuscript Structure for Clarity + +Recommended structure for the Methodology / Results narrative: + +1. Core Method + +Describe the pipeline, descriptor construction, and five-way rule. + +2. Why the Threshold Is Operational Rather Than Natural + +Use the composition decomposition only. Avoid over-explaining K=3, BD/McCrary, or historical mixture logic here. + +3. How the Rule Is Calibrated Without Ground Truth + +Explain ICCR and the three reporting units: per-comparison, per-signature, per-document. + +4. What the Calibration Reveals + +Report firm heterogeneity and within-firm collision concentration. + +5. Supporting Diagnostics + +Place K=3, Spearman convergence, LOOO, BD/McCrary, and pixel-identity checks here as supporting evidence. + +## Rewrite Style for Multi-Method Sections + +Avoid: + +```text +We apply a multi-tool validation framework consisting of ten diagnostics... +``` + +This can sound like methodological stacking. + +Prefer: + +```text +Each supporting diagnostic addresses a specific failure mode: composition artefacts, inter-CPA coincidence, pool-size effects, firm heterogeneity, or positive-anchor capture. +``` + +Avoid: + +```text +The conjunction of ten tools constitutes validation... +``` + +Prefer: + +```text +Together, these diagnostics define the limits of what can be supported without signature-level ground truth. +``` + +Avoid presenting auxiliary diagnostics before the reader understands the classifier. + +Preferred order: + +```text +Rule first. Then why not natural threshold. Then ICCR calibration. Then robustness. +``` + +## Reviewer-Facing Principle + +The paper should not read as: + +```text +We used many methods, so the result is credible. +``` + +It should read as: + +```text +We use one explicit screening rule. Each statistical diagnostic answers one necessary question about how that rule should be interpreted under unsupervised constraints. +``` + +This distinction is important for readability and reviewer trust. diff --git a/paper/review_handoff_body_20260515.md b/paper/review_handoff_body_20260515.md new file mode 100644 index 0000000..99b8737 --- /dev/null +++ b/paper/review_handoff_body_20260515.md @@ -0,0 +1,397 @@ +# Review Handoff: Methodology, Results, Discussion, Conclusion + +Date: 2026-05-15 +Target manuscript: `paper/paper_a_v4_combined.md` +Scope reviewed: §III Methodology, §IV Experiments and Results, §V Discussion, §VI Conclusion +Companion review: `paper/review_handoff_abstract_intro_20260515.md` (Abstract + Introduction) + +This handoff continues the same framing principle established for Abstract + Introduction: + +> *"One explicit screening rule. Each statistical diagnostic answers one necessary question about how that rule should be interpreted under unsupervised constraints."* + +If only the Abstract and Introduction are revised, the manuscript will exhibit tonal mismatch when the reader drops into the body sections, which currently retain internal-version language and a defensive-accumulation framing for the supporting diagnostics. The body must be brought into the same register. + +## Overall Assessment + +The body sections are substantively defensible. The core empirical results — composition decomposition, anchor-based ICCR at three units, firm heterogeneity logistic, cross-firm hit matrix, alert-rate sensitivity — are presented in adequate quantitative detail with explicit unsupervised-validation caveats. The Discussion correctly distinguishes positive and negative anchors. The Conclusion lists eight methodological contributions that map onto the v4 contribution set. + +The recurring weakness across §III / §IV / §V / §VI is *not* empirical. It is two intertwined narrative tendencies: + +1. The body is still written as a *revision history* relative to v3.x in many paragraphs — "v4.0 strengthens", "v4.0 retroactively reframes", "v4.0 adopts", "inherited from v3.x", "the v3.x role of Firm A". This is internally honest but, in a submitted paper, signals to the reviewer that the authors are arguing with themselves. + +2. The supporting diagnostics are repeatedly presented as a *collection* ("multi-tool framework", "ten-tool unsupervised-validation collection", "Table XXVII"). This collection framing is precisely the readability risk identified in the Abstract / Introduction handoff under "Readability Risk: Too Many Diagnostics Can Look Like Methodological Overbuilding." It currently appears unmodified in §III-M. + +Recommended disposition: Minor Revision for narrative voice and structural emphasis, not for empirical weakness. + +## Main Reviewer Concerns + +### 1. The v3-to-v4 revision narrative is pervasive in the body and must be removed + +The Abstract / Introduction handoff identified "v4.0 adopts", "Earlier work in this lineage", and "inherited Paper A v3.x five-way box rule" as patterns to strip. The same patterns occur throughout the body sections. Representative instances (not exhaustive): + +- §III-G: "We earlier (v4.0 first draft) listed 'statistical multimodality at the accountant level' among the scope justifications..." +- §III-H opening: "v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing." +- §III-I.5 closing sentence: "§III-L develops the v4.0 anchor-based threshold calibration framework..." +- §III-L.0 "Why retained without v4.0 recalibration" subsection title. +- §III-L.7 closing: "The operational classifier of §III-L.0 is the inherited v3.x five-way box rule..." +- §IV opening paragraph: "The v4.0 primary analyses (§IV-D through §IV-J) are scoped to..." and "§IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables." +- §IV-I: "v4.0 retroactively reframes the metric as inter-CPA pair-level coincidence rate (ICCR) rather than 'False Acceptance Rate'..." +- §IV-J: "v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset)." +- §IV-M opening: "v4-new empirical results that support..." +- §V-B: "A central empirical finding of v3.x was that per-signature similarity does not admit a clean two-mechanism mixture... v4.0 strengthens and extends this signature-level reading." +- §V-C: "In v4.0 we treat Firm A as a templated-end case study rather than as the calibration anchor for the operational threshold." +- §V-H opening: "The first nine are v4.0-specific; the last five are inherited from v3.20.0 §V-G and still apply to the v4.0 pipeline." + +The remediation principle is the same as for the Introduction pivot paragraph. The final manuscript should describe the *final methodological state* and its rationale, not the trajectory by which that state was reached. Internal provenance — "this analysis is reproduced from v3.x §IV-F.1 / Script 28" — belongs in an Appendix B reproducibility table or supplementary material, not in the main narrative arc. + +A safe rewriting heuristic: every sentence that begins with "v4.0", "v3.x", "v4-new", "inherited", or "earlier work" should be candidated for either deletion or rewriting in the present tense without version labels. + +### 2. The "Ten-Tool Unsupervised-Validation Collection" frame must be retired + +§III-M Table XXVII is the canonical instance of the readability risk that the Abstract / Introduction handoff flagged. The current frame is: + +> "v4.0 adopts a multi-tool collection of partial-evidence diagnostics (Table XXVII), each with an explicitly disclosed assumption..." +> "No single tool in this collection provides ground-truth validation. Their conjunction constitutes the unsupervised validation ceiling that the v4.0 corpus admits." + +This is exactly the language the Abstract / Introduction handoff identified as risky ("We used many methods, so the result is credible"). It reappears verbatim in the §VI Conclusion as "a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope" and "(8) a ten-tool unsupervised-validation collection (§III-M Table XXVII) that explicitly discloses each tool's untested assumption." + +The recommended reframe is: + +```text +The corpus does not admit standard supervised classifier validation: no signature-level +ground truth exists for hand-signed versus replicated classes, so False Rejection Rate, +sensitivity, recall, EER, ROC-AUC, precision, and positive predictive value are not +reportable. Each diagnostic in this section therefore addresses one specific +failure mode of an unsupervised screening classifier: composition artefacts, +inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold +sensitivity, or positive-anchor capture. Together they characterise the limits of +what can be claimed without signature-level ground truth. +``` + +Keep Table XXVII as a reference table if useful, but retitle it as "Diagnostic — failure mode addressed — disclosed assumption" rather than "Ten-tool collection". The word "ten" should not appear in the manuscript. + +### 3. The §V-H Limitations list is correct but defensively ordered + +§V-H lists fourteen limitations. The first one — "No signature-level ground truth; no true error rates reportable" — is the load-bearing limitation that everything else in v4.0 hinges on. The next two — "Inter-CPA negative-anchor assumption is partially violated" and "Scope" — are also major. The other eleven are real but secondary. The current presentation gives every item roughly equal visual weight as a flat list. + +Recommended reorganisation: + +- *Primary limitations (3 items):* (a) no signature-level ground truth, (b) inter-CPA negative-anchor assumption partially violated and firm-dependent, (c) Big-4 scope (full-dataset robustness is light). +- *Secondary limitations (4 items):* pixel-identity conservative subset; inherited rule components not separately v4-validated; deployed-rate excess not a true-positive rate; A1 pair-detectability stipulation. +- *Documented features rather than limitations (2 items):* K=3 hard-posterior composition sensitivity; no partner-level mechanism attribution. +- *Inherited engineering limitations (5 items):* ImageNet features, red-stamp HSV preprocessing, longitudinal scan / PDF / compression, source-exemplar misattribution, legal interpretation. + +This preserves the disclosures but signals to the reviewer which limitations carry the methodological weight and which are routine engineering caveats. + +### 4. §III-F SSIM and pixel-comparison justification is too long for Methodology + +§III-F currently dedicates roughly 15 lines (lines 112–127 in `paper_a_methodology_v3.md`) to justifying *why* SSIM and pixel-level comparison are not used as primary descriptors. The argument is correct (design-level mismatch between SSIM's natural-image quality factors and signature-crop artefacts; sub-pixel alignment fragility of pixel L1/L2), but in its current form it reads as a defensive response to an anticipated reviewer objection rather than as forward Methodology exposition. + +Recommended reduction: collapse the argument to one short paragraph (3–4 sentences) and move the full design-level discussion to Appendix B. The Methodology body should state the choice (cosine on deep features + dHash) and briefly justify it (both stable across print-scan cycles by design), with the SSIM / pixel-comparison rebuttal in an appendix or a single citation footnote. + +### 5. §IV's section opener still encodes provenance not appropriate to a Results section opener + +The current §IV opener: + +> "The v4.0 primary analyses (§IV-D through §IV-J) are scoped to the Big-4 sub-corpus (Firms A–D, n = 437 CPAs with n_sig ≥ 10, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. The §IV-K Full-Dataset Robustness section reports the full-dataset (686 CPAs) variant of the K=3 mixture + Paper A box-rule Spearman analysis as a cross-scope robustness check. §IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables." + +Recommended replacement direction: + +```text +Section IV reports the empirical results that calibrate and characterise the +operational classifier of §III-L. The primary analyses (§IV-D through §IV-J, +§IV-M) are scoped to the Big-4 sub-corpus (Firms A–D, 437 CPAs, 150,442 +signatures); §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3 +mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L +report the corpus-wide pipeline performance and feature-backbone ablation that +support the descriptor choice of §III-F. +``` + +This preserves the scope information while removing the v3-to-v4 inheritance labels and the "v4-new" prefix on §IV-M. + +## Section-by-Section Comments + +### §III-A Pipeline Overview + +The pipeline diagram caption (lines 12–20) describes the classifier as "Firm A P7.5-anchored", which is residual v3 language that conflicts with the v4 reframe. v4 explicitly abandons Firm A as the calibration anchor in favour of inter-CPA ICCR (§III-H, §III-L). The figure caption should be updated to read "Anchor-Calibrated Five-Way Classifier" or similar, consistent with the §III-L title "Anchor-Based Threshold Calibration and Operational Classifier". + +The §III-A second paragraph ("Throughout this paper we use the term non-hand-signed rather than 'digitally replicated'...") is well-positioned and should be kept. + +### §III-B Data Collection + +No issues identified. + +### §III-C Signature Page Identification + +No issues identified. The 98.8% VLM-YOLO agreement footnote is appropriately scoped ("we do not attempt to attribute the residual"). + +### §III-D Signature Detection + +No issues identified. + +### §III-E Feature Extraction + +No issues identified. + +### §III-F Dual-Method Similarity Descriptors + +As noted in Main Concern 4: shorten the SSIM and pixel-comparison rebuttal to ~3–4 sentences and move full design-level argument to Appendix B. + +### §III-G Unit of Analysis and Scope + +This section is currently long and contains the "We earlier (v4.0 first draft) listed..." paragraph that explicitly walks through the methodological revision. That paragraph (currently at the end of §III-G, before the sample-size reconciliation) should be deleted. The four-item scope rationale list above it is good and should be kept. + +The sample-size reconciliation paragraph (n=150,442 vs n=150,453) is technically necessary but is repeated almost verbatim in §IV-J as a parenthetical. Consider centralising it in §III-G with a forward reference, or in an Appendix B reproducibility note. + +### §III-H Reference Populations + +Replace the opening sentence: + +> "v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing." + +with: + +```text +The calibration distinguishes two reference populations: Firm A as a within-Big-4 +templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference +for internal-consistency checking. +``` + +The remainder of §III-H is well-written; the descriptive content is fine. The "v3.x's single-anchor framing" phrase is the only internal-version language that needs removal. + +### §III-I Distributional Diagnostics + +This is the strongest single section in the body. The four sub-diagnostics (dip test, mixture, BD/McCrary, composition decomposition) are tightly organised around one claim: the descriptor distribution does not provide a within-population bimodal antimode. The 2x2 factorial table at §III-I.4 is the empirical centrepiece of the v4 reframe. + +One small narrative issue: §III-I.5 ("Conclusion") closes with "§III-L develops the v4.0 anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode." Remove "v4.0" — write "§III-L develops the anchor-based threshold calibration framework..." + +### §III-J K=3 as a Descriptive Partition of Firm-Composition Contrast + +The section header is clear and the framing ("Both fits are descriptive partitions... not within-population mechanism modes") is correct. + +The current closing paragraph references "§III-K" for cross-checks between the box rule and K=3, but §III-K is the next subsection — this is a within-Methodology forward reference and reads slightly oddly. Consider rephrasing as "Cross-checks between the inherited five-way box rule and the K=3 partition appear in §III-K below." + +### §III-K Convergent Internal-Consistency Checks + +This section is well-handled. The opening caveat — "the three scores are not statistically independent measurements... so their high pairwise rank correlations are partly a mechanical consequence of shared inputs" — is exactly the methodological honesty the v4 reframe needs. + +One narrative issue: §III-K.4 (positive-anchor miss rate) and §III-K.3 (LOOO reproducibility) are *summarised* in §III-K but also reported in detail in §III-J and §IV-G respectively. Consider whether the §III-K subsections add narrative value beyond cross-referencing — if not, §III-K could shrink to just the three-score Spearman block (§III-K.1) and a one-line cross-reference to LOOO and pixel-identity, with the detail living in §III-J and §IV-G / §IV-H. + +### §III-L Anchor-Based Threshold Calibration and Operational Classifier + +This section has the operating-rule text that the Abstract / Introduction handoff explicitly asked for ("Cosine > 0.95 AND dHash ≤ 5" etc., §III-L.0 item 1). Good. + +The "Terminological note on FAR" at the end of §III-L.0 is explicit and reviewer-facing. Keep it. + +Issues: + +- "Why retained without v4.0 recalibration" — replace subsection title and contents to remove v4 references. The argument ("the inherited thresholds preserve continuity with prior reporting; §III-I.4 establishes that recalibration cannot be anchored on distributional antimodes; §III-L.1 confirms the cosine threshold's specificity at the inter-CPA pair level is reproducible") is intact without the v4 label. +- §III-L.7 ("K=3 not used as classifier") restates content already in §III-J. Consider deleting §III-L.7 and adding a one-line note inside §III-L.0 ("The K=3 mixture of §III-J is used as an accountant-level descriptive summary alongside the per-signature five-way classifier; K=3 hard-posterior membership is not used to assign signature-level or document-level labels in any result table"). + +### §III-M Validation Strategy and Limitations under Unsupervised Setting + +Replace the framing as described in Main Concern 2. Keep the underlying disclosure content. Consider whether Table XXVII is best presented as a numbered methodological table or as an Appendix B reproducibility-and-assumption summary; in either case retitle and reframe so that "ten" does not appear and the unifying principle is "each diagnostic addresses one specific unsupervised failure mode." + +The "What v4.0 does not claim" and "What v4.0 does claim" subsections at the end of §III-M are strong but the framing tag "v4.0 does not claim" / "v4.0 does claim" is the problematic version-language pattern. Replace with "Limits of the present analysis" and "Scope of the present analysis." + +### §III-N Data Source and Firm Anonymization + +No issues. The residual-identifiability disclosure is appropriately framed. + +### §IV-A Experimental Setup + +No issues identified. + +### §IV-B Signature Detection Performance + +No issues identified. + +### §IV-C All-Pairs Intra-vs-Inter Class Distribution Analysis + +The pairwise-non-independence caveat ("we therefore rely primarily on Cohen's d... A Cohen's d of 0.669 indicates a medium effect size, confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count") is well-positioned. Keep. + +### §IV-D Big-4 Accountant-Level Distributional Characterisation + +The Table V dip-test row labels are clear. The "v4-new composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration" should drop the "v4-new" — just write "...are tabulated in §IV-M below alongside the anchor-based ICCR calibration." + +### §IV-E Big-4 K=2 / K=3 Mixture Fits + +The "descriptive partition; not mechanism clusters per §III-J" labels in Tables VII and VIII are consistent with the v4 reframe. Keep. Drop "(v3.x role)" anywhere it appears. + +### §IV-F Convergent Internal-Consistency Checks + +This is duplicate Results-side reporting of §III-K. Consider whether the duplication adds value or is redundant. If both sections must remain, then §III-K should describe the *method* (three scores, why they are not independent) and §IV-F should report the *numbers*; currently §III-K reports both the method and the numbers, leaving §IV-F as a near-duplicate. Recommendation: trim §IV-F to just the per-firm summary table and the Cohen-kappa block, with the method description living in §III-K. + +### §IV-G Leave-One-Firm-Out Reproducibility + +Tables XII and XIII are well-organised. The interpretation paragraph following Table XIII correctly identifies the K=2 vs K=3 contrast (K=2 unstable; K=3 component shape reproducible but hard-posterior membership composition-sensitive). Keep. + +### §IV-H Pixel-Identity Positive-Anchor Miss Rate + +The "close to tautological" caveat is appropriately positioned. Keep. The reverse-anchor cut by prevalence calibration disclosure is also appropriate. + +### §IV-I Inter-CPA Pair-Level Coincidence Rate + +Replace: + +> "v4.0 retroactively reframes the metric as inter-CPA pair-level coincidence rate (ICCR) rather than 'False Acceptance Rate' because..." + +with: + +```text +The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It +is the per-pair rate at which two signatures from different CPAs satisfy the +deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has +a biometric-verification meaning that requires ground-truth negative labels, and +(b) the inter-CPA negative-anchor assumption is partially violated by within-firm +cross-CPA template-like collision structures (§III-L.4 cross-firm hit matrix). +``` + +### §IV-J Five-Way Per-Signature + Document-Level Classification Output + +The sample-size reconciliation parenthetical ("11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded") is repeated from §III-G. Centralise once and forward-reference. + +"v4.0 does not change this aggregation rule; only the population over which it is computed changes" should be "The aggregation rule is the inherited worst-case rule (HC > MC > HSC > UN > LH); we apply it to the Big-4 sub-corpus." + +The MC band capture-rate inheritance disclosure is appropriately framed but should drop the "v4.0 does not re-derive" phrasing; rewrite as "The moderate-confidence band's calibration and capture-rate evidence is reported in [Appendix B / v3.20.0 Tables IX, XI, XII, XII-B] and is not regenerated on the Big-4 subset." + +### §IV-K Full-Dataset Robustness + +The scope-of-§IV-K paragraph ("The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA less-replication-dominated rate analysis...") is defensively framed but the substance is correct. Consider shortening the "what we do not do" enumeration and emphasising the "what we do show" finding (K=3 + Paper A box-rule Spearman convergence preserved at full scope; ρ drift = 0.007). + +### §IV-L Feature Backbone Comparison + +This is inherited v3.x content. The "inherited unchanged from the v3.20.0 backbone-ablation table" framing is acceptable here because it is a methodological choice (do not re-run the ablation at the Big-4 scope) rather than a narrative pivot. Keep. + +### §IV-M v4-New Anchor-Based ICCR Calibration Results + +Drop the "v4-new" from the section heading. Recommended replacement heading: "Anchor-Based ICCR Calibration Results". + +The section is empirically dense and methodologically sound. Tables XXI–XXVI cover the four units (per-comparison, per-signature, per-document, firm logistic + hit matrix) and the alert-rate sensitivity. Keep all tables. Drop "v4 new" / "v4-new" wherever it appears as a row qualifier or section subheading. + +### §V-A Non-Hand-Signing Detection as a Distinct Problem + +Keep. This section preserves the forgery distinction (Main concern #2 in the Abstract / Introduction handoff). + +### §V-B Per-Signature Similarity is a Continuous Quality Spectrum + +Replace the v3-to-v4 opening: + +> "A central empirical finding of v3.x was that per-signature similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading." + +with: + +```text +The Big-4 accountant-level descriptor distribution rejects unimodality on both +marginals at p < 5 × 10⁻⁴ (§IV-D Table V). The composition decomposition of +§III-I.4 (Scripts 39b–39e) shows this rejection is fully attributable to two +non-mechanistic sources... +``` + +This preserves the §V-B content while removing the v3.x lineage statement. + +### §V-C Firm A as the Templated End of Big-4 + +Replace "In v4.0 we treat Firm A as a templated-end case study rather than as the calibration anchor for the operational threshold" with "We treat Firm A as a templated-end case study within the Big-4 sub-corpus rather than as the calibration anchor for the operational threshold." + +Drop the "the v3.x role of Firm A" historical sub-clause that appears in §III-G item 2. + +The Firm A byte-level pixel-identity reference (145 signatures across ~50 distinct partners; 35 byte-identical matches across fiscal years) is inherited from v3.x §IV-F.1 / Script 28 — this byte-level granularity is the strongest single piece of v3.x evidence that *should* survive into v4 because it directly supports the §V-C templated-end characterisation. Keep the reference but recast as "Byte-level decomposition of these 145 signatures (Appendix B) shows..." rather than the current "The additional v3.x finding... is inherited from v3.20.0 §IV-F.1 / Script 28..." + +### §V-D K=2 / K=3 as Descriptive Firm-Compositional Partitions + +Keep. The contrast between K=2 instability and K=3 reproducible-component-shape-but-composition-sensitive-membership is one of the cleanest narrative arcs in the paper. + +### §V-E Three-Score Convergent Internal-Consistency + +Keep. The "not statistically independent" caveat is correctly positioned. The within-Big-4 non-Firm-A disagreement between Score 2 and Scores 1/3 is correctly disclosed. + +### §V-F Anchor-Based Multi-Level Calibration + +Keep. This is the v4 contribution. Drop any residual "v4" labels. + +### §V-G Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor + +Keep. The "positive necessary but not sufficient" caveat and the "specificity proxy under a partially-violated assumption" framing are exactly right. + +Drop "Inherited" from the §V-G section heading — the heading currently reads "Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor Reframed as Coincidence Rate", which encodes the v3-to-v4 history in the section title itself. Recommended: "Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor". + +### §V-H Limitations + +Reorganise as described in Main Concern 3: primary (3) / secondary (4) / documented features (2) / inherited engineering (5). + +Drop "inherited from v3.20.0 §V-G" qualifiers — the limitation either applies to the pipeline or it does not; the version source is reproducibility metadata that belongs in Appendix B. + +### §VI Conclusion + +Replace the opening framing: + +> "We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope." + +with: + +```text +We present a fully automated pipeline for detecting non-hand-signed CPA +signatures in Taiwan-listed financial audit reports, together with an +anchor-calibrated screening framework that characterises the pipeline's +operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised +assumptions. +``` + +The eight numbered contributions are content-correct but presented in flat-list form. Consider grouping into three thematic clusters: + +- *Why the descriptor distribution does not anchor a natural threshold* (contributions 1, 5). +- *How the deployed rule is calibrated under unsupervised constraints* (contributions 2, 6, 7). +- *What the calibration reveals about firm heterogeneity* (contributions 3, 4). +- *Methodological positioning* (contribution 8 — but reframe per Main Concern 2). + +The Future Work block (four items) is fine; consider trimming the second item ("a separate study to distinguish deliberate template sharing from passive firm-level production artefacts") which is the only item that involves additional fieldwork rather than methodological extension. + +## Recommended Minimum Patch List + +1. Strip v3-to-v4 revision language throughout §III, §IV, §V, §VI. Mechanical pass on "v4.0", "v3.x", "v4-new", "inherited", "earlier work in this lineage". Replace with present-tense descriptions of the final methodological choice and forward references to Appendix B for reproducibility provenance. + +2. Retire the "ten-tool unsupervised-validation collection" framing in §III-M and the "multi-tool framework" phrase in §VI Conclusion. Replace with "each diagnostic addresses one specific unsupervised failure mode" framing. Retitle Table XXVII so that "ten" does not appear. + +3. Reorganise §V-H Limitations into primary / secondary / documented-features / inherited-engineering groupings. + +4. Shorten §III-F SSIM and pixel-comparison rebuttal to ~3–4 sentences; move design-level discussion to Appendix B. + +5. Update Figure 1 caption (currently in §III-A commented HTML) to remove "Firm A P7.5-anchored" residual v3 language. + +6. Rewrite the §IV opener paragraph to remove the inherited-vs-v4-new section labels. + +7. Rewrite the §IV-I opening paragraph to remove "v4.0 retroactively reframes the metric...". + +8. Drop "v4-new" from the §IV-M section heading; replace with "Anchor-Based ICCR Calibration Results". + +9. Centralise the n=150,442 vs n=150,453 sample-size reconciliation in §III-G; remove the duplicate parenthetical from §IV-J. + +10. Consider trimming §IV-F to numbers-only (per-firm summary table + Cohen kappa), with the method description living in §III-K. + +11. Consider deleting §III-L.7 (duplicate of §III-J K=3-not-used-as-classifier claim) and adding a one-line note in §III-L.0. + +## Reviewer Bottom Line + +The body sections of v4 are empirically defensible and methodologically internally consistent. The required revisions are stylistic and structural rather than substantive: + +- Remove the v3-to-v4 revision narrative from the present-tense exposition. +- Reframe the supporting diagnostics from "ten-tool collection" to "each diagnostic addresses one unsupervised failure mode." +- Reorganise the limitations list so that the load-bearing limitations are visibly more prominent than the routine engineering caveats. +- Move provenance and reproducibility detail to Appendix B / supplementary material. + +These changes preserve every quantitative claim and every disclosure currently in the manuscript. They tighten the narrative voice so that the reader experiences the v4 methodological choices as the final state of the design rather than as an ongoing argument with an earlier version. Combined with the Abstract / Introduction patches in the companion handoff, the manuscript should read as a single coherent submission rather than as a layered revision document. + +## Additional Cross-Cutting Observation: Script Provenance in Tables + +Across §III, §IV, §V, and the Conclusion, tables are annotated with `(Source: Script 32 / 34 / 35 / 38 / 40b / 43 / 44 / 45 / 46)` parentheticals. This is appropriate for reproducibility but heavy at the visual level — every table footer in §IV-D through §IV-M carries one of these annotations. + +Recommended consolidation: move the script-to-table mapping to a single Appendix B reproducibility table ("Table B-1. Script-to-table provenance map"), and replace the inline annotations with a single one-line note at the start of §IV ("Script-to-table provenance is summarised in Appendix B Table B-1; raw outputs are available in the supplementary repository"). + +This is a minor change but it materially reduces the visual signal that the paper is built on a large number of separate scripts. + +## Closing Note + +This review covers the body sections only. The Abstract / Introduction handoff (`paper/review_handoff_abstract_intro_20260515.md`) covers the front matter. The two handoffs should be applied together; applying only one of them will produce tonal mismatch as the reader moves from the front matter into the body. + +The References and the Appendix have not been reviewed and may benefit from a separate handoff if the Appendix is to absorb the SSIM / pixel-comparison material and the reproducibility-provenance table recommended above.