Opus round-9 review (paper/opus_final_review_v3_9.md) dissented from Gemini round-7 Accept and aligned with codex round-8 Minor, but for a DIFFERENT issue all prior reviewers missed: the paper's main text in four locations flatly claimed the BD/McCrary accountant-level null "persists across the Appendix-A bin-width sweep", yet Appendix A Table A.I itself documents a significant accountant-level cosine transition at bin 0.005 with |Z_below|=3.23, |Z_above|=5.18 (both past 1.96) located at cosine 0.980 --- on the upper edge of our two threshold estimators' convergence band [0.973, 0.979]. This is a paper-to-appendix contradiction that a careful reviewer would catch in 30 seconds. BLOCKER B1: BD/McCrary accountant-level claim softened across all four locations to match what Appendix A Table A.I actually reports: - Results IV-D.1 (lines 85-86): rewritten to say the null is not rejected at 2/3 cosine bin widths and 2/3 dHash bin widths, with the one cosine transition at bin 0.005 sitting on the upper edge of the convergence band and the one dHash transition at |Z|=1.96. - Results IV-E Table VIII row (line 145): "no transition / no transition" changed to "0.980 at bin 0.005 only; null at 0.002, 0.010" / "3.0 at bin 1.0 only ( |Z|=1.96); null at 0.2, 0.5". - Results IV-E line 130 (Third finding): "does not produce a significant transition (robust across bin-width sweep)" replaced with "largely null at the accountant level --- no significant transition at 2/3 cosine bin widths and 2/3 dHash bin widths, with the one cosine transition at bin 0.005 sitting at cosine 0.980 on the upper edge of the convergence band". - Results IV-E line 152 (Table VIII synthesis paragraph): matched reframing. - Discussion V-B (line 27): "does not produce a significant transition at the accountant level either" -> "largely null at the accountant level ... with the one cosine transition on the upper edge of the convergence band". - Conclusion (line 16): matched reframing with power caveat retained. MAJOR M1: Related Work L67 stale "well suited to detecting the boundary between two generative mechanisms" framing (residue from pre-demotion drafts) replaced with a local-density-discontinuity diagnostic framing that matches the rest of the paper and flags the signature-level bin-width sensitivity + accountant-level rarity as documented in Appendix A. MAJOR M2: Table XII orphaned in-text anchor --- Table XII is defined inside IV-G.3 but had no in-text "Table XII reports ..." pointer at its presentation location. Added a single sentence before the table comment. MINOR m1: Section IV-I.1 "4 of 30,000+ Firm A documents, 0.01%" replaced with the exact "4 of 30,226 Firm A documents, 0.013%". MINOR m2: Section IV-E "the two-dimensional two-component GMM" wording ambiguity (reader might confuse with the already-selected K*=3 GMM from BIC) replaced with explicit "a separately fit two-component 2D GMM (reported as a cross-check on the 1D accountant-level crossings)". MINOR m3: Section IV-D L59 "downstream all-pairs analyses (Tables XII, XVIII)" misnomer --- Table XII is per-signature classifier output not all-pairs; Table XVIII's all-pairs are over ~16M pairs not 168,740. Replaced with an accurate list: "same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII)". MINOR m4: Methodology III-H L156 "the validation role is played by ... the held-out Firm A fold" slightly overclaims what the held-out fold establishes (the fold-level rates differ by 1-5 pp with p<0.001). Parenthetical hedge added: "(which confirms the qualitative replication-dominated framing; fold-level rate differences are disclosed in Section IV-G.2)". Also add: - paper/opus_final_review_v3_9.md (Opus 4.7 max-effort review) - paper/gemini_review_v3_8.md (Gemini round-7 Accept verdict, was missing from prior commit) Abstract remains 243 words (under IEEE Access 250 limit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.5 KiB
VI. Conclusion and Future Work
Conclusion
We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale. Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through two methodologically distinct threshold estimators and a density-smoothness diagnostic applied at two analysis levels.
Our contributions are fourfold.
First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
Third, we introduced a convergent threshold framework combining two methodologically distinct estimators---KDE antimode (with a Hartigan unimodality test) and an EM-fitted Beta mixture (with a logit-Gaussian robustness check)---together with a Burgstahler-Dichev / McCrary density-smoothness diagnostic.
Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the KDE antimode together with the Beta-mixture and logit-Gaussian estimators agree within \sim 0.006 at cosine \approx 0.975.
The Burgstahler-Dichev / McCrary test, by contrast, is largely null at the accountant level (no significant transition at two of three cosine bin widths and two of three dHash bin widths, with the one cosine transition sitting on the upper edge of the convergence band; Appendix A); at N = 686 accountants the test has limited power and cannot affirmatively establish smoothness, but its largely-null pattern is consistent with the smoothly-mixed cluster boundaries implied by the accountant-level GMM.
The substantive reading is therefore narrower than "discrete behavior": pixel-level output quality is continuous and heavy-tailed, and accountant-level aggregate behavior is clustered into three recognizable groups whose inter-cluster boundaries are gradual rather than sharp.
Fourth, we introduced a replication-dominated calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor. To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85-95% capture band differ by 1-5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure. This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
Future Work
Several directions merit further investigation. Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study. Extending the accountant-level analysis to auditor-year units---using the same convergent threshold framework at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade. The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration. The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself. Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.