0471e36fd4
User review of the v3.15 Sanity Sample subsection revealed that the paper's claim of "inter-rater agreement with the classifier in all 30 cases" (Results IV-G.4) was not backed by any data artifact in the repository. Script 19 exports a 30-signature stratified sample to reports/pixel_validation/sanity_sample.csv, but that CSV contains only classifier output fields (stratum, sig_id, cosine, dhash_indep, pixel_identical, closest_match) and no human-annotation column, and no subsequent script computes any human--classifier agreement metric. User confirmed that the only human annotation in the project was the YOLO training-set bounding-box labeling; signature classification (stamped vs hand-signed) was done entirely by automated numerical methods. The 30/30 sanity-sample claim was therefore factually unsupported and has been removed. Investigation additionally revealed that the "independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images...for many of the sampled partners" framing used as the first strand of Firm A's replication-dominated evidence (Section III-H first strand, Section V-C first strand, and the Conclusion fourth contribution) had the same provenance problem: no human visual inspection was performed. The underlying FACT (that Firm A contains many byte-identical same-CPA signature pairs) is correct and fully supported by automated byte-level pair analysis (Script 19), but the "visual inspection" phrasing misrepresents the provenance. Changes: 1. Results IV-G.4 "Sanity Sample" subsection deleted entirely (results_v3.md L271-273). 2. Methodology III-K penultimate paragraph describing the 30-signature manual visual sanity inspection deleted (methodology_v3.md L259). 3. Methodology Section III-H first strand (L152) rewritten from "independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images...for many of the sampled partners" to "automated byte-level pair analysis (Section IV-G.1) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years." All four numbers verified directly from the signature_analysis.db database via pixel_identical_to_closest = 1 filter joined to accountants.firm. 4. Discussion V-C first strand (L41) rewritten analogously to refer to byte-level pair evidence with the same four verified numbers. 5. Conclusion fourth contribution (L21) rewritten to "byte-level pair analysis finding of 145 pixel-identical calibration-firm signatures across 50 distinct partners (Section IV-G.1)." 6. Abstract (L5): "visual inspection and accountant-level mixture evidence..." rewritten as "byte-level pixel-identity evidence (145 signatures across 50 partners) and accountant-level mixture evidence..." Abstract now at 250/250 words. 7. Introduction (L55): "visual-inspection evidence" relabeled "byte-level pixel-identity evidence" for internal consistency. 8. Methodology III-H penultimate (L164): "validation role is played by the visual inspection" relabeled "validation role is played by the byte-level pixel-identity evidence" for consistency. All substantive claims are preserved and now back-traceable to Script 19 output and the signature_analysis.db pixel_identical_to_closest flag. This correction brings the paper's descriptive language into strict alignment with its actual methodology, which is fully automated (except for YOLO training annotation, disclosed in Methodology Section III-B). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
33 lines
5.7 KiB
Markdown
33 lines
5.7 KiB
Markdown
# VI. Conclusion and Future Work
|
|
|
|
## Conclusion
|
|
|
|
We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
|
|
Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through two methodologically distinct threshold estimators and a density-smoothness diagnostic applied at two analysis levels.
|
|
|
|
The seven numbered contributions listed in Section I can be grouped into four broader methodological themes, summarized below.
|
|
|
|
First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
|
|
|
|
Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
|
|
|
|
Third, we introduced a convergent threshold framework combining two methodologically distinct estimators---KDE antimode (with a Hartigan unimodality test) and an EM-fitted Beta mixture (with a logit-Gaussian robustness check)---together with a Burgstahler-Dichev / McCrary density-smoothness diagnostic.
|
|
Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the KDE antimode together with the Beta-mixture and logit-Gaussian estimators agree within $\sim 0.006$ at cosine $\approx 0.975$.
|
|
The Burgstahler-Dichev / McCrary test, by contrast, is largely null at the accountant level (no significant transition at two of three cosine bin widths and two of three dHash bin widths, with the one cosine transition sitting on the upper edge of the convergence band; Appendix A); at $N = 686$ accountants the test has limited power and cannot affirmatively establish smoothness, but its largely-null pattern is consistent with the smoothly-mixed cluster boundaries implied by the accountant-level GMM.
|
|
The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered into three recognizable groups whose inter-cluster boundaries are gradual rather than sharp.
|
|
|
|
Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
|
|
To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85-95% capture band differ by 1-5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure.
|
|
This framing is internally consistent with all available evidence: the byte-level pair analysis finding of 145 pixel-identical calibration-firm signatures across 50 distinct partners of 180 registered (Section IV-G.1); the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 registered CPAs; 178 after excluding two with disambiguation ties, Section IV-G.2), the 139 / 32 split between the high-replication and middle-band clusters.
|
|
|
|
An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
|
|
|
|
## Future Work
|
|
|
|
Several directions merit further investigation.
|
|
Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
|
|
Extending the accountant-level analysis to auditor-year units---using the same convergent threshold framework at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade.
|
|
The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
|
|
The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
|
|
Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
|