Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):
Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
table had 1-4-unit transcription errors in k values and a fabricated
cos > 0.9407 calibration row; both fixed by rerunning Script 24
with cos = 0.9407 added to COS_RULES and copying exact values from
the JSON output.
- Section III-L classifier now defined entirely in terms of the
independent-minimum dHash statistic that the deployed code (Scripts
21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
language is removed. Tables IX, XI, XII, XVI are now arithmetically
consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
per-signature cosine distribution, matching III-L and IV-F.
Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
limit. Removed "we break the circularity" overclaim; replaced with
"report capture rates on both folds with Wilson 95% intervals to
make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
Methods/Results don't deliver; replaced with anchor-based capture /
FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
intra-report consistency (IV-H.3) is a different test (two co-signers
on the same report, firm-level homogeneity) and is not a within-CPA
year-level mixing check; the assumption is maintained as a bounded
identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
the partner-level ranking is threshold-free"; longitudinal-stability
uses 0.95 cutoff, intra-report uses the operational classifier.
Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
Regular Papers do not have a standalone Impact Statement). The file
itself is retained as an archived non-paper note for cover-letter /
grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
[35] VLM survey, [36] Mann-Whitney) are now cited in-text:
[27] in Methodology III-E (dHash definition)
[31][32][33] in Introduction (audit-quality regulation context)
[34][35] in Methodology III-C/III-D
[36] in Results IV-C (Mann-Whitney result)
Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.1 KiB
VI. Conclusion and Future Work
Conclusion
We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale. Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three methodologically distinct methods applied at two analysis levels.
Our contributions are fourfold.
First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
Third, we introduced a three-method threshold framework combining KDE antimode (with a Hartigan unimodality test), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check).
Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the KDE antimode together with the Beta-mixture and logit-Gaussian estimators agree within \sim 0.006 at cosine \approx 0.975.
The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level heterogeneity.
The substantive reading is therefore narrower than "discrete behavior": pixel-level output quality is continuous and heavy-tailed, and accountant-level aggregate behavior is clustered with smooth cluster boundaries.
Fourth, we introduced a replication-dominated calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor. To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85-95% capture band differ by 1-5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure. This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
Future Work
Several directions merit further investigation. Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study. Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade. The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration. The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself. Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.