0ff1845b22
Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md):
B1 Classifier vs three-method threshold mismatch
- Methodology III-L rewritten to make explicit that the per-signature
classifier and the accountant-level three-method convergence operate
at different units (signature vs accountant) and are complementary
rather than substitutable.
- Add Results IV-G.3 + Table XII operational-threshold sensitivity:
cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole
Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary.
B2 Held-out validation false "within Wilson CI" claim
- Script 24 recomputes both calibration-fold and held-out-fold rates
with Wilson 95% CIs and a two-proportion z-test on each rule.
- Table XI replaced with the proper fold-vs-fold comparison; prose
in Results IV-G.2 and Discussion V-C corrected: extreme rules agree
across folds (p>0.7); operational rules in the 85-95% band differ
by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample
contained more high-replication C1 accountants), not generalization
failure.
B3 Interview evidence reframed as practitioner knowledge
- The Firm A "interviews" referenced throughout v3.3 are private,
informal professional conversations, not structured research
interviews. Reframed accordingly: all "interview*" references in
abstract / intro / methodology / results / discussion / conclusion
are replaced with "domain knowledge / industry-practice knowledge".
- This avoids overclaiming methodological formality and removes the
human-subjects research framing that triggered the ethics-statement
requirement.
- Section III-H four-pillar Firm A validation now stands on visual
inspection, signature-level statistics, accountant-level GMM, and
the three Section IV-H analyses, with practitioner knowledge as
background context only.
- New Section III-M ("Data Source and Firm Anonymization") covers
MOPS public-data provenance, Firm A/B/C/D pseudonymization, and
conflict-of-interest declaration.
Add signature_analysis/24_validation_recalibration.py for the recomputed
calib-vs-held-out z-tests and the classifier sensitivity analysis;
output in reports/validation_recalibration/.
Pending (not in this commit): abstract length (368 -> 250 words),
Impact Statement removal, BD/McCrary sensitivity reporting, full
reproducibility appendix, references cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
33 lines
4.9 KiB
Markdown
33 lines
4.9 KiB
Markdown
# VI. Conclusion and Future Work
|
|
|
|
## Conclusion
|
|
|
|
We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
|
|
Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three methodologically distinct methods applied at two analysis levels.
|
|
|
|
Our contributions are fourfold.
|
|
|
|
First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
|
|
|
|
Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
|
|
|
|
Third, we introduced a three-method threshold framework combining KDE antimode (with a Hartigan unimodality test), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check).
|
|
Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the KDE antimode together with the Beta-mixture and logit-Gaussian estimators agree within $\sim 0.006$ at cosine $\approx 0.975$.
|
|
The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level heterogeneity.
|
|
The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered with smooth cluster boundaries.
|
|
|
|
Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
|
|
To break the circularity of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report post-hoc capture rates on the held-out fold with Wilson 95% confidence intervals.
|
|
This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
|
|
|
|
An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
|
|
|
|
## Future Work
|
|
|
|
Several directions merit further investigation.
|
|
Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
|
|
Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade.
|
|
The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
|
|
The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
|
|
Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
|