Implements codex gpt-5.4 recommendation (paper/codex_bd_mccrary_opinion.md,
"option (c) hybrid"): demote BD/McCrary in the main text from a co-equal
threshold estimator to a density-smoothness diagnostic, and add a
bin-width sensitivity appendix as an audit trail.
Why: the bin-width sweep (Script 25) confirms that at the signature
level the BD transition drifts monotonically with bin width (Firm A
cosine: 0.987 -> 0.985 -> 0.980 -> 0.975 as bin width widens 0.003 ->
0.015; full-sample dHash transitions drift from 2 to 10 to 9 across
bin widths 1 / 2 / 3) and Z statistics inflate superlinearly with bin
width, both characteristic of a histogram-resolution artifact. At the
accountant level the BD null is robust across the sweep. The paper's
earlier "three methodologically distinct estimators" framing therefore
could not be defended to an IEEE Access reviewer once the sweep was
run.
Added
- signature_analysis/25_bd_mccrary_sensitivity.py: bin-width sweep
across 6 variants (Firm A / full-sample / accountant-level, each
cosine + dHash_indep) and 3-4 bin widths per variant. Reports
Z_below, Z_above, p-values, and number of significant transitions
per cell. Writes reports/bd_sensitivity/bd_sensitivity.{json,md}.
- paper/paper_a_appendix_v3.md: new "Appendix A. BD/McCrary Bin-Width
Sensitivity" with Table A.I (all 20 sensitivity cells) and
interpretation linking the empirical pattern to the main-text
framing decision.
- export_v3.py: appendix inserted into SECTIONS between conclusion
and references.
- paper/codex_bd_mccrary_opinion.md: codex gpt-5.4 recommendation
captured verbatim for audit trail.
Main-text reframing
- Abstract: "three methodologically distinct estimators" ->
"two estimators plus a Burgstahler-Dichev/McCrary density-
smoothness diagnostic". Trimmed to 243 words.
- Introduction: related-work summary, pipeline step 5, accountant-
level convergence sentence, contribution 4, and section-outline
line all updated. Contribution 4 renamed to "Convergent threshold
framework with a smoothness diagnostic".
- Methodology III-I: section renamed to "Convergent Threshold
Determination with a Density-Smoothness Diagnostic". "Method 2:
BD/McCrary Discontinuity" converted to "Density-Smoothness
Diagnostic" in a new subsection; Method 3 (Beta mixture) renumbered
to Method 2. Subsections 4 and 5 updated to refer to "two threshold
estimators" with BD as diagnostic.
- Methodology III-A pipeline overview: "three methodologically
distinct statistical methods" -> "two methodologically distinct
threshold estimators complemented by a density-smoothness
diagnostic".
- Methodology III-L: "three-method analysis" -> "accountant-level
threshold analysis (KDE antimode, Beta-2 crossing, logit-Gaussian
robustness crossing)".
- Results IV-D.1 heading: "BD/McCrary Discontinuity" ->
"BD/McCrary Density-Smoothness Diagnostic". Prose now notes the
Appendix-A bin-width instability explicitly.
- Results IV-E: Table VIII restructured to label BD rows
"(diagnostic only; bin-unstable)" and "(diagnostic; null across
Appendix A)". Summary sentence rewritten to frame BD null as
evidence for clustered-but-smoothly-mixed rather than as a
convergence failure. Table cosine P5 row corrected from 0.941 to
0.9407 to match III-K.
- Results IV-G.3 and IV-I.2: "three-method convergence/thresholds"
-> "accountant-level convergent thresholds" (clarifies the 3
converging estimates are KDE antimode, Beta-2, logit-Gaussian,
not KDE/BD/Beta).
- Discussion V-B: "three-method framework" -> "convergent threshold
framework".
- Conclusion: "three methodologically distinct methods" -> "two
threshold estimators and a density-smoothness diagnostic";
contribution 3 restated; future-work sentence updated.
- Impact Statement (archived): "three methodologically distinct
threshold-selection methods" -> "two methodologically distinct
threshold estimators plus a density-smoothness diagnostic" so the
archived text is internally consistent if reused.
Discussion V-B / V-G already framed BD as a diagnostic in v3.5
(unchanged in this commit). The reframing therefore brings Abstract /
Introduction / Methodology / Results / Conclusion into alignment with
the Discussion framing that codex had already endorsed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
I. Introduction
Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require that certifying CPAs affix their signature or seal (簽名或蓋章) to each audit report [1]. While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].
The digitization of financial reporting has introduced a practice that complicates this intent. As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one. This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step. From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise. We refer to signatures produced by either workflow collectively as non-hand-signed. Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement. The accounting literature has long examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33]. Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused. This practice, while potentially widespread, is visually invisible to report users and virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of image reproduction.
The distinction between non-hand-signing detection and signature forgery detection is both conceptually and technically important. The extensive body of research on offline signature verification [3]--[8] has focused almost exclusively on forgery detection---determining whether a questioned signature was produced by its purported author or by an impostor. This framing presupposes that the central threat is identity fraud. In our context, identity is not in question; the CPA is indeed the legitimate signer. The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction, requiring an analytical framework focused on detecting abnormally high similarity across documents.
A secondary methodological concern shapes the research design.
Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification.
Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference.
A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against naturally-occurring anchor populations---byte-level identical pairs as a conservative gold positive subset and large random inter-CPA pairs as a gold negative population---reported with Wilson 95% confidence intervals on per-rule capture / FAR rates, since precision and F_1 are not meaningful when the positive and negative anchor populations are sampled from different units.
Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from image reproduction. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a convergent threshold framework for document-forensics threshold selection.
In this paper, we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale. Our approach processes raw PDF documents through the following stages: (1) signature page identification using a Vision-Language Model (VLM); (2) signature region detection using a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network; (4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance; (5) threshold determination using two methodologically distinct estimators---KDE antimode with a Hartigan unimodality test and finite Beta mixture via EM with a logit-Gaussian robustness check---complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic, all applied at both the signature level and the accountant level; and (6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm.
The dual-descriptor verification is central to our contribution. Cosine similarity of deep feature embeddings captures high-level visual style similarity---it can identify signatures that share similar stroke patterns and spatial layouts---but cannot distinguish between a CPA who signs consistently and one whose signature is reproduced from a stored image. Perceptual hashing (specifically, difference hashing) encodes structural-level image gradients into compact binary fingerprints that are robust to scan noise but sensitive to substantive content differences. By requiring convergent evidence from both descriptors, we can differentiate style consistency (high cosine but divergent dHash) from image reproduction (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone.
A second distinctive feature is our framing of the calibration reference. One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing for the majority of its certifying partners, while not ruling out that a minority may continue to hand-sign some reports. We therefore treat Firm A as a replication-dominated calibration reference rather than a pure positive class. This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode. Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence among the visual-inspection evidence, the signature-level statistics, and the accountant-level mixture.
A third distinctive feature is our unit-of-analysis treatment.
Our threshold-framework analysis reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are clustered into three recognizable groups (BIC-best K = 3).
The substantive reading is that pixel-level output quality is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, while accountant-level aggregate behaviour is clustered but not sharply discrete---a given CPA tends to cluster into a dominant regime (high-replication, middle-band, or hand-signed-tendency), though the boundaries between regimes are smooth rather than discontinuous.
At the accountant level, the KDE antimode and the two mixture-based estimators (Beta-2 crossing and its logit-Gaussian robustness counterpart) converge within \sim 0.006 on a cosine threshold of approximately 0.975, while the Burgstahler-Dichev / McCrary density-smoothness diagnostic finds no significant transition---an outcome (robust across a bin-width sweep, Appendix A) consistent with smoothly mixed clusters.
The two-dimensional GMM marginal crossings (cosine = 0.945, dHash = 8.10) are reported as a complementary cross-check rather than as the primary accountant-level threshold.
We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature.
The contributions of this paper are summarized as follows:
-
Problem formulation. We formally define non-hand-signing detection as distinct from signature forgery detection and argue that it requires an analytical framework focused on intra-signer similarity distributions rather than genuine-versus-forged classification.
-
End-to-end pipeline. We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity computation, with automated inference requiring no manual intervention after initial training and annotation.
-
Dual-descriptor verification. We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures.
-
Convergent threshold framework with a smoothness diagnostic. We introduce a threshold-selection framework that applies two methodologically distinct estimators---KDE antimode with Hartigan unimodality test and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, and uses a Burgstahler-Dichev / McCrary density-smoothness diagnostic to characterize the local density structure. The convergence of the two estimators, combined with the presence or absence of a BD/McCrary transition, is used as evidence about the mixture structure of the data.
-
Continuous-quality / clustered-accountant finding. We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates cluster into three recognizable groups with smoothly mixed rather than sharply discrete boundaries---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics.
-
Replication-dominated calibration methodology. We introduce a calibration strategy using a known-majority-positive reference group, distinguishing replication-dominated from replication-pure anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling.
-
Large-scale empirical analysis. We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on non-hand-signing practices in financial reporting under a methodology designed for peer-review defensibility.
The remainder of this paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination. Section III describes the proposed methodology. Section IV presents experimental results including the convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study. Section V discusses the implications and limitations of our findings. Section VI concludes with directions for future work.