User review of the v3.15 Sanity Sample subsection revealed that the paper's claim of "inter-rater agreement with the classifier in all 30 cases" (Results IV-G.4) was not backed by any data artifact in the repository. Script 19 exports a 30-signature stratified sample to reports/pixel_validation/sanity_sample.csv, but that CSV contains only classifier output fields (stratum, sig_id, cosine, dhash_indep, pixel_identical, closest_match) and no human-annotation column, and no subsequent script computes any human--classifier agreement metric. User confirmed that the only human annotation in the project was the YOLO training-set bounding-box labeling; signature classification (stamped vs hand-signed) was done entirely by automated numerical methods. The 30/30 sanity-sample claim was therefore factually unsupported and has been removed. Investigation additionally revealed that the "independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images...for many of the sampled partners" framing used as the first strand of Firm A's replication-dominated evidence (Section III-H first strand, Section V-C first strand, and the Conclusion fourth contribution) had the same provenance problem: no human visual inspection was performed. The underlying FACT (that Firm A contains many byte-identical same-CPA signature pairs) is correct and fully supported by automated byte-level pair analysis (Script 19), but the "visual inspection" phrasing misrepresents the provenance. Changes: 1. Results IV-G.4 "Sanity Sample" subsection deleted entirely (results_v3.md L271-273). 2. Methodology III-K penultimate paragraph describing the 30-signature manual visual sanity inspection deleted (methodology_v3.md L259). 3. Methodology Section III-H first strand (L152) rewritten from "independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images...for many of the sampled partners" to "automated byte-level pair analysis (Section IV-G.1) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years." All four numbers verified directly from the signature_analysis.db database via pixel_identical_to_closest = 1 filter joined to accountants.firm. 4. Discussion V-C first strand (L41) rewritten analogously to refer to byte-level pair evidence with the same four verified numbers. 5. Conclusion fourth contribution (L21) rewritten to "byte-level pair analysis finding of 145 pixel-identical calibration-firm signatures across 50 distinct partners (Section IV-G.1)." 6. Abstract (L5): "visual inspection and accountant-level mixture evidence..." rewritten as "byte-level pixel-identity evidence (145 signatures across 50 partners) and accountant-level mixture evidence..." Abstract now at 250/250 words. 7. Introduction (L55): "visual-inspection evidence" relabeled "byte-level pixel-identity evidence" for internal consistency. 8. Methodology III-H penultimate (L164): "validation role is played by the visual inspection" relabeled "validation role is played by the byte-level pixel-identity evidence" for consistency. All substantive claims are preserved and now back-traceable to Script 19 output and the signature_analysis.db pixel_identical_to_closest flag. This correction brings the paper's descriptive language into strict alignment with its actual methodology, which is fully automated (except for YOLO training annotation, disclosed in Methodology Section III-B). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
V. Discussion
A. Non-Hand-Signing Detection as a Distinct Problem
Our results highlight the importance of distinguishing non-hand-signing detection from the well-studied signature forgery detection problem. In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature. In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).
This distinction has direct methodological consequences. Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures. Non-hand-signing detection, by contrast, requires sensitivity to the upper tail of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous. The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.
B. Continuous-Quality Spectrum vs. Clustered Accountant-Level Heterogeneity
The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the convergent threshold framework and the Hartigan dip test (Sections IV-D and IV-E).
At the per-signature level, the distribution of best-match cosine similarity is not cleanly bimodal.
Firm A's signature-level cosine is formally unimodal (dip test p = 0.17) with a long left tail.
The all-CPA signature-level cosine rejects unimodality (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary between two mechanisms.
Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class.
Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.
At the per-accountant aggregate level the picture partly reverses.
The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode = 0.973, Beta-2 crossing = 0.979, logit-GMM-2 crossing = 0.976).
The BD/McCrary test is largely null at the accountant level---no significant transition at two of three cosine bin widths and two of three dHash bin widths, and the one cosine transition (at bin 0.005, location 0.980) sits on the upper edge of the convergence band described above rather than outside it (Appendix A).
This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the test fails to reject the smoothness null at the sample size available (N = 686), and the GMM cluster boundaries appear gradual rather than sheer.
We caveat this interpretation appropriately in Section V-G: the BD null alone cannot affirmatively establish smoothness---only fail to falsify it---and our substantive claim of smoothly-mixed clustering rests on the joint weight of the GMM fit, the dip test, and the BD null rather than on the BD null alone.
The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: pixel-level output quality is continuous and heavy-tailed, and accountant-level aggregate behaviour is clustered (three recognizable groups) but not sharply discrete. The accountant-level mixture is a useful classifier of firm-and-practitioner-level signing regimes; individual behaviour may still transition or mix over time within a practitioner, and our cross-sectional analysis does not rule this out. Methodologically, the implication is that the two threshold estimators (KDE antimode, Beta mixture with logit-Gaussian robustness) are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is a failure-to-reject rather than a failure of the method---informative alongside the other evidence but subject to the power caveat recorded in Section V-G.
C. Firm A as a Replication-Dominated, Not Pure, Population
A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class. Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.
Three convergent strands of evidence support the replication-dominated framing.
First, the byte-level pair evidence: 145 Firm A signatures (from 50 distinct partners of 180 registered) have a byte-identical same-CPA match in a different audit report, with 35 of these matches spanning different fiscal years. Independent hand-signing cannot produce byte-identical images across distinct reports, so these pairs directly establish image reuse within Firm A.
Second, the signature-level statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures (\geq 10) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---consistent with within-firm heterogeneity in signing output (potentially spanning hand-signing partners, multi-template replication workflows, CPAs undergoing mid-sample mechanism transitions, and CPAs whose pooled coordinates reflect mixed-quality replication; we do not disaggregate these mechanisms---see Section III-G for the scope of claims) rather than a pure replication population.
Of the 178 valid Firm A CPAs (the 180 registered CPAs minus two excluded for disambiguation ties in the registry; Section IV-G.2), seven are outside the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone.
The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85–95% band differ between folds by 1–5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure).
The accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to fold-level sampling variance.
The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise. We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
D. The Style-Replication Gap
Within the 71,656 documents exceeding cosine 0.95, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative.
Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting.
Some may use signing pads or templates that further constrain variability without constituting image-level reproduction.
The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.
E. Value of a Replication-Dominated Calibration Group
The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels. In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance. Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset.
This calibration strategy has broader applicability beyond signature analysis. Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives. The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity quantified by the accountant-level mixture, and yields classification rates that are internally consistent with the data.
F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation
A further methodological contribution is the combination of byte-level pixel identity as an annotation-free conservative gold positive and a large random-inter-CPA negative anchor. Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is pair-level proof of image reuse and, modulo the narrow source-template edge case discussed in the seventh limitation below, a conservative positive for non-hand-signing without requiring human review. In our corpus 310 signatures satisfied this condition. We emphasize that byte-identical pairs are a subset of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways). Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.
Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative (n = 35) we originally considered.
The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset.
G. Limitations
Several limitations should be acknowledged.
First, comprehensive per-document ground truth labels are not available.
The pixel-identity anchor is a strict subset of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class.
The low-similarity same-CPA anchor (n = 35) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled).
A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class.
Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning. While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.
Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity. This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.
Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.
Fifth, the accountant-level summary (Section III-J) is a cross-year pooled statistic by construction, so a CPA whose signing mechanism changed mid-sample is placed at a weighted mix of component means rather than at a single regime centroid. Extending the accountant-level analysis to auditor-year units---using the same convergent threshold framework at finer temporal resolution---is the natural next step for resolving such within-accountant transitions.
Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, and the test produces no significant transition at all at the accountant level.
In our application, therefore, BD/McCrary contributes diagnostic information about local density-smoothness rather than an independent accountant-level threshold estimate; that role is played by the KDE antimode and the two mixture-based estimators.
We emphasize that the accountant-level BD/McCrary null is consistent with---not affirmative proof of---smoothly mixed cluster boundaries: the BD/McCrary test is known to have limited statistical power at modest sample sizes, and with N = 686 accountants in our analysis the test cannot reliably detect anything less than a sharp cliff-type density discontinuity.
Failure to reject the smoothness null at this sample size therefore reinforces BD/McCrary's role as a diagnostic rather than a definitive estimator; the substantive claim of smoothly-mixed accountant-level clustering rests on the joint weight of the dip-test and Beta-mixture evidence together with the BD null, not on the BD null alone.
Seventh, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case that one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as the stamping or e-signature template, the pair correctly identifies image reuse but misattributes the non-hand-signed status to the source exemplar. This misattribution affects at most one source document per template variant per CPA (the exemplar from which the template was produced), is not expected to be common given that stored signature templates are typically generated in a separate acquisition step rather than extracted from submitted audit reports, and does not materially affect aggregate capture rates at the firm level.
Eighth, our analyses remain at the signature level and the accountant (cross-year pooled) level; we abstain from partner-level frequency inferences such as "X% of CPAs hand-sign in a given year." Per-signature labels in this paper are not translated to per-report or per-partner mechanism assignments, because making such a translation would require an assumption of within-year uniformity of signing mechanisms that we do not adopt: a CPA's signatures within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination, and the data at hand do not disambiguate these possibilities (Section III-G). The signature-level rates we report, including the 92.5% / 7.5% Firm A split and the year-by-year left-tail share of Section IV-H.1, should accordingly be read as signature-level quantities rather than partner-level frequencies.
Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing." Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.