# V. Discussion

## A. Non-Hand-Signing Detection as a Distinct Problem

Our results highlight the importance of distinguishing *non-hand-signing detection* from the well-studied *signature forgery detection* problem.
In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature.
In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).

This distinction has direct methodological consequences.
Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures.
Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.

## B. Per-Signature Similarity is a Continuous Quality Spectrum

A central empirical finding of this study is that per-signature similarity does not form a clean two-mechanism mixture (Section IV-D).
Firm A's signature-level cosine is formally unimodal (Hartigan dip test $p = 0.17$) with a long left tail.
The all-CPA signature-level cosine rejects unimodality ($p < 0.001$), reflecting the heterogeneity of signing practices across firms, but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit ($\Delta\text{BIC} = 381$ for Firm A; $10{,}175$ for the full sample), and the forced 2-component Beta crossing and its logit-GMM robustness counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
The BD/McCrary discontinuity test locates its transition at cosine 0.985---*inside* the non-hand-signed mode rather than at a boundary between two mechanisms---and the transition is not bin-width-stable (Appendix A).

Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class cleanly separated from hand-signing.
Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.

The methodological implication is that the operational classifier's cosine cut should not be derived from a mixture-fit crossing.
We accordingly anchor the operational cosine cut on the whole-sample Firm A P7.5 percentile (Section III-K), and treat the signature-level threshold-estimator outputs (KDE antimode, Beta and logit-Gaussian crossings) as descriptive characterisation of the similarity distribution rather than as the source of operational thresholds.
The BD/McCrary procedure plays a *density-smoothness diagnostic* role in this framing rather than that of an independent threshold estimator.

This continuous-spectrum finding also has substantive implications for downstream interpretation.
Because pixel-level output quality varies continuously, *signature-level rates* (such as the 92.5% / 7.5% Firm A split) reflect the share of signatures whose similarity falls above or below a chosen threshold rather than the share that came from a "non-hand-signing mechanism" versus a "hand-signing mechanism."
We accordingly report all rates as signature-level quantities and abstain from partner-level frequency claims (Section III-G).

## C. Firm A as a Replication-Dominated, Not Pure, Population

A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.

Two convergent strands of evidence support the replication-dominated framing.
First, the byte-level pair evidence: 145 Firm A signatures (from 50 distinct partners of 180 registered) have a byte-identical same-CPA match in a different audit report, with 35 of these matches spanning different fiscal years.
Independent hand-signing cannot produce byte-identical images across distinct reports, so these pairs directly establish image reuse within Firm A as a concrete, threshold-free phenomenon, and the 50/180 partner spread shows that replication is widespread rather than confined to a handful of CPAs.
Second, the signature-level distributional evidence: Firm A's per-signature cosine distribution is unimodal long-tail (Hartigan dip test $p = 0.17$) rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
The unimodal-long-tail *shape*, not the precise 92.5 / 7.5 split, is the structural evidence: it is consistent with a dominant high-similarity regime plus residual within-firm heterogeneity, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).

Two additional checks, reported in Section IV-G, are robust to threshold choice and complement the two primary strands:
the held-out Firm A 70/30 validation (Section IV-F.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85--95% band differ between folds by 1--5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure), and the threshold-independent partner-ranking analysis (Section IV-G.2) shows that Firm A auditor-years occupy 95.9% of the top decile of similarity-ranked auditor-years against a 27.8% baseline share---a 3.5$\times$ concentration ratio that uses only ordinal ranking and is independent of any absolute cutoff.

The replication-dominated framing is internally coherent with both pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.

## D. The Style-Replication Gap

Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.

The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting.
Some may use signing pads or templates that further constrain variability without constituting image-level reproduction.
The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.

## E. Value of a Replication-Dominated Calibration Group

The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
Our approach uses practitioner background---one Big-4 firm reportedly relies predominantly on stamping or e-signing workflows---only as a *motivation* for selecting that firm as a candidate reference population; the calibration role is then established from the audit-report images themselves (byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency), so the calibration does not depend on the practitioner-background claim being externally verified (Section III-H).

This calibration strategy has broader applicability beyond signature analysis.
Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity visible in the unimodal-long-tail shape of Firm A's per-signature cosine distribution, and yields classification rates that are internally consistent with the data.

## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation

A further methodological contribution is the combination of byte-level pixel identity as an annotation-free *conservative* gold positive and a large random-inter-CPA negative anchor.
Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is pair-level proof of image reuse and, modulo the narrow source-template edge case discussed in the seventh limitation below, a conservative positive for non-hand-signing without requiring human review.
In our corpus 310 signatures satisfied this condition.
We emphasize that byte-identical pairs are a *subset* of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways).
Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.

Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative ($n = 35$) we originally considered.
The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset.

## G. Limitations

Several limitations should be acknowledged.

First, comprehensive per-document ground truth labels are not available.
The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class.
The low-similarity same-CPA anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled).
A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class.

Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.

Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions.
In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.

Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.

Fifth, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed.
In the rare case that one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as the stamping or e-signature template, the pair correctly identifies image reuse but misattributes the non-hand-signed status to the source exemplar.
This misattribution affects at most one source document per template variant per CPA (the exemplar from which the template was produced), is not expected to be common given that stored signature templates are typically generated in a separate acquisition step rather than extracted from submitted audit reports, and does not materially affect aggregate capture rates at the firm level.

Sixth, our analyses remain at the signature level; we abstain from partner-level frequency inferences such as "X% of CPAs hand-sign in a given year."
Per-signature labels in this paper are not translated to per-report or per-partner mechanism assignments (Section III-G).
The signature-level rates we report, including the 92.5% / 7.5% Firm A split and the year-by-year left-tail share of Section IV-G.1, should accordingly be read as signature-level quantities rather than partner-level frequencies.

Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.