Files
pdf_signature_extraction/paper/paper_a_discussion_v3.md
T
gbanyan 9b0b8358a2 Paper A v3.12: resolve Gemini 3.1 Pro round-11 full-paper review findings
Round-11 Gemini 3.1 Pro fresh full-paper review (Minor Revision)
surfaced four issues that the prior 10 rounds (codex gpt-5.4 x4, codex
gpt-5.5 x1, Gemini 3.1 Pro x2, Opus 4.7 x1, paragraph-level v3.11
review) all missed:

1. MAJOR - Percentile-terminology contradiction between Section III-L
   L290 and Section III-H L160. III-L called 0.95 the "whole-sample
   Firm A P95" of the per-signature best-match cosine distribution,
   but III-H states 92.5% of Firm A signatures exceed 0.95. Under
   standard bottom-up percentile convention this makes 0.95 the P7.5,
   not the P95; Table XI calibration-fold data (Firm A cosine
   median = 0.9862, P5 = 0.9407) confirms true P95 is near 0.998.
   Fix: rewrote III-L L290 to state 0.95 corresponds to approximately
   the whole-sample Firm A P7.5 with the 92.5%/7.5% complement stated
   explicitly. dHash P95 claims elsewhere (Table XI, L229/L233) were
   already correct under standard convention and are unchanged.

2. MINOR - Firm A CPA count inconsistency. Discussion V-C L44 said
   "Nine additional Firm A CPAs are excluded from the GMM for having
   fewer than 10 signatures" but Results IV-G.2 L216 defines 178
   valid Firm A CPAs (180 registry minus 2 disambiguation-excluded);
   178 - 171 = 7. Fix: corrected to "seven are outside the GMM" with
   explicit 178-baseline and cross-reference to IV-G.2.

3. MINOR - Table XVI mixed-firm handling broken promise. Results
   L355-356 previously said "mixed-firm reports are reported
   separately" but Table XVI only lists single-firm rows summing to
   exactly 83,970, and no subsequent prose reports the 384 mixed-firm
   agreement rate. Fix: rewrote L355-356 to state Table XVI covers
   the 83,970 single-firm reports only and that the 384 mixed-firm
   reports (0.46%) are excluded because firm-level agreement is not
   well defined when the two signers are at different firms.

4. MINOR - Contribution-count structural inconsistency. Introduction
   enumerates seven contributions, Conclusion opens with "Our
   contributions are fourfold." Fix: rewrote the Conclusion lead to
   "The seven numbered contributions listed in Section I can be
   grouped into four broader methodological themes," making the
   grouping explicit.

No re-computation. All tables (IV-XVIII) and Appendix A numbers
unchanged. Abstract unchanged (still 248/250 words).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:10:20 +08:00

116 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# V. Discussion
## A. Non-Hand-Signing Detection as a Distinct Problem
Our results highlight the importance of distinguishing *non-hand-signing detection* from the well-studied *signature forgery detection* problem.
In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature.
In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).
This distinction has direct methodological consequences.
Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures.
Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.
## B. Continuous-Quality Spectrum vs. Clustered Accountant-Level Heterogeneity
The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the convergent threshold framework and the Hartigan dip test (Sections IV-D and IV-E).
At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
Firm A's signature-level cosine is formally unimodal (dip test $p = 0.17$) with a long left tail.
The all-CPA signature-level cosine rejects unimodality ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary between two mechanisms.
Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class.
Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.
At the per-accountant aggregate level the picture partly reverses.
The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, logit-GMM-2 crossing $= 0.976$).
The BD/McCrary test is largely null at the accountant level---no significant transition at two of three cosine bin widths and two of three dHash bin widths, and the one cosine transition (at bin 0.005, location 0.980) sits on the upper edge of the convergence band described above rather than outside it (Appendix A).
This pattern is consistent with a clustered *but smoothly mixed* accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the test fails to reject the smoothness null at the sample size available ($N = 686$), and the GMM cluster boundaries appear gradual rather than sheer.
We caveat this interpretation appropriately in Section V-G: the BD null alone cannot affirmatively establish smoothness---only fail to falsify it---and our substantive claim of smoothly-mixed clustering rests on the joint weight of the GMM fit, the dip test, and the BD null rather than on the BD null alone.
The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behaviour* is clustered (three recognizable groups) but not sharply discrete.
The accountant-level mixture is a useful classifier of firm-and-practitioner-level signing regimes; individual behaviour may still transition or mix over time within a practitioner, and our cross-sectional analysis does not rule this out.
Methodologically, the implication is that the two threshold estimators (KDE antimode, Beta mixture with logit-Gaussian robustness) are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is a failure-to-reject rather than a failure of the method---informative alongside the other evidence but subject to the power caveat recorded in Section V-G.
## C. Firm A as a Replication-Dominated, Not Pure, Population
A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.
Three convergent strands of evidence support the replication-dominated framing.
First, the visual-inspection evidence: randomly sampled Firm A reports exhibit pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.
Second, the signature-level statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures ($\geq 10$) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---consistent with within-firm heterogeneity in signing practice (spanning a minority of hand-signers, CPAs undergoing mid-sample mechanism transitions, and CPAs whose pooled coordinates reflect mixed-quality replication) rather than a pure replication population.
Of the 178 valid Firm A CPAs (the 180 registered CPAs minus two excluded for disambiguation ties in the registry; Section IV-G.2), seven are outside the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone.
The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 8595% band differ between folds by 15 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure).
The accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to fold-level sampling variance.
The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
## D. The Style-Replication Gap
Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting.
Some may use signing pads or templates that further constrain variability without constituting image-level reproduction.
The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.
## E. Value of a Replication-Dominated Calibration Group
The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset.
This calibration strategy has broader applicability beyond signature analysis.
Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity quantified by the accountant-level mixture, and yields classification rates that are internally consistent with the data.
## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation
A further methodological contribution is the combination of byte-level pixel identity as an annotation-free *conservative* gold positive and a large random-inter-CPA negative anchor.
Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is pair-level proof of image reuse and, modulo the narrow source-template edge case discussed in the seventh limitation below, a conservative positive for non-hand-signing without requiring human review.
In our corpus 310 signatures satisfied this condition.
We emphasize that byte-identical pairs are a *subset* of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways).
Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.
Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative ($n = 35$) we originally considered.
The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset.
## G. Limitations
Several limitations should be acknowledged.
First, comprehensive per-document ground truth labels are not available.
The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class.
The low-similarity same-CPA anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled).
A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class.
Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.
Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions.
In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.
Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.
Fifth, the accountant-level summary (Section III-J) is a cross-year pooled statistic by construction, so a CPA whose signing mechanism changed mid-sample is placed at a weighted mix of component means rather than at a single regime centroid.
Extending the accountant-level analysis to auditor-year units---using the same convergent threshold framework at finer temporal resolution---is the natural next step for resolving such within-accountant transitions.
Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, and the test produces no significant transition at all at the accountant level.
In our application, therefore, BD/McCrary contributes diagnostic information about local density-smoothness rather than an independent accountant-level threshold estimate; that role is played by the KDE antimode and the two mixture-based estimators.
We emphasize that the accountant-level BD/McCrary null is *consistent with*---not affirmative proof of---smoothly mixed cluster boundaries: the BD/McCrary test is known to have limited statistical power at modest sample sizes, and with $N = 686$ accountants in our analysis the test cannot reliably detect anything less than a sharp cliff-type density discontinuity.
Failure to reject the smoothness null at this sample size therefore reinforces BD/McCrary's role as a diagnostic rather than a definitive estimator; the substantive claim of smoothly-mixed accountant-level clustering rests on the joint weight of the dip-test and Beta-mixture evidence together with the BD null, not on the BD null alone.
Seventh, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed.
In the rare case that one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as the stamping or e-signature template, the pair correctly identifies image reuse but misattributes the non-hand-signed status to the source exemplar.
This misattribution affects at most one source document per template variant per CPA (the exemplar from which the template was produced), is not expected to be common given that stored signature templates are typically generated in a separate acquisition step rather than extracted from submitted audit reports, and does not materially affect aggregate capture rates at the firm level.
Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.