Paper A v3.18: remove accountant-level + replication-dominated calibration + Gemini 2.5 Pro review minor fixes

Major changes (per partner red-pen + user decision):
- Delete entire accountant-level analysis (III.J, IV.E, Tables VI/VII/VIII,
  Fig 4) -- cross-year pooling assumption unjustified, removes the implicit
  "habitually stamps = always stamps" reading.
- Renumber sections III.J/K/L (was K/L/M) and IV.E/F/G/H/I (was F/G/H/I/J).
- Title: "Three-Method Convergent Thresholding" -> "Replication-Dominated
  Calibration" (the three diagnostics do NOT converge at signature level).
- Operational cosine cut anchored on whole-sample Firm A P7.5 (cos > 0.95).
- Three statistical diagnostics (Hartigan/Beta/BD-McCrary) reframed as
  descriptive characterisation, not threshold estimators.
- Firm A replication-dominated framing: 3 evidence strands -> 2.
- Discussion limitation list: drop accountant-level cross-year pooling and
  BD/McCrary diagnostic; add auditor-year longitudinal tracking as future work.
- Tone-shift: "we do not claim / do not derive" -> "we find / motivates".

Reference verification (independent web-search audit of all 41 refs):
- Fix [5] author hallucination: Hadjadj et al. -> Kao & Wen (real authors of
  Appl. Sci. 10:11:3716; report at paper/reference_verification_v3.md).
- Polish [16] [21] [22] [25] (year/volume/page-range/model-name).

Gemini 2.5 Pro peer review (Minor Revision verdict, A-F all positive):
- Neutralize script-path references in tables/appendix -> "supplementary
  materials".
- Move conflict-of-interest declaration from III-L to new Declarations
  section before References (paper_a_declarations_v3.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 17:43:09 +08:00
parent 6ab6e19137
commit 16e90bab20
13 changed files with 430 additions and 264 deletions
+71 -110
View File
@@ -2,7 +2,7 @@
## A. Experimental Setup
Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, 2D Gaussian mixture, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration.
Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration.
Feature extraction used PyTorch 2.9 with torchvision model implementations.
The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision.
@@ -28,7 +28,7 @@ The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliabi
## C. All-Pairs Intra-vs-Inter Class Distribution Analysis
Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-K).
Table IV summarizes the distributional statistics.
<!-- TABLE IV: Cosine Similarity Distribution Statistics
@@ -44,7 +44,7 @@ Table IV summarizes the distributional statistics.
Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent threshold-estimator outputs reported in Section IV-D are derived via the methods of Section III-I to avoid single-family distributional assumptions.
The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
@@ -54,7 +54,12 @@ We emphasize that pairwise observations are not independent---the same signature
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
## D. Hartigan Dip Test: Unimodality at the Signature Level
## D. Signature-Level Distributional Characterisation
This section applies the threshold-estimator and density-smoothness diagnostic of Section III-I to the per-signature similarity distribution.
The joint reading is that per-signature similarity is a continuous quality spectrum rather than a clean two-mechanism mixture, which is why the operational classifier (Section III-K) anchors its cosine cut on the whole-sample Firm A P7.5 percentile rather than on any mixture-fit crossing.
### 1) Hartigan Dip Test: Unimodality at the Signature Level
Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
@@ -66,95 +71,54 @@ The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-sig
| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
| Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
-->
Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to within-firm heterogeneity in signing outputs (see Section IV-E for the accountant-level mixture evidence and Section III-G for the scope of partner-level claims).
Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
The Firm A unimodal-long-tail finding is the structural evidence that supports the replication-dominated framing (Section III-H): a single dominant mechanism plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
This asymmetry between signature level and accountant level is itself an empirical finding.
It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
### 1) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
### 2) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine distribution yields a nominally significant $Z^- \rightarrow Z^+$ transition at cosine 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample under the bin width ($0.005$ / $1$) used here.
Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
At the accountant level the BD/McCrary null is not rejected at two of three cosine bin widths (0.002, 0.010) and two of three dHash bin widths (0.2, 0.5); the one cosine transition that does occur (at bin width 0.005) sits at cosine 0.980---*at the upper edge* of the convergence band of our two threshold estimators (Section IV-E)---and the one dHash transition (at bin width 1.0, location dHash = 3.0) has $|Z_{\text{below}}|$ exactly at the 1.96 critical value.
We read this pattern as *largely but not uniformly* null and *consistent with*---not affirmative proof of---clustered-but-smoothly-mixed aggregates: at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness (Section V-G), and the one bin-0.005 cosine transition, sitting at the edge rather than outside the threshold band and flanked by bin-0.002 and bin-0.010 non-rejections, is consistent with a mild histogram-resolution effect rather than a stable cross-mode density discontinuity (Appendix A).
We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator, and the substantive claim of smoothly-mixed accountant clustering rests on the joint evidence of the dip test, the BIC-selected GMM, and the BD null.
We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator.
### 2) Beta Mixture at Signature Level: A Forced Fit
### 3) Beta Mixture at Signature Level: A Forced Fit
Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
This motivates the pivot to the accountant-level analysis in Section IV-E, where aggregation over signatures reveals clustered (though not sharply discrete) patterns in individual-level signing *practice* that the signature-level analysis lacks.
### 4) Joint Reading of the Three Diagnostics
## E. Accountant-Level Gaussian Mixture
The three diagnostics agree that per-signature similarity does not form a clean two-mechanism mixture:
(i) the Hartigan dip test fails to reject unimodality for Firm A and rejects it for the heterogeneous-firm pooled sample;
(ii) BIC strongly prefers a 3-component over a 2-component Beta fit, so the 2-component crossing is a *forced fit* and the Beta-vs-logit-Gaussian disagreement (0.977 vs 0.999 for Firm A) reflects parametric-form sensitivity rather than a stable two-mechanism boundary;
(iii) the BD/McCrary procedure locates its candidate transition *inside* the non-hand-signed mode rather than between modes, and the transition is not bin-width-stable.
We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
BIC selects $K^* = 3$ (Table VI).
Table VI summarises the signature-level threshold-estimator outputs for cross-method comparison.
<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
| K | BIC | AIC | Converged |
|---|-----|-----|-----------|
| 1 | 316 | 339 | ✓ |
| 2 | 545 | 595 | ✓ |
| 3 | **792** | **869** | (best) |
| 4 | 779 | 883 | ✓ |
| 5 | 747 | 879 | ✓ |
<!-- TABLE VI: Signature-Level Threshold-Estimator Summary
| Method | Cosine | dHash |
|--------|--------|-------|
| All-pairs KDE crossover (Section IV-C) | 0.837 | — |
| Beta-2 EM crossing (Firm A; forced fit, BIC prefers $K{=}3$) | 0.977 | — |
| logit-GMM-2 crossing (Full sample; forced fit) | 0.980 | |
| BD/McCrary transition (diagnostic; bin-unstable, Appendix A) | 0.985 | 2.0 |
| Firm A whole-sample P7.5 (operational anchor; Section III-K) | 0.95 | — |
| Firm A whole-sample dHash_indep P75 (operational $\leq 5$ band lower edge) | — | 4 |
| Firm A whole-sample dHash_indep ceiling (style-consistency boundary) | — | 15 |
| Firm A calibration-fold cosine P5 (held-out validation; Section IV-F.2) | 0.9407 | — |
| Firm A calibration-fold dHash_indep P95 (held-out validation) | — | 9 |
-->
Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
Non-hand-signed replication quality is therefore best read as a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) acting on a common stored exemplar.
This finding has a direct methodological pay-off: it is *why* the operational cosine cut is anchored on the whole-sample Firm A P7.5 percentile (Section III-K), and it is *why* the byte-level pixel-identity anchor (Section IV-F.1) is the natural threshold-free positive reference for downstream validation.
<!-- TABLE VII: Accountant-Level 3-Component GMM
| Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
|-------|----------|------------|--------|---|----------------|
| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
| C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
| C3 (hand-signed tendency) | 0.928 | 11.17 | 0.28 | 184 | smaller domestic firms |
-->
Three empirical findings stand out.
First, of the 180 CPAs in the Firm A registry, 171 have $\geq 10$ signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only).
Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
This split is consistent with the within-firm heterogeneity framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
Third, applying the threshold framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary density-smoothness diagnostic is largely null at the accountant level---no significant transition at two of three cosine bin widths and two of three dHash bin widths, with the one cosine transition at bin 0.005 sitting at cosine 0.980 on the upper edge of the convergence band (Appendix A).
For completeness we also report the marginal crossings of a *separately fit* two-component 2D GMM (reported as a cross-check on the 1D accountant-level crossings) at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
Table VIII summarizes the threshold estimates produced by the two threshold estimators and the BD/McCrary smoothness diagnostic across the two analysis levels for a compact cross-level comparison.
<!-- TABLE VIII: Threshold Convergence Summary Across Levels
| Level / method | Cosine threshold | dHash threshold |
|----------------|-------------------|------------------|
| Signature-level, all-pairs KDE crossover | 0.837 | — |
| Signature-level, Beta-2 EM crossing (Firm A) | 0.977 | — |
| Signature-level, logit-GMM-2 crossing (Full) | 0.980 | — |
| Signature-level, BD/McCrary transition (diagnostic only; bin-unstable, Appendix A) | 0.985 | 2.0 |
| Accountant-level, KDE antimode (threshold estimator) | **0.973** | **4.07** |
| Accountant-level, Beta-2 EM crossing (threshold estimator) | **0.979** | **3.41** |
| Accountant-level, logit-GMM-2 crossing (robustness) | **0.976** | **3.93** |
| Accountant-level, BD/McCrary transition (diagnostic; largely null, Appendix A) | 0.980 at bin 0.005 only; null at 0.002, 0.010 | 3.0 at bin 1.0 only (\|Z\|=1.96); null at 0.2, 0.5 |
| Accountant-level, 2D-GMM 2-comp marginal crossing (secondary) | 0.945 | 8.10 |
| Firm A calibration-fold cosine P5 | 0.9407 | — |
| Firm A calibration-fold dHash_indep P95 | — | 9 |
| Firm A calibration-fold dHash_indep median | — | 2 |
-->
At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic is largely null at the same level (two of three cosine bin widths and two of three dHash bin widths produce no significant transition; the one bin-0.005 cosine transition at 0.980 sits on the convergence-band upper edge and is flanked by non-rejections at bin 0.002 and bin 0.010, Appendix A), which is *consistent with*---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates.
This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
## F. Calibration Validation with Firm A
## E. Calibration Validation with Firm A
Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
@@ -164,24 +128,23 @@ Table IX reports the proportion of Firm A signatures crossing each candidate thr
|------|-------------|-------|
| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,408 / 60,448 |
| cosine > 0.9407 (calibration-fold P5) | 95.15% | 57,518 / 60,448 |
| cosine > 0.945 (2D GMM marginal crossing) | 94.02% | 56,836 / 60,448 |
| cosine > 0.945 (calibration-fold P5 rounded) | 94.02% | 56,836 / 60,448 |
| cosine > 0.95 | 92.51% | 55,922 / 60,448 |
| cosine > 0.973 (accountant-level KDE antimode) | 79.45% | 48,028 / 60,448 |
| dHash_indep ≤ 5 (whole-sample upper-tail of mode) | 84.20% | 50,897 / 60,448 |
| dHash_indep ≤ 8 | 95.17% | 57,527 / 60,448 |
| dHash_indep ≤ 15 (style-consistency boundary) | 99.83% | 60,348 / 60,448 |
| cosine > 0.95 AND dHash_indep ≤ 8 (operational dual) | 89.95% | 54,370 / 60,448 |
All rates computed exactly from the full Firm A sample (N = 60,448 signatures); counts reproduce from `signature_analysis/24_validation_recalibration.py` (whole_firm_a section).
All rates computed exactly from the full Firm A sample (N = 60,448 signatures); per-rule counts and codes are available in the supplementary materials.
-->
Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to the whole-sample Firm A distribution described in Section III-L (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E).
Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to the whole-sample Firm A distribution described in Section III-K (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 92.5% / 7.5% signature-level split (Section III-H).
Section IV-F.2 reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
## G. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
## F. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
We report three validation analyses corresponding to the anchors of Section III-K.
We report three validation analyses corresponding to the anchors of Section III-J.
### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor
@@ -197,10 +160,10 @@ We do not report an Equal Error Rate: EER is meaningful only when the positive a
|-----------|-----|-------------------|
| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] |
| 0.900 | 0.0233 | [0.0221, 0.0247] |
| 0.945 (2D GMM marginal) | 0.0008 | [0.0006, 0.0011] |
| 0.950 | 0.0007 | [0.0005, 0.0009] |
| 0.973 (accountant KDE antimode) | 0.0003 | [0.0002, 0.0004] |
| 0.979 (accountant Beta-2) | 0.0002 | [0.0001, 0.0004] |
| 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
| 0.950 (whole-sample Firm A P7.5; operational cut) | 0.0007 | [0.0005, 0.0009] |
| 0.973 (signature-level Beta/KDE upper bound) | 0.0003 | [0.0002, 0.0004] |
| 0.979 (signature-level Beta-2 forced-fit crossing) | 0.0002 | [0.0001, 0.0004] |
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
-->
@@ -208,13 +171,13 @@ Table note: We do not include FRR against the byte-identical positive anchor as
Two caveats apply.
First, the byte-identical positive anchor referenced above is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.
Second, the 0.945 / 0.95 thresholds are derived from the Firm A whole-sample and calibration-fold percentiles rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
The very low FAR at the operational cut is therefore informative about specificity against a realistic inter-CPA negative population.
### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)
We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here and has no effect on the accountant-level mixture analysis of Section IV-E, which uses the $\geq 10$-signature subset of 171 CPAs.
The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here.
Thresholds are re-derived from calibration-fold percentiles only.
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
@@ -223,15 +186,15 @@ Table XI reports both calibration-fold and held-out-fold capture rates with Wils
|------|---------------------------|-------------------------|----------|---|-----------|----------|
| cosine > 0.837 | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31 | 0.756 n.s. | 45,087/45,116 | 15,321/15,332 |
| cosine > 0.9407 (calib-fold P5) | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.19 | 0.001 | 42,856/45,116 | 14,662/15,332 |
| cosine > 0.945 (2D GMM marginal) | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001 | 42,305/45,116 | 14,531/15,332 |
| cosine > 0.950 | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001 | 41,570/45,116 | 14,352/15,332 |
| cosine > 0.945 (calib-fold P5 rounded) | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001 | 42,305/45,116 | 14,531/15,332 |
| cosine > 0.950 (whole-sample P7.5; operational cut) | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001 | 41,570/45,116 | 14,352/15,332 |
| dHash_indep ≤ 5 | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001 | 37,430/45,116 | 13,467/15,332 |
| dHash_indep ≤ 8 | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45 | <0.001 | 42,788/45,116 | 14,739/15,332 |
| dHash_indep ≤ 9 (calib-fold P95) | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07 | <0.001 | 43,604/45,116 | 14,945/15,332 |
| dHash_indep ≤ 15 | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31 | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60 | <0.001 | 40,335/45,116 | 14,035/15,332 |
Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. All counts and z/p values are reproducible from `signature_analysis/24_validation_recalibration.py` (seed = 42).
Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. Counts and z/p values are reproducible from the supplementary materials (fixed random seed).
-->
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
@@ -240,15 +203,13 @@ We report fold-versus-fold comparisons rather than fold-versus-whole-sample comp
Under this proper test the two extreme rules agree across folds (cosine $> 0.837$ and $\text{dHash}_\text{indep} \leq 15$; both $p > 0.7$).
The operationally relevant rules in the 8595% capture band differ between folds by 15 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 8599% range has a held-out counterpart in the 8799% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity (see the $139 / 32$ accountant-level split of Section IV-E): the random 30% CPA sample happened to contain proportionally more accountants from the high-replication C1 cluster.
We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to this fold variance.
The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-F.2) is the cross-check that is robust to this fold variance.
### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P7.5 heuristic (i.e., 7.5% of whole-sample Firm A signatures lie at or below 0.95; see Section III-L).
The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$ (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at $0.945$.
Because the classifier operates at the signature level while these convergent accountant-level estimates are at the accountant level, they are formally non-substitutable.
We report a sensitivity check in which the classifier's operational cut cos $> 0.95$ is replaced by the nearest accountant-level reference, cos $> 0.945$.
The per-signature classifier (Section III-K) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P7.5 heuristic (i.e., 7.5% of whole-sample Firm A signatures lie at or below 0.95; see Section III-H).
We report a sensitivity check in which this round-number cut is replaced by the slightly stricter calibration-fold P5 rounded value cos $> 0.945$ (calibration-fold P5 = 0.9407, see Table XI).
Table XII reports the five-way classifier output under each cut.
<!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
@@ -266,12 +227,12 @@ At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies
The Likely-hand-signed category is unaffected because it depends only on the fixed all-pairs KDE crossover cosine $= 0.837$.
The High-confidence non-hand-signed share grows from 45.62% to 46.98%.
We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within the accountant-level convergence band, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
The paper therefore retains cos $> 0.95$ as the primary operational cut for transparency and reports the 0.945 results as a sensitivity check rather than as a deployed alternative; a future deployment requiring tighter accountant-level alignment could substitute cos $> 0.945$ without altering the substantive firm-level conclusions.
We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within a 0.005-cosine neighbourhood of the Firm A P7.5 anchor, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
The paper therefore retains cos $> 0.95$ as the primary operational cut for transparency (round-number P7.5 of the whole-sample Firm A reference distribution) and reports the 0.945 results as a sensitivity check rather than as a deployed alternative.
## H. Additional Firm A Benchmark Validation
## G. Additional Firm A Benchmark Validation
The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
The capture rates of Section IV-E are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
This section reports three complementary analyses that go beyond the whole-sample capture rates.
Subsection H.2 is fully threshold-independent (it uses only ordinal ranking).
Subsection H.1 uses a fixed 0.95 cutoff but derives information from the longitudinal stability of rates rather than from the absolute rate at any single year.
@@ -350,7 +311,7 @@ Taiwanese statutory audit reports are co-signed by two engagement partners (a pr
Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.
For each report with exactly two signatures and complete per-signature data (84,354 reports total: 83,970 single-firm reports, in which both signers are at the same firm, and 384 mixed-firm reports, in which the two signers are at different firms), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
For each report with exactly two signatures and complete per-signature data (84,354 reports total: 83,970 single-firm reports, in which both signers are at the same firm, and 384 mixed-firm reports, in which the two signers are at different firms), we classify each signature using the dual-descriptor rules of Section III-K and record whether the two classifications agree.
Table XVI reports per-firm intra-report agreement for the 83,970 single-firm reports only (firm-assignment defined by the common firm identity of both signers); the 384 mixed-firm reports (0.46% of the 2-signature corpus) are excluded from the intra-report analysis because firm-level agreement is not well defined when the two signers are at different firms.
<!-- TABLE XVI: Intra-Report Classification Agreement by Firm
@@ -370,14 +331,14 @@ Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports havin
The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.
We note that this test uses the calibrated classifier of Section III-L rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
We note that this test uses the calibrated classifier of Section III-K rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
## I. Classification Results
## H. Classification Results
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-L: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-H.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-F.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
<!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
| Verdict | N (PDFs) | % | Firm A | Firm A % |
@@ -388,7 +349,7 @@ Document-level rates therefore represent the share of reports in which *at least
| Uncertain | 12,683 | 15.0% | 758 | 2.5% |
| Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
Per the worst-case aggregation rule of Section III-L, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
Per the worst-case aggregation rule of Section III-K, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
-->
Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
@@ -400,16 +361,16 @@ A cosine-only classifier would treat all 71,656 identically; the dual-descriptor
### 1) Firm A Capture Profile (Consistency Check)
96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the 32/171 middle-band minority identified by the accountant-level mixture (Section IV-E).
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the within-firm heterogeneity implied by the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 7.5% signature-level left tail (Section III-H).
The near-zero "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 count here is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset in Table XVI by 4 reports) indicates that the within-firm heterogeneity implied by the 7.5% signature-level left tail (Section IV-D) does not project into the lowest-cosine document-level category under the dual-descriptor rules; it is absorbed instead into the uncertain or high-style-consistency categories at this threshold set.
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-F.2 is the corresponding external check.
### 2) Cross-Method Agreement
### 2) Cross-Firm Comparison of Dual-Descriptor Convergence
Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
This is consistent with the accountant-level convergent thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-F.2) and intra-report consistency (Section IV-F.3) findings.
## J. Ablation Study: Feature Backbone Comparison
## I. Ablation Study: Feature Backbone Comparison
To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.