Files
pdf_signature_extraction/paper/paper_a_results_v3.md
T
gbanyan 623eb4cd4b Paper A v3.19.1: address codex partner-redpen audit residual ("upper bound" wording)
Codex GPT-5.5 cross-verified the Gemini partner red-pen audit
(paper/codex_partner_redpen_audit_v3_19_0.md) and downgraded item (j) --
the BIC strict-3-component upper-bound framing -- from RESOLVED to
IMPROVED, because the "upper bound" wording the partner originally
red-circled in v3.17 still survived in two methodology sentences and one
Table VI row label, even though Section IV-D.3 had been retitled
"A Forced Fit" in v3.18.

This commit closes that residual:

- Methodology III-I.2: "the 2-component crossing should be treated as
  an upper bound rather than a definitive cut" -> "we report the
  resulting crossing only as a forced-fit descriptive reference and do
  not use it as an operational threshold".
- Methodology III-I.4: "should be read as an upper bound rather than a
  definitive cut" -> "reported only as a descriptive reference rather
  than as an operational threshold".
- Table VI row "0.973 (signature-level Beta/KDE upper bound)" relabelled
  to "0.973 (signature-level Beta/KDE forced-fit reference)" to match
  the IV-D.3 "Forced Fit" framing.
- reference_verification_v3.md header updated so the [5] entry reads as
  an audit trail of a fix already applied (v3.18 reference list reflects
  every correction) rather than as an active major problem.
- Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Also commits the codex partner-redpen audit artifact so the disagreement
trail with Gemini is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:05:39 +08:00

413 lines
42 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# IV. Experiments and Results
## A. Experimental Setup
Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration.
Feature extraction used PyTorch 2.9 with torchvision model implementations.
The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision.
## B. Signature Detection Performance
The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.
<!-- TABLE III: Extraction Results
| Metric | Value |
|--------|-------|
| Documents processed | 86,071 |
| Documents with detections | 85,042 (98.8%) |
| Total signatures extracted | 182,328 |
| Avg. signatures per document | 2.14 |
| CPA-matched signatures | 168,755 (92.6%) |
| Processing rate | 43.1 docs/sec |
-->
## C. All-Pairs Intra-vs-Inter Class Distribution Analysis
Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-K).
Table IV summarizes the distributional statistics.
<!-- TABLE IV: Cosine Similarity Distribution Statistics
| Statistic | Intra-class | Inter-class |
|-----------|-------------|-------------|
| N (pairs) | 41,352,824 | 500,000 |
| Mean | 0.821 | 0.758 |
| Std. Dev. | 0.098 | 0.090 |
| Median | 0.836 | 0.774 |
| Skewness | 0.711 | 0.851 |
| Kurtosis | 0.550 | 1.027 |
-->
Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent threshold-estimator outputs reported in Section IV-D are derived via the methods of Section III-I to avoid single-family distributional assumptions.
The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).
We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
## D. Signature-Level Distributional Characterisation
This section applies the threshold-estimator and density-smoothness diagnostic of Section III-I to the per-signature similarity distribution.
The joint reading is that per-signature similarity is a continuous quality spectrum rather than a clean two-mechanism mixture, which is why the operational classifier (Section III-K) anchors its cosine cut on the whole-sample Firm A P7.5 percentile rather than on any mixture-fit crossing.
### 1) Hartigan Dip Test: Unimodality at the Signature Level
Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
<!-- TABLE V: Hartigan Dip Test Results
| Distribution | N | dip | p-value | Verdict (α=0.05) |
|--------------|---|-----|---------|------------------|
| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
-->
Firm A's per-signature cosine distribution *fails to reject unimodality* ($p = 0.17$), a pattern consistent with a dominant high-similarity regime plus a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
The Firm A unimodal-long-tail finding is, in conjunction with the byte-identity, partner-ranking, and intra-report evidence reported below, consistent with the replication-dominated framing (Section III-H): a dominant high-similarity regime plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
### 2) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine distribution yields a nominally significant $Z^- \rightarrow Z^+$ transition at cosine 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample under the bin width ($0.005$ / $1$) used here.
Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator.
### 3) Beta Mixture at Signature Level: A Forced Fit
Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
### 4) Joint Reading of the Three Diagnostics
The three diagnostics agree that per-signature similarity does not form a clean two-mechanism mixture:
(i) the Hartigan dip test fails to reject unimodality for Firm A and rejects it for the heterogeneous-firm pooled sample;
(ii) BIC strongly prefers a 3-component over a 2-component Beta fit, so the 2-component crossing is a *forced fit* and the Beta-vs-logit-Gaussian disagreement (0.977 vs 0.999 for Firm A) reflects parametric-form sensitivity rather than a stable two-mechanism boundary;
(iii) the BD/McCrary procedure locates its candidate transition *inside* the non-hand-signed mode rather than between modes, and the transition is not bin-width-stable.
Table VI summarises the signature-level threshold-estimator outputs for cross-method comparison.
<!-- TABLE VI: Signature-Level Threshold-Estimator Summary
| Method | Cosine | dHash |
|--------|--------|-------|
| All-pairs KDE crossover (Section IV-C) | 0.837 | — |
| Beta-2 EM crossing (Firm A; forced fit, BIC prefers $K{=}3$) | 0.977 | — |
| logit-GMM-2 crossing (Full sample; forced fit) | 0.980 | — |
| BD/McCrary transition (diagnostic; bin-unstable, Appendix A) | 0.985 | 2.0 |
| Firm A whole-sample P7.5 (operational anchor; Section III-K) | 0.95 | — |
| Firm A whole-sample dHash_indep P75 (operational $\leq 5$ band lower edge) | — | 4 |
| Firm A whole-sample dHash_indep ceiling (style-consistency boundary) | — | 15 |
| Firm A calibration-fold cosine P5 (held-out validation; Section IV-F.2) | 0.9407 | — |
| Firm A calibration-fold dHash_indep P95 (held-out validation) | — | 9 |
-->
Non-hand-signed replication quality is therefore best read as a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) acting on a common stored exemplar.
This finding has a direct methodological pay-off: it is *why* the operational cosine cut is anchored on the whole-sample Firm A P7.5 percentile (Section III-K), and it is *why* the byte-level pixel-identity anchor (Section IV-F.1) is the natural threshold-free positive reference for downstream validation.
## E. Calibration Validation with Firm A
Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
<!-- TABLE IX: Firm A Whole-Sample Capture Rates (consistency check, NOT external validation)
| Rule | Firm A rate | k / N |
|------|-------------|-------|
| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,408 / 60,448 |
| cosine > 0.9407 (calibration-fold P5) | 95.15% | 57,518 / 60,448 |
| cosine > 0.945 (calibration-fold P5 rounded) | 94.02% | 56,836 / 60,448 |
| cosine > 0.95 | 92.51% | 55,922 / 60,448 |
| dHash_indep ≤ 5 (whole-sample upper-tail of mode) | 84.20% | 50,897 / 60,448 |
| dHash_indep ≤ 8 | 95.17% | 57,527 / 60,448 |
| dHash_indep ≤ 15 (style-consistency boundary) | 99.83% | 60,348 / 60,448 |
| cosine > 0.95 AND dHash_indep ≤ 8 (operational dual) | 89.95% | 54,370 / 60,448 |
All rates computed exactly from the full Firm A sample (N = 60,448 signatures); per-rule counts and codes are available in the supplementary materials.
-->
Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to the whole-sample Firm A distribution described in Section III-K (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 92.5% / 7.5% signature-level split (Section III-H).
Section IV-F.2 reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
## F. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
We report three validation analyses corresponding to the anchors of Section III-J.
### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B).
As the gold-negative anchor we sample 50,000 i.i.d. random cross-CPA signature pairs from the full 168,755-signature matched corpus (inter-CPA cosine: mean $= 0.763$, $P_{95} = 0.886$, $P_{99} = 0.915$, max $= 0.992$).
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.
<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
| Threshold | FAR | FAR 95% Wilson CI |
|-----------|-----|-------------------|
| 0.837 (all-pairs KDE crossover) | 0.2101 | [0.2066, 0.2137] |
| 0.900 | 0.0250 | [0.0237, 0.0264] |
| 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
| 0.950 (whole-sample Firm A P7.5; operational cut) | 0.0005 | [0.0003, 0.0007] |
| 0.973 (signature-level Beta/KDE forced-fit reference) | 0.0002 | [0.0001, 0.0004] |
| 0.979 (signature-level Beta-2 forced-fit crossing) | 0.0001 | [0.0001, 0.0003] |
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
-->
Two caveats apply.
First, the byte-identical positive anchor referenced above is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
Second, the 0.945 / 0.95 thresholds are derived from the Firm A whole-sample and calibration-fold percentiles rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
The very low FAR at the operational cut is therefore informative about specificity against a realistic inter-CPA negative population.
### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)
We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two registered Firm A partners whose signatures in the corpus are singletons (only one signature each, so the per-signature best-match cosine is undefined and they do not appear in the same-CPA matched-signature table that script `24_validation_recalibration.py` reads); they are therefore not represented in either fold by construction rather than by an explicit exclusion rule.
Thresholds are re-derived from calibration-fold percentiles only.
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
<!-- TABLE XI: Held-Out vs Calibration Firm A Capture Rates and Generalization Test
| Rule | Calibration 70% fold (CI) | Held-out 30% fold (CI) | 2-prop z | p | k/n calib | k/n held |
|------|---------------------------|-------------------------|----------|---|-----------|----------|
| cosine > 0.837 | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31 | 0.756 n.s. | 45,087/45,116 | 15,321/15,332 |
| cosine > 0.9407 (calib-fold P5) | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.19 | 0.001 | 42,856/45,116 | 14,662/15,332 |
| cosine > 0.945 (calib-fold P5 rounded) | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001 | 42,305/45,116 | 14,531/15,332 |
| cosine > 0.950 (whole-sample P7.5; operational cut) | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001 | 41,570/45,116 | 14,352/15,332 |
| dHash_indep ≤ 5 | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001 | 37,430/45,116 | 13,467/15,332 |
| dHash_indep ≤ 8 | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45 | <0.001 | 42,788/45,116 | 14,739/15,332 |
| dHash_indep ≤ 9 (calib-fold P95) | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07 | <0.001 | 43,604/45,116 | 14,945/15,332 |
| dHash_indep ≤ 15 | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31 | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60 | <0.001 | 40,335/45,116 | 14,035/15,332 |
Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. Counts and z/p values are reproducible from the supplementary materials (fixed random seed).
-->
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
We report fold-versus-fold comparisons rather than fold-versus-whole-sample comparisons, because the whole-sample rate is a weighted average of the two folds and therefore cannot, in general, fall inside the Wilson CI of either fold when the folds differ in rate; the correct generalization reference is the calibration fold, which produced the thresholds.
Under this proper test the two extreme rules agree across folds (cosine $> 0.837$ and $\text{dHash}_\text{indep} \leq 15$; both $p > 0.7$).
The operationally relevant rules in the 8595% capture band differ between folds by 15 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 8599% range has a held-out counterpart in the 8799% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-G.2) is the cross-check that is robust to this fold variance.
### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
The per-signature classifier (Section III-K) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P7.5 heuristic (i.e., 7.5% of whole-sample Firm A signatures lie at or below 0.95; see Section III-H).
We report a sensitivity check in which this round-number cut is replaced by the slightly stricter calibration-fold P5 rounded value cos $> 0.945$ (calibration-fold P5 = 0.9407, see Table XI).
Table XII reports the five-way classifier output under each cut.
<!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
| Category | cos > 0.95 count (%) | cos > 0.945 count (%) | Δ count |
|--------------------------------------------|----------------------|-----------------------|---------|
| High-confidence non-hand-signed | 76,984 (45.62%) | 79,278 (46.98%) | +2,294 |
| Moderate-confidence non-hand-signed | 43,906 (26.02%) | 50,001 (29.63%) | +6,095 |
| High style consistency | 546 ( 0.32%) | 665 ( 0.39%) | +119 |
| Uncertain | 46,768 (27.72%) | 38,260 (22.67%) | -8,508 |
| Likely hand-signed | 536 ( 0.32%) | 536 ( 0.32%) | +0 |
-->
At the aggregate firm-level, the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies 8,508 signatures (5.04% of the corpus) out of the Uncertain band; 6,095 of them migrate to Moderate-confidence non-hand-signed, 2,294 to High-confidence non-hand-signed, and 119 to High style consistency.
The Likely-hand-signed category is unaffected because it depends only on the fixed all-pairs KDE crossover cosine $= 0.837$.
The High-confidence non-hand-signed share grows from 45.62% to 46.98%.
We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within a 0.005-cosine neighbourhood of the Firm A P7.5 anchor, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
The paper therefore retains cos $> 0.95$ as the primary operational cut for transparency (round-number P7.5 of the whole-sample Firm A reference distribution) and reports the 0.945 results as a sensitivity check rather than as a deployed alternative.
## G. Additional Firm A Benchmark Validation
The capture rates of Section IV-E are an *internal* consistency check: they ask "how much of Firm A does our threshold capture?", but the threshold was itself derived from Firm A's percentiles, so a high capture rate is not surprising.
To go beyond this circular check, we report three further analyses, each chosen so that the *informative quantity* does not depend on the threshold's absolute value:
- **§IV-G.1 (year-by-year stability).** Holds the cosine cutoff fixed at 0.95 and asks whether the share of Firm A below the cutoff is *stable across years*. The information is in the temporal trend, not in the absolute rate; under a noise-only explanation of the left tail, the share should shrink as scan/PDF technology matured.
- **§IV-G.2 (partner-level similarity ranking).** Uses *no threshold at all*: every auditor-year is ranked by mean similarity, and we measure Firm A's share of the top decile against its baseline share. The information is in the concentration ratio, which is invariant to the choice of cutoff.
- **§IV-G.3 (intra-report agreement).** Applies the calibrated classifier and measures whether the *two co-signing CPAs on the same Firm A report* receive the same classifier label, then compares Firm A's intra-report agreement rate to the other firms'. The information is in the *cross-firm gap*; the absolute agreement rate at any one firm depends on the cutoff, but the gap is robust to moderate cutoff shifts as long as the same cutoff is applied uniformly across firms.
Together these three analyses provide threshold-free or threshold-robust evidence that complements the within-sample capture rates of Section IV-E.
### 1) Year-by-Year Stability of the Firm A Left Tail
Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year.
Under the replication-dominated interpretation (Section III-H), this signature-level left-tail rate reflects within-firm heterogeneity in signing outputs at Firm A.
Consistent with the scope-of-claims framing in Section III-G, we report the rate as a signature-level quantity without disaggregating the underlying mechanism (which may span a minority of hand-signing partners, multi-template replication workflows within the firm, or a combination); partner-level mechanism attribution is not attempted.
Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.
<!-- TABLE XIII: Firm A Per-Year Cosine Distribution
| Year | N sigs | mean best-match cosine | % below 0.95 |
|------|--------|-------------|--------------|
| 2013 | 2,167 | 0.9733 | 12.78% |
| 2014 | 5,256 | 0.9781 | 8.69% |
| 2015 | 5,484 | 0.9793 | 7.46% |
| 2016 | 5,739 | 0.9811 | 6.92% |
| 2017 | 5,796 | 0.9814 | 6.69% |
| 2018 | 5,986 | 0.9808 | 6.58% |
| 2019 | 6,122 | 0.9780 | 8.71% |
| 2020 | 6,122 | 0.9770 | 9.46% |
| 2021 | 5,996 | 0.9792 | 8.37% |
| 2022 | 5,918 | 0.9819 | 6.25% |
| 2023 | 5,862 | 0.9860 | 3.75% |
-->
The left tail is stable at 6-13% throughout the sample period and shows no pre/post-2020 level shift: the 2013-2019 mean left-tail share is 8.26% and the 2020-2023 mean is 6.96%.
The lowest observed share is in 2023 (3.75%), consistent with firm-level electronic signing systems producing more uniform output than earlier manual scanning-and-stamping, not less.
This stability supports the replication-dominated framing: a persistent within-firm heterogeneity component is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved.
### 2) Partner-Level Similarity Ranking
If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all auditor-years (across all firms).
We test this prediction directly.
For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
Firm A accounts for 1,287 of these (27.8% baseline share).
Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.
The per-signature best-match cosine underlying each auditor-year mean is taken over the full same-CPA pool (Section III-G) and may match against signatures from other fiscal years, so the auditor-year mean reflects the year's signatures' position within the CPA's full-sample similarity structure rather than purely within-year similarity; a within-year-restricted sensitivity replication is a natural robustness check and is left to future work.
<!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
| Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
|-------|-------------|--------|--------|--------|--------|-----------|--------------|
| 10% | 462 | 443 | 2 | 3 | 0 | 14 | 95.9% |
| 25% | 1,157 | 1,043 | 32 | 23 | 9 | 50 | 90.1% |
| 50% | 2,314 | 1,220 | 473 | 273 | 102 | 246 | 52.7% |
-->
Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
<!-- TABLE XV: Firm A Share of Top-10% Similarity by Year
| Year | N auditor-years | Top-10% k | Firm A in top-10% | Firm A share | Firm A baseline |
|------|-----------------|-----------|-------------------|--------------|-----------------|
| 2013 | 324 | 32 | 32 | 100.0% | 32.4% |
| 2014 | 399 | 39 | 39 | 100.0% | 27.8% |
| 2015 | 394 | 39 | 38 | 97.4% | 27.7% |
| 2016 | 413 | 41 | 39 | 95.1% | 26.2% |
| 2017 | 415 | 41 | 41 | 100.0% | 27.2% |
| 2018 | 434 | 43 | 43 | 100.0% | 26.5% |
| 2019 | 429 | 42 | 42 | 100.0% | 27.0% |
| 2020 | 430 | 43 | 38 | 88.4% | 27.7% |
| 2021 | 450 | 45 | 44 | 97.8% | 28.7% |
| 2022 | 467 | 46 | 43 | 93.5% | 28.3% |
| 2023 | 474 | 47 | 46 | 97.9% | 27.4% |
-->
This over-representation is consistent with firm-wide non-hand-signing practice at Firm A and is not derived from any threshold we subsequently calibrate.
It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.
### 3) Intra-Report Consistency
Taiwanese statutory audit reports are co-signed by two engagement partners (a primary and a secondary signer).
Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.
For each report with exactly two signatures and complete per-signature data (84,354 reports total: 83,970 single-firm reports, in which both signers are at the same firm, and 384 mixed-firm reports, in which the two signers are at different firms), we classify each signature using the dual-descriptor rules of Section III-K and record whether the two classifications agree.
Table XVI reports per-firm intra-report agreement for the 83,970 single-firm reports only (firm-assignment defined by the common firm identity of both signers); the 384 mixed-firm reports (0.46% of the 2-signature corpus) are excluded from the intra-report analysis because firm-level agreement is not well defined when the two signers are at different firms.
<!-- TABLE XVI: Intra-Report Classification Agreement by Firm
| Firm | Total 2-signer reports | Both non-hand-signed | Both uncertain | Both style | Both hand-signed | Mixed | Agreement rate |
|------|-----------------------|----------------------|----------------|------------|------------------|-------|----------------|
| Firm A | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | **89.91%** |
| Firm B | 17,121 | 9,260 | 2,159| 5 | 6 | 5,691 | 66.76% |
| Firm C | 19,112 | 8,983 | 3,035| 3 | 5 | 7,086 | 62.92% |
| Firm D | 8,375 | 3,028 | 2,376| 0 | 3 | 2,968 | 64.56% |
| Non-Big-4 | 9,140 | 1,671 | 3,945| 18| 27| 3,479 | 61.94% |
A report is "in agreement" if both signature labels fall in the same coarse bucket
(non-hand-signed = high+moderate; uncertain; style consistency; or likely hand-signed).
-->
Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
This 23-28 percentage-point gap in intra-report agreement between Firm A and the other firms is consistent with firm-wide (rather than partner-specific) non-hand-signing practice; we do not claim a sharp discontinuity in the formal sense, since classifier calibration, firm-specific document-production pipelines, and signer-mix differences could each contribute to gap magnitude.
We note that this test uses the calibrated classifier of Section III-K rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
## H. Classification Results
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents have no signature whose extracted handwriting could be matched to a registered CPA name (every such signature has `assigned_accountant IS NULL` in the database, typically because the auditor's report page deviates from the standard two-signature layout or the OCRed printed CPA name was not present in the registry); the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity exists, and these documents are therefore excluded from the classification reported here.
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
<!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
| Verdict | N (PDFs) | % | Firm A | Firm A % |
|---------|----------|---|--------|----------|
| High-confidence non-hand-signed | 29,529 | 35.0% | 22,970 | 76.0% |
| Moderate-confidence non-hand-signed | 36,994 | 43.8% | 6,311 | 20.9% |
| High style consistency | 5,133 | 6.1% | 183 | 0.6% |
| Uncertain | 12,683 | 15.0% | 758 | 2.5% |
| Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
Per the worst-case aggregation rule of Section III-K, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
-->
Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
A cosine-only classifier would treat all 71,656 identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
### 1) Firm A Capture Profile (Consistency Check)
96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the within-firm heterogeneity implied by the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 7.5% signature-level left tail (Section III-H).
The near-zero "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 count here is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset in Table XVI by 4 reports) indicates that the within-firm heterogeneity implied by the 7.5% signature-level left tail (Section IV-D) does not project into the lowest-cosine document-level category under the dual-descriptor rules; it is absorbed instead into the uncertain or high-style-consistency categories at this threshold set.
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-F.2 is the corresponding external check.
### 2) Cross-Firm Comparison of Dual-Descriptor Convergence
Among the 65,514 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,922 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible.
The Firm A denominator (55,922) matches Table IX exactly: both Table IX and the cross-firm decomposition define Firm A membership via the CPA registry (`accountants.firm`), and the cross-firm analysis additionally requires a non-null independent-min dHash record, which all 55,922 Firm A cosine-eligible signatures have in the current database.
This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings.
Counts and percentages are reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (see Appendix B for the table-to-script provenance map).
## I. Ablation Study: Feature Backbone Comparison
To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
Table XVIII presents the comparison.
<!-- TABLE XVIII: Backbone Comparison
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|--------|-----------|--------|-----------------|
| Feature dim | 2048 | 4096 | 1280 |
| Intra mean | 0.821 | 0.822 | 0.786 |
| Inter mean | 0.758 | 0.767 | 0.699 |
| Cohen's d | 0.669 | 0.564 | 0.707 |
| KDE crossover | 0.837 | 0.850 | 0.792 |
| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |
Note: Firm A values in this table are computed over all intra-firm pairwise
similarities (16.0M pairs) for cross-backbone comparability. These differ from
the per-signature best-match statistic used in Section IV-D and visualized in
Table XIII (whole-sample Firm A best-match mean $\approx 0.980$), which reflects
the classification-relevant quantity: the similarity of each signature to its
single closest match from the same CPA.
-->
EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence.
VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.
ResNet-50 provides the best overall balance:
(1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
(2) its tighter distributions yield more reliable individual classifications;
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.