pdf_signature_extraction/paper/paper_a_results_v3.md

# IV. Experiments and Results

## A. Experimental Setup

All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration.
Feature extraction used PyTorch 2.9 with torchvision model implementations.
The complete pipeline---from raw PDF processing through final classification---was implemented in Python.

## B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

<!-- TABLE III: Extraction Results
| Metric | Value |
|--------|-------|
| Documents processed | 86,071 |
| Documents with detections | 85,042 (98.8%) |
| Total signatures extracted | 182,328 |
| Avg. signatures per document | 2.14 |
| CPA-matched signatures | 168,755 (92.6%) |
| Processing rate | 43.1 docs/sec |
-->

## C. All-Pairs Intra-vs-Inter Class Distribution Analysis

Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
Table IV summarizes the distributional statistics.

<!-- TABLE IV: Cosine Similarity Distribution Statistics
| Statistic | Intra-class | Inter-class |
|-----------|-------------|-------------|
| N (pairs) | 41,352,824 | 500,000 |
| Mean | 0.821 | 0.758 |
| Std. Dev. | 0.098 | 0.090 |
| Median | 0.836 | 0.774 |
| Skewness | −0.711 | −0.851 |
| Kurtosis | 0.550 | 1.027 |
-->

Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

## D. Hartigan Dip Test: Unimodality at the Signature Level

Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).

<!-- TABLE V: Hartigan Dip Test Results
| Distribution | N | dip | p-value | Verdict (α=0.05) |
|--------------|---|-----|---------|------------------|
| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
| Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
-->

Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews.
The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.

This asymmetry between signature level and accountant level is itself an empirical finding.
It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.

### 1) Burgstahler-Dichev / McCrary Discontinuity

Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a single significant transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample.
We note that the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
In contrast, the dHash transition at distance 2 is a substantively meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.
At the accountant level the test does not produce a significant $Z^- \rightarrow Z^+$ transition in either the cosine-mean or the dHash-mean distribution (Section IV-E), reflecting that accountant aggregates are smooth at the bin resolution the test requires rather than exhibiting a sharp density discontinuity.

### 2) Beta Mixture at Signature Level: A Forced Fit

Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.

The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
This motivates the pivot to the accountant-level analysis in Section IV-E, where aggregation over signatures reveals clustered (though not sharply discrete) patterns in individual-level signing *practice* that the signature-level analysis lacks.

## E. Accountant-Level Gaussian Mixture

We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
BIC selects $K^* = 3$ (Table VI).

<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
| K | BIC | AIC | Converged |
|---|-----|-----|-----------|
| 1 | −316 | −339 | ✓ |
| 2 | −545 | −595 | ✓ |
| 3 | **−792** | **−869** | ✓  (best) |
| 4 | −779 | −883 | ✓ |
| 5 | −747 | −879 | ✓ |
-->

Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.

<!-- TABLE VII: Accountant-Level 3-Component GMM
| Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
|-------|----------|------------|--------|---|----------------|
| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
| C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
| C3 (hand-signed tendency) | 0.928 | 11.17 | 0.28 | 184 | smaller domestic firms |
-->

Three empirical findings stand out.
First, of the 180 CPAs in the Firm A registry, 171 have $\geq 10$ signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only).
Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary test does not produce a significant transition at the accountant level.
For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.

Table VIII summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison.

<!-- TABLE VIII: Threshold Convergence Summary Across Levels
| Level / method | Cosine threshold | dHash threshold |
|----------------|-------------------|------------------|
| Signature-level, all-pairs KDE crossover | 0.837 | — |
| Signature-level, BD/McCrary transition | 0.985 | 2.0 |
| Signature-level, Beta-2 EM crossing (Firm A) | 0.977 | — |
| Signature-level, logit-GMM-2 crossing (Full) | 0.980 | — |
| Accountant-level, KDE antimode | **0.973** | **4.07** |
| Accountant-level, BD/McCrary transition | no transition | no transition |
| Accountant-level, Beta-2 EM crossing | **0.979** | **3.41** |
| Accountant-level, logit-GMM-2 crossing | **0.976** | **3.93** |
| Accountant-level, 2D-GMM 2-comp marginal crossing | 0.945 | 8.10 |
| Firm A calibration-fold cosine P5 | 0.941 | — |
| Firm A calibration-fold dHash P95 | — | 9 |
| Firm A calibration-fold dHash median | — | 2 |
-->

Methods 1 and 3 (KDE antimode, Beta-2 crossing, and its logit-GMM robustness check) converge at the accountant level to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$, while Method 2 (BD/McCrary) does not produce a significant discontinuity.
This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.

## F. Calibration Validation with Firm A

Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).

<!-- TABLE IX: Firm A Whole-Sample Capture Rates (consistency check, NOT external validation)
| Rule | Firm A rate | n / N |
|------|-------------|-------|
| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,405 / 60,448 |
| cosine > 0.941 (calibration-fold P5) | 95.08% | 57,473 / 60,448 |
| cosine > 0.945 (2D GMM marginal crossing) | 94.52% | 57,131 / 60,448 |
| cosine > 0.95 | 92.51% | 55,916 / 60,448 |
| cosine > 0.973 (accountant KDE antimode) | 80.91% | 48,910 / 60,448 |
| dHash_indep ≤ 5 (calib-fold median-adjacent) | 84.20% | 50,897 / 60,448 |
| dHash_indep ≤ 8 | 95.17% | 57,521 / 60,448 |
| dHash_indep ≤ 15 | 99.83% | 60,345 / 60,448 |
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | 54,373 / 60,448 |

All rates computed exactly from the full Firm A sample (N = 60,448 signatures).
The threshold 0.941 corresponds to the 5th percentile of the calibration-fold Firm A cosine distribution (see Section IV-G for the held-out validation that addresses the circularity inherent in this whole-sample table).
-->

Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H.
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix.
Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.

## G. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation

We report three validation analyses corresponding to the anchors of Section III-K.

### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor

Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor and FRR against the byte-identical positive anchor in Table X; these two error rates are well defined within their respective anchor populations.
The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because every byte-identical positive falls at cosine very close to 1.

<!-- TABLE X: Cosine Threshold Sweep (positives = 310 byte-identical signatures; negatives = 50,000 inter-CPA pairs)
| Threshold | FAR | FAR 95% Wilson CI | FRR (byte-identical) |
|-----------|-----|-------------------|----------------------|
| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] | 0.000 |
| 0.900                            | 0.0233 | [0.0221, 0.0247] | 0.000 |
| 0.945 (2D GMM marginal)          | 0.0008 | [0.0006, 0.0011] | 0.000 |
| 0.950                            | 0.0007 | [0.0005, 0.0009] | 0.000 |
| 0.973 (accountant KDE antimode)  | 0.0003 | [0.0002, 0.0004] | 0.000 |
| 0.979 (accountant Beta-2)        | 0.0002 | [0.0001, 0.0004] | 0.000 |
-->

Two caveats apply.
First, the gold-positive anchor is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
Zero FRR against this subset does not establish zero FRR against the broader positive class, and the reported FRR should therefore be interpreted as a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable miss rate.
Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.

### 2) Held-Out Firm A Validation (breaks calibration-validation circularity)

We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here and has no effect on the accountant-level mixture analysis of Section IV-E, which uses the $\geq 10$-signature subset of 171 CPAs.
Thresholds are re-derived from calibration-fold percentiles only.
Table XI reports heldout-fold capture rates with Wilson 95% confidence intervals.

<!-- TABLE XI: Held-Out Firm A Capture Rates (30% fold, N = 15,332 signatures)
| Rule | Capture rate | Wilson 95% CI | k / n |
|------|--------------|---------------|-------|
| cosine > 0.837 | 99.93% | [99.87%, 99.96%] | 15,321 / 15,332 |
| cosine > 0.945 (2D GMM marginal)    | 94.78% | [94.41%, 95.12%] | 14,532 / 15,332 |
| cosine > 0.950                       | 93.61% | [93.21%, 93.98%] | 14,353 / 15,332 |
| cosine > 0.9407 (calib-fold P5)      | 95.64% | [95.31%, 95.95%] | 14,664 / 15,332 |
| dHash_indep ≤ 5                      | 87.84% | [87.31%, 88.34%] | 13,469 / 15,332 |
| dHash_indep ≤ 8                      | 96.13% | [95.82%, 96.43%] | 14,739 / 15,332 |
| dHash_indep ≤ 9 (calib-fold P95)     | 97.48% | [97.22%, 97.71%] | 14,942 / 15,332 |
| dHash_indep ≤ 15                     | 99.84% | [99.77%, 99.89%] | 15,308 / 15,332 |
| cosine > 0.95 AND dHash_indep ≤ 8    | 91.54% | [91.09%, 91.97%] | 14,035 / 15,332 |

Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash median = 2, P95 = 9.
-->

The held-out rates match the whole-sample rates of Table IX within each rule's Wilson confidence interval, confirming that the calibration-derived thresholds generalize to Firm A CPAs that did not contribute to calibration.
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 91.54% [91.09%, 91.97%] of the held-out Firm A population, consistent with Firm A's interview-reported signing mix and with the replication-dominated framing of Section III-H.

### 3) Sanity Sample

A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.

## H. Additional Firm A Benchmark Validation

The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
This section reports three complementary analyses that go beyond the whole-sample capture rates.
Subsection H.2 is fully threshold-independent (it uses only ordinal ranking).
Subsection H.1 uses a fixed 0.95 cutoff but derives information from the longitudinal stability of rates rather than from the absolute rate at any single year.
Subsection H.3 applies the calibrated classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the informative quantity is the cross-firm *gap* rather than the absolute agreement rate at any one firm.

### 1) Year-by-Year Stability of the Firm A Left Tail

Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year.
Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign.
Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.

<!-- TABLE XIII: Firm A Per-Year Cosine Distribution
| Year | N sigs | mean cosine | % below 0.95 |
|------|--------|-------------|--------------|
| 2013 | 2,167 | 0.9733 | 12.78% |
| 2014 | 5,256 | 0.9781 | 8.69% |
| 2015 | 5,484 | 0.9793 | 7.46% |
| 2016 | 5,739 | 0.9811 | 6.92% |
| 2017 | 5,796 | 0.9814 | 6.69% |
| 2018 | 5,986 | 0.9808 | 6.58% |
| 2019 | 6,122 | 0.9780 | 8.71% |
| 2020 | 6,122 | 0.9770 | 9.46% |
| 2021 | 5,996 | 0.9792 | 8.37% |
| 2022 | 5,918 | 0.9819 | 6.25% |
| 2023 | 5,862 | 0.9860 | 3.75% |
-->

The left tail is stable at 6-13% throughout the sample period and shows no pre/post-2020 level shift: the 2013-2019 mean left-tail share is 8.26% and the 2020-2023 mean is 6.96%.
The lowest observed share is in 2023 (3.75%), consistent with firm-level electronic signing systems producing more uniform output than earlier manual scanning-and-stamping, not less.
This stability supports the replication-dominated framing: a persistent minority of hand-signing Firm A partners is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved.

### 2) Partner-Level Similarity Ranking

If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all Big-4 auditor-years.
We test this prediction directly.

For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
Firm A accounts for 1,287 of these (27.8% baseline share).
Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.

<!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
| Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
|-------|-------------|--------|--------|--------|--------|-----------|--------------|
| 10%   | 462         | 443    | 2      | 3      | 0      | 14        | 95.9% |
| 25%   | 1,157       | 1,043  | 32     | 23     | 9      | 50        | 90.1% |
| 50%   | 2,314       | 1,220  | 473    | 273    | 102    | 246       | 52.7% |
-->

Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.

<!-- TABLE XV: Firm A Share of Top-10% Similarity by Year
| Year | N auditor-years | Top-10% k | Firm A in top-10% | Firm A share | Firm A baseline |
|------|-----------------|-----------|-------------------|--------------|-----------------|
| 2013 | 324 | 32 | 32 | 100.0% | 26.2% |
| 2014 | 399 | 39 | 39 | 100.0% | 27.1% |
| 2015 | 394 | 39 | 38 | 97.4% | 27.2% |
| 2016 | 413 | 41 | 39 | 95.1% | 27.4% |
| 2017 | 415 | 41 | 41 | 100.0% | 27.9% |
| 2018 | 434 | 43 | 43 | 100.0% | 28.1% |
| 2019 | 429 | 42 | 42 | 100.0% | 28.2% |
| 2020 | 430 | 43 | 38 | 88.4% | 28.3% |
| 2021 | 450 | 45 | 44 | 97.8% | 28.4% |
| 2022 | 467 | 46 | 43 | 93.5% | 28.5% |
| 2023 | 474 | 47 | 46 | 97.9% | 28.5% |
-->

This over-representation is a direct consequence of firm-wide non-hand-signing practice and is not derived from any threshold we subsequently calibrate.
It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.

### 3) Intra-Report Consistency

Taiwanese statutory audit reports are co-signed by two engagement partners (a primary and a secondary signer).
Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.

For each report with exactly two signatures and complete per-signature data (83,970 reports assigned to a single firm, plus 384 reports with one signer per firm in the mixed-firm buckets for 84,354 total), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
Table XVI reports per-firm intra-report agreement (firm-assignment defined by the firm identity of both signers; mixed-firm reports are reported separately).

<!-- TABLE XVI: Intra-Report Classification Agreement by Firm
| Firm | Total 2-signer reports | Both non-hand-signed | Both uncertain | Both style | Both hand-signed | Mixed | Agreement rate |
|------|-----------------------|----------------------|----------------|------------|------------------|-------|----------------|
| Firm A  | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | **89.91%** |
| Firm B  | 17,121 | 9,260  | 2,159| 5 | 6 | 5,691 | 66.76% |
| Firm C  | 19,112 | 8,983  | 3,035| 3 | 5 | 7,086 | 62.92% |
| Firm D  | 8,375  | 3,028  | 2,376| 0 | 3 | 2,968 | 64.56% |
| Non-Big-4 | 9,140  | 1,671  | 3,945| 18| 27| 3,479 | 61.94% |

A report is "in agreement" if both signature labels fall in the same coarse bucket
(non-hand-signed = high+moderate; uncertain; style consistency; or likely hand-signed).
-->

Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.

We note that this test uses the calibrated classifier of Section III-L rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.

## I. Classification Results

Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.

<!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
| Verdict | N (PDFs) | % | Firm A | Firm A % |
|---------|----------|---|--------|----------|
| High-confidence non-hand-signed | 29,529 | 35.0% | 22,970 | 76.0% |
| Moderate-confidence non-hand-signed | 36,994 | 43.8% | 6,311 | 20.9% |
| High style consistency | 5,133 | 6.1% | 183 | 0.6% |
| Uncertain | 12,683 | 15.0% | 758 | 2.5% |
| Likely hand-signed | 47 | 0.1% | 4 | 0.0% |

Per the worst-case aggregation rule of Section III-L, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
-->

Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
A cosine-only classifier would treat all 71,656 identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.

### 1) Firm A Capture Profile (Consistency Check)

96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers.
The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.

### 2) Cross-Method Agreement

Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).

## J. Ablation Study: Feature Backbone Comparison

To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
Table XVIII presents the comparison.

<!-- TABLE XVIII: Backbone Comparison
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|--------|-----------|--------|-----------------|
| Feature dim | 2048 | 4096 | 1280 |
| Intra mean | 0.821 | 0.822 | 0.786 |
| Inter mean | 0.758 | 0.767 | 0.699 |
| Cohen's d | 0.669 | 0.564 | 0.707 |
| KDE crossover | 0.837 | 0.850 | 0.792 |
| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |

Note: Firm A values in this table are computed over all intra-firm pairwise
similarities (16.0M pairs) for cross-backbone comparability. These differ from
the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
the classification-relevant statistic: the similarity of each signature to its
single closest match from the same CPA.
-->

EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence.
VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance:
(1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
(2) its tighter distributions yield more reliable individual classifications;
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.