# IV. Experiments and Results

Section IV reports the empirical results that calibrate and characterise the operational classifier of §III-H.1 (calibration developed in §III-L). The primary analyses (§IV-D through §IV-J, and the anchor-based ICCR calibration consolidated in §IV-M) are scoped to the Big-4 sub-corpus (Firms A–D, $n = 437$ CPAs with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3 mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F.

## A. Experimental Setup

Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration.
Feature extraction used PyTorch 2.9 with torchvision model implementations.
The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision.

## B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
However, the subsequent production deployment provides a practical consistency check: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

<!-- TABLE III: Extraction Results
| Metric | Value |
|--------|-------|
| Documents processed | 86,071 |
| Documents with detections | 85,042 (98.8%) |
| Total signatures extracted | 182,328 |
| Avg. signatures per document | 2.14 |
| CPA-matched signatures | 168,755 (92.6%) |
| Processing rate | 43.1 docs/sec |
-->

The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in the primary analyses of §IV-D through §IV-J.

## C. All-Pairs Intra-vs-Inter Class Distribution Analysis

Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
Table IV summarizes the distributional statistics.

<!-- TABLE IV: Cosine Similarity Distribution Statistics
| Statistic | Intra-class | Inter-class |
|-----------|-------------|-------------|
| N (pairs) | 41,352,824 | 500,000 |
| Mean | 0.821 | 0.758 |
| Std. Dev. | 0.098 | 0.090 |
| Median | 0.836 | 0.774 |
| Skewness | −0.711 | −0.851 |
| Kurtosis | 0.550 | 1.027 |
-->

Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-I to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
Under equal prior probabilities and equal misclassification costs, this crossover is a candidate decision boundary between the two classes; we adopt it only as the operational LH/UN boundary in §III-H.1, not as a natural distributional threshold.
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

## D. Big-4 Accountant-Level Distributional Characterisation

This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. The accountant-level dip-test rejection reported in Table V is, per §III-I.4, fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.

**Table V.** Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32).

| Population | $n$ CPAs | $p_{\text{cos}}$ | $p_{\text{dHash}}$ | Interpretation |
|---|---|---|---|---|
| **Big-4 pooled (primary)** | 437 | $< 5 \times 10^{-4}$ | $< 5 \times 10^{-4}$ | reject unimodality on both axes |
| Firm A pooled alone | 171 | 0.992 | 0.924 | unimodal |
| Firms B + C + D pooled | 266 | 0.998 | 0.906 | unimodal |
| All non-Firm-A pooled | 515 | 0.998 | 0.907 | unimodal |

Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution $1 / 2000 = 5 \times 10^{-4}$ (Script 34 reports this as $p = 0.0000$; we report $p < 5 \times 10^{-4}$ to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed.

**Table VI.** Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; $\alpha = 0.05$, two-sided).

| Population | Cosine: significant transition? | dHash: significant transition? |
|---|---|---|
| **Big-4 pooled (primary)** | none ($p > 0.05$) | none ($p > 0.05$) |
| Firm A pooled alone | none | none |
| Firms B + C + D pooled | none | one transition at $\overline{\text{dHash}} = 10.8$ |
| All non-Firm-A pooled | none | one transition at $\overline{\text{dHash}} = 6.6$ |

The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes.

## E. Big-4 K=2 / K=3 Mixture Fits

This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.

**Table VII.** Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-J) and marginal-crossing bootstrap 95% confidence intervals.

| K=2 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight |
|---|---|---|---|
| K=2-a (low-cos / high-dHash position) | 0.954 | 7.14 | 0.689 |
| K=2-b (high-cos / low-dHash position) | 0.983 | 2.41 | 0.311 |

Marginal crossings (point + bootstrap 95% CI, $n_{\text{boot}} = 500$):

| Axis | Point | Bootstrap median | 95% CI | CI half-width |
|---|---|---|---|---|
| cos | 0.9755 | 0.9754 | $[0.9742, 0.9772]$ | 0.0015 |
| dHash | 3.755 | 3.763 | $[3.476, 3.969]$ | 0.246 |

$\text{BIC}(K{=}2) = -1108.45$ (Script 34).

**Table VIII.** Big-4 K=3 mixture components (descriptive firm-compositional partition per §III-J; not mechanism clusters).

| K=3 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position |
|---|---|---|---|---|
| C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner |
| C2 | 0.9558 | 6.66 | 0.536 | central region |
| C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner |

$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.

## F. Convergent Internal-Consistency Checks

This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ and are *not statistically independent measurements*. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.

**Table IX.** Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, $n = 437$.

| Score pair | Spearman $\rho$ | $p$-value |
|---|---|---|
| K=3 P(C1) vs deployed box-rule less-replication-dominated rate | $+0.9627$ | $< 10^{-248}$ |
| Reverse-anchor cosine percentile vs deployed box-rule less-replication-dominated rate | $+0.8890$ | $< 10^{-149}$ |
| K=3 P(C1) vs Reverse-anchor cosine percentile | $+0.8794$ | $< 10^{-142}$ |

(Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on $n = 249$ non-Big-4 CPAs; reference centre $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$.

**Table X.** Per-firm summary across the three feature-derived scores, Big-4.

| Firm | $n$ CPAs | mean $P(\text{C1})$ | mean reverse-anchor score | mean deployed less-replication-dominated rate |
|---|---|---|---|---|
| Firm A | 171 | 0.0072 | $-0.9726$ | 0.1935 |
| Firm B | 112 | 0.1410 | $-0.8201$ | 0.6962 |
| Firm C | 102 | 0.3110 | $-0.7672$ | 0.7896 |
| Firm D | 52 | 0.2406 | $-0.7125$ | 0.7608 |

(Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that *higher* values indicate deeper into the reference left tail = less replication-dominated relative to the non-Big-4 reference.)

The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as less replication-dominated. The K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Score 1 and Score 3) place Firm C at the least-replication-dominated end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors.

**Table XI.** Per-signature Cohen $\kappa$ (binary collapse, replication-dominated vs less-replication-dominated), $n = 150{,}442$ Big-4 signatures.

| Pair | Cohen $\kappa$ |
|---|---|
| deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) vs per-CPA K=3 hard label | 0.662 |
| deployed binary high-confidence box rule vs per-signature K=3 hard label | 0.559 |
| Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 |

(Source: Script 39.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) retains its prior calibration and capture-rate evidence (Appendix B; cross-referenced in §IV-J).

## G. Leave-One-Firm-Out Reproducibility

This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing.

**Table XII.** K=2 leave-one-firm-out across the four Big-4 folds.

| Held-out firm | $n_{\text{train}}$ | $n_{\text{held}}$ | Fold rule (cos cut, dHash cut) | Held-out classified as templated by fold rule |
|---|---|---|---|---|
| Firm A | 266 | 171 | cos $> 0.9380$ AND dHash $\leq 8.79$ | $171 / 171 = 100.00\%$ ($95\%$ Wilson $[97.80\%, 100.00\%]$) |
| Firm B | 325 | 112 | cos $> 0.9744$ AND dHash $\leq 3.98$ | $0 / 112 = 0\%$ ($95\%$ Wilson $[0\%, 3.32\%]$) |
| Firm C | 335 | 102 | cos $> 0.9752$ AND dHash $\leq 3.75$ | $0 / 102 = 0\%$ ($95\%$ Wilson $[0\%, 3.63\%]$) |
| Firm D | 385 | 52 | cos $> 0.9756$ AND dHash $\leq 3.74$ | $0 / 52 = 0\%$ ($95\%$ Wilson $[0\%, 6.88\%]$) |

(Source: Script 36.) Across-fold cosine crossing: pairwise range $[0.9380, 0.9756]$, range = $0.0376$; max absolute deviation from the across-fold mean is $0.028$. This exceeds the report's $0.005$ across-fold stability tolerance by $5.6\times$ and is much larger than the full-Big-4 bootstrap CI half-width of $0.0015$. Together with the all-or-nothing held-out classification pattern (Firm A held out $\Rightarrow$ all held-out CPAs templated; any non-Firm-A firm held out $\Rightarrow$ none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

**Table XIII.** K=3 leave-one-firm-out: C1 component shape and held-out membership.

| Held-out firm | C1 cos (fit) | C1 dHash (fit) | C1 weight (fit) | Held-out C1 hard-label rate | Full-Big-4 baseline C1% | Absolute difference |
|---|---|---|---|---|---|---|
| Full-Big-4 baseline | 0.9457 | 9.17 | 0.143 | — | — | — |
| Firm A held out | 0.9425 | 10.13 | 0.145 | $4.68\%$ | $0.00\%$ | $4.68$ pp |
| Firm B held out | 0.9441 | 9.16 | 0.127 | $7.14\%$ | $8.93\%$ | $1.76$ pp |
| Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp |
| Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp |

(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).

## H. Pixel-Identity Positive-Anchor Miss Rate

This section reports the only conservative hard-positive subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4).

**Table XIV.** Positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures.

| Classifier | Misclassified as less-replication-dominated | Miss rate | Wilson 95% CI |
|---|---|---|---|
| deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |
| K=3 per-CPA hard label (C3 = high-cos / low-dHash; descriptive) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |
| Reverse-anchor (prevalence-calibrated cut) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |

(Source: Script 40.) Per-firm breakdown of the byte-identical subset: Firm A 145; Firm B 8; Firm C 107; Firm D 2. All three candidate scores correctly assign every byte-identical signature to the replicated class.

We caution that for the deployed box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region). The reverse-anchor cut is chosen by *prevalence calibration* against the box rule's overall replicated rate of $49.58\%$ across Big-4 signatures; this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.

## I. Inter-CPA Pair-Level Coincidence Rate

The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It is the per-pair rate at which two signatures from different CPAs satisfy the deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has a biometric-verification meaning that requires ground-truth negative labels, and (b) the inter-CPA negative-anchor assumption is partially violated by within-firm cross-CPA template-like collision structures (§III-L.4 cross-firm hit matrix).

A corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs gives a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$. The Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs) replicates this number, adds the structural dimension (dHash), and adds joint-rule rates; the §III-L.1 numbers are referenced rather than duplicated here, and the consolidated ICCR calibration appears in §IV-M Tables XXI–XXVI.

## J. Five-Way Per-Signature + Document-Level Classification Output

This section reports the five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. See §III-H.1 for the five-way category definitions and the cosine and dHash cuts; calibration is in §III-L.

**Table XV.** Five-way per-signature category counts, Big-4 sub-corpus, $n = 150{,}442$ classified.

| Category | Long name | $n$ signatures | % of classified |
|---|---|---|---|
| HC | High-confidence non-hand-signed | 74,593 | 49.58% |
| MC | Moderate-confidence non-hand-signed | 39,817 | 26.47% |
| HSC | High style consistency | 314 | 0.21% |
| UN | Uncertain | 35,480 | 23.58% |
| LH | Likely hand-signed | 238 | 0.16% |

(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The $150{,}442$ vs $150{,}453$ distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use $n = 150{,}442$; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIV–XXV; Scripts 40b, 43, 44) use $n = 150{,}453$ because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.)

**Per-firm five-way breakdown (% within firm).**

| Firm | HC | MC | HSC | UN | LH | total signatures |
|---|---|---|---|---|---|---|
| Firm A | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 |
| Firm B | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 |
| Firm C | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 |
| Firm D | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 |

(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).

**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-L), applied to the Big-4 sub-corpus.

**Table XIX.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs.

| Category | Long name | $n$ documents | % |
|---|---|---|---|
| HC | High-confidence non-hand-signed | 46,857 | 62.28% |
| MC | Moderate-confidence non-hand-signed | 19,667 | 26.14% |
| HSC | High style consistency | 167 | 0.22% |
| UN | Uncertain | 8,524 | 11.33% |
| LH | Likely hand-signed | 18 | 0.02% |

(Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.)

**Per-firm document-level breakdown (single-firm PDFs only).**

| Firm | HC | MC | HSC | UN | LH | total docs |
|---|---|---|---|---|---|---|
| Firm A | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 |
| Firm B | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 |
| Firm C | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 |
| Firm D | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 |

(Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)

The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains its prior calibration (Appendix B); it is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in Appendix B and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).

**Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.

| Firm | $n$ | C1 (low-cos / high-dHash) | C2 (central) | C3 (high-cos / low-dHash) | C1 % | C3 % |
|---|---|---|---|---|---|---|
| Firm A | 171 | 0 | 30 | 141 | $0.00\%$ | $82.46\%$ |
| Firm B | 112 | 10 | 102 | 0 | $8.93\%$ | $0.00\%$ |
| Firm C | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ |
| Firm D | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ |

(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).

**Document-level worst-case aggregation outputs are reported in Table XIX above.**

## K. Full-Dataset Robustness (light scope)

This section reports the reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (Appendix B; §IV-J).

**Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset.

| K=3 component | Big-4 (n=437) cos / dHash / weight | Full (n=686) cos / dHash / weight | Drift Big-4 → Full |
|---|---|---|---|
| C1 (low-cos / high-dHash) | 0.9457 / 9.17 / 0.143 | 0.9278 / 11.17 / 0.284 | $\lvert\Delta\rvert$ cos 0.018, dHash 1.99, wt 0.141 |
| C2 (central) | 0.9558 / 6.66 / 0.536 | 0.9535 / 6.99 / 0.512 | $\lvert\Delta\rvert$ cos 0.002, dHash 0.33, wt 0.024 |
| C3 (high-cos / low-dHash) | 0.9826 / 2.41 / 0.321 | 0.9826 / 2.40 / 0.205 | $\lvert\Delta\rvert$ cos 0.000, dHash 0.01, wt 0.117 |

(Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.)

**Table XVIII.** Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.

| Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs deployed less-replication-dominated rate) | $p$-value |
|---|---|---|---|
| Big-4 (primary) | 437 | $+0.9627$ | $< 10^{-248}$ |
| Full dataset | 686 | $+0.9558$ | $< 10^{-300}$ |
| $\lvert\rho_{\text{full}} - \rho_{\text{Big-4}}\rvert$ | — | $0.0069$ | — |

(Source: Script 41.)

**Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the deployed box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + deployed-rule convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the primary methodology is restricted to Big-4 by design (§III-G item 4).

## L. Ablation Study: Feature Backbone Comparison

To support the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
The comparison summary is reported in Appendix B (the backbone-ablation table; not the same table as Table XVIII in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K).

<!-- BACKBONE ABLATION TABLE (rendered in Appendix B):
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|--------|-----------|--------|-----------------|
| Feature dim | 2048 | 4096 | 1280 |
| Intra mean | 0.821 | 0.822 | 0.786 |
| Inter mean | 0.758 | 0.767 | 0.699 |
| Cohen's d | 0.669 | 0.564 | 0.707 |
| KDE crossover | 0.837 | 0.850 | 0.792 |
| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |

Note: Firm A values in this table are computed over all intra-firm pairwise
similarities (16.0M pairs) for cross-backbone comparability. These differ from
the per-signature best-match statistic used in Section IV-D and visualized in
Table XIII (whole-sample Firm A best-match mean $\approx 0.980$), which reflects
the classification-relevant quantity: the similarity of each signature to its
single closest match from the same CPA.
-->

EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), i.e., a wider descriptor dispersion per signature.
VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance:
(1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
(2) its tighter distributions yield more stable descriptor behaviour at the per-signature level;
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that Firm A replication-dominated signatures are least likely to produce low-similarity outlier pairs under this backbone; and
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.

## M. Anchor-Based ICCR Calibration Results

This section consolidates the empirical results that support the §III-L anchor-based threshold calibration framework.

### M.1 Composition decomposition (Scripts 39b–39e)

**Table XX.** Within-firm and between-firm decomposition of the Big-4 accountant-level dip-test rejection.

| Diagnostic | Scope | Statistic | Implication |
|---|---|---|---|
| Within-firm signature-level cosine dip | Big-4 (4 firms) | $p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ | 0/4 firms reject; cosine within-firm unimodal |
| Within-firm signature-level cosine dip | non-Big-4 (10 firms $\geq 500$ sigs) | $p_{\text{cos}} \in [0.59, 0.99]$ | 0/10 firms reject; cosine within-firm unimodal |
| Within-firm jittered-dHash dip (5 seeds, median) | Big-4 (4 firms) | $p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\}$ | 0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact |
| Big-4 pooled dHash: 2×2 factorial | firm-centred + jittered (5 seeds) | $p_{\text{med}} = 0.35$, 0/5 seeds reject | combined corrections eliminate rejection; multimodality is composition + integer artefact |
| Integer-histogram valley near $\text{dHash} \approx 5$ | within each Big-4 firm | none (0/4 firms) | no within-firm dHash antimode at the deployed HC cutoff |

(Source: Scripts 39b, 39c, 39d, 39e; bootstrap $n_{\text{boot}} = 2000$; jitter $\sim \mathrm{U}[-0.5, +0.5]$.)

### M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b)

**Table XXI.** Big-4 inter-CPA per-comparison ICCR sweep, $n = 5 \times 10^5$ pairs (Big-4 scope).

| Threshold | Per-comparison ICCR | 95% Wilson CI |
|---|---|---|
| cos $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ |
| cos $> 0.95$ (deployed operating point) | $0.00060$ | $[0.00053, 0.00067]$ |
| cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
| dHash $\leq 5$ (deployed operating point) | $0.00129$ | $[0.00120, 0.00140]$ |
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ |

Conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$ (Wilson 95% $[0.190, 0.285]$; $70$ of $299$ pairs).

The cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I (per-comparison rate $0.0005$). The dHash row and joint row are reported here for the first time on this corpus.

### M.3 Pool-normalised per-signature ICCR (Script 43)

**Table XXII.** Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos $> 0.95$ AND dHash $\leq 5$); $n_{\text{sig}} = 150{,}453$ (vector-complete Big-4); CPA-block bootstrap $n_{\text{boot}} = 1000$.

| Scope | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
|---|---|---|---|
| Big-4 pooled (any-pair, deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
| Big-4 pooled (same-pair, stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |
| Firm A (any-pair) | $0.2594$ | — | — |
| Firm B (any-pair) | $0.0147$ | — | — |
| Firm C (any-pair) | $0.0053$ | — | — |
| Firm D (any-pair) | $0.0110$ | — | — |
| Pool-size decile 1 (smallest pools) any-pair | $0.0249$ | — | — |
| Pool-size decile 10 (largest pools) any-pair | $0.1905$ | — | — |

Decile trend is broadly monotone in pool size with two minor reversals (decile 5 and decile 9 dip below their predecessors). Stricter operating point cos $> 0.95$ AND dHash $\leq 3$ (same-pair) gives per-signature ICCR $0.0449$.

### M.4 Document-level ICCR under three alarm definitions (Script 45)

**Table XXIII.** Document-level inter-CPA ICCR by alarm definition; $n_{\text{docs}} = 75{,}233$.

| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|---|---|---|---|
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
| D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ |
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |

Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XIX's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XIX's per-firm rows sum to $74{,}854 = 75{,}233 - 379$.

### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)

**Table XXIV.** Logistic regression of per-signature any-pair HC hit indicator on firm dummies and centred log pool size (Firm A reference).

| Term | Odds ratio (vs Firm A) | Direction |
|---|---|---|
| Firm B | $0.053$ | $\sim 19\times$ lower odds than Firm A |
| Firm C | $0.010$ | $\sim 100\times$ lower odds than Firm A |
| Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A |
| log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size |

Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–$0.0358$ while Firm A ranges $0.0541$–$0.5958$. The firm gap survives within matched pool sizes.

**Table XXV.** Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts).

| Source firm | Firm A cand. | Firm B | Firm C | Firm D | non-Big-4 | n hits |
|---|---|---|---|---|---|---|
| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ |
| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ |
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |

Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively.

### M.6 Alert-rate sensitivity around deployed HC threshold (Script 46)

**Table XXVI.** Local-gradient / median-gradient ratio at deployed thresholds (descriptive plateau diagnostic).

| Threshold | Local / median gradient ratio | Interpretation |
|---|---|---|
| cos $= 0.95$ (HC) | $\approx 25\times$ | locally sensitive (not plateau-stable) |
| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |

Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ ($38.6$ pp) per-signature and $0.4431$ ($44.3$ pp) per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate.