# IV. Experiments and Results

The v4.0 primary analyses (§IV-D through §IV-J) are scoped to the Big-4 sub-corpus (Firms A–D, $n = 437$ CPAs with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. The §IV-K Full-Dataset Robustness section reports the full-dataset (686 CPAs) variant of the K=3 mixture + Paper A box-rule Spearman analysis as a cross-scope robustness check. §IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables.

## A. Experimental Setup

Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration.
Feature extraction used PyTorch 2.9 with torchvision model implementations.
The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision.

## B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

<!-- TABLE III: Extraction Results
| Metric | Value |
|--------|-------|
| Documents processed | 86,071 |
| Documents with detections | 85,042 (98.8%) |
| Total signatures extracted | 182,328 |
| Avg. signatures per document | 2.14 |
| CPA-matched signatures | 168,755 (92.6%) |
| Processing rate | 43.1 docs/sec |
-->

The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in all §IV v4 primary analyses (§IV-D through §IV-J).

## C. All-Pairs Intra-vs-Inter Class Distribution Analysis

Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
Table IV summarizes the distributional statistics.

<!-- TABLE IV: Cosine Similarity Distribution Statistics
| Statistic | Intra-class | Inter-class |
|-----------|-------------|-------------|
| N (pairs) | 41,352,824 | 500,000 |
| Mean | 0.821 | 0.758 |
| Std. Dev. | 0.098 | 0.090 |
| Median | 0.836 | 0.774 |
| Skewness | −0.711 | −0.851 |
| Kurtosis | 0.550 | 1.027 |
-->

Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent threshold-estimator outputs reported in Section IV-D are derived via the methods of Section III-I to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

## D. Big-4 Accountant-Level Distributional Characterisation

This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. All numbers below are direct re-statements from Scripts 32 / 34. The accountant-level dip-test rejection reported in Table V is, per §III-I.4 (Scripts 39b–39e), fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the v4-new composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.

**Table V.** Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32).

| Population | $n$ CPAs | $p_{\text{cos}}$ | $p_{\text{dHash}}$ | Interpretation |
|---|---|---|---|---|
| **Big-4 pooled (primary)** | 437 | $< 5 \times 10^{-4}$ | $< 5 \times 10^{-4}$ | reject unimodality on both axes |
| Firm A pooled alone | 171 | 0.992 | 0.924 | unimodal |
| Firms B + C + D pooled | 266 | 0.998 | 0.906 | unimodal |
| All non-Firm-A pooled | 515 | 0.998 | 0.907 | unimodal |

Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution $1 / 2000 = 5 \times 10^{-4}$ (Script 34 reports this as $p = 0.0000$; we report $p < 5 \times 10^{-4}$ to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed.

**Table VI.** Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; $\alpha = 0.05$, two-sided).

| Population | Cosine: significant transition? | dHash: significant transition? |
|---|---|---|
| **Big-4 pooled (primary)** | none ($p > 0.05$) | none ($p > 0.05$) |
| Firm A pooled alone | none | none |
| Firms B + C + D pooled | none | one transition at $\overline{\text{dHash}} = 10.8$ |
| All non-Firm-A pooled | none | one transition at $\overline{\text{dHash}} = 6.6$ |

The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as v4.0 operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes.

## E. Big-4 K=2 / K=3 Mixture Fits

This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.

**Table VII.** Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-J) and marginal-crossing bootstrap 95% confidence intervals.

| K=2 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight |
|---|---|---|---|
| K=2-a (low-cos / high-dHash position) | 0.954 | 7.14 | 0.689 |
| K=2-b (high-cos / low-dHash position) | 0.983 | 2.41 | 0.311 |

Marginal crossings (point + bootstrap 95% CI, $n_{\text{boot}} = 500$):

| Axis | Point | Bootstrap median | 95% CI | CI half-width |
|---|---|---|---|---|
| cos | 0.9755 | 0.9754 | $[0.9742, 0.9772]$ | 0.0015 |
| dHash | 3.755 | 3.763 | $[3.476, 3.969]$ | 0.246 |

$\text{BIC}(K{=}2) = -1108.45$ (Script 34).

**Table VIII.** Big-4 K=3 mixture components (descriptive firm-compositional partition per §III-J; not mechanism clusters).

| K=3 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position |
|---|---|---|---|---|
| C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner |
| C2 | 0.9558 | 6.66 | 0.536 | central region |
| C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner |

$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.

## F. Convergent Internal-Consistency Checks

This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ and are *not statistically independent measurements*. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.

**Table IX.** Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, $n = 437$.

| Score pair | Spearman $\rho$ | $p$-value |
|---|---|---|
| K=3 P(C1) vs Paper A box-rule less-replication-dominated rate | $+0.9627$ | $< 10^{-248}$ |
| Reverse-anchor cosine percentile vs Paper A box-rule less-replication-dominated rate | $+0.8890$ | $< 10^{-149}$ |
| K=3 P(C1) vs Reverse-anchor cosine percentile | $+0.8794$ | $< 10^{-142}$ |

(Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on $n = 249$ non-Big-4 CPAs; reference centre $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$.

**Table X.** Per-firm summary across the three feature-derived scores, Big-4.

| Firm | $n$ CPAs | mean $P(\text{C1})$ | mean reverse-anchor score | mean Paper A less-replication-dominated rate |
|---|---|---|---|---|
| Firm A | 171 | 0.0072 | $-0.9726$ | 0.1935 |
| Firm B | 112 | 0.1410 | $-0.8201$ | 0.6962 |
| Firm C | 102 | 0.3110 | $-0.7672$ | 0.7896 |
| Firm D | 52 | 0.2406 | $-0.7125$ | 0.7608 |

(Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that *higher* values indicate deeper into the reference left tail = less replication-dominated relative to the non-Big-4 reference.)

The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as less replication-dominated. The K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Score 1 and Score 3) place Firm C at the least-replication-dominated end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors.

**Table XI.** Per-signature Cohen $\kappa$ (binary collapse, replication-dominated vs less-replication-dominated), $n = 150{,}442$ Big-4 signatures.

| Pair | Cohen $\kappa$ |
|---|---|
| Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) vs per-CPA K=3 hard label | 0.662 |
| Paper A binary high-confidence box rule vs per-signature K=3 hard label | 0.559 |
| Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 |

(Source: Script 39; verdict label `SIG_CONVERGENCE_MODERATE`.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evaluation (§IV-J).

## G. Leave-One-Firm-Out Reproducibility

This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing.

**Table XII.** K=2 leave-one-firm-out across the four Big-4 folds.

| Held-out firm | $n_{\text{train}}$ | $n_{\text{held}}$ | Fold rule (cos cut, dHash cut) | Held-out classified as templated by fold rule |
|---|---|---|---|---|
| Firm A | 266 | 171 | cos $> 0.9380$ AND dHash $\leq 8.79$ | $171 / 171 = 100.00\%$ ($95\%$ Wilson $[97.80\%, 100.00\%]$) |
| Firm B | 325 | 112 | cos $> 0.9744$ AND dHash $\leq 3.98$ | $0 / 112 = 0\%$ ($95\%$ Wilson $[0\%, 3.32\%]$) |
| Firm C | 335 | 102 | cos $> 0.9752$ AND dHash $\leq 3.75$ | $0 / 102 = 0\%$ ($95\%$ Wilson $[0\%, 3.63\%]$) |
| Firm D | 385 | 52 | cos $> 0.9756$ AND dHash $\leq 3.74$ | $0 / 52 = 0\%$ ($95\%$ Wilson $[0\%, 6.88\%]$) |

(Source: Script 36.) Across-fold cosine crossing: pairwise range $[0.9380, 0.9756]$, range = $0.0376$; max absolute deviation from the across-fold mean is $0.028$. This exceeds the report's $0.005$ across-fold stability tolerance by $5.6\times$ and is much larger than the full-Big-4 bootstrap CI half-width of $0.0015$. Together with the all-or-nothing held-out classification pattern (Firm A held out $\Rightarrow$ all held-out CPAs templated; any non-Firm-A firm held out $\Rightarrow$ none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

**Table XIII.** K=3 leave-one-firm-out: C1 component shape and held-out membership.

| Held-out firm | C1 cos (fit) | C1 dHash (fit) | C1 weight (fit) | Held-out C1 hard-label rate | Full-Big-4 baseline C1% | Absolute difference |
|---|---|---|---|---|---|---|
| Full-Big-4 baseline | 0.9457 | 9.17 | 0.143 | — | — | — |
| Firm A held out | 0.9425 | 10.13 | 0.145 | $4.68\%$ | $0.00\%$ | $4.68$ pp |
| Firm B held out | 0.9441 | 9.16 | 0.127 | $7.14\%$ | $8.93\%$ | $1.76$ pp |
| Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp |
| Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp |

(Source: Script 37; verdict label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).

## H. Pixel-Identity Positive-Anchor Miss Rate

This section reports the only hard-ground-truth subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4).

**Table XIV.** Positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures.

| Classifier | Misclassified as less-replication-dominated | Miss rate | Wilson 95% CI |
|---|---|---|---|
| Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |
| K=3 per-CPA hard label (C3 = high-cos / low-dHash; descriptive) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |
| Reverse-anchor (prevalence-calibrated cut) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |

(Source: Script 40.) Per-firm breakdown of the byte-identical subset: Firm A 145; Firm B 8; Firm C 107; Firm D 2. All three candidate scores correctly assign every byte-identical signature to the replicated class.

We caution that for the Paper A box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region); v3.20.0 §V-F discusses this conservative-subset caveat at length and we retain that discussion. The reverse-anchor cut is chosen by *prevalence calibration* against the inherited box rule's overall replicated rate of $49.58\%$ across Big-4 signatures; this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.

## I. Inter-CPA Pair-Level Coincidence Rate (Big-4 spike + inherited corpus-wide)

The signature-level inter-CPA pair-level coincidence-rate analysis (reported in v3.20.0 §IV-F.1, Table X as "FAR") is inherited and extended in v4.0. v4.0 retroactively reframes the metric as **inter-CPA pair-level coincidence rate (ICCR)** rather than "False Acceptance Rate" because the corpus does not provide signature-level ground-truth negative labels; the inter-CPA negative-anchor assumption underpinning the metric is itself partially violated by within-firm cross-CPA template-like collision structures (§III-L.4). The v3.20.0 corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs reported a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$.

v4.0 additionally reports the §III-L.1 Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs; Script 40b), which replicates and extends the v3 result and adds the structural dimension (dHash) and joint-rule rates. The §III-L.1 numbers are referenced rather than duplicated here; the consolidated v4-new ICCR calibration appears in §IV-M Tables XXI–XXVI.

## J. Five-Way Per-Signature + Document-Level Classification Output

This section reports the §III-L five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. The five-way category definitions are inherited unchanged from v3.20.0 §III-K (now §III-L); see §III-L for the cosine and dHash cuts.

**Table XV.** Five-way per-signature category counts, Big-4 sub-corpus, $n = 150{,}442$ classified.

| Category | Long name | $n$ signatures | % of classified |
|---|---|---|---|
| HC | High-confidence non-hand-signed | 74,593 | 49.58% |
| MC | Moderate-confidence non-hand-signed | 39,817 | 26.47% |
| HSC | High style consistency | 314 | 0.21% |
| UN | Uncertain | 35,480 | 23.58% |
| LH | Likely hand-signed | 238 | 0.16% |

(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The $150{,}442$ vs $150{,}453$ distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use $n = 150{,}442$; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIV–XXV; Scripts 40b, 43, 44) use $n = 150{,}453$ because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.)

**Per-firm five-way breakdown (% within firm).**

| Firm | HC | MC | HSC | UN | LH | total signatures |
|---|---|---|---|---|---|---|
| Firm A | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 |
| Firm B | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 |
| Firm C | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 |
| Firm D | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 |

(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).

**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.20.0 worst-case rule (HC > MC > HSC > UN > LH; §III-L). v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset).

**Table XIX.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs.

| Category | Long name | $n$ documents | % |
|---|---|---|---|
| HC | High-confidence non-hand-signed | 46,857 | 62.28% |
| MC | Moderate-confidence non-hand-signed | 19,667 | 26.14% |
| HSC | High style consistency | 167 | 0.22% |
| UN | Uncertain | 8,524 | 11.33% |
| LH | Likely hand-signed | 18 | 0.02% |

(Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.)

**Per-firm document-level breakdown (single-firm PDFs only).**

| Firm | HC | MC | HSC | UN | LH | total docs |
|---|---|---|---|---|---|---|
| Firm A | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 |
| Firm B | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 |
| Firm C | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 |
| Firm D | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 |

(Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)

The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The v3.20.0 capture-rate calibration evidence for the moderate band (v3.20.0 Tables IX, XI, XII, XII-B) is carried into v4.0 by reference and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).

**Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.

| Firm | $n$ | C1 (low-cos / high-dHash) | C2 (central) | C3 (high-cos / low-dHash) | C1 % | C3 % |
|---|---|---|---|---|---|---|
| Firm A | 171 | 0 | 30 | 141 | $0.00\%$ | $82.46\%$ |
| Firm B | 112 | 10 | 102 | 0 | $8.93\%$ | $0.00\%$ |
| Firm C | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ |
| Firm D | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ |

(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).

**Document-level worst-case aggregation outputs are reported in Table XIX above.**

## K. Full-Dataset Robustness (light scope)

This section reports the v4.0 reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the v4.0 K=3 + Paper A convergence reproduces at the wider scope. The §III-L five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band is documented as inherited from v3.x calibration in §IV-J.

**Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset.

| K=3 component | Big-4 (n=437) cos / dHash / weight | Full (n=686) cos / dHash / weight | Drift Big-4 → Full |
|---|---|---|---|
| C1 (low-cos / high-dHash) | 0.9457 / 9.17 / 0.143 | 0.9278 / 11.17 / 0.284 | $\lvert\Delta\rvert$ cos 0.018, dHash 1.99, wt 0.141 |
| C2 (central) | 0.9558 / 6.66 / 0.536 | 0.9535 / 6.99 / 0.512 | $\lvert\Delta\rvert$ cos 0.002, dHash 0.33, wt 0.024 |
| C3 (high-cos / low-dHash) | 0.9826 / 2.41 / 0.321 | 0.9826 / 2.40 / 0.205 | $\lvert\Delta\rvert$ cos 0.000, dHash 0.01, wt 0.117 |

(Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.)

**Table XVIII.** Spearman rank correlation between K=3 P(C1) and Paper A operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.

| Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs Paper A less-replication-dominated rate) | $p$-value |
|---|---|---|---|
| Big-4 (primary) | 437 | $+0.9627$ | $< 10^{-248}$ |
| Full dataset | 686 | $+0.9558$ | $< 10^{-300}$ |
| $\lvert\rho_{\text{full}} - \rho_{\text{Big-4}}\rvert$ | — | $0.0069$ | — |

(Source: Script 41.)

**Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the Paper A box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + Paper A convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the v4.0 primary methodology is restricted to Big-4 by design (§III-G item 4).

## L. Ablation Study: Feature Backbone Comparison

To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
The comparison summary is inherited unchanged from the v3.20.0 backbone-ablation table (v3.20.0 Table XVIII; not the same table as v4 Table XVIII which reports Big-4 vs full-dataset Spearman drift in §IV-K).

<!-- v3.20.0 TABLE XVIII (inherited unchanged; reproduced here in commented form for v4 splice provenance — actual rendered table is the v3.20.0 backbone-ablation table):
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|--------|-----------|--------|-----------------|
| Feature dim | 2048 | 4096 | 1280 |
| Intra mean | 0.821 | 0.822 | 0.786 |
| Inter mean | 0.758 | 0.767 | 0.699 |
| Cohen's d | 0.669 | 0.564 | 0.707 |
| KDE crossover | 0.837 | 0.850 | 0.792 |
| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |

Note: Firm A values in this table are computed over all intra-firm pairwise
similarities (16.0M pairs) for cross-backbone comparability. These differ from
the per-signature best-match statistic used in Section IV-D and visualized in
Table XIII (whole-sample Firm A best-match mean $\approx 0.980$), which reflects
the classification-relevant quantity: the similarity of each signature to its
single closest match from the same CPA.
-->

EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence.
VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance:
(1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
(2) its tighter distributions yield more reliable individual classifications;
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.

## M. v4-New Anchor-Based ICCR Calibration Results

This section consolidates the v4-new empirical results that support the §III-L anchor-based threshold calibration framework. Numbers below are direct re-statements from the spike scripts cited per row; the corresponding §III provenance table entries appear in §III's provenance table.

### M.1 Composition decomposition (Scripts 39b–39e)

**Table XX.** Within-firm and between-firm decomposition of the Big-4 accountant-level dip-test rejection.

| Diagnostic | Scope | Statistic | Implication |
|---|---|---|---|
| Within-firm signature-level cosine dip | Big-4 (4 firms) | $p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ | 0/4 firms reject; cosine within-firm unimodal |
| Within-firm signature-level cosine dip | non-Big-4 (10 firms $\geq 500$ sigs) | $p_{\text{cos}} \in [0.59, 0.99]$ | 0/10 firms reject; cosine within-firm unimodal |
| Within-firm jittered-dHash dip (5 seeds, median) | Big-4 (4 firms) | $p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\}$ | 0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact |
| Big-4 pooled dHash: 2×2 factorial | firm-centred + jittered (5 seeds) | $p_{\text{med}} = 0.35$, 0/5 seeds reject | combined corrections eliminate rejection; multimodality is composition + integer artefact |
| Integer-histogram valley near $\text{dHash} \approx 5$ | within each Big-4 firm | none (0/4 firms) | no within-firm dHash antimode at the inherited HC cutoff |

(Source: Scripts 39b, 39c, 39d, 39e; bootstrap $n_{\text{boot}} = 2000$; jitter $\sim \mathrm{U}[-0.5, +0.5]$.)

### M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b)

**Table XXI.** Big-4 inter-CPA per-comparison ICCR sweep, $n = 5 \times 10^5$ pairs (Big-4 scope; v4 new).

| Threshold | Per-comparison ICCR | 95% Wilson CI |
|---|---|---|
| cos $> 0.945$ (v3.x published "natural threshold") | $0.00081$ | $[0.00073, 0.00089]$ |
| cos $> 0.95$ (inherited operating point) | $0.00060$ | $[0.00053, 0.00067]$ |
| cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
| dHash $\leq 5$ (inherited operating point) | $0.00129$ | $[0.00120, 0.00140]$ |
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | — |
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | — |

Conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$ (Wilson 95% $[0.190, 0.285]$; $70$ of $299$ pairs).

The cos $> 0.95$ row replicates v3.20.0 §IV-F.1 Table X (v3 reported $0.0005$ under prior "FAR" terminology). The dHash row and joint row are v4 new.

### M.3 Pool-normalised per-signature ICCR (Script 43)

**Table XXII.** Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos $> 0.95$ AND dHash $\leq 5$); $n_{\text{sig}} = 150{,}453$ (vector-complete Big-4); CPA-block bootstrap $n_{\text{boot}} = 1000$.

| Scope | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
|---|---|---|---|
| Big-4 pooled (any-pair, deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
| Big-4 pooled (same-pair, stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |
| Firm A (any-pair) | $0.2594$ | — | — |
| Firm B (any-pair) | $0.0147$ | — | — |
| Firm C (any-pair) | $0.0053$ | — | — |
| Firm D (any-pair) | $0.0110$ | — | — |
| Pool-size decile 1 (smallest pools) any-pair | $0.0249$ | — | — |
| Pool-size decile 10 (largest pools) any-pair | $0.1905$ | — | — |

Decile trend is broadly monotone in pool size with two minor reversals (decile 5 and decile 9 dip below their predecessors). Stricter operating point cos $> 0.95$ AND dHash $\leq 3$ (same-pair) gives per-signature ICCR $0.0449$.

### M.4 Document-level ICCR under three alarm definitions (Script 45)

**Table XXIII.** Document-level inter-CPA ICCR by alarm definition; $n_{\text{docs}} = 75{,}233$.

| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|---|---|---|---|
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
| D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ |
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |

Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XIX's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XIX's per-firm rows sum to $74{,}854 = 75{,}233 - 379$.

### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)

**Table XXIV.** Logistic regression of per-signature any-pair HC hit indicator on firm dummies and centred log pool size (Firm A reference).

| Term | Odds ratio (vs Firm A) | Direction |
|---|---|---|
| Firm B | $0.053$ | $\sim 19\times$ lower odds than Firm A |
| Firm C | $0.010$ | $\sim 100\times$ lower odds than Firm A |
| Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A |
| log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size |

Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–$0.0358$ while Firm A ranges $0.0541$–$0.5958$. The firm gap survives within matched pool sizes.

**Table XXV.** Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts).

| Source firm | Firm A cand. | Firm B | Firm C | Firm D | non-Big-4 | n hits |
|---|---|---|---|---|---|---|
| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ |
| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ |
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |

Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively.

### M.6 Alert-rate sensitivity around inherited HC threshold (Script 46)

**Table XXVI.** Local-gradient / median-gradient ratio at inherited thresholds (descriptive plateau diagnostic).

| Threshold | Local / median gradient ratio | Interpretation |
|---|---|---|
| cos $= 0.95$ (HC) | $\approx 25\times$ | locally sensitive (not plateau-stable) |
| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |

Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ pp per-signature and $0.4431$ pp per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate.