pdf_signature_extraction/paper/paper_a_methodology_v3.md

# III. Methodology

## A. Pipeline Overview

We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
Fig. 1 illustrates the overall architecture.
The pipeline takes as input a corpus of PDF audit reports and produces five-way operational screening labels (§III-H.1) whose behaviour is characterised by pixel-identity positive-anchor capture checks and inter-CPA coincidence-rate calibration (§III-L).

Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.

<!--
[Figure 1: Pipeline Architecture - clean vector diagram]
90,282 PDFs → VLM Pre-screening → 86,084 candidates → processing checks → 86,071 PDFs
→ YOLOv11 Detection → 182,328 signatures
→ ResNet-50 Features → 2048-dim embeddings
→ Dual-Descriptor Verification (Cosine + dHash)
→ Anchor-Calibrated Five-Way Classifier → Five-way classification
→ Pixel-identity Positive Anchor + Inter-CPA Coincidence-Rate Negative Anchor
-->

## B. Data Collection

The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023.
The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings.
An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period.
Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.

CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
Table I summarizes the dataset composition.

**Table I.** Dataset Summary.

| Attribute | Value |
|-----------|-------|
| Total PDF documents | 90,282 |
| Date range | 2013–2023 |
| Signature-page candidates (VLM-positive) | 86,084 (95.3%) |
| Processed for signature extraction | 86,071 (95.3%) |
| Unique CPAs identified | 758 |
| Accounting firms | >50 |

## C. Signature Page Identification

To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism.
Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
The model was configured with temperature 0 for deterministic output.

The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document.
Scanning terminated upon the first positive detection.
This process identified 86,084 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded.
An additional 13 PDFs that could not be rendered (corruption or read errors) were excluded, yielding a final set of 86,071 documents.

Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents.
The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling.

## D. Signature Detection

We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization.
A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.

The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).

**Table II.** YOLO Detection Performance.

| Metric | Value |
|--------|-------|
| Precision | 0.97–0.98 |
| Recall | 0.95–0.98 |
| mAP@0.50 | 0.98–0.99 |
| mAP@0.50:0.95 | 0.85–0.90 |

Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.

Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). The matched records assume standard two-signature ordering; residual order-mismatch risk remains for nonstandard layouts. The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset forms the candidate pool for same-CPA analyses, before the Big-4 and descriptor-completeness restrictions described in §III-G.

## E. Feature Extraction

Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning.
The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.

Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-H, Engineering-level caveats).
This design choice is supported by an ablation study (Section IV-L) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

## F. Dual-Method Similarity Descriptors

For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:

**Cosine similarity on deep embeddings** captures high-level visual style:

$$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$

where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized 2048-dim feature vectors.
Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].

**Perceptual hash distance (dHash)** [27] captures structural-level similarity.
Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images.
Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].

These descriptors provide partially independent evidence.
Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise.
Non-hand-signing is expected to yield extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise; scan-stage noise can in principle push a replicated pair off either extremum but rarely both.
Hand-signing, by contrast, often yields high dHash similarity (the overall layout of a signature is typically preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.

We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. The supplementary materials contain the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.

Cosine similarity on L2-normalised deep embeddings and dHash both remain stable across the print-scan-rasterise cycle by design [14], [19], [21], [27]; together they constitute the dual descriptor used throughout the rest of this paper.

## G. Unit of Analysis and Scope

We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.

We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism.

We adopt one stipulation about same-CPA pair detectability:

> **(A1) Pair-detectability.** *If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the cross-year same-CPA pool used by the max-cosine / min-dHash computation.*

A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.

**Scope: the Big-4 sub-corpus.** The primary analyses (§III-I, §III-J, §III-K, §III-L, and the corresponding §IV-D through §IV-J and §IV-M tables) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F; §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ — the threshold for accountant-level analyses — totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the primary analyses to Big-4 is a methodological choice driven by four considerations:

1. **Restricted generalisability claim and Big-4 institutional comparability.** The primary claims are scoped to the Big-4 audit-report context, where the four firms share comparable institutional scale, document-production infrastructure, and CPA-volume regime; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H.2's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-I.4. Generalisation beyond Big-4 is left as future work.

2. **Within-firm cross-CPA collision structure analysis.** §III-L.4 reports a Big-4 cross-firm hit-matrix analysis that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work.

3. **Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-J K=3 component cross-tab; byte-level pair analysis referenced in §III-H.2). We retain Firm A within the Big-4 scope as a descriptive case study of the templated end rather than as the calibration anchor for thresholds.

4. **Leave-one-firm-out fold feasibility.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm.

**Sample-size reconciliation.** Two Big-4 signature counts appear in this section and §IV: $n = 150{,}442$ for analyses using the pre-computed per-signature descriptors $\text{cos}_s$ (`max_similarity_to_same_accountant`) and $\text{dHash}_s$ (`min_dhash_independent`), and $n = 150{,}453$ for analyses recomputing pair-level metrics directly from the stored feature and dHash byte vectors (Scripts 40b, 43, 44). The $11$-signature difference reflects descriptor-completion status: $11$ signatures have feature vectors and dHash byte vectors stored but lack the pre-computed extrema. The $11$ signatures are negligible at population scale and do not affect any reported coincidence rate within $0.01$ percentage point. The CPA counts $468$ (all Big-4 CPAs with both vectors stored) and $437$ (Big-4 CPAs with $n_{\text{sig}} \geq 10$ for accountant-level stability) likewise reflect a single uniform exclusion rule rather than analysis-specific subsetting.

## H. Operational Classifier and Reference Populations

### H.1. Deployed Operational Rule

Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA:

1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence.
2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff.
3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction.
4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$.

Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).

The remainder of this section (§III-H.2) describes the reference populations used to calibrate and cross-check this rule. §III-I demonstrates that the descriptor distributions do not provide a within-population natural threshold; §III-J–§III-K develop the descriptive partition and internal-consistency cross-checks; §III-L develops the anchor-based threshold calibration; §III-M discloses the unsupervised-setting limits.

### H.2. Reference Populations

The calibration distinguishes two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking.

**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.

Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate.

**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores used as a cross-check on the per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked.

The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 low-cos / high-dHash component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 high-cos / low-dHash component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end of the descriptor plane. This is a "deviation in the less-replication-dominated descriptor-position direction" measure, not a "deviation toward the templated descriptor-position" measure; the reference is the less-replication-dominated population.

## I. Distributional Diagnostics: Why the Composition Path Does Not Yield a Natural Threshold

This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G and tests whether the distribution provides distributional support — in the form of within-population bimodality — for the deployed operational thresholds. We apply four diagnostic procedures in turn: a univariate unimodality test on each accountant-level marginal; a 2D Gaussian mixture fit (developed in §III-J); a density-smoothness diagnostic; and a composition decomposition that distinguishes within-population multimodality from between-firm location-shift artefacts. The four diagnostics jointly imply that the operational thresholds are *not* anchored by distributional bimodality: §III-L develops an anchor-based calibration framework that does not require this assumption.

**1. Hartigan dip test on each accountant-level marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope at the accountant level rejected unimodality. The accountant-level Big-4 rejection is a descriptive observation; §III-I.4 below shows that the rejection is fully explained by between-firm location-shift effects rather than within-population bimodality.

**2. K=2 / K=3 Gaussian mixture fits (descriptive partition).** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3 as a population mixture. Following §III-I.4 we treat both K=2 and K=3 fits as *descriptive partitions* of the joint Big-4 distribution that reflect firm-composition structure (Firm A vs others; §III-J) rather than as inferential evidence for two or three latent population modes.

**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each accountant-level marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with §III-I.4 below: under the composition decomposition the Big-4 marginals are unimodal once between-firm and integer-tie confounds are removed, so a local-discontinuity test correctly fails to flag a within-population transition.

**4. Composition decomposition (Scripts 39b–39e).** §III-I.1 establishes that the accountant-level marginals reject unimodality at the Big-4 sub-corpus. The remaining question is whether the rejection reflects (a) genuine within-population bimodality at the signature or accountant level, (b) between-firm location-shift artefacts (firms with different mean descriptor positions pool to a multi-peaked distribution), or (c) integer mass-point artefacts on the integer-valued dHash axis (the dHash dip statistic is sensitive to spikes at integer values). We apply four diagnostics that decompose the rejection into these candidate sources:

*Within-firm signature-level dip (Scripts 39b, 39c).* Repeating the dip test at the signature level inside each individual Big-4 firm (Script 39b) and inside each individual non-Big-4 firm with $\geq 500$ signatures (Script 39c) yields a consistent picture. The cosine marginal *fails* to reject unimodality in every single firm tested — all four Big-4 firms ($p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ for Firms A through D; Script 39b) and ten non-Big-4 firms with $\geq 500$ signatures ($p_{\text{cos}} \in [0.59, 0.99]$; Script 39c). The raw dHash marginal *does* reject unimodality in every firm tested ($p < 5 \times 10^{-4}$ in all $14$ firms), but the raw dHash values are integer-valued in $\{0, 1, \ldots, 64\}$, leaving open the possibility of an integer-tie artefact.

*Integer-jitter robustness (Scripts 39d, 39e).* Adding independent uniform jitter $\sim \mathrm{U}[-0.5, +0.5]$ to break exact dHash ties and re-running the dip test on the perturbed signature cloud (5 seeds, $n_{\text{boot}} = 2000$; Script 39d) eliminates the dHash within-firm rejection in every Big-4 firm tested (Firm A jittered $p_{\text{median}} = 0.999$; B $0.996$; C $0.999$; D $0.9995$; $0$/$5$ seeds reject at $\alpha = 0.05$ in any firm). The pooled-Big-4 dHash dip *does* survive jitter alone ($p_{\text{median}} = 0$, $5$/$5$ seeds reject), but Firm A's mean dHash ($2.73$) is substantially below Firms B/C/D's ($6.46$, $7.39$, $7.21$) — a between-firm location shift. Script 39e applies a 2 \times 2 factorial correction (firm-mean centring $\times$ integer jitter) on the Big-4 pooled dHash:

| Condition | Firm-mean centred | Integer jitter | Median dip $p$ |  Reject at $\alpha = 0.05$ |
|---|---|---|---|---|
| 1 raw | — | — | $< 5 \times 10^{-4}$ | $5/5$ |
| 2 centred only | $\checkmark$ | — | $< 5 \times 10^{-4}$ | $5/5$ |
| 3 jittered only | — | $\checkmark$ | $< 5 \times 10^{-4}$ | $5/5$ |
| 4 centred and jittered | $\checkmark$ | $\checkmark$ | $\mathbf{0.35}$ | $\mathbf{0/5}$ |

Removing *both* the between-firm location shift *and* the integer mass points eliminates the Big-4 dHash rejection. The Big-4 pooled dHash multimodality is therefore fully attributable to firm-composition contrast (primarily Firm A's mean $\text{dHash} = 2.73$ versus Firms B/C/D $\approx 6.5$–$7.4$) and integer-density artefacts, with no residual continuous within-firm bimodality.

*Cosine analogue.* The cosine axis follows the same pattern by construction: the within-firm signature-level cosine dip tests above (Scripts 39b, 39c) fail to reject in every Big-4 firm and in every eligible non-Big-4 firm, so any pooled cosine multimodality must arise from between-firm composition rather than from within-population bimodality.

*Integer-histogram valleys (Script 39d).* A genuine within-firm dHash antimode would appear as a strict local minimum in the count histogram with deep relative depth. Within each of the four Big-4 firms, the dHash histogram on bins $0$–$20$ exhibits no strict local minimum; the Big-4 pooled histogram exhibits one shallow valley at $\text{dHash} = 4$ with relative depth $0.021$ (a $2.1\%$ count drop). No valley near the deployed $\text{dHash} = 5$ operational boundary appears within any individual firm. The hypothesised dHash antimode near $\text{dHash} \approx 5$ is not empirically supported by the histogram analysis.

**5. Conclusion: no natural threshold from the descriptor distribution.** §III-I.4 jointly establishes that (a) the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts; (b) within the Big-4 firms, the descriptor marginals at the signature level are unimodal once integer ties are broken (Scripts 39b, 39d); (c) eligible non-Big-4 checks provide corroborating raw-axis evidence on the cosine dimension (Script 39c) and corroborate the integer-mass-point reading of raw dHash, but are not used as calibration evidence for the deployed thresholds; and (d) no integer-histogram valley near the deployed $\text{dHash} = 5$ operational boundary exists within any Big-4 firm. The descriptor distributions therefore do not contain a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits of §III-I.2 and §III-J are retained as *descriptive partitions* that reflect firm-composition contrast, not as inferential evidence for two or three population modes. §III-L develops the anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode.

## J. K=3 as a Descriptive Partition of Firm-Composition Contrast

This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes.** §III-I.4 demonstrates that the apparent multimodality of the accountant-level marginals is fully explained by between-firm location shifts and integer mass-point artefacts, leaving no residual evidence for two or three latent within-population mechanism classes. Neither mixture is used to assign signature-level or document-level labels in the primary analysis. The operational classifier of §III-H.1 is calibrated in §III-L via inter-CPA negative-anchor coincidence rates, not via mixture-derived antimodes.

**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ (weight $0.689$) and $(0.983, 2.41)$ (weight $0.311$) (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. We refer to the components by index rather than by mechanism labels, since §III-I.4 establishes that the K=2 separation is firm-compositional rather than mechanistic.

**K=3 fit.** Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces):

| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position |
|---|---|---|---|---|
| C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner |
| C2 | 0.9558 | 6.66 | 0.536 | central region |
| C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner |

$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical preference for K=3 under standard BIC interpretation, but not by itself decisive). The "descriptive position" column refrains from any mechanism interpretation: §III-I.4 establishes that the cosine and dHash axes both lack within-population bimodality, so component centres are best interpreted as locations in a continuous descriptor space rather than as latent mechanism modes.

**Per-firm component composition (Script 35 firm × cluster cross-tab).** The K=3 partition is dominated by firm membership:

- Firm A: $0\%$ C1, $17.5\%$ C2, $82.5\%$ C3
- Firm B: $8.9\%$ C1, $\sim 78\%$ C2, $\sim 13\%$ C3
- Firm C: $23.5\%$ C1, $75.5\%$ C2, $1.0\%$ C3
- Firm D: $11.5\%$ C1, $\sim 84\%$ C2, $\sim 4.5\%$ C3

Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs. The K=3 partition is therefore well-described as a firm-compositional decomposition: C3 is essentially "Firm A and any non-Firm-A CPA whose mean descriptors happen to land in the high-cos / low-dHash corner"; C1 is essentially "non-Firm-A CPAs whose mean descriptors land in the low-cos / high-dHash corner." The composition contrast that K=3 captures at the accountant level reappears at the deployment level in the cross-firm hit matrix of §III-L.4 (Script 44): under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). The K=3 partition and the cross-firm hit matrix therefore describe the same underlying firm-compositional structure at two different units of analysis.

**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label.

We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the operational classifier:

- The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary.
- The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$.
- Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs.

The operational signature-level classifier of §III-H.1 is calibrated in §III-L against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K.

## K. Convergent Internal-Consistency Checks

The descriptive partition of §III-J is supported by three feature-derived per-CPA scores and a conservative hard-positive subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. Per §III-I.4, none of the three scores has a within-population bimodality interpretation; they are firm-compositional position scores at the accountant level. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide).

**1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute:

- **Score 1 (K=3 posterior on the low-cos / high-dHash component):** $P(\text{C1})$ from the K=3 fit of §III-J. Per §III-J this is a firm-compositional position score on the (cos, dHash) plane (not a probability of any latent "hand-signing mechanism") — a function of both descriptor means.
- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H.2, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end. This is a function of $\overline{\text{cos}}_a$ alone.
- **Score 3 (deployed binary high-confidence box rule rate):** the per-CPA fraction of signatures that do **not** satisfy the deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors.

Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38):

| Pair | Spearman $\rho$ | $p$-value |
|---|---|---|
| Score 1 vs Score 3 | $+0.9627$ | $< 10^{-248}$ |
| Score 2 vs Score 3 | $+0.8890$ | $< 10^{-149}$ |
| Score 1 vs Score 2 | $+0.8794$ | $< 10^{-142}$ |

We read this as the strongest internal-consistency signal in the analysis: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the deployed box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms.

**2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replication-dominated vs less-replication-dominated):

| Pair | Cohen $\kappa$ |
|---|---|
| Deployed binary high-confidence box rule vs per-CPA K=3 hard label | $0.662$ |
| Deployed binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ |
| Per-CPA K=3 vs per-signature K=3 | $0.870$ |

The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in the supplementary materials and not regenerated on the Big-4 subset.

**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference:

- *K=2 LOOO is unstable.* The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$, against the report's $0.005$ across-fold stability tolerance (Script 36; pairwise fold range $0.0376$, from $0.9380$ to $0.9756$). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

- *K=3 LOOO is partially stable.* The C1 (low-cos / high-dHash) component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own screening label is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label.

**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one conservative hard-positive subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40).

We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the corpus-wide version cited at §IV-I:

| Candidate check | Pixel-identity miss rate (Wilson 95% CI) |
|---|---|
| Deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ |
| K=3 per-CPA hard label (C3 high-cos / low-dHash corner; descriptive only) | $0\%$ $[0\%, 1.45\%]$ |
| Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ |

All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them. The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the box rule's overall replicated rate ($49.58\%$ of Big-4 signatures); this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.

## L. Anchor-Based Threshold Calibration

The operational classifier defined in §III-H.1 is calibrated by characterising the deployed thresholds' inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. §III-I.4 establishes that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. Throughout this section we report **inter-CPA coincidence rates** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0.

### L.0. Calibration methodology

**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine threshold's specificity-proxy behaviour at the inter-CPA pair level and the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands.

**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:

- *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the conventional pairwise calibration unit in biometric verification. We report it for both the cosine and dHash dimensions, marginally and jointly (§III-L.1).
- *Per signature pool.* For a Big-4 source signature $s$ with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates? This addresses the standard concern that a per-pair rate computed on independent pairs is not the deployed-rule rate at the per-signature classifier level: the deployed rule takes max-cosine and min-dHash over a pool of size $n_{\text{pool}}(s)$, so its effective coincidence rate is approximately $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-L.2).
- *Per document.* For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose deployed pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-L.3).

**Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate in the pool satisfies both. We refer to this as the **any-pair** rule. A stricter alternative — the **same-pair** rule — requires a single candidate to satisfy both inequalities; the deployed rule is any-pair, but we report same-pair as a stricter alternative classifier where useful (§III-L.2, §III-L.4).

**Terminological note on "FAR".** The biometric-verification literature speaks of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) §III-L.4 shows that the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures. Reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence.

### L.1. Per-comparison inter-CPA coincidence rate (Script 40b)

We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatures, computing for each pair the cosine similarity (feature dot product) and Hamming distance between the dHash byte vectors. Marginal and joint rates at threshold $k$ are reported with Wilson 95% confidence intervals (Script 40b).

| Threshold | Per-comparison inter-CPA coincidence rate | 95% Wilson CI |
|---|---|---|
| Cosine $> 0.95$ | $0.00060$ | $[0.00053, 0.00067]$ |
| Cosine $> 0.945$ (alternative operating point from supplementary calibration evidence) | $0.00081$ | $[0.00073, 0.00089]$ |
| Cosine $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| Cosine $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
| dHash $\leq 5$ | $0.00129$ | $[0.00120, 0.00140]$ |
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
| dHash $\leq 2$ | $0.00006$ | $[0.00004, 0.00008]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ |

The cosine row at $\text{cos} > 0.95$ is consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I on a similarly-sized inter-CPA sample; the present $5 \times 10^5$-pair sample yields $0.00060$, within that precision. The dHash row and joint row are reported here for the first time; the corpus-wide spike did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule.

The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check.

**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison specificity (joint $0.00014$ vs cos-only $0.00060$).

The per-comparison rate is a useful *specificity-proxy calibration* for the deployed rule's pair-level behaviour. It does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.

### L.2. Pool-normalised inter-CPA alert rate (Script 43)

The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \text{dHash}$ over the same-CPA pool of size $n_{\text{pool}}(s)$ for each signature $s$. A per-comparison rate is therefore not the rate at which the deployed classifier fires per signature. To compute the per-signature inter-CPA-equivalent rate, for each Big-4 source signature $s$ we simulate one realisation of an inter-CPA candidate pool of the same size $n_{\text{pool}}(s)$, drawn uniformly from non-same-CPA signatures across all firms, compute the deployed extrema and rule indicator, and aggregate (Script 43; $n_{\text{sig}} = 150{,}453$ vector-complete in this analysis; CPA-block bootstrap 95% CIs reported below).

**Headline rates (deployed any-pair rule, HC = cos $> 0.95$ AND dHash $\leq 5$).** Wilson 95% CIs on the point estimate, CPA-block bootstrap 95% CI on $n_{\text{boot}} = 1000$ replicates:

| Rule semantics | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
|---|---|---|---|
| Any-pair (deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
| Same-pair (stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |

Per-firm any-pair rates (no bootstrap; descriptive):

| Firm | $n_{\text{sig}}$ | Any-pair ICCR | Same-pair ICCR |
|---|---|---|---|
| Firm A | $60{,}450$ | $0.2594$ | $0.2018$ |
| Firm B | $34{,}254$ | $0.0147$ | $0.0023$ |
| Firm C | $38{,}616$ | $0.0053$ | $0.0019$ |
| Firm D | $17{,}133$ | $0.0110$ | $0.0051$ |

**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors).

**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). A stricter operating point of dHash $\leq 3$ same-pair would correspond to a per-signature ICCR of $\approx 0.05$; the deployed HC any-pair rule with $\text{dHash} \leq 5$ corresponds to $\approx 0.11$. Stakeholders requiring a tighter specificity proxy could consider the dHash $\leq 3$ same-pair variant, with the unsupervised-setting caveats of §III-M.

### L.3. Document-level inter-CPA proxy alert rate (Script 45)

The deployed worst-case aggregation classifies each document by the most-replication-consistent category among its constituent signatures (§III-H.1). Three operationally meaningful document-level alarm definitions are reported, each as the fraction of documents whose worst-case signature category falls in the alarm set under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 45; $n_{\text{docs}} = 75{,}233$ Big-4 documents):

| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|---|---|---|---|
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
| D2 | HC + MC ("any non-hand-signed screening label") | $0.3375$ | $[0.3342, 0.3409]$ |
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |

Per-firm D2 document-level rates:

| Firm | $n_{\text{docs}}$ | D2 (HC + MC) ICCR |
|---|---|---|
| Firm A | $30{,}226$ | $0.6201$ |
| Firm B | $17{,}127$ | $0.1600$ |
| Firm C | $19{,}501$ | $0.1635$ |
| Firm D | $8{,}379$ | $0.0863$ |

The document-level D2 rate of $33.75\%$ pooled over Big-4 is the most operationally relevant alarm-rate metric: it is the fraction of audit documents that would carry at least one signature flagged HC or MC under the counterfactual of inter-CPA candidate-pool replacement. The non-trivial per-document inter-CPA alarm rate (and its concentration in Firm A at $62\%$) motivates the positioning of the operational system as a **screening framework with human-in-the-loop review**, not as an autonomous forensic classifier (§III-M).

### L.4. Firm heterogeneity (Script 44)

§III-L.2 and §III-L.3 report large per-firm variation in the deployed rule's pool-normalised behaviour: Firm A's any-pair per-signature ICCR is $0.2594$, an order of magnitude larger than Firm B's $0.0147$, Firm C's $0.0053$, Firm D's $0.0110$. A natural alternative explanation is the pool-size confound: Firm A's median pool size ($\sim 285$) is larger than other firms', and pool size monotonically (broadly) increases the per-signature rate (§III-L.2 decile trend). We test the firm-vs-pool confound with a logistic regression of the per-signature hit indicator (any-pair HC) on firm dummies (Firm A = reference) and centred log pool size (Script 44):

| Term | Odds ratio (vs Firm A) | Direction | Magnitude |
|---|---|---|---|
| Firm B | $0.053$ | $< 1$ | $\sim 19\times$ lower odds than Firm A |
| Firm C | $0.010$ | $< 1$ | $\sim 100\times$ lower odds than Firm A |
| Firm D | $0.027$ | $< 1$ | $\sim 37\times$ lower odds than Firm A |
| log(pool size, centred) | $4.01$ | $> 1$ | $\sim 4\times$ higher odds per unit log pool size |

The Firm B/C/D odds ratios are very small after controlling for pool size, indicating that firm membership accounts for a large multiplicative effect on the per-signature rate that is *not* explained by pool size alone. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm, and naive standard errors would be unreliable under within-cluster correlation; a cluster-robust standard error analysis is left as a robustness check.)

The per-decile per-firm breakdown (Script 44) confirms the pattern: within every pool-size decile, Firms B/C/D have rates of $0.0006$–$0.0358$, while Firm A's rate ranges $0.0541$–$0.5958$ across deciles. The firm gap is large within matched pool sizes, not driven by pool composition.

**Cross-firm hit matrix.** Among Big-4 source signatures whose any-pair rule fires under the inter-CPA candidate-pool counterfactual, the candidate firm of the max-cosine partner is distributed as follows (Script 44):

| Source firm | Firm A candidate | Firm B | Firm C | Firm D | non-Big-4 | hits |
|---|---|---|---|---|---|---|
| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ |
| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ |
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |

For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$).

**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$–$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (supplementary materials; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.

This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where the within-firm collision concentration is the dominant pattern at all four Big-4 firms — most strongly at Firm A ($98.8\%$ any-pair, $99.96\%$ same-pair) and at materially lower but still majority levels at Firms B/C/D ($76.7$–$83.7\%$ any-pair; $97.0$–$98.2\%$ same-pair).

### L.5. Alert-rate sensitivity around deployed thresholds (Script 46)

To test whether the deployed cosine threshold $0.95$ and dHash threshold $5$ coincide with a low-gradient (plateau-stable) region of the deployed-rule alert-rate surface — which would be weak distributional evidence that the deployed thresholds are stable operating points — we sweep each threshold across a range and report the per-signature alert rate on actual observed Big-4 same-CPA pools (not inter-CPA-replaced pools), comparing the local gradient at the deployed threshold to the median gradient across the sweep (Script 46).

At the deployed HC operating point cos $> 0.95$ AND dHash $\leq 5$, the local gradient of the per-signature alert rate is substantially larger than the median gradient across the sweep (cosine: ratio $\approx 25\times$ at the $0.95$ point relative to median; dHash: ratio $\approx 3.8\times$ at the $5$ point relative to median; both Script 46). Reading these ratios descriptively, the deployed HC threshold is *locally sensitive* rather than plateau-stable: small threshold perturbations materially change the deployed alert rate (cosine sweep at dHash $\leq 5$ yields rates of $0.5091$ at cos $> 0.945$ vs $0.4789$ at cos $> 0.955$, a $3.0$ pp swing across a $0.01$ cosine perturbation; dHash sweep at cos $> 0.95$ yields rates of $0.4207$ at dHash $\leq 4$ vs $0.5639$ at dHash $\leq 6$, a $14.3$ pp swing across a single integer step). The local-gradient-to-median-gradient ratios are descriptive diagnostics, not formal plateau tests; the primary evidence for "no within-population bimodal antimode at these thresholds" comes from §III-I.4's composition decomposition, not from §III-L.5.

The MC/HSC boundary at dHash $= 15$, by contrast, *is* in a low-gradient region (ratio $\approx 0.08$ to the median); the plateau-like behaviour around dHash $= 15$ is corroborating evidence that the high-end structural threshold lies in a regime where the rule's alert rate is approximately saturated, consistent with the high-dHash tail behaviour expected once near-identical pairs have been exhausted. The §III-L.5 non-plateau / local-sensitivity finding therefore applies specifically to the HC cutoff (cos $= 0.95$, dHash $= 5$); the MC/HSC sub-band boundary at dHash $= 15$ exhibits the opposite behaviour and is plateau-like.

We interpret the deployed HC thresholds as **specificity-anchored operating points** chosen for the specificity-vs-alert-yield tradeoff (§III-L.1), *not* as distributional antimodes. Alternative operating points on the tradeoff curve can be characterised by inverting the per-comparison or pool-normalised ICCR curves (§III-L.1, §III-L.2) at the preferred specificity target.

### L.6. Observed deployed alert rate on actual same-CPA pools

The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the deployed HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$).

The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised inter-CPA rate ($0.4958$ vs $0.1102$); the per-document observed-deployed rate is $\sim 3.5\times$ the pool-normalised inter-CPA D1 (HC) rate ($0.6228$ vs $0.1797$). We refer to this multiplicative gap as the **deployed-rate excess over the inter-CPA proxy**:

- Per-signature: $0.4958 - 0.1102 = 0.3856$ ($38.6$ pp excess)
- Per-document HC: $0.6228 - 0.1797 = 0.4431$ ($44.3$ pp excess)

We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence.

## M. Unsupervised Diagnostic Strategy and Limits

The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth.

Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier; the full diagnostic-to-failure-mode-to-assumption map is given in Appendix A Table A.II. No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth.

**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, supplementary materials) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.

**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature any-pair HC ICCR ($0.11$; Table XXII) and per-document HC+MC alarm-rate ICCR ($0.34$; Table XXIII) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.

**Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data.

## N. Data Source and Firm Anonymization

**Audit-report corpus.** The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation.
MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document.
We did not access any non-public auditor work papers, internal firm records, or personally identifying information beyond the certifying CPAs' names and signatures, which are themselves published on the face of the audit report as part of the public regulatory filing.
The CPA registry used to map signatures to CPAs is a publicly available audit-firm tenure registry (Section III-B).

**Firm-level anonymization.** Although all audit reports and CPA identities in the corpus are public, we report firm-level results under the pseudonyms Firm A / B / C / D throughout this paper to avoid naming specific accounting firms in descriptive rate comparisons.
Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name.