Files
pdf_signature_extraction/paper/paper_a_v4_combined.md
T
gbanyan 1e8466f7a8 Paper A v4.3: unify Firm A positioning to out-of-sample templated-end target
Finishes the BCD re-anchor chassis (audit critique #1): Firm A was
inconsistently framed as both a "within-Big-4 case study" and an
"out-of-sample target". Harmonised to a single label, "out-of-sample
templated-end target" (held out of the calibration negative anchor;
scored against the normative Firms-B/C/D baseline), across:
- §I contribution #3 (title + body)
- §III-H.2 (opening trio BCD/Firm-A/non-Big-4; sub-header; role sentence)
- §V-C body (removed the dual case-study/out-of-sample phrasing)
(§V-C header already fixed in ac3372d.)

Zero "case study" wording remains; no numbers changed. codex gpt-5.5
focused check: all consistency items PASS, no new findings.

Also restore the BCD+non-Big-4 joint ICCR Wilson CI [0.000001, 0.000015]
to the §IV-M Table XXI note (three-scope CI symmetry; the one MINOR
completeness gap surfaced by a codex old-vs-new content diff, which
otherwise confirmed no substantive content was dropped by the trim).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 02:11:09 +08:00

170 KiB
Raw Blame History

title, author
title author
Automated Screening of Digitally Replicated Signatures in Large-Scale Financial Audit Reports [Authors removed for double-blind review]

Abstract

Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen non-hand-signed signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating style consistency from image reproduction. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold (p=0.35 after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a clean pre-e-signature baseline (Firms B/C/D, 20132019), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a low inter-CPA coincidence rate (per-comparison ICCR 0.000010; per-signature 0.006; per-document 0.012), whereas the moderate-confidence band (dHash$\leq 15$) retains a \sim 0.175 per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A never coincides cross-firm yet fires on 82\% of its own (\sim 139\times floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth.

I. Introduction

Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require certifying CPAs to affix their signature or seal (簽名或蓋章) to each audit report [1]. While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].

The digitization of financial reporting has introduced a practice that complicates this intent. As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one. This reproduction can occur either through an administrative stamping workflow — in which scanned signature images are affixed by staff as part of the report-assembly process — or through a firm-level electronic signing system that automates the same step. We refer to signatures produced by either workflow collectively as non-hand-signed. Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement. The accounting literature has examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33]. Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused, and is visually invisible to report users at scale.

The distinction between non-hand-signing detection and signature forgery detection is conceptually and technically important. The extensive body of research on offline signature verification [3][8] focuses almost exclusively on forgery detection — determining whether a questioned signature was produced by its purported author. In our context, identity is not in question; the CPA is indeed the legitimate signer. The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction.

A methodological concern shapes the research design. Many prior similarity-based classification studies rely on ad-hoc thresholds — declaring two images equivalent above a hand-picked cosine cutoff, for example — without principled statistical justification. Such thresholds are fragile in an archival-data setting. A defensible approach requires (i) explicit calibration of the operational thresholds against measurable negative-anchor evidence; (ii) diagnostic procedures that test whether the descriptor distribution itself supports a within-population threshold, including formal decomposition of apparent multimodality into between-group composition and integer-tie artefacts; (iii) annotation-free reporting of operational alarm rates at multiple analysis units (per-comparison, per-signature pool, per-document) with Wilson 95% confidence intervals; (iv) per-firm stratification of the reported rates to surface heterogeneity that aggregate metrics conceal; and (v) explicit disclosure of the unsupervised setting's limits — in particular, the inability to estimate true error rates without signature-level ground-truth labels.

Despite the significance of the problem for audit quality and regulatory oversight, to our knowledge no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation.

In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-I); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-J.1); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-K.4); and (8) disclosure of each diagnostic's untested assumption (§III-N).

We are deliberate about what the system claims. The operating thresholds are operator-tunable rather than asserted as ground-truth decision boundaries: the contribution is not a fixed detector that pronounces a signature non-hand-signed, but (a) a dual-descriptor design that separates style consistency from image reproduction, and (b) a methodology for choosing and characterising a screening operating point in the absence of labels, so that an operator can set a specificity target and read off what each setting yields. Operationally the framework is a semi-automated triage step that surfaces a tractable set of replication candidates from hundreds of thousands of signatures for human adjudication; it does not adjudicate. The firm-level results and the byte-identical capture check are reported as demonstrations that this triage works at scale, not as forensic determinations.

A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of 2.73 versus Firms B/C/D's 6.46, 7.39, 7.21) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears (p_{\text{median}} = 0.35 across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-K.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.

In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a clean pre-e-signature baseline (Firms B/C/D, 20132019); §III-I.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR = 0.000010 (versus 0.00014 on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR = 0.0059 (CPA-block bootstrap 95% [0.0045, 0.0073]), and per-document ICCR = 0.012 — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND 5 < \text{dHash} \leq 15), by contrast, retains a per-document coincidence rate of 0.175 even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover (\text{cos} = 0.837) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves \leq 0.012 across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide.

With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR 0.0059), the deployed rule fires on each firm's actual same-CPA pools far above the inter-CPA coincidence floor: Firm A at 0.82 (\sim 139\times floor), Firms B/C/D at $0.24$0.35 ($\sim 40$59\times). Firm A scored against the clean 20132019 baseline coincides essentially never (0.0001, below the clean-baseline floor itself) — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are 0.053 (B), 0.010 (C), 0.027 (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within \sim 3.5\times (odds ratio 1.73 for B, 0.49 for C). Under the deployed any-pair rule, within-firm collision concentration is a universal Big-4 pattern — 98.8\% at Firm A and, on the clean BCD pool, $89$97\% at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos > 0.95 AND dHash \leq 5) and the advisory moderate-confidence sub-rule (cos > 0.95 AND 5 < \text{dHash} \leq 15) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.

Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman \rho \geq 0.879: the K=3 mixture posterior (a firm-compositional position score under §III-L's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve 0\% positive-anchor miss rate (Wilson 95% upper bound 1.45\%). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.

We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available.

The contributions of this paper are:

  1. Problem formulation. We define non-hand-signing detection as distinct from signature forgery detection and frame it as a detection problem on intra-signer similarity distributions.

  2. End-to-end pipeline. We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention before the human-adjudication step.

  3. Dual-descriptor similarity. We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash provides complementary evidence for screening cases where style consistency and image reproduction hypotheses diverge, and we support the backbone choice through a feature-backbone ablation.

  4. Composition decomposition does not support the distributional-threshold path. We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.

  5. Anchor-based multi-level ICCR calibration on a normative non-Firm-A baseline. We characterise the deployed high-confidence (HC) sub-rule at three units of analysis against a clean Firms-B/C/D negative anchor (Firm A held out as an out-of-sample target to avoid circularity): per-comparison ICCR 0.000010, pool-normalised per-signature ICCR 0.0059, and per-document ICCR 0.012 — each roughly an order of magnitude below the contaminated all-Big-4 figures (0.00014, 0.11, 0.18). The moderate-confidence band (dHash$\leq 15$) retains a \sim 0.175 per-document coincidence rate on the clean baseline and is repositioned as a low-specificity advisory tier rather than a confident non-hand-signed label. Because the deployed thresholds are operator-tunable, the contribution is this label-free calibration methodology — a principled way to choose and characterise a screening operating point and the specificity it yields — rather than any specific threshold. We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.

  6. Firm A as a singular out-of-sample extreme; universal within-firm collision concentration. Against the clean BCD floor (per-signature HC ICCR 0.0059), the deployed rule fires on each firm's own pools far above the inter-CPA coincidence floor (Firm A 0.82, \sim 139\times; Firms B/C/D $0.24$0.35, $\sim 40$59\times), while Firm A scored cross-firm against the clean 20132019 baseline coincides essentially never cross-firm (0.0001, below the floor itself) — localising the repeatability signal to within-firm comparisons. Two logistic regressions (full-Big-4 with Firm A reference: odds ratios $0.053$/$0.010$/0.027 for B/C/D; BCD-only with Firm D reference: residual spread within \sim 3.5\times, odds ratios $1.73$/0.49 for B/C) show Firm A is the lone outlier while Firms B/C/D form an internally homogeneous baseline. Within-firm collision concentration is a universal Big-4 pattern — 98.8\% at Firm A and $89$97\% at Firms B/C/D on the clean pool — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).

  7. K=3 as descriptive firm-compositional partition; three-score convergent internal consistency. We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman \rho \geq 0.879; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.

  8. Annotation-free positive-anchor capture check and unsupervised-setting disclosure. We achieve 0\% positive-anchor miss rate (Wilson 95% upper bound 1.45\%) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-N addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review.

The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity positive-anchor check, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work.

II. Related Work

A. Offline Signature Verification

Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning. Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant. Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work. Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining. Kao and Wen [5] addressed offline verification and forgery detection using only a single known genuine signature per writer with an explainable deep-learning approach. More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results. Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives. Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer. Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems.

A common thread in this literature is the assumption that the primary threat is identity fraud: a forger attempting to produce a convincing imitation of another person's signature. Our work addresses a fundamentally different problem---detecting whether the legitimate signer's stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability.

Brimoh and Olisah [8] are closest in spirit in using reference evidence to discipline threshold choice. Their setting, however, uses standard verification benchmarks with known genuine references, whereas our archival setting lacks signature-level labels and therefore characterises a fixed deployed screening rule through inter-CPA coincidence-rate anchors.

B. Document Forensics and Copy Detection

Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18]. Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11]. Abramova and Böhme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection.

Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money-laundering investigations. Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering. While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting image-level reproduction within a single author's signatures across documents.

In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images. Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach.

C. Perceptual Hashing

Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19]. Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process.

Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks. Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-descriptor approach, though applied to natural images rather than document signatures.

Our work differs from prior perceptual-hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from image-level reproduction in scanned financial documents.

D. Deep Feature Extraction for Signature Analysis

Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures. Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach. Tsourounis et al. [21] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison. Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach.

Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach. These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical.

E. Statistical Methods for Threshold Characterisation and Calibration

Our threshold-characterisation and calibration framework combines three families of methods developed in statistics and accounting-econometrics.

Non-parametric density estimation. Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions. In idealized two-class mixture settings with equal priors and equal misclassification costs, the local density minimum (antimode) between the two modes coincides with the Bayes-optimal decision boundary. The statistical validity of the unimodality-vs-multimodality dichotomy can be tested via the Hartigan & Hartigan dip test [37], which tests the null of unimodality; we use rejection of this null as evidence consistent with (though not a direct test for) bimodality.

Discontinuity tests on empirical distributions. Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions. Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately N(0,1). The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis. The BD/McCrary pairing provides a local-density-discontinuity diagnostic that is informative about distributional smoothness under minimal assumptions; we use it in that diagnostic role (rather than as a threshold estimator) because its transitions in our corpus are bin-width-sensitive at the signature level and rarely significant at the accountant level (Appendix A).

Finite mixture models. When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters. For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation. Under mild regularity conditions, White's quasi-MLE result [41] supports interpreting maximum-likelihood estimates under a mis-specified parametric family as consistent estimators of the pseudo-true parameter that minimizes the Kullback-Leibler divergence to the data-generating distribution within that family; we use this result to justify the Beta-mixture fit as a principled approximation rather than as a guarantee that the true distribution is Beta.

The present study uses these tools diagnostically: first to test whether the descriptor distribution supports a natural operating boundary, and then, when that support fails under composition decomposition, to motivate anchor-based ICCR calibration of a fixed deployed rule.

Cross-validation in a small-cluster scope. Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-M differs in two respects from the standard usage: (i) the hold-out unit is the firm (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a composition-sensitivity band on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-I). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier.

III. Methodology

A. Pipeline Overview

We propose a six-stage pipeline for large-scale screening of non-hand-signed auditor signatures in scanned financial documents. Fig. 1 illustrates the overall architecture. The pipeline takes as input a corpus of PDF audit reports and produces five-way operational screening labels (§III-H.1) whose behaviour is characterised by pixel-identity positive-anchor capture checks and inter-CPA coincidence-rate calibration (§III-I).

Throughout this paper we use the term non-hand-signed rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.

B. Data Collection

The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.

CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. Table I summarizes the dataset composition.

Table I. Dataset Summary.

Attribute Value
Total PDF documents 90,282
Date range 20132023
Signature-page candidates (VLM-positive) 86,084 (95.3%)
Processed for signature extraction 86,071 (95.3%)
Unique CPAs identified 758
Accounting firms >50

C. Signature Page Identification

To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism. Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. The model was configured with temperature 0 for deterministic output.

The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. Scanning terminated upon the first positive detection. This process identified 86,084 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. An additional 13 PDFs that could not be rendered (corruption or read errors) were excluded, yielding a final set of 86,071 documents.

Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents. The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling.

D. Signature Detection

We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization. A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.

The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).

Table II. YOLO Detection Performance.

Metric Value
Precision 0.970.98
Recall 0.950.98
mAP@0.50 0.980.99
mAP@0.50:0.95 0.850.90

Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.

Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). The matched records assume standard two-signature ordering; residual order-mismatch risk remains for nonstandard layouts. The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset forms the candidate pool for same-CPA analyses, before the Big-4 and descriptor-completeness restrictions described in §III-G.

E. Feature Extraction

Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.

Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-H, Engineering-level caveats). This design choice is supported by an ablation study (Section IV-L) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

F. Dual-Method Similarity Descriptors

For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:

Cosine similarity on deep embeddings captures high-level visual style:

\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B

where \mathbf{f}_A and \mathbf{f}_B are L2-normalized 2048-dim feature vectors. Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].

Perceptual hash distance (dHash) [27] captures structural-level similarity. Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].

These descriptors provide partially independent evidence. Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. Non-hand-signing is expected to yield extreme similarity under both descriptors, since the underlying image is identical up to reproduction noise; scan-stage noise can in principle push a replicated pair off either extremum but rarely both. One working hypothesis is that some hand-signed repetitions may preserve coarse layout while varying in fine execution, producing relatively higher dHash similarity than cosine similarity within a same-CPA pair; the classifier does not require this hypothesis to hold for all CPAs, and the descriptor-level pattern is used only as input to the deployed rule, not as a within-CPA consistency claim. Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.

We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances (L_1, L_2, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. The supplementary materials contain the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-M), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.

Cosine similarity on L2-normalised deep embeddings and dHash both remain stable across the print-scan-rasterise cycle by design [14], [19], [21], [27]; together they constitute the dual descriptor used throughout the rest of this paper.

G. Unit of Analysis and Scope

We analyse signatures at two descriptor-summary units of resolution. The signature — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The accountant — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-L), of per-CPA internal-consistency analysis (§III-M), and of the leave-one-firm-out reproducibility check (§III-M). At the accountant level we compute, for each CPA with n_{\text{sig}} \geq 10 signatures, the per-CPA mean of the per-signature best-match cosine (\overline{\text{cos}}_a) and the per-CPA mean of the independent-minimum dHash (\overline{\text{dHash}}_a). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. §III-I additionally characterises the deployed rule's behaviour at three operational reporting units (per-comparison, per-signature, per-document), which are distinct from the descriptor-summary units defined here: the descriptor-summary units summarise input descriptors; the operational reporting units summarise rule outputs.

We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a summary statistic of their observed signatures, not a claim that all of their signatures share a single mechanism.

We adopt one stipulation about same-CPA pair detectability:

(A1) Pair-detectability. If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the observed same-CPA candidate pool used by the max-cosine / min-dHash computation, pooled over the CPA's reports across years. A1 does not assume temporal stability of handwriting or scanning workflow within or across years.

A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.

Scope: the Big-4 sub-corpus. The primary analyses (§III-I through §III-M, and the corresponding §IV-D through §IV-J and §IV-M tables) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F; §IV-K reports a deliberately narrow full-dataset cross-check at n = 686 CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with n_{\text{sig}} \geq 10 — the threshold for accountant-level analyses — totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the primary analyses to Big-4 is a methodological choice driven by four considerations:

  1. Restricted generalisability claim and Big-4 institutional comparability. The primary claims are scoped to the Big-4 audit-report context, where the four firms share comparable institutional scale, document-production infrastructure, and CPA-volume regime; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H.2's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-K.4. Generalisation beyond Big-4 is left as future work.

  2. Within-firm cross-CPA collision structure analysis. §III-J.1 reports a Big-4 cross-firm hit-matrix analysis that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work.

  3. Firm A as the out-of-sample templated-end target. Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-L K=3 component cross-tab; byte-level pair analysis referenced in §III-H.2). It is held out of the calibration negative anchor and scored as an out-of-sample target against the normative Firms-B/C/D baseline (§III-I.0, §III-J), illustrating the templated end the screening surfaces rather than serving as a calibration anchor for thresholds.

  4. Leave-one-firm-out fold feasibility. §III-M reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$O(30) per firm.

Sample-size reconciliation. Two Big-4 signature counts appear in this section and §IV: n = 150{,}442 for analyses using the pre-computed per-signature descriptors \text{cos}_s (max_similarity_to_same_accountant) and \text{dHash}_s (min_dhash_independent), and n = 150{,}453 for analyses recomputing pair-level metrics directly from the stored feature and dHash byte vectors (Scripts 40b, 43, 44). The $11$-signature difference reflects descriptor-completion status: 11 signatures have feature vectors and dHash byte vectors stored but lack the pre-computed extrema. The 11 signatures are negligible at population scale and do not affect any reported coincidence rate within 0.01 percentage point. The CPA counts 468 (all Big-4 CPAs with both vectors stored) and 437 (Big-4 CPAs with n_{\text{sig}} \geq 10 for accountant-level stability) likewise reflect a single uniform exclusion rule rather than analysis-specific subsetting.

H. Operational Classifier and Reference Populations

H.1. Deployed Operational Rule

Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair (\text{cos}_s, \text{dHash}_s) where \text{cos}_s is the maximum cosine similarity to another signature by the same CPA and \text{dHash}_s is the minimum independent dHash to another signature by the same CPA. The five labels below name regions of the descriptor space and are operational rule outputs, not validated ground-truth classes; the label names reflect the screening hypothesis associated with each region and are subject to the unsupervised-setting caveats of §III-N:

  1. High-confidence replication candidate (HC): Cosine > 0.95 AND \text{dHash}_{\text{indep}} \leq 5. Both descriptors converge on image-similarity evidence consistent with replication; this is the highest-priority triage bin for human review, and mechanism attribution remains subject to §III-N.
  2. Moderate-confidence advisory flag (MC): Cosine > 0.95 AND 5 < \text{dHash}_{\text{indep}} \leq 15. Feature-level similarity is strong but structural similarity is below the high-confidence cutoff; §III-I.3 shows this band carries low inter-CPA specificity even on the normative baseline, so it is a low-specificity advisory bin (review-workload-expanding) rather than a confident replication flag.
  3. High style-consistency flag (HSC): Cosine > 0.95 AND \text{dHash}_{\text{indep}} > 15. High feature-level similarity without structural corroboration; the descriptor position is operationally distinguished from HC/MC, but the underlying mechanism (within-CPA signing style, lossy image reproduction with structural drift, or a hybrid) is not resolved by descriptor data alone.
  4. Uncertain (UN): Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95.
  5. Low replication-similarity (LH): Cosine \leq 0.837. The name reflects the screening hypothesis that low maximum same-CPA cosine similarity is more consistent with hand-signing variation than with image replication; it is an operational low-priority bin, not a verified hand-signed classification, since cross-year handwriting drift, scanner-workflow change, or template variant rotation within a CPA's reports can also yield a low max-cosine within a same-CPA pool.

Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds (\text{cos} = 0.95 as the cosine operating point, \text{cos} = 0.837 as the all-pairs KDE crossover, \text{dHash} = 5 and 15 as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-I).

The remainder of this section (§III-H.2) describes the reference populations used to calibrate and cross-check this rule. §III-I then establishes the normative baseline and its inter-CPA coincidence floor; §III-J reads each firm, and Firm A in particular, as a deviation from that floor; §III-K shows that the descriptor distributions provide no within-population natural threshold; §III-L–§III-M develop the descriptive mixture partition and internal-consistency cross-checks; §III-N discloses the unsupervised-setting limits.

H.2. Reference Populations

Beyond the normative Firms-B/C/D baseline that anchors the calibration (§III-I), two further populations inform the analysis: Firm A, the out-of-sample templated-end target characterised in §III-J, and the 249 non-Big-4 CPAs, an out-of-target reverse-anchor reference for the internal-consistency checks of §III-M. Neither is the calibration negative anchor: Firm A is held out of it (§III-I.0), and the non-Big-4 reference informs the §III-M cross-checks (it also appears as a robustness scope in §III-I).

The out-of-sample target: Firm A as the templated end. Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-L; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos \approx 0.946, dHash \approx 9.17, weight \approx 0.143), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.

Firm A is not the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-L), not from Firm A alone. Firm A's role is as the out-of-sample templated-end target (§III-J): it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-M's pixel-identity positive-anchor miss rate.

External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking. The 249 non-Big-4 CPAs (n_{\text{sig}} \geq 10, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA (\overline{\text{cos}}_a, \overline{\text{dHash}}_a) distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-M's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores used as a cross-check on the per-signature classifier. The reverse-anchor reference is not a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked.

The reverse-anchor reference centre is at \overline{\text{cos}} = 0.935, \overline{\text{dHash}} = 9.77 (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 low-cos / high-dHash component (cos = 0.946, dHash = 9.17; §III-L); compared to the Big-4 high-cos / low-dHash component (cos = 0.983, dHash = 2.41; §III-L) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of \overline{\text{cos}}_a within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end of the descriptor plane. This is a "deviation in the less-replication-dominated descriptor-position direction" measure, not a "deviation toward the templated descriptor-position" measure; the reference is the less-replication-dominated population.

I. Normative Baseline and the Inter-CPA Coincidence Floor

We calibrate the operational classifier of §III-H.1 by first establishing a normative baseline — a population of independent CPAs in which the deployed rule should fire only by chance — and then measuring the rate at which it fires there. This inter-CPA coincidence floor is the reference against which every firm, Firm A included, is read in §III-J. Because the descriptor distributions contain no within-population bimodal antimode that could anchor a threshold directly (§III-K), it is this empirically measured floor, rather than a distributional cut, that gives the deployed thresholds an interpretable specificity meaning. Throughout we report inter-CPA coincidence rates (ICCR) rather than "False Acceptance Rates", for the reasons given in §III-I.0. This section develops the calibration method and reports the headline floor at each unit of analysis; the full result tables are consolidated in §IV-M (Tables XXIXXIII).

I.0. Calibration methodology

Choice of negative-anchor pool. A negative anchor must approximate a population in which the rule should not fire — independent CPAs whose signatures coincide only by chance. §III-J.1 shows that under the deployed rule 98.8\% of Firm A's inter-CPA collisions fall on other Firm-A CPAs, and byte-level evidence (§IV-H, supplementary materials) confirms image-level reuse across \sim 50 Firm-A partners. Including Firm A in the negative-anchor pool would therefore load the "coincidence" rate with structured within-firm collisions rather than chance coincidence — a circularity, since that collision structure is the phenomenon the rule targets. We adopt Firms B/C/D (BCD) as the normative negative-anchor baseline and report the all-Big-4 (ABCD) pool only as a contamination-comparison scope; Firm A enters as an out-of-sample target (§III-J.1), not as a calibration input. A still-broader baseline adding the eligible non-Big-4 firms (BCD+non-Big-4) is reported as a robustness scope.

We further restrict the calibration baseline temporally to fiscal years 20132019. Following the post-2020 acceleration of digital document workflows, Taiwan audit firms increasingly adopted electronic-signature and stamping systems for report assembly, with firm-specific timing; the pre-2020 BCD period is therefore the construct-clean hand-signing baseline, while the post-2020 period mixes genuine hand-signing with legitimate e-signing and is not a clean negative anchor. The data corroborate this: the BCD per-comparison HC floor rises from 0.000010 (20132019) to 0.000036 (20202023), and the per-signature floor from 0.0059 to 0.0105 — the gradual, non-stepped rise being consistent with staggered per-firm adoption. We therefore calibrate on BCD 20132019 and report BCD 20202023 only as a robustness scope. Firm A is scored across its full 20132023 record against this clean threshold.

Calibration role. The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. Because a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists, §III-K.4), §III-I.1 below characterises the cosine and structural (\text{dHash} \leq 5) thresholds' specificity-proxy behaviour at the inter-CPA pair level on the BCD baseline. The sub-band thresholds (\text{dHash} = 15, \text{cos} = 0.837) retain their supplementary calibration evidence; the present calibration does not provide independent rates for them. The cosine LH/UN crossover \text{cos} = 0.837 is a corpus-wide descriptor-space landmark (intra- vs inter-CPA cosine KDE crossover, §IV-C) robust to baseline choice — it moves by at most 0.012 across the corpus-wide, BCD, and BCD+non-Big-4 scopes (0.8367, 0.8489, 0.8302) — so we retain the corpus-wide value.

Three units of analysis. We report inter-CPA negative-anchor coincidence behaviour at three units, each answering a different operational question:

  • Per comparison. For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos > cos_threshold and / or dHash \leq dHash_threshold)? This is the conventional pairwise calibration unit in biometric verification, reported marginally and jointly (§III-I.1).
  • Per signature pool. For a source signature s with same-CPA pool of size n_{\text{pool}}(s), what is the probability that the deployed rule fires under the counterfactual of replacing the source's same-CPA pool with n_{\text{pool}}(s) random non-same-CPA candidates from the baseline pool? The deployed rule takes max-cosine and min-dHash over the pool, so its effective coincidence rate is \approx 1 - (1 - p_{\text{pair}})^{n_{\text{pool}}} in the independence limit (§III-I.2).
  • Per document. For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-I.3).

Any-pair vs same-pair semantics. The deployed rule uses independent extrema: a signature satisfies the HC rule if \max_{\text{pool}} \text{cos} > 0.95 AND \min_{\text{pool}} \text{dHash} \leq 5, not if a single candidate satisfies both. We call this the any-pair rule, and report the stricter same-pair rule (one candidate satisfying both inequalities) as an alternative where useful (§III-I.2, §III-J.1).

Terminological note on "FAR". We adopt inter-CPA coincidence rate (ICCR) and do not use "FAR", for two reasons: (a) FAR has a specific biometric-verification meaning requiring ground-truth negative labels, which the corpus does not provide at the signature level; (b) the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures, which is precisely why we move the anchor to the BCD baseline. Even on the BCD baseline, ICCR is a specificity proxy under an explicitly disclosed assumption, not a true biometric FAR.

I.1. Per-comparison inter-CPA coincidence rate (Script 46)

We sample 5 \times 10^5 inter-CPA pairs uniformly at random from the baseline pool, computing for each the cosine similarity and the Hamming distance between dHash byte vectors, with Wilson 95% confidence intervals (Script 46; Table XXI, §IV-M).

On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule (cos > 0.95 AND dHash \leq 5, any-pair) is \mathbf{0.000010} [0.000004, 0.000023] — roughly 8\times lower than the all-Big-4 rate (0.000140) and lower still when the non-Big-4 firms are added (0.000004). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-J.1): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms (5 of 5 \times 10^5 pairs on the BCD pool), so we treat the per-comparison joint rate as an order-of-magnitude specificity proxy and let the well-powered per-signature and per-document units (§III-I.2, §III-I.3) carry the primary calibration weight. The all-Big-4 cos > 0.95 marginal (0.00060) is consistent with the corpus-wide per-comparison rate of §IV-I, and on the all-Big-4 sample the conditional rate ICCR(dHash \leq 5\mid cos > 0.95) = 0.234 — the structural dimension adds substantial per-comparison specificity beyond the cosine gate.

The per-comparison rate does not directly translate to deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size n_{\text{pool}} (§III-I.2).

I.2. Pool-normalised inter-CPA alert rate (Script 52)

For each source signature s we simulate one realisation of an inter-CPA candidate pool of the same size n_{\text{pool}}(s), drawn uniformly from non-same-CPA signatures in the baseline pool, compute the deployed extrema and rule indicator, and aggregate (Script 52, canonical retry-loop sampler matching Scripts 43/45; CPA-block bootstrap 95% CIs on n_{\text{boot}} = 1000 replicates; Table XXII, §IV-M).

On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is \mathbf{0.0059} [0.0045, 0.0073] — an order of magnitude below the all-Big-4 figure of 0.1102, which is dominated by Firm A. Once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is \approx 0.59\%. This is the specificity-proxy floor against which the deployed HC rule operates. The rate rises with pool size (the rule takes extrema over n_{\text{pool}} candidates), consistent with the 1 - (1 - p_{\text{pair}})^{n_{\text{pool}}} form expected under inter-CPA independence; the within-firm violation of that independence (§III-J.1) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash \leq 3) by inverting the ICCR curve, under the unsupervised-setting caveats of §III-N.

I.3. Document-level inter-CPA proxy alert rate (Script 52)

Each document is classified by the worst-case rule over its constituent signatures (§III-H.1) under the same inter-CPA candidate-pool counterfactual as §III-I.2 (Script 52, dominant-firm document assignment; Table XXIII, §IV-M).

The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation. On the BCD baseline the per-document HC (dHash \leq 5) rate is \mathbf{0.0117} (\sim 8\times below the all-Big-4 0.1797): a clean inter-CPA baseline almost never produces an HC document, confirming HC as a high-specificity operating point. The HC+MC (dHash \leq 15) rate, by contrast, stays high on the clean baseline — 0.1753 per document — and does not fall when Firm A is removed. Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B 0.162, Firm C 0.225, Firm D 0.089, slightly higher than under the all-Big-4 pool (B 0.160, C 0.163, D 0.088), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash \leq 15 scale. We therefore treat the HC sub-rule (dHash \leq 5) as the specificity-anchored operating point and reposition the MC band (5 < \text{dHash} \leq 15) as a low-specificity advisory tier, not a confident non-hand-signed screening label: roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement. The positioning of the operational system as a screening framework with human-in-the-loop review, not an autonomous forensic classifier, follows directly (§III-N).

J. Firm-Level Deviation from the Baseline

With the calibration anchored on the clean BCD floor (§III-I.2), every firm — and in particular Firm A — can be read as a deviation from that floor. Firm A is scored as a true out-of-sample target against the baseline.

J.1. Firm A as an out-of-sample target; firm heterogeneity

Three complementary readings establish that Firm A is the extreme case while keeping the inferential limits explicit.

(i) Observed deployed rate versus the clean floor. The deployed HC rule fires on each firm's actual same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of 0.0059 (§III-I.2):

Firm Observed per-signature HC rate Multiple of BCD floor
Firm A 0.817 \sim 139\times
Firm B 0.346 \sim 59\times
Firm C 0.238 \sim 40\times
Firm D 0.245 \sim 42\times

All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at \sim 139\times, roughly $2.4$3.4\times the other Big-4 firms in absolute rate.

(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm. Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is 0.0001 — below even the BCD-internal floor of 0.0059, i.e. Firm A's signatures essentially never resemble genuine 20132019 hand-signing from other firms. The entire elevation in Firm A's observed rate (0.817) therefore arises from matches against other Firm-A signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness.

(iii) Firm-effect regressions: Firm A singular, baseline homogeneous. Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are 0.053 (B), 0.010 (C), 0.027 (D), with log-pool-size odds ratio 4.01 — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; n = 89{,}994, hit rate 0.0059), the residual firm spread collapses to within a factor of \sim 3.5: odds ratios 1.73 (B), 0.49 (C), log-pool-size odds ratio 3.29. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.)

Cross-firm hit matrix: within-firm concentration is a universal Big-4 pattern. Under the deployed any-pair rule, inter-CPA collisions concentrate within the source firm at every Big-4 firm. On the full Big-4 candidate pool, within-firm concentration is 98.8\% at Firm A and $76.7$83.7\% at Firms B/C/D (same-pair $97.0$99.96\%; Table XXV). Restricting the candidate pool to the BCD baseline (Script 53) raises the within-firm concentration for B/C/D to $89.2$97.2\% any-pair (Firm B 97.2\%, Firm C 92.3\%, Firm D 89.2\%) and $98.5$100\% same-pair — higher than on the full pool, because there some B/C/D collisions landed on Firm A's generically copy-like signatures; removing Firm A leaves each firm's collisions concentrated within itself. Within-firm collision concentration is therefore a universal Big-4 structural pattern, not a Firm-A peculiarity: Firm A is extreme in the rate at which the rule fires (reading (i)), but all four firms exhibit the same within-firm collision signature.

J.2. Observed deployed alert rate on actual same-CPA pools

Reading (i) of §III-J.1 reported each firm's observed HC rate; reading (ii) used the inter-CPA candidate-replacement counterfactual. Here we report the pooled-Big-4 observed deployed alert rate — the rate at which the rule fires on each source's actual same-CPA pool across the real corpus — and its excess over the clean floor. For Big-4 it fires on 49.58\% of signatures and 62.28\% of documents (Script 46; Script 42 reproduces the per-signature rate). Read against the normative BCD floor rather than the contaminated all-Big-4 rate, the observed-deployed excess is large: per signature 0.4958 vs 0.0059 (49.0 pp, \sim 84\times); per document (HC) 0.6228 vs 0.0117 (61.1 pp, \sim 53\times). Anchoring the floor on the clean BCD baseline sharpens this contrast, since the all-Big-4 floor would understate it by absorbing Firm A's reuse.

Interpretation and inferential limits. The firm multiples of §III-J.1 and the observed-deployed excess above are not true-positive rates: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above that floor. We therefore read the excess as an observed same-CPA-pool excess over the normative inter-CPA floor — a quantity far exceeding what random inter-CPA candidate replacement among normative firms would produce — whose mechanism is not identifiable from descriptor-only data (§III-N); we do not attribute it to within-CPA handwriting repeatability or to image replication without further evidence. Likewise, the within-firm collision concentration is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: common form templates, shared scanning workflows, and report-generation infrastructure could all produce visually similar signature crops across CPAs within a firm. Byte-level decomposition of Firm A's 145 pixel-identical signatures across \sim 50 distinct certifying partners (§IV-H, supplementary materials) is direct evidence of image-level reuse among Firm A signatures; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We report "inter-CPA collision concentration is within-firm" as a descriptive observation about deployed-rule behaviour and refrain from inferring deliberate or systematic template sharing.

K. Why the Descriptor Distribution Provides No Threshold

The baseline calibration of §III-I is necessary because the descriptor distribution itself supplies no within-population threshold. This section establishes that negative result: the joint distribution of accountant-level descriptor means (\overline{\text{cos}}_a, \overline{\text{dHash}}_a) across the 437 Big-4 CPAs contains no within-population bimodal antimode that could anchor the deployed operational thresholds. We apply four diagnostics, decompose the one apparent rejection into its true source, and confirm that the deployed thresholds sit on a steep — not plateau-like — region of the alert-rate surface.

1. Hartigan dip test on each accountant-level marginal. The dip test [37] on each marginal \{\overline{\text{cos}}_a\} and \{\overline{\text{dHash}}_a\} (bootstrap p, n_{\text{boot}} = 2000) rejects unimodality at the Big-4 sub-corpus (p < 5 \times 10^{-4} on both, Script 34). The rejection does not hold in narrower tested scopes (Script 32): Firm A alone (p_{\text{cos}} = 0.992, p_{\text{dHash}} = 0.924), Firms B+C+D pooled (0.998, 0.906), and all non-Firm-A CPAs pooled (0.998, 0.907). The Big-4 rejection is thus a descriptive observation that item 4 below attributes entirely to between-firm composition rather than within-population bimodality.

2. K=2 / K=3 Gaussian mixture fits (descriptive partition). A 2-component 2D GMM (Script 34) recovers components at (0.954, 7.14), weight 0.689, and (0.983, 2.41), weight 0.311, with marginal crossings \overline{\text{cos}}^* = 0.9755, \overline{\text{dHash}}^* = 3.755. The 3-component fit is mildly BIC-preferred (\Delta\text{BIC} = -3.48, not decisive). Following item 4 we treat both fits as descriptive partitions reflecting firm-composition structure (Firm A vs others), not evidence for latent population modes; they are developed in §III-L.

3. Burgstahler-Dichev / McCrary density-smoothness diagnostic. Applied as a density-smoothness diagnostic [38, 39] on each marginal, the test flags no significant transition at the Big-4 scope on either axis (\alpha = 0.05, Script 34). It does flag dHash transitions in some out-of-scope subsets (Script 32) but no cosine transition in any subset — consistent with item 4: once between-firm and integer-tie confounds are removed, the Big-4 marginals are unimodal, so a local-discontinuity test correctly finds no within-population transition.

4. Composition decomposition (Scripts 39b39e). The Big-4 accountant-level rejection (item 1) could reflect (a) genuine within-population bimodality, (b) between-firm location-shift artefacts, or (c) integer mass-point artefacts on the integer-valued dHash axis. Repeating the dip test at the signature level inside each firm shows the cosine marginal fails to reject unimodality in every firm tested — all four Big-4 firms (p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}; Script 39b) and ten non-Big-4 firms with \geq 500 signatures (p_{\text{cos}} \in [0.59, 0.99]; Script 39c). The raw dHash marginal rejects in every firm, but the values are integer-valued; adding uniform jitter \sim \mathrm{U}[-0.5, +0.5] to break exact ties (5 seeds; Script 39d) eliminates the within-firm dHash rejection in every Big-4 firm (jittered p_{\text{median}} \geq 0.996, 0/5 seeds reject). The pooled-Big-4 dHash dip survives jitter alone, but Firm A's mean dHash (2.73) sits well below Firms B/C/D's (6.46, 7.39, 7.21) — a between-firm location shift. Script 39e applies a 2 \times 2 factorial correction on the pooled dHash:

Condition Firm-mean centred Integer jitter Median dip p Reject at \alpha = 0.05
1 raw < 5 \times 10^{-4} 5/5
2 centred only \checkmark < 5 \times 10^{-4} 5/5
3 jittered only \checkmark < 5 \times 10^{-4} 5/5
4 centred and jittered \checkmark \checkmark \mathbf{0.35} \mathbf{0/5}

Removing both the between-firm location shift and the integer mass points eliminates the rejection (p_{\text{median}} = 0.35): the Big-4 pooled dHash multimodality is fully attributable to firm-composition contrast and integer-density artefacts, with no residual continuous within-firm bimodality. Consistently, within each Big-4 firm the dHash histogram on bins $0$20 exhibits no strict local minimum, and the pooled histogram shows only a shallow valley at \text{dHash} = 4 (relative depth 2.1\%) — no antimode near the deployed \text{dHash} = 5 boundary in any firm.

5. Conclusion. The descriptor distributions contain no within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixtures (§III-L) are therefore descriptive firm-compositional partitions, not evidence for population modes, and the anchor-based calibration of §III-I does not require a distributional antimode.

6. Local sensitivity of the deployed thresholds (Script 46). As a final confirmation that the deployed HC thresholds are not distributional features, we sweep each threshold against the actual observed Big-4 same-CPA pools and compare the local gradient at the deployed value to the median gradient across the sweep. At cos > 0.95 AND dHash \leq 5 the local gradient is substantially larger than the median (cosine ratio \approx 25\times; dHash ratio \approx 3.8\times): the HC threshold is locally sensitive, not plateau-stable (a 0.01 cosine perturbation swings the rate 3.0 pp; a single dHash integer step swings it 14.3 pp). The MC/HSC boundary at dHash = 15, by contrast, lies in a low-gradient plateau (ratio \approx 0.08\times median) — adding alert yield without inter-CPA specificity (§III-I.3), reinforcing the MC band's demotion to an advisory tier. We therefore read the deployed HC thresholds as specificity-anchored operating points (chosen for the specificity-vs-alert-yield tradeoff, §III-I.1), not as distributional antimodes; the gradient ratios are descriptive diagnostics, and the primary "no antimode" evidence comes from item 4 above.

L. K=3 as a Descriptive Partition of Firm-Composition Contrast

This section develops the K=2 and K=3 Gaussian mixture fits and clarifies their role. Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes (§III-K.4 shows the apparent multimodality is fully explained by between-firm location shifts and integer mass-point artefacts). Neither mixture assigns signature- or document-level labels in the primary analysis; the operational classifier of §III-H.1 is calibrated in §III-I via inter-CPA coincidence rates, not mixture-derived antimodes.

K=2 fit. Two components at (\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14) (weight 0.689) and (0.983, 2.41) (weight 0.311); marginal crossings \overline{\text{cos}}^* = 0.9755, \overline{\text{dHash}}^* = 3.755 (Script 34). We refer to components by index, since §III-K.4 establishes that the separation is firm-compositional, not mechanistic.

K=3 fit. Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces):

Component \overline{\text{cos}} \overline{\text{dHash}} weight descriptive position
C1 0.9457 9.17 0.143 low-cos / high-dHash corner
C2 0.9558 6.66 0.536 central region
C3 0.9826 2.41 0.321 high-cos / low-dHash corner

\text{BIC}(K{=}3) = -1111.93, below K{=}2 by 3.48 (mild, not decisive). The component centres are locations in a continuous descriptor space, not latent mechanism modes.

The partition is dominated by firm membership (Script 35). Firm A is 82.5\% C3 and accounts for 141 of the 143 C3-assigned CPAs; Firm C accounts for 24 of the 40 C1-assigned CPAs; Firms B and D sit predominantly in the central C2. The K=3 partition is therefore a firm-compositional decomposition: C3 is essentially "Firm A," C1 essentially "non-Firm-A CPAs in the low-cos / high-dHash corner." This same firm-compositional contrast reappears at the deployment level in the cross-firm hit matrix of §III-J.1.

Leave-one-firm-out stability (Scripts 36, 37). K=2 is unstable across folds (holding Firm A out shifts the cosine crossing to 0.938 vs \sim 0.975 for the other folds; max across-fold deviation 0.028, 5.6\times the report's tolerance), confirming the K=2 boundary is essentially a Firm-A-versus-others separator. K=3 has a reproducible component shape (C1 cosine mean varies by \leq 0.005, dHash by \leq 0.96, weight by \leq 0.012 across folds) but composition-sensitive hard-posterior membership (held-out C1 rate 36.3\% vs baseline 23.5\% at Firm C — a 12.8 pp difference; legend P2_PARTIAL). We therefore do not use K=3 hard-posterior membership as an operational label; the operational classifier is calibrated in §III-I, and cross-checks between the deployed rule and the K=3 partition appear in §III-M.

M. Convergent Internal-Consistency Checks

The descriptive partition of §III-L is supported by three feature-derived per-CPA scores and a conservative hard-positive subset analysis. We caution at the outset that the three scores are not statistically independent: all three are deterministic functions of the same per-CPA descriptor means (\overline{\text{cos}}_a, \overline{\text{dHash}}_a), so their high rank correlations are partly mechanical. The checks below document internal consistency among feature-derived ranks, not external validation against an independent hand-signed ground truth (which the corpus does not provide). Full result tables are in §IV-F (Tables IX, XI) and §IV-H (Table XIV).

1. Three feature-derived per-CPA scores (Script 38). For each Big-4 CPA we compute (i) the K=3 posterior P(\text{C1}) on the low-cos / high-dHash component (§III-L); (ii) a reverse-anchor cosine percentile of \overline{\text{cos}}_a under the non-Big-4 reference Gaussian of §III-H.2; and (iii) the per-CPA fraction of signatures not satisfying the deployed HC box rule (cos > 0.95 AND dHash \leq 5). The three pairwise Spearman correlations are all \rho \geq 0.879 (+0.9627, +0.8890, +0.8794; n = 437; Table IX) — the strongest internal-consistency signal in the analysis: three summarisations of the same descriptor pair agree on the per-CPA ranking, all placing Firm A as the most replication-dominated descriptor position. They disagree only mildly at the less-replication-dominated end (Scores 1/3 place Firm C furthest from the templated end; the reverse-anchor places Firm D fractionally beyond Firm C). This is not external validation of any classifier; the deployed rule is calibrated separately (§III-I).

2. Per-signature consistency (Script 39). Refitting K=3 at the signature level (150,442 Big-4 points) and comparing binary labels gives Cohen \kappa = 0.870 between per-CPA-fit and per-signature-fit K=3 labels (Table XI), so per-CPA aggregation does not collapse the three-component ordering. The lower $\kappa = 0.56$0.66 between the binary box rule and either K=3 fit reflects different decision geometries (rectangular box vs Gaussian-mixture boundary).

3. Leave-one-firm-out reproducibility (Scripts 36, 37). As developed in §III-L, firm-level LOOO shows K=2 unstable (a Firm-A-versus-others separator) while K=3 has a reproducible C1 component shape (\leq 0.005 cosine drift) but composition-sensitive hard membership (up to 12.8 pp; P2_PARTIAL), which is why K=3 hard membership is not an operational label. Full LOOO tables are in §IV-G.

4. Positive-anchor miss rate on byte-identical signatures (Script 40). The corpus provides one conservative hard-positive subset: n = 262 Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation (145 / 8 / 107 / 2 across Firms AD). Independent hand-signing cannot produce pixel-identical images, so these are a conservative hard-positive subset for replication. All three candidate scores — the deployed HC box rule, the K=3 hard label, and the prevalence-calibrated reverse-anchor cut — assign every byte-identical signature to the replicated class (0\% miss, Wilson [0\%, 1.45\%]; Table XIV, §IV-H). We caution that for the box rule this is close to tautological (byte-identical neighbours have cos \approx 1, dHash \approx 0), so it is a necessary check a failing classifier would not pass, not a sufficiency proof on the non-byte-identical replicated population. The corresponding inter-CPA negative-anchor evidence is in §III-I.1 (Big-4) and the corpus-wide version at §IV-I.

N. Unsupervised Diagnostic Strategy and Limits

The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-M.4) is by construction near \text{cos} = 1 and \text{dHash} = 0, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth. Each diagnostic in this paper addresses one specific failure mode of an unsupervised screening classifier; the full diagnostic-to-failure-mode-to-assumption map is given in Appendix A Table A.II.

Limits. We do not claim a validated forensic detector or an autonomous classification system, and we do not interpret the deployed-rate excess of §III-J.2 as a presumed true-positive rate. That interpretation would require assuming a CPA's genuine same-CPA hand-signing produces a collision rate no higher than random inter-CPA pairs — unsafe for two reasons: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA cosine similarity; and (b) within-firm template-like reuse (§III-J.1; byte-level evidence of Firm A's pixel-identical signatures across partners) places a collision floor that itself reflects reuse rather than independent random matching. We describe the within-firm collision concentration of §III-J.1 as a descriptive observation and treat its mechanism as an open empirical question.

Scope and positioning. The deployed rule is characterised at three units against the normative Firms-B/C/D inter-CPA negative anchor, with Firm A held out as an out-of-sample target (§III-I.0). The resulting rates (§III-I, §III-J) are specificity-proxy-anchored alarm-yield indicators, not true error rates: the HC rule has a very low BCD coincidence rate at every unit, the dHash \leq 15 MC band is a low-specificity advisory tier, and the per-firm heterogeneity is read against the clean floor (Firm A the rate-extreme, its signal within-firm). The framework is positioned as a specificity-proxy-anchored screening tool with human-in-the-loop review, not a validated forensic classifier.

Specificity-alert-yield tradeoff. Because sensitivity is unobservable, operators cannot derive an operating point by optimising a ROC criterion. Instead, tighter operating points (e.g., cos > 0.98 AND dHash \leq 3) further reduce both the per-comparison ICCR and the per-signature alert yield below the deployed-HC values of §III-I.1–§III-I.2, with an unknown effect on actual replication-detection recall — so tightening is not necessarily preferable. The deployment decision depends on the relative cost of manual review per alarm and missed-replication risk per false negative, neither directly observable from corpus data.

O. Data Source and Firm Anonymization

Audit-report corpus. The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation. MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document. We did not access any non-public auditor work papers, internal firm records, or personally identifying information beyond the certifying CPAs' names and signatures, which are themselves published on the face of the audit report as part of the public regulatory filing. The CPA registry used to map signatures to CPAs is a publicly available audit-firm tenure registry (Section III-B).

Firm-level anonymization. Although all audit reports and CPA identities in the corpus are public, we report firm-level results under the pseudonyms Firm A / B / C / D throughout this paper to avoid naming specific accounting firms in descriptive rate comparisons. Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name.

IV. Experiments and Results

Section IV reports the empirical results that calibrate and characterise the operational classifier of §III-H.1 (calibration developed in §III-I). The primary analyses (§IV-D through §IV-J, and the anchor-based ICCR calibration consolidated in §IV-M) are scoped to the Big-4 sub-corpus (Firms AD, n = 437 CPAs with n_{\text{sig}} \geq 10, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3 mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F.

A. Experimental Setup

Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration. Feature extraction used PyTorch 2.9 with torchvision model implementations. The complete pipeline---from raw PDF processing through final classification---was implemented in Python. Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision.

B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). However, the subsequent production deployment provides a practical consistency check: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

Table III. Extraction Results.

Metric Value
Documents processed 86,071
Documents with detections 85,042 (98.8%)
Total signatures extracted 182,328
Avg. signatures per document 2.14
CPA-matched signatures 168,755 (92.6%)
Processing rate 43.1 docs/sec

The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in the primary analyses of §IV-D through §IV-J.

C. All-Pairs Intra-vs-Inter Class Distribution Analysis

Fig. 2 presents the cosine similarity distributions computed over the full set of pairwise comparisons under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-signature classification (Section III-H.1). Table IV summarizes the distributional statistics.

Table IV. Cosine Similarity Distribution Statistics.

Statistic Intra-class Inter-class
N (pairs) 41,352,824 500,000
Mean 0.821 0.758
Std. Dev. 0.098 0.090
Median 0.836 0.774
Skewness 0.711 0.851
Kurtosis 0.550 1.027

Both distributions are left-skewed and leptokurtic. Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-K to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837. Under equal prior probabilities and equal misclassification costs, this crossover is a candidate decision boundary between the two classes; we adopt it only as the operational LH/UN boundary in §III-H.1, not as a natural distributional threshold. Statistical tests confirmed significant separation between the two distributions (Cohen's d = 0.669, Mann-Whitney [36] p < 0.001, K-S 2-sample p < 0.001).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size. A Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

D. Big-4 Accountant-Level Distributional Characterisation

This section reports the empirical evidence for §III-K's distributional diagnostics at the Big-4 accountant level. The accountant-level dip-test rejection reported in Table V is, per §III-K.4, fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.

Table V. Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32).

Population n CPAs p_{\text{cos}} p_{\text{dHash}} Interpretation
Big-4 pooled (analysis scope; not the calibration anchor) 437 < 5 \times 10^{-4} < 5 \times 10^{-4} reject unimodality on both axes
Firm A pooled alone 171 0.992 0.924 unimodal
Firms B + C + D pooled 266 0.998 0.906 unimodal
All non-Firm-A pooled 515 0.998 0.907 unimodal

Bootstrap implementation: n_{\text{boot}} = 2000; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution 1 / 2000 = 5 \times 10^{-4} (Script 34 reports this as p = 0.0000; we report p < 5 \times 10^{-4} to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed.

Table VI. Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; \alpha = 0.05, two-sided).

Population Cosine: significant transition? dHash: significant transition?
Big-4 pooled (primary) none (p > 0.05) none (p > 0.05)
Firm A pooled alone none none
Firms B + C + D pooled none one transition at \overline{\text{dHash}} = 10.8
All non-Firm-A pooled none one transition at \overline{\text{dHash}} = 6.6

The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes.

E. Big-4 K=2 / K=3 Mixture Fits

This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.

Table VII. Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-L) and marginal-crossing bootstrap 95% confidence intervals.

K=2 component \overline{\text{cos}} \overline{\text{dHash}} weight
K=2-a (low-cos / high-dHash position) 0.954 7.14 0.689
K=2-b (high-cos / low-dHash position) 0.983 2.41 0.311

Marginal crossings (point + bootstrap 95% CI, n_{\text{boot}} = 500):

Axis Point Bootstrap median 95% CI CI half-width
cos 0.9755 0.9754 [0.9742, 0.9772] 0.0015
dHash 3.755 3.763 [3.476, 3.969] 0.246

\text{BIC}(K{=}2) = -1108.45 (Script 34).

Table VIII. Big-4 K=3 mixture components (descriptive firm-compositional partition per §III-L; not mechanism clusters).

K=3 component \overline{\text{cos}} \overline{\text{dHash}} weight descriptive position
C1 0.9457 9.17 0.143 low-cos / high-dHash corner
C2 0.9558 6.66 0.536 central region
C3 0.9826 2.41 0.321 high-cos / low-dHash corner

\text{BIC}(K{=}3) = -1111.93, lower than K{=}2 by 3.48 (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-I and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.

F. Convergent Internal-Consistency Checks

This section reports the empirical evidence for §III-M's three-score internal-consistency analysis. We re-emphasise the §III-M caveat: the three scores are deterministic functions of the same per-CPA descriptor pair (\overline{\text{cos}}_a, \overline{\text{dHash}}_a) and are not statistically independent measurements. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.

Table IX. Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, n = 437.

Score pair Spearman \rho $p$-value
K=3 P(C1) vs deployed box-rule less-replication-dominated rate +0.9627 < 10^{-248}
Reverse-anchor cosine percentile vs deployed box-rule less-replication-dominated rate +0.8890 < 10^{-149}
K=3 P(C1) vs Reverse-anchor cosine percentile +0.8794 < 10^{-142}

(Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on n = 249 non-Big-4 CPAs; reference centre \overline{\text{cos}} = 0.935, \overline{\text{dHash}} = 9.77.

Table X. Per-firm summary across the three feature-derived scores, Big-4.

Firm n CPAs mean P(\text{C1}) mean reverse-anchor score mean deployed less-replication-dominated rate
Firm A 171 0.0072 -0.9726 0.1935
Firm B 112 0.1410 -0.8201 0.6962
Firm C 102 0.3110 -0.7672 0.7896
Firm D 52 0.2406 -0.7125 0.7608

(Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that higher values indicate deeper into the reference left tail = less replication-dominated relative to the non-Big-4 reference.)

The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as less replication-dominated. The K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Score 1 and Score 3) place Firm C at the least-replication-dominated end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors.

Table XI. Per-signature Cohen \kappa (binary collapse, replication-dominated vs less-replication-dominated), n = 150{,}442 Big-4 signatures.

Pair Cohen \kappa
deployed binary high-confidence box rule (cos > 0.95 AND dHash \leq 5) vs per-CPA K=3 hard label 0.662
deployed binary high-confidence box rule vs per-signature K=3 hard label 0.559
Per-CPA K=3 hard label vs per-signature K=3 hard label 0.870

(Source: Script 39.) Per-signature K=3 components (n = 150{,}442) sorted by ascending cosine: (0.928, 9.75, 0.146) / (0.963, 6.04, 0.582) / (0.989, 1.27, 0.272), an absolute cosine drift of 0.018 in C1 and 0.006 in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos > 0.95 AND dHash \leq 5); the five-way classifier's moderate-confidence band (5 < \text{dHash} \leq 15) retains its prior calibration and capture-rate evidence (supplementary materials; cross-referenced in §IV-J).

G. Leave-One-Firm-Out Reproducibility

This section reports the firm-level cross-validation evidence motivating §III-L's "K=3 descriptive, not operational" framing.

Table XII. K=2 leave-one-firm-out across the four Big-4 folds.

Held-out firm n_{\text{train}} n_{\text{held}} Fold rule (cos cut, dHash cut) Held-out classified as templated by fold rule
Firm A 266 171 cos > 0.9380 AND dHash \leq 8.79 171 / 171 = 100.00\% (95\% Wilson [97.80\%, 100.00\%])
Firm B 325 112 cos > 0.9744 AND dHash \leq 3.98 0 / 112 = 0\% (95\% Wilson [0\%, 3.32\%])
Firm C 335 102 cos > 0.9752 AND dHash \leq 3.75 0 / 102 = 0\% (95\% Wilson [0\%, 3.63\%])
Firm D 385 52 cos > 0.9756 AND dHash \leq 3.74 0 / 52 = 0\% (95\% Wilson [0\%, 6.88\%])

(Source: Script 36.) Across-fold cosine crossing: pairwise range [0.9380, 0.9756], range = 0.0376; max absolute deviation from the across-fold mean is 0.028. This exceeds the report's 0.005 across-fold stability tolerance by 5.6\times and is much larger than the full-Big-4 bootstrap CI half-width of 0.0015. Together with the all-or-nothing held-out classification pattern (Firm A held out \Rightarrow all held-out CPAs templated; any non-Firm-A firm held out \Rightarrow none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

Table XIII. K=3 leave-one-firm-out: C1 component shape and held-out membership.

Held-out firm C1 cos (fit) C1 dHash (fit) C1 weight (fit) Held-out C1 hard-label rate Full-Big-4 baseline C1% Absolute difference
Full-Big-4 baseline 0.9457 9.17 0.143
Firm A held out 0.9425 10.13 0.145 4.68\% 0.00\% 4.68 pp
Firm B held out 0.9441 9.16 0.127 7.14\% 8.93\% 1.76 pp
Firm C held out 0.9504 8.41 0.126 36.27\% 23.53\% 12.77 pp
Firm D held out 0.9439 9.29 0.120 17.31\% 11.54\% 5.81 pp

(Source: Script 37; screening label P2_PARTIAL.) Component shape is reproducible across folds: max deviation of C1 cosine = 0.005, C1 dHash = 0.96, C1 weight = 0.012. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is 12.77 pp at the Firm C held-out fold, exceeding the report's 5 pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-I, §III-L).

H. Pixel-Identity Positive-Anchor Miss Rate

This section reports the only conservative hard-positive subset analysis available in the corpus: the positive-anchor miss rate against n = 262 Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-M item 4).

Table XIV. Positive-anchor miss rate, n = 262 Big-4 byte-identical signatures.

Classifier Misclassified as less-replication-dominated Miss rate Wilson 95% CI
deployed binary high-confidence box rule (cos > 0.95 AND dHash \leq 5) 0 / 262 0\% [0\%, 1.45\%]
K=3 per-CPA hard label (C3 = high-cos / low-dHash; descriptive) 0 / 262 0\% [0\%, 1.45\%]
Reverse-anchor (prevalence-calibrated cut) 0 / 262 0\% [0\%, 1.45\%]

(Source: Script 40.) Per-firm breakdown of the byte-identical subset: Firm A 145; Firm B 8; Firm C 107; Firm D 2. All three candidate scores correctly assign every byte-identical signature to the replicated class.

We caution that for the deployed box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine \approx 1 and dHash \approx 0, well inside the rule's high-confidence region). The reverse-anchor cut is chosen by prevalence calibration against the box rule's overall replicated rate of 49.58\% across Big-4 signatures; this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.

I. Inter-CPA Pair-Level Coincidence Rate

The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It is the per-pair rate at which two signatures from different CPAs satisfy the deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has a biometric-verification meaning that requires ground-truth negative labels, and (b) the inter-CPA negative-anchor assumption is partially violated by within-firm cross-CPA template-like collision structures (§III-J.1 cross-firm hit matrix).

A corpus-wide spike on \sim 50{,}000 inter-CPA pairs gives a per-comparison rate of 0.0005 (Wilson 95% CI [0.0003, 0.0007]) at the cosine cut 0.95. The Big-4-scope spike at higher sample size (5 \times 10^5 inter-CPA pairs) replicates this number, adds the structural dimension (dHash), and adds joint-rule rates; the §III-I.1 numbers are referenced rather than duplicated here, and the consolidated ICCR calibration appears in §IV-M Tables XXIXXVI.

J. Five-Way Per-Signature + Document-Level Classification Output

This section reports the five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. See §III-H.1 for the five-way category definitions and the cosine and dHash cuts; calibration is in §III-I.

Table XV. Five-way per-signature category counts, Big-4 sub-corpus, n = 150{,}442 classified.

Category Long name n signatures % of classified
HC High-confidence replication candidate 74,593 49.58%
MC Moderate-confidence advisory flag 39,817 26.47%
HSC High style-consistency flag 314 0.21%
UN Uncertain 35,480 23.58%
LH Low replication-similarity 238 0.16%

(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The 150{,}442 vs 150{,}453 distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use n = 150{,}442; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIVXXV; Scripts 46, 52, 44, 53) use n = 150{,}453 because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.)

Per-firm five-way breakdown (% within firm).

Firm HC MC HSC UN LH total signatures
Firm A 81.70% 10.76% 0.05% 7.42% 0.07% 60,448
Firm B 34.56% 35.88% 0.29% 29.09% 0.18% 34,248
Firm C 23.75% 41.44% 0.38% 34.21% 0.22% 38,613
Firm D 24.51% 29.33% 0.22% 45.65% 0.29% 17,133

(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVII: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVII). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).

Document-level worst-case aggregation. Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-H.1), applied to the Big-4 sub-corpus.

Table XVI. Document-level worst-case category counts, Big-4 sub-corpus, n = 75{,}233 unique PDFs.

Category Long name n documents %
HC High-confidence replication candidate 46,857 62.28%
MC Moderate-confidence advisory flag 19,667 26.14%
HSC High style-consistency flag 167 0.22%
UN Uncertain 8,524 11.33%
LH Low replication-similarity 18 0.02%

(Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.)

Per-firm document-level breakdown (single-firm PDFs only).

Firm HC MC HSC UN LH total docs
Firm A 27,600 1,857 7 758 4 30,226
Firm B 8,783 6,079 57 2,202 6 17,127
Firm C 7,281 8,660 77 3,099 5 19,122
Firm D 3,100 2,838 22 2,416 3 8,379

(Source: Script 42; mixed-firm PDFs n = 379 excluded from the per-firm rows but included in the overall counts above.)

The five-way moderate-confidence advisory band (cos > 0.95 AND 5 < \text{dHash} \leq 15) retains the threshold provenance of its prior calibration (supplementary materials), but §III-I.3 supersedes its claim strength: on the normative BCD baseline this band carries a \sim 0.175 per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is not separately re-characterised by Scripts 3840, which checked only the binary high-confidence rule (cos > 0.95 AND dHash \leq 5). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-M Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).

Table XVII. Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.

Firm n C1 (low-cos / high-dHash) C2 (central) C3 (high-cos / low-dHash) C1 % C3 %
Firm A 171 0 30 141 0.00\% 82.46\%
Firm B 112 10 102 0 8.93\% 0.00\%
Firm C 102 24 77 1 23.53\% 0.98\%
Firm D 52 6 45 1 11.54\% 1.92\%

(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-L / §IV-E). It is reported here as a complement to the five-way per-signature screening rule (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction 23.5\%); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).

Document-level worst-case aggregation outputs are reported in Table XVI above.

K. Full-Dataset Robustness (light scope)

This section reports the reproducibility cross-check at the full accountant scope (n = 686 CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (supplementary materials; §IV-J).

Table XVIII. K=3 component comparison, Big-4 sub-corpus vs full dataset.

K=3 component Big-4 (n=437) cos / dHash / weight Full (n=686) cos / dHash / weight Drift Big-4 → Full
C1 (low-cos / high-dHash) 0.9457 / 9.17 / 0.143 0.9278 / 11.17 / 0.284 \lvert\Delta\rvert cos 0.018, dHash 1.99, wt 0.141
C2 (central) 0.9558 / 6.66 / 0.536 0.9535 / 6.99 / 0.512 \lvert\Delta\rvert cos 0.002, dHash 0.33, wt 0.024
C3 (high-cos / low-dHash) 0.9826 / 2.41 / 0.321 0.9826 / 2.40 / 0.205 \lvert\Delta\rvert cos 0.000, dHash 0.01, wt 0.117

(Source: Script 41; full-dataset \text{BIC}(K{=}3) = -792.31 vs Big-4 \text{BIC}(K{=}3) = -1111.93; BIC values are not directly comparable across different n and are reported only for completeness.)

Table XIX. Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.

Scope n CPAs Spearman \rho (P(C1) vs deployed less-replication-dominated rate) $p$-value
Big-4 (primary) 437 +0.9627 < 10^{-248}
Full dataset 686 +0.9558 < 10^{-300}
\lvert\rho_{\text{full}} - \rho_{\text{Big-4}}\rvert 0.0069

(Source: Script 41.)

Reading. The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the deployed box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight 0.117 as the full population includes more CPAs in the low-cos / high-dHash descriptor region (mid/small firms); C1 (low-cos / high-dHash) gains weight 0.141 and shifts to lower cosine and higher dHash (centre (0.928, 11.17) vs Big-4 (0.946, 9.17)) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + deployed-rule convergence is not a Big-4-specific artefact; we do not read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the primary methodology is restricted to Big-4 by design (§III-G item 4).

L. Ablation Study: Feature Backbone Comparison

To support the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. The comparison summary is reported in the supplementary materials (backbone-ablation table; not the same table as Table XIX in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K).

EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), i.e., a wider descriptor dispersion per signature. VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance: (1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707; (2) its tighter distributions yield more stable descriptor behaviour at the per-signature level; (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that Firm A replication-dominated signatures are least likely to produce low-similarity outlier pairs under this backbone; and (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.

M. Anchor-Based ICCR Calibration Results

This section consolidates the empirical results that support the §III-I anchor-based threshold calibration framework.

M.1 Composition decomposition (Scripts 39b39e)

Table XX. Within-firm and between-firm decomposition of the Big-4 accountant-level dip-test rejection.

Diagnostic Scope Statistic Implication
Within-firm signature-level cosine dip Big-4 (4 firms) p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\} 0/4 firms reject; cosine within-firm unimodal
Within-firm signature-level cosine dip non-Big-4 (10 firms \geq 500 sigs) p_{\text{cos}} \in [0.59, 0.99] 0/10 firms reject; cosine within-firm unimodal
Within-firm jittered-dHash dip (5 seeds, median) Big-4 (4 firms) p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\} 0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact
Big-4 pooled dHash: 2×2 factorial firm-centred + jittered (5 seeds) p_{\text{med}} = 0.35, 0/5 seeds reject combined corrections eliminate rejection; multimodality is composition + integer artefact
Integer-histogram valley near \text{dHash} \approx 5 within each Big-4 firm none (0/4 firms) no within-firm dHash antimode at the deployed HC cutoff

(Source: Scripts 39b, 39c, 39d, 39e; bootstrap n_{\text{boot}} = 2000; jitter \sim \mathrm{U}[-0.5, +0.5].)

M.2 Anchor-based inter-CPA pair-level ICCR (Script 46)

Table XXI. Inter-CPA per-comparison ICCR by negative-anchor pool, n = 5 \times 10^5 pairs each.

Threshold BCD (primary) All-Big-4 (contamination comparison) BCD+non-Big-4
cos > 0.95 0.00026 0.00060 0.00014
dHash \leq 5 0.00037 0.00129 0.00034
Joint: cos > 0.95 AND dHash \leq 5 (any-pair) \mathbf{0.000010} 0.000140 0.000004

BCD joint Wilson 95% [0.000004, 0.000023] (5 of 5 \times 10^5 pairs); all-Big-4 joint [0.000111, 0.000177]; BCD+non-Big-4 joint [0.000001, 0.000015]. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by \sim 8\times, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-J.1). On the all-Big-4 sample, conditional ICCR(dHash \leq 5 | cos > 0.95) = 0.234; the all-Big-4 cos > 0.95 row is consistent with the corpus-wide spike of §IV-I (0.0005).

M.3 Pool-normalised per-signature ICCR (Script 52)

Table XXII. Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos > 0.95 AND dHash \leq 5) by negative-anchor pool; canonical retry-loop sampler; CPA-block bootstrap n_{\text{boot}} = 1000.

Baseline pool Per-signature HC ICCR CPA-bootstrap 95% CI
BCD (primary) \mathbf{0.0059} [0.0045, 0.0073]
All-Big-4 (contamination comparison) 0.1102 [0.0908, 0.1330]
BCD+non-Big-4 0.0083 [0.0066, 0.0099]

The BCD floor is an order of magnitude below the all-Big-4 figure, which is dominated by Firm A's within-firm coincidences. The per-signature rate increases with pool size (the deployed rule takes extrema over n_{\text{pool}} candidates); the per-document HC+MC band on the clean baseline (Table XXIII) does not show the same collapse, because the dHash \leq 15 band carries little inter-CPA specificity even among normative firms.

M.4 Document-level ICCR by alarm definition and pool (Script 52)

Table XXIII. Document-level inter-CPA ICCR by alarm definition and negative-anchor pool (dominant-firm document assignment).

Alarm definition BCD (primary) All-Big-4 BCD+non-Big-4
HC (dHash \leq 5) \mathbf{0.0117} 0.1797 0.0163
HC + MC (dHash \leq 15) 0.1753 0.3375 0.1467

Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B 0.162, Firm C 0.225, Firm D 0.089 (all-Big-4 pool: Firm A 0.620, Firm B 0.160, Firm C 0.163, Firm D 0.088). The HC band collapses by \sim 8\times when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash \leq 15 adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-I.3).

M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)

Table XXIV. Logistic regression of per-signature any-pair HC hit indicator on firm dummies and centred log pool size (Firm A reference).

Term Odds ratio (vs Firm A) Direction
Firm B 0.053 \sim 19\times lower odds than Firm A
Firm C 0.010 \sim 100\times lower odds than Firm A
Firm D 0.027 \sim 37\times lower odds than Firm A
log(pool size, centred) 4.01 \sim 4\times higher odds per log unit pool size

On the BCD baseline with Firm D as reference (Script 53; n = 89{,}994, hit rate 0.0059), the residual firm spread collapses to within \sim 3.5\times — odds ratios 1.73 (Firm B), 0.49 (Firm C), log-pool-size 3.29 — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-J.1).

Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$0.0358 while Firm A ranges $0.0541$0.5958. The firm gap survives within matched pool sizes.

Table XXV. Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts).

Source firm Firm A cand. Firm B Firm C Firm D non-Big-4 n hits
Firm A 14{,}447 95 44 19 17 14{,}622
Firm B 92 371 8 4 9 484
Firm C 16 7 149 5 1 178
Firm D 22 2 6 106 1 137

Same-pair joint hits (single candidate satisfying both cos > 0.95 AND dHash \leq 5) are within-firm at rates 99.96\% / 97.7\% / 98.2\% / 97.0\% for Firms A/B/C/D respectively. Restricting the candidate pool to the BCD baseline (Script 53) raises Firms B/C/D within-firm any-pair concentration to 97.2\% / 92.3\% / 89.2\% (same-pair 100\% / 100\% / 98.5\%): within-firm concentration is a universal Big-4 pattern, not a Firm-A peculiarity (§III-J.1).

M.6 Alert-rate sensitivity around deployed HC threshold (Script 46)

Table XXVI. Local-gradient / median-gradient ratio at deployed thresholds (descriptive plateau diagnostic).

Threshold Local / median gradient ratio Interpretation
cos = 0.95 (HC) \approx 25\times locally sensitive (not plateau-stable)
dHash = 5 (HC) \approx 3.8\times locally sensitive (not plateau-stable)
dHash = 15 (MC/HSC boundary) \approx 0.08 plateau-like (saturating tail)

Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC = 0.4958; per-document HC = 0.6228. Against the normative BCD floor (per-signature 0.0059; per-document HC 0.0117), the observed same-CPA-pool excess is 0.4899 (49.0 pp, \sim 84\times) per-signature and 0.6111 (61.1 pp, \sim 53\times) per-document; this excess is reported under §III-N caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.

V. Discussion

A. Non-Hand-Signing Detection as a Distinct Problem

Non-hand-signing differs from forgery in that the questioned signature is produced by its legitimate signer's own stored image rather than by an impostor. The detection problem is therefore framed around intra-signer image reproduction rather than inter-signer imitation. This framing has analytical consequences. The within-CPA signature distribution is the analytical population of interest; the cross-CPA inter-class distribution is a reference against which intra-CPA similarity is interpreted, not the population to be modelled. This contrasts with most prior offline signature verification work, which treats genuine-versus-forged as the central two-class problem.

B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Multimodality is Composition-Driven

The Big-4 accountant-level distribution rejects unimodality on both marginals (§IV-D), but §III-K.4 shows this is fully attributable to between-firm location shifts and integer mass-point artefacts, not within-population structure: under joint firm-mean centring and integer-tie jitter the dip test no longer rejects (p_{\text{median}} = 0.35), and within each Big-4 firm the signature-level marginals are unimodal once integer ties are broken. The distributions therefore contain no within-population bimodal antimode to anchor a threshold, and per-signature similarity is best read as a continuous quality spectrum rather than two discrete populations. The K=2 / K=3 fits are descriptive firm-compositional partitions (§III-L), not latent mechanism classes.

C. Firm A as the Templated End of Big-4 (Out-of-Sample Target, Not Calibration Anchor)

Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-L), Firm A accounts for 0\% of C1 (low-cos / high-dHash position) and 82.5\% of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at 23.5\%. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.

We treat Firm A as the out-of-sample templated-end target: held out of the calibration negative anchor and scored against the normative Firms-B/C/D baseline (§III-I.0) rather than used as a calibration input. Three readings (§III-J.1) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide essentially never (0.0001, below the BCD floor of 0.0059) — so Firm A is unremarkable, indeed sub-baseline, cross-firm; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on 0.82 of Firm A signatures, \sim 139\times the clean floor, versus $\sim 40$59\times for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: 98.8\% at Firm A and, on the clean BCD pool, $89$97\% at Firms B/C/D, with same-pair concentration $97$100\% across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of 0.62 at Firm A versus $0.09$0.16 at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-I.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across \sim 50 distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a demonstration that the screening surfaces a known templated end at scale — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish.

D. K=2 / K=3 as Descriptive Firm-Compositional Partitions

Leave-one-firm-out cross-validation (§III-L) sharply separates the two fits. K=2 is unstable — its boundary is essentially a Firm-A-versus-others separator (holding Firm A out gives a markedly looser fold rule than holding any other firm out), direct evidence that it reflects firm composition, not mechanism. K=3, by contrast, has a reproducible component shape across folds (the C1 cosine mean varies by \leq 0.005), though hard-posterior membership remains composition-sensitive. We therefore read K=3 as a reproducible three-region descriptor partition reflecting how firm-compositional weight is distributed across the descriptor plane, not a three-mechanism latent structure, and use it only as an accountant-level descriptive summary — never as operational classifier output.

E. Three-Score Convergent Internal-Consistency

Three feature-derived per-CPA scores — the K=3 firm-compositional position, the reverse-anchor cosine percentile against a non-Big-4 reference, and the deployed box-rule rate — agree on the per-CPA ranking at Spearman \rho \geq 0.879, with agreement persisting at the signature level (Cohen \kappa = 0.87). Because the three are deterministic functions of the same descriptor pair, we report this as internal consistency, not external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The only material disagreement is a Firm C / Firm D swap among the non-Firm-A firms.

F. Anchor-Based Multi-Level Calibration

The deployed HC sub-rule's specificity-proxy behaviour is characterised at three units against the normative BCD baseline (§III-I), with the all-Big-4 pool shown only as a contamination comparison. At every unit the HC inter-CPA coincidence rate is an order of magnitude below the all-Big-4 figure — the gap being Firm A's extreme within-firm collision structure (§III-J.1) — confirming HC as a high-specificity-proxy operating point. Because the deployed rule takes pool extrema, the per-comparison rate understates the per-signature rate (the 1 - (1 - p_{\text{pair}})^{n_{\text{pool}}} pool effect). The HC threshold is locally sensitive rather than plateau-stable (§III-K.6), so it is a specificity-anchored operating choice, not a distributional antimode; operators can select alternative points by inverting the ICCR curves (§III-I.2). The dHash$\leq 15$ MC band stays a low-specificity advisory tier even on the clean baseline (§III-I.3).

G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor

The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve 0\% positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical \Rightarrow cosine \approx 1, dHash \approx 0, well inside the HC region). The complementary negative anchor is the §III-I.1 per-comparison ICCR on the normative BCD baseline (0.000010); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-I.0).

H. Limitations

Several limitations should be transparent. We group them into primary methodological limitations, secondary scope and validation caveats, documented design features, and engineering-level caveats of the pipeline.

Primary methodological limitations.

No signature-level ground truth; no true error rates reportable. The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-I are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-N).

Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline. The cross-firm hit matrix of §III-J.1 shows that under the deployed rule, within-firm collision concentration is 98.8\% at Firm A and $76.7$97.2\% at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-I.0); on this baseline the per-comparison HC rate falls from 0.00014 to 0.000010 and the per-signature HC rate from 0.1102 to 0.0059. A residual caveat survives even on the clean baseline: the BCD floor is an inter-CPA coincidence rate, not an intra-CPA genuine-hand-signing rate, so the observed-versus-floor excess (§III-J.2) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities.

Mechanism attribution for the firm-level heterogeneity is not identifiable from descriptor-only data. The observed firm-level contrast (Firm A's per-document HC$+$MC ICCR of 0.62 versus $0.09$0.16 at Firms B/C/D; within-firm collision concentration $77$99\% under the deployed any-pair rule; byte-identical evidence of §IV-H) is consistent with at least three non-mutually-exclusive firm-level mechanisms: (i) template, stamp, or e-signature production reuse; (ii) digitisation-pipeline homogeneity — shared scanners, common PDF generation infrastructure, identical compression and form-template settings — that systematically inflates image-descriptor similarity without signature replication; and (iii) signing-style or training homogeneity that produces correlated handwritten signatures within a firm. The descriptor pair (cosine, dHash) operates at the image-similarity level and is, by construction, indifferent to which mechanism generated a given near-identical pair. We therefore report the firm contrast as a methodological observation — the framework discriminates at firm-level resolution — rather than as a mechanism finding. The byte-identical Firm A signatures across \sim 50 distinct partners (§IV-H, §V-C) provide direct evidence for (i) at Firm A specifically, but do not exclude additive contribution from (ii) or (iii); the milder within-firm collision patterns at Firms B/C/D are individually consistent with all three mechanisms. Image-acquisition metadata (scanner identifiers, PDF generator fingerprints, compression-codec markers), partner-level intent records, or controlled hand-signed baselines would be needed to attribute the contrast across (i), (ii), and (iii).

Scope. The primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full n = 686 scope; the §IV-K full-dataset Spearman re-run shows the K=3 + deployed box-rule rank-convergence is preserved at n = 686 but does not establish portability of the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope.

Secondary scope and validation caveats.

Pixel-identity is a conservative subset. Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical \Rightarrow cosine \approx 1, dHash \approx 0, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).

Rule components not separately re-characterised by the present diagnostic battery. The five-way classifier's moderate-confidence advisory band (cos > 0.95 AND 5 < \text{dHash} \leq 15), the style-consistency band (\text{dHash} > 15), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-I.3 supersedes the MC band's claim strength — its \sim 0.175 per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-K.6) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.

Deployed-rate excess is not a presumed true-positive rate. The per-document gap between the observed deployed alert rate (HC: 0.62 on real same-CPA pools) and the normative inter-CPA proxy floor (HC: 0.012 on the BCD baseline) — \sim 60 pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-N shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal.

A1 pair-detectability stipulation. The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime.

Documented design features.

K=3 hard-posterior membership is composition-sensitive. The K=3 hard-posterior membership for any single firm varies by up to 12.8 pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as operational classifier output; they are reported only as accountant-level descriptive characterisation.

No partner-level mechanism attribution. The analysis reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-J.1 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing.

Engineering-level caveats of the pipeline.

Transferred ImageNet features. The ResNet-50 feature extractor uses pre-trained ImageNet weights without signature-domain fine-tuning. While our backbone-ablation study (§IV-L) and prior literature support the effectiveness of transferred ImageNet features for signature comparison, a signature-domain fine-tuned feature extractor could improve discriminative performance.

Red-stamp HSV preprocessing artifacts. The red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. Blended pixels are replaced with white, potentially creating small gaps in signature strokes that could reduce dHash similarity. This bias would push classifications toward false negatives rather than false positives.

Longitudinal scan / PDF / compression confounds. Scanning equipment, PDF generation software, and compression algorithms may have changed over the 20132023 study period, potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.

Source-exemplar misattribution in max/min pair logic. The max-cosine / min-dHash detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case where one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as a stamping or e-signature template, the pair correctly identifies image reuse but misattributes non-hand-signed status to the source exemplar. This affects at most one source document per template variant per CPA and is not expected to be common.

Legal and regulatory interpretation. Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a jurisdiction-specific legal question. Our technical analysis can inform such determinations but cannot resolve them.

VI. Conclusion and Future Work

We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature screening rule with worst-case document-level aggregation (§III-H.1; calibrated in §III-I). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442150,453 signatures at signature level) as the primary analytical population. We emphasise that the operating thresholds are operator-tunable and that the system performs semi-automated triage — surfacing replication candidates from hundreds of thousands of signatures for human adjudication — rather than autonomous forensic classification; its central deliverable is the label-free calibration methodology by which an operator selects and characterises a screening operating point.

Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter (p_{\text{median}} = 0.35), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR 0.000010, per-signature 0.0059, and per-document 0.012 — roughly an order of magnitude below the contaminated all-Big-4 figures (0.00014, 0.11, 0.18) — while the dHash$\leq 15$ moderate-confidence band, which retains a \sim 0.175 per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at \sim 139\times (Firm A) and $\sim 40$59\times (Firms B/C/D), while Firm A scored cross-firm against the clean 20132019 baseline coincides essentially never cross-firm (0.0001); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/0.027; BCD-only Firm-D-reference residual spread within \sim 3.5\times) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — 98.8\% at Firm A and $89$97\% at Firms B/C/D on the clean BCD pool (same-pair $97$100\% across all four firms) — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman \rho \geq 0.879, reported as internal consistency rather than external validation; (7) 0\% positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.

Future work falls in four directions. First, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. Second, the within-firm collision concentration documented in §III-J.1 (any-pair $76.7$98.8\% across Big-4; same-pair joint $97.0$99.96\%) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. Third, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate 0.82 vs $0.24$0.35, \sim 139\times vs $\sim 40$59\times the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across \sim 50 distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. Fourth, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-K.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.

Appendix A. Supplementary Diagnostic Detail

A.1. BD/McCrary Bin-Width Sensitivity (Signature Level)

The main text (Section III-K, Section IV-D Table VI) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a density-smoothness diagnostic rather than as a threshold estimator. This subsection documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and \text{dHash}_\text{indep} direction.

Table A.I. BD/McCrary Bin-Width Sensitivity (two-sided \alpha = 0.05, |Z| > 1.96).

Variant n Bin width Best transition z_below z_above
Firm A cosine (sig-level) 60,448 0.003 0.9870 -2.81 +9.42
Firm A cosine (sig-level) 60,448 0.005 0.9850 -9.57 +19.07
Firm A cosine (sig-level) 60,448 0.010 0.9800 -54.64 +69.96
Firm A cosine (sig-level) 60,448 0.015 0.9750 -85.86 +106.17
Firm A dHash_indep (sig-level) 60,448 1 2.0 -4.69 +10.01
Firm A dHash_indep (sig-level) 60,448 2 no transition
Firm A dHash_indep (sig-level) 60,448 3 no transition
Full-sample cosine (sig-level) 168,740 0.003 0.9870 -3.21 +8.17
Full-sample cosine (sig-level) 168,740 0.005 0.9850 -8.80 +14.32
Full-sample cosine (sig-level) 168,740 0.010 0.9800 -29.69 +44.91
Full-sample cosine (sig-level) 168,740 0.015 0.9450 -11.35 +14.85
Full-sample dHash_indep (sig-l.) 168,740 1 2.0 -6.22 +4.89
Full-sample dHash_indep (sig-l.) 168,740 2 10.0 -7.35 +3.83
Full-sample dHash_indep (sig-l.) 168,740 3 9.0 -11.05 +45.39

Two patterns are visible in Table A.I. First, the procedure consistently identifies a "transition" under every bin width, but the location of that transition drifts monotonically with bin width (Firm A cosine: 0.987 → 0.985 → 0.980 → 0.975 as bin width grows from 0.003 to 0.015; full-sample dHash: 2 → 10 → 9 as the bin width grows from 1 to 3). The Z statistics also inflate superlinearly with the bin width (Firm A cosine |Z| rises from \sim 9 at bin 0.003 to \sim 106 at bin 0.015) because wider bins aggregate more mass per bin and therefore shrink the per-bin standard error on a very large sample. Both features are characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity.

Second, the candidate transitions all locate inside the high-similarity region (cosine \geq 0.975, dHash \leq 10) rather than at a between-mode boundary, which is the location pattern we would expect of a clean within-population antimode.

Taken together, Table A.I shows that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located inside the high-similarity descriptor region rather than between modes. This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold.

Raw per-bin Z sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials.

A.2. Diagnostic Summary

Section III-N positions the unsupervised-diagnostic strategy as a set of complementary checks, each addressing one specific failure mode of an unsupervised screening classifier with an explicitly disclosed untested assumption. Table A.II maps each diagnostic to the failure mode it addresses and to the untested assumption it relies on.

Table A.II. Diagnostics, failure mode addressed, and disclosed untested assumption.

Diagnostic Failure mode addressed Disclosed untested assumption
Composition decomposition (§III-K.4; Scripts 39b39e) Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); p_{\text{median}} = 0.35 under joint firm-mean centring + integer-tie jitter Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered (n_{\text{seeds}} = 5; Script 39e)
Per-comparison inter-CPA coincidence rate (§III-I.1; Script 46) Pair-level specificity proxy under a random-pair negative anchor, on the normative BCD baseline Inter-CPA pairs are negative (i.e., not template-related); addressed by anchoring on Firms B/C/D and holding Firm A out (§III-I.0)
Pool-normalised per-signature ICCR (§III-I.2; Script 52) Deployed-rule specificity proxy at per-signature unit, accounting for pool size, on the BCD baseline Same as above + that pool replacement preserves the negative-anchor property
Document-level ICCR (§III-I.3; Script 52) Operational alarm-rate proxy at per-document unit (HC and HC+MC), on the BCD baseline Same as above
Firm-heterogeneity logistic regression (§III-J.1; Script 44) Multiplicative effect of firm membership on per-signature rate, controlling for pool size Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check
Cross-firm hit matrix (§III-J.1; Scripts 44, 53) Concentration of inter-CPA collisions within source firm (all-Big-4 and BCD-pool variants) Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$99.96\% within-firm at all four firms versus $76.7$98.8\% under any-pair; §III-J.1); per-document per-firm assignment uses Script 52's dominant-firm rule (§IV-M.4)
Alert-rate sensitivity sweep (§III-K.6; Script 46) Local sensitivity of deployed rule to threshold perturbation Gradient comparison is descriptive, not a formal plateau test
Convergent score Spearman ranking (§III-M.1; Script 38) Internal-consistency of three feature-derived per-CPA scores Scores share underlying inputs and are not statistically independent
Pixel-identical conservative positive capture (§III-M.4; Script 40) Trivial sanity check on the conservative positive anchor Anchor is tautologically captured by any reasonable threshold
LOOO firm-level reproducibility (§III-M.3; Scripts 36, 37) Algorithmic stability of K=2 / K=3 partition across firm folds Stability is necessary but not sufficient for classification validity

Appendix B. Reproducibility Materials

The full table-to-script provenance mapping, script source code, and report artefacts for every numerical table and figure in this paper are provided in the supplementary materials. Scripts run deterministically under fixed random seeds documented there; reviewer reproduction should re-emit artefacts from the listed scripts rather than rely on any local path layout.

References

[1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067

[2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," Res. Account. Regul., vol. 25, no. 2, pp. 230235, 2013.

[3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in Proc. NeurIPS, 1993.

[4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.

[5] H.-H. Kao and C.-Y. Wen, "An offline signature verification and forgery detection method based on a single known sample and an explainable deep learning approach," Appl. Sci., vol. 10, no. 11, p. 3716, 2020.

[6] H. Li et al., "TransOSV: Offline signature verification with transformers," Pattern Recognit., vol. 145, p. 109882, 2024.

[7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," Mathematics, vol. 12, no. 17, p. 2757, 2024.

[8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.

[9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021.

[10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," in Proc. Electronic Imaging, 2016.

[11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," Multimedia Tools Appl., 2024.

[12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," Inf. Process. Manage., p. 104086, 2025.

[13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in Proc. CVPR, 2022.

[14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," Pattern Recognit., vol. 70, pp. 163176, 2017.

[15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," IEEE Trans. Inf. Forensics Security, vol. 19, pp. 13421356, 2024.

[16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," IEEE Trans. Inf. Forensics Security, vol. 15, pp. 17351745, 2020.

[17] H. Farid, "Image forgery detection," IEEE Signal Process. Mag., vol. 26, no. 2, pp. 1625, 2009.

[18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," Pattern Recognit., vol. 144, art. no. 109778, 2023.

[19] J. Luo et al., "A survey of perceptual hashing for multimedia," ACM Trans. Multimedia Comput. Commun. Appl., vol. 21, no. 7, 2025.

[20] D. Engin et al., "Offline signature verification on real-world documents," in Proc. CVPRW, 2020.

[21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," Expert Syst. Appl., vol. 189, art. 116136, 2022.

[22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification across multilingual datasets," Procedia Comput. Sci., vol. 270, pp. 40244033, 2025.

[23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in Proc. ECCV, 2014, pp. 584599.

[24] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, "Qwen2.5-VL technical report," arXiv:2502.13923, 2025. [Online]. Available: https://arxiv.org/abs/2502.13923

[25] Ultralytics, "YOLO11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/models/yolo11/

[26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proc. CVPR, 2016.

[27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

[28] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London: Chapman & Hall, 1986.

[29] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988.

[30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Trans. Image Process., vol. 13, no. 4, pp. 600612, 2004.

[31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," The Accounting Review, vol. 88, no. 5, pp. 15111546, 2013.

[32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," Int. J. Auditing, vol. 18, no. 3, pp. 172192, 2014.

[33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," Contemp. Account. Res., vol. 26, no. 2, pp. 359391, 2009.

[34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proc. CVPR, 2016, pp. 779788.

[35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 56255644, 2024.

[36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," Ann. Math. Statist., vol. 18, no. 1, pp. 5060, 1947.

[37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," Ann. Statist., vol. 13, no. 1, pp. 7084, 1985.

[38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," J. Account. Econ., vol. 24, no. 1, pp. 99126, 1997.

[39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," J. Econometrics, vol. 142, no. 2, pp. 698714, 2008.

[40] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," J. R. Statist. Soc. B, vol. 39, no. 1, pp. 138, 1977.

[41] H. White, "Maximum likelihood estimation of misspecified models," Econometrica, vol. 50, no. 1, pp. 125, 1982.

[42] M. Stone, "Cross-validatory choice and assessment of statistical predictions," J. R. Statist. Soc. B, vol. 36, no. 2, pp. 111147, 1974.

[43] S. Geisser, "The predictive sample reuse method with applications," J. Amer. Statist. Assoc., vol. 70, no. 350, pp. 320328, 1975.

[44] A. Vehtari, A. Gelman, and J. Gabry, "Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC," Stat. Comput., vol. 27, no. 5, pp. 14131432, 2017.

Declarations

Conflict of interest. The authors declare no conflict of interest with Firm A, Firm B, Firm C, or Firm D, or with any other entity referenced in this work.

Data availability. All audit reports analysed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, a publicly accessible regulatory disclosure platform. The CPA registry used to map signatures to certifying CPAs is publicly available. Signature images, model weights, and reproducibility scripts are available in the supplementary materials.