# Copy output and paste to new Claude Code session
```
### Manual Method
Tell Claude:
> "I'm continuing the PDF signature extraction project at `/Volumes/NV2/pdf_recognize/`. Please read `SESSION_INIT.md` and `PROJECT_DOCUMENTATION.md` to understand the current state. I want to [choose option from SESSION_INIT.md]."
## Quick Commands Reference
### View Documentation
```bash
less /Volumes/NV2/pdf_recognize/SESSION_INIT.md
less /Volumes/NV2/pdf_recognize/PROJECT_DOCUMENTATION.md
```
### Run Scripts
```bash
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_signatures_hybrid.py # Main script
```
### Check Results
```bash
ls -lh /Volumes/NV2/PDF-Processing/signature-image-output/signatures/*.png
@@ -7,9 +7,9 @@ author: "[Authors removed for double-blind review]"
<!-- IEEE Access target: <= 250 words, single paragraph -->
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000010$; per-signature $0.006$; per-document $0.012$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A never coincides cross-firm yet fires on $82\%$ of its own ($\sim 139\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth.
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a low inter-CPA coincidence rate (per-comparison ICCR $0.000010$; per-signature $0.006$; per-document $0.012$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A never coincides cross-firm yet fires on $82\%$ of its own ($\sim 139\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth.
<!-- Word count: 250 (v4.1 BCD-baseline reframe) -->
<!-- Word count: 250 -->
# I. Introduction
@@ -26,17 +26,17 @@ A methodological concern shapes the research design. Many prior similarity-based
Despite the significance of the problem for audit quality and regulatory oversight, to our knowledge no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation.
In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) disclosure of each diagnostic's untested assumption (§III-M).
In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-I); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-J.1); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-K.4); and (8) disclosure of each diagnostic's untested assumption (§III-N).
We are deliberate about what the system claims. The operating thresholds are *operator-tunable* rather than asserted as ground-truth decision boundaries: the contribution is not a fixed detector that pronounces a signature non-hand-signed, but (a) a dual-descriptor design that separates *style consistency* from *image reproduction*, and (b) a methodology for choosing and characterising a screening operating point in the absence of labels, so that an operator can set a specificity target and read off what each setting yields. Operationally the framework is a semi-automated triage step that surfaces a tractable set of replication candidates from hundreds of thousands of signatures for human adjudication; it does not adjudicate. The firm-level results and the byte-identical capture check are reported as *demonstrations that this triage works at scale*, not as forensic determinations.
A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.
A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-K.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.
In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000010$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0059$ (CPA-block bootstrap 95% $[0.0045, 0.0073]$), and per-document ICCR $= 0.012$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.175$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide.
In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019); §III-I.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000010$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0059$ (CPA-block bootstrap 95% $[0.0045, 0.0073]$), and per-document ICCR $= 0.012$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.175$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide.
With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0059$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 139\times$ floor), Firms B/C/D at $0.24$–$0.35$ ($\sim 40$–$59\times$). Firm A scored against the clean 2013–2019 baseline coincides essentially never ($0.0001$, below the clean-baseline floor itself) — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.
Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.
Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-L's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.
We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available.
@@ -56,7 +56,7 @@ The contributions of this paper are:
7.**K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.
8.**Annotation-free positive-anchor capture check and unsupervised-setting disclosure.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-M addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review.
8.**Annotation-free positive-anchor capture check and unsupervised-setting disclosure.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-N addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review.
The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity positive-anchor check, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work.
@@ -137,7 +137,7 @@ Under mild regularity conditions, White's quasi-MLE result [41] supports interpr
The present study uses these tools diagnostically: first to test whether the descriptor distribution supports a natural operating boundary, and then, when that support fails under composition decomposition, to motivate anchor-based ICCR calibration of a fixed deployed rule.
*Cross-validation in a small-cluster scope.*
Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier.
Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-M differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-I). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier.
<!--
REFERENCES for Related Work (full list in the References section):
[3] Bromley et al. 1993 — Siamese TDNN (NeurIPS)
@@ -179,7 +179,7 @@ REFERENCES for Related Work (full list in the References section):
We propose a six-stage pipeline for large-scale screening of non-hand-signed auditor signatures in scanned financial documents.
Fig. 1 illustrates the overall architecture.
The pipeline takes as input a corpus of PDF audit reports and produces five-way operational screening labels (§III-H.1) whose behaviour is characterised by pixel-identity positive-anchor capture checks and inter-CPA coincidence-rate calibration (§III-L).
The pipeline takes as input a corpus of PDF audit reports and produces five-way operational screening labels (§III-H.1) whose behaviour is characterised by pixel-identity positive-anchor capture checks and inter-CPA coincidence-rate calibration (§III-I).
Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.
@@ -284,13 +284,13 @@ Non-hand-signing is expected to yield extreme similarity under *both* descriptor
One working hypothesis is that some hand-signed repetitions may preserve coarse layout while varying in fine execution, producing relatively higher dHash similarity than cosine similarity within a same-CPA pair; the classifier does not require this hypothesis to hold for all CPAs, and the descriptor-level pattern is used only as input to the deployed rule, not as a within-CPA consistency claim.
Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.
We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. The supplementary materials contain the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.
We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. The supplementary materials contain the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-M), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.
Cosine similarity on L2-normalised deep embeddings and dHash both remain stable across the print-scan-rasterise cycle by design [14], [19], [21], [27]; together they constitute the dual descriptor used throughout the rest of this paper.
## G. Unit of Analysis and Scope
We analyse signatures at two **descriptor-summary** units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. §III-L additionally characterises the deployed rule's behaviour at three **operational reporting** units (per-comparison, per-signature, per-document), which are distinct from the descriptor-summary units defined here: the descriptor-summary units summarise input descriptors; the operational reporting units summarise rule outputs.
We analyse signatures at two **descriptor-summary** units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-L), of per-CPA internal-consistency analysis (§III-M), and of the leave-one-firm-out reproducibility check (§III-M). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. §III-I additionally characterises the deployed rule's behaviour at three **operational reporting** units (per-comparison, per-signature, per-document), which are distinct from the descriptor-summary units defined here: the descriptor-summary units summarise input descriptors; the operational reporting units summarise rule outputs.
We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism.
@@ -300,15 +300,15 @@ We adopt one stipulation about same-CPA pair detectability:
A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.
**Scope: the Big-4 sub-corpus.** The primary analyses (§III-I, §III-J, §III-K, §III-L, and the corresponding §IV-D through §IV-J and §IV-M tables) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F; §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ — the threshold for accountant-level analyses — totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the primary analyses to Big-4 is a methodological choice driven by four considerations:
**Scope: the Big-4 sub-corpus.** The primary analyses (§III-I through §III-M, and the corresponding §IV-D through §IV-J and §IV-M tables) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F; §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ — the threshold for accountant-level analyses — totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the primary analyses to Big-4 is a methodological choice driven by four considerations:
1.**Restricted generalisability claim and Big-4 institutional comparability.** The primary claims are scoped to the Big-4 audit-report context, where the four firms share comparable institutional scale, document-production infrastructure, and CPA-volume regime; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H.2's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-I.4. Generalisation beyond Big-4 is left as future work.
1.**Restricted generalisability claim and Big-4 institutional comparability.** The primary claims are scoped to the Big-4 audit-report context, where the four firms share comparable institutional scale, document-production infrastructure, and CPA-volume regime; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H.2's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-K.4. Generalisation beyond Big-4 is left as future work.
2.**Within-firm cross-CPA collision structure analysis.** §III-L.4 reports a Big-4 cross-firm hit-matrix analysis that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work.
2.**Within-firm cross-CPA collision structure analysis.** §III-J.1 reports a Big-4 cross-firm hit-matrix analysis that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work.
3.**Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-J K=3 component cross-tab; byte-level pair analysis referenced in §III-H.2). We retain Firm A within the Big-4 scope as a descriptive case study of the templated end rather than as the calibration anchor for thresholds.
3.**Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-L K=3 component cross-tab; byte-level pair analysis referenced in §III-H.2). We retain Firm A within the Big-4 scope as a descriptive case study of the templated end rather than as the calibration anchor for thresholds.
4.**Leave-one-firm-out fold feasibility.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm.
4.**Leave-one-firm-out fold feasibility.** §III-M reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm.
**Sample-size reconciliation.** Two Big-4 signature counts appear in this section and §IV: $n = 150{,}442$ for analyses using the pre-computed per-signature descriptors $\text{cos}_s$ (`max_similarity_to_same_accountant`) and $\text{dHash}_s$ (`min_dhash_independent`), and $n = 150{,}453$ for analyses recomputing pair-level metrics directly from the stored feature and dHash byte vectors (Scripts 40b, 43, 44). The $11$-signature difference reflects descriptor-completion status: $11$ signatures have feature vectors and dHash byte vectors stored but lack the pre-computed extrema. The $11$ signatures are negligible at population scale and do not affect any reported coincidence rate within $0.01$ percentage point. The CPA counts $468$ (all Big-4 CPAs with both vectors stored) and $437$ (Big-4 CPAs with $n_{\text{sig}} \geq 10$ for accountant-level stability) likewise reflect a single uniform exclusion rule rather than analysis-specific subsetting.
@@ -316,66 +316,134 @@ A1 is plausible for high-volume stamping or firm-level electronic signing workfl
### H.1. Deployed Operational Rule
Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA. The five labels below name regions of the descriptor space and are operational rule outputs, not validated ground-truth classes; the label names reflect the screening hypothesis associated with each region and are subject to the unsupervised-setting caveats of §III-M:
Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA. The five labels below name regions of the descriptor space and are operational rule outputs, not validated ground-truth classes; the label names reflect the screening hypothesis associated with each region and are subject to the unsupervised-setting caveats of §III-N:
1.**High-confidence replication candidate (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on image-similarity evidence consistent with replication; this is the highest-priority triage bin for human review, and mechanism attribution remains subject to §III-M.
2.**Moderate-confidence advisory flag (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level similarity is strong but structural similarity is below the high-confidence cutoff; §III-L.3 shows this band carries low inter-CPA specificity even on the normative baseline, so it is a low-specificity advisory bin (review-workload-expanding) rather than a confident replication flag.
1.**High-confidence replication candidate (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on image-similarity evidence consistent with replication; this is the highest-priority triage bin for human review, and mechanism attribution remains subject to §III-N.
2.**Moderate-confidence advisory flag (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level similarity is strong but structural similarity is below the high-confidence cutoff; §III-I.3 shows this band carries low inter-CPA specificity even on the normative baseline, so it is a low-specificity advisory bin (review-workload-expanding) rather than a confident replication flag.
3.**High style-consistency flag (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration; the descriptor position is operationally distinguished from HC/MC, but the underlying mechanism (within-CPA signing style, lossy image reproduction with structural drift, or a hybrid) is not resolved by descriptor data alone.
4.**Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
5.**Low replication-similarity (LH):** Cosine $\leq 0.837$. The name reflects the screening hypothesis that low maximum same-CPA cosine similarity is more consistent with hand-signing variation than with image replication; it is an operational low-priority bin, not a verified hand-signed classification, since cross-year handwriting drift, scanner-workflow change, or template variant rotation within a CPA's reports can also yield a low max-cosine within a same-CPA pool.
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-I).
The remainder of this section (§III-H.2) describes the reference populations used to calibrate and cross-check this rule. §III-I demonstrates that the descriptor distributions do not provide a within-population natural threshold; §III-J–§III-K develop the descriptive partition and internal-consistency cross-checks; §III-L develops the anchor-based threshold calibration; §III-M discloses the unsupervised-setting limits.
The remainder of this section (§III-H.2) describes the reference populations used to calibrate and cross-check this rule. §III-I then establishes the normative baseline and its inter-CPA coincidence floor; §III-J reads each firm, and Firm A in particular, as a deviation from that floor; §III-K shows that the descriptor distributions provide no within-population natural threshold; §III-L–§III-M develop the descriptive mixture partition and internal-consistency cross-checks; §III-N discloses the unsupervised-setting limits.
### H.2. Reference Populations
The supporting diagnostics use two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking. Neither population is the calibration anchor for the deployed threshold; both are descriptive references that inform the cross-checks in §III-K.
The supporting diagnostics use two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking. Neither population is the calibration anchor for the deployed threshold; both are descriptive references that inform the cross-checks in §III-M.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-L; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.
Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate.
Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-L), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-M's pixel-identity positive-anchor miss rate.
**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores used as a cross-check on the per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked.
**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-M's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores used as a cross-check on the per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked.
The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 low-cos / high-dHash component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 high-cos / low-dHash component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end of the descriptor plane. This is a "deviation in the less-replication-dominated descriptor-position direction" measure, not a "deviation toward the templated descriptor-position" measure; the reference is the less-replication-dominated population.
The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 low-cos / high-dHash component (cos $= 0.946$, dHash $= 9.17$; §III-L); compared to the Big-4 high-cos / low-dHash component (cos $= 0.983$, dHash $= 2.41$; §III-L) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end of the descriptor plane. This is a "deviation in the less-replication-dominated descriptor-position direction" measure, not a "deviation toward the templated descriptor-position" measure; the reference is the less-replication-dominated population.
## I. Distributional Diagnostics: Why the Composition Path Does Not Yield a Natural Threshold
This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G and tests whether the distribution provides distributional support — in the form of within-population bimodality — for the deployed operational thresholds. We apply four diagnostic procedures in turn: a univariate unimodality test on each accountant-level marginal; a 2D Gaussian mixture fit (developed in §III-J); a density-smoothness diagnostic; and a composition decomposition that distinguishes within-population multimodality from between-firm location-shift artefacts. The four diagnostics jointly imply that the operational thresholds are *not* anchored by distributional bimodality: §III-L develops an anchor-based calibration framework that does not require this assumption.
## I. Normative Baseline and the Inter-CPA Coincidence Floor
**1. Hartigan dip test on each accountant-level marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope at the accountant level rejected unimodality. The accountant-level Big-4 rejection is a descriptive observation; §III-I.4 below shows that the rejection is fully explained by between-firm location-shift effects rather than within-population bimodality.
We calibrate the operational classifier of §III-H.1 by first establishing a *normative baseline* — a population of independent CPAs in which the deployed rule should fire only by chance — and then measuring the rate at which it fires there. This inter-CPA coincidence floor is the reference against which every firm, Firm A included, is read in §III-J. Because the descriptor distributions contain no within-population bimodal antimode that could anchor a threshold directly (§III-K), it is this empirically measured floor, rather than a distributional cut, that gives the deployed thresholds an interpretable specificity meaning. Throughout we report **inter-CPA coincidence rates (ICCR)** rather than "False Acceptance Rates", for the reasons given in §III-I.0. This section develops the calibration method and reports the headline floor at each unit of analysis; the full result tables are consolidated in §IV-M (Tables XXI–XXIII).
**2. K=2 / K=3 Gaussian mixture fits (descriptive partition).** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3 as a population mixture. Following §III-I.4 we treat both K=2 and K=3 fits as *descriptive partitions* of the joint Big-4 distribution that reflect firm-composition structure (Firm A vs others; §III-J) rather than as inferential evidence for two or three latent population modes.
### I.0. Calibration methodology
**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each accountant-level marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with §III-I.4 below: under the composition decomposition the Big-4 marginals are unimodal once between-firm and integer-tie confounds are removed, so a local-discontinuity test correctly fails to flag a within-population transition.
**Choice of negative-anchor pool.** A negative anchor must approximate a populationin which the rule should *not* fire — independent CPAs whose signatures coincide only by chance. §III-J.1 shows that under the deployed rule $98.8\%$ of Firm A's inter-CPA collisions fall on other Firm-A CPAs, and byte-level evidence (§IV-H, supplementary materials) confirms image-level reuse across $\sim 50$ Firm-A partners. Including Firm A in the negative-anchor pool would therefore load the "coincidence" rate with structured within-firm collisions rather than chance coincidence — a circularity, since that collision structure is the phenomenon the rule targets. We adopt **Firms B/C/D (BCD) as the normative negative-anchor baseline** and report the all-Big-4 (ABCD) pool only as a contamination-comparison scope; Firm A enters as an **out-of-sample target** (§III-J.1), not as a calibration input. A still-broader baseline adding the eligible non-Big-4 firms (BCD+non-Big-4) is reported as a robustness scope.
**4. Composition decomposition (Scripts 39b–39e).** §III-I.1 establishes that the accountant-level marginals reject unimodality at the Big-4 sub-corpus. The remaining question is whether the rejection reflects (a) genuine within-population bimodality at the signature or accountant level, (b) between-firm location-shift artefacts (firms with different mean descriptor positions pool to a multi-peaked distribution), or (c) integer mass-point artefacts on the integer-valued dHash axis (the dHash dip statistic is sensitive to spikes at integer values). We apply four diagnostics that decompose the rejection into these candidate sources:
We further restrict the calibration baseline temporally to **fiscal years 2013–2019**. Following the post-2020 acceleration of digital document workflows, Taiwan audit firms increasingly adopted electronic-signature and stamping systems for report assembly, with firm-specific timing; the pre-2020 BCD period is therefore the construct-clean hand-signing baseline, while the post-2020 period mixes genuine hand-signing with legitimate e-signing and is not a clean negative anchor. The data corroborate this: the BCD per-comparison HC floor rises from $0.000010$ (2013–2019) to $0.000036$ (2020–2023), and the per-signature floor from $0.0059$ to $0.0105$ — the gradual, non-stepped rise being consistent with staggered per-firm adoption. We therefore calibrate on BCD 2013–2019 and report BCD 2020–2023 only as a robustness scope. Firm A is scored across its full 2013–2023 record against this clean threshold.
*Within-firm signature-level dip (Scripts 39b, 39c).* Repeating the dip test at the signature level inside each individual Big-4 firm (Script 39b) and inside each individual non-Big-4 firm with $\geq 500$ signatures (Script 39c) yields a consistent picture. The cosine marginal *fails* to reject unimodality in every single firm tested — all four Big-4 firms ($p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ for Firms A through D; Script 39b) and ten non-Big-4 firms with $\geq 500$ signatures ($p_{\text{cos}} \in [0.59, 0.99]$; Script 39c). The raw dHash marginal *does* reject unimodality in every firm tested ($p < 5 \times 10^{-4}$ in all $14$ firms), but the raw dHash values are integer-valued in $\{0, 1, \ldots, 64\}$, leaving open the possibility of an integer-tie artefact.
**Calibration role.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. Because a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists, §III-K.4), §III-I.1 below characterises the cosine and structural ($\text{dHash} \leq 5$) thresholds' specificity-proxy behaviour at the inter-CPA pair level on the BCD baseline. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for them. The cosine LH/UN crossover $\text{cos} = 0.837$ is a corpus-wide descriptor-space landmark (intra- vs inter-CPA cosine KDE crossover, §IV-C) robust to baseline choice — it moves by at most $0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes ($0.8367$, $0.8489$, $0.8302$) — so we retain the corpus-wide value.
*Integer-jitter robustness (Scripts 39d, 39e).* Adding independent uniform jitter $\sim \mathrm{U}[-0.5, +0.5]$ to break exact dHash ties and re-running the dip test on the perturbed signature cloud (5 seeds, $n_{\text{boot}} = 2000$; Script 39d) eliminates the dHash within-firm rejection in every Big-4 firm tested (Firm A jittered $p_{\text{median}} = 0.999$; B $0.996$; C $0.999$; D $0.9995$; $0$/$5$ seeds reject at $\alpha = 0.05$ in any firm). The pooled-Big-4 dHash dip *does* survive jitter alone ($p_{\text{median}} = 0$, $5$/$5$ seeds reject), but Firm A's mean dHash ($2.73$) is substantially below Firms B/C/D's ($6.46$, $7.39$, $7.21$) — a between-firm location shift. Script 39e applies a 2 \times 2 factorial correction (firm-mean centring $\times$ integer jitter) on the Big-4 pooled dHash:
**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each answering a different operational question:
| Condition | Firm-mean centred | Integer jitter | Median dip $p$ | Reject at $\alpha = 0.05$ |
- *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the conventional pairwise calibration unit in biometric verification, reported marginally and jointly (§III-I.1).
- *Per signature pool.* For a source signature $s$ with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates from the baseline pool? The deployed rule takes max-cosine and min-dHash over the pool, so its effective coincidence rate is $\approx 1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-I.2).
- *Per document.* For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-I.3).
**Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate satisfies both. We call this the **any-pair** rule, and report the stricter **same-pair** rule (one candidate satisfying both inequalities) as an alternative where useful (§III-I.2, §III-J.1).
**Terminological note on "FAR".** We adopt **inter-CPA coincidence rate (ICCR)** and *do not* use "FAR", for two reasons: (a) FAR has a specific biometric-verification meaning requiring ground-truth negative labels, which the corpus does not provide at the signature level; (b) the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures, which is precisely why we move the anchor to the BCD baseline. Even on the BCD baseline, ICCR is a *specificity proxy* under an explicitly disclosed assumption, not a true biometric FAR.
We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from the baseline pool, computing for each the cosine similarity and the Hamming distance between dHash byte vectors, with Wilson 95% confidence intervals (Script 46; Table XXI, §IV-M).
On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule (cos $> 0.95$ AND dHash $\leq 5$, any-pair) is $\mathbf{0.000010}$ $[0.000004, 0.000023]$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$) and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-J.1): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($5$ of $5 \times 10^5$ pairs on the BCD pool), so we treat the per-comparison joint rate as an order-of-magnitude specificity proxy and let the well-powered per-signature and per-document units (§III-I.2, §III-I.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ marginal ($0.00060$) is consistent with the corpus-wide per-comparison rate of §IV-I, and on the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) $= 0.234$ — the structural dimension adds substantial per-comparison specificity beyond the cosine gate.
The per-comparison rate does *not* directly translate to deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$ (§III-I.2).
For each source signature $s$ we simulate one realisation of an inter-CPA candidate pool of the same size $n_{\text{pool}}(s)$, drawn uniformly from non-same-CPA signatures *in the baseline pool*, compute the deployed extrema and rule indicator, and aggregate (Script 52, canonical retry-loop sampler matching Scripts 43/45; CPA-block bootstrap 95% CIs on $n_{\text{boot}} = 1000$ replicates; Table XXII, §IV-M).
On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $\mathbf{0.0059}$ $[0.0045, 0.0073]$ — an order of magnitude below the all-Big-4 figure of $0.1102$, which is dominated by Firm A. Once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 0.59\%$. **This is the specificity-proxy floor against which the deployed HC rule operates.** The rate rises with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-J.1) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, under the unsupervised-setting caveats of §III-N.
Each document is classified by the worst-case rule over its constituent signatures (§III-H.1) under the same inter-CPA candidate-pool counterfactual as §III-I.2 (Script 52, dominant-firm document assignment; Table XXIII, §IV-M).
**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC (dHash $\leq 5$) rate is $\mathbf{0.0117}$ ($\sim 8\times$ below the all-Big-4 $0.1797$): a clean inter-CPA baseline almost never produces an HC document, confirming HC as a high-specificity operating point. The HC+MC (dHash $\leq 15$) rate, by contrast, stays high on the clean baseline — $0.1753$ per document — and does *not* fall when Firm A is removed. Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$, slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier, not a confident non-hand-signed screening label**: roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-N).
## J. Firm-Level Deviation from the Baseline
With the calibration anchored on the clean BCD floor (§III-I.2), every firm — and in particular Firm A — can be read as a deviation from that floor. Firm A is scored as a true out-of-sample target against the baseline.
### J.1. Firm A as an out-of-sample target; firm heterogeneity
Three complementary readings establish that Firm A is the extreme case while keeping the inferential limits explicit.
**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0059$ (§III-I.2):
| Firm | Observed per-signature HC rate | Multiple of BCD floor |
|---|---|---|
| Firm A | $0.817$ | $\sim 139\times$ |
| Firm B | $0.346$ | $\sim 59\times$ |
| Firm C | $0.238$ | $\sim 40\times$ |
| Firm D | $0.245$ | $\sim 42\times$ |
All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 139\times$, roughly $2.4$–$3.4\times$ the other Big-4 firms in absolute rate.
**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0001$ — below even the BCD-internal floor of $0.0059$, i.e. Firm A's signatures essentially never resemble genuine 2013–2019 hand-signing from other firms. The entire elevation in Firm A's observed rate ($0.817$) therefore arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness.
**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.)
**Cross-firm hit matrix: within-firm concentration is a universal Big-4 pattern.** Under the deployed any-pair rule, inter-CPA collisions concentrate within the source firm at every Big-4 firm. On the full Big-4 candidate pool, within-firm concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (same-pair $97.0$–$99.96\%$; Table XXV). Restricting the candidate pool to the BCD baseline (Script 53) *raises* the within-firm concentration for B/C/D to $89.2$–$97.2\%$ any-pair (Firm B $97.2\%$, Firm C $92.3\%$, Firm D $89.2\%$) and $98.5$–$100\%$ same-pair — higher than on the full pool, because there some B/C/D collisions landed on Firm A's generically copy-like signatures; removing Firm A leaves each firm's collisions concentrated within itself. Within-firm collision concentration is therefore a universal Big-4 structural pattern, not a Firm-A peculiarity: Firm A is extreme in the *rate* at which the rule fires (reading (i)), but all four firms exhibit the same within-firm collision signature.
### J.2. Observed deployed alert rate on actual same-CPA pools
Reading (i) of §III-J.1 reported each firm's observed HC rate; reading (ii) used the inter-CPA candidate-replacement counterfactual. Here we report the pooled-Big-4 **observed deployed alert rate** — the rate at which the rule fires on each source's actual same-CPA pool across the real corpus — and its excess over the clean floor. For Big-4 it fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate). Read against the **normative BCD floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is large: per signature $0.4958$ vs $0.0059$ ($49.0$ pp, $\sim 84\times$); per document (HC) $0.6228$ vs $0.0117$ ($61.1$ pp, $\sim 53\times$). Anchoring the floor on the clean BCD baseline sharpens this contrast, since the all-Big-4 floor would understate it by absorbing Firm A's reuse.
**Interpretation and inferential limits.** The firm multiples of §III-J.1 and the observed-deployed excess above are *not* true-positive rates: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above that floor. We therefore read the excess as an *observed same-CPA-pool excess over the normative inter-CPA floor* — a quantity far exceeding what random inter-CPA candidate replacement among normative firms would produce — whose mechanism is not identifiable from descriptor-only data (§III-N); we do not attribute it to within-CPA handwriting repeatability or to image replication without further evidence. Likewise, the within-firm collision concentration is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: common form templates, shared scanning workflows, and report-generation infrastructure could all produce visually similar signature crops across CPAs within a firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (§IV-H, supplementary materials) is direct evidence of image-level reuse among Firm A signatures; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We report "inter-CPA collision concentration is within-firm" as a descriptive observation about deployed-rule behaviour and refrain from inferring deliberate or systematic template sharing.
## K. Why the Descriptor Distribution Provides No Threshold
The baseline calibration of §III-I is necessary because the descriptor distribution itself supplies no within-population threshold. This section establishes that negative result: the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs contains no within-population bimodal antimode that could anchor the deployed operational thresholds. We apply four diagnostics, decompose the one apparent rejection into its true source, and confirm that the deployed thresholds sit on a steep — not plateau-like — region of the alert-rate surface.
**1. Hartigan dip test on each accountant-level marginal.** The dip test [37] on each marginal $\{\overline{\text{cos}}_a\}$ and $\{\overline{\text{dHash}}_a\}$ (bootstrap $p$, $n_{\text{boot}} = 2000$) rejects unimodality at the Big-4 sub-corpus ($p < 5 \times 10^{-4}$ on both, Script 34). The rejection does *not* hold in narrower tested scopes (Script 32): Firm A alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$), Firms B+C+D pooled ($0.998$, $0.906$), and all non-Firm-A CPAs pooled ($0.998$, $0.907$). The Big-4 rejection is thus a descriptive observation that item 4 below attributes entirely to between-firm composition rather than within-population bimodality.
**2. K=2 / K=3 Gaussian mixture fits (descriptive partition).** A 2-component 2D GMM (Script 34) recovers components at $(0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$, with marginal crossings $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. The 3-component fit is mildly BIC-preferred ($\Delta\text{BIC} = -3.48$, not decisive). Following item 4 we treat both fits as *descriptive partitions* reflecting firm-composition structure (Firm A vs others), not evidence for latent population modes; they are developed in §III-L.
**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** Applied as a density-smoothness diagnostic [38, 39] on each marginal, the test flags no significant transition at the Big-4 scope on either axis ($\alpha = 0.05$, Script 34). It does flag dHash transitions in some out-of-scope subsets (Script 32) but no cosine transition in any subset — consistent with item 4: once between-firm and integer-tie confounds are removed, the Big-4 marginals are unimodal, so a local-discontinuity test correctly finds no within-population transition.
**4. Composition decomposition (Scripts 39b–39e).** The Big-4 accountant-level rejection (item 1) could reflect (a) genuine within-population bimodality, (b) between-firm location-shift artefacts, or (c) integer mass-point artefacts on the integer-valued dHash axis. Repeating the dip test at the signature level *inside* each firm shows the cosine marginal fails to reject unimodality in every firm tested — all four Big-4 firms ($p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$; Script 39b) and ten non-Big-4 firms with $\geq 500$ signatures ($p_{\text{cos}} \in [0.59, 0.99]$; Script 39c). The raw dHash marginal rejects in every firm, but the values are integer-valued; adding uniform jitter $\sim \mathrm{U}[-0.5, +0.5]$ to break exact ties (5 seeds; Script 39d) eliminates the within-firm dHash rejection in every Big-4 firm (jittered $p_{\text{median}} \geq 0.996$, $0/5$ seeds reject). The pooled-Big-4 dHash dip survives jitter alone, but Firm A's mean dHash ($2.73$) sits well below Firms B/C/D's ($6.46$, $7.39$, $7.21$) — a between-firm location shift. Script 39e applies a $2 \times 2$ factorial correction on the pooled dHash:
| Condition | Firm-mean centred | Integer jitter | Median dip $p$ | Reject at $\alpha = 0.05$ |
Removing *both* the between-firm location shift *and* the integer mass points eliminates the Big-4 dHash rejection. The Big-4 pooled dHash multimodality is therefore fully attributable to firm-composition contrast (primarily Firm A's mean $\text{dHash} = 2.73$ versus Firms B/C/D $\approx 6.5$–$7.4$) and integer-density artefacts, with no residual continuous within-firm bimodality.
Removing *both* the between-firm location shift *and* the integer mass points eliminates the rejection ($p_{\text{median}} = 0.35$): the Big-4 pooled dHash multimodality is fully attributable to firm-composition contrast and integer-density artefacts, with no residual continuous within-firm bimodality. Consistently, within each Big-4 firm the dHash histogram on bins $0$–$20$ exhibits no strict local minimum, and the pooled histogram shows only a shallow valley at $\text{dHash} = 4$ (relative depth $2.1\%$) — no antimode near the deployed $\text{dHash} = 5$ boundary in any firm.
*Cosine analogue.* The cosine axis follows the same pattern by construction: the within-firm signature-level cosine dip tests above (Scripts 39b, 39c) fail to reject in every Big-4 firm and in every eligible non-Big-4 firm, so any pooled cosine multimodality must arise from between-firm composition rather than from within-population bimodality.
**5. Conclusion.** The descriptor distributions contain no within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixtures (§III-L) are therefore descriptive firm-compositional partitions, not evidence for population modes, and the anchor-based calibration of §III-I does not require a distributional antimode.
*Integer-histogram valleys (Script 39d).* A genuine within-firm dHash antimode would appear as a strict local minimum in the count histogram with deep relative depth. Within each of the four Big-4 firms, the dHash histogram on bins $0$–$20$ exhibits no strict local minimum; the Big-4 pooled histogram exhibits one shallow valley at $\text{dHash} = 4$ with relative depth $0.021$ (a $2.1\%$ count drop). No valley near the deployed $\text{dHash} = 5$ operational boundary appears within any individual firm. The hypothesised dHash antimode near $\text{dHash} \approx 5$ is not empirically supported by the histogram analysis.
**6. Local sensitivity of the deployed thresholds (Script 46).** As a final confirmation that the deployed HC thresholds are not distributional features, we sweep each threshold against the *actual observed* Big-4 same-CPA pools and compare the local gradient at the deployed value to the median gradient across the sweep. At cos $> 0.95$ AND dHash $\leq 5$ the local gradient is substantially larger than the median (cosine ratio $\approx 25\times$; dHash ratio $\approx 3.8\times$): the HC threshold is *locally sensitive*, not plateau-stable (a $0.01$ cosine perturbation swings the rate $3.0$ pp; a single dHash integer step swings it $14.3$ pp). The MC/HSC boundary at dHash $= 15$, by contrast, lies in a low-gradient plateau (ratio $\approx 0.08\times$ median) — adding alert yield without inter-CPA specificity (§III-I.3), reinforcing the MC band's demotion to an advisory tier. We therefore read the deployed HC thresholds as **specificity-anchored operating points** (chosen for the specificity-vs-alert-yield tradeoff, §III-I.1), not as distributional antimodes; the gradient ratios are descriptive diagnostics, and the primary "no antimode" evidence comes from item 4 above.
**5. Conclusion: no natural threshold from the descriptor distribution.** The four diagnostics jointly establish that the descriptor distributions contain no within-population bimodal antimode that could anchor an operational threshold: the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts (the 2×2 factorial restores unimodality, $p_{\text{median}} = 0.35$), the within-firm signature-level marginals are unimodal once integer ties are broken, and no integer-histogram valley exists near the deployed $\text{dHash} = 5$ boundary in any Big-4 firm. The K=2 / K=3 mixtures (§III-J) are therefore *descriptive* firm-compositional partitions, not evidence for population modes, and §III-L develops an anchor-based calibration that does not require a distributional antimode.
## L. K=3 as a Descriptive Partition of Firm-Composition Contrast
## J. K=3 as a Descriptive Partition of Firm-Composition Contrast
This section develops the K=2 and K=3 Gaussian mixture fits and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes** (§III-K.4 shows the apparent multimodality is fully explained by between-firm location shifts and integer mass-point artefacts). Neither mixture assigns signature- or document-level labels in the primary analysis; the operational classifier of §III-H.1 is calibrated in §III-I via inter-CPA coincidence rates, not mixture-derived antimodes.
This section develops the K=2 and K=3 Gaussian mixture fits and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes** (§III-I.4 shows the apparent multimodality is fully explained by between-firm location shifts and integer mass-point artefacts). Neither mixture is used to assign signature- or document-level labels in the primary analysis; the operational classifier of §III-H.1 is calibrated in §III-L via inter-CPA coincidence rates, not mixture-derived antimodes.
**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ (weight $0.689$) and $(0.983, 2.41)$ (weight $0.311$) (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. We refer to the components by index rather than by mechanism labels, since §III-I.4 establishes that the K=2 separation is firm-compositional rather than mechanistic.
**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ (weight $0.689$) and $(0.983, 2.41)$ (weight $0.311$); marginal crossings $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$ (Script 34). We refer to components by index, since §III-K.4 establishes that the separation is firm-compositional, not mechanistic.
**K=3 fit.** Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces):
@@ -385,191 +453,35 @@ This section develops the K=2 and K=3 Gaussian mixture fits and clarifies their
$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical preference for K=3 under standard BIC interpretation, but not by itself decisive). The "descriptive position" column refrains from any mechanism interpretation: §III-I.4 establishes that the cosine and dHash axes both lack within-population bimodality, so component centres are best interpreted as locations in a continuous descriptor space rather than as latent mechanism modes.
$\text{BIC}(K{=}3) = -1111.93$, below $K{=}2$ by $3.48$ (mild, not decisive). The component centres are locations in a continuous descriptor space, not latent mechanism modes.
**Per-firm component composition (Script 35 firm × cluster cross-tab).** The K=3 partition is dominated by firm membership:
**The partition is dominated by firm membership (Script 35).** Firm A is $82.5\%$ C3 and accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs; Firms B and D sit predominantly in the central C2. The K=3 partition is therefore a firm-compositional decomposition: C3 is essentially "Firm A," C1 essentially "non-Firm-A CPAs in the low-cos / high-dHash corner." This same firm-compositional contrast reappears at the deployment level in the cross-firm hit matrix of §III-J.1.
**Leave-one-firm-out stability (Scripts 36, 37).** K=2 is unstable across folds (holding Firm A out shifts the cosine crossing to $0.938$ vs $\sim 0.975$ for the other folds; max across-fold deviation $0.028$, $5.6\times$ the report's tolerance), confirming the K=2 boundary is essentially a Firm-A-versus-others separator. K=3 has a *reproducible component shape* (C1 cosine mean varies by $\leq 0.005$, dHash by $\leq 0.96$, weight by $\leq 0.012$ across folds) but composition-sensitive hard-posterior membership (held-out C1 rate $36.3\%$ vs baseline $23.5\%$ at Firm C — a $12.8$ pp difference; legend `P2_PARTIAL`). We therefore do not use K=3 hard-posterior membership as an operational label; the operational classifier is calibrated in §III-I, and cross-checks between the deployed rule and the K=3 partition appear in §III-M.
Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs. The K=3 partition is therefore well-described as a firm-compositional decomposition: C3 is essentially "Firm A and any non-Firm-A CPA whose mean descriptors happen to land in the high-cos / low-dHash corner"; C1 is essentially "non-Firm-A CPAs whose mean descriptors land in the low-cos / high-dHash corner." The composition contrast that K=3 captures at the accountant level reappears at the deployment level in the cross-firm hit matrix of §III-L.4 (Script 44): under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). The K=3 partition and the cross-firm hit matrix therefore describe the same underlying firm-compositional structure at two different units of analysis.
## M. Convergent Internal-Consistency Checks
**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.012$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label.
The descriptive partition of §III-L is supported by three feature-derived per-CPA scores and a conservative hard-positive subset analysis. We caution at the outset that the three scores are **not statistically independent**: all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$, so their high rank correlations are partly mechanical. The checks below document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide). Full result tables are in §IV-F (Tables IX, XI) and §IV-H (Table XIV).
We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the operational classifier:
**1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute (i) the K=3 posterior $P(\text{C1})$ on the low-cos / high-dHash component (§III-L); (ii) a reverse-anchor cosine percentile of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H.2; and (iii) the per-CPA fraction of signatures *not* satisfying the deployed HC box rule (cos $> 0.95$ AND dHash $\leq 5$). The three pairwise Spearman correlations are all $\rho \geq 0.879$ ($+0.9627$, $+0.8890$, $+0.8794$; $n = 437$; Table IX) — the strongest internal-consistency signal in the analysis: three summarisations of the same descriptor pair agree on the per-CPA ranking, all placing Firm A as the most replication-dominated descriptor position. They disagree only mildly at the less-replication-dominated end (Scores 1/3 place Firm C furthest from the templated end; the reverse-anchor places Firm D fractionally beyond Firm C). This is not external validation of any classifier; the deployed rule is calibrated separately (§III-I).
- The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary.
- The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$.
- Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs.
**2. Per-signature consistency (Script 39).** Refitting K=3 at the signature level (150,442 Big-4 points) and comparing binary labels gives Cohen $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 labels (Table XI), so per-CPA aggregation does not collapse the three-component ordering. The lower $\kappa = 0.56$–$0.66$ between the binary box rule and either K=3 fit reflects different decision geometries (rectangular box vs Gaussian-mixture boundary).
The operational signature-level classifier of §III-H.1 is calibrated in §III-L against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K.
**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** As developed in §III-L, firm-level LOOO shows K=2 unstable (a Firm-A-versus-others separator) while K=3 has a reproducible C1 component shape ($\leq 0.005$ cosine drift) but composition-sensitive hard membership (up to $12.8$ pp; `P2_PARTIAL`), which is why K=3 hard membership is not an operational label. Full LOOO tables are in §IV-G.
## K. Convergent Internal-Consistency Checks
**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one conservative hard-positive subset: $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation ($145 / 8 / 107 / 2$ across Firms A–D). Independent hand-signing cannot produce pixel-identical images, so these are a conservative hard-positive subset for replication. All three candidate scores — the deployed HC box rule, the K=3 hard label, and the prevalence-calibrated reverse-anchor cut — assign every byte-identical signature to the replicated class ($0\%$ miss, Wilson $[0\%, 1.45\%]$; Table XIV, §IV-H). We caution that for the box rule this is close to tautological (byte-identical neighbours have cos $\approx 1$, dHash $\approx 0$), so it is a *necessary* check a failing classifier would not pass, not a sufficiency proof on the non-byte-identical replicated population. The corresponding inter-CPA negative-anchor evidence is in §III-I.1 (Big-4) and the corpus-wide version at §IV-I.
The descriptive partition of §III-J is supported by three feature-derived per-CPA scores and a conservative hard-positive subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. Per §III-I.4, none of the three scores has a within-population bimodality interpretation; they are firm-compositional position scores at the accountant level. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide).
## N. Unsupervised Diagnostic Strategy and Limits
**1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute:
The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-M.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth. Each diagnostic in this paper addresses one specific failure mode of an unsupervised screening classifier; the full diagnostic-to-failure-mode-to-assumption map is given in Appendix A Table A.II.
- **Score 1 (K=3 posterior on the low-cos / high-dHash component):** $P(\text{C1})$ from the K=3 fit of §III-J. Per §III-J this is a firm-compositional position score on the (cos, dHash) plane (not a probability of any latent "hand-signing mechanism") — a function of both descriptor means.
- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H.2, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end. This is a function of $\overline{\text{cos}}_a$ alone.
- **Score 3 (deployed binary high-confidence box rule rate):** the per-CPA fraction of signatures that do **not** satisfy the deployed binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors.
**Limits.** We do not claim a validated forensic detector or an autonomous classification system, and we do not interpret the deployed-rate excess of §III-J.2 as a presumed true-positive rate. That interpretation would require assuming a CPA's genuine same-CPA hand-signing produces a collision rate no higher than random inter-CPA pairs — unsafe for two reasons: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA cosine similarity; and (b) within-firm template-like reuse (§III-J.1; byte-level evidence of Firm A's pixel-identical signatures across partners) places a collision floor that itself reflects reuse rather than independent random matching. We describe the within-firm collision concentration of §III-J.1 as a descriptive observation and treat its mechanism as an open empirical question.
Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38):
**Scope and positioning.** The deployed rule is characterised at three units against the normative Firms-B/C/D inter-CPA negative anchor, with Firm A held out as an out-of-sample target (§III-I.0). The resulting rates (§III-I, §III-J) are specificity-proxy-anchored **alarm-yield** indicators, not true error rates: the HC rule has a very low BCD coincidence rate at every unit, the dHash $\leq 15$ MC band is a low-specificity advisory tier, and the per-firm heterogeneity is read against the clean floor (Firm A the rate-extreme, its signal within-firm). The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not a validated forensic classifier.
**Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, operators cannot derive an operating point by optimising a ROC criterion. Instead, tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) further reduce both the per-comparison ICCR and the per-signature alert yield below the deployed-HC values of §III-I.1–§III-I.2, with an unknown effect on actual replication-detection recall — so tightening is not necessarily preferable. The deployment decision depends on the relative cost of manual review per alarm and missed-replication risk per false negative, neither directly observable from corpus data.
We read this as the strongest internal-consistency signal in the analysis: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the deployed box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms.
**2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replication-dominated vs less-replication-dominated):
| Pair | Cohen $\kappa$ |
|---|---|
| Deployed binary high-confidence box rule vs per-CPA K=3 hard label | $0.662$ |
| Deployed binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ |
| Per-CPA K=3 vs per-signature K=3 | $0.870$ |
The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in the supplementary materials and not regenerated on the Big-4 subset.
**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Developed in §III-J: the firm-level LOOO cross-validation shows K=2 is unstable (its boundary is essentially a Firm-A-versus-others separator), while K=3 has a reproducible C1 component shape ($\leq 0.005$ cosine drift across folds) but composition-sensitive hard-posterior membership (up to $12.8$ pp; labelled `P2_PARTIAL`), which is why K=3 hard membership is not used as an operational label.
**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one conservative hard-positive subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40).
We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the corpus-wide version cited at §IV-I:
| Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ |
All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them. The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the box rule's overall replicated rate ($49.58\%$ of Big-4 signatures); this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.
## L. Anchor-Based Threshold Calibration on a Normative (Non-Firm-A) Baseline
The operational classifier defined in §III-H.1 is calibrated by characterising the deployed thresholds' inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. §III-I.4 establishes that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. Throughout this section we report **inter-CPA coincidence rates (ICCR)** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0. The calibration is anchored on a **normative non-Firm-A baseline (Firms B, C, D)** and treats Firm A as an out-of-sample target, for reasons developed in §III-L.0.
### L.0. Calibration methodology
**Choice of negative-anchor pool.** A negative anchor must approximate a population in which the rule should *not* fire — independent CPAs whose signatures coincide only by chance. §III-L.4 shows that under the deployed rule, $98.8\%$ of Firm A's inter-CPA collisions fall on other Firm-A CPAs, and byte-level evidence (§IV-H, supplementary materials) confirms image-level reuse across $\sim 50$ Firm-A partners. Including Firm A in the negative-anchor pool therefore loads the "coincidence" rate with structured within-firm collisions, not chance coincidence — a circularity, since that collision structure is the phenomenon the rule targets. We adopt **Firms B/C/D (BCD) as the normative negative-anchor baseline** and report the all-Big-4 (ABCD) pool only as a contamination-comparison scope; Firm A enters as an **out-of-sample target** (§III-L.4), not as a calibration input. A still-broader baseline adding the eligible non-Big-4 firms (BCD+non-Big-4) is reported as a robustness scope.
We further restrict the calibration baseline temporally to **fiscal years 2013–2019**. Following the post-2020 acceleration of digital document workflows, Taiwan audit firms increasingly adopted electronic-signature and stamping systems for report assembly, with firm-specific timing; the pre-2020 BCD period is therefore the construct-clean hand-signing baseline, while the post-2020 period mixes genuine hand-signing with legitimate e-signing and is not a clean negative anchor. The data corroborate this: the BCD per-comparison HC floor rises from $0.000010$ (2013–2019) to $0.000036$ (2020–2023), and the per-signature floor from $0.0059$ to $0.0105$ — the gradual, non-stepped rise being consistent with staggered per-firm adoption. We therefore calibrate on BCD 2013–2019 and report BCD 2020–2023 only as a robustness scope (it documents the e-signing contamination rather than the clean floor). Firm A is scored across its full 2013–2023 record against this clean threshold.
**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine and structural ($\text{dHash} \leq 5$) thresholds' specificity-proxy behaviour at the inter-CPA pair level on the BCD baseline. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands. The cosine LH/UN crossover $\text{cos} = 0.837$ is a corpus-wide descriptor-space landmark (intra- vs inter-CPA cosine KDE crossover, §IV-C) robust to baseline choice — it moves by at most $0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes ($0.8367$, $0.8489$, $0.8302$) — so we retain the corpus-wide value and do not re-anchor it on BCD.
**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:
- *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the conventional pairwise calibration unit in biometric verification. We report it for both the cosine and dHash dimensions, marginally and jointly (§III-L.1).
- *Per signature pool.* For a source signature $s$ in the baseline pool with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates drawn from the baseline pool? This addresses the standard concern that a per-pair rate computed on independent pairs is not the deployed-rule rate at the per-signature classifier level: the deployed rule takes max-cosine and min-dHash over a pool of size $n_{\text{pool}}(s)$, so its effective coincidence rate is approximately $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-L.2).
- *Per document.* For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose deployed pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-L.3).
**Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate in the pool satisfies both. We refer to this as the **any-pair** rule. A stricter alternative — the **same-pair** rule — requires a single candidate to satisfy both inequalities; the deployed rule is any-pair, but we report same-pair as a stricter alternative classifier where useful (§III-L.2, §III-L.4).
**Terminological note on "FAR".** The biometric-verification literature speaks of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures, which is precisely why we move the anchor to the BCD baseline (§III-L.0). Even on the BCD baseline, reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence.
We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from the baseline pool, computing for each pair the cosine similarity (feature dot product) and Hamming distance between the dHash byte vectors. Marginal and joint rates are reported with Wilson 95% confidence intervals (Script 46).
On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule is $0.000010$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$), and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-L.4): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($5$ of $5 \times 10^5$ pairs on the BCD pool), so we report the Wilson interval and treat the per-comparison joint rate as an order-of-magnitude specificity proxy rather than a precisely estimated rate; the well-powered per-signature and per-document units (§III-L.2, §III-L.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ row remains consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I. On the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) is $0.234$, indicating that the structural dimension adds substantial per-comparison specificity beyond the cosine gate.
The per-comparison rate does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.
The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \text{dHash}$ over the same-CPA pool of size $n_{\text{pool}}(s)$ for each signature $s$, so a per-comparison rate is not the rate at which the deployed classifier fires per signature. For each source signature $s$ we simulate one realisation of an inter-CPA candidate pool of the same size $n_{\text{pool}}(s)$, drawn uniformly from non-same-CPA signatures *in the baseline pool*, compute the deployed extrema and rule indicator, and aggregate (Script 52, canonical retry-loop sampler matching Scripts 43/45; CPA-block bootstrap 95% CIs on $n_{\text{boot}} = 1000$ replicates).
**Headline rate (deployed any-pair HC rule, cos $> 0.95$ AND dHash $\leq 5$):**
| Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI |
On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $0.0059$ — an order of magnitude below the all-Big-4 figure of $0.1102$. The all-Big-4 figure is dominated by Firm A, whose signatures coincide with other Firm-A signatures at high rate; once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 0.59\%$. This is the specificity-proxy floor against which the deployed HC rule operates. The rate increases with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-L.4) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, with the unsupervised-setting caveats of §III-M.
Each document is classified by the worst-case rule over its constituent signatures (§III-H.1) under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 52; dominant-firm document assignment). We report the high-confidence (HC, dHash $\leq 5$) and the HC+MC (dHash $\leq 15$) document-level alarm definitions:
| HC + MC (dHash $\leq 15$) | $0.1753$ | $0.3375$ | $0.1467$ |
**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC rate is $0.0117$ ($\sim 8\times$ below the all-Big-4 $0.1797$), confirming that the HC (dHash $\leq 5$) rule has a very low inter-CPA coincidence rate: a clean inter-CPA baseline almost never produces an HC document. The HC+MC (dHash $\leq 15$) rate, by contrast, remains high on the clean baseline — $0.1753$ per document — and the per-firm breakdown shows it does *not* fall when Firm A is removed. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier rather than a confident non-hand-signed screening label.** Roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement, so an HC+MC alarm is not by itself evidence of image reproduction.
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ — slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. This is direct evidence that the MC band carries little inter-CPA specificity even among normative firms, corroborating its demotion to an advisory tier. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-M).
### L.4. Firm A as an out-of-sample target; firm heterogeneity (Scripts 49, 52, 44, 53)
With the calibration anchored on BCD, Firm A is scored as an out-of-sample target against the clean baseline. Three complementary readings establish that Firm A is the extreme case while keeping the inferential limits explicit.
**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0059$ (§III-L.2):
| Firm | Observed per-signature HC rate | Multiple of BCD floor |
|---|---|---|
| Firm A | $0.817$ | $\sim 139\times$ |
| Firm B | $0.346$ | $\sim 59\times$ |
| Firm C | $0.238$ | $\sim 40\times$ |
| Firm D | $0.245$ | $\sim 42\times$ |
All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 139\times$, roughly $2.4$–$3.4\times$ the other Big-4 firms in absolute rate. We emphasise (and develop in §III-M) that this excess is *not* a true-positive rate: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above it. The multiple is a framework-discriminative observation, not a measure of image reproduction.
**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0001$ — below even the BCD-internal floor of $0.0059$, i.e. Firm A's signatures essentially never resemble genuine 2013–2019 hand-signing. Firm A's signatures are thus unremarkable when matched against *other firms'* signatures; the entire elevation in Firm A's observed rate ($0.817$) arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness.
**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.)
**Cross-firm hit matrix: within-firm concentration is a universal Big-4 pattern.** Under the deployed any-pair rule, inter-CPA collisions concentrate within the source firm at every Big-4 firm. On the full Big-4 candidate pool, within-firm concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (same-pair $97.0$–$99.96\%$; Table XXV). Restricting the candidate pool to the BCD baseline (Script 53) *raises* the within-firm concentration for B/C/D to $89.2$–$97.2\%$ any-pair (Firm B $97.2\%$, Firm C $92.3\%$, Firm D $89.2\%$) and $98.5$–$100\%$ same-pair — higher than on the full pool, because on the full pool some B/C/D collisions landed on Firm A's generically copy-like signatures; removing Firm A leaves each firm's collisions concentrated within itself. Within-firm collision concentration is therefore a universal Big-4 structural pattern, not a Firm-A peculiarity: Firm A is extreme in the *rate* at which the rule fires (reading (i)), but all four firms exhibit the same within-firm collision signature.
**Interpretation.** This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (§IV-H, supplementary materials) provides direct evidence of image-level reuse among Firm A signatures; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We report "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
### L.5. Alert-rate sensitivity around deployed thresholds (Script 46)
To test whether the deployed cosine threshold $0.95$ and dHash threshold $5$ coincide with a low-gradient (plateau-stable) region of the deployed-rule alert-rate surface — which would be weak distributional evidence that the deployed thresholds are stable operating points — we sweep each threshold across a range and report the per-signature alert rate on actual observed Big-4 same-CPA pools (not inter-CPA-replaced pools), comparing the local gradient at the deployed threshold to the median gradient across the sweep (Script 46).
At the deployed HC operating point cos $> 0.95$ AND dHash $\leq 5$, the local gradient of the per-signature alert rate is substantially larger than the median gradient across the sweep (cosine: ratio $\approx 25\times$ at the $0.95$ point relative to median; dHash: ratio $\approx 3.8\times$ at the $5$ point relative to median; both Script 46). Reading these ratios descriptively, the deployed HC threshold is *locally sensitive* rather than plateau-stable: small threshold perturbations materially change the deployed alert rate (cosine sweep at dHash $\leq 5$ yields rates of $0.5091$ at cos $> 0.945$ vs $0.4789$ at cos $> 0.955$, a $3.0$ pp swing across a $0.01$ cosine perturbation; dHash sweep at cos $> 0.95$ yields rates of $0.4207$ at dHash $\leq 4$ vs $0.5639$ at dHash $\leq 6$, a $14.3$ pp swing across a single integer step). The local-gradient-to-median-gradient ratios are descriptive diagnostics, not formal plateau tests; the primary evidence for "no within-population bimodal antimode at these thresholds" comes from §III-I.4's composition decomposition, not from §III-L.5.
The MC/HSC boundary at dHash $= 15$, by contrast, *is* in a low-gradient region (ratio $\approx 0.08$ to the median); the plateau-like behaviour around dHash $= 15$ is corroborating evidence that the high-end structural threshold lies in a regime where the rule's alert rate is approximately saturated, consistent with the high-dHash tail behaviour expected once near-identical pairs have been exhausted. The §III-L.5 non-plateau / local-sensitivity finding therefore applies specifically to the HC cutoff (cos $= 0.95$, dHash $= 5$); the MC/HSC sub-band boundary at dHash $= 15$ exhibits the opposite behaviour and is plateau-like. The plateau at dHash $= 15$ — added alert yield without added inter-CPA specificity (§III-L.3) — reinforces the demotion of the MC band to an advisory tier.
We interpret the deployed HC thresholds as **specificity-anchored operating points** chosen for the specificity-vs-alert-yield tradeoff (§III-L.1), *not* as distributional antimodes. Alternative operating points on the tradeoff curve can be characterised by inverting the per-comparison or pool-normalised ICCR curves (§III-L.1, §III-L.2) at the preferred specificity target.
### L.6. Observed deployed alert rate on actual same-CPA pools
The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the deployed HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$).
Read against the **normative BCD specificity-proxy floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is larger: the per-signature observed rate is $\sim 84\times$ the BCD floor ($0.4958$ vs $0.0059$), and the per-document HC observed rate is $\sim 53\times$ the BCD floor ($0.6228$ vs $0.0117$):
- Per-signature: $0.4958 - 0.0059 = 0.4899$ ($49.0$ pp excess over the clean floor)
- Per-document HC: $0.6228 - 0.0117 = 0.6111$ ($61.1$ pp excess over the clean floor)
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits are developed in §III-M. The excess is best read as an *observed same-CPA-pool excess over the normative inter-CPA floor* — a quantity that far exceeds what random inter-CPA candidate replacement among normative firms would produce — whose mechanism is not identifiable from descriptor-only data (§III-M). Anchoring the floor on the clean BCD baseline sharpens this contrast (the all-Big-4 floor would understate it by absorbing Firm A's reuse), while leaving the §III-M caveat — that the floor is an inter-CPA coincidence rate, not an intra-CPA genuine-hand-signing rate — fully in force; we do not attribute the excess to within-CPA handwriting repeatability or to image replication without further evidence.
## M. Unsupervised Diagnostic Strategy and Limits
The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth.
Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier; the full diagnostic-to-failure-mode-to-assumption map is given in Appendix A Table A.II. No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth.
**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, supplementary materials) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.
**Scope of the present analysis.** The deployed screening rule is characterised at three units (per-comparison, per-signature, per-document) against the normative Firms-B/C/D inter-CPA negative anchor, with Firm A held out as an out-of-sample target (§III-L.0). The resulting rates (§III-L.1–L.6) are specificity-proxy-anchored **alarm-yield** indicators, not true error rates: the HC rule has a very low BCD coincidence rate at every unit, the dHash $\leq 15$ MC band is a low-specificity advisory tier, and the per-firm heterogeneity is read against the clean floor (Firm A the rate-extreme, its signal within-firm). The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.
**Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data.
## N. Data Source and Firm Anonymization
## O. Data Source and Firm Anonymization
**Audit-report corpus.** The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation.
MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document.
@@ -580,9 +492,10 @@ The CPA registry used to map signatures to CPAs is a publicly available audit-fi
Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name.
# IV. Experiments and Results
Section IV reports the empirical results that calibrate and characterise the operational classifier of §III-H.1 (calibration developed in §III-L). The primary analyses (§IV-D through §IV-J, and the anchor-based ICCR calibration consolidated in §IV-M) are scoped to the Big-4 sub-corpus (Firms A–D, $n = 437$ CPAs with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3 mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F.
Section IV reports the empirical results that calibrate and characterise the operational classifier of §III-H.1 (calibration developed in §III-I). The primary analyses (§IV-D through §IV-J, and the anchor-based ICCR calibration consolidated in §IV-M) are scoped to the Big-4 sub-corpus (Firms A–D, $n = 437$ CPAs with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3 mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F.
## A. Experimental Setup
@@ -630,7 +543,7 @@ Table IV summarizes the distributional statistics.
Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-I to avoid single-family distributional assumptions.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-K to avoid single-family distributional assumptions.
The KDE crossover---where the two density functions intersect---was located at 0.837.
Under equal prior probabilities and equal misclassification costs, this crossover is a candidate decision boundary between the two classes; we adopt it only as the operational LH/UN boundary in §III-H.1, not as a natural distributional threshold.
@@ -642,7 +555,7 @@ A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the
## D. Big-4 Accountant-Level Distributional Characterisation
This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. The accountant-level dip-test rejection reported in Table V is, per §III-I.4, fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.
This section reports the empirical evidence for §III-K's distributional diagnostics at the Big-4 accountant level. The accountant-level dip-test rejection reported in Table V is, per §III-K.4, fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.
@@ -670,7 +583,7 @@ The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence
This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.
**Table VII.** Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-J) and marginal-crossing bootstrap 95% confidence intervals.
**Table VII.** Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-L) and marginal-crossing bootstrap 95% confidence intervals.
$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.
$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-I and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.
## F. Convergent Internal-Consistency Checks
This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ and are *not statistically independent measurements*. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.
This section reports the empirical evidence for §III-M's three-score internal-consistency analysis. We re-emphasise the §III-M caveat: the three scores are deterministic functions of the same per-CPA descriptor pair $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ and are *not statistically independent measurements*. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.
**Table IX.** Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, $n = 437$.
@@ -735,7 +648,7 @@ The three scores agree on placing Firm A as the most replication-dominated and t
## G. Leave-One-Firm-Out Reproducibility
This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing.
This section reports the firm-level cross-validation evidence motivating §III-L's "K=3 descriptive, not operational" framing.
**Table XII.** K=2 leave-one-firm-out across the four Big-4 folds.
@@ -758,11 +671,11 @@ This section reports the firm-level cross-validation evidence motivating §III-J
| Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp |
| Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp |
(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.012$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).
(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.012$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-I, §III-L).
## H. Pixel-Identity Positive-Anchor Miss Rate
This section reports the only conservative hard-positive subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4).
This section reports the only conservative hard-positive subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-M item 4).
@@ -778,13 +691,13 @@ We caution that for the deployed box rule this result is close to tautological (
## I. Inter-CPA Pair-Level Coincidence Rate
The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It is the per-pair rate at which two signatures from different CPAs satisfy the deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has a biometric-verification meaning that requires ground-truth negative labels, and (b) the inter-CPA negative-anchor assumption is partially violated by within-firm cross-CPA template-like collision structures (§III-L.4 cross-firm hit matrix).
The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It is the per-pair rate at which two signatures from different CPAs satisfy the deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has a biometric-verification meaning that requires ground-truth negative labels, and (b) the inter-CPA negative-anchor assumption is partially violated by within-firm cross-CPA template-like collision structures (§III-J.1 cross-firm hit matrix).
A corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs gives a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$. The Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs) replicates this number, adds the structural dimension (dHash), and adds joint-rule rates; the §III-L.1 numbers are referenced rather than duplicated here, and the consolidated ICCR calibration appears in §IV-M Tables XXI–XXVI.
A corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs gives a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$. The Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs) replicates this number, adds the structural dimension (dHash), and adds joint-rule rates; the §III-I.1 numbers are referenced rather than duplicated here, and the consolidated ICCR calibration appears in §IV-M Tables XXI–XXVI.
## J. Five-Way Per-Signature + Document-Level Classification Output
This section reports the five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. See §III-H.1 for the five-way category definitions and the cosine and dHash cuts; calibration is in §III-L.
This section reports the five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. See §III-H.1 for the five-way category definitions and the cosine and dHash cuts; calibration is in §III-I.
@@ -834,7 +747,7 @@ This section reports the five-way per-signature + document-level worst-case clas
(Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)
The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-L.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.175$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-I.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.175$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-M Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature screening rule (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).
(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-L / §IV-E). It is reported here as a complement to the five-way per-signature screening rule (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).
**Document-level worst-case aggregation outputs are reported in Table XVI above.**
@@ -912,7 +825,7 @@ ResNet-50 provides the best overall balance:
## M. Anchor-Based ICCR Calibration Results
This section consolidates the empirical results that support the §III-L anchor-based threshold calibration framework.
This section consolidates the empirical results that support the §III-I anchor-based threshold calibration framework.
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000010}$ | $0.000140$ | $0.000004$ |
BCD joint Wilson 95% $[0.000004, 0.000023]$ ($5$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-L.4). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$).
BCD joint Wilson 95% $[0.000004, 0.000023]$ ($5$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-J.1). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$).
| HC + MC (dHash $\leq 15$) | $0.1753$ | $0.3375$ | $0.1467$ |
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-L.3).
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-I.3).
### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)
@@ -974,7 +887,7 @@ Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $
| Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A |
| log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size |
On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-L.4).
On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-J.1).
Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–$0.0358$ while Firm A ranges $0.0541$–$0.5958$. The firm gap survives within matched pool sizes.
@@ -987,7 +900,7 @@ Per-decile per-firm rates (Table not duplicated here; Script 44 decile table ava
Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively. Restricting the candidate pool to the BCD baseline (Script 53) raises Firms B/C/D within-firm any-pair concentration to $97.2\%$ / $92.3\%$ / $89.2\%$ (same-pair $100\%$ / $100\%$ / $98.5\%$): within-firm concentration is a universal Big-4 pattern, not a Firm-A peculiarity (§III-L.4).
Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively. Restricting the candidate pool to the BCD baseline (Script 53) raises Firms B/C/D within-firm any-pair concentration to $97.2\%$ / $92.3\%$ / $89.2\%$ (same-pair $100\%$ / $100\%$ / $98.5\%$): within-firm concentration is a universal Big-4 pattern, not a Firm-A peculiarity (§III-J.1).
### M.6 Alert-rate sensitivity around deployed HC threshold (Script 46)
@@ -999,7 +912,7 @@ Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0059$; per-document HC $0.0117$), the observed same-CPA-pool excess is $0.4899$ ($49.0$ pp, $\sim 84\times$) per-signature and $0.6111$ ($61.1$ pp, $\sim 53\times$) per-document; this excess is reported under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0059$; per-document HC $0.0117$), the observed same-CPA-pool excess is $0.4899$ ($49.0$ pp, $\sim 84\times$) per-signature and $0.6111$ ($61.1$ pp, $\sim 53\times$) per-document; this excess is reported under §III-N caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
# V. Discussion
@@ -1010,17 +923,17 @@ Non-hand-signing differs from forgery in that the questioned signature is produc
## B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Multimodality is Composition-Driven
The Big-4 accountant-level distribution rejects unimodality on both marginals (§IV-D), but §III-I.4 shows this is fully attributable to between-firm location shifts and integer mass-point artefacts, not within-population structure: under joint firm-mean centring and integer-tie jitter the dip test no longer rejects ($p_{\text{median}} = 0.35$), and within each Big-4 firm the signature-level marginals are unimodal once integer ties are broken. The distributions therefore contain no within-population bimodal antimode to anchor a threshold, and per-signature similarity is best read as a continuous quality spectrum rather than two discrete populations. The K=2 / K=3 fits are descriptive firm-compositional partitions (§III-J), not latent mechanism classes.
The Big-4 accountant-level distribution rejects unimodality on both marginals (§IV-D), but §III-K.4 shows this is fully attributable to between-firm location shifts and integer mass-point artefacts, not within-population structure: under joint firm-mean centring and integer-tie jitter the dip test no longer rejects ($p_{\text{median}} = 0.35$), and within each Big-4 firm the signature-level marginals are unimodal once integer ties are broken. The distributions therefore contain no within-population bimodal antimode to anchor a threshold, and per-signature similarity is best read as a continuous quality spectrum rather than two discrete populations. The K=2 / K=3 fits are descriptive firm-compositional partitions (§III-L), not latent mechanism classes.
## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor)
## C. Firm A as the Templated End of Big-4 (Out-of-Sample Target, Not Calibration Anchor)
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-L), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.
We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-L.0). Three readings (§III-L.4) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide essentially never ($0.0001$, below the BCD floor of $0.0059$) — so Firm A is unremarkable, indeed sub-baseline, *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 139\times$ the clean floor, versus $\sim 40$–$59\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D, with same-pair concentration $97$–$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$–$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-L.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish.
We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-I.0). Three readings (§III-J.1) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide essentially never ($0.0001$, below the BCD floor of $0.0059$) — so Firm A is unremarkable, indeed sub-baseline, *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 139\times$ the clean floor, versus $\sim 40$–$59\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D, with same-pair concentration $97$–$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$–$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-I.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish.
## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
Leave-one-firm-out cross-validation (§III-J) sharply separates the two fits. K=2 is unstable — its boundary is essentially a Firm-A-versus-others separator (holding Firm A out gives a markedly looser fold rule than holding any other firm out), direct evidence that it reflects firm composition, not mechanism. K=3, by contrast, has a reproducible component shape across folds (the C1 cosine mean varies by $\leq 0.005$), though hard-posterior membership remains composition-sensitive. We therefore read K=3 as a reproducible three-region descriptor partition reflecting how firm-compositional weight is distributed across the descriptor plane, not a three-mechanism latent structure, and use it only as an accountant-level descriptive summary — never as operational classifier output.
Leave-one-firm-out cross-validation (§III-L) sharply separates the two fits. K=2 is unstable — its boundary is essentially a Firm-A-versus-others separator (holding Firm A out gives a markedly looser fold rule than holding any other firm out), direct evidence that it reflects firm composition, not mechanism. K=3, by contrast, has a reproducible component shape across folds (the C1 cosine mean varies by $\leq 0.005$), though hard-posterior membership remains composition-sensitive. We therefore read K=3 as a reproducible three-region descriptor partition reflecting how firm-compositional weight is distributed across the descriptor plane, not a three-mechanism latent structure, and use it only as an accountant-level descriptive summary — never as operational classifier output.
## E. Three-Score Convergent Internal-Consistency
@@ -1028,11 +941,11 @@ Three feature-derived per-CPA scores — the K=3 firm-compositional position, th
## F. Anchor-Based Multi-Level Calibration
The deployed HC sub-rule's specificity-proxy behaviour is characterised at three units against the normative BCD baseline (§III-L), with the all-Big-4 pool shown only as a contamination comparison. At every unit the HC inter-CPA coincidence rate is an order of magnitude below the all-Big-4 figure — the gap being Firm A's extreme within-firm collision structure (§III-L.4) — confirming HC as a high-specificity-proxy operating point. Because the deployed rule takes pool extrema, the per-comparison rate understates the per-signature rate (the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ pool effect). The HC threshold is locally sensitive rather than plateau-stable (§III-L.5), so it is a specificity-anchored operating *choice*, not a distributional antimode; operators can select alternative points by inverting the ICCR curves (§III-L.2). The dHash$\leq 15$ MC band stays a low-specificity advisory tier even on the clean baseline (§III-L.3).
The deployed HC sub-rule's specificity-proxy behaviour is characterised at three units against the normative BCD baseline (§III-I), with the all-Big-4 pool shown only as a contamination comparison. At every unit the HC inter-CPA coincidence rate is an order of magnitude below the all-Big-4 figure — the gap being Firm A's extreme within-firm collision structure (§III-J.1) — confirming HC as a high-specificity-proxy operating point. Because the deployed rule takes pool extrema, the per-comparison rate understates the per-signature rate (the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ pool effect). The HC threshold is locally sensitive rather than plateau-stable (§III-K.6), so it is a specificity-anchored operating *choice*, not a distributional antimode; operators can select alternative points by inverting the ICCR curves (§III-I.2). The dHash$\leq 15$ MC band stays a low-specificity advisory tier even on the clean baseline (§III-I.3).
## G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor
The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-L.1 per-comparison ICCR on the normative BCD baseline ($0.000010$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-L.0).
The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-I.1 per-comparison ICCR on the normative BCD baseline ($0.000010$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-I.0).
## H. Limitations
@@ -1040,9 +953,9 @@ Several limitations should be transparent. We group them into primary methodolog
**Primary methodological limitations.**
*No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).
*No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-I are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-N).
*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-L.4 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-L.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000010$ and the per-signature HC rate from $0.1102$ to $0.0059$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-L.6) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities.
*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-J.1 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-I.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000010$ and the per-signature HC rate from $0.1102$ to $0.0059$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-J.2) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities.
*Mechanism attribution for the firm-level heterogeneity is not identifiable from descriptor-only data.* The observed firm-level contrast (Firm A's per-document HC$+$MC ICCR of $0.62$ versus $0.09$–$0.16$ at Firms B/C/D; within-firm collision concentration $77$–$99\%$ under the deployed any-pair rule; byte-identical evidence of §IV-H) is consistent with at least three non-mutually-exclusive firm-level mechanisms: (i) template, stamp, or e-signature production reuse; (ii) digitisation-pipeline homogeneity — shared scanners, common PDF generation infrastructure, identical compression and form-template settings — that systematically inflates image-descriptor similarity without signature replication; and (iii) signing-style or training homogeneity that produces correlated handwritten signatures within a firm. The descriptor pair (cosine, dHash) operates at the image-similarity level and is, by construction, indifferent to which mechanism generated a given near-identical pair. We therefore report the firm contrast as a methodological observation — the framework discriminates at firm-level resolution — rather than as a mechanism finding. The byte-identical Firm A signatures across $\sim 50$ distinct partners (§IV-H, §V-C) provide direct evidence for (i) at Firm A specifically, but do not exclude additive contribution from (ii) or (iii); the milder within-firm collision patterns at Firms B/C/D are individually consistent with all three mechanisms. Image-acquisition metadata (scanner identifiers, PDF generator fingerprints, compression-codec markers), partner-level intent records, or controlled hand-signed baselines would be needed to attribute the contrast across (i), (ii), and (iii).
@@ -1052,9 +965,9 @@ Several limitations should be transparent. We group them into primary methodolog
*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).
*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-L.3 supersedes the MC band's *claim strength* — its $\sim 0.175$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-I.3 supersedes the MC band's *claim strength* — its $\sim 0.175$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-K.6) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.012$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal.
*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.012$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-N shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal.
*A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime.
@@ -1062,7 +975,7 @@ Several limitations should be transparent. We group them into primary methodolog
*K=3 hard-posterior membership is composition-sensitive.* The K=3 hard-posterior membership for any single firm varies by up to $12.8$ pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as operational classifier output; they are reported only as accountant-level descriptive characterisation.
*No partner-level mechanism attribution.* The analysis reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-L.4 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing.
*No partner-level mechanism attribution.* The analysis reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-J.1 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing.
**Engineering-level caveats of the pipeline.**
@@ -1079,18 +992,18 @@ Several limitations should be transparent. We group them into primary methodolog
# VI. Conclusion and Future Work
We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature screening rule with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. We emphasise that the operating thresholds are operator-tunable and that the system performs semi-automated triage — surfacing replication candidates from hundreds of thousands of signatures for human adjudication — rather than autonomous forensic classification; its central deliverable is the label-free calibration methodology by which an operator selects and characterises a screening operating point.
We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature screening rule with worst-case document-level aggregation (§III-H.1; calibrated in §III-I). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. We emphasise that the operating thresholds are operator-tunable and that the system performs semi-automated triage — surfacing replication candidates from hundreds of thousands of signatures for human adjudication — rather than autonomous forensic classification; its central deliverable is the label-free calibration methodology by which an operator selects and characterises a screening operating point.
Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR $0.000010$, per-signature $0.0059$, and per-document $0.012$ — roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$) — while the dHash$\leq 15$ moderate-confidence band, which retains a $\sim 0.175$ per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at $\sim 139\times$ (Firm A) and $\sim 40$–$59\times$ (Firms B/C/D), while Firm A scored cross-firm against the clean 2013–2019 baseline coincides essentially never cross-firm ($0.0001$); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/$0.027$; BCD-only Firm-D-reference residual spread within $\sim 3.5\times$) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean BCD pool (same-pair $97$–$100\%$ across all four firms) — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.
Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$–$0.35$, $\sim 139\times$ vs $\sim 40$–$59\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-J.1 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$–$0.35$, $\sim 139\times$ vs $\sim 40$–$59\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-K.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
The main text (Section III-I, Section IV-D Table VI) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator.
The main text (Section III-K, Section IV-D Table VI) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator.
This subsection documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction.
@@ -1126,22 +1039,22 @@ Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel ar
## A.2. Diagnostic Summary
Section III-M positions the unsupervised-diagnostic strategy as a set of complementary checks, each addressing one specific failure mode of an unsupervised screening classifier with an explicitly disclosed untested assumption. Table A.II maps each diagnostic to the failure mode it addresses and to the untested assumption it relies on.
Section III-N positions the unsupervised-diagnostic strategy as a set of complementary checks, each addressing one specific failure mode of an unsupervised screening classifier with an explicitly disclosed untested assumption. Table A.II maps each diagnostic to the failure mode it addresses and to the untested assumption it relies on.
**Table A.II.** Diagnostics, failure mode addressed, and disclosed untested assumption.
| Composition decomposition (§III-I.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) |
| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 46) | Pair-level specificity proxy under a random-pair negative anchor, on the normative BCD baseline | Inter-CPA pairs are negative (i.e., not template-related); addressed by anchoring on Firms B/C/D and holding Firm A out (§III-L.0) |
| Pool-normalised per-signature ICCR (§III-L.2; Script 52) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size, on the BCD baseline | Same as above + that pool replacement preserves the negative-anchor property |
| Document-level ICCR (§III-L.3; Script 52) | Operational alarm-rate proxy at per-document unit (HC and HC+MC), on the BCD baseline | Same as above |
| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check |
| Cross-firm hit matrix (§III-L.4; Scripts 44, 53) | Concentration of inter-CPA collisions within source firm (all-Big-4 and BCD-pool variants) | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 52's dominant-firm rule (§IV-M.4) |
| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test |
| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent |
| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold |
| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity |
| Composition decomposition (§III-K.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) |
| Per-comparison inter-CPA coincidence rate (§III-I.1; Script 46) | Pair-level specificity proxy under a random-pair negative anchor, on the normative BCD baseline | Inter-CPA pairs are negative (i.e., not template-related); addressed by anchoring on Firms B/C/D and holding Firm A out (§III-I.0) |
| Pool-normalised per-signature ICCR (§III-I.2; Script 52) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size, on the BCD baseline | Same as above + that pool replacement preserves the negative-anchor property |
| Document-level ICCR (§III-I.3; Script 52) | Operational alarm-rate proxy at per-document unit (HC and HC+MC), on the BCD baseline | Same as above |
| Firm-heterogeneity logistic regression (§III-J.1; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check |
| Cross-firm hit matrix (§III-J.1; Scripts 44, 53) | Concentration of inter-CPA collisions within source firm (all-Big-4 and BCD-pool variants) | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-J.1); per-document per-firm assignment uses Script 52's dominant-firm rule (§IV-M.4) |
| Alert-rate sensitivity sweep (§III-K.6; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test |
| Convergent score Spearman ranking (§III-M.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent |
| Pixel-identical conservative positive capture (§III-M.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold |
| LOOO firm-level reproducibility (§III-M.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity |
The Abstract and Introduction are substantively strong and defensible. The current argument is clear:
- Regulations require CPA attestation, but digitized PDF workflows make stored-signature reuse operationally easy.
- The problem is not signature forgery; identity is not in dispute. The target is detecting possible image-level reproduction by the legitimate signer or firm workflow.
- The paper avoids claiming validated forensic detection and instead frames the system as an anchor-calibrated screening framework under unsupervised constraints.
- The strongest methodological move is replacing unsupported distributional "natural threshold" logic with anchor-based inter-CPA coincidence-rate (ICCR) calibration.
Recommended disposition: Minor Revision for prose and narrative complexity, not for core empirical weakness.
## Main Reviewer Concern
The Introduction currently explains the methodology shift too explicitly as a research-process or version-history pivot. This is useful internally, but in the submitted paper it may increase complexity and invite reviewers to focus on why earlier versions used a different framing.
The final manuscript should explain the final methodological choice, not the internal research journey.
Keep:
- The descriptor distribution does not support a stable within-population bimodal antimode.
- Apparent multimodality is explained by firm composition and integer mass-point artefacts.
- Mixture fits are descriptive, not threshold-generating.
- Operational rules are characterized using anchor-based ICCR at multiple units.
Reduce or remove:
- "Earlier work in this lineage..."
- "v4.0 contribution..."
- "overturns this reading..."
- "inherited Paper A v3.x..."
- Internal script-heavy provenance in the Introduction.
Detailed provenance belongs in Methodology, Results, Appendix, or reproducibility notes, not in the opening narrative.
## Suggested Rewrite Direction for Introduction Pivot Paragraph
Current issue location: around `paper/paper_a_v4_combined.md`, Introduction paragraph beginning with "The methodological reframing relative to earlier versions..."
Recommended replacement direction:
```text
A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location shifts and integer mass-point artefacts on the dHash axis. After firm-mean centring and integer-tie jitter, the pooled dHash dip-test rejection disappears. Within-firm diagnostics likewise do not reveal a stable bimodal antimode. We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.
```
This preserves the methodological defense while removing the internal v3-to-v4 story.
## Abstract-Specific Comments
The Abstract is strong but very dense. It is currently optimized for technical reviewers rather than broad readability. That may be acceptable for IEEE Access, but the first sentence has a small grammar/style issue.
Suggested edit:
```text
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports -- through administrative stamping or firm-level electronic signing -- thereby undermining individualized attestation.
```
Reason:
- Current wording: "digitization makes reusing ... undermining ..." is grammatically awkward.
- The suggested version makes the causal relation explicit.
No need to remove the final limitation sentence. The sentence "not as a validated forensic detector; no calibrated error rates..." is important and should remain.
## Introduction-Specific Comments
### 1. Keep the legal framing but avoid legal overclaiming
The sentence saying non-hand-signed workflows "may fall within the literal statutory requirement" is acceptable because it is cautious. Do not strengthen it into a legal conclusion.
Preferred style:
- "may fall within"
- "raises substantive concerns"
- "may not represent meaningful individual attestation"
Avoid:
- "violates"
- "illegal"
- "non-compliant"
- "fraudulent"
### 2. Preserve the forgery distinction
The distinction between non-hand-signing detection and signature forgery detection is one of the strongest conceptual contributions. Keep it prominent.
Key idea to preserve:
- Forgery detection asks whether the signer is genuine.
- This paper asks whether the signing act was repeated for each document or a stored image was reused.
### 3. Reduce script/provenance detail in the Introduction
Current paragraph references scripts such as Script 39c and Script 39d. This makes the Introduction read like an internal review memo.
Recommendation:
- Remove or simplify script references from Introduction.
- Keep exact script provenance in Methodology, Results, Appendix B, or supplementary material.
Specific risk:
- The current parenthetical "10 firms tested in Script 39c" is imprecise for jittered-dHash. Script 39c raw dHash tests reject unimodality; the non-Big-4 jittered-dHash no-rejection statement depends on a codex-verified read-only spike on the same substrate.
Safer Introduction wording:
```text
Within-firm diagnostics likewise fail to reveal stable bimodal structure after accounting for integer ties, including in eligible mid/small-firm checks.
```
If provenance must remain:
```text
Within-firm signature-level cosine checks fail to reject in eligible firms, and corresponding jittered-dHash checks fail to reject in Big-4 firms and in a read-only spike on the same mid/small-firm substrate.
```
### 4. Avoid presenting the Introduction as a Results section
The Introduction currently contains many detailed numbers. Some are necessary because the paper is methodological, but the v4 pivot paragraphs are numerically heavy.
- Firm heterogeneity: Firm A 0.62 vs Firms B/C/D 0.09-0.16.
Consider moving or reducing:
- Full script-specific details.
- Too many parenthetical rule semantics in the Introduction.
- Repeated mentions of inherited/v3/v4 framing.
## Recommended Minimum Patch List
1. Fix Abstract first sentence grammar:
```text
digitization makes it feasible to reuse...
```
2. Rewrite the Introduction paragraph that begins with "The methodological reframing relative to earlier versions..." so it describes the final methodological rationale rather than v3-to-v4 revision history.
3. Remove or narrow `Script 39c` provenance in the Introduction because the raw vs jittered dHash distinction is subtle and currently risky.
4. Replace internal-version language across the Introduction:
- Replace "v4.0 adopts..." with "We adopt..."
- Replace "Earlier work in this lineage..." with "A distributional-threshold approach would be inappropriate here because..."
- Replace "inherited Paper A v3.x five-way box rule" with "the deployed five-way box rule" unless historical provenance is essential.
5. Preserve limitation language:
- The paper should continue to say it is not a validated forensic detector.
- The paper should continue to say calibrated error rates cannot be reported without signature-level ground truth.
## Reviewer Bottom Line
The paper should not hide that the distributional threshold path failed; that is actually a methodological strength. But it should present this as a final empirical finding and design rationale, not as a visible research-history correction.
Recommended framing:
```text
Because the observed distribution does not provide a defensible natural threshold, we use ICCR calibration to characterize the deployed operating rules under explicit unsupervised assumptions.
```
This is cleaner, less complex, and more reviewer-facing than the current v3-to-v4 narrative.
## Additional Framing Issue: Are We Giving Thresholds or Not?
A likely reviewer confusion point is whether the paper provides a concrete classifier threshold or merely explains why no defensible threshold can be derived.
The intended answer should be explicit:
- The paper does provide a concrete, reproducible operational classifier.
- The paper does not claim that this classifier is ground-truth-optimal.
- The paper does not claim that the operating thresholds are natural antimodes in the descriptor distribution.
- The paper's calibration contribution is to characterize the deployed rule's inter-CPA coincidence behavior under unsupervised assumptions.
Recommended high-level framing:
```text
We use a fixed, pre-specified five-way operating rule. The present calibration does not derive an optimal threshold; instead, it quantifies the rule's inter-CPA coincidence behavior at per-comparison, per-signature, and per-document units under explicit unsupervised assumptions.
```
Chinese interpretation:
```text
我們有一組明確、可重現的五分類操作規則;本文不是宣稱這組門檻是最佳門檻或自然分界點,而是在沒有 signature-level ground truth 的情況下,用 ICCR 量化這組規則的 specificity-proxy 行為。
```
## Concrete Threshold Language to Make Visible
The manuscript should not bury the actual operating thresholds. Somewhere early in Methodology, and preferably summarized in Introduction, make the rule explicit:
```text
High-confidence non-hand-signed: cosine > 0.95 AND dHash <= 5.
Other outcomes follow the fixed five-way box rule.
```
If space allows, add a compact sentence:
```text
Thus, the system has explicit decision rules; what remains uncalibrated in the absence of signature-level labels is their true false-positive and false-negative error rate.
```
This directly answers the reviewer question: "Do the authors actually have a classifier?"
## Rewrite Style Recommendation
Avoid language that sounds like the authors are unable to provide thresholds:
- Avoid: "No threshold can be derived."
- Avoid: "The distribution does not support classification."
- Avoid: "We cannot determine a threshold."
Use language that distinguishes operational thresholds from statistically natural or supervised-optimal thresholds:
- Prefer: "The deployed thresholds are operational rules rather than natural antimodes."
- Prefer: "We characterize these rules with ICCR rather than claiming supervised error rates."
- Prefer: "The absence of a distributional antimode motivates anchor-based calibration, not threshold-free analysis."
- Prefer: "The system is a concrete screening classifier with explicit unsupervised calibration limits."
## Reviewer-Facing Answer to the Threshold Question
If the manuscript needs one sentence that resolves the ambiguity, use:
```text
The system therefore uses explicit operating thresholds, but the evidentiary claim attached to those thresholds is limited: they define a reproducible screening rule whose coincidence behavior can be estimated under inter-CPA anchors, not a validated forensic decision boundary with calibrated error rates.
```
This should be the guiding style for Abstract, Introduction, and the start of Methodology.
## Readability Risk: Too Many Diagnostics Can Look Like Methodological Overbuilding
The manuscript's multi-method statistical design increases rigor, but it also creates a readability risk. In the current form, some sections may feel like a defensive accumulation of diagnostics rather than a clean research design.
Reviewer risk:
- The reader may ask: "Are the authors using many methods because the core classifier is unclear?"
- The reader may miss the simple main claim because the paper introduces too many caveats and validation tools early.
- The paper may look like "we used many methods, therefore credible" instead of "each method answers one necessary question."
Recommended main-thread sentence:
```text
We deploy a fixed five-way screening rule and characterize its unsupervised reliability limits using ICCR, after showing that the descriptor distribution does not support a natural threshold.
- Composition decomposition showing why the descriptor distribution does not yield a natural threshold.
- ICCR calibration at three units: per-comparison, per-signature, per-document.
- Firm heterogeneity and within-firm collision concentration.
- Ground-truth limitation and no true error-rate claim.
Treat the following as supporting diagnostics and avoid letting them dominate the main narrative:
- K=2 / K=3 mixture fits.
- Three-score Spearman convergence.
- Leave-one-firm-out reproducibility.
- BD/McCrary sensitivity.
- Ten-tool validation table.
- Pixel-identity positive anchor, especially because it is close to tautological for the high-confidence rule.
These supporting diagnostics can stay, but they should be framed as robustness checks, assumption checks, or supplementary evidence, not as independent central contributions.
## Suggested Manuscript Structure for Clarity
Recommended structure for the Methodology / Results narrative:
1. Core Method
Describe the pipeline, descriptor construction, and five-way rule.
2. Why the Threshold Is Operational Rather Than Natural
Use the composition decomposition only. Avoid over-explaining K=3, BD/McCrary, or historical mixture logic here.
3. How the Rule Is Calibrated Without Ground Truth
Explain ICCR and the three reporting units: per-comparison, per-signature, per-document.
4. What the Calibration Reveals
Report firm heterogeneity and within-firm collision concentration.
5. Supporting Diagnostics
Place K=3, Spearman convergence, LOOO, BD/McCrary, and pixel-identity checks here as supporting evidence.
## Rewrite Style for Multi-Method Sections
Avoid:
```text
We apply a multi-tool validation framework consisting of ten diagnostics...
```
This can sound like methodological stacking.
Prefer:
```text
Each supporting diagnostic addresses a specific failure mode: composition artefacts, inter-CPA coincidence, pool-size effects, firm heterogeneity, or positive-anchor capture.
```
Avoid:
```text
The conjunction of ten tools constitutes validation...
```
Prefer:
```text
Together, these diagnostics define the limits of what can be supported without signature-level ground truth.
```
Avoid presenting auxiliary diagnostics before the reader understands the classifier.
Preferred order:
```text
Rule first. Then why not natural threshold. Then ICCR calibration. Then robustness.
```
## Reviewer-Facing Principle
The paper should not read as:
```text
We used many methods, so the result is credible.
```
It should read as:
```text
We use one explicit screening rule. Each statistical diagnostic answers one necessary question about how that rule should be interpreted under unsupervised constraints.
```
This distinction is important for readability and reviewer trust.
This handoff continues the same framing principle established for Abstract + Introduction:
> *"One explicit screening rule. Each statistical diagnostic answers one necessary question about how that rule should be interpreted under unsupervised constraints."*
If only the Abstract and Introduction are revised, the manuscript will exhibit tonal mismatch when the reader drops into the body sections, which currently retain internal-version language and a defensive-accumulation framing for the supporting diagnostics. The body must be brought into the same register.
## Overall Assessment
The body sections are substantively defensible. The core empirical results — composition decomposition, anchor-based ICCR at three units, firm heterogeneity logistic, cross-firm hit matrix, alert-rate sensitivity — are presented in adequate quantitative detail with explicit unsupervised-validation caveats. The Discussion correctly distinguishes positive and negative anchors. The Conclusion lists eight methodological contributions that map onto the v4 contribution set.
The recurring weakness across §III / §IV / §V / §VI is *not* empirical. It is two intertwined narrative tendencies:
1. The body is still written as a *revision history* relative to v3.x in many paragraphs — "v4.0 strengthens", "v4.0 retroactively reframes", "v4.0 adopts", "inherited from v3.x", "the v3.x role of Firm A". This is internally honest but, in a submitted paper, signals to the reviewer that the authors are arguing with themselves.
2. The supporting diagnostics are repeatedly presented as a *collection* ("multi-tool framework", "ten-tool unsupervised-validation collection", "Table XXVII"). This collection framing is precisely the readability risk identified in the Abstract / Introduction handoff under "Readability Risk: Too Many Diagnostics Can Look Like Methodological Overbuilding." It currently appears unmodified in §III-M.
Recommended disposition: Minor Revision for narrative voice and structural emphasis, not for empirical weakness.
## Main Reviewer Concerns
### 1. The v3-to-v4 revision narrative is pervasive in the body and must be removed
The Abstract / Introduction handoff identified "v4.0 adopts", "Earlier work in this lineage", and "inherited Paper A v3.x five-way box rule" as patterns to strip. The same patterns occur throughout the body sections. Representative instances (not exhaustive):
- §III-G: "We earlier (v4.0 first draft) listed 'statistical multimodality at the accountant level' among the scope justifications..."
- §III-H opening: "v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing."
- §III-L.0 "Why retained without v4.0 recalibration" subsection title.
- §III-L.7 closing: "The operational classifier of §III-L.0 is the inherited v3.x five-way box rule..."
- §IV opening paragraph: "The v4.0 primary analyses (§IV-D through §IV-J) are scoped to..." and "§IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables."
- §IV-I: "v4.0 retroactively reframes the metric as inter-CPA pair-level coincidence rate (ICCR) rather than 'False Acceptance Rate'..."
- §IV-J: "v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset)."
- §IV-M opening: "v4-new empirical results that support..."
- §V-B: "A central empirical finding of v3.x was that per-signature similarity does not admit a clean two-mechanism mixture... v4.0 strengthens and extends this signature-level reading."
- §V-C: "In v4.0 we treat Firm A as a templated-end case study rather than as the calibration anchor for the operational threshold."
- §V-H opening: "The first nine are v4.0-specific; the last five are inherited from v3.20.0 §V-G and still apply to the v4.0 pipeline."
The remediation principle is the same as for the Introduction pivot paragraph. The final manuscript should describe the *final methodological state* and its rationale, not the trajectory by which that state was reached. Internal provenance — "this analysis is reproduced from v3.x §IV-F.1 / Script 28" — belongs in an Appendix B reproducibility table or supplementary material, not in the main narrative arc.
A safe rewriting heuristic: every sentence that begins with "v4.0", "v3.x", "v4-new", "inherited", or "earlier work" should be candidated for either deletion or rewriting in the present tense without version labels.
### 2. The "Ten-Tool Unsupervised-Validation Collection" frame must be retired
§III-M Table XXVII is the canonical instance of the readability risk that the Abstract / Introduction handoff flagged. The current frame is:
> "v4.0 adopts a multi-tool collection of partial-evidence diagnostics (Table XXVII), each with an explicitly disclosed assumption..."
> "No single tool in this collection provides ground-truth validation. Their conjunction constitutes the unsupervised validation ceiling that the v4.0 corpus admits."
This is exactly the language the Abstract / Introduction handoff identified as risky ("We used many methods, so the result is credible"). It reappears verbatim in the §VI Conclusion as "a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope" and "(8) a ten-tool unsupervised-validation collection (§III-M Table XXVII) that explicitly discloses each tool's untested assumption."
The recommended reframe is:
```text
The corpus does not admit standard supervised classifier validation: no signature-level
ground truth exists for hand-signed versus replicated classes, so False Rejection Rate,
sensitivity, recall, EER, ROC-AUC, precision, and positive predictive value are not
reportable. Each diagnostic in this section therefore addresses one specific
failure mode of an unsupervised screening classifier: composition artefacts,
sensitivity, or positive-anchor capture. Together they characterise the limits of
what can be claimed without signature-level ground truth.
```
Keep Table XXVII as a reference table if useful, but retitle it as "Diagnostic — failure mode addressed — disclosed assumption" rather than "Ten-tool collection". The word "ten" should not appear in the manuscript.
### 3. The §V-H Limitations list is correct but defensively ordered
§V-H lists fourteen limitations. The first one — "No signature-level ground truth; no true error rates reportable" — is the load-bearing limitation that everything else in v4.0 hinges on. The next two — "Inter-CPA negative-anchor assumption is partially violated" and "Scope" — are also major. The other eleven are real but secondary. The current presentation gives every item roughly equal visual weight as a flat list.
Recommended reorganisation:
- *Primary limitations (3 items):* (a) no signature-level ground truth, (b) inter-CPA negative-anchor assumption partially violated and firm-dependent, (c) Big-4 scope (full-dataset robustness is light).
- *Secondary limitations (4 items):* pixel-identity conservative subset; inherited rule components not separately v4-validated; deployed-rate excess not a true-positive rate; A1 pair-detectability stipulation.
- *Documented features rather than limitations (2 items):* K=3 hard-posterior composition sensitivity; no partner-level mechanism attribution.
This preserves the disclosures but signals to the reviewer which limitations carry the methodological weight and which are routine engineering caveats.
### 4. §III-F SSIM and pixel-comparison justification is too long for Methodology
§III-F currently dedicates roughly 15 lines (lines 112–127 in `paper_a_methodology_v3.md`) to justifying *why* SSIM and pixel-level comparison are not used as primary descriptors. The argument is correct (design-level mismatch between SSIM's natural-image quality factors and signature-crop artefacts; sub-pixel alignment fragility of pixel L1/L2), but in its current form it reads as a defensive response to an anticipated reviewer objection rather than as forward Methodology exposition.
Recommended reduction: collapse the argument to one short paragraph (3–4 sentences) and move the full design-level discussion to Appendix B. The Methodology body should state the choice (cosine on deep features + dHash) and briefly justify it (both stable across print-scan cycles by design), with the SSIM / pixel-comparison rebuttal in an appendix or a single citation footnote.
### 5. §IV's section opener still encodes provenance not appropriate to a Results section opener
The current §IV opener:
> "The v4.0 primary analyses (§IV-D through §IV-J) are scoped to the Big-4 sub-corpus (Firms A–D, n = 437 CPAs with n_sig ≥ 10, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. The §IV-K Full-Dataset Robustness section reports the full-dataset (686 CPAs) variant of the K=3 mixture + Paper A box-rule Spearman analysis as a cross-scope robustness check. §IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables."
Recommended replacement direction:
```text
Section IV reports the empirical results that calibrate and characterise the
operational classifier of §III-L. The primary analyses (§IV-D through §IV-J,
§IV-M) are scoped to the Big-4 sub-corpus (Firms A–D, 437 CPAs, 150,442
signatures); §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3
mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L
report the corpus-wide pipeline performance and feature-backbone ablation that
support the descriptor choice of §III-F.
```
This preserves the scope information while removing the v3-to-v4 inheritance labels and the "v4-new" prefix on §IV-M.
## Section-by-Section Comments
### §III-A Pipeline Overview
The pipeline diagram caption (lines 12–20) describes the classifier as "Firm A P7.5-anchored", which is residual v3 language that conflicts with the v4 reframe. v4 explicitly abandons Firm A as the calibration anchor in favour of inter-CPA ICCR (§III-H, §III-L). The figure caption should be updated to read "Anchor-Calibrated Five-Way Classifier" or similar, consistent with the §III-L title "Anchor-Based Threshold Calibration and Operational Classifier".
The §III-A second paragraph ("Throughout this paper we use the term non-hand-signed rather than 'digitally replicated'...") is well-positioned and should be kept.
### §III-B Data Collection
No issues identified.
### §III-C Signature Page Identification
No issues identified. The 98.8% VLM-YOLO agreement footnote is appropriately scoped ("we do not attempt to attribute the residual").
### §III-D Signature Detection
No issues identified.
### §III-E Feature Extraction
No issues identified.
### §III-F Dual-Method Similarity Descriptors
As noted in Main Concern 4: shorten the SSIM and pixel-comparison rebuttal to ~3–4 sentences and move full design-level argument to Appendix B.
### §III-G Unit of Analysis and Scope
This section is currently long and contains the "We earlier (v4.0 first draft) listed..." paragraph that explicitly walks through the methodological revision. That paragraph (currently at the end of §III-G, before the sample-size reconciliation) should be deleted. The four-item scope rationale list above it is good and should be kept.
The sample-size reconciliation paragraph (n=150,442 vs n=150,453) is technically necessary but is repeated almost verbatim in §IV-J as a parenthetical. Consider centralising it in §III-G with a forward reference, or in an Appendix B reproducibility note.
### §III-H Reference Populations
Replace the opening sentence:
> "v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing."
with:
```text
The calibration distinguishes two reference populations: Firm A as a within-Big-4
templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference
for internal-consistency checking.
```
The remainder of §III-H is well-written; the descriptive content is fine. The "v3.x's single-anchor framing" phrase is the only internal-version language that needs removal.
### §III-I Distributional Diagnostics
This is the strongest single section in the body. The four sub-diagnostics (dip test, mixture, BD/McCrary, composition decomposition) are tightly organised around one claim: the descriptor distribution does not provide a within-population bimodal antimode. The 2x2 factorial table at §III-I.4 is the empirical centrepiece of the v4 reframe.
One small narrative issue: §III-I.5 ("Conclusion") closes with "§III-L develops the v4.0 anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode." Remove "v4.0" — write "§III-L develops the anchor-based threshold calibration framework..."
### §III-J K=3 as a Descriptive Partition of Firm-Composition Contrast
The section header is clear and the framing ("Both fits are descriptive partitions... not within-population mechanism modes") is correct.
The current closing paragraph references "§III-K" for cross-checks between the box rule and K=3, but §III-K is the next subsection — this is a within-Methodology forward reference and reads slightly oddly. Consider rephrasing as "Cross-checks between the inherited five-way box rule and the K=3 partition appear in §III-K below."
### §III-K Convergent Internal-Consistency Checks
This section is well-handled. The opening caveat — "the three scores are not statistically independent measurements... so their high pairwise rank correlations are partly a mechanical consequence of shared inputs" — is exactly the methodological honesty the v4 reframe needs.
One narrative issue: §III-K.4 (positive-anchor miss rate) and §III-K.3 (LOOO reproducibility) are *summarised* in §III-K but also reported in detail in §III-J and §IV-G respectively. Consider whether the §III-K subsections add narrative value beyond cross-referencing — if not, §III-K could shrink to just the three-score Spearman block (§III-K.1) and a one-line cross-reference to LOOO and pixel-identity, with the detail living in §III-J and §IV-G / §IV-H.
### §III-L Anchor-Based Threshold Calibration and Operational Classifier
This section has the operating-rule text that the Abstract / Introduction handoff explicitly asked for ("Cosine > 0.95 AND dHash ≤ 5" etc., §III-L.0 item 1). Good.
The "Terminological note on FAR" at the end of §III-L.0 is explicit and reviewer-facing. Keep it.
Issues:
- "Why retained without v4.0 recalibration" — replace subsection title and contents to remove v4 references. The argument ("the inherited thresholds preserve continuity with prior reporting; §III-I.4 establishes that recalibration cannot be anchored on distributional antimodes; §III-L.1 confirms the cosine threshold's specificity at the inter-CPA pair level is reproducible") is intact without the v4 label.
- §III-L.7 ("K=3 not used as classifier") restates content already in §III-J. Consider deleting §III-L.7 and adding a one-line note inside §III-L.0 ("The K=3 mixture of §III-J is used as an accountant-level descriptive summary alongside the per-signature five-way classifier; K=3 hard-posterior membership is not used to assign signature-level or document-level labels in any result table").
### §III-M Validation Strategy and Limitations under Unsupervised Setting
Replace the framing as described in Main Concern 2. Keep the underlying disclosure content. Consider whether Table XXVII is best presented as a numbered methodological table or as an Appendix B reproducibility-and-assumption summary; in either case retitle and reframe so that "ten" does not appear and the unifying principle is "each diagnostic addresses one specific unsupervised failure mode."
The "What v4.0 does not claim" and "What v4.0 does claim" subsections at the end of §III-M are strong but the framing tag "v4.0 does not claim" / "v4.0 does claim" is the problematic version-language pattern. Replace with "Limits of the present analysis" and "Scope of the present analysis."
### §III-N Data Source and Firm Anonymization
No issues. The residual-identifiability disclosure is appropriately framed.
### §IV-A Experimental Setup
No issues identified.
### §IV-B Signature Detection Performance
No issues identified.
### §IV-C All-Pairs Intra-vs-Inter Class Distribution Analysis
The pairwise-non-independence caveat ("we therefore rely primarily on Cohen's d... A Cohen's d of 0.669 indicates a medium effect size, confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count") is well-positioned. Keep.
The Table V dip-test row labels are clear. The "v4-new composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration" should drop the "v4-new" — just write "...are tabulated in §IV-M below alongside the anchor-based ICCR calibration."
### §IV-E Big-4 K=2 / K=3 Mixture Fits
The "descriptive partition; not mechanism clusters per §III-J" labels in Tables VII and VIII are consistent with the v4 reframe. Keep. Drop "(v3.x role)" anywhere it appears.
### §IV-F Convergent Internal-Consistency Checks
This is duplicate Results-side reporting of §III-K. Consider whether the duplication adds value or is redundant. If both sections must remain, then §III-K should describe the *method* (three scores, why they are not independent) and §IV-F should report the *numbers*; currently §III-K reports both the method and the numbers, leaving §IV-F as a near-duplicate. Recommendation: trim §IV-F to just the per-firm summary table and the Cohen-kappa block, with the method description living in §III-K.
### §IV-G Leave-One-Firm-Out Reproducibility
Tables XII and XIII are well-organised. The interpretation paragraph following Table XIII correctly identifies the K=2 vs K=3 contrast (K=2 unstable; K=3 component shape reproducible but hard-posterior membership composition-sensitive). Keep.
### §IV-H Pixel-Identity Positive-Anchor Miss Rate
The "close to tautological" caveat is appropriately positioned. Keep. The reverse-anchor cut by prevalence calibration disclosure is also appropriate.
### §IV-I Inter-CPA Pair-Level Coincidence Rate
Replace:
> "v4.0 retroactively reframes the metric as inter-CPA pair-level coincidence rate (ICCR) rather than 'False Acceptance Rate' because..."
with:
```text
The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It
is the per-pair rate at which two signatures from different CPAs satisfy the
deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has
a biometric-verification meaning that requires ground-truth negative labels, and
(b) the inter-CPA negative-anchor assumption is partially violated by within-firm
cross-CPA template-like collision structures (§III-L.4 cross-firm hit matrix).
The sample-size reconciliation parenthetical ("11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded") is repeated from §III-G. Centralise once and forward-reference.
"v4.0 does not change this aggregation rule; only the population over which it is computed changes" should be "The aggregation rule is the inherited worst-case rule (HC > MC > HSC > UN > LH); we apply it to the Big-4 sub-corpus."
The MC band capture-rate inheritance disclosure is appropriately framed but should drop the "v4.0 does not re-derive" phrasing; rewrite as "The moderate-confidence band's calibration and capture-rate evidence is reported in [Appendix B / v3.20.0 Tables IX, XI, XII, XII-B] and is not regenerated on the Big-4 subset."
### §IV-K Full-Dataset Robustness
The scope-of-§IV-K paragraph ("The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA less-replication-dominated rate analysis...") is defensively framed but the substance is correct. Consider shortening the "what we do not do" enumeration and emphasising the "what we do show" finding (K=3 + Paper A box-rule Spearman convergence preserved at full scope; ρ drift = 0.007).
### §IV-L Feature Backbone Comparison
This is inherited v3.x content. The "inherited unchanged from the v3.20.0 backbone-ablation table" framing is acceptable here because it is a methodological choice (do not re-run the ablation at the Big-4 scope) rather than a narrative pivot. Keep.
Drop the "v4-new" from the section heading. Recommended replacement heading: "Anchor-Based ICCR Calibration Results".
The section is empirically dense and methodologically sound. Tables XXI–XXVI cover the four units (per-comparison, per-signature, per-document, firm logistic + hit matrix) and the alert-rate sensitivity. Keep all tables. Drop "v4 new" / "v4-new" wherever it appears as a row qualifier or section subheading.
### §V-A Non-Hand-Signing Detection as a Distinct Problem
Keep. This section preserves the forgery distinction (Main concern #2 in the Abstract / Introduction handoff).
### §V-B Per-Signature Similarity is a Continuous Quality Spectrum
Replace the v3-to-v4 opening:
> "A central empirical finding of v3.x was that per-signature similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading."
with:
```text
The Big-4 accountant-level descriptor distribution rejects unimodality on both
marginals at p < 5 × 10⁻⁴ (§IV-D Table V). The composition decomposition of
§III-I.4 (Scripts 39b–39e) shows this rejection is fully attributable to two
non-mechanistic sources...
```
This preserves the §V-B content while removing the v3.x lineage statement.
### §V-C Firm A as the Templated End of Big-4
Replace "In v4.0 we treat Firm A as a templated-end case study rather than as the calibration anchor for the operational threshold" with "We treat Firm A as a templated-end case study within the Big-4 sub-corpus rather than as the calibration anchor for the operational threshold."
Drop the "the v3.x role of Firm A" historical sub-clause that appears in §III-G item 2.
The Firm A byte-level pixel-identity reference (145 signatures across ~50 distinct partners; 35 byte-identical matches across fiscal years) is inherited from v3.x §IV-F.1 / Script 28 — this byte-level granularity is the strongest single piece of v3.x evidence that *should* survive into v4 because it directly supports the §V-C templated-end characterisation. Keep the reference but recast as "Byte-level decomposition of these 145 signatures (Appendix B) shows..." rather than the current "The additional v3.x finding... is inherited from v3.20.0 §IV-F.1 / Script 28..."
### §V-D K=2 / K=3 as Descriptive Firm-Compositional Partitions
Keep. The contrast between K=2 instability and K=3 reproducible-component-shape-but-composition-sensitive-membership is one of the cleanest narrative arcs in the paper.
Keep. The "not statistically independent" caveat is correctly positioned. The within-Big-4 non-Firm-A disagreement between Score 2 and Scores 1/3 is correctly disclosed.
### §V-F Anchor-Based Multi-Level Calibration
Keep. This is the v4 contribution. Drop any residual "v4" labels.
### §V-G Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor
Keep. The "positive necessary but not sufficient" caveat and the "specificity proxy under a partially-violated assumption" framing are exactly right.
Drop "Inherited" from the §V-G section heading — the heading currently reads "Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor Reframed as Coincidence Rate", which encodes the v3-to-v4 history in the section title itself. Recommended: "Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor".
### §V-H Limitations
Reorganise as described in Main Concern 3: primary (3) / secondary (4) / documented features (2) / inherited engineering (5).
Drop "inherited from v3.20.0 §V-G" qualifiers — the limitation either applies to the pipeline or it does not; the version source is reproducibility metadata that belongs in Appendix B.
### §VI Conclusion
Replace the opening framing:
> "We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope."
with:
```text
We present a fully automated pipeline for detecting non-hand-signed CPA
signatures in Taiwan-listed financial audit reports, together with an
anchor-calibrated screening framework that characterises the pipeline's
operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised
assumptions.
```
The eight numbered contributions are content-correct but presented in flat-list form. Consider grouping into three thematic clusters:
- *Why the descriptor distribution does not anchor a natural threshold* (contributions 1, 5).
- *How the deployed rule is calibrated under unsupervised constraints* (contributions 2, 6, 7).
- *What the calibration reveals about firm heterogeneity* (contributions 3, 4).
- *Methodological positioning* (contribution 8 — but reframe per Main Concern 2).
The Future Work block (four items) is fine; consider trimming the second item ("a separate study to distinguish deliberate template sharing from passive firm-level production artefacts") which is the only item that involves additional fieldwork rather than methodological extension.
## Recommended Minimum Patch List
1. Strip v3-to-v4 revision language throughout §III, §IV, §V, §VI. Mechanical pass on "v4.0", "v3.x", "v4-new", "inherited", "earlier work in this lineage". Replace with present-tense descriptions of the final methodological choice and forward references to Appendix B for reproducibility provenance.
2. Retire the "ten-tool unsupervised-validation collection" framing in §III-M and the "multi-tool framework" phrase in §VI Conclusion. Replace with "each diagnostic addresses one specific unsupervised failure mode" framing. Retitle Table XXVII so that "ten" does not appear.
4. Shorten §III-F SSIM and pixel-comparison rebuttal to ~3–4 sentences; move design-level discussion to Appendix B.
5. Update Figure 1 caption (currently in §III-A commented HTML) to remove "Firm A P7.5-anchored" residual v3 language.
6. Rewrite the §IV opener paragraph to remove the inherited-vs-v4-new section labels.
7. Rewrite the §IV-I opening paragraph to remove "v4.0 retroactively reframes the metric...".
8. Drop "v4-new" from the §IV-M section heading; replace with "Anchor-Based ICCR Calibration Results".
9. Centralise the n=150,442 vs n=150,453 sample-size reconciliation in §III-G; remove the duplicate parenthetical from §IV-J.
10. Consider trimming §IV-F to numbers-only (per-firm summary table + Cohen kappa), with the method description living in §III-K.
11. Consider deleting §III-L.7 (duplicate of §III-J K=3-not-used-as-classifier claim) and adding a one-line note in §III-L.0.
## Reviewer Bottom Line
The body sections of v4 are empirically defensible and methodologically internally consistent. The required revisions are stylistic and structural rather than substantive:
- Remove the v3-to-v4 revision narrative from the present-tense exposition.
- Reframe the supporting diagnostics from "ten-tool collection" to "each diagnostic addresses one unsupervised failure mode."
- Reorganise the limitations list so that the load-bearing limitations are visibly more prominent than the routine engineering caveats.
- Move provenance and reproducibility detail to Appendix B / supplementary material.
These changes preserve every quantitative claim and every disclosure currently in the manuscript. They tighten the narrative voice so that the reader experiences the v4 methodological choices as the final state of the design rather than as an ongoing argument with an earlier version. Combined with the Abstract / Introduction patches in the companion handoff, the manuscript should read as a single coherent submission rather than as a layered revision document.
## Additional Cross-Cutting Observation: Script Provenance in Tables
Across §III, §IV, §V, and the Conclusion, tables are annotated with `(Source: Script 32 / 34 / 35 / 38 / 40b / 43 / 44 / 45 / 46)` parentheticals. This is appropriate for reproducibility but heavy at the visual level — every table footer in §IV-D through §IV-M carries one of these annotations.
Recommended consolidation: move the script-to-table mapping to a single Appendix B reproducibility table ("Table B-1. Script-to-table provenance map"), and replace the inline annotations with a single one-line note at the start of §IV ("Script-to-table provenance is summarised in Appendix B Table B-1; raw outputs are available in the supplementary repository").
This is a minor change but it materially reduces the visual signal that the paper is built on a large number of separate scripts.
## Closing Note
This review covers the body sections only. The Abstract / Introduction handoff (`paper/review_handoff_abstract_intro_20260515.md`) covers the front matter. The two handoffs should be applied together; applying only one of them will produce tonal mismatch as the reader moves from the front matter into the body.
The References and the Appendix have not been reviewed and may benefit from a separate handoff if the Appendix is to absorb the SSIM / pixel-comparison material and the reproducibility-provenance table recommended above.
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.