Paper A v4.2: re-anchor primary calibration to clean BCD 2013-2019 baseline

- Restrict the calibration negative anchor to Firms B/C/D, fiscal years
  2013-2019 (pre-electronic-signature hand-signing period); B/C/D adopted
  e-signing post-2020 at staggered times, so 2013-2019 is the construct-clean
  baseline. Firm A scored across its full 2013-2023 record against it.
- New locked numbers (codex-audited, Scripts 54/55): per-comparison HC floor
  0.000010; per-signature HC floor 0.0059 [boot 0.0045-0.0073]; per-document
  HC 0.0117 / HC+MC 0.1753; per-firm HC+MC B 0.162 / C 0.225 / D 0.089.
  Firm A observed 0.817 = ~139x the clean floor (was ~70x on all-period BCD);
  Firm A out-of-sample vs clean pool 0.0001 (below floor -> never resembles
  genuine hand-signing). BCD 2020+ robustness: per-sig 0.0105, per-comparison
  0.000036 (~2x pre-2020) quantifies the e-signing contamination.
- Propagated through abstract / Sec. I / III-L / IV-M / V / conclusion;
  0.837 crossover kept corpus-wide; ABCD retained as contamination comparison.
- Grounded the 2013-2019 choice on data (floor drift) + e-sign-adoption
  background, not on in-text interview claims (double-blind).
- Add Scripts 54 (temporal floor stability) and 55 (BCD 2013-2019 primary
  calibration + Firm A scoring).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-04 21:30:06 +08:00
parent 3c7fcc010f
commit 1eb323e959
4 changed files with 287 additions and 42 deletions
+44 -42
View File
@@ -7,7 +7,7 @@ author: "[Authors removed for double-blind review]"
<!-- IEEE Access target: <= 250 words, single paragraph -->
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a normative non-Firm-A baseline (Firms B/C/D), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000018$; per-signature $0.012$; per-document $0.023$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.19$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A coincides at baseline rate cross-firm yet fires the rule on $82\%$ of its own signatures ($\sim 70\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth.
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a clean pre-e-signature baseline (Firms B/C/D, 20132019), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000010$; per-signature $0.006$; per-document $0.012$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A never coincides cross-firm yet fires on $82\%$ of its own ($\sim 139\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth.
<!-- Word count: 250 (v4.1 BCD-baseline reframe) -->
@@ -32,9 +32,9 @@ We are deliberate about what the system claims. The operating thresholds are *op
A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.
In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000018$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0116$ (CPA-block bootstrap 95% $[0.0094, 0.0141]$), and per-document ICCR $= 0.023$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.19$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide.
In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a clean pre-e-signature baseline (Firms B/C/D, 20132019); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000010$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0059$ (CPA-block bootstrap 95% $[0.0045, 0.0073]$), and per-document ICCR $= 0.012$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.175$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide.
With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0116$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 70\times$ floor), Firms B/C/D at $0.24$$0.35$ ($\sim 21$$30\times$). Firm A scored against the clean baseline coincides at only $0.0102$ — essentially the floor itself — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.
With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0059$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 139\times$ floor), Firms B/C/D at $0.24$$0.35$ ($\sim 40$$59\times$). Firm A scored against the clean 20132019 baseline coincides essentially never ($0.0001$, below the clean-baseline floor itself) — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.
Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.
@@ -50,9 +50,9 @@ The contributions of this paper are:
4. **Composition decomposition does not support the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
5. **Anchor-based multi-level ICCR calibration on a normative non-Firm-A baseline.** We characterise the deployed high-confidence (HC) sub-rule at three units of analysis against a clean Firms-B/C/D negative anchor (Firm A held out as an out-of-sample target to avoid circularity): per-comparison ICCR $0.000018$, pool-normalised per-signature ICCR $0.0116$, and per-document ICCR $0.023$ — each roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$). The moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.19$ per-document coincidence rate on the clean baseline and is repositioned as a low-specificity advisory tier rather than a confident non-hand-signed label. Because the deployed thresholds are operator-tunable, the contribution is this label-free calibration methodology — a principled way to choose and characterise a screening operating point and the specificity it yields — rather than any specific threshold. We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
5. **Anchor-based multi-level ICCR calibration on a normative non-Firm-A baseline.** We characterise the deployed high-confidence (HC) sub-rule at three units of analysis against a clean Firms-B/C/D negative anchor (Firm A held out as an out-of-sample target to avoid circularity): per-comparison ICCR $0.000010$, pool-normalised per-signature ICCR $0.0059$, and per-document ICCR $0.012$ — each roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$). The moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate on the clean baseline and is repositioned as a low-specificity advisory tier rather than a confident non-hand-signed label. Because the deployed thresholds are operator-tunable, the contribution is this label-free calibration methodology — a principled way to choose and characterise a screening operating point and the specificity it yields — rather than any specific threshold. We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
6. **Firm A as a singular out-of-sample extreme; universal within-firm collision concentration.** Against the clean BCD floor (per-signature HC ICCR $0.0116$), the deployed rule fires on each firm's own pools far above the inter-CPA coincidence floor (Firm A $0.82$, $\sim 70\times$; Firms B/C/D $0.24$$0.35$, $\sim 21$$30\times$), while Firm A scored cross-firm against the baseline coincides only at the floor ($0.0102$) — localising the repeatability signal to within-firm comparisons. Two logistic regressions (full-Big-4 with Firm A reference: odds ratios $0.053$/$0.010$/$0.027$ for B/C/D; BCD-only with Firm D reference: residual spread within $\sim 3.5\times$, odds ratios $1.73$/$0.49$ for B/C) show Firm A is the lone outlier while Firms B/C/D form an internally homogeneous baseline. Within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$$97\%$ at Firms B/C/D on the clean pool — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
6. **Firm A as a singular out-of-sample extreme; universal within-firm collision concentration.** Against the clean BCD floor (per-signature HC ICCR $0.0059$), the deployed rule fires on each firm's own pools far above the inter-CPA coincidence floor (Firm A $0.82$, $\sim 139\times$; Firms B/C/D $0.24$$0.35$, $\sim 40$$59\times$), while Firm A scored cross-firm against the clean 20132019 baseline coincides essentially never cross-firm ($0.0001$, below the floor itself) — localising the repeatability signal to within-firm comparisons. Two logistic regressions (full-Big-4 with Firm A reference: odds ratios $0.053$/$0.010$/$0.027$ for B/C/D; BCD-only with Firm D reference: residual spread within $\sim 3.5\times$, odds ratios $1.73$/$0.49$ for B/C) show Firm A is the lone outlier while Firms B/C/D form an internally homogeneous baseline. Within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$$97\%$ at Firms B/C/D on the clean pool — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.
@@ -396,7 +396,7 @@ $\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical pre
Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs. The K=3 partition is therefore well-described as a firm-compositional decomposition: C3 is essentially "Firm A and any non-Firm-A CPA whose mean descriptors happen to land in the high-cos / low-dHash corner"; C1 is essentially "non-Firm-A CPAs whose mean descriptors land in the low-cos / high-dHash corner." The composition contrast that K=3 captures at the accountant level reappears at the deployment level in the cross-firm hit matrix of §III-L.4 (Script 44): under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). The K=3 partition and the cross-firm hit matrix therefore describe the same underlying firm-compositional structure at two different units of analysis.
**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label.
**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.012$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label.
We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the operational classifier:
@@ -458,6 +458,8 @@ The operational classifier defined in §III-H.1 is calibrated by characterising
**Choice of negative-anchor pool.** A negative anchor must approximate a population in which the rule should *not* fire — independent CPAs whose signatures coincide only by chance. §III-L.4 shows that under the deployed rule, $98.8\%$ of Firm A's inter-CPA collisions fall on other Firm-A CPAs, and byte-level evidence (§IV-H, supplementary materials) confirms image-level reuse across $\sim 50$ Firm-A partners. Including Firm A in the negative-anchor pool therefore loads the "coincidence" rate with structured within-firm collisions, not chance coincidence — a circularity, since that collision structure is the phenomenon the rule targets. We adopt **Firms B/C/D (BCD) as the normative negative-anchor baseline** and report the all-Big-4 (ABCD) pool only as a contamination-comparison scope; Firm A enters as an **out-of-sample target** (§III-L.4), not as a calibration input. A still-broader baseline adding the eligible non-Big-4 firms (BCD+non-Big-4) is reported as a robustness scope.
We further restrict the calibration baseline temporally to **fiscal years 20132019**. Taiwan audit firms progressively adopted electronic-signature systems after 2020 (with firm-specific timing), so the pre-2020 BCD period is the construct-clean hand-signing baseline; the post-2020 period mixes genuine hand-signing with legitimate e-signing and is therefore not a clean negative anchor. The data corroborate this: the BCD per-comparison HC floor rises from $0.000010$ (20132019) to $0.000036$ (20202023), and the per-signature floor from $0.0059$ to $0.0105$ — the gradual, non-stepped rise being consistent with staggered per-firm adoption. We therefore calibrate on BCD 20132019 and report BCD 20202023 only as a robustness scope (it documents the e-signing contamination rather than the clean floor). Firm A is scored across its full 20132023 record against this clean threshold.
**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine and structural ($\text{dHash} \leq 5$) thresholds' specificity-proxy behaviour at the inter-CPA pair level on the BCD baseline. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands. The cosine LH/UN crossover $\text{cos} = 0.837$ is a corpus-wide descriptor-space landmark (intra- vs inter-CPA cosine KDE crossover, §IV-C) robust to baseline choice — it moves by at most $0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes ($0.8367$, $0.8489$, $0.8302$) — so we retain the corpus-wide value and do not re-anchor it on BCD.
**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:
@@ -478,9 +480,9 @@ We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from the baseline
|---|---|---|---|
| Cosine $> 0.95$ | $0.00026$ | $0.00060$ | $0.00014$ |
| dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000018}$ $\;[0.000009, 0.000034]$ | $0.000140$ $\;[0.000111, 0.000177]$ | $0.000004$ $\;[0.000001, 0.000015]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000010}$ $\;[0.000004, 0.000023]$ | $0.000140$ $\;[0.000111, 0.000177]$ | $0.000004$ $\;[0.000001, 0.000015]$ |
On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule is $0.000018$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$), and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-L.4): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($9$ of $5 \times 10^5$ pairs on the BCD pool), so we report the Wilson interval and treat the per-comparison joint rate as an order-of-magnitude specificity proxy rather than a precisely estimated rate; the well-powered per-signature and per-document units (§III-L.2, §III-L.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ row remains consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I. On the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) is $0.234$, indicating that the structural dimension adds substantial per-comparison specificity beyond the cosine gate.
On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule is $0.000010$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$), and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-L.4): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($5$ of $5 \times 10^5$ pairs on the BCD pool), so we report the Wilson interval and treat the per-comparison joint rate as an order-of-magnitude specificity proxy rather than a precisely estimated rate; the well-powered per-signature and per-document units (§III-L.2, §III-L.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ row remains consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I. On the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) is $0.234$, indicating that the structural dimension adds substantial per-comparison specificity beyond the cosine gate.
The per-comparison rate does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.
@@ -492,11 +494,11 @@ The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \
| Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI |
|---|---|---|
| **BCD (primary)** | $\mathbf{0.0116}$ | $[0.0094, 0.0141]$ |
| **BCD (primary)** | $\mathbf{0.0059}$ | $[0.0045, 0.0073]$ |
| All-Big-4 (contamination comparison) | $0.1102$ | $[0.0908, 0.1330]$ |
| BCD+non-Big-4 | $0.0083$ | $[0.0066, 0.0099]$ |
On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $0.0116$ — an order of magnitude below the all-Big-4 figure of $0.1102$. The all-Big-4 figure is dominated by Firm A, whose signatures coincide with other Firm-A signatures at high rate; once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 1.2\%$. This is the specificity-proxy floor against which the deployed HC rule operates. The rate increases with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-L.4) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, with the unsupervised-setting caveats of §III-M.
On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $0.0059$ — an order of magnitude below the all-Big-4 figure of $0.1102$. The all-Big-4 figure is dominated by Firm A, whose signatures coincide with other Firm-A signatures at high rate; once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 0.59\%$. This is the specificity-proxy floor against which the deployed HC rule operates. The rate increases with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-L.4) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, with the unsupervised-setting caveats of §III-M.
### L.3. Document-level inter-CPA proxy alert rate (Script 52)
@@ -504,31 +506,31 @@ Each document is classified by the worst-case rule over its constituent signatur
| Alarm definition | BCD baseline (primary) | All-Big-4 | BCD+non-Big-4 |
|---|---|---|---|
| HC (dHash $\leq 5$) | $\mathbf{0.0226}$ | $0.1797$ | $0.0163$ |
| HC + MC (dHash $\leq 15$) | $0.1905$ | $0.3375$ | $0.1467$ |
| HC (dHash $\leq 5$) | $\mathbf{0.0117}$ | $0.1797$ | $0.0163$ |
| HC + MC (dHash $\leq 15$) | $0.1753$ | $0.3375$ | $0.1467$ |
**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC rate is $0.0226$ ($\sim 8\times$ below the all-Big-4 $0.1797$), confirming that the HC (dHash $\leq 5$) rule has a very low inter-CPA coincidence rate: a clean inter-CPA baseline almost never produces an HC document. The HC+MC (dHash $\leq 15$) rate, by contrast, remains high on the clean baseline — $0.1905$ per document — and the per-firm breakdown shows it does *not* fall when Firm A is removed. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier rather than a confident non-hand-signed screening label.** Roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement, so an HC+MC alarm is not by itself evidence of image reproduction.
**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC rate is $0.0117$ ($\sim 8\times$ below the all-Big-4 $0.1797$), confirming that the HC (dHash $\leq 5$) rule has a very low inter-CPA coincidence rate: a clean inter-CPA baseline almost never produces an HC document. The HC+MC (dHash $\leq 15$) rate, by contrast, remains high on the clean baseline — $0.1753$ per document — and the per-firm breakdown shows it does *not* fall when Firm A is removed. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier rather than a confident non-hand-signed screening label.** Roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement, so an HC+MC alarm is not by itself evidence of image reproduction.
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $0.218$, Firm D $0.114$ — slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. This is direct evidence that the MC band carries little inter-CPA specificity even among normative firms, corroborating its demotion to an advisory tier. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-M).
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ — slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. This is direct evidence that the MC band carries little inter-CPA specificity even among normative firms, corroborating its demotion to an advisory tier. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-M).
### L.4. Firm A as an out-of-sample target; firm heterogeneity (Scripts 49, 52, 44, 53)
With the calibration anchored on BCD, Firm A is scored as an out-of-sample target against the clean baseline. Three complementary readings establish that Firm A is the extreme case while keeping the inferential limits explicit.
**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0116$ (§III-L.2):
**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0059$ (§III-L.2):
| Firm | Observed per-signature HC rate | Multiple of BCD floor |
|---|---|---|
| Firm A | $0.817$ | $\sim 70\times$ |
| Firm B | $0.346$ | $\sim 30\times$ |
| Firm C | $0.238$ | $\sim 21\times$ |
| Firm D | $0.245$ | $\sim 21\times$ |
| Firm A | $0.817$ | $\sim 139\times$ |
| Firm B | $0.346$ | $\sim 59\times$ |
| Firm C | $0.238$ | $\sim 40\times$ |
| Firm D | $0.245$ | $\sim 42\times$ |
All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 70\times$, roughly $2.4$$3.4\times$ the other Big-4 firms in absolute rate. We emphasise (and develop in §III-M) that this excess is *not* a true-positive rate: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above it. The multiple is a framework-discriminative observation, not a measure of image reproduction.
All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 139\times$, roughly $2.4$$3.4\times$ the other Big-4 firms in absolute rate. We emphasise (and develop in §III-M) that this excess is *not* a true-positive rate: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above it. The multiple is a framework-discriminative observation, not a measure of image reproduction.
**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0102$ — essentially identical to the BCD-internal floor of $0.0116$. Firm A's signatures are thus unremarkable when matched against *other firms'* signatures; the entire elevation in Firm A's observed rate ($0.817$) arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness.
**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0001$ — below even the BCD-internal floor of $0.0059$, i.e. Firm A's signatures essentially never resemble genuine 20132019 hand-signing. Firm A's signatures are thus unremarkable when matched against *other firms'* signatures; the entire elevation in Firm A's observed rate ($0.817$) arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness.
**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0116$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.)
**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.)
**Cross-firm hit matrix: within-firm concentration is a universal Big-4 pattern.** Under the deployed any-pair rule, inter-CPA collisions concentrate within the source firm at every Big-4 firm. On the full Big-4 candidate pool, within-firm concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (same-pair $97.0$$99.96\%$; Table XXV). Restricting the candidate pool to the BCD baseline (Script 53) *raises* the within-firm concentration for B/C/D to $89.2$$97.2\%$ any-pair (Firm B $97.2\%$, Firm C $92.3\%$, Firm D $89.2\%$) and $98.5$$100\%$ same-pair — higher than on the full pool, because on the full pool some B/C/D collisions landed on Firm A's generically copy-like signatures; removing Firm A leaves each firm's collisions concentrated within itself. Within-firm collision concentration is therefore a universal Big-4 structural pattern, not a Firm-A peculiarity: Firm A is extreme in the *rate* at which the rule fires (reading (i)), but all four firms exhibit the same within-firm collision signature.
@@ -548,10 +550,10 @@ We interpret the deployed HC thresholds as **specificity-anchored operating poin
The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the deployed HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$).
Read against the **normative BCD specificity-proxy floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is larger: the per-signature observed rate is $\sim 43\times$ the BCD floor ($0.4958$ vs $0.0116$), and the per-document HC observed rate is $\sim 28\times$ the BCD floor ($0.6228$ vs $0.0226$):
Read against the **normative BCD specificity-proxy floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is larger: the per-signature observed rate is $\sim 84\times$ the BCD floor ($0.4958$ vs $0.0059$), and the per-document HC observed rate is $\sim 53\times$ the BCD floor ($0.6228$ vs $0.0117$):
- Per-signature: $0.4958 - 0.0116 = 0.4842$ ($48.4$ pp excess over the clean floor)
- Per-document HC: $0.6228 - 0.0226 = 0.6002$ ($60.0$ pp excess over the clean floor)
- Per-signature: $0.4958 - 0.0059 = 0.4899$ ($49.0$ pp excess over the clean floor)
- Per-document HC: $0.6228 - 0.0117 = 0.6111$ ($61.1$ pp excess over the clean floor)
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits are developed in §III-M. The excess is best read as an *observed same-CPA-pool excess over the normative inter-CPA floor* — a quantity that far exceeds what random inter-CPA candidate replacement among normative firms would produce — whose mechanism is not identifiable from descriptor-only data (§III-M). Anchoring the floor on the clean BCD baseline sharpens this contrast (the all-Big-4 floor would understate it by absorbing Firm A's reuse), while leaving the §III-M caveat — that the floor is an inter-CPA coincidence rate, not an intra-CPA genuine-hand-signing rate — fully in force; we do not attribute the excess to within-CPA handwriting repeatability or to image replication without further evidence.
@@ -756,7 +758,7 @@ This section reports the firm-level cross-validation evidence motivating §III-J
| Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp |
| Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp |
(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).
(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.012$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).
## H. Pixel-Identity Positive-Anchor Miss Rate
@@ -832,7 +834,7 @@ This section reports the five-way per-signature + document-level worst-case clas
(Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)
The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-L.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.19$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 3840**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-L.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.175$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 3840**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
**Table XVII.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.
@@ -934,9 +936,9 @@ This section consolidates the empirical results that support the §III-L anchor-
|---|---|---|---|
| cos $> 0.95$ | $0.00026$ | $0.00060$ | $0.00014$ |
| dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000018}$ | $0.000140$ | $0.000004$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000010}$ | $0.000140$ | $0.000004$ |
BCD joint Wilson 95% $[0.000009, 0.000034]$ ($9$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-L.4). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$).
BCD joint Wilson 95% $[0.000004, 0.000023]$ ($5$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-L.4). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$).
### M.3 Pool-normalised per-signature ICCR (Script 52)
@@ -944,7 +946,7 @@ BCD joint Wilson 95% $[0.000009, 0.000034]$ ($9$ of $5 \times 10^5$ pairs); all-
| Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI |
|---|---|---|
| BCD (primary) | $\mathbf{0.0116}$ | $[0.0094, 0.0141]$ |
| BCD (primary) | $\mathbf{0.0059}$ | $[0.0045, 0.0073]$ |
| All-Big-4 (contamination comparison) | $0.1102$ | $[0.0908, 0.1330]$ |
| BCD+non-Big-4 | $0.0083$ | $[0.0066, 0.0099]$ |
@@ -956,10 +958,10 @@ The BCD floor is an order of magnitude below the all-Big-4 figure, which is domi
| Alarm definition | BCD (primary) | All-Big-4 | BCD+non-Big-4 |
|---|---|---|---|
| HC (dHash $\leq 5$) | $\mathbf{0.0226}$ | $0.1797$ | $0.0163$ |
| HC + MC (dHash $\leq 15$) | $0.1905$ | $0.3375$ | $0.1467$ |
| HC (dHash $\leq 5$) | $\mathbf{0.0117}$ | $0.1797$ | $0.0163$ |
| HC + MC (dHash $\leq 15$) | $0.1753$ | $0.3375$ | $0.1467$ |
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $0.218$, Firm D $0.114$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-L.3).
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-L.3).
### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)
@@ -972,7 +974,7 @@ Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $
| Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A |
| log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size |
On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0116$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-L.4).
On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-L.4).
Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$$0.0358$ while Firm A ranges $0.0541$$0.5958$. The firm gap survives within matched pool sizes.
@@ -997,7 +999,7 @@ Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\
| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0116$; per-document HC $0.0226$), the observed same-CPA-pool excess is $0.4842$ ($48.4$ pp, $\sim 43\times$) per-signature and $0.6002$ ($60.0$ pp, $\sim 28\times$) per-document; this excess is reported under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0059$; per-document HC $0.0117$), the observed same-CPA-pool excess is $0.4899$ ($49.0$ pp, $\sim 84\times$) per-signature and $0.6111$ ($61.1$ pp, $\sim 53\times$) per-document; this excess is reported under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
# V. Discussion
@@ -1014,7 +1016,7 @@ The Big-4 accountant-level distribution rejects unimodality on both marginals (
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.
We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-L.0). Three readings (§III-L.4) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide at only $0.0102$ — essentially the BCD floor ($0.0116$) — so Firm A is unremarkable *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 70\times$ the clean floor, versus $\sim 21$$30\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$$97\%$ at Firms B/C/D, with same-pair concentration $97$$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-L.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish.
We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-L.0). Three readings (§III-L.4) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide essentially never ($0.0001$, below the BCD floor of $0.0059$) — so Firm A is unremarkable, indeed sub-baseline, *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 139\times$ the clean floor, versus $\sim 40$$59\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$$97\%$ at Firms B/C/D, with same-pair concentration $97$$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-L.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish.
## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
@@ -1030,7 +1032,7 @@ The deployed HC sub-rule's specificity-proxy behaviour is characterised at three
## G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor
The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-L.1 per-comparison ICCR on the normative BCD baseline ($0.000018$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-L.0).
The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-L.1 per-comparison ICCR on the normative BCD baseline ($0.000010$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-L.0).
## H. Limitations
@@ -1040,7 +1042,7 @@ Several limitations should be transparent. We group them into primary methodolog
*No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).
*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-L.4 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-L.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000018$ and the per-signature HC rate from $0.1102$ to $0.0116$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-L.6) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities.
*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-L.4 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-L.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000010$ and the per-signature HC rate from $0.1102$ to $0.0059$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-L.6) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities.
*Mechanism attribution for the firm-level heterogeneity is not identifiable from descriptor-only data.* The observed firm-level contrast (Firm A's per-document HC$+$MC ICCR of $0.62$ versus $0.09$$0.16$ at Firms B/C/D; within-firm collision concentration $77$$99\%$ under the deployed any-pair rule; byte-identical evidence of §IV-H) is consistent with at least three non-mutually-exclusive firm-level mechanisms: (i) template, stamp, or e-signature production reuse; (ii) digitisation-pipeline homogeneity — shared scanners, common PDF generation infrastructure, identical compression and form-template settings — that systematically inflates image-descriptor similarity without signature replication; and (iii) signing-style or training homogeneity that produces correlated handwritten signatures within a firm. The descriptor pair (cosine, dHash) operates at the image-similarity level and is, by construction, indifferent to which mechanism generated a given near-identical pair. We therefore report the firm contrast as a methodological observation — the framework discriminates at firm-level resolution — rather than as a mechanism finding. The byte-identical Firm A signatures across $\sim 50$ distinct partners (§IV-H, §V-C) provide direct evidence for (i) at Firm A specifically, but do not exclude additive contribution from (ii) or (iii); the milder within-firm collision patterns at Firms B/C/D are individually consistent with all three mechanisms. Image-acquisition metadata (scanner identifiers, PDF generator fingerprints, compression-codec markers), partner-level intent records, or controlled hand-signed baselines would be needed to attribute the contrast across (i), (ii), and (iii).
@@ -1050,9 +1052,9 @@ Several limitations should be transparent. We group them into primary methodolog
*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).
*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-L.3 supersedes the MC band's *claim strength* — its $\sim 0.19$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-L.3 supersedes the MC band's *claim strength* — its $\sim 0.175$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.023$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal.
*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.012$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal.
*A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime.
@@ -1079,9 +1081,9 @@ Several limitations should be transparent. We group them into primary methodolog
We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature screening rule with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442150,453 signatures at signature level) as the primary analytical population. We emphasise that the operating thresholds are operator-tunable and that the system performs semi-automated triage — surfacing replication candidates from hundreds of thousands of signatures for human adjudication — rather than autonomous forensic classification; its central deliverable is the label-free calibration methodology by which an operator selects and characterises a screening operating point.
Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR $0.000018$, per-signature $0.0116$, and per-document $0.023$ — roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$) — while the dHash$\leq 15$ moderate-confidence band, which retains a $\sim 0.19$ per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at $\sim 70\times$ (Firm A) and $\sim 21$$30\times$ (Firms B/C/D), while Firm A scored cross-firm against the baseline coincides only at the floor ($0.0102$); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/$0.027$; BCD-only Firm-D-reference residual spread within $\sim 3.5\times$) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$$97\%$ at Firms B/C/D on the clean BCD pool (same-pair $97$$100\%$ across all four firms) — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.
Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR $0.000010$, per-signature $0.0059$, and per-document $0.012$ — roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$) — while the dHash$\leq 15$ moderate-confidence band, which retains a $\sim 0.175$ per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at $\sim 139\times$ (Firm A) and $\sim 40$$59\times$ (Firms B/C/D), while Firm A scored cross-firm against the clean 20132019 baseline coincides essentially never cross-firm ($0.0001$); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/$0.027$; BCD-only Firm-D-reference residual spread within $\sim 3.5\times$) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$$97\%$ at Firms B/C/D on the clean BCD pool (same-pair $97$$100\%$ across all four firms) — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.
Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$$98.8\%$ across Big-4; same-pair joint $97.0$$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$$0.35$, $\sim 70\times$ vs $\sim 21$$30\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$$98.8\%$ across Big-4; same-pair joint $97.0$$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$$0.35$, $\sim 139\times$ vs $\sim 40$$59\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
# Appendix A. Supplementary Diagnostic Detail
@@ -0,0 +1,99 @@
#!/usr/bin/env python3
"""Script 54: temporal stability of the BCD inter-CPA floor.
Does the normative BCD per-comparison HC coincidence floor drift over time /
get contaminated by post-2020 e-signing? Compares eras full / 2013-2019 /
2020-2023 using the pool-size-independent per-comparison joint HC ICCR
(cos>0.95 & dHash<=5) on BCD inter-CPA pairs (N=500k, seed 42), plus the
observed deployed per-signature HC rate by firm by era. Read-only.
"""
import sqlite3
from collections import defaultdict
import numpy as np
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
N_PAIRS = 500_000
POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
def wilson(k, n, z=1.96):
if n == 0:
return (None, None)
p = k/n; d = 1+z*z/n
c = (p+z*z/(2*n))/d
h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
return (max(0.0, c-h), min(1.0, c+h))
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""
SELECT s.assigned_accountant, a.firm, CAST(substr(s.year_month,1,4) AS INT),
s.feature_vector, s.dhash_vector,
s.max_similarity_to_same_accountant, s.min_dhash_independent
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE a.firm IN (?,?,?,?) AND s.year_month IS NOT NULL
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""", BIG4)
rows = cur.fetchall()
conn.close()
ERAS = {'full 2013-2023': lambda y: True,
'2013-2019 (pre-drift)': lambda y: 2013 <= y <= 2019,
'2020-2023': lambda y: 2020 <= y <= 2023}
def per_comparison_floor(era_fn, label):
# BCD-only (exclude Firm A), era-restricted
keep = [r for r in rows if r[1] != FIRM_A and era_fn(r[2])]
feats = np.stack([np.frombuffer(r[3], np.float32) for r in keep]).astype(np.float32)
feats /= np.clip(np.linalg.norm(feats, axis=1, keepdims=True), 1e-9, None)
dh = np.stack([np.frombuffer(r[4], np.uint8) for r in keep])
cpas = np.array([r[0] for r in keep])
by = defaultdict(list)
for i, c in enumerate(cpas):
by[c].append(i)
accts = list(by.keys())
rng = np.random.default_rng(SEED)
cos = np.empty(N_PAIRS, np.float32); dv = np.empty(N_PAIRS, np.int32)
na = len(accts)
for t in range(N_PAIRS):
i, j = rng.choice(na, 2, replace=False)
a1, a2 = accts[i], accts[j]
k1 = by[a1][int(rng.integers(0, len(by[a1])))]
k2 = by[a2][int(rng.integers(0, len(by[a2])))]
cos[t] = feats[k1] @ feats[k2]
dv[t] = POP[dh[k1] ^ dh[k2]].sum()
joint = int(((cos > 0.95) & (dv <= 5)).sum())
lo, hi = wilson(joint, N_PAIRS)
print(f' [{label}] BCD per-comparison HC floor = {joint/N_PAIRS:.6f} '
f'({joint}/{N_PAIRS}) Wilson95% [{lo:.6f},{hi:.6f}] '
f'(n_sig={len(keep):,}, CPAs={na})')
return joint/N_PAIRS
print('=== (1) BCD per-comparison HC floor by era (pool-size-independent) ===')
floors = {lab: per_comparison_floor(fn, lab) for lab, fn in ERAS.items()}
print('\n=== (2) Observed deployed per-signature HC rate by firm by era ===')
print(' (max_sim>0.95 & min_dh<=5 on actual same-CPA pools)')
for lab, fn in ERAS.items():
print(f' --- {lab} ---')
for fm_zh in BIG4:
sub = [r for r in rows if r[1] == fm_zh and fn(r[2])
and r[5] is not None and r[6] is not None]
if not sub:
continue
k = sum(1 for r in sub if r[5] > 0.95 and r[6] <= 5)
print(f' Firm {ALIAS[fm_zh]}: {k/len(sub):.4f} ({k}/{len(sub)})')
print('\n=== A-vs-floor multiple by era (observed A HC / BCD floor) ===')
for lab, fn in ERAS.items():
a = [r for r in rows if r[1] == FIRM_A and fn(r[2]) and r[5] is not None and r[6] is not None]
a_rate = sum(1 for r in a if r[5] > 0.95 and r[6] <= 5)/len(a) if a else 0
fl = floors[lab]
# per-comparison floor is not directly comparable to observed pooled rate;
# report ratio vs the per-signature floor proxy from Script 52 (0.0116 full).
print(f' {lab}: observed A HC = {a_rate:.3f}; per-comparison floor = {fl:.6f}')
@@ -0,0 +1,144 @@
#!/usr/bin/env python3
"""Script 55: PRIMARY calibration on the clean pre-e-signature baseline
BCD 2013-2019 (Firms B/C/D, fiscal years 2013-2019). Rationale: co-author
interviews confirm B/C/D progressively adopted e-signature systems after 2020
(staggered timing), so 2013-2019 BCD is the construct-clean hand-signing
baseline. Canonical retry-loop sampler (matches Scripts 43/45/52), any-pair.
Reports the floor + Firm A (all years) scored out-of-sample against it, and
BCD 2020+ scored against the same threshold. Read-only.
"""
import sqlite3
from collections import defaultdict, Counter
import numpy as np
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
N_BOOT = 1000
POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
def wilson(k, n, z=1.96):
if n == 0:
return (None, None)
p = k/n; d = 1+z*z/n; c = (p+z*z/(2*n))/d
h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
return (max(0.0, c-h), min(1.0, c+h))
def canon_sampler(rng, n, npool, same, all_idx):
need = npool; cand = []; att = 0
while need > 0 and att < 10:
draw = rng.choice(n, size=need*2, replace=True)
ok = draw[~np.isin(draw, same)]
cand.extend(ok[:need].tolist()); need -= len(ok[:need]); att += 1
if need > 0:
pm = np.ones(n, bool); pm[same] = False
cand.extend(rng.choice(all_idx[pm], size=need, replace=False).tolist())
return np.array(cand[:npool], dtype=np.int64)
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""SELECT s.assigned_accountant,a.firm,CAST(substr(s.year_month,1,4) AS INT),
s.source_pdf,s.feature_vector,s.dhash_vector,
s.max_similarity_to_same_accountant,s.min_dhash_independent
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE a.firm IN (?,?,?,?) AND s.year_month IS NOT NULL
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""", BIG4)
rows = cur.fetchall()
conn.close()
def prep(rec):
feats = np.stack([np.frombuffer(r[4], np.float32) for r in rec]).astype(np.float32)
norms = np.linalg.norm(feats, axis=1, keepdims=True); norms[norms == 0] = 1.0
feats /= norms
dh = np.stack([np.frombuffer(r[5], np.uint8) for r in rec])
return feats, dh
def floor_on(baseline_rec, label):
"""Canonical per-sig/per-doc HC floor on a baseline population."""
feats, dh = prep(baseline_rec)
n = len(baseline_rec)
cpas = np.array([r[0] for r in baseline_rec])
firms = np.array([ALIAS[r[1]] for r in baseline_rec])
docs = np.array([r[3] for r in baseline_rec])
cidx = defaultdict(list)
for i, c in enumerate(cpas):
cidx[c].append(i)
cidx = {c: np.array(v) for c, v in cidx.items()}
psize = {c: len(v)-1 for c, v in cidx.items()}
all_idx = np.arange(n)
rng = np.random.default_rng(SEED)
mx = np.zeros(n, np.float32); mn = np.full(n, 64, np.int32)
for si in range(n):
np_ = psize[cpas[si]]
if np_ <= 0:
continue
cand = canon_sampler(rng, n, np_, cidx[cpas[si]], all_idx)
cosv = feats[cand] @ feats[si]
mx[si] = cosv.max(); mn[si] = int(POP[dh[cand] ^ dh[si]].sum(axis=1).min())
hc = (mx > 0.95) & (mn <= 5); d2 = (mx > 0.95) & (mn <= 15)
k = int(hc.sum())
rng2 = np.random.default_rng(SEED+1); cl = list(cidx.keys())
bs = np.array([hc[np.concatenate([cidx[cl[i]] for i in rng2.choice(len(cl), len(cl), True)])].mean()
for _ in range(N_BOOT)])
print(f'\n [{label}] n_sig={n:,}, CPAs={len(cidx)}')
print(f' per-sig HC floor = {k/n:.4f} ({k}/{n}) CPA-boot95% [{np.percentile(bs,2.5):.4f},{np.percentile(bs,97.5):.4f}]')
dd1 = defaultdict(bool); dd2 = defaultdict(bool); dfirm = {}
for i in range(n):
if hc[i]: dd1[docs[i]] = True
if d2[i]: dd2[docs[i]] = True
dfirm.setdefault(docs[i], []).append(firms[i])
dd1.setdefault(docs[i], False); dd2.setdefault(docs[i], False)
dl = list(dd1.keys()); nd = len(dl)
print(f' per-doc HC = {sum(dd1[d] for d in dl)/nd:.4f}; per-doc HC+MC = {sum(dd2[d] for d in dl)/nd:.4f} (n_doc={nd:,})')
dom = {d: Counter(dfirm[d]).most_common(1)[0][0] for d in dl}
for f in ['B', 'C', 'D']:
ds = [d for d in dl if dom[d] == f]
if ds:
print(f' Firm {f} per-doc HC+MC: {sum(dd2[d] for d in ds)/len(ds):.4f} ({sum(dd2[d] for d in ds)}/{len(ds)})')
return k/n
def a_vs_baseline(baseline_rec, a_rec, label):
bf, bdh = prep(baseline_rec); nb = len(baseline_rec)
a_cpa = defaultdict(list)
for i, r in enumerate(a_rec):
a_cpa[r[0]].append(i)
psize = {c: len(v)-1 for c, v in a_cpa.items()}
rng = np.random.default_rng(SEED)
hc = np.zeros(len(a_rec), bool)
for i, r in enumerate(a_rec):
np_ = psize[r[0]]
if np_ <= 0:
continue
cand = rng.integers(0, nb, size=np_)
sf = np.frombuffer(r[4], np.float32).astype(np.float32); sf /= max(np.linalg.norm(sf), 1e-9)
cosv = bf[cand] @ sf
if (cosv > 0.95).any():
dist = POP[bdh[cand] ^ np.frombuffer(r[5], np.uint8)].sum(axis=1)
hc[i] = bool(((cosv > 0.95) & (dist <= 5)).any())
k = int(hc.sum()); n = len(a_rec); lo, hi = wilson(k, n)
print(f' [{label}] Firm A (all yrs) vs BCD-2013-2019 pool: per-sig HC = {k/n:.4f} ({k}/{n}) [{lo:.5f},{hi:.5f}]')
bcd_pre = [r for r in rows if r[1] != FIRM_A and 2013 <= r[2] <= 2019]
bcd_post = [r for r in rows if r[1] != FIRM_A and r[2] >= 2020]
A_all = [r for r in rows if r[1] == FIRM_A]
print('=== PRIMARY floor: BCD 2013-2019 ===')
fl = floor_on(bcd_pre, 'BCD 2013-2019 (PRIMARY)')
print('\n=== Firm A scored against the BCD-2013-2019 threshold ===')
a_vs_baseline(bcd_pre, A_all, 'A out-of-sample')
A_obs = [r for r in A_all if r[6] is not None and r[7] is not None]
ak = sum(1 for r in A_obs if r[6] > 0.95 and r[7] <= 5)
print(f' Firm A observed (all yrs, own pools): per-sig HC = {ak/len(A_obs):.4f} -> {ak/len(A_obs)/fl:.0f}x the BCD-2013-2019 floor')
print('\n=== (optional) BCD 2020+ floor, same method (may be inflated by e-signing) ===')
floor_on(bcd_post, 'BCD 2020-2023 (post e-signing)')