Paper A v4.2: re-anchor primary calibration to clean BCD 2013-2019 baseline

- Restrict the calibration negative anchor to Firms B/C/D, fiscal years 2013-2019 (pre-electronic-signature hand-signing period); B/C/D adopted e-signing post-2020 at staggered times, so 2013-2019 is the construct-clean baseline. Firm A scored across its full 2013-2023 record against it. - New locked numbers (codex-audited, Scripts 54/55): per-comparison HC floor 0.000010; per-signature HC floor 0.0059 [boot 0.0045-0.0073]; per-document HC 0.0117 / HC+MC 0.1753; per-firm HC+MC B 0.162 / C 0.225 / D 0.089. Firm A observed 0.817 = ~139x the clean floor (was ~70x on all-period BCD); Firm A out-of-sample vs clean pool 0.0001 (below floor -> never resembles genuine hand-signing). BCD 2020+ robustness: per-sig 0.0105, per-comparison 0.000036 (~2x pre-2020) quantifies the e-signing contamination. - Propagated through abstract / Sec. I / III-L / IV-M / V / conclusion; 0.837 crossover kept corpus-wide; ABCD retained as contamination comparison. - Grounded the 2013-2019 choice on data (floor drift) + e-sign-adoption background, not on in-text interview claims (double-blind). - Add Scripts 54 (temporal floor stability) and 55 (BCD 2013-2019 primary calibration + Firm A scoring). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 21:30:06 +08:00
parent 3c7fcc010f
commit 1eb323e959
4 changed files with 287 additions and 42 deletions
@@ -7,7 +7,7 @@ author: "[Authors removed for double-blind review]"

 <!-- IEEE Access target: <= 250 words, single paragraph -->

-Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a normative non-Firm-A baseline (Firms B/C/D), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000018$; per-signature $0.012$; per-document $0.023$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.19$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A coincides at baseline rate cross-firm yet fires the rule on $82\%$ of its own signatures ($\sim 70\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth.
+Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000010$; per-signature $0.006$; per-document $0.012$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A never coincides cross-firm yet fires on $82\%$ of its own ($\sim 139\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth.

 <!-- Word count: 250 (v4.1 BCD-baseline reframe) -->

@@ -32,9 +32,9 @@ We are deliberate about what the system claims. The operating thresholds are *op

 A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.

-In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000018$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0116$ (CPA-block bootstrap 95% $[0.0094, 0.0141]$), and per-document ICCR $= 0.023$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.19$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide.
+In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000010$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0059$ (CPA-block bootstrap 95% $[0.0045, 0.0073]$), and per-document ICCR $= 0.012$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.175$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide.

-With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0116$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 70\times$ floor), Firms B/C/D at $0.24$–$0.35$ ($\sim 21$–$30\times$). Firm A scored against the clean baseline coincides at only $0.0102$ — essentially the floor itself — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.
+With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0059$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 139\times$ floor), Firms B/C/D at $0.24$–$0.35$ ($\sim 40$–$59\times$). Firm A scored against the clean 2013–2019 baseline coincides essentially never ($0.0001$, below the clean-baseline floor itself) — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.

 Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.

@@ -50,9 +50,9 @@ The contributions of this paper are:

 4. **Composition decomposition does not support the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.

-5. **Anchor-based multi-level ICCR calibration on a normative non-Firm-A baseline.** We characterise the deployed high-confidence (HC) sub-rule at three units of analysis against a clean Firms-B/C/D negative anchor (Firm A held out as an out-of-sample target to avoid circularity): per-comparison ICCR $0.000018$, pool-normalised per-signature ICCR $0.0116$, and per-document ICCR $0.023$ — each roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$). The moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.19$ per-document coincidence rate on the clean baseline and is repositioned as a low-specificity advisory tier rather than a confident non-hand-signed label. Because the deployed thresholds are operator-tunable, the contribution is this label-free calibration methodology — a principled way to choose and characterise a screening operating point and the specificity it yields — rather than any specific threshold. We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
+5. **Anchor-based multi-level ICCR calibration on a normative non-Firm-A baseline.** We characterise the deployed high-confidence (HC) sub-rule at three units of analysis against a clean Firms-B/C/D negative anchor (Firm A held out as an out-of-sample target to avoid circularity): per-comparison ICCR $0.000010$, pool-normalised per-signature ICCR $0.0059$, and per-document ICCR $0.012$ — each roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$). The moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate on the clean baseline and is repositioned as a low-specificity advisory tier rather than a confident non-hand-signed label. Because the deployed thresholds are operator-tunable, the contribution is this label-free calibration methodology — a principled way to choose and characterise a screening operating point and the specificity it yields — rather than any specific threshold. We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.

-6. **Firm A as a singular out-of-sample extreme; universal within-firm collision concentration.** Against the clean BCD floor (per-signature HC ICCR $0.0116$), the deployed rule fires on each firm's own pools far above the inter-CPA coincidence floor (Firm A $0.82$, $\sim 70\times$; Firms B/C/D $0.24$–$0.35$, $\sim 21$–$30\times$), while Firm A scored cross-firm against the baseline coincides only at the floor ($0.0102$) — localising the repeatability signal to within-firm comparisons. Two logistic regressions (full-Big-4 with Firm A reference: odds ratios $0.053$/$0.010$/$0.027$ for B/C/D; BCD-only with Firm D reference: residual spread within $\sim 3.5\times$, odds ratios $1.73$/$0.49$ for B/C) show Firm A is the lone outlier while Firms B/C/D form an internally homogeneous baseline. Within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean pool — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
+6. **Firm A as a singular out-of-sample extreme; universal within-firm collision concentration.** Against the clean BCD floor (per-signature HC ICCR $0.0059$), the deployed rule fires on each firm's own pools far above the inter-CPA coincidence floor (Firm A $0.82$, $\sim 139\times$; Firms B/C/D $0.24$–$0.35$, $\sim 40$–$59\times$), while Firm A scored cross-firm against the clean 2013–2019 baseline coincides essentially never cross-firm ($0.0001$, below the floor itself) — localising the repeatability signal to within-firm comparisons. Two logistic regressions (full-Big-4 with Firm A reference: odds ratios $0.053$/$0.010$/$0.027$ for B/C/D; BCD-only with Firm D reference: residual spread within $\sim 3.5\times$, odds ratios $1.73$/$0.49$ for B/C) show Firm A is the lone outlier while Firms B/C/D form an internally homogeneous baseline. Within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean pool — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).

 7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.

@@ -396,7 +396,7 @@ $\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical pre

 Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs. The K=3 partition is therefore well-described as a firm-compositional decomposition: C3 is essentially "Firm A and any non-Firm-A CPA whose mean descriptors happen to land in the high-cos / low-dHash corner"; C1 is essentially "non-Firm-A CPAs whose mean descriptors land in the low-cos / high-dHash corner." The composition contrast that K=3 captures at the accountant level reappears at the deployment level in the cross-firm hit matrix of §III-L.4 (Script 44): under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). The K=3 partition and the cross-firm hit matrix therefore describe the same underlying firm-compositional structure at two different units of analysis.

-**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label.
+**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.012$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label.

 We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the operational classifier:

@@ -458,6 +458,8 @@ The operational classifier defined in §III-H.1 is calibrated by characterising

 **Choice of negative-anchor pool.** A negative anchor must approximate a population in which the rule should *not* fire — independent CPAs whose signatures coincide only by chance. §III-L.4 shows that under the deployed rule, $98.8\%$ of Firm A's inter-CPA collisions fall on other Firm-A CPAs, and byte-level evidence (§IV-H, supplementary materials) confirms image-level reuse across $\sim 50$ Firm-A partners. Including Firm A in the negative-anchor pool therefore loads the "coincidence" rate with structured within-firm collisions, not chance coincidence — a circularity, since that collision structure is the phenomenon the rule targets. We adopt **Firms B/C/D (BCD) as the normative negative-anchor baseline** and report the all-Big-4 (ABCD) pool only as a contamination-comparison scope; Firm A enters as an **out-of-sample target** (§III-L.4), not as a calibration input. A still-broader baseline adding the eligible non-Big-4 firms (BCD+non-Big-4) is reported as a robustness scope.

+We further restrict the calibration baseline temporally to **fiscal years 2013–2019**. Taiwan audit firms progressively adopted electronic-signature systems after 2020 (with firm-specific timing), so the pre-2020 BCD period is the construct-clean hand-signing baseline; the post-2020 period mixes genuine hand-signing with legitimate e-signing and is therefore not a clean negative anchor. The data corroborate this: the BCD per-comparison HC floor rises from $0.000010$ (2013–2019) to $0.000036$ (2020–2023), and the per-signature floor from $0.0059$ to $0.0105$ — the gradual, non-stepped rise being consistent with staggered per-firm adoption. We therefore calibrate on BCD 2013–2019 and report BCD 2020–2023 only as a robustness scope (it documents the e-signing contamination rather than the clean floor). Firm A is scored across its full 2013–2023 record against this clean threshold.
+
 **Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine and structural ($\text{dHash} \leq 5$) thresholds' specificity-proxy behaviour at the inter-CPA pair level on the BCD baseline. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands. The cosine LH/UN crossover $\text{cos} = 0.837$ is a corpus-wide descriptor-space landmark (intra- vs inter-CPA cosine KDE crossover, §IV-C) robust to baseline choice — it moves by at most $0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes ($0.8367$, $0.8489$, $0.8302$) — so we retain the corpus-wide value and do not re-anchor it on BCD.

 **Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:
@@ -478,9 +480,9 @@ We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from the baseline
 |---|---|---|---|
 | Cosine $> 0.95$ | $0.00026$ | $0.00060$ | $0.00014$ |
 | dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ |
-| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000018}$ $\;[0.000009, 0.000034]$ | $0.000140$ $\;[0.000111, 0.000177]$ | $0.000004$ $\;[0.000001, 0.000015]$ |
+| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000010}$ $\;[0.000004, 0.000023]$ | $0.000140$ $\;[0.000111, 0.000177]$ | $0.000004$ $\;[0.000001, 0.000015]$ |

-On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule is $0.000018$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$), and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-L.4): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($9$ of $5 \times 10^5$ pairs on the BCD pool), so we report the Wilson interval and treat the per-comparison joint rate as an order-of-magnitude specificity proxy rather than a precisely estimated rate; the well-powered per-signature and per-document units (§III-L.2, §III-L.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ row remains consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I. On the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) is $0.234$, indicating that the structural dimension adds substantial per-comparison specificity beyond the cosine gate.
+On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule is $0.000010$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$), and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-L.4): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($5$ of $5 \times 10^5$ pairs on the BCD pool), so we report the Wilson interval and treat the per-comparison joint rate as an order-of-magnitude specificity proxy rather than a precisely estimated rate; the well-powered per-signature and per-document units (§III-L.2, §III-L.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ row remains consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I. On the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) is $0.234$, indicating that the structural dimension adds substantial per-comparison specificity beyond the cosine gate.

 The per-comparison rate does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.

@@ -492,11 +494,11 @@ The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \

 | Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI |
 |---|---|---|
-| **BCD (primary)** | $\mathbf{0.0116}$ | $[0.0094, 0.0141]$ |
+| **BCD (primary)** | $\mathbf{0.0059}$ | $[0.0045, 0.0073]$ |
 | All-Big-4 (contamination comparison) | $0.1102$ | $[0.0908, 0.1330]$ |
 | BCD+non-Big-4 | $0.0083$ | $[0.0066, 0.0099]$ |

-On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $0.0116$ — an order of magnitude below the all-Big-4 figure of $0.1102$. The all-Big-4 figure is dominated by Firm A, whose signatures coincide with other Firm-A signatures at high rate; once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 1.2\%$. This is the specificity-proxy floor against which the deployed HC rule operates. The rate increases with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-L.4) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, with the unsupervised-setting caveats of §III-M.
+On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $0.0059$ — an order of magnitude below the all-Big-4 figure of $0.1102$. The all-Big-4 figure is dominated by Firm A, whose signatures coincide with other Firm-A signatures at high rate; once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 0.59\%$. This is the specificity-proxy floor against which the deployed HC rule operates. The rate increases with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-L.4) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, with the unsupervised-setting caveats of §III-M.

 ### L.3. Document-level inter-CPA proxy alert rate (Script 52)

@@ -504,31 +506,31 @@ Each document is classified by the worst-case rule over its constituent signatur

 | Alarm definition | BCD baseline (primary) | All-Big-4 | BCD+non-Big-4 |
 |---|---|---|---|
-| HC (dHash $\leq 5$) | $\mathbf{0.0226}$ | $0.1797$ | $0.0163$ |
-| HC + MC (dHash $\leq 15$) | $0.1905$ | $0.3375$ | $0.1467$ |
+| HC (dHash $\leq 5$) | $\mathbf{0.0117}$ | $0.1797$ | $0.0163$ |
+| HC + MC (dHash $\leq 15$) | $0.1753$ | $0.3375$ | $0.1467$ |

-**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC rate is $0.0226$ ($\sim 8\times$ below the all-Big-4 $0.1797$), confirming that the HC (dHash $\leq 5$) rule has a very low inter-CPA coincidence rate: a clean inter-CPA baseline almost never produces an HC document. The HC+MC (dHash $\leq 15$) rate, by contrast, remains high on the clean baseline — $0.1905$ per document — and the per-firm breakdown shows it does *not* fall when Firm A is removed. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier rather than a confident non-hand-signed screening label.** Roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement, so an HC+MC alarm is not by itself evidence of image reproduction.
+**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC rate is $0.0117$ ($\sim 8\times$ below the all-Big-4 $0.1797$), confirming that the HC (dHash $\leq 5$) rule has a very low inter-CPA coincidence rate: a clean inter-CPA baseline almost never produces an HC document. The HC+MC (dHash $\leq 15$) rate, by contrast, remains high on the clean baseline — $0.1753$ per document — and the per-firm breakdown shows it does *not* fall when Firm A is removed. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier rather than a confident non-hand-signed screening label.** Roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement, so an HC+MC alarm is not by itself evidence of image reproduction.

-Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $0.218$, Firm D $0.114$ — slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. This is direct evidence that the MC band carries little inter-CPA specificity even among normative firms, corroborating its demotion to an advisory tier. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-M).
+Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ — slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. This is direct evidence that the MC band carries little inter-CPA specificity even among normative firms, corroborating its demotion to an advisory tier. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-M).

 ### L.4. Firm A as an out-of-sample target; firm heterogeneity (Scripts 49, 52, 44, 53)

 With the calibration anchored on BCD, Firm A is scored as an out-of-sample target against the clean baseline. Three complementary readings establish that Firm A is the extreme case while keeping the inferential limits explicit.

-**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0116$ (§III-L.2):
+**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0059$ (§III-L.2):

 | Firm | Observed per-signature HC rate | Multiple of BCD floor |
 |---|---|---|
-| Firm A | $0.817$ | $\sim 70\times$ |
-| Firm B | $0.346$ | $\sim 30\times$ |
-| Firm C | $0.238$ | $\sim 21\times$ |
-| Firm D | $0.245$ | $\sim 21\times$ |
+| Firm A | $0.817$ | $\sim 139\times$ |
+| Firm B | $0.346$ | $\sim 59\times$ |
+| Firm C | $0.238$ | $\sim 40\times$ |
+| Firm D | $0.245$ | $\sim 42\times$ |

-All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 70\times$, roughly $2.4$–$3.4\times$ the other Big-4 firms in absolute rate. We emphasise (and develop in §III-M) that this excess is *not* a true-positive rate: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above it. The multiple is a framework-discriminative observation, not a measure of image reproduction.
+All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 139\times$, roughly $2.4$–$3.4\times$ the other Big-4 firms in absolute rate. We emphasise (and develop in §III-M) that this excess is *not* a true-positive rate: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above it. The multiple is a framework-discriminative observation, not a measure of image reproduction.

-**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0102$ — essentially identical to the BCD-internal floor of $0.0116$. Firm A's signatures are thus unremarkable when matched against *other firms'* signatures; the entire elevation in Firm A's observed rate ($0.817$) arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness.
+**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0001$ — below even the BCD-internal floor of $0.0059$, i.e. Firm A's signatures essentially never resemble genuine 2013–2019 hand-signing. Firm A's signatures are thus unremarkable when matched against *other firms'* signatures; the entire elevation in Firm A's observed rate ($0.817$) arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness.

-**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0116$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.)
+**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.)

 **Cross-firm hit matrix: within-firm concentration is a universal Big-4 pattern.** Under the deployed any-pair rule, inter-CPA collisions concentrate within the source firm at every Big-4 firm. On the full Big-4 candidate pool, within-firm concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (same-pair $97.0$–$99.96\%$; Table XXV). Restricting the candidate pool to the BCD baseline (Script 53) *raises* the within-firm concentration for B/C/D to $89.2$–$97.2\%$ any-pair (Firm B $97.2\%$, Firm C $92.3\%$, Firm D $89.2\%$) and $98.5$–$100\%$ same-pair — higher than on the full pool, because on the full pool some B/C/D collisions landed on Firm A's generically copy-like signatures; removing Firm A leaves each firm's collisions concentrated within itself. Within-firm collision concentration is therefore a universal Big-4 structural pattern, not a Firm-A peculiarity: Firm A is extreme in the *rate* at which the rule fires (reading (i)), but all four firms exhibit the same within-firm collision signature.

@@ -548,10 +550,10 @@ We interpret the deployed HC thresholds as **specificity-anchored operating poin

 The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the deployed HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$).

-Read against the **normative BCD specificity-proxy floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is larger: the per-signature observed rate is $\sim 43\times$ the BCD floor ($0.4958$ vs $0.0116$), and the per-document HC observed rate is $\sim 28\times$ the BCD floor ($0.6228$ vs $0.0226$):
+Read against the **normative BCD specificity-proxy floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is larger: the per-signature observed rate is $\sim 84\times$ the BCD floor ($0.4958$ vs $0.0059$), and the per-document HC observed rate is $\sim 53\times$ the BCD floor ($0.6228$ vs $0.0117$):

- Per-signature: $0.4958 - 0.0116 = 0.4842$ ($48.4$ pp excess over the clean floor)
- Per-document HC: $0.6228 - 0.0226 = 0.6002$ ($60.0$ pp excess over the clean floor)
+- Per-signature: $0.4958 - 0.0059 = 0.4899$ ($49.0$ pp excess over the clean floor)
+- Per-document HC: $0.6228 - 0.0117 = 0.6111$ ($61.1$ pp excess over the clean floor)

 We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits are developed in §III-M. The excess is best read as an *observed same-CPA-pool excess over the normative inter-CPA floor* — a quantity that far exceeds what random inter-CPA candidate replacement among normative firms would produce — whose mechanism is not identifiable from descriptor-only data (§III-M). Anchoring the floor on the clean BCD baseline sharpens this contrast (the all-Big-4 floor would understate it by absorbing Firm A's reuse), while leaving the §III-M caveat — that the floor is an inter-CPA coincidence rate, not an intra-CPA genuine-hand-signing rate — fully in force; we do not attribute the excess to within-CPA handwriting repeatability or to image replication without further evidence.

@@ -756,7 +758,7 @@ This section reports the firm-level cross-validation evidence motivating §III-J
 | Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp |
 | Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp |

-(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).
+(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.012$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).

 ## H. Pixel-Identity Positive-Anchor Miss Rate

@@ -832,7 +834,7 @@ This section reports the five-way per-signature + document-level worst-case clas

 (Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)

-The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-L.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.19$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
+The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-L.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.175$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).

 **Table XVII.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.

@@ -934,9 +936,9 @@ This section consolidates the empirical results that support the §III-L anchor-
 |---|---|---|---|
 | cos $> 0.95$ | $0.00026$ | $0.00060$ | $0.00014$ |
 | dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ |
-| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000018}$ | $0.000140$ | $0.000004$ |
+| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000010}$ | $0.000140$ | $0.000004$ |

-BCD joint Wilson 95% $[0.000009, 0.000034]$ ($9$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-L.4). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$).
+BCD joint Wilson 95% $[0.000004, 0.000023]$ ($5$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-L.4). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$).

 ### M.3 Pool-normalised per-signature ICCR (Script 52)

@@ -944,7 +946,7 @@ BCD joint Wilson 95% $[0.000009, 0.000034]$ ($9$ of $5 \times 10^5$ pairs); all-

 | Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI |
 |---|---|---|
-| BCD (primary) | $\mathbf{0.0116}$ | $[0.0094, 0.0141]$ |
+| BCD (primary) | $\mathbf{0.0059}$ | $[0.0045, 0.0073]$ |
 | All-Big-4 (contamination comparison) | $0.1102$ | $[0.0908, 0.1330]$ |
 | BCD+non-Big-4 | $0.0083$ | $[0.0066, 0.0099]$ |

@@ -956,10 +958,10 @@ The BCD floor is an order of magnitude below the all-Big-4 figure, which is domi

 | Alarm definition | BCD (primary) | All-Big-4 | BCD+non-Big-4 |
 |---|---|---|---|
-| HC (dHash $\leq 5$) | $\mathbf{0.0226}$ | $0.1797$ | $0.0163$ |
-| HC + MC (dHash $\leq 15$) | $0.1905$ | $0.3375$ | $0.1467$ |
+| HC (dHash $\leq 5$) | $\mathbf{0.0117}$ | $0.1797$ | $0.0163$ |
+| HC + MC (dHash $\leq 15$) | $0.1753$ | $0.3375$ | $0.1467$ |

-Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $0.218$, Firm D $0.114$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-L.3).
+Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-L.3).

 ### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)

@@ -972,7 +974,7 @@ Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $
 | Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A |
 | log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size |

-On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0116$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-L.4).
+On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-L.4).

 Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–$0.0358$ while Firm A ranges $0.0541$–$0.5958$. The firm gap survives within matched pool sizes.

@@ -997,7 +999,7 @@ Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\
 | dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
 | dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |

-Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0116$; per-document HC $0.0226$), the observed same-CPA-pool excess is $0.4842$ ($48.4$ pp, $\sim 43\times$) per-signature and $0.6002$ ($60.0$ pp, $\sim 28\times$) per-document; this excess is reported under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
+Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0059$; per-document HC $0.0117$), the observed same-CPA-pool excess is $0.4899$ ($49.0$ pp, $\sim 84\times$) per-signature and $0.6111$ ($61.1$ pp, $\sim 53\times$) per-document; this excess is reported under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.


 # V. Discussion
@@ -1014,7 +1016,7 @@ The Big-4 accountant-level distribution rejects unimodality on both marginals (

 Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.

-We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-L.0). Three readings (§III-L.4) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide at only $0.0102$ — essentially the BCD floor ($0.0116$) — so Firm A is unremarkable *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 70\times$ the clean floor, versus $\sim 21$–$30\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D, with same-pair concentration $97$–$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$–$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-L.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish.
+We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-L.0). Three readings (§III-L.4) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide essentially never ($0.0001$, below the BCD floor of $0.0059$) — so Firm A is unremarkable, indeed sub-baseline, *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 139\times$ the clean floor, versus $\sim 40$–$59\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D, with same-pair concentration $97$–$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$–$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-L.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish.

 ## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions

@@ -1030,7 +1032,7 @@ The deployed HC sub-rule's specificity-proxy behaviour is characterised at three

 ## G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor

-The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-L.1 per-comparison ICCR on the normative BCD baseline ($0.000018$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-L.0).
+The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-L.1 per-comparison ICCR on the normative BCD baseline ($0.000010$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-L.0).

 ## H. Limitations

@@ -1040,7 +1042,7 @@ Several limitations should be transparent. We group them into primary methodolog

 *No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).

-*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-L.4 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-L.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000018$ and the per-signature HC rate from $0.1102$ to $0.0116$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-L.6) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities.
+*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-L.4 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-L.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000010$ and the per-signature HC rate from $0.1102$ to $0.0059$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-L.6) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities.

 *Mechanism attribution for the firm-level heterogeneity is not identifiable from descriptor-only data.* The observed firm-level contrast (Firm A's per-document HC$+$MC ICCR of $0.62$ versus $0.09$–$0.16$ at Firms B/C/D; within-firm collision concentration $77$–$99\%$ under the deployed any-pair rule; byte-identical evidence of §IV-H) is consistent with at least three non-mutually-exclusive firm-level mechanisms: (i) template, stamp, or e-signature production reuse; (ii) digitisation-pipeline homogeneity — shared scanners, common PDF generation infrastructure, identical compression and form-template settings — that systematically inflates image-descriptor similarity without signature replication; and (iii) signing-style or training homogeneity that produces correlated handwritten signatures within a firm. The descriptor pair (cosine, dHash) operates at the image-similarity level and is, by construction, indifferent to which mechanism generated a given near-identical pair. We therefore report the firm contrast as a methodological observation — the framework discriminates at firm-level resolution — rather than as a mechanism finding. The byte-identical Firm A signatures across $\sim 50$ distinct partners (§IV-H, §V-C) provide direct evidence for (i) at Firm A specifically, but do not exclude additive contribution from (ii) or (iii); the milder within-firm collision patterns at Firms B/C/D are individually consistent with all three mechanisms. Image-acquisition metadata (scanner identifiers, PDF generator fingerprints, compression-codec markers), partner-level intent records, or controlled hand-signed baselines would be needed to attribute the contrast across (i), (ii), and (iii).

@@ -1050,9 +1052,9 @@ Several limitations should be transparent. We group them into primary methodolog

 *Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).

-*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-L.3 supersedes the MC band's *claim strength* — its $\sim 0.19$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
+*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-L.3 supersedes the MC band's *claim strength* — its $\sim 0.175$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.

-*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.023$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal.
+*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.012$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal.

 *A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime.

@@ -1079,9 +1081,9 @@ Several limitations should be transparent. We group them into primary methodolog

 We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature screening rule with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. We emphasise that the operating thresholds are operator-tunable and that the system performs semi-automated triage — surfacing replication candidates from hundreds of thousands of signatures for human adjudication — rather than autonomous forensic classification; its central deliverable is the label-free calibration methodology by which an operator selects and characterises a screening operating point.

-Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR $0.000018$, per-signature $0.0116$, and per-document $0.023$ — roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$) — while the dHash$\leq 15$ moderate-confidence band, which retains a $\sim 0.19$ per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at $\sim 70\times$ (Firm A) and $\sim 21$–$30\times$ (Firms B/C/D), while Firm A scored cross-firm against the baseline coincides only at the floor ($0.0102$); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/$0.027$; BCD-only Firm-D-reference residual spread within $\sim 3.5\times$) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean BCD pool (same-pair $97$–$100\%$ across all four firms) — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.
+Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR $0.000010$, per-signature $0.0059$, and per-document $0.012$ — roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$) — while the dHash$\leq 15$ moderate-confidence band, which retains a $\sim 0.175$ per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at $\sim 139\times$ (Firm A) and $\sim 40$–$59\times$ (Firms B/C/D), while Firm A scored cross-firm against the clean 2013–2019 baseline coincides essentially never cross-firm ($0.0001$); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/$0.027$; BCD-only Firm-D-reference residual spread within $\sim 3.5\times$) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean BCD pool (same-pair $97$–$100\%$ across all four firms) — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.

-Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$–$0.35$, $\sim 70\times$ vs $\sim 21$–$30\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
+Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$–$0.35$, $\sim 139\times$ vs $\sim 40$–$59\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.


 # Appendix A. Supplementary Diagnostic Detail
@@ -0,0 +1,99 @@
+#!/usr/bin/env python3
+"""Script 54: temporal stability of the BCD inter-CPA floor.
+Does the normative BCD per-comparison HC coincidence floor drift over time /
+get contaminated by post-2020 e-signing? Compares eras full / 2013-2019 /
+2020-2023 using the pool-size-independent per-comparison joint HC ICCR
+(cos>0.95 & dHash<=5) on BCD inter-CPA pairs (N=500k, seed 42), plus the
+observed deployed per-signature HC rate by firm by era. Read-only.
+"""
+import sqlite3
+from collections import defaultdict
+import numpy as np
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+FIRM_A = '勤業眾信聯合'
+BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
+ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
+SEED = 42
+N_PAIRS = 500_000
+POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
+
+
+def wilson(k, n, z=1.96):
+    if n == 0:
+        return (None, None)
+    p = k/n; d = 1+z*z/n
+    c = (p+z*z/(2*n))/d
+    h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
+    return (max(0.0, c-h), min(1.0, c+h))
+
+
+conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
+cur = conn.cursor()
+cur.execute("""
+    SELECT s.assigned_accountant, a.firm, CAST(substr(s.year_month,1,4) AS INT),
+           s.feature_vector, s.dhash_vector,
+           s.max_similarity_to_same_accountant, s.min_dhash_independent
+    FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
+    WHERE a.firm IN (?,?,?,?) AND s.year_month IS NOT NULL
+      AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""", BIG4)
+rows = cur.fetchall()
+conn.close()
+
+ERAS = {'full 2013-2023': lambda y: True,
+        '2013-2019 (pre-drift)': lambda y: 2013 <= y <= 2019,
+        '2020-2023': lambda y: 2020 <= y <= 2023}
+
+
+def per_comparison_floor(era_fn, label):
+    # BCD-only (exclude Firm A), era-restricted
+    keep = [r for r in rows if r[1] != FIRM_A and era_fn(r[2])]
+    feats = np.stack([np.frombuffer(r[3], np.float32) for r in keep]).astype(np.float32)
+    feats /= np.clip(np.linalg.norm(feats, axis=1, keepdims=True), 1e-9, None)
+    dh = np.stack([np.frombuffer(r[4], np.uint8) for r in keep])
+    cpas = np.array([r[0] for r in keep])
+    by = defaultdict(list)
+    for i, c in enumerate(cpas):
+        by[c].append(i)
+    accts = list(by.keys())
+    rng = np.random.default_rng(SEED)
+    cos = np.empty(N_PAIRS, np.float32); dv = np.empty(N_PAIRS, np.int32)
+    na = len(accts)
+    for t in range(N_PAIRS):
+        i, j = rng.choice(na, 2, replace=False)
+        a1, a2 = accts[i], accts[j]
+        k1 = by[a1][int(rng.integers(0, len(by[a1])))]
+        k2 = by[a2][int(rng.integers(0, len(by[a2])))]
+        cos[t] = feats[k1] @ feats[k2]
+        dv[t] = POP[dh[k1] ^ dh[k2]].sum()
+    joint = int(((cos > 0.95) & (dv <= 5)).sum())
+    lo, hi = wilson(joint, N_PAIRS)
+    print(f'  [{label}] BCD per-comparison HC floor = {joint/N_PAIRS:.6f} '
+          f'({joint}/{N_PAIRS}) Wilson95% [{lo:.6f},{hi:.6f}]  '
+          f'(n_sig={len(keep):,}, CPAs={na})')
+    return joint/N_PAIRS
+
+
+print('=== (1) BCD per-comparison HC floor by era (pool-size-independent) ===')
+floors = {lab: per_comparison_floor(fn, lab) for lab, fn in ERAS.items()}
+
+print('\n=== (2) Observed deployed per-signature HC rate by firm by era ===')
+print('   (max_sim>0.95 & min_dh<=5 on actual same-CPA pools)')
+for lab, fn in ERAS.items():
+    print(f'  --- {lab} ---')
+    for fm_zh in BIG4:
+        sub = [r for r in rows if r[1] == fm_zh and fn(r[2])
+               and r[5] is not None and r[6] is not None]
+        if not sub:
+            continue
+        k = sum(1 for r in sub if r[5] > 0.95 and r[6] <= 5)
+        print(f'      Firm {ALIAS[fm_zh]}: {k/len(sub):.4f} ({k}/{len(sub)})')
+
+print('\n=== A-vs-floor multiple by era (observed A HC / BCD floor) ===')
+for lab, fn in ERAS.items():
+    a = [r for r in rows if r[1] == FIRM_A and fn(r[2]) and r[5] is not None and r[6] is not None]
+    a_rate = sum(1 for r in a if r[5] > 0.95 and r[6] <= 5)/len(a) if a else 0
+    fl = floors[lab]
+    # per-comparison floor is not directly comparable to observed pooled rate;
+    # report ratio vs the per-signature floor proxy from Script 52 (0.0116 full).
+    print(f'  {lab}: observed A HC = {a_rate:.3f}; per-comparison floor = {fl:.6f}')
@@ -0,0 +1,144 @@
+#!/usr/bin/env python3
+"""Script 55: PRIMARY calibration on the clean pre-e-signature baseline
+BCD 2013-2019 (Firms B/C/D, fiscal years 2013-2019). Rationale: co-author
+interviews confirm B/C/D progressively adopted e-signature systems after 2020
+(staggered timing), so 2013-2019 BCD is the construct-clean hand-signing
+baseline. Canonical retry-loop sampler (matches Scripts 43/45/52), any-pair.
+Reports the floor + Firm A (all years) scored out-of-sample against it, and
+BCD 2020+ scored against the same threshold. Read-only.
+"""
+import sqlite3
+from collections import defaultdict, Counter
+import numpy as np
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+FIRM_A = '勤業眾信聯合'
+BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
+ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
+SEED = 42
+N_BOOT = 1000
+POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
+
+
+def wilson(k, n, z=1.96):
+    if n == 0:
+        return (None, None)
+    p = k/n; d = 1+z*z/n; c = (p+z*z/(2*n))/d
+    h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
+    return (max(0.0, c-h), min(1.0, c+h))
+
+
+def canon_sampler(rng, n, npool, same, all_idx):
+    need = npool; cand = []; att = 0
+    while need > 0 and att < 10:
+        draw = rng.choice(n, size=need*2, replace=True)
+        ok = draw[~np.isin(draw, same)]
+        cand.extend(ok[:need].tolist()); need -= len(ok[:need]); att += 1
+    if need > 0:
+        pm = np.ones(n, bool); pm[same] = False
+        cand.extend(rng.choice(all_idx[pm], size=need, replace=False).tolist())
+    return np.array(cand[:npool], dtype=np.int64)
+
+
+conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
+cur = conn.cursor()
+cur.execute("""SELECT s.assigned_accountant,a.firm,CAST(substr(s.year_month,1,4) AS INT),
+    s.source_pdf,s.feature_vector,s.dhash_vector,
+    s.max_similarity_to_same_accountant,s.min_dhash_independent
+    FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
+    WHERE a.firm IN (?,?,?,?) AND s.year_month IS NOT NULL
+      AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""", BIG4)
+rows = cur.fetchall()
+conn.close()
+
+
+def prep(rec):
+    feats = np.stack([np.frombuffer(r[4], np.float32) for r in rec]).astype(np.float32)
+    norms = np.linalg.norm(feats, axis=1, keepdims=True); norms[norms == 0] = 1.0
+    feats /= norms
+    dh = np.stack([np.frombuffer(r[5], np.uint8) for r in rec])
+    return feats, dh
+
+
+def floor_on(baseline_rec, label):
+    """Canonical per-sig/per-doc HC floor on a baseline population."""
+    feats, dh = prep(baseline_rec)
+    n = len(baseline_rec)
+    cpas = np.array([r[0] for r in baseline_rec])
+    firms = np.array([ALIAS[r[1]] for r in baseline_rec])
+    docs = np.array([r[3] for r in baseline_rec])
+    cidx = defaultdict(list)
+    for i, c in enumerate(cpas):
+        cidx[c].append(i)
+    cidx = {c: np.array(v) for c, v in cidx.items()}
+    psize = {c: len(v)-1 for c, v in cidx.items()}
+    all_idx = np.arange(n)
+    rng = np.random.default_rng(SEED)
+    mx = np.zeros(n, np.float32); mn = np.full(n, 64, np.int32)
+    for si in range(n):
+        np_ = psize[cpas[si]]
+        if np_ <= 0:
+            continue
+        cand = canon_sampler(rng, n, np_, cidx[cpas[si]], all_idx)
+        cosv = feats[cand] @ feats[si]
+        mx[si] = cosv.max(); mn[si] = int(POP[dh[cand] ^ dh[si]].sum(axis=1).min())
+    hc = (mx > 0.95) & (mn <= 5); d2 = (mx > 0.95) & (mn <= 15)
+    k = int(hc.sum())
+    rng2 = np.random.default_rng(SEED+1); cl = list(cidx.keys())
+    bs = np.array([hc[np.concatenate([cidx[cl[i]] for i in rng2.choice(len(cl), len(cl), True)])].mean()
+                   for _ in range(N_BOOT)])
+    print(f'\n  [{label}] n_sig={n:,}, CPAs={len(cidx)}')
+    print(f'    per-sig HC floor = {k/n:.4f} ({k}/{n})  CPA-boot95% [{np.percentile(bs,2.5):.4f},{np.percentile(bs,97.5):.4f}]')
+    dd1 = defaultdict(bool); dd2 = defaultdict(bool); dfirm = {}
+    for i in range(n):
+        if hc[i]: dd1[docs[i]] = True
+        if d2[i]: dd2[docs[i]] = True
+        dfirm.setdefault(docs[i], []).append(firms[i])
+        dd1.setdefault(docs[i], False); dd2.setdefault(docs[i], False)
+    dl = list(dd1.keys()); nd = len(dl)
+    print(f'    per-doc HC = {sum(dd1[d] for d in dl)/nd:.4f}; per-doc HC+MC = {sum(dd2[d] for d in dl)/nd:.4f}  (n_doc={nd:,})')
+    dom = {d: Counter(dfirm[d]).most_common(1)[0][0] for d in dl}
+    for f in ['B', 'C', 'D']:
+        ds = [d for d in dl if dom[d] == f]
+        if ds:
+            print(f'      Firm {f} per-doc HC+MC: {sum(dd2[d] for d in ds)/len(ds):.4f} ({sum(dd2[d] for d in ds)}/{len(ds)})')
+    return k/n
+
+
+def a_vs_baseline(baseline_rec, a_rec, label):
+    bf, bdh = prep(baseline_rec); nb = len(baseline_rec)
+    a_cpa = defaultdict(list)
+    for i, r in enumerate(a_rec):
+        a_cpa[r[0]].append(i)
+    psize = {c: len(v)-1 for c, v in a_cpa.items()}
+    rng = np.random.default_rng(SEED)
+    hc = np.zeros(len(a_rec), bool)
+    for i, r in enumerate(a_rec):
+        np_ = psize[r[0]]
+        if np_ <= 0:
+            continue
+        cand = rng.integers(0, nb, size=np_)
+        sf = np.frombuffer(r[4], np.float32).astype(np.float32); sf /= max(np.linalg.norm(sf), 1e-9)
+        cosv = bf[cand] @ sf
+        if (cosv > 0.95).any():
+            dist = POP[bdh[cand] ^ np.frombuffer(r[5], np.uint8)].sum(axis=1)
+            hc[i] = bool(((cosv > 0.95) & (dist <= 5)).any())
+    k = int(hc.sum()); n = len(a_rec); lo, hi = wilson(k, n)
+    print(f'  [{label}] Firm A (all yrs) vs BCD-2013-2019 pool: per-sig HC = {k/n:.4f} ({k}/{n}) [{lo:.5f},{hi:.5f}]')
+
+
+bcd_pre = [r for r in rows if r[1] != FIRM_A and 2013 <= r[2] <= 2019]
+bcd_post = [r for r in rows if r[1] != FIRM_A and r[2] >= 2020]
+A_all = [r for r in rows if r[1] == FIRM_A]
+
+print('=== PRIMARY floor: BCD 2013-2019 ===')
+fl = floor_on(bcd_pre, 'BCD 2013-2019 (PRIMARY)')
+
+print('\n=== Firm A scored against the BCD-2013-2019 threshold ===')
+a_vs_baseline(bcd_pre, A_all, 'A out-of-sample')
+A_obs = [r for r in A_all if r[6] is not None and r[7] is not None]
+ak = sum(1 for r in A_obs if r[6] > 0.95 and r[7] <= 5)
+print(f'  Firm A observed (all yrs, own pools): per-sig HC = {ak/len(A_obs):.4f}  -> {ak/len(A_obs)/fl:.0f}x the BCD-2013-2019 floor')
+
+print('\n=== (optional) BCD 2020+ floor, same method (may be inflated by e-signing) ===')
+floor_on(bcd_post, 'BCD 2020-2023 (post e-signing)')