diff --git a/paper/Paper_A_IEEE_Access_Draft_v4.2_20260604.pandoc.docx b/paper/Paper_A_IEEE_Access_Draft_v4.2_20260604.pandoc.docx new file mode 100644 index 0000000..df6c20d Binary files /dev/null and b/paper/Paper_A_IEEE_Access_Draft_v4.2_20260604.pandoc.docx differ diff --git a/paper/paper_a_v4_combined.md b/paper/paper_a_v4_combined.md index abee363..1a63fb3 100644 --- a/paper/paper_a_v4_combined.md +++ b/paper/paper_a_v4_combined.md @@ -7,7 +7,7 @@ author: "[Authors removed for double-blind review]" -Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a normative non-Firm-A baseline (Firms B/C/D), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000018$; per-signature $0.012$; per-document $0.023$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.19$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A coincides at baseline rate cross-firm yet fires the rule on $82\%$ of its own signatures ($\sim 70\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth. +Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000010$; per-signature $0.006$; per-document $0.012$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A never coincides cross-firm yet fires on $82\%$ of its own ($\sim 139\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth. @@ -32,9 +32,9 @@ We are deliberate about what the system claims. The operating thresholds are *op A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors. -In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000018$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0116$ (CPA-block bootstrap 95% $[0.0094, 0.0141]$), and per-document ICCR $= 0.023$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.19$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide. +In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a clean pre-e-signature baseline (Firms B/C/D, 2013–2019); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000010$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0059$ (CPA-block bootstrap 95% $[0.0045, 0.0073]$), and per-document ICCR $= 0.012$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.175$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide. -With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0116$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 70\times$ floor), Firms B/C/D at $0.24$–$0.35$ ($\sim 21$–$30\times$). Firm A scored against the clean baseline coincides at only $0.0102$ — essentially the floor itself — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels. +With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0059$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 139\times$ floor), Firms B/C/D at $0.24$–$0.35$ ($\sim 40$–$59\times$). Firm A scored against the clean 2013–2019 baseline coincides essentially never ($0.0001$, below the clean-baseline floor itself) — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels. Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G. @@ -50,9 +50,9 @@ The contributions of this paper are: 4. **Composition decomposition does not support the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported. -5. **Anchor-based multi-level ICCR calibration on a normative non-Firm-A baseline.** We characterise the deployed high-confidence (HC) sub-rule at three units of analysis against a clean Firms-B/C/D negative anchor (Firm A held out as an out-of-sample target to avoid circularity): per-comparison ICCR $0.000018$, pool-normalised per-signature ICCR $0.0116$, and per-document ICCR $0.023$ — each roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$). The moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.19$ per-document coincidence rate on the clean baseline and is repositioned as a low-specificity advisory tier rather than a confident non-hand-signed label. Because the deployed thresholds are operator-tunable, the contribution is this label-free calibration methodology — a principled way to choose and characterise a screening operating point and the specificity it yields — rather than any specific threshold. We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide. +5. **Anchor-based multi-level ICCR calibration on a normative non-Firm-A baseline.** We characterise the deployed high-confidence (HC) sub-rule at three units of analysis against a clean Firms-B/C/D negative anchor (Firm A held out as an out-of-sample target to avoid circularity): per-comparison ICCR $0.000010$, pool-normalised per-signature ICCR $0.0059$, and per-document ICCR $0.012$ — each roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$). The moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.175$ per-document coincidence rate on the clean baseline and is repositioned as a low-specificity advisory tier rather than a confident non-hand-signed label. Because the deployed thresholds are operator-tunable, the contribution is this label-free calibration methodology — a principled way to choose and characterise a screening operating point and the specificity it yields — rather than any specific threshold. We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide. -6. **Firm A as a singular out-of-sample extreme; universal within-firm collision concentration.** Against the clean BCD floor (per-signature HC ICCR $0.0116$), the deployed rule fires on each firm's own pools far above the inter-CPA coincidence floor (Firm A $0.82$, $\sim 70\times$; Firms B/C/D $0.24$–$0.35$, $\sim 21$–$30\times$), while Firm A scored cross-firm against the baseline coincides only at the floor ($0.0102$) — localising the repeatability signal to within-firm comparisons. Two logistic regressions (full-Big-4 with Firm A reference: odds ratios $0.053$/$0.010$/$0.027$ for B/C/D; BCD-only with Firm D reference: residual spread within $\sim 3.5\times$, odds ratios $1.73$/$0.49$ for B/C) show Firm A is the lone outlier while Firms B/C/D form an internally homogeneous baseline. Within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean pool — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). +6. **Firm A as a singular out-of-sample extreme; universal within-firm collision concentration.** Against the clean BCD floor (per-signature HC ICCR $0.0059$), the deployed rule fires on each firm's own pools far above the inter-CPA coincidence floor (Firm A $0.82$, $\sim 139\times$; Firms B/C/D $0.24$–$0.35$, $\sim 40$–$59\times$), while Firm A scored cross-firm against the clean 2013–2019 baseline coincides essentially never cross-firm ($0.0001$, below the floor itself) — localising the repeatability signal to within-firm comparisons. Two logistic regressions (full-Big-4 with Firm A reference: odds ratios $0.053$/$0.010$/$0.027$ for B/C/D; BCD-only with Firm D reference: residual spread within $\sim 3.5\times$, odds ratios $1.73$/$0.49$ for B/C) show Firm A is the lone outlier while Firms B/C/D form an internally homogeneous baseline. Within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean pool — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). 7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair. @@ -396,7 +396,7 @@ $\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical pre Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs. The K=3 partition is therefore well-described as a firm-compositional decomposition: C3 is essentially "Firm A and any non-Firm-A CPA whose mean descriptors happen to land in the high-cos / low-dHash corner"; C1 is essentially "non-Firm-A CPAs whose mean descriptors land in the low-cos / high-dHash corner." The composition contrast that K=3 captures at the accountant level reappears at the deployment level in the cross-firm hit matrix of §III-L.4 (Script 44): under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). The K=3 partition and the cross-firm hit matrix therefore describe the same underlying firm-compositional structure at two different units of analysis. -**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label. +**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.012$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label. We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the operational classifier: @@ -458,6 +458,8 @@ The operational classifier defined in §III-H.1 is calibrated by characterising **Choice of negative-anchor pool.** A negative anchor must approximate a population in which the rule should *not* fire — independent CPAs whose signatures coincide only by chance. §III-L.4 shows that under the deployed rule, $98.8\%$ of Firm A's inter-CPA collisions fall on other Firm-A CPAs, and byte-level evidence (§IV-H, supplementary materials) confirms image-level reuse across $\sim 50$ Firm-A partners. Including Firm A in the negative-anchor pool therefore loads the "coincidence" rate with structured within-firm collisions, not chance coincidence — a circularity, since that collision structure is the phenomenon the rule targets. We adopt **Firms B/C/D (BCD) as the normative negative-anchor baseline** and report the all-Big-4 (ABCD) pool only as a contamination-comparison scope; Firm A enters as an **out-of-sample target** (§III-L.4), not as a calibration input. A still-broader baseline adding the eligible non-Big-4 firms (BCD+non-Big-4) is reported as a robustness scope. +We further restrict the calibration baseline temporally to **fiscal years 2013–2019**. Taiwan audit firms progressively adopted electronic-signature systems after 2020 (with firm-specific timing), so the pre-2020 BCD period is the construct-clean hand-signing baseline; the post-2020 period mixes genuine hand-signing with legitimate e-signing and is therefore not a clean negative anchor. The data corroborate this: the BCD per-comparison HC floor rises from $0.000010$ (2013–2019) to $0.000036$ (2020–2023), and the per-signature floor from $0.0059$ to $0.0105$ — the gradual, non-stepped rise being consistent with staggered per-firm adoption. We therefore calibrate on BCD 2013–2019 and report BCD 2020–2023 only as a robustness scope (it documents the e-signing contamination rather than the clean floor). Firm A is scored across its full 2013–2023 record against this clean threshold. + **Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine and structural ($\text{dHash} \leq 5$) thresholds' specificity-proxy behaviour at the inter-CPA pair level on the BCD baseline. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands. The cosine LH/UN crossover $\text{cos} = 0.837$ is a corpus-wide descriptor-space landmark (intra- vs inter-CPA cosine KDE crossover, §IV-C) robust to baseline choice — it moves by at most $0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes ($0.8367$, $0.8489$, $0.8302$) — so we retain the corpus-wide value and do not re-anchor it on BCD. **Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question: @@ -478,9 +480,9 @@ We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from the baseline |---|---|---|---| | Cosine $> 0.95$ | $0.00026$ | $0.00060$ | $0.00014$ | | dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ | -| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000018}$ $\;[0.000009, 0.000034]$ | $0.000140$ $\;[0.000111, 0.000177]$ | $0.000004$ $\;[0.000001, 0.000015]$ | +| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000010}$ $\;[0.000004, 0.000023]$ | $0.000140$ $\;[0.000111, 0.000177]$ | $0.000004$ $\;[0.000001, 0.000015]$ | -On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule is $0.000018$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$), and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-L.4): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($9$ of $5 \times 10^5$ pairs on the BCD pool), so we report the Wilson interval and treat the per-comparison joint rate as an order-of-magnitude specificity proxy rather than a precisely estimated rate; the well-powered per-signature and per-document units (§III-L.2, §III-L.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ row remains consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I. On the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) is $0.234$, indicating that the structural dimension adds substantial per-comparison specificity beyond the cosine gate. +On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule is $0.000010$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$), and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-L.4): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($5$ of $5 \times 10^5$ pairs on the BCD pool), so we report the Wilson interval and treat the per-comparison joint rate as an order-of-magnitude specificity proxy rather than a precisely estimated rate; the well-powered per-signature and per-document units (§III-L.2, §III-L.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ row remains consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I. On the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) is $0.234$, indicating that the structural dimension adds substantial per-comparison specificity beyond the cosine gate. The per-comparison rate does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2. @@ -492,11 +494,11 @@ The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \ | Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI | |---|---|---| -| **BCD (primary)** | $\mathbf{0.0116}$ | $[0.0094, 0.0141]$ | +| **BCD (primary)** | $\mathbf{0.0059}$ | $[0.0045, 0.0073]$ | | All-Big-4 (contamination comparison) | $0.1102$ | $[0.0908, 0.1330]$ | | BCD+non-Big-4 | $0.0083$ | $[0.0066, 0.0099]$ | -On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $0.0116$ — an order of magnitude below the all-Big-4 figure of $0.1102$. The all-Big-4 figure is dominated by Firm A, whose signatures coincide with other Firm-A signatures at high rate; once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 1.2\%$. This is the specificity-proxy floor against which the deployed HC rule operates. The rate increases with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-L.4) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, with the unsupervised-setting caveats of §III-M. +On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $0.0059$ — an order of magnitude below the all-Big-4 figure of $0.1102$. The all-Big-4 figure is dominated by Firm A, whose signatures coincide with other Firm-A signatures at high rate; once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 0.59\%$. This is the specificity-proxy floor against which the deployed HC rule operates. The rate increases with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-L.4) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, with the unsupervised-setting caveats of §III-M. ### L.3. Document-level inter-CPA proxy alert rate (Script 52) @@ -504,31 +506,31 @@ Each document is classified by the worst-case rule over its constituent signatur | Alarm definition | BCD baseline (primary) | All-Big-4 | BCD+non-Big-4 | |---|---|---|---| -| HC (dHash $\leq 5$) | $\mathbf{0.0226}$ | $0.1797$ | $0.0163$ | -| HC + MC (dHash $\leq 15$) | $0.1905$ | $0.3375$ | $0.1467$ | +| HC (dHash $\leq 5$) | $\mathbf{0.0117}$ | $0.1797$ | $0.0163$ | +| HC + MC (dHash $\leq 15$) | $0.1753$ | $0.3375$ | $0.1467$ | -**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC rate is $0.0226$ ($\sim 8\times$ below the all-Big-4 $0.1797$), confirming that the HC (dHash $\leq 5$) rule has a very low inter-CPA coincidence rate: a clean inter-CPA baseline almost never produces an HC document. The HC+MC (dHash $\leq 15$) rate, by contrast, remains high on the clean baseline — $0.1905$ per document — and the per-firm breakdown shows it does *not* fall when Firm A is removed. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier rather than a confident non-hand-signed screening label.** Roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement, so an HC+MC alarm is not by itself evidence of image reproduction. +**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC rate is $0.0117$ ($\sim 8\times$ below the all-Big-4 $0.1797$), confirming that the HC (dHash $\leq 5$) rule has a very low inter-CPA coincidence rate: a clean inter-CPA baseline almost never produces an HC document. The HC+MC (dHash $\leq 15$) rate, by contrast, remains high on the clean baseline — $0.1753$ per document — and the per-firm breakdown shows it does *not* fall when Firm A is removed. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier rather than a confident non-hand-signed screening label.** Roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement, so an HC+MC alarm is not by itself evidence of image reproduction. -Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $0.218$, Firm D $0.114$ — slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. This is direct evidence that the MC band carries little inter-CPA specificity even among normative firms, corroborating its demotion to an advisory tier. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-M). +Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ — slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. This is direct evidence that the MC band carries little inter-CPA specificity even among normative firms, corroborating its demotion to an advisory tier. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-M). ### L.4. Firm A as an out-of-sample target; firm heterogeneity (Scripts 49, 52, 44, 53) With the calibration anchored on BCD, Firm A is scored as an out-of-sample target against the clean baseline. Three complementary readings establish that Firm A is the extreme case while keeping the inferential limits explicit. -**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0116$ (§III-L.2): +**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0059$ (§III-L.2): | Firm | Observed per-signature HC rate | Multiple of BCD floor | |---|---|---| -| Firm A | $0.817$ | $\sim 70\times$ | -| Firm B | $0.346$ | $\sim 30\times$ | -| Firm C | $0.238$ | $\sim 21\times$ | -| Firm D | $0.245$ | $\sim 21\times$ | +| Firm A | $0.817$ | $\sim 139\times$ | +| Firm B | $0.346$ | $\sim 59\times$ | +| Firm C | $0.238$ | $\sim 40\times$ | +| Firm D | $0.245$ | $\sim 42\times$ | -All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 70\times$, roughly $2.4$–$3.4\times$ the other Big-4 firms in absolute rate. We emphasise (and develop in §III-M) that this excess is *not* a true-positive rate: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above it. The multiple is a framework-discriminative observation, not a measure of image reproduction. +All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 139\times$, roughly $2.4$–$3.4\times$ the other Big-4 firms in absolute rate. We emphasise (and develop in §III-M) that this excess is *not* a true-positive rate: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above it. The multiple is a framework-discriminative observation, not a measure of image reproduction. -**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0102$ — essentially identical to the BCD-internal floor of $0.0116$. Firm A's signatures are thus unremarkable when matched against *other firms'* signatures; the entire elevation in Firm A's observed rate ($0.817$) arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness. +**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0001$ — below even the BCD-internal floor of $0.0059$, i.e. Firm A's signatures essentially never resemble genuine 2013–2019 hand-signing. Firm A's signatures are thus unremarkable when matched against *other firms'* signatures; the entire elevation in Firm A's observed rate ($0.817$) arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness. -**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0116$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.) +**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.) **Cross-firm hit matrix: within-firm concentration is a universal Big-4 pattern.** Under the deployed any-pair rule, inter-CPA collisions concentrate within the source firm at every Big-4 firm. On the full Big-4 candidate pool, within-firm concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (same-pair $97.0$–$99.96\%$; Table XXV). Restricting the candidate pool to the BCD baseline (Script 53) *raises* the within-firm concentration for B/C/D to $89.2$–$97.2\%$ any-pair (Firm B $97.2\%$, Firm C $92.3\%$, Firm D $89.2\%$) and $98.5$–$100\%$ same-pair — higher than on the full pool, because on the full pool some B/C/D collisions landed on Firm A's generically copy-like signatures; removing Firm A leaves each firm's collisions concentrated within itself. Within-firm collision concentration is therefore a universal Big-4 structural pattern, not a Firm-A peculiarity: Firm A is extreme in the *rate* at which the rule fires (reading (i)), but all four firms exhibit the same within-firm collision signature. @@ -548,10 +550,10 @@ We interpret the deployed HC thresholds as **specificity-anchored operating poin The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the deployed HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$). -Read against the **normative BCD specificity-proxy floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is larger: the per-signature observed rate is $\sim 43\times$ the BCD floor ($0.4958$ vs $0.0116$), and the per-document HC observed rate is $\sim 28\times$ the BCD floor ($0.6228$ vs $0.0226$): +Read against the **normative BCD specificity-proxy floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is larger: the per-signature observed rate is $\sim 84\times$ the BCD floor ($0.4958$ vs $0.0059$), and the per-document HC observed rate is $\sim 53\times$ the BCD floor ($0.6228$ vs $0.0117$): -- Per-signature: $0.4958 - 0.0116 = 0.4842$ ($48.4$ pp excess over the clean floor) -- Per-document HC: $0.6228 - 0.0226 = 0.6002$ ($60.0$ pp excess over the clean floor) +- Per-signature: $0.4958 - 0.0059 = 0.4899$ ($49.0$ pp excess over the clean floor) +- Per-document HC: $0.6228 - 0.0117 = 0.6111$ ($61.1$ pp excess over the clean floor) We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits are developed in §III-M. The excess is best read as an *observed same-CPA-pool excess over the normative inter-CPA floor* — a quantity that far exceeds what random inter-CPA candidate replacement among normative firms would produce — whose mechanism is not identifiable from descriptor-only data (§III-M). Anchoring the floor on the clean BCD baseline sharpens this contrast (the all-Big-4 floor would understate it by absorbing Firm A's reuse), while leaving the §III-M caveat — that the floor is an inter-CPA coincidence rate, not an intra-CPA genuine-hand-signing rate — fully in force; we do not attribute the excess to within-CPA handwriting repeatability or to image replication without further evidence. @@ -756,7 +758,7 @@ This section reports the firm-level cross-validation evidence motivating §III-J | Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp | | Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp | -(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L). +(Source: Script 37; screening label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.012$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L). ## H. Pixel-Identity Positive-Anchor Miss Rate @@ -832,7 +834,7 @@ This section reports the five-way per-signature + document-level worst-case clas (Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.) -The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-L.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.19$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B). +The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-L.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.175$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B). **Table XVII.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus. @@ -934,9 +936,9 @@ This section consolidates the empirical results that support the §III-L anchor- |---|---|---|---| | cos $> 0.95$ | $0.00026$ | $0.00060$ | $0.00014$ | | dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ | -| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000018}$ | $0.000140$ | $0.000004$ | +| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000010}$ | $0.000140$ | $0.000004$ | -BCD joint Wilson 95% $[0.000009, 0.000034]$ ($9$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-L.4). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$). +BCD joint Wilson 95% $[0.000004, 0.000023]$ ($5$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-L.4). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$). ### M.3 Pool-normalised per-signature ICCR (Script 52) @@ -944,7 +946,7 @@ BCD joint Wilson 95% $[0.000009, 0.000034]$ ($9$ of $5 \times 10^5$ pairs); all- | Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI | |---|---|---| -| BCD (primary) | $\mathbf{0.0116}$ | $[0.0094, 0.0141]$ | +| BCD (primary) | $\mathbf{0.0059}$ | $[0.0045, 0.0073]$ | | All-Big-4 (contamination comparison) | $0.1102$ | $[0.0908, 0.1330]$ | | BCD+non-Big-4 | $0.0083$ | $[0.0066, 0.0099]$ | @@ -956,10 +958,10 @@ The BCD floor is an order of magnitude below the all-Big-4 figure, which is domi | Alarm definition | BCD (primary) | All-Big-4 | BCD+non-Big-4 | |---|---|---|---| -| HC (dHash $\leq 5$) | $\mathbf{0.0226}$ | $0.1797$ | $0.0163$ | -| HC + MC (dHash $\leq 15$) | $0.1905$ | $0.3375$ | $0.1467$ | +| HC (dHash $\leq 5$) | $\mathbf{0.0117}$ | $0.1797$ | $0.0163$ | +| HC + MC (dHash $\leq 15$) | $0.1753$ | $0.3375$ | $0.1467$ | -Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $0.218$, Firm D $0.114$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-L.3). +Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.162$, Firm C $0.225$, Firm D $0.089$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-L.3). ### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44) @@ -972,7 +974,7 @@ Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $ | Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A | | log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size | -On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0116$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-L.4). +On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0059$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-L.4). Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–$0.0358$ while Firm A ranges $0.0541$–$0.5958$. The firm gap survives within matched pool sizes. @@ -997,7 +999,7 @@ Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\ | dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) | | dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) | -Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0116$; per-document HC $0.0226$), the observed same-CPA-pool excess is $0.4842$ ($48.4$ pp, $\sim 43\times$) per-signature and $0.6002$ ($60.0$ pp, $\sim 28\times$) per-document; this excess is reported under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability. +Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0059$; per-document HC $0.0117$), the observed same-CPA-pool excess is $0.4899$ ($49.0$ pp, $\sim 84\times$) per-signature and $0.6111$ ($61.1$ pp, $\sim 53\times$) per-document; this excess is reported under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability. # V. Discussion @@ -1014,7 +1016,7 @@ The Big-4 accountant-level distribution rejects unimodality on both marginals ( Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years. -We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-L.0). Three readings (§III-L.4) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide at only $0.0102$ — essentially the BCD floor ($0.0116$) — so Firm A is unremarkable *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 70\times$ the clean floor, versus $\sim 21$–$30\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D, with same-pair concentration $97$–$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$–$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-L.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish. +We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-L.0). Three readings (§III-L.4) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide essentially never ($0.0001$, below the BCD floor of $0.0059$) — so Firm A is unremarkable, indeed sub-baseline, *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 139\times$ the clean floor, versus $\sim 40$–$59\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D, with same-pair concentration $97$–$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$–$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-L.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish. ## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions @@ -1030,7 +1032,7 @@ The deployed HC sub-rule's specificity-proxy behaviour is characterised at three ## G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor -The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-L.1 per-comparison ICCR on the normative BCD baseline ($0.000018$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-L.0). +The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-L.1 per-comparison ICCR on the normative BCD baseline ($0.000010$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-L.0). ## H. Limitations @@ -1040,7 +1042,7 @@ Several limitations should be transparent. We group them into primary methodolog *No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M). -*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-L.4 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-L.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000018$ and the per-signature HC rate from $0.1102$ to $0.0116$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-L.6) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities. +*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-L.4 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-L.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000010$ and the per-signature HC rate from $0.1102$ to $0.0059$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-L.6) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities. *Mechanism attribution for the firm-level heterogeneity is not identifiable from descriptor-only data.* The observed firm-level contrast (Firm A's per-document HC$+$MC ICCR of $0.62$ versus $0.09$–$0.16$ at Firms B/C/D; within-firm collision concentration $77$–$99\%$ under the deployed any-pair rule; byte-identical evidence of §IV-H) is consistent with at least three non-mutually-exclusive firm-level mechanisms: (i) template, stamp, or e-signature production reuse; (ii) digitisation-pipeline homogeneity — shared scanners, common PDF generation infrastructure, identical compression and form-template settings — that systematically inflates image-descriptor similarity without signature replication; and (iii) signing-style or training homogeneity that produces correlated handwritten signatures within a firm. The descriptor pair (cosine, dHash) operates at the image-similarity level and is, by construction, indifferent to which mechanism generated a given near-identical pair. We therefore report the firm contrast as a methodological observation — the framework discriminates at firm-level resolution — rather than as a mechanism finding. The byte-identical Firm A signatures across $\sim 50$ distinct partners (§IV-H, §V-C) provide direct evidence for (i) at Firm A specifically, but do not exclude additive contribution from (ii) or (iii); the milder within-firm collision patterns at Firms B/C/D are individually consistent with all three mechanisms. Image-acquisition metadata (scanner identifiers, PDF generator fingerprints, compression-codec markers), partner-level intent records, or controlled hand-signed baselines would be needed to attribute the contrast across (i), (ii), and (iii). @@ -1050,9 +1052,9 @@ Several limitations should be transparent. We group them into primary methodolog *Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures). -*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-L.3 supersedes the MC band's *claim strength* — its $\sim 0.19$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery. +*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-L.3 supersedes the MC band's *claim strength* — its $\sim 0.175$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery. -*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.023$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal. +*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.012$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal. *A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime. @@ -1079,9 +1081,9 @@ Several limitations should be transparent. We group them into primary methodolog We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature screening rule with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. We emphasise that the operating thresholds are operator-tunable and that the system performs semi-automated triage — surfacing replication candidates from hundreds of thousands of signatures for human adjudication — rather than autonomous forensic classification; its central deliverable is the label-free calibration methodology by which an operator selects and characterises a screening operating point. -Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR $0.000018$, per-signature $0.0116$, and per-document $0.023$ — roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$) — while the dHash$\leq 15$ moderate-confidence band, which retains a $\sim 0.19$ per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at $\sim 70\times$ (Firm A) and $\sim 21$–$30\times$ (Firms B/C/D), while Firm A scored cross-firm against the baseline coincides only at the floor ($0.0102$); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/$0.027$; BCD-only Firm-D-reference residual spread within $\sim 3.5\times$) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean BCD pool (same-pair $97$–$100\%$ across all four firms) — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector. +Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR $0.000010$, per-signature $0.0059$, and per-document $0.012$ — roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$) — while the dHash$\leq 15$ moderate-confidence band, which retains a $\sim 0.175$ per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at $\sim 139\times$ (Firm A) and $\sim 40$–$59\times$ (Firms B/C/D), while Firm A scored cross-firm against the clean 2013–2019 baseline coincides essentially never cross-firm ($0.0001$); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/$0.027$; BCD-only Firm-D-reference residual spread within $\sim 3.5\times$) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$–$97\%$ at Firms B/C/D on the clean BCD pool (same-pair $97$–$100\%$ across all four firms) — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector. -Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$–$0.35$, $\sim 70\times$ vs $\sim 21$–$30\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. +Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$–$0.35$, $\sim 139\times$ vs $\sim 40$–$59\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. # Appendix A. Supplementary Diagnostic Detail diff --git a/signature_analysis/54_bcd_floor_temporal.py b/signature_analysis/54_bcd_floor_temporal.py new file mode 100644 index 0000000..95dfaff --- /dev/null +++ b/signature_analysis/54_bcd_floor_temporal.py @@ -0,0 +1,99 @@ +#!/usr/bin/env python3 +"""Script 54: temporal stability of the BCD inter-CPA floor. +Does the normative BCD per-comparison HC coincidence floor drift over time / +get contaminated by post-2020 e-signing? Compares eras full / 2013-2019 / +2020-2023 using the pool-size-independent per-comparison joint HC ICCR +(cos>0.95 & dHash<=5) on BCD inter-CPA pairs (N=500k, seed 42), plus the +observed deployed per-signature HC rate by firm by era. Read-only. +""" +import sqlite3 +from collections import defaultdict +import numpy as np + +DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db' +FIRM_A = '勤業眾信聯合' +BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合') +ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'} +SEED = 42 +N_PAIRS = 500_000 +POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8) + + +def wilson(k, n, z=1.96): + if n == 0: + return (None, None) + p = k/n; d = 1+z*z/n + c = (p+z*z/(2*n))/d + h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d + return (max(0.0, c-h), min(1.0, c+h)) + + +conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True) +cur = conn.cursor() +cur.execute(""" + SELECT s.assigned_accountant, a.firm, CAST(substr(s.year_month,1,4) AS INT), + s.feature_vector, s.dhash_vector, + s.max_similarity_to_same_accountant, s.min_dhash_independent + FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name + WHERE a.firm IN (?,?,?,?) AND s.year_month IS NOT NULL + AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""", BIG4) +rows = cur.fetchall() +conn.close() + +ERAS = {'full 2013-2023': lambda y: True, + '2013-2019 (pre-drift)': lambda y: 2013 <= y <= 2019, + '2020-2023': lambda y: 2020 <= y <= 2023} + + +def per_comparison_floor(era_fn, label): + # BCD-only (exclude Firm A), era-restricted + keep = [r for r in rows if r[1] != FIRM_A and era_fn(r[2])] + feats = np.stack([np.frombuffer(r[3], np.float32) for r in keep]).astype(np.float32) + feats /= np.clip(np.linalg.norm(feats, axis=1, keepdims=True), 1e-9, None) + dh = np.stack([np.frombuffer(r[4], np.uint8) for r in keep]) + cpas = np.array([r[0] for r in keep]) + by = defaultdict(list) + for i, c in enumerate(cpas): + by[c].append(i) + accts = list(by.keys()) + rng = np.random.default_rng(SEED) + cos = np.empty(N_PAIRS, np.float32); dv = np.empty(N_PAIRS, np.int32) + na = len(accts) + for t in range(N_PAIRS): + i, j = rng.choice(na, 2, replace=False) + a1, a2 = accts[i], accts[j] + k1 = by[a1][int(rng.integers(0, len(by[a1])))] + k2 = by[a2][int(rng.integers(0, len(by[a2])))] + cos[t] = feats[k1] @ feats[k2] + dv[t] = POP[dh[k1] ^ dh[k2]].sum() + joint = int(((cos > 0.95) & (dv <= 5)).sum()) + lo, hi = wilson(joint, N_PAIRS) + print(f' [{label}] BCD per-comparison HC floor = {joint/N_PAIRS:.6f} ' + f'({joint}/{N_PAIRS}) Wilson95% [{lo:.6f},{hi:.6f}] ' + f'(n_sig={len(keep):,}, CPAs={na})') + return joint/N_PAIRS + + +print('=== (1) BCD per-comparison HC floor by era (pool-size-independent) ===') +floors = {lab: per_comparison_floor(fn, lab) for lab, fn in ERAS.items()} + +print('\n=== (2) Observed deployed per-signature HC rate by firm by era ===') +print(' (max_sim>0.95 & min_dh<=5 on actual same-CPA pools)') +for lab, fn in ERAS.items(): + print(f' --- {lab} ---') + for fm_zh in BIG4: + sub = [r for r in rows if r[1] == fm_zh and fn(r[2]) + and r[5] is not None and r[6] is not None] + if not sub: + continue + k = sum(1 for r in sub if r[5] > 0.95 and r[6] <= 5) + print(f' Firm {ALIAS[fm_zh]}: {k/len(sub):.4f} ({k}/{len(sub)})') + +print('\n=== A-vs-floor multiple by era (observed A HC / BCD floor) ===') +for lab, fn in ERAS.items(): + a = [r for r in rows if r[1] == FIRM_A and fn(r[2]) and r[5] is not None and r[6] is not None] + a_rate = sum(1 for r in a if r[5] > 0.95 and r[6] <= 5)/len(a) if a else 0 + fl = floors[lab] + # per-comparison floor is not directly comparable to observed pooled rate; + # report ratio vs the per-signature floor proxy from Script 52 (0.0116 full). + print(f' {lab}: observed A HC = {a_rate:.3f}; per-comparison floor = {fl:.6f}') diff --git a/signature_analysis/55_bcd_pre2020_calibration.py b/signature_analysis/55_bcd_pre2020_calibration.py new file mode 100644 index 0000000..a95cce5 --- /dev/null +++ b/signature_analysis/55_bcd_pre2020_calibration.py @@ -0,0 +1,144 @@ +#!/usr/bin/env python3 +"""Script 55: PRIMARY calibration on the clean pre-e-signature baseline +BCD 2013-2019 (Firms B/C/D, fiscal years 2013-2019). Rationale: co-author +interviews confirm B/C/D progressively adopted e-signature systems after 2020 +(staggered timing), so 2013-2019 BCD is the construct-clean hand-signing +baseline. Canonical retry-loop sampler (matches Scripts 43/45/52), any-pair. +Reports the floor + Firm A (all years) scored out-of-sample against it, and +BCD 2020+ scored against the same threshold. Read-only. +""" +import sqlite3 +from collections import defaultdict, Counter +import numpy as np + +DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db' +FIRM_A = '勤業眾信聯合' +BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合') +ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'} +SEED = 42 +N_BOOT = 1000 +POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8) + + +def wilson(k, n, z=1.96): + if n == 0: + return (None, None) + p = k/n; d = 1+z*z/n; c = (p+z*z/(2*n))/d + h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d + return (max(0.0, c-h), min(1.0, c+h)) + + +def canon_sampler(rng, n, npool, same, all_idx): + need = npool; cand = []; att = 0 + while need > 0 and att < 10: + draw = rng.choice(n, size=need*2, replace=True) + ok = draw[~np.isin(draw, same)] + cand.extend(ok[:need].tolist()); need -= len(ok[:need]); att += 1 + if need > 0: + pm = np.ones(n, bool); pm[same] = False + cand.extend(rng.choice(all_idx[pm], size=need, replace=False).tolist()) + return np.array(cand[:npool], dtype=np.int64) + + +conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True) +cur = conn.cursor() +cur.execute("""SELECT s.assigned_accountant,a.firm,CAST(substr(s.year_month,1,4) AS INT), + s.source_pdf,s.feature_vector,s.dhash_vector, + s.max_similarity_to_same_accountant,s.min_dhash_independent + FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name + WHERE a.firm IN (?,?,?,?) AND s.year_month IS NOT NULL + AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""", BIG4) +rows = cur.fetchall() +conn.close() + + +def prep(rec): + feats = np.stack([np.frombuffer(r[4], np.float32) for r in rec]).astype(np.float32) + norms = np.linalg.norm(feats, axis=1, keepdims=True); norms[norms == 0] = 1.0 + feats /= norms + dh = np.stack([np.frombuffer(r[5], np.uint8) for r in rec]) + return feats, dh + + +def floor_on(baseline_rec, label): + """Canonical per-sig/per-doc HC floor on a baseline population.""" + feats, dh = prep(baseline_rec) + n = len(baseline_rec) + cpas = np.array([r[0] for r in baseline_rec]) + firms = np.array([ALIAS[r[1]] for r in baseline_rec]) + docs = np.array([r[3] for r in baseline_rec]) + cidx = defaultdict(list) + for i, c in enumerate(cpas): + cidx[c].append(i) + cidx = {c: np.array(v) for c, v in cidx.items()} + psize = {c: len(v)-1 for c, v in cidx.items()} + all_idx = np.arange(n) + rng = np.random.default_rng(SEED) + mx = np.zeros(n, np.float32); mn = np.full(n, 64, np.int32) + for si in range(n): + np_ = psize[cpas[si]] + if np_ <= 0: + continue + cand = canon_sampler(rng, n, np_, cidx[cpas[si]], all_idx) + cosv = feats[cand] @ feats[si] + mx[si] = cosv.max(); mn[si] = int(POP[dh[cand] ^ dh[si]].sum(axis=1).min()) + hc = (mx > 0.95) & (mn <= 5); d2 = (mx > 0.95) & (mn <= 15) + k = int(hc.sum()) + rng2 = np.random.default_rng(SEED+1); cl = list(cidx.keys()) + bs = np.array([hc[np.concatenate([cidx[cl[i]] for i in rng2.choice(len(cl), len(cl), True)])].mean() + for _ in range(N_BOOT)]) + print(f'\n [{label}] n_sig={n:,}, CPAs={len(cidx)}') + print(f' per-sig HC floor = {k/n:.4f} ({k}/{n}) CPA-boot95% [{np.percentile(bs,2.5):.4f},{np.percentile(bs,97.5):.4f}]') + dd1 = defaultdict(bool); dd2 = defaultdict(bool); dfirm = {} + for i in range(n): + if hc[i]: dd1[docs[i]] = True + if d2[i]: dd2[docs[i]] = True + dfirm.setdefault(docs[i], []).append(firms[i]) + dd1.setdefault(docs[i], False); dd2.setdefault(docs[i], False) + dl = list(dd1.keys()); nd = len(dl) + print(f' per-doc HC = {sum(dd1[d] for d in dl)/nd:.4f}; per-doc HC+MC = {sum(dd2[d] for d in dl)/nd:.4f} (n_doc={nd:,})') + dom = {d: Counter(dfirm[d]).most_common(1)[0][0] for d in dl} + for f in ['B', 'C', 'D']: + ds = [d for d in dl if dom[d] == f] + if ds: + print(f' Firm {f} per-doc HC+MC: {sum(dd2[d] for d in ds)/len(ds):.4f} ({sum(dd2[d] for d in ds)}/{len(ds)})') + return k/n + + +def a_vs_baseline(baseline_rec, a_rec, label): + bf, bdh = prep(baseline_rec); nb = len(baseline_rec) + a_cpa = defaultdict(list) + for i, r in enumerate(a_rec): + a_cpa[r[0]].append(i) + psize = {c: len(v)-1 for c, v in a_cpa.items()} + rng = np.random.default_rng(SEED) + hc = np.zeros(len(a_rec), bool) + for i, r in enumerate(a_rec): + np_ = psize[r[0]] + if np_ <= 0: + continue + cand = rng.integers(0, nb, size=np_) + sf = np.frombuffer(r[4], np.float32).astype(np.float32); sf /= max(np.linalg.norm(sf), 1e-9) + cosv = bf[cand] @ sf + if (cosv > 0.95).any(): + dist = POP[bdh[cand] ^ np.frombuffer(r[5], np.uint8)].sum(axis=1) + hc[i] = bool(((cosv > 0.95) & (dist <= 5)).any()) + k = int(hc.sum()); n = len(a_rec); lo, hi = wilson(k, n) + print(f' [{label}] Firm A (all yrs) vs BCD-2013-2019 pool: per-sig HC = {k/n:.4f} ({k}/{n}) [{lo:.5f},{hi:.5f}]') + + +bcd_pre = [r for r in rows if r[1] != FIRM_A and 2013 <= r[2] <= 2019] +bcd_post = [r for r in rows if r[1] != FIRM_A and r[2] >= 2020] +A_all = [r for r in rows if r[1] == FIRM_A] + +print('=== PRIMARY floor: BCD 2013-2019 ===') +fl = floor_on(bcd_pre, 'BCD 2013-2019 (PRIMARY)') + +print('\n=== Firm A scored against the BCD-2013-2019 threshold ===') +a_vs_baseline(bcd_pre, A_all, 'A out-of-sample') +A_obs = [r for r in A_all if r[6] is not None and r[7] is not None] +ak = sum(1 for r in A_obs if r[6] > 0.95 and r[7] <= 5) +print(f' Firm A observed (all yrs, own pools): per-sig HC = {ak/len(A_obs):.4f} -> {ak/len(A_obs)/fl:.0f}x the BCD-2013-2019 floor') + +print('\n=== (optional) BCD 2020+ floor, same method (may be inflated by e-signing) ===') +floor_on(bcd_post, 'BCD 2020-2023 (post e-signing)')