diff --git a/paper/Paper_A_IEEE_Access_Draft_v4.3_20260604.pandoc.docx b/paper/Paper_A_IEEE_Access_Draft_v4.3_20260604.pandoc.docx index 8f9f701..1b00b8d 100644 Binary files a/paper/Paper_A_IEEE_Access_Draft_v4.3_20260604.pandoc.docx and b/paper/Paper_A_IEEE_Access_Draft_v4.3_20260604.pandoc.docx differ diff --git a/paper/paper_a_v4_combined.md b/paper/paper_a_v4_combined.md index 1790be5..7cd1dad 100644 --- a/paper/paper_a_v4_combined.md +++ b/paper/paper_a_v4_combined.md @@ -306,7 +306,7 @@ A1 is plausible for high-volume stamping or firm-level electronic signing workfl 2. **Within-firm cross-CPA collision structure analysis.** §III-J.1 reports a Big-4 cross-firm hit-matrix analysis that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work. -3. **Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-L K=3 component cross-tab; byte-level pair analysis referenced in §III-H.2). We retain Firm A within the Big-4 scope as a descriptive case study of the templated end rather than as the calibration anchor for thresholds. +3. **Firm A as the out-of-sample templated-end target.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-L K=3 component cross-tab; byte-level pair analysis referenced in §III-H.2). It is held out of the calibration negative anchor and scored as an out-of-sample target against the normative Firms-B/C/D baseline (§III-I.0, §III-J), illustrating the templated end the screening surfaces rather than serving as a calibration anchor for thresholds. 4. **Leave-one-firm-out fold feasibility.** §III-M reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm. @@ -330,11 +330,11 @@ The remainder of this section (§III-H.2) describes the reference populations us ### H.2. Reference Populations -The supporting diagnostics use two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking. Neither population is the calibration anchor for the deployed threshold; both are descriptive references that inform the cross-checks in §III-M. +Beyond the normative Firms-B/C/D baseline that anchors the calibration (§III-I), two further populations inform the analysis: Firm A, the **out-of-sample templated-end target** characterised in §III-J, and the 249 non-Big-4 CPAs, an out-of-target **reverse-anchor reference** for the internal-consistency checks of §III-M. Neither is the calibration negative anchor: Firm A is held out of it (§III-I.0), and the non-Big-4 reference informs the §III-M cross-checks (it also appears as a robustness scope in §III-I). -**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-L; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures. +**The out-of-sample target: Firm A as the templated end.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-L; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures. -Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-L), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-M's pixel-identity positive-anchor miss rate. +Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-L), not from Firm A alone. Firm A's role is as the out-of-sample templated-end target (§III-J): it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-M's pixel-identity positive-anchor miss rate. **External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-M's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores used as a cross-check on the per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked. @@ -851,7 +851,7 @@ This section consolidates the empirical results that support the §III-I anchor- | dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ | | Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000010}$ | $0.000140$ | $0.000004$ | -BCD joint Wilson 95% $[0.000004, 0.000023]$ ($5$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-J.1). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$). +BCD joint Wilson 95% $[0.000004, 0.000023]$ ($5$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$; BCD+non-Big-4 joint $[0.000001, 0.000015]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-J.1). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$). ### M.3 Pool-normalised per-signature ICCR (Script 52) @@ -929,7 +929,7 @@ The Big-4 accountant-level distribution rejects unimodality on both marginals ( Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-L), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years. -We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-I.0). Three readings (§III-J.1) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide essentially never ($0.0001$, below the BCD floor of $0.0059$) — so Firm A is unremarkable, indeed sub-baseline, *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 139\times$ the clean floor, versus $\sim 40$–$59\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D, with same-pair concentration $97$–$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$–$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-I.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish. +We treat Firm A as the **out-of-sample templated-end target**: held out of the calibration negative anchor and scored against the normative Firms-B/C/D baseline (§III-I.0) rather than used as a calibration input. Three readings (§III-J.1) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide essentially never ($0.0001$, below the BCD floor of $0.0059$) — so Firm A is unremarkable, indeed sub-baseline, *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 139\times$ the clean floor, versus $\sim 40$–$59\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$–$97\%$ at Firms B/C/D, with same-pair concentration $97$–$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$–$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-I.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish. ## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions