Rewrite §III v7: anchor-based ICCR framework + composition-decomp finding

Major §III restructuring after codex rounds 29-34 demolished the
distributional path to thresholds (Scripts 39b-39e prove (cos, dHash)
multimodality is composition-driven + integer-tie artefact).

v4.0 pivots to anchor-based multi-level inter-CPA coincidence-rate
(ICCR) calibration via Scripts 40b, 43, 44, 45, 46:

- §III-G: scope justification rewritten (LOOO + Firm A case study +
  within-firm collision structure; dropped "smallest scope rejects
  unimodality" rationale); added sample-size reconciliation
  (150,442 descriptor-complete vs 150,453 vector-complete; 437
  accountant-level vs 468 all)
- §III-I: new sub-section I.4 composition decomposition (2x2 factorial
  centred + jittered Big-4 pooled dh p=0.35); I.5 conclusion of no
  natural threshold
- §III-J: K=3 recast as firm-compositional descriptive partition
  (not three mechanism clusters); bridge to §III-L.4 cross-firm
  hit matrix added
- §III-K: Score 1 reframed as firm-composition position score
- §III-L: NEW major sub-section — anchor-based threshold calibration
  with L.0 methodology, L.1 per-comparison ICCR (replicates v3
  cos>0.95 -> 0.0006; new dh<=5 -> 0.0013; joint -> 0.00014),
  L.2 pool-normalised per-signature ICCR (any-pair HC 11.02%;
  per-firm A 25.94% vs B/C/D <1.5%), L.3 doc-level ICCR (HC 18%;
  HC+MC 34%), L.4 firm heterogeneity logistic OR 0.01-0.05 +
  cross-firm hit matrix (98-100% within-firm), L.5 alert-rate
  sensitivity (HC threshold locally sensitive not plateau-stable),
  L.6 observed deployed alert rate excess over inter-CPA proxy
- §III-M: NEW sub-section — multi-tool validation strategy under
  unsupervised setting; 9 partial-evidence diagnostics each with
  disclosed untested assumption; positioning as anchor-calibrated
  screening framework with human-in-the-loop review, NOT validated
  forensic detector
- Terminology: "FAR" replaced with "inter-CPA coincidence rate
  (ICCR)" throughout; primary metric name change documented in
  §III-L.0
- Provenance table: ~35 new rows for Scripts 39b-e/40b/43-46;
  "key numerical claims" instead of "every numerical claim"
- Removed v2-v6 internal changelog metadata; v7 draft note added

Codex round-32 SOUND_WITH_QUALIFICATIONS, round-33 GO_WITH_REVISIONS,
round-34 READY_WITH_NARROW_FIXES (all 8 patches applied).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-13 17:27:01 +08:00
parent 2f05d6f0c9
commit 723a3f6eaf
+303 -66
View File
@@ -1,23 +1,12 @@
# Section III. Methodology — v4.0 Draft v6 (post codex rounds 2125) # Section III. Methodology — v4.0 Draft v7 (post codex rounds 2134)
> **Draft note (2026-05-12, v6; internal — remove before submission).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. > **Draft note (2026-05-13, v7; internal — remove before submission).** This file replaces the §III-G through §III-M block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. The §III-G through §III-M block has been substantially restructured between v6 and v7 (2026-05-13): codex round-29 demolished the distributional path to thresholds (Scripts 39b39e prove (cos, dHash) multimodality is composition + integer artefact); v4.0 pivots to anchor-based multi-level inter-CPA coincidence-rate calibration (Scripts 40b, 43, 44, 45, 46); §III-I is rewritten as the no-natural-threshold diagnostic; §III-J is recast as a firm-compositional descriptive partition (not three mechanism clusters); §III-L is a new major sub-section on anchor-based threshold calibration; §III-M is a new sub-section on validation strategy and limitations under the unsupervised setting. Prior internal draft notes (v2v6 changelog) have been moved to `paper/v4/CHANGELOG.md`.
> >
> **v2** incorporated codex gpt-5.5 round-21 review (`paper/codex_review_gpt55_v4_round1.md`, Major Revision); key revisions were: (i) the inherited five-way per-signature box rule restored as the **primary operational classifier** (§III-L), (ii) the K=3 Gaussian mixture positioned as **accountant-level descriptive characterisation** (§III-J), (iii) "convergent validation" softened to "convergent internal-consistency checks" since the three scores share underlying features (§III-K), (iv) the pixel-identity metric renamed from FAR to positive-anchor miss rate (§III-K), (v) five empirical/wording slips corrected. > Empirical anchors throughout reference Scripts 3246 on branch `paper-a-v4-big4`; a curated provenance table appears at the end of this section listing the principal numerical claims with their script and report path.
>
> **v3** incorporates codex gpt-5.5 round-22 review (`paper/codex_review_gpt55_v4_round2.md`, Minor Revision); five narrow fixes applied: per-firm ranking corrected (Score 2 reverse-anchor ranks Firm D fractionally above Firm C while Scores 1 and 3 rank Firm C highest), "smallest scope" language narrowed to "comparison scopes tested in Script 32", §III-L Spearman correlations explicitly tied to the K=3 *posterior* P(C1), provenance for $n = 150{,}442$ cites Script 39 directly, "max fold-to-fold deviation" wording made precise ($0.028$ = max absolute deviation from across-fold mean; pairwise range $0.0376$).
>
> **v4** incorporates the §III ↔ §IV cross-reference cleanup that codex round-23 review flagged: §III-G unit references now point to actual §IV locations (§IV-J for five-way per-signature counts; §IV-I for inherited inter-CPA FAR), §III-G scope statement enumerates v4-new vs inherited sub-sections explicitly, §III-K cites v3.20.0 Tables IX/XI/XII/XII-B for moderate-band capture-rate (was "§IV-F" which is now Convergent Internal-Consistency), and §III-L's "without recalibration" claim is narrowed to apply only to the binary high-confidence sub-rule.
>
> **v5** incorporates codex gpt-5.5 round-24 review (`paper/codex_review_gpt55_v4_round4.md`, Minor Revision); seven narrow §III-side cleanups: (1) anonymisation leak repaired (real firm names/aliases removed from §III prose; Firm AD used throughout); (2) K=3 LOOO weight-drift value $0.025$ corrected to $0.023$ at three §III sites (matches Script 37); (3) §III-K positive-anchor paragraph cross-ref repaired (now points to §IV-I and v3.20.0 §IV-F.1 Table X, was the meaningless "§III-J inherited; Table X"); (4) §III-L five-way rule's Likely-hand-signed band made inclusive ($\text{cos} \leq 0.837$, matches Script 42); (5) open question 1's location pointer changed from current §IV-F to v3.20.0 Tables IX/XI/XII/XII-B and §IV-J descriptive proportions; (6) provenance row added for the full-dataset $n = 686$ claim citing Script 41; (7) draft-note dates and version stamps refreshed.
>
> **v6** incorporates codex gpt-5.5 round-25 review (`paper/codex_review_gpt55_v4_round5.md`, Minor Revision): empirical anchor range updated to Scripts 3242 (was 3240, missed Scripts 41 and 42).
>
> Empirical anchors throughout reference Scripts 3242 on branch `paper-a-v4-big4`; a provenance table appears at the end of this section listing every numerical claim with its script and report path.
## G. Unit of Analysis and Scope ## G. Unit of Analysis and Scope
We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and v3.20.0's inherited inter-CPA FAR analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inherited inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I; reported under prior "FAR" terminology in v3.x). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.
We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism. We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism.
@@ -27,73 +16,107 @@ We adopt one stipulation about same-CPA pair detectability:
A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication. A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.
**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, and the v4-new analyses in §IV-D through §IV-J) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C, §IV-I (inter-CPA negative-anchor FAR), and §IV-L (feature-backbone ablation) report inherited corpus-wide v3.x material that v4.0 does not re-scope to Big-4. §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ (Scripts 36, 38), totalling 150,442 Big-4 signatures with both descriptors available (Script 39 reports the explicit per-signature $n$ used in the signature-level K=3 fit). Restricting the v4-new analyses to Big-4 is a methodological choice driven by four considerations: **Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, §III-L, and the v4-new analyses in §IV-D through §IV-J) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C, §IV-I (inter-CPA negative-anchor coincidence rate), and §IV-L (feature-backbone ablation) report inherited corpus-wide v3.x material that v4.0 does not re-scope to Big-4. §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ — the threshold for accountant-level analyses (Scripts 36, 38) totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the v4-new analyses to Big-4 is a methodological choice driven by four considerations:
1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$249 CPAs distributed across multiple firms with idiosyncratic signing practices and small per-firm samples. The full-sample and Big-4-only calibrations *differ* in their fitted marginal crossings (full-sample published $\overline{\text{cos}}^* = 0.945$, $\overline{\text{dHash}}^* = 8.10$ from v3.x; Big-4-only $\overline{\text{cos}}^* = 0.975$, $\overline{\text{dHash}}^* = 3.76$ from Script 34; bootstrap 95% CIs $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$); the offset is large compared to the Big-4 bootstrap CI half-width of $0.0015$. We report this as a *scope-dependent shift* rather than asserting a causal "mid/small-firm tail distorts" claim. 1. **Leave-one-firm-out fold feasibility.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$$O(30)$ per firm.
2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes (cosine $p = 0.0000$, dHash $p = 0.0000$ in the bootstrap-2000 implementation, i.e., no bootstrap replicate exceeded the observed statistic; reported here as $p < 5 \times 10^{-4}$; Script 34). No such rejection holds in the comparison scopes tested by Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled (Script 32, `big4_non_A` subset: $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$); all non-Firm-A pooled (Script 32, `all_non_A` subset: other Big-4 plus mid/small firms; $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$). Among the comparison scopes we evaluated, Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution; we did not separately test single-firm dip statistics for Firms B, C, or D. 2. **Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-J K=3 component cross-tab; v3.x byte-level pair analysis referenced in §III-H). v4.0 retains Firm A within the Big-4 scope as a descriptive case study of the templated end, rather than treating Firm A as the calibration anchor for thresholds (the v3.x role of Firm A).
3. **Reproducibility under leave-one-firm-out cross-validation.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 mixture fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$$O(30)$ per firm. 3. **Within-firm cross-CPA collision structure analysis.** §III-L.4 reports a Big-4 cross-firm hit-matrix analysis (Script 44) that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work.
4. **Restricted generalizability claim.** v4.0's primary claims are scoped to the Big-4 audit-report context; we do not assert that the same mixture structure or operational thresholds extend to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H's reverse-anchor internal-consistency check and (b) as a robustness comparison in §IV-K. Generalisation beyond Big-4 is left as future work. 4. **Restricted generalisability claim.** v4.0's primary claims are scoped to the Big-4 audit-report context; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-I.4 (Script 39c). Generalisation beyond Big-4 is left as future work.
We earlier (v4.0 first draft) listed "statistical multimodality at the accountant level" among the scope justifications, on the basis that the Hartigan dip test rejects unimodality on the Big-4 accountant-level marginals. §III-I.4 reports diagnostics (Scripts 39b39e) that explain the rejection as a joint effect of between-firm composition shift and dHash integer mass points, not as evidence of within-population continuous bimodality. We therefore no longer list dip-test multimodality among the Big-4 scope rationales; the K=3 mixture is retained as a descriptive partition (§III-J), not as inferential evidence for two mechanism modes.
**Sample-size reconciliation.** Two Big-4 signature counts appear in this section and §IV: $n = 150{,}442$ for analyses using the pre-computed per-signature descriptors $\text{cos}_s$ (`max_similarity_to_same_accountant`) and $\text{dHash}_s$ (`min_dhash_independent`), and $n = 150{,}453$ for analyses recomputing pair-level metrics directly from the stored feature and dHash byte vectors (Scripts 40b, 43, 44). The $11$-signature difference reflects descriptor-completion status: $11$ signatures have feature vectors and dHash byte vectors stored but lack the pre-computed extrema. The $11$ signatures are negligible at population scale and do not affect any reported coincidence rate within $0.01$ percentage point. The CPA counts $468$ (all Big-4 CPAs with both vectors stored) and $437$ (Big-4 CPAs with $n_{\text{sig}} \geq 10$ for accountant-level stability) likewise reflect a single uniform exclusion rule rather than analysis-specific subsetting.
## H. Reference Populations ## H. Reference Populations
v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing. v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the most digitally-replicated of the Big-4. In the Big-4 K=3 mixture (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 hand-leaning component (cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 mixed component, and 82.5% of the C3 replicated component; the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). The byte-level pair analysis reported in v3.x §IV-F.1 identifies 145 Firm A pixel-identical signatures at the signature level (Script 40 verifies the 145/262 split among Big-4 pixel-identical signatures); the additional details that v3.x attributes to this analysis (50 distinct Firm A partners of 180 registered; 35 byte-identical matches spanning different fiscal years) are inherited from the Script 28 / Appendix B byte-decomposition output and were not regenerated in the v4.0 spike scripts. We retain those v3.x details by reference and mark them in the provenance table as "inherited from v3 §IV-F.1 / Script 28." **Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). The byte-level pair analysis reported in v3.x §IV-F.1 identifies 145 Firm A pixel-identical signatures at the signature level (Script 40 verifies the 145/262 split among Big-4 pixel-identical signatures); the additional details that v3.x attributes to this analysis (50 distinct Firm A partners of 180 registered; 35 byte-identical matches spanning different fiscal years) are inherited from the Script 28 / Appendix B byte-decomposition output and were not regenerated in the v4.0 spike scripts. We retain those v3.x details by reference and mark them in the provenance table as "inherited from v3 §IV-F.1 / Script 28."
In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate. In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate.
**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores that v4.0 uses as a cross-check on the inherited per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked. **External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores that v4.0 uses as a cross-check on the inherited per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked.
The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 hand-leaning component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 templated component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a more hand-leaning Big-4 CPA. This is a "deviation in the hand-leaning direction" measure, not a "deviation toward replication" measure; the reference is the less-replicated population. The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 low-cos / high-dHash component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 high-cos / low-dHash component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end of the descriptor plane. This is a "deviation in the less-replication-dominated descriptor-position direction" measure, not a "deviation toward the templated descriptor-position" measure; the reference is the less-replication-dominated population.
## I. Distributional Characterisation at the Accountant Level ## I. Distributional Diagnostics: Why the Composition Path Does Not Yield a Natural Threshold
This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G. Three diagnostic procedures are applied: a univariate unimodality test on each marginal axis, a 2D Gaussian mixture fit (developed in §III-J), and a density-smoothness diagnostic. This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G and tests whether the distribution provides distributional support — in the form of within-population bimodality — for the operational thresholds inherited from v3.x. We apply four diagnostic procedures in turn: a univariate unimodality test on each accountant-level marginal; a 2D Gaussian mixture fit (developed in §III-J); a density-smoothness diagnostic; and a composition decomposition that distinguishes within-population multimodality from between-firm location-shift artefacts (the v4-new diagnostic battery). The four diagnostics jointly imply that the operational thresholds are *not* anchored by distributional bimodality: §III-L develops an anchor-based calibration framework that does not require this assumption.
**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope rejected unimodality. The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution. **1. Hartigan dip test on each accountant-level marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope at the accountant level rejected unimodality. The accountant-level Big-4 rejection is a descriptive observation; §III-I.4 below shows that the rejection is fully explained by between-firm location-shift effects rather than within-population bimodality.
**2. Mixture-model evidence.** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3, and the operational role of each fit is developed in §III-J and §III-K. **2. K=2 / K=3 Gaussian mixture fits (descriptive partition).** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3 as a population mixture. Following §III-I.4 we treat both K=2 and K=3 fits as *descriptive partitions* of the joint Big-4 distribution that reflect firm-composition structure (Firm A vs others; §III-J) rather than as inferential evidence for two or three latent population modes.
**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with the mixture-model evidence: the K=3 components overlap rather than separate sharply, so a local-discontinuity test does not flag a transition. We retain BD/McCrary in v4.0 as a non-parametric robustness diagnostic; the dHash transitions outside Big-4 are not used as operational thresholds because they are scope-dependent and lie within rather than between modes of the corresponding density. **3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each accountant-level marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with §III-I.4 below: under the composition decomposition the Big-4 marginals are unimodal once between-firm and integer-tie confounds are removed, so a local-discontinuity test correctly fails to flag a within-population transition.
## J. Mixture Model and Accountant-Level Characterisation **4. Composition decomposition (Scripts 39b39e).** §III-I.1 establishes that the accountant-level marginals reject unimodality at the Big-4 sub-corpus. The remaining question is whether the rejection reflects (a) genuine within-population bimodality at the signature or accountant level, (b) between-firm location-shift artefacts (firms with different mean descriptor positions pool to a multi-peaked distribution), or (c) integer mass-point artefacts on the integer-valued dHash axis (the dHash dip statistic is sensitive to spikes at integer values). We apply four diagnostics that decompose the rejection into these candidate sources:
This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive characterisations of the joint Big-4 distribution; the operational per-signature classifier remains the inherited five-way box rule of §III-L.** Neither mixture is used to assign signature-level or document-level labels in the v4.0 primary analysis. *Within-firm signature-level dip (Scripts 39b, 39c).* Repeating the dip test at the signature level inside each individual Big-4 firm (Script 39b) and inside each individual non-Big-4 firm with $\geq 500$ signatures (Script 39c) yields a consistent picture. The cosine marginal *fails* to reject unimodality in every single firm tested — all four Big-4 firms ($p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ for Firms A through D; Script 39b) and ten non-Big-4 firms with $\geq 500$ signatures ($p_{\text{cos}} \in [0.59, 0.99]$; Script 39c). The raw dHash marginal *does* reject unimodality in every firm tested ($p < 5 \times 10^{-4}$ in all $14$ firms), but the raw dHash values are integer-valued in $\{0, 1, \ldots, 64\}$, leaving open the possibility of an integer-tie artefact.
**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ ("hand-leaning"), weight $0.689$, and $(0.983, 2.41)$ ("replicated"), weight $0.311$ (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. *Integer-jitter robustness (Scripts 39d, 39e).* Adding independent uniform jitter $\sim \mathrm{U}[-0.5, +0.5]$ to break exact dHash ties and re-running the dip test on the perturbed signature cloud (5 seeds, $n_{\text{boot}} = 2000$; Script 39d) eliminates the dHash within-firm rejection in every Big-4 firm tested (Firm A jittered $p_{\text{median}} = 0.999$; B $0.996$; C $0.999$; D $0.9995$; $0$/$5$ seeds reject at $\alpha = 0.05$ in any firm). All ten non-Big-4 firms similarly fail to reject after jitter ($p \in [0.71, 1.00]$). The pooled-Big-4 dHash dip *does* survive jitter alone ($p_{\text{median}} = 0$, $5$/$5$ seeds reject), but Firm A's mean dHash ($2.73$) is substantially below Firms B/C/D's ($6.46$, $7.39$, $7.21$) — a between-firm location shift. Script 39e applies a 2 \times 2 factorial correction (firm-mean centring $\times$ integer jitter) on the Big-4 pooled dHash:
| Condition | Firm-mean centred | Integer jitter | Median dip $p$ | Reject at $\alpha = 0.05$ |
|---|---|---|---|---|
| 1 raw | — | — | $< 5 \times 10^{-4}$ | $5/5$ |
| 2 centred only | $\checkmark$ | — | $< 5 \times 10^{-4}$ | $5/5$ |
| 3 jittered only | — | $\checkmark$ | $< 5 \times 10^{-4}$ | $5/5$ |
| 4 centred and jittered | $\checkmark$ | $\checkmark$ | $\mathbf{0.35}$ | $\mathbf{0/5}$ |
Removing *both* the between-firm location shift *and* the integer mass points eliminates the Big-4 dHash rejection. The Big-4 pooled dHash multimodality is therefore fully attributable to firm-composition contrast (primarily Firm A's mean $\text{dHash} = 2.73$ versus Firms B/C/D $\approx 6.5$$7.4$) and integer-density artefacts, with no residual continuous within-firm bimodality.
*Cosine analogue.* The cosine axis exhibits the same pattern. Codex-verified read-only spike on the Big-4 pooled signature cloud yields signature-level cosine dip $p < 5 \times 10^{-4}$ on the raw data, but $p = 0.597$ after firm-mean centring; accountant-level cosine $p = 1.0$ after firm-mean centring. The cosine multimodality is therefore between-firm composition-driven, not within-population bimodality.
*Integer-histogram valleys (Script 39d).* A genuine within-firm dHash antimode would appear as a strict local minimum in the count histogram with deep relative depth. Within each of the four Big-4 firms, the dHash histogram on bins $0$$20$ exhibits no strict local minimum; the Big-4 pooled histogram exhibits one shallow valley at $\text{dHash} = 4$ with relative depth $0.021$ (a $2.1\%$ count drop). No valley near the inherited $\text{dHash} = 5$ operational boundary appears within any individual firm. The hypothesised dHash antimode near $\text{dHash} \approx 5$ is not empirically supported by the histogram analysis.
**5. Conclusion: no natural threshold from the descriptor distribution.** §III-I.4 jointly establishes that (a) the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts; (b) within any individual firm, the descriptor marginals at the signature level are unimodal once integer ties are broken; and (c) no integer-histogram valley near the inherited $\text{dHash} = 5$ operational boundary exists within any firm. The descriptor distributions therefore do not contain a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits of §III-I.2 and §III-J are retained as *descriptive partitions* that reflect firm-composition contrast, not as inferential evidence for two or three population modes. §III-L develops the v4.0 anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode.
## J. K=3 as a Descriptive Partition of Firm-Composition Contrast
This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes.** §III-I.4 demonstrates that the apparent multimodality of the accountant-level marginals is fully explained by between-firm location shifts and integer mass-point artefacts, leaving no residual evidence for two or three latent within-population mechanism classes. Neither mixture is used to assign signature-level or document-level labels in the v4.0 primary analysis. The operational classifier of §III-L is calibrated via inter-CPA negative-anchor coincidence rates, not via mixture-derived antimodes.
**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ (weight $0.689$) and $(0.983, 2.41)$ (weight $0.311$) (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. We refer to the components by index rather than by mechanism labels, since §III-I.4 establishes that the K=2 separation is firm-compositional rather than mechanistic.
**K=3 fit.** Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces): **K=3 fit.** Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces):
| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive label | | Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position |
|---|---|---|---|---| |---|---|---|---|---|
| C1 | 0.9457 | 9.17 | 0.143 | hand-leaning | | C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner |
| C2 | 0.9558 | 6.66 | 0.536 | mixed | | C2 | 0.9558 | 6.66 | 0.536 | central region |
| C3 | 0.9826 | 2.41 | 0.321 | replicated | | C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner |
$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support for K=3 under standard BIC interpretation but not by itself decisive). $\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical preference for K=3 under standard BIC interpretation, but not by itself decisive). The "descriptive position" column replaces v3.x's "hand-leaning / mixed / replicated" mechanism labels: §III-I.4 establishes that the cosine and dHash axes both lack within-population bimodality, so component centres are best interpreted as locations in a continuous descriptor space rather than as latent mechanism modes.
**Why we report both K=2 and K=3.** Leave-one-firm-out cross-validation (§III-K) shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold *stability tolerance* — not the bootstrap CI; the full-Big-4 bootstrap cosine half-width is the much smaller $0.0015$. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is more composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL`, with the explicit interpretation that "the C1 cluster exists but membership is not well-predicted by the held-out fit" and "K=3 is not predictively useful as an operational classifier" (Script 37 verdict legend). **Per-firm component composition (Script 35 firm × cluster cross-tab).** The K=3 partition is dominated by firm membership:
We take the joint K=2/K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites a v4.0 operational classifier: - Firm A: $0\%$ C1, $17.5\%$ C2, $82.5\%$ C3
- Firm B: $8.9\%$ C1, $\sim 78\%$ C2, $\sim 13\%$ C3
- Firm C: $23.5\%$ C1, $75.5\%$ C2, $1.0\%$ C3
- Firm D: $11.5\%$ C1, $\sim 84\%$ C2, $\sim 4.5\%$ C3
Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs. The K=3 partition is therefore well-described as a firm-compositional decomposition: C3 is essentially "Firm A and any non-Firm-A CPA whose mean descriptors happen to land in the high-cos / low-dHash corner"; C1 is essentially "non-Firm-A CPAs whose mean descriptors land in the low-cos / high-dHash corner." The composition contrast that K=3 captures at the accountant level reappears at the deployment level in the cross-firm hit matrix of §III-L.4 (Script 44): nearly all (98%) of the inter-CPA-anchor hits for a Firm A source signature have a Firm A candidate, and the same within-firm concentration holds for Firms B, C, D individually. The K=3 partition and the cross-firm hit matrix therefore describe the same underlying firm-compositional structure at two different units of analysis.
**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label.
We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the v4.0 operational classifier:
- The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary. - The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary.
- The Big-4 K=3 mixture exhibits a reproducible three-component shape across LOOO folds; a hand-leaning component (C1) exists at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$ with weight $\approx 0.14$. - The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$.
- Hard-posterior membership in C1 is composition-sensitive (max absolute deviation $12.8$ pp across LOOO folds, exceeding the report's $5$ pp viability bar); K=3 is therefore not used to assign operational hand-leaning labels to CPAs in v4.0. - Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs in v4.0.
The operational signature-level classifier remains the inherited five-way box rule of §III-L, calibrated as in v3.x. Cross-checks between the inherited rule and the K=3 mixture appear in §III-K. The operational signature-level classifier of §III-L is calibrated against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the inherited five-way box rule and the K=3 partition appear in §III-K.
## K. Convergent Internal-Consistency Checks ## K. Convergent Internal-Consistency Checks
The mixture characterisation of §III-J is supported by three feature-derived per-CPA scores and a hard-ground-truth subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide). The descriptive partition of §III-J is supported by three feature-derived per-CPA scores and a hard-ground-truth subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. Per §III-I.4, none of the three scores has a within-population bimodality interpretation; they are firm-compositional position scores at the accountant level. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide).
**1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute: **1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute:
- **Score 1 (mixture posterior):** $P(\text{C1}_{\text{hand-leaning}})$ from the K=3 fit of §III-J — a function of both descriptor means. - **Score 1 (K=3 posterior on the low-cos / high-dHash component):** $P(\text{C1})$ from the K=3 fit of §III-J. Per §III-J this is a firm-compositional position score on the (cos, dHash) plane (not a probability of any latent "hand-signing mechanism") — a function of both descriptor means.
- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a more hand-leaning Big-4 CPA. This is a function of $\overline{\text{cos}}_a$ alone. - **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end. This is a function of $\overline{\text{cos}}_a$ alone.
- **Score 3 (Paper A box-rule hand-leaning rate):** the per-CPA fraction of signatures that do **not** satisfy the inherited binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors. - **Score 3 (inherited binary high-confidence box rule rate):** the per-CPA fraction of signatures that do **not** satisfy the inherited binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors.
Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38): Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38):
@@ -103,7 +126,7 @@ Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs
| Score 2 vs Score 3 | $+0.889$ | $< 10^{-149}$ | | Score 2 vs Score 3 | $+0.889$ | $< 10^{-149}$ |
| Score 1 vs Score 2 | $+0.879$ | $< 10^{-142}$ | | Score 1 vs Score 2 | $+0.879$ | $< 10^{-142}$ |
We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA hand-leaning ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A Big-4 firms as more hand-leaning, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule hand-leaning rate (Scores 1 and 3) place Firm C at the most-hand-leaning end of Big-4 (mean P(C1) $= 0.311$; mean box-rule hand-leaning rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the most-hand-leaning end between the three non-A Big-4 firms. We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms.
**2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replicated vs not-replicated): **2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replicated vs not-replicated):
@@ -119,50 +142,210 @@ The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$
- *K=2 LOOO is unstable.* The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$, against the report's $0.005$ across-fold stability tolerance (Script 36; pairwise fold range $0.0376$, from $0.9380$ to $0.9756$). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary. - *K=2 LOOO is unstable.* The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$, against the report's $0.005$ across-fold stability tolerance (Script 36; pairwise fold range $0.0376$, from $0.9380$ to $0.9756$). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.
- *K=3 LOOO is partially stable.* The C1 hand-leaning component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own verdict is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label. - *K=3 LOOO is partially stable.* The C1 (low-cos / high-dHash) component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own verdict is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label.
**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one hard ground-truth subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40). **4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one hard ground-truth subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40).
We report each candidate classifier's *positive-anchor miss rate* — the fraction of byte-identical signatures misclassified as hand-leaning. This is a one-sided check against a conservative positive subset, **not a false-alarm rate in the usual two-class sense**; we do not report a paired false-alarm rate because no signature-level hand-signed ground truth exists. The corresponding signature-level false-alarm-rate evidence comes from the v3.x inter-CPA negative anchor (§IV-I, inheriting v3.20.0 §IV-F.1 Table X), which retains its v3.x interpretation: We report each candidate classifier's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the v3.x §IV-I corpus-wide version (reported under prior "FAR" terminology):
| Candidate classifier | Pixel-identity miss rate (Wilson 95% CI) | | Candidate classifier | Pixel-identity miss rate (Wilson 95% CI) |
|---|---| |---|---|
| Inherited Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ | | Inherited Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ |
| K=3 per-CPA hard label (C3 = replicated; descriptive only) | $0\%$ $[0\%, 1.45\%]$ | | K=3 per-CPA hard label (C3 high-cos / low-dHash corner; descriptive only) | $0\%$ $[0\%, 1.45\%]$ |
| Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ | | Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ |
All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the inherited box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them; v3.x discussed this conservative-subset caveat at length (v3 §III-J item 1, V-F). The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the inherited box rule's overall replicated rate ($49.58\%$ of Big-4 signatures; Script 40); this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation. All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the inherited box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them; v3.x discussed this conservative-subset caveat at length (v3 §III-J item 1, V-F). The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the inherited box rule's overall replicated rate ($49.58\%$ of Big-4 signatures; Script 40); this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.
## L. Signature- and Document-Level Classification ## L. Anchor-Based Threshold Calibration and Operational Classifier
The v4.0 operational classifier is the inherited v3.x five-way per-signature box rule, retained unchanged for two reasons: (a) it preserves continuity with the v3.x literature and its established interpretation; (b) the convergent internal-consistency checks of §III-K show that the box rule's *binary high-confidence* output (cos $> 0.95$ AND dHash $\leq 5$) agrees at $\rho \geq 0.96$ per-CPA with a K=3-posterior score and at $\rho \geq 0.89$ with a reverse-anchor score. The §III-K checks cover only the binary high-confidence rule; the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation are not separately validated by Scripts 3842. We retain those rule components by reference to v3.20.0's calibration (v3.20.0 §III-K and Tables IX, XI, XII, XII-B); we do not claim that v4.0's convergent-checks evidence supports the inherited rule as a whole, only its binary high-confidence sub-rule. §III-I.4 established that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. This section develops v4.0's anchor-based threshold calibration: the operational thresholds inherited from v3.x are characterised by their inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. Throughout this section we report **inter-CPA coincidence rates** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0.
**Per-signature five-way classifier.** Operational thresholds are anchored on whole-sample Firm A percentile heuristics as in v3.x: cos $> 0.95$ (Firm A P7.5) for the cosine dimension and dHash$_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. All dHash references refer to the *independent-minimum* dHash defined in §III-G. We assign each signature to one of five signature-level categories using convergent evidence from both descriptors: ### L.0. Calibration methodology
1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence. **Operational classifier (inherited from v3.20.0 §III-K, retained unchanged).** Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA:
2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff. 1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence.
2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff.
3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction.
4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$.
3. **High style consistency:** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction. The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) are inherited from v3.x §III-K and retain their v3.x calibration provenance. Document-level labels are aggregated via the v3.x worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH).
4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence in either direction. **Why retained without v4.0 recalibration.** The inherited thresholds preserve continuity with v3.x reporting and with the existing literature. §III-I.4 establishes that a v4.0 recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 confirms that the cosine threshold's specificity behaviour at the inter-CPA pair level (the v3.x calibration anchor) is reproducible on the v4 spike sample, and §III-L.1 newly characterises the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. Sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain v3.x's inherited calibration; v4.0 does not provide independent calibration for those sub-bands.
5. **Likely hand-signed:** Cosine at or below the all-pairs KDE crossover threshold (cos $\leq 0.837$). **Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:
The conventions about these thresholds (cosine 0.95 as an operating point chosen for capture-vs-FAR tradeoff against the inter-CPA negative anchor, with Wilson 95% inter-CPA FAR of $0.0005$ at the operating point; cosine 0.837 as the all-pairs KDE crossover; dHash 5 and 15 as the upper tail of the high-similarity mode and the structural-similarity ceiling respectively) are inherited from v3.x §III-K and retain their v3.x calibration and capture-rate evidence (v3.x Tables IX, XI, XII, XII-B). - *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the unit at which v3.x §IV-I characterised the cosine threshold's specificity behaviour and at which threshold-derivation in biometric verification is conventionally calibrated. We report it for both the cosine and dHash dimensions, marginally and jointly (§III-L.1).
- *Per signature pool.* For a Big-4 source signature $s$ with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates? This addresses the standard concern that a per-pair rate computed on independent pairs is not the deployed-rule rate at the per-signature classifier level: the deployed rule takes max-cosine and min-dHash over a pool of size $n_{\text{pool}}(s)$, so its effective coincidence rate is approximately $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-L.2).
- *Per document.* For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose deployed pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-L.3).
**Document-level aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.x worst-case rule: the document inherits the *most-replication-consistent* signature label among the two signatures (rank order: High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed). The aggregation rule reflects the detection goal of flagging any potentially non-hand-signed report. **Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate in the pool satisfies both. We refer to this as the **any-pair** rule. A stricter alternative — the **same-pair** rule — requires a single candidate to satisfy both inequalities; the deployed v3/v4 rule is any-pair, but we report same-pair as a stricter alternative classifier where useful (§III-L.2, §III-L.4).
**K=3 as accountant-level characterisation, not classifier.** The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used for the accountant-level cluster cross-tabulation (Script 35), and the K=3 *posterior* P(C1) is used (as the continuous Score 1) in the internal-consistency Spearman correlations of §III-K. **Terminological note on "FAR".** The v3.x and biometric-verification literature speak of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the v4.0 metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) §III-L.4 shows that the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures. Reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence. We retain the v3.x numerical results (which are quantitatively reproduced in §III-L.1) under the new terminology.
### L.1. Per-comparison inter-CPA coincidence rate (Script 40b)
We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatures, computing for each pair the cosine similarity (feature dot product) and Hamming distance between the dHash byte vectors. Marginal and joint rates at threshold $k$ are reported with Wilson 95% confidence intervals (Script 40b).
| Threshold | Per-comparison inter-CPA coincidence rate | 95% Wilson CI |
|---|---|---|
| Cosine $> 0.95$ | $0.00060$ | $[0.00053, 0.00067]$ |
| Cosine $> 0.945$ (v3.x published "natural threshold") | $0.00081$ | $[0.00073, 0.00089]$ |
| Cosine $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| Cosine $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
| dHash $\leq 5$ | $0.00129$ | $[0.00120, 0.00140]$ |
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
| dHash $\leq 2$ | $0.00006$ | $[0.00004, 0.00008]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ | $0.00014$ | (any-pair semantics) |
| Joint: cos $> 0.95$ AND dHash $\leq 4$ | $0.00011$ | (any-pair semantics) |
The cosine row at $\text{cos} > 0.95$ replicates the v3.x §IV-I Table X result (v3.x reported the per-comparison rate as $0.0005$ under prior "FAR" terminology from a similarly-sized inter-CPA negative anchor; the v4 spike on a $5 \times 10^5$-pair sample yields $0.00060$, within the v3.x reported precision). The dHash and joint rows are v4-new: v3.x calibration did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule.
The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check.
**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison specificity (joint $0.00014$ vs cos-only $0.00060$).
The per-comparison rate is a useful *specificity-proxy calibration* for the deployed rule's pair-level behaviour. It does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.
### L.2. Pool-normalised inter-CPA alert rate (Script 43)
The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \text{dHash}$ over the same-CPA pool of size $n_{\text{pool}}(s)$ for each signature $s$. A per-comparison rate is therefore not the rate at which the deployed classifier fires per signature. To compute the per-signature inter-CPA-equivalent rate, for each Big-4 source signature $s$ we simulate one realisation of an inter-CPA candidate pool of the same size $n_{\text{pool}}(s)$, drawn uniformly from non-same-CPA signatures across all firms, compute the deployed extrema and rule indicator, and aggregate (Script 43; $n_{\text{sig}} = 150{,}453$ vector-complete in this analysis; CPA-block bootstrap 95% CIs reported below).
**Headline rates (deployed any-pair rule, HC = cos $> 0.95$ AND dHash $\leq 5$).** Wilson 95% CIs on the point estimate, CPA-block bootstrap 95% CI on $n_{\text{boot}} = 1000$ replicates:
| Rule semantics | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
|---|---|---|---|
| Any-pair (deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
| Same-pair (stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |
Per-firm any-pair rates (no bootstrap; descriptive):
| Firm | $n_{\text{sig}}$ | Any-pair ICCR | Same-pair ICCR |
|---|---|---|---|
| Firm A | $60{,}450$ | $0.2594$ | $0.2018$ |
| Firm B | $34{,}254$ | $0.0147$ | $0.0023$ |
| Firm C | $38{,}616$ | $0.0053$ | $0.0019$ |
| Firm D | $17{,}133$ | $0.0110$ | $0.0051$ |
**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors).
**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). Stakeholders requiring a per-signature ICCR of $\leq 0.05$ at HC can adopt dHash $\leq 3$ same-pair as a stricter operating point; at $\leq 0.10$ the inherited HC any-pair rule with $\text{dHash} \leq 5$ at $0.1102$ is within tolerance.
### L.3. Document-level inter-CPA proxy alert rate (Script 45)
The deployed worst-case aggregation classifies each document by the most-replication-consistent category among its constituent signatures (§III-L.0). Three operationally meaningful document-level alarm definitions are reported, each as the fraction of documents whose worst-case signature category falls in the alarm set under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 45; $n_{\text{docs}} = 75{,}233$ Big-4 documents):
| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|---|---|---|---|
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
| D2 | HC + MC ("any non-hand-signed verdict") | $0.3375$ | $[0.3342, 0.3409]$ |
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |
Per-firm D2 document-level rates:
| Firm | $n_{\text{docs}}$ | D2 (HC + MC) ICCR |
|---|---|---|
| Firm A | $30{,}226$ | $0.6201$ |
| Firm B | $17{,}127$ | $0.1600$ |
| Firm C | $19{,}501$ | $0.1635$ |
| Firm D | $8{,}379$ | $0.0863$ |
The document-level D2 rate of $33.75\%$ pooled over Big-4 is the most operationally relevant alarm-rate metric: it is the fraction of audit documents that would carry at least one signature flagged HC or MC under the counterfactual of inter-CPA candidate-pool replacement. The non-trivial per-document inter-CPA alarm rate (and its concentration in Firm A at $62\%$) motivates the positioning of the operational system as a **screening framework with human-in-the-loop review**, not as an autonomous forensic classifier (§III-M).
### L.4. Firm heterogeneity (Script 44)
§III-L.2 and §III-L.3 report large per-firm variation in the deployed rule's pool-normalised behaviour: Firm A's any-pair per-signature ICCR is $0.2594$, an order of magnitude larger than Firm B's $0.0147$, Firm C's $0.0053$, Firm D's $0.0110$. A natural alternative explanation is the pool-size confound: Firm A's median pool size ($\sim 285$) is larger than other firms', and pool size monotonically (broadly) increases the per-signature rate (§III-L.2 decile trend). We test the firm-vs-pool confound with a logistic regression of the per-signature hit indicator (any-pair HC) on firm dummies (Firm A = reference) and centred log pool size (Script 44):
| Term | Odds ratio (vs Firm A) | Direction | Magnitude |
|---|---|---|---|
| Firm B | $0.053$ | $< 1$ | $\sim 19\times$ lower odds than Firm A |
| Firm C | $0.010$ | $< 1$ | $\sim 100\times$ lower odds than Firm A |
| Firm D | $0.027$ | $< 1$ | $\sim 37\times$ lower odds than Firm A |
| log(pool size, centred) | $4.01$ | $> 1$ | $\sim 4\times$ higher odds per unit log pool size |
The Firm B/C/D odds ratios are very small after controlling for pool size, indicating that firm membership accounts for a large multiplicative effect on the per-signature rate that is *not* explained by pool size alone. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm, and naive standard errors would be inflated by within-cluster correlation; a cluster-robust standard error analysis is left as a robustness check.)
The per-decile per-firm breakdown (Script 44) confirms the pattern: within every pool-size decile, Firms B/C/D have rates of $0.0006$$0.0358$, while Firm A's rate ranges $0.0541$$0.5958$ across deciles. The firm gap is large within matched pool sizes, not driven by pool composition.
**Cross-firm hit matrix.** Among Big-4 source signatures whose any-pair rule fires under the inter-CPA candidate-pool counterfactual, the candidate firm of the max-cosine partner is distributed as follows (Script 44):
| Source firm | Firm A candidate | Firm B | Firm C | Firm D | non-Big-4 | hits |
|---|---|---|---|---|---|---|
| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ |
| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ |
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |
For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$).
**Interpretation.** The cross-firm hit matrix shows that nearly all inter-CPA collisions under the deployed rule originate from candidates within the source firm (different CPA, same firm). This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. The byte-level evidence of v3.x §IV-F.1 (Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners) provides direct evidence that firm-level template reuse does occur at Firm A; the broader inter-CPA collision pattern in §III-L.4 is consistent with that mechanism extending in milder form to Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where nearly all collisions are within-firm. The K=3 partition and the cross-firm hit matrix describe the same underlying firm-compositional structure at two different units of analysis.
### L.5. Alert-rate sensitivity around inherited thresholds (Script 46)
To test whether the inherited cosine threshold $0.95$ and dHash threshold $5$ coincide with a low-gradient (plateau-stable) region of the deployed-rule alert-rate surface — which would be weak distributional evidence that the inherited thresholds are stable operating points — we sweep each threshold across a range and report the per-signature alert rate on actual observed Big-4 same-CPA pools (not inter-CPA-replaced pools), comparing the local gradient at the inherited threshold to the median gradient across the sweep (Script 46).
At the inherited HC operating point cos $> 0.95$ AND dHash $\leq 5$, the local gradient of the per-signature alert rate is substantially larger than the median gradient across the sweep (cosine: ratio $\approx 25\times$ at the $0.95$ point relative to median; dHash: ratio $\approx 3.8\times$ at the $5$ point relative to median; both Script 46). Reading these ratios descriptively, the inherited HC threshold is *locally sensitive* rather than plateau-stable: small threshold perturbations materially change the deployed alert rate (cosine sweep at dHash $\leq 5$ yields rates of $0.5091$ at cos $> 0.945$ vs $0.4789$ at cos $> 0.955$, a $3.0$ pp swing across a $0.01$ cosine perturbation; dHash sweep at cos $> 0.95$ yields rates of $0.4207$ at dHash $\leq 4$ vs $0.5639$ at dHash $\leq 6$, a $14.3$ pp swing across a single integer step). The local-gradient-to-median-gradient ratios are descriptive diagnostics, not formal plateau tests; the primary evidence for "no within-population bimodal antimode at these thresholds" comes from §III-I.4's composition decomposition, not from §III-L.5.
The MC/HSC boundary at dHash $= 15$, by contrast, *is* in a low-gradient region (ratio $\approx 0.08$ to the median); the plateau-like behaviour around dHash $= 15$ is corroborating evidence that the high-end structural threshold lies in a regime where the rule's alert rate is approximately saturated, consistent with the high-dHash tail behaviour expected once near-identical pairs have been exhausted. The §III-L.5 non-plateau / local-sensitivity finding therefore applies specifically to the HC cutoff (cos $= 0.95$, dHash $= 5$); the MC/HSC sub-band boundary at dHash $= 15$ exhibits the opposite behaviour and is plateau-like.
We interpret the inherited HC thresholds as **specificity-anchored operating points** chosen for the specificity-vs-alert-yield tradeoff (§III-L.1), *not* as distributional antimodes. Stakeholders requiring different operating points on the tradeoff curve can derive thresholds by inverting the per-comparison or pool-normalised ICCR curves (§III-L.1, §III-L.2) at their preferred specificity target.
### L.6. Observed deployed alert rate on actual same-CPA pools
The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the inherited HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$).
The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised inter-CPA rate ($0.4958$ vs $0.1102$); the per-document observed-deployed rate is $\sim 3.5\times$ the pool-normalised inter-CPA D1 (HC) rate ($0.6228$ vs $0.1797$). We refer to this multiplicative gap as the **deployed-rate excess over the inter-CPA proxy**:
- Per-signature: $0.4958 - 0.1102 = 0.3856$ ($38.6$ pp excess)
- Per-document HC: $0.6228 - 0.1797 = 0.4431$ ($44.3$ pp excess)
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence.
### L.7. K=3 not used as classifier
The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used only for the accountant-level firm × cluster cross-tabulation (§III-J; Script 35), and the K=3 *posterior* P(C1) is used (as the continuous Score 1) in the internal-consistency Spearman correlations of §III-K. The operational classifier of §III-L.0 is the inherited v3.x five-way box rule; the calibration evidence in §III-L.1 through §III-L.6 characterises its multi-level coincidence behaviour against the inter-CPA negative anchor.
## M. Validation Strategy and Limitations under Unsupervised Setting
The v4.0 corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4 and v3.x §IV-F.1) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth.
In place of supervised validation, v4.0 adopts a **multi-tool collection of partial-evidence diagnostics**, each with an explicitly disclosed assumption:
| Tool | What it measures | Untested assumption |
|---|---|---|
| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) |
| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property |
| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above |
| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors inflated; cluster-robust analysis is a future check |
| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | None — direct descriptive observation |
| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test |
| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent |
| Pixel-identical conservative positive capture (§III-K.4; v3.x; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold |
| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity |
No single tool in this collection provides ground-truth validation. Their conjunction constitutes the unsupervised validation ceiling that the v4.0 corpus admits.
**What v4.0 does not claim.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; v3.x byte-level evidence of Firm A's pixel-identical signatures across partners) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.
**What v4.0 does claim.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.
**Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data.
--- ---
## Provenance table for numerical claims in §III-G through §III-L ## Provenance table for key numerical claims in §III-G through §III-L
The table below lists the principal numerical claims and their data-source scripts. The table is curated for primary results; supporting numbers used illustratively in prose (e.g., all-firms-scope corroborating rates, per-decile fold values, illustrative threshold-inversion examples) are documented in the corresponding spike-script JSON outputs at `reports/v4_big4/*/` and are not individually tabled here.
| Claim | Value | Source | Notes | | Claim | Value | Source | Notes |
|---|---|---|---| |---|---|---|---|
| Big-4 CPA count, $n_{\text{sig}} \geq 10$ | $437$ ($171/112/102/52$) | Script 36 sample sizes; Script 38 per-firm summary | direct | | Big-4 CPA count, $n_{\text{sig}} \geq 10$ | $437$ ($171/112/102/52$) | Script 36 sample sizes; Script 38 per-firm summary | direct |
| Big-4 signature count | $150{,}442$ | Script 39 (per-signature K=3 fit explicitly cites this $n$) | direct | | Big-4 signature count (descriptor-complete) | $150{,}442$ | Script 39 / 39b / 39d / 39e / 45 / 46 | analyses using pre-computed descriptors |
| Big-4 signature count (vector-complete) | $150{,}453$ | Script 40b / 43 / 44 | analyses recomputing from feature + dHash vectors |
| Non-Big-4 reference CPA count | $249$ | Script 38 reference population | direct | | Non-Big-4 reference CPA count | $249$ | Script 38 reference population | direct |
| Big-4 K=2 marginal crossings $(0.9755, 3.755)$ | direct | Script 34; Script 36 §A | direct | | Big-4 K=2 marginal crossings $(0.9755, 3.755)$ | direct | Script 34; Script 36 §A | direct |
| Bootstrap 95% CI cosine $[0.9742, 0.9772]$ | direct | Script 34; Script 36 §A | $n_{\text{boot}} = 500$ | | Bootstrap 95% CI cosine $[0.9742, 0.9772]$ | direct | Script 34; Script 36 §A | $n_{\text{boot}} = 500$ |
@@ -185,21 +368,75 @@ The conventions about these thresholds (cosine 0.95 as an operating point chosen
| Pixel-identity Big-4 subset $n = 262$ ($145/8/107/2$) | direct | Script 40 sample | direct | | Pixel-identity Big-4 subset $n = 262$ ($145/8/107/2$) | direct | Script 40 sample | direct |
| Full-dataset accountant count $n = 686$ | direct | Script 41 (`fulldataset_report.md`) | direct | | Full-dataset accountant count $n = 686$ | direct | Script 41 (`fulldataset_report.md`) | direct |
| Positive-anchor miss rate $0\%$ on $n = 262$ (Wilson upper $1.45\%$) | direct | Script 40 results table | direct | | Positive-anchor miss rate $0\%$ on $n = 262$ (Wilson upper $1.45\%$) | direct | Script 40 results table | direct |
| Inter-CPA FAR $0.0005$ at cos $> 0.95$ (Wilson 95% $[0.0003, 0.0007]$) | direct | v3 §IV-F.1 / Table X (inherited) | inherited from v3 | | Inter-CPA cos $> 0.95$ ICCR $0.0005$ (Wilson 95% $[0.0003, 0.0007]$) | inherited | v3 §IV-F.1 / Table X | v3 reported this as "FAR"; v4.0 reframes as inter-CPA coincidence rate per §III-L.0 |
| Firm A byte-identical $145$ pixel-identical signatures in Big-4 subset | direct | Script 40 sample breakdown | direct | | Firm A byte-identical $145$ pixel-identical signatures in Big-4 subset | direct | Script 40 sample breakdown | direct |
| Firm A byte-identical "50 distinct partners of 180; 35 cross-year" | inherited | v3 §IV-F.1 / Script 28 / Appendix B byte-decomposition output | **inherited from v3; not regenerated in v4.0 spike scripts** | | Firm A byte-identical "50 distinct partners of 180; 35 cross-year" | inherited | v3 §IV-F.1 / Script 28 / Appendix B byte-decomposition output | **inherited from v3; not regenerated in v4.0 spike scripts** |
| Big-4 K=3 per-firm C1 hard-assignment ($0\%$ / $8.9\%$ / $23.5\%$ / $11.5\%$) | direct | Script 35 firm × cluster cross-tab | direct | | Big-4 K=3 per-firm C1 hard-assignment ($0\%$ / $8.9\%$ / $23.5\%$ / $11.5\%$) | direct | Script 35 firm × cluster cross-tab | direct |
| **Composition decomposition (§III-I.4):** | | | |
| Within-firm signature-level dip $p_{\text{cos}}$ Big-4 (A/B/C/D) | $0.176 / 0.991 / 0.551 / 0.976$ | Script 39b per-firm | direct, $n_{\text{boot}} = 2000$ |
| Within-firm signature-level dip $p_{\text{cos}}$ non-Big-4 (10 firms, range) | $[0.59, 0.99]$ | Script 39c per-firm | direct, firms with $\geq 500$ signatures |
| Within-firm jittered-dHash dip $p$ Big-4 (5 seeds, median) A/B/C/D | $0.999 / 0.996 / 0.999 / 0.9995$ | Script 39d multi-seed | uniform jitter $[-0.5, +0.5]$ |
| Within-firm jittered-dHash dip $p$ non-Big-4 (5 seeds, range across 10 firms) | $[0.71, 1.00]$ | Script 39d / 39c | uniform jitter $[-0.5, +0.5]$ |
| Big-4 pooled dHash dip $p$ raw / jittered (seed median) | $< 5 \times 10^{-4}$ / $< 5 \times 10^{-4}$ | Script 39d | jitter alone does not eliminate Big-4 pooled rejection |
| Big-4 pooled dHash dip $p$ firm-centred + jittered (5-seed median) | $0.35$ | Script 39e 2×2 factorial | both corrections eliminate rejection ($0/5$ seeds at $\alpha = 0.05$) |
| Big-4 firm-centred signature-level cos dip $p$ | $0.597$ | codex round-30 verification on Script 43 substrate | independent verification |
| Big-4 firm-centred accountant-level cos\_mean dip $p$ | $1.0$ | codex round-30 verification | independent verification |
| Per-firm Big-4 dHash mean (A/B/C/D) | $2.73 / 6.46 / 7.39 / 7.21$ | Script 39e per-firm summary | direct |
| Big-4 integer-histogram valley near $\text{dHash} \approx 5$ within any firm | none in any of A/B/C/D | Script 39d valley analysis | bins $0$$20$ |
| **Anchor-based calibration (§III-L.1):** | | | |
| Per-comparison ICCR cos $> 0.95$ Big-4 | $0.00060$ (Wilson 95% $[0.00053, 0.00067]$) | Script 40b | $5 \times 10^5$ inter-CPA pairs, Big-4 scope |
| Per-comparison ICCR cos $> 0.945$ Big-4 | $0.00081$ (Wilson 95% $[0.00073, 0.00089]$) | Script 40b | direct |
| Per-comparison ICCR cos $> 0.97$ / cos $> 0.98$ Big-4 | $0.00024$ / $0.00009$ | Script 40b | direct |
| Per-comparison ICCR dHash $\leq 5$ Big-4 | $0.00129$ (Wilson 95% $[0.00120, 0.00140]$) | Script 40b | direct, v4 new |
| Per-comparison ICCR dHash $\leq 4 / 3 / 2$ Big-4 | $0.00050 / 0.00019 / 0.00006$ | Script 40b | direct |
| Per-comparison joint ICCR cos $> 0.95$ AND dHash $\leq 5$ Big-4 | $0.00014$ | Script 40b | any-pair semantics |
| Per-comparison joint ICCR cos $> 0.95$ AND dHash $\leq 4$ Big-4 | $0.00011$ | Script 40b | any-pair semantics |
| Conditional ICCR dHash $\leq 5$ given cos $> 0.95$ Big-4 | $0.234$ (Wilson 95% $[0.190, 0.285]$) | Script 40b | $70 / 299$ pairs |
| All-firms per-comparison joint ICCR | $0.00007$ | Script 40b | corroborating scope |
| **Pool-normalised per-signature alert rate (§III-L.2):** | | | |
| Per-signature any-pair ICCR HC Big-4 | $0.1102$ (Wilson 95% $[0.1086, 0.1118]$; CPA-bootstrap 95% $[0.0908, 0.1330]$) | Script 43 | $n_{\text{sig}} = 150{,}453$ (vector-complete) |
| Per-signature same-pair ICCR HC Big-4 | $0.0827$ (Wilson 95% $[0.0813, 0.0841]$; CPA-bootstrap 95% $[0.0668, 0.1021]$) | Script 43 | stricter alternative |
| Per-firm any-pair ICCR HC (A/B/C/D) | $0.2594 / 0.0147 / 0.0053 / 0.0110$ | Script 43 per-firm | direct |
| Per-firm same-pair ICCR HC (A/B/C/D) | $0.2018 / 0.0023 / 0.0019 / 0.0051$ | Script 43 per-firm | direct |
| Pool-size decile 1 / decile 10 any-pair ICCR | $0.0249 / 0.1905$ | Script 43 decile table | broadly monotone with two minor reversals |
| Per-signature tighter ICCR cos $> 0.95$ AND dHash $\leq 3$ same-pair Big-4 | $0.0449$ | Script 43 | optional stricter operating point |
| **Document-level alert rate (§III-L.3):** | | | |
| Document-level ICCR D1 (HC only) Big-4 | $0.1797$ (Wilson 95% $[0.1770, 0.1825]$) | Script 45 | $n_{\text{docs}} = 75{,}233$ |
| Document-level ICCR D2 (HC + MC) Big-4 | $0.3375$ (Wilson 95% $[0.3342, 0.3409]$) | Script 45 | operational alarm definition |
| Document-level ICCR D3 (HC + MC + HSC) Big-4 | $0.3384$ (Wilson 95% $[0.3351, 0.3418]$) | Script 45 | descriptive |
| Per-firm document-level D2 ICCR (A/B/C/D) | $0.6201 / 0.1600 / 0.1635 / 0.0863$ | Script 45 per-firm | direct |
| **Firm-heterogeneity logistic regression (§III-L.4):** | | | |
| Logistic OR (Firm B / C / D vs A) | $0.053 / 0.010 / 0.027$ | Script 44 regression | controlling for log pool size; reference $=$ Firm A |
| Logistic OR log(pool size, centred) | $4.01$ | Script 44 regression | pool-size effect after firm adjustment |
| Cross-firm hit matrix Firm A source $\to$ Firm A candidate (any-pair) | $14{,}447 / 14{,}622$ | Script 44 cross-firm matrix | $98.8\%$ within-firm |
| Cross-firm hit matrix same-pair within-firm rate (A/B/C/D) | $99.96\% / 97.7\% / 98.2\% / 97.0\%$ | Script 44 same-pair section | direct |
| **Threshold-sensitivity (§III-L.5):** | | | |
| Local / median gradient ratio cos $= 0.95$ | $\approx 25\times$ | Script 46 plateau diagnostic | descriptive, not formal plateau test |
| Local / median gradient ratio dHash $= 5$ | $\approx 3.8\times$ | Script 46 plateau diagnostic | descriptive |
| Local / median gradient ratio dHash $= 15$ | $\approx 0.08$ | Script 46 plateau diagnostic | MC/HSC boundary plateau-like |
| **Observed deployed alert rate (§III-L.6):** | | | |
| Per-signature observed-deployed HC rate Big-4 | $0.4958$ | Script 46 / Script 42 | actual same-CPA pools |
| Per-document observed-deployed HC rate Big-4 | $0.6228$ | Script 46 | actual same-CPA pools |
| Deployed-rate excess over inter-CPA proxy (per-sig HC) | $0.3856$ pp | derived | $0.4958 - 0.1102$ |
| Deployed-rate excess over inter-CPA proxy (per-doc HC) | $0.4431$ pp | derived | $0.6228 - 0.1797$ |
| **Sample-size reconciliation:** | | | |
| Big-4 signatures with pre-computed descriptors | $150{,}442$ | Script 39 / 39b / 39d / 39e / 45 / 46 | descriptor-complete subset |
| Big-4 signatures with feature + dHash vectors stored | $150{,}453$ | Script 40b / 43 / 44 | vector-complete subset |
| Difference between the two counts | $11$ signatures | direct (descriptor-completion lag) | negligible at population scale |
| Big-4 CPAs all (any signature count) | $468$ | Script 40b / 43 / 44 | direct |
| Big-4 CPAs with $n_{\text{sig}} \geq 10$ for accountant-level stability | $437$ | Scripts 36 / 38 / 39 | accountant-level analysis threshold |
--- ---
## Cross-reference index (author working checklist; remove before submission) ## Cross-reference index (author working checklist; remove before submission)
- **Big-4 sub-corpus definition** (§III-G) — 437 CPAs, 150,442 signatures. - **Big-4 sub-corpus definition** (§III-G) — 437 CPAs / $n_{\text{sig}} \geq 10$ at accountant-level, 468 CPAs / 150,442150,453 signatures at signature-level (sample-size reconciliation in §III-G).
- **Reference populations** (§III-H) — Firm A as templated-end case study; non-Big-4 ($n = 249$) as reverse-anchor reference (less-replicated population). - **Reference populations** (§III-H) — Firm A as templated-end case study; non-Big-4 ($n = 249$) as reverse-anchor reference (less-replicated population).
- **Distributional characterisation** (§III-I) — Big-4 dip-test multimodality ($p < 5 \times 10^{-4}$); BD/McCrary null at Big-4 scope; mixture support. - **Distributional diagnostics + composition decomposition** (§III-I) — Big-4 accountant-level dip-test rejection ($p < 5 \times 10^{-4}$); §III-I.4's 2×2 factorial decomposition (firm centring × integer jitter) shows the rejection is fully explained by between-firm location shift + integer mass-point artefacts; **no within-population bimodality and no natural threshold**.
- **K=3 components, descriptive only** (§III-J) — C1 hand-leaning, C2 mixed, C3 replicated; LOOO supports descriptive use, not operational classification. - **K=3 as descriptive firm-compositional partition** (§III-J) — C1/C2/C3 are descriptive positions on the descriptor plane reflecting Firm A vs others composition; not mechanism clusters; not used as operational classifier.
- **Convergent internal-consistency** (§III-K) — three feature-derived scores ($\rho \geq 0.879$, not independent measurements); per-signature K=3 ($\kappa = 0.87$ vs per-CPA fit); K=2 LOOO unstable, K=3 LOOO partial; pixel-identity miss rate $0\%$ on $n = 262$. - **Convergent internal-consistency** (§III-K) — three feature-derived scores ($\rho \geq 0.879$, not independent measurements); per-signature K=3 ($\kappa = 0.87$ vs per-CPA fit); K=2 LOOO unstable, K=3 LOOO partial; pixel-identity miss rate $0\%$ on $n = 262$.
- **Per-document classifier** (§III-L) — inherited five-way rule retained as primary; K=3 demoted to characterisation only. - **Anchor-based threshold calibration + operational classifier** (§III-L) — inherited five-way rule retained; characterised by inter-CPA negative-anchor coincidence rates at per-comparison (§III-L.1: cos $> 0.95$ at $0.0006$, dHash $\leq 5$ at $0.0013$, joint at $0.00014$), per-signature pool (§III-L.2: $0.11$ any-pair HC), per-document (§III-L.3: HC $0.18$; HC+MC $0.34$); firm heterogeneity (§III-L.4) decisive after pool-size adjustment; within-firm cross-CPA collision concentration $\geq 97\%$; threshold-sensitivity analysis (§III-L.5) confirms HC threshold is locally sensitive, not plateau-stable; deployed-rate excess over proxy (§III-L.6) $\approx 38$ pp per-signature and $\approx 44$ pp per-document.
- **Validation strategy and limitations** (§III-M) — multi-tool diagnostic collection (9 tools, each with disclosed untested assumption); positioning as anchor-calibrated screening framework with human-in-the-loop review, not as validated forensic detector; no FRR / sensitivity / EER / ROC-AUC reportable.
## Open questions remaining for partner / reviewer ## Open questions remaining for partner / reviewer