Revise §III v4.0 draft per codex round-21 review (Major Revision -> v2)

Codex gpt-5.5 xhigh review of v1 draft returned Major Revision with 5 Major findings + 7 Minor + editorial nits. v2 addresses all of them. Key v2 changes: 1. Primary classifier declared: inherited v3.x five-way per-signature box rule. K=3 mixture is demoted to accountant-level descriptive characterisation (Script 35 / Script 38 footing), explicitly NOT used to assign signature- or document-level labels. 2. §III-J reframed as "Mixture Model and Accountant-Level Characterisation" (was "Mixture Model and Operational Threshold Derivation"). K=3 LOOO P2_PARTIAL verdict surfaced in prose including the "not predictively useful as an operational classifier" interpretation from the Script 37 verdict legend. 3. §III-K renamed "Convergent Internal-Consistency Checks" (was "Convergent Validation") with explicit caveat that the three scores share underlying features and are not statistically independent measurements. 4. §III-H reverse-anchor paragraph rewritten: the directional error in v1 (the non-Big-4 reference described as a "more- replicated-population baseline") is corrected -- the reference is in fact in the LESS-replicated regime relative to Big-4, and the score measures deviation in the hand-leaning direction. 5. Pixel-identity metric renamed from "FAR" to "positive-anchor miss rate" with explicit conservative-subset caveat ("near-tautological for the box rule because byte-identical => cosine ~1 / dHash ~0"). 6. §III-L title changed to "Signature- and Document-Level Classification" (was "Per-Document Classification") and reorganised so the per-signature five-way rule + document-level worst-case aggregation are both clearly under this section. 7. Empirical slips corrected: - K=2 LOOO comparison: now correctly says "5.6x the stability tolerance 0.005" rather than "5.6x the bootstrap CI half-width"; full-Big-4 bootstrap half-width 0.0015 cited separately. - all-non-Firm-A dip: now correctly (0.998, 0.907), not "p > 0.99". - BD/McCrary: now narrowed to Big-4 scope (Script 34 null), with Script 32 dHash transitions for non-Big-4 subsets noted but not used as operational thresholds. - Firm A byte-identical "50 partners of 180 registered, 35 cross-year" -- now explicitly inherited from v3.x §IV-F.1 / Script 28 / Appendix B; provenance row in the new table flags this as inherited, not v4-regenerated. - "mid/small-firm tail actively pulling" -> "the full-sample and Big-4-only calibrations differ" (causal language softened). - $\Delta\text{BIC}$ sign convention: explicit "lower BIC is preferred; BIC(K=3) - BIC(K=2) = -3.48". 8. Editorial nits applied: - "failure rate" -> "box-rule hand-leaning rate" - "boundary moves modestly" -> "membership remains composition-sensitive" - "calibration uncertainty band +/- 5-13 pp" -> "observed absolute differences of 1.8-12.8 pp, with Firm C exceeding the 5 pp viability bar" - "strongest single methodology-validation signal" -> "strongest internal-consistency signal" - "the same component structure recovers" -> "a broadly similar three-component ordering recovers" - Cross-reference index marked as author checklist (remove before submission). 9. New provenance table at end of §III mapping every numerical claim to (script, source, direct/derived/inherited). 10. Open questions reduced from 5 to 3 (codex resolved questions 2, 3, 4 with concrete answers); remaining 3 are forward-looking (5-way moderate band, pseudonym consistency, table numbering). Also commits: paper/codex_review_gpt55_v4_round1.md (codex review artifact, 143 lines). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:49:59 +08:00
parent d0bf2fe911
commit 62a22ceb83
2 changed files with 256 additions and 94 deletions
@@ -1,10 +1,10 @@
-# Section III. Methodology — v4.0 Draft (Big-4 reframe)
+# Section III. Methodology — v4.0 Draft v2 (post codex round-21 revision)

-> **Draft note (2026-05-12).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. The v4.0 sub-section ordering inserts a new **§III-K Convergent Validation** between the old K (per-document classification, now §III-L) and J (validation anchors, now folded into §III-K and §III-J). Empirical anchors throughout reference Scripts 32–40 on branch `paper-a-v4-big4`.
+> **Draft note (2026-05-12, v2).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. v2 incorporates codex gpt-5.5 round-21 review (`paper/codex_review_gpt55_v4_round1.md`); the key revisions are: (i) the inherited five-way per-signature box rule is restored as the **primary operational classifier** (§III-L), (ii) the K=3 Gaussian mixture is positioned as an **accountant-level descriptive characterisation**, not an operational threshold (§III-J), (iii) the "convergent validation" framing is softened to "convergent internal-consistency checks" since the three scores share underlying features (§III-K), (iv) the pixel-identity metric is renamed from FAR to positive-anchor miss rate (§III-K), (v) five empirical/wording slips flagged by the reviewer are corrected. Empirical anchors throughout reference Scripts 32–40 on branch `paper-a-v4-big4`; a provenance table appears at the end of this section listing every numerical claim with its script and report path.

 ## G. Unit of Analysis and Scope

-We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of all signature-level capture-rate analyses (§IV-D, §IV-F, §IV-G). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), per-CPA convergent validation (§III-K), and the leave-one-firm-out reproducibility analysis (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.
+We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of all signature-level capture-rate analyses (§IV-D, §IV-F, §IV-G). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.

 We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism.

@@ -14,113 +14,117 @@ We adopt one stipulation about same-CPA pair detectability:

 A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.

-**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan (referred to throughout as the Big-4: 勤業眾信聯合 / 安侯建業聯合 / 資誠聯合 / 安永聯合, pseudonymously Firms A through D in the rest of the paper). The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available. Restricting the analyses to Big-4 is a methodological choice driven by three considerations:
+**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available (Script 38 verification). Restricting the analyses to Big-4 is a methodological choice driven by four considerations:

-1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$250 CPAs distributed across $\sim$30 firms with idiosyncratic signing practices and small per-firm samples. Including this tail in a single 2D-GMM mixture distorts the inferred component locations and weights — empirically, the published v3.x "natural threshold" $\overline{\text{cos}} = 0.945$, $\overline{\text{dHash}} = 8.10$ shifts to $\overline{\text{cos}} = 0.975$, $\overline{\text{dHash}} = 3.76$ when restricted to Big-4 (Script 34, bootstrap 95% CI $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$). The shift is large compared to either CI half-width, indicating the mid/small-firm tail was actively pulling the v3.x crossing rather than acting as inert noise.
+1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$249 CPAs distributed across multiple firms with idiosyncratic signing practices and small per-firm samples. The full-sample and Big-4-only calibrations *differ* in their fitted marginal crossings (full-sample published $\overline{\text{cos}}^* = 0.945$, $\overline{\text{dHash}}^* = 8.10$ from v3.x; Big-4-only $\overline{\text{cos}}^* = 0.975$, $\overline{\text{dHash}}^* = 3.76$ from Script 34; bootstrap 95% CIs $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$); the offset is large compared to the Big-4 bootstrap CI half-width of $0.0015$. We report this as a *scope-dependent shift* rather than asserting a causal "mid/small-firm tail distorts" claim.

-2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes ($p < 10^{-4}$, $n = 437$ CPAs; Script 34) for the per-CPA $\overline{\text{cos}}$ and $\overline{\text{dHash}}$ distributions. No such rejection holds within any single firm pooled alone (each firm's per-CPA distribution is dip-test unimodal at $p > 0.92$; Scripts 32, 34) or within the all-non-Firm-A pool ($p > 0.99$; Script 32). Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution.
+2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes (cosine $p = 0.0000$, dHash $p = 0.0000$ in the bootstrap-2000 implementation, i.e., no bootstrap replicate exceeded the observed statistic; reported here as $p < 5 \times 10^{-4}$; Script 34). No such rejection holds within any single firm pooled alone (Firm A: $p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$; Script 32). Pooling Firm A's three Big-4 peers gives $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$ (Script 32, `big4_non_A` subset). Pooling all non-Firm-A CPAs (other Big-4 plus mid/small firms) gives $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$ (Script 32, `all_non_A` subset). Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution.

-3. **Reproducibility under leave-one-firm-out cross-validation.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 mixture fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm); the held-out firm's CPAs are then classified by the rule derived from the other three firms. No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm, far below the sample sizes required for stable mixture-fit estimation on a held-out fold.
+3. **Reproducibility under leave-one-firm-out cross-validation.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 mixture fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm.

-The full-dataset (686 CPAs across Big-4 plus mid/small) is reported as a robustness section in §IV-K so that v4.0's pipeline reproducibility claim is demonstrable at multiple scopes, while the primary methodology operates at the scope where its statistical assumptions (dip-test multimodality, firm-level LOOO feasibility) are met.
+4. **Restricted generalizability claim.** v4.0's primary claims are scoped to the Big-4 audit-report context; we do not assert that the same mixture structure or operational thresholds extend to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H's reverse-anchor internal-consistency check and (b) as a robustness comparison in §IV-K. Generalisation beyond Big-4 is left as future work.

 ## H. Reference Populations

-v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single Firm-A-as-anchor framing.
+v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing.

-**Internal reference: Firm A as the templated end of Big-4 (case study).** Firm A is empirically the most digitally-replicated of the Big-4. In the Big-4 K=3 mixture (§III-J, Script 35), Firm A accounts for 0% of the C1 hand-leaning component, 17.5% of the C2 mixed component, and 82.5% of the C3 replicated component; the reverse pattern holds at PwC (23.5% C1, 75.5% C2, 1.0% C3). Automated byte-level pair analysis (§IV-F.1; reproduction artifact in Appendix B) identifies 145 Firm A signatures byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years. Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports — these pairs establish image reuse as a concrete, threshold-free phenomenon within Firm A.
+**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the most digitally-replicated of the Big-4. In the Big-4 K=3 mixture (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 hand-leaning component (cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 mixed component, and 82.5% of the C3 replicated component; the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). The byte-level pair analysis reported in v3.x §IV-F.1 identifies 145 Firm A pixel-identical signatures at the signature level (Script 40 verifies the 145/262 split among Big-4 pixel-identical signatures); the additional details that v3.x attributes to this analysis (50 distinct Firm A partners of 180 registered; 35 byte-identical matches spanning different fiscal years) are inherited from the Script 28 / Appendix B byte-decomposition output and were not regenerated in the v4.0 spike scripts. We retain those v3.x details by reference and mark them in the provenance table as "inherited from v3 §IV-F.1 / Script 28."

-In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with KPMG, PwC, and EY; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the within-Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level ground truth that anchors §III-K's pixel-identity false-alarm-rate validation.
+In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate.

-**External reference: non-Big-4 as the reverse-anchor reference for cross-validation.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor convergent-validation analysis: each Big-4 CPA's deviation from the reference centre, measured as the marginal cosine left-tail percentile under the reference, is one of three independent statistical lenses that v4.0 uses to validate its hand-leaning ranking. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its sole role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be cross-checked.
+**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores that v4.0 uses as a cross-check on the inherited per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked.

-The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$. This sits below the Big-4 K=3 hand-leaning component centre ($\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$), confirming that the non-Big-4 reference population is itself in the lower-cosine / higher-dHash regime relative to Big-4 — appropriate for use as a "more-replicated-population" baseline against which Big-4 hand-leaning CPAs deviate by being further into the left tail of cosine.
+The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 hand-leaning component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 templated component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a more hand-leaning Big-4 CPA. This is a "deviation in the hand-leaning direction" measure, not a "deviation toward replication" measure; the reference is the less-replicated population.

 ## I. Distributional Characterisation at the Accountant Level

-This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G. Three diagnostic procedures are applied: a univariate unimodality test on each marginal axis, a 2D Gaussian mixture fit (developed further in §III-J), and a density-smoothness diagnostic. These procedures play decreasing-strength roles in v4.0: the dip test directly motivates whether a mixture model is statistically warranted; the mixture is the operational generative model whose fit is reported in §III-J; the density-smoothness diagnostic complements the mixture with a non-parametric check on local discontinuity that is independent of the mixture's parametric form.
+This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G. Three diagnostic procedures are applied: a univariate unimodality test on each marginal axis, a 2D Gaussian mixture fit (developed in §III-J), and a density-smoothness diagnostic.

-**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). Both marginals reject unimodality with $p < 10^{-4}$ (Script 34). For comparison, no rejection of unimodality holds at any narrower scope: Firm A alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); KPMG + PwC + EY pooled ($p_{\text{cos}} = 0.999$, $p_{\text{dHash}} = 0.906$, $n = 266$); all-non-Firm-A pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution.
+**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds at any narrower scope or in the non-Big-4 population (Script 32): Firm A alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution.

-**2. Mixture-model evidence.** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42) fit to the joint $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ respectively over $n_{\text{boot}} = 500$ resamples (Script 34). A 3-component fit (§III-J) is BIC-preferred ($-1111.93$ vs $-1108.45$), and the K=3 component structure carries different operational implications; we develop both fits in §III-J and discuss the choice between them in §III-K.
+**2. Mixture-model evidence.** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3, and the operational role of each fit is developed in §III-J and §III-K.

-**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* rather than as a third threshold estimator, on each marginal axis (cosine in bins of $0.002$, dHash in integer bins). The diagnostic identifies no significant transition on either axis at $\alpha = 0.05$ (Script 32 / Script 34). This null result is consistent with the mixture-model evidence: the K=3 components overlap rather than separate sharply, so a local-discontinuity test does not flag a transition. We retain BD/McCrary in v4.0 as a non-parametric robustness check — its null result indicates that no boundary location chosen by the mixture model is corroborated by sharp local discontinuity, which we read as an honest description of the data's continuous-overlap structure rather than a methodological deficiency.
+**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with the mixture-model evidence: the K=3 components overlap rather than separate sharply, so a local-discontinuity test does not flag a transition. We retain BD/McCrary in v4.0 as a non-parametric robustness diagnostic; the dHash transitions outside Big-4 are not used as operational thresholds because they are scope-dependent and lie within rather than between modes of the corresponding density.

-The operational threshold derivation (§III-J) and the convergent validation (§III-K) both operate on the K=3 fit. The K=2 fit and its marginal crossings are reported here for completeness and to enable cross-paper-version comparison with v3.x, but are not used as v4.0's operational rule.
+## J. Mixture Model and Accountant-Level Characterisation

-## J. Mixture Model and Operational Threshold Derivation
+This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive characterisations of the joint Big-4 distribution; the operational per-signature classifier remains the inherited five-way box rule of §III-L.** Neither mixture is used to assign signature-level or document-level labels in the v4.0 primary analysis.

-We fit 2- and 3-component 2D Gaussian Mixture Models (full covariance, $n_{\text{init}} = 15$, $\text{max\_iter} = 500$, fixed seed 42) to the joint $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution of the 437 Big-4 CPAs. We report both fits and develop the operational threshold from the K=3 fit, with the K=2 fit retained as a reference and the choice of K=3 over K=2 justified in §III-K.
+**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ ("hand-leaning"), weight $0.689$, and $(0.983, 2.41)$ ("replicated"), weight $0.311$ (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$.

-**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$ ("hand-leaning"), and $(0.983, 2.41)$, weight $0.311$ ("replicated"). BIC = $-1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$.
+**K=3 fit.** Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces):

-**K=3 fit (selected for v4.0).** Three components, sorted by ascending cosine mean:
-
-| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | interpretation |
+| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive label |
 |---|---|---|---|---|
-| C1 hand-leaning | 0.9457 | 9.17 | 0.143 | low-cosine, high-dHash; cross-firm membership |
-| C2 mixed | 0.9558 | 6.66 | 0.536 | intermediate; spans all four Big-4 firms |
-| C3 replicated | 0.9826 | 2.41 | 0.321 | high-cosine, low-dHash; dominated by Firm A |
+| C1 | 0.9457 | 9.17 | 0.143 | hand-leaning |
+| C2 | 0.9558 | 6.66 | 0.536 | mixed |
+| C3 | 0.9826 | 2.41 | 0.321 | replicated |

-BIC = $-1111.93$, slightly preferred over K=2 ($\Delta\text{BIC} = -3.5$). The K=3 components carry direct interpretations from the joint Big-4 distribution: C3 is the templated mode dominated by Firm A (82.5% of Firm A CPAs assigned to C3 by hard-posterior assignment; Script 35); C1 is the hand-leaning mode with cross-firm membership (24 PwC, 10 KPMG, 6 EY, 0 Firm A — Script 35); C2 is the intermediate mode that contains the majority of all four Big-4 firms' CPAs.
+$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support for K=3 under standard BIC interpretation but not by itself decisive).

-**Operational threshold.** v4.0's per-CPA classifier is the K=3 hard-posterior assignment: a CPA $a$ is assigned to component $C_k$ that maximises $P(C_k | (\overline{\text{cos}}_a, \overline{\text{dHash}}_a))$ under the K=3 fit. CPAs assigned to C3 are labelled *templated*; CPAs assigned to C1 or C2 are labelled *non-templated*, with C1 carrying the additional designation *hand-leaning* and C2 the designation *mixed*. The per-CPA classifier is an honest description of the joint Big-4 distribution rather than a discovered "natural threshold" — the dip-test rejection of unimodality (§III-I) supports treating the data as multi-mode, but the components overlap in their tails (BD/McCrary null on the marginal density), so the classifier's hard boundary inherits a calibration uncertainty quantified in §III-K.
+**Why we report both K=2 and K=3.** Leave-one-firm-out cross-validation (§III-K) shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The max fold-to-fold deviation of the cosine crossing is $0.028$, which is $5.6\times$ the report's $0.005$ across-fold *stability tolerance* (not bootstrap CI; the full-Big-4 bootstrap cosine half-width is the much smaller $0.0015$). K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.025$ (Script 37). K=3 hard-posterior membership for the held-out firm is more composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL`, with the explicit interpretation that "the C1 cluster exists but membership is not well-predicted by the held-out fit" and "K=3 is not predictively useful as an operational classifier" (Script 37 verdict legend).

-The signature-level operational rule is inherited from v3.x for continuity and is reported in §III-L: a signature is classified as *non-hand-signed* iff $\text{cos} > 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. The agreement between this signature-level box rule and the per-CPA K=3 hard label is reported in §III-K (Cohen $\kappa = 0.66$ binary collapse), and is one of three convergent-validation signals supporting v4.0's overall methodology.
+We take the joint K=2/K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites a v4.0 operational classifier:

-## K. Convergent Validation
+- The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary.
+- The Big-4 K=3 mixture exhibits a reproducible three-component shape across LOOO folds; a hand-leaning component (C1) exists at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$ with weight $\approx 0.14$.
+- Hard-posterior membership in C1 is composition-sensitive (max absolute deviation $12.8$ pp across LOOO folds, exceeding the report's $5$ pp viability bar); K=3 is therefore not used to assign operational hand-leaning labels to CPAs in v4.0.

-v4.0 validates its mixture-derived per-CPA classifier through three independent statistical lenses, supplemented by leave-one-firm-out cross-validation and a hard ground-truth false-alarm-rate check.
+The operational signature-level classifier remains the inherited five-way box rule of §III-L, calibrated as in v3.x. Cross-checks between the inherited rule and the K=3 mixture appear in §III-K.

-**1. Three convergent lenses (Script 38).** For each Big-4 CPA we compute three scalars:
+## K. Convergent Internal-Consistency Checks

- **Lens 1 (mixture posterior):** $P(\text{C1}_{\text{hand-leaning}})$ from the K=3 fit of §III-J.
- **Lens 2 (reverse-anchor directional):** $-F_{\text{ref}}(\overline{\text{cos}}_a)$, where $F_{\text{ref}}$ is the marginal cosine CDF of the non-Big-4 reference Gaussian from §III-H. Higher score corresponds to deeper in the left tail of the reference cosine distribution, i.e., more deviated in the hand-leaning direction relative to a strictly out-of-target population.
- **Lens 3 (Paper A operational rule):** the per-CPA failure rate of the inherited signature-level rule cos $> 0.95$ AND dHash$_{\text{indep}} \leq 5$.
+The mixture characterisation of §III-J is supported by three feature-derived per-CPA scores and a hard-ground-truth subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide).

-Pairwise Spearman rank correlations among the three lenses, $n = 437$:
+**1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute:
+
+- **Score 1 (mixture posterior):** $P(\text{C1}_{\text{hand-leaning}})$ from the K=3 fit of §III-J — a function of both descriptor means.
+- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a more hand-leaning Big-4 CPA. This is a function of $\overline{\text{cos}}_a$ alone.
+- **Score 3 (Paper A box-rule hand-leaning rate):** the per-CPA fraction of signatures that do **not** satisfy the inherited binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors.
+
+Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38):

 | Pair | Spearman $\rho$ | $p$-value |
 |---|---|---|
-| Lens 1 vs Lens 3 | $+0.963$ | $< 10^{-248}$ |
-| Lens 2 vs Lens 3 | $+0.889$ | $< 10^{-149}$ |
-| Lens 1 vs Lens 2 | $+0.879$ | $< 10^{-142}$ |
+| Score 1 vs Score 3 | $+0.963$ | $< 10^{-248}$ |
+| Score 2 vs Score 3 | $+0.889$ | $< 10^{-149}$ |
+| Score 1 vs Score 2 | $+0.879$ | $< 10^{-142}$ |

-All three pairwise correlations exceed $0.7$, and the three lenses agree on the per-firm hand-leaning ranking: Firm A < KPMG < EY $\approx$ PwC by mean P(C1), mean reverse-anchor score, and mean per-CPA hand-failure rate. Convergence on a per-CPA ranking by three approaches that draw on different statistical machinery — finite-mixture posterior, deviation from a strictly out-of-target reference, and a published box rule — is the strongest single methodology-validation signal in v4.0.
+We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA hand-leaning ranking with $\rho > 0.87$. The three scores agree on the per-firm ordering: by all three scores, Firm A is the most replication-dominated and Firm C is the most hand-leaning of the Big-4 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs.

-**2. Per-signature convergence (Script 39).** The per-CPA convergence above could in principle reflect averaging across heterogeneous within-CPA signatures rather than coherent within-CPA behaviour. We address this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points — and comparing labels:
+**2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replicated vs not-replicated):

-| Pair | Cohen $\kappa$ (binary collapse, 1 = replicated) |
+| Pair | Cohen $\kappa$ |
 |---|---|
-| Paper A box rule vs K=3 per-CPA hard label | $0.662$ |
-| Paper A box rule vs K=3 per-signature hard label | $0.559$ |
-| K=3 per-CPA vs K=3 per-signature hard label | $0.870$ |
+| Paper A binary high-confidence box rule vs per-CPA K=3 hard label | $0.662$ |
+| Paper A binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ |
+| Per-CPA K=3 vs per-signature K=3 | $0.870$ |

-The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 labels indicates that per-CPA aggregation does not materially distort the K=3 classifier; the same component structure recovers when fit at signature granularity. The moderate $\kappa = 0.56$–$0.66$ between Paper A's box rule and either K=3 fit reflects different decision geometries — a rectangular box vs a Gaussian-mixture posterior boundary — rather than a fundamental signal disagreement.
+The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the inherited five-way rule. We note that this comparison validates only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly validate the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, which retains its v3.x interpretation and capture-rate evaluation (§IV-F).

-**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** We test cross-firm reproducibility of the mixture fit by holding out each Big-4 firm in turn, refitting the GMM on the other three firms' CPAs, and reporting the held-out firm's classifier behaviour under the fold-derived rule. We report both K=2 and K=3 LOOO results:
+**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference:

- *K=2 LOOO is unstable.* Across the four folds the marginal cosine crossing varies between 0.938 (Firm-A held out) and 0.976 (EY held out), a max deviation of $0.028$, $5.6\times$ the bootstrap CI half-width of $0.005$. When Firm A is held out, the fold rule classifies 100% of held-out Firm A as *replicated*; when any non-Firm-A firm is held out, the fold rule classifies 0% of the held-out firm as *replicated*. This pattern indicates that the K=2 boundary is essentially a *firm-mass separator* between Firm A and the other three Big-4, not a *mechanism boundary* between hand-leaning and replicated CPAs.
+- *K=2 LOOO is unstable.* Max fold-to-fold cosine crossing deviation $= 0.028$ across the four folds, against the report's $0.005$ stability tolerance (Script 36). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

- *K=3 LOOO is partially stable.* The C1 hand-leaning component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.025$ in mixture weight (Script 37). The held-out firm's C1 membership rate varies more substantially (e.g., $36.3\%$ for held-out PwC vs $23.5\%$ baseline; $4.7\%$ for held-out Firm A vs $0\%$ baseline), reflecting that the fold's decision boundary moves modestly when training composition changes. We document the held-out membership rates as a *calibration uncertainty band* of $\pm 5$–$13$ percentage points around the baseline rate, rather than as a defect of the K=3 fit.
+- *K=3 LOOO is partially stable.* The C1 hand-leaning component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.025$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own verdict is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label.

-The contrast — K=2 unstable, K=3 stable in component shape with bounded membership uncertainty — is the empirical justification for selecting K=3 over K=2 as v4.0's operational mixture (referenced in §III-J).
+**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one hard ground-truth subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40).

-**4. Pixel-identity false-alarm rate (Script 40).** The only hard ground truth available in the corpus is the set of Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation ($n = 262$: 145 Firm A, 8 KPMG, 107 PwC, 2 EY). Byte-identical pairs cannot arise from independent hand-signing and are taken as ground-truth *replicated*. We report each classifier's false-alarm rate (probability of labelling a pixel-identical signature as hand-leaning):
+We report each candidate classifier's *positive-anchor miss rate* — the fraction of byte-identical signatures misclassified as hand-leaning. This is a one-sided check against a conservative positive subset, **not a false-alarm rate in the usual two-class sense**; we do not report a paired false-alarm rate because no signature-level hand-signed ground truth exists. The corresponding signature-level false-alarm-rate evidence comes from the v3.x inter-CPA negative anchor (§III-J inherited; Table X), which retains its v3.x interpretation:

-| Classifier | Misclassified | FAR | Wilson 95% CI |
-|---|---|---|---|
-| Paper A box rule | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ |
-| K=3 per-CPA hard label (C3 = replicated) | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ |
-| Reverse-anchor score (prevalence-calibrated cut) | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ |
+| Candidate classifier | Pixel-identity miss rate (Wilson 95% CI) |
+|---|---|
+| Inherited Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ |
+| K=3 per-CPA hard label (C3 = replicated; descriptive only) | $0\%$ $[0\%, 1.45\%]$ |
+| Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ |

-All three v4.0 classifiers achieve perfect detection on the pixel-identity ground-truth subset, with a Wilson upper bound on FAR of $1.45\%$. We acknowledge two caveats. First, pixel-identity is a *conservative subset* of true replication — only the byte-equal extreme — so a low FAR against this subset is necessary but not sufficient evidence of correct replication detection across the full non-hand-signed class. Second, no signature-level ground truth exists for the hand-leaning class, so the reverse-anchor classifier's cut is chosen by *prevalence calibration* against Paper A's overall replicated rate; this is documented as a v4.0 limitation in §V-G and would benefit from human-rated validation in a future revision.
+All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the inherited box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them; v3.x discussed this conservative-subset caveat at length (v3 §III-J item 1, V-F). The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the inherited box rule's overall replicated rate ($49.58\%$ of Big-4 signatures; Script 40); this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.

-## L. Per-Document Classification
+## L. Signature- and Document-Level Classification

-The per-signature classifier inherits its operational thresholds from v3.x: cos $> 0.95$ for the cosine dimension and $\text{dHash}_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. v4.0 retains these cuts unchanged for two reasons: (a) per-signature classification is an operational inheritance from prior literature with established interpretation, and (b) §III-K reports that the inherited rule agrees with the v4.0 K=3 per-CPA classifier at $\rho = 0.963$ per-CPA / $\kappa = 0.66$ per-signature, supporting continued use without recalibration. We document the K=3 hard label as an *alternative output* available alongside the inherited rule for users who prefer a posterior-based assignment.
+The v4.0 operational classifier is the inherited v3.x five-way per-signature box rule, retained unchanged for two reasons: (a) it preserves continuity with the v3.x literature and its established interpretation, and (b) the convergent internal-consistency checks of §III-K show that the box rule's per-CPA-aggregated outputs agree at $\rho \geq 0.96$ with a mixture-derived score and at $\rho \geq 0.89$ with a reverse-anchor score, supporting continued use without recalibration.

-We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:
+**Per-signature five-way classifier.** Operational thresholds are anchored on whole-sample Firm A percentile heuristics as in v3.x: cos $> 0.95$ (Firm A P7.5) for the cosine dimension and dHash$_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. All dHash references refer to the *independent-minimum* dHash defined in §III-G. We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:

 1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence.

@@ -132,46 +136,61 @@ We assign each signature to one of five signature-level categories using converg

 5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.

-Because each audit report typically carries two certifying-CPA signatures, we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label among the two signatures (rank order: High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed). This rule reflects the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.
+The conventions about these thresholds (cosine 0.95 as an operating point chosen for capture-vs-FAR tradeoff against the inter-CPA negative anchor, with Wilson 95% inter-CPA FAR of $0.0005$ at the operating point; cosine 0.837 as the all-pairs KDE crossover; dHash 5 and 15 as the upper tail of the high-similarity mode and the structural-similarity ceiling respectively) are inherited from v3.x §III-K and retain their v3.x calibration and capture-rate evidence (v3.x Tables IX, XI, XII, XII-B).

-The K=3 alternative output assigns each signature directly to {C1 hand-leaning, C2 mixed, C3 replicated} via the per-signature K=3 fit of §III-K. Cross-tabulation between the five-way box-rule classifier and the K=3 hard label is reported in §IV-G (cross-tab table; Script 39 output).
+**Document-level aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.x worst-case rule: the document inherits the *most-replication-consistent* signature label among the two signatures (rank order: High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed). The aggregation rule reflects the detection goal of flagging any potentially non-hand-signed report.
+
+**K=3 as accountant-level characterisation, not classifier.** The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used solely for the accountant-level cluster cross-tabulation (Script 35) and the internal-consistency Spearman correlations (§III-K).

 ---

-## Cross-reference index
+## Provenance table for numerical claims in §III-G through §III-L
+
+| Claim | Value | Source | Notes |
+|---|---|---|---|
+| Big-4 CPA count, $n_{\text{sig}} \geq 10$ | $437$ ($171/112/102/52$) | Script 36 sample sizes; Script 38 per-firm summary | direct |
+| Big-4 signature count | $150{,}442$ | Script 39 / Script 38 | direct |
+| Non-Big-4 reference CPA count | $249$ | Script 38 reference population | direct |
+| Big-4 K=2 marginal crossings $(0.9755, 3.755)$ | direct | Script 34; Script 36 §A | direct |
+| Bootstrap 95% CI cosine $[0.9742, 0.9772]$ | direct | Script 34; Script 36 §A | $n_{\text{boot}} = 500$ |
+| Bootstrap 95% CI dHash $[3.48, 3.97]$ | direct | Script 34; Script 36 §A | $n_{\text{boot}} = 500$ |
+| Bootstrap CI half-width $0.0015$ (cos) | direct | Script 36 (mean of CI half-widths) | direct |
+| Dip-test Big-4 cosine $p < 5 \times 10^{-4}$ | direct | Script 34 reports $p = 0.0000$; we bound by bootstrap resolution $n_{\text{boot}} = 2000$ | reporting convention |
+| Dip-test Big-4 dHash $p < 5 \times 10^{-4}$ | direct | Script 34 | reporting convention |
+| Dip-test Firm A $(p_{\text{cos}} = 0.992, p_{\text{dHash}} = 0.924)$ | direct | Script 32 §`firm_A` | direct |
+| Dip-test `big4_non_A` $(0.998, 0.906)$ | direct | Script 32 §`big4_non_A` | direct |
+| Dip-test `all_non_A` $(0.998, 0.907)$ | direct | Script 32 §`all_non_A` | direct |
+| K=3 component centers / weights | $(0.9457, 9.17, 0.143)$ / $(0.9558, 6.66, 0.536)$ / $(0.9826, 2.41, 0.321)$ | Script 35 / Script 38 | direct |
+| $\Delta\text{BIC}(K{=}3, K{=}2) = -3.48$ | direct | Script 34 (BIC K=2 = $-1108.45$; Script 36 reports BIC K=3 = $-1111.93$) | direct (arithmetic) |
+| K=2 LOOO max cosine deviation $0.028$ | direct | Script 36 stability summary | direct |
+| K=2 LOOO Firm A held-out $171/171$ replicated | direct | Script 36 fold table | direct |
+| K=3 C1 component shape drift (cos $0.005$, dHash $0.96$, weight $0.025$) | direct | Script 37 stability summary | direct |
+| K=3 LOOO held-out C1 absolute differences $1.8$–$12.8$ pp | direct | Script 37 held-out prediction check | direct |
+| Three-score pairwise Spearman ($0.963$, $0.889$, $0.879$) | direct | Script 38 correlations | direct |
+| Per-CPA / per-signature K=3 Cohen $\kappa$ ($0.662$, $0.559$, $0.870$) | direct | Script 39 kappa table | direct |
+| Per-CPA / per-signature K=3 C1 center drift $0.018$ (cosine) | derived | $\lvert 0.9457 - 0.9280 \rvert$; Script 39 components | direct |
+| Pixel-identity Big-4 subset $n = 262$ ($145/8/107/2$) | direct | Script 40 sample | direct |
+| Positive-anchor miss rate $0\%$ on $n = 262$ (Wilson upper $1.45\%$) | direct | Script 40 results table | direct |
+| Inter-CPA FAR $0.0005$ at cos $> 0.95$ (Wilson 95% $[0.0003, 0.0007]$) | direct | v3 §IV-F.1 / Table X (inherited) | inherited from v3 |
+| Firm A byte-identical $145$ pixel-identical signatures in Big-4 subset | direct | Script 40 sample breakdown | direct |
+| Firm A byte-identical "50 distinct partners of 180; 35 cross-year" | inherited | v3 §IV-F.1 / Script 28 / Appendix B byte-decomposition output | **inherited from v3; not regenerated in v4.0 spike scripts** |
+| Big-4 K=3 per-firm C1 hard-assignment ($0\%$ / $8.9\%$ / $23.5\%$ / $11.5\%$) | direct | Script 35 firm × cluster cross-tab | direct |
+
+---
+
+## Cross-reference index (author working checklist; remove before submission)

 - **Big-4 sub-corpus definition** (§III-G) — 437 CPAs, 150,442 signatures.
- **Reference populations** (§III-H) — Firm A as templated-end case study; non-Big-4 ($n = 249$) as reverse-anchor reference.
- **Distributional characterisation** (§III-I) — dip-test multimodality; BD/McCrary null; mixture support.
- **K=3 components** (§III-J) — C1 hand-leaning $(0.946, 9.17, w = 0.143)$; C2 mixed $(0.956, 6.66, w = 0.536)$; C3 replicated $(0.983, 2.41, w = 0.321)$.
- **Convergent validation** (§III-K) — three lenses ($\rho \geq 0.879$); $\kappa \geq 0.56$ at signature level; LOOO K=3 partial; pixel-identity FAR $0\%$ on $n = 262$.
- **Per-document classifier** (§III-L) — five-way inherited rule + K=3 alternative output.
+- **Reference populations** (§III-H) — Firm A as templated-end case study; non-Big-4 ($n = 249$) as reverse-anchor reference (less-replicated population).
+- **Distributional characterisation** (§III-I) — Big-4 dip-test multimodality ($p < 5 \times 10^{-4}$); BD/McCrary null at Big-4 scope; mixture support.
+- **K=3 components, descriptive only** (§III-J) — C1 hand-leaning, C2 mixed, C3 replicated; LOOO supports descriptive use, not operational classification.
+- **Convergent internal-consistency** (§III-K) — three feature-derived scores ($\rho \geq 0.879$, not independent measurements); per-signature K=3 ($\kappa = 0.87$ vs per-CPA fit); K=2 LOOO unstable, K=3 LOOO partial; pixel-identity miss rate $0\%$ on $n = 262$.
+- **Per-document classifier** (§III-L) — inherited five-way rule retained as primary; K=3 demoted to characterisation only.

-## Script-to-section evidence map
+## Open questions remaining for partner / reviewer

-| Section | Empirical anchor | Script |
-|---|---|---|
-| III-G scope | mid/small-firm tail distortion | 32, 34 |
-| III-H Firm A descriptive | byte-level pair analysis | (v3 §IV-F.1 inherited) |
-| III-H non-Big-4 reference | MCD reference Gaussian | 38 |
-| III-I dip-test | Big-4 marginal $p < 10^{-4}$ | 32, 34 |
-| III-I BD/McCrary null | density-smoothness diagnostic | 32, 34 |
-| III-J K=2 components | bootstrap CI on marginals | 34 |
-| III-J K=3 components | full-Big-4 fit | 35 |
-| III-K three convergent lenses | per-CPA Spearman | 38 |
-| III-K per-signature | Cohen $\kappa$ binary | 39 |
-| III-K K=2 LOOO unstable | $\Delta\text{cos} = 0.028$ | 36 |
-| III-K K=3 LOOO partial | C1 shape stable | 37 |
-| III-K pixel-identity FAR | $0\%$ on $n = 262$ | 40 |
-| III-L five-way inherited rule | (no v4 change) | (v3 §III-K inherited) |
-| III-L K=3 alternative output | per-signature K=3 fit | 39 |
+1. **Five-way rule validation against the moderate-confidence band.** §III-K's $\kappa$ evidence covers only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evidence (Tables IX, XI, XII). Is this inheritance sufficient, or should v4.0 add a Big-4-specific capture-rate analysis for the moderate band in §IV-F?

---
+2. **Anonymisation of within-Big-4 firm contrasts.** §III-H states that Firm C is the firm most concentrated in C1 hand-leaning at $23.5\%$ (Script 35). The within-Big-4 ordering by hand-leaning concentration is informative for the §V discussion. v3.x reports under pseudonyms throughout. Confirm that we maintain pseudonyms consistently in §IV–V even when discussing the specific Firm C / Firm B / Firm D hand-leaning rates.

-> **Open questions for partner / reviewer review of this draft:**
->
-> 1. §III-G scope justification — is the three-point argument (within-pool homogeneity, dip-test multimodality, LOOO feasibility) sufficiently strong, or should we add a fourth (e.g., generalisability claim restricted to Big-4 audit context)?
-> 2. §III-H Firm A reframing — is "case study of templated end" the right phrase, or should we use "calibration reference, descriptively defined post-hoc"?
-> 3. §III-J K=3 vs K=2 selection — we choose K=3 on the basis of LOOO stability (§III-K), not BIC alone (only $\Delta = 3.5$). Is this acceptable to a reviewer, or should we strengthen the BIC argument?
-> 4. §III-L — keeping the inherited five-way rule intact preserves continuity with v3.x literature but means the operational classifier is *not* derived from v4.0's K=3 mixture. Is this hybrid (mixture for characterisation, inherited box rule for classification) acceptable, or should v4.0 commit to K=3 hard label as the primary classifier?
-> 5. The Section IV (Results) regeneration in Phase 3 will need to mirror this section's structure — particularly Tables IV–XVIII numbering. We should confirm the numbering and section-letter scheme before Phase 3 begins.
+3. **Section IV table numbering.** Defer until §III final accepted by partner / reviewer; results numbering should mirror §III flow (sample/scope → mixture characterisation → convergent checks → LOOO → pixel-identity → signature/document classification → full-dataset robustness).