diff --git a/paper/codex_review_gpt55_v4_round1.md b/paper/codex_review_gpt55_v4_round1.md new file mode 100644 index 0000000..5fcb564 --- /dev/null +++ b/paper/codex_review_gpt55_v4_round1.md @@ -0,0 +1,143 @@ +# Paper A v4.0 Methodology Section III-G through III-L Peer Review + +Reviewer: gpt-5.5 xhigh +Date: 2026-05-12 +Round number: 21 (v4 round 1) +Review target: `paper/v4/paper_a_methodology_v4_section_iii.md` + +Audit aliases used below: + +- V4: `paper/v4/paper_a_methodology_v4_section_iii.md` +- V3: `paper/paper_a_methodology_v3.md` +- Script36: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/calibration_and_loo_validation/calibration_loo_report.md` +- Script37: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/k3_loo_check/k3_loo_report.md` +- Script38: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/convergence_k3_reverse_anchor/convergence_report.md` +- Script39: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/signature_level_convergence/sig_level_report.md` +- Script40: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/pixel_identity_far/far_report.md` +- Script34 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/big4_only_pooled/big4_only_pooled_report.md` +- Script35 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/big4_k3_cluster_inspection/inspection_report.md` +- Script32 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/non_firm_a_calibration/non_firm_a_calibration_report.md` + +## Verdict + +Major Revision. + +## Major Findings + +1. **K=3 is not yet justified as an operational classifier.** + + V4 selects K=3 for the operational per-CPA classifier (V4:57, V4:67) and says the K=3/K=2 contrast justifies selecting K=3 (V4:107). The underlying Script37 verdict is weaker: `P2_PARTIAL`, with the explicit interpretation that the C1 cluster exists but "membership is not well-predicted by held-out fit" (Script37:92, Script37:94). The report's own legend says `P2_PARTIAL` means the cluster is "not predictively useful as an operational classifier" (Script37:97-99). + + The numbers support this concern. K=3 C1 component shape is stable (max deviations 0.0047 cosine, 0.955 dHash, 0.023 weight; Script37:77-79), but held-out C1 membership differs from baseline by up to 12.77 percentage points (Script37:83-90). For PwC, baseline C1 is 23.5% but held-out prediction is 36.27% (Script37:47-51, Script37:87). That is not a small operational error if the label is used to classify CPAs. + + The BIC evidence is also weak. K=3 is lower BIC than K=2 by only 3.48 points (Script36:9-10; Script34 local:40-41). This is acceptable as mild descriptive support, not as the load-bearing reason to replace a classifier. The draft should either (a) demote K=3 to a descriptive/convergent-validation model, or (b) make K=3 primary only with explicit LOOO membership uncertainty and soft-posterior reporting. + +2. **The "three independent lenses" framing overstates independence and validation strength.** + + V4 describes the convergent validation as three "independent statistical lenses" (V4:73-89). They are not independent empirical measurements. All three are deterministic functions of the same per-CPA or per-signature `(cos, dHash)` features: + + - Lens 1 is K=3 posterior from the same two descriptors (V4:77; Script38:6-12). + - Lens 2 is a monotone transform of the cosine marginal only (V4:78; Script38:16-18). + - Lens 3 is the fraction of signatures failing the same box rule `cos > 0.95 AND dh <= 5` (V4:79; Script38:20-22). + + The high Spearman correlations are verified (0.9627, 0.8890, 0.8794; Script38:24-34), but they are partly mechanical agreement among feature-derived scores. They do not validate the classifier against an independent ground truth for hand-signed signatures. + + There is also a conceptual reversal in the reverse-anchor prose. V4 says the non-Big-4 reference has lower cosine and higher dHash than the Big-4 C1 center (V4:37), which is verified (reference center 0.9349/9.7670 in Script38:16-18; C1 0.9457/9.1715 in Script38:8-12). But V4 then calls this a "more-replicated-population" baseline (V4:37). Lower cosine and higher dHash indicate less replication / more hand-leaning, not more replication. A reviewer will likely catch this immediately. + +3. **The draft conflates at least three classifiers and then validates only one simplified binary rule.** + + V4 alternates among (i) K=3 per-CPA hard labels (V4:67), (ii) a binary Paper A box rule `cos > 0.95 AND dh <= 5` (V4:69), and (iii) the inherited five-way per-signature/document rule with `dh <= 5`, `5 < dh <= 15`, and `dh > 15` bands (V4:123-135). The Script38/39 convergence results validate only the simplified binary rule `non_hand iff cos > 0.95 AND dh <= 5` (Script38:20-22; Script39:8-12). They do not validate the full five-way classifier, especially the moderate non-hand-signed band `5 < dh <= 15`. + + This matters because V3's inherited Section III-K explicitly treated `cos > 0.95 AND 5 < dh <= 15` as "Moderate-confidence non-hand-signed" (V3:278-287). V4 keeps that category (V4:127) but cites kappa/rho evidence from a binary high-confidence-only rule (V4:121). The current prose therefore overstates what the Script39 kappa values prove. + + Recommended fix: choose a primary endpoint. If the five-way rule remains primary, validate that exact five-way rule or its declared binary collapse. If K=3 becomes primary, provide a document-level aggregation rule for K=3 and stop calling the inherited box rule the operational classifier. + +4. **The pixel-identity validation is useful, but "FAR" is the wrong metric name and the evidentiary force is overstated.** + + Script40's ground truth is a positive class: pixel-identical signatures are treated as replicated (Script40:4-8). Misclassifying them as hand-leaning is a false negative / miss rate on an easy positive-anchor subset, not a false-alarm rate in the usual classifier sense. V4 defines FAR as "probability of labelling a pixel-identical signature as hand-leaning" (V4:109), which reverses standard terminology. + + The 0/262 result is verified for all three classifiers (Script40:12-18), and the caveat that pixel-identity is necessary but not sufficient is appropriate (V4:117; Script40:29-31). But for the Paper A box rule this result is close to tautological: byte-identical nearest-neighbor signatures will have near-maximal cosine and minimal dHash. V3 was more careful, noting that FRR against byte-identical positives is trivially zero at thresholds below 1 and should be interpreted qualitatively (V3:266-268). + + Rename this metric to "pixel-identity positive-anchor miss rate" or "false-hand rate on replicated positives." Do not present it as FAR unless a true hand-signed negative anchor is evaluated. + +5. **Several empirical/provenance claims need correction or explicit "unverified" status.** + + - V4 says the K=2 LOOO max cosine deviation 0.028 is `5.6x` a "bootstrap CI half-width of 0.005" (V4:103). Script36 reports max deviation 0.0278 (Script36:43), but 0.005 is the stability tolerance in the verdict legend, not the bootstrap CI half-width (Script36:50-52). The full Big-4 bootstrap cosine CI half-width is 0.0015 (Script36:14-17). Correct the denominator and wording. + + - V4 says all-non-Firm-A is dip-test unimodal at `p > 0.99` (V4:21). Script32 local reports all-non-Firm-A cosine p = 0.9975 but dHash p = 0.9065 (Script32 local:56-76). The later detailed sentence in V4 correctly gives 0.998/0.907 (V4:43). Fix the earlier overstatement. + + - V4 says no BD/McCrary transition is identified on either axis and cites Script32/34 (V4:47). Script34 local supports no Big-4-only BD/McCrary threshold (Script34 local:28-31), but Script32 local reports dHash BD/McCrary thresholds for `big4_non_A` and `all_non_A` (Script32 local:36-44, Script32 local:68-76). Narrow the claim to the Big-4-only analysis or explain why Script32 subset transitions are not used. + + - The Firm A byte-identical claim is partly verified. Script40 verifies 145 Firm A pixel-identical signatures inside the 262 Big-4 total (Script40:20-27). The added details "50 distinct Firm A partners," "of 180 registered," and "35 span different fiscal years" appear in V3 (V3:165) and V4 (V4:31), but I did not find them in the supplied Script36-40 reports. Treat those details as unverified unless the Appendix B/script artifact is cited directly. + + - The "mid/small-firm tail actively pulling the v3.x crossing" statement (V4:19) is stronger than the local Script34 evidence. Script34 local verifies the Big-4-only crossing and CI (Script34 local:18-24), and it reports a large offset from the published baseline (Script34 local:51-58). It does not, by itself, prove the causal language "actively pulling" rather than "the full-sample and Big-4-only calibrations differ." + +## Minor Findings + +1. **Dip-test p-value precision needs a resolution check.** V4 says bootstrap p-value estimation uses `n_boot = 2000` and reports `p < 10^-4` (V4:43). With a finite bootstrap of 2000, the natural resolution is about 1/2000 unless the script uses a different asymptotic/calibrated p-value. Script36/34 display p = 0.0000 (Script36:6-8; Script34 local:28-31). State the reporting convention precisely, e.g., "no bootstrap replicate exceeded the observed statistic; reported as p < 0.001" if that is what happened. + +2. **The Delta BIC sign convention is confusing.** V4 reports "Delta BIC = -3.5" (V4:65). Since lower BIC is preferred, a reviewer may expect `BIC(K=2) - BIC(K=3) = 3.48` or "K=3 lower by 3.48." Use one convention and define it. + +3. **Per-signature convergence is real but only moderate for the box rule.** Script39 verifies kappas of 0.6616, 0.5586, and 0.8701 (Script39:22-30). The report verdict is `SIG_CONVERGENCE_MODERATE`, not strong (Script39:41-48). V4's statement that box-rule disagreement reflects "different decision geometries" rather than signal disagreement (V4:99) is plausible but interpretive. Add the moderate verdict and avoid making geometry the only explanation. + +4. **Per-CPA vs per-signature component centers drift more than the prose suggests.** Script39 shows per-CPA C1 at cosine 0.9457 and per-signature C1 at 0.9280 (Script39:16-20). Kappa is high for K=3 perCPA vs perSig labels (Script39:28), but "the same component structure recovers" (V4:99) should be softened to "a broadly similar three-component ordering recovers." + +5. **The Section III-L title is misleading.** The section is titled "Per-Document Classification" (V4:119) but most of it defines per-signature categories (V4:121-133). The document-level aggregation appears only in one paragraph (V4:135). Either rename to "Signature- and Document-Level Classification" or split the two parts. + +6. **K=3 alternative output lacks document aggregation.** V4 says the K=3 alternative assigns each signature to C1/C2/C3 (V4:137), but if Section III-L is per-document classification, the K=3 alternative also needs a document-level worst-case or posterior aggregation rule. + +7. **Firm anonymization is inconsistent.** V4 names the four firms in Chinese and then says they are pseudonymized as Firms A-D (V4:17). Later it uses PwC directly (V4:31). V3 says firm-level results are reported under pseudonyms (V3:315-316). Decide whether v4 abandons anonymization; otherwise keep the main text pseudonymous and put the mapping outside the manuscript, if at all. + +## Editorial / Prose Nits + +1. Replace "more-replicated-population baseline" (V4:37) with "less-replicated external reference" or "hand-leaning external reference." + +2. Replace "failure rate" for Lens 3 (V4:79, V4:89) with "box-rule hand-leaning rate" or "non-replicated rate." "Failure" sounds like classifier failure rather than a hand-leaning outcome. + +3. "Strongest single methodology-validation signal" (V4:89) is too strong because the lenses share features. Use "strongest internal consistency signal." + +4. "Boundary moves modestly" (V4:105) understates the PwC fold, where C1 membership rises from 23.5% to 36.3% (Script37:47-51). Use "membership remains composition-sensitive." + +5. "Calibration uncertainty band of +/- 5-13 percentage points" (V4:105) should be "observed absolute differences of 1.8-12.8 percentage points, with the largest fold exceeding the report's 5 pp viability bar" (Script37:83-90). + +6. "Operational threshold derivation" (V4:51) is not accurate if the operational per-signature classifier remains the inherited box rule. Use "mixture model and component assignment" unless K=3 is truly primary. + +7. The cross-reference index is useful, but it should be removed from the submitted manuscript or converted into an internal author checklist. + +## Responses to the Five Open Questions + +1. **Scope justification.** + + The three-point argument is directionally good but not yet sufficient. Add a fourth point explicitly restricting generalizability: primary claims are for the Big-4 audit-report context, while the 249 non-Big-4 CPAs are used only as robustness/reverse-anchor context unless Section IV-K independently validates them. Also soften "tail distorts" to "tail changes the fitted crossing" unless you cite a direct diagnostic for distortion. The Big-4 counts and crossings are verified (Script34 local:4-24; Script36:6-17), but the causal language needs restraint. + +2. **Firm A phrasing.** + + Use "templated-end case study" or "replication-heavy descriptive reference." Do not use "calibration reference, descriptively defined post-hoc" unless Firm A actually calibrates a threshold in v4. The draft correctly says Firm A is not the calibration anchor (V4:33). Calling it a calibration reference reintroduces the v3 vulnerability. + +3. **K=3 vs K=2 rationale.** + + As written, no. Selecting K=3 as an operational classifier on LOOO stability is not acceptable because Script37 says K=3 is only `P2_PARTIAL` and "not predictively useful as an operational classifier" (Script37:92-99). Do not strengthen the BIC argument; Delta BIC about 3.5 is mild. The defensible claim is: K=2 is clearly unstable; K=3 gives a reproducible hand-leaning component shape; hard membership remains uncertain and should be reported as calibration uncertainty. + +4. **Hybrid box rule plus K=3 alternative.** + + The hybrid can be acceptable only if roles are sharply separated: inherited five-way box rule is the primary signature/document classifier; K=3 is an accountant-level characterization and exploratory alternative. The current draft blurs this by calling K=3 "operational" (V4:67) while keeping the box rule in Section III-L (V4:121-137). Also, the validation scripts use the binary high-confidence rule `dh <= 5`, not the full five-way rule with `dh <= 15`. Fix this before deciding whether to keep the hybrid. + +5. **Section IV numbering.** + + Do not freeze table numbers yet. First settle the Methodology labels and primary classifier. Results should mirror this order: sample/scope, K=2/K=3 calibration, convergence lenses, K=2 and K=3 LOOO, pixel-identity positive-anchor check, signature/document classification outputs, then full-dataset robustness. After that, assign table numbers and verify every Section III cross-reference to Section IV-D/F/G/K. + +## Recommended Next-Step Actions + +1. Rewrite Sections III-J and III-K so K=3 is either clearly primary with uncertainty, or clearly descriptive. If descriptive, remove "operational threshold" language from the K=3 discussion. + +2. Add the Script37 `P2_PARTIAL` result directly to the prose. Do not hide the "not predictively useful as an operational classifier" implication. + +3. Decide and declare the primary classifier: inherited five-way box rule, binary high-confidence box rule, or K=3 hard/posterior labels. Align all validation text to that exact classifier. + +4. If the five-way rule remains primary, rerun or report validation for the five-way categories and the document-level worst-case aggregation, not just `cos > 0.95 AND dh <= 5`. + +5. Rename the pixel-identity metric from FAR to positive-anchor miss rate / false-hand rate. Add a separate specificity/FAR result only if a true hand-signed or inter-CPA negative anchor is evaluated. + +6. Correct the empirical slips: K=2 "0.005 bootstrap half-width," all-non-Firm-A `p > 0.99`, Script32 BD/McCrary wording, reverse-anchor "more-replicated" phrase, and any unverified Firm A byte-decomposition details. + +7. Add a short provenance table for every numerical claim in Sections III-G through III-L, including exact report path, script number, and whether the number is directly reported or inferred by arithmetic. diff --git a/paper/v4/paper_a_methodology_v4_section_iii.md b/paper/v4/paper_a_methodology_v4_section_iii.md index 31a4ae1..1614312 100644 --- a/paper/v4/paper_a_methodology_v4_section_iii.md +++ b/paper/v4/paper_a_methodology_v4_section_iii.md @@ -1,10 +1,10 @@ -# Section III. Methodology — v4.0 Draft (Big-4 reframe) +# Section III. Methodology — v4.0 Draft v2 (post codex round-21 revision) -> **Draft note (2026-05-12).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. The v4.0 sub-section ordering inserts a new **§III-K Convergent Validation** between the old K (per-document classification, now §III-L) and J (validation anchors, now folded into §III-K and §III-J). Empirical anchors throughout reference Scripts 32–40 on branch `paper-a-v4-big4`. +> **Draft note (2026-05-12, v2).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. v2 incorporates codex gpt-5.5 round-21 review (`paper/codex_review_gpt55_v4_round1.md`); the key revisions are: (i) the inherited five-way per-signature box rule is restored as the **primary operational classifier** (§III-L), (ii) the K=3 Gaussian mixture is positioned as an **accountant-level descriptive characterisation**, not an operational threshold (§III-J), (iii) the "convergent validation" framing is softened to "convergent internal-consistency checks" since the three scores share underlying features (§III-K), (iv) the pixel-identity metric is renamed from FAR to positive-anchor miss rate (§III-K), (v) five empirical/wording slips flagged by the reviewer are corrected. Empirical anchors throughout reference Scripts 32–40 on branch `paper-a-v4-big4`; a provenance table appears at the end of this section listing every numerical claim with its script and report path. ## G. Unit of Analysis and Scope -We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of all signature-level capture-rate analyses (§IV-D, §IV-F, §IV-G). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), per-CPA convergent validation (§III-K), and the leave-one-firm-out reproducibility analysis (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. +We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of all signature-level capture-rate analyses (§IV-D, §IV-F, §IV-G). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism. @@ -14,113 +14,117 @@ We adopt one stipulation about same-CPA pair detectability: A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication. -**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan (referred to throughout as the Big-4: 勤業眾信聯合 / 安侯建業聯合 / 資誠聯合 / 安永聯合, pseudonymously Firms A through D in the rest of the paper). The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available. Restricting the analyses to Big-4 is a methodological choice driven by three considerations: +**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available (Script 38 verification). Restricting the analyses to Big-4 is a methodological choice driven by four considerations: -1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$250 CPAs distributed across $\sim$30 firms with idiosyncratic signing practices and small per-firm samples. Including this tail in a single 2D-GMM mixture distorts the inferred component locations and weights — empirically, the published v3.x "natural threshold" $\overline{\text{cos}} = 0.945$, $\overline{\text{dHash}} = 8.10$ shifts to $\overline{\text{cos}} = 0.975$, $\overline{\text{dHash}} = 3.76$ when restricted to Big-4 (Script 34, bootstrap 95% CI $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$). The shift is large compared to either CI half-width, indicating the mid/small-firm tail was actively pulling the v3.x crossing rather than acting as inert noise. +1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$249 CPAs distributed across multiple firms with idiosyncratic signing practices and small per-firm samples. The full-sample and Big-4-only calibrations *differ* in their fitted marginal crossings (full-sample published $\overline{\text{cos}}^* = 0.945$, $\overline{\text{dHash}}^* = 8.10$ from v3.x; Big-4-only $\overline{\text{cos}}^* = 0.975$, $\overline{\text{dHash}}^* = 3.76$ from Script 34; bootstrap 95% CIs $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$); the offset is large compared to the Big-4 bootstrap CI half-width of $0.0015$. We report this as a *scope-dependent shift* rather than asserting a causal "mid/small-firm tail distorts" claim. -2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes ($p < 10^{-4}$, $n = 437$ CPAs; Script 34) for the per-CPA $\overline{\text{cos}}$ and $\overline{\text{dHash}}$ distributions. No such rejection holds within any single firm pooled alone (each firm's per-CPA distribution is dip-test unimodal at $p > 0.92$; Scripts 32, 34) or within the all-non-Firm-A pool ($p > 0.99$; Script 32). Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution. +2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes (cosine $p = 0.0000$, dHash $p = 0.0000$ in the bootstrap-2000 implementation, i.e., no bootstrap replicate exceeded the observed statistic; reported here as $p < 5 \times 10^{-4}$; Script 34). No such rejection holds within any single firm pooled alone (Firm A: $p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$; Script 32). Pooling Firm A's three Big-4 peers gives $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$ (Script 32, `big4_non_A` subset). Pooling all non-Firm-A CPAs (other Big-4 plus mid/small firms) gives $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$ (Script 32, `all_non_A` subset). Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution. -3. **Reproducibility under leave-one-firm-out cross-validation.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 mixture fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm); the held-out firm's CPAs are then classified by the rule derived from the other three firms. No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm, far below the sample sizes required for stable mixture-fit estimation on a held-out fold. +3. **Reproducibility under leave-one-firm-out cross-validation.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 mixture fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm. -The full-dataset (686 CPAs across Big-4 plus mid/small) is reported as a robustness section in §IV-K so that v4.0's pipeline reproducibility claim is demonstrable at multiple scopes, while the primary methodology operates at the scope where its statistical assumptions (dip-test multimodality, firm-level LOOO feasibility) are met. +4. **Restricted generalizability claim.** v4.0's primary claims are scoped to the Big-4 audit-report context; we do not assert that the same mixture structure or operational thresholds extend to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H's reverse-anchor internal-consistency check and (b) as a robustness comparison in §IV-K. Generalisation beyond Big-4 is left as future work. ## H. Reference Populations -v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single Firm-A-as-anchor framing. +v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing. -**Internal reference: Firm A as the templated end of Big-4 (case study).** Firm A is empirically the most digitally-replicated of the Big-4. In the Big-4 K=3 mixture (§III-J, Script 35), Firm A accounts for 0% of the C1 hand-leaning component, 17.5% of the C2 mixed component, and 82.5% of the C3 replicated component; the reverse pattern holds at PwC (23.5% C1, 75.5% C2, 1.0% C3). Automated byte-level pair analysis (§IV-F.1; reproduction artifact in Appendix B) identifies 145 Firm A signatures byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years. Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports — these pairs establish image reuse as a concrete, threshold-free phenomenon within Firm A. +**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the most digitally-replicated of the Big-4. In the Big-4 K=3 mixture (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 hand-leaning component (cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 mixed component, and 82.5% of the C3 replicated component; the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). The byte-level pair analysis reported in v3.x §IV-F.1 identifies 145 Firm A pixel-identical signatures at the signature level (Script 40 verifies the 145/262 split among Big-4 pixel-identical signatures); the additional details that v3.x attributes to this analysis (50 distinct Firm A partners of 180 registered; 35 byte-identical matches spanning different fiscal years) are inherited from the Script 28 / Appendix B byte-decomposition output and were not regenerated in the v4.0 spike scripts. We retain those v3.x details by reference and mark them in the provenance table as "inherited from v3 §IV-F.1 / Script 28." -In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with KPMG, PwC, and EY; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the within-Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level ground truth that anchors §III-K's pixel-identity false-alarm-rate validation. +In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate. -**External reference: non-Big-4 as the reverse-anchor reference for cross-validation.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor convergent-validation analysis: each Big-4 CPA's deviation from the reference centre, measured as the marginal cosine left-tail percentile under the reference, is one of three independent statistical lenses that v4.0 uses to validate its hand-leaning ranking. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its sole role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be cross-checked. +**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores that v4.0 uses as a cross-check on the inherited per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked. -The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$. This sits below the Big-4 K=3 hand-leaning component centre ($\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$), confirming that the non-Big-4 reference population is itself in the lower-cosine / higher-dHash regime relative to Big-4 — appropriate for use as a "more-replicated-population" baseline against which Big-4 hand-leaning CPAs deviate by being further into the left tail of cosine. +The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 hand-leaning component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 templated component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a more hand-leaning Big-4 CPA. This is a "deviation in the hand-leaning direction" measure, not a "deviation toward replication" measure; the reference is the less-replicated population. ## I. Distributional Characterisation at the Accountant Level -This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G. Three diagnostic procedures are applied: a univariate unimodality test on each marginal axis, a 2D Gaussian mixture fit (developed further in §III-J), and a density-smoothness diagnostic. These procedures play decreasing-strength roles in v4.0: the dip test directly motivates whether a mixture model is statistically warranted; the mixture is the operational generative model whose fit is reported in §III-J; the density-smoothness diagnostic complements the mixture with a non-parametric check on local discontinuity that is independent of the mixture's parametric form. +This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G. Three diagnostic procedures are applied: a univariate unimodality test on each marginal axis, a 2D Gaussian mixture fit (developed in §III-J), and a density-smoothness diagnostic. -**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). Both marginals reject unimodality with $p < 10^{-4}$ (Script 34). For comparison, no rejection of unimodality holds at any narrower scope: Firm A alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); KPMG + PwC + EY pooled ($p_{\text{cos}} = 0.999$, $p_{\text{dHash}} = 0.906$, $n = 266$); all-non-Firm-A pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution. +**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds at any narrower scope or in the non-Big-4 population (Script 32): Firm A alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution. -**2. Mixture-model evidence.** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42) fit to the joint $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ respectively over $n_{\text{boot}} = 500$ resamples (Script 34). A 3-component fit (§III-J) is BIC-preferred ($-1111.93$ vs $-1108.45$), and the K=3 component structure carries different operational implications; we develop both fits in §III-J and discuss the choice between them in §III-K. +**2. Mixture-model evidence.** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3, and the operational role of each fit is developed in §III-J and §III-K. -**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* rather than as a third threshold estimator, on each marginal axis (cosine in bins of $0.002$, dHash in integer bins). The diagnostic identifies no significant transition on either axis at $\alpha = 0.05$ (Script 32 / Script 34). This null result is consistent with the mixture-model evidence: the K=3 components overlap rather than separate sharply, so a local-discontinuity test does not flag a transition. We retain BD/McCrary in v4.0 as a non-parametric robustness check — its null result indicates that no boundary location chosen by the mixture model is corroborated by sharp local discontinuity, which we read as an honest description of the data's continuous-overlap structure rather than a methodological deficiency. +**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with the mixture-model evidence: the K=3 components overlap rather than separate sharply, so a local-discontinuity test does not flag a transition. We retain BD/McCrary in v4.0 as a non-parametric robustness diagnostic; the dHash transitions outside Big-4 are not used as operational thresholds because they are scope-dependent and lie within rather than between modes of the corresponding density. -The operational threshold derivation (§III-J) and the convergent validation (§III-K) both operate on the K=3 fit. The K=2 fit and its marginal crossings are reported here for completeness and to enable cross-paper-version comparison with v3.x, but are not used as v4.0's operational rule. +## J. Mixture Model and Accountant-Level Characterisation -## J. Mixture Model and Operational Threshold Derivation +This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive characterisations of the joint Big-4 distribution; the operational per-signature classifier remains the inherited five-way box rule of §III-L.** Neither mixture is used to assign signature-level or document-level labels in the v4.0 primary analysis. -We fit 2- and 3-component 2D Gaussian Mixture Models (full covariance, $n_{\text{init}} = 15$, $\text{max\_iter} = 500$, fixed seed 42) to the joint $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution of the 437 Big-4 CPAs. We report both fits and develop the operational threshold from the K=3 fit, with the K=2 fit retained as a reference and the choice of K=3 over K=2 justified in §III-K. +**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ ("hand-leaning"), weight $0.689$, and $(0.983, 2.41)$ ("replicated"), weight $0.311$ (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. -**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$ ("hand-leaning"), and $(0.983, 2.41)$, weight $0.311$ ("replicated"). BIC = $-1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. +**K=3 fit.** Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces): -**K=3 fit (selected for v4.0).** Three components, sorted by ascending cosine mean: - -| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | interpretation | +| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive label | |---|---|---|---|---| -| C1 hand-leaning | 0.9457 | 9.17 | 0.143 | low-cosine, high-dHash; cross-firm membership | -| C2 mixed | 0.9558 | 6.66 | 0.536 | intermediate; spans all four Big-4 firms | -| C3 replicated | 0.9826 | 2.41 | 0.321 | high-cosine, low-dHash; dominated by Firm A | +| C1 | 0.9457 | 9.17 | 0.143 | hand-leaning | +| C2 | 0.9558 | 6.66 | 0.536 | mixed | +| C3 | 0.9826 | 2.41 | 0.321 | replicated | -BIC = $-1111.93$, slightly preferred over K=2 ($\Delta\text{BIC} = -3.5$). The K=3 components carry direct interpretations from the joint Big-4 distribution: C3 is the templated mode dominated by Firm A (82.5% of Firm A CPAs assigned to C3 by hard-posterior assignment; Script 35); C1 is the hand-leaning mode with cross-firm membership (24 PwC, 10 KPMG, 6 EY, 0 Firm A — Script 35); C2 is the intermediate mode that contains the majority of all four Big-4 firms' CPAs. +$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support for K=3 under standard BIC interpretation but not by itself decisive). -**Operational threshold.** v4.0's per-CPA classifier is the K=3 hard-posterior assignment: a CPA $a$ is assigned to component $C_k$ that maximises $P(C_k | (\overline{\text{cos}}_a, \overline{\text{dHash}}_a))$ under the K=3 fit. CPAs assigned to C3 are labelled *templated*; CPAs assigned to C1 or C2 are labelled *non-templated*, with C1 carrying the additional designation *hand-leaning* and C2 the designation *mixed*. The per-CPA classifier is an honest description of the joint Big-4 distribution rather than a discovered "natural threshold" — the dip-test rejection of unimodality (§III-I) supports treating the data as multi-mode, but the components overlap in their tails (BD/McCrary null on the marginal density), so the classifier's hard boundary inherits a calibration uncertainty quantified in §III-K. +**Why we report both K=2 and K=3.** Leave-one-firm-out cross-validation (§III-K) shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The max fold-to-fold deviation of the cosine crossing is $0.028$, which is $5.6\times$ the report's $0.005$ across-fold *stability tolerance* (not bootstrap CI; the full-Big-4 bootstrap cosine half-width is the much smaller $0.0015$). K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.025$ (Script 37). K=3 hard-posterior membership for the held-out firm is more composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL`, with the explicit interpretation that "the C1 cluster exists but membership is not well-predicted by the held-out fit" and "K=3 is not predictively useful as an operational classifier" (Script 37 verdict legend). -The signature-level operational rule is inherited from v3.x for continuity and is reported in §III-L: a signature is classified as *non-hand-signed* iff $\text{cos} > 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. The agreement between this signature-level box rule and the per-CPA K=3 hard label is reported in §III-K (Cohen $\kappa = 0.66$ binary collapse), and is one of three convergent-validation signals supporting v4.0's overall methodology. +We take the joint K=2/K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites a v4.0 operational classifier: -## K. Convergent Validation +- The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary. +- The Big-4 K=3 mixture exhibits a reproducible three-component shape across LOOO folds; a hand-leaning component (C1) exists at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$ with weight $\approx 0.14$. +- Hard-posterior membership in C1 is composition-sensitive (max absolute deviation $12.8$ pp across LOOO folds, exceeding the report's $5$ pp viability bar); K=3 is therefore not used to assign operational hand-leaning labels to CPAs in v4.0. -v4.0 validates its mixture-derived per-CPA classifier through three independent statistical lenses, supplemented by leave-one-firm-out cross-validation and a hard ground-truth false-alarm-rate check. +The operational signature-level classifier remains the inherited five-way box rule of §III-L, calibrated as in v3.x. Cross-checks between the inherited rule and the K=3 mixture appear in §III-K. -**1. Three convergent lenses (Script 38).** For each Big-4 CPA we compute three scalars: +## K. Convergent Internal-Consistency Checks -- **Lens 1 (mixture posterior):** $P(\text{C1}_{\text{hand-leaning}})$ from the K=3 fit of §III-J. -- **Lens 2 (reverse-anchor directional):** $-F_{\text{ref}}(\overline{\text{cos}}_a)$, where $F_{\text{ref}}$ is the marginal cosine CDF of the non-Big-4 reference Gaussian from §III-H. Higher score corresponds to deeper in the left tail of the reference cosine distribution, i.e., more deviated in the hand-leaning direction relative to a strictly out-of-target population. -- **Lens 3 (Paper A operational rule):** the per-CPA failure rate of the inherited signature-level rule cos $> 0.95$ AND dHash$_{\text{indep}} \leq 5$. +The mixture characterisation of §III-J is supported by three feature-derived per-CPA scores and a hard-ground-truth subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide). -Pairwise Spearman rank correlations among the three lenses, $n = 437$: +**1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute: + +- **Score 1 (mixture posterior):** $P(\text{C1}_{\text{hand-leaning}})$ from the K=3 fit of §III-J — a function of both descriptor means. +- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a more hand-leaning Big-4 CPA. This is a function of $\overline{\text{cos}}_a$ alone. +- **Score 3 (Paper A box-rule hand-leaning rate):** the per-CPA fraction of signatures that do **not** satisfy the inherited binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors. + +Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38): | Pair | Spearman $\rho$ | $p$-value | |---|---|---| -| Lens 1 vs Lens 3 | $+0.963$ | $< 10^{-248}$ | -| Lens 2 vs Lens 3 | $+0.889$ | $< 10^{-149}$ | -| Lens 1 vs Lens 2 | $+0.879$ | $< 10^{-142}$ | +| Score 1 vs Score 3 | $+0.963$ | $< 10^{-248}$ | +| Score 2 vs Score 3 | $+0.889$ | $< 10^{-149}$ | +| Score 1 vs Score 2 | $+0.879$ | $< 10^{-142}$ | -All three pairwise correlations exceed $0.7$, and the three lenses agree on the per-firm hand-leaning ranking: Firm A < KPMG < EY $\approx$ PwC by mean P(C1), mean reverse-anchor score, and mean per-CPA hand-failure rate. Convergence on a per-CPA ranking by three approaches that draw on different statistical machinery — finite-mixture posterior, deviation from a strictly out-of-target reference, and a published box rule — is the strongest single methodology-validation signal in v4.0. +We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA hand-leaning ranking with $\rho > 0.87$. The three scores agree on the per-firm ordering: by all three scores, Firm A is the most replication-dominated and Firm C is the most hand-leaning of the Big-4 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs. -**2. Per-signature convergence (Script 39).** The per-CPA convergence above could in principle reflect averaging across heterogeneous within-CPA signatures rather than coherent within-CPA behaviour. We address this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points — and comparing labels: +**2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replicated vs not-replicated): -| Pair | Cohen $\kappa$ (binary collapse, 1 = replicated) | +| Pair | Cohen $\kappa$ | |---|---| -| Paper A box rule vs K=3 per-CPA hard label | $0.662$ | -| Paper A box rule vs K=3 per-signature hard label | $0.559$ | -| K=3 per-CPA vs K=3 per-signature hard label | $0.870$ | +| Paper A binary high-confidence box rule vs per-CPA K=3 hard label | $0.662$ | +| Paper A binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ | +| Per-CPA K=3 vs per-signature K=3 | $0.870$ | -The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 labels indicates that per-CPA aggregation does not materially distort the K=3 classifier; the same component structure recovers when fit at signature granularity. The moderate $\kappa = 0.56$–$0.66$ between Paper A's box rule and either K=3 fit reflects different decision geometries — a rectangular box vs a Gaussian-mixture posterior boundary — rather than a fundamental signal disagreement. +The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the inherited five-way rule. We note that this comparison validates only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly validate the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, which retains its v3.x interpretation and capture-rate evaluation (§IV-F). -**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** We test cross-firm reproducibility of the mixture fit by holding out each Big-4 firm in turn, refitting the GMM on the other three firms' CPAs, and reporting the held-out firm's classifier behaviour under the fold-derived rule. We report both K=2 and K=3 LOOO results: +**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference: -- *K=2 LOOO is unstable.* Across the four folds the marginal cosine crossing varies between 0.938 (Firm-A held out) and 0.976 (EY held out), a max deviation of $0.028$, $5.6\times$ the bootstrap CI half-width of $0.005$. When Firm A is held out, the fold rule classifies 100% of held-out Firm A as *replicated*; when any non-Firm-A firm is held out, the fold rule classifies 0% of the held-out firm as *replicated*. This pattern indicates that the K=2 boundary is essentially a *firm-mass separator* between Firm A and the other three Big-4, not a *mechanism boundary* between hand-leaning and replicated CPAs. +- *K=2 LOOO is unstable.* Max fold-to-fold cosine crossing deviation $= 0.028$ across the four folds, against the report's $0.005$ stability tolerance (Script 36). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary. -- *K=3 LOOO is partially stable.* The C1 hand-leaning component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.025$ in mixture weight (Script 37). The held-out firm's C1 membership rate varies more substantially (e.g., $36.3\%$ for held-out PwC vs $23.5\%$ baseline; $4.7\%$ for held-out Firm A vs $0\%$ baseline), reflecting that the fold's decision boundary moves modestly when training composition changes. We document the held-out membership rates as a *calibration uncertainty band* of $\pm 5$–$13$ percentage points around the baseline rate, rather than as a defect of the K=3 fit. +- *K=3 LOOO is partially stable.* The C1 hand-leaning component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.025$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own verdict is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label. -The contrast — K=2 unstable, K=3 stable in component shape with bounded membership uncertainty — is the empirical justification for selecting K=3 over K=2 as v4.0's operational mixture (referenced in §III-J). +**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one hard ground-truth subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40). -**4. Pixel-identity false-alarm rate (Script 40).** The only hard ground truth available in the corpus is the set of Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation ($n = 262$: 145 Firm A, 8 KPMG, 107 PwC, 2 EY). Byte-identical pairs cannot arise from independent hand-signing and are taken as ground-truth *replicated*. We report each classifier's false-alarm rate (probability of labelling a pixel-identical signature as hand-leaning): +We report each candidate classifier's *positive-anchor miss rate* — the fraction of byte-identical signatures misclassified as hand-leaning. This is a one-sided check against a conservative positive subset, **not a false-alarm rate in the usual two-class sense**; we do not report a paired false-alarm rate because no signature-level hand-signed ground truth exists. The corresponding signature-level false-alarm-rate evidence comes from the v3.x inter-CPA negative anchor (§III-J inherited; Table X), which retains its v3.x interpretation: -| Classifier | Misclassified | FAR | Wilson 95% CI | -|---|---|---|---| -| Paper A box rule | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ | -| K=3 per-CPA hard label (C3 = replicated) | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ | -| Reverse-anchor score (prevalence-calibrated cut) | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ | +| Candidate classifier | Pixel-identity miss rate (Wilson 95% CI) | +|---|---| +| Inherited Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ | +| K=3 per-CPA hard label (C3 = replicated; descriptive only) | $0\%$ $[0\%, 1.45\%]$ | +| Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ | -All three v4.0 classifiers achieve perfect detection on the pixel-identity ground-truth subset, with a Wilson upper bound on FAR of $1.45\%$. We acknowledge two caveats. First, pixel-identity is a *conservative subset* of true replication — only the byte-equal extreme — so a low FAR against this subset is necessary but not sufficient evidence of correct replication detection across the full non-hand-signed class. Second, no signature-level ground truth exists for the hand-leaning class, so the reverse-anchor classifier's cut is chosen by *prevalence calibration* against Paper A's overall replicated rate; this is documented as a v4.0 limitation in §V-G and would benefit from human-rated validation in a future revision. +All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the inherited box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them; v3.x discussed this conservative-subset caveat at length (v3 §III-J item 1, V-F). The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the inherited box rule's overall replicated rate ($49.58\%$ of Big-4 signatures; Script 40); this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation. -## L. Per-Document Classification +## L. Signature- and Document-Level Classification -The per-signature classifier inherits its operational thresholds from v3.x: cos $> 0.95$ for the cosine dimension and $\text{dHash}_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. v4.0 retains these cuts unchanged for two reasons: (a) per-signature classification is an operational inheritance from prior literature with established interpretation, and (b) §III-K reports that the inherited rule agrees with the v4.0 K=3 per-CPA classifier at $\rho = 0.963$ per-CPA / $\kappa = 0.66$ per-signature, supporting continued use without recalibration. We document the K=3 hard label as an *alternative output* available alongside the inherited rule for users who prefer a posterior-based assignment. +The v4.0 operational classifier is the inherited v3.x five-way per-signature box rule, retained unchanged for two reasons: (a) it preserves continuity with the v3.x literature and its established interpretation, and (b) the convergent internal-consistency checks of §III-K show that the box rule's per-CPA-aggregated outputs agree at $\rho \geq 0.96$ with a mixture-derived score and at $\rho \geq 0.89$ with a reverse-anchor score, supporting continued use without recalibration. -We assign each signature to one of five signature-level categories using convergent evidence from both descriptors: +**Per-signature five-way classifier.** Operational thresholds are anchored on whole-sample Firm A percentile heuristics as in v3.x: cos $> 0.95$ (Firm A P7.5) for the cosine dimension and dHash$_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. All dHash references refer to the *independent-minimum* dHash defined in §III-G. We assign each signature to one of five signature-level categories using convergent evidence from both descriptors: 1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence. @@ -132,46 +136,61 @@ We assign each signature to one of five signature-level categories using converg 5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold. -Because each audit report typically carries two certifying-CPA signatures, we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label among the two signatures (rank order: High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed). This rule reflects the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge. +The conventions about these thresholds (cosine 0.95 as an operating point chosen for capture-vs-FAR tradeoff against the inter-CPA negative anchor, with Wilson 95% inter-CPA FAR of $0.0005$ at the operating point; cosine 0.837 as the all-pairs KDE crossover; dHash 5 and 15 as the upper tail of the high-similarity mode and the structural-similarity ceiling respectively) are inherited from v3.x §III-K and retain their v3.x calibration and capture-rate evidence (v3.x Tables IX, XI, XII, XII-B). -The K=3 alternative output assigns each signature directly to {C1 hand-leaning, C2 mixed, C3 replicated} via the per-signature K=3 fit of §III-K. Cross-tabulation between the five-way box-rule classifier and the K=3 hard label is reported in §IV-G (cross-tab table; Script 39 output). +**Document-level aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.x worst-case rule: the document inherits the *most-replication-consistent* signature label among the two signatures (rank order: High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed). The aggregation rule reflects the detection goal of flagging any potentially non-hand-signed report. + +**K=3 as accountant-level characterisation, not classifier.** The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used solely for the accountant-level cluster cross-tabulation (Script 35) and the internal-consistency Spearman correlations (§III-K). --- -## Cross-reference index +## Provenance table for numerical claims in §III-G through §III-L + +| Claim | Value | Source | Notes | +|---|---|---|---| +| Big-4 CPA count, $n_{\text{sig}} \geq 10$ | $437$ ($171/112/102/52$) | Script 36 sample sizes; Script 38 per-firm summary | direct | +| Big-4 signature count | $150{,}442$ | Script 39 / Script 38 | direct | +| Non-Big-4 reference CPA count | $249$ | Script 38 reference population | direct | +| Big-4 K=2 marginal crossings $(0.9755, 3.755)$ | direct | Script 34; Script 36 §A | direct | +| Bootstrap 95% CI cosine $[0.9742, 0.9772]$ | direct | Script 34; Script 36 §A | $n_{\text{boot}} = 500$ | +| Bootstrap 95% CI dHash $[3.48, 3.97]$ | direct | Script 34; Script 36 §A | $n_{\text{boot}} = 500$ | +| Bootstrap CI half-width $0.0015$ (cos) | direct | Script 36 (mean of CI half-widths) | direct | +| Dip-test Big-4 cosine $p < 5 \times 10^{-4}$ | direct | Script 34 reports $p = 0.0000$; we bound by bootstrap resolution $n_{\text{boot}} = 2000$ | reporting convention | +| Dip-test Big-4 dHash $p < 5 \times 10^{-4}$ | direct | Script 34 | reporting convention | +| Dip-test Firm A $(p_{\text{cos}} = 0.992, p_{\text{dHash}} = 0.924)$ | direct | Script 32 §`firm_A` | direct | +| Dip-test `big4_non_A` $(0.998, 0.906)$ | direct | Script 32 §`big4_non_A` | direct | +| Dip-test `all_non_A` $(0.998, 0.907)$ | direct | Script 32 §`all_non_A` | direct | +| K=3 component centers / weights | $(0.9457, 9.17, 0.143)$ / $(0.9558, 6.66, 0.536)$ / $(0.9826, 2.41, 0.321)$ | Script 35 / Script 38 | direct | +| $\Delta\text{BIC}(K{=}3, K{=}2) = -3.48$ | direct | Script 34 (BIC K=2 = $-1108.45$; Script 36 reports BIC K=3 = $-1111.93$) | direct (arithmetic) | +| K=2 LOOO max cosine deviation $0.028$ | direct | Script 36 stability summary | direct | +| K=2 LOOO Firm A held-out $171/171$ replicated | direct | Script 36 fold table | direct | +| K=3 C1 component shape drift (cos $0.005$, dHash $0.96$, weight $0.025$) | direct | Script 37 stability summary | direct | +| K=3 LOOO held-out C1 absolute differences $1.8$–$12.8$ pp | direct | Script 37 held-out prediction check | direct | +| Three-score pairwise Spearman ($0.963$, $0.889$, $0.879$) | direct | Script 38 correlations | direct | +| Per-CPA / per-signature K=3 Cohen $\kappa$ ($0.662$, $0.559$, $0.870$) | direct | Script 39 kappa table | direct | +| Per-CPA / per-signature K=3 C1 center drift $0.018$ (cosine) | derived | $\lvert 0.9457 - 0.9280 \rvert$; Script 39 components | direct | +| Pixel-identity Big-4 subset $n = 262$ ($145/8/107/2$) | direct | Script 40 sample | direct | +| Positive-anchor miss rate $0\%$ on $n = 262$ (Wilson upper $1.45\%$) | direct | Script 40 results table | direct | +| Inter-CPA FAR $0.0005$ at cos $> 0.95$ (Wilson 95% $[0.0003, 0.0007]$) | direct | v3 §IV-F.1 / Table X (inherited) | inherited from v3 | +| Firm A byte-identical $145$ pixel-identical signatures in Big-4 subset | direct | Script 40 sample breakdown | direct | +| Firm A byte-identical "50 distinct partners of 180; 35 cross-year" | inherited | v3 §IV-F.1 / Script 28 / Appendix B byte-decomposition output | **inherited from v3; not regenerated in v4.0 spike scripts** | +| Big-4 K=3 per-firm C1 hard-assignment ($0\%$ / $8.9\%$ / $23.5\%$ / $11.5\%$) | direct | Script 35 firm × cluster cross-tab | direct | + +--- + +## Cross-reference index (author working checklist; remove before submission) - **Big-4 sub-corpus definition** (§III-G) — 437 CPAs, 150,442 signatures. -- **Reference populations** (§III-H) — Firm A as templated-end case study; non-Big-4 ($n = 249$) as reverse-anchor reference. -- **Distributional characterisation** (§III-I) — dip-test multimodality; BD/McCrary null; mixture support. -- **K=3 components** (§III-J) — C1 hand-leaning $(0.946, 9.17, w = 0.143)$; C2 mixed $(0.956, 6.66, w = 0.536)$; C3 replicated $(0.983, 2.41, w = 0.321)$. -- **Convergent validation** (§III-K) — three lenses ($\rho \geq 0.879$); $\kappa \geq 0.56$ at signature level; LOOO K=3 partial; pixel-identity FAR $0\%$ on $n = 262$. -- **Per-document classifier** (§III-L) — five-way inherited rule + K=3 alternative output. +- **Reference populations** (§III-H) — Firm A as templated-end case study; non-Big-4 ($n = 249$) as reverse-anchor reference (less-replicated population). +- **Distributional characterisation** (§III-I) — Big-4 dip-test multimodality ($p < 5 \times 10^{-4}$); BD/McCrary null at Big-4 scope; mixture support. +- **K=3 components, descriptive only** (§III-J) — C1 hand-leaning, C2 mixed, C3 replicated; LOOO supports descriptive use, not operational classification. +- **Convergent internal-consistency** (§III-K) — three feature-derived scores ($\rho \geq 0.879$, not independent measurements); per-signature K=3 ($\kappa = 0.87$ vs per-CPA fit); K=2 LOOO unstable, K=3 LOOO partial; pixel-identity miss rate $0\%$ on $n = 262$. +- **Per-document classifier** (§III-L) — inherited five-way rule retained as primary; K=3 demoted to characterisation only. -## Script-to-section evidence map +## Open questions remaining for partner / reviewer -| Section | Empirical anchor | Script | -|---|---|---| -| III-G scope | mid/small-firm tail distortion | 32, 34 | -| III-H Firm A descriptive | byte-level pair analysis | (v3 §IV-F.1 inherited) | -| III-H non-Big-4 reference | MCD reference Gaussian | 38 | -| III-I dip-test | Big-4 marginal $p < 10^{-4}$ | 32, 34 | -| III-I BD/McCrary null | density-smoothness diagnostic | 32, 34 | -| III-J K=2 components | bootstrap CI on marginals | 34 | -| III-J K=3 components | full-Big-4 fit | 35 | -| III-K three convergent lenses | per-CPA Spearman | 38 | -| III-K per-signature | Cohen $\kappa$ binary | 39 | -| III-K K=2 LOOO unstable | $\Delta\text{cos} = 0.028$ | 36 | -| III-K K=3 LOOO partial | C1 shape stable | 37 | -| III-K pixel-identity FAR | $0\%$ on $n = 262$ | 40 | -| III-L five-way inherited rule | (no v4 change) | (v3 §III-K inherited) | -| III-L K=3 alternative output | per-signature K=3 fit | 39 | +1. **Five-way rule validation against the moderate-confidence band.** §III-K's $\kappa$ evidence covers only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evidence (Tables IX, XI, XII). Is this inheritance sufficient, or should v4.0 add a Big-4-specific capture-rate analysis for the moderate band in §IV-F? ---- +2. **Anonymisation of within-Big-4 firm contrasts.** §III-H states that Firm C is the firm most concentrated in C1 hand-leaning at $23.5\%$ (Script 35). The within-Big-4 ordering by hand-leaning concentration is informative for the §V discussion. v3.x reports under pseudonyms throughout. Confirm that we maintain pseudonyms consistently in §IV–V even when discussing the specific Firm C / Firm B / Firm D hand-leaning rates. -> **Open questions for partner / reviewer review of this draft:** -> -> 1. §III-G scope justification — is the three-point argument (within-pool homogeneity, dip-test multimodality, LOOO feasibility) sufficiently strong, or should we add a fourth (e.g., generalisability claim restricted to Big-4 audit context)? -> 2. §III-H Firm A reframing — is "case study of templated end" the right phrase, or should we use "calibration reference, descriptively defined post-hoc"? -> 3. §III-J K=3 vs K=2 selection — we choose K=3 on the basis of LOOO stability (§III-K), not BIC alone (only $\Delta = 3.5$). Is this acceptable to a reviewer, or should we strengthen the BIC argument? -> 4. §III-L — keeping the inherited five-way rule intact preserves continuity with v3.x literature but means the operational classifier is *not* derived from v4.0's K=3 mixture. Is this hybrid (mixture for characterisation, inherited box rule for classification) acceptable, or should v4.0 commit to K=3 hard label as the primary classifier? -> 5. The Section IV (Results) regeneration in Phase 3 will need to mirror this section's structure — particularly Tables IV–XVIII numbering. We should confirm the numbering and section-letter scheme before Phase 3 begins. +3. **Section IV table numbering.** Defer until §III final accepted by partner / reviewer; results numbering should mirror §III flow (sample/scope → mixture characterisation → convergent checks → LOOO → pixel-identity → signature/document classification → full-dataset robustness).