Apply codex round-22 corrections to §III v3 (Minor -> ready)

Codex gpt-5.5 round 22 returned Minor Revision after v2 closed 3/5 Major findings fully and 2/5 partially. Five narrow fixes applied for v3: 1. Per-firm ranking unanimity corrected (v2:93). The reverse- anchor score ranks Firm D fractionally higher than Firm C (-0.7125 vs -0.7672); only Scores 1 and 3 rank Firm C highest. The unanimity claim was wrong; v3 prose now says all three agree on Firm A as most replication-dominated and on the non-Firm-A Big-4 as more hand-leaning, with a modest disagreement on Firm C vs D ordering. 2. "Smallest scope" / "any single firm" overclaim narrowed (v2:21, v2:43). Script 32 only tested Firm A alone, big4_non_A pooled, and all_non_A pooled -- not Firms B, C, D individually. v3 explicitly says "comparison scopes tested in Script 32" and notes single-firm dip tests for B, C, D were not separately computed. 3. K=3 hard label vs posterior in Spearman correctly attributed (v2:143). Script 38 uses the K=3 posterior P(C1), not the hard label, in the internal-consistency Spearman correlations. v3 §III-L now correctly says the hard label is for the §IV cluster cross-tabulation while the posterior is the continuous Score 1 in §III-K. 4. Provenance source for n=150,442 corrected (v2:17, v2:152). Script 39 directly reports this count in its per-signature K=3 fit; Script 38's report does not. v3 cites Script 39 for this number. 5. "Max fold-to-fold deviation" wording made precise (v2:65, v2:107). The $0.028$ value is the max absolute deviation from the across-fold mean (Script 36 stability summary), not the pairwise across-fold range (which is $0.0376 = 0.9756 - 0.9380$). v3 reports both statistics with explicit definitions. Also: draft note at top now records v2 (round-21) and v3 (round-22) revision lineage. Cross-reference index and open- question block retained as author working checklist (will be removed before manuscript submission per codex e7). Outstanding open questions reduced to 3 (codex round-22 view): - Five-way moderate-confidence band: validate in Big-4 specifically (Phase 3 §IV-F work) or document as inherited from v3.x? - Firm anonymisation policy in §IV-V (procedural) - §IV table numbering (procedural; defer until §IV done) Phase 2 §III draft is now Minor-Revision-quality. Ready for Phase 3 (Results regeneration §IV). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:26:02 +08:00
parent 62a22ceb83
commit c8c7656513
2 changed files with 103 additions and 10 deletions
@@ -0,0 +1,87 @@
+# Paper A v4.0 Methodology Section III-G through III-L Peer Review
+
+Reviewer: gpt-5.5 xhigh  
+Date: 2026-05-12  
+Round number: 22 (v4 round 2)  
+Review target: `paper/v4/paper_a_methodology_v4_section_iii.md`
+
+## Verdict
+
+Minor Revision.
+
+v2 closes most of the round-21 blockers: K=3 is no longer the operational classifier, the "independent lenses" claim is softened, the pixel-identity metric is no longer called FAR in the draft, and the main empirical slips are corrected. The remaining issues are narrower but still need edits before accepting the methodology text, especially the false per-firm ordering claim in §III-K and the unresolved validation status of the five-way moderate-confidence band.
+
+## Round-21 finding closure table
+
+| Finding | Round-21 Severity | v2 Status | Evidence in v2 |
+|---|---|---|---|
+| M1. K=3 is not justified as an operational classifier. | Major | CLOSED | v2 explicitly says both K=2 and K=3 are descriptive and not used for signature/document labels (v2:51, v2:67-73, v2:143). It also reports Script 37 `P2_PARTIAL` and the "not predictively useful as an operational classifier" implication (v2:65, v2:109). |
+| M2. "Three independent lenses" overstates independence and validation strength, and reverse-anchor direction was wrong. | Major | PARTIAL | The independence and reverse-anchor wording are fixed: the scores are "not statistically independent" and only internal-consistency checks (v2:75-83), and the reference is now described as less replication-dominated (v2:35-37). However, v2 adds a false per-firm ordering claim that all three scores make Firm C most hand-leaning (v2:93); Script 38's reverse-anchor mean instead ranks Firm D highest. |
+| M3. Classifier conflation; only the simplified binary rule was validated. | Major | PARTIAL | v2 now declares the inherited five-way box rule as primary (v2:123-143) and K=3 as descriptive (v2:143). It also correctly notes that the kappa comparison validates only the binary high-confidence rule, not the five-way moderate band (v2:103). The unresolved moderate-band validation is still open (v2:190-192), and v2:125 still uses binary-rule correlations to support the full five-way rule without recalibration. |
+| M4. Pixel-identity "FAR" naming and evidentiary force were wrong. | Major | CLOSED | v2 renames this to a positive-anchor miss rate, frames it as a one-sided replicated-positive check, and adds the tautology/conservative-subset caveat (v2:111-121). |
+| M5. Empirical/provenance claims needed correction or explicit unverified status. | Major | CLOSED | The 0.005 denominator is now a stability tolerance, not a bootstrap CI (v2:65, v2:107); all-non-Firm-A dip values are corrected (v2:21, v2:43); BD/McCrary is narrowed to Big-4 null with external dHash transitions disclosed (v2:47); Firm A byte-decomposition details are marked inherited/not regenerated (v2:31, v2:176); "tail distorts" is softened to a scope-dependent shift (v2:19). |
+| m1. Dip-test p-value precision needed bootstrap-resolution wording. | Minor | CLOSED | v2 states no bootstrap replicate exceeded the observed statistic and reports `p < 5 x 10^-4` for `n_boot = 2000` (v2:21, v2:43, v2:158-159). |
+| m2. Delta BIC sign convention was confusing. | Minor | CLOSED | v2 defines lower BIC as preferred and reports `BIC(K=3) - BIC(K=2) = -3.48`, plus "K=3 lower by 3.48" (v2:45, v2:63). |
+| m3. Per-signature convergence is only moderate for the box rule. | Minor | CLOSED | v2 includes the `SIG_CONVERGENCE_MODERATE` verdict and avoids calling the Paper A-vs-K=3 kappas strong (v2:95-103). |
+| m4. Per-CPA vs per-signature component centers drift more than v1 suggested. | Minor | CLOSED | v2 says the fits recover a "broadly similar three-component ordering" and reports the C1 cosine drift of 0.018 (v2:95). |
+| m5. Section III-L title was misleading. | Minor | CLOSED | The section is now titled "Signature- and Document-Level Classification" and separates per-signature categories from document aggregation (v2:123-143). |
+| m6. K=3 alternative lacked document aggregation. | Minor | CLOSED | v2 no longer offers K=3 as a signature/document classifier, so a K=3 document aggregation rule is no longer required (v2:143). |
+| m7. Firm anonymization was inconsistent. | Minor | CLOSED | v2 uses Firm A-D pseudonyms in the methodology text and no longer names the Big-4 firms directly in the prose (v2:17, v2:31, v2:194). |
+| e1. Replace "more-replicated-population baseline." | Editorial | CLOSED | v2 now calls non-Big-4 a less-replicated external/reverse-anchor reference (v2:35-37). |
+| e2. Replace "failure rate" for Lens 3. | Editorial | CLOSED | Lens 3 is now "Paper A box-rule hand-leaning rate" (v2:83). |
+| e3. "Strongest single methodology-validation signal" was too strong. | Editorial | CLOSED | v2 uses "strongest internal-consistency signal" and denies external validation (v2:77, v2:93). |
+| e4. "Boundary moves modestly" understated LOOO membership instability. | Editorial | CLOSED | v2 uses composition-sensitive wording and reports the 12.8 pp Firm C fold deviation (v2:65, v2:109). |
+| e5. "Calibration uncertainty band of +/- 5-13 pp" wording needed correction. | Editorial | CLOSED | v2 reports observed absolute differences of 1.8-12.8 pp and the 5 pp viability bar (v2:109). |
+| e6. "Operational threshold derivation" language was inaccurate. | Editorial | CLOSED | v2 consistently calls K=3 a mixture characterisation/descriptive model, not an operational threshold source (v2:49-73, v2:143). |
+| e7. Cross-reference index should be removed or made internal. | Editorial | PARTIAL | v2 labels the cross-reference index as an author checklist to remove before submission (v2:181), but it remains inside the methodology draft (v2:181-188). |
+
+## Newly introduced issues
+
+1. **New factual/provenance error: the three scores do not agree on the most hand-leaning firm.** v2 claims that "by all three scores, Firm A is the most replication-dominated and Firm C is the most hand-leaning" (v2:93). Script 38 confirms Firm A is most replication-dominated, but not the Firm C part for all scores: mean P_C1 and mean hand_frac rank Firm C highest, while mean reverse-anchor ranks Firm D highest (`-0.7125` vs Firm C `-0.7672`, with higher score meaning more hand-leaning). Revise to: "P_C1 and box-rule hand_frac rank Firm C highest; the reverse-anchor score ranks Firm D highest; all three agree Firm A is most replication-dominated and the non-A firms are more hand-leaning than Firm A."
+
+2. **Unsupported scope superlative: "any single firm" / "smallest scope" is not proven by the supplied reports.** v2 says no dip-test rejection holds "within any single firm pooled alone" and that Big-4 is the "smallest scope" supporting a finite-mixture model (v2:21; repeated more generally at v2:43). The supplied Script 32 report verifies Firm A alone, `big4_non_A`, and `all_non_A`; it does not report separate single-firm tests for Firms B, C, and D or all smaller combinations. Narrow this to "among the tested comparison scopes in Script 32" or add the missing single-firm tests.
+
+3. **K=3 hard labels are incorrectly described as used in the Spearman correlations.** v2:143 says the "K=3 hard label" is used for the internal-consistency Spearman correlations. Script 38's Spearman table uses the K=3 posterior score `P_C1`, not hard labels. Change v2:143 to "K=3 posterior score is used for the Spearman correlations; hard labels are used for the cluster cross-tabulation."
+
+4. **Provenance table over-cites Script 38 for the Big-4 signature count.** v2:17 and v2:152 attribute the 150,442 signature count partly/directly to Script 38. In the supplied markdown report, Script 39 directly reports the 150,442 signature-level cloud; Script 38's visible report does not directly state that count. Keep Script 39 as the direct source unless the JSON artifact is also cited.
+
+5. **"Max fold-to-fold deviation" wording is imprecise.** v2 reports a K=2 "max fold-to-fold deviation" of 0.028 (v2:65, v2:107). Script 36's 0.0278 is the max absolute deviation across folds as reported in the stability summary, not the pairwise fold range; the fold cut range is about 0.0376 (0.9756 - 0.9380). Use the report's exact wording or explicitly define the statistic.
+
+## Provenance re-verification
+
+| v2 numerical claim | v2 lines | Spike-report check | Status |
+|---|---:|---|---|
+| Big-4 has 437 CPAs split 171 / 112 / 102 / 52. | v2:17, v2:151 | Script 36 reports 437 CPAs; Script 34 reports the four firm counts. | CONFIRMED |
+| Big-4 signature-level cloud has 150,442 signatures. | v2:17, v2:95, v2:152 | Script 39 reports fitting on 150,442 signature-level points. | CONFIRMED, but source should be Script 39 rather than Script 38 in the provenance table. |
+| Big-4 K=2 crossings are cos 0.9755 and dHash 3.7549, with CIs [0.9742, 0.9772] and [3.4762, 3.9689]. | v2:45, v2:53, v2:154-156 | Script 36 and Script 34 report these point estimates and bootstrap CIs. | CONFIRMED |
+| K=3 components are C1 0.9457/9.1715/0.143, C2 0.9558/6.6603/0.536, C3 0.9826/2.4137/0.321. | v2:55-63, v2:163 | Scripts 35, 37, and 38 report the same centers and weights. | CONFIRMED |
+| K=3 LOOO membership deviations are 1.8-12.8 pp, with `P2_PARTIAL`. | v2:65, v2:109, v2:168 | Script 37 reports diffs 1.76, 4.68, 5.81, 12.77 pp and verdict `P2_PARTIAL`. | CONFIRMED |
+| Spearman correlations are 0.963, 0.889, and 0.879. | v2:85-91, v2:169 | Script 38 reports 0.9627, 0.8890, and 0.8794. | CONFIRMED |
+| All three scores rank Firm C as most hand-leaning. | v2:93 | Script 38 per-firm summary ranks Firm C highest on mean P_C1 and mean hand_frac, but Firm D highest on mean reverse-anchor. | FLAGGED |
+| Per-signature kappas are 0.662, 0.559, and 0.870; verdict moderate. | v2:95-103, v2:170 | Script 39 reports 0.6616, 0.5586, 0.8701 and `SIG_CONVERGENCE_MODERATE`. | CONFIRMED |
+| Pixel-identical subset is n=262 split 145 / 8 / 107 / 2, with 0% miss rate and Wilson upper 1.45%. | v2:111-119, v2:172-173 | Script 40 reports total 262, the per-firm split, and 262/262 correct for all three candidate classifiers with Wilson [0.00%, 1.45%]. | CONFIRMED |
+| Non-Firm-A dip values are 0.998/0.906 for `big4_non_A` and 0.998/0.907 for `all_non_A`. | v2:21, v2:43, v2:161-162 | Script 32 reports 0.9985/0.9055 and 0.9975/0.9065, matching v2 rounded values. | CONFIRMED |
+
+## Outstanding open questions
+
+1. **Five-way moderate-confidence validation still needs a decision.** v2 is honest that the v4 kappa evidence covers only the high-confidence binary rule (v2:103, v2:190-192). If the five-way classifier remains primary, the cleanest next step is a Big-4-specific capture/FAR/cross-tab analysis for the moderate band and the document-level worst-case aggregation. If not rerun, the manuscript should explicitly state that the moderate band remains inherited from v3.x and is not newly validated by Scripts 38-40.
+
+2. **Firm anonymisation policy still needs confirmation for §IV-V.** v2 itself is pseudonymous, but the open question at v2:194 remains real: once §IV-V discuss within-Big-4 contrasts, the manuscript should consistently use Firm A-D and keep any real-name mapping out of the paper body.
+
+3. **Section IV numbering can remain deferred.** v2:196 is procedural and does not block §III acceptance; resolve after the methodology claims and result-table sequence are frozen.
+
+## Recommended next-step actions
+
+1. Correct v2:93's per-firm ordering claim against Script 38.
+
+2. Decide whether to add a Big-4-specific validation for the five-way moderate band and document-level aggregation. If not, narrow v2:125 so binary-rule correlations do not appear to validate the full five-way classifier.
+
+3. Narrow the dip-test scope language at v2:21 and v2:43, or add missing individual-firm dip tests for Firms B-D.
+
+4. Fix v2:143 so Spearman correlations are tied to K=3 posterior scores, not K=3 hard labels.
+
+5. Correct the provenance table entry for the 150,442 signature count to cite Script 39 as the direct markdown-report source.
+
+6. Replace "max fold-to-fold deviation" with the exact Script 36 statistic or report the actual pairwise fold range.
+
+7. Remove the author checklist and open-question block from the manuscript version after these decisions are resolved.
@@ -1,6 +1,12 @@
-# Section III. Methodology — v4.0 Draft v2 (post codex round-21 revision)
+# Section III. Methodology — v4.0 Draft v3 (post codex rounds 21 + 22)

-> **Draft note (2026-05-12, v2).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. v2 incorporates codex gpt-5.5 round-21 review (`paper/codex_review_gpt55_v4_round1.md`); the key revisions are: (i) the inherited five-way per-signature box rule is restored as the **primary operational classifier** (§III-L), (ii) the K=3 Gaussian mixture is positioned as an **accountant-level descriptive characterisation**, not an operational threshold (§III-J), (iii) the "convergent validation" framing is softened to "convergent internal-consistency checks" since the three scores share underlying features (§III-K), (iv) the pixel-identity metric is renamed from FAR to positive-anchor miss rate (§III-K), (v) five empirical/wording slips flagged by the reviewer are corrected. Empirical anchors throughout reference Scripts 32–40 on branch `paper-a-v4-big4`; a provenance table appears at the end of this section listing every numerical claim with its script and report path.
+> **Draft note (2026-05-12, v3).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here.
+>
+> **v2** incorporated codex gpt-5.5 round-21 review (`paper/codex_review_gpt55_v4_round1.md`, Major Revision); key revisions were: (i) the inherited five-way per-signature box rule restored as the **primary operational classifier** (§III-L), (ii) the K=3 Gaussian mixture positioned as **accountant-level descriptive characterisation** (§III-J), (iii) "convergent validation" softened to "convergent internal-consistency checks" since the three scores share underlying features (§III-K), (iv) the pixel-identity metric renamed from FAR to positive-anchor miss rate (§III-K), (v) five empirical/wording slips corrected.
+>
+> **v3** incorporates codex gpt-5.5 round-22 review (`paper/codex_review_gpt55_v4_round2.md`, Minor Revision); five narrow fixes applied: (1) §III-K per-firm ranking corrected to acknowledge that Score 2 (reverse-anchor) ranks Firm D fractionally above Firm C while Scores 1 and 3 rank Firm C highest; (2) "smallest scope" / "any single firm" language narrowed to "comparison scopes tested in Script 32"; (3) §III-L corrected so the §III-K Spearman correlations are explicitly tied to the K=3 *posterior* P(C1) rather than the K=3 hard label; (4) provenance table source for $n = 150{,}442$ updated to cite Script 39 directly; (5) "max fold-to-fold deviation" wording made precise — the $0.028$ value is the maximum absolute deviation from the across-fold mean, not the pairwise across-fold range ($0.0376$).
+>
+> Empirical anchors throughout reference Scripts 32–40 on branch `paper-a-v4-big4`; a provenance table appears at the end of this section listing every numerical claim with its script and report path.

 ## G. Unit of Analysis and Scope

@@ -14,11 +20,11 @@ We adopt one stipulation about same-CPA pair detectability:

 A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.

-**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available (Script 38 verification). Restricting the analyses to Big-4 is a methodological choice driven by four considerations:
+**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ (Scripts 36, 38), totalling 150,442 Big-4 signatures with both descriptors available (Script 39 reports the explicit per-signature $n$ used in the signature-level K=3 fit). Restricting the analyses to Big-4 is a methodological choice driven by four considerations:

 1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$249 CPAs distributed across multiple firms with idiosyncratic signing practices and small per-firm samples. The full-sample and Big-4-only calibrations *differ* in their fitted marginal crossings (full-sample published $\overline{\text{cos}}^* = 0.945$, $\overline{\text{dHash}}^* = 8.10$ from v3.x; Big-4-only $\overline{\text{cos}}^* = 0.975$, $\overline{\text{dHash}}^* = 3.76$ from Script 34; bootstrap 95% CIs $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$); the offset is large compared to the Big-4 bootstrap CI half-width of $0.0015$. We report this as a *scope-dependent shift* rather than asserting a causal "mid/small-firm tail distorts" claim.

-2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes (cosine $p = 0.0000$, dHash $p = 0.0000$ in the bootstrap-2000 implementation, i.e., no bootstrap replicate exceeded the observed statistic; reported here as $p < 5 \times 10^{-4}$; Script 34). No such rejection holds within any single firm pooled alone (Firm A: $p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$; Script 32). Pooling Firm A's three Big-4 peers gives $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$ (Script 32, `big4_non_A` subset). Pooling all non-Firm-A CPAs (other Big-4 plus mid/small firms) gives $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$ (Script 32, `all_non_A` subset). Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution.
+2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes (cosine $p = 0.0000$, dHash $p = 0.0000$ in the bootstrap-2000 implementation, i.e., no bootstrap replicate exceeded the observed statistic; reported here as $p < 5 \times 10^{-4}$; Script 34). No such rejection holds in the comparison scopes tested by Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled (Script 32, `big4_non_A` subset: $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$); all non-Firm-A pooled (Script 32, `all_non_A` subset: other Big-4 plus mid/small firms; $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$). Among the comparison scopes we evaluated, Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution; we did not separately test single-firm dip statistics for Firms B, C, or D.

 3. **Reproducibility under leave-one-firm-out cross-validation.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 mixture fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm.

@@ -40,7 +46,7 @@ The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\ove

 This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G. Three diagnostic procedures are applied: a univariate unimodality test on each marginal axis, a 2D Gaussian mixture fit (developed in §III-J), and a density-smoothness diagnostic.

-**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds at any narrower scope or in the non-Big-4 population (Script 32): Firm A alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution.
+**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope rejected unimodality. The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution.

 **2. Mixture-model evidence.** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3, and the operational role of each fit is developed in §III-J and §III-K.

@@ -62,7 +68,7 @@ This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 account

 $\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support for K=3 under standard BIC interpretation but not by itself decisive).

-**Why we report both K=2 and K=3.** Leave-one-firm-out cross-validation (§III-K) shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The max fold-to-fold deviation of the cosine crossing is $0.028$, which is $5.6\times$ the report's $0.005$ across-fold *stability tolerance* (not bootstrap CI; the full-Big-4 bootstrap cosine half-width is the much smaller $0.0015$). K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.025$ (Script 37). K=3 hard-posterior membership for the held-out firm is more composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL`, with the explicit interpretation that "the C1 cluster exists but membership is not well-predicted by the held-out fit" and "K=3 is not predictively useful as an operational classifier" (Script 37 verdict legend).
+**Why we report both K=2 and K=3.** Leave-one-firm-out cross-validation (§III-K) shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-EY fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold *stability tolerance* — not the bootstrap CI; the full-Big-4 bootstrap cosine half-width is the much smaller $0.0015$. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.025$ (Script 37). K=3 hard-posterior membership for the held-out firm is more composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL`, with the explicit interpretation that "the C1 cluster exists but membership is not well-predicted by the held-out fit" and "K=3 is not predictively useful as an operational classifier" (Script 37 verdict legend).

 We take the joint K=2/K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites a v4.0 operational classifier:

@@ -90,7 +96,7 @@ Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs
 | Score 2 vs Score 3 | $+0.889$ | $< 10^{-149}$ |
 | Score 1 vs Score 2 | $+0.879$ | $< 10^{-142}$ |

-We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA hand-leaning ranking with $\rho > 0.87$. The three scores agree on the per-firm ordering: by all three scores, Firm A is the most replication-dominated and Firm C is the most hand-leaning of the Big-4 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs.
+We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA hand-leaning ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A Big-4 firms as more hand-leaning, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule hand-leaning rate (Scores 1 and 3) place Firm C at the most-hand-leaning end of Big-4 (mean P(C1) $= 0.311$; mean box-rule hand-leaning rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B (KPMG) and D (EY) sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the most-hand-leaning end between the three non-A Big-4 firms.

 **2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replicated vs not-replicated):

@@ -104,7 +110,7 @@ The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$

 **3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference:

- *K=2 LOOO is unstable.* Max fold-to-fold cosine crossing deviation $= 0.028$ across the four folds, against the report's $0.005$ stability tolerance (Script 36). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.
+- *K=2 LOOO is unstable.* The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$, against the report's $0.005$ across-fold stability tolerance (Script 36; pairwise fold range $0.0376$, from $0.9380$ to $0.9756$). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

 - *K=3 LOOO is partially stable.* The C1 hand-leaning component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.025$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own verdict is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label.

@@ -140,7 +146,7 @@ The conventions about these thresholds (cosine 0.95 as an operating point chosen

 **Document-level aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.x worst-case rule: the document inherits the *most-replication-consistent* signature label among the two signatures (rank order: High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed). The aggregation rule reflects the detection goal of flagging any potentially non-hand-signed report.

-**K=3 as accountant-level characterisation, not classifier.** The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used solely for the accountant-level cluster cross-tabulation (Script 35) and the internal-consistency Spearman correlations (§III-K).
+**K=3 as accountant-level characterisation, not classifier.** The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used for the accountant-level cluster cross-tabulation (Script 35), and the K=3 *posterior* P(C1) is used (as the continuous Score 1) in the internal-consistency Spearman correlations of §III-K.

 ---

@@ -149,7 +155,7 @@ The conventions about these thresholds (cosine 0.95 as an operating point chosen
 | Claim | Value | Source | Notes |
 |---|---|---|---|
 | Big-4 CPA count, $n_{\text{sig}} \geq 10$ | $437$ ($171/112/102/52$) | Script 36 sample sizes; Script 38 per-firm summary | direct |
-| Big-4 signature count | $150{,}442$ | Script 39 / Script 38 | direct |
+| Big-4 signature count | $150{,}442$ | Script 39 (per-signature K=3 fit explicitly cites this $n$) | direct |
 | Non-Big-4 reference CPA count | $249$ | Script 38 reference population | direct |
 | Big-4 K=2 marginal crossings $(0.9755, 3.755)$ | direct | Script 34; Script 36 §A | direct |
 | Bootstrap 95% CI cosine $[0.9742, 0.9772]$ | direct | Script 34; Script 36 §A | $n_{\text{boot}} = 500$ |