Apply codex round-23 corrections: §IV v3 + §III v4

Codex round 23 returned Major Revision on §IV v2: 6 Major + 6 Minor + 5 Editorial findings. Codex confirmed the spike-script provenance is mostly sound -- no scripts needed rerunning -- so v3 applies presentation-level fixes only. Decisions baked in: - Anonymisation: maintain Firm A-D pseudonyms throughout the manuscript body; remove (Deloitte) / (KPMG) / (PwC) / (EY) parentheticals from all v4 §IV tables. - Table numbering: v4 tables use fresh V-XVIII (plus Table XV-B); inherited v3.x tables are cited only as "v3.20.0 Table N" with the original v3 number, NOT renumbered into the v4 sequence. §IV v3 changes: 1. Detection denominator rewritten: 86,072 VLM-positive / 12 corrupted / 86,071 YOLO-processed / 85,042 with-detections / 182,328 signatures (matches v3.x §IV-B exact wording). 2. All v4 table labels stripped of "(revised:" / "(NEW:" prefixes; replaced with clean "Table N. <descriptor>." form. 3. Real firm names removed from all tables: 4 replace_all edits. 4. Line 211 MC-ordering claim removed: MC occupancy is no longer described as "consistent with the §III-K Spearman convergence" because MC fraction is not monotone in per-CPA hand-leaning ranking. New language: descriptive only, with Firm D / Firm B ordering counterexample stated. 5. Line 184 81.70% vs 82.46% qualified as "qualitative alignment, not like-for-like consistency check" (different units: per-signature class vs per-CPA hard cluster). 6. Line 43 BD-transition "histogram-resolution artefacts" softened to "scope-dependent and not used operationally"; no specific bin-width artefact claim without sensitivity sweep evidence. 7. K=3 LOOO C1 weight drift corrected: 0.025 -> 0.023 (matches Script 37 max deviation 0.0235 / rounded 0.023). 8. Seed coverage in §IV-A updated: "Scripts 32-42" (was "Scripts 32-41", missed Script 42). 9. Low-cosine cutoff inclusivity: cos < 0.837 -> cos <= 0.837 (matches Script 42 rule definition). 10. "round-22 Light scope" process note removed from manuscript prose in §IV-K. 11. §IV-L ablation pointer corrected: v3.20.0 §IV-I (was §IV-H.3); v3.20.0 Table XVIII clarified as different from v4 Table XVIII. 12. Line 75 "Component recovery verified across Scripts 35, 37, 38" rewritten: "the full-fit baseline is reproduced in Scripts 35, 37, 38" with explicit note that Script 37 LOOO fold-specific components differ by design. 13. Line 110 grammar: "This convergent-checks evidence" -> "These convergence checks". 14. Draft note marked "internal -- remove before submission". §III v4 changes (cross-reference cleanup): 1. Line 13 cross-reference repaired: "§IV-D, §IV-F, §IV-G" (which are now accountant-level v4 analyses) replaced with accurate signature-level references (§IV-J for five-way counts; §IV-I for inherited inter-CPA FAR). 2. Line 23 cross-reference repaired: "all §IV results except §IV-K" replaced with explicit list of v4-new vs inherited sub-sections. 3. Line 109 cross-reference repaired: moderate-band capture- rate evidence cited as "v3.20.0 Tables IX, XI, XII, XII-B" (was "§IV-F", which is now Convergent Internal-Consistency Checks, not capture-rate). 4. Line 131 "without recalibration" claim narrowed: §III-K's convergent-checks evidence is now scoped to the binary high-confidence rule only; the moderate-confidence band, style-consistency band, and document-level aggregation are retained by reference to v3.20.0 calibration, not claimed as v4.0-validated. Outstanding open questions: 3 procedural items remain (§IV table numbering finalisation, §IV-A-C content audit, Phase 4 prose); no methodology blockers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:03:33 +08:00
parent 453f1d8768
commit ce33156238
3 changed files with 199 additions and 54 deletions
@@ -1,4 +1,4 @@
-# Section III. Methodology — v4.0 Draft v3 (post codex rounds 21 + 22)
+# Section III. Methodology — v4.0 Draft v4 (post codex rounds 21 + 22 + 23 cross-reference cleanup)

 > **Draft note (2026-05-12, v3).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here.
 >
@@ -10,7 +10,7 @@

 ## G. Unit of Analysis and Scope

-We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of all signature-level capture-rate analyses (§IV-D, §IV-F, §IV-G). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.
+We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and v3.20.0's inherited inter-CPA FAR analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.

 We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism.

@@ -20,7 +20,7 @@ We adopt one stipulation about same-CPA pair detectability:

 A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.

-**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ (Scripts 36, 38), totalling 150,442 Big-4 signatures with both descriptors available (Script 39 reports the explicit per-signature $n$ used in the signature-level K=3 fit). Restricting the analyses to Big-4 is a methodological choice driven by four considerations:
+**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, and the v4-new analyses in §IV-D through §IV-J) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C, §IV-I (inter-CPA negative-anchor FAR), and §IV-L (feature-backbone ablation) report inherited corpus-wide v3.x material that v4.0 does not re-scope to Big-4. §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ (Scripts 36, 38), totalling 150,442 Big-4 signatures with both descriptors available (Script 39 reports the explicit per-signature $n$ used in the signature-level K=3 fit). Restricting the v4-new analyses to Big-4 is a methodological choice driven by four considerations:

 1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$249 CPAs distributed across multiple firms with idiosyncratic signing practices and small per-firm samples. The full-sample and Big-4-only calibrations *differ* in their fitted marginal crossings (full-sample published $\overline{\text{cos}}^* = 0.945$, $\overline{\text{dHash}}^* = 8.10$ from v3.x; Big-4-only $\overline{\text{cos}}^* = 0.975$, $\overline{\text{dHash}}^* = 3.76$ from Script 34; bootstrap 95% CIs $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$); the offset is large compared to the Big-4 bootstrap CI half-width of $0.0015$. We report this as a *scope-dependent shift* rather than asserting a causal "mid/small-firm tail distorts" claim.

@@ -106,7 +106,7 @@ We read this as the strongest internal-consistency signal in v4.0: three differe
 | Paper A binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ |
 | Per-CPA K=3 vs per-signature K=3 | $0.870$ |

-The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the inherited five-way rule. We note that this comparison validates only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly validate the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, which retains its v3.x interpretation and capture-rate evaluation (§IV-F).
+The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the inherited five-way rule. We note that this comparison validates only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly validate the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, which retains its v3.20.0 calibration and capture-rate evaluation (v3.20.0 Tables IX, XI, XII, XII-B; documented as inherited in §IV-J).

 **3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference:

@@ -128,7 +128,7 @@ All three candidate scores correctly assign every byte-identical signature to th

 ## L. Signature- and Document-Level Classification

-The v4.0 operational classifier is the inherited v3.x five-way per-signature box rule, retained unchanged for two reasons: (a) it preserves continuity with the v3.x literature and its established interpretation, and (b) the convergent internal-consistency checks of §III-K show that the box rule's per-CPA-aggregated outputs agree at $\rho \geq 0.96$ with a mixture-derived score and at $\rho \geq 0.89$ with a reverse-anchor score, supporting continued use without recalibration.
+The v4.0 operational classifier is the inherited v3.x five-way per-signature box rule, retained unchanged for two reasons: (a) it preserves continuity with the v3.x literature and its established interpretation; (b) the convergent internal-consistency checks of §III-K show that the box rule's *binary high-confidence* output (cos $> 0.95$ AND dHash $\leq 5$) agrees at $\rho \geq 0.96$ per-CPA with a K=3-posterior score and at $\rho \geq 0.89$ with a reverse-anchor score. The §III-K checks cover only the binary high-confidence rule; the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation are not separately validated by Scripts 38–42. We retain those rule components by reference to v3.20.0's calibration (v3.20.0 §III-K and Tables IX, XI, XII, XII-B); we do not claim that v4.0's convergent-checks evidence supports the inherited rule as a whole, only its binary high-confidence sub-rule.

 **Per-signature five-way classifier.** Operational thresholds are anchored on whole-sample Firm A percentile heuristics as in v3.x: cos $> 0.95$ (Firm A P7.5) for the cosine dimension and dHash$_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. All dHash references refer to the *independent-minimum* dHash defined in §III-G. We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:

@@ -1,26 +1,28 @@
-# Section IV. Results — v4.0 Draft v2
+# Section IV. Results — v4.0 Draft v3 (post codex rounds 21–23)

-> **Draft note (2026-05-12, v2).** This file replaces the §IV-A through §IV-H block of `paper/paper_a_results_v3.md` (v3.20.0) with the Big-4 reframed structure; Section IV expands from 8 sub-sections in v3.20.0 to 12 sub-sections in v4.0 (A through L) to mirror the §III-G..L lineage. **v2** fills Table XV (and adds Table XV-B for document-level counts) using Script 42's per-signature five-way categorisation on the Big-4 sub-corpus, closing the only TBD that v1 carried. Tables IV–XVIII numbering remains provisional and is finalised in Phase 3 close-out. Empirical anchors trace to Scripts 32–42 on branch `paper-a-v4-big4`; the §III provenance table covers the methodology-side citations and §IV adds new tables for the v4.0-specific results.
+> **Draft note (2026-05-12, v3; internal — remove before submission).** This file replaces the §IV-A through §IV-H block of `paper/paper_a_results_v3.md` (v3.20.0) with the Big-4 reframed structure. Section IV expands from 8 sub-sections in v3.20.0 to 12 sub-sections in v4.0 (A through L) to mirror the §III-G..L lineage. **v3** incorporates codex gpt-5.5 round-23 review (`paper/codex_review_gpt55_v4_round3.md`, Major Revision); the fixes are presentation-level rather than methodology-level. **Table-numbering scheme** (resolved in v3): the v4 manuscript uses fresh Table numbering V through XVIII for the new v4 Big-4 results; inherited v3.x tables are cited only as "v3.20.0 Table N" with the original v3 number and are *not* renumbered into the v4 sequence. **Anonymisation** (resolved in v3): the Big-4 firms remain pseudonymously labelled Firm A through Firm D throughout the manuscript body; real names are not printed in v4 tables or prose (a single mapping line, retained in v3.20.0's §III-L data-source paragraph, discloses the residual identifiability through contextual descriptors as per IEEE Access norms). Tables IV–XVIII numbering remains provisional and will be finalised at Phase 3 close-out after §III ↔ §IV cross-references are traced end-to-end. Empirical anchors trace to Scripts 32–42 on branch `paper-a-v4-big4`; the §III provenance table covers the methodology-side citations and §IV adds new tables for the v4.0-specific results.

 ## A. Experimental Setup

-The signature-detection and feature-extraction pipeline (§III-A through §III-F) was executed on the full TWSE MOPS audit-report corpus (90,282 PDFs spanning 2013–2023; §III-B). Detection and embedding ran on RTX 4090 (CUDA, deterministic forward inference, fixed seed); the v4.0 statistical analyses ran on Apple Silicon (MPS / CPU). Random seeds are fixed (`SEED = 42`) across all v4.0 spike scripts (32–41) for reproducibility. The signature_analysis SQLite snapshot at `/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db` is treated as frozen; no v4.0 result re-ingests source PDFs.
+The signature-detection and feature-extraction pipeline (§III-A through §III-F) was executed on the full TWSE MOPS audit-report corpus (90,282 PDFs spanning 2013–2023; §III-B). Detection and embedding ran on RTX 4090 (CUDA, deterministic forward inference, fixed seed); the v4.0 statistical analyses ran on Apple Silicon (MPS / CPU). Random seeds are fixed (`SEED = 42`) across the v4.0 spike scripts 32–42 for reproducibility. The signature_analysis SQLite snapshot at `/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db` is treated as frozen; no v4.0 result re-ingests source PDFs.

 The v4.0 primary analyses (§IV-D through §IV-J) are scoped to the Big-4 sub-corpus (Firms A–D, $n = 437$ CPAs with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. The §IV-K Full-Dataset Robustness section reports the full-dataset (686 CPAs) variant of the K=3 mixture + Paper A box-rule Spearman analysis as a cross-scope robustness check.

 ## B. Signature Detection Performance

-The detection metrics inherited unchanged from v3.20.0 §IV-B: 182,328 detected signatures across 86,072 prefiltered audit-report PDFs at the YOLOv11n + Qwen2.5-VL prefilter stage. Per-firm counts of detected signatures are reported in v3.20.0 Table IV (retained as Table IV here, unchanged numbers). The Big-4 subset of the detection output yields 150,442 signatures with both descriptors successfully computed.
+The detection metrics are inherited unchanged from v3.20.0 §IV-B. v3.20.0 reports: VLM screening identified 86,072 documents with signature pages; 12 corrupted PDFs were excluded; YOLOv11n batch inference processed the remaining 86,071 documents; 85,042 of these yielded at least one signature detection; the total extracted-signature count is 182,328 (v3.20.0 Table III). Per-firm counts of detected signatures are reported in v3.20.0 Table IV. v4.0 does not renumber the v3.x detection tables into the v4 sequence; v3.20.0 Tables III and IV are cited by their original numbers.
+
+The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in all §IV v4 primary analyses (§IV-D through §IV-J).

 ## C. All-Pairs Intra-vs-Inter Class Distribution Analysis

-The all-pairs intra-vs-inter class distribution analysis (KDE crossover at $\overline{\text{cos}} = 0.837$; v3.20.0 §IV-C, Table V) is inherited unchanged. This analysis was computed on the full corpus (not Big-4-restricted) and remains the source of the Uncertain / Likely-hand-signed boundary used by the §III-L five-way per-signature classifier (cosine $< 0.837 \Rightarrow$ Likely-hand-signed). v4.0 makes no scope-specific re-derivation of this boundary; the all-pairs cross-class crossover is a corpus-wide reference and is not restated as a v4.0 finding.
+The all-pairs intra-vs-inter class distribution analysis (KDE crossover at $\overline{\text{cos}} = 0.837$; v3.20.0 §IV-C, v3.20.0 Table V) is inherited unchanged. This analysis was computed on the full corpus (not Big-4-restricted) and remains the source of the Uncertain / Likely-hand-signed boundary used by the §III-L five-way per-signature classifier (cosine $\leq 0.837 \Rightarrow$ Likely-hand-signed, matching Script 42's `cos <= 0.837` rule definition). v4.0 makes no scope-specific re-derivation of this boundary; the all-pairs cross-class crossover is a corpus-wide reference and is not restated as a v4.0 finding. v3.20.0 Table V is cited by its original number and is not renumbered into the v4 sequence.

 ## D. Big-4 Accountant-Level Distributional Characterisation

 This section reports the empirical evidence for §III-I's three-diagnostic distributional characterisation at the Big-4 accountant level. All numbers below are direct re-statements from Scripts 32 / 34; cross-citations to the v3.x (signature-level) analysis are noted where the v4.0 result differs structurally from the v3.x result.

-**Table V (revised: Big-4 dip-test).** Hartigan dip-test results, accountant-level marginals.
+**Table V.** Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32).

 | Population | $n$ CPAs | $p_{\text{cos}}$ | $p_{\text{dHash}}$ | Interpretation |
 |---|---|---|---|---|
@@ -31,7 +33,7 @@ This section reports the empirical evidence for §III-I's three-diagnostic distr

 Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution $1 / 2000 = 5 \times 10^{-4}$ (Script 34 reports this as $p = 0.0000$; we report $p < 5 \times 10^{-4}$ to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed.

-**Table VI (revised: BD/McCrary diagnostic, Big-4 marginals).** Burgstahler-Dichev / McCrary local-discontinuity test on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; $\alpha = 0.05$, two-sided).
+**Table VI.** Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; $\alpha = 0.05$, two-sided).

 | Population | Cosine: significant transition? | dHash: significant transition? |
 |---|---|---|
@@ -40,13 +42,13 @@ Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no boot
 | Firms B + C + D pooled | none | one transition at $\overline{\text{dHash}} = 10.8$ |
 | All non-Firm-A pooled | none | one transition at $\overline{\text{dHash}} = 6.6$ |

-The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 / post-2020 stratified variants exhibit dHash transitions at varying locations consistent with histogram-resolution artefacts rather than population-structural boundaries). The diagnostic is reported as a non-parametric robustness check; we do not use the off-Big-4 dHash transitions as operational thresholds.
+The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as v4.0 operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes.

 ## E. Big-4 K=2 / K=3 Mixture Fits

 This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.

-**Table VII (revised: Big-4 K=2 components and bootstrap CIs).**
+**Table VII.** Big-4 K=2 mixture components and marginal-crossing bootstrap 95% confidence intervals.

 | K=2 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight |
 |---|---|---|---|
@@ -62,7 +64,7 @@ Marginal crossings (point + bootstrap 95% CI, $n_{\text{boot}} = 500$):

 $\text{BIC}(K{=}2) = -1108.45$ (Script 34).

-**Table VIII (revised: Big-4 K=3 components).**
+**Table VIII.** Big-4 K=3 mixture components.

 | K=3 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive label |
 |---|---|---|---|---|
@@ -70,13 +72,13 @@ $\text{BIC}(K{=}2) = -1108.45$ (Script 34).
 | C2 | 0.9558 | 6.66 | 0.536 | mixed |
 | C3 | 0.9826 | 2.41 | 0.321 | replicated |

-$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). Component recovery is verified across Scripts 35, 37, and 38 (consistent component centres and weights to four decimal places). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.
+$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.

 ## F. Convergent Internal-Consistency Checks

 This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ and are *not statistically independent measurements*. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.

-**Table IX (revised: per-CPA Spearman among three feature-derived scores, Big-4, $n = 437$).**
+**Table IX.** Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, $n = 437$.

 | Score pair | Spearman $\rho$ | $p$-value |
 |---|---|---|
@@ -86,20 +88,20 @@ This section reports the empirical evidence for §III-K's three-score internal-c

 (Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on $n = 249$ non-Big-4 CPAs; reference centre $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$.

-**Table X (revised: per-firm summary across the three scores, Big-4).**
+**Table X.** Per-firm summary across the three feature-derived scores, Big-4.

 | Firm | $n$ CPAs | mean $P(\text{C1})$ | mean reverse-anchor score | mean Paper A hand-leaning rate |
 |---|---|---|---|---|
-| Firm A (Deloitte) | 171 | 0.0072 | $-0.9726$ | 0.1935 |
-| Firm B (KPMG) | 112 | 0.1410 | $-0.8201$ | 0.6962 |
-| Firm C (PwC) | 102 | 0.3110 | $-0.7672$ | 0.7896 |
-| Firm D (EY) | 52 | 0.2406 | $-0.7125$ | 0.7608 |
+| Firm A | 171 | 0.0072 | $-0.9726$ | 0.1935 |
+| Firm B | 112 | 0.1410 | $-0.8201$ | 0.6962 |
+| Firm C | 102 | 0.3110 | $-0.7672$ | 0.7896 |
+| Firm D | 52 | 0.2406 | $-0.7125$ | 0.7608 |

 (Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that *higher* values indicate deeper into the reference left tail = more hand-leaning relative to the non-Big-4 reference.)

 The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as more hand-leaning. The K=3 posterior P(C1) and the box-rule hand-leaning rate (Score 1 and Score 3) place Firm C at the most-hand-leaning end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors.

-**Table XI (revised: per-signature Cohen $\kappa$ binary collapse, $n = 150{,}442$ Big-4 signatures).**
+**Table XI.** Per-signature Cohen $\kappa$ (binary collapse, replicated vs not-replicated), $n = 150{,}442$ Big-4 signatures.

 | Pair | Cohen $\kappa$ |
 |---|---|
@@ -107,24 +109,24 @@ The three scores agree on placing Firm A as the most replication-dominated and t
 | Paper A binary high-confidence box rule vs per-signature K=3 hard label | 0.559 |
 | Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 |

-(Source: Script 39; verdict label `SIG_CONVERGENCE_MODERATE`.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. This convergent-checks evidence covers only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evaluation (§IV-J).
+(Source: Script 39; verdict label `SIG_CONVERGENCE_MODERATE`.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evaluation (§IV-J).

 ## G. Leave-One-Firm-Out Reproducibility

 This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing.

-**Table XII (revised: K=2 LOOO across the four Big-4 folds).**
+**Table XII.** K=2 leave-one-firm-out across the four Big-4 folds.

 | Held-out firm | $n_{\text{train}}$ | $n_{\text{held}}$ | Fold rule (cos cut, dHash cut) | Held-out classified as templated by fold rule |
 |---|---|---|---|---|
-| Firm A (Deloitte) | 266 | 171 | cos $> 0.9380$ AND dHash $\leq 8.79$ | $171 / 171 = 100.00\%$ ($95\%$ Wilson $[97.80\%, 100.00\%]$) |
-| Firm B (KPMG) | 325 | 112 | cos $> 0.9744$ AND dHash $\leq 3.98$ | $0 / 112 = 0\%$ ($95\%$ Wilson $[0\%, 3.32\%]$) |
-| Firm C (PwC) | 335 | 102 | cos $> 0.9752$ AND dHash $\leq 3.75$ | $0 / 102 = 0\%$ ($95\%$ Wilson $[0\%, 3.63\%]$) |
-| Firm D (EY) | 385 | 52 | cos $> 0.9756$ AND dHash $\leq 3.74$ | $0 / 52 = 0\%$ ($95\%$ Wilson $[0\%, 6.88\%]$) |
+| Firm A | 266 | 171 | cos $> 0.9380$ AND dHash $\leq 8.79$ | $171 / 171 = 100.00\%$ ($95\%$ Wilson $[97.80\%, 100.00\%]$) |
+| Firm B | 325 | 112 | cos $> 0.9744$ AND dHash $\leq 3.98$ | $0 / 112 = 0\%$ ($95\%$ Wilson $[0\%, 3.32\%]$) |
+| Firm C | 335 | 102 | cos $> 0.9752$ AND dHash $\leq 3.75$ | $0 / 102 = 0\%$ ($95\%$ Wilson $[0\%, 3.63\%]$) |
+| Firm D | 385 | 52 | cos $> 0.9756$ AND dHash $\leq 3.74$ | $0 / 52 = 0\%$ ($95\%$ Wilson $[0\%, 6.88\%]$) |

 (Source: Script 36.) Across-fold cosine crossing: pairwise range $[0.9380, 0.9756]$, range = $0.0376$; max absolute deviation from the across-fold mean is $0.028$. This exceeds the report's $0.005$ across-fold stability tolerance by $5.6\times$ and is much larger than the full-Big-4 bootstrap CI half-width of $0.0015$. Together with the all-or-nothing held-out classification pattern (Firm A held out $\Rightarrow$ all held-out CPAs templated; any non-Firm-A firm held out $\Rightarrow$ none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

-**Table XIII (revised: K=3 LOOO C1 component shape and held-out membership).**
+**Table XIII.** K=3 leave-one-firm-out: C1 component shape and held-out membership.

 | Held-out firm | C1 cos (fit) | C1 dHash (fit) | C1 weight (fit) | Held-out C1 hard-label rate | Full-Big-4 baseline C1% | Absolute difference |
 |---|---|---|---|---|---|---|
@@ -134,13 +136,13 @@ This section reports the firm-level cross-validation evidence motivating §III-J
 | Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp |
 | Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp |

-(Source: Script 37; verdict label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.025$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).
+(Source: Script 37; verdict label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).

 ## H. Pixel-Identity Positive-Anchor Miss Rate

 This section reports the only hard-ground-truth subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4).

-**Table XIV (revised: positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures).**
+**Table XIV.** Positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures.

 | Classifier | Misclassified as hand-leaning | Miss rate | Wilson 95% CI |
 |---|---|---|---|
@@ -160,7 +162,7 @@ The signature-level inter-CPA negative-anchor FAR analysis (~50,000 random pairs

 This section reports the §III-L five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. The five-way category definitions are inherited unchanged from v3.20.0 §III-K (now §III-L); see §III-L for the cosine and dHash cuts.

-**Table XV (revised: five-way per-signature category counts, Big-4 only, $n = 150{,}442$ classified).**
+**Table XV.** Five-way per-signature category counts, Big-4 sub-corpus, $n = 150{,}442$ classified.

 | Category | Long name | $n$ signatures | % of classified |
 |---|---|---|---|
@@ -176,16 +178,16 @@ This section reports the §III-L five-way per-signature + document-level worst-c

 | Firm | HC | MC | HSC | UN | LH | total signatures |
 |---|---|---|---|---|---|---|
-| Firm A (Deloitte) | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 |
-| Firm B (KPMG) | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 |
-| Firm C (PwC) | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 |
-| Firm D (EY) | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 |
+| Firm A | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 |
+| Firm B | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 |
+| Firm C | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 |
+| Firm D | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 |

-(Source: Script 42 per-firm cross-tab.) The per-firm pattern aligns with the K=3 cluster cross-tab of Table XVI: Firm A is concentrated in the HC band (81.70% of its signatures), consistent with its 82.46% C3-replicated concentration at the accountant level; the three non-Firm-A Big-4 firms have markedly lower HC rates and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%) — consistent with §III-K Score 2 (reverse-anchor cosine percentile) ranking Firm D fractionally above Firm C in the hand-leaning direction.
+(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3-replicated component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).

 **Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.20.0 worst-case rule (HC > MC > HSC > UN > LH; §III-L). v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset).

-**Table XV-B (NEW: document-level worst-case category counts, Big-4 only, $n = 75{,}233$ unique PDFs).**
+**Table XV-B.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs.

 | Category | Long name | $n$ documents | % |
 |---|---|---|---|
@@ -201,23 +203,23 @@ This section reports the §III-L five-way per-signature + document-level worst-c

 | Firm | HC | MC | HSC | UN | LH | total docs |
 |---|---|---|---|---|---|---|
-| Firm A (Deloitte) | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 |
-| Firm B (KPMG) | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 |
-| Firm C (PwC) | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 |
-| Firm D (EY) | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 |
+| Firm A | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 |
+| Firm B | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 |
+| Firm C | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 |
+| Firm D | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 |

 (Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)

-The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we note this inheritance status explicitly so the reader can locate the v3.x Tables IX / XI / XII calibration evidence (carried into v4.0 by reference) without expecting v4.0-spike-script confirmation of the moderate-band specifics. The Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) report the inherited rule's output on the Big-4 subset; the relative ordering of the non-Firm-A firms on MC is consistent with the §III-K Spearman convergence on the per-CPA hand-leaning ranking.
+The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The v3.20.0 capture-rate calibration evidence for the moderate band (v3.20.0 Tables IX, XI, XII, XII-B) is carried into v4.0 by reference and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA hand-leaning ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as more hand-leaning than Firm B).

-**Table XVI (NEW: firm × K=3 cluster cross-tabulation, Big-4 only).**
+**Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.

 | Firm | $n$ | C1 (hand-leaning) | C2 (mixed) | C3 (replicated) | C1 % | C3 % |
 |---|---|---|---|---|---|---|
-| Firm A (Deloitte) | 171 | 0 | 30 | 141 | $0.00\%$ | $82.46\%$ |
-| Firm B (KPMG) | 112 | 10 | 102 | 0 | $8.93\%$ | $0.00\%$ |
-| Firm C (PwC) | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ |
-| Firm D (EY) | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ |
+| Firm A | 171 | 0 | 30 | 141 | $0.00\%$ | $82.46\%$ |
+| Firm B | 112 | 10 | 102 | 0 | $8.93\%$ | $0.00\%$ |
+| Firm C | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ |
+| Firm D | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ |

 (Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 replicated component (no Firm A CPAs in C1); Firm C has the highest hand-leaning concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).

@@ -225,9 +227,9 @@ The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 <

 ## K. Full-Dataset Robustness (light scope)

-This section reports the v4.0 reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). Per the v4.0 author choice (codex round-22 open question 1, Light scope), we re-run only the K=3 mixture + Paper A operational-rule per-CPA hand-leaning rate analysis; the §III-L five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band is documented as inherited from v3.x calibration in §IV-J.
+This section reports the v4.0 reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA hand-leaning rate analysis, sufficient to demonstrate that the v4.0 K=3 + Paper A convergence reproduces at the wider scope. The §III-L five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band is documented as inherited from v3.x calibration in §IV-J.

-**Table XVII (NEW: K=3 component comparison, Big-4 vs full dataset).**
+**Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset.

 | K=3 component | Big-4 (n=437) cos / dHash / weight | Full (n=686) cos / dHash / weight | Drift Big-4 → Full |
 |---|---|---|---|
@@ -237,7 +239,7 @@ This section reports the v4.0 reproducibility cross-check at the full accountant

 (Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.)

-**Table XVIII (NEW: Spearman correlation between K=3 P(C1) and Paper A operational hand-leaning rate, Big-4 vs full dataset).**
+**Table XVIII.** Spearman rank correlation between K=3 P(C1) and Paper A operational hand-leaning rate, Big-4 sub-corpus vs full dataset.

 | Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs Paper A hand-leaning rate) | $p$-value |
 |---|---|---|---|
@@ -249,9 +251,9 @@ This section reports the v4.0 reproducibility cross-check at the full accountant

 **Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the Paper A box-rule hand-leaning rate are preserved at the full scope. Component centres shift modestly: C3 (replicated) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (hand-leaning) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm hand-leaning CPAs that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + Paper A convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the v4.0 primary methodology is restricted to Big-4 by design (§III-G item 4).

-## L. Feature Backbone Ablation (inherited from v3.x §IV-H.3)
+## L. Feature Backbone Ablation (inherited from v3.20.0 §IV-I)

-The feature-backbone ablation (Table XVIII in v3.20.0; backbone replacement of ResNet-50 with alternative ImageNet-pretrained backbones to verify the §III-E embedding choice is not load-bearing) is inherited unchanged. v4.0 makes no scope-specific re-derivation; the ablation is a methodological-stability check on the embedding stage and is corpus-wide rather than Big-4-restricted.
+The feature-backbone ablation (v3.20.0 Table XVIII; backbone replacement of ResNet-50 with alternative ImageNet-pretrained backbones to verify that the §III-E embedding choice is not load-bearing) is inherited unchanged. v3.20.0 Table XVIII is cited by its original v3 number and is **not** the same table as the v4 Table XVIII (which reports the Big-4 vs full-dataset Spearman drift in §IV-K). v4.0 makes no scope-specific re-derivation of the ablation; the analysis is a methodological-stability check on the embedding stage and is corpus-wide rather than Big-4-restricted.

 ---