From ce3315623822c8cda1be5f19c035bd1715ae9a39 Mon Sep 17 00:00:00 2001 From: gbanyan Date: Tue, 12 May 2026 17:03:33 +0800 Subject: [PATCH] =?UTF-8?q?Apply=20codex=20round-23=20corrections:=20?= =?UTF-8?q?=C2=A7IV=20v3=20+=20=C2=A7III=20v4?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex round 23 returned Major Revision on §IV v2: 6 Major + 6 Minor + 5 Editorial findings. Codex confirmed the spike-script provenance is mostly sound -- no scripts needed rerunning -- so v3 applies presentation-level fixes only. Decisions baked in: - Anonymisation: maintain Firm A-D pseudonyms throughout the manuscript body; remove (Deloitte) / (KPMG) / (PwC) / (EY) parentheticals from all v4 §IV tables. - Table numbering: v4 tables use fresh V-XVIII (plus Table XV-B); inherited v3.x tables are cited only as "v3.20.0 Table N" with the original v3 number, NOT renumbered into the v4 sequence. §IV v3 changes: 1. Detection denominator rewritten: 86,072 VLM-positive / 12 corrupted / 86,071 YOLO-processed / 85,042 with-detections / 182,328 signatures (matches v3.x §IV-B exact wording). 2. All v4 table labels stripped of "(revised:" / "(NEW:" prefixes; replaced with clean "Table N. ." form. 3. Real firm names removed from all tables: 4 replace_all edits. 4. Line 211 MC-ordering claim removed: MC occupancy is no longer described as "consistent with the §III-K Spearman convergence" because MC fraction is not monotone in per-CPA hand-leaning ranking. New language: descriptive only, with Firm D / Firm B ordering counterexample stated. 5. Line 184 81.70% vs 82.46% qualified as "qualitative alignment, not like-for-like consistency check" (different units: per-signature class vs per-CPA hard cluster). 6. Line 43 BD-transition "histogram-resolution artefacts" softened to "scope-dependent and not used operationally"; no specific bin-width artefact claim without sensitivity sweep evidence. 7. K=3 LOOO C1 weight drift corrected: 0.025 -> 0.023 (matches Script 37 max deviation 0.0235 / rounded 0.023). 8. Seed coverage in §IV-A updated: "Scripts 32-42" (was "Scripts 32-41", missed Script 42). 9. Low-cosine cutoff inclusivity: cos < 0.837 -> cos <= 0.837 (matches Script 42 rule definition). 10. "round-22 Light scope" process note removed from manuscript prose in §IV-K. 11. §IV-L ablation pointer corrected: v3.20.0 §IV-I (was §IV-H.3); v3.20.0 Table XVIII clarified as different from v4 Table XVIII. 12. Line 75 "Component recovery verified across Scripts 35, 37, 38" rewritten: "the full-fit baseline is reproduced in Scripts 35, 37, 38" with explicit note that Script 37 LOOO fold-specific components differ by design. 13. Line 110 grammar: "This convergent-checks evidence" -> "These convergence checks". 14. Draft note marked "internal -- remove before submission". §III v4 changes (cross-reference cleanup): 1. Line 13 cross-reference repaired: "§IV-D, §IV-F, §IV-G" (which are now accountant-level v4 analyses) replaced with accurate signature-level references (§IV-J for five-way counts; §IV-I for inherited inter-CPA FAR). 2. Line 23 cross-reference repaired: "all §IV results except §IV-K" replaced with explicit list of v4-new vs inherited sub-sections. 3. Line 109 cross-reference repaired: moderate-band capture- rate evidence cited as "v3.20.0 Tables IX, XI, XII, XII-B" (was "§IV-F", which is now Convergent Internal-Consistency Checks, not capture-rate). 4. Line 131 "without recalibration" claim narrowed: §III-K's convergent-checks evidence is now scoped to the binary high-confidence rule only; the moderate-confidence band, style-consistency band, and document-level aggregation are retained by reference to v3.20.0 calibration, not claimed as v4.0-validated. Outstanding open questions: 3 procedural items remain (§IV table numbering finalisation, §IV-A-C content audit, Phase 4 prose); no methodology blockers. Co-Authored-By: Claude Opus 4.7 (1M context) --- paper/codex_review_gpt55_v4_round3.md | 143 ++++++++++++++++++ .../v4/paper_a_methodology_v4_section_iii.md | 10 +- paper/v4/paper_a_results_v4_section_iv.md | 100 ++++++------ 3 files changed, 199 insertions(+), 54 deletions(-) create mode 100644 paper/codex_review_gpt55_v4_round3.md diff --git a/paper/codex_review_gpt55_v4_round3.md b/paper/codex_review_gpt55_v4_round3.md new file mode 100644 index 0000000..e363a06 --- /dev/null +++ b/paper/codex_review_gpt55_v4_round3.md @@ -0,0 +1,143 @@ +# Paper A Round 23 Review - v4 round 3 + +Reviewer: gpt-5.5 xhigh +Date: 2026-05-12 +Target: `paper/v4/paper_a_results_v4_section_iv.md` (§IV v2) +Cross-checked against: `paper/v4/paper_a_methodology_v4_section_iii.md` (§III v3), round-21/22 reviews, `paper/paper_a_results_v3.md`, and the supplied spike reports. + +## Verdict + +Major Revision. + +The empirical core of §IV v2 is much stronger than the earlier methodology drafts: most new Big-4 numerical tables match the spike reports. The blockers are presentation and provenance risks that reviewers will catch quickly: table numbering is not coherent, several §III cross-references now point to the wrong §IV material, the inherited detection count is misstated, and the draft says firm anonymisation is maintained while repeatedly printing real firm names. + +## Major findings + +1. **Table numbering is not coherent enough for partner review.** + + §IV v2 says provisional numbering covers Tables IV-XVIII (line 3), and line 13 says v3.20.0 Table IV is "retained as Table IV here." But the file does not actually include a Table IV block; the first displayed v4 table is Table V at line 23. Line 17 also cites the inherited all-pairs analysis as "v3.20.0 §IV-C, Table V," while line 23 reuses Table V for the new Big-4 dip test. That is acceptable only if the inherited table is explicitly not a v4 table; otherwise Table V is duplicated. + + The same issue recurs at the end: line 240 assigns current Table XVIII to the full-dataset Spearman robustness table, while line 254 says the inherited backbone ablation is "Table XVIII in v3.20.0." If the ablation is retained in the v4 manuscript, it cannot also be current Table XVIII. Fix by deciding which inherited v3 tables are reprinted/renumbered versus cited only as v3.x provenance. + +2. **§III v3 contains stale cross-references that §IV v2 does not support as written.** + + §III line 13 says signature-level capture-rate analyses are in §IV-D, §IV-F, and §IV-G. In §IV v2, those are accountant-level distributional characterisation, internal-consistency checks, and LOOO reproducibility. This is a direct cross-reference failure. + + §III line 23 says "all §IV results except §IV-K" are Big-4 restricted. §IV v2 itself is narrower and more accurate at line 9: §IV-D through §IV-J are Big-4 primary, while §IV-K is full-dataset robustness. But §IV-A-C are inherited full-corpus setup/detection/all-pairs material, §IV-I is inherited full-corpus inter-CPA FAR, and §IV-L is an inherited corpus-wide ablation. §III must be changed to match the actual results section. + + §III line 109 says the moderate-confidence band retains v3.x capture-rate evaluation in "§IV-F"; in current §IV, §IV-F is not the inherited v3 capture-rate section. It should cite v3.x Tables IX/XI/XII/XII-B or current §IV-J's inheritance note, not current §IV-F. + +3. **The inherited detection-count sentence is numerically wrong / ambiguous.** + + §IV line 13 says "182,328 detected signatures across 86,072 prefiltered audit-report PDFs." The v3 baseline distinguishes these counts: VLM screening identified 86,072 documents with signature pages, 12 corrupted PDFs were excluded, and batch YOLO inference ran on 86,071 documents; v3 Table III then reports 85,042 documents with detections and 182,328 extracted signatures. Current line 13 collapses these stages and assigns the 182,328 signatures to the wrong denominator. + + Suggested rewrite: "VLM screening identified 86,072 signature-page documents; after 12 corrupted PDFs were excluded, YOLO batch inference processed 86,071 documents, with 85,042 yielding detections and 182,328 extracted signatures." + +4. **The draft claims firm anonymisation is maintained, but the §IV tables reveal real firm names.** + + §III line 23 says the Big-4 firms are pseudonymously labelled Firm A-D. §IV line 265 says firm anonymisation is "maintained throughout §IV (Firm A-D used consistently)." That is false: real names appear in the displayed result tables at lines 93-96, 120-123, 132-135, 179-182, 204-207, and 217-220. + + Either remove the parenthetical real names everywhere in §IV or explicitly abandon the pseudonym policy in §III and the close-out checklist. Given prior review history, this should be fixed before partner review. + +5. **Some interpretive claims overstate what the spike results prove.** + + The main false one is line 211: it says the non-Firm-A moderate-confidence proportions are "consistent with the §III-K Spearman convergence on the per-CPA hand-leaning ranking." The MC ordering is C (41.44%), B (35.88%), D (29.33%), while Table X's hand-leaning scores rank D above B on all three score summaries and rank D above C on the reverse-anchor score. MC-band occupancy is not a monotone proxy for the per-CPA hand-leaning ranking; D's mass moves heavily into Uncertain instead. + + Line 184 also compares Firm A's signature-level HC rate (81.70%) to its accountant-level C3 rate (82.46%). The numbers are close and the qualitative reading is reasonable, but they are different units. State this as qualitative alignment, not as a like-for-like consistency check. + + Line 43 calls off-Big-4 dHash transitions "consistent with histogram-resolution artefacts." Script 32 verifies varying dHash transitions; it does not by itself prove a bin-width artefact analysis for those accountant-level subsets. "Scope-dependent and not used operationally" is safer. + +6. **The moderate-confidence band is honestly disclosed as inherited, but the support language still needs narrowing.** + + §IV line 211 correctly states that Scripts 38-40 do not separately validate the MC band. That is good. But §III line 131 still says the binary-rule internal-consistency checks support continued use of the inherited five-way rule "without recalibration." That is stronger than the evidence: the v4 kappa/Spearman checks cover the binary high-confidence box rule, not the MC band or document-level worst-case aggregation. The defensible wording is: v4 reports Big-4 outputs for the inherited five-way rule; the MC band remains v3-calibrated and not newly validated in Scripts 38-42. + +## Minor findings + +1. **K=3 LOOO C1 weight drift is rounded away from the report.** §IV line 137 reports max C1 weight deviation as 0.025. Script 37's report says 0.023, and the JSON gives 0.023489. Use 0.023 or 0.0235. + +2. **Seed coverage statement stops at Script 41.** §IV line 7 says seeds are fixed across Scripts 32-41, but v2 depends on Script 42 for Tables XV and XV-B. Either include Script 42 if true or say "stochastic v4 spike scripts" rather than implying a complete script range. + +3. **Inclusivity of the low-cosine cutoff should match Script 42.** §IV line 17 says cosine `< 0.837` implies Likely-hand-signed; Script 42 defines LH as `cos <= 0.837`. Align §III-L and §IV-C/J exactly. + +4. **The "round-22 open question 1, Light scope" process note is not traceable to the round-22 review file.** §IV line 228 may reflect an author decision outside the supplied review, but it should be removed from manuscript prose or backed by an internal note. + +5. **The ablation section pointer is wrong.** §IV line 252 says the inherited feature-backbone ablation is from v3.x §IV-H.3, but in `paper/paper_a_results_v3.md` it is §IV-I, beginning at line 461. + +6. **Line 73's "component recovery ... across Scripts 35, 37, and 38" can be misread.** Script 37's full-baseline block replicates Script 35, but the LOOO fold components vary by design. Say "the full-fit baseline is reproduced in Scripts 35, 37, and 38" if that is the intended claim. + +## Editorial nits + +1. Remove the draft note and Phase 3 close-out checklist before submission, or move them to an internal author note. + +2. Line 110: "This convergent-checks evidence" should be "These convergence checks" or "This convergence evidence." + +3. Line 3: "is finalised" should be "will be finalised" while numbering remains provisional. + +4. Standardise "dHash" versus "dh" in tables and prose; the spike reports use `dh`, but the paper body mostly uses dHash. + +5. Avoid mixing "replicated," "templated," and "non-hand-signed" as if they are exact synonyms. The paper's caveats rely on preserving those distinctions. + +## Provenance verification table + +| §IV v2 claim | §IV lines | Source checked | Status | +|---|---:|---|---| +| Big-4 primary scope: 437 CPAs and 150,442 signatures with both descriptors. | 9 | Script 36 report lines 6, 32-37; Script 39 report line 12. | Confirmed. | +| Detection inheritance: 182,328 signatures across 86,072 PDFs. | 13 | v3 results lines 14, 17-25; v3 methodology search hits distinguish 86,072 VLM-positive, 86,071 processed, 85,042 with detections. | Needs correction; denominator conflated. | +| All-pairs KDE crossover at 0.837. | 17 | v3 results lines 49 and 118; Script 42 rule lines 6-10 uses 0.837. | Confirmed; fix `<` vs `<=` wording. | +| Big-4 dip-test p-values reported as `< 5 x 10^-4`. | 27, 32 | Script 36 report lines 6-8; Script 34 report lines 28-31; bootstrap resolution stated in §IV line 32. | Confirmed with reporting convention. | +| Firm A / Big4-non-A / all-non-A dip p-values: 0.992/0.924, 0.998/0.906, 0.998/0.907. | 28-30 | Script 32 report lines 30, 40, 62, 72, 94, 104. | Confirmed after rounding. | +| BD/McCrary Big-4 null and non-A dHash transitions at 10.8 and 6.6. | 38-41 | Script 34 report lines 28-31; Script 32 report lines 40-41 and 72-73. | Confirmed; artefact interpretation not directly proven. | +| K=2 components, crossings, bootstrap CIs, and BIC. | 53-63 | Script 34 report lines 23-41; Script 36 report lines 12-28. | Confirmed. | +| K=3 component centers/weights and BIC lower by 3.48. | 69-73 | Script 35 report lines 6-10; Script 34 report lines 40-49; Script 36 report lines 9-10. | Confirmed. | +| Spearman correlations 0.9627, 0.8890, 0.8794 and non-Big-4 reference center 0.935/9.77. | 83-87 | Script 38 report lines 16-18 and 24-30. | Confirmed. | +| Per-firm score summaries in Table X. | 93-98 | Script 38 report lines 43-48. | Confirmed; anonymisation violation. | +| Cohen kappas 0.662, 0.559, 0.870 and per-signature K=3 centers. | 106-110 | Script 39 report lines 16-28. | Confirmed after rounding. | +| K=2 LOOO fold rules and all-or-none held-out classifications. | 120-125 | Script 36 report lines 32-44 and JSON stability summary. | Confirmed. | +| K=3 LOOO C1 fold rates and `P2_PARTIAL`. | 131-137 | Script 37 report lines 16-19, 25-90, 92-99; JSON exact drift values. | Confirmed, except weight drift should be 0.023/0.0235 not 0.025. | +| Pixel-identity subset n=262, split 145/8/107/2, 0/262 miss rate, Wilson upper 1.45%. | 147-153 | Script 40 report lines 8, 12-18, 22-27. | Confirmed. | +| Inter-CPA FAR 0.0005 with Wilson [0.0003, 0.0007] inherited from v3. | 157 | v3 results lines 182-190 and 263-275. | Confirmed as inherited, not v4-regenerated. | +| Five-way per-signature counts and 11 excluded signatures. | 167-173 | Script 42 report lines 14-26. | Confirmed. | +| Per-firm five-way percentages. | 179-184 | Script 42 report lines 30-44. | Confirmed; line 211 interpretation is not supported. | +| Document-level overall counts, n=75,233, mixed-firm PDFs n=379. | 188-198 | Script 42 report lines 46-57; JSON `document_level`. | Confirmed. | +| Single-firm per-document rows. | 204-209 | Script 42 report lines 59-66. | Confirmed. | +| Full-dataset robustness components, BIC, Spearman rho. | 234-248 | Script 41 report lines 8-31. | Confirmed. | +| Feature-backbone ablation inherited from v3.x Table XVIII. | 252-254 | v3 results lines 461-475. | Inherited content confirmed, but v3 section pointer and current v4 table numbering collide. | + +## Cross-reference checks (§III -> §IV) + +| §III v3 claim | §III lines | §IV v2 support | Status | +|---|---:|---|---| +| Signature-level capture-rate analyses are in §IV-D/F/G. | 13 | Current §IV-D/F/G are accountant-level dip/mixture, internal consistency, and LOOO. | Fails; stale v3 cross-reference. | +| All §IV results except §IV-K are Big-4 restricted. | 23 | §IV-A-C, §IV-I, and §IV-L are inherited full-corpus/corpus-wide material. | Fails; narrow to "primary v4 analyses §IV-D-J except inherited §IV-I." | +| Big-4 scope is 437 CPAs / 150,442 signatures. | 23 | §IV lines 9, 163 and Script 39. | Supported. | +| Dip-test and BD/McCrary distributional characterisation. | 47-53 | §IV Tables V-VI, lines 23-43. | Supported. | +| K=2 and K=3 mixture components and mild BIC preference. | 51, 59-73 | §IV Tables VII-VIII, lines 49-73. | Supported. | +| K=2 unstable and K=3 descriptive only under LOOO. | 71-79, 111-115 | §IV Tables XII-XIII, lines 116-137. | Supported. | +| Three-score internal consistency and per-firm ranking nuance. | 83-100 | §IV Tables IX-X, lines 79-100. | Supported. | +| Per-signature K=3 convergence kappas. | 101-109 | §IV Table XI, lines 102-110. | Supported. | +| Pixel-identity positive-anchor miss rate. | 117-127 | §IV Table XIV, lines 141-153. | Supported. | +| Five-way signature/document classifier retained as primary; K=3 not used for operational labels. | 131-149 | §IV-J, lines 159-224. | Mostly supported; the MC band remains inherited and current wording should not imply v4 validation. | +| Moderate-confidence band retains v3.x capture-rate evaluation. | 109, 145, 198 | §IV line 211 cites v3 Tables IX/XI/XII but not XII-B; §III line 109's "§IV-F" is now wrong. | Needs citation cleanup. | +| Firm anonymisation maintained. | 23 and open question 200 | §IV repeatedly includes real firm names in parentheses. | Fails unless policy changes. | + +## Recommended next-step actions + +1. Freeze the v4 table scheme before any prose edits: decide whether inherited v3 tables are reprinted as current v4 tables, cited only as v3 tables, or moved to appendix/supplement. Then renumber Tables IV-XVIII and remove Table XV-B if the journal style cannot handle letter suffixes. + +2. Fix §III cross-references after the table scheme is frozen, especially §III line 13, §III line 23, and §III lines 109/119/145. + +3. Correct §IV line 13's detection denominator and restate the VLM-positive / corrupted-excluded / YOLO-processed / with-detections sequence. + +4. Remove all real firm names from §IV or explicitly change the anonymisation policy. Do not leave line 265 claiming anonymisation while tables reveal names. + +5. Delete or rewrite line 211's MC-ordering claim. If the MC band remains inherited, present the per-firm MC proportions descriptively only. + +6. Narrow the support claim for the five-way rule: Scripts 38-40 validate only the binary high-confidence rule, while Script 42 reports five-way output counts. Either add a Big-4-specific MC/document validation or state plainly that MC/document validation is inherited from v3.x. + +7. Fix small numeric/provenance issues: K=3 weight drift 0.023/0.0235, Script 42 seed wording, cutoff inclusivity, v3 ablation section pointer, and the unsupported "round-22 Light scope" process note. + +## Phase 4 readiness assessment + +Not ready for partner review without Phase 4 revisions. + +The spike-script provenance for the new Big-4 result tables is mostly sound, so I do not see a need to rerun the main v4 empirical scripts solely to fix §IV. But the current section would invite reviewer attacks on table identity, stale cross-references, anonymisation, and overinterpretation of the inherited MC band. After those are corrected, §IV should be close to partner-review ready; the only substantive open decision is whether to add a new Big-4-specific validation for the moderate-confidence/document-level rule or keep it explicitly inherited from v3.x. diff --git a/paper/v4/paper_a_methodology_v4_section_iii.md b/paper/v4/paper_a_methodology_v4_section_iii.md index 6b714a8..104b807 100644 --- a/paper/v4/paper_a_methodology_v4_section_iii.md +++ b/paper/v4/paper_a_methodology_v4_section_iii.md @@ -1,4 +1,4 @@ -# Section III. Methodology — v4.0 Draft v3 (post codex rounds 21 + 22) +# Section III. Methodology — v4.0 Draft v4 (post codex rounds 21 + 22 + 23 cross-reference cleanup) > **Draft note (2026-05-12, v3).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. > @@ -10,7 +10,7 @@ ## G. Unit of Analysis and Scope -We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of all signature-level capture-rate analyses (§IV-D, §IV-F, §IV-G). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. +We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and v3.20.0's inherited inter-CPA FAR analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism. @@ -20,7 +20,7 @@ We adopt one stipulation about same-CPA pair detectability: A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication. -**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ (Scripts 36, 38), totalling 150,442 Big-4 signatures with both descriptors available (Script 39 reports the explicit per-signature $n$ used in the signature-level K=3 fit). Restricting the analyses to Big-4 is a methodological choice driven by four considerations: +**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, and the v4-new analyses in §IV-D through §IV-J) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C, §IV-I (inter-CPA negative-anchor FAR), and §IV-L (feature-backbone ablation) report inherited corpus-wide v3.x material that v4.0 does not re-scope to Big-4. §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ (Scripts 36, 38), totalling 150,442 Big-4 signatures with both descriptors available (Script 39 reports the explicit per-signature $n$ used in the signature-level K=3 fit). Restricting the v4-new analyses to Big-4 is a methodological choice driven by four considerations: 1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$249 CPAs distributed across multiple firms with idiosyncratic signing practices and small per-firm samples. The full-sample and Big-4-only calibrations *differ* in their fitted marginal crossings (full-sample published $\overline{\text{cos}}^* = 0.945$, $\overline{\text{dHash}}^* = 8.10$ from v3.x; Big-4-only $\overline{\text{cos}}^* = 0.975$, $\overline{\text{dHash}}^* = 3.76$ from Script 34; bootstrap 95% CIs $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$); the offset is large compared to the Big-4 bootstrap CI half-width of $0.0015$. We report this as a *scope-dependent shift* rather than asserting a causal "mid/small-firm tail distorts" claim. @@ -106,7 +106,7 @@ We read this as the strongest internal-consistency signal in v4.0: three differe | Paper A binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ | | Per-CPA K=3 vs per-signature K=3 | $0.870$ | -The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the inherited five-way rule. We note that this comparison validates only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly validate the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, which retains its v3.x interpretation and capture-rate evaluation (§IV-F). +The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{–}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the inherited five-way rule. We note that this comparison validates only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly validate the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, which retains its v3.20.0 calibration and capture-rate evaluation (v3.20.0 Tables IX, XI, XII, XII-B; documented as inherited in §IV-J). **3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference: @@ -128,7 +128,7 @@ All three candidate scores correctly assign every byte-identical signature to th ## L. Signature- and Document-Level Classification -The v4.0 operational classifier is the inherited v3.x five-way per-signature box rule, retained unchanged for two reasons: (a) it preserves continuity with the v3.x literature and its established interpretation, and (b) the convergent internal-consistency checks of §III-K show that the box rule's per-CPA-aggregated outputs agree at $\rho \geq 0.96$ with a mixture-derived score and at $\rho \geq 0.89$ with a reverse-anchor score, supporting continued use without recalibration. +The v4.0 operational classifier is the inherited v3.x five-way per-signature box rule, retained unchanged for two reasons: (a) it preserves continuity with the v3.x literature and its established interpretation; (b) the convergent internal-consistency checks of §III-K show that the box rule's *binary high-confidence* output (cos $> 0.95$ AND dHash $\leq 5$) agrees at $\rho \geq 0.96$ per-CPA with a K=3-posterior score and at $\rho \geq 0.89$ with a reverse-anchor score. The §III-K checks cover only the binary high-confidence rule; the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation are not separately validated by Scripts 38–42. We retain those rule components by reference to v3.20.0's calibration (v3.20.0 §III-K and Tables IX, XI, XII, XII-B); we do not claim that v4.0's convergent-checks evidence supports the inherited rule as a whole, only its binary high-confidence sub-rule. **Per-signature five-way classifier.** Operational thresholds are anchored on whole-sample Firm A percentile heuristics as in v3.x: cos $> 0.95$ (Firm A P7.5) for the cosine dimension and dHash$_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. All dHash references refer to the *independent-minimum* dHash defined in §III-G. We assign each signature to one of five signature-level categories using convergent evidence from both descriptors: diff --git a/paper/v4/paper_a_results_v4_section_iv.md b/paper/v4/paper_a_results_v4_section_iv.md index 4b4df9c..9594068 100644 --- a/paper/v4/paper_a_results_v4_section_iv.md +++ b/paper/v4/paper_a_results_v4_section_iv.md @@ -1,26 +1,28 @@ -# Section IV. Results — v4.0 Draft v2 +# Section IV. Results — v4.0 Draft v3 (post codex rounds 21–23) -> **Draft note (2026-05-12, v2).** This file replaces the §IV-A through §IV-H block of `paper/paper_a_results_v3.md` (v3.20.0) with the Big-4 reframed structure; Section IV expands from 8 sub-sections in v3.20.0 to 12 sub-sections in v4.0 (A through L) to mirror the §III-G..L lineage. **v2** fills Table XV (and adds Table XV-B for document-level counts) using Script 42's per-signature five-way categorisation on the Big-4 sub-corpus, closing the only TBD that v1 carried. Tables IV–XVIII numbering remains provisional and is finalised in Phase 3 close-out. Empirical anchors trace to Scripts 32–42 on branch `paper-a-v4-big4`; the §III provenance table covers the methodology-side citations and §IV adds new tables for the v4.0-specific results. +> **Draft note (2026-05-12, v3; internal — remove before submission).** This file replaces the §IV-A through §IV-H block of `paper/paper_a_results_v3.md` (v3.20.0) with the Big-4 reframed structure. Section IV expands from 8 sub-sections in v3.20.0 to 12 sub-sections in v4.0 (A through L) to mirror the §III-G..L lineage. **v3** incorporates codex gpt-5.5 round-23 review (`paper/codex_review_gpt55_v4_round3.md`, Major Revision); the fixes are presentation-level rather than methodology-level. **Table-numbering scheme** (resolved in v3): the v4 manuscript uses fresh Table numbering V through XVIII for the new v4 Big-4 results; inherited v3.x tables are cited only as "v3.20.0 Table N" with the original v3 number and are *not* renumbered into the v4 sequence. **Anonymisation** (resolved in v3): the Big-4 firms remain pseudonymously labelled Firm A through Firm D throughout the manuscript body; real names are not printed in v4 tables or prose (a single mapping line, retained in v3.20.0's §III-L data-source paragraph, discloses the residual identifiability through contextual descriptors as per IEEE Access norms). Tables IV–XVIII numbering remains provisional and will be finalised at Phase 3 close-out after §III ↔ §IV cross-references are traced end-to-end. Empirical anchors trace to Scripts 32–42 on branch `paper-a-v4-big4`; the §III provenance table covers the methodology-side citations and §IV adds new tables for the v4.0-specific results. ## A. Experimental Setup -The signature-detection and feature-extraction pipeline (§III-A through §III-F) was executed on the full TWSE MOPS audit-report corpus (90,282 PDFs spanning 2013–2023; §III-B). Detection and embedding ran on RTX 4090 (CUDA, deterministic forward inference, fixed seed); the v4.0 statistical analyses ran on Apple Silicon (MPS / CPU). Random seeds are fixed (`SEED = 42`) across all v4.0 spike scripts (32–41) for reproducibility. The signature_analysis SQLite snapshot at `/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db` is treated as frozen; no v4.0 result re-ingests source PDFs. +The signature-detection and feature-extraction pipeline (§III-A through §III-F) was executed on the full TWSE MOPS audit-report corpus (90,282 PDFs spanning 2013–2023; §III-B). Detection and embedding ran on RTX 4090 (CUDA, deterministic forward inference, fixed seed); the v4.0 statistical analyses ran on Apple Silicon (MPS / CPU). Random seeds are fixed (`SEED = 42`) across the v4.0 spike scripts 32–42 for reproducibility. The signature_analysis SQLite snapshot at `/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db` is treated as frozen; no v4.0 result re-ingests source PDFs. The v4.0 primary analyses (§IV-D through §IV-J) are scoped to the Big-4 sub-corpus (Firms A–D, $n = 437$ CPAs with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. The §IV-K Full-Dataset Robustness section reports the full-dataset (686 CPAs) variant of the K=3 mixture + Paper A box-rule Spearman analysis as a cross-scope robustness check. ## B. Signature Detection Performance -The detection metrics inherited unchanged from v3.20.0 §IV-B: 182,328 detected signatures across 86,072 prefiltered audit-report PDFs at the YOLOv11n + Qwen2.5-VL prefilter stage. Per-firm counts of detected signatures are reported in v3.20.0 Table IV (retained as Table IV here, unchanged numbers). The Big-4 subset of the detection output yields 150,442 signatures with both descriptors successfully computed. +The detection metrics are inherited unchanged from v3.20.0 §IV-B. v3.20.0 reports: VLM screening identified 86,072 documents with signature pages; 12 corrupted PDFs were excluded; YOLOv11n batch inference processed the remaining 86,071 documents; 85,042 of these yielded at least one signature detection; the total extracted-signature count is 182,328 (v3.20.0 Table III). Per-firm counts of detected signatures are reported in v3.20.0 Table IV. v4.0 does not renumber the v3.x detection tables into the v4 sequence; v3.20.0 Tables III and IV are cited by their original numbers. + +The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in all §IV v4 primary analyses (§IV-D through §IV-J). ## C. All-Pairs Intra-vs-Inter Class Distribution Analysis -The all-pairs intra-vs-inter class distribution analysis (KDE crossover at $\overline{\text{cos}} = 0.837$; v3.20.0 §IV-C, Table V) is inherited unchanged. This analysis was computed on the full corpus (not Big-4-restricted) and remains the source of the Uncertain / Likely-hand-signed boundary used by the §III-L five-way per-signature classifier (cosine $< 0.837 \Rightarrow$ Likely-hand-signed). v4.0 makes no scope-specific re-derivation of this boundary; the all-pairs cross-class crossover is a corpus-wide reference and is not restated as a v4.0 finding. +The all-pairs intra-vs-inter class distribution analysis (KDE crossover at $\overline{\text{cos}} = 0.837$; v3.20.0 §IV-C, v3.20.0 Table V) is inherited unchanged. This analysis was computed on the full corpus (not Big-4-restricted) and remains the source of the Uncertain / Likely-hand-signed boundary used by the §III-L five-way per-signature classifier (cosine $\leq 0.837 \Rightarrow$ Likely-hand-signed, matching Script 42's `cos <= 0.837` rule definition). v4.0 makes no scope-specific re-derivation of this boundary; the all-pairs cross-class crossover is a corpus-wide reference and is not restated as a v4.0 finding. v3.20.0 Table V is cited by its original number and is not renumbered into the v4 sequence. ## D. Big-4 Accountant-Level Distributional Characterisation This section reports the empirical evidence for §III-I's three-diagnostic distributional characterisation at the Big-4 accountant level. All numbers below are direct re-statements from Scripts 32 / 34; cross-citations to the v3.x (signature-level) analysis are noted where the v4.0 result differs structurally from the v3.x result. -**Table V (revised: Big-4 dip-test).** Hartigan dip-test results, accountant-level marginals. +**Table V.** Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32). | Population | $n$ CPAs | $p_{\text{cos}}$ | $p_{\text{dHash}}$ | Interpretation | |---|---|---|---|---| @@ -31,7 +33,7 @@ This section reports the empirical evidence for §III-I's three-diagnostic distr Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution $1 / 2000 = 5 \times 10^{-4}$ (Script 34 reports this as $p = 0.0000$; we report $p < 5 \times 10^{-4}$ to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed. -**Table VI (revised: BD/McCrary diagnostic, Big-4 marginals).** Burgstahler-Dichev / McCrary local-discontinuity test on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; $\alpha = 0.05$, two-sided). +**Table VI.** Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; $\alpha = 0.05$, two-sided). | Population | Cosine: significant transition? | dHash: significant transition? | |---|---|---| @@ -40,13 +42,13 @@ Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no boot | Firms B + C + D pooled | none | one transition at $\overline{\text{dHash}} = 10.8$ | | All non-Firm-A pooled | none | one transition at $\overline{\text{dHash}} = 6.6$ | -The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 / post-2020 stratified variants exhibit dHash transitions at varying locations consistent with histogram-resolution artefacts rather than population-structural boundaries). The diagnostic is reported as a non-parametric robustness check; we do not use the off-Big-4 dHash transitions as operational thresholds. +The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as v4.0 operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes. ## E. Big-4 K=2 / K=3 Mixture Fits This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings. -**Table VII (revised: Big-4 K=2 components and bootstrap CIs).** +**Table VII.** Big-4 K=2 mixture components and marginal-crossing bootstrap 95% confidence intervals. | K=2 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | |---|---|---|---| @@ -62,7 +64,7 @@ Marginal crossings (point + bootstrap 95% CI, $n_{\text{boot}} = 500$): $\text{BIC}(K{=}2) = -1108.45$ (Script 34). -**Table VIII (revised: Big-4 K=3 components).** +**Table VIII.** Big-4 K=3 mixture components. | K=3 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive label | |---|---|---|---|---| @@ -70,13 +72,13 @@ $\text{BIC}(K{=}2) = -1108.45$ (Script 34). | C2 | 0.9558 | 6.66 | 0.536 | mixed | | C3 | 0.9826 | 2.41 | 0.321 | replicated | -$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). Component recovery is verified across Scripts 35, 37, and 38 (consistent component centres and weights to four decimal places). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively. +$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively. ## F. Convergent Internal-Consistency Checks This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ and are *not statistically independent measurements*. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth. -**Table IX (revised: per-CPA Spearman among three feature-derived scores, Big-4, $n = 437$).** +**Table IX.** Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, $n = 437$. | Score pair | Spearman $\rho$ | $p$-value | |---|---|---| @@ -86,20 +88,20 @@ This section reports the empirical evidence for §III-K's three-score internal-c (Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on $n = 249$ non-Big-4 CPAs; reference centre $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$. -**Table X (revised: per-firm summary across the three scores, Big-4).** +**Table X.** Per-firm summary across the three feature-derived scores, Big-4. | Firm | $n$ CPAs | mean $P(\text{C1})$ | mean reverse-anchor score | mean Paper A hand-leaning rate | |---|---|---|---|---| -| Firm A (Deloitte) | 171 | 0.0072 | $-0.9726$ | 0.1935 | -| Firm B (KPMG) | 112 | 0.1410 | $-0.8201$ | 0.6962 | -| Firm C (PwC) | 102 | 0.3110 | $-0.7672$ | 0.7896 | -| Firm D (EY) | 52 | 0.2406 | $-0.7125$ | 0.7608 | +| Firm A | 171 | 0.0072 | $-0.9726$ | 0.1935 | +| Firm B | 112 | 0.1410 | $-0.8201$ | 0.6962 | +| Firm C | 102 | 0.3110 | $-0.7672$ | 0.7896 | +| Firm D | 52 | 0.2406 | $-0.7125$ | 0.7608 | (Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that *higher* values indicate deeper into the reference left tail = more hand-leaning relative to the non-Big-4 reference.) The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as more hand-leaning. The K=3 posterior P(C1) and the box-rule hand-leaning rate (Score 1 and Score 3) place Firm C at the most-hand-leaning end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors. -**Table XI (revised: per-signature Cohen $\kappa$ binary collapse, $n = 150{,}442$ Big-4 signatures).** +**Table XI.** Per-signature Cohen $\kappa$ (binary collapse, replicated vs not-replicated), $n = 150{,}442$ Big-4 signatures. | Pair | Cohen $\kappa$ | |---|---| @@ -107,24 +109,24 @@ The three scores agree on placing Firm A as the most replication-dominated and t | Paper A binary high-confidence box rule vs per-signature K=3 hard label | 0.559 | | Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 | -(Source: Script 39; verdict label `SIG_CONVERGENCE_MODERATE`.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. This convergent-checks evidence covers only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evaluation (§IV-J). +(Source: Script 39; verdict label `SIG_CONVERGENCE_MODERATE`.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evaluation (§IV-J). ## G. Leave-One-Firm-Out Reproducibility This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing. -**Table XII (revised: K=2 LOOO across the four Big-4 folds).** +**Table XII.** K=2 leave-one-firm-out across the four Big-4 folds. | Held-out firm | $n_{\text{train}}$ | $n_{\text{held}}$ | Fold rule (cos cut, dHash cut) | Held-out classified as templated by fold rule | |---|---|---|---|---| -| Firm A (Deloitte) | 266 | 171 | cos $> 0.9380$ AND dHash $\leq 8.79$ | $171 / 171 = 100.00\%$ ($95\%$ Wilson $[97.80\%, 100.00\%]$) | -| Firm B (KPMG) | 325 | 112 | cos $> 0.9744$ AND dHash $\leq 3.98$ | $0 / 112 = 0\%$ ($95\%$ Wilson $[0\%, 3.32\%]$) | -| Firm C (PwC) | 335 | 102 | cos $> 0.9752$ AND dHash $\leq 3.75$ | $0 / 102 = 0\%$ ($95\%$ Wilson $[0\%, 3.63\%]$) | -| Firm D (EY) | 385 | 52 | cos $> 0.9756$ AND dHash $\leq 3.74$ | $0 / 52 = 0\%$ ($95\%$ Wilson $[0\%, 6.88\%]$) | +| Firm A | 266 | 171 | cos $> 0.9380$ AND dHash $\leq 8.79$ | $171 / 171 = 100.00\%$ ($95\%$ Wilson $[97.80\%, 100.00\%]$) | +| Firm B | 325 | 112 | cos $> 0.9744$ AND dHash $\leq 3.98$ | $0 / 112 = 0\%$ ($95\%$ Wilson $[0\%, 3.32\%]$) | +| Firm C | 335 | 102 | cos $> 0.9752$ AND dHash $\leq 3.75$ | $0 / 102 = 0\%$ ($95\%$ Wilson $[0\%, 3.63\%]$) | +| Firm D | 385 | 52 | cos $> 0.9756$ AND dHash $\leq 3.74$ | $0 / 52 = 0\%$ ($95\%$ Wilson $[0\%, 6.88\%]$) | (Source: Script 36.) Across-fold cosine crossing: pairwise range $[0.9380, 0.9756]$, range = $0.0376$; max absolute deviation from the across-fold mean is $0.028$. This exceeds the report's $0.005$ across-fold stability tolerance by $5.6\times$ and is much larger than the full-Big-4 bootstrap CI half-width of $0.0015$. Together with the all-or-nothing held-out classification pattern (Firm A held out $\Rightarrow$ all held-out CPAs templated; any non-Firm-A firm held out $\Rightarrow$ none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary. -**Table XIII (revised: K=3 LOOO C1 component shape and held-out membership).** +**Table XIII.** K=3 leave-one-firm-out: C1 component shape and held-out membership. | Held-out firm | C1 cos (fit) | C1 dHash (fit) | C1 weight (fit) | Held-out C1 hard-label rate | Full-Big-4 baseline C1% | Absolute difference | |---|---|---|---|---|---|---| @@ -134,13 +136,13 @@ This section reports the firm-level cross-validation evidence motivating §III-J | Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp | | Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp | -(Source: Script 37; verdict label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.025$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L). +(Source: Script 37; verdict label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L). ## H. Pixel-Identity Positive-Anchor Miss Rate This section reports the only hard-ground-truth subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4). -**Table XIV (revised: positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures).** +**Table XIV.** Positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures. | Classifier | Misclassified as hand-leaning | Miss rate | Wilson 95% CI | |---|---|---|---| @@ -160,7 +162,7 @@ The signature-level inter-CPA negative-anchor FAR analysis (~50,000 random pairs This section reports the §III-L five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. The five-way category definitions are inherited unchanged from v3.20.0 §III-K (now §III-L); see §III-L for the cosine and dHash cuts. -**Table XV (revised: five-way per-signature category counts, Big-4 only, $n = 150{,}442$ classified).** +**Table XV.** Five-way per-signature category counts, Big-4 sub-corpus, $n = 150{,}442$ classified. | Category | Long name | $n$ signatures | % of classified | |---|---|---|---| @@ -176,16 +178,16 @@ This section reports the §III-L five-way per-signature + document-level worst-c | Firm | HC | MC | HSC | UN | LH | total signatures | |---|---|---|---|---|---|---| -| Firm A (Deloitte) | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 | -| Firm B (KPMG) | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 | -| Firm C (PwC) | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 | -| Firm D (EY) | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 | +| Firm A | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 | +| Firm B | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 | +| Firm C | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 | +| Firm D | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 | -(Source: Script 42 per-firm cross-tab.) The per-firm pattern aligns with the K=3 cluster cross-tab of Table XVI: Firm A is concentrated in the HC band (81.70% of its signatures), consistent with its 82.46% C3-replicated concentration at the accountant level; the three non-Firm-A Big-4 firms have markedly lower HC rates and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%) — consistent with §III-K Score 2 (reverse-anchor cosine percentile) ranking Firm D fractionally above Firm C in the hand-leaning direction. +(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3-replicated component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%). **Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.20.0 worst-case rule (HC > MC > HSC > UN > LH; §III-L). v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset). -**Table XV-B (NEW: document-level worst-case category counts, Big-4 only, $n = 75{,}233$ unique PDFs).** +**Table XV-B.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs. | Category | Long name | $n$ documents | % | |---|---|---|---| @@ -201,23 +203,23 @@ This section reports the §III-L five-way per-signature + document-level worst-c | Firm | HC | MC | HSC | UN | LH | total docs | |---|---|---|---|---|---|---| -| Firm A (Deloitte) | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 | -| Firm B (KPMG) | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 | -| Firm C (PwC) | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 | -| Firm D (EY) | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 | +| Firm A | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 | +| Firm B | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 | +| Firm C | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 | +| Firm D | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 | (Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.) -The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we note this inheritance status explicitly so the reader can locate the v3.x Tables IX / XI / XII calibration evidence (carried into v4.0 by reference) without expecting v4.0-spike-script confirmation of the moderate-band specifics. The Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) report the inherited rule's output on the Big-4 subset; the relative ordering of the non-Firm-A firms on MC is consistent with the §III-K Spearman convergence on the per-CPA hand-leaning ranking. +The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The v3.20.0 capture-rate calibration evidence for the moderate band (v3.20.0 Tables IX, XI, XII, XII-B) is carried into v4.0 by reference and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA hand-leaning ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as more hand-leaning than Firm B). -**Table XVI (NEW: firm × K=3 cluster cross-tabulation, Big-4 only).** +**Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus. | Firm | $n$ | C1 (hand-leaning) | C2 (mixed) | C3 (replicated) | C1 % | C3 % | |---|---|---|---|---|---|---| -| Firm A (Deloitte) | 171 | 0 | 30 | 141 | $0.00\%$ | $82.46\%$ | -| Firm B (KPMG) | 112 | 10 | 102 | 0 | $8.93\%$ | $0.00\%$ | -| Firm C (PwC) | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ | -| Firm D (EY) | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ | +| Firm A | 171 | 0 | 30 | 141 | $0.00\%$ | $82.46\%$ | +| Firm B | 112 | 10 | 102 | 0 | $8.93\%$ | $0.00\%$ | +| Firm C | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ | +| Firm D | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ | (Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 replicated component (no Firm A CPAs in C1); Firm C has the highest hand-leaning concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there). @@ -225,9 +227,9 @@ The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < ## K. Full-Dataset Robustness (light scope) -This section reports the v4.0 reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). Per the v4.0 author choice (codex round-22 open question 1, Light scope), we re-run only the K=3 mixture + Paper A operational-rule per-CPA hand-leaning rate analysis; the §III-L five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band is documented as inherited from v3.x calibration in §IV-J. +This section reports the v4.0 reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA hand-leaning rate analysis, sufficient to demonstrate that the v4.0 K=3 + Paper A convergence reproduces at the wider scope. The §III-L five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band is documented as inherited from v3.x calibration in §IV-J. -**Table XVII (NEW: K=3 component comparison, Big-4 vs full dataset).** +**Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset. | K=3 component | Big-4 (n=437) cos / dHash / weight | Full (n=686) cos / dHash / weight | Drift Big-4 → Full | |---|---|---|---| @@ -237,7 +239,7 @@ This section reports the v4.0 reproducibility cross-check at the full accountant (Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.) -**Table XVIII (NEW: Spearman correlation between K=3 P(C1) and Paper A operational hand-leaning rate, Big-4 vs full dataset).** +**Table XVIII.** Spearman rank correlation between K=3 P(C1) and Paper A operational hand-leaning rate, Big-4 sub-corpus vs full dataset. | Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs Paper A hand-leaning rate) | $p$-value | |---|---|---|---| @@ -249,9 +251,9 @@ This section reports the v4.0 reproducibility cross-check at the full accountant **Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the Paper A box-rule hand-leaning rate are preserved at the full scope. Component centres shift modestly: C3 (replicated) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (hand-leaning) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm hand-leaning CPAs that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + Paper A convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the v4.0 primary methodology is restricted to Big-4 by design (§III-G item 4). -## L. Feature Backbone Ablation (inherited from v3.x §IV-H.3) +## L. Feature Backbone Ablation (inherited from v3.20.0 §IV-I) -The feature-backbone ablation (Table XVIII in v3.20.0; backbone replacement of ResNet-50 with alternative ImageNet-pretrained backbones to verify the §III-E embedding choice is not load-bearing) is inherited unchanged. v4.0 makes no scope-specific re-derivation; the ablation is a methodological-stability check on the embedding stage and is corpus-wide rather than Big-4-restricted. +The feature-backbone ablation (v3.20.0 Table XVIII; backbone replacement of ResNet-50 with alternative ImageNet-pretrained backbones to verify that the §III-E embedding choice is not load-bearing) is inherited unchanged. v3.20.0 Table XVIII is cited by its original v3 number and is **not** the same table as the v4 Table XVIII (which reports the Big-4 vs full-dataset Spearman drift in §IV-K). v4.0 makes no scope-specific re-derivation of the ablation; the analysis is a methodological-stability check on the embedding stage and is corpus-wide rather than Big-4-restricted. ---