# Paper A Round 24 Review - v4 round 4 Reviewer: gpt-5.5 xhigh Date: 2026-05-12 Target: `paper/v4/paper_a_results_v4_section_iv.md` (§IV v3) Paired methodology: `paper/v4/paper_a_methodology_v4_section_iii.md` (§III v4) Rubric: `paper/codex_review_gpt55_v4_round3.md` (6 Major, 6 Minor, 5 Editorial) ## Verdict Minor Revision. The round-23 blockers are substantially reduced. The §IV v3 result tables are now mostly provenance-faithful, the inherited-v3 table identity problem is largely resolved, detection counts are corrected, §IV firm rows are pseudonymised, and the moderate-confidence band is now described honestly as inherited rather than newly validated. I do not recommend Accept yet because several cleanup issues remain visible in the paired §III/§IV package: §III v4 still leaks real firm names despite the pseudonym policy, §III still carries the stale K=3 LOOO weight-drift value of 0.025 where the report and §IV v3 use 0.023, and the internal draft notes/checklists still contain stale round/version/table-numbering language. ## Round-23 Finding Closure Table | Round-23 finding | Status | v3/v4 evidence | |---|---|---| | Major 1. Table numbering was incoherent and inherited v3 tables collided with current v4 tables. | PARTIAL | Core collision is fixed: §IV v3 says inherited v3.x tables are cited only as `v3.20.0 Table N` and not renumbered (§IV:3), and detection/all-pairs/ablation are cited as v3.20.0 Tables III/IV/V/XVIII (§IV:13, 19, 256). Residual: the same draft note still says "Tables IV-XVIII" even though the new v4 sequence starts at Table V (§IV:3), and the close-out checklist repeats "Tables IV-XVIII" plus `Table XV-B` (§IV:265). | | Major 2. §III v3 contained stale cross-references not supported by §IV v2. | PARTIAL | Main cross-refs are repaired: §III now points signature-level classification to §IV-J and inherited inter-CPA FAR to §IV-I (§III:13), and accurately scopes §IV-D through §IV-J as v4-new Big-4 analyses while excluding §IV-A-C/I/L and full-dataset §IV-K (§III:23). Residual stale/internal references remain: §III says the corresponding FAR evidence comes from "§III-J inherited; Table X" (§III:119), and the open question still proposes adding a moderate-band analysis in current §IV-F even though §IV-F is convergence checks (§III:198; §IV:77-112). | | Major 3. Inherited detection-count sentence was numerically wrong / ambiguous. | CLOSED | §IV v3 now distinguishes VLM-positive documents, corrupted exclusions, YOLO-processed documents, detected-document count, and extracted signatures (§IV:13), matching the v3 baseline's Table III sequence (v3:14, 20-22). | | Major 4. Draft claimed anonymisation while §IV tables revealed real firm names. | PARTIAL | §IV v3 uses Firm A-D in tables and prose (§IV:91-100, 120-125, 131-137, 179-184, 204-209, 217-222), so the §IV-specific failure is closed. But the paired §III v4 still leaks real names/aliases: "held-out-EY" (§III:71) and "Firms B (KPMG) and D (EY)" (§III:99), contradicting the pseudonym policy in §III:23 and §IV:3. | | Major 5. Interpretive claims overstated what the spike results prove. | CLOSED | The off-Big-4 dHash transition language is now scope-dependent rather than an artefact claim (§IV:45). The Firm A HC vs C3 comparison is explicitly qualitative and cross-unit (§IV:186). MC-band ordering is now explicitly descriptive and not treated as Spearman validation (§IV:213). | | Major 6. Moderate-confidence band support language needed narrowing. | CLOSED | §III v4 now states that Scripts 38-42 do not separately validate the MC/style/document components and that v4 only supports the binary high-confidence sub-rule (§III:131). §IV v3 repeats this limitation and cites v3.20.0 Tables IX/XI/XII/XII-B as inherited support (§IV:213). | | Minor 1. K=3 LOOO C1 weight drift should be 0.023/0.0235, not 0.025. | PARTIAL | §IV v3 is corrected to 0.023 (§IV:139), matching Script 37. §III v4 still says 0.025 in prose and provenance (§III:71, 115, 173). | | Minor 2. Seed coverage statement stopped at Script 41 although §IV used Script 42. | CLOSED | §IV v3 now says seeds are fixed across Scripts 32-42 (§IV:7). | | Minor 3. Low-cosine cutoff inclusivity should match Script 42 (`cos <= 0.837`). | PARTIAL | §IV v3 is explicit: cosine `<= 0.837` maps to Likely-hand-signed (§IV:19), matching Script 42. §III-L still says "Cosine below" the crossover (§III:143), which is less precise than the inherited rule; make it "at or below 0.837." | | Minor 4. "Round-22 open question 1, Light scope" process note was not traceable. | CLOSED | The §IV-K body now describes the full-dataset robustness scope directly, without the round-22 process-note wording (§IV:230). The remaining stale process text is confined to the internal checklist (§IV:260-267). | | Minor 5. Ablation section pointer was wrong. | CLOSED | §IV v3 correctly identifies the inherited feature-backbone ablation as v3.20.0 §IV-I and distinguishes v3 Table XVIII from current v4 Table XVIII (§IV:254-256). | | Minor 6. "Component recovery across Scripts 35, 37, and 38" could be misread. | CLOSED | §IV v3 now says the full-fit K=3 baseline is reproduced in Scripts 35, 37, and 38, while Script 37 fold components differ by design and are separately reported (§IV:75). | | Editorial 1. Remove draft note and Phase 3 close-out checklist before submission. | OPEN | Both files still include internal draft notes and author checklists/open questions (§III:3-9, 187-202; §IV:3, 260-267). §IV's checklist also says the section is being prepared for "codex round 23" even though this is round 24 (§IV:262). | | Editorial 2. "This convergent-checks evidence" grammar. | CLOSED | §IV v3 uses "These convergence checks" (§IV:112). | | Editorial 3. "is finalised" should be "will be finalised." | CLOSED | §IV v3 uses future/provisional wording (§IV:3, 265). | | Editorial 4. Standardise `dHash` versus `dh`. | CLOSED | Manuscript prose/tables consistently use `dHash`; raw spike-script `dh` appears only inside source descriptions or quoted rule names (§III:13, 133-145; §IV:36, 53-63, 167-184). | | Editorial 5. Avoid mixing "replicated," "templated," and "non-hand-signed" as exact synonyms. | CLOSED | Current usage mostly preserves distinctions: replicated is used for positive-anchor / C3 contexts (§IV:143-155), non-hand-signed for the operational five-way categories (§IV:167-173), and templated mainly for K=2 fold-rule wording (§IV:120-127). No remaining overclaim depends on treating them as exact synonyms. | ## Newly Introduced Or Remaining Issues 1. **§III v4 still violates the anonymisation policy.** §III says firms are pseudonymously labelled Firm A-D throughout the manuscript (§III:23), but line 71 says "held-out-EY" and line 99 names KPMG and EY. §IV v3 fixed this; §III now needs the same scrub. 2. **§III v4 has a stale K=3 LOOO weight-drift number.** Script 37 reports max C1 weight deviation 0.023, and §IV v3 uses 0.023 (§IV:139). §III still reports 0.025 in two prose locations and the provenance table (§III:71, 115, 173). 3. **Two §III internal references are stale.** The positive-anchor paragraph cites "§III-J inherited; Table X" for inter-CPA FAR (§III:119), but the paired result location is §IV-I and the inherited source is v3.20.0 §IV-F.1/Table X (§IV:157-159). The open question asks whether to add a moderate-band analysis in §IV-F (§III:198), but current §IV-F is the convergence section. 4. **Internal notes are stale enough to confuse a handoff.** §III's draft note says "(2026-05-12, v3)" although the file title is v4 (§III:1, 3). §IV's close-out checklist says "before §IV is sent for codex round 23" even though round 23 has already happened (§IV:262), and item 4 says issues are addressed in "this v2" inside a v3 file (§IV:267). 5. **§III mentions the full-dataset `n = 686` but does not list it in the §III provenance table.** §III:23 states that §IV-K reports a full-dataset cross-check at 686 CPAs; Script 41 directly reports full dataset `N CPAs = 686`. Add that row if the number remains in §III. 6. **The table-numbering note still has a small self-contradiction.** §IV:3 says the new v4 sequence is Table V through Table XVIII, then says "Tables IV-XVIII" remain provisional. Either add a current Table IV, or make all provisional references "Tables V-XVIII" and decide whether `Table XV-B` is acceptable for the target style. ## Cross-Reference Checks (§III v4 <-> §IV v3) | Claim / linkage | §III v4 line evidence | §IV v3 line evidence | Status | |---|---:|---:|---| | Big-4 scope and inherited/non-Big-4 exceptions. | §III:23 | §IV:9, 13, 19, 157-159, 230, 254-256 | Supported. | | Big-4 sample size: 437 CPAs and 150,442 classified signatures. | §III:23, 157-158 | §IV:9, 15, 165, 175 | Supported. | | Dip-test and BD/McCrary accountant-level characterisation. | §III:49-53 | §IV:25-45 | Supported. | | K=2/K=3 mixture components and mild BIC preference. | §III:59-69 | §IV:51-75 | Supported. | | K=2 unstable; K=3 descriptive, not operational, under LOOO. | §III:71-79, 111-115 | §IV:116-139 | Mostly supported; align §III's 0.025 weight drift to §IV's/report's 0.023. | | Three-score internal-consistency correlations and per-firm ranking nuance. | §III:83-99 | §IV:79-102 | Supported, except §III anonymisation leak in line 99. | | Per-signature K=3 convergence and binary kappa values. | §III:101-109 | §IV:104-112 | Supported. | | Pixel-identity positive-anchor miss rate. | §III:117-127 | §IV:141-155 | Supported, but §III:119 should cite §IV-I/v3 §IV-F.1 for inter-CPA FAR, not "§III-J inherited." | | Five-way classifier retained as primary and MC band inherited. | §III:131-149 | §IV:161-213 | Supported; make §III:143 inclusive for `cos <= 0.837`. | | K=3 hard label vs K=3 posterior roles. | §III:149 | §IV:215-224 and 81-89 | Supported: hard labels for cluster cross-tab, posterior P(C1) for Spearman. | | Full-dataset robustness is light scope only. | §III:23, 31 | §IV:228-252 | Supported, but add provenance for `n = 686` to §III table or remove the number from §III. | | Internal author/open-question checklist. | §III:187-202 | §IV:260-267 | Not manuscript-ready; stale references remain. | ## Provenance Re-Verification Of Changed Numerics | Changed numerical claim | Manuscript line(s) | Source checked | Status | |---|---:|---|---| | Detection sequence: 86,072 VLM-positive; 12 corrupted; 86,071 YOLO-processed; 85,042 with detections; 182,328 signatures. | §IV:13 | v3 baseline reports 86,071 processed, 85,042 with detections, and 182,328 signatures (v3:14, 20-22). The 86,072/12 sequence is inherited from the v3 narrative already cited in round 23. | Confirmed; round-23 denominator conflation is fixed. | | Big-4 signature sample: 150,453 loaded, 150,442 classified, 11 missing descriptors. | §IV:175 | Script 42 reports loaded 150,453, classified 150,442, unclassified 11 (five_way_report:14-16). | Confirmed. | | K=2 marginal crossings and bootstrap CIs: cos 0.9755, dHash 3.755, CIs [0.9742, 0.9772] and [3.476, 3.969]. | §IV:62-65; §III:51, 59-60 | Script 36 reports cos point 0.9755 and dHash point 3.7549 with those CIs (calibration_loo_report:14-17). | Confirmed. | | K=3 components: C1 0.9457/9.17/0.143; C2 0.9558/6.66/0.536; C3 0.9826/2.41/0.321. | §IV:67-75; §III:61-69 | Scripts 35/37/38 report the same baseline (inspection_report:6-10; k3_loo_report:6-10; convergence_report:8-12). | Confirmed. | | K=3 lower than K=2 by 3.48 BIC points. | §IV:75; §III:69 | Script 36 reports K=2 BIC -1108.45 and K=3 BIC -1111.93 (calibration_loo_report:9-10). | Confirmed by arithmetic. | | Spearman correlations: 0.9627, 0.8890, 0.8794, with p-values bounded in manuscript. | §IV:81-89; §III:91-99 | Script 38 reports 0.9627 / 3.92e-249, 0.8890 / 1.09e-149, 0.8794 / 2.73e-142 (convergence_report:26-30). | Confirmed. | | Per-firm score nuance: Firm C highest on P(C1)=0.3110 and hand_frac=0.7896; Firm D higher on reverse-anchor score -0.7125 vs Firm C -0.7672. | §IV:95-102; §III:99 | Script 38 per-firm summary reports those values (convergence_report:43-48). | Confirmed; §III should anonymise KPMG/EY parentheticals. | | K=3 LOOO C1 weight drift is 0.023, not 0.025. | §IV:139; §III:71, 115, 173 | Script 37 reports max C1 weight deviation 0.023 (k3_loo_report:77-79). | §IV confirmed; §III mismatch remains. | | Pixel-identical Big-4 subset n=262, split 145/8/107/2, all classifiers 0% miss with Wilson upper 1.45%. | §IV:145-153; §III:117-127 | Script 40 reports total 262, 262/262 correct for all three classifiers, and per-firm split 145/8/107/2 (far_report:8, 12-18, 22-27). | Confirmed. | | Five-way per-signature counts: HC 74,593; MC 39,817; HSC 314; UN 35,480; LH 238. | §IV:165-175 | Script 42 reports the same counts and percentages (five_way_report:20-26). | Confirmed. | | Per-firm five-way percentages: Firm A 81.70/10.76/0.05/7.42/0.07; Firm B 34.56/35.88/0.29/29.09/0.18; Firm C 23.75/41.44/0.38/34.21/0.22; Firm D 24.51/29.33/0.22/45.65/0.29. | §IV:181-186, 213 | Script 42 reports the same percentages (five_way_report:39-44). | Confirmed; interpretation is now appropriately descriptive. | | Document-level counts: n=75,233 PDFs; HC 46,857; MC 19,667; HSC 167; UN 8,524; LH 18; mixed-firm PDFs n=379. | §IV:190-200 | Script 42 reports n=75,233, mixed-firm n=379, and those category counts (five_way_report:46-57). | Confirmed. | | Full-dataset robustness: full n=686; component rows; full rho 0.9558; drift 0.0069. | §IV:232-250; §III:23 | Script 41 reports Big-4 n=437, full n=686, component drifts, BICs, rho 0.9558, and drift 0.0069 (fulldataset_report:8-31). | Confirmed; add §III provenance row for n=686. | ## Phase 4 Readiness Partial. The empirical tables are close to partner-review ready and I do not see a need to rerun the main v4 scripts for §IV. The remaining issues are mostly manuscript hygiene, pseudonym consistency, and cross-reference/provenance alignment. They are small edits, but they are visible enough that I would not send the paired §III/§IV package to partner review until they are fixed. ## Recommended Next-Step Actions 1. Scrub §III v4 for real firm names/aliases. Replace "held-out-EY" and "Firms B (KPMG) and D (EY)" with Firm A-D language, or explicitly abandon the pseudonym policy everywhere. 2. Align K=3 LOOO weight drift to Script 37 throughout §III: use 0.023 (or 0.0235 if exact precision is preferred), matching §IV:139. 3. Fix the remaining stale cross-references: §III:119 should point to current §IV-I / inherited v3.20.0 §IV-F.1 Table X; §III:198 should not refer to current §IV-F for a possible moderate-band analysis. 4. Make the §III-L low-cosine rule inclusive: Likely hand-signed is `cos <= 0.837`, matching Script 42 and §IV:19. 5. Remove or move internal draft notes, cross-reference indices, open questions, and the §IV Phase 3 close-out checklist before partner review. At minimum, fix stale "v2/v3/round 23" text. 6. Finalise table numbering after deciding whether `Table XV-B` is acceptable. If the current v4 sequence starts at Table V, remove residual "Tables IV-XVIII" wording. 7. Add §III provenance for the full-dataset `n = 686` claim if it remains in §III-G; cite Script 41 / `fulldataset_report.md`.