Files
pdf_signature_extraction/paper/codex_review_gpt55_v4_round4.md
gbanyan 6d2eddb6e8 Apply codex round-24 final cleanup: §III v5 + §IV v3.1
Codex round 24 returned Minor Revision: 3 Major CLOSED + 3 Major
PARTIAL + 4 Minor CLOSED + 2 Minor PARTIAL + 4 Editorial CLOSED
+ 1 Editorial OPEN. All 7 narrow residual fixes were §III-side
(I applied §IV fixes thoroughly in v3 but didn't mirror them to
§III v4).

§III v5 fixes:

  1. Anonymisation leak repaired:
     - "held-out-EY fold" -> "held-out-Firm-D fold" (L71)
     - "Firms B (KPMG) and D (EY)" -> "Firms B and D" (L99)
  2. K=3 LOOO weight drift 0.025 -> 0.023 at three sites
     (L71, L115, L173 provenance table). Matches Script 37 max
     C1 weight deviation and §IV v3 line 139.
  3. §III-K positive-anchor paragraph cross-ref repaired:
     "v3.x inter-CPA negative anchor (§III-J inherited; Table X)"
     -> "(§IV-I, inheriting v3.20.0 §IV-F.1 Table X)".
  4. §III-L five-way Likely-hand-signed band made inclusive:
     "Cosine below the all-pairs KDE crossover threshold." ->
     "Cosine at or below the all-pairs KDE crossover threshold
     (cos <= 0.837)." Matches Script 42 and §IV:19.
  5. Open question 1's pointer changed from current §IV-F (which
     is Convergent Internal-Consistency Checks) to v3.20.0
     Tables IX/XI/XII/XII-B + §IV-J descriptive proportions.
  6. Provenance table: new row for full-dataset n=686 citing
     Script 41 fulldataset_report.md.
  7. Draft-note header refreshed: v3 -> v5; v4 -> v5 etc.;
     "internal -- remove before submission" tag added.

§IV v3.1 fixes:

  - Close-out checklist L262 stale "codex round 23" wording
    updated to "rounds 21-24 and before partner Jimmy review".
  - Close-out item 4 "in this v2" stale wording -> "in this v3".
  - New item 5 added: internal author notes (this checklist +
    §III cross-reference index + both files' draft-note headers)
    are author working artefacts and should be moved/stripped
    before partner / submission packaging.

Round 24 finding summary post-v5/v3.1:
  Major:     3 CLOSED, 3 -> CLOSED (anonymisation + cross-ref +
             table numbering note residuals)
  Minor:     4 CLOSED, 2 -> CLOSED (weight drift 0.025 -> 0.023;
             low-cosine inclusivity cos <= 0.837)
  Editorial: 4 CLOSED, 1 PARTIAL (draft notes remain visible but
             explicitly marked as internal-only "remove before
             submission")

Phase 4 readiness: pending decision on whether to do one more
codex verification round (round 25) before drafting Abstract /
Intro / Discussion / Conclusion prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:26:14 +08:00

15 KiB

Paper A Round 24 Review - v4 round 4

Reviewer: gpt-5.5 xhigh
Date: 2026-05-12
Target: paper/v4/paper_a_results_v4_section_iv.md (§IV v3)
Paired methodology: paper/v4/paper_a_methodology_v4_section_iii.md (§III v4)
Rubric: paper/codex_review_gpt55_v4_round3.md (6 Major, 6 Minor, 5 Editorial)

Verdict

Minor Revision.

The round-23 blockers are substantially reduced. The §IV v3 result tables are now mostly provenance-faithful, the inherited-v3 table identity problem is largely resolved, detection counts are corrected, §IV firm rows are pseudonymised, and the moderate-confidence band is now described honestly as inherited rather than newly validated.

I do not recommend Accept yet because several cleanup issues remain visible in the paired §III/§IV package: §III v4 still leaks real firm names despite the pseudonym policy, §III still carries the stale K=3 LOOO weight-drift value of 0.025 where the report and §IV v3 use 0.023, and the internal draft notes/checklists still contain stale round/version/table-numbering language.

Round-23 Finding Closure Table

Round-23 finding Status v3/v4 evidence
Major 1. Table numbering was incoherent and inherited v3 tables collided with current v4 tables. PARTIAL Core collision is fixed: §IV v3 says inherited v3.x tables are cited only as v3.20.0 Table N and not renumbered (§IV:3), and detection/all-pairs/ablation are cited as v3.20.0 Tables III/IV/V/XVIII (§IV:13, 19, 256). Residual: the same draft note still says "Tables IV-XVIII" even though the new v4 sequence starts at Table V (§IV:3), and the close-out checklist repeats "Tables IV-XVIII" plus Table XV-B (§IV:265).
Major 2. §III v3 contained stale cross-references not supported by §IV v2. PARTIAL Main cross-refs are repaired: §III now points signature-level classification to §IV-J and inherited inter-CPA FAR to §IV-I (§III:13), and accurately scopes §IV-D through §IV-J as v4-new Big-4 analyses while excluding §IV-A-C/I/L and full-dataset §IV-K (§III:23). Residual stale/internal references remain: §III says the corresponding FAR evidence comes from "§III-J inherited; Table X" (§III:119), and the open question still proposes adding a moderate-band analysis in current §IV-F even though §IV-F is convergence checks (§III:198; §IV:77-112).
Major 3. Inherited detection-count sentence was numerically wrong / ambiguous. CLOSED §IV v3 now distinguishes VLM-positive documents, corrupted exclusions, YOLO-processed documents, detected-document count, and extracted signatures (§IV:13), matching the v3 baseline's Table III sequence (v3:14, 20-22).
Major 4. Draft claimed anonymisation while §IV tables revealed real firm names. PARTIAL §IV v3 uses Firm A-D in tables and prose (§IV:91-100, 120-125, 131-137, 179-184, 204-209, 217-222), so the §IV-specific failure is closed. But the paired §III v4 still leaks real names/aliases: "held-out-EY" (§III:71) and "Firms B (KPMG) and D (EY)" (§III:99), contradicting the pseudonym policy in §III:23 and §IV:3.
Major 5. Interpretive claims overstated what the spike results prove. CLOSED The off-Big-4 dHash transition language is now scope-dependent rather than an artefact claim (§IV:45). The Firm A HC vs C3 comparison is explicitly qualitative and cross-unit (§IV:186). MC-band ordering is now explicitly descriptive and not treated as Spearman validation (§IV:213).
Major 6. Moderate-confidence band support language needed narrowing. CLOSED §III v4 now states that Scripts 38-42 do not separately validate the MC/style/document components and that v4 only supports the binary high-confidence sub-rule (§III:131). §IV v3 repeats this limitation and cites v3.20.0 Tables IX/XI/XII/XII-B as inherited support (§IV:213).
Minor 1. K=3 LOOO C1 weight drift should be 0.023/0.0235, not 0.025. PARTIAL §IV v3 is corrected to 0.023 (§IV:139), matching Script 37. §III v4 still says 0.025 in prose and provenance (§III:71, 115, 173).
Minor 2. Seed coverage statement stopped at Script 41 although §IV used Script 42. CLOSED §IV v3 now says seeds are fixed across Scripts 32-42 (§IV:7).
Minor 3. Low-cosine cutoff inclusivity should match Script 42 (cos <= 0.837). PARTIAL §IV v3 is explicit: cosine <= 0.837 maps to Likely-hand-signed (§IV:19), matching Script 42. §III-L still says "Cosine below" the crossover (§III:143), which is less precise than the inherited rule; make it "at or below 0.837."
Minor 4. "Round-22 open question 1, Light scope" process note was not traceable. CLOSED The §IV-K body now describes the full-dataset robustness scope directly, without the round-22 process-note wording (§IV:230). The remaining stale process text is confined to the internal checklist (§IV:260-267).
Minor 5. Ablation section pointer was wrong. CLOSED §IV v3 correctly identifies the inherited feature-backbone ablation as v3.20.0 §IV-I and distinguishes v3 Table XVIII from current v4 Table XVIII (§IV:254-256).
Minor 6. "Component recovery across Scripts 35, 37, and 38" could be misread. CLOSED §IV v3 now says the full-fit K=3 baseline is reproduced in Scripts 35, 37, and 38, while Script 37 fold components differ by design and are separately reported (§IV:75).
Editorial 1. Remove draft note and Phase 3 close-out checklist before submission. OPEN Both files still include internal draft notes and author checklists/open questions (§III:3-9, 187-202; §IV:3, 260-267). §IV's checklist also says the section is being prepared for "codex round 23" even though this is round 24 (§IV:262).
Editorial 2. "This convergent-checks evidence" grammar. CLOSED §IV v3 uses "These convergence checks" (§IV:112).
Editorial 3. "is finalised" should be "will be finalised." CLOSED §IV v3 uses future/provisional wording (§IV:3, 265).
Editorial 4. Standardise dHash versus dh. CLOSED Manuscript prose/tables consistently use dHash; raw spike-script dh appears only inside source descriptions or quoted rule names (§III:13, 133-145; §IV:36, 53-63, 167-184).
Editorial 5. Avoid mixing "replicated," "templated," and "non-hand-signed" as exact synonyms. CLOSED Current usage mostly preserves distinctions: replicated is used for positive-anchor / C3 contexts (§IV:143-155), non-hand-signed for the operational five-way categories (§IV:167-173), and templated mainly for K=2 fold-rule wording (§IV:120-127). No remaining overclaim depends on treating them as exact synonyms.

Newly Introduced Or Remaining Issues

  1. §III v4 still violates the anonymisation policy. §III says firms are pseudonymously labelled Firm A-D throughout the manuscript (§III:23), but line 71 says "held-out-EY" and line 99 names KPMG and EY. §IV v3 fixed this; §III now needs the same scrub.

  2. §III v4 has a stale K=3 LOOO weight-drift number. Script 37 reports max C1 weight deviation 0.023, and §IV v3 uses 0.023 (§IV:139). §III still reports 0.025 in two prose locations and the provenance table (§III:71, 115, 173).

  3. Two §III internal references are stale. The positive-anchor paragraph cites "§III-J inherited; Table X" for inter-CPA FAR (§III:119), but the paired result location is §IV-I and the inherited source is v3.20.0 §IV-F.1/Table X (§IV:157-159). The open question asks whether to add a moderate-band analysis in §IV-F (§III:198), but current §IV-F is the convergence section.

  4. Internal notes are stale enough to confuse a handoff. §III's draft note says "(2026-05-12, v3)" although the file title is v4 (§III:1, 3). §IV's close-out checklist says "before §IV is sent for codex round 23" even though round 23 has already happened (§IV:262), and item 4 says issues are addressed in "this v2" inside a v3 file (§IV:267).

  5. §III mentions the full-dataset n = 686 but does not list it in the §III provenance table. §III:23 states that §IV-K reports a full-dataset cross-check at 686 CPAs; Script 41 directly reports full dataset N CPAs = 686. Add that row if the number remains in §III.

  6. The table-numbering note still has a small self-contradiction. §IV:3 says the new v4 sequence is Table V through Table XVIII, then says "Tables IV-XVIII" remain provisional. Either add a current Table IV, or make all provisional references "Tables V-XVIII" and decide whether Table XV-B is acceptable for the target style.

Cross-Reference Checks (§III v4 <-> §IV v3)

Claim / linkage §III v4 line evidence §IV v3 line evidence Status
Big-4 scope and inherited/non-Big-4 exceptions. §III:23 §IV:9, 13, 19, 157-159, 230, 254-256 Supported.
Big-4 sample size: 437 CPAs and 150,442 classified signatures. §III:23, 157-158 §IV:9, 15, 165, 175 Supported.
Dip-test and BD/McCrary accountant-level characterisation. §III:49-53 §IV:25-45 Supported.
K=2/K=3 mixture components and mild BIC preference. §III:59-69 §IV:51-75 Supported.
K=2 unstable; K=3 descriptive, not operational, under LOOO. §III:71-79, 111-115 §IV:116-139 Mostly supported; align §III's 0.025 weight drift to §IV's/report's 0.023.
Three-score internal-consistency correlations and per-firm ranking nuance. §III:83-99 §IV:79-102 Supported, except §III anonymisation leak in line 99.
Per-signature K=3 convergence and binary kappa values. §III:101-109 §IV:104-112 Supported.
Pixel-identity positive-anchor miss rate. §III:117-127 §IV:141-155 Supported, but §III:119 should cite §IV-I/v3 §IV-F.1 for inter-CPA FAR, not "§III-J inherited."
Five-way classifier retained as primary and MC band inherited. §III:131-149 §IV:161-213 Supported; make §III:143 inclusive for cos <= 0.837.
K=3 hard label vs K=3 posterior roles. §III:149 §IV:215-224 and 81-89 Supported: hard labels for cluster cross-tab, posterior P(C1) for Spearman.
Full-dataset robustness is light scope only. §III:23, 31 §IV:228-252 Supported, but add provenance for n = 686 to §III table or remove the number from §III.
Internal author/open-question checklist. §III:187-202 §IV:260-267 Not manuscript-ready; stale references remain.

Provenance Re-Verification Of Changed Numerics

Changed numerical claim Manuscript line(s) Source checked Status
Detection sequence: 86,072 VLM-positive; 12 corrupted; 86,071 YOLO-processed; 85,042 with detections; 182,328 signatures. §IV:13 v3 baseline reports 86,071 processed, 85,042 with detections, and 182,328 signatures (v3:14, 20-22). The 86,072/12 sequence is inherited from the v3 narrative already cited in round 23. Confirmed; round-23 denominator conflation is fixed.
Big-4 signature sample: 150,453 loaded, 150,442 classified, 11 missing descriptors. §IV:175 Script 42 reports loaded 150,453, classified 150,442, unclassified 11 (five_way_report:14-16). Confirmed.
K=2 marginal crossings and bootstrap CIs: cos 0.9755, dHash 3.755, CIs [0.9742, 0.9772] and [3.476, 3.969]. §IV:62-65; §III:51, 59-60 Script 36 reports cos point 0.9755 and dHash point 3.7549 with those CIs (calibration_loo_report:14-17). Confirmed.
K=3 components: C1 0.9457/9.17/0.143; C2 0.9558/6.66/0.536; C3 0.9826/2.41/0.321. §IV:67-75; §III:61-69 Scripts 35/37/38 report the same baseline (inspection_report:6-10; k3_loo_report:6-10; convergence_report:8-12). Confirmed.
K=3 lower than K=2 by 3.48 BIC points. §IV:75; §III:69 Script 36 reports K=2 BIC -1108.45 and K=3 BIC -1111.93 (calibration_loo_report:9-10). Confirmed by arithmetic.
Spearman correlations: 0.9627, 0.8890, 0.8794, with p-values bounded in manuscript. §IV:81-89; §III:91-99 Script 38 reports 0.9627 / 3.92e-249, 0.8890 / 1.09e-149, 0.8794 / 2.73e-142 (convergence_report:26-30). Confirmed.
Per-firm score nuance: Firm C highest on P(C1)=0.3110 and hand_frac=0.7896; Firm D higher on reverse-anchor score -0.7125 vs Firm C -0.7672. §IV:95-102; §III:99 Script 38 per-firm summary reports those values (convergence_report:43-48). Confirmed; §III should anonymise KPMG/EY parentheticals.
K=3 LOOO C1 weight drift is 0.023, not 0.025. §IV:139; §III:71, 115, 173 Script 37 reports max C1 weight deviation 0.023 (k3_loo_report:77-79). §IV confirmed; §III mismatch remains.
Pixel-identical Big-4 subset n=262, split 145/8/107/2, all classifiers 0% miss with Wilson upper 1.45%. §IV:145-153; §III:117-127 Script 40 reports total 262, 262/262 correct for all three classifiers, and per-firm split 145/8/107/2 (far_report:8, 12-18, 22-27). Confirmed.
Five-way per-signature counts: HC 74,593; MC 39,817; HSC 314; UN 35,480; LH 238. §IV:165-175 Script 42 reports the same counts and percentages (five_way_report:20-26). Confirmed.
Per-firm five-way percentages: Firm A 81.70/10.76/0.05/7.42/0.07; Firm B 34.56/35.88/0.29/29.09/0.18; Firm C 23.75/41.44/0.38/34.21/0.22; Firm D 24.51/29.33/0.22/45.65/0.29. §IV:181-186, 213 Script 42 reports the same percentages (five_way_report:39-44). Confirmed; interpretation is now appropriately descriptive.
Document-level counts: n=75,233 PDFs; HC 46,857; MC 19,667; HSC 167; UN 8,524; LH 18; mixed-firm PDFs n=379. §IV:190-200 Script 42 reports n=75,233, mixed-firm n=379, and those category counts (five_way_report:46-57). Confirmed.
Full-dataset robustness: full n=686; component rows; full rho 0.9558; drift 0.0069. §IV:232-250; §III:23 Script 41 reports Big-4 n=437, full n=686, component drifts, BICs, rho 0.9558, and drift 0.0069 (fulldataset_report:8-31). Confirmed; add §III provenance row for n=686.

Phase 4 Readiness

Partial.

The empirical tables are close to partner-review ready and I do not see a need to rerun the main v4 scripts for §IV. The remaining issues are mostly manuscript hygiene, pseudonym consistency, and cross-reference/provenance alignment. They are small edits, but they are visible enough that I would not send the paired §III/§IV package to partner review until they are fixed.

  1. Scrub §III v4 for real firm names/aliases. Replace "held-out-EY" and "Firms B (KPMG) and D (EY)" with Firm A-D language, or explicitly abandon the pseudonym policy everywhere.

  2. Align K=3 LOOO weight drift to Script 37 throughout §III: use 0.023 (or 0.0235 if exact precision is preferred), matching §IV:139.

  3. Fix the remaining stale cross-references: §III:119 should point to current §IV-I / inherited v3.20.0 §IV-F.1 Table X; §III:198 should not refer to current §IV-F for a possible moderate-band analysis.

  4. Make the §III-L low-cosine rule inclusive: Likely hand-signed is cos <= 0.837, matching Script 42 and §IV:19.

  5. Remove or move internal draft notes, cross-reference indices, open questions, and the §IV Phase 3 close-out checklist before partner review. At minimum, fix stale "v2/v3/round 23" text.

  6. Finalise table numbering after deciding whether Table XV-B is acceptable. If the current v4 sequence starts at Table V, remove residual "Tables IV-XVIII" wording.

  7. Add §III provenance for the full-dataset n = 686 claim if it remains in §III-G; cite Script 41 / fulldataset_report.md.