Files
pdf_signature_extraction/paper/codex_review_gpt55_v4_round3.md
gbanyan ce33156238 Apply codex round-23 corrections: §IV v3 + §III v4
Codex round 23 returned Major Revision on §IV v2: 6 Major + 6
Minor + 5 Editorial findings. Codex confirmed the spike-script
provenance is mostly sound -- no scripts needed rerunning -- so
v3 applies presentation-level fixes only.

Decisions baked in:
  - Anonymisation: maintain Firm A-D pseudonyms throughout the
    manuscript body; remove (Deloitte) / (KPMG) / (PwC) / (EY)
    parentheticals from all v4 §IV tables.
  - Table numbering: v4 tables use fresh V-XVIII (plus Table XV-B);
    inherited v3.x tables are cited only as "v3.20.0 Table N" with
    the original v3 number, NOT renumbered into the v4 sequence.

§IV v3 changes:
  1. Detection denominator rewritten: 86,072 VLM-positive / 12
     corrupted / 86,071 YOLO-processed / 85,042 with-detections /
     182,328 signatures (matches v3.x §IV-B exact wording).
  2. All v4 table labels stripped of "(revised:" / "(NEW:"
     prefixes; replaced with clean "Table N. <descriptor>." form.
  3. Real firm names removed from all tables: 4 replace_all edits.
  4. Line 211 MC-ordering claim removed: MC occupancy is no longer
     described as "consistent with the §III-K Spearman convergence"
     because MC fraction is not monotone in per-CPA hand-leaning
     ranking. New language: descriptive only, with Firm D / Firm B
     ordering counterexample stated.
  5. Line 184 81.70% vs 82.46% qualified as "qualitative
     alignment, not like-for-like consistency check" (different
     units: per-signature class vs per-CPA hard cluster).
  6. Line 43 BD-transition "histogram-resolution artefacts"
     softened to "scope-dependent and not used operationally";
     no specific bin-width artefact claim without sensitivity
     sweep evidence.
  7. K=3 LOOO C1 weight drift corrected: 0.025 -> 0.023 (matches
     Script 37 max deviation 0.0235 / rounded 0.023).
  8. Seed coverage in §IV-A updated: "Scripts 32-42" (was
     "Scripts 32-41", missed Script 42).
  9. Low-cosine cutoff inclusivity: cos < 0.837 -> cos <= 0.837
     (matches Script 42 rule definition).
  10. "round-22 Light scope" process note removed from
      manuscript prose in §IV-K.
  11. §IV-L ablation pointer corrected: v3.20.0 §IV-I (was
      §IV-H.3); v3.20.0 Table XVIII clarified as different from
      v4 Table XVIII.
  12. Line 75 "Component recovery verified across Scripts 35,
      37, 38" rewritten: "the full-fit baseline is reproduced
      in Scripts 35, 37, 38" with explicit note that Script 37
      LOOO fold-specific components differ by design.
  13. Line 110 grammar: "This convergent-checks evidence" ->
      "These convergence checks".
  14. Draft note marked "internal -- remove before submission".

§III v4 changes (cross-reference cleanup):
  1. Line 13 cross-reference repaired: "§IV-D, §IV-F, §IV-G"
     (which are now accountant-level v4 analyses) replaced with
     accurate signature-level references (§IV-J for five-way
     counts; §IV-I for inherited inter-CPA FAR).
  2. Line 23 cross-reference repaired: "all §IV results except
     §IV-K" replaced with explicit list of v4-new vs inherited
     sub-sections.
  3. Line 109 cross-reference repaired: moderate-band capture-
     rate evidence cited as "v3.20.0 Tables IX, XI, XII, XII-B"
     (was "§IV-F", which is now Convergent Internal-Consistency
     Checks, not capture-rate).
  4. Line 131 "without recalibration" claim narrowed: §III-K's
     convergent-checks evidence is now scoped to the binary
     high-confidence rule only; the moderate-confidence band,
     style-consistency band, and document-level aggregation
     are retained by reference to v3.20.0 calibration, not
     claimed as v4.0-validated.

Outstanding open questions: 3 procedural items remain (§IV
table numbering finalisation, §IV-A-C content audit, Phase 4
prose); no methodology blockers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:03:33 +08:00

15 KiB

Paper A Round 23 Review - v4 round 3

Reviewer: gpt-5.5 xhigh
Date: 2026-05-12
Target: paper/v4/paper_a_results_v4_section_iv.md (§IV v2)
Cross-checked against: paper/v4/paper_a_methodology_v4_section_iii.md (§III v3), round-21/22 reviews, paper/paper_a_results_v3.md, and the supplied spike reports.

Verdict

Major Revision.

The empirical core of §IV v2 is much stronger than the earlier methodology drafts: most new Big-4 numerical tables match the spike reports. The blockers are presentation and provenance risks that reviewers will catch quickly: table numbering is not coherent, several §III cross-references now point to the wrong §IV material, the inherited detection count is misstated, and the draft says firm anonymisation is maintained while repeatedly printing real firm names.

Major findings

  1. Table numbering is not coherent enough for partner review.

    §IV v2 says provisional numbering covers Tables IV-XVIII (line 3), and line 13 says v3.20.0 Table IV is "retained as Table IV here." But the file does not actually include a Table IV block; the first displayed v4 table is Table V at line 23. Line 17 also cites the inherited all-pairs analysis as "v3.20.0 §IV-C, Table V," while line 23 reuses Table V for the new Big-4 dip test. That is acceptable only if the inherited table is explicitly not a v4 table; otherwise Table V is duplicated.

    The same issue recurs at the end: line 240 assigns current Table XVIII to the full-dataset Spearman robustness table, while line 254 says the inherited backbone ablation is "Table XVIII in v3.20.0." If the ablation is retained in the v4 manuscript, it cannot also be current Table XVIII. Fix by deciding which inherited v3 tables are reprinted/renumbered versus cited only as v3.x provenance.

  2. §III v3 contains stale cross-references that §IV v2 does not support as written.

    §III line 13 says signature-level capture-rate analyses are in §IV-D, §IV-F, and §IV-G. In §IV v2, those are accountant-level distributional characterisation, internal-consistency checks, and LOOO reproducibility. This is a direct cross-reference failure.

    §III line 23 says "all §IV results except §IV-K" are Big-4 restricted. §IV v2 itself is narrower and more accurate at line 9: §IV-D through §IV-J are Big-4 primary, while §IV-K is full-dataset robustness. But §IV-A-C are inherited full-corpus setup/detection/all-pairs material, §IV-I is inherited full-corpus inter-CPA FAR, and §IV-L is an inherited corpus-wide ablation. §III must be changed to match the actual results section.

    §III line 109 says the moderate-confidence band retains v3.x capture-rate evaluation in "§IV-F"; in current §IV, §IV-F is not the inherited v3 capture-rate section. It should cite v3.x Tables IX/XI/XII/XII-B or current §IV-J's inheritance note, not current §IV-F.

  3. The inherited detection-count sentence is numerically wrong / ambiguous.

    §IV line 13 says "182,328 detected signatures across 86,072 prefiltered audit-report PDFs." The v3 baseline distinguishes these counts: VLM screening identified 86,072 documents with signature pages, 12 corrupted PDFs were excluded, and batch YOLO inference ran on 86,071 documents; v3 Table III then reports 85,042 documents with detections and 182,328 extracted signatures. Current line 13 collapses these stages and assigns the 182,328 signatures to the wrong denominator.

    Suggested rewrite: "VLM screening identified 86,072 signature-page documents; after 12 corrupted PDFs were excluded, YOLO batch inference processed 86,071 documents, with 85,042 yielding detections and 182,328 extracted signatures."

  4. The draft claims firm anonymisation is maintained, but the §IV tables reveal real firm names.

    §III line 23 says the Big-4 firms are pseudonymously labelled Firm A-D. §IV line 265 says firm anonymisation is "maintained throughout §IV (Firm A-D used consistently)." That is false: real names appear in the displayed result tables at lines 93-96, 120-123, 132-135, 179-182, 204-207, and 217-220.

    Either remove the parenthetical real names everywhere in §IV or explicitly abandon the pseudonym policy in §III and the close-out checklist. Given prior review history, this should be fixed before partner review.

  5. Some interpretive claims overstate what the spike results prove.

    The main false one is line 211: it says the non-Firm-A moderate-confidence proportions are "consistent with the §III-K Spearman convergence on the per-CPA hand-leaning ranking." The MC ordering is C (41.44%), B (35.88%), D (29.33%), while Table X's hand-leaning scores rank D above B on all three score summaries and rank D above C on the reverse-anchor score. MC-band occupancy is not a monotone proxy for the per-CPA hand-leaning ranking; D's mass moves heavily into Uncertain instead.

    Line 184 also compares Firm A's signature-level HC rate (81.70%) to its accountant-level C3 rate (82.46%). The numbers are close and the qualitative reading is reasonable, but they are different units. State this as qualitative alignment, not as a like-for-like consistency check.

    Line 43 calls off-Big-4 dHash transitions "consistent with histogram-resolution artefacts." Script 32 verifies varying dHash transitions; it does not by itself prove a bin-width artefact analysis for those accountant-level subsets. "Scope-dependent and not used operationally" is safer.

  6. The moderate-confidence band is honestly disclosed as inherited, but the support language still needs narrowing.

    §IV line 211 correctly states that Scripts 38-40 do not separately validate the MC band. That is good. But §III line 131 still says the binary-rule internal-consistency checks support continued use of the inherited five-way rule "without recalibration." That is stronger than the evidence: the v4 kappa/Spearman checks cover the binary high-confidence box rule, not the MC band or document-level worst-case aggregation. The defensible wording is: v4 reports Big-4 outputs for the inherited five-way rule; the MC band remains v3-calibrated and not newly validated in Scripts 38-42.

Minor findings

  1. K=3 LOOO C1 weight drift is rounded away from the report. §IV line 137 reports max C1 weight deviation as 0.025. Script 37's report says 0.023, and the JSON gives 0.023489. Use 0.023 or 0.0235.

  2. Seed coverage statement stops at Script 41. §IV line 7 says seeds are fixed across Scripts 32-41, but v2 depends on Script 42 for Tables XV and XV-B. Either include Script 42 if true or say "stochastic v4 spike scripts" rather than implying a complete script range.

  3. Inclusivity of the low-cosine cutoff should match Script 42. §IV line 17 says cosine < 0.837 implies Likely-hand-signed; Script 42 defines LH as cos <= 0.837. Align §III-L and §IV-C/J exactly.

  4. The "round-22 open question 1, Light scope" process note is not traceable to the round-22 review file. §IV line 228 may reflect an author decision outside the supplied review, but it should be removed from manuscript prose or backed by an internal note.

  5. The ablation section pointer is wrong. §IV line 252 says the inherited feature-backbone ablation is from v3.x §IV-H.3, but in paper/paper_a_results_v3.md it is §IV-I, beginning at line 461.

  6. Line 73's "component recovery ... across Scripts 35, 37, and 38" can be misread. Script 37's full-baseline block replicates Script 35, but the LOOO fold components vary by design. Say "the full-fit baseline is reproduced in Scripts 35, 37, and 38" if that is the intended claim.

Editorial nits

  1. Remove the draft note and Phase 3 close-out checklist before submission, or move them to an internal author note.

  2. Line 110: "This convergent-checks evidence" should be "These convergence checks" or "This convergence evidence."

  3. Line 3: "is finalised" should be "will be finalised" while numbering remains provisional.

  4. Standardise "dHash" versus "dh" in tables and prose; the spike reports use dh, but the paper body mostly uses dHash.

  5. Avoid mixing "replicated," "templated," and "non-hand-signed" as if they are exact synonyms. The paper's caveats rely on preserving those distinctions.

Provenance verification table

§IV v2 claim §IV lines Source checked Status
Big-4 primary scope: 437 CPAs and 150,442 signatures with both descriptors. 9 Script 36 report lines 6, 32-37; Script 39 report line 12. Confirmed.
Detection inheritance: 182,328 signatures across 86,072 PDFs. 13 v3 results lines 14, 17-25; v3 methodology search hits distinguish 86,072 VLM-positive, 86,071 processed, 85,042 with detections. Needs correction; denominator conflated.
All-pairs KDE crossover at 0.837. 17 v3 results lines 49 and 118; Script 42 rule lines 6-10 uses 0.837. Confirmed; fix < vs <= wording.
Big-4 dip-test p-values reported as < 5 x 10^-4. 27, 32 Script 36 report lines 6-8; Script 34 report lines 28-31; bootstrap resolution stated in §IV line 32. Confirmed with reporting convention.
Firm A / Big4-non-A / all-non-A dip p-values: 0.992/0.924, 0.998/0.906, 0.998/0.907. 28-30 Script 32 report lines 30, 40, 62, 72, 94, 104. Confirmed after rounding.
BD/McCrary Big-4 null and non-A dHash transitions at 10.8 and 6.6. 38-41 Script 34 report lines 28-31; Script 32 report lines 40-41 and 72-73. Confirmed; artefact interpretation not directly proven.
K=2 components, crossings, bootstrap CIs, and BIC. 53-63 Script 34 report lines 23-41; Script 36 report lines 12-28. Confirmed.
K=3 component centers/weights and BIC lower by 3.48. 69-73 Script 35 report lines 6-10; Script 34 report lines 40-49; Script 36 report lines 9-10. Confirmed.
Spearman correlations 0.9627, 0.8890, 0.8794 and non-Big-4 reference center 0.935/9.77. 83-87 Script 38 report lines 16-18 and 24-30. Confirmed.
Per-firm score summaries in Table X. 93-98 Script 38 report lines 43-48. Confirmed; anonymisation violation.
Cohen kappas 0.662, 0.559, 0.870 and per-signature K=3 centers. 106-110 Script 39 report lines 16-28. Confirmed after rounding.
K=2 LOOO fold rules and all-or-none held-out classifications. 120-125 Script 36 report lines 32-44 and JSON stability summary. Confirmed.
K=3 LOOO C1 fold rates and P2_PARTIAL. 131-137 Script 37 report lines 16-19, 25-90, 92-99; JSON exact drift values. Confirmed, except weight drift should be 0.023/0.0235 not 0.025.
Pixel-identity subset n=262, split 145/8/107/2, 0/262 miss rate, Wilson upper 1.45%. 147-153 Script 40 report lines 8, 12-18, 22-27. Confirmed.
Inter-CPA FAR 0.0005 with Wilson [0.0003, 0.0007] inherited from v3. 157 v3 results lines 182-190 and 263-275. Confirmed as inherited, not v4-regenerated.
Five-way per-signature counts and 11 excluded signatures. 167-173 Script 42 report lines 14-26. Confirmed.
Per-firm five-way percentages. 179-184 Script 42 report lines 30-44. Confirmed; line 211 interpretation is not supported.
Document-level overall counts, n=75,233, mixed-firm PDFs n=379. 188-198 Script 42 report lines 46-57; JSON document_level. Confirmed.
Single-firm per-document rows. 204-209 Script 42 report lines 59-66. Confirmed.
Full-dataset robustness components, BIC, Spearman rho. 234-248 Script 41 report lines 8-31. Confirmed.
Feature-backbone ablation inherited from v3.x Table XVIII. 252-254 v3 results lines 461-475. Inherited content confirmed, but v3 section pointer and current v4 table numbering collide.

Cross-reference checks (§III -> §IV)

§III v3 claim §III lines §IV v2 support Status
Signature-level capture-rate analyses are in §IV-D/F/G. 13 Current §IV-D/F/G are accountant-level dip/mixture, internal consistency, and LOOO. Fails; stale v3 cross-reference.
All §IV results except §IV-K are Big-4 restricted. 23 §IV-A-C, §IV-I, and §IV-L are inherited full-corpus/corpus-wide material. Fails; narrow to "primary v4 analyses §IV-D-J except inherited §IV-I."
Big-4 scope is 437 CPAs / 150,442 signatures. 23 §IV lines 9, 163 and Script 39. Supported.
Dip-test and BD/McCrary distributional characterisation. 47-53 §IV Tables V-VI, lines 23-43. Supported.
K=2 and K=3 mixture components and mild BIC preference. 51, 59-73 §IV Tables VII-VIII, lines 49-73. Supported.
K=2 unstable and K=3 descriptive only under LOOO. 71-79, 111-115 §IV Tables XII-XIII, lines 116-137. Supported.
Three-score internal consistency and per-firm ranking nuance. 83-100 §IV Tables IX-X, lines 79-100. Supported.
Per-signature K=3 convergence kappas. 101-109 §IV Table XI, lines 102-110. Supported.
Pixel-identity positive-anchor miss rate. 117-127 §IV Table XIV, lines 141-153. Supported.
Five-way signature/document classifier retained as primary; K=3 not used for operational labels. 131-149 §IV-J, lines 159-224. Mostly supported; the MC band remains inherited and current wording should not imply v4 validation.
Moderate-confidence band retains v3.x capture-rate evaluation. 109, 145, 198 §IV line 211 cites v3 Tables IX/XI/XII but not XII-B; §III line 109's "§IV-F" is now wrong. Needs citation cleanup.
Firm anonymisation maintained. 23 and open question 200 §IV repeatedly includes real firm names in parentheses. Fails unless policy changes.
  1. Freeze the v4 table scheme before any prose edits: decide whether inherited v3 tables are reprinted as current v4 tables, cited only as v3 tables, or moved to appendix/supplement. Then renumber Tables IV-XVIII and remove Table XV-B if the journal style cannot handle letter suffixes.

  2. Fix §III cross-references after the table scheme is frozen, especially §III line 13, §III line 23, and §III lines 109/119/145.

  3. Correct §IV line 13's detection denominator and restate the VLM-positive / corrupted-excluded / YOLO-processed / with-detections sequence.

  4. Remove all real firm names from §IV or explicitly change the anonymisation policy. Do not leave line 265 claiming anonymisation while tables reveal names.

  5. Delete or rewrite line 211's MC-ordering claim. If the MC band remains inherited, present the per-firm MC proportions descriptively only.

  6. Narrow the support claim for the five-way rule: Scripts 38-40 validate only the binary high-confidence rule, while Script 42 reports five-way output counts. Either add a Big-4-specific MC/document validation or state plainly that MC/document validation is inherited from v3.x.

  7. Fix small numeric/provenance issues: K=3 weight drift 0.023/0.0235, Script 42 seed wording, cutoff inclusivity, v3 ablation section pointer, and the unsupported "round-22 Light scope" process note.

Phase 4 readiness assessment

Not ready for partner review without Phase 4 revisions.

The spike-script provenance for the new Big-4 result tables is mostly sound, so I do not see a need to rerun the main v4 empirical scripts solely to fix §IV. But the current section would invite reviewer attacks on table identity, stale cross-references, anonymisation, and overinterpretation of the inherited MC band. After those are corrected, §IV should be close to partner-review ready; the only substantive open decision is whether to add a new Big-4-specific validation for the moderate-confidence/document-level rule or keep it explicitly inherited from v3.x.