Files
pdf_signature_extraction/paper/codex_review_gpt55_v4_round2.md
gbanyan c8c7656513 Apply codex round-22 corrections to §III v3 (Minor -> ready)
Codex gpt-5.5 round 22 returned Minor Revision after v2 closed
3/5 Major findings fully and 2/5 partially. Five narrow fixes
applied for v3:

  1. Per-firm ranking unanimity corrected (v2:93). The reverse-
     anchor score ranks Firm D fractionally higher than Firm C
     (-0.7125 vs -0.7672); only Scores 1 and 3 rank Firm C
     highest. The unanimity claim was wrong; v3 prose now says
     all three agree on Firm A as most replication-dominated
     and on the non-Firm-A Big-4 as more hand-leaning, with a
     modest disagreement on Firm C vs D ordering.

  2. "Smallest scope" / "any single firm" overclaim narrowed
     (v2:21, v2:43). Script 32 only tested Firm A alone, big4_non_A
     pooled, and all_non_A pooled -- not Firms B, C, D individually.
     v3 explicitly says "comparison scopes tested in Script 32"
     and notes single-firm dip tests for B, C, D were not
     separately computed.

  3. K=3 hard label vs posterior in Spearman correctly
     attributed (v2:143). Script 38 uses the K=3 posterior P(C1),
     not the hard label, in the internal-consistency Spearman
     correlations. v3 §III-L now correctly says the hard label
     is for the §IV cluster cross-tabulation while the posterior
     is the continuous Score 1 in §III-K.

  4. Provenance source for n=150,442 corrected (v2:17, v2:152).
     Script 39 directly reports this count in its per-signature
     K=3 fit; Script 38's report does not. v3 cites Script 39 for
     this number.

  5. "Max fold-to-fold deviation" wording made precise (v2:65,
     v2:107). The $0.028$ value is the max absolute deviation
     from the across-fold mean (Script 36 stability summary), not
     the pairwise across-fold range (which is $0.0376 = 0.9756 -
     0.9380$). v3 reports both statistics with explicit
     definitions.

Also: draft note at top now records v2 (round-21) and v3
(round-22) revision lineage. Cross-reference index and open-
question block retained as author working checklist (will be
removed before manuscript submission per codex e7).

Outstanding open questions reduced to 3 (codex round-22 view):
  - Five-way moderate-confidence band: validate in Big-4 specifically
    (Phase 3 §IV-F work) or document as inherited from v3.x?
  - Firm anonymisation policy in §IV-V (procedural)
  - §IV table numbering (procedural; defer until §IV done)

Phase 2 §III draft is now Minor-Revision-quality. Ready for
Phase 3 (Results regeneration §IV).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:26:02 +08:00

12 KiB

Paper A v4.0 Methodology Section III-G through III-L Peer Review

Reviewer: gpt-5.5 xhigh
Date: 2026-05-12
Round number: 22 (v4 round 2)
Review target: paper/v4/paper_a_methodology_v4_section_iii.md

Verdict

Minor Revision.

v2 closes most of the round-21 blockers: K=3 is no longer the operational classifier, the "independent lenses" claim is softened, the pixel-identity metric is no longer called FAR in the draft, and the main empirical slips are corrected. The remaining issues are narrower but still need edits before accepting the methodology text, especially the false per-firm ordering claim in §III-K and the unresolved validation status of the five-way moderate-confidence band.

Round-21 finding closure table

Finding Round-21 Severity v2 Status Evidence in v2
M1. K=3 is not justified as an operational classifier. Major CLOSED v2 explicitly says both K=2 and K=3 are descriptive and not used for signature/document labels (v2:51, v2:67-73, v2:143). It also reports Script 37 P2_PARTIAL and the "not predictively useful as an operational classifier" implication (v2:65, v2:109).
M2. "Three independent lenses" overstates independence and validation strength, and reverse-anchor direction was wrong. Major PARTIAL The independence and reverse-anchor wording are fixed: the scores are "not statistically independent" and only internal-consistency checks (v2:75-83), and the reference is now described as less replication-dominated (v2:35-37). However, v2 adds a false per-firm ordering claim that all three scores make Firm C most hand-leaning (v2:93); Script 38's reverse-anchor mean instead ranks Firm D highest.
M3. Classifier conflation; only the simplified binary rule was validated. Major PARTIAL v2 now declares the inherited five-way box rule as primary (v2:123-143) and K=3 as descriptive (v2:143). It also correctly notes that the kappa comparison validates only the binary high-confidence rule, not the five-way moderate band (v2:103). The unresolved moderate-band validation is still open (v2:190-192), and v2:125 still uses binary-rule correlations to support the full five-way rule without recalibration.
M4. Pixel-identity "FAR" naming and evidentiary force were wrong. Major CLOSED v2 renames this to a positive-anchor miss rate, frames it as a one-sided replicated-positive check, and adds the tautology/conservative-subset caveat (v2:111-121).
M5. Empirical/provenance claims needed correction or explicit unverified status. Major CLOSED The 0.005 denominator is now a stability tolerance, not a bootstrap CI (v2:65, v2:107); all-non-Firm-A dip values are corrected (v2:21, v2:43); BD/McCrary is narrowed to Big-4 null with external dHash transitions disclosed (v2:47); Firm A byte-decomposition details are marked inherited/not regenerated (v2:31, v2:176); "tail distorts" is softened to a scope-dependent shift (v2:19).
m1. Dip-test p-value precision needed bootstrap-resolution wording. Minor CLOSED v2 states no bootstrap replicate exceeded the observed statistic and reports p < 5 x 10^-4 for n_boot = 2000 (v2:21, v2:43, v2:158-159).
m2. Delta BIC sign convention was confusing. Minor CLOSED v2 defines lower BIC as preferred and reports BIC(K=3) - BIC(K=2) = -3.48, plus "K=3 lower by 3.48" (v2:45, v2:63).
m3. Per-signature convergence is only moderate for the box rule. Minor CLOSED v2 includes the SIG_CONVERGENCE_MODERATE verdict and avoids calling the Paper A-vs-K=3 kappas strong (v2:95-103).
m4. Per-CPA vs per-signature component centers drift more than v1 suggested. Minor CLOSED v2 says the fits recover a "broadly similar three-component ordering" and reports the C1 cosine drift of 0.018 (v2:95).
m5. Section III-L title was misleading. Minor CLOSED The section is now titled "Signature- and Document-Level Classification" and separates per-signature categories from document aggregation (v2:123-143).
m6. K=3 alternative lacked document aggregation. Minor CLOSED v2 no longer offers K=3 as a signature/document classifier, so a K=3 document aggregation rule is no longer required (v2:143).
m7. Firm anonymization was inconsistent. Minor CLOSED v2 uses Firm A-D pseudonyms in the methodology text and no longer names the Big-4 firms directly in the prose (v2:17, v2:31, v2:194).
e1. Replace "more-replicated-population baseline." Editorial CLOSED v2 now calls non-Big-4 a less-replicated external/reverse-anchor reference (v2:35-37).
e2. Replace "failure rate" for Lens 3. Editorial CLOSED Lens 3 is now "Paper A box-rule hand-leaning rate" (v2:83).
e3. "Strongest single methodology-validation signal" was too strong. Editorial CLOSED v2 uses "strongest internal-consistency signal" and denies external validation (v2:77, v2:93).
e4. "Boundary moves modestly" understated LOOO membership instability. Editorial CLOSED v2 uses composition-sensitive wording and reports the 12.8 pp Firm C fold deviation (v2:65, v2:109).
e5. "Calibration uncertainty band of +/- 5-13 pp" wording needed correction. Editorial CLOSED v2 reports observed absolute differences of 1.8-12.8 pp and the 5 pp viability bar (v2:109).
e6. "Operational threshold derivation" language was inaccurate. Editorial CLOSED v2 consistently calls K=3 a mixture characterisation/descriptive model, not an operational threshold source (v2:49-73, v2:143).
e7. Cross-reference index should be removed or made internal. Editorial PARTIAL v2 labels the cross-reference index as an author checklist to remove before submission (v2:181), but it remains inside the methodology draft (v2:181-188).

Newly introduced issues

  1. New factual/provenance error: the three scores do not agree on the most hand-leaning firm. v2 claims that "by all three scores, Firm A is the most replication-dominated and Firm C is the most hand-leaning" (v2:93). Script 38 confirms Firm A is most replication-dominated, but not the Firm C part for all scores: mean P_C1 and mean hand_frac rank Firm C highest, while mean reverse-anchor ranks Firm D highest (-0.7125 vs Firm C -0.7672, with higher score meaning more hand-leaning). Revise to: "P_C1 and box-rule hand_frac rank Firm C highest; the reverse-anchor score ranks Firm D highest; all three agree Firm A is most replication-dominated and the non-A firms are more hand-leaning than Firm A."

  2. Unsupported scope superlative: "any single firm" / "smallest scope" is not proven by the supplied reports. v2 says no dip-test rejection holds "within any single firm pooled alone" and that Big-4 is the "smallest scope" supporting a finite-mixture model (v2:21; repeated more generally at v2:43). The supplied Script 32 report verifies Firm A alone, big4_non_A, and all_non_A; it does not report separate single-firm tests for Firms B, C, and D or all smaller combinations. Narrow this to "among the tested comparison scopes in Script 32" or add the missing single-firm tests.

  3. K=3 hard labels are incorrectly described as used in the Spearman correlations. v2:143 says the "K=3 hard label" is used for the internal-consistency Spearman correlations. Script 38's Spearman table uses the K=3 posterior score P_C1, not hard labels. Change v2:143 to "K=3 posterior score is used for the Spearman correlations; hard labels are used for the cluster cross-tabulation."

  4. Provenance table over-cites Script 38 for the Big-4 signature count. v2:17 and v2:152 attribute the 150,442 signature count partly/directly to Script 38. In the supplied markdown report, Script 39 directly reports the 150,442 signature-level cloud; Script 38's visible report does not directly state that count. Keep Script 39 as the direct source unless the JSON artifact is also cited.

  5. "Max fold-to-fold deviation" wording is imprecise. v2 reports a K=2 "max fold-to-fold deviation" of 0.028 (v2:65, v2:107). Script 36's 0.0278 is the max absolute deviation across folds as reported in the stability summary, not the pairwise fold range; the fold cut range is about 0.0376 (0.9756 - 0.9380). Use the report's exact wording or explicitly define the statistic.

Provenance re-verification

v2 numerical claim v2 lines Spike-report check Status
Big-4 has 437 CPAs split 171 / 112 / 102 / 52. v2:17, v2:151 Script 36 reports 437 CPAs; Script 34 reports the four firm counts. CONFIRMED
Big-4 signature-level cloud has 150,442 signatures. v2:17, v2:95, v2:152 Script 39 reports fitting on 150,442 signature-level points. CONFIRMED, but source should be Script 39 rather than Script 38 in the provenance table.
Big-4 K=2 crossings are cos 0.9755 and dHash 3.7549, with CIs [0.9742, 0.9772] and [3.4762, 3.9689]. v2:45, v2:53, v2:154-156 Script 36 and Script 34 report these point estimates and bootstrap CIs. CONFIRMED
K=3 components are C1 0.9457/9.1715/0.143, C2 0.9558/6.6603/0.536, C3 0.9826/2.4137/0.321. v2:55-63, v2:163 Scripts 35, 37, and 38 report the same centers and weights. CONFIRMED
K=3 LOOO membership deviations are 1.8-12.8 pp, with P2_PARTIAL. v2:65, v2:109, v2:168 Script 37 reports diffs 1.76, 4.68, 5.81, 12.77 pp and verdict P2_PARTIAL. CONFIRMED
Spearman correlations are 0.963, 0.889, and 0.879. v2:85-91, v2:169 Script 38 reports 0.9627, 0.8890, and 0.8794. CONFIRMED
All three scores rank Firm C as most hand-leaning. v2:93 Script 38 per-firm summary ranks Firm C highest on mean P_C1 and mean hand_frac, but Firm D highest on mean reverse-anchor. FLAGGED
Per-signature kappas are 0.662, 0.559, and 0.870; verdict moderate. v2:95-103, v2:170 Script 39 reports 0.6616, 0.5586, 0.8701 and SIG_CONVERGENCE_MODERATE. CONFIRMED
Pixel-identical subset is n=262 split 145 / 8 / 107 / 2, with 0% miss rate and Wilson upper 1.45%. v2:111-119, v2:172-173 Script 40 reports total 262, the per-firm split, and 262/262 correct for all three candidate classifiers with Wilson [0.00%, 1.45%]. CONFIRMED
Non-Firm-A dip values are 0.998/0.906 for big4_non_A and 0.998/0.907 for all_non_A. v2:21, v2:43, v2:161-162 Script 32 reports 0.9985/0.9055 and 0.9975/0.9065, matching v2 rounded values. CONFIRMED

Outstanding open questions

  1. Five-way moderate-confidence validation still needs a decision. v2 is honest that the v4 kappa evidence covers only the high-confidence binary rule (v2:103, v2:190-192). If the five-way classifier remains primary, the cleanest next step is a Big-4-specific capture/FAR/cross-tab analysis for the moderate band and the document-level worst-case aggregation. If not rerun, the manuscript should explicitly state that the moderate band remains inherited from v3.x and is not newly validated by Scripts 38-40.

  2. Firm anonymisation policy still needs confirmation for §IV-V. v2 itself is pseudonymous, but the open question at v2:194 remains real: once §IV-V discuss within-Big-4 contrasts, the manuscript should consistently use Firm A-D and keep any real-name mapping out of the paper body.

  3. Section IV numbering can remain deferred. v2:196 is procedural and does not block §III acceptance; resolve after the methodology claims and result-table sequence are frozen.

  1. Correct v2:93's per-firm ordering claim against Script 38.

  2. Decide whether to add a Big-4-specific validation for the five-way moderate band and document-level aggregation. If not, narrow v2:125 so binary-rule correlations do not appear to validate the full five-way classifier.

  3. Narrow the dip-test scope language at v2:21 and v2:43, or add missing individual-firm dip tests for Firms B-D.

  4. Fix v2:143 so Spearman correlations are tied to K=3 posterior scores, not K=3 hard labels.

  5. Correct the provenance table entry for the 150,442 signature count to cite Script 39 as the direct markdown-report source.

  6. Replace "max fold-to-fold deviation" with the exact Script 36 statistic or report the actual pairwise fold range.

  7. Remove the author checklist and open-question block from the manuscript version after these decisions are resolved.