Files
pdf_signature_extraction/paper/codex_review_gpt55_v4_round1.md
T
gbanyan 62a22ceb83 Revise §III v4.0 draft per codex round-21 review (Major Revision -> v2)
Codex gpt-5.5 xhigh review of v1 draft returned Major Revision with
5 Major findings + 7 Minor + editorial nits. v2 addresses all of
them.

Key v2 changes:

  1. Primary classifier declared: inherited v3.x five-way per-signature
     box rule. K=3 mixture is demoted to accountant-level descriptive
     characterisation (Script 35 / Script 38 footing), explicitly NOT
     used to assign signature- or document-level labels.

  2. §III-J reframed as "Mixture Model and Accountant-Level
     Characterisation" (was "Mixture Model and Operational Threshold
     Derivation"). K=3 LOOO P2_PARTIAL verdict surfaced in prose
     including the "not predictively useful as an operational
     classifier" interpretation from the Script 37 verdict legend.

  3. §III-K renamed "Convergent Internal-Consistency Checks" (was
     "Convergent Validation") with explicit caveat that the three
     scores share underlying features and are not statistically
     independent measurements.

  4. §III-H reverse-anchor paragraph rewritten: the directional
     error in v1 (the non-Big-4 reference described as a "more-
     replicated-population baseline") is corrected -- the reference
     is in fact in the LESS-replicated regime relative to Big-4,
     and the score measures deviation in the hand-leaning direction.

  5. Pixel-identity metric renamed from "FAR" to "positive-anchor
     miss rate" with explicit conservative-subset caveat
     ("near-tautological for the box rule because byte-identical
     => cosine ~1 / dHash ~0").

  6. §III-L title changed to "Signature- and Document-Level
     Classification" (was "Per-Document Classification") and
     reorganised so the per-signature five-way rule + document-level
     worst-case aggregation are both clearly under this section.

  7. Empirical slips corrected:
     - K=2 LOOO comparison: now correctly says "5.6x the stability
       tolerance 0.005" rather than "5.6x the bootstrap CI half-width";
       full-Big-4 bootstrap half-width 0.0015 cited separately.
     - all-non-Firm-A dip: now correctly (0.998, 0.907), not "p > 0.99".
     - BD/McCrary: now narrowed to Big-4 scope (Script 34 null), with
       Script 32 dHash transitions for non-Big-4 subsets noted but
       not used as operational thresholds.
     - Firm A byte-identical "50 partners of 180 registered, 35
       cross-year" -- now explicitly inherited from v3.x §IV-F.1 /
       Script 28 / Appendix B; provenance row in the new table flags
       this as inherited, not v4-regenerated.
     - "mid/small-firm tail actively pulling" -> "the full-sample and
       Big-4-only calibrations differ" (causal language softened).
     - $\Delta\text{BIC}$ sign convention: explicit "lower BIC is
       preferred; BIC(K=3) - BIC(K=2) = -3.48".

  8. Editorial nits applied:
     - "failure rate" -> "box-rule hand-leaning rate"
     - "boundary moves modestly" -> "membership remains
       composition-sensitive"
     - "calibration uncertainty band +/- 5-13 pp" -> "observed absolute
       differences of 1.8-12.8 pp, with Firm C exceeding the 5 pp
       viability bar"
     - "strongest single methodology-validation signal" -> "strongest
       internal-consistency signal"
     - "the same component structure recovers" -> "a broadly similar
       three-component ordering recovers"
     - Cross-reference index marked as author checklist (remove
       before submission).

  9. New provenance table at end of §III mapping every numerical claim
     to (script, source, direct/derived/inherited).

  10. Open questions reduced from 5 to 3 (codex resolved questions 2,
      3, 4 with concrete answers); remaining 3 are forward-looking
      (5-way moderate band, pseudonym consistency, table numbering).

Also commits: paper/codex_review_gpt55_v4_round1.md (codex review
artifact, 143 lines).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:49:59 +08:00

16 KiB

Paper A v4.0 Methodology Section III-G through III-L Peer Review

Reviewer: gpt-5.5 xhigh
Date: 2026-05-12
Round number: 21 (v4 round 1)
Review target: paper/v4/paper_a_methodology_v4_section_iii.md

Audit aliases used below:

  • V4: paper/v4/paper_a_methodology_v4_section_iii.md
  • V3: paper/paper_a_methodology_v3.md
  • Script36: /Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/calibration_and_loo_validation/calibration_loo_report.md
  • Script37: /Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/k3_loo_check/k3_loo_report.md
  • Script38: /Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/convergence_k3_reverse_anchor/convergence_report.md
  • Script39: /Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/signature_level_convergence/sig_level_report.md
  • Script40: /Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/pixel_identity_far/far_report.md
  • Script34 local: /Volumes/NV2/PDF-Processing/signature-analysis/reports/big4_only_pooled/big4_only_pooled_report.md
  • Script35 local: /Volumes/NV2/PDF-Processing/signature-analysis/reports/big4_k3_cluster_inspection/inspection_report.md
  • Script32 local: /Volumes/NV2/PDF-Processing/signature-analysis/reports/non_firm_a_calibration/non_firm_a_calibration_report.md

Verdict

Major Revision.

Major Findings

  1. K=3 is not yet justified as an operational classifier.

    V4 selects K=3 for the operational per-CPA classifier (V4:57, V4:67) and says the K=3/K=2 contrast justifies selecting K=3 (V4:107). The underlying Script37 verdict is weaker: P2_PARTIAL, with the explicit interpretation that the C1 cluster exists but "membership is not well-predicted by held-out fit" (Script37:92, Script37:94). The report's own legend says P2_PARTIAL means the cluster is "not predictively useful as an operational classifier" (Script37:97-99).

    The numbers support this concern. K=3 C1 component shape is stable (max deviations 0.0047 cosine, 0.955 dHash, 0.023 weight; Script37:77-79), but held-out C1 membership differs from baseline by up to 12.77 percentage points (Script37:83-90). For PwC, baseline C1 is 23.5% but held-out prediction is 36.27% (Script37:47-51, Script37:87). That is not a small operational error if the label is used to classify CPAs.

    The BIC evidence is also weak. K=3 is lower BIC than K=2 by only 3.48 points (Script36:9-10; Script34 local:40-41). This is acceptable as mild descriptive support, not as the load-bearing reason to replace a classifier. The draft should either (a) demote K=3 to a descriptive/convergent-validation model, or (b) make K=3 primary only with explicit LOOO membership uncertainty and soft-posterior reporting.

  2. The "three independent lenses" framing overstates independence and validation strength.

    V4 describes the convergent validation as three "independent statistical lenses" (V4:73-89). They are not independent empirical measurements. All three are deterministic functions of the same per-CPA or per-signature (cos, dHash) features:

    • Lens 1 is K=3 posterior from the same two descriptors (V4:77; Script38:6-12).
    • Lens 2 is a monotone transform of the cosine marginal only (V4:78; Script38:16-18).
    • Lens 3 is the fraction of signatures failing the same box rule cos > 0.95 AND dh <= 5 (V4:79; Script38:20-22).

    The high Spearman correlations are verified (0.9627, 0.8890, 0.8794; Script38:24-34), but they are partly mechanical agreement among feature-derived scores. They do not validate the classifier against an independent ground truth for hand-signed signatures.

    There is also a conceptual reversal in the reverse-anchor prose. V4 says the non-Big-4 reference has lower cosine and higher dHash than the Big-4 C1 center (V4:37), which is verified (reference center 0.9349/9.7670 in Script38:16-18; C1 0.9457/9.1715 in Script38:8-12). But V4 then calls this a "more-replicated-population" baseline (V4:37). Lower cosine and higher dHash indicate less replication / more hand-leaning, not more replication. A reviewer will likely catch this immediately.

  3. The draft conflates at least three classifiers and then validates only one simplified binary rule.

    V4 alternates among (i) K=3 per-CPA hard labels (V4:67), (ii) a binary Paper A box rule cos > 0.95 AND dh <= 5 (V4:69), and (iii) the inherited five-way per-signature/document rule with dh <= 5, 5 < dh <= 15, and dh > 15 bands (V4:123-135). The Script38/39 convergence results validate only the simplified binary rule non_hand iff cos > 0.95 AND dh <= 5 (Script38:20-22; Script39:8-12). They do not validate the full five-way classifier, especially the moderate non-hand-signed band 5 < dh <= 15.

    This matters because V3's inherited Section III-K explicitly treated cos > 0.95 AND 5 < dh <= 15 as "Moderate-confidence non-hand-signed" (V3:278-287). V4 keeps that category (V4:127) but cites kappa/rho evidence from a binary high-confidence-only rule (V4:121). The current prose therefore overstates what the Script39 kappa values prove.

    Recommended fix: choose a primary endpoint. If the five-way rule remains primary, validate that exact five-way rule or its declared binary collapse. If K=3 becomes primary, provide a document-level aggregation rule for K=3 and stop calling the inherited box rule the operational classifier.

  4. The pixel-identity validation is useful, but "FAR" is the wrong metric name and the evidentiary force is overstated.

    Script40's ground truth is a positive class: pixel-identical signatures are treated as replicated (Script40:4-8). Misclassifying them as hand-leaning is a false negative / miss rate on an easy positive-anchor subset, not a false-alarm rate in the usual classifier sense. V4 defines FAR as "probability of labelling a pixel-identical signature as hand-leaning" (V4:109), which reverses standard terminology.

    The 0/262 result is verified for all three classifiers (Script40:12-18), and the caveat that pixel-identity is necessary but not sufficient is appropriate (V4:117; Script40:29-31). But for the Paper A box rule this result is close to tautological: byte-identical nearest-neighbor signatures will have near-maximal cosine and minimal dHash. V3 was more careful, noting that FRR against byte-identical positives is trivially zero at thresholds below 1 and should be interpreted qualitatively (V3:266-268).

    Rename this metric to "pixel-identity positive-anchor miss rate" or "false-hand rate on replicated positives." Do not present it as FAR unless a true hand-signed negative anchor is evaluated.

  5. Several empirical/provenance claims need correction or explicit "unverified" status.

    • V4 says the K=2 LOOO max cosine deviation 0.028 is 5.6x a "bootstrap CI half-width of 0.005" (V4:103). Script36 reports max deviation 0.0278 (Script36:43), but 0.005 is the stability tolerance in the verdict legend, not the bootstrap CI half-width (Script36:50-52). The full Big-4 bootstrap cosine CI half-width is 0.0015 (Script36:14-17). Correct the denominator and wording.

    • V4 says all-non-Firm-A is dip-test unimodal at p > 0.99 (V4:21). Script32 local reports all-non-Firm-A cosine p = 0.9975 but dHash p = 0.9065 (Script32 local:56-76). The later detailed sentence in V4 correctly gives 0.998/0.907 (V4:43). Fix the earlier overstatement.

    • V4 says no BD/McCrary transition is identified on either axis and cites Script32/34 (V4:47). Script34 local supports no Big-4-only BD/McCrary threshold (Script34 local:28-31), but Script32 local reports dHash BD/McCrary thresholds for big4_non_A and all_non_A (Script32 local:36-44, Script32 local:68-76). Narrow the claim to the Big-4-only analysis or explain why Script32 subset transitions are not used.

    • The Firm A byte-identical claim is partly verified. Script40 verifies 145 Firm A pixel-identical signatures inside the 262 Big-4 total (Script40:20-27). The added details "50 distinct Firm A partners," "of 180 registered," and "35 span different fiscal years" appear in V3 (V3:165) and V4 (V4:31), but I did not find them in the supplied Script36-40 reports. Treat those details as unverified unless the Appendix B/script artifact is cited directly.

    • The "mid/small-firm tail actively pulling the v3.x crossing" statement (V4:19) is stronger than the local Script34 evidence. Script34 local verifies the Big-4-only crossing and CI (Script34 local:18-24), and it reports a large offset from the published baseline (Script34 local:51-58). It does not, by itself, prove the causal language "actively pulling" rather than "the full-sample and Big-4-only calibrations differ."

Minor Findings

  1. Dip-test p-value precision needs a resolution check. V4 says bootstrap p-value estimation uses n_boot = 2000 and reports p < 10^-4 (V4:43). With a finite bootstrap of 2000, the natural resolution is about 1/2000 unless the script uses a different asymptotic/calibrated p-value. Script36/34 display p = 0.0000 (Script36:6-8; Script34 local:28-31). State the reporting convention precisely, e.g., "no bootstrap replicate exceeded the observed statistic; reported as p < 0.001" if that is what happened.

  2. The Delta BIC sign convention is confusing. V4 reports "Delta BIC = -3.5" (V4:65). Since lower BIC is preferred, a reviewer may expect BIC(K=2) - BIC(K=3) = 3.48 or "K=3 lower by 3.48." Use one convention and define it.

  3. Per-signature convergence is real but only moderate for the box rule. Script39 verifies kappas of 0.6616, 0.5586, and 0.8701 (Script39:22-30). The report verdict is SIG_CONVERGENCE_MODERATE, not strong (Script39:41-48). V4's statement that box-rule disagreement reflects "different decision geometries" rather than signal disagreement (V4:99) is plausible but interpretive. Add the moderate verdict and avoid making geometry the only explanation.

  4. Per-CPA vs per-signature component centers drift more than the prose suggests. Script39 shows per-CPA C1 at cosine 0.9457 and per-signature C1 at 0.9280 (Script39:16-20). Kappa is high for K=3 perCPA vs perSig labels (Script39:28), but "the same component structure recovers" (V4:99) should be softened to "a broadly similar three-component ordering recovers."

  5. The Section III-L title is misleading. The section is titled "Per-Document Classification" (V4:119) but most of it defines per-signature categories (V4:121-133). The document-level aggregation appears only in one paragraph (V4:135). Either rename to "Signature- and Document-Level Classification" or split the two parts.

  6. K=3 alternative output lacks document aggregation. V4 says the K=3 alternative assigns each signature to C1/C2/C3 (V4:137), but if Section III-L is per-document classification, the K=3 alternative also needs a document-level worst-case or posterior aggregation rule.

  7. Firm anonymization is inconsistent. V4 names the four firms in Chinese and then says they are pseudonymized as Firms A-D (V4:17). Later it uses PwC directly (V4:31). V3 says firm-level results are reported under pseudonyms (V3:315-316). Decide whether v4 abandons anonymization; otherwise keep the main text pseudonymous and put the mapping outside the manuscript, if at all.

Editorial / Prose Nits

  1. Replace "more-replicated-population baseline" (V4:37) with "less-replicated external reference" or "hand-leaning external reference."

  2. Replace "failure rate" for Lens 3 (V4:79, V4:89) with "box-rule hand-leaning rate" or "non-replicated rate." "Failure" sounds like classifier failure rather than a hand-leaning outcome.

  3. "Strongest single methodology-validation signal" (V4:89) is too strong because the lenses share features. Use "strongest internal consistency signal."

  4. "Boundary moves modestly" (V4:105) understates the PwC fold, where C1 membership rises from 23.5% to 36.3% (Script37:47-51). Use "membership remains composition-sensitive."

  5. "Calibration uncertainty band of +/- 5-13 percentage points" (V4:105) should be "observed absolute differences of 1.8-12.8 percentage points, with the largest fold exceeding the report's 5 pp viability bar" (Script37:83-90).

  6. "Operational threshold derivation" (V4:51) is not accurate if the operational per-signature classifier remains the inherited box rule. Use "mixture model and component assignment" unless K=3 is truly primary.

  7. The cross-reference index is useful, but it should be removed from the submitted manuscript or converted into an internal author checklist.

Responses to the Five Open Questions

  1. Scope justification.

    The three-point argument is directionally good but not yet sufficient. Add a fourth point explicitly restricting generalizability: primary claims are for the Big-4 audit-report context, while the 249 non-Big-4 CPAs are used only as robustness/reverse-anchor context unless Section IV-K independently validates them. Also soften "tail distorts" to "tail changes the fitted crossing" unless you cite a direct diagnostic for distortion. The Big-4 counts and crossings are verified (Script34 local:4-24; Script36:6-17), but the causal language needs restraint.

  2. Firm A phrasing.

    Use "templated-end case study" or "replication-heavy descriptive reference." Do not use "calibration reference, descriptively defined post-hoc" unless Firm A actually calibrates a threshold in v4. The draft correctly says Firm A is not the calibration anchor (V4:33). Calling it a calibration reference reintroduces the v3 vulnerability.

  3. K=3 vs K=2 rationale.

    As written, no. Selecting K=3 as an operational classifier on LOOO stability is not acceptable because Script37 says K=3 is only P2_PARTIAL and "not predictively useful as an operational classifier" (Script37:92-99). Do not strengthen the BIC argument; Delta BIC about 3.5 is mild. The defensible claim is: K=2 is clearly unstable; K=3 gives a reproducible hand-leaning component shape; hard membership remains uncertain and should be reported as calibration uncertainty.

  4. Hybrid box rule plus K=3 alternative.

    The hybrid can be acceptable only if roles are sharply separated: inherited five-way box rule is the primary signature/document classifier; K=3 is an accountant-level characterization and exploratory alternative. The current draft blurs this by calling K=3 "operational" (V4:67) while keeping the box rule in Section III-L (V4:121-137). Also, the validation scripts use the binary high-confidence rule dh <= 5, not the full five-way rule with dh <= 15. Fix this before deciding whether to keep the hybrid.

  5. Section IV numbering.

    Do not freeze table numbers yet. First settle the Methodology labels and primary classifier. Results should mirror this order: sample/scope, K=2/K=3 calibration, convergence lenses, K=2 and K=3 LOOO, pixel-identity positive-anchor check, signature/document classification outputs, then full-dataset robustness. After that, assign table numbers and verify every Section III cross-reference to Section IV-D/F/G/K.

  1. Rewrite Sections III-J and III-K so K=3 is either clearly primary with uncertainty, or clearly descriptive. If descriptive, remove "operational threshold" language from the K=3 discussion.

  2. Add the Script37 P2_PARTIAL result directly to the prose. Do not hide the "not predictively useful as an operational classifier" implication.

  3. Decide and declare the primary classifier: inherited five-way box rule, binary high-confidence box rule, or K=3 hard/posterior labels. Align all validation text to that exact classifier.

  4. If the five-way rule remains primary, rerun or report validation for the five-way categories and the document-level worst-case aggregation, not just cos > 0.95 AND dh <= 5.

  5. Rename the pixel-identity metric from FAR to positive-anchor miss rate / false-hand rate. Add a separate specificity/FAR result only if a true hand-signed or inter-CPA negative anchor is evaluated.

  6. Correct the empirical slips: K=2 "0.005 bootstrap half-width," all-non-Firm-A p > 0.99, Script32 BD/McCrary wording, reverse-anchor "more-replicated" phrase, and any unverified Firm A byte-decomposition details.

  7. Add a short provenance table for every numerical claim in Sections III-G through III-L, including exact report path, script number, and whether the number is directly reported or inferred by arithmetic.