Files
pdf_signature_extraction/paper/codex_review_gpt54_v3_4.md
T
gbanyan 12f716ddf1 Paper A v3.5: resolve codex round-4 residual issues
Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):

Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
  table had 1-4-unit transcription errors in k values and a fabricated
  cos > 0.9407 calibration row; both fixed by rerunning Script 24
  with cos = 0.9407 added to COS_RULES and copying exact values from
  the JSON output.
- Section III-L classifier now defined entirely in terms of the
  independent-minimum dHash statistic that the deployed code (Scripts
  21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
  language is removed. Tables IX, XI, XII, XVI are now arithmetically
  consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
  III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
  per-signature cosine distribution, matching III-L and IV-F.

Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
  limit. Removed "we break the circularity" overclaim; replaced with
  "report capture rates on both folds with Wilson 95% intervals to
  make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
  within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
  Methods/Results don't deliver; replaced with anchor-based capture /
  FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
  intra-report consistency (IV-H.3) is a different test (two co-signers
  on the same report, firm-level homogeneity) and is not a within-CPA
  year-level mixing check; the assumption is maintained as a bounded
  identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
  the partner-level ranking is threshold-free"; longitudinal-stability
  uses 0.95 cutoff, intra-report uses the operational classifier.

Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
  Regular Papers do not have a standalone Impact Statement). The file
  itself is retained as an archived non-paper note for cover-letter /
  grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
  signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
  [35] VLM survey, [36] Mann-Whitney) are now cited in-text:
    [27] in Methodology III-E (dHash definition)
    [31][32][33] in Introduction (audit-quality regulation context)
    [34][35] in Methodology III-C/III-D
    [36] in Results IV-C (Mann-Whitney result)

Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 12:23:03 +08:00

29 KiB

Fourth-Round Review of Paper A v3.4

Overall Verdict: Major Revision

v3.4 is materially better than v3.3. The ethics/interview blocker is genuinely fixed, the classifier-versus-accountant-threshold distinction is much clearer in the prose, Table XII now exists, and the held-out-validation story has been conceptually corrected from the false "within Wilson CI" claim to the right calibration-fold-versus-held-out comparison. I still do not recommend submission as-is, however, because two core problems remain. First, the newly added sensitivity and intra-report analyses do not appear to evaluate the classifier that Section III-L now defines: the paper says the operational five-way classifier uses cosine-conditional dHash cutoffs, but the new scripts use min_dhash_independent instead. Second, the replacement Table XI has z/p columns that do not consistently match its own reported counts under the script's published two-proportion formula. Those are fixable, but they keep the manuscript in major-revision territory.

1. v3.3 Blocker Resolution Audit

Blocker Status Audit
B1. Classifier vs three-method convergence misalignment PARTIALLY-RESOLVED The prose repair is real. Section III-L now explicitly distinguishes the signature-level operational classifier from the accountant-level convergent reference band at paper_a_methodology_v3.md through paper_a_methodology_v3.md, and Section IV-G.3 is added as a sensitivity check at paper_a_results_v3.md through paper_a_results_v3.md. The remaining problem is that III-L defines the classifier's dHash cutoffs as cosine-conditional at paper_a_methodology_v3.md through paper_a_methodology_v3.md, but the new sensitivity script loads only s.min_dhash_independent at 24_validation_recalibration.py through 24_validation_recalibration.py and then claims to "Replicate Section III-L" at 24_validation_recalibration.py through 24_validation_recalibration.py. So the conceptual alignment is improved, but the new empirical support is still not aligned to the declared classifier.
B2. Held-out validation false within-Wilson-CI claim PARTIALLY-RESOLVED The false claim itself is removed. Section IV-G.2 now correctly says the calibration fold, not the whole sample, is the right comparison target at paper_a_results_v3.md through paper_a_results_v3.md, and Discussion mirrors that at paper_a_discussion_v3.md. The new script also implements the two-proportion z-test explicitly at 24_validation_recalibration.py through 24_validation_recalibration.py and 24_validation_recalibration.py through 24_validation_recalibration.py. However, several Table XI z/p entries do not match the displayed k/n counts under that formula: the cosine > 0.837 row at paper_a_results_v3.md implies about z = +0.41, p = 0.683, not +0.31 / 0.756; the cosine > 0.9407 row at paper_a_results_v3.md implies about z = -3.19, p = 0.0014, not -2.83 / 0.005; and the dHash_indep <= 15 row at paper_a_results_v3.md implies about z = -0.43, p = 0.670, not -0.31 / 0.754. The conceptual blocker is fixed; the replacement inferential table still needs numeric cleanup.
B3. Interview evidence lacks ethics statement RESOLVED This blocker is fixed. The manuscript now consistently reframes the contextual claim as practitioner / industry-practice knowledge rather than as research interviews; see paper_a_introduction_v3.md through paper_a_introduction_v3.md, paper_a_methodology_v3.md through paper_a_methodology_v3.md, and paper_a_methodology_v3.md through paper_a_methodology_v3.md. I also ran a grep across the nine v3 manuscript files and found no surviving interview, IRB, or ethics strings. The evidentiary burden now sits on paper-internal analyses rather than on undeclared human-subject evidence.

2. v3.3 Major-Issues Follow-up

Prior major issue Status v3.4 audit
dHash classifier ambiguity UNFIXED III-L now says the classifier uses cosine-conditional dHash thresholds at paper_a_methodology_v3.md through paper_a_methodology_v3.md, but the Results still report only dHash_indep capture rules at paper_a_results_v3.md through paper_a_results_v3.md and paper_a_results_v3.md through paper_a_results_v3.md, despite the promise at paper_a_methodology_v3.md that both statistics would be reported. The new scripts for Table XII and Table XVI also use min_dhash_independent, not cosine-conditional dHash, at 24_validation_recalibration.py through 24_validation_recalibration.py and 23_intra_report_consistency.py through 23_intra_report_consistency.py.
70/30 split overstatement PARTIALLY-FIXED The paper is now more candid that the operational classifier still inherits whole-sample thresholds at paper_a_methodology_v3.md through paper_a_methodology_v3.md, and IV-G.2 properly frames the fold comparison at paper_a_results_v3.md through paper_a_results_v3.md. But the Abstract still says "we break the circularity" at paper_a_abstract_v3.md, and the Conclusion repeats that framing at paper_a_conclusion_v3.md, which overstates what the 70/30 split accomplishes for the actual deployed classifier.
Validation-metric story PARTIALLY-FIXED Methods and Results are substantially improved: precision and F1 are now explicitly rejected as meaningless here at paper_a_methodology_v3.md through paper_a_methodology_v3.md and paper_a_results_v3.md through paper_a_results_v3.md. But the Introduction still promises validation with "precision, recall, F1, and equal-error-rate" at paper_a_introduction_v3.md, and the Impact Statement still overstates binary discrimination at paper_a_impact_statement_v3.md.
Within-auditor-year empirical-check confusion UNFIXED Section III-G still says the intra-report analysis provides an empirical check on the within-auditor-year no-mixing assumption at paper_a_methodology_v3.md through paper_a_methodology_v3.md. But Section IV-H.3 still measures agreement between the two different signers on the same report at paper_a_results_v3.md through paper_a_results_v3.md. That is a cross-partner same-report test, not a same-CPA within-year mixing test.
BD/McCrary rigor UNFIXED The Methods still mention KDE bandwidth sensitivity at paper_a_methodology_v3.md and define a fixed-bin BD/McCrary procedure at paper_a_methodology_v3.md through paper_a_methodology_v3.md, but the Results still give only narrative transition statements at paper_a_results_v3.md through paper_a_results_v3.md and paper_a_results_v3.md through paper_a_results_v3.md, with no alternate-bin analysis, Z-statistics table, p-values, or McCrary-style estimator output.
Reproducibility gaps PARTIALLY-FIXED There is some improvement at the code level: the new recalibration script exposes the seed and test formulae at 24_validation_recalibration.py, 24_validation_recalibration.py through 24_validation_recalibration.py, and 24_validation_recalibration.py through 24_validation_recalibration.py. But from the paper alone the work is still not reproducible: the exact VLM prompt and parse rule remain absent at paper_a_methodology_v3.md through paper_a_methodology_v3.md, HSV thresholds remain absent at paper_a_methodology_v3.md, visual-inspection sample size/protocol remain absent at paper_a_methodology_v3.md through paper_a_methodology_v3.md, and mixture initialization / stopping / boundary handling remain under-specified at paper_a_methodology_v3.md through paper_a_methodology_v3.md and paper_a_methodology_v3.md.
Section III-H / IV-F reconciliation FIXED The manuscript now clearly says the 92.5% Firm A figure is a within-sample consistency check, not the independent validation pillar, at paper_a_methodology_v3.md and paper_a_results_v3.md through paper_a_results_v3.md. That specific circularity / role-confusion problem is repaired.
"Fixed 0.95 not calibrated to Firm A" inconsistency UNFIXED III-H still says the fixed 0.95 cutoff "is not calibrated to Firm A" at paper_a_methodology_v3.md, but III-L says 0.95 is the whole-sample Firm A P95 heuristic at paper_a_methodology_v3.md and paper_a_methodology_v3.md, and IV-F says the same at paper_a_results_v3.md and paper_a_results_v3.md. This contradiction remains.

3. v3.3 Minor-Issues Follow-up

Prior minor issue Status v3.4 audit
Table XII numbering FIXED Table XII now exists at paper_a_results_v3.md through paper_a_results_v3.md, and the numbering now runs XI-XVIII without the previous jump.
dHash_indep <= 5 (calib-fold median-adjacent) label UNFIXED The unclear label remains at paper_a_results_v3.md, even though the same table family now explicitly reports the calibration-fold independent-minimum median as 2 at paper_a_results_v3.md. Calling 5 "median-adjacent" is still opaque.
References [27], [31]-[36] cleanup UNFIXED These references remain present at paper_a_references_v3.md through paper_a_references_v3.md, but a citation sweep across the nine manuscript files found no in-text uses of [27] or [31]-[36]. The Mann-Whitney test is still reported at paper_a_results_v3.md without citing [36]. I also do not see uses of [34] or [35] in the reviewed manuscript text.

4. New Findings in v3.4

Blockers

Major Issues

Minor Issues

5. IEEE Access Fit Check

  • Scope: Yes. The topic is a plausible IEEE Access Regular Paper fit as a methods paper spanning document forensics, computer vision, and audit/regulatory applications.

  • Abstract length: Not compliant yet. A local plain-word count of paper_a_abstract_v3.md through paper_a_abstract_v3.md gives about 367 words. The IEEE Author Center guidance says the abstract should be a single paragraph of up to 250 words. The current abstract is also dense with abbreviations / symbols (KDE, EM, BIC, GMM, ~, approx) that IEEE generally prefers authors to avoid in abstracts.

  • Impact Statement section: The manuscript still includes a standalone Impact Statement at paper_a_impact_statement_v3.md through paper_a_impact_statement_v3.md. Inference from official IEEE Access / IEEE Author Center sources: I do not see a Regular Paper requirement for a standalone Impact Statement section. Unless an editor specifically requested it, I would remove it or fold its content into the abstract / conclusion / cover letter.

  • Formatting: I cannot verify final IEEE template conformance from the markdown section files alone. Official IEEE Access guidance requires the journal template and submission of both source and PDF; that should be checked at the generated DOCX / PDF stage, not from these source snippets.

  • Review model / anonymization: IEEE Access uses single-anonymized review. The current pseudonymization of firms is therefore a confidentiality choice, not a review-blinding requirement. Within the nine reviewed section files I do not see author or institution metadata.

  • Official sources checked:

6. Statistical Rigor Audit

  • The high-level statistical story is cleaner than in v3.3. The paper now explicitly separates the primary accountant-level 1D convergence (0.973 / 0.979 / 0.976) from the secondary 2D-GMM marginal (0.945) at paper_a_results_v3.md through paper_a_results_v3.md, and III-L no longer pretends those accountant-level thresholds are themselves the deployed classifier at paper_a_methodology_v3.md.

  • The B2 statistical interpretation is substantially improved: IV-G.2 now frames fold differences as heterogeneity rather than as failed generalization at paper_a_results_v3.md through paper_a_results_v3.md, and Discussion repeats that narrower reading at paper_a_discussion_v3.md through paper_a_discussion_v3.md.

  • The main remaining statistical weakness is now more specific: the paper's new classifier definition and the paper's new sensitivity evidence are not using the same dHash statistic. That is a model-definition problem, not just a wording problem.

  • BD/McCrary remains the least rigorous component. The paper's qualitative interpretation is plausible, but the reporting is still too thin for a method presented as a co-equal thresholding component.

  • The anchor-based validation is better framed than before. The manuscript now correctly treats the byte-identical positives as a conservative subset and no longer uses precision / F1 in the main validation table at paper_a_results_v3.md through paper_a_results_v3.md.

7. Anonymization Check

8. Numerical Consistency

9. Reproducibility

Bottom Line

v3.4 clears the ethics/interview blocker and substantially improves the classifier-threshold narrative. It is much closer to a submittable paper than v3.3. But I would still require one more round before IEEE Access submission: (1) make Section III-L, Table XII, Table XVI, and the supporting scripts use the same dHash statistic, or explicitly redefine the classifier around dHash_indep; (2) recompute and correct the Table XI z/p columns from the displayed counts; (3) remove the remaining overstatements about what the 70/30 split and the validation metrics establish; and (4) cut the abstract to <= 250 words while cleaning the non-standard Impact Statement. If those are repaired cleanly, the paper should move into minor-revision territory.