Codex gpt-5.4 round-5 (codex_review_gpt54_v3_5.md) verdict was Minor Revision - all v3.4 round-4 PARTIAL/UNFIXED items now confirmed RESOLVED, including line-by-line recomputation of Table XI z/p matching the manuscript values. This commit cleans the remaining quick-win items: Table IX numerical sync to Script 24 authoritative values - Five count corrections: cos>0.837 (60,405->60,408), cos>0.945 (57,131/94.52% -> 56,836/94.02%, was 295 sigs / 0.50 pp off), cos>0.973 (48,910/80.91% -> 48,028/79.45%, was 882 sigs / 1.46 pp off), cos>0.95 (55,916->55,922), dh<=8 (57,521->57,527), dh<=15 (60,345->60,348), dual (54,373->54,370). - Threshold label cos>0.941 -> cos>0.9407 (use exact calib-fold P5 rather than rounded value). - "dHash_indep <= 5 (calib-fold median-adjacent)" relabeled to "(whole-sample upper-tail of mode)" to match what III-L explains. - Added "(operational dual)" / "(style-consistency boundary)" labels for unambiguous mapping into III-L category definitions. - Removed circularity-language footnote inside the table comment. Circularity overclaim removed paper-wide - Methodology III-K (Section 3 anchor): "we break the resulting circularity" -> "we make the within-Firm-A sampling variance visible". - Results IV-G.2 subsection title: "(breaks calibration-validation circularity)" -> "(within-Firm-A sampling variance disclosure)". - Combined with the v3.5 Abstract / Conclusion edits, no surviving use of circular* anywhere in the paper. export_v3.py title page now single-anonymized - Removed "[Authors removed for double-blind review]" placeholder (IEEE Access uses single-anonymized review). - Replaced with explicit "[AUTHOR NAMES - fill in before submission]" + affiliation placeholder so the requirement is unmissable. - Subtitle now reads "single-anonymized review". III-G stale "cosine-conditional dHash" sentence removed - After the v3.5 III-L rewrite to dh_indep, the sentence at Methodology L131 referencing "cosine-conditional dHash used as a diagnostic elsewhere" no longer described any current paper usage. - Replaced with a positive statement that dh_indep is the dHash statistic used throughout the operational classifier and all reported capture-rate analyses. Abstract trimmed 247 -> 242 words for IEEE 250-word safety margin - "an end-to-end pipeline" -> "a pipeline"; "Unlike signature forgery" -> "Unlike forgery"; "we report" passive recast; small conjunction trims. Outstanding items deferred (require user decision / larger scope): - BD/McCrary either substantiate (Z/p table + bin-width robustness) or demote to supplementary diagnostic. - Visual-inspection protocol disclosure (sample size, rater count, blinding, adjudication rule). - Reproducibility appendix (VLM prompt, HSV thresholds, seeds, EM init / stopping / boundary handling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
Fifth-Round Review of Paper A v3.5
Audit basis: commit 12f716d. Line numbers below refer to the current v3.5 markdown and script files.
1. Overall Verdict
Minor Revision
v3.5 clears the two issues that kept v3.4 in major-revision territory. The classifier definition in Section III-L is now arithmetically aligned with the dHash_indep implementation used by the supporting scripts and downstream tables, and Table XI's z/p columns now reproduce from the displayed k/n counts under the exact two-proportion formula in Script 24. I do not see a core scientific regression in the B1/B2/B3 logic. I would still not submit v3.5 as-is, however, because a short v3.6 cleanup is still warranted: Table IX is not fully synchronized to the current script outputs, "breaks circularity" overclaim language survives in Methods/Results, the export path still hardcodes a double-blind placeholder even though IEEE Access is single-anonymized, and the manuscript still underdocuments BD/McCrary, visual inspection, and several key reproducibility details. This is now a close paper, but not yet the cleanest version to send.
2. v3.4 Round-4 Follow-Up Audit
2.1 Round-4 Blockers
| Round-4 item | Round-4 status | v3.5 audit | Evidence |
|---|---|---|---|
| B1. Classifier vs three-method convergence misalignment | PARTIALLY-RESOLVED |
RESOLVED |
Section III-L now defines the operational classifier entirely in dHash_indep terms at Methodology L252-L277. The matching downstream tables also use dHash_indep: Results L165-L168, L221-L225, L246-L254, and L350-L361. Script 24 likewise loads min_dhash_independent and applies it in the Section III-L classifier at Script 24 L86-L99, L157-L168, and L215-L241. |
| B2. Held-out validation false within-Wilson-CI claim | PARTIALLY-RESOLVED |
RESOLVED |
Results L230-L237 now correctly interpret the fold comparison, and the Table XI z/p entries at Results L217-L225 reproduce from Script 24's two_prop_z formula at Script 24 L69-L83 and L186-L205. |
| B3. Interview evidence lacks ethics statement | RESOLVED |
RESOLVED |
The manuscript still treats practitioner knowledge as background context only and locates evidentiary weight in paper-internal analyses: Introduction L51-L55; Methodology L140-L156 and L282-L291. I found no regression to interview/IRB-style evidentiary claims. |
2.2 Round-4 Major and Minor Follow-Up Items
| Round-4 item | Round-4 status | v3.5 audit | Evidence |
|---|---|---|---|
| dHash classifier ambiguity | UNFIXED |
RESOLVED |
The classifier is now explicitly dHash_indep-based throughout III-L, not cosine-conditional: Methodology L254-L277. Results Tables IX, XI, XII, and XVI are written in the same statistic: Results L165-L168, L221-L225, L246-L254, L350-L361. |
| 70/30 split overstatement | PARTIALLY-FIXED |
PARTIALLY-RESOLVED |
The Abstract and Conclusion are repaired: Abstract L5 and Conclusion L19-L21 now use fold-variance language. But the overclaim survives at Methodology L238, Results L171, and the subsection title at Results L207. |
| Validation-metric story | PARTIALLY-FIXED |
RESOLVED |
The Introduction now promises anchor-based capture/FAR reporting rather than precision/F1/EER: Introduction L29-L30. Methods/Results remain aligned on why precision/F1 are not meaningful here: Methodology L245-L246; Results L186-L188. The archived Impact Statement is explicitly excluded from submission and self-warns against overclaim: Impact Statement L1-L12; export_v3.py L15-L25. |
| Within-auditor-year empirical-check confusion | UNFIXED |
RESOLVED |
Methodology L123-L128 now explicitly says IV-H.3 is a related but distinct cross-partner same-report homogeneity test, not a same-CPA within-year mixing test. Results L343-L367 matches that framing exactly. |
| BD/McCrary rigor | UNFIXED |
UNRESOLVED |
The paper still gives only narrative BD/McCrary outcomes without a table of Z statistics, p values, or bin-width robustness: Results L80-L83 and L126-L149. |
| Reproducibility gaps | PARTIALLY-FIXED |
PARTIALLY-RESOLVED |
The scripts expose some seeds and formulas, but the manuscript still omits the exact VLM prompt/parse rule, HSV thresholds, visual-inspection protocol, and EM initialization/stopping details: Methodology L44-L49, L74-L75, L145-L146, L188-L196, L222-L223, L248. |
| Section III-H / IV-F reconciliation | FIXED |
RESOLVED |
The 92.5% Firm A figure is still consistently framed as a within-sample consistency check, not an external validation pillar: Methodology L156-L160; Results L174-L176. |
"0.95 not calibrated to Firm A" inconsistency |
UNFIXED |
RESOLVED |
III-H now says the 0.95 cutoff is the whole-sample Firm A P95: Methodology L151-L154. III-L repeats that at Methodology L273-L277, and Results uses the same interpretation at L174-L176 and L241-L244. |
| Table XII numbering | FIXED |
RESOLVED |
Numbering remains coherent through XI-XVIII, with Table XII present at Results L246-L254. |
dHash_indep <= 5 (calib-fold median-adjacent) label |
UNFIXED |
UNRESOLVED |
The label still appears in Table IX at Results L165. III-L explains the rationale better at Methodology L275, but the table label itself remains opaque. |
References [27], [31]-[36] cleanup |
UNFIXED |
RESOLVED |
All seven are now cited in text: [27] at Methodology L100; [31]-[33] at Introduction L15; [34]-[35] at Methodology L44 and L58; [36] at Results L50. |
2.3 Round-4 New-Issue Audit
| Round-4 new issue | v3.5 audit | Evidence |
|---|---|---|
| IV-G.3 sensitivity evidence did not evaluate the stated classifier | RESOLVED |
III-L now defines the same dHash_indep classifier that Script 24 evaluates: Methodology L252-L277; Script 24 L215-L241; Results L239-L262. |
Table XI z/p columns did not match displayed counts |
RESOLVED |
Results L217-L225 now matches recomputation from Script 24 L69-L83 exactly up to rounding; details in Section 3 below. |
| Table XVI was affected by the same classifier-definition problem | RESOLVED |
Table XVI is now aligned because III-L itself is dHash_indep-based. Script 23 also uses min_dhash_independent: Script 23 L37-L53 and L90-L92. |
| Visual-inspection pillar still lacked protocol details | UNRESOLVED |
The claim remains at Methodology L145-L149, but sample size, rater count, and adjudication rule are still absent from the manuscript. |
| Threshold-free wording in III-H was inaccurate | RESOLVED |
III-H now correctly says only partner-ranking is fully threshold-free: Methodology L151-L154. Results L270-L274 matches this. |
| Introduction metric promise / Impact Statement wording still overstated | RESOLVED |
The Introduction is repaired at L29-L30, and the Impact Statement is archived and excluded from export: Impact Statement L1-L12; export_v3.py L15-L25. |
3. Verification of the v3.5 Critical Fixes
3.1 Table XI Recalculation
I recomputed every Table XI z/p pair from the displayed k/n counts using the exact two-proportion formula in Script 24 L69-L83. All nine rows now match the manuscript rounding at Results L217-L225.
| Rule | Exact recomputation from displayed k/n |
Paper value | Audit |
|---|---|---|---|
cosine > 0.837 |
z = +0.310601, p = 0.756104 |
+0.31, 0.756 |
Match |
cosine > 0.9407 |
z = -3.184698, p = 0.001449 |
-3.19, 0.001 |
Match |
cosine > 0.945 |
z = -4.541202, p = 0.00000559 |
-4.54, <0.001 |
Match |
cosine > 0.950 |
z = -5.966194, p = 0.0000000024 |
-5.97, <0.001 |
Match |
dHash_indep <= 5 |
z = -14.288642, p < 1e-40 |
-14.29, <0.001 |
Match |
dHash_indep <= 8 |
z = -6.446423, p = 1.15e-10 |
-6.45, <0.001 |
Match |
dHash_indep <= 9 |
z = -5.072930, p = 3.92e-07 |
-5.07, <0.001 |
Match |
dHash_indep <= 15 |
z = -0.313744, p = 0.753716 |
-0.31, 0.754 |
Match |
cosine > 0.95 AND dHash_indep <= 8 |
z = -7.603992, p = 2.86e-14 |
-7.60, <0.001 |
Match |
This directly resolves the main round-4 numerical blocker.
3.2 Section III-L Uses dh_indep Throughout
This fix is real. Section III-L now states at Methodology L254-L255 that all dHash references in the operational classifier are the independent-minimum statistic, and the five categories at L257-L277 are all written with dHash_indep. The downstream result tables are consistent with that same statistic:
- Table IX: Results L165-L168.
- Table XI: Results L221-L225.
- Table XII: Results L246-L258.
- Table XVI: Results L347-L367.
Script 24 is now consistent with that choice as well: it loads min_dhash_independent at L86-L99 and classifies with it at L215-L241.
3.3 "0.95 is Firm A P95" Is Now Consistent
This inconsistency is fixed across the relevant sections:
- III-H: Methodology L151-L154 states that the
0.95cutoff is the whole-sample Firm A P95 and that the longitudinal analysis is about stability, not absolute-rate calibration. - III-L: Methodology L273-L277 repeats that
0.95is the whole-sample Firm A P95 heuristic. - IV-F / IV-G.3: Results L174-L176 and L241-L244 use the same framing.
I do not see a surviving contradiction of the old "not calibrated to Firm A" type.
4. Verification of the v3.5 Major Fixes
- Abstract length: The abstract is now one paragraph. A rendered whitespace count after stripping the header/comment gives 247 words, which is nominally under the IEEE 250-word cap. If one counts inline math markers as separate tokens, the count rises above 250, so the abstract is compliant in ordinary rendered form but still too close to the limit for comfort.
- "We break the circularity" overclaim: Removed from the Abstract and Conclusion. The current Abstract L5 and Conclusion L19-L21 use fold-level variance / heterogeneity language instead. However, the same overclaim still survives elsewhere in the paper at Methodology L238 and Results L171 and L207.
- Introduction metric language: Fixed. Introduction L29-L30 now promises per-rule capture/FAR with Wilson intervals and explicitly states why precision/F1 are not meaningful here. The obsolete introduction promise of precision/F1/EER is gone.
- III-G / IV-H.3 wording alignment: Fixed. Methodology L123-L128 and Results L343-L367 now describe the same cross-partner same-report homogeneity test.
- III-H threshold-free wording: Fixed. Methodology L151-L154 and Results L270-L274 now correctly say that only partner-ranking is threshold-free.
5. Verification of the v3.5 Minor Fixes
- Impact Statement exclusion: Fixed.
export_v3.pyexcludespaper_a_impact_statement_v3.mdfromSECTIONSat L15-L25, and the archived file itself says it is not part of the IEEE Access submission at Impact Statement L1-L12. - Previously unused references: Fixed.
[27],[31],[32],[33],[34],[35], and[36]all now have in-text citations; see the evidence in Section 2.2 above.
6. New Findings in v3.5
No core scientific regression is visible in the B1/B2/B3 logic. The remaining new findings are cleanup-level but real:
- Table IX is still not fully synchronized to the current script outputs. Using the displayed counts at Results L160-L168, three percentages are off by
0.01under standard rounding:57,131 / 60,448 = 94.51%, not94.52%;55,916 / 60,448 = 92.50%, not92.51%; and57,521 / 60,448 = 95.16%, not95.17%. More importantly, Script 24 computes the whole-sample dual rule as54,370 / 60,448, not54,373 / 60,448(Script 24 L276-L316; generated recalibration report section 3 lines 48-52). This is small, but v3.5 explicitly positions itself as having cleaned exact table arithmetic, so it should be corrected. - The circularity overclaim is not fully removed paper-wide. Methodology L238 still says the 70/30 split "break[s] the resulting circularity," Results L171 says the held-out analysis "addresses the circularity," and the IV-G.2 subsection title at Results L207 still says "(breaks calibration-validation circularity)." Those are stronger than the better, narrower interpretation at Results L233-L237, Discussion L44-L45, and Conclusion L20-L21.
- The export path is not submission-ready for IEEE Access single-anonymized review.
export_v3.pycorrectly excludes the Impact Statement, but it still inserts[Authors removed for double-blind review]on the title page at L208-L218. If the manuscript were submitted literally from this export path, that would be a packaging error. - Methodology III-G retains one stale reference to cosine-conditional dHash. Methodology L131-L132 says cosine-conditional dHash is used "as a diagnostic elsewhere," but no remaining main-text result appears to use it. After the III-L rewrite, this reads as leftover phrasing and should be either deleted or pointed to a real appendix/supplement.
7. IEEE Access Submission Readiness Check
- Scope: Yes. The topic remains a plausible IEEE Access Regular Paper fit spanning document forensics, computer vision, and audit/regulatory analytics.
- Abstract length: Nominally compliant in rendered form at 247 words, but close enough to the cap that another 5-10 words of trimming would be safer.
- Formatting / template: Not verifiable from the markdown section files alone. The paper is maintained as markdown fragments plus a custom
python-docxexporter; I did not audit a final IEEE Access template-conformant DOCX/PDF package here. - Review model: IEEE Access is single-anonymized. The current export path still uses a double-blind placeholder on the title page (
export_v3.pyL208-L218). That must be fixed before submission. - Anonymization: The manuscript body still consistently uses
Firm A/B/C/Dand does not expose explicit real firm names or author metadata in the reviewed markdown sections. As before, that is a confidentiality choice rather than a review-model requirement. - Ethics / data-source disclosure: Adequate for this paper's current evidentiary framing. Methodology L282-L291 clearly states the corpus is public MOPS data and that no non-public records or human-subject evidence are used.
- Items that could trigger desk return if submitted literally now: the missing author/affiliation metadata from the current export path, and any unverified IEEE template / metadata nonconformance in the final DOCX/PDF. The remaining scientific issues are reviewer-risk issues rather than obvious desk-return items.
Bottom line on readiness: not as-is. The science is close; the packaging and last-round reporting cleanup are not finished.
8. Statistical Rigor, Numerical Consistency, and Reproducibility
Statistical Rigor
- The core statistical story is now coherent. The paper cleanly separates the operational signature-level classifier from the accountant-level convergence band and treats the held-out Firm A split as heterogeneity disclosure rather than a false Wilson-CI "generalization pass": Methodology L252-L277; Results L230-L237; Discussion L44-L45.
- The anchor-based validation is better framed than in earlier rounds. The byte-identical positives are clearly treated as a conservative subset, and precision/F1 are no longer misused: Methodology L227-L248; Results L184-L205.
- The main remaining rigor weakness is still BD/McCrary. Because the paper keeps advertising a three-method convergent threshold strategy in the title/abstract/introduction, the absence of explicit BD/McCrary
Z/preporting and bin-width sensitivity still leaves one of the three methods under-reported. - The visual-inspection pillar is still too thinly documented for the rhetorical weight it carries in III-H and the Conclusion.
Numerical Consistency
- Table XI is now repaired and reproducible from its displayed counts.
- Table XII, Table XVI, and Table XVII remain arithmetically consistent.
- Table IX still has the residual percentage/count mismatches noted in Section 6.
- The biggest numerical issue left is therefore no longer inferential-table arithmetic; it is the smaller but still avoidable transcription drift in Table IX.
Reproducibility
The paper is still not reproducible from the manuscript alone.
The most important under-specified items remain:
- Exact VLM prompt, parse rule, and page-selection failure handling: Methodology L44-L49.
- HSV thresholds for red-stamp removal: Methodology L74-L75.
- Randomization / seed rules for the 500-page annotation set, the inter-CPA negative sample, the 30-signature sanity sample, and the Firm A split: Methodology L59-L62 and L232-L248.
- Visual-inspection protocol details: sample size, rater count, and decision rule are still absent around Methodology L145-L146.
- EM / mixture initialization count, stopping criteria, logit-boundary clipping, and software versions: Methodology L188-L196 and L222-L223.
The scripts help auditability, but the manuscript still needs a short reproducibility appendix or supplement if the authors want the paper to look fully defensible on first submission.
9. What v3.6 Must Change to Clear Review
If the authors want the paper to clear this review and be genuinely submission-ready, v3.6 should do the following:
- Re-sync Table IX and mirrored prose to the authoritative script outputs. Correct the three
0.01percentage mismatches and the whole-sample dual-rule count (54,370 / 60,448if Script 24 is authoritative). - Remove the surviving circularity overclaim from Methods/Results. Replace Methodology L238, Results L171, and the IV-G.2 heading at L207 with the softer fold-variance / within-Firm-A heterogeneity framing already used elsewhere.
- Fix the export path for IEEE Access single-anonymized review. Restore author/affiliation/corresponding-author metadata and audit the real final DOCX/PDF against the IEEE Access template rather than relying on the current double-blind placeholder export.
- Document the visual-inspection protocol. At minimum: sample size, sampling rule, number of raters, whether review was blinded, and how disagreements were adjudicated.
- Either substantiate BD/McCrary or demote it. If it stays as one of the three headline methods, add a compact table of
Zstatistics,pvalues, and bin-width robustness. If not, explicitly recast it as a supplementary diagnostic rather than a co-equal threshold estimator. - Add a short reproducibility appendix or supplement. Include the VLM prompt/parse rule, HSV thresholds, key seeds/sampling rules, and mixture-model implementation details.
- Clean the stale cosine-conditional dHash sentence at Methodology L131-L132. After the III-L rewrite, that sentence now looks like leftover terminology.
If those items are addressed cleanly, I would treat the manuscript as submission-ready for IEEE Access.