Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):
Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
table had 1-4-unit transcription errors in k values and a fabricated
cos > 0.9407 calibration row; both fixed by rerunning Script 24
with cos = 0.9407 added to COS_RULES and copying exact values from
the JSON output.
- Section III-L classifier now defined entirely in terms of the
independent-minimum dHash statistic that the deployed code (Scripts
21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
language is removed. Tables IX, XI, XII, XVI are now arithmetically
consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
per-signature cosine distribution, matching III-L and IV-F.
Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
limit. Removed "we break the circularity" overclaim; replaced with
"report capture rates on both folds with Wilson 95% intervals to
make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
Methods/Results don't deliver; replaced with anchor-based capture /
FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
intra-report consistency (IV-H.3) is a different test (two co-signers
on the same report, firm-level homogeneity) and is not a within-CPA
year-level mixing check; the assumption is maintained as a bounded
identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
the partner-level ranking is threshold-free"; longitudinal-stability
uses 0.95 cutoff, intra-report uses the operational classifier.
Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
Regular Papers do not have a standalone Impact Statement). The file
itself is retained as an archived non-paper note for cover-letter /
grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
[35] VLM survey, [36] Mann-Whitney) are now cited in-text:
[27] in Methodology III-E (dHash definition)
[31][32][33] in Introduction (audit-quality regulation context)
[34][35] in Methodology III-C/III-D
[36] in Results IV-C (Mann-Whitney result)
Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
29 KiB
Fourth-Round Review of Paper A v3.4
Overall Verdict: Major Revision
v3.4 is materially better than v3.3. The ethics/interview blocker is genuinely fixed, the classifier-versus-accountant-threshold distinction is much clearer in the prose, Table XII now exists, and the held-out-validation story has been conceptually corrected from the false "within Wilson CI" claim to the right calibration-fold-versus-held-out comparison. I still do not recommend submission as-is, however, because two core problems remain. First, the newly added sensitivity and intra-report analyses do not appear to evaluate the classifier that Section III-L now defines: the paper says the operational five-way classifier uses cosine-conditional dHash cutoffs, but the new scripts use min_dhash_independent instead. Second, the replacement Table XI has z/p columns that do not consistently match its own reported counts under the script's published two-proportion formula. Those are fixable, but they keep the manuscript in major-revision territory.
1. v3.3 Blocker Resolution Audit
| Blocker | Status | Audit |
|---|---|---|
| B1. Classifier vs three-method convergence misalignment | PARTIALLY-RESOLVED |
The prose repair is real. Section III-L now explicitly distinguishes the signature-level operational classifier from the accountant-level convergent reference band at paper_a_methodology_v3.md through paper_a_methodology_v3.md, and Section IV-G.3 is added as a sensitivity check at paper_a_results_v3.md through paper_a_results_v3.md. The remaining problem is that III-L defines the classifier's dHash cutoffs as cosine-conditional at paper_a_methodology_v3.md through paper_a_methodology_v3.md, but the new sensitivity script loads only s.min_dhash_independent at 24_validation_recalibration.py through 24_validation_recalibration.py and then claims to "Replicate Section III-L" at 24_validation_recalibration.py through 24_validation_recalibration.py. So the conceptual alignment is improved, but the new empirical support is still not aligned to the declared classifier. |
| B2. Held-out validation false within-Wilson-CI claim | PARTIALLY-RESOLVED |
The false claim itself is removed. Section IV-G.2 now correctly says the calibration fold, not the whole sample, is the right comparison target at paper_a_results_v3.md through paper_a_results_v3.md, and Discussion mirrors that at paper_a_discussion_v3.md. The new script also implements the two-proportion z-test explicitly at 24_validation_recalibration.py through 24_validation_recalibration.py and 24_validation_recalibration.py through 24_validation_recalibration.py. However, several Table XI z/p entries do not match the displayed k/n counts under that formula: the cosine > 0.837 row at paper_a_results_v3.md implies about z = +0.41, p = 0.683, not +0.31 / 0.756; the cosine > 0.9407 row at paper_a_results_v3.md implies about z = -3.19, p = 0.0014, not -2.83 / 0.005; and the dHash_indep <= 15 row at paper_a_results_v3.md implies about z = -0.43, p = 0.670, not -0.31 / 0.754. The conceptual blocker is fixed; the replacement inferential table still needs numeric cleanup. |
| B3. Interview evidence lacks ethics statement | RESOLVED |
This blocker is fixed. The manuscript now consistently reframes the contextual claim as practitioner / industry-practice knowledge rather than as research interviews; see paper_a_introduction_v3.md through paper_a_introduction_v3.md, paper_a_methodology_v3.md through paper_a_methodology_v3.md, and paper_a_methodology_v3.md through paper_a_methodology_v3.md. I also ran a grep across the nine v3 manuscript files and found no surviving interview, IRB, or ethics strings. The evidentiary burden now sits on paper-internal analyses rather than on undeclared human-subject evidence. |
2. v3.3 Major-Issues Follow-up
| Prior major issue | Status | v3.4 audit |
|---|---|---|
| dHash classifier ambiguity | UNFIXED |
III-L now says the classifier uses cosine-conditional dHash thresholds at paper_a_methodology_v3.md through paper_a_methodology_v3.md, but the Results still report only dHash_indep capture rules at paper_a_results_v3.md through paper_a_results_v3.md and paper_a_results_v3.md through paper_a_results_v3.md, despite the promise at paper_a_methodology_v3.md that both statistics would be reported. The new scripts for Table XII and Table XVI also use min_dhash_independent, not cosine-conditional dHash, at 24_validation_recalibration.py through 24_validation_recalibration.py and 23_intra_report_consistency.py through 23_intra_report_consistency.py. |
| 70/30 split overstatement | PARTIALLY-FIXED |
The paper is now more candid that the operational classifier still inherits whole-sample thresholds at paper_a_methodology_v3.md through paper_a_methodology_v3.md, and IV-G.2 properly frames the fold comparison at paper_a_results_v3.md through paper_a_results_v3.md. But the Abstract still says "we break the circularity" at paper_a_abstract_v3.md, and the Conclusion repeats that framing at paper_a_conclusion_v3.md, which overstates what the 70/30 split accomplishes for the actual deployed classifier. |
| Validation-metric story | PARTIALLY-FIXED |
Methods and Results are substantially improved: precision and F1 are now explicitly rejected as meaningless here at paper_a_methodology_v3.md through paper_a_methodology_v3.md and paper_a_results_v3.md through paper_a_results_v3.md. But the Introduction still promises validation with "precision, recall, F1, and equal-error-rate" at paper_a_introduction_v3.md, and the Impact Statement still overstates binary discrimination at paper_a_impact_statement_v3.md. |
| Within-auditor-year empirical-check confusion | UNFIXED |
Section III-G still says the intra-report analysis provides an empirical check on the within-auditor-year no-mixing assumption at paper_a_methodology_v3.md through paper_a_methodology_v3.md. But Section IV-H.3 still measures agreement between the two different signers on the same report at paper_a_results_v3.md through paper_a_results_v3.md. That is a cross-partner same-report test, not a same-CPA within-year mixing test. |
| BD/McCrary rigor | UNFIXED |
The Methods still mention KDE bandwidth sensitivity at paper_a_methodology_v3.md and define a fixed-bin BD/McCrary procedure at paper_a_methodology_v3.md through paper_a_methodology_v3.md, but the Results still give only narrative transition statements at paper_a_results_v3.md through paper_a_results_v3.md and paper_a_results_v3.md through paper_a_results_v3.md, with no alternate-bin analysis, Z-statistics table, p-values, or McCrary-style estimator output. |
| Reproducibility gaps | PARTIALLY-FIXED |
There is some improvement at the code level: the new recalibration script exposes the seed and test formulae at 24_validation_recalibration.py, 24_validation_recalibration.py through 24_validation_recalibration.py, and 24_validation_recalibration.py through 24_validation_recalibration.py. But from the paper alone the work is still not reproducible: the exact VLM prompt and parse rule remain absent at paper_a_methodology_v3.md through paper_a_methodology_v3.md, HSV thresholds remain absent at paper_a_methodology_v3.md, visual-inspection sample size/protocol remain absent at paper_a_methodology_v3.md through paper_a_methodology_v3.md, and mixture initialization / stopping / boundary handling remain under-specified at paper_a_methodology_v3.md through paper_a_methodology_v3.md and paper_a_methodology_v3.md. |
| Section III-H / IV-F reconciliation | FIXED |
The manuscript now clearly says the 92.5% Firm A figure is a within-sample consistency check, not the independent validation pillar, at paper_a_methodology_v3.md and paper_a_results_v3.md through paper_a_results_v3.md. That specific circularity / role-confusion problem is repaired. |
| "Fixed 0.95 not calibrated to Firm A" inconsistency | UNFIXED |
III-H still says the fixed 0.95 cutoff "is not calibrated to Firm A" at paper_a_methodology_v3.md, but III-L says 0.95 is the whole-sample Firm A P95 heuristic at paper_a_methodology_v3.md and paper_a_methodology_v3.md, and IV-F says the same at paper_a_results_v3.md and paper_a_results_v3.md. This contradiction remains. |
3. v3.3 Minor-Issues Follow-up
| Prior minor issue | Status | v3.4 audit |
|---|---|---|
| Table XII numbering | FIXED |
Table XII now exists at paper_a_results_v3.md through paper_a_results_v3.md, and the numbering now runs XI-XVIII without the previous jump. |
dHash_indep <= 5 (calib-fold median-adjacent) label |
UNFIXED |
The unclear label remains at paper_a_results_v3.md, even though the same table family now explicitly reports the calibration-fold independent-minimum median as 2 at paper_a_results_v3.md. Calling 5 "median-adjacent" is still opaque. |
| References [27], [31]-[36] cleanup | UNFIXED |
These references remain present at paper_a_references_v3.md through paper_a_references_v3.md, but a citation sweep across the nine manuscript files found no in-text uses of [27] or [31]-[36]. The Mann-Whitney test is still reported at paper_a_results_v3.md without citing [36]. I also do not see uses of [34] or [35] in the reviewed manuscript text. |
4. New Findings in v3.4
Blockers
- The new IV-G.3 sensitivity evidence does not appear to use the classifier that III-L now defines. III-L says the operational categories use cosine-conditional dHash cutoffs at paper_a_methodology_v3.md through paper_a_methodology_v3.md, and IV-G.3 presents itself as a sensitivity test of that classifier at paper_a_results_v3.md through paper_a_results_v3.md. But 24_validation_recalibration.py through 24_validation_recalibration.py load only
min_dhash_independent, and the "Replicate Section III-L" classifier at 24_validation_recalibration.py through 24_validation_recalibration.py uses that statistic directly. This is currently the most important unresolved issue because the newly added evidence that is meant to support B1 is not evaluating the paper's stated classifier.
Major Issues
-
Table XI's z/p columns are not consistently arithmetically compatible with the published counts. The formula in 24_validation_recalibration.py through 24_validation_recalibration.py is straightforward, but several rows in paper_a_results_v3.md through paper_a_results_v3.md do not match their own
k/ninputs. The qualitative interpretation survives, but a statistical table that does not reproduce from its displayed counts is not submission-ready. -
Table XVI is affected by the same classifier-definition problem as Table XII. The paper says IV-H.3 uses the "dual-descriptor rules of Section III-L" at paper_a_results_v3.md, but 23_intra_report_consistency.py through 23_intra_report_consistency.py and 23_intra_report_consistency.py through 23_intra_report_consistency.py classify with
min_dhash_independent. So the new "fourth pillar" consistency check is not actually tied to the classifier as specified in III-L. -
The four-pillar Firm A validation is ethically cleaner, but not stronger in evidentiary reporting than v3.3. It is stronger on internal consistency because practitioner knowledge is now background-only at paper_a_methodology_v3.md through paper_a_methodology_v3.md, and the paper states that the evidence comes from the manuscript's own analyses at paper_a_methodology_v3.md through paper_a_methodology_v3.md. But it is not stronger on empirical documentation because the visual-inspection pillar still has no sample size, randomization rule, rater count, or decision protocol at paper_a_methodology_v3.md through paper_a_methodology_v3.md. My read is: ethically stronger, scientifically cleaner, but only roughly equal in evidentiary strength unless the visual-inspection protocol is documented.
Minor Issues
-
III-H says "Two of them are fully threshold-free" at paper_a_methodology_v3.md, but item (a) immediately uses a fixed
0.95cutoff at paper_a_methodology_v3.md. The Results intro to Section IV-H is more accurate at paper_a_results_v3.md through paper_a_results_v3.md. This should be harmonized. -
The Introduction still contains an obsolete metric promise at paper_a_introduction_v3.md, and the Impact Statement still reads too strongly for a five-way classifier with no full labeled test set at paper_a_impact_statement_v3.md. These are not new conceptual flaws, but they are still visible in the current version.
5. IEEE Access Fit Check
-
Scope: Yes. The topic is a plausible IEEE Access Regular Paper fit as a methods paper spanning document forensics, computer vision, and audit/regulatory applications.
-
Abstract length: Not compliant yet. A local plain-word count of paper_a_abstract_v3.md through paper_a_abstract_v3.md gives about 367 words. The IEEE Author Center guidance says the abstract should be a single paragraph of up to 250 words. The current abstract is also dense with abbreviations / symbols (
KDE,EM,BIC,GMM,~,approx) that IEEE generally prefers authors to avoid in abstracts. -
Impact Statement section: The manuscript still includes a standalone Impact Statement at paper_a_impact_statement_v3.md through paper_a_impact_statement_v3.md. Inference from official IEEE Access / IEEE Author Center sources: I do not see a Regular Paper requirement for a standalone
Impact Statementsection. Unless an editor specifically requested it, I would remove it or fold its content into the abstract / conclusion / cover letter. -
Formatting: I cannot verify final IEEE template conformance from the markdown section files alone. Official IEEE Access guidance requires the journal template and submission of both source and PDF; that should be checked at the generated DOCX / PDF stage, not from these source snippets.
-
Review model / anonymization: IEEE Access uses single-anonymized review. The current pseudonymization of firms is therefore a confidentiality choice, not a review-blinding requirement. Within the nine reviewed section files I do not see author or institution metadata.
-
Official sources checked:
- IEEE Access submission guidelines: https://ieeeaccess.ieee.org/authors/submission-guidelines/
- IEEE Author Center article-structure guidance: https://journals.ieeeauthorcenter.ieee.org/create-your-ieee-journal-article/create-the-text-of-your-article/structure-your-article/
- IEEE Access reviewer guidelines / reviewer info: https://ieeeaccess.ieee.org/reviewers/reviewer-guidelines/
6. Statistical Rigor Audit
-
The high-level statistical story is cleaner than in v3.3. The paper now explicitly separates the primary accountant-level 1D convergence (
0.973 / 0.979 / 0.976) from the secondary 2D-GMM marginal (0.945) at paper_a_results_v3.md through paper_a_results_v3.md, and III-L no longer pretends those accountant-level thresholds are themselves the deployed classifier at paper_a_methodology_v3.md. -
The B2 statistical interpretation is substantially improved: IV-G.2 now frames fold differences as heterogeneity rather than as failed generalization at paper_a_results_v3.md through paper_a_results_v3.md, and Discussion repeats that narrower reading at paper_a_discussion_v3.md through paper_a_discussion_v3.md.
-
The main remaining statistical weakness is now more specific: the paper's new classifier definition and the paper's new sensitivity evidence are not using the same dHash statistic. That is a model-definition problem, not just a wording problem.
-
BD/McCrary remains the least rigorous component. The paper's qualitative interpretation is plausible, but the reporting is still too thin for a method presented as a co-equal thresholding component.
-
The anchor-based validation is better framed than before. The manuscript now correctly treats the byte-identical positives as a conservative subset and no longer uses precision /
F1in the main validation table at paper_a_results_v3.md through paper_a_results_v3.md.
7. Anonymization Check
-
Within the nine reviewed v3 manuscript files, I do not see any explicit real firm names or auditor names. The paper consistently uses
Firm A/B/C/D; see paper_a_methodology_v3.md through paper_a_methodology_v3.md and paper_a_results_v3.md through paper_a_results_v3.md. -
The new III-M residual-identifiability disclosure at paper_a_methodology_v3.md through paper_a_methodology_v3.md is appropriate. Knowledgeable local readers may still infer Firm A, but the paper now states that risk explicitly.
8. Numerical Consistency
-
Most of the large headline counts still reconcile across sections:
90,282reports,182,328signatures,758CPAs, and the Firm A171 + 9accountant split remain internally consistent across paper_a_abstract_v3.md through paper_a_abstract_v3.md, paper_a_introduction_v3.md through paper_a_introduction_v3.md, paper_a_results_v3.md through paper_a_results_v3.md, and paper_a_conclusion_v3.md through paper_a_conclusion_v3.md. -
Table XII arithmetic is internally consistent: both columns sum to
168,740, and the listed percentages match the counts. Table XVI and Table XVII arithmetic also reconcile. The new numbering XI-XVIII is coherent. -
The important remaining numerical inconsistency is Table XI's inferential columns, not its raw counts or percentages.
9. Reproducibility
-
The paper is still not reproducible from the manuscript alone.
-
Missing or under-specified items that should be added before submission:
- Exact VLM prompt, parse rule, and failure-handling for page selection at paper_a_methodology_v3.md through paper_a_methodology_v3.md.
- HSV thresholds for red-stamp removal at paper_a_methodology_v3.md.
- Random seeds / sampling protocol for the 500-page annotation set, the 50,000 inter-CPA negatives, the 30-signature sanity sample, and the Firm A 70/30 split at paper_a_methodology_v3.md, paper_a_methodology_v3.md, paper_a_methodology_v3.md through paper_a_methodology_v3.md, and paper_a_methodology_v3.md.
- Visual-inspection sample size, selection rule, and decision protocol at paper_a_methodology_v3.md through paper_a_methodology_v3.md.
- EM / mixture initialization, stopping criteria, boundary clipping for the logit transform, and software versions for the mixture fits at paper_a_methodology_v3.md through paper_a_methodology_v3.md and paper_a_methodology_v3.md.
-
The new scripts help the audit, but they also expose that the Results tables are currently not perfectly aligned to the Methods classifier definition. So reproducibility is not only incomplete; it is presently inconsistent in one key place.
Bottom Line
v3.4 clears the ethics/interview blocker and substantially improves the classifier-threshold narrative. It is much closer to a submittable paper than v3.3. But I would still require one more round before IEEE Access submission: (1) make Section III-L, Table XII, Table XVI, and the supporting scripts use the same dHash statistic, or explicitly redefine the classifier around dHash_indep; (2) recompute and correct the Table XI z/p columns from the displayed counts; (3) remove the remaining overstatements about what the 70/30 split and the validation metrics establish; and (4) cut the abstract to <= 250 words while cleaning the non-standard Impact Statement. If those are repaired cleanly, the paper should move into minor-revision territory.