Files
pdf_signature_extraction/paper/codex_review_gpt54_v3_3.md
T
gbanyan 0ff1845b22 Paper A v3.4: resolve codex round-3 major-revision blockers
Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md):

B1 Classifier vs three-method threshold mismatch
  - Methodology III-L rewritten to make explicit that the per-signature
    classifier and the accountant-level three-method convergence operate
    at different units (signature vs accountant) and are complementary
    rather than substitutable.
  - Add Results IV-G.3 + Table XII operational-threshold sensitivity:
    cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole
    Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary.

B2 Held-out validation false "within Wilson CI" claim
  - Script 24 recomputes both calibration-fold and held-out-fold rates
    with Wilson 95% CIs and a two-proportion z-test on each rule.
  - Table XI replaced with the proper fold-vs-fold comparison; prose
    in Results IV-G.2 and Discussion V-C corrected: extreme rules agree
    across folds (p>0.7); operational rules in the 85-95% band differ
    by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample
    contained more high-replication C1 accountants), not generalization
    failure.

B3 Interview evidence reframed as practitioner knowledge
  - The Firm A "interviews" referenced throughout v3.3 are private,
    informal professional conversations, not structured research
    interviews. Reframed accordingly: all "interview*" references in
    abstract / intro / methodology / results / discussion / conclusion
    are replaced with "domain knowledge / industry-practice knowledge".
  - This avoids overclaiming methodological formality and removes the
    human-subjects research framing that triggered the ethics-statement
    requirement.
  - Section III-H four-pillar Firm A validation now stands on visual
    inspection, signature-level statistics, accountant-level GMM, and
    the three Section IV-H analyses, with practitioner knowledge as
    background context only.
  - New Section III-M ("Data Source and Firm Anonymization") covers
    MOPS public-data provenance, Firm A/B/C/D pseudonymization, and
    conflict-of-interest declaration.

Add signature_analysis/24_validation_recalibration.py for the recomputed
calib-vs-held-out z-tests and the classifier sensitivity analysis;
output in reports/validation_recalibration/.

Pending (not in this commit): abstract length (368 -> 250 words),
Impact Statement removal, BD/McCrary sensitivity reporting, full
reproducibility appendix, references cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 11:45:24 +08:00

26 KiB

Third-Round Review of Paper A v3.3

Overall Verdict: Major Revision

v3.3 is substantially cleaner than v3.2. Most of the round-2 minor issues were genuinely fixed: the anonymization leak is gone, the BD/McCrary wording is now much more careful, the denominator and table-arithmetic errors were corrected, and the manuscript now explicitly distinguishes cosine-conditional from independent-minimum dHash. I do not recommend submission as-is, however, because three non-cosmetic problems remain. First, the central "three-method convergent thresholding" story is still not aligned with the operational classifier: the deployed rules in Section III-L use whole-sample Firm A heuristics (0.95, 5, 15, 0.837) rather than the convergent accountant-level thresholds reported in Section IV-E. Second, the held-out Firm A validation section makes an objectively false numerical claim that the held-out rates match the whole-sample rates within the Wilson confidence intervals. Third, the paper relies on interview evidence from Firm A partners as a key calibration pillar but provides no human-subjects/ethics statement, no consent/exemption language, and almost no protocol detail. Those are fixable, but they are still submission-blocking.

1. v3.2 Findings Follow-up Audit

Prior v3.2 finding Status v3.3 audit
Three-method convergence overclaim FIXED The paper now consistently states that the KDE antimode plus the two mixture-based estimators converge, while BD/McCrary does not produce an accountant-level transition; see paper_a_abstract_v3.md and paper_a_conclusion_v3.md.
KDE method inconsistency FIXED The KDE crossover vs KDE antimode distinction is now explicit in paper_a_methodology_v3.md, and the Results use the distinction correctly at paper_a_results_v3.md.
Unit-of-analysis clarity PARTIALLY-FIXED The signature/accountant distinction is much clearer at paper_a_methodology_v3.md, but Sections III-L and IV-F/IV-G still mix analysis levels and dHash statistics. The classifier is described with cosine-conditional dHash cutoffs at paper_a_methodology_v3.md, while the validation tables report only dHash_indep rules at paper_a_results_v3.md and paper_a_results_v3.md.
Accountant-level interpretation overstated FIXED The manuscript now consistently frames the accountant-level result as clustered but smoothly mixed, not sharply discrete; see paper_a_introduction_v3.md, paper_a_discussion_v3.md, and paper_a_conclusion_v3.md.
BD/McCrary rigor PARTIALLY-FIXED The overclaim is reduced and the limitation sentence is repaired at paper_a_discussion_v3.md, but the paper still reports a fixed-bin implementation (0.005 cosine bins) at paper_a_methodology_v3.md without any reported bin-width sensitivity results or actual McCrary-style density-estimator output.
White 1982 overclaim FIXED Related Work now uses the narrower pseudo-true-parameter framing at paper_a_related_work_v3.md, consistent with Methods at paper_a_methodology_v3.md.
Firm A circular validation PARTIALLY-FIXED The 70/30 CPA-level split is now explicit at paper_a_results_v3.md, but the actual classifier still uses whole-sample Firm A-derived rules at paper_a_methodology_v3.md. The manuscript therefore overstates how fully the held-out fold breaks circularity.
139 + 32 vs 180 discrepancy FIXED The 171 + 9 = 180 accounting is now internally consistent; see paper_a_introduction_v3.md, paper_a_results_v3.md, paper_a_discussion_v3.md, and paper_a_conclusion_v3.md.
dHash calibration story internally inconsistent PARTIALLY-FIXED The distinction between cosine-conditional and independent-minimum dHash is finally stated at paper_a_methodology_v3.md, but the Results still do not "report both" as promised at paper_a_methodology_v3.md. Tables IX and XI still report only dHash_indep rules at paper_a_results_v3.md and paper_a_results_v3.md.
Section IV-H.3 not threshold-independent FIXED The paper now correctly labels H.3 as a classifier-based consistency check rather than a threshold-free test; see paper_a_methodology_v3.md, paper_a_results_v3.md, and paper_a_results_v3.md.
Table XVI numerical error FIXED The totals now reconcile: 83,970 single-firm reports plus 384 mixed-firm reports for 84,354 total at paper_a_results_v3.md.
Held-out Firm A denominator shift FIXED The 178-CPA held-out denominator is now explicitly explained by two excluded disambiguation-tie CPAs at paper_a_results_v3.md.
Table numbering / cross-reference confusion PARTIALLY-FIXED The duplicate "Table VIII" phrasing is gone, but numbering still jumps from Table XI to Table XIII; see paper_a_results_v3.md and paper_a_results_v3.md.
Real firm identities leaked in tables FIXED The manuscript now consistently uses Firm A/B/C/D; see paper_a_results_v3.md and paper_a_results_v3.md.
Table X mixed unlike units while still reporting precision / F1 FIXED The paper now explicitly says precision and F1 are not meaningful here and omits them; see paper_a_methodology_v3.md and paper_a_results_v3.md.
"three independent statistical methods" wording FIXED The manuscript now uses "methodologically distinct" at paper_a_abstract_v3.md and paper_a_methodology_v3.md.
Abstract / conclusion / discussion still implied BD converged FIXED The relevant sections now explicitly separate the non-transition result from the convergent estimators; see paper_a_abstract_v3.md, paper_a_discussion_v3.md, and paper_a_conclusion_v3.md.
Stale "discrete behaviour" wording FIXED The current wording is appropriately narrowed at paper_a_methodology_v3.md and paper_a_conclusion_v3.md.
Related Work still overclaimed White 1982 FIXED The problematic sentence is gone; see paper_a_related_work_v3.md.
Section III-H preview said "two analyses" FIXED It now correctly says "three analyses" at paper_a_methodology_v3.md.
Incorrect limitation sentence about BD/McCrary threshold-setting role FIXED The limitation is now correctly framed at paper_a_discussion_v3.md.

2. New Findings in v3.3

Blockers

  • The paper still does not document the ethics status of the interview evidence that underwrites the Firm A calibration anchor. The interviews are not incidental; they are used in the Abstract, Introduction, Methods, Discussion, and Conclusion as one of the main justifications for identifying Firm A as replication-dominated; see paper_a_abstract_v3.md, paper_a_introduction_v3.md, and paper_a_methodology_v3.md. There is no statement about IRB/review-board approval, exemption, participant consent, number of interviewees, interview dates, or anonymization protocol. For IEEE Access this is not optional if the paper reports human-subject research.

  • The operational classifier is still not the classifier implied by the paper's title and main thresholding narrative. Section III-I says the accountant-level estimates are the threshold reference used in classification at paper_a_methodology_v3.md, and Section IV-E says the primary accountant-level interpretation comes from the 0.973 / 0.979 / 0.976 convergence band (with 0.945 / 8.10 as a secondary cross-check) at paper_a_results_v3.md. But the actual five-way classifier in Section III-L uses 0.95, 0.837, and dHash cutoffs 5 / 15 from whole-sample Firm A heuristics at paper_a_methodology_v3.md and paper_a_methodology_v3.md. As written, the paper demonstrates convergent threshold analysis, but deploys a different heuristic classifier.

  • The "held-out fold confirms generalization" claim is numerically false as written. The manuscript states that the held-out rates "match the whole-sample rates of Table IX within each rule's Wilson confidence interval" at paper_a_results_v3.md, and repeats the same idea in Discussion at paper_a_discussion_v3.md. That is not true for several published rules. Examples: whole-sample cosine > 0.95 = 92.51% at paper_a_results_v3.md is outside the held-out CI [93.21%, 93.98%] at paper_a_results_v3.md; whole-sample dHash_indep ≤ 5 = 84.20% at paper_a_results_v3.md is outside [87.31%, 88.34%] at paper_a_results_v3.md; whole-sample dual-rule 89.95% at paper_a_results_v3.md is outside [91.09%, 91.97%] at paper_a_results_v3.md. This needs correction, not softening.

Major Issues

  • The dHash statistic used by the deployed classifier remains ambiguous. Section III-L says the final classifier retains the cosine-conditional dHash cutoffs for continuity at paper_a_methodology_v3.md, but Tables IX and XI report only dHash_indep rules at paper_a_results_v3.md and paper_a_results_v3.md. Section III-L also promises that anchor-level analysis reports both cosine-conditional and independent-minimum rates, but the Results do not. This is still a material reproducibility and interpretation gap.

  • The paper still overstates what the 70/30 split accomplishes. Section III-K promises that calibration-fold percentiles are derived from the 70% fold only at paper_a_methodology_v3.md, but Section III-L then says the classifier uses thresholds inherited from the whole-sample Firm A distribution at paper_a_methodology_v3.md. That means the held-out fold is not a fully external evaluation for the actual deployed classifier.

  • The validation-metric story still overpromises in the Introduction and Impact Statement. The Introduction says the design includes validation using "precision, recall, F_1, and equal-error-rate metrics" at paper_a_introduction_v3.md, but Methods and Results later state that precision and F_1 are not meaningful here and that FRR/recall is only valid for the conservative byte-identical subset at paper_a_methodology_v3.md and paper_a_results_v3.md. The Impact Statement is even stronger, claiming the system "distinguishes genuinely hand-signed signatures from reproduced ones" at paper_a_impact_statement_v3.md, which is not what a five-way confidence classifier with no full ground-truth test set has established.

  • The claimed empirical check on the within-auditor-year no-mixing assumption is not actually a check on that assumption. Section III-G says the intra-report consistency analysis "provides an empirical check on the within-auditor-year assumption" at paper_a_methodology_v3.md. But Section IV-H.3 measures agreement between two different signers on the same report at paper_a_results_v3.md; it does not test whether the same CPA mixes signing mechanisms within a fiscal year.

  • BD/McCrary is still the weakest statistical component and is not yet reported rigorously enough to sit as an equal methodological peer to the other two methods. The paper specifies a fixed bin width at paper_a_methodology_v3.md and mentions a KDE bandwidth sensitivity check at paper_a_methodology_v3.md, but no actual sensitivity results, Z-statistics, p-values, or alternate-bin outputs are reported anywhere in Section IV. The narrative conclusions are probably directionally reasonable, but the evidentiary reporting is still thin.

  • Reproducibility from the paper alone is still insufficient. Missing or under-specified items include the exact VLM prompt and parsing rules (paper_a_methodology_v3.md), HSV thresholds for red-stamp removal (paper_a_methodology_v3.md), sampling/randomization seeds for the 500-image YOLO annotation set, the 500,000 inter-class pairs, the 50,000 inter-CPA negatives, and the 70/30 Firm A split (paper_a_results_v3.md, paper_a_results_v3.md, paper_a_results_v3.md), and the initialization/convergence/clipping details for the Beta and logit-GMM fits (paper_a_methodology_v3.md, paper_a_methodology_v3.md, paper_a_methodology_v3.md).

  • Section III-H still contains one misleading sentence about H.1: it says the fixed 0.95 cutoff "is not calibrated to Firm A" at paper_a_methodology_v3.md, but Section IV-F explicitly says 0.95 and the dHash percentile rules are anchored to Firm A at paper_a_results_v3.md, and Section III-L says the classifier inherits thresholds from the whole-sample Firm A distribution at paper_a_methodology_v3.md. Those statements need to be reconciled.

Minor Issues

  • The table numbering still skips Table XII; the numbering jumps from Table XI at paper_a_results_v3.md to Table XIII at paper_a_results_v3.md.

  • The label dHash_indep ≤ 5 (calib-fold median-adjacent) at paper_a_results_v3.md is still unclear. If the calibration-fold independent-minimum median is 2, then 5 is not a transparent "median-adjacent" label.

  • The references still need cleanup. At least [27] and [31]-[36] appear unused in the manuscript text, and the Mann-Whitney test is reported at paper_a_results_v3.md without actually citing [36].

3. IEEE Access Fit Check

  • Scope: Yes. The topic fits IEEE Access well as a multidisciplinary methods paper spanning document forensics, computer vision, and audit-regulation applications.

  • Single-anonymized review: IEEE Access uses single-anonymized review according to the current reviewer information page. The manuscript's use of Firm A/B/C/D is therefore not required for author anonymity, but it is acceptable as an entity-confidentiality choice.

  • Formatting / desk-return risks: There are three concrete issues.

    • The abstract is too long for current IEEE journal guidance. The IEEE Author Center says abstracts should be a single paragraph of up to 250 words, whereas the current abstract text at paper_a_abstract_v3.md through paper_a_abstract_v3.md is roughly 368 words by a plain-word count.
    • The paper includes a standalone Impact Statement section at paper_a_impact_statement_v3.md. That is not a standard IEEE Access Regular Paper section and should be removed or relocated unless the target article type explicitly requires it.
    • Because the manuscript relies on partner interviews, it also appears to require the human-subject research statement that IEEE journal guidance asks authors to include when applicable.
  • Official sources checked: IEEE Access submission guidelines, IEEE Author Center article-structure guidance, and IEEE Access reviewer information.

4. Statistical Rigor Audit

  • The paper's main high-level statistical narrative is now mostly coherent. The "Firm A is replication-dominated but not pure" framing is supported by the combination of the 92.5% signature-level rate, the 139 / 32 accountant-level split, and the unimodal-long-tail characterization; see paper_a_introduction_v3.md, paper_a_results_v3.md, and paper_a_discussion_v3.md.

  • The Hartigan dip test is now described correctly as a unimodality test, and the paper no longer treats non-rejection as a formal bimodality finding. That said, the text at paper_a_results_v3.md still moves quickly from "p = 0.17" to a substantive "single dominant generative mechanism" reading. That interpretation is plausible, but it is still an inference supported by interviews and ancillary evidence, not something the dip test itself establishes.

  • The accountant-level 1D thresholds are statistically described more carefully than before. The 0.973 / 0.979 / 0.976 cosine band is internally consistent across Abstract, Introduction, Results, Discussion, and Conclusion, and the text now correctly treats BD/McCrary non-transition as diagnostic rather than as failed thresholding.

  • The main remaining statistical weakness is the disconnect between where the methods converge and what thresholds the classifier actually uses. If the final classifier remains 0.95 / 5 / 15 / 0.837, then the three-method convergence analysis is supporting context, not operational threshold-setting. The manuscript needs to say that explicitly or change the classifier accordingly.

  • The anchor-based validation is improved, especially because precision and F_1 were removed and Wilson CIs were added. But the EER remains close to vacuous here: with 310 byte-identical positives all sitting near cosine 1.0, the reported "EER ≈ 0" at paper_a_results_v3.md is not very informative and should not be treated as a strong biometric-style performance result.

5. Anonymization Check

  • Within the reviewed manuscript sections, I do not see any explicit real firm names or real auditor names. Firms are consistently pseudonymized as Firm A/B/C/D; see paper_a_results_v3.md and paper_a_results_v3.md.

  • I also do not see author/institution metadata in the reviewed section files. From a single-anonymized IEEE Access standpoint, there is no obvious explicit anonymization leak in the manuscript text provided for review.

  • The one caveat is inferential rather than explicit: the combination of interview-based knowledge, Big-4 status, and distinctive cross-firm statistics may allow knowledgeable local readers to guess which firm is Firm A. That is not an explicit leak, but if firm confidentiality matters beyond mere pseudonymization, the authors should be aware of the residual identifiability risk.

6. Numerical Consistency

  • The major cross-section numbers are now mostly consistent:

    • 90,282 reports / 182,328 signatures / 758 CPAs are aligned across Abstract, Introduction, Methods, and Conclusion.
    • Firm A's 171 analyzable CPAs, 9 excluded CPAs, and 139 / 32 accountant-level split are aligned across Introduction, Results, Discussion, and Conclusion.
    • The partner-ranking 95.9% top-decile share and the intra-report 89.9% agreement are aligned between Methods and Results.
    • Table XVI and Table XVII arithmetic now reconciles.
  • The remaining numerical inconsistency is the held-out-validation sentence discussed above. The underlying table counts are internally consistent, but the prose interpretation at paper_a_results_v3.md and paper_a_discussion_v3.md is not.

  • A second consistency problem is metric-level rather than arithmetic: the classifier is described in Section III-L using cosine-conditional dHash cutoffs, while the validation tables are reported in independent-minimum dHash. That numerical comparison is not apples-to-apples until the paper states clearly which statistic drives Table XVII.

7. Reproducibility

  • The paper is not yet replicable from the manuscript alone.

  • Missing items that should be added before submission:

    • Exact VLM prompt, output format, and page-selection parse rule.
    • YOLO training hyperparameters beyond epoch count and split ratio, plus inference confidence/NMS thresholds.
    • HSV stamp-removal thresholds.
    • Exact matching/disambiguation rules for CPA assignment ties.
    • Random seeds and selection rules for the 500-page annotation sample, the 500,000 inter-class pairs, the 50,000 inter-CPA negatives, and the 70/30 Firm A split.
    • EM/Beta/logit-GMM initialization, stopping criteria, handling of boundary values for the logit transform, and software/library versions for the mixture fits.
    • Sensitivity-analysis results for KDE bandwidth and any analogous robustness checks for the BD/McCrary binning choice.
    • Interview protocol details and the "independent visual inspection" sample size / decision rule.
  • I would not describe the current paper as reproducible "from the paper alone" yet. It is closer than v3.2, but it still depends on undocumented implementation choices.

Bottom Line

v3.3 is close, and most of the v3.2 cleanup work landed correctly. But before IEEE Access submission, I would require: (1) a clean reconciliation between the three-method threshold story and the actual classifier, (2) correction of the false held-out-validation claim, and (3) an explicit ethics/human-subjects statement plus minimal protocol disclosure for the interview evidence. Once those are fixed, the paper is much closer to minor-revision territory.