Files
pdf_signature_extraction/paper/codex_review_gpt55_v3_18_2.md
T
gbanyan 26b934c429 Add codex GPT-5.5 round-17 independent peer review artifact
paper/codex_review_gpt55_v3_18_2.md: 16.7 KB / 133 lines. Codex re-audited
v3.18.2 against its own round-16 review and the live scripts/JSON.
Verdict: Minor Revision (did not regress to Accept simply because v3.18.2
addressed the round-16 findings; instead caught three new issues
introduced by the v3.18.2 edits themselves, including four fabricated
JSON paths in Appendix B and residual "single dominant mechanism"
phrasing not yet softened).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 20:45:54 +08:00

17 KiB

Independent Peer Review (Round 17) - Paper A v3.18.2

Reviewer role: independent peer reviewer for IEEE Access Regular Paper. Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.2, commit 7990dab on yolo-signature-pipeline. Audit basis: manuscript sections under paper/, scripts under signature_analysis/, prior round-16 review paper/codex_review_gpt55_v3_18_1.md, and generated reports under /Volumes/NV2/PDF-Processing/signature-analysis/reports/.

1. Overall Verdict: Minor Revision

I recommend Minor Revision, not unconditional Accept.

The v3.18.2 revision fixes the most important round-16 empirical problem: the cross-firm dual-descriptor convergence claim is no longer the erroneous 11.3% vs 58.7% / 5x statement. The new script signature_analysis/28_byte_identity_decomposition.py and JSON artifact reproduce the corrected values: among signatures with cosine > 0.95, non-Firm-A has 27,596 / 65,515 = 42.12% with dHash_indep <= 5, while Firm A has 49,388 / 55,921 = 88.32%, a ~2.1x gap. The byte-identity decomposition is also now reproducible: 145 Firm A byte-identical signatures, 50 distinct partners, 180 registered Firm A partners, and 35 cross-year matches.

The revision also resolves most stale section references and improves provenance. However, I found three remaining issues that should be corrected before IEEE Access submission:

  1. The Appendix B provenance map overclaims: several mapped report artifacts do not exist at the stated paths in the available report tree.
  2. Some mechanism-identification language was softened in Results but remains too strong in Methodology and Discussion, especially "consistent with a single dominant mechanism."
  3. A few exact method/performance claims remain unverifiable from packaged artifacts, especially YOLO validation metrics, VLM prompt/settings, HSV thresholds, runtime, and some extraction/document-type details.

These are Minor because they do not overturn the central empirical findings, but they affect reproducibility and narrative discipline.

2. Re-audit of Round-16 Findings

Round-16 finding v3.18.2 status Re-audit notes
Mechanism-identification overclaim from dip-test non-rejection PARTIAL Results IV-D.1 now correctly says Firm A "fails to reject unimodality." But Methodology III-H still says the distribution is "consistent with a single dominant mechanism (non-hand-signing)," and Discussion V-C says "consistent with a single dominant mechanism plus residual within-firm heterogeneity." A dip-test non-rejection plus left tail does not identify a single mechanism; the joint evidence supports a replication-dominated benchmark, not a mechanism count.
Stale IV-F / IV-G references after retitling LARGELY RESOLVED I did not find the old round-16 pattern of IV-F references pointing to the new IV-G validation analyses. The current IV-F/IV-G references are mostly correct. Minor remaining issue: Introduction and conclusion still cite byte-identity as Section IV-F.1 although the detailed 145/50/180/35 decomposition itself is not reported in Section IV-F.1, only in III-H/V-C/Appendix B.
Practitioner knowledge as load-bearing evidence PARTIAL III-H now explicitly says practitioner knowledge is "non-load-bearing," which is good. But Introduction still says Firm A is "widely recognized within the audit profession" and III-H says "widely held within the audit profession" without a citation or source. This is acceptable only as motivation; I would soften or cite.
0.941 / 0.945 / 0.9407 ambiguity RESOLVED III-K and IV-F.3 now correctly distinguish the operational 0.95 cut, the nearby rounded sensitivity cut 0.945, and calibration-fold P5 = 0.9407.
Incorrect cross-firm dual-convergence claim RESOLVED The prior 11.3% vs 58.7% / 5x claim is gone from current manuscript files. The replacement 42.12% vs 88.32% / ~2.1x matches the new JSON artifact.
Byte-identity decomposition was unverifiable RESOLVED with packaging caveat New script and JSON reproduce 145/50/180/35. Caveat: the manuscript says reports are under the project's reports/ tree, but the actual artifact I inspected is under /Volumes/NV2/PDF-Processing/signature-analysis/reports/..., not under this repo's reports/ path.
Legacy EER/FRR/Precision/F1 script comments RESOLVED enough Scripts 19 and 21 now label EER/FRR/Precision/F1 as legacy / diagnostic-only and state that the manuscript omits them. Some functions still emit those sections if run, but the conceptual warning is explicit.

3. New Empirical-Claim Audit

Status definitions: VERIFIED = matches script/report or arithmetic; PARTIAL = broadly supported but wording/provenance needs cleanup; UNVERIFIABLE = plausible but not traceable in the available artifacts; SUSPICIOUS = overphrased or internally inconsistent. I found no new fabricated core result.

Claim Status Audit basis / notes
90,282 PDFs, 2013-2023, Taiwan VERIFIED Consistent across manuscript. Raw scraping log not audited.
86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071 VERIFIED Internally consistent in III-C.
182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched VERIFIED Matches manuscript counts and downstream 168,740 after singleton exclusion.
758 CPAs, >50 firms, 15 document types, 86.4% standard audit reports PARTIAL 758/>50 are stable manuscript counts. I did not find a direct packaged JSON for 15 document types / 86.4%.
Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0 UNVERIFIABLE Method claim not contradicted, but prompt/config/log artifact not inspected.
YOLO 500 annotated pages, 425/75 split, 100 epochs PARTIAL Method is clear; no training log audited.
YOLO precision 0.97-0.98, recall 0.95-0.98, mAP metrics UNVERIFIABLE Table II remains unsupported by a visible training-results artifact.
43.1 docs/sec with 8 workers UNVERIFIABLE Runtime claim still lacks a visible timing log.
Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs VERIFIED Matches dip-test report and script logic.
ResNet-50 ImageNet-1K V2, 2048-d, L2 normalized VERIFIED Consistent with methods and ablation script.
All-pairs intra/inter distribution N = 41,352,824 / 500,000; KDE crossover 0.837; Cohen's d = 0.669 VERIFIED Supported by formal-statistical script/report, although Appendix B points to the wrong JSON path.
Firm A dip result N=60,448, dip=0.0019, p=0.169 VERIFIED /reports/dip_test/dip_test_results.json.
Firm A dHash dip result N=60,448, dip=0.1051, p<0.001 VERIFIED Same JSON.
All-CPA cosine/dHash dip results N=168,740, p<0.001 VERIFIED Same JSON.
"p = 0.17 at n >= 10 signatures" in III-H SUSPICIOUS The n >= 10 filter applies to accountant-level aggregates in script 15, not the Firm A signature-level dip test. The Firm A dip test uses N=60,448 signatures.
"single dominant mechanism" language SUSPICIOUS Still too mechanistic for the statistics; use "dominant high-similarity regime" or "consistent with replication-dominated framing."
BD/McCrary transition instability and values in Appendix A VERIFIED /reports/bd_sensitivity/bd_sensitivity.json; table values match.
Beta mixture Delta BIC = 381 for Firm A; 10,175 full sample; forced crossings 0.977/0.999 VERIFIED /reports/beta_mixture/beta_mixture_results.json.
Firm A whole-sample rates in Table IX VERIFIED /reports/validation_recalibration/validation_recalibration.json and pixel-validation JSON: e.g., cos>0.95 55,922/60,448 = 92.51%, dual 54,370/60,448 = 89.95%.
310 byte-identical positives VERIFIED /reports/pixel_validation/pixel_validation_results.json.
Byte-identity decomposition 145 / 50 / 180 / 35 VERIFIED New /reports/byte_identity_decomp/byte_identity_decomposition.json. The script counts Firm A signatures whose nearest same-CPA match is byte-identical; the "35" is a cross-year nearest-match count, not necessarily a deduplicated unordered pair count.
Table X FAR against 50,000 inter-CPA negatives VERIFIED /reports/expanded_validation/expanded_validation_results.json.
Omission of EER/FRR/precision/F1 in manuscript VERIFIED Manuscript now explains why these are not meaningful for Table X.
Firm A 70/30 split: 124 CPAs/45,116 signatures vs 54 CPAs/15,332 VERIFIED /reports/validation_recalibration/validation_recalibration.json.
Two CPAs excluded from split due to disambiguation ties UNVERIFIABLE Plausible; I did not find a report field documenting those two ties.
Table XI rates/z-tests VERIFIED Values match recalibration JSON, including corrected z=-3.19 for cos>0.9407.
Table XII sensitivity counts and +1.19 pp Firm A shift VERIFIED Recalibration JSON supports counts and 0.89945 vs 0.91138.
Table XIII per-year Firm A left-tail rates PARTIAL Values are internally coherent, but Appendix B points to reports/deloitte_distribution/deloitte_distribution_results.json, which does not exist in the inspected report tree.
Tables XIV/XV partner ranking values VERIFIED /reports/partner_ranking/partner_ranking_results.json.
Table XVI intra-report agreement VERIFIED /reports/intra_report/intra_report_results.json.
Table XVII document-level classification counts VERIFIED with path caveat Counts match manuscript arithmetic and available PDF verdict artifacts, but Appendix B points to reports/pdf_level/pdf_level_results.json, which does not exist. Existing files include pdf_signature_verdicts.json, CSV/XLSX, and report markdown at report root.
Cross-firm dual-descriptor convergence 42.12% vs 88.32% VERIFIED New JSON: non-Firm-A 27,596/65,515, Firm A 49,388/55,921. Note this Firm A denominator differs by one from Table IX's cosine-only 55,922, so the text should specify the additional filters used by script 28.
Ablation Table XVIII PARTIAL The script exists and /Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json exists, but Appendix B incorrectly maps it to reports/ablation/ablation_results.json.
Appendix B claim that all report files are committed alongside scripts in the project's reports/ tree SUSPICIOUS In the current workspace there is no repo-root reports/ directory. Several paths named in Appendix B are missing even in the absolute report tree.

4. Methodological Rigor

The core methodology remains credible for an IEEE Access Regular Paper. The strongest elements are:

  • The paper separates operational calibration from distributional characterization. This is essential because the per-signature diagnostics do not converge to a clean two-class threshold.
  • The dual-descriptor design is well motivated: cosine captures high-level similarity, while independent-minimum dHash provides a structural near-duplicate check.
  • The byte-identical positive anchor is a valid conservative subset, and the inter-CPA negative anchor gives meaningful specificity/FAR estimates.
  • The held-out Firm A fold is now framed as within-Firm-A sampling-variance disclosure rather than full external validation.
  • The new script 28 closes the most important prior provenance gap for byte identity and cross-firm convergence.

Remaining rigor concerns:

  1. Provenance packaging is still inconsistent. Appendix B says scripts and reports live under the project's reports/ tree. In this workspace there is no repo-root reports/ directory, and the actual artifacts are under /Volumes/NV2/PDF-Processing/signature-analysis/reports/. More importantly, the Appendix B paths for formal statistical results, Deloitte/Firm-A distribution results, PDF-level results, and ablation results are wrong or missing.
  2. The Firm A prior remains partly socially sourced. The text says practitioner knowledge is non-load-bearing, but the Introduction still relies rhetorically on "widely recognized." The empirical case can stand without that phrase.
  3. The dip-test interpretation remains slightly overextended. Failure to reject unimodality supports "no clear multimodal split"; it does not show a single mechanism. The byte-identity and ranking evidence do more of the work.
  4. The n >= 10 parenthetical in III-H is likely misplaced. It should not be attached to the Firm A signature-level dip result unless the authors can show the exact filtering.
  5. Several engineering details remain under-specified for full reproducibility: VLM prompt/parse rule, HSV red-stamp thresholds, training log for YOLO metrics, and exact runtime environment for throughput.

5. Narrative Discipline

The narrative is substantially more disciplined than v3.18.1, but a few overclaims remain.

Recommended softening:

  • Replace "detects such non-hand-signed signatures" in the Abstract with "classifies signatures by evidence of non-hand-signing" or "detects replication-consistent signatures." The pipeline does not observe the signing workflow directly.
  • Replace "consistent with a single dominant mechanism (non-hand-signing)" in III-H and "single dominant mechanism plus residual..." in V-C with "consistent with a dominant high-similarity regime plus residual heterogeneity."
  • Replace "widely recognized / widely held within the audit profession" with either a citation or a purely methodological framing: "Firm A was selected as a candidate calibration reference; its benchmark status is evaluated using image evidence below."
  • Be careful with "known-majority-positive population." The empirical evidence supports replication-dominated, but "known" implies a source of ground truth outside the image evidence.

The corrected cross-firm claim is narratively better. The old 5x story was both wrong and too dramatic; the new ~2.1x gap is still meaningful and more defensible.

6. IEEE Access Fit

The paper fits IEEE Access well. It is application-driven, computationally substantial, and methodologically relevant to document forensics, audit analytics, and computer vision. The novelty is not a new neural architecture; it is the calibration and validation strategy for a real archival corpus with limited ground truth. That is a legitimate IEEE Access contribution.

The remaining issues are editorial/reproducibility issues rather than grounds for rejection. IEEE Access reviewers are likely to value the added Appendix B provenance map, but they will also notice if the mapped paths do not exist. Fixing those paths, or bundling the missing JSON/Markdown reports, is important before submission.

7. Specific Actionable Revisions

  1. Fix Appendix B provenance paths. In the inspected report tree, these Appendix B artifacts are missing at the stated paths:

    • reports/formal_statistical/formal_statistical_results.json (available alternative appears to be reports/formal_statistical_data.json)
    • reports/deloitte_distribution/deloitte_distribution_results.json (only figures were present)
    • reports/pdf_level/pdf_level_results.json (available alternatives include reports/pdf_signature_verdicts.json, CSV/XLSX, and markdown)
    • reports/ablation/ablation_results.json (actual path appears to be /Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json)
  2. Either commit/copy the report tree into the repo or state the absolute artifact root. The user-facing manuscript says reports/...; the current repo root has no reports/ directory.

  3. Remove the remaining "single dominant mechanism" phrasing. Use "dominant high-similarity regime" instead.

  4. Fix the III-H parenthetical "p = 0.17 at n >= 10 signatures." The signature-level dip test is N=60,448; the n >= 10 rule belongs to accountant-level aggregates.

  5. Clarify the 55,921 denominator in IV-H.2. It differs by one from Table IX's 55,922 cosine-only Firm A count. Add that script 28 conditions on assigned_accountant IS NOT NULL and min_dhash_independent IS NOT NULL, or reconcile the one-record discrepancy.

  6. Add or cite artifacts for still-unverifiable operational claims. At minimum: YOLO training metrics/logs, VLM prompt/config, HSV thresholds, throughput log, and document-type breakdown.

  7. Soften "widely recognized/widely held" practitioner wording or cite it. The current "non-load-bearing" sentence helps, but uncited professional-knowledge claims are still exposed.

  8. Keep the impact statement archived or revise before reuse. The archive note correctly warns that "distinguishes genuinely hand-signed signatures from reproduced ones" would overstate the evidence.

Bottom line: v3.18.2 materially improves the paper and fixes the round-16 empirical error. I would not block submission on the central results, but I would require the provenance/path cleanup and the remaining mechanism-language softening before calling it Accept.