paper/codex_review_gpt55_v3_18_2.md: 16.7 KB / 133 lines. Codex re-audited v3.18.2 against its own round-16 review and the live scripts/JSON. Verdict: Minor Revision (did not regress to Accept simply because v3.18.2 addressed the round-16 findings; instead caught three new issues introduced by the v3.18.2 edits themselves, including four fabricated JSON paths in Appendix B and residual "single dominant mechanism" phrasing not yet softened). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
Independent Peer Review (Round 17) - Paper A v3.18.2
Reviewer role: independent peer reviewer for IEEE Access Regular Paper.
Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.2, commit 7990dab on yolo-signature-pipeline.
Audit basis: manuscript sections under paper/, scripts under signature_analysis/, prior round-16 review paper/codex_review_gpt55_v3_18_1.md, and generated reports under /Volumes/NV2/PDF-Processing/signature-analysis/reports/.
1. Overall Verdict: Minor Revision
I recommend Minor Revision, not unconditional Accept.
The v3.18.2 revision fixes the most important round-16 empirical problem: the cross-firm dual-descriptor convergence claim is no longer the erroneous 11.3% vs 58.7% / 5x statement. The new script signature_analysis/28_byte_identity_decomposition.py and JSON artifact reproduce the corrected values: among signatures with cosine > 0.95, non-Firm-A has 27,596 / 65,515 = 42.12% with dHash_indep <= 5, while Firm A has 49,388 / 55,921 = 88.32%, a ~2.1x gap. The byte-identity decomposition is also now reproducible: 145 Firm A byte-identical signatures, 50 distinct partners, 180 registered Firm A partners, and 35 cross-year matches.
The revision also resolves most stale section references and improves provenance. However, I found three remaining issues that should be corrected before IEEE Access submission:
- The Appendix B provenance map overclaims: several mapped report artifacts do not exist at the stated paths in the available report tree.
- Some mechanism-identification language was softened in Results but remains too strong in Methodology and Discussion, especially "consistent with a single dominant mechanism."
- A few exact method/performance claims remain unverifiable from packaged artifacts, especially YOLO validation metrics, VLM prompt/settings, HSV thresholds, runtime, and some extraction/document-type details.
These are Minor because they do not overturn the central empirical findings, but they affect reproducibility and narrative discipline.
2. Re-audit of Round-16 Findings
| Round-16 finding | v3.18.2 status | Re-audit notes |
|---|---|---|
| Mechanism-identification overclaim from dip-test non-rejection | PARTIAL | Results IV-D.1 now correctly says Firm A "fails to reject unimodality." But Methodology III-H still says the distribution is "consistent with a single dominant mechanism (non-hand-signing)," and Discussion V-C says "consistent with a single dominant mechanism plus residual within-firm heterogeneity." A dip-test non-rejection plus left tail does not identify a single mechanism; the joint evidence supports a replication-dominated benchmark, not a mechanism count. |
| Stale IV-F / IV-G references after retitling | LARGELY RESOLVED | I did not find the old round-16 pattern of IV-F references pointing to the new IV-G validation analyses. The current IV-F/IV-G references are mostly correct. Minor remaining issue: Introduction and conclusion still cite byte-identity as Section IV-F.1 although the detailed 145/50/180/35 decomposition itself is not reported in Section IV-F.1, only in III-H/V-C/Appendix B. |
| Practitioner knowledge as load-bearing evidence | PARTIAL | III-H now explicitly says practitioner knowledge is "non-load-bearing," which is good. But Introduction still says Firm A is "widely recognized within the audit profession" and III-H says "widely held within the audit profession" without a citation or source. This is acceptable only as motivation; I would soften or cite. |
| 0.941 / 0.945 / 0.9407 ambiguity | RESOLVED | III-K and IV-F.3 now correctly distinguish the operational 0.95 cut, the nearby rounded sensitivity cut 0.945, and calibration-fold P5 = 0.9407. |
| Incorrect cross-firm dual-convergence claim | RESOLVED | The prior 11.3% vs 58.7% / 5x claim is gone from current manuscript files. The replacement 42.12% vs 88.32% / ~2.1x matches the new JSON artifact. |
| Byte-identity decomposition was unverifiable | RESOLVED with packaging caveat | New script and JSON reproduce 145/50/180/35. Caveat: the manuscript says reports are under the project's reports/ tree, but the actual artifact I inspected is under /Volumes/NV2/PDF-Processing/signature-analysis/reports/..., not under this repo's reports/ path. |
| Legacy EER/FRR/Precision/F1 script comments | RESOLVED enough | Scripts 19 and 21 now label EER/FRR/Precision/F1 as legacy / diagnostic-only and state that the manuscript omits them. Some functions still emit those sections if run, but the conceptual warning is explicit. |
3. New Empirical-Claim Audit
Status definitions: VERIFIED = matches script/report or arithmetic; PARTIAL = broadly supported but wording/provenance needs cleanup; UNVERIFIABLE = plausible but not traceable in the available artifacts; SUSPICIOUS = overphrased or internally inconsistent. I found no new fabricated core result.
| Claim | Status | Audit basis / notes |
|---|---|---|
| 90,282 PDFs, 2013-2023, Taiwan | VERIFIED | Consistent across manuscript. Raw scraping log not audited. |
| 86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071 | VERIFIED | Internally consistent in III-C. |
| 182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched | VERIFIED | Matches manuscript counts and downstream 168,740 after singleton exclusion. |
| 758 CPAs, >50 firms, 15 document types, 86.4% standard audit reports | PARTIAL | 758/>50 are stable manuscript counts. I did not find a direct packaged JSON for 15 document types / 86.4%. |
| Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0 | UNVERIFIABLE | Method claim not contradicted, but prompt/config/log artifact not inspected. |
| YOLO 500 annotated pages, 425/75 split, 100 epochs | PARTIAL | Method is clear; no training log audited. |
| YOLO precision 0.97-0.98, recall 0.95-0.98, mAP metrics | UNVERIFIABLE | Table II remains unsupported by a visible training-results artifact. |
| 43.1 docs/sec with 8 workers | UNVERIFIABLE | Runtime claim still lacks a visible timing log. |
| Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs | VERIFIED | Matches dip-test report and script logic. |
| ResNet-50 ImageNet-1K V2, 2048-d, L2 normalized | VERIFIED | Consistent with methods and ablation script. |
| All-pairs intra/inter distribution N = 41,352,824 / 500,000; KDE crossover 0.837; Cohen's d = 0.669 | VERIFIED | Supported by formal-statistical script/report, although Appendix B points to the wrong JSON path. |
| Firm A dip result N=60,448, dip=0.0019, p=0.169 | VERIFIED | /reports/dip_test/dip_test_results.json. |
| Firm A dHash dip result N=60,448, dip=0.1051, p<0.001 | VERIFIED | Same JSON. |
| All-CPA cosine/dHash dip results N=168,740, p<0.001 | VERIFIED | Same JSON. |
| "p = 0.17 at n >= 10 signatures" in III-H | SUSPICIOUS | The n >= 10 filter applies to accountant-level aggregates in script 15, not the Firm A signature-level dip test. The Firm A dip test uses N=60,448 signatures. |
| "single dominant mechanism" language | SUSPICIOUS | Still too mechanistic for the statistics; use "dominant high-similarity regime" or "consistent with replication-dominated framing." |
| BD/McCrary transition instability and values in Appendix A | VERIFIED | /reports/bd_sensitivity/bd_sensitivity.json; table values match. |
| Beta mixture Delta BIC = 381 for Firm A; 10,175 full sample; forced crossings 0.977/0.999 | VERIFIED | /reports/beta_mixture/beta_mixture_results.json. |
| Firm A whole-sample rates in Table IX | VERIFIED | /reports/validation_recalibration/validation_recalibration.json and pixel-validation JSON: e.g., cos>0.95 55,922/60,448 = 92.51%, dual 54,370/60,448 = 89.95%. |
| 310 byte-identical positives | VERIFIED | /reports/pixel_validation/pixel_validation_results.json. |
Byte-identity decomposition 145 / 50 / 180 / 35 |
VERIFIED | New /reports/byte_identity_decomp/byte_identity_decomposition.json. The script counts Firm A signatures whose nearest same-CPA match is byte-identical; the "35" is a cross-year nearest-match count, not necessarily a deduplicated unordered pair count. |
| Table X FAR against 50,000 inter-CPA negatives | VERIFIED | /reports/expanded_validation/expanded_validation_results.json. |
| Omission of EER/FRR/precision/F1 in manuscript | VERIFIED | Manuscript now explains why these are not meaningful for Table X. |
| Firm A 70/30 split: 124 CPAs/45,116 signatures vs 54 CPAs/15,332 | VERIFIED | /reports/validation_recalibration/validation_recalibration.json. |
| Two CPAs excluded from split due to disambiguation ties | UNVERIFIABLE | Plausible; I did not find a report field documenting those two ties. |
| Table XI rates/z-tests | VERIFIED | Values match recalibration JSON, including corrected z=-3.19 for cos>0.9407. |
| Table XII sensitivity counts and +1.19 pp Firm A shift | VERIFIED | Recalibration JSON supports counts and 0.89945 vs 0.91138. |
| Table XIII per-year Firm A left-tail rates | PARTIAL | Values are internally coherent, but Appendix B points to reports/deloitte_distribution/deloitte_distribution_results.json, which does not exist in the inspected report tree. |
| Tables XIV/XV partner ranking values | VERIFIED | /reports/partner_ranking/partner_ranking_results.json. |
| Table XVI intra-report agreement | VERIFIED | /reports/intra_report/intra_report_results.json. |
| Table XVII document-level classification counts | VERIFIED with path caveat | Counts match manuscript arithmetic and available PDF verdict artifacts, but Appendix B points to reports/pdf_level/pdf_level_results.json, which does not exist. Existing files include pdf_signature_verdicts.json, CSV/XLSX, and report markdown at report root. |
Cross-firm dual-descriptor convergence 42.12% vs 88.32% |
VERIFIED | New JSON: non-Firm-A 27,596/65,515, Firm A 49,388/55,921. Note this Firm A denominator differs by one from Table IX's cosine-only 55,922, so the text should specify the additional filters used by script 28. |
| Ablation Table XVIII | PARTIAL | The script exists and /Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json exists, but Appendix B incorrectly maps it to reports/ablation/ablation_results.json. |
Appendix B claim that all report files are committed alongside scripts in the project's reports/ tree |
SUSPICIOUS | In the current workspace there is no repo-root reports/ directory. Several paths named in Appendix B are missing even in the absolute report tree. |
4. Methodological Rigor
The core methodology remains credible for an IEEE Access Regular Paper. The strongest elements are:
- The paper separates operational calibration from distributional characterization. This is essential because the per-signature diagnostics do not converge to a clean two-class threshold.
- The dual-descriptor design is well motivated: cosine captures high-level similarity, while independent-minimum dHash provides a structural near-duplicate check.
- The byte-identical positive anchor is a valid conservative subset, and the inter-CPA negative anchor gives meaningful specificity/FAR estimates.
- The held-out Firm A fold is now framed as within-Firm-A sampling-variance disclosure rather than full external validation.
- The new script 28 closes the most important prior provenance gap for byte identity and cross-firm convergence.
Remaining rigor concerns:
- Provenance packaging is still inconsistent. Appendix B says scripts and reports live under the project's
reports/tree. In this workspace there is no repo-rootreports/directory, and the actual artifacts are under/Volumes/NV2/PDF-Processing/signature-analysis/reports/. More importantly, the Appendix B paths for formal statistical results, Deloitte/Firm-A distribution results, PDF-level results, and ablation results are wrong or missing. - The Firm A prior remains partly socially sourced. The text says practitioner knowledge is non-load-bearing, but the Introduction still relies rhetorically on "widely recognized." The empirical case can stand without that phrase.
- The dip-test interpretation remains slightly overextended. Failure to reject unimodality supports "no clear multimodal split"; it does not show a single mechanism. The byte-identity and ranking evidence do more of the work.
- The
n >= 10parenthetical in III-H is likely misplaced. It should not be attached to the Firm A signature-level dip result unless the authors can show the exact filtering. - Several engineering details remain under-specified for full reproducibility: VLM prompt/parse rule, HSV red-stamp thresholds, training log for YOLO metrics, and exact runtime environment for throughput.
5. Narrative Discipline
The narrative is substantially more disciplined than v3.18.1, but a few overclaims remain.
Recommended softening:
- Replace "detects such non-hand-signed signatures" in the Abstract with "classifies signatures by evidence of non-hand-signing" or "detects replication-consistent signatures." The pipeline does not observe the signing workflow directly.
- Replace "consistent with a single dominant mechanism (non-hand-signing)" in III-H and "single dominant mechanism plus residual..." in V-C with "consistent with a dominant high-similarity regime plus residual heterogeneity."
- Replace "widely recognized / widely held within the audit profession" with either a citation or a purely methodological framing: "Firm A was selected as a candidate calibration reference; its benchmark status is evaluated using image evidence below."
- Be careful with "known-majority-positive population." The empirical evidence supports replication-dominated, but "known" implies a source of ground truth outside the image evidence.
The corrected cross-firm claim is narratively better. The old 5x story was both wrong and too dramatic; the new ~2.1x gap is still meaningful and more defensible.
6. IEEE Access Fit
The paper fits IEEE Access well. It is application-driven, computationally substantial, and methodologically relevant to document forensics, audit analytics, and computer vision. The novelty is not a new neural architecture; it is the calibration and validation strategy for a real archival corpus with limited ground truth. That is a legitimate IEEE Access contribution.
The remaining issues are editorial/reproducibility issues rather than grounds for rejection. IEEE Access reviewers are likely to value the added Appendix B provenance map, but they will also notice if the mapped paths do not exist. Fixing those paths, or bundling the missing JSON/Markdown reports, is important before submission.
7. Specific Actionable Revisions
-
Fix Appendix B provenance paths. In the inspected report tree, these Appendix B artifacts are missing at the stated paths:
reports/formal_statistical/formal_statistical_results.json(available alternative appears to bereports/formal_statistical_data.json)reports/deloitte_distribution/deloitte_distribution_results.json(only figures were present)reports/pdf_level/pdf_level_results.json(available alternatives includereports/pdf_signature_verdicts.json, CSV/XLSX, and markdown)reports/ablation/ablation_results.json(actual path appears to be/Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json)
-
Either commit/copy the report tree into the repo or state the absolute artifact root. The user-facing manuscript says
reports/...; the current repo root has noreports/directory. -
Remove the remaining "single dominant mechanism" phrasing. Use "dominant high-similarity regime" instead.
-
Fix the III-H parenthetical "p = 0.17 at n >= 10 signatures." The signature-level dip test is N=60,448; the
n >= 10rule belongs to accountant-level aggregates. -
Clarify the
55,921denominator in IV-H.2. It differs by one from Table IX's55,922cosine-only Firm A count. Add that script 28 conditions onassigned_accountant IS NOT NULLandmin_dhash_independent IS NOT NULL, or reconcile the one-record discrepancy. -
Add or cite artifacts for still-unverifiable operational claims. At minimum: YOLO training metrics/logs, VLM prompt/config, HSV thresholds, throughput log, and document-type breakdown.
-
Soften "widely recognized/widely held" practitioner wording or cite it. The current "non-load-bearing" sentence helps, but uncited professional-knowledge claims are still exposed.
-
Keep the impact statement archived or revise before reuse. The archive note correctly warns that "distinguishes genuinely hand-signed signatures from reproduced ones" would overstate the evidence.
Bottom line: v3.18.2 materially improves the paper and fixes the round-16 empirical error. I would not block submission on the central results, but I would require the provenance/path cleanup and the remaining mechanism-language softening before calling it Accept.