Files

T

gbanyan 26b934c429 Add codex GPT-5.5 round-17 independent peer review artifact

paper/codex_review_gpt55_v3_18_2.md: 16.7 KB / 133 lines. Codex re-audited
v3.18.2 against its own round-16 review and the live scripts/JSON.
Verdict: Minor Revision (did not regress to Accept simply because v3.18.2
addressed the round-16 findings; instead caught three new issues
introduced by the v3.18.2 edits themselves, including four fabricated
JSON paths in Appendix B and residual "single dominant mechanism"
phrasing not yet softened).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 20:45:54 +08:00

17 KiB

Raw Blame History

Independent Peer Review (Round 17) - Paper A v3.18.2

Reviewer role: independent peer reviewer for IEEE Access Regular Paper. Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.2, commit 7990dab on yolo-signature-pipeline. Audit basis: manuscript sections under paper/, scripts under signature_analysis/, prior round-16 review paper/codex_review_gpt55_v3_18_1.md, and generated reports under /Volumes/NV2/PDF-Processing/signature-analysis/reports/.

1. Overall Verdict: Minor Revision

I recommend Minor Revision, not unconditional Accept.

The v3.18.2 revision fixes the most important round-16 empirical problem: the cross-firm dual-descriptor convergence claim is no longer the erroneous 11.3% vs 58.7% / 5x statement. The new script signature_analysis/28_byte_identity_decomposition.py and JSON artifact reproduce the corrected values: among signatures with cosine > 0.95, non-Firm-A has 27,596 / 65,515 = 42.12% with dHash_indep <= 5, while Firm A has 49,388 / 55,921 = 88.32%, a ~2.1x gap. The byte-identity decomposition is also now reproducible: 145 Firm A byte-identical signatures, 50 distinct partners, 180 registered Firm A partners, and 35 cross-year matches.

The revision also resolves most stale section references and improves provenance. However, I found three remaining issues that should be corrected before IEEE Access submission:

The Appendix B provenance map overclaims: several mapped report artifacts do not exist at the stated paths in the available report tree.
Some mechanism-identification language was softened in Results but remains too strong in Methodology and Discussion, especially "consistent with a single dominant mechanism."
A few exact method/performance claims remain unverifiable from packaged artifacts, especially YOLO validation metrics, VLM prompt/settings, HSV thresholds, runtime, and some extraction/document-type details.

These are Minor because they do not overturn the central empirical findings, but they affect reproducibility and narrative discipline.

2. Re-audit of Round-16 Findings

Round-16 finding	v3.18.2 status	Re-audit notes
Mechanism-identification overclaim from dip-test non-rejection	PARTIAL	Results IV-D.1 now correctly says Firm A "fails to reject unimodality." But Methodology III-H still says the distribution is "consistent with a single dominant mechanism (non-hand-signing)," and Discussion V-C says "consistent with a single dominant mechanism plus residual within-firm heterogeneity." A dip-test non-rejection plus left tail does not identify a single mechanism; the joint evidence supports a replication-dominated benchmark, not a mechanism count.
Stale IV-F / IV-G references after retitling	LARGELY RESOLVED	I did not find the old round-16 pattern of IV-F references pointing to the new IV-G validation analyses. The current IV-F/IV-G references are mostly correct. Minor remaining issue: Introduction and conclusion still cite byte-identity as Section IV-F.1 although the detailed `145/50/180/35` decomposition itself is not reported in Section IV-F.1, only in III-H/V-C/Appendix B.
Practitioner knowledge as load-bearing evidence	PARTIAL	III-H now explicitly says practitioner knowledge is "non-load-bearing," which is good. But Introduction still says Firm A is "widely recognized within the audit profession" and III-H says "widely held within the audit profession" without a citation or source. This is acceptable only as motivation; I would soften or cite.
0.941 / 0.945 / 0.9407 ambiguity	RESOLVED	III-K and IV-F.3 now correctly distinguish the operational 0.95 cut, the nearby rounded sensitivity cut 0.945, and calibration-fold P5 = 0.9407.
Incorrect cross-firm dual-convergence claim	RESOLVED	The prior `11.3%` vs `58.7%` / `5x` claim is gone from current manuscript files. The replacement `42.12%` vs `88.32%` / `~2.1x` matches the new JSON artifact.
Byte-identity decomposition was unverifiable	RESOLVED with packaging caveat	New script and JSON reproduce `145/50/180/35`. Caveat: the manuscript says reports are under the project's `reports/` tree, but the actual artifact I inspected is under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/...`, not under this repo's `reports/` path.
Legacy EER/FRR/Precision/F1 script comments	RESOLVED enough	Scripts 19 and 21 now label EER/FRR/Precision/F1 as legacy / diagnostic-only and state that the manuscript omits them. Some functions still emit those sections if run, but the conceptual warning is explicit.

3. New Empirical-Claim Audit

Status definitions: VERIFIED = matches script/report or arithmetic; PARTIAL = broadly supported but wording/provenance needs cleanup; UNVERIFIABLE = plausible but not traceable in the available artifacts; SUSPICIOUS = overphrased or internally inconsistent. I found no new fabricated core result.

Claim	Status	Audit basis / notes
90,282 PDFs, 2013-2023, Taiwan	VERIFIED	Consistent across manuscript. Raw scraping log not audited.
86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071	VERIFIED	Internally consistent in III-C.
182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched	VERIFIED	Matches manuscript counts and downstream `168,740` after singleton exclusion.
758 CPAs, >50 firms, 15 document types, 86.4% standard audit reports	PARTIAL	758/>50 are stable manuscript counts. I did not find a direct packaged JSON for 15 document types / 86.4%.
Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0	UNVERIFIABLE	Method claim not contradicted, but prompt/config/log artifact not inspected.
YOLO 500 annotated pages, 425/75 split, 100 epochs	PARTIAL	Method is clear; no training log audited.
YOLO precision 0.97-0.98, recall 0.95-0.98, mAP metrics	UNVERIFIABLE	Table II remains unsupported by a visible training-results artifact.
43.1 docs/sec with 8 workers	UNVERIFIABLE	Runtime claim still lacks a visible timing log.
Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs	VERIFIED	Matches dip-test report and script logic.
ResNet-50 ImageNet-1K V2, 2048-d, L2 normalized	VERIFIED	Consistent with methods and ablation script.
All-pairs intra/inter distribution N = 41,352,824 / 500,000; KDE crossover 0.837; Cohen's d = 0.669	VERIFIED	Supported by formal-statistical script/report, although Appendix B points to the wrong JSON path.
Firm A dip result N=60,448, dip=0.0019, p=0.169	VERIFIED	`/reports/dip_test/dip_test_results.json`.
Firm A dHash dip result N=60,448, dip=0.1051, p<0.001	VERIFIED	Same JSON.
All-CPA cosine/dHash dip results N=168,740, p<0.001	VERIFIED	Same JSON.
"p = 0.17 at n >= 10 signatures" in III-H	SUSPICIOUS	The `n >= 10` filter applies to accountant-level aggregates in script 15, not the Firm A signature-level dip test. The Firm A dip test uses N=60,448 signatures.
"single dominant mechanism" language	SUSPICIOUS	Still too mechanistic for the statistics; use "dominant high-similarity regime" or "consistent with replication-dominated framing."
BD/McCrary transition instability and values in Appendix A	VERIFIED	`/reports/bd_sensitivity/bd_sensitivity.json`; table values match.
Beta mixture Delta BIC = 381 for Firm A; 10,175 full sample; forced crossings 0.977/0.999	VERIFIED	`/reports/beta_mixture/beta_mixture_results.json`.
Firm A whole-sample rates in Table IX	VERIFIED	`/reports/validation_recalibration/validation_recalibration.json` and pixel-validation JSON: e.g., cos>0.95 `55,922/60,448 = 92.51%`, dual `54,370/60,448 = 89.95%`.
310 byte-identical positives	VERIFIED	`/reports/pixel_validation/pixel_validation_results.json`.
Byte-identity decomposition `145 / 50 / 180 / 35`	VERIFIED	New `/reports/byte_identity_decomp/byte_identity_decomposition.json`. The script counts Firm A signatures whose nearest same-CPA match is byte-identical; the "35" is a cross-year nearest-match count, not necessarily a deduplicated unordered pair count.
Table X FAR against 50,000 inter-CPA negatives	VERIFIED	`/reports/expanded_validation/expanded_validation_results.json`.
Omission of EER/FRR/precision/F1 in manuscript	VERIFIED	Manuscript now explains why these are not meaningful for Table X.
Firm A 70/30 split: 124 CPAs/45,116 signatures vs 54 CPAs/15,332	VERIFIED	`/reports/validation_recalibration/validation_recalibration.json`.
Two CPAs excluded from split due to disambiguation ties	UNVERIFIABLE	Plausible; I did not find a report field documenting those two ties.
Table XI rates/z-tests	VERIFIED	Values match recalibration JSON, including corrected `z=-3.19` for cos>0.9407.
Table XII sensitivity counts and +1.19 pp Firm A shift	VERIFIED	Recalibration JSON supports counts and `0.89945` vs `0.91138`.
Table XIII per-year Firm A left-tail rates	PARTIAL	Values are internally coherent, but Appendix B points to `reports/deloitte_distribution/deloitte_distribution_results.json`, which does not exist in the inspected report tree.
Tables XIV/XV partner ranking values	VERIFIED	`/reports/partner_ranking/partner_ranking_results.json`.
Table XVI intra-report agreement	VERIFIED	`/reports/intra_report/intra_report_results.json`.
Table XVII document-level classification counts	VERIFIED with path caveat	Counts match manuscript arithmetic and available PDF verdict artifacts, but Appendix B points to `reports/pdf_level/pdf_level_results.json`, which does not exist. Existing files include `pdf_signature_verdicts.json`, CSV/XLSX, and report markdown at report root.
Cross-firm dual-descriptor convergence `42.12%` vs `88.32%`	VERIFIED	New JSON: non-Firm-A `27,596/65,515`, Firm A `49,388/55,921`. Note this Firm A denominator differs by one from Table IX's cosine-only `55,922`, so the text should specify the additional filters used by script 28.
Ablation Table XVIII	PARTIAL	The script exists and `/Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json` exists, but Appendix B incorrectly maps it to `reports/ablation/ablation_results.json`.
Appendix B claim that all report files are committed alongside scripts in the project's `reports/` tree	SUSPICIOUS	In the current workspace there is no repo-root `reports/` directory. Several paths named in Appendix B are missing even in the absolute report tree.

4. Methodological Rigor

The core methodology remains credible for an IEEE Access Regular Paper. The strongest elements are:

The paper separates operational calibration from distributional characterization. This is essential because the per-signature diagnostics do not converge to a clean two-class threshold.
The dual-descriptor design is well motivated: cosine captures high-level similarity, while independent-minimum dHash provides a structural near-duplicate check.
The byte-identical positive anchor is a valid conservative subset, and the inter-CPA negative anchor gives meaningful specificity/FAR estimates.
The held-out Firm A fold is now framed as within-Firm-A sampling-variance disclosure rather than full external validation.
The new script 28 closes the most important prior provenance gap for byte identity and cross-firm convergence.

Remaining rigor concerns:

Provenance packaging is still inconsistent. Appendix B says scripts and reports live under the project's reports/ tree. In this workspace there is no repo-root reports/ directory, and the actual artifacts are under /Volumes/NV2/PDF-Processing/signature-analysis/reports/. More importantly, the Appendix B paths for formal statistical results, Deloitte/Firm-A distribution results, PDF-level results, and ablation results are wrong or missing.
The Firm A prior remains partly socially sourced. The text says practitioner knowledge is non-load-bearing, but the Introduction still relies rhetorically on "widely recognized." The empirical case can stand without that phrase.
The dip-test interpretation remains slightly overextended. Failure to reject unimodality supports "no clear multimodal split"; it does not show a single mechanism. The byte-identity and ranking evidence do more of the work.
The n >= 10 parenthetical in III-H is likely misplaced. It should not be attached to the Firm A signature-level dip result unless the authors can show the exact filtering.
Several engineering details remain under-specified for full reproducibility: VLM prompt/parse rule, HSV red-stamp thresholds, training log for YOLO metrics, and exact runtime environment for throughput.

5. Narrative Discipline

The narrative is substantially more disciplined than v3.18.1, but a few overclaims remain.

Recommended softening:

Replace "detects such non-hand-signed signatures" in the Abstract with "classifies signatures by evidence of non-hand-signing" or "detects replication-consistent signatures." The pipeline does not observe the signing workflow directly.
Replace "consistent with a single dominant mechanism (non-hand-signing)" in III-H and "single dominant mechanism plus residual..." in V-C with "consistent with a dominant high-similarity regime plus residual heterogeneity."
Replace "widely recognized / widely held within the audit profession" with either a citation or a purely methodological framing: "Firm A was selected as a candidate calibration reference; its benchmark status is evaluated using image evidence below."
Be careful with "known-majority-positive population." The empirical evidence supports replication-dominated, but "known" implies a source of ground truth outside the image evidence.

The corrected cross-firm claim is narratively better. The old 5x story was both wrong and too dramatic; the new ~2.1x gap is still meaningful and more defensible.

6. IEEE Access Fit

The paper fits IEEE Access well. It is application-driven, computationally substantial, and methodologically relevant to document forensics, audit analytics, and computer vision. The novelty is not a new neural architecture; it is the calibration and validation strategy for a real archival corpus with limited ground truth. That is a legitimate IEEE Access contribution.

The remaining issues are editorial/reproducibility issues rather than grounds for rejection. IEEE Access reviewers are likely to value the added Appendix B provenance map, but they will also notice if the mapped paths do not exist. Fixing those paths, or bundling the missing JSON/Markdown reports, is important before submission.

7. Specific Actionable Revisions

Fix Appendix B provenance paths. In the inspected report tree, these Appendix B artifacts are missing at the stated paths:
- reports/formal_statistical/formal_statistical_results.json (available alternative appears to be reports/formal_statistical_data.json)
- reports/deloitte_distribution/deloitte_distribution_results.json (only figures were present)
- reports/pdf_level/pdf_level_results.json (available alternatives include reports/pdf_signature_verdicts.json, CSV/XLSX, and markdown)
- reports/ablation/ablation_results.json (actual path appears to be /Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json)
Either commit/copy the report tree into the repo or state the absolute artifact root. The user-facing manuscript says reports/...; the current repo root has no reports/ directory.
Remove the remaining "single dominant mechanism" phrasing. Use "dominant high-similarity regime" instead.
Fix the III-H parenthetical "p = 0.17 at n >= 10 signatures." The signature-level dip test is N=60,448; the n >= 10 rule belongs to accountant-level aggregates.
Clarify the 55,921 denominator in IV-H.2. It differs by one from Table IX's 55,922 cosine-only Firm A count. Add that script 28 conditions on assigned_accountant IS NOT NULL and min_dhash_independent IS NOT NULL, or reconcile the one-record discrepancy.
Add or cite artifacts for still-unverifiable operational claims. At minimum: YOLO training metrics/logs, VLM prompt/config, HSV thresholds, throughput log, and document-type breakdown.
Soften "widely recognized/widely held" practitioner wording or cite it. The current "non-load-bearing" sentence helps, but uncited professional-knowledge claims are still exposed.
Keep the impact statement archived or revise before reuse. The archive note correctly warns that "distinguishes genuinely hand-signed signatures from reproduced ones" would overstate the evidence.

Bottom line: v3.18.2 materially improves the paper and fixes the round-16 empirical error. I would not block submission on the central results, but I would require the provenance/path cleanup and the remaining mechanism-language softening before calling it Accept.

17 KiB Raw Blame History