From 26b934c429ff0f891b035d2f7aa2d48e101dd90b Mon Sep 17 00:00:00 2001 From: gbanyan Date: Mon, 27 Apr 2026 20:45:54 +0800 Subject: [PATCH] Add codex GPT-5.5 round-17 independent peer review artifact paper/codex_review_gpt55_v3_18_2.md: 16.7 KB / 133 lines. Codex re-audited v3.18.2 against its own round-16 review and the live scripts/JSON. Verdict: Minor Revision (did not regress to Accept simply because v3.18.2 addressed the round-16 findings; instead caught three new issues introduced by the v3.18.2 edits themselves, including four fabricated JSON paths in Appendix B and residual "single dominant mechanism" phrasing not yet softened). Co-Authored-By: Claude Opus 4.7 (1M context) --- paper/codex_review_gpt55_v3_18_2.md | 133 ++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 paper/codex_review_gpt55_v3_18_2.md diff --git a/paper/codex_review_gpt55_v3_18_2.md b/paper/codex_review_gpt55_v3_18_2.md new file mode 100644 index 0000000..9d1b49e --- /dev/null +++ b/paper/codex_review_gpt55_v3_18_2.md @@ -0,0 +1,133 @@ +# Independent Peer Review (Round 17) - Paper A v3.18.2 + +Reviewer role: independent peer reviewer for IEEE Access Regular Paper. +Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.2, commit `7990dab` on `yolo-signature-pipeline`. +Audit basis: manuscript sections under `paper/`, scripts under `signature_analysis/`, prior round-16 review `paper/codex_review_gpt55_v3_18_1.md`, and generated reports under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/`. + +## 1. Overall Verdict: Minor Revision + +I recommend **Minor Revision**, not unconditional Accept. + +The v3.18.2 revision fixes the most important round-16 empirical problem: the cross-firm dual-descriptor convergence claim is no longer the erroneous `11.3%` vs `58.7%` / `5x` statement. The new script `signature_analysis/28_byte_identity_decomposition.py` and JSON artifact reproduce the corrected values: among signatures with cosine `> 0.95`, non-Firm-A has `27,596 / 65,515 = 42.12%` with `dHash_indep <= 5`, while Firm A has `49,388 / 55,921 = 88.32%`, a `~2.1x` gap. The byte-identity decomposition is also now reproducible: `145` Firm A byte-identical signatures, `50` distinct partners, `180` registered Firm A partners, and `35` cross-year matches. + +The revision also resolves most stale section references and improves provenance. However, I found three remaining issues that should be corrected before IEEE Access submission: + +1. The Appendix B provenance map overclaims: several mapped report artifacts do not exist at the stated paths in the available report tree. +2. Some mechanism-identification language was softened in Results but remains too strong in Methodology and Discussion, especially "consistent with a single dominant mechanism." +3. A few exact method/performance claims remain unverifiable from packaged artifacts, especially YOLO validation metrics, VLM prompt/settings, HSV thresholds, runtime, and some extraction/document-type details. + +These are Minor because they do not overturn the central empirical findings, but they affect reproducibility and narrative discipline. + +## 2. Re-audit of Round-16 Findings + +| Round-16 finding | v3.18.2 status | Re-audit notes | +|---|---|---| +| Mechanism-identification overclaim from dip-test non-rejection | **PARTIAL** | Results IV-D.1 now correctly says Firm A "fails to reject unimodality." But Methodology III-H still says the distribution is "consistent with a single dominant mechanism (non-hand-signing)," and Discussion V-C says "consistent with a single dominant mechanism plus residual within-firm heterogeneity." A dip-test non-rejection plus left tail does not identify a single mechanism; the joint evidence supports a replication-dominated benchmark, not a mechanism count. | +| Stale IV-F / IV-G references after retitling | **LARGELY RESOLVED** | I did not find the old round-16 pattern of IV-F references pointing to the new IV-G validation analyses. The current IV-F/IV-G references are mostly correct. Minor remaining issue: Introduction and conclusion still cite byte-identity as Section IV-F.1 although the detailed `145/50/180/35` decomposition itself is not reported in Section IV-F.1, only in III-H/V-C/Appendix B. | +| Practitioner knowledge as load-bearing evidence | **PARTIAL** | III-H now explicitly says practitioner knowledge is "non-load-bearing," which is good. But Introduction still says Firm A is "widely recognized within the audit profession" and III-H says "widely held within the audit profession" without a citation or source. This is acceptable only as motivation; I would soften or cite. | +| 0.941 / 0.945 / 0.9407 ambiguity | **RESOLVED** | III-K and IV-F.3 now correctly distinguish the operational 0.95 cut, the nearby rounded sensitivity cut 0.945, and calibration-fold P5 = 0.9407. | +| Incorrect cross-firm dual-convergence claim | **RESOLVED** | The prior `11.3%` vs `58.7%` / `5x` claim is gone from current manuscript files. The replacement `42.12%` vs `88.32%` / `~2.1x` matches the new JSON artifact. | +| Byte-identity decomposition was unverifiable | **RESOLVED with packaging caveat** | New script and JSON reproduce `145/50/180/35`. Caveat: the manuscript says reports are under the project's `reports/` tree, but the actual artifact I inspected is under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/...`, not under this repo's `reports/` path. | +| Legacy EER/FRR/Precision/F1 script comments | **RESOLVED enough** | Scripts 19 and 21 now label EER/FRR/Precision/F1 as legacy / diagnostic-only and state that the manuscript omits them. Some functions still emit those sections if run, but the conceptual warning is explicit. | + +## 3. New Empirical-Claim Audit + +Status definitions: **VERIFIED** = matches script/report or arithmetic; **PARTIAL** = broadly supported but wording/provenance needs cleanup; **UNVERIFIABLE** = plausible but not traceable in the available artifacts; **SUSPICIOUS** = overphrased or internally inconsistent. I found no new fabricated core result. + +| Claim | Status | Audit basis / notes | +|---|---|---| +| 90,282 PDFs, 2013-2023, Taiwan | VERIFIED | Consistent across manuscript. Raw scraping log not audited. | +| 86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071 | VERIFIED | Internally consistent in III-C. | +| 182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched | VERIFIED | Matches manuscript counts and downstream `168,740` after singleton exclusion. | +| 758 CPAs, >50 firms, 15 document types, 86.4% standard audit reports | PARTIAL | 758/>50 are stable manuscript counts. I did not find a direct packaged JSON for 15 document types / 86.4%. | +| Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0 | UNVERIFIABLE | Method claim not contradicted, but prompt/config/log artifact not inspected. | +| YOLO 500 annotated pages, 425/75 split, 100 epochs | PARTIAL | Method is clear; no training log audited. | +| YOLO precision 0.97-0.98, recall 0.95-0.98, mAP metrics | UNVERIFIABLE | Table II remains unsupported by a visible training-results artifact. | +| 43.1 docs/sec with 8 workers | UNVERIFIABLE | Runtime claim still lacks a visible timing log. | +| Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs | VERIFIED | Matches dip-test report and script logic. | +| ResNet-50 ImageNet-1K V2, 2048-d, L2 normalized | VERIFIED | Consistent with methods and ablation script. | +| All-pairs intra/inter distribution N = 41,352,824 / 500,000; KDE crossover 0.837; Cohen's d = 0.669 | VERIFIED | Supported by formal-statistical script/report, although Appendix B points to the wrong JSON path. | +| Firm A dip result N=60,448, dip=0.0019, p=0.169 | VERIFIED | `/reports/dip_test/dip_test_results.json`. | +| Firm A dHash dip result N=60,448, dip=0.1051, p<0.001 | VERIFIED | Same JSON. | +| All-CPA cosine/dHash dip results N=168,740, p<0.001 | VERIFIED | Same JSON. | +| "p = 0.17 at n >= 10 signatures" in III-H | SUSPICIOUS | The `n >= 10` filter applies to accountant-level aggregates in script 15, not the Firm A signature-level dip test. The Firm A dip test uses N=60,448 signatures. | +| "single dominant mechanism" language | SUSPICIOUS | Still too mechanistic for the statistics; use "dominant high-similarity regime" or "consistent with replication-dominated framing." | +| BD/McCrary transition instability and values in Appendix A | VERIFIED | `/reports/bd_sensitivity/bd_sensitivity.json`; table values match. | +| Beta mixture Delta BIC = 381 for Firm A; 10,175 full sample; forced crossings 0.977/0.999 | VERIFIED | `/reports/beta_mixture/beta_mixture_results.json`. | +| Firm A whole-sample rates in Table IX | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json` and pixel-validation JSON: e.g., cos>0.95 `55,922/60,448 = 92.51%`, dual `54,370/60,448 = 89.95%`. | +| 310 byte-identical positives | VERIFIED | `/reports/pixel_validation/pixel_validation_results.json`. | +| Byte-identity decomposition `145 / 50 / 180 / 35` | VERIFIED | New `/reports/byte_identity_decomp/byte_identity_decomposition.json`. The script counts Firm A signatures whose nearest same-CPA match is byte-identical; the "35" is a cross-year nearest-match count, not necessarily a deduplicated unordered pair count. | +| Table X FAR against 50,000 inter-CPA negatives | VERIFIED | `/reports/expanded_validation/expanded_validation_results.json`. | +| Omission of EER/FRR/precision/F1 in manuscript | VERIFIED | Manuscript now explains why these are not meaningful for Table X. | +| Firm A 70/30 split: 124 CPAs/45,116 signatures vs 54 CPAs/15,332 | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`. | +| Two CPAs excluded from split due to disambiguation ties | UNVERIFIABLE | Plausible; I did not find a report field documenting those two ties. | +| Table XI rates/z-tests | VERIFIED | Values match recalibration JSON, including corrected `z=-3.19` for cos>0.9407. | +| Table XII sensitivity counts and +1.19 pp Firm A shift | VERIFIED | Recalibration JSON supports counts and `0.89945` vs `0.91138`. | +| Table XIII per-year Firm A left-tail rates | PARTIAL | Values are internally coherent, but Appendix B points to `reports/deloitte_distribution/deloitte_distribution_results.json`, which does not exist in the inspected report tree. | +| Tables XIV/XV partner ranking values | VERIFIED | `/reports/partner_ranking/partner_ranking_results.json`. | +| Table XVI intra-report agreement | VERIFIED | `/reports/intra_report/intra_report_results.json`. | +| Table XVII document-level classification counts | VERIFIED with path caveat | Counts match manuscript arithmetic and available PDF verdict artifacts, but Appendix B points to `reports/pdf_level/pdf_level_results.json`, which does not exist. Existing files include `pdf_signature_verdicts.json`, CSV/XLSX, and report markdown at report root. | +| Cross-firm dual-descriptor convergence `42.12%` vs `88.32%` | VERIFIED | New JSON: non-Firm-A `27,596/65,515`, Firm A `49,388/55,921`. Note this Firm A denominator differs by one from Table IX's cosine-only `55,922`, so the text should specify the additional filters used by script 28. | +| Ablation Table XVIII | PARTIAL | The script exists and `/Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json` exists, but Appendix B incorrectly maps it to `reports/ablation/ablation_results.json`. | +| Appendix B claim that all report files are committed alongside scripts in the project's `reports/` tree | SUSPICIOUS | In the current workspace there is no repo-root `reports/` directory. Several paths named in Appendix B are missing even in the absolute report tree. | + +## 4. Methodological Rigor + +The core methodology remains credible for an IEEE Access Regular Paper. The strongest elements are: + +- The paper separates operational calibration from distributional characterization. This is essential because the per-signature diagnostics do not converge to a clean two-class threshold. +- The dual-descriptor design is well motivated: cosine captures high-level similarity, while independent-minimum dHash provides a structural near-duplicate check. +- The byte-identical positive anchor is a valid conservative subset, and the inter-CPA negative anchor gives meaningful specificity/FAR estimates. +- The held-out Firm A fold is now framed as within-Firm-A sampling-variance disclosure rather than full external validation. +- The new script 28 closes the most important prior provenance gap for byte identity and cross-firm convergence. + +Remaining rigor concerns: + +1. **Provenance packaging is still inconsistent.** Appendix B says scripts and reports live under the project's `reports/` tree. In this workspace there is no repo-root `reports/` directory, and the actual artifacts are under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/`. More importantly, the Appendix B paths for formal statistical results, Deloitte/Firm-A distribution results, PDF-level results, and ablation results are wrong or missing. +2. **The Firm A prior remains partly socially sourced.** The text says practitioner knowledge is non-load-bearing, but the Introduction still relies rhetorically on "widely recognized." The empirical case can stand without that phrase. +3. **The dip-test interpretation remains slightly overextended.** Failure to reject unimodality supports "no clear multimodal split"; it does not show a single mechanism. The byte-identity and ranking evidence do more of the work. +4. **The `n >= 10` parenthetical in III-H is likely misplaced.** It should not be attached to the Firm A signature-level dip result unless the authors can show the exact filtering. +5. **Several engineering details remain under-specified for full reproducibility:** VLM prompt/parse rule, HSV red-stamp thresholds, training log for YOLO metrics, and exact runtime environment for throughput. + +## 5. Narrative Discipline + +The narrative is substantially more disciplined than v3.18.1, but a few overclaims remain. + +Recommended softening: + +- Replace "detects such non-hand-signed signatures" in the Abstract with "classifies signatures by evidence of non-hand-signing" or "detects replication-consistent signatures." The pipeline does not observe the signing workflow directly. +- Replace "consistent with a single dominant mechanism (non-hand-signing)" in III-H and "single dominant mechanism plus residual..." in V-C with "consistent with a dominant high-similarity regime plus residual heterogeneity." +- Replace "widely recognized / widely held within the audit profession" with either a citation or a purely methodological framing: "Firm A was selected as a candidate calibration reference; its benchmark status is evaluated using image evidence below." +- Be careful with "known-majority-positive population." The empirical evidence supports replication-dominated, but "known" implies a source of ground truth outside the image evidence. + +The corrected cross-firm claim is narratively better. The old `5x` story was both wrong and too dramatic; the new `~2.1x` gap is still meaningful and more defensible. + +## 6. IEEE Access Fit + +The paper fits IEEE Access well. It is application-driven, computationally substantial, and methodologically relevant to document forensics, audit analytics, and computer vision. The novelty is not a new neural architecture; it is the calibration and validation strategy for a real archival corpus with limited ground truth. That is a legitimate IEEE Access contribution. + +The remaining issues are editorial/reproducibility issues rather than grounds for rejection. IEEE Access reviewers are likely to value the added Appendix B provenance map, but they will also notice if the mapped paths do not exist. Fixing those paths, or bundling the missing JSON/Markdown reports, is important before submission. + +## 7. Specific Actionable Revisions + +1. **Fix Appendix B provenance paths.** In the inspected report tree, these Appendix B artifacts are missing at the stated paths: + - `reports/formal_statistical/formal_statistical_results.json` (available alternative appears to be `reports/formal_statistical_data.json`) + - `reports/deloitte_distribution/deloitte_distribution_results.json` (only figures were present) + - `reports/pdf_level/pdf_level_results.json` (available alternatives include `reports/pdf_signature_verdicts.json`, CSV/XLSX, and markdown) + - `reports/ablation/ablation_results.json` (actual path appears to be `/Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json`) + +2. **Either commit/copy the report tree into the repo or state the absolute artifact root.** The user-facing manuscript says `reports/...`; the current repo root has no `reports/` directory. + +3. **Remove the remaining "single dominant mechanism" phrasing.** Use "dominant high-similarity regime" instead. + +4. **Fix the III-H parenthetical "p = 0.17 at n >= 10 signatures."** The signature-level dip test is N=60,448; the `n >= 10` rule belongs to accountant-level aggregates. + +5. **Clarify the `55,921` denominator in IV-H.2.** It differs by one from Table IX's `55,922` cosine-only Firm A count. Add that script 28 conditions on `assigned_accountant IS NOT NULL` and `min_dhash_independent IS NOT NULL`, or reconcile the one-record discrepancy. + +6. **Add or cite artifacts for still-unverifiable operational claims.** At minimum: YOLO training metrics/logs, VLM prompt/config, HSV thresholds, throughput log, and document-type breakdown. + +7. **Soften "widely recognized/widely held" practitioner wording or cite it.** The current "non-load-bearing" sentence helps, but uncited professional-knowledge claims are still exposed. + +8. **Keep the impact statement archived or revise before reuse.** The archive note correctly warns that "distinguishes genuinely hand-signed signatures from reproduced ones" would overstate the evidence. + +Bottom line: v3.18.2 materially improves the paper and fixes the round-16 empirical error. I would not block submission on the central results, but I would require the provenance/path cleanup and the remaining mechanism-language softening before calling it Accept.