Files

T

gbanyan 7990dab4b5 Add codex GPT-5.5 round-16 independent peer review artifact

paper/codex_review_gpt55_v3_18_1.md: 28.6 KB / 224 lines, archived for
reference. Verdict: Minor Revision (broke a 15-round Accept-anchor chain
by independently auditing every quantitative claim against scripts and
JSON reports). Flagged the previously-cited cross-firm 11.3% / 58.7%
numbers as UNVERIFIABLE; subsequent DB recomputation confirmed they were
incorrect (true values 42.12% / 88.32%).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 20:23:15 +08:00

29 KiB

Raw Blame History

Independent Peer Review (Round 16) - Paper A v3.18.1

Reviewer role: independent peer reviewer for IEEE Access Regular Paper. Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.1, commit cb77f481ec2ab4b93b0effbf4c0ee4c89e90d610. Audit basis: manuscript sections under paper/, analysis scripts under signature_analysis/, generated reports under /Volumes/NV2/PDF-Processing/signature-analysis/reports/, and paper/reference_verification_v3.md.

1. Overall Verdict: Minor Revision

The paper is close to submission-ready and the central empirical story is largely reproducible from the provided scripts: a large Taiwan audit-report corpus; a signature-detection and feature-extraction pipeline; percentile-calibrated dual-descriptor classification; annotation-free validation using byte-identical positives and inter-CPA negatives; and strong Firm A concentration in several benchmark checks. I did not find a surviving "30/30 human rater agreement" claim in the current manuscript.

However, I would not recommend unconditional Accept. Three issues require revision before IEEE Access submission:

Several claims are empirically supported but still phrased more strongly than the scripts justify, especially "detects non-hand-signed signatures," "single dominant generative mechanism," and statements that Firm A's industry practice is "widely understood" or majority non-hand-signing. The data support replication-dominated calibration evidence, not a direct observation of signing workflow.
A number of section references are stale after the v3.18 retitling/reframing. The most visible are references to Section IV-F for analyses that now appear under Section IV-G, and Section III-K references "Firm A P5 percentile 0.941" while the reported sensitivity uses 0.945 and calibration-fold P5 is 0.9407.
The empirical audit found no fabricated quantitative core result, but some claims are only partially reproducible from scripts because the generated tables are embedded as manuscript comments and some scripts contain legacy comments or outputs from earlier versions (e.g., EER/precision/F1 code still present in Script 21/19, although the manuscript correctly omits those metrics).

These are Minor rather than Major because the numerical tables I checked generally match the scripts/reports, the prior fabricated rater-agreement problem appears removed, and the manuscript now contains appropriate limitations around annotation-free anchors and signature-level scope.

2. Empirical-Claim Audit Table

Status definitions: VERIFIED = matches scripts/reports or reference verification; UNVERIFIABLE = plausible but not independently supported by provided artifacts; SUSPICIOUS = likely true directionally but overphrased or internally inconsistent; FABRICATED = contradicted by provided artifacts or unsupported despite being presented as measured fact. I found no clear fabricated quantitative claim in v3.18.1.

Claim	Location	Status	Audit basis / notes
90,282 audit-report PDFs, Taiwan, 2013-2023	Abstract; III-B; V	VERIFIED	Manuscript dataset summary; pipeline comments. No raw download log audited, but internally consistent across III-B and conclusion.
86,072 documents with signatures (95.4%); 12 corrupted PDFs excluded; final 86,071 documents	III-B/C/D; Table I/III	VERIFIED	III-C explains 86,072 VLM-positive minus 12 corrupted = 86,071 final. Slight table split is clear enough.
182,328 extracted signatures	Abstract; III-D; IV-B; conclusion	VERIFIED	Table III and scripts using DB counts; `signature_analysis/21_expanded_validation.py` loads 168,740 post-best-match subset, consistent with matched subset after exclusions.
758 unique CPAs; >50 accounting firms; 15 document types, 86.4% standard audit reports	III-B/Table I	VERIFIED for 758 and >50; UNVERIFIABLE for 15/86.4	758 is repeatedly used in manuscript. I did not find a direct script/report cross-check for the 15 document-type and 86.4% breakdown in the inspected artifacts.
Qwen2.5-VL 32B; first quartile scanning; temperature 0	III-C	UNVERIFIABLE	Method claim, not contradicted, but no config/output file inspected establishes these exact inference settings.
VLM-YOLO agreement / YOLO detections in 98.8% of VLM-positive documents	Abstract; III-C; IV-B	VERIFIED	Table III: 85,042 / 86,071 = 98.8%. Script provenance not fully traced, but arithmetic and manuscript consistency are correct.
YOLO training set 500 pages, 425/75 split, 100 epochs	III-D; IV-B	VERIFIED with caveat	Method statement; no training logs inspected. The 425/75 split is arithmetically consistent.
YOLO metrics: precision 0.97-0.98, recall 0.95-0.98, mAP@0.50 0.98-0.99, mAP@0.50:0.95 0.85-0.90	Table II	UNVERIFIABLE	I did not find a training-results artifact in `signature_analysis/`; claim may be true but needs reproducible log/table in supplement.
Detection deployment: 43.1 docs/sec with 8 workers	III-D; Table III	UNVERIFIABLE	Reported in Table III; no script/log inspected verifies runtime.
CPA-matched signatures: 168,755 / 182,328 = 92.6%; unmatched 13,573 = 7.4%	III-D; Table III	VERIFIED	168,755 + 13,573 = 182,328; percentages correct.
Same-CPA best-match analyses use N = 168,740, 15 fewer than matched count due to singleton CPAs	IV-D.1	VERIFIED	`signature_analysis/15_hartigan_dip_test.py` and reports use N=168,740; explanation is plausible and internally consistent.
ResNet-50, ImageNet-1K V2, 2048-d embeddings, 224x224 preprocessing, L2 normalization	III-E	VERIFIED	`signature_analysis/10_formal_statistical_analysis.py`, `paper/ablation_backbone_comparison.py`.
All-pairs intra-class N = 41,352,824; inter-class N = 500,000	Table IV	VERIFIED	`signature_analysis/10_formal_statistical_analysis.py` computes all intra-pairs and samples 500,000 inter-pairs.
Table IV distribution stats: intra mean 0.821, inter mean 0.758, std/median/skew/kurtosis	IV-C/Table IV	VERIFIED	Consistent with formal statistical report logic and Table XVIII ResNet stats; exact JSON not fully quoted here but no contradiction found.
Shapiro-Wilk and K-S reject normality, p < 0.001	IV-C	VERIFIED with caveat	`signature_analysis/10_formal_statistical_analysis.py` performs tests. Large paired dependence caveat is correctly acknowledged later.
Lognormal best parametric fit by AIC	IV-C	UNVERIFIABLE	Mentioned in manuscript; not confirmed in inspected code excerpt/output. Needs report citation or supplement table.
KDE crossover at 0.837; Cohen's d = 0.669; Mann-Whitney p < 0.001; K-S p < 0.001	IV-C/Table V	VERIFIED	`signature_analysis/10_formal_statistical_analysis.py` computes these categories; Table XVIII also repeats ResNet crossover/d.
Pairwise p-values unreliable due to non-independence	IV-C	VERIFIED as methodological caveat	Correct; same signature appears in many pairs.
Firm A cosine dip: N=60,448, dip=0.0019, p=0.169, unimodal	IV-D.1/Table V	VERIFIED	`/reports/dip_test/dip_test_results.json`; `signature_analysis/15_hartigan_dip_test.py`.
Firm A dHash dip: N=60,448, dip=0.1051, p<0.001, multimodal	IV-D.1/Table V	VERIFIED	`/reports/dip_test/dip_test_results.json`.
All-CPA cosine dip: N=168,740, dip=0.0035, p<0.001, multimodal	IV-D.1/Table V	VERIFIED	`/reports/dip_test/dip_test_results.json`.
All-CPA dHash dip: N=168,740, dip=0.0468, p<0.001, multimodal	IV-D.1/Table V	VERIFIED	`/reports/dip_test/dip_test_results.json`.
Firm A cosine distribution "reflects a single dominant generative mechanism"	IV-D.1	SUSPICIOUS	Dip p=0.17 supports failure to reject unimodality, not direct mechanism identification. Rewrite as "consistent with" rather than "reflecting."
BD/McCrary Firm A cosine transition 0.985 at bin 0.005; full 0.985; dHash transition 2	IV-D.2; Appendix A	VERIFIED	`signature_analysis/25_bd_mccrary_sensitivity.py`; `/reports/bd_sensitivity/bd_sensitivity.json`.
BD transition drift: Firm A cosine 0.987/0.985/0.980/0.975 as bin widens; full dHash 2/10/9	Appendix A	VERIFIED	`/reports/bd_sensitivity/bd_sensitivity.json`.
BD/McCrary transition lies inside non-hand-signed mode and is not bin-width-stable	IV-D.2; Appendix A	VERIFIED as interpretation	Script supports instability. "Inside mode" is interpretive but reasonable given Firm A high-similarity mass.
Beta mixture: Firm A Delta BIC = 381 preferring K=3; full-sample Delta BIC = 10,175	IV-D.3; V-B	VERIFIED	`/reports/beta_mixture/beta_mixture_results.json`: -371092.8 vs -371473.9; -787280.4 vs -797455.1.
Firm A forced Beta-2 crossing 0.977; logit-GMM crossing 0.999	IV-D.3/Table VI	VERIFIED	`/reports/beta_mixture/beta_mixture_results.json`: 0.9774276 and 0.9992143.
Full-sample forced Beta crossing none; logit-GMM 0.980	IV-D.3/Table VI	VERIFIED	`/reports/beta_mixture/beta_mixture_results.json`.
Operational Firm A P7.5 cosine cut: cos > 0.95; 92.5% above / 7.5% at or below	Abstract; III-H/K; IV-E	VERIFIED	`/reports/pixel_validation/pixel_validation_results.json`: Firm A cosine>0.95 = 0.9251257.
dHash cutoffs <=5, <=8, <=15; Firm A dHash median 2; P75 approx 4; P95 9	III-K; IV-E/F	VERIFIED	`/reports/validation_recalibration/validation_recalibration.json` and pixel-validation JSON.
Firm A whole-sample capture: cos>0.837 99.93%, cos>0.9407 95.15%, cos>0.945 94.02%, cos>0.95 92.51%	Table IX	VERIFIED mostly	Counts/rates match manuscript except pixel JSON has 0.941 rather than 0.9407 from older run; recalibration JSON supports 0.9407 threshold family.
Firm A whole-sample dHash<=5 84.20%, <=8 95.17%, <=15 99.83%, dual cos>0.95 AND dHash<=8 89.95%	Table IX; abstract	VERIFIED	`/reports/pixel_validation/pixel_validation_results.json`; `/reports/validation_recalibration/validation_recalibration.json`.
310 byte-identical positives	Abstract; IV-F.1; V-F	VERIFIED	`signature_analysis/19_pixel_identity_validation.py`; `/reports/pixel_validation/pixel_validation_results.json`; `/reports/expanded_validation/expanded_validation_results.json`.
145 Firm A byte-identical signatures, 50 distinct Firm A partners of 180, 35 cross-year	III-H; V-C; conclusion	VERIFIED with caveat	The manuscript cites this, but the inspected `pixel_validation_results.json` reports only 310 all-sample pixel-identical signatures. I did not inspect an output table listing the 145/50/35 decomposition. Treat as verified only if the supplementary byte-level pair table is included; otherwise demote to UNVERIFIABLE.
50,000 inter-CPA negative pairs; inter-CPA mean=0.762, P95=0.884, P99=0.913, max=0.988	IV-F.1	VERIFIED	`signature_analysis/21_expanded_validation.py`; `/reports/expanded_validation/expanded_validation_results.json`.
Table X FAR at thresholds: 0.837 -> 0.2062; 0.900 -> 0.0233; 0.945 -> 0.0008; 0.950 -> 0.0007; 0.973 -> 0.0003; 0.979 -> 0.0002, Wilson CIs	IV-F.1/Table X	VERIFIED	`/reports/expanded_validation/expanded_validation_results.json`.
Omission of EER/FRR/precision/F1 in Table X because anchor prevalence is arbitrary and byte-identical positives make FRR trivial	III-J; IV-F.1	VERIFIED methodologically	Correct manuscript correction. Scripts still compute legacy EER/precision/F1 in places; manuscript appropriately omits.
Low-similarity same-CPA negative anchor n=35	III-J; V-G	VERIFIED	`/reports/pixel_validation/pixel_validation_results.json`.
Firm A 70/30 CPA split: 124 CPAs/45,116 signatures vs 54 CPAs/15,332 signatures	IV-F.2/Table XI	VERIFIED	`/reports/validation_recalibration/validation_recalibration.json`; `signature_analysis/24_validation_recalibration.py`.
178 Firm A CPAs in split vs 180 registry; two excluded for disambiguation ties	IV-F.2	UNVERIFIABLE	Plausible and internally consistent, but I did not find a script/report field documenting the two disambiguation ties.
Calibration-fold thresholds: cosine median 0.9862, P1 0.9067, P5 0.9407; dHash median 2, P95 9	Table XI	VERIFIED	`/reports/validation_recalibration/validation_recalibration.json`; `/reports/expanded_validation/expanded_validation_results.json`.
Table XI fold rates and z-tests	IV-F.2/Table XI	VERIFIED	`/reports/validation_recalibration/validation_recalibration.json`.
Claim: extreme rules agree across folds, operational 85-95% rules differ by 1-5 points, p<0.001	IV-F.2; conclusion	VERIFIED	Recalibration JSON supports this.
Sensitivity: cos>0.95 vs cos>0.945 reclassifies 8,508 signatures; category counts in Table XII	IV-F.3/Table XII	VERIFIED	`/reports/validation_recalibration/validation_recalibration.json`.
Firm A dual capture shifts from 89.95% to 91.14%, +1.19 pp	IV-F.3	VERIFIED	Recalibration JSON: 0.89945 vs 0.91138.
Text says "Firm A P5 percentile 0.941" but sensitivity uses 0.945	III-K	SUSPICIOUS	Calibration-fold P5 is 0.9407; deployed sensitivity cut is 0.945. Revise to avoid "P5 percentile 0.941" vs "0.945 rounded" ambiguity.
Year-by-year Firm A left-tail table, 2013-2023 N/mean/% below 0.95	IV-G.1/Table XIII	VERIFIED with caveat	Values plausible and internally consistent, but I did not find the specific report output in inspected files. Include generating script/table in supplement.
2013-2019 mean left-tail 8.26%, 2020-2023 mean 6.96%; lowest 2023 = 3.75%	IV-G.1	VERIFIED arithmetically from Table XIII	Means computed from unweighted annual percentages. If intended signature-weighted means, disclose.
Partner ranking: 4,629 auditor-years >=5 signatures; Firm A 1,287 baseline 27.8%; top decile 443/462 = 95.9%; top quartile 1,043/1,157 = 90.1%; top half 1,220/2,314 = 52.7%	IV-G.2/Table XIV	VERIFIED	`signature_analysis/22_partner_ranking.py`; `/reports/partner_ranking/partner_ranking_results.json`.
Year-by-year top-decile Firm A share range 88.4%-100%	IV-G.2/Table XV	VERIFIED	`/reports/partner_ranking/partner_ranking_results.json`.
Intra-report corpus: 84,354 two-signer reports; 83,970 single-firm; 384 mixed-firm = 0.46%	IV-G.3	VERIFIED	`/reports/intra_report/intra_report_results.json` gives same-firm totals plus mixed-firm categories adding to 384.
Intra-report Table XVI: Firm A 30,222 reports, agreement 89.91%; other Big-4 62-67%; 23-28 pp gap	IV-G.3/Table XVI; abstract	VERIFIED	`signature_analysis/23_intra_report_consistency.py`; `/reports/intra_report/intra_report_results.json`.
Firm A both non-hand-signed 26,435/30,222 = 87.5%; both likely hand-signed 4 = 0.01%	IV-G.3	VERIFIED	`/reports/intra_report/intra_report_results.json`.
Intra-report gap "predicted by firm-wide practice"	IV-G.3	SUSPICIOUS	Pattern is consistent with firm-wide practice, but not uniquely diagnostic. Use "consistent with" and avoid "sharp discontinuity" unless statistical uncertainty/sensitivity is shown.
Document-level classification cohort 84,386; differs from 85,042 detections by 656 single-signature documents	IV-H/Table XVII	VERIFIED	Legacy PDF verdict report reports total 84,386; explanation internally consistent.
Table XVII document counts: high 29,529; moderate 36,994; style 5,133; uncertain 12,683; likely 47; total 84,386	IV-H/Table XVII	VERIFIED	Sum = 84,386; consistent with text.
Within 71,656 documents exceeding cosine 0.95: 41.2% high, 51.7% moderate, 7.2% style-only	IV-H	VERIFIED	29,529 + 36,994 + 5,133 = 71,656; percentages correct.
Abstract says "only 41% exhibit converging structural evidence ... 7% show no structural corroboration"	Abstract/conclusion	VERIFIED with caveat	Correct for documents with cos>0.95, but "only" is rhetoric; moderate 51.7% still has partial structural similarity.
Firm A document capture: 96.9% high/moderate, 0.6% style, 2.5% uncertain, 4/30,226 likely hand-signed	IV-H.1	VERIFIED	Table XVII Firm A counts sum to 30,226; 22,970+6,311=29,281=96.9%.
Cross-firm dual-descriptor convergence: non-Firm-A CPAs with cos>0.95 have dHash<=5 at 11.3%, Firm A 58.7%	IV-H.2	UNVERIFIABLE	I did not find a direct output artifact for this exact comparison in inspected scripts/reports. Add reproducible table or script reference.
Ablation Table XVIII: ResNet/VGG/EfficientNet dimensions and stats	IV-I/Table XVIII	VERIFIED with caveat	`paper/ablation_backbone_comparison.py` implements analysis; I did not inspect generated JSON under ablation.
Claim ResNet-50 "best balance" over EfficientNet-B0 despite lower Cohen's d	IV-I; conclusion	VERIFIED as judgment, not a pure metric	The chosen tradeoff is defensible but subjective. Do not overstate as a purely empirical optimum.
Reference verification: [5] fixed to Kao and Wen; [16]/[21]/[22]/[25] corrected/polished	References; reference_verification_v3.md	VERIFIED	Current `paper_a_references_v3.md` reflects the critical [5] correction and most polish recommendations.
"30/30 human rater agreement"	Current manuscript	VERIFIED ABSENT	`rg` found no surviving 30/30/rater agreement claim in manuscript sections.

3. Methodological Rigor

The methodological core is substantially stronger than in earlier described versions. The key positive points are:

The paper now separates operational calibration from descriptive distributional diagnostics. This is the right move: the signature-level dip/Beta/BD results do not converge to a clean two-mechanism threshold, so a transparent Firm A percentile anchor is more defensible than a forced mixture crossing.
The dual-descriptor classifier is methodologically sensible. Cosine captures high-level similarity; independent-minimum dHash adds structural near-duplicate evidence and avoids treating all high-cosine signatures as image reproduction.
The pixel-identity positive anchor is valid as a conservative subset, and the manuscript now correctly avoids presenting FRR/EER/precision/F1 against that artificial anchor set as biometric performance.
The inter-CPA negative anchor is a meaningful improvement over the n=35 low-similarity same-CPA anchor.
The 70/30 Firm A split is a useful disclosure of within-anchor heterogeneity, even though it is not external validation in the ordinary supervised-learning sense.

Remaining rigor concerns:

The inference from "Firm A dip p=0.17" to "single dominant generative mechanism" is too strong. A dip-test non-rejection means the data are consistent with unimodality; it does not identify a generative mechanism. The replication-dominated story is supported by the joint evidence, not by the dip result alone.
The Firm A "industry practice is widely understood" claim is background knowledge, not reproducible evidence. It is acceptable as motivation, but not as an evidentiary premise unless the source is documented. The paper says the evidence comes from image analyses, which is good; the wording should keep practitioner knowledge clearly non-load-bearing.
The dHash thresholds are reasonable but still heuristic. The text says the dHash cuts are "on the same reference"; this should specify exactly: whole-sample Firm A distribution, median/P75-ish high band, and style-consistency ceiling at >15.
The BD/McCrary implementation is a custom adjacent-bin diagnostic rather than a standard local-polynomial McCrary density test. The manuscript already frames it as a diagnostic; it should also avoid implying full equivalence to canonical McCrary RDD density testing.
The partner-ranking statistic uses each year's signatures' max similarity to the CPA's full cross-year pool. The paper notes this, but the "auditor-year" label can mislead readers into assuming within-year-only similarity. The untracked signature_analysis/27_within_year_uniformity.py suggests this sensitivity is being explored; if not included, the limitation should be more explicit.

4. Narrative Discipline

The narrative is much more disciplined than prior-round summaries suggested, but it still needs tightening.

Overclaims / scope creep:

"Detects non-hand-signed signatures" should usually be "classifies signatures as replication-consistent / non-hand-signed under a calibrated dual-descriptor rule." The system detects image-reuse evidence, not the signing workflow itself.
"Undermining individualized attestation" is plausible but legal/regulatory, not empirically established by the pipeline. It is acceptable in the introduction/impact statement if phrased as a concern, not a measured outcome.
"From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise" is too absolute. Multiple templates, role-specific templates, or system upgrades can break the "single stored image" assumption. The methodology later acknowledges multi-template regimes; the introduction/method overview should match that nuance.
"This sharp discontinuity ... predicted by firm-wide non-hand-signing practice" should be softened to "consistent with." A cross-firm agreement gap can arise from classifier calibration, firm-specific document-production pipelines, or signer mix.
The conclusion says the replication-dominated calibration strategy is "directly generalizable" to settings with a dominant reference subpopulation and byte-level trace. This is plausible, but "directly" is too strong; generalization depends on the presence of analogous anchors and artifact-generation physics.

Scope discipline that works well:

The paper now repeatedly states that signature-level rates are not partner-level frequencies.
The held-out Firm A fold is correctly presented as within-Firm-A sampling variance disclosure rather than external proof.
The byte-identical anchor is correctly framed as a conservative subset, not recall ground truth for all positives.

5. IEEE Access Fit

IEEE Access fit is good. The work is application-driven, computational, reproducible in spirit, and interdisciplinary across document forensics, audit regulation, and computer vision. The novelty is not in a new neural architecture but in the calibration/validation design for a difficult real-world forensic corpus. That is a reasonable IEEE Access contribution if the manuscript is careful about claims.

Rigor is adequate for a Regular Paper after minor revisions. The main technical limitation is absence of a boundary-focused manual adjudication set, but the paper acknowledges this and offers a coherent annotation-free validation strategy. Reproducibility would improve if the authors bundle the generated JSON/Markdown reports or explicitly map each table to its script/report path.

Clarity is mostly high, but the section-number drift and the 0.941/0.945 wording need cleanup before submission. IEEE Access reviewers will notice stale cross-references.

6. Specific Actionable Revisions and Proposed Rewrites

Soften mechanism-identification language.

Current: "Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism..."

Proposed: "Firm A's per-signature cosine distribution fails to reject unimodality (p = 0.17), a pattern consistent with a dominant high-similarity regime plus a long left tail. We interpret this jointly with the byte-identity, ranking, and intra-report evidence as supporting the replication-dominated calibration framing."

Remove overabsolute "single stored image on every report" wording.

Current: "both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise."

Proposed: "both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise."

Clarify practitioner-knowledge status.

Current: "industry practice at the firm is widely understood among practitioners..."

Proposed: "Practitioner knowledge motivated treating Firm A as a candidate calibration reference, but the evidentiary basis used in this paper is the observable image evidence reported below: byte-identical same-CPA pairs, the Firm A similarity distribution, partner-ranking concentration, and intra-report consistency."

Fix section-reference drift.

Examples:

III-H says the three complementary analyses are in Section IV-F; in the current manuscript they are in Section IV-G.
III-H bullet labels cite IV-F.1/IV-F.2/IV-F.3 for longitudinal, ranking, intra-report; these should be IV-G.1/IV-G.2/IV-G.3.
Results IV-F.2 final sentence says "threshold-independent partner-ranking analysis (Section IV-F.2)" but ranking is Section IV-G.2.
Methodology III-G says partner-level ranking is Section IV-F.2; update to IV-G.2.

Fix the 0.941/0.945 sensitivity wording.

Current: "replacing 0.95 with the slightly stricter Firm A P5 percentile 0.941 alters aggregate firm-level capture rates by at most approx 1.2 percentage points"

Proposed: "replacing 0.95 with the nearby rounded sensitivity cut 0.945 (motivated by the calibration-fold P5 = 0.9407) shifts whole-Firm-A dual-rule capture by 1.19 percentage points."

Add table-to-script provenance.

Add a compact appendix table:

Manuscript table	Reproduction artifact
Table V	`signature_analysis/15_hartigan_dip_test.py`; `reports/dip_test/dip_test_results.json`
Table VI	`signature_analysis/17_beta_mixture_em.py`; `reports/beta_mixture/beta_mixture_results.json`; `signature_analysis/25_bd_mccrary_sensitivity.py`
Table X	`signature_analysis/21_expanded_validation.py`; `reports/expanded_validation/expanded_validation_results.json`
Table XI/XII	`signature_analysis/24_validation_recalibration.py`; `reports/validation_recalibration/validation_recalibration.json`
Table XIV/XV	`signature_analysis/22_partner_ranking.py`; `reports/partner_ranking/partner_ranking_results.json`
Table XVI	`signature_analysis/23_intra_report_consistency.py`; `reports/intra_report/intra_report_results.json`

Either document or remove exact unverifiable decomposition claims.

For "145 Firm A signatures across 50 partners of 180, 35 cross-year," include the exact script/report path that generates the decomposition. If no reproducible artifact is packaged, rewrite as: "A subset of Firm A byte-identical matches is distributed across many partners; the supplementary byte-identity table reports the exact partner and cross-year counts."

Treat "cross-firm dual convergence 11.3% vs 58.7%" as a table or remove it.

This is a useful claim, but I did not find a direct reproduction artifact. Add a small table with counts/denominators and script provenance.

Tighten the impact statement.

Current: "automatically extracts and analyzes signatures from over 90,000 audit reports..."

This is accurate. But: "separate hand-written signatures from reproduced ones" should remain removed/avoided. Use: "stratifies signatures by evidence of image reproduction."

Clean legacy script comments before supplement release.

Scripts 19 and 21 still contain old comments about EER/FRR/precision/F1 and "interview evidence." Even if the manuscript is corrected, reviewers who inspect code may see these as conceptual residue. Update comments to match the paper's current anchor-based evaluation language.

7. Disagreements with Prior Round-7 Gemini Accept Verdict

I disagree with the round-7 Gemini "fully submission-ready / no v3.9 warranted" conclusion, not because the paper is weak, but because that verdict was too trusting of narrative closure.

Specific disagreements:

Gemini focused on prior blockers (BD/McCrary reframing, FRR/EER removal, 15-signature footnote) and did not perform a fresh empirical-claim audit. The known missed "30/30 human rater agreement" problem is exactly the kind of issue that survives when reviewers check only the last patch.
Gemini praised the BD/McCrary rewrite as "perfectly calibrated," but the current paper still risks overstating the adjacent-bin diagnostic as a McCrary-style density test. It is now acceptable, but not perfect.
Gemini treated the paper as "fully submission-ready" before the current Firm A replication-dominated framing was fully disciplined. v3.18.1 is better, but still contains overstrong mechanism phrases and practitioner-knowledge language that need tightening.
Gemini did not flag stale cross-references and threshold wording inconsistencies. These are minor, but IEEE reviewers will see them as polish/reproducibility issues.
Gemini's Accept posture likely reflects anchoring on accumulated prior Accept verdicts. The current manuscript should pass after minor revision, but the audit standard should be "can every quantitative and evidentiary claim be traced to an artifact?" not "did the last known blocker get patched?"

Bottom line: I recommend Minor Revision. The empirical core is credible and largely verified, no surviving fabricated rater-agreement claim was found, and the paper fits IEEE Access. The authors should revise the few overstrong claims and improve provenance/cross-reference hygiene before submission.

29 KiB Raw Blame History