fcce58aff0
Gemini round-6 (paper/gemini_review_v3_7.md) gave Minor Revision but
flagged three issues that five rounds of codex review had missed.
This commit addresses all three.
BLOCKER: Accountant-level BD/McCrary null is a power artifact, not
proof of smoothness (Gemini Issue 1)
- At N=686 accountants the BD/McCrary test has limited statistical
power; interpreting a failure-to-reject as affirmative proof of
smoothness is a Type II error risk.
- Discussion V-B: "itself diagnostic of smoothness" replaced with
"failure-to-reject rather than a failure of the method ---
informative alongside the other evidence but subject to the power
caveat in Section V-G".
- Discussion V-G (Sixth limitation): added a power-aware paragraph
naming N=686 explicitly and clarifying that the substantive claim
of smoothly-mixed clustering rests on the JOINT weight of dip
test + BIC-selected GMM + BD null, not on BD alone.
- Results IV-D.1 and IV-E: reframe accountant-level null as
"consistent with --- not affirmative proof of" clustered-but-
smoothly-mixed, citing V-G for the power caveat.
- Appendix A interpretation paragraph: explicit inferential-asymmetry
sentence ("consistency is what the BD null delivers, not
affirmative proof"); "itself evidence for" removed.
- Conclusion: "consistent with clustered but smoothly mixed"
rephrased with explicit power caveat ("at N = 686 the test has
limited power and cannot affirmatively establish smoothness").
MAJOR: Table X FRR / EER was tautological reviewer-bait
(Gemini Issue 2)
- Byte-identical positive anchor has cosine approx 1 by construction,
so FRR against that subset is trivially 0 at every threshold
below 1 and any EER calculation is arithmetic tautology, not
biometric performance.
- Results IV-G.1: removed EER row; dropped FRR column from Table X;
added a table note explaining the omission and directing readers
to Section V-F for the conservative-subset discussion.
- Methodology III-K: removed the EER / FRR-against-byte-identical
reporting clause; clarified that FAR against inter-CPA negatives
is the primary reported quantity.
- Table X is now FAR + Wilson 95% CI only, which is the quantity
that actually carries empirical content on this anchor design.
MINOR: Document-level worst-case aggregation narrative (Gemini
Issue 3) + 15-signature delta (Gemini spot-check)
- Results IV-I: added two sentences explicitly noting that the
document-level percentages reflect the Section III-L worst-case
aggregation rule (a report with one stamped + one hand-signed
signature inherits the most-replication-consistent label), and
cross-referencing Section IV-H.3 / Table XVI for the mixed-report
composition that qualifies the headline percentages.
- Results IV-D: added a one-sentence footnote explaining that the
15-signature delta between the Table III CPA-matched count
(168,755) and the all-pairs analyzed count (168,740) is due to
CPAs with exactly one signature, for whom no same-CPA pairwise
best-match statistic exists.
Abstract remains 243 words, comfortably under the IEEE Access
250-word cap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
121 lines
15 KiB
Markdown
121 lines
15 KiB
Markdown
# Independent Peer Review: Paper A (v3.7)
|
||
|
||
**Target Venue:** IEEE Access (Regular Paper)
|
||
**Date:** April 21, 2026
|
||
**Reviewer:** Gemini CLI (6th Round Independent Review)
|
||
|
||
---
|
||
|
||
## 1. Overall Verdict
|
||
|
||
**Verdict: Minor Revision**
|
||
|
||
**Rationale:**
|
||
The manuscript presents a methodologically rigorous, highly sophisticated, and large-scale empirical analysis of non-hand-signed auditor signatures. Analyzing over 180,000 signatures from 90,282 audit reports is an impressive feat, and the pipeline architecture combining VLM prescreening, YOLO detection, and ResNet-50 feature extraction is fundamentally sound. The utilization of a "replication-dominated" calibration strategy—validated across both intra-firm consistency metrics and held-out cross-validation folds—represents a significant contribution to document forensics where ground-truth labeling is scarce and expensive. Furthermore, the dual-descriptor approach (using cosine similarity for semantic features and dHash for structural features) effectively resolves the ambiguity between stylistic consistency and mechanical image reproduction. The demotion of the Burgstahler-Dichev / McCrary (BD/McCrary) test to a density-smoothness diagnostic, supported by the new Appendix A, is analytically correct.
|
||
|
||
However, approaching this manuscript with a fresh perspective reveals three distinct methodological blind spots that previous review rounds missed. Specifically, the manuscript commits a statistical overclaim regarding the statistical power of the BD/McCrary test at the accountant level, it presents a mathematically tautological False Rejection Rate (FRR) evaluation that borders on reviewer-bait, and it lacks narrative guardrails around its document-level aggregation metrics. Resolving these localized issues will not alter the paper's conclusions but will significantly harden the manuscript against aggressive peer review, making it fully submission-ready for IEEE Access.
|
||
|
||
---
|
||
|
||
## 2. Scientific Soundness Audit
|
||
|
||
### Three-Level Framework Coherence
|
||
The separation of the analysis into signature-level, accountant-level, and auditor-year units is intellectually rigorous and highly defensible. By strictly separating the *pixel-level output quality* (signature level) from the *aggregate behavioral regime* (accountant level), the authors successfully avoid the ecological fallacy of assuming that because an individual practitioner acts in a binary fashion (hand-signing vs. stamping), the aggregate distribution of signature pixels must be neatly bimodal. The evidence compellingly demonstrates that the data forms a continuous quality degradation spectrum at the pixel level.
|
||
|
||
### Firm A 'Replication-Dominated' Framing
|
||
This is perhaps the strongest conceptual pillar of the paper. Assuming that Firm A acts as a "pure" positive class would inevitably force the thresholding model to interpret the long left tail of the cosine distribution as algorithmic noise or pipeline error. The explicit validation of Firm A as "replication-dominated but not pure"—quantified elegantly by the 139/32 split between high-replication and middle-band clusters in the accountant-level Gaussian Mixture Model (Section IV-E)—logically resolves the 92.5% capture rate without overclaiming. It is a highly defensible stance.
|
||
|
||
### BD/McCrary Demotion
|
||
Moving the BD/McCrary test from a co-equal threshold estimator to a "density-smoothness diagnostic" is the correct scientific decision. Appendix A empirically demonstrates that the test behaves exactly as one would expect when applied to a large ($N > 60,000$), smooth, heavy-tailed distribution: it detects localized non-linearities caused by histogram binning resolution rather than true mechanistic discontinuities. The theoretical tension is resolved by this demotion.
|
||
|
||
### Statistical Choices
|
||
The statistical foundations of the paper are appropriate and well-applied:
|
||
* **Beta/Logit-Gaussian Mixtures:** Fitting Beta mixtures via the EM algorithm is perfectly suited for bounded cosine similarity data $[0,1]$, and the logit-Gaussian cross-check serves as an excellent robustness measure against parametric misspecification.
|
||
* **Hartigan Dip Test:** The use of the dip test provides a rigorous, non-parametric verification of unimodality/multimodality.
|
||
* **Wilson Confidence Intervals:** Utilizing Wilson score intervals for the held-out validation metrics (Table XI) correctly models binomial variance, preventing zero-bound confidence interval collapse.
|
||
|
||
---
|
||
|
||
## 3. Numerical Consistency Cross-Check
|
||
|
||
An exhaustive spot-check of the manuscript’s arithmetic, table values, and cited numbers reveals a practically flawless internal consistency. The scripts supporting the pipeline operate exactly as claimed.
|
||
|
||
* **Table VIII:** The reported accountant-level threshold band (KDE antimode: 0.973, Beta-2: 0.979, logit-GMM-2: 0.976) matches the narrative text precisely.
|
||
* **Table IX:** The proportion of Firm A captures under the dual rule ($54,370 / 60,448 = 89.945\%$) correctly rounds to the reported $89.95\%$.
|
||
* **Table XI:** The calibration fold's operational dual rule yields $40,335 / 45,116 = 89.402\%$ (reported $89.40\%$), and the held-out fold yields $14,035 / 15,332 = 91.540\%$ (reported $91.54\%$).
|
||
* **Table XII:** The column sums for $N = 168,740$ match perfectly. Furthermore, the delta column balances precisely to zero ($+2,294 + 6,095 + 119 - 8,508 + 0 = 0$).
|
||
* **Table XIV:** Top 10% Firm A occupancy is $443 / 462 = 95.88\%$ (reported $95.9\%$), against a baseline of $1,287 / 4,629 = 27.80\%$ (reported $27.8\%$).
|
||
* **Table XVI:** Firm A's intra-report agreement is correctly calculated as $(26,435 + 734 + 4) / 30,222 = 89.91\%$.
|
||
|
||
**Minor Narrative Clarification Required:**
|
||
In Table III, total extracted signatures are reported as $182,328$, with $168,755$ successfully matched to CPAs. However, Table V and Table XII utilize $N = 168,740$ signatures for the all-pairs best-match analysis. This delta of $15$ signatures is mathematically implied by CPAs who possess exactly *one* signature in the entire database, rendering a "same-CPA pairwise comparison" impossible. While logically sound to anyone analyzing the pipeline closely, this microscopic $15$-signature discrepancy is exactly the kind of arithmetic artifact that distracts meticulous reviewers.
|
||
*Recommendation:* Add a one-sentence footnote or parenthetical to Section IV-D explicitly stating this $15$-signature delta is due to single-signature CPAs lacking a pairwise match.
|
||
|
||
---
|
||
|
||
## 4. Appendix A Validity
|
||
|
||
The addition of Appendix A successfully and empirically justifies the main-text demotion of the BD/McCrary test.
|
||
|
||
**Strengths:**
|
||
The argument demonstrating that the BD/McCrary transitions drift monotonically with bin width (e.g., Firm A cosine drifting across 0.987 $\rightarrow$ 0.985 $\rightarrow$ 0.980 $\rightarrow$ 0.975) is brilliant. Coupled with the observation that the Z-statistics inflate superlinearly with bin width (from $|Z| \sim 9$ at bin 0.003 to $|Z| \sim 106$ at bin 0.015), the appendix irrefutably proves that the test is interacting with the local curvature of a heavily-populated continuous distribution rather than identifying a discrete, mechanistic boundary. Table A.I is arithmetically consistent with the script's logic.
|
||
|
||
**Weaknesses:**
|
||
The interpretation paragraph overstates the implications of the accountant-level null finding. It claims that the lack of a transition at the accountant level ($N=686$) is a "robust finding that survives the bin-width sweep." As detailed in Section 6 below, a non-finding surviving a bin-width sweep in a small sample is largely a function of low statistical power, not definitive proof of a smoothly-mixed boundary.
|
||
|
||
---
|
||
|
||
## 5. IEEE Access Submission Readiness
|
||
|
||
The manuscript is in excellent shape for submission to IEEE Access.
|
||
* **Scope Fit:** High. The paper sits perfectly at the intersection of applied AI, document forensics, and interdisciplinary data science, which is a core demographic for IEEE Access.
|
||
* **Abstract Length:** The abstract is approximately 234 words, comfortably satisfying the stringent $\leq 250$ word limit requirement.
|
||
* **Formatting & Structure:** The document adheres to standard IEEE double-column formatting conventions (Roman numeral sections, appropriate table/figure references).
|
||
* **Anonymization:** Properly handled. Author placeholders, affiliation blocks, and correspondence emails are appropriately bracketed for single-anonymized peer review.
|
||
* **Desk-Return Risks:** Very low. The inclusion of the ablation study (Table XVIII) and explicit baseline comparisons ensures the paper meets the journal's expectations for methodological validation.
|
||
|
||
---
|
||
|
||
## 6. Novel Issues and Methodological Blind Spots
|
||
|
||
While the previous review rounds improved the manuscript significantly, habituation has allowed three specific narrative and statistical blind spots to persist. These are prime targets for reviewer pushback.
|
||
|
||
### Issue 1: The Accountant-Level BD/McCrary Null is a Power Artifact, not Proof of Smoothness
|
||
In Section V-B and Appendix A, the authors claim that because the BD/McCrary test yields no significant transition at the accountant level, this "pattern is consistent with a clustered but smoothly mixed accountant-level distribution." Furthermore, Section V-B states that this non-transition is "itself diagnostic of smoothness rather than a failure of the method."
|
||
|
||
**The Critique:** The McCrary (2008) test relies on local linear regression smoothing. The variance of the estimator scales inversely with $N \cdot h$ (where $h$ is the bin width). With a sample size of only $N=686$ accountants, the test is severely underpowered and lacks the statistical capacity to reject the null of smoothness unless the discontinuity is an absolute, sheer cliff. Asserting that a failure to reject the null affirmatively *proves* the null is true (smoothness) is a fundamental statistical fallacy (Type II error risk).
|
||
*Impact:* Statistically literate reviewers will immediately flag this as an overclaim. The demotion of the test to a diagnostic is correct, but interpreting the null at $N=686$ as definitive proof of smoothness is flawed.
|
||
|
||
### Issue 2: Tautological Presentation of FRR and EER (Table X)
|
||
Table X presents a False Rejection Rate (FRR) computed against a "byte-identical" positive anchor. It reports an FRR of $0.000$ for thresholds like 0.95 and 0.973, and subsequently reports an Equal Error Rate (EER) of $\approx 0$ at cosine = 0.990.
|
||
|
||
**The Critique:** By definition, byte-identical signatures have a cosine similarity asymptotically approaching 1.0 (modulo minor float/cropping artifacts). Evaluating a similarity threshold of 0.95 against inputs that are mathematically defined to score near 1.0 yields a 0% FRR trivially. It is a tautology. While the text in Section V-F attempts to caveat this ("perfect recall against this subset therefore does not generalize"), presenting it as a formal column in Table X with an EER calculation treats it as a standard biometric evaluation. There are no crossing error distributions here to warrant an EER.
|
||
*Impact:* This is reviewer-bait. Reviewers from the biometric or forensics domains will argue that an EER of 0 is artificially constructed. The true scientific value of Table X is purely the empirical False Acceptance Rate (FAR) derived from the 50,000 inter-CPA negatives.
|
||
|
||
### Issue 3: Document-Level Worst-Case Aggregation Narrative
|
||
Section IV-I reports that 35.0% of documents are classified as "High-confidence non-hand-signed" and 43.8% as "Moderate-confidence." This relies on the worst-case rule defined in Section III-L (if one signature on a dual-signed report is stamped, the whole document inherits that label).
|
||
|
||
**The Critique:** While this "worst-case" aggregation is highly practical for building an operational regulatory auditing tool (flagging the report for review), the narrative in IV-I presents these percentages without reminding the reader that a document might contain a mix of genuine and stamped signatures. Without immediate context, stating that nearly 80% of the market's reports are non-hand-signed invites the ecological fallacy that *both* partners are stamping.
|
||
*Impact:* A brief narrative safeguard is missing. Section IV-I must briefly cross-reference the intra-report agreement findings (Table XVI) to remind the reader of the composition of these documents, mitigating the risk that the reader misinterprets the document-level severity.
|
||
|
||
---
|
||
|
||
## 7. Final Recommendation and v3.8 Action Items
|
||
|
||
The manuscript is exceptionally strong but requires a few surgical narrative adjustments to remove reviewer-bait and statistical overclaims. I recommend a **Minor Revision** encompassing the following ranked action items.
|
||
|
||
### BLOCKER (Must Fix for Submission)
|
||
1. **Revise the interpretation of the accountant-level BD/McCrary null.**
|
||
* *Action:* In Section V-B, Section VI (Conclusion), and Appendix A, remove any explicit claims that the null affirmatively proves "smoothly mixed" boundaries.
|
||
* *Replacement Phrasing:* Reframe this finding to acknowledge statistical power. For example: *"We fail to find evidence of a discontinuity at the accountant level. While this is consistent with smoothly mixed clusters, it also reflects the limited statistical power of the BD/McCrary test at smaller sample sizes ($N=686$), reinforcing its role as a diagnostic rather than a definitive estimator."*
|
||
|
||
### MAJOR (Highly Recommended to Prevent Desk-Reject/Major Revision)
|
||
2. **Reframe Table X to eliminate the tautological FRR/EER presentation.**
|
||
* *Action:* Remove the Equal Error Rate (EER) calculation entirely. Add an explicit, prominent table note to Table X stating that FRR is computed against a definitionally extreme subset (byte-identical signatures), making the $0.000$ values an expected mathematical boundary check rather than an empirical discovery of real-world recall. Emphasize that the primary contribution of Table X is the FAR evaluation against the large inter-CPA negative anchor.
|
||
|
||
### MINOR (Quick Wins for Readability and Precision)
|
||
3. **Contextualize the Document-Level Aggregation (Section IV-I).**
|
||
* *Action:* When presenting the 35.0% / 43.8% document-level figures in Section IV-I, explicitly remind the reader of the worst-case aggregation rule. Add a single sentence cross-referencing Table XVI's mixed-report rates to ensure the reader understands the internal composition of these flagged documents.
|
||
4. **Clarify the 15-Signature Delta (Section IV-D / Table XII).**
|
||
* *Action:* Add a one-sentence clarification explaining that the delta between the 168,755 CPA-matched signatures (Table III) and the 168,740 signatures analyzed in the all-pairs distributions (Table V/Table XII) consists of CPAs who have exactly one signature in the corpus, making intra-CPA pairwise comparison impossible. This will preempt arithmetic nitpicking by reviewers.
|