pdf_signature_extraction/paper/gemini_review_v3_7.md

# Independent Peer Review: Paper A (v3.7)

**Target Venue:** IEEE Access (Regular Paper)
**Date:** April 21, 2026
**Reviewer:** Gemini CLI (6th Round Independent Review)

---

## 1. Overall Verdict

**Verdict: Minor Revision**

**Rationale:**
The manuscript presents a methodologically rigorous, highly sophisticated, and large-scale empirical analysis of non-hand-signed auditor signatures. Analyzing over 180,000 signatures from 90,282 audit reports is an impressive feat, and the pipeline architecture combining VLM prescreening, YOLO detection, and ResNet-50 feature extraction is fundamentally sound. The utilization of a "replication-dominated" calibration strategy—validated across both intra-firm consistency metrics and held-out cross-validation folds—represents a significant contribution to document forensics where ground-truth labeling is scarce and expensive. Furthermore, the dual-descriptor approach (using cosine similarity for semantic features and dHash for structural features) effectively resolves the ambiguity between stylistic consistency and mechanical image reproduction. The demotion of the Burgstahler-Dichev / McCrary (BD/McCrary) test to a density-smoothness diagnostic, supported by the new Appendix A, is analytically correct.

However, approaching this manuscript with a fresh perspective reveals three distinct methodological blind spots that previous review rounds missed. Specifically, the manuscript commits a statistical overclaim regarding the statistical power of the BD/McCrary test at the accountant level, it presents a mathematically tautological False Rejection Rate (FRR) evaluation that borders on reviewer-bait, and it lacks narrative guardrails around its document-level aggregation metrics. Resolving these localized issues will not alter the paper's conclusions but will significantly harden the manuscript against aggressive peer review, making it fully submission-ready for IEEE Access.

---

## 2. Scientific Soundness Audit

### Three-Level Framework Coherence
The separation of the analysis into signature-level, accountant-level, and auditor-year units is intellectually rigorous and highly defensible. By strictly separating the *pixel-level output quality* (signature level) from the *aggregate behavioral regime* (accountant level), the authors successfully avoid the ecological fallacy of assuming that because an individual practitioner acts in a binary fashion (hand-signing vs. stamping), the aggregate distribution of signature pixels must be neatly bimodal. The evidence compellingly demonstrates that the data forms a continuous quality degradation spectrum at the pixel level.

### Firm A 'Replication-Dominated' Framing
This is perhaps the strongest conceptual pillar of the paper. Assuming that Firm A acts as a "pure" positive class would inevitably force the thresholding model to interpret the long left tail of the cosine distribution as algorithmic noise or pipeline error. The explicit validation of Firm A as "replication-dominated but not pure"—quantified elegantly by the 139/32 split between high-replication and middle-band clusters in the accountant-level Gaussian Mixture Model (Section IV-E)—logically resolves the 92.5% capture rate without overclaiming. It is a highly defensible stance.

### BD/McCrary Demotion
Moving the BD/McCrary test from a co-equal threshold estimator to a "density-smoothness diagnostic" is the correct scientific decision. Appendix A empirically demonstrates that the test behaves exactly as one would expect when applied to a large ($N > 60,000$), smooth, heavy-tailed distribution: it detects localized non-linearities caused by histogram binning resolution rather than true mechanistic discontinuities. The theoretical tension is resolved by this demotion.

### Statistical Choices
The statistical foundations of the paper are appropriate and well-applied:
*   **Beta/Logit-Gaussian Mixtures:** Fitting Beta mixtures via the EM algorithm is perfectly suited for bounded cosine similarity data $[0,1]$, and the logit-Gaussian cross-check serves as an excellent robustness measure against parametric misspecification.
*   **Hartigan Dip Test:** The use of the dip test provides a rigorous, non-parametric verification of unimodality/multimodality.
*   **Wilson Confidence Intervals:** Utilizing Wilson score intervals for the held-out validation metrics (Table XI) correctly models binomial variance, preventing zero-bound confidence interval collapse.

---

## 3. Numerical Consistency Cross-Check

An exhaustive spot-check of the manuscript’s arithmetic, table values, and cited numbers reveals a practically flawless internal consistency. The scripts supporting the pipeline operate exactly as claimed.

*   **Table VIII:** The reported accountant-level threshold band (KDE antimode: 0.973, Beta-2: 0.979, logit-GMM-2: 0.976) matches the narrative text precisely.
*   **Table IX:** The proportion of Firm A captures under the dual rule ($54,370 / 60,448 = 89.945\%$) correctly rounds to the reported $89.95\%$.
*   **Table XI:** The calibration fold's operational dual rule yields $40,335 / 45,116 = 89.402\%$ (reported $89.40\%$), and the held-out fold yields $14,035 / 15,332 = 91.540\%$ (reported $91.54\%$).
*   **Table XII:** The column sums for $N = 168,740$ match perfectly. Furthermore, the delta column balances precisely to zero ($+2,294 + 6,095 + 119 - 8,508 + 0 = 0$).
*   **Table XIV:** Top 10% Firm A occupancy is $443 / 462 = 95.88\%$ (reported $95.9\%$), against a baseline of $1,287 / 4,629 = 27.80\%$ (reported $27.8\%$).
*   **Table XVI:** Firm A's intra-report agreement is correctly calculated as $(26,435 + 734 + 4) / 30,222 = 89.91\%$.

**Minor Narrative Clarification Required:**
In Table III, total extracted signatures are reported as $182,328$, with $168,755$ successfully matched to CPAs. However, Table V and Table XII utilize $N = 168,740$ signatures for the all-pairs best-match analysis. This delta of $15$ signatures is mathematically implied by CPAs who possess exactly *one* signature in the entire database, rendering a "same-CPA pairwise comparison" impossible. While logically sound to anyone analyzing the pipeline closely, this microscopic $15$-signature discrepancy is exactly the kind of arithmetic artifact that distracts meticulous reviewers.
*Recommendation:* Add a one-sentence footnote or parenthetical to Section IV-D explicitly stating this $15$-signature delta is due to single-signature CPAs lacking a pairwise match.

---

## 4. Appendix A Validity

The addition of Appendix A successfully and empirically justifies the main-text demotion of the BD/McCrary test.

**Strengths:**
The argument demonstrating that the BD/McCrary transitions drift monotonically with bin width (e.g., Firm A cosine drifting across 0.987 $\rightarrow$ 0.985 $\rightarrow$ 0.980 $\rightarrow$ 0.975) is brilliant. Coupled with the observation that the Z-statistics inflate superlinearly with bin width (from $|Z| \sim 9$ at bin 0.003 to $|Z| \sim 106$ at bin 0.015), the appendix irrefutably proves that the test is interacting with the local curvature of a heavily-populated continuous distribution rather than identifying a discrete, mechanistic boundary. Table A.I is arithmetically consistent with the script's logic.

**Weaknesses:**
The interpretation paragraph overstates the implications of the accountant-level null finding. It claims that the lack of a transition at the accountant level ($N=686$) is a "robust finding that survives the bin-width sweep." As detailed in Section 6 below, a non-finding surviving a bin-width sweep in a small sample is largely a function of low statistical power, not definitive proof of a smoothly-mixed boundary.

---

## 5. IEEE Access Submission Readiness

The manuscript is in excellent shape for submission to IEEE Access.
*   **Scope Fit:** High. The paper sits perfectly at the intersection of applied AI, document forensics, and interdisciplinary data science, which is a core demographic for IEEE Access.
*   **Abstract Length:** The abstract is approximately 234 words, comfortably satisfying the stringent $\leq 250$ word limit requirement.
*   **Formatting & Structure:** The document adheres to standard IEEE double-column formatting conventions (Roman numeral sections, appropriate table/figure references).
*   **Anonymization:** Properly handled. Author placeholders, affiliation blocks, and correspondence emails are appropriately bracketed for single-anonymized peer review.
*   **Desk-Return Risks:** Very low. The inclusion of the ablation study (Table XVIII) and explicit baseline comparisons ensures the paper meets the journal's expectations for methodological validation.

---

## 6. Novel Issues and Methodological Blind Spots

While the previous review rounds improved the manuscript significantly, habituation has allowed three specific narrative and statistical blind spots to persist. These are prime targets for reviewer pushback.

### Issue 1: The Accountant-Level BD/McCrary Null is a Power Artifact, not Proof of Smoothness
In Section V-B and Appendix A, the authors claim that because the BD/McCrary test yields no significant transition at the accountant level, this "pattern is consistent with a clustered but smoothly mixed accountant-level distribution." Furthermore, Section V-B states that this non-transition is "itself diagnostic of smoothness rather than a failure of the method."

**The Critique:** The McCrary (2008) test relies on local linear regression smoothing. The variance of the estimator scales inversely with $N \cdot h$ (where $h$ is the bin width). With a sample size of only $N=686$ accountants, the test is severely underpowered and lacks the statistical capacity to reject the null of smoothness unless the discontinuity is an absolute, sheer cliff. Asserting that a failure to reject the null affirmatively *proves* the null is true (smoothness) is a fundamental statistical fallacy (Type II error risk).
*Impact:* Statistically literate reviewers will immediately flag this as an overclaim. The demotion of the test to a diagnostic is correct, but interpreting the null at $N=686$ as definitive proof of smoothness is flawed.

### Issue 2: Tautological Presentation of FRR and EER (Table X)
Table X presents a False Rejection Rate (FRR) computed against a "byte-identical" positive anchor. It reports an FRR of $0.000$ for thresholds like 0.95 and 0.973, and subsequently reports an Equal Error Rate (EER) of $\approx 0$ at cosine = 0.990.

**The Critique:** By definition, byte-identical signatures have a cosine similarity asymptotically approaching 1.0 (modulo minor float/cropping artifacts). Evaluating a similarity threshold of 0.95 against inputs that are mathematically defined to score near 1.0 yields a 0% FRR trivially. It is a tautology. While the text in Section V-F attempts to caveat this ("perfect recall against this subset therefore does not generalize"), presenting it as a formal column in Table X with an EER calculation treats it as a standard biometric evaluation. There are no crossing error distributions here to warrant an EER.
*Impact:* This is reviewer-bait. Reviewers from the biometric or forensics domains will argue that an EER of 0 is artificially constructed. The true scientific value of Table X is purely the empirical False Acceptance Rate (FAR) derived from the 50,000 inter-CPA negatives.

### Issue 3: Document-Level Worst-Case Aggregation Narrative
Section IV-I reports that 35.0% of documents are classified as "High-confidence non-hand-signed" and 43.8% as "Moderate-confidence." This relies on the worst-case rule defined in Section III-L (if one signature on a dual-signed report is stamped, the whole document inherits that label).

**The Critique:** While this "worst-case" aggregation is highly practical for building an operational regulatory auditing tool (flagging the report for review), the narrative in IV-I presents these percentages without reminding the reader that a document might contain a mix of genuine and stamped signatures. Without immediate context, stating that nearly 80% of the market's reports are non-hand-signed invites the ecological fallacy that *both* partners are stamping.
*Impact:* A brief narrative safeguard is missing. Section IV-I must briefly cross-reference the intra-report agreement findings (Table XVI) to remind the reader of the composition of these documents, mitigating the risk that the reader misinterprets the document-level severity.

---

## 7. Final Recommendation and v3.8 Action Items

The manuscript is exceptionally strong but requires a few surgical narrative adjustments to remove reviewer-bait and statistical overclaims. I recommend a **Minor Revision** encompassing the following ranked action items.

### BLOCKER (Must Fix for Submission)
1. **Revise the interpretation of the accountant-level BD/McCrary null.**
   *   *Action:* In Section V-B, Section VI (Conclusion), and Appendix A, remove any explicit claims that the null affirmatively proves "smoothly mixed" boundaries.
   *   *Replacement Phrasing:* Reframe this finding to acknowledge statistical power. For example: *"We fail to find evidence of a discontinuity at the accountant level. While this is consistent with smoothly mixed clusters, it also reflects the limited statistical power of the BD/McCrary test at smaller sample sizes ($N=686$), reinforcing its role as a diagnostic rather than a definitive estimator."*

### MAJOR (Highly Recommended to Prevent Desk-Reject/Major Revision)
2. **Reframe Table X to eliminate the tautological FRR/EER presentation.**
   *   *Action:* Remove the Equal Error Rate (EER) calculation entirely. Add an explicit, prominent table note to Table X stating that FRR is computed against a definitionally extreme subset (byte-identical signatures), making the $0.000$ values an expected mathematical boundary check rather than an empirical discovery of real-world recall. Emphasize that the primary contribution of Table X is the FAR evaluation against the large inter-CPA negative anchor.

### MINOR (Quick Wins for Readability and Precision)
3. **Contextualize the Document-Level Aggregation (Section IV-I).**
   *   *Action:* When presenting the 35.0% / 43.8% document-level figures in Section IV-I, explicitly remind the reader of the worst-case aggregation rule. Add a single sentence cross-referencing Table XVI's mixed-report rates to ensure the reader understands the internal composition of these flagged documents.
4. **Clarify the 15-Signature Delta (Section IV-D / Table XII).**
   *   *Action:* Add a one-sentence clarification explaining that the delta between the 168,755 CPA-matched signatures (Table III) and the 168,740 signatures analyzed in the all-pairs distributions (Table V/Table XII) consists of CPAs who have exactly one signature in the corpus, making intra-CPA pairwise comparison impossible. This will preempt arithmetic nitpicking by reviewers.