Files
pdf_signature_extraction/paper/gemini_review_v3_7.md
T
gbanyan fcce58aff0 Paper A v3.8: resolve Gemini 3.1 Pro round-6 independent-review findings
Gemini round-6 (paper/gemini_review_v3_7.md) gave Minor Revision but
flagged three issues that five rounds of codex review had missed.
This commit addresses all three.

BLOCKER: Accountant-level BD/McCrary null is a power artifact, not
proof of smoothness (Gemini Issue 1)
- At N=686 accountants the BD/McCrary test has limited statistical
  power; interpreting a failure-to-reject as affirmative proof of
  smoothness is a Type II error risk.
- Discussion V-B: "itself diagnostic of smoothness" replaced with
  "failure-to-reject rather than a failure of the method ---
  informative alongside the other evidence but subject to the power
  caveat in Section V-G".
- Discussion V-G (Sixth limitation): added a power-aware paragraph
  naming N=686 explicitly and clarifying that the substantive claim
  of smoothly-mixed clustering rests on the JOINT weight of dip
  test + BIC-selected GMM + BD null, not on BD alone.
- Results IV-D.1 and IV-E: reframe accountant-level null as
  "consistent with --- not affirmative proof of" clustered-but-
  smoothly-mixed, citing V-G for the power caveat.
- Appendix A interpretation paragraph: explicit inferential-asymmetry
  sentence ("consistency is what the BD null delivers, not
  affirmative proof"); "itself evidence for" removed.
- Conclusion: "consistent with clustered but smoothly mixed"
  rephrased with explicit power caveat ("at N = 686 the test has
  limited power and cannot affirmatively establish smoothness").

MAJOR: Table X FRR / EER was tautological reviewer-bait
(Gemini Issue 2)
- Byte-identical positive anchor has cosine approx 1 by construction,
  so FRR against that subset is trivially 0 at every threshold
  below 1 and any EER calculation is arithmetic tautology, not
  biometric performance.
- Results IV-G.1: removed EER row; dropped FRR column from Table X;
  added a table note explaining the omission and directing readers
  to Section V-F for the conservative-subset discussion.
- Methodology III-K: removed the EER / FRR-against-byte-identical
  reporting clause; clarified that FAR against inter-CPA negatives
  is the primary reported quantity.
- Table X is now FAR + Wilson 95% CI only, which is the quantity
  that actually carries empirical content on this anchor design.

MINOR: Document-level worst-case aggregation narrative (Gemini
Issue 3) + 15-signature delta (Gemini spot-check)
- Results IV-I: added two sentences explicitly noting that the
  document-level percentages reflect the Section III-L worst-case
  aggregation rule (a report with one stamped + one hand-signed
  signature inherits the most-replication-consistent label), and
  cross-referencing Section IV-H.3 / Table XVI for the mixed-report
  composition that qualifies the headline percentages.
- Results IV-D: added a one-sentence footnote explaining that the
  15-signature delta between the Table III CPA-matched count
  (168,755) and the all-pairs analyzed count (168,740) is due to
  CPAs with exactly one signature, for whom no same-CPA pairwise
  best-match statistic exists.

Abstract remains 243 words, comfortably under the IEEE Access
250-word cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:47:48 +08:00

15 KiB
Raw Blame History

Independent Peer Review: Paper A (v3.7)

Target Venue: IEEE Access (Regular Paper)
Date: April 21, 2026
Reviewer: Gemini CLI (6th Round Independent Review)


1. Overall Verdict

Verdict: Minor Revision

Rationale:
The manuscript presents a methodologically rigorous, highly sophisticated, and large-scale empirical analysis of non-hand-signed auditor signatures. Analyzing over 180,000 signatures from 90,282 audit reports is an impressive feat, and the pipeline architecture combining VLM prescreening, YOLO detection, and ResNet-50 feature extraction is fundamentally sound. The utilization of a "replication-dominated" calibration strategy—validated across both intra-firm consistency metrics and held-out cross-validation folds—represents a significant contribution to document forensics where ground-truth labeling is scarce and expensive. Furthermore, the dual-descriptor approach (using cosine similarity for semantic features and dHash for structural features) effectively resolves the ambiguity between stylistic consistency and mechanical image reproduction. The demotion of the Burgstahler-Dichev / McCrary (BD/McCrary) test to a density-smoothness diagnostic, supported by the new Appendix A, is analytically correct.

However, approaching this manuscript with a fresh perspective reveals three distinct methodological blind spots that previous review rounds missed. Specifically, the manuscript commits a statistical overclaim regarding the statistical power of the BD/McCrary test at the accountant level, it presents a mathematically tautological False Rejection Rate (FRR) evaluation that borders on reviewer-bait, and it lacks narrative guardrails around its document-level aggregation metrics. Resolving these localized issues will not alter the paper's conclusions but will significantly harden the manuscript against aggressive peer review, making it fully submission-ready for IEEE Access.


2. Scientific Soundness Audit

Three-Level Framework Coherence

The separation of the analysis into signature-level, accountant-level, and auditor-year units is intellectually rigorous and highly defensible. By strictly separating the pixel-level output quality (signature level) from the aggregate behavioral regime (accountant level), the authors successfully avoid the ecological fallacy of assuming that because an individual practitioner acts in a binary fashion (hand-signing vs. stamping), the aggregate distribution of signature pixels must be neatly bimodal. The evidence compellingly demonstrates that the data forms a continuous quality degradation spectrum at the pixel level.

Firm A 'Replication-Dominated' Framing

This is perhaps the strongest conceptual pillar of the paper. Assuming that Firm A acts as a "pure" positive class would inevitably force the thresholding model to interpret the long left tail of the cosine distribution as algorithmic noise or pipeline error. The explicit validation of Firm A as "replication-dominated but not pure"—quantified elegantly by the 139/32 split between high-replication and middle-band clusters in the accountant-level Gaussian Mixture Model (Section IV-E)—logically resolves the 92.5% capture rate without overclaiming. It is a highly defensible stance.

BD/McCrary Demotion

Moving the BD/McCrary test from a co-equal threshold estimator to a "density-smoothness diagnostic" is the correct scientific decision. Appendix A empirically demonstrates that the test behaves exactly as one would expect when applied to a large (N > 60,000), smooth, heavy-tailed distribution: it detects localized non-linearities caused by histogram binning resolution rather than true mechanistic discontinuities. The theoretical tension is resolved by this demotion.

Statistical Choices

The statistical foundations of the paper are appropriate and well-applied:

  • Beta/Logit-Gaussian Mixtures: Fitting Beta mixtures via the EM algorithm is perfectly suited for bounded cosine similarity data [0,1], and the logit-Gaussian cross-check serves as an excellent robustness measure against parametric misspecification.
  • Hartigan Dip Test: The use of the dip test provides a rigorous, non-parametric verification of unimodality/multimodality.
  • Wilson Confidence Intervals: Utilizing Wilson score intervals for the held-out validation metrics (Table XI) correctly models binomial variance, preventing zero-bound confidence interval collapse.

3. Numerical Consistency Cross-Check

An exhaustive spot-check of the manuscripts arithmetic, table values, and cited numbers reveals a practically flawless internal consistency. The scripts supporting the pipeline operate exactly as claimed.

  • Table VIII: The reported accountant-level threshold band (KDE antimode: 0.973, Beta-2: 0.979, logit-GMM-2: 0.976) matches the narrative text precisely.
  • Table IX: The proportion of Firm A captures under the dual rule (54,370 / 60,448 = 89.945\%) correctly rounds to the reported 89.95\%.
  • Table XI: The calibration fold's operational dual rule yields 40,335 / 45,116 = 89.402\% (reported 89.40\%), and the held-out fold yields 14,035 / 15,332 = 91.540\% (reported 91.54\%).
  • Table XII: The column sums for N = 168,740 match perfectly. Furthermore, the delta column balances precisely to zero (+2,294 + 6,095 + 119 - 8,508 + 0 = 0).
  • Table XIV: Top 10% Firm A occupancy is 443 / 462 = 95.88\% (reported 95.9\%), against a baseline of 1,287 / 4,629 = 27.80\% (reported 27.8\%).
  • Table XVI: Firm A's intra-report agreement is correctly calculated as (26,435 + 734 + 4) / 30,222 = 89.91\%.

Minor Narrative Clarification Required: In Table III, total extracted signatures are reported as 182,328, with 168,755 successfully matched to CPAs. However, Table V and Table XII utilize N = 168,740 signatures for the all-pairs best-match analysis. This delta of 15 signatures is mathematically implied by CPAs who possess exactly one signature in the entire database, rendering a "same-CPA pairwise comparison" impossible. While logically sound to anyone analyzing the pipeline closely, this microscopic $15$-signature discrepancy is exactly the kind of arithmetic artifact that distracts meticulous reviewers. Recommendation: Add a one-sentence footnote or parenthetical to Section IV-D explicitly stating this $15$-signature delta is due to single-signature CPAs lacking a pairwise match.


4. Appendix A Validity

The addition of Appendix A successfully and empirically justifies the main-text demotion of the BD/McCrary test.

Strengths: The argument demonstrating that the BD/McCrary transitions drift monotonically with bin width (e.g., Firm A cosine drifting across 0.987 \rightarrow 0.985 \rightarrow 0.980 \rightarrow 0.975) is brilliant. Coupled with the observation that the Z-statistics inflate superlinearly with bin width (from |Z| \sim 9 at bin 0.003 to |Z| \sim 106 at bin 0.015), the appendix irrefutably proves that the test is interacting with the local curvature of a heavily-populated continuous distribution rather than identifying a discrete, mechanistic boundary. Table A.I is arithmetically consistent with the script's logic.

Weaknesses: The interpretation paragraph overstates the implications of the accountant-level null finding. It claims that the lack of a transition at the accountant level (N=686) is a "robust finding that survives the bin-width sweep." As detailed in Section 6 below, a non-finding surviving a bin-width sweep in a small sample is largely a function of low statistical power, not definitive proof of a smoothly-mixed boundary.


5. IEEE Access Submission Readiness

The manuscript is in excellent shape for submission to IEEE Access.

  • Scope Fit: High. The paper sits perfectly at the intersection of applied AI, document forensics, and interdisciplinary data science, which is a core demographic for IEEE Access.
  • Abstract Length: The abstract is approximately 234 words, comfortably satisfying the stringent \leq 250 word limit requirement.
  • Formatting & Structure: The document adheres to standard IEEE double-column formatting conventions (Roman numeral sections, appropriate table/figure references).
  • Anonymization: Properly handled. Author placeholders, affiliation blocks, and correspondence emails are appropriately bracketed for single-anonymized peer review.
  • Desk-Return Risks: Very low. The inclusion of the ablation study (Table XVIII) and explicit baseline comparisons ensures the paper meets the journal's expectations for methodological validation.

6. Novel Issues and Methodological Blind Spots

While the previous review rounds improved the manuscript significantly, habituation has allowed three specific narrative and statistical blind spots to persist. These are prime targets for reviewer pushback.

Issue 1: The Accountant-Level BD/McCrary Null is a Power Artifact, not Proof of Smoothness

In Section V-B and Appendix A, the authors claim that because the BD/McCrary test yields no significant transition at the accountant level, this "pattern is consistent with a clustered but smoothly mixed accountant-level distribution." Furthermore, Section V-B states that this non-transition is "itself diagnostic of smoothness rather than a failure of the method."

The Critique: The McCrary (2008) test relies on local linear regression smoothing. The variance of the estimator scales inversely with N \cdot h (where h is the bin width). With a sample size of only N=686 accountants, the test is severely underpowered and lacks the statistical capacity to reject the null of smoothness unless the discontinuity is an absolute, sheer cliff. Asserting that a failure to reject the null affirmatively proves the null is true (smoothness) is a fundamental statistical fallacy (Type II error risk). Impact: Statistically literate reviewers will immediately flag this as an overclaim. The demotion of the test to a diagnostic is correct, but interpreting the null at N=686 as definitive proof of smoothness is flawed.

Issue 2: Tautological Presentation of FRR and EER (Table X)

Table X presents a False Rejection Rate (FRR) computed against a "byte-identical" positive anchor. It reports an FRR of 0.000 for thresholds like 0.95 and 0.973, and subsequently reports an Equal Error Rate (EER) of \approx 0 at cosine = 0.990.

The Critique: By definition, byte-identical signatures have a cosine similarity asymptotically approaching 1.0 (modulo minor float/cropping artifacts). Evaluating a similarity threshold of 0.95 against inputs that are mathematically defined to score near 1.0 yields a 0% FRR trivially. It is a tautology. While the text in Section V-F attempts to caveat this ("perfect recall against this subset therefore does not generalize"), presenting it as a formal column in Table X with an EER calculation treats it as a standard biometric evaluation. There are no crossing error distributions here to warrant an EER. Impact: This is reviewer-bait. Reviewers from the biometric or forensics domains will argue that an EER of 0 is artificially constructed. The true scientific value of Table X is purely the empirical False Acceptance Rate (FAR) derived from the 50,000 inter-CPA negatives.

Issue 3: Document-Level Worst-Case Aggregation Narrative

Section IV-I reports that 35.0% of documents are classified as "High-confidence non-hand-signed" and 43.8% as "Moderate-confidence." This relies on the worst-case rule defined in Section III-L (if one signature on a dual-signed report is stamped, the whole document inherits that label).

The Critique: While this "worst-case" aggregation is highly practical for building an operational regulatory auditing tool (flagging the report for review), the narrative in IV-I presents these percentages without reminding the reader that a document might contain a mix of genuine and stamped signatures. Without immediate context, stating that nearly 80% of the market's reports are non-hand-signed invites the ecological fallacy that both partners are stamping. Impact: A brief narrative safeguard is missing. Section IV-I must briefly cross-reference the intra-report agreement findings (Table XVI) to remind the reader of the composition of these documents, mitigating the risk that the reader misinterprets the document-level severity.


7. Final Recommendation and v3.8 Action Items

The manuscript is exceptionally strong but requires a few surgical narrative adjustments to remove reviewer-bait and statistical overclaims. I recommend a Minor Revision encompassing the following ranked action items.

BLOCKER (Must Fix for Submission)

  1. Revise the interpretation of the accountant-level BD/McCrary null.
    • Action: In Section V-B, Section VI (Conclusion), and Appendix A, remove any explicit claims that the null affirmatively proves "smoothly mixed" boundaries.
    • Replacement Phrasing: Reframe this finding to acknowledge statistical power. For example: "We fail to find evidence of a discontinuity at the accountant level. While this is consistent with smoothly mixed clusters, it also reflects the limited statistical power of the BD/McCrary test at smaller sample sizes (N=686), reinforcing its role as a diagnostic rather than a definitive estimator."
  1. Reframe Table X to eliminate the tautological FRR/EER presentation.
    • Action: Remove the Equal Error Rate (EER) calculation entirely. Add an explicit, prominent table note to Table X stating that FRR is computed against a definitionally extreme subset (byte-identical signatures), making the 0.000 values an expected mathematical boundary check rather than an empirical discovery of real-world recall. Emphasize that the primary contribution of Table X is the FAR evaluation against the large inter-CPA negative anchor.

MINOR (Quick Wins for Readability and Precision)

  1. Contextualize the Document-Level Aggregation (Section IV-I).
    • Action: When presenting the 35.0% / 43.8% document-level figures in Section IV-I, explicitly remind the reader of the worst-case aggregation rule. Add a single sentence cross-referencing Table XVI's mixed-report rates to ensure the reader understands the internal composition of these flagged documents.
  2. Clarify the 15-Signature Delta (Section IV-D / Table XII).
    • Action: Add a one-sentence clarification explaining that the delta between the 168,755 CPA-matched signatures (Table III) and the 168,740 signatures analyzed in the all-pairs distributions (Table V/Table XII) consists of CPAs who have exactly one signature in the corpus, making intra-CPA pairwise comparison impossible. This will preempt arithmetic nitpicking by reviewers.