Paper A v3.8: resolve Gemini 3.1 Pro round-6 independent-review findings

Gemini round-6 (paper/gemini_review_v3_7.md) gave Minor Revision but
flagged three issues that five rounds of codex review had missed.
This commit addresses all three.

BLOCKER: Accountant-level BD/McCrary null is a power artifact, not
proof of smoothness (Gemini Issue 1)
- At N=686 accountants the BD/McCrary test has limited statistical
  power; interpreting a failure-to-reject as affirmative proof of
  smoothness is a Type II error risk.
- Discussion V-B: "itself diagnostic of smoothness" replaced with
  "failure-to-reject rather than a failure of the method ---
  informative alongside the other evidence but subject to the power
  caveat in Section V-G".
- Discussion V-G (Sixth limitation): added a power-aware paragraph
  naming N=686 explicitly and clarifying that the substantive claim
  of smoothly-mixed clustering rests on the JOINT weight of dip
  test + BIC-selected GMM + BD null, not on BD alone.
- Results IV-D.1 and IV-E: reframe accountant-level null as
  "consistent with --- not affirmative proof of" clustered-but-
  smoothly-mixed, citing V-G for the power caveat.
- Appendix A interpretation paragraph: explicit inferential-asymmetry
  sentence ("consistency is what the BD null delivers, not
  affirmative proof"); "itself evidence for" removed.
- Conclusion: "consistent with clustered but smoothly mixed"
  rephrased with explicit power caveat ("at N = 686 the test has
  limited power and cannot affirmatively establish smoothness").

MAJOR: Table X FRR / EER was tautological reviewer-bait
(Gemini Issue 2)
- Byte-identical positive anchor has cosine approx 1 by construction,
  so FRR against that subset is trivially 0 at every threshold
  below 1 and any EER calculation is arithmetic tautology, not
  biometric performance.
- Results IV-G.1: removed EER row; dropped FRR column from Table X;
  added a table note explaining the omission and directing readers
  to Section V-F for the conservative-subset discussion.
- Methodology III-K: removed the EER / FRR-against-byte-identical
  reporting clause; clarified that FAR against inter-CPA negatives
  is the primary reported quantity.
- Table X is now FAR + Wilson 95% CI only, which is the quantity
  that actually carries empirical content on this anchor design.

MINOR: Document-level worst-case aggregation narrative (Gemini
Issue 3) + 15-signature delta (Gemini spot-check)
- Results IV-I: added two sentences explicitly noting that the
  document-level percentages reflect the Section III-L worst-case
  aggregation rule (a report with one stamped + one hand-signed
  signature inherits the most-replication-consistent label), and
  cross-referencing Section IV-H.3 / Table XVI for the mixed-report
  composition that qualifies the headline percentages.
- Results IV-D: added a one-sentence footnote explaining that the
  15-signature delta between the Table III CPA-matched count
  (168,755) and the all-pairs analyzed count (168,740) is due to
  CPAs with exactly one signature, for whom no same-CPA pairwise
  best-match statistic exists.

Abstract remains 243 words, comfortably under the IEEE Access
250-word cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-21 14:47:48 +08:00
parent 552b6b80d4
commit fcce58aff0
6 changed files with 157 additions and 25 deletions
+120
View File
@@ -0,0 +1,120 @@
# Independent Peer Review: Paper A (v3.7)
**Target Venue:** IEEE Access (Regular Paper)
**Date:** April 21, 2026
**Reviewer:** Gemini CLI (6th Round Independent Review)
---
## 1. Overall Verdict
**Verdict: Minor Revision**
**Rationale:**
The manuscript presents a methodologically rigorous, highly sophisticated, and large-scale empirical analysis of non-hand-signed auditor signatures. Analyzing over 180,000 signatures from 90,282 audit reports is an impressive feat, and the pipeline architecture combining VLM prescreening, YOLO detection, and ResNet-50 feature extraction is fundamentally sound. The utilization of a "replication-dominated" calibration strategy—validated across both intra-firm consistency metrics and held-out cross-validation folds—represents a significant contribution to document forensics where ground-truth labeling is scarce and expensive. Furthermore, the dual-descriptor approach (using cosine similarity for semantic features and dHash for structural features) effectively resolves the ambiguity between stylistic consistency and mechanical image reproduction. The demotion of the Burgstahler-Dichev / McCrary (BD/McCrary) test to a density-smoothness diagnostic, supported by the new Appendix A, is analytically correct.
However, approaching this manuscript with a fresh perspective reveals three distinct methodological blind spots that previous review rounds missed. Specifically, the manuscript commits a statistical overclaim regarding the statistical power of the BD/McCrary test at the accountant level, it presents a mathematically tautological False Rejection Rate (FRR) evaluation that borders on reviewer-bait, and it lacks narrative guardrails around its document-level aggregation metrics. Resolving these localized issues will not alter the paper's conclusions but will significantly harden the manuscript against aggressive peer review, making it fully submission-ready for IEEE Access.
---
## 2. Scientific Soundness Audit
### Three-Level Framework Coherence
The separation of the analysis into signature-level, accountant-level, and auditor-year units is intellectually rigorous and highly defensible. By strictly separating the *pixel-level output quality* (signature level) from the *aggregate behavioral regime* (accountant level), the authors successfully avoid the ecological fallacy of assuming that because an individual practitioner acts in a binary fashion (hand-signing vs. stamping), the aggregate distribution of signature pixels must be neatly bimodal. The evidence compellingly demonstrates that the data forms a continuous quality degradation spectrum at the pixel level.
### Firm A 'Replication-Dominated' Framing
This is perhaps the strongest conceptual pillar of the paper. Assuming that Firm A acts as a "pure" positive class would inevitably force the thresholding model to interpret the long left tail of the cosine distribution as algorithmic noise or pipeline error. The explicit validation of Firm A as "replication-dominated but not pure"—quantified elegantly by the 139/32 split between high-replication and middle-band clusters in the accountant-level Gaussian Mixture Model (Section IV-E)—logically resolves the 92.5% capture rate without overclaiming. It is a highly defensible stance.
### BD/McCrary Demotion
Moving the BD/McCrary test from a co-equal threshold estimator to a "density-smoothness diagnostic" is the correct scientific decision. Appendix A empirically demonstrates that the test behaves exactly as one would expect when applied to a large ($N > 60,000$), smooth, heavy-tailed distribution: it detects localized non-linearities caused by histogram binning resolution rather than true mechanistic discontinuities. The theoretical tension is resolved by this demotion.
### Statistical Choices
The statistical foundations of the paper are appropriate and well-applied:
* **Beta/Logit-Gaussian Mixtures:** Fitting Beta mixtures via the EM algorithm is perfectly suited for bounded cosine similarity data $[0,1]$, and the logit-Gaussian cross-check serves as an excellent robustness measure against parametric misspecification.
* **Hartigan Dip Test:** The use of the dip test provides a rigorous, non-parametric verification of unimodality/multimodality.
* **Wilson Confidence Intervals:** Utilizing Wilson score intervals for the held-out validation metrics (Table XI) correctly models binomial variance, preventing zero-bound confidence interval collapse.
---
## 3. Numerical Consistency Cross-Check
An exhaustive spot-check of the manuscripts arithmetic, table values, and cited numbers reveals a practically flawless internal consistency. The scripts supporting the pipeline operate exactly as claimed.
* **Table VIII:** The reported accountant-level threshold band (KDE antimode: 0.973, Beta-2: 0.979, logit-GMM-2: 0.976) matches the narrative text precisely.
* **Table IX:** The proportion of Firm A captures under the dual rule ($54,370 / 60,448 = 89.945\%$) correctly rounds to the reported $89.95\%$.
* **Table XI:** The calibration fold's operational dual rule yields $40,335 / 45,116 = 89.402\%$ (reported $89.40\%$), and the held-out fold yields $14,035 / 15,332 = 91.540\%$ (reported $91.54\%$).
* **Table XII:** The column sums for $N = 168,740$ match perfectly. Furthermore, the delta column balances precisely to zero ($+2,294 + 6,095 + 119 - 8,508 + 0 = 0$).
* **Table XIV:** Top 10% Firm A occupancy is $443 / 462 = 95.88\%$ (reported $95.9\%$), against a baseline of $1,287 / 4,629 = 27.80\%$ (reported $27.8\%$).
* **Table XVI:** Firm A's intra-report agreement is correctly calculated as $(26,435 + 734 + 4) / 30,222 = 89.91\%$.
**Minor Narrative Clarification Required:**
In Table III, total extracted signatures are reported as $182,328$, with $168,755$ successfully matched to CPAs. However, Table V and Table XII utilize $N = 168,740$ signatures for the all-pairs best-match analysis. This delta of $15$ signatures is mathematically implied by CPAs who possess exactly *one* signature in the entire database, rendering a "same-CPA pairwise comparison" impossible. While logically sound to anyone analyzing the pipeline closely, this microscopic $15$-signature discrepancy is exactly the kind of arithmetic artifact that distracts meticulous reviewers.
*Recommendation:* Add a one-sentence footnote or parenthetical to Section IV-D explicitly stating this $15$-signature delta is due to single-signature CPAs lacking a pairwise match.
---
## 4. Appendix A Validity
The addition of Appendix A successfully and empirically justifies the main-text demotion of the BD/McCrary test.
**Strengths:**
The argument demonstrating that the BD/McCrary transitions drift monotonically with bin width (e.g., Firm A cosine drifting across 0.987 $\rightarrow$ 0.985 $\rightarrow$ 0.980 $\rightarrow$ 0.975) is brilliant. Coupled with the observation that the Z-statistics inflate superlinearly with bin width (from $|Z| \sim 9$ at bin 0.003 to $|Z| \sim 106$ at bin 0.015), the appendix irrefutably proves that the test is interacting with the local curvature of a heavily-populated continuous distribution rather than identifying a discrete, mechanistic boundary. Table A.I is arithmetically consistent with the script's logic.
**Weaknesses:**
The interpretation paragraph overstates the implications of the accountant-level null finding. It claims that the lack of a transition at the accountant level ($N=686$) is a "robust finding that survives the bin-width sweep." As detailed in Section 6 below, a non-finding surviving a bin-width sweep in a small sample is largely a function of low statistical power, not definitive proof of a smoothly-mixed boundary.
---
## 5. IEEE Access Submission Readiness
The manuscript is in excellent shape for submission to IEEE Access.
* **Scope Fit:** High. The paper sits perfectly at the intersection of applied AI, document forensics, and interdisciplinary data science, which is a core demographic for IEEE Access.
* **Abstract Length:** The abstract is approximately 234 words, comfortably satisfying the stringent $\leq 250$ word limit requirement.
* **Formatting & Structure:** The document adheres to standard IEEE double-column formatting conventions (Roman numeral sections, appropriate table/figure references).
* **Anonymization:** Properly handled. Author placeholders, affiliation blocks, and correspondence emails are appropriately bracketed for single-anonymized peer review.
* **Desk-Return Risks:** Very low. The inclusion of the ablation study (Table XVIII) and explicit baseline comparisons ensures the paper meets the journal's expectations for methodological validation.
---
## 6. Novel Issues and Methodological Blind Spots
While the previous review rounds improved the manuscript significantly, habituation has allowed three specific narrative and statistical blind spots to persist. These are prime targets for reviewer pushback.
### Issue 1: The Accountant-Level BD/McCrary Null is a Power Artifact, not Proof of Smoothness
In Section V-B and Appendix A, the authors claim that because the BD/McCrary test yields no significant transition at the accountant level, this "pattern is consistent with a clustered but smoothly mixed accountant-level distribution." Furthermore, Section V-B states that this non-transition is "itself diagnostic of smoothness rather than a failure of the method."
**The Critique:** The McCrary (2008) test relies on local linear regression smoothing. The variance of the estimator scales inversely with $N \cdot h$ (where $h$ is the bin width). With a sample size of only $N=686$ accountants, the test is severely underpowered and lacks the statistical capacity to reject the null of smoothness unless the discontinuity is an absolute, sheer cliff. Asserting that a failure to reject the null affirmatively *proves* the null is true (smoothness) is a fundamental statistical fallacy (Type II error risk).
*Impact:* Statistically literate reviewers will immediately flag this as an overclaim. The demotion of the test to a diagnostic is correct, but interpreting the null at $N=686$ as definitive proof of smoothness is flawed.
### Issue 2: Tautological Presentation of FRR and EER (Table X)
Table X presents a False Rejection Rate (FRR) computed against a "byte-identical" positive anchor. It reports an FRR of $0.000$ for thresholds like 0.95 and 0.973, and subsequently reports an Equal Error Rate (EER) of $\approx 0$ at cosine = 0.990.
**The Critique:** By definition, byte-identical signatures have a cosine similarity asymptotically approaching 1.0 (modulo minor float/cropping artifacts). Evaluating a similarity threshold of 0.95 against inputs that are mathematically defined to score near 1.0 yields a 0% FRR trivially. It is a tautology. While the text in Section V-F attempts to caveat this ("perfect recall against this subset therefore does not generalize"), presenting it as a formal column in Table X with an EER calculation treats it as a standard biometric evaluation. There are no crossing error distributions here to warrant an EER.
*Impact:* This is reviewer-bait. Reviewers from the biometric or forensics domains will argue that an EER of 0 is artificially constructed. The true scientific value of Table X is purely the empirical False Acceptance Rate (FAR) derived from the 50,000 inter-CPA negatives.
### Issue 3: Document-Level Worst-Case Aggregation Narrative
Section IV-I reports that 35.0% of documents are classified as "High-confidence non-hand-signed" and 43.8% as "Moderate-confidence." This relies on the worst-case rule defined in Section III-L (if one signature on a dual-signed report is stamped, the whole document inherits that label).
**The Critique:** While this "worst-case" aggregation is highly practical for building an operational regulatory auditing tool (flagging the report for review), the narrative in IV-I presents these percentages without reminding the reader that a document might contain a mix of genuine and stamped signatures. Without immediate context, stating that nearly 80% of the market's reports are non-hand-signed invites the ecological fallacy that *both* partners are stamping.
*Impact:* A brief narrative safeguard is missing. Section IV-I must briefly cross-reference the intra-report agreement findings (Table XVI) to remind the reader of the composition of these documents, mitigating the risk that the reader misinterprets the document-level severity.
---
## 7. Final Recommendation and v3.8 Action Items
The manuscript is exceptionally strong but requires a few surgical narrative adjustments to remove reviewer-bait and statistical overclaims. I recommend a **Minor Revision** encompassing the following ranked action items.
### BLOCKER (Must Fix for Submission)
1. **Revise the interpretation of the accountant-level BD/McCrary null.**
* *Action:* In Section V-B, Section VI (Conclusion), and Appendix A, remove any explicit claims that the null affirmatively proves "smoothly mixed" boundaries.
* *Replacement Phrasing:* Reframe this finding to acknowledge statistical power. For example: *"We fail to find evidence of a discontinuity at the accountant level. While this is consistent with smoothly mixed clusters, it also reflects the limited statistical power of the BD/McCrary test at smaller sample sizes ($N=686$), reinforcing its role as a diagnostic rather than a definitive estimator."*
### MAJOR (Highly Recommended to Prevent Desk-Reject/Major Revision)
2. **Reframe Table X to eliminate the tautological FRR/EER presentation.**
* *Action:* Remove the Equal Error Rate (EER) calculation entirely. Add an explicit, prominent table note to Table X stating that FRR is computed against a definitionally extreme subset (byte-identical signatures), making the $0.000$ values an expected mathematical boundary check rather than an empirical discovery of real-world recall. Emphasize that the primary contribution of Table X is the FAR evaluation against the large inter-CPA negative anchor.
### MINOR (Quick Wins for Readability and Precision)
3. **Contextualize the Document-Level Aggregation (Section IV-I).**
* *Action:* When presenting the 35.0% / 43.8% document-level figures in Section IV-I, explicitly remind the reader of the worst-case aggregation rule. Add a single sentence cross-referencing Table XVI's mixed-report rates to ensure the reader understands the internal composition of these flagged documents.
4. **Clarify the 15-Signature Delta (Section IV-D / Table XII).**
* *Action:* Add a one-sentence clarification explaining that the delta between the 168,755 CPA-matched signatures (Table III) and the 168,740 signatures analyzed in the all-pairs distributions (Table V/Table XII) consists of CPAs who have exactly one signature in the corpus, making intra-CPA pairwise comparison impossible. This will preempt arithmetic nitpicking by reviewers.
+5 -3
View File
@@ -34,10 +34,12 @@ The $Z$ statistics also inflate superlinearly with the bin width (Firm A cosine
Both features are characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity.
Second, at the accountant level---the unit we rely on for primary threshold inference (Sections III-H, III-J, IV-E)---the procedure produces no significant transition at two of three cosine bin widths and two of three dHash bin widths, and the one marginal transition it does produce ($Z_\text{below} = -2.00$ in the dHash sweep at bin width $1.0$) sits exactly at the critical value for $\alpha = 0.05$.
This pattern is itself informative: it is consistent with *clustered-but-smoothly-mixed* accountant-level aggregates, in which the between-cluster boundary is gradual enough that a discontinuity-based test cannot reject the smoothness null at conventional significance.
We stress the inferential asymmetry here: *consistency* with smoothly-mixed clustering is what the BD null delivers, not *affirmative proof* of smoothness.
At $N = 686$ accountants the BD/McCrary test has limited statistical power and can typically reject only sharp cliff-type discontinuities; failure to reject the smoothness null therefore constrains the data only to distributions whose between-cluster transitions are gradual *enough* to escape the test's sensitivity at that sample size.
We read this as reinforcing---not establishing---the clustered-but-smoothly-mixed interpretation derived from the GMM fit and the dip-test evidence.
Taken together, Table A.I shows (i) that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes---and (ii) that the accountant-level BD/McCrary null is a robust finding that survives the bin-width sweep.
Taken together, Table A.I shows (i) that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes---and (ii) that the accountant-level BD/McCrary null persists across the bin-width sweep, consistent with but not alone sufficient to establish the clustered-but-smoothly-mixed interpretation discussed in Section V-B and limitation-caveated in Section V-G.
Both observations support the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator.
The accountant-level threshold band reported in Table VIII ($\text{cosine} \approx 0.975$ from the convergence of the KDE antimode, the Beta-2 crossing, and the logit-GMM-2 crossing) is therefore not adjusted to include any BD/McCrary location, and the absence of a BD transition at the accountant level is reported as itself evidence for the clustered-but-smooth interpretation in Section V-B.
The accountant-level threshold band reported in Table VIII ($\text{cosine} \approx 0.975$ from the convergence of the KDE antimode, the Beta-2 crossing, and the logit-GMM-2 crossing) is therefore not adjusted to include any BD/McCrary location.
Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials (`reports/bd_sensitivity/bd_sensitivity.json`) produced by `signature_analysis/25_bd_mccrary_sensitivity.py`.
+2 -2
View File
@@ -13,8 +13,8 @@ Second, we showed that combining cosine similarity of deep embeddings with diffe
Third, we introduced a convergent threshold framework combining two methodologically distinct estimators---KDE antimode (with a Hartigan unimodality test) and an EM-fitted Beta mixture (with a logit-Gaussian robustness check)---together with a Burgstahler-Dichev / McCrary density-smoothness diagnostic.
Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the KDE antimode together with the Beta-mixture and logit-Gaussian estimators agree within $\sim 0.006$ at cosine $\approx 0.975$.
The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level heterogeneity.
The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered with smooth cluster boundaries.
The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transition at the accountant level; at $N = 686$ accountants the test has limited power and cannot affirmatively establish smoothness, but its non-transition is consistent with the smoothly-mixed cluster boundaries implied by the accountant-level GMM.
The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered into three recognizable groups whose inter-cluster boundaries are gradual rather than sharp.
Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85-95% capture band differ by 1-5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure.
+5 -3
View File
@@ -25,11 +25,12 @@ Replication quality varies continuously with scan equipment, PDF compression, st
At the per-accountant aggregate level the picture partly reverses.
The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, logit-GMM-2 crossing $= 0.976$).
The BD/McCrary test, however, does not produce a significant transition at the accountant level either, in contrast to the signature level.
This pattern is consistent with a clustered *but smoothly mixed* accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the transitions between them are gradual rather than discrete at the bin resolution BD/McCrary requires.
This pattern is consistent with a clustered *but smoothly mixed* accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the test fails to reject the smoothness null at the sample size available ($N = 686$), and the GMM cluster boundaries appear gradual rather than sheer.
We caveat this interpretation appropriately in Section V-G: the BD null alone cannot affirmatively establish smoothness---only fail to falsify it---and our substantive claim of smoothly-mixed clustering rests on the joint weight of the GMM fit, the dip test, and the BD null rather than on the BD null alone.
The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behaviour* is clustered (three recognizable groups) but not sharply discrete.
The accountant-level mixture is a useful classifier of firm-and-practitioner-level signing regimes; individual behaviour may still transition or mix over time within a practitioner, and our cross-sectional analysis does not rule this out.
Methodologically, the implication is that the three 1D methods are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is itself diagnostic of smoothness rather than a failure of the method.
Methodologically, the implication is that the two threshold estimators (KDE antimode, Beta mixture with logit-Gaussian robustness) are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is a failure-to-reject rather than a failure of the method---informative alongside the other evidence but subject to the power caveat recorded in Section V-G.
## C. Firm A as a Replication-Dominated, Not Pure, Population
@@ -103,7 +104,8 @@ Extending the accountant-level analysis to auditor-year units is a natural next
Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, and the test produces no significant transition at all at the accountant level.
In our application, therefore, BD/McCrary contributes diagnostic information about local density-smoothness rather than an independent accountant-level threshold estimate; that role is played by the KDE antimode and the two mixture-based estimators.
The BD/McCrary results remain informative as a robustness check---their non-transition at the accountant level is consistent with the dip-test and Beta-mixture evidence that accountant-level clustering is smooth rather than sharply discontinuous.
We emphasize that the accountant-level BD/McCrary null is *consistent with*---not affirmative proof of---smoothly mixed cluster boundaries: the BD/McCrary test is known to have limited statistical power at modest sample sizes, and with $N = 686$ accountants in our analysis the test cannot reliably detect anything less than a sharp cliff-type density discontinuity.
Failure to reject the smoothness null at this sample size therefore reinforces BD/McCrary's role as a diagnostic rather than a definitive estimator; the substantive claim of smoothly-mixed accountant-level clustering rests on the joint weight of the dip-test and Beta-mixture evidence together with the BD null, not on the BD null alone.
Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
+2 -1
View File
@@ -244,7 +244,8 @@ The heldout fold is used exclusively to report post-hoc capture rates with Wilso
4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70.
This anchor is retained for continuity with prior work but is small in our dataset ($n = 35$) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.
From these anchors we report FAR with Wilson 95% confidence intervals (against the inter-CPA negative anchor) and FRR (against the byte-identical positive anchor), together with the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
From these anchors we report FAR with Wilson 95% confidence intervals against the inter-CPA negative anchor.
We do not report an Equal Error Rate or FRR column against the byte-identical positive anchor, because byte-identical pairs have cosine $\approx 1$ by construction and any FRR computed against that subset is trivially $0$ at every threshold below $1$; the conservative-subset role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
Precision and $F_1$ are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and $F_1$ from Table X.
The 70/30 held-out Firm A fold of Section IV-G.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.
We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.
+23 -16
View File
@@ -56,6 +56,7 @@ A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the
## D. Hartigan Dip Test: Unimodality at the Signature Level
Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
The $N = 168{,}740$ count used in Table V and downstream all-pairs analyses (Tables XII, XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
<!-- TABLE V: Hartigan Dip Test Results
| Distribution | N | dip | p-value | Verdict (α=0.05) |
@@ -81,8 +82,9 @@ Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine
Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null is robust across the Appendix-A bin-width sweep.
We therefore read the BD/McCrary pattern as evidence that accountant-level aggregates are clustered-but-smoothly-mixed rather than sharply discontinuous, and we use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator.
At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null persists across the Appendix-A bin-width sweep.
We read this accountant-level pattern as *consistent with*---not affirmative proof of---clustered-but-smoothly-mixed aggregates: at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness (Section V-G).
We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator, and the substantive claim of smoothly-mixed accountant clustering rests on the joint evidence of the dip test, the BIC-selected GMM, and the BD null.
### 2) Beta Mixture at Signature Level: A Forced Fit
@@ -147,7 +149,7 @@ Table VIII summarizes the threshold estimates produced by the two threshold esti
| Firm A calibration-fold dHash_indep median | — | 2 |
-->
At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic produces no significant transition at the same level (and this null is robust across Appendix A's bin-width sweep), consistent with clustered-but-smoothly-mixed accountant-level aggregates.
At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic produces no significant transition at the same level (a null that persists across Appendix A's bin-width sweep), which is *consistent with*---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates.
This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
@@ -185,23 +187,26 @@ We report three validation analyses corresponding to the anchors of Section III-
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor and FRR against the byte-identical positive anchor in Table X; these two error rates are well defined within their respective anchor populations.
The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because every byte-identical positive falls at cosine very close to 1.
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.
<!-- TABLE X: Cosine Threshold Sweep (positives = 310 byte-identical signatures; negatives = 50,000 inter-CPA pairs)
| Threshold | FAR | FAR 95% Wilson CI | FRR (byte-identical) |
|-----------|-----|-------------------|----------------------|
| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] | 0.000 |
| 0.900 | 0.0233 | [0.0221, 0.0247] | 0.000 |
| 0.945 (2D GMM marginal) | 0.0008 | [0.0006, 0.0011] | 0.000 |
| 0.950 | 0.0007 | [0.0005, 0.0009] | 0.000 |
| 0.973 (accountant KDE antimode) | 0.0003 | [0.0002, 0.0004] | 0.000 |
| 0.979 (accountant Beta-2) | 0.0002 | [0.0001, 0.0004] | 0.000 |
<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
| Threshold | FAR | FAR 95% Wilson CI |
|-----------|-----|-------------------|
| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] |
| 0.900 | 0.0233 | [0.0221, 0.0247] |
| 0.945 (2D GMM marginal) | 0.0008 | [0.0006, 0.0011] |
| 0.950 | 0.0007 | [0.0005, 0.0009] |
| 0.973 (accountant KDE antimode) | 0.0003 | [0.0002, 0.0004] |
| 0.979 (accountant Beta-2) | 0.0002 | [0.0001, 0.0004] |
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
-->
Two caveats apply.
First, the gold-positive anchor is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
Zero FRR against this subset does not establish zero FRR against the broader positive class, and the reported FRR should therefore be interpreted as a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable miss rate.
First, the byte-identical positive anchor referenced above is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.
@@ -371,6 +376,8 @@ We note that this test uses the calibrated classifier of Section III-L rather th
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-L: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
Document-level rates therefore bound the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-H.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
<!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
| Verdict | N (PDFs) | % | Firm A | Firm A % |