Paper A v3.8: resolve Gemini 3.1 Pro round-6 independent-review findings
Gemini round-6 (paper/gemini_review_v3_7.md) gave Minor Revision but
flagged three issues that five rounds of codex review had missed.
This commit addresses all three.
BLOCKER: Accountant-level BD/McCrary null is a power artifact, not
proof of smoothness (Gemini Issue 1)
- At N=686 accountants the BD/McCrary test has limited statistical
power; interpreting a failure-to-reject as affirmative proof of
smoothness is a Type II error risk.
- Discussion V-B: "itself diagnostic of smoothness" replaced with
"failure-to-reject rather than a failure of the method ---
informative alongside the other evidence but subject to the power
caveat in Section V-G".
- Discussion V-G (Sixth limitation): added a power-aware paragraph
naming N=686 explicitly and clarifying that the substantive claim
of smoothly-mixed clustering rests on the JOINT weight of dip
test + BIC-selected GMM + BD null, not on BD alone.
- Results IV-D.1 and IV-E: reframe accountant-level null as
"consistent with --- not affirmative proof of" clustered-but-
smoothly-mixed, citing V-G for the power caveat.
- Appendix A interpretation paragraph: explicit inferential-asymmetry
sentence ("consistency is what the BD null delivers, not
affirmative proof"); "itself evidence for" removed.
- Conclusion: "consistent with clustered but smoothly mixed"
rephrased with explicit power caveat ("at N = 686 the test has
limited power and cannot affirmatively establish smoothness").
MAJOR: Table X FRR / EER was tautological reviewer-bait
(Gemini Issue 2)
- Byte-identical positive anchor has cosine approx 1 by construction,
so FRR against that subset is trivially 0 at every threshold
below 1 and any EER calculation is arithmetic tautology, not
biometric performance.
- Results IV-G.1: removed EER row; dropped FRR column from Table X;
added a table note explaining the omission and directing readers
to Section V-F for the conservative-subset discussion.
- Methodology III-K: removed the EER / FRR-against-byte-identical
reporting clause; clarified that FAR against inter-CPA negatives
is the primary reported quantity.
- Table X is now FAR + Wilson 95% CI only, which is the quantity
that actually carries empirical content on this anchor design.
MINOR: Document-level worst-case aggregation narrative (Gemini
Issue 3) + 15-signature delta (Gemini spot-check)
- Results IV-I: added two sentences explicitly noting that the
document-level percentages reflect the Section III-L worst-case
aggregation rule (a report with one stamped + one hand-signed
signature inherits the most-replication-consistent label), and
cross-referencing Section IV-H.3 / Table XVI for the mixed-report
composition that qualifies the headline percentages.
- Results IV-D: added a one-sentence footnote explaining that the
15-signature delta between the Table III CPA-matched count
(168,755) and the all-pairs analyzed count (168,740) is due to
CPAs with exactly one signature, for whom no same-CPA pairwise
best-match statistic exists.
Abstract remains 243 words, comfortably under the IEEE Access
250-word cap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+23
-16
@@ -56,6 +56,7 @@ A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the
|
||||
## D. Hartigan Dip Test: Unimodality at the Signature Level
|
||||
|
||||
Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
|
||||
The $N = 168{,}740$ count used in Table V and downstream all-pairs analyses (Tables XII, XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
|
||||
|
||||
<!-- TABLE V: Hartigan Dip Test Results
|
||||
| Distribution | N | dip | p-value | Verdict (α=0.05) |
|
||||
@@ -81,8 +82,9 @@ Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine
|
||||
Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
|
||||
First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
|
||||
Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
|
||||
At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null is robust across the Appendix-A bin-width sweep.
|
||||
We therefore read the BD/McCrary pattern as evidence that accountant-level aggregates are clustered-but-smoothly-mixed rather than sharply discontinuous, and we use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator.
|
||||
At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null persists across the Appendix-A bin-width sweep.
|
||||
We read this accountant-level pattern as *consistent with*---not affirmative proof of---clustered-but-smoothly-mixed aggregates: at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness (Section V-G).
|
||||
We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator, and the substantive claim of smoothly-mixed accountant clustering rests on the joint evidence of the dip test, the BIC-selected GMM, and the BD null.
|
||||
|
||||
### 2) Beta Mixture at Signature Level: A Forced Fit
|
||||
|
||||
@@ -147,7 +149,7 @@ Table VIII summarizes the threshold estimates produced by the two threshold esti
|
||||
| Firm A calibration-fold dHash_indep median | — | 2 |
|
||||
-->
|
||||
|
||||
At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic produces no significant transition at the same level (and this null is robust across Appendix A's bin-width sweep), consistent with clustered-but-smoothly-mixed accountant-level aggregates.
|
||||
At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic produces no significant transition at the same level (a null that persists across Appendix A's bin-width sweep), which is *consistent with*---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates.
|
||||
This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
|
||||
The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
|
||||
|
||||
@@ -185,23 +187,26 @@ We report three validation analyses corresponding to the anchors of Section III-
|
||||
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
|
||||
As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
|
||||
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
|
||||
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor and FRR against the byte-identical positive anchor in Table X; these two error rates are well defined within their respective anchor populations.
|
||||
The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because every byte-identical positive falls at cosine very close to 1.
|
||||
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
|
||||
The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
|
||||
We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.
|
||||
|
||||
<!-- TABLE X: Cosine Threshold Sweep (positives = 310 byte-identical signatures; negatives = 50,000 inter-CPA pairs)
|
||||
| Threshold | FAR | FAR 95% Wilson CI | FRR (byte-identical) |
|
||||
|-----------|-----|-------------------|----------------------|
|
||||
| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] | 0.000 |
|
||||
| 0.900 | 0.0233 | [0.0221, 0.0247] | 0.000 |
|
||||
| 0.945 (2D GMM marginal) | 0.0008 | [0.0006, 0.0011] | 0.000 |
|
||||
| 0.950 | 0.0007 | [0.0005, 0.0009] | 0.000 |
|
||||
| 0.973 (accountant KDE antimode) | 0.0003 | [0.0002, 0.0004] | 0.000 |
|
||||
| 0.979 (accountant Beta-2) | 0.0002 | [0.0001, 0.0004] | 0.000 |
|
||||
<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
|
||||
| Threshold | FAR | FAR 95% Wilson CI |
|
||||
|-----------|-----|-------------------|
|
||||
| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] |
|
||||
| 0.900 | 0.0233 | [0.0221, 0.0247] |
|
||||
| 0.945 (2D GMM marginal) | 0.0008 | [0.0006, 0.0011] |
|
||||
| 0.950 | 0.0007 | [0.0005, 0.0009] |
|
||||
| 0.973 (accountant KDE antimode) | 0.0003 | [0.0002, 0.0004] |
|
||||
| 0.979 (accountant Beta-2) | 0.0002 | [0.0001, 0.0004] |
|
||||
|
||||
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
|
||||
-->
|
||||
|
||||
Two caveats apply.
|
||||
First, the gold-positive anchor is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
|
||||
Zero FRR against this subset does not establish zero FRR against the broader positive class, and the reported FRR should therefore be interpreted as a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable miss rate.
|
||||
First, the byte-identical positive anchor referenced above is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
|
||||
A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
|
||||
Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
|
||||
The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.
|
||||
|
||||
@@ -371,6 +376,8 @@ We note that this test uses the calibrated classifier of Section III-L rather th
|
||||
|
||||
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
|
||||
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
|
||||
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-L: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
|
||||
Document-level rates therefore bound the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-H.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
|
||||
|
||||
<!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
|
||||
| Verdict | N (PDFs) | % | Firm A | Firm A % |
|
||||
|
||||
Reference in New Issue
Block a user