Paper A v3.8: resolve Gemini 3.1 Pro round-6 independent-review findings

Gemini round-6 (paper/gemini_review_v3_7.md) gave Minor Revision but flagged three issues that five rounds of codex review had missed. This commit addresses all three. BLOCKER: Accountant-level BD/McCrary null is a power artifact, not proof of smoothness (Gemini Issue 1) - At N=686 accountants the BD/McCrary test has limited statistical power; interpreting a failure-to-reject as affirmative proof of smoothness is a Type II error risk. - Discussion V-B: "itself diagnostic of smoothness" replaced with "failure-to-reject rather than a failure of the method --- informative alongside the other evidence but subject to the power caveat in Section V-G". - Discussion V-G (Sixth limitation): added a power-aware paragraph naming N=686 explicitly and clarifying that the substantive claim of smoothly-mixed clustering rests on the JOINT weight of dip test + BIC-selected GMM + BD null, not on BD alone. - Results IV-D.1 and IV-E: reframe accountant-level null as "consistent with --- not affirmative proof of" clustered-but- smoothly-mixed, citing V-G for the power caveat. - Appendix A interpretation paragraph: explicit inferential-asymmetry sentence ("consistency is what the BD null delivers, not affirmative proof"); "itself evidence for" removed. - Conclusion: "consistent with clustered but smoothly mixed" rephrased with explicit power caveat ("at N = 686 the test has limited power and cannot affirmatively establish smoothness"). MAJOR: Table X FRR / EER was tautological reviewer-bait (Gemini Issue 2) - Byte-identical positive anchor has cosine approx 1 by construction, so FRR against that subset is trivially 0 at every threshold below 1 and any EER calculation is arithmetic tautology, not biometric performance. - Results IV-G.1: removed EER row; dropped FRR column from Table X; added a table note explaining the omission and directing readers to Section V-F for the conservative-subset discussion. - Methodology III-K: removed the EER / FRR-against-byte-identical reporting clause; clarified that FAR against inter-CPA negatives is the primary reported quantity. - Table X is now FAR + Wilson 95% CI only, which is the quantity that actually carries empirical content on this anchor design. MINOR: Document-level worst-case aggregation narrative (Gemini Issue 3) + 15-signature delta (Gemini spot-check) - Results IV-I: added two sentences explicitly noting that the document-level percentages reflect the Section III-L worst-case aggregation rule (a report with one stamped + one hand-signed signature inherits the most-replication-consistent label), and cross-referencing Section IV-H.3 / Table XVI for the mixed-report composition that qualifies the headline percentages. - Results IV-D: added a one-sentence footnote explaining that the 15-signature delta between the Table III CPA-matched count (168,755) and the all-pairs analyzed count (168,740) is due to CPAs with exactly one signature, for whom no same-CPA pairwise best-match statistic exists. Abstract remains 243 words, comfortably under the IEEE Access 250-word cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:47:48 +08:00
parent 552b6b80d4
commit fcce58aff0
6 changed files with 157 additions and 25 deletions
@@ -56,6 +56,7 @@ A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the
 ## D. Hartigan Dip Test: Unimodality at the Signature Level

 Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
+The $N = 168{,}740$ count used in Table V and downstream all-pairs analyses (Tables XII, XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.

 <!-- TABLE V: Hartigan Dip Test Results
 | Distribution | N | dip | p-value | Verdict (α=0.05) |
@@ -81,8 +82,9 @@ Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine
 Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
 First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
 Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
-At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null is robust across the Appendix-A bin-width sweep.
-We therefore read the BD/McCrary pattern as evidence that accountant-level aggregates are clustered-but-smoothly-mixed rather than sharply discontinuous, and we use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator.
+At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null persists across the Appendix-A bin-width sweep.
+We read this accountant-level pattern as *consistent with*---not affirmative proof of---clustered-but-smoothly-mixed aggregates: at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness (Section V-G).
+We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator, and the substantive claim of smoothly-mixed accountant clustering rests on the joint evidence of the dip test, the BIC-selected GMM, and the BD null.

 ### 2) Beta Mixture at Signature Level: A Forced Fit

@@ -147,7 +149,7 @@ Table VIII summarizes the threshold estimates produced by the two threshold esti
 | Firm A calibration-fold dHash_indep median                                | —                 | 2                 |
 -->

-At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic produces no significant transition at the same level (and this null is robust across Appendix A's bin-width sweep), consistent with clustered-but-smoothly-mixed accountant-level aggregates.
+At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic produces no significant transition at the same level (a null that persists across Appendix A's bin-width sweep), which is *consistent with*---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates.
 This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
 The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.

@@ -185,23 +187,26 @@ We report three validation analyses corresponding to the anchors of Section III-
 Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
 As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
 Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
-We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor and FRR against the byte-identical positive anchor in Table X; these two error rates are well defined within their respective anchor populations.
-The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because every byte-identical positive falls at cosine very close to 1.
+We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
+The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
+We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.

-<!-- TABLE X: Cosine Threshold Sweep (positives = 310 byte-identical signatures; negatives = 50,000 inter-CPA pairs)
-| Threshold | FAR | FAR 95% Wilson CI | FRR (byte-identical) |
-|-----------|-----|-------------------|----------------------|
-| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] | 0.000 |
-| 0.900                            | 0.0233 | [0.0221, 0.0247] | 0.000 |
-| 0.945 (2D GMM marginal)          | 0.0008 | [0.0006, 0.0011] | 0.000 |
-| 0.950                            | 0.0007 | [0.0005, 0.0009] | 0.000 |
-| 0.973 (accountant KDE antimode)  | 0.0003 | [0.0002, 0.0004] | 0.000 |
-| 0.979 (accountant Beta-2)        | 0.0002 | [0.0001, 0.0004] | 0.000 |
+<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
+| Threshold | FAR | FAR 95% Wilson CI |
+|-----------|-----|-------------------|
+| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] |
+| 0.900                            | 0.0233 | [0.0221, 0.0247] |
+| 0.945 (2D GMM marginal)          | 0.0008 | [0.0006, 0.0011] |
+| 0.950                            | 0.0007 | [0.0005, 0.0009] |
+| 0.973 (accountant KDE antimode)  | 0.0003 | [0.0002, 0.0004] |
+| 0.979 (accountant Beta-2)        | 0.0002 | [0.0001, 0.0004] |
+
+Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
 -->

 Two caveats apply.
-First, the gold-positive anchor is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
-Zero FRR against this subset does not establish zero FRR against the broader positive class, and the reported FRR should therefore be interpreted as a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable miss rate.
+First, the byte-identical positive anchor referenced above is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
+A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
 Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
 The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.

@@ -371,6 +376,8 @@ We note that this test uses the calibrated classifier of Section III-L rather th

 Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
 The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
+We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-L: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
+Document-level rates therefore bound the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-H.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.

 <!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
 | Verdict | N (PDFs) | % | Firm A | Firm A % |