9d19ca5a31
Major fixes per codex (gpt-5.4) review: ## Structural fixes - Fixed three-method convergence overclaim: added Script 20 to run KDE antimode, BD/McCrary, and Beta mixture EM on accountant-level means. Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979, LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at accountant level (consistent with smooth clustering, not sharp discontinuity). - Disambiguated Method 1: KDE crossover (between two labeled distributions, used at signature all-pairs level) vs KDE antimode (single-distribution local minimum, used at accountant level). - Addressed Firm A circular validation: Script 21 adds CPA-level 70/30 held-out fold. Calibration thresholds derived from 70% only; heldout rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61% [93.21%-93.98%]). - Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion. - Added document-level classification aggregation rule (worst-case signature label determines document label). ## Pixel-identity validation strengthened - Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces the original n=35 same-CPA low-similarity negative which had untenable Wilson CIs). - Added Wilson 95% CI for every FAR in Table X. - Proper EER interpolation (FAR=FRR point) in Table X. - Softened "conservative recall" claim to "non-generalizable subset" language per codex feedback (byte-identical positives are a subset, not a representative positive class). - Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913. ## Terminology & sentence-level fixes - "statistically independent methods" -> "methodologically distinct methods" throughout (three diagnostics on the same sample are not independent). - "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality; rejection is consistent with but not a direct test of bimodality). - "Firm A near-universally non-hand-signed" -> already corrected to "replication-dominated" in prior commit; this commit strengthens that framing with explicit held-out validation. - "discrete-behavior regimes" -> "clustered accountant-level heterogeneity" (BD/McCrary non-transition at accountant level rules out sharp discrete boundaries; the defensible claim is clustered-but-smooth). - Softened White 1982 quasi-MLE claim (no longer framed as a guarantee). - Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP or YOLO FN). - Unified "310 byte-identical signatures" language across Abstract, Results, Discussion (previously alternated between pairs/signatures). - Defined min_dhash_independent explicitly in Section III-G. - Fixed table numbering (Table XI heldout added, classification moved to XII, ablation to XIII). - Explained 84,386 vs 85,042 gap (656 docs have only one signature, no pairwise stat). - Made Table IX explicitly a "consistency check" not "validation"; paired it with Table XI held-out rates as the genuine external check. - Defined 0.941 threshold (calibration-fold Firm A cosine P5). - Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated. - Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923). ## New artifacts - Script 20: accountant-level three-method threshold analysis - Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30) - paper/codex_review_gpt54_v3.md: preserved review feedback Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1 markdown sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
313 lines
25 KiB
Markdown
313 lines
25 KiB
Markdown
# IV. Experiments and Results
|
||
|
||
## A. Experimental Setup
|
||
|
||
All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration.
|
||
Feature extraction used PyTorch 2.9 with torchvision model implementations.
|
||
The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
|
||
|
||
## B. Signature Detection Performance
|
||
|
||
The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
|
||
We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
|
||
However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
|
||
The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.
|
||
|
||
<!-- TABLE III: Extraction Results
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Documents processed | 86,071 |
|
||
| Documents with detections | 85,042 (98.8%) |
|
||
| Total signatures extracted | 182,328 |
|
||
| Avg. signatures per document | 2.14 |
|
||
| CPA-matched signatures | 168,755 (92.6%) |
|
||
| Processing rate | 43.1 docs/sec |
|
||
-->
|
||
|
||
## C. All-Pairs Intra-vs-Inter Class Distribution Analysis
|
||
|
||
Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
|
||
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
|
||
Table IV summarizes the distributional statistics.
|
||
|
||
<!-- TABLE IV: Cosine Similarity Distribution Statistics
|
||
| Statistic | Intra-class | Inter-class |
|
||
|-----------|-------------|-------------|
|
||
| N (pairs) | 41,352,824 | 500,000 |
|
||
| Mean | 0.821 | 0.758 |
|
||
| Std. Dev. | 0.098 | 0.090 |
|
||
| Median | 0.836 | 0.774 |
|
||
| Skewness | −0.711 | −0.851 |
|
||
| Kurtosis | 0.550 | 1.027 |
|
||
-->
|
||
|
||
Both distributions are left-skewed and leptokurtic.
|
||
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
|
||
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
|
||
|
||
The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
|
||
Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
|
||
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).
|
||
|
||
We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
|
||
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
|
||
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
|
||
|
||
## D. Hartigan Dip Test: Unimodality at the Signature Level
|
||
|
||
Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
|
||
|
||
<!-- TABLE V: Hartigan Dip Test Results
|
||
| Distribution | N | dip | p-value | Verdict (α=0.05) |
|
||
|--------------|---|-----|---------|------------------|
|
||
| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
|
||
| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
|
||
| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
|
||
| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
|
||
| Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
|
||
| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
|
||
-->
|
||
|
||
Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews.
|
||
The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
|
||
At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
|
||
|
||
This asymmetry between signature level and accountant level is itself an empirical finding.
|
||
It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
|
||
|
||
### 1) Burgstahler-Dichev / McCrary Discontinuity
|
||
|
||
Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a single significant transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample.
|
||
We note that the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
|
||
In contrast, the dHash transition at distance 2 is a substantively meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.
|
||
At the accountant level the test does not produce a significant $Z^- \rightarrow Z^+$ transition in either the cosine-mean or the dHash-mean distribution (Section IV-E), reflecting that accountant aggregates are smooth at the bin resolution the test requires rather than exhibiting a sharp density discontinuity.
|
||
|
||
### 2) Beta Mixture at Signature Level: A Forced Fit
|
||
|
||
Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
|
||
For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
|
||
Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
|
||
Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
|
||
|
||
The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
|
||
Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
|
||
This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual *behavior* (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.
|
||
|
||
## E. Accountant-Level Gaussian Mixture
|
||
|
||
We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
|
||
BIC selects $K^* = 3$ (Table VI).
|
||
|
||
<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
|
||
| K | BIC | AIC | Converged |
|
||
|---|-----|-----|-----------|
|
||
| 1 | −316 | −339 | ✓ |
|
||
| 2 | −545 | −595 | ✓ |
|
||
| 3 | **−792** | **−869** | ✓ (best) |
|
||
| 4 | −779 | −883 | ✓ |
|
||
| 5 | −747 | −879 | ✓ |
|
||
-->
|
||
|
||
Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
|
||
|
||
<!-- TABLE VII: Accountant-Level 3-Component GMM
|
||
| Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
|
||
|-------|----------|------------|--------|---|----------------|
|
||
| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
|
||
| C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
|
||
| C3 (hand-signed tendency) | 0.928 | 11.17 | 0.28 | 184 | smaller domestic firms |
|
||
-->
|
||
|
||
Three empirical findings stand out.
|
||
First, of the 180 CPAs in the Firm A registry, 171 have $\geq 10$ signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only).
|
||
Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
|
||
This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
|
||
Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
|
||
Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in Table VIII: KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary test does not produce a significant transition at the accountant level.
|
||
For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
|
||
|
||
<!-- TABLE VIII-acct: Accountant-Level Three-Method Threshold Summary
|
||
| Level / method | Cosine threshold | dHash threshold |
|
||
|----------------|-------------------|------------------|
|
||
| Method 1 (KDE antimode) | 0.973 | 4.07 |
|
||
| Method 2 (BD/McCrary) | no transition | no transition |
|
||
| Method 3 (Beta-2 EM crossing) | 0.979 | 3.41 |
|
||
| Method 3' (logit-GMM-2 crossing) | 0.976 | 3.93 |
|
||
| 2D GMM 2-comp marginal crossing | 0.945 | 8.10 |
|
||
-->
|
||
|
||
Table VIII then summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison.
|
||
|
||
<!-- TABLE VIII: Threshold Convergence Summary Across Levels
|
||
| Level / method | Cosine threshold | dHash threshold |
|
||
|----------------|-------------------|------------------|
|
||
| Signature-level, all-pairs KDE crossover | 0.837 | — |
|
||
| Signature-level, BD/McCrary transition | 0.985 | 2.0 |
|
||
| Signature-level, Beta-2 EM crossing (Firm A) | 0.977 | — |
|
||
| Signature-level, logit-GMM-2 crossing (Full) | 0.980 | — |
|
||
| Accountant-level, KDE antimode | **0.973** | **4.07** |
|
||
| Accountant-level, BD/McCrary transition | no transition | no transition |
|
||
| Accountant-level, Beta-2 EM crossing | **0.979** | **3.41** |
|
||
| Accountant-level, logit-GMM-2 crossing | **0.976** | **3.93** |
|
||
| Accountant-level, 2D-GMM 2-comp marginal crossing | 0.945 | 8.10 |
|
||
| Firm A calibration-fold cosine P5 | 0.941 | — |
|
||
| Firm A calibration-fold dHash P95 | — | 9 |
|
||
| Firm A calibration-fold dHash median | — | 2 |
|
||
-->
|
||
|
||
Methods 1 and 3 (KDE antimode, Beta-2 crossing, and its logit-GMM robustness check) converge at the accountant level to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$, while Method 2 (BD/McCrary) does not produce a significant discontinuity.
|
||
This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
|
||
The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
|
||
|
||
## F. Calibration Validation with Firm A
|
||
|
||
Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
|
||
Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
|
||
|
||
<!-- TABLE IX: Firm A Whole-Sample Capture Rates (consistency check, NOT external validation)
|
||
| Rule | Firm A rate | n / N |
|
||
|------|-------------|-------|
|
||
| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,405 / 60,448 |
|
||
| cosine > 0.941 (calibration-fold P5) | 95.08% | 57,473 / 60,448 |
|
||
| cosine > 0.945 (2D GMM marginal crossing) | 94.52% | 57,131 / 60,448 |
|
||
| cosine > 0.95 | 92.51% | 55,916 / 60,448 |
|
||
| cosine > 0.973 (accountant KDE antimode) | 80.91% | 48,910 / 60,448 |
|
||
| dHash_indep ≤ 5 (calib-fold median-adjacent) | 84.20% | 50,897 / 60,448 |
|
||
| dHash_indep ≤ 8 | 95.17% | 57,521 / 60,448 |
|
||
| dHash_indep ≤ 15 | 99.83% | 60,345 / 60,448 |
|
||
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | 54,373 / 60,448 |
|
||
|
||
All rates computed exactly from the full Firm A sample (N = 60,448 signatures).
|
||
The threshold 0.941 corresponds to the 5th percentile of the calibration-fold Firm A cosine distribution (see Section IV-G for the held-out validation that addresses the circularity inherent in this whole-sample table).
|
||
-->
|
||
|
||
Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H.
|
||
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix.
|
||
Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
|
||
|
||
## G. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
|
||
|
||
We report three validation analyses corresponding to the anchors of Section III-K.
|
||
|
||
### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor
|
||
|
||
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
|
||
As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
|
||
Table X reports precision, recall, $F_1$, FAR with Wilson 95% confidence intervals, and FRR at each candidate threshold.
|
||
The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because pixel-identical positives are all at cosine very close to 1.
|
||
|
||
<!-- TABLE X: Cosine Threshold Sweep (positives = 310 pixel-identical signatures; negatives = 50,000 inter-CPA pairs)
|
||
| Threshold | Precision | Recall | F1 | FAR | FAR 95% Wilson CI | FRR |
|
||
|-----------|-----------|--------|----|-----|-------------------|-----|
|
||
| 0.837 (all-pairs KDE crossover) | 0.029 | 1.000 | 0.056 | 0.2062 | [0.2027, 0.2098] | 0.000 |
|
||
| 0.900 | 0.210 | 1.000 | 0.347 | 0.0233 | [0.0221, 0.0247] | 0.000 |
|
||
| 0.945 (2D GMM marginal) | 0.883 | 1.000 | 0.938 | 0.0008 | [0.0006, 0.0011] | 0.000 |
|
||
| 0.950 | 0.904 | 1.000 | 0.950 | 0.0007 | [0.0005, 0.0009] | 0.000 |
|
||
| 0.973 (accountant KDE antimode) | 0.960 | 1.000 | 0.980 | 0.0003 | [0.0002, 0.0004] | 0.000 |
|
||
| 0.979 (accountant Beta-2) | 0.969 | 1.000 | 0.984 | 0.0002 | [0.0001, 0.0004] | 0.000 |
|
||
-->
|
||
|
||
Two caveats apply.
|
||
First, the gold-positive anchor is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
|
||
Perfect recall against this subset does not establish perfect recall against the broader positive class, and the reported recall should therefore be interpreted as a lower-bound calibration check rather than a generalizable recall estimate.
|
||
Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
|
||
The very low FAR at the accountant-level thresholds is therefore informative.
|
||
|
||
### 2) Held-Out Firm A Validation (breaks calibration-validation circularity)
|
||
|
||
We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
|
||
Thresholds are re-derived from calibration-fold percentiles only.
|
||
Table XI reports heldout-fold capture rates with Wilson 95% confidence intervals.
|
||
|
||
<!-- TABLE XI: Held-Out Firm A Capture Rates (30% fold, N = 15,332 signatures)
|
||
| Rule | Capture rate | Wilson 95% CI | k / n |
|
||
|------|--------------|---------------|-------|
|
||
| cosine > 0.837 | 99.93% | [99.87%, 99.96%] | 15,321 / 15,332 |
|
||
| cosine > 0.945 (2D GMM marginal) | 94.78% | [94.41%, 95.12%] | 14,532 / 15,332 |
|
||
| cosine > 0.950 | 93.61% | [93.21%, 93.98%] | 14,353 / 15,332 |
|
||
| cosine > 0.9407 (calib-fold P5) | 95.64% | [95.31%, 95.95%] | 14,664 / 15,332 |
|
||
| dHash_indep ≤ 5 | 87.84% | [87.31%, 88.34%] | 13,469 / 15,332 |
|
||
| dHash_indep ≤ 8 | 96.13% | [95.82%, 96.43%] | 14,739 / 15,332 |
|
||
| dHash_indep ≤ 9 (calib-fold P95) | 97.48% | [97.22%, 97.71%] | 14,942 / 15,332 |
|
||
| dHash_indep ≤ 15 | 99.84% | [99.77%, 99.89%] | 15,308 / 15,332 |
|
||
| cosine > 0.95 AND dHash_indep ≤ 8 | 91.54% | [91.09%, 91.97%] | 14,035 / 15,332 |
|
||
|
||
Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash median = 2, P95 = 9.
|
||
-->
|
||
|
||
The held-out rates match the whole-sample rates of Table IX within each rule's Wilson confidence interval, confirming that the calibration-derived thresholds generalize to Firm A CPAs that did not contribute to calibration.
|
||
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 91.54% [91.09%, 91.97%] of the held-out Firm A population, consistent with Firm A's interview-reported signing mix and with the replication-dominated framing of Section III-H.
|
||
|
||
### 3) Sanity Sample
|
||
|
||
A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
|
||
|
||
## H. Classification Results
|
||
|
||
Table XII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
|
||
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
|
||
|
||
<!-- TABLE XII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
|
||
| Verdict | N (PDFs) | % | Firm A | Firm A % |
|
||
|---------|----------|---|--------|----------|
|
||
| High-confidence non-hand-signed | 29,529 | 35.0% | 22,970 | 76.0% |
|
||
| Moderate-confidence non-hand-signed | 36,994 | 43.8% | 6,311 | 20.9% |
|
||
| High style consistency | 5,133 | 6.1% | 183 | 0.6% |
|
||
| Uncertain | 12,683 | 15.0% | 758 | 2.5% |
|
||
| Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
|
||
|
||
Per the worst-case aggregation rule of Section III-L, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
|
||
-->
|
||
|
||
Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
|
||
29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
|
||
36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
|
||
and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
|
||
A cosine-only classifier would treat all 71,656 identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
|
||
|
||
### 1) Firm A Capture Profile (Consistency Check)
|
||
|
||
96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
|
||
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers.
|
||
The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
|
||
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.
|
||
|
||
### 2) Cross-Method Agreement
|
||
|
||
Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
|
||
This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
|
||
|
||
## I. Ablation Study: Feature Backbone Comparison
|
||
|
||
To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
|
||
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
|
||
Table XIII presents the comparison.
|
||
|
||
<!-- TABLE XIII: Backbone Comparison
|
||
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|
||
|--------|-----------|--------|-----------------|
|
||
| Feature dim | 2048 | 4096 | 1280 |
|
||
| Intra mean | 0.821 | 0.822 | 0.786 |
|
||
| Inter mean | 0.758 | 0.767 | 0.699 |
|
||
| Cohen's d | 0.669 | 0.564 | 0.707 |
|
||
| KDE crossover | 0.837 | 0.850 | 0.792 |
|
||
| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
|
||
| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |
|
||
|
||
Note: Firm A values in this table are computed over all intra-firm pairwise
|
||
similarities (16.0M pairs) for cross-backbone comparability. These differ from
|
||
the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
|
||
the classification-relevant statistic: the similarity of each signature to its
|
||
single closest match from the same CPA.
|
||
-->
|
||
|
||
EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
|
||
However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence.
|
||
VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.
|
||
|
||
ResNet-50 provides the best overall balance:
|
||
(1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
|
||
(2) its tighter distributions yield more reliable individual classifications;
|
||
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
|
||
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.
|