pdf_signature_extraction/paper/paper_a_results_v3.md

# IV. Experiments and Results

## A. Experimental Setup

All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration.
Feature extraction used PyTorch 2.9 with torchvision model implementations.
The complete pipeline---from raw PDF processing through final classification---was implemented in Python.

## B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

<!-- TABLE III: Extraction Results
| Metric | Value |
|--------|-------|
| Documents processed | 86,071 |
| Documents with detections | 85,042 (98.8%) |
| Total signatures extracted | 182,328 |
| Avg. signatures per document | 2.14 |
| CPA-matched signatures | 168,755 (92.6%) |
| Processing rate | 43.1 docs/sec |
-->

## C. Signature-Level Distribution Analysis

Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs.
Table IV summarizes the distributional statistics.

<!-- TABLE IV: Cosine Similarity Distribution Statistics
| Statistic | Intra-class | Inter-class |
|-----------|-------------|-------------|
| N (pairs) | 41,352,824 | 500,000 |
| Mean | 0.821 | 0.758 |
| Std. Dev. | 0.098 | 0.090 |
| Median | 0.836 | 0.774 |
| Skewness | −0.711 | −0.851 |
| Kurtosis | 0.550 | 1.027 |
-->

Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

## D. Hartigan Dip Test: Unimodality at the Signature Level

Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).

<!-- TABLE V: Hartigan Dip Test Results
| Distribution | N | dip | p-value | Verdict (α=0.05) |
|--------------|---|-----|---------|------------------|
| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
| Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
-->

Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews.
The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.

This asymmetry between signature level and accountant level is itself an empirical finding.
It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.

### 1) Burgstahler-Dichev / McCrary Discontinuity

Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample.
We note that the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
In contrast, the dHash transition at distance 2 is a meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.

### 2) Beta Mixture at Signature Level: A Forced Fit

Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.

The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual *behavior* (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.

## E. Accountant-Level Gaussian Mixture

We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
BIC selects $K^* = 3$ (Table VI).

<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
| K | BIC | AIC | Converged |
|---|-----|-----|-----------|
| 1 | −316 | −339 | ✓ |
| 2 | −545 | −595 | ✓ |
| 3 | **−792** | **−869** | ✓  (best) |
| 4 | −779 | −883 | ✓ |
| 5 | −747 | −879 | ✓ |
-->

Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.

<!-- TABLE VII: Accountant-Level 3-Component GMM
| Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
|-------|----------|------------|--------|---|----------------|
| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
| C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
| C3 (hand-signed tendency) | 0.928 | 11.17 | 0.28 | 184 | smaller domestic firms |
-->

Three empirical findings stand out.
First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster.
The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine $= 0.945$ and dHash $= 8.10$; these are the natural per-accountant thresholds.

Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level.

<!-- TABLE VIII: Threshold Convergence Summary
| Level / method | Cosine threshold | dHash threshold |
|----------------|-------------------|------------------|
| Signature-level KDE crossover | 0.837 | — |
| Signature-level BD/McCrary transition | 0.985 | 2.0 |
| Signature-level Beta 2-comp (Firm A) | 0.977 | — |
| Signature-level LogGMM 2-comp (Full) | 0.980 | — |
| Accountant-level 2-comp GMM crossing | **0.945** | **8.10** |
| Firm A P95 (median/95th pct calibration) | 0.95 | 15 |
| Firm A median calibration | — | 5 |
-->

The accountant-level two-component crossing (cosine $= 0.945$, dHash $= 8.10$) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture.
The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum / discrete-behavior asymmetry rather than as primary classification boundaries.

## F. Calibration Validation with Firm A

Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).

<!-- TABLE IX: Firm A Anchor Rates Across Candidate Thresholds
| Rule | Firm A rate |
|------|-------------|
| cosine > 0.837 | 99.93% |
| cosine > 0.941 | 95.08% |
| cosine > 0.945 (accountant 2-comp) | 94.5%† |
| cosine > 0.95 | 92.51% |
| dHash_indep ≤ 5 | 84.20% |
| dHash_indep ≤ 8 | 95.17% |
| dHash_indep ≤ 15 | 99.83% |
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% |

† interpolated from adjacent rates; all other rates computed exactly.
-->

The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail.
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix.

## G. Pixel-Identity Validation

Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1).
These serve as the gold-positive anchor of Section III-K.
Using signatures with cosine $< 0.70$ ($n = 35$) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X).

<!-- TABLE X: Pixel-Identity Validation Metrics
| Indicator | Threshold | Precision | Recall | F1 | FAR | FRR |
|-----------|-----------|-----------|--------|----|-----|-----|
| cosine > 0.837 | KDE crossover | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
| cosine > 0.945 | Accountant crossing | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
| cosine > 0.95 | Canonical | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
| dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 |
| dHash_indep ≤ 8 | Accountant crossing | 0.966 | 1.000 | 0.983 | 0.314 | 0.000 |
| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 |
-->

All cosine thresholds achieve perfect classification of the pixel-identical anchor against the low-similarity anchor, which is unsurprising given the complete separation between the two anchor populations.
The dHash thresholds trade precision for recall along the expected tradeoff.
We emphasize that because the gold-positive anchor is a *subset* of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate.
The negative-anchor population ($n = 35$) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates.

A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.

## H. Classification Results

Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents.

<!-- TABLE XI: Classification Results (Dual-Method: Cosine + dHash)
| Verdict | N (PDFs) | % | Firm A | Firm A % |
|---------|----------|---|--------|----------|
| High-confidence non-hand-signed | 29,529 | 35.0% | 22,970 | 76.0% |
| Moderate-confidence non-hand-signed | 36,994 | 43.8% | 6,311 | 20.9% |
| High style consistency | 5,133 | 6.1% | 183 | 0.6% |
| Uncertain | 12,683 | 15.0% | 758 | 2.5% |
| Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
-->

Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations.

### 1) Firm A Validation

96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers.
The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.

### 2) Cross-Method Agreement

Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).

## I. Ablation Study: Feature Backbone Comparison

To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
Table XII presents the comparison.

<!-- TABLE XII: Backbone Comparison
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|--------|-----------|--------|-----------------|
| Feature dim | 2048 | 4096 | 1280 |
| Intra mean | 0.821 | 0.822 | 0.786 |
| Inter mean | 0.758 | 0.767 | 0.699 |
| Cohen's d | 0.669 | 0.564 | 0.707 |
| KDE crossover | 0.837 | 0.850 | 0.792 |
| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |

Note: Firm A values in this table are computed over all intra-firm pairwise
similarities (16.0M pairs) for cross-backbone comparability. These differ from
the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
the classification-relevant statistic: the similarity of each signature to its
single closest match from the same CPA.
-->

EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence.
VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance:
(1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
(2) its tighter distributions yield more reliable individual classifications;
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.