Files
pdf_signature_extraction/paper/paper_a_results_v3.md
T
gbanyan 9b11f03548 Paper A v3: full rewrite for IEEE Access with three-method convergence
Major changes from v2:

Terminology:
- "digitally replicated" -> "non-hand-signed" throughout (per partner v3
  feedback and to avoid implicit accusation)
- "Firm A near-universal non-hand-signing" -> "replication-dominated"
  (per interview nuance: most but not all Firm A partners use replication)

Target journal: IEEE TAI -> IEEE Access (per NCKU CSIE list)

New methodological sections (III.G-III.L + IV.D-IV.G):
- Three convergent threshold methods (KDE antimode + Hartigan dip test /
  Burgstahler-Dichev McCrary / EM-fitted Beta mixture + logit-GMM
  robustness check)
- Explicit unit-of-analysis discussion (signature vs accountant)
- Accountant-level 2D Gaussian mixture (BIC-best K=3 found empirically)
- Pixel-identity validation anchor (no manual annotation needed)
- Low-similarity negative anchor + Firm A replication-dominated anchor

New empirical findings integrated:
- Firm A signature cosine UNIMODAL (dip p=0.17) - long left tail = minority
  hand-signers
- Full-sample cosine MULTIMODAL but not cleanly bimodal (BIC prefers 3-comp
  mixture) - signature-level is continuous quality spectrum
- Accountant-level mixture trimodal (C1 Deloitte-heavy 139/141,
  C2 other Big-4, C3 smaller firms). 2-comp crossings cos=0.945, dh=8.10
- Pixel-identity anchor (310 pairs) gives perfect recall at all cosine
  thresholds
- Firm A anchor rates: cos>0.95=92.5%, dual-rule cos>0.95 AND dh<=8=89.95%

New discussion section V.B: "Continuous-quality spectrum vs discrete-
behavior regimes" - the core interpretive contribution of v3.

References added: Hartigan & Hartigan 1985, Burgstahler & Dichev 1997,
McCrary 2008, Dempster-Laird-Rubin 1977, White 1982 (refs 37-41).

export_v3.py builds Paper_A_IEEE_Access_Draft_v3.docx (462 KB, +40% vs v2
from expanded methodology + results sections).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 00:14:47 +08:00

18 KiB

IV. Experiments and Results

A. Experimental Setup

All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration. Feature extraction used PyTorch 2.9 with torchvision model implementations. The complete pipeline---from raw PDF processing through final classification---was implemented in Python.

B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

C. Signature-Level Distribution Analysis

Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs. Table IV summarizes the distributional statistics.

Both distributions are left-skewed and leptokurtic. Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes. Statistical tests confirmed significant separation between the two distributions (Cohen's d = 0.669, Mann-Whitney p < 0.001, K-S 2-sample p < 0.001).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size. A Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

D. Hartigan Dip Test: Unimodality at the Signature Level

Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).

Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews. The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is multimodal (p < 0.001). At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.

This asymmetry between signature level and accountant level is itself an empirical finding. It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.

1) Burgstahler-Dichev / McCrary Discontinuity

Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. We note that the cosine transition at 0.985 lies inside the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal. In contrast, the dHash transition at distance 2 is a meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.

2) Beta Mixture at Signature Level: A Forced Fit

Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. For the full-sample cosine the 3-component fit is likewise strongly preferred (\Delta\text{BIC} = 10{,}175). Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data. Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.

The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: at the per-signature level, no two-mechanism mixture explains the data. Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing. This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual behavior (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.

E. Accountant-Level Gaussian Mixture

We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. BIC selects K^* = 3 (Table VI).

Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.

Three empirical findings stand out. First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster. The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D. Second, the three-component partition is not a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3. Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine = 0.945 and dHash = 8.10; these are the natural per-accountant thresholds.

Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level.

The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum / discrete-behavior asymmetry rather than as primary classification boundaries.

F. Calibration Validation with Firm A

Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).

The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail. The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix.

G. Pixel-Identity Validation

Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1). These serve as the gold-positive anchor of Section III-K. Using signatures with cosine < 0.70 (n = 35) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X).

All cosine thresholds achieve perfect classification of the pixel-identical anchor against the low-similarity anchor, which is unsurprising given the complete separation between the two anchor populations. The dHash thresholds trade precision for recall along the expected tradeoff. We emphasize that because the gold-positive anchor is a subset of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate. The negative-anchor population (n = 35) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates.

A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.

H. Classification Results

Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents.

Within the 71,656 documents exceeding cosine 0.95, the dHash dimension stratifies them into three distinct populations: 29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash \leq 5); 36,994 (51.7%) show partial structural similarity (dHash in [6, 15]) consistent with replication degraded by scan variations; and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than image reproduction. A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations.

1) Firm A Validation

96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain. This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers. The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.

2) Cross-Method Agreement

Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).

I. Ablation Study: Feature Backbone Comparison

To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. Table XII presents the comparison.

EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), resulting in lower per-sample classification confidence. VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance: (1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707; (2) its tighter distributions yield more reliable individual classifications; (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.