Files
pdf_signature_extraction/paper/paper_a_results_v3.md
T
gbanyan 51d15b32a5 Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation)
Partner v4 (signature_paper_draft_v4) proposed 3 substantive improvements;
partner confirmed the 2013-2019 restriction was an error (sample stays
2013-2023). The remaining suggestions are adopted with our own data.

## New scripts
- Script 22 (partner ranking): ranks all Big-4 auditor-years by mean
  max-cosine. Firm A occupies 95.9% of top-10% (base 27.8%), 3.5x
  concentration ratio. Stable across 2013-2023 (88-100% per year).
- Script 23 (intra-report consistency): for each 2-signer report,
  classify both signatures and check agreement. Firm A agrees 89.9%
  vs 62-67% at other Big-4. 87.5% Firm A reports have BOTH signers
  non-hand-signed; only 4 reports (0.01%) both hand-signed.

## New methodology additions
- III-G: explicit within-auditor-year no-mixing identification
  assumption (supported by Firm A interview evidence).
- III-H: 4th Firm A validation line: threshold-independent evidence
  from partner ranking + intra-report consistency.

## New results section IV-H (threshold-independent validation)
- IV-H.1: Firm A year-by-year cosine<0.95 rate. 2013-2019 mean=8.26%,
  2020-2023 mean=6.96%, 2023 lowest (3.75%). Stability contradicts
  partner's hypothesis that 2020+ electronic systems increase
  heterogeneity -- data shows opposite (electronic systems more
  consistent than physical stamping).
- IV-H.2: partner ranking top-K tables (pooled + year-by-year).
- IV-H.3: intra-report consistency per-firm table.

## Renumbering
- Section H (was Classification Results) -> I
- Section I (was Ablation) -> J
- Tables XIII-XVI new (yearly stability, top-K pooled, top-10% per-year,
  intra-report), XVII = classification (was XII), XVIII = ablation
  (was XIII).

These threshold-independent analyses address the codex review concern
about circular validation by providing benchmark evidence that does not
depend on any threshold calibrated to Firm A itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:59:49 +08:00

32 KiB

IV. Experiments and Results

A. Experimental Setup

All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration. Feature extraction used PyTorch 2.9 with torchvision model implementations. The complete pipeline---from raw PDF processing through final classification---was implemented in Python.

B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

C. All-Pairs Intra-vs-Inter Class Distribution Analysis

Fig. 2 presents the cosine similarity distributions computed over the full set of pairwise comparisons under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L). Table IV summarizes the distributional statistics.

Both distributions are left-skewed and leptokurtic. Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes. Statistical tests confirmed significant separation between the two distributions (Cohen's d = 0.669, Mann-Whitney p < 0.001, K-S 2-sample p < 0.001).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size. A Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

D. Hartigan Dip Test: Unimodality at the Signature Level

Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).

Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews. The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is multimodal (p < 0.001). At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.

This asymmetry between signature level and accountant level is itself an empirical finding. It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.

1) Burgstahler-Dichev / McCrary Discontinuity

Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a single significant transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. We note that the cosine transition at 0.985 lies inside the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal. In contrast, the dHash transition at distance 2 is a substantively meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication. At the accountant level the test does not produce a significant Z^- \rightarrow Z^+ transition in either the cosine-mean or the dHash-mean distribution (Section IV-E), reflecting that accountant aggregates are smooth at the bin resolution the test requires rather than exhibiting a sharp density discontinuity.

2) Beta Mixture at Signature Level: A Forced Fit

Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. For the full-sample cosine the 3-component fit is likewise strongly preferred (\Delta\text{BIC} = 10{,}175). Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data. Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.

The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: at the per-signature level, no two-mechanism mixture explains the data. Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing. This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual behavior (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.

E. Accountant-Level Gaussian Mixture

We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. BIC selects K^* = 3 (Table VI).

Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.

Three empirical findings stand out. First, of the 180 CPAs in the Firm A registry, 171 have \geq 10 signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only). Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2. This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D. Second, the three-component partition is not a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3. Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in Table VIII: KDE antimode = 0.973, Beta-2 crossing = 0.979, and the logit-GMM-2 crossing = 0.976 converge within \sim 0.006 of each other, while the BD/McCrary test does not produce a significant transition at the accountant level. For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine = 0.945 and dHash = 8.10; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.

Table VIII then summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison.

Methods 1 and 3 (KDE antimode, Beta-2 crossing, and its logit-GMM robustness check) converge at the accountant level to a cosine threshold of \approx 0.975 \pm 0.003 and a dHash threshold of \approx 3.8 \pm 0.4, while Method 2 (BD/McCrary) does not produce a significant discontinuity. This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine = 0.945, dHash = 8.10) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check. The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.

F. Calibration Validation with Firm A

Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).

Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H. The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.

G. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation

We report three validation analyses corresponding to the anchors of Section III-K.

1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor

Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor. As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean = 0.762, P_{95} = 0.884, P_{99} = 0.913, max = 0.988). Table X reports precision, recall, F_1, FAR with Wilson 95% confidence intervals, and FRR at each candidate threshold. The Equal-Error-Rate point, interpolated at FAR = FRR, is located at cosine = 0.990 with EER \approx 0, which is trivially small because pixel-identical positives are all at cosine very close to 1.

Two caveats apply. First, the gold-positive anchor is a conservative subset of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical. Perfect recall against this subset does not establish perfect recall against the broader positive class, and the reported recall should therefore be interpreted as a lower-bound calibration check rather than a generalizable recall estimate. Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X. The very low FAR at the accountant-level thresholds is therefore informative.

2) Held-Out Firm A Validation (breaks calibration-validation circularity)

We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures). Thresholds are re-derived from calibration-fold percentiles only. Table XI reports heldout-fold capture rates with Wilson 95% confidence intervals.

The held-out rates match the whole-sample rates of Table IX within each rule's Wilson confidence interval, confirming that the calibration-derived thresholds generalize to Firm A CPAs that did not contribute to calibration. The dual rule cosine > 0.95 AND dHash \leq 8 captures 91.54% [91.09%, 91.97%] of the held-out Firm A population, consistent with Firm A's interview-reported signing mix and with the replication-dominated framing of Section III-H.

3) Sanity Sample

A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.

H. Firm A Benchmark Validation: Threshold-Independent Evidence

The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles. This section reports three additional analyses that are threshold-independent in the sense that their findings do not depend on any cutoff we calibrate to Firm A, and therefore constitute genuine benchmark-validation evidence rather than a circular check.

1) Year-by-Year Stability of the Firm A Left Tail

Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year. Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign. Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.

The left tail is stable at 6-13% throughout the sample period and shows no pre/post-2020 level shift: the 2013-2019 mean left-tail share is 8.26% and the 2020-2023 mean is 6.96%. The lowest observed share is in 2023 (3.75%), consistent with firm-level electronic signing systems producing more uniform output than earlier manual scanning-and-stamping, not less. This stability supports the replication-dominated framing: a persistent minority of hand-signing Firm A partners is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved.

2) Partner-Level Similarity Ranking

If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all Big-4 auditor-years. We test this prediction directly.

For each auditor-year (CPA \times fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023. Firm A accounts for 1,287 of these (27.8% baseline share). Table XIV reports per-firm occupancy of the top K\% of the ranked distribution.

Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile. Year-by-year (Table XV), the top-10% Deloitte share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.

This over-representation is a direct consequence of firm-wide stamping practice and is not derived from any threshold we subsequently calibrate. It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.

3) Intra-Report Consistency

Taiwanese statutory audit reports are co-signed by two engagement partners (a primary and a secondary signer). Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification. Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.

For each report with exactly two signatures and complete per-signature data (93,979 reports), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree. Table XVI reports per-firm intra-report agreement.

Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having both signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed. The other Big-4 firms and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap. This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) stamping practice.

Like the partner-level ranking, this test does not depend on any threshold we calibrate to Firm A; the firm-vs-firm comparison is invariant to the absolute cutoff so long as the cutoff is applied uniformly.

I. Classification Results

Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents. The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.

Within the 71,656 documents exceeding cosine 0.95, the dHash dimension stratifies them into three distinct populations: 29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash \leq 5); 36,994 (51.7%) show partial structural similarity (dHash in [6, 15]) consistent with replication degraded by scan variations; and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than image reproduction. A cosine-only classifier would treat all 71,656 identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.

1) Firm A Capture Profile (Consistency Check)

96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain. This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers. The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.

2) Cross-Method Agreement

Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).

J. Ablation Study: Feature Backbone Comparison

To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. Table XVIII presents the comparison.

EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), resulting in lower per-sample classification confidence. VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance: (1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707; (2) its tighter distributions yield more reliable individual classifications; (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.