Opus round-9 review (paper/opus_final_review_v3_9.md) dissented from Gemini round-7 Accept and aligned with codex round-8 Minor, but for a DIFFERENT issue all prior reviewers missed: the paper's main text in four locations flatly claimed the BD/McCrary accountant-level null "persists across the Appendix-A bin-width sweep", yet Appendix A Table A.I itself documents a significant accountant-level cosine transition at bin 0.005 with |Z_below|=3.23, |Z_above|=5.18 (both past 1.96) located at cosine 0.980 --- on the upper edge of our two threshold estimators' convergence band [0.973, 0.979]. This is a paper-to-appendix contradiction that a careful reviewer would catch in 30 seconds. BLOCKER B1: BD/McCrary accountant-level claim softened across all four locations to match what Appendix A Table A.I actually reports: - Results IV-D.1 (lines 85-86): rewritten to say the null is not rejected at 2/3 cosine bin widths and 2/3 dHash bin widths, with the one cosine transition at bin 0.005 sitting on the upper edge of the convergence band and the one dHash transition at |Z|=1.96. - Results IV-E Table VIII row (line 145): "no transition / no transition" changed to "0.980 at bin 0.005 only; null at 0.002, 0.010" / "3.0 at bin 1.0 only ( |Z|=1.96); null at 0.2, 0.5". - Results IV-E line 130 (Third finding): "does not produce a significant transition (robust across bin-width sweep)" replaced with "largely null at the accountant level --- no significant transition at 2/3 cosine bin widths and 2/3 dHash bin widths, with the one cosine transition at bin 0.005 sitting at cosine 0.980 on the upper edge of the convergence band". - Results IV-E line 152 (Table VIII synthesis paragraph): matched reframing. - Discussion V-B (line 27): "does not produce a significant transition at the accountant level either" -> "largely null at the accountant level ... with the one cosine transition on the upper edge of the convergence band". - Conclusion (line 16): matched reframing with power caveat retained. MAJOR M1: Related Work L67 stale "well suited to detecting the boundary between two generative mechanisms" framing (residue from pre-demotion drafts) replaced with a local-density-discontinuity diagnostic framing that matches the rest of the paper and flags the signature-level bin-width sensitivity + accountant-level rarity as documented in Appendix A. MAJOR M2: Table XII orphaned in-text anchor --- Table XII is defined inside IV-G.3 but had no in-text "Table XII reports ..." pointer at its presentation location. Added a single sentence before the table comment. MINOR m1: Section IV-I.1 "4 of 30,000+ Firm A documents, 0.01%" replaced with the exact "4 of 30,226 Firm A documents, 0.013%". MINOR m2: Section IV-E "the two-dimensional two-component GMM" wording ambiguity (reader might confuse with the already-selected K*=3 GMM from BIC) replaced with explicit "a separately fit two-component 2D GMM (reported as a cross-check on the 1D accountant-level crossings)". MINOR m3: Section IV-D L59 "downstream all-pairs analyses (Tables XII, XVIII)" misnomer --- Table XII is per-signature classifier output not all-pairs; Table XVIII's all-pairs are over ~16M pairs not 168,740. Replaced with an accurate list: "same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII)". MINOR m4: Methodology III-H L156 "the validation role is played by ... the held-out Firm A fold" slightly overclaims what the held-out fold establishes (the fold-level rates differ by 1-5 pp with p<0.001). Parenthetical hedge added: "(which confirms the qualitative replication-dominated framing; fold-level rate differences are disclosed in Section IV-G.2)". Also add: - paper/opus_final_review_v3_9.md (Opus 4.7 max-effort review) - paper/gemini_review_v3_8.md (Gemini round-7 Accept verdict, was missing from prior commit) Abstract remains 243 words (under IEEE Access 250 limit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
43 KiB
IV. Experiments and Results
A. Experimental Setup
All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration. Feature extraction used PyTorch 2.9 with torchvision model implementations. The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
B. Signature Detection Performance
The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.
C. All-Pairs Intra-vs-Inter Class Distribution Analysis
Fig. 2 presents the cosine similarity distributions computed over the full set of pairwise comparisons under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L). Table IV summarizes the distributional statistics.
Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
Statistical tests confirmed significant separation between the two distributions (Cohen's d = 0.669, Mann-Whitney [36] p < 0.001, K-S 2-sample p < 0.001).
We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size.
A Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
D. Hartigan Dip Test: Unimodality at the Signature Level
Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
The N = 168{,}740 count used in Table V and in the downstream same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII) is 15 signatures smaller than the 168{,}755 CPA-matched count reported in Table III: these 15 signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in the accountant-level mixture (Section IV-E).
The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is multimodal (p < 0.001).
At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
This asymmetry between signature level and accountant level is itself an empirical finding. It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
1) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine distribution yields a nominally significant Z^- \rightarrow Z^+ transition at cosine 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample under the bin width (0.005 / 1) used here.
Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
First, the cosine transition at 0.985 lies inside the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
At the accountant level the BD/McCrary null is not rejected at two of three cosine bin widths (0.002, 0.010) and two of three dHash bin widths (0.2, 0.5); the one cosine transition that does occur (at bin width 0.005) sits at cosine 0.980---at the upper edge of the convergence band of our two threshold estimators (Section IV-E)---and the one dHash transition (at bin width 1.0, location dHash = 3.0) has |Z_{\text{below}}| exactly at the 1.96 critical value.
We read this pattern as largely but not uniformly null and consistent with---not affirmative proof of---clustered-but-smoothly-mixed aggregates: at N = 686 accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness (Section V-G), and the one bin-0.005 cosine transition, sitting at the edge rather than outside the threshold band and flanked by bin-0.002 and bin-0.010 non-rejections, is consistent with a mild histogram-resolution effect rather than a stable cross-mode density discontinuity (Appendix A).
We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator, and the substantive claim of smoothly-mixed accountant clustering rests on the joint evidence of the dip test, the BIC-selected GMM, and the BD null.
2) Beta Mixture at Signature Level: A Forced Fit
Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check.
For the full-sample cosine the 3-component fit is likewise strongly preferred (\Delta\text{BIC} = 10{,}175).
Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: at the per-signature level, no two-mechanism mixture explains the data. Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing. This motivates the pivot to the accountant-level analysis in Section IV-E, where aggregation over signatures reveals clustered (though not sharply discrete) patterns in individual-level signing practice that the signature-level analysis lacks.
E. Accountant-Level Gaussian Mixture
We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}.
BIC selects K^* = 3 (Table VI).
Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
Three empirical findings stand out.
First, of the 180 CPAs in the Firm A registry, 171 have \geq 10 signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only).
Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
Second, the three-component partition is not a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
Third, applying the threshold framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode = 0.973, Beta-2 crossing = 0.979, and the logit-GMM-2 crossing = 0.976 converge within \sim 0.006 of each other, while the BD/McCrary density-smoothness diagnostic is largely null at the accountant level---no significant transition at two of three cosine bin widths and two of three dHash bin widths, with the one cosine transition at bin 0.005 sitting at cosine 0.980 on the upper edge of the convergence band (Appendix A).
For completeness we also report the marginal crossings of a separately fit two-component 2D GMM (reported as a cross-check on the 1D accountant-level crossings) at cosine = 0.945 and dHash = 8.10; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
Table VIII summarizes the threshold estimates produced by the two threshold estimators and the BD/McCrary smoothness diagnostic across the two analysis levels for a compact cross-level comparison.
At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of \approx 0.975 \pm 0.003 and a dHash threshold of \approx 3.8 \pm 0.4; the BD/McCrary density-smoothness diagnostic is largely null at the same level (two of three cosine bin widths and two of three dHash bin widths produce no significant transition; the one bin-0.005 cosine transition at 0.980 sits on the convergence-band upper edge and is flanked by non-rejections at bin 0.002 and bin 0.010, Appendix A), which is consistent with---though, at N = 686, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates.
This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine = 0.945, dHash = 8.10) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
F. Calibration Validation with Firm A
Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H.
The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E).
Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
G. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
We report three validation analyses corresponding to the anchors of Section III-K.
1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean = 0.762, P_{95} = 0.884, P_{99} = 0.913, max = 0.988).
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / F_1 / recall therefore have no meaningful population interpretation.
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from different CPAs exceeds the candidate threshold.
We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine \approx 1 by construction, so FRR against that subset is trivially 0 at every threshold below 1. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.
Two caveats apply.
First, the byte-identical positive anchor referenced above is a conservative subset of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
A would-be FRR computed against this subset is definitionally 0 at every threshold below 1 (since byte-identical pairs have cosine \approx 1), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.
2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)
We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures). The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here and has no effect on the accountant-level mixture analysis of Section IV-E, which uses the $\geq 10$-signature subset of 171 CPAs. Thresholds are re-derived from calibration-fold percentiles only. Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test. We report fold-versus-fold comparisons rather than fold-versus-whole-sample comparisons, because the whole-sample rate is a weighted average of the two folds and therefore cannot, in general, fall inside the Wilson CI of either fold when the folds differ in rate; the correct generalization reference is the calibration fold, which produced the thresholds.
Under this proper test the two extreme rules agree across folds (cosine > 0.837 and \text{dHash}_\text{indep} \leq 15; both p > 0.7).
The operationally relevant rules in the 85–95% capture band differ between folds by 1–5 percentage points (p < 0.001 given the n \approx 45\text{k}/15\text{k} fold sizes).
Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the operational dual rule cosine > 0.95 AND \text{dHash}_\text{indep} \leq 8 captures 89.40% of the calibration fold and 91.54% of the held-out fold.
The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity (see the 139 / 32 accountant-level split of Section IV-E): the random 30% CPA sample happened to contain proportionally more accountants from the high-replication C1 cluster.
We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to this fold variance.
3) Operational-Threshold Sensitivity: cos > 0.95 vs cos > 0.945
The per-signature classifier (Section III-L) uses cos > 0.95 as its operational cosine cut, anchored on the whole-sample Firm A P95 heuristic.
The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between 0.973 and 0.979 (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at 0.945.
Because the classifier operates at the signature level while these convergent accountant-level estimates are at the accountant level, they are formally non-substitutable.
We report a sensitivity check in which the classifier's operational cut cos > 0.95 is replaced by the nearest accountant-level reference, cos > 0.945.
Table XII reports the five-way classifier output under each cut.
At the aggregate firm-level, the operational dual rule cos > 0.95 AND \text{dHash}_\text{indep} \leq 8 captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies 8,508 signatures (5.04% of the corpus) out of the Uncertain band; 6,095 of them migrate to Moderate-confidence non-hand-signed, 2,294 to High-confidence non-hand-signed, and 119 to High style consistency.
The Likely-hand-signed category is unaffected because it depends only on the fixed all-pairs KDE crossover cosine = 0.837.
The High-confidence non-hand-signed share grows from 45.62% to 46.98%.
We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within the accountant-level convergence band, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
The paper therefore retains cos > 0.95 as the primary operational cut for transparency and reports the 0.945 results as a sensitivity check rather than as a deployed alternative; a future deployment requiring tighter accountant-level alignment could substitute cos > 0.945 without altering the substantive firm-level conclusions.
4) Sanity Sample
A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
H. Additional Firm A Benchmark Validation
The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles. This section reports three complementary analyses that go beyond the whole-sample capture rates. Subsection H.2 is fully threshold-independent (it uses only ordinal ranking). Subsection H.1 uses a fixed 0.95 cutoff but derives information from the longitudinal stability of rates rather than from the absolute rate at any single year. Subsection H.3 applies the calibrated classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the informative quantity is the cross-firm gap rather than the absolute agreement rate at any one firm.
1) Year-by-Year Stability of the Firm A Left Tail
Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year. Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign. Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.
The left tail is stable at 6-13% throughout the sample period and shows no pre/post-2020 level shift: the 2013-2019 mean left-tail share is 8.26% and the 2020-2023 mean is 6.96%. The lowest observed share is in 2023 (3.75%), consistent with firm-level electronic signing systems producing more uniform output than earlier manual scanning-and-stamping, not less. This stability supports the replication-dominated framing: a persistent minority of hand-signing Firm A partners is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved.
2) Partner-Level Similarity Ranking
If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all Big-4 auditor-years. We test this prediction directly.
For each auditor-year (CPA \times fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
Firm A accounts for 1,287 of these (27.8% baseline share).
Table XIV reports per-firm occupancy of the top K\% of the ranked distribution.
Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile. Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
This over-representation is a direct consequence of firm-wide non-hand-signing practice and is not derived from any threshold we subsequently calibrate. It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.
3) Intra-Report Consistency
Taiwanese statutory audit reports are co-signed by two engagement partners (a primary and a secondary signer). Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification. Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.
For each report with exactly two signatures and complete per-signature data (83,970 reports assigned to a single firm, plus 384 reports with one signer per firm in the mixed-firm buckets for 84,354 total), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree. Table XVI reports per-firm intra-report agreement (firm-assignment defined by the firm identity of both signers; mixed-firm reports are reported separately).
Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having both signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed. The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap. This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.
We note that this test uses the calibrated classifier of Section III-L rather than a threshold-free statistic; the substantive evidence lies in the cross-firm gap between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
I. Classification Results
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents. The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here. We emphasize that the document-level proportions below reflect the worst-case aggregation rule of Section III-L: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts. Document-level rates therefore bound the share of reports in which at least one signature is non-hand-signed rather than the share in which both are; the intra-report agreement analysis of Section IV-H.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
Within the 71,656 documents exceeding cosine 0.95, the dHash dimension stratifies them into three distinct populations:
29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash \leq 5);
36,994 (51.7%) show partial structural similarity (dHash in [6, 15]) consistent with replication degraded by scan variations;
and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than image reproduction.
A cosine-only classifier would treat all 71,656 identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
1) Firm A Capture Profile (Consistency Check)
96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain. This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the 32/171 middle-band minority identified by the accountant-level mixture (Section IV-E). The absence of any meaningful "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.
2) Cross-Method Agreement
Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
This is consistent with the accountant-level convergent thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
J. Ablation Study: Feature Backbone Comparison
To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. Table XVIII presents the comparison.
EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), resulting in lower per-sample classification confidence.
VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.
ResNet-50 provides the best overall balance:
(1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707;
(2) its tighter distributions yield more reliable individual classifications;
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.