Phase 6 manuscript splice (2/2): §IV / §V / §VI spliced
Lands v4.0 §IV / §V / §VI content into v3.20.0 master sub-files. Strips internal close-out checklists, draft notes, and open-questions blocks at splice. Completes the Phase 6 manuscript-master file assembly. §IV Results (paper_a_results_v3.md): - §IV-A..C: kept v3.20.0 inherited content (experimental setup, detection performance, all-pairs distribution); added v4 scope note (Big-4 primary) at the §IV header - §IV-D..K: replaced v3.20.0 §IV-D..H with v4.0 §IV-D..K (Big-4 distributional / mixture / convergence / LOOO / pixel-identity / inter-CPA reference / five-way classification / full-dataset robustness) - §IV-L: renumbered v3.20.0 §IV-I (backbone ablation) content to match v4's "§IV-L inherited from v3.20.0 §IV-I" reframing - §IV-M: appended v4.0 ICCR calibration tables (XX-XXVI): composition decomposition, per-comparison/per-signature/ per-document ICCRs, firm heterogeneity + cross-firm hit matrix, alert-rate sensitivity - §III-K ablation cross-ref updated to §IV-L (was §IV-I) - Phase 3 close-out checklist (lines 365+) stripped §V Discussion (paper_a_discussion_v3.md): - Replaced v3.20.0 §V with v4.0 §V (8 sub-sections A-H): A. Distinct problem framing B. Continuous quality spectrum + composition-driven multimodality C. Firm A as templated end (case study, not anchor) D. K=2 / K=3 descriptive partitions E. Three-score convergent internal-consistency F. Anchor-based multi-level calibration G. Pixel-identity hard positive anchor + ICCR reframing H. Limitations (14 items: 9 v4-specific + 5 inherited from v3.x) §VI Conclusion (paper_a_conclusion_v3.md): - Replaced v3.20.0 §VI with v4.0 §VI (8 contribution items mirroring §I contributions; 4-direction future work). Known splice-time issue (deferred to typesetting): §IV table numbering is sequential by label (V, VI, ..., XXVI) but Table XIX (document-level worst-case) appears physically before Tables XVI/XVII/XVIII in §IV-J narrative flow. IEEE Access typesetters typically normalize table order during typesetting; we accept the in-file ordering quirk to preserve the §IV-J narrative arc (per-signature -> document-level worst-case -> K=3 cross-tab). Renumbering to strictly-ascending physical order would require renaming Tables XVI/XVII/XVIII -> XVII/XVIII/XIX with downstream cross-reference updates; deferred unless partner Jimmy review or IEEE Access submission portal flags it. Manuscript splice complete. Working drafts in paper/v4/ retained as archive of the round-by-round Phase 5 fix history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,30 +1,7 @@
|
|||||||
# VI. Conclusion and Future Work
|
# VI. Conclusion and Future Work
|
||||||
|
|
||||||
## Conclusion
|
We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is an inherited Paper A five-way per-signature classifier with worst-case document-level aggregation (§III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population.
|
||||||
|
|
||||||
We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
|
Our central methodological contributions are: (1) a composition decomposition (Scripts 39b–39e) that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the inherited operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) a ten-tool unsupervised-validation collection (§III-M Table XXVII) that explicitly discloses each tool's untested assumption and positions the system as an anchor-calibrated screening framework with human-in-the-loop review, not as a validated forensic detector.
|
||||||
Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with the operational classifier's cosine cut anchored on a whole-sample Firm A percentile heuristic and the per-signature similarity distribution characterised through two threshold estimators and a density-smoothness diagnostic.
|
|
||||||
|
|
||||||
The seven numbered contributions listed in Section I can be grouped into four broader methodological themes, summarized below.
|
Future work falls in four directions. *First*, a small-scale human-rated validation set would enable direct ROC optimisation and provide signature-level ground truth that v4.0 fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with v3.x's byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the v4.0 calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
|
||||||
|
|
||||||
First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
|
|
||||||
|
|
||||||
Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
|
|
||||||
|
|
||||||
Third, we characterised the per-signature similarity distribution using three diagnostics---a Hartigan dip test, an EM-fitted Beta mixture (with logit-Gaussian robustness check), and a Burgstahler-Dichev / McCrary density-smoothness procedure---and showed that no two-mechanism mixture cleanly explains it: the dip test fails to reject unimodality for Firm A ($p = 0.17$), BIC strongly prefers a 3-component over a 2-component Beta fit ($\Delta\text{BIC} = 381$ for Firm A), and the BD/McCrary candidate transition lies inside the non-hand-signed mode rather than between modes (and is not bin-width-stable; Appendix A).
|
|
||||||
The substantive reading is that *pixel-level output quality* is a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) and scan conditions, rather than a discrete class cleanly separated from hand-signing.
|
|
||||||
This reading motivates anchoring the operational classifier's cosine cut on a whole-sample Firm A P7.5 percentile heuristic (cos $> 0.95$) rather than on a mixture-fit crossing.
|
|
||||||
|
|
||||||
Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
|
|
||||||
To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85--95% capture band differ by 1--5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure.
|
|
||||||
This framing is internally consistent with the available evidence: the byte-level pair analysis finding of 145 pixel-identical calibration-firm signatures across 50 distinct partners of 180 registered (Section IV-F.1); the 92.5% / 7.5% split in signature-level cosine thresholds and the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1); and the 95.9% top-decile concentration of Firm A auditor-years in the threshold-independent partner-ranking analysis (Section IV-G.2).
|
|
||||||
|
|
||||||
An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
|
|
||||||
|
|
||||||
## Future Work
|
|
||||||
|
|
||||||
Several directions merit further investigation.
|
|
||||||
Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
|
|
||||||
The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
|
|
||||||
The replication-dominated calibration strategy and the pixel-identity anchor technique are both generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself, conditional on the availability of analogous anchors in the new domain and on artifact-generation physics that preserve the byte-level trace.
|
|
||||||
Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
|
|
||||||
|
|||||||
@@ -2,108 +2,68 @@
|
|||||||
|
|
||||||
## A. Non-Hand-Signing Detection as a Distinct Problem
|
## A. Non-Hand-Signing Detection as a Distinct Problem
|
||||||
|
|
||||||
Our results highlight the importance of distinguishing *non-hand-signing detection* from the well-studied *signature forgery detection* problem.
|
Non-hand-signing differs from forgery in that the questioned signature is produced by its legitimate signer's own stored image rather than by an impostor. The detection problem is therefore framed around *intra-signer image reproduction* rather than *inter-signer imitation*. This framing has analytical consequences. The within-CPA signature distribution is the analytical population of interest; the cross-CPA inter-class distribution is a *reference* against which intra-CPA similarity is interpreted, not the population to be modelled. This contrasts with most prior offline signature verification work, which treats genuine-versus-forged as the central two-class problem.
|
||||||
In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature.
|
|
||||||
In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).
|
|
||||||
|
|
||||||
This distinction has direct methodological consequences.
|
## B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Multimodality is Composition-Driven
|
||||||
Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures.
|
|
||||||
Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
|
|
||||||
The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.
|
|
||||||
|
|
||||||
## B. Per-Signature Similarity is a Continuous Quality Spectrum
|
A central empirical finding of v3.x was that *per-signature* similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading.
|
||||||
|
|
||||||
A central empirical finding of this study is that per-signature similarity does not form a clean two-mechanism mixture (Section IV-D).
|
The Big-4 accountant-level descriptor distribution does reject unimodality on both marginals at $p < 5 \times 10^{-4}$ (Script 34). v4.0's composition decomposition (§III-I.4; Scripts 39b–39e) shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with $\geq 500$ signatures tested (cosine: Scripts 39b/39c; jittered-dHash: Script 39d for Big-4 plus codex-verified read-only spike for the ten non-Big-4 firms; see §III-I.4). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes.
|
||||||
Firm A's signature-level cosine is formally unimodal (Hartigan dip test $p = 0.17$) with a long left tail.
|
|
||||||
The all-CPA signature-level cosine rejects unimodality ($p < 0.001$), reflecting the heterogeneity of signing practices across firms, but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit ($\Delta\text{BIC} = 381$ for Firm A; $10{,}175$ for the full sample), and the forced 2-component Beta crossing and its logit-GMM robustness counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
|
|
||||||
The BD/McCrary discontinuity test locates its transition at cosine 0.985---*inside* the non-hand-signed mode rather than at a boundary between two mechanisms---and the transition is not bin-width-stable (Appendix A).
|
|
||||||
|
|
||||||
Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class cleanly separated from hand-signing.
|
## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor)
|
||||||
Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.
|
|
||||||
|
|
||||||
The methodological implication is that the operational classifier's cosine cut should not be derived from a mixture-fit crossing.
|
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). The additional v3.x finding that the 145 Firm A pixel-identical signatures span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches across different fiscal years, is inherited from v3.20.0 §IV-F.1 / Script 28 / Appendix B byte-decomposition output and was not regenerated in v4.0's spike scripts; we retain those numbers by reference.
|
||||||
We accordingly anchor the operational cosine cut on the whole-sample Firm A P7.5 percentile (Section III-K), and treat the signature-level threshold-estimator outputs (KDE antimode, Beta and logit-Gaussian crossings) as descriptive characterisation of the similarity distribution rather than as the source of operational thresholds.
|
|
||||||
The BD/McCrary procedure plays a *density-smoothness diagnostic* role in this framing rather than that of an independent threshold estimator.
|
|
||||||
|
|
||||||
This continuous-spectrum finding also has substantive implications for downstream interpretation.
|
In v4.0 we treat Firm A as a *templated-end case study* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms). Firm A's high per-document HC$+$MC alarm rate of $0.62$ (versus Firms B/C/D's $0.09$–$0.16$) reflects high inter-CPA collision concentration under the deployed rule on real same-CPA pools, consistent with firm-specific template, stamp, or document-production reuse — though the inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence of v3.x §IV-F.1 (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence that firm-level template reuse does occur at Firm A; the within-firm collision pattern at all four Big-4 firms is consistent with that mechanism extending in milder form to Firms B/C/D.
|
||||||
Because pixel-level output quality varies continuously, *signature-level rates* (such as the 92.5% / 7.5% Firm A split) reflect the share of signatures whose similarity falls above or below a chosen threshold rather than the share that came from a "non-hand-signing mechanism" versus a "hand-signing mechanism."
|
|
||||||
We accordingly report all rates as signature-level quantities and abstain from partner-level frequency claims (Section III-G).
|
|
||||||
|
|
||||||
## C. Firm A as a Replication-Dominated, Not Pure, Population
|
## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
|
||||||
|
|
||||||
A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
|
Leave-one-firm-out cross-validation of the Big-4 mixture fit reveals a sharp contrast between K=2 and K=3 behaviour. K=2 is unstable: across-fold cosine-crossing deviation is $0.028$, and holding Firm A out gives a fold rule (cos $> 0.938$, dHash $\leq 8.79$) that classifies $100\%$ of held-out Firm A in the upper component, while holding any non-Firm-A Big-4 firm out gives a fold rule near (cos $> 0.975$, dHash $\leq 3.76$) that classifies $0\%$ of the held-out firm in the upper component. The K=2 boundary is essentially a Firm-A-vs-others separator — direct evidence that the K=2 partition reflects firm-compositional rather than mechanistic structure.
|
||||||
Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.
|
|
||||||
|
|
||||||
Two convergent strands of evidence support the replication-dominated framing.
|
K=3 in contrast has a *reproducible component shape* at the descriptor-position level: across the four folds the C1 (low-cos / high-dHash) component cosine mean varies by at most $0.005$, the dHash mean by at most $0.96$, and the weight by at most $0.023$. Hard-posterior membership for the held-out firm is composition-sensitive (absolute differences $1.8$–$12.8$ pp across folds). Together with the §III-I.4 composition decomposition (no within-population bimodal antimode), the K=3 stability supports a descriptive reading: the Big-4 descriptor plane has a reproducible three-region partition that reflects how firm-compositional weight is distributed across the descriptor space, *not* a three-mechanism latent-class structure. We accordingly do not use K=3 hard-posterior membership as an operational classifier; we use it as the accountant-level descriptive summary that complements the deployed signature-level five-way classifier of §III-L.
|
||||||
First, the byte-level pair evidence: 145 Firm A signatures (from 50 distinct partners of 180 registered) have a byte-identical same-CPA match in a different audit report, with 35 of these matches spanning different fiscal years.
|
|
||||||
Independent hand-signing cannot produce byte-identical images across distinct reports, so these pairs directly establish image reuse within Firm A as a concrete, threshold-free phenomenon, and the 50/180 partner spread shows that replication is widespread rather than confined to a handful of CPAs.
|
|
||||||
Second, the signature-level distributional evidence: Firm A's per-signature cosine distribution is unimodal long-tail (Hartigan dip test $p = 0.17$) rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
|
|
||||||
The unimodal-long-tail *shape*, not the precise 92.5 / 7.5 split, is the structural evidence: it is consistent with a dominant high-similarity regime plus residual within-firm heterogeneity, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).
|
|
||||||
|
|
||||||
Two additional checks, reported in Section IV-G, are robust to threshold choice and complement the two primary strands:
|
## E. Three-Score Convergent Internal-Consistency
|
||||||
the held-out Firm A 70/30 validation (Section IV-F.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85--95% band differ between folds by 1--5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure), and the threshold-independent partner-ranking analysis (Section IV-G.2) shows that Firm A auditor-years occupy 95.9% of the top decile of similarity-ranked auditor-years against a 27.8% baseline share---a 3.5$\times$ concentration ratio that uses only ordinal ranking and is independent of any absolute cutoff.
|
|
||||||
|
|
||||||
The replication-dominated framing is internally coherent with both pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
|
Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score, not a mechanism cluster posterior); the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the inherited Paper A box-rule less-replication-dominated rate. The three scores are *not* statistically independent measurements — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise $|\rho| > 0.87$) and its persistence at the signature level (Cohen $\kappa = 0.87$ between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-region ordering, and three different summarisations of the descriptor space produce broadly concordant per-CPA rankings, with a residual non-Firm-A disagreement (the reverse-anchor cosine percentile ranks Firm D fractionally above Firm C, while the mixture posterior and the box-rule rate rank Firm C highest among non-Firm-A firms).
|
||||||
We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
|
|
||||||
|
|
||||||
## D. The Style-Replication Gap
|
## F. Anchor-Based Multi-Level Calibration
|
||||||
|
|
||||||
Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
|
The operational specificity of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR replicates v3.x's per-comparison rate (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
|
||||||
A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
|
|
||||||
|
|
||||||
The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
|
Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5; Script 46) shows the inherited HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); stakeholders requiring different specificity-alert-yield operating points can derive thresholds by inverting the ICCR curves (a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint gives per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation.
|
||||||
Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
|
|
||||||
Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting.
|
|
||||||
Some may use signing pads or templates that further constrain variability without constituting image-level reproduction.
|
|
||||||
The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.
|
|
||||||
|
|
||||||
## E. Value of a Replication-Dominated Calibration Group
|
## G. Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor Reframed as Coincidence Rate
|
||||||
|
|
||||||
The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
|
The only hard ground-truth subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are conservative-subset ground truth for the *replicated* class. On the Big-4 subset ($n = 262$ pixel-identical signatures), all three candidate checks — the inherited box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). We caution that this result is necessary but not sufficient: for the box rule it is close to tautological, because byte-identical neighbours have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region. The corresponding signature-level *negative* anchor evidence is developed in §III-L.1 above (v4 spike: cos$>0.95$ per-comparison ICCR $= 0.00060$, replicating v3.20.0's reported $0.0005$ under prior "FAR" terminology). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures.
|
||||||
In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
|
|
||||||
Our approach uses practitioner background---one Big-4 firm reportedly relies predominantly on stamping or e-signing workflows---only as a *motivation* for selecting that firm as a candidate reference population; the calibration role is then established from the audit-report images themselves (byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency), so the calibration does not depend on the practitioner-background claim being externally verified (Section III-H).
|
|
||||||
|
|
||||||
This calibration strategy has broader applicability beyond signature analysis.
|
## H. Limitations
|
||||||
Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
|
|
||||||
The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity visible in the unimodal-long-tail shape of Firm A's per-signature cosine distribution, and yields classification rates that are internally consistent with the data.
|
|
||||||
|
|
||||||
## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation
|
Several limitations should be transparent. The first nine are v4.0-specific; the last five are inherited from v3.20.0 §V-G and still apply to the v4.0 pipeline.
|
||||||
|
|
||||||
A further methodological contribution is the combination of byte-level pixel identity as an annotation-free *conservative* gold positive and a large random-inter-CPA negative anchor.
|
*No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).
|
||||||
Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is pair-level proof of image reuse and, modulo the narrow source-template edge case discussed in the seventh limitation below, a conservative positive for non-hand-signing without requiring human review.
|
|
||||||
In our corpus 310 signatures satisfied this condition.
|
|
||||||
We emphasize that byte-identical pairs are a *subset* of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways).
|
|
||||||
Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.
|
|
||||||
|
|
||||||
Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative ($n = 35$) we originally considered.
|
*Inter-CPA negative-anchor assumption is partially violated and the violation is firm-dependent.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs. Because the violation is firm-dependent, Firm A's per-firm ICCR is more contaminated by within-firm sharing than Firms B/C/D's; the per-firm B/C/D rates of $0.09$–$0.16$ are therefore closer to a clean specificity estimate than the pooled rate, and the Firm A vs Firms B/C/D contrast reflects both genuine firm heterogeneity and a firm-dependent proxy-contamination gradient.
|
||||||
The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset.
|
|
||||||
|
|
||||||
## G. Limitations
|
*Scope.* The v4.0 primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full $n = 686$ scope; the §IV-K full-dataset Spearman re-run shows the K=3 $+$ box-rule rank-convergence is preserved at $n = 686$ but does not validate the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope.
|
||||||
|
|
||||||
Several limitations should be acknowledged.
|
*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the inherited box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).
|
||||||
|
|
||||||
First, comprehensive per-document ground truth labels are not available.
|
*Inherited rule components are not separately v4-validated.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their v3.20.0 calibration and capture-rate evidence; v4.0's anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-validated by v4.0's diagnostic battery.
|
||||||
The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class.
|
|
||||||
The low-similarity same-CPA anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled).
|
|
||||||
A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class.
|
|
||||||
|
|
||||||
Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
|
*Deployed-rate excess is not a presumed true-positive rate.* The $\sim 44$-pp per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the inter-CPA proxy rate (HC: $0.18$) cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; within-firm template sharing inflates the inter-CPA proxy baseline). The gap is best read as a same-CPA repeatability signal.
|
||||||
While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.
|
|
||||||
|
|
||||||
Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions.
|
*A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime.
|
||||||
In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
|
|
||||||
This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.
|
|
||||||
|
|
||||||
Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
|
*K=3 hard-posterior membership is composition-sensitive.* The K=3 hard-posterior membership for any single firm varies by up to $12.8$ pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as v4.0 operational classifier output; they are reported only as accountant-level descriptive characterisation.
|
||||||
While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.
|
|
||||||
|
|
||||||
Fifth, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed.
|
*No partner-level mechanism attribution.* v4.0 reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-L.4 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing.
|
||||||
In the rare case that one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as the stamping or e-signature template, the pair correctly identifies image reuse but misattributes the non-hand-signed status to the source exemplar.
|
|
||||||
This misattribution affects at most one source document per template variant per CPA (the exemplar from which the template was produced), is not expected to be common given that stored signature templates are typically generated in a separate acquisition step rather than extracted from submitted audit reports, and does not materially affect aggregate capture rates at the firm level.
|
|
||||||
|
|
||||||
Sixth, our analyses remain at the signature level; we abstain from partner-level frequency inferences such as "X% of CPAs hand-sign in a given year."
|
*Transferred ImageNet features (inherited from v3.20.0).* The ResNet-50 feature extractor uses pre-trained ImageNet weights without signature-domain fine-tuning. While our backbone-ablation study (§IV-L, inherited from v3.20.0 §IV-I) and prior literature support the effectiveness of transferred ImageNet features for signature comparison, a signature-domain fine-tuned feature extractor could improve discriminative performance.
|
||||||
Per-signature labels in this paper are not translated to per-report or per-partner mechanism assignments (Section III-G).
|
|
||||||
The signature-level rates we report, including the 92.5% / 7.5% Firm A split and the year-by-year left-tail share of Section IV-G.1, should accordingly be read as signature-level quantities rather than partner-level frequencies.
|
|
||||||
|
|
||||||
Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
|
*Red-stamp HSV preprocessing artifacts (inherited from v3.20.0).* The red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. Blended pixels are replaced with white, potentially creating small gaps in signature strokes that could reduce dHash similarity. This bias would push classifications toward false negatives rather than false positives.
|
||||||
Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
|
|
||||||
|
*Longitudinal scan / PDF / compression confounds (inherited from v3.20.0).* Scanning equipment, PDF generation software, and compression algorithms may have changed over the 2013–2023 study period, potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.
|
||||||
|
|
||||||
|
*Source-exemplar misattribution in max/min pair logic (inherited from v3.20.0).* The max-cosine / min-dHash detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case where one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as a stamping or e-signature template, the pair correctly identifies image reuse but misattributes non-hand-signed status to the source exemplar. This affects at most one source document per template variant per CPA and is not expected to be common.
|
||||||
|
|
||||||
|
*Legal and regulatory interpretation (inherited from v3.20.0).* Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a jurisdiction-specific legal question. Our technical analysis can inform such determinations but cannot resolve them.
|
||||||
|
|||||||
+346
-408
@@ -1,5 +1,7 @@
|
|||||||
# IV. Experiments and Results
|
# IV. Experiments and Results
|
||||||
|
|
||||||
|
The v4.0 primary analyses (§IV-D through §IV-J) are scoped to the Big-4 sub-corpus (Firms A–D, $n = 437$ CPAs with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. The §IV-K Full-Dataset Robustness section reports the full-dataset (686 CPAs) variant of the K=3 mixture + Paper A box-rule Spearman analysis as a cross-scope robustness check. §IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables.
|
||||||
|
|
||||||
## A. Experimental Setup
|
## A. Experimental Setup
|
||||||
|
|
||||||
Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration.
|
Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration.
|
||||||
@@ -25,10 +27,12 @@ The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliabi
|
|||||||
| Processing rate | 43.1 docs/sec |
|
| Processing rate | 43.1 docs/sec |
|
||||||
-->
|
-->
|
||||||
|
|
||||||
|
The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in all §IV v4 primary analyses (§IV-D through §IV-J).
|
||||||
|
|
||||||
## C. All-Pairs Intra-vs-Inter Class Distribution Analysis
|
## C. All-Pairs Intra-vs-Inter Class Distribution Analysis
|
||||||
|
|
||||||
Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
|
Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
|
||||||
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-K).
|
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
|
||||||
Table IV summarizes the distributional statistics.
|
Table IV summarizes the distributional statistics.
|
||||||
|
|
||||||
<!-- TABLE IV: Cosine Similarity Distribution Statistics
|
<!-- TABLE IV: Cosine Similarity Distribution Statistics
|
||||||
@@ -54,417 +58,248 @@ We emphasize that pairwise observations are not independent---the same signature
|
|||||||
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
|
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
|
||||||
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
|
A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
|
||||||
|
|
||||||
## D. Signature-Level Distributional Characterisation
|
## D. Big-4 Accountant-Level Distributional Characterisation
|
||||||
|
|
||||||
This section applies the threshold-estimator and density-smoothness diagnostic of Section III-I to the per-signature similarity distribution.
|
This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. All numbers below are direct re-statements from Scripts 32 / 34. The accountant-level dip-test rejection reported in Table V is, per §III-I.4 (Scripts 39b–39e), fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the v4-new composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.
|
||||||
The joint reading is that per-signature similarity is a continuous quality spectrum rather than a clean two-mechanism mixture, which is why the operational classifier (Section III-K) anchors its cosine cut on the whole-sample Firm A P7.5 percentile rather than on any mixture-fit crossing.
|
|
||||||
|
**Table V.** Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32).
|
||||||
### 1) Hartigan Dip Test: Unimodality at the Signature Level
|
|
||||||
|
| Population | $n$ CPAs | $p_{\text{cos}}$ | $p_{\text{dHash}}$ | Interpretation |
|
||||||
Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
|
|---|---|---|---|---|
|
||||||
The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
|
| **Big-4 pooled (primary)** | 437 | $< 5 \times 10^{-4}$ | $< 5 \times 10^{-4}$ | reject unimodality on both axes |
|
||||||
|
| Firm A pooled alone | 171 | 0.992 | 0.924 | unimodal |
|
||||||
<!-- TABLE V: Hartigan Dip Test Results
|
| Firms B + C + D pooled | 266 | 0.998 | 0.906 | unimodal |
|
||||||
| Distribution | N | dip | p-value | Verdict (α=0.05) |
|
| All non-Firm-A pooled | 515 | 0.998 | 0.907 | unimodal |
|
||||||
|--------------|---|-----|---------|------------------|
|
|
||||||
| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
|
Bootstrap implementation: $n_{\text{boot}} = 2000$; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution $1 / 2000 = 5 \times 10^{-4}$ (Script 34 reports this as $p = 0.0000$; we report $p < 5 \times 10^{-4}$ to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed.
|
||||||
| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
|
|
||||||
| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
|
**Table VI.** Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; $\alpha = 0.05$, two-sided).
|
||||||
| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
|
|
||||||
-->
|
| Population | Cosine: significant transition? | dHash: significant transition? |
|
||||||
|
|---|---|---|
|
||||||
Firm A's per-signature cosine distribution *fails to reject unimodality* ($p = 0.17$), a pattern consistent with a dominant high-similarity regime plus a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
|
| **Big-4 pooled (primary)** | none ($p > 0.05$) | none ($p > 0.05$) |
|
||||||
The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
|
| Firm A pooled alone | none | none |
|
||||||
The Firm A unimodal-long-tail finding is, in conjunction with the byte-identity, partner-ranking, and intra-report evidence reported below, consistent with the replication-dominated framing (Section III-H): a dominant high-similarity regime plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
|
| Firms B + C + D pooled | none | one transition at $\overline{\text{dHash}} = 10.8$ |
|
||||||
|
| All non-Firm-A pooled | none | one transition at $\overline{\text{dHash}} = 6.6$ |
|
||||||
### 2) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
|
|
||||||
|
The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as v4.0 operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes.
|
||||||
Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine distribution yields a nominally significant $Z^- \rightarrow Z^+$ transition at cosine 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample under the bin width ($0.005$ / $1$) used here.
|
|
||||||
Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
|
## E. Big-4 K=2 / K=3 Mixture Fits
|
||||||
First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
|
|
||||||
Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
|
This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.
|
||||||
We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator.
|
|
||||||
|
**Table VII.** Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-J) and marginal-crossing bootstrap 95% confidence intervals.
|
||||||
### 3) Beta Mixture at Signature Level: A Forced Fit
|
|
||||||
|
| K=2 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight |
|
||||||
Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
|
|---|---|---|---|
|
||||||
For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
|
| K=2-a (low-cos / high-dHash position) | 0.954 | 7.14 | 0.689 |
|
||||||
Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
|
| K=2-b (high-cos / low-dHash position) | 0.983 | 2.41 | 0.311 |
|
||||||
Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
|
|
||||||
|
Marginal crossings (point + bootstrap 95% CI, $n_{\text{boot}} = 500$):
|
||||||
### 4) Joint Reading of the Three Diagnostics
|
|
||||||
|
| Axis | Point | Bootstrap median | 95% CI | CI half-width |
|
||||||
The three diagnostics agree that per-signature similarity does not form a clean two-mechanism mixture:
|
|---|---|---|---|---|
|
||||||
(i) the Hartigan dip test fails to reject unimodality for Firm A and rejects it for the heterogeneous-firm pooled sample;
|
| cos | 0.9755 | 0.9754 | $[0.9742, 0.9772]$ | 0.0015 |
|
||||||
(ii) BIC strongly prefers a 3-component over a 2-component Beta fit, so the 2-component crossing is a *forced fit* and the Beta-vs-logit-Gaussian disagreement (0.977 vs 0.999 for Firm A) reflects parametric-form sensitivity rather than a stable two-mechanism boundary;
|
| dHash | 3.755 | 3.763 | $[3.476, 3.969]$ | 0.246 |
|
||||||
(iii) the BD/McCrary procedure locates its candidate transition *inside* the non-hand-signed mode rather than between modes, and the transition is not bin-width-stable.
|
|
||||||
|
$\text{BIC}(K{=}2) = -1108.45$ (Script 34).
|
||||||
Table VI summarises the signature-level threshold-estimator outputs for cross-method comparison.
|
|
||||||
|
**Table VIII.** Big-4 K=3 mixture components (descriptive firm-compositional partition per §III-J; not mechanism clusters).
|
||||||
<!-- TABLE VI: Signature-Level Threshold-Estimator Summary
|
|
||||||
| Population | Method | Cosine threshold | dHash threshold | Status |
|
| K=3 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position |
|
||||||
|------------|--------|------------------|-----------------|--------|
|
|---|---|---|---|---|
|
||||||
| **Threshold estimators (signature-level distributional fits)** | | | | |
|
| C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner |
|
||||||
| Firm A signature-level | KDE antimode + Hartigan dip (Section III-I.1) | undefined | — | unimodal at $\alpha=0.05$ ($p=0.169$); antimode not defined for unimodal data |
|
| C2 | 0.9558 | 6.66 | 0.536 | central region |
|
||||||
| Firm A signature-level | Beta-2 EM crossing (Section III-I.2) | 0.977 | — | forced fit; BIC strongly prefers $K{=}3$ ($\Delta\text{BIC} = 381$) |
|
| C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner |
|
||||||
| Firm A signature-level | logit-Gaussian-2 crossing (robustness check) | 0.999 | — | forced fit; sharply inconsistent with Beta-2 crossing—reflects parametric-form sensitivity |
|
|
||||||
| Full-sample signature-l. | KDE antimode + Hartigan dip | (multiple modes) | — | multimodal ($p<0.001$); KDE crossover at full-sample is dominated by between-firm heterogeneity |
|
$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.
|
||||||
| Full-sample signature-l. | Beta-2 EM crossing | no crossing | — | forced fit; component densities do not cross over $[0,1]$ under recovered parameters |
|
|
||||||
| Full-sample signature-l. | logit-Gaussian-2 crossing | 0.980 | — | forced fit; BIC strongly prefers $K{=}3$ ($\Delta\text{BIC} = 10{,}175$) |
|
## F. Convergent Internal-Consistency Checks
|
||||||
| **Density-smoothness diagnostics (not threshold estimators)** | | | | |
|
|
||||||
| Firm A signature-level | BD/McCrary candidate transition (Section III-I.3) | 0.985 (bin 0.005)| 2.0 (bin 1) | bin-unstable across $\{0.003, 0.005, 0.010, 0.015\}$ (Appendix A); transition lies *inside* the non-hand-signed mode |
|
This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ and are *not statistically independent measurements*. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.
|
||||||
| Full-sample signature-l. | BD/McCrary candidate transition | 0.985 (bin 0.005) | 2.0 (bin 1) | bin-unstable across $\{0.003, 0.005, 0.010, 0.015\}$ (Appendix A) |
|
|
||||||
| **Reference: between-class KDE (different unit of analysis)** | | | | |
|
**Table IX.** Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, $n = 437$.
|
||||||
| All-pairs intra/inter (pair-level; Section IV-C) | KDE crossover | 0.837 | — | reference point for the Uncertain/Likely-hand-signed boundary in the operational classifier |
|
|
||||||
| **Operational classifier anchors and percentile cross-references** | | | | |
|
| Score pair | Spearman $\rho$ | $p$-value |
|
||||||
| Firm A whole-sample | P7.5 (operational anchor; Section III-K) | 0.95 | — | operational cosine cut for the five-way classifier |
|
|---|---|---|
|
||||||
| Firm A whole-sample | dHash$_\text{indep}$ P75 | — | 4 | informs the $\leq 5$ high-confidence band edge in the classifier |
|
| K=3 P(C1) vs Paper A box-rule less-replication-dominated rate | $+0.9627$ | $< 10^{-248}$ |
|
||||||
| Firm A whole-sample | dHash$_\text{indep}$ style-consistency ceiling | — | 15 | operational $> 15$ style-consistency boundary |
|
| Reverse-anchor cosine percentile vs Paper A box-rule less-replication-dominated rate | $+0.8890$ | $< 10^{-149}$ |
|
||||||
| Firm A calibration fold (70%) | cosine P5 (Section IV-F.2) | 0.9407 | — | calibration-fold cross-reference; held-out fold reports rates at this cut |
|
| K=3 P(C1) vs Reverse-anchor cosine percentile | $+0.8794$ | $< 10^{-142}$ |
|
||||||
| Firm A calibration fold (70%) | dHash$_\text{indep}$ P95 | — | 9 | calibration-fold cross-reference (Tables IX and XI report rates at the rounded $\leq 8$ cut for continuity) |
|
|
||||||
|
(Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on $n = 249$ non-Big-4 CPAs; reference centre $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$.
|
||||||
Read this table by *population × method*: each row reports one method applied to one population.
|
|
||||||
The first three blocks (threshold estimators; density-smoothness diagnostics; between-class KDE) are *characterisation* outputs; the bottom block is the operational anchor set used by the classifier of Section III-K.
|
**Table X.** Per-firm summary across the three feature-derived scores, Big-4.
|
||||||
The disagreement between Firm A Beta-2 (0.977) and Firm A logit-Gaussian-2 (0.999) is the parametric-form sensitivity referenced in the prose of Section IV-D.3; it cannot be resolved from the data because BIC rejects the underlying $K{=}2$ assumption itself.
|
|
||||||
-->
|
| Firm | $n$ CPAs | mean $P(\text{C1})$ | mean reverse-anchor score | mean Paper A less-replication-dominated rate |
|
||||||
|
|---|---|---|---|---|
|
||||||
Non-hand-signed replication quality is therefore best read as a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) acting on a common stored exemplar.
|
| Firm A | 171 | 0.0072 | $-0.9726$ | 0.1935 |
|
||||||
This finding has a direct methodological pay-off: it is *why* the operational cosine cut is anchored on the whole-sample Firm A P7.5 percentile (Section III-K), and it is *why* the byte-level pixel-identity anchor (Section IV-F.1) is the natural threshold-free positive reference for downstream validation.
|
| Firm B | 112 | 0.1410 | $-0.8201$ | 0.6962 |
|
||||||
|
| Firm C | 102 | 0.3110 | $-0.7672$ | 0.7896 |
|
||||||
## E. Calibration Validation with Firm A
|
| Firm D | 52 | 0.2406 | $-0.7125$ | 0.7608 |
|
||||||
|
|
||||||
Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
|
(Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that *higher* values indicate deeper into the reference left tail = less replication-dominated relative to the non-Big-4 reference.)
|
||||||
Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
|
|
||||||
|
The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as less replication-dominated. The K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Score 1 and Score 3) place Firm C at the least-replication-dominated end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors.
|
||||||
<!-- TABLE IX: Firm A Whole-Sample Capture Rates (consistency check, NOT external validation)
|
|
||||||
| Rule | Firm A rate | k / N |
|
**Table XI.** Per-signature Cohen $\kappa$ (binary collapse, replication-dominated vs less-replication-dominated), $n = 150{,}442$ Big-4 signatures.
|
||||||
|------|-------------|-------|
|
|
||||||
| **Cosine-only marginal rates** | | |
|
| Pair | Cohen $\kappa$ |
|
||||||
| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,408 / 60,448 |
|
|---|---|
|
||||||
| cosine > 0.9407 (calibration-fold P5) | 95.15% | 57,518 / 60,448 |
|
| Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) vs per-CPA K=3 hard label | 0.662 |
|
||||||
| cosine > 0.945 (calibration-fold P5 rounded) | 94.02% | 56,836 / 60,448 |
|
| Paper A binary high-confidence box rule vs per-signature K=3 hard label | 0.559 |
|
||||||
| cosine > 0.95 (operational; whole-sample Firm A P7.5) | 92.51% | 55,922 / 60,448 |
|
| Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 |
|
||||||
| **dHash-only marginal rates** | | |
|
|
||||||
| dHash_indep ≤ 5 (operational high-confidence cap) | 84.20% | 50,897 / 60,448 |
|
(Source: Script 39; verdict label `SIG_CONVERGENCE_MODERATE`.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) inherits its v3.x calibration and capture-rate evaluation (§IV-J).
|
||||||
| dHash_indep ≤ 8 (calibration-fold P95 rounded) | 95.17% | 57,527 / 60,448 |
|
|
||||||
| dHash_indep ≤ 15 (operational style-consistency boundary) | 99.83% | 60,348 / 60,448 |
|
## G. Leave-One-Firm-Out Reproducibility
|
||||||
| **Operational classifier dual rules (Section III-K)** | | |
|
|
||||||
| cosine > 0.95 AND dHash_indep ≤ 5 (high-confidence non-hand-signed) | 81.70% | 49,389 / 60,448 |
|
This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing.
|
||||||
| cosine > 0.95 AND 5 < dHash_indep ≤ 15 (moderate-confidence) | 10.76% | 6,503 / 60,448 |
|
|
||||||
| cosine > 0.95 AND dHash_indep ≤ 15 (combined non-hand-signed) | 92.46% | 55,892 / 60,448 |
|
**Table XII.** K=2 leave-one-firm-out across the four Big-4 folds.
|
||||||
| **Calibration-fold-adjacent cross-reference (not the operational classifier rule)** | | |
|
|
||||||
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | 54,370 / 60,448 |
|
| Held-out firm | $n_{\text{train}}$ | $n_{\text{held}}$ | Fold rule (cos cut, dHash cut) | Held-out classified as templated by fold rule |
|
||||||
|
|---|---|---|---|---|
|
||||||
All rates computed exactly from the full Firm A sample (N = 60,448 signatures); per-rule counts and codes are available in the supplementary materials.
|
| Firm A | 266 | 171 | cos $> 0.9380$ AND dHash $\leq 8.79$ | $171 / 171 = 100.00\%$ ($95\%$ Wilson $[97.80\%, 100.00\%]$) |
|
||||||
The two operational dHash cuts ($\leq 5$ for the high-confidence cap and $\leq 15$ for the style-consistency boundary) come from the classifier definition in Section III-K and are the rules used by the five-way classifier of Tables XII and XVII; the dHash $\leq 8$ row is *not* an operational classifier rule but a calibration-fold-adjacent reference (Section IV-F.2 calibration-fold dHash P95 = 9; we report the $\leq 8$ rate as the integer-valued threshold immediately below P95, included here so that Firm A capture in the calibration-fold-P95 neighbourhood can be read off the same table).
|
| Firm B | 325 | 112 | cos $> 0.9744$ AND dHash $\leq 3.98$ | $0 / 112 = 0\%$ ($95\%$ Wilson $[0\%, 3.32\%]$) |
|
||||||
-->
|
| Firm C | 335 | 102 | cos $> 0.9752$ AND dHash $\leq 3.75$ | $0 / 102 = 0\%$ ($95\%$ Wilson $[0\%, 3.63\%]$) |
|
||||||
|
| Firm D | 385 | 52 | cos $> 0.9756$ AND dHash $\leq 3.74$ | $0 / 52 = 0\%$ ($95\%$ Wilson $[0\%, 6.88\%]$) |
|
||||||
Table IX is a whole-sample consistency check rather than an external validation: the cosine cut $0.95$ and the operational dHash band edges ($\leq 5$ high-confidence cap and $\leq 15$ style-consistency boundary) are themselves anchored to the whole-sample Firm A distribution described in Section III-K (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
|
|
||||||
The operational dual rule used by the five-way classifier of Section III-K---cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ (the union of the high-confidence and moderate-confidence non-hand-signed buckets)---captures 92.46% of Firm A; the high-confidence component alone (cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 5$) captures 81.70%.
|
(Source: Script 36.) Across-fold cosine crossing: pairwise range $[0.9380, 0.9756]$, range = $0.0376$; max absolute deviation from the across-fold mean is $0.028$. This exceeds the report's $0.005$ across-fold stability tolerance by $5.6\times$ and is much larger than the full-Big-4 bootstrap CI half-width of $0.0015$. Together with the all-or-nothing held-out classification pattern (Firm A held out $\Rightarrow$ all held-out CPAs templated; any non-Firm-A firm held out $\Rightarrow$ none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.
|
||||||
For continuity with prior calibration-fold reporting (Section IV-F.2 reports the calibration-fold rate at the calibration-fold-P95-adjacent cut $\text{dHash}_\text{indep} \leq 8$), Table IX also lists the cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ rate of 89.95%; this is *not* the operational classifier rule but a cross-reference value.
|
|
||||||
Both operational rates are consistent with the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 92.5% / 7.5% signature-level split (Section III-H).
|
**Table XIII.** K=3 leave-one-firm-out: C1 component shape and held-out membership.
|
||||||
Section IV-F.2 reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
|
|
||||||
|
| Held-out firm | C1 cos (fit) | C1 dHash (fit) | C1 weight (fit) | Held-out C1 hard-label rate | Full-Big-4 baseline C1% | Absolute difference |
|
||||||
## F. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
|
|---|---|---|---|---|---|---|
|
||||||
|
| Full-Big-4 baseline | 0.9457 | 9.17 | 0.143 | — | — | — |
|
||||||
We report three validation analyses corresponding to the anchors of Section III-J.
|
| Firm A held out | 0.9425 | 10.13 | 0.145 | $4.68\%$ | $0.00\%$ | $4.68$ pp |
|
||||||
|
| Firm B held out | 0.9441 | 9.16 | 0.127 | $7.14\%$ | $8.93\%$ | $1.76$ pp |
|
||||||
### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor
|
| Firm C held out | 0.9504 | 8.41 | 0.126 | $36.27\%$ | $23.53\%$ | $12.77$ pp |
|
||||||
|
| Firm D held out | 0.9439 | 9.29 | 0.120 | $17.31\%$ | $11.54\%$ | $5.81$ pp |
|
||||||
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
|
|
||||||
Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; reproduction artifact for this Firm A decomposition is listed in Appendix B.
|
(Source: Script 37; verdict label `P2_PARTIAL`.) Component shape is reproducible across folds: max deviation of C1 cosine = $0.005$, C1 dHash = $0.96$, C1 weight = $0.023$. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is $12.77$ pp at the Firm C held-out fold, exceeding the report's $5$ pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).
|
||||||
As the gold-negative anchor we sample 50,000 i.i.d. random cross-CPA signature pairs from the full 168,755-signature matched corpus (inter-CPA cosine: mean $= 0.763$, $P_{95} = 0.886$, $P_{99} = 0.915$, max $= 0.992$).
|
|
||||||
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
|
## H. Pixel-Identity Positive-Anchor Miss Rate
|
||||||
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
|
|
||||||
The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
|
This section reports the only hard-ground-truth subset analysis available in the corpus: the positive-anchor miss rate against $n = 262$ Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4).
|
||||||
We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.
|
|
||||||
|
**Table XIV.** Positive-anchor miss rate, $n = 262$ Big-4 byte-identical signatures.
|
||||||
<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
|
|
||||||
| Threshold | FAR | FAR 95% Wilson CI |
|
| Classifier | Misclassified as less-replication-dominated | Miss rate | Wilson 95% CI |
|
||||||
|-----------|-----|-------------------|
|
|---|---|---|---|
|
||||||
| 0.837 (all-pairs KDE crossover) | 0.2101 | [0.2066, 0.2137] |
|
| Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |
|
||||||
| 0.900 | 0.0250 | [0.0237, 0.0264] |
|
| K=3 per-CPA hard label (C3 = high-cos / low-dHash; descriptive) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |
|
||||||
| 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
|
| Reverse-anchor (prevalence-calibrated cut) | $0 / 262$ | $0\%$ | $[0\%, 1.45\%]$ |
|
||||||
| 0.950 (whole-sample Firm A P7.5; operational cut) | 0.0005 | [0.0003, 0.0007] |
|
|
||||||
| 0.977 (Firm A Beta-2 forced-fit crossing; Section IV-D) | 0.00014 | [0.00007, 0.00029] |
|
(Source: Script 40.) Per-firm breakdown of the byte-identical subset: Firm A 145; Firm B 8; Firm C 107; Firm D 2. All three candidate scores correctly assign every byte-identical signature to the replicated class.
|
||||||
| 0.985 (BD/McCrary candidate transition; Appendix A) | 0.00004 | [0.00001, 0.00015] |
|
|
||||||
|
We caution that for the Paper A box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region); v3.20.0 §V-F discusses this conservative-subset caveat at length and we retain that discussion. The reverse-anchor cut is chosen by *prevalence calibration* against the inherited box rule's overall replicated rate of $49.58\%$ across Big-4 signatures; this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.
|
||||||
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
|
|
||||||
-->
|
## I. Inter-CPA Pair-Level Coincidence Rate (Big-4 spike + inherited corpus-wide)
|
||||||
|
|
||||||
Two caveats apply.
|
The signature-level inter-CPA pair-level coincidence-rate analysis (reported in v3.20.0 §IV-F.1, Table X as "FAR") is inherited and extended in v4.0. v4.0 retroactively reframes the metric as **inter-CPA pair-level coincidence rate (ICCR)** rather than "False Acceptance Rate" because the corpus does not provide signature-level ground-truth negative labels; the inter-CPA negative-anchor assumption underpinning the metric is itself partially violated by within-firm cross-CPA template-like collision structures (§III-L.4). The v3.20.0 corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs reported a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$.
|
||||||
First, the byte-identical positive anchor referenced above is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
|
|
||||||
A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
|
v4.0 additionally reports the §III-L.1 Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs; Script 40b), which replicates and extends the v3 result and adds the structural dimension (dHash) and joint-rule rates. The §III-L.1 numbers are referenced rather than duplicated here; the consolidated v4-new ICCR calibration appears in §IV-M Tables XXI–XXVI.
|
||||||
Second, the 0.945 / 0.95 thresholds are derived from the Firm A whole-sample and calibration-fold percentiles rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
|
|
||||||
The very low FAR at the operational cut is therefore informative about specificity against a realistic inter-CPA negative population.
|
## J. Five-Way Per-Signature + Document-Level Classification Output
|
||||||
|
|
||||||
### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)
|
This section reports the §III-L five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. The five-way category definitions are inherited unchanged from v3.20.0 §III-K (now §III-L); see §III-L for the cosine and dHash cuts.
|
||||||
|
|
||||||
We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
|
**Table XV.** Five-way per-signature category counts, Big-4 sub-corpus, $n = 150{,}442$ classified.
|
||||||
The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two registered Firm A partners whose signatures in the corpus are singletons (only one signature each, so the per-signature best-match cosine is undefined and they do not appear in the same-CPA matched-signature table that script `24_validation_recalibration.py` reads); they are therefore not represented in either fold by construction rather than by an explicit exclusion rule.
|
|
||||||
Thresholds are re-derived from calibration-fold percentiles only.
|
| Category | Long name | $n$ signatures | % of classified |
|
||||||
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
|
|---|---|---|---|
|
||||||
|
| HC | High-confidence non-hand-signed | 74,593 | 49.58% |
|
||||||
<!-- TABLE XI: Held-Out vs Calibration Firm A Capture Rates and Generalization Test
|
| MC | Moderate-confidence non-hand-signed | 39,817 | 26.47% |
|
||||||
| Rule | Calibration 70% fold (CI) | Held-out 30% fold (CI) | 2-prop z | p | k/n calib | k/n held |
|
| HSC | High style consistency | 314 | 0.21% |
|
||||||
|------|---------------------------|-------------------------|----------|---|-----------|----------|
|
| UN | Uncertain | 35,480 | 23.58% |
|
||||||
| cosine > 0.837 | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31 | 0.756 n.s. | 45,087/45,116 | 15,321/15,332 |
|
| LH | Likely hand-signed | 238 | 0.16% |
|
||||||
| cosine > 0.9407 (calib-fold P5) | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.19 | 0.001 | 42,856/45,116 | 14,662/15,332 |
|
|
||||||
| cosine > 0.945 (calib-fold P5 rounded) | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001 | 42,305/45,116 | 14,531/15,332 |
|
(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The $150{,}442$ vs $150{,}453$ distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use $n = 150{,}442$; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIV–XXV; Scripts 40b, 43, 44) use $n = 150{,}453$ because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.)
|
||||||
| cosine > 0.950 (whole-sample P7.5; operational cut) | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001 | 41,570/45,116 | 14,352/15,332 |
|
|
||||||
| dHash_indep ≤ 5 | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001 | 37,430/45,116 | 13,467/15,332 |
|
**Per-firm five-way breakdown (% within firm).**
|
||||||
| dHash_indep ≤ 8 | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45 | <0.001 | 42,788/45,116 | 14,739/15,332 |
|
|
||||||
| dHash_indep ≤ 9 (calib-fold P95) | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07 | <0.001 | 43,604/45,116 | 14,945/15,332 |
|
| Firm | HC | MC | HSC | UN | LH | total signatures |
|
||||||
| dHash_indep ≤ 15 | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31 | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
|
|---|---|---|---|---|---|---|
|
||||||
| cosine > 0.95 AND dHash_indep ≤ 8 (calibration-fold P95-adjacent reference; P95 = 9) | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60 | <0.001 | 40,335/45,116 | 14,035/15,332 |
|
| Firm A | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 |
|
||||||
| cosine > 0.95 AND dHash_indep ≤ 15 (operational classifier rule, Section III-K) | 92.09% [91.84%, 92.34%] | 93.56% [93.16%, 93.93%] | -5.93 | <0.001 | 41,548/45,116 | 14,344/15,332 |
|
| Firm B | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 |
|
||||||
|
| Firm C | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 |
|
||||||
Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. Counts and z/p values are reproducible from the supplementary materials (fixed random seed).
|
| Firm D | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 |
|
||||||
-->
|
|
||||||
|
(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).
|
||||||
We report fold-versus-fold comparisons rather than fold-versus-whole-sample comparisons, because the whole-sample rate is a weighted average of the two folds and therefore cannot, in general, fall inside the Wilson CI of either fold when the folds differ in rate; the correct generalization reference is the calibration fold, which produced the thresholds.
|
|
||||||
|
**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.20.0 worst-case rule (HC > MC > HSC > UN > LH; §III-L). v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset).
|
||||||
Under this proper test the two extreme rules agree across folds (cosine $> 0.837$ and $\text{dHash}_\text{indep} \leq 15$; both $p > 0.7$).
|
|
||||||
The operationally relevant rules in the 85–95% capture band differ between folds by 1–5 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
|
**Table XIX.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs.
|
||||||
Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the calibration-fold-adjacent reference rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ (the integer cut immediately below the calibration-fold dHash P95 of 9) captures 89.40% of the calibration fold and 91.54% of the held-out fold; the operational classifier rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K captures still higher rates in both folds (calibration 92.09%, 41,548 / 45,116; held-out 93.56%, 14,344 / 15,332).
|
|
||||||
The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
|
| Category | Long name | $n$ documents | % |
|
||||||
We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-G.2) is the cross-check that is robust to this fold variance.
|
|---|---|---|---|
|
||||||
|
| HC | High-confidence non-hand-signed | 46,857 | 62.28% |
|
||||||
### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
|
| MC | Moderate-confidence non-hand-signed | 19,667 | 26.14% |
|
||||||
|
| HSC | High style consistency | 167 | 0.22% |
|
||||||
The per-signature classifier (Section III-K) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P7.5 heuristic (i.e., 7.5% of whole-sample Firm A signatures lie at or below 0.95; see Section III-H).
|
| UN | Uncertain | 8,524 | 11.33% |
|
||||||
We report a sensitivity check in which this round-number cut is replaced by the slightly stricter calibration-fold P5 rounded value cos $> 0.945$ (calibration-fold P5 = 0.9407, see Table XI).
|
| LH | Likely hand-signed | 18 | 0.02% |
|
||||||
Table XII reports the five-way classifier output under each cut.
|
|
||||||
|
(Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.)
|
||||||
<!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
|
|
||||||
| Cosine cut | High-confidence | Moderate-confidence | High style consistency | Uncertain | Likely hand-signed |
|
**Per-firm document-level breakdown (single-firm PDFs only).**
|
||||||
|------------|-----------------|---------------------|------------------------|-----------|--------------------|
|
|
||||||
| cos > 0.940 | 81,069 (48.04%) | 55,308 (32.78%) | 801 (0.47%) | 31,026 (18.39%) | 536 (0.32%) |
|
| Firm | HC | MC | HSC | UN | LH | total docs |
|
||||||
| cos > 0.945 | 79,278 (46.98%) | 50,001 (29.63%) | 665 (0.39%) | 38,260 (22.67%) | 536 (0.32%) |
|
|---|---|---|---|---|---|---|
|
||||||
| cos > 0.950 (operational) | 76,984 (45.62%) | 43,906 (26.02%) | 546 (0.32%) | 46,768 (27.72%) | 536 (0.32%) |
|
| Firm A | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 |
|
||||||
| cos > 0.960 | 70,250 (41.63%) | 29,450 (17.45%) | 288 (0.17%) | 68,216 (40.43%) | 536 (0.32%) |
|
| Firm B | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 |
|
||||||
| cos > 0.970 | 60,247 (35.70%) | 14,865 ( 8.81%) | 117 (0.07%) | 92,975 (55.10%) | 536 (0.32%) |
|
| Firm C | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 |
|
||||||
| cos > 0.985 | 37,368 (22.15%) | 2,231 ( 1.32%) | 10 (0.01%) | 128,595 (76.21%) | 536 (0.32%) |
|
| Firm D | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 |
|
||||||
|
|
||||||
The dHash band edges ($\leq 5$ for high-confidence, $5 < \text{dHash}_\text{indep} \leq 15$ for moderate-confidence, $> 15$ for style) are held fixed across the grid; only the cosine cut varies. The Likely-hand-signed count is invariant across the grid because it depends only on the all-pairs KDE crossover cosine $= 0.837$.
|
(Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)
|
||||||
-->
|
|
||||||
|
The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The v3.20.0 capture-rate calibration evidence for the moderate band (v3.20.0 Tables IX, XI, XII, XII-B) is carried into v4.0 by reference and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
|
||||||
At the aggregate firm-level, the calibration-fold-adjacent reference dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
|
|
||||||
The operational classifier rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K captures 92.46% under the 0.95 cut and 93.97% under the 0.945 cut---a shift of 1.51 percentage points.
|
**Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.
|
||||||
|
|
||||||
Reading the wider grid in Table XII: the High-confidence and Moderate-confidence shares shift by less than 5 percentage points across the 0.940-0.950 neighbourhood, while pushing the cosine cut to 0.970 or 0.985 produces qualitatively different classifier behaviour (Moderate-confidence collapses from 26.02% at $0.95$ to 8.81% at $0.97$ and 1.32% at $0.985$, with the displaced mass landing in Uncertain rather than reclassifying out of the corpus).
|
| Firm | $n$ | C1 (low-cos / high-dHash) | C2 (central) | C3 (high-cos / low-dHash) | C1 % | C3 % |
|
||||||
The classifier output is therefore robust to small (~0.005-cosine) perturbations of the operational cut but not to wholesale reanchoring at the threshold-estimator outputs of Section IV-D, which is consistent with our reading that those outputs are not classifier thresholds.
|
|---|---|---|---|---|---|---|
|
||||||
At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies 8,508 signatures (5.04% of the corpus) out of the Uncertain band; 6,095 of them migrate to Moderate-confidence non-hand-signed, 2,294 to High-confidence non-hand-signed, and 119 to High style consistency.
|
| Firm A | 171 | 0 | 30 | 141 | $0.00\%$ | $82.46\%$ |
|
||||||
The Likely-hand-signed category is unaffected because it depends only on the fixed all-pairs KDE crossover cosine $= 0.837$.
|
| Firm B | 112 | 10 | 102 | 0 | $8.93\%$ | $0.00\%$ |
|
||||||
The High-confidence non-hand-signed share grows from 45.62% to 46.98%.
|
| Firm C | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ |
|
||||||
|
| Firm D | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ |
|
||||||
We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within a 0.005-cosine neighbourhood of the Firm A P7.5 anchor, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
|
|
||||||
|
(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).
|
||||||
To make the operating-point selection (Section III-K) auditable rather than presented as a single fixed value, Table XII-B reports the capture-vs-FAR tradeoff over the candidate threshold grid spanning the calibration-fold P5 (0.9407), its rounded value (0.945), the operational anchor (0.95), the Firm A Beta-2 forced-fit crossing from Section IV-D.3 (0.977), and the BD/McCrary candidate transition from Section IV-D.2 (0.985).
|
|
||||||
For each grid point we report Firm A capture (under both the cosine-only marginal and the operational dual rule cos $> t$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K), non-Firm-A capture (the cosine-only marginal in the 108,292 non-Firm-A matched signatures), and inter-CPA FAR with Wilson 95% CI against the 50,000-pair anchor of Section IV-F.1.
|
**Document-level worst-case aggregation outputs are reported in Table XIX above.**
|
||||||
|
|
||||||
<!-- TABLE XII-B: Cosine-Threshold Tradeoff: Capture vs Inter-CPA FAR
|
## K. Full-Dataset Robustness (light scope)
|
||||||
| Cosine cut t | Firm A capture (cos > t) | Firm A capture (cos > t AND dHash_indep ≤ 15) | Non-Firm-A capture (cos > t) | Inter-CPA FAR | Inter-CPA FAR Wilson 95% CI |
|
|
||||||
|--------------|--------------------------|------------------------------------------------|------------------------------|---------------|------------------------------|
|
This section reports the v4.0 reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the v4.0 K=3 + Paper A convergence reproduces at the wider scope. The §III-L five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band is documented as inherited from v3.x calibration in §IV-J.
|
||||||
| 0.9407 (calibration-fold P5) | 95.15% (57,518/60,448) | 95.09% (57,482/60,448) | 72.68% (78,710/108,292) | 0.00126 | [0.00099, 0.00161] |
|
|
||||||
| 0.945 (calibration-fold P5 rounded) | 94.02% (56,836/60,448) | 93.97% (56,804/60,448) | 67.51% (73,108/108,292) | 0.00082 | [0.00061, 0.00111] |
|
**Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset.
|
||||||
| 0.95 (whole-sample Firm A P7.5; **operational cut**) | **92.51%** (55,922/60,448) | **92.46%** (55,892/60,448) | 60.50% (65,514/108,292) | **0.00050** | [0.00034, 0.00074] |
|
|
||||||
| 0.977 (Firm A Beta-2 forced-fit crossing) | 74.53% (45,050/60,448) | 74.51% (45,038/60,448) | 13.14% (14,233/108,292) | 0.00014 | [0.00007, 0.00029] |
|
| K=3 component | Big-4 (n=437) cos / dHash / weight | Full (n=686) cos / dHash / weight | Drift Big-4 → Full |
|
||||||
| 0.985 (BD/McCrary candidate transition) | 55.27% (33,409/60,448) | 55.26% (33,406/60,448) | 5.73% (6,200/108,292) | 0.00004 | [0.00001, 0.00015] |
|
|---|---|---|---|
|
||||||
|
| C1 (low-cos / high-dHash) | 0.9457 / 9.17 / 0.143 | 0.9278 / 11.17 / 0.284 | $\lvert\Delta\rvert$ cos 0.018, dHash 1.99, wt 0.141 |
|
||||||
Inter-CPA FAR computed against 50,000 i.i.d. inter-CPA pairs (random seed 42, reproducing the anchor of Section IV-F.1 / Table X). Capture and FAR percentages are exact ratios of the displayed integer counts; gap arithmetic in the surrounding prose is computed from those exact counts and rounded to two decimal places. The dual-rule column is the operational classifier rule of Section III-K; for cuts above the dHash-15 saturation point (Firm A dHash$_\text{indep}$ $> 15$ rate is only 0.17%, Table IX), the dual-rule and cosine-only columns coincide to within the dHash$_\text{indep}$ $> 15$ residual.
|
| C2 (central) | 0.9558 / 6.66 / 0.536 | 0.9535 / 6.99 / 0.512 | $\lvert\Delta\rvert$ cos 0.002, dHash 0.33, wt 0.024 |
|
||||||
-->
|
| C3 (high-cos / low-dHash) | 0.9826 / 2.41 / 0.321 | 0.9826 / 2.40 / 0.205 | $\lvert\Delta\rvert$ cos 0.000, dHash 0.01, wt 0.117 |
|
||||||
|
|
||||||
Reading Table XII-B, three patterns motivate the choice of $0.95$ as the operating point.
|
(Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.)
|
||||||
First, *Firm A capture* on the operational dual rule decays smoothly from 95.09% at $t = 0.9407$ to 55.26% at $t = 0.985$.
|
|
||||||
Relaxing the cut from $0.95$ to $0.945$ buys 1.51 percentage points of additional Firm A capture, and to $0.9407$ buys 2.63 percentage points; tightening from $0.95$ to $0.977$ costs 17.96 percentage points and to $0.985$ costs 37.20 percentage points.
|
**Table XVIII.** Spearman rank correlation between K=3 P(C1) and Paper A operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.
|
||||||
The selected cut at $0.95$ is the strictest cut on this grid at which Firm A capture remains above $90\%$ on the operational dual rule.
|
|
||||||
Second, *inter-CPA FAR* is small in absolute terms across the entire candidate grid ($0.00126$ at $0.9407$, falling to $0.00004$ at $0.985$): under any of these operating points the classifier's specificity against random cross-CPA pairs is in the per-mille range or better, so FAR alone does not determine the choice.
|
| Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs Paper A less-replication-dominated rate) | $p$-value |
|
||||||
The marginal FAR cost of relaxing from $0.95$ to $0.945$ is $+0.00032$ ($25 \to 41$ false positives per 50,000 pairs) and to $0.9407$ is $+0.00076$ ($25 \to 63$); the marginal FAR savings from tightening to $0.977$ and $0.985$ are $-0.00036$ and $-0.00046$ respectively.
|
|---|---|---|---|
|
||||||
The FAR savings from going stricter are small in absolute terms compared with the corresponding Firm A capture loss, which makes $0.95$ a balanced operating point on this grid rather than a uniquely optimal one.
|
| Big-4 (primary) | 437 | $+0.9627$ | $< 10^{-248}$ |
|
||||||
Third, *non-Firm-A capture* (the cosine-only marginal in the 108,292 non-Firm-A signatures) decays from 67.51% at $0.945$ to 60.50% at $0.95$, 13.14% at $0.977$, and 5.73% at $0.985$.
|
| Full dataset | 686 | $+0.9558$ | $< 10^{-300}$ |
|
||||||
The Firm-A-minus-non-Firm-A gap widens with strictness through $0.977$ and then contracts (22.41 percentage points at $0.9407$; 26.46 at $0.945$; 31.97 at $0.95$; 61.36 at $0.977$; 49.54 at $0.985$): on the $0.95 \to 0.977$ segment non-Firm-A capture falls faster than Firm A capture in absolute terms ($-47.35$ vs $-17.96$ percentage points), so the widening is dominated by non-Firm-A removal rather than by an intrinsic property of Firm A; on the $0.977 \to 0.985$ segment Firm A capture falls faster than non-Firm-A's already-low residual, so the gap contracts.
|
| $\lvert\rho_{\text{full}} - \rho_{\text{Big-4}}\rvert$ | — | $0.0069$ | — |
|
||||||
We do *not* read the gap pattern as evidence for a particular cut; it is reported here as cross-firm replication heterogeneity rather than as a selection criterion.
|
|
||||||
The operating point at $0.95$ is therefore a defensible---not unique---selection in this neighbourhood, motivated by (i) keeping Firm A capture above $90\%$ on the operational dual rule, (ii) achieving an FAR of $0.0005$ at which marginal further savings from tightening are small relative to the corresponding capture loss, and (iii) preserving the interpretive transparency of the whole-sample Firm A P7.5 reading.
|
(Source: Script 41.)
|
||||||
It is *not* derived from the threshold-estimator outputs of Section IV-D, which the data do not support as classifier thresholds.
|
|
||||||
|
**Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the Paper A box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + Paper A convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the v4.0 primary methodology is restricted to Big-4 by design (§III-G item 4).
|
||||||
The paper therefore retains cos $> 0.95$ as the primary operational cut and reports the 0.945 result of Table XII as a sensitivity check rather than as a deployed alternative; downstream document-level rates (Table XVII) and intra-report agreement (Table XVI) are robust to moderate cutoff shifts within the 0.945--0.95 neighbourhood as long as the same cutoff is applied uniformly across firms.
|
|
||||||
|
## L. Ablation Study: Feature Backbone Comparison
|
||||||
## G. Additional Firm A Benchmark Validation
|
|
||||||
|
|
||||||
Before presenting the three threshold-robust analyses, Fig. 4 summarises the per-firm yearly per-signature best-match cosine distribution that motivates them.
|
|
||||||
The left panel reports the mean per-signature best-match cosine within each firm bucket and fiscal year (a threshold-free statistic); the right panel reports the share of each firm-bucket-year with per-signature best-match cosine $\geq 0.95$ (the operational cut of Section III-K).
|
|
||||||
Both panels show Firm A above the other Big-4 firms in every year of the 2013-2023 sample, with non-Big-4 firms below all four Big-4 firms throughout, and the cross-firm ordering is stable across the sample period.
|
|
||||||
The mean-cosine separation between Firm A and the other Big-4 firms is on the order of 0.02-0.04 throughout the sample (e.g., 2013: Firm A $0.9733$ vs Firm B $0.9498$, Firm C $0.9464$, Firm D $0.9395$, Non-Big-4 $0.9227$; 2023: $0.9860$ vs $0.9668$, $0.9662$, $0.9525$, $0.9346$); the share-above-0.95 separation is wider (2013: Firm A $87.2\%$ vs $61.8\%$, $56.2\%$, $38.5\%$, $27.5\%$).
|
|
||||||
This visual is the most direct cross-firm evidence in the paper that Firm A's high-similarity behaviour is firm-specific rather than corpus-wide; the three subsections below decompose this gap along three threshold-free or threshold-robust dimensions.
|
|
||||||
|
|
||||||
<!-- FIGURE 4: Per-firm yearly per-signature best-match cosine
|
|
||||||
File: reports/figures/fig_yearly_big4_comparison.png (and .pdf)
|
|
||||||
Generated by: signature_analysis/30_yearly_big4_comparison.py
|
|
||||||
Caption: Per-firm yearly per-signature best-match cosine, 2013-2023.
|
|
||||||
(a) Mean per-signature best-match cosine by firm bucket and fiscal year
|
|
||||||
(threshold-free). (b) Share of per-signature best-match cosine $\geq 0.95$
|
|
||||||
(operational cut of Section III-K). Five lines: Firm A, B, C, D, Non-Big-4.
|
|
||||||
Firm A is above the other Big-4 firms in every year; Non-Big-4 is below all
|
|
||||||
four Big-4 firms in every year. Per-firm signature counts and exact values
|
|
||||||
are in `reports/firm_yearly_comparison/firm_yearly_comparison.{json,md}`.
|
|
||||||
-->
|
|
||||||
|
|
||||||
The capture rates of Section IV-E are an *internal* consistency check: they ask "how much of Firm A does our threshold capture?", but the threshold was itself derived from Firm A's percentiles, so a high capture rate is not surprising.
|
|
||||||
To go beyond this circular check, we report three further analyses, each chosen so that the *informative quantity* does not depend on the threshold's absolute value:
|
|
||||||
|
|
||||||
- **§IV-G.1 (year-by-year stability).** Holds the cosine cutoff fixed at 0.95 and asks whether the share of Firm A below the cutoff is *stable across years*. The information is in the temporal trend, not in the absolute rate; under a noise-only explanation of the left tail, the share should shrink as scan/PDF technology matured.
|
|
||||||
- **§IV-G.2 (partner-level similarity ranking).** Uses *no threshold at all*: every auditor-year is ranked by mean similarity, and we measure Firm A's share of the top decile against its baseline share. The information is in the concentration ratio, which is invariant to the choice of cutoff.
|
|
||||||
- **§IV-G.3 (intra-report agreement).** Applies the calibrated classifier and measures whether the *two co-signing CPAs on the same Firm A report* receive the same classifier label, then compares Firm A's intra-report agreement rate to the other firms'. The information is in the *cross-firm gap*; the absolute agreement rate at any one firm depends on the cutoff, but the gap is robust to moderate cutoff shifts as long as the same cutoff is applied uniformly across firms.
|
|
||||||
|
|
||||||
Together these three analyses provide threshold-free or threshold-robust evidence that complements the within-sample capture rates of Section IV-E.
|
|
||||||
|
|
||||||
### 1) Year-by-Year Stability of the Firm A Left Tail
|
|
||||||
|
|
||||||
Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year.
|
|
||||||
Under the replication-dominated interpretation (Section III-H), this signature-level left-tail rate reflects within-firm heterogeneity in signing outputs at Firm A.
|
|
||||||
Consistent with the scope-of-claims framing in Section III-G, we report the rate as a signature-level quantity without disaggregating the underlying mechanism (which may span a minority of hand-signing partners, multi-template replication workflows within the firm, or a combination); partner-level mechanism attribution is not attempted.
|
|
||||||
Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.
|
|
||||||
|
|
||||||
<!-- TABLE XIII: Firm A Per-Year Cosine Distribution
|
|
||||||
| Year | N sigs | mean best-match cosine | % below 0.95 |
|
|
||||||
|------|--------|-------------|--------------|
|
|
||||||
| 2013 | 2,167 | 0.9733 | 12.78% |
|
|
||||||
| 2014 | 5,256 | 0.9781 | 8.69% |
|
|
||||||
| 2015 | 5,484 | 0.9793 | 7.46% |
|
|
||||||
| 2016 | 5,739 | 0.9811 | 6.92% |
|
|
||||||
| 2017 | 5,796 | 0.9814 | 6.69% |
|
|
||||||
| 2018 | 5,986 | 0.9808 | 6.58% |
|
|
||||||
| 2019 | 6,122 | 0.9780 | 8.71% |
|
|
||||||
| 2020 | 6,122 | 0.9770 | 9.46% |
|
|
||||||
| 2021 | 5,996 | 0.9792 | 8.37% |
|
|
||||||
| 2022 | 5,918 | 0.9819 | 6.25% |
|
|
||||||
| 2023 | 5,862 | 0.9860 | 3.75% |
|
|
||||||
-->
|
|
||||||
|
|
||||||
The left tail is stable at 6-13% throughout the sample period and shows no pre/post-2020 level shift: the 2013-2019 mean left-tail share is 8.26% and the 2020-2023 mean is 6.96%.
|
|
||||||
The lowest observed share is in 2023 (3.75%), consistent with firm-level electronic signing systems producing more uniform output than earlier manual scanning-and-stamping, not less.
|
|
||||||
This stability supports the replication-dominated framing: a persistent within-firm heterogeneity component is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved.
|
|
||||||
|
|
||||||
### 2) Partner-Level Similarity Ranking
|
|
||||||
|
|
||||||
If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all auditor-years (across all firms).
|
|
||||||
We test this prediction directly.
|
|
||||||
|
|
||||||
For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
|
|
||||||
Firm A accounts for 1,287 of these (27.8% baseline share).
|
|
||||||
Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.
|
|
||||||
The per-signature best-match cosine underlying each auditor-year mean is taken over the full same-CPA pool (Section III-G), consistent with the unit-of-analysis framing in Section III-G.
|
|
||||||
|
|
||||||
<!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
|
|
||||||
| Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
|
|
||||||
|-------|-------------|--------|--------|--------|--------|-----------|--------------|
|
|
||||||
| 10% | 462 | 443 | 2 | 3 | 0 | 14 | 95.9% |
|
|
||||||
| 20% | 925 | 877 | 9 | 14 | 2 | 23 | 94.8% |
|
|
||||||
| 25% | 1,157 | 1,043 | 32 | 23 | 9 | 50 | 90.1% |
|
|
||||||
| 30% | 1,388 | 1,129 | 105 | 52 | 25 | 77 | 81.3% |
|
|
||||||
| 50% | 2,314 | 1,220 | 473 | 273 | 102 | 246 | 52.7% |
|
|
||||||
-->
|
|
||||||
|
|
||||||
Firm A occupies 95.9% of the top 10%, 94.8% of the top 20%, 90.1% of the top 25%, and 81.3% of the top 30% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of $3.5\times$ at the top decile, $3.4\times$ at the top quintile, and $2.9\times$ at the top tercile.
|
|
||||||
Firm A's share decays monotonically as the bracket widens (95.9% $\to$ 94.8% $\to$ 90.1% $\to$ 81.3% $\to$ 52.7% across top-10/20/25/30/50%), and only at the top 50% does its share approach its baseline; the over-representation is therefore concentrated in the very top of the distribution rather than spread uniformly through the upper half.
|
|
||||||
Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
|
|
||||||
|
|
||||||
<!-- TABLE XV: Firm A Share of Top-K Similarity by Year (K = 10%, 20%, 30%)
|
|
||||||
| Year | N auditor-years | Top-10% share | Top-20% share | Top-30% share | Firm A baseline |
|
|
||||||
|------|-----------------|---------------|---------------|---------------|-----------------|
|
|
||||||
| 2013 | 324 | 100.0% (32/32) | 98.4% (63/64) | 89.7% (87/97) | 32.4% |
|
|
||||||
| 2014 | 399 | 100.0% (39/39) | 98.7% (78/79) | 82.4% (98/119) | 27.8% |
|
|
||||||
| 2015 | 394 | 97.4% (38/39) | 96.2% (75/78) | 84.7% (100/118) | 27.7% |
|
|
||||||
| 2016 | 413 | 95.1% (39/41) | 96.3% (79/82) | 81.3% (100/123) | 26.2% |
|
|
||||||
| 2017 | 415 | 100.0% (41/41) | 97.6% (81/83) | 83.9% (104/124) | 27.2% |
|
|
||||||
| 2018 | 434 | 100.0% (43/43) | 97.7% (84/86) | 80.0% (104/130) | 26.5% |
|
|
||||||
| 2019 | 429 | 100.0% (42/42) | 97.6% (83/85) | 78.9% (101/128) | 27.0% |
|
|
||||||
| 2020 | 430 | 88.4% (38/43) | 91.9% (79/86) | 76.0% (98/129) | 27.7% |
|
|
||||||
| 2021 | 450 | 97.8% (44/45) | 96.7% (87/90) | 81.5% (110/135) | 28.7% |
|
|
||||||
| 2022 | 467 | 93.5% (43/46) | 95.7% (89/93) | 84.3% (118/140) | 28.3% |
|
|
||||||
| 2023 | 474 | 97.9% (46/47) | 94.7% (89/94) | 83.8% (119/142) | 27.4% |
|
|
||||||
|
|
||||||
Per-cell entries are "share (k_FirmA / k_total)". Top-25% and top-50% pooled values are reported in Table XIV; per-year top-25/50 columns are omitted from this table to reduce visual width but are reproducible from the supplementary materials.
|
|
||||||
-->
|
|
||||||
|
|
||||||
This over-representation is consistent with firm-wide non-hand-signing practice at Firm A and is not derived from any threshold we subsequently calibrate.
|
|
||||||
It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.
|
|
||||||
|
|
||||||
### 3) Intra-Report Consistency
|
|
||||||
|
|
||||||
Taiwanese statutory audit reports are co-signed by two engagement partners (a primary and a secondary signer).
|
|
||||||
Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
|
|
||||||
Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.
|
|
||||||
|
|
||||||
For each report with exactly two signatures and complete per-signature data (84,354 reports total: 83,970 single-firm reports, in which both signers are at the same firm, and 384 mixed-firm reports, in which the two signers are at different firms), we classify each signature using the dual-descriptor rules of Section III-K and record whether the two classifications agree.
|
|
||||||
Table XVI reports per-firm intra-report agreement for the 83,970 single-firm reports only (firm-assignment defined by the common firm identity of both signers); the 384 mixed-firm reports (0.46% of the 2-signature corpus) are excluded from the intra-report analysis because firm-level agreement is not well defined when the two signers are at different firms.
|
|
||||||
|
|
||||||
<!-- TABLE XVI: Intra-Report Classification Agreement by Firm
|
|
||||||
| Firm | Total 2-signer reports | Both non-hand-signed | Both uncertain | Both style | Both hand-signed | Mixed | Agreement rate |
|
|
||||||
|------|-----------------------|----------------------|----------------|------------|------------------|-------|----------------|
|
|
||||||
| Firm A | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | **89.91%** |
|
|
||||||
| Firm B | 17,121 | 9,260 | 2,159| 5 | 6 | 5,691 | 66.76% |
|
|
||||||
| Firm C | 19,112 | 8,983 | 3,035| 3 | 5 | 7,086 | 62.92% |
|
|
||||||
| Firm D | 8,375 | 3,028 | 2,376| 0 | 3 | 2,968 | 64.56% |
|
|
||||||
| Non-Big-4 | 9,140 | 1,671 | 3,945| 18| 27| 3,479 | 61.94% |
|
|
||||||
|
|
||||||
A report is "in agreement" if both signature labels fall in the same coarse bucket
|
|
||||||
(non-hand-signed = high+moderate; uncertain; style consistency; or likely hand-signed).
|
|
||||||
-->
|
|
||||||
|
|
||||||
Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
|
|
||||||
The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
|
|
||||||
This 23-28 percentage-point gap in intra-report agreement between Firm A and the other firms is consistent with firm-wide (rather than partner-specific) non-hand-signing practice; we do not claim a sharp discontinuity in the formal sense, since classifier calibration, firm-specific document-production pipelines, and signer-mix differences could each contribute to gap magnitude.
|
|
||||||
|
|
||||||
We note that this test uses the calibrated classifier of Section III-K rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
|
|
||||||
|
|
||||||
## H. Classification Results
|
|
||||||
|
|
||||||
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents (656 documents excluded from the 85,042-document YOLO-detection cohort because no signature on the document could be matched to a registered CPA; see Table XVII note).
|
|
||||||
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
|
|
||||||
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
|
|
||||||
|
|
||||||
<!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
|
|
||||||
| Verdict | N (PDFs) | % | Firm A | Firm A % |
|
|
||||||
|---------|----------|---|--------|----------|
|
|
||||||
| High-confidence non-hand-signed | 29,529 | 35.0% | 22,970 | 76.0% |
|
|
||||||
| Moderate-confidence non-hand-signed | 36,994 | 43.8% | 6,311 | 20.9% |
|
|
||||||
| High style consistency | 5,133 | 6.1% | 183 | 0.6% |
|
|
||||||
| Uncertain | 12,683 | 15.0% | 758 | 2.5% |
|
|
||||||
| Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
|
|
||||||
|
|
||||||
Per the worst-case aggregation rule of Section III-K, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
|
|
||||||
The 84,386-document cohort excludes 656 documents (relative to the 85,042 YOLO-detected cohort of Table III) for which no signature could be matched to a registered CPA: the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity is defined. The exclusion is definitional rather than discretionary; typical causes are auditor's-report-page formats deviating from the standard two-signature layout, or OCR returning a printed CPA name not present in the registry.
|
|
||||||
-->
|
|
||||||
|
|
||||||
Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
|
|
||||||
29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
|
|
||||||
36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
|
|
||||||
and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
|
|
||||||
A cosine-only classifier would treat all 71,656 identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
|
|
||||||
|
|
||||||
### 1) Firm A Capture Profile (Consistency Check)
|
|
||||||
|
|
||||||
96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
|
|
||||||
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the within-firm heterogeneity implied by the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 7.5% signature-level left tail (Section III-H).
|
|
||||||
The near-zero "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 denominator is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset of Table XVI by 4 mixed-firm reports excluded from the firm-level intra-report comparison) indicates that the within-firm heterogeneity implied by the 7.5% signature-level left tail (Section IV-D) does not project into the lowest-cosine document-level category under the dual-descriptor rules; it is absorbed instead into the uncertain or high-style-consistency categories at this threshold set.
|
|
||||||
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-F.2 is the corresponding external check.
|
|
||||||
|
|
||||||
### 2) Cross-Firm Comparison of Dual-Descriptor Convergence
|
|
||||||
|
|
||||||
Among the 65,514 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,922 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible.
|
|
||||||
The Firm A denominator (55,922) matches Table IX exactly: both Table IX and the cross-firm decomposition define Firm A membership via the CPA registry (`accountants.firm`), and the cross-firm analysis additionally requires a non-null independent-min dHash record, which all 55,922 Firm A cosine-eligible signatures have in the current database.
|
|
||||||
This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings.
|
|
||||||
Reproduction artifact for these counts is listed in Appendix B.
|
|
||||||
|
|
||||||
## I. Ablation Study: Feature Backbone Comparison
|
|
||||||
|
|
||||||
To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
|
To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
|
||||||
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
|
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
|
||||||
Table XVIII presents the comparison.
|
The comparison summary is inherited unchanged from the v3.20.0 backbone-ablation table (v3.20.0 Table XVIII; not the same table as v4 Table XVIII which reports Big-4 vs full-dataset Spearman drift in §IV-K).
|
||||||
|
|
||||||
<!-- TABLE XVIII: Backbone Comparison
|
<!-- v3.20.0 TABLE XVIII (inherited unchanged; reproduced here in commented form for v4 splice provenance — actual rendered table is the v3.20.0 backbone-ablation table):
|
||||||
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|
||||||
|--------|-----------|--------|-----------------|
|
|--------|-----------|--------|-----------------|
|
||||||
| Feature dim | 2048 | 4096 | 1280 |
|
| Feature dim | 2048 | 4096 | 1280 |
|
||||||
@@ -492,3 +327,106 @@ ResNet-50 provides the best overall balance:
|
|||||||
(2) its tighter distributions yield more reliable individual classifications;
|
(2) its tighter distributions yield more reliable individual classifications;
|
||||||
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
|
(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
|
||||||
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.
|
(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.
|
||||||
|
|
||||||
|
## M. v4-New Anchor-Based ICCR Calibration Results
|
||||||
|
|
||||||
|
This section consolidates the v4-new empirical results that support the §III-L anchor-based threshold calibration framework. Numbers below are direct re-statements from the spike scripts cited per row; the corresponding §III provenance table entries appear in §III's provenance table.
|
||||||
|
|
||||||
|
### M.1 Composition decomposition (Scripts 39b–39e)
|
||||||
|
|
||||||
|
**Table XX.** Within-firm and between-firm decomposition of the Big-4 accountant-level dip-test rejection.
|
||||||
|
|
||||||
|
| Diagnostic | Scope | Statistic | Implication |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Within-firm signature-level cosine dip | Big-4 (4 firms) | $p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ | 0/4 firms reject; cosine within-firm unimodal |
|
||||||
|
| Within-firm signature-level cosine dip | non-Big-4 (10 firms $\geq 500$ sigs) | $p_{\text{cos}} \in [0.59, 0.99]$ | 0/10 firms reject; cosine within-firm unimodal |
|
||||||
|
| Within-firm jittered-dHash dip (5 seeds, median) | Big-4 (4 firms) | $p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\}$ | 0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact |
|
||||||
|
| Big-4 pooled dHash: 2×2 factorial | firm-centred + jittered (5 seeds) | $p_{\text{med}} = 0.35$, 0/5 seeds reject | combined corrections eliminate rejection; multimodality is composition + integer artefact |
|
||||||
|
| Integer-histogram valley near $\text{dHash} \approx 5$ | within each Big-4 firm | none (0/4 firms) | no within-firm dHash antimode at the inherited HC cutoff |
|
||||||
|
|
||||||
|
(Source: Scripts 39b, 39c, 39d, 39e; bootstrap $n_{\text{boot}} = 2000$; jitter $\sim \mathrm{U}[-0.5, +0.5]$.)
|
||||||
|
|
||||||
|
### M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b)
|
||||||
|
|
||||||
|
**Table XXI.** Big-4 inter-CPA per-comparison ICCR sweep, $n = 5 \times 10^5$ pairs (Big-4 scope; v4 new).
|
||||||
|
|
||||||
|
| Threshold | Per-comparison ICCR | 95% Wilson CI |
|
||||||
|
|---|---|---|
|
||||||
|
| cos $> 0.945$ (v3.x published "natural threshold") | $0.00081$ | $[0.00073, 0.00089]$ |
|
||||||
|
| cos $> 0.95$ (inherited operating point) | $0.00060$ | $[0.00053, 0.00067]$ |
|
||||||
|
| cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
|
||||||
|
| cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
|
||||||
|
| dHash $\leq 5$ (inherited operating point) | $0.00129$ | $[0.00120, 0.00140]$ |
|
||||||
|
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
|
||||||
|
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
|
||||||
|
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | — |
|
||||||
|
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | — |
|
||||||
|
|
||||||
|
Conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$ (Wilson 95% $[0.190, 0.285]$; $70$ of $299$ pairs).
|
||||||
|
|
||||||
|
The cos $> 0.95$ row replicates v3.20.0 §IV-F.1 Table X (v3 reported $0.0005$ under prior "FAR" terminology). The dHash row and joint row are v4 new.
|
||||||
|
|
||||||
|
### M.3 Pool-normalised per-signature ICCR (Script 43)
|
||||||
|
|
||||||
|
**Table XXII.** Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos $> 0.95$ AND dHash $\leq 5$); $n_{\text{sig}} = 150{,}453$ (vector-complete Big-4); CPA-block bootstrap $n_{\text{boot}} = 1000$.
|
||||||
|
|
||||||
|
| Scope | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Big-4 pooled (any-pair, deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
|
||||||
|
| Big-4 pooled (same-pair, stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |
|
||||||
|
| Firm A (any-pair) | $0.2594$ | — | — |
|
||||||
|
| Firm B (any-pair) | $0.0147$ | — | — |
|
||||||
|
| Firm C (any-pair) | $0.0053$ | — | — |
|
||||||
|
| Firm D (any-pair) | $0.0110$ | — | — |
|
||||||
|
| Pool-size decile 1 (smallest pools) any-pair | $0.0249$ | — | — |
|
||||||
|
| Pool-size decile 10 (largest pools) any-pair | $0.1905$ | — | — |
|
||||||
|
|
||||||
|
Decile trend is broadly monotone in pool size with two minor reversals (decile 5 and decile 9 dip below their predecessors). Stricter operating point cos $> 0.95$ AND dHash $\leq 3$ (same-pair) gives per-signature ICCR $0.0449$.
|
||||||
|
|
||||||
|
### M.4 Document-level ICCR under three alarm definitions (Script 45)
|
||||||
|
|
||||||
|
**Table XXIII.** Document-level inter-CPA ICCR by alarm definition; $n_{\text{docs}} = 75{,}233$.
|
||||||
|
|
||||||
|
| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|
||||||
|
|---|---|---|---|
|
||||||
|
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
|
||||||
|
| D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ |
|
||||||
|
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |
|
||||||
|
|
||||||
|
Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XIX's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XIX's per-firm rows sum to $74{,}854 = 75{,}233 - 379$.
|
||||||
|
|
||||||
|
### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)
|
||||||
|
|
||||||
|
**Table XXIV.** Logistic regression of per-signature any-pair HC hit indicator on firm dummies and centred log pool size (Firm A reference).
|
||||||
|
|
||||||
|
| Term | Odds ratio (vs Firm A) | Direction |
|
||||||
|
|---|---|---|
|
||||||
|
| Firm B | $0.053$ | $\sim 19\times$ lower odds than Firm A |
|
||||||
|
| Firm C | $0.010$ | $\sim 100\times$ lower odds than Firm A |
|
||||||
|
| Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A |
|
||||||
|
| log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size |
|
||||||
|
|
||||||
|
Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–$0.0358$ while Firm A ranges $0.0541$–$0.5958$. The firm gap survives within matched pool sizes.
|
||||||
|
|
||||||
|
**Table XXV.** Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts).
|
||||||
|
|
||||||
|
| Source firm | Firm A cand. | Firm B | Firm C | Firm D | non-Big-4 | n hits |
|
||||||
|
|---|---|---|---|---|---|---|
|
||||||
|
| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ |
|
||||||
|
| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ |
|
||||||
|
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
|
||||||
|
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |
|
||||||
|
|
||||||
|
Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively.
|
||||||
|
|
||||||
|
### M.6 Alert-rate sensitivity around inherited HC threshold (Script 46)
|
||||||
|
|
||||||
|
**Table XXVI.** Local-gradient / median-gradient ratio at inherited thresholds (descriptive plateau diagnostic).
|
||||||
|
|
||||||
|
| Threshold | Local / median gradient ratio | Interpretation |
|
||||||
|
|---|---|---|
|
||||||
|
| cos $= 0.95$ (HC) | $\approx 25\times$ | locally sensitive (not plateau-stable) |
|
||||||
|
| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
|
||||||
|
| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |
|
||||||
|
|
||||||
|
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ pp per-signature and $0.4431$ pp per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate.
|
||||||
|
|||||||
Reference in New Issue
Block a user