Major Phase 4 prose update aligning narrative with the §III v7
anchor-based ICCR framework (codex rounds 29-34):
- Abstract (247 words, under 250 limit): replaced K=3 mixture +
natural-threshold framing with composition decomposition +
multi-level ICCR + firm heterogeneity. Positioning as
specificity-proxy-anchored screening framework.
- §I Introduction:
* Methodological-design paragraph rewritten (no natural threshold;
multi-level reporting; per-firm stratification; unsupervised
disclosure)
* Two new paragraphs documenting composition decomposition
overturning distributional path, and anchor-based three-unit
ICCR calibration
* Firm heterogeneity + within-firm collision concentration as
central findings
* Contribution list rewritten (8 items): composition decomposition
disproves natural threshold (NEW #4); multi-level ICCR
calibration (NEW #5); firm heterogeneity quantification (NEW #6);
K=3 demoted to descriptive partition (#7); multi-tool validation
ceiling positioning (#8)
- §V Discussion:
* §V-B retitled "composition-driven multimodality"; 2x2 factorial
decomposition reported
* §V-C Firm A reframed: position contrast + within-firm collision
pattern, not "templated-end calibration anchor"
* §V-D K=2/K=3 reframed as descriptive firm-compositional
partitions (no "mechanism boundary" language)
* §V-E three-score convergence reinterpreted as descriptor-position
ranking, not hand-leaning mechanism ranking
* §V-F (new title) Anchor-based multi-level calibration with all
three units of analysis
* §V-G expanded to 9 v4-specific limitations (no signature-level
ground truth; assumption-violation; scope; conservative-subset;
inherited rule components; deployed-rate excess not TPR; A1
stipulation; K=3 composition sensitivity; no partner-level
mechanism attribution) plus 5 inherited limitations
- §VI Conclusion: 8-point contribution list mirroring §I; 4 future
work directions including within-firm collision-mechanism
disambiguation and audit-quality companion analysis.
- Header draft-note updated to v3 (post codex rounds 26-34);
Phase 4 v2 changelog moved to CHANGELOG.md placeholder.
Companion to §III v7 commit 723a3f6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
44 KiB
Paper A v4.0 Phase 4 Prose Draft v3 (post codex rounds 26–34)
Draft note (2026-05-13, Phase 4 v3; internal — remove before submission). This file replaces the v3.20.0 Abstract, §I Introduction, §II Related Work, §V Discussion, and §VI Conclusion blocks with the v4.0 prose. The methodology and results sections (§III v7 and §IV v3.2 on this branch) are the technical foundation; Phase 4 prose aligns the narrative with the post-codex-round-34 framing. v3 (2026-05-13) reflects the major restructuring driven by codex rounds 29–34: distributional path to thresholds demolished (Scripts 39b–39e); anchor-based multi-level inter-CPA coincidence-rate calibration adopted (Scripts 40b, 43, 44, 45, 46); K=3 demoted to descriptive firm-compositional partition; "FAR" terminology replaced by "inter-CPA coincidence rate (ICCR)" throughout; nine-tool unsupervised validation strategy disclosed; positioning as anchor-calibrated screening framework with human-in-the-loop review (not validated forensic detector). Empirical anchors cite Scripts 32–46 on branch
paper-a-v4-big4. Prior Phase 4 v2 changelog has been moved topaper/v4/CHANGELOG.md.
Abstract
IEEE Access target: <= 250 words, single paragraph.
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — technically trivial and visually invisible, undermining individualized attestation. We build an end-to-end pipeline detecting such non-hand-signed signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate style consistency from image reproduction. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter (p rises to 0.35), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison (0.0006 at cos$>0.95$; 0.0013 at dHash$\leq 5$; 0.00014 jointly), pool-normalised per-signature (0.11 under the deployed any-pair high-confidence rule), and per-document (0.34 for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is 0.62 versus $0.09$–0.16 at Firms B/C/D after pool-size adjustment, with $98$–100\% of inter-CPA collisions concentrated within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
I. Introduction
Target: ~1.5 pages double-column IEEE format. Double-blind: no author/institution info.
Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require certifying CPAs to affix their signature or seal (簽名或蓋章) to each audit report [1]. While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].
The digitization of financial reporting has introduced a practice that complicates this intent. As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one. This reproduction can occur either through an administrative stamping workflow — in which scanned signature images are affixed by staff as part of the report-assembly process — or through a firm-level electronic signing system that automates the same step. We refer to signatures produced by either workflow collectively as non-hand-signed. Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement. The accounting literature has examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33]. Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused, and is visually invisible to report users at scale.
The distinction between non-hand-signing detection and signature forgery detection is conceptually and technically important. The extensive body of research on offline signature verification [3]–[8] focuses almost exclusively on forgery detection — determining whether a questioned signature was produced by its purported author. In our context, identity is not in question; the CPA is indeed the legitimate signer. The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction.
A methodological concern shapes the research design. Many prior similarity-based classification studies rely on ad-hoc thresholds — declaring two images equivalent above a hand-picked cosine cutoff, for example — without principled statistical justification. Such thresholds are fragile in an archival-data setting. A defensible approach requires (i) explicit calibration of the operational thresholds against measurable negative-anchor evidence; (ii) diagnostic procedures that test whether the descriptor distribution itself supports a within-population threshold, including formal decomposition of apparent multimodality into between-group composition and integer-tie artefacts; (iii) annotation-free reporting of operational alarm rates at multiple analysis units (per-comparison, per-signature pool, per-document) with Wilson 95% confidence intervals; (iv) per-firm stratification of the reported rates to surface heterogeneity that aggregate metrics conceal; and (v) explicit disclosure of the unsupervised setting's limits — in particular, the inability to estimate true error rates without signature-level ground-truth labels.
Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation.
In this paper we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale, together with a multi-tool validation framework that explicitly discloses the unsupervised setting's limits. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) a multi-tool unsupervised validation strategy with disclosed assumption-violation analysis (§III-M).
The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. Earlier work in this lineage adopted a distributional path to thresholds — fitting accountant-level finite-mixture models and treating their marginal crossings as data-derived "natural" thresholds. v4.0 reports a composition decomposition diagnostic (§III-I.4) that overturns this reading: the apparent multimodality of the Big-4 accountant-level distribution is fully explained by between-firm location-shift effects (Firm A's mean dHash of 2.73 versus Firms B/C/D's 6.46, 7.39, 7.21) and integer mass-point artefacts on the integer-valued dHash axis. Once both confounds are removed (firm-mean centring plus uniform integer jitter), the Big-4 pooled dHash dip test yields p_{\text{median}} = 0.35 across five jitter seeds, eliminating the rejection. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual mid/small firm with \geq 500 signatures (10 firms tested in Script 39c). The descriptor distributions therefore contain no within-population bimodal antimode that could anchor an operational threshold.
In place of distributional anchoring, v4.0 adopts an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the inherited cos$>0.95$ operating point yields ICCR = 0.00060 on a $5 \times 10^5$-pair Big-4 sample (replicating v3.x's reported per-comparison rate of 0.0005 under prior "FAR" terminology); the dHash$\leq 5$ structural cutoff yields ICCR = 0.00129 (v4 new); the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR = 0.00014 (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is 0.1102 (Wilson 95% CI [0.1086, 0.1118]; CPA-block bootstrap 95% [0.0908, 0.1330]). At the per-document unit, the operational HC$+$MC alarm fires on 33.75\% of Big-4 documents under the inter-CPA candidate-pool counterfactual.
The pooled per-signature and per-document rates conceal striking firm heterogeneity. A logistic regression of the per-signature hit indicator on firm dummies (Firm A reference) and centred log pool size yields odds ratios of 0.053 (Firm B), 0.010 (Firm C), and 0.027 (Firm D) — Firms B/C/D are an order of magnitude below Firm A even after controlling for the pool-size confound (Script 44). Cross-firm hit matrix analysis shows that $98$–100\% of inter-CPA collisions originate from candidates within the source firm (different CPA, same firm), consistent with firm-specific template, stamp, or document-production reuse mechanisms — though not by itself diagnostic of deliberate sharing. We retain the inherited Paper A v3.x five-way box rule as the operational classifier; v4.0's contribution is to characterise its multi-level coincidence behaviour against the inter-CPA negative anchor rather than to derive new thresholds.
Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman \rho \geq 0.879 (Script 38): the K=3 mixture posterior (now interpreted as a firm-compositional position score, not a mechanism cluster posterior; §III-J), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the inherited box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. Hard ground truth for the replicated class is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve 0\% positive-anchor miss rate (Wilson 95% upper bound 1.45\%). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.
We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available.
The contributions of this paper are:
-
Problem formulation. We define non-hand-signing detection as distinct from signature forgery detection and frame it as a detection problem on intra-signer similarity distributions.
-
End-to-end pipeline. We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training.
-
Dual-descriptor verification. We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between style consistency and image reproduction, and we validate the backbone choice through a feature-backbone ablation.
-
Composition decomposition disproves the distributional-threshold path. We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; "natural threshold" language in this lineage's prior work is not empirically supported.
-
Anchor-based multi-level inter-CPA coincidence-rate calibration. We characterise the deployed five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$:
0.0006; dHash$\leq 5$:0.0013; joint:0.00014), pool-normalised per-signature ICCR (0.11for the deployed any-pair high-confidence rule), and per-document ICCR (0.34for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide. -
Firm heterogeneity quantification and within-firm cross-CPA collision concentration. Per-firm rates differ by an order of magnitude after pool-size adjustment (Firm A's per-document HC$+$MC alarm at
0.62versus Firms B/C/D at $0.09$–0.16). Cross-firm hit matrix analysis shows that $98$–100\%of inter-CPA collisions originate from candidates within the source firm, consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing. -
K=3 as descriptive firm-compositional partition; three-score convergent internal consistency. We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (no longer interpreted as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman
\rho \geq 0.879; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair. -
Annotation-free positive-anchor validation and unsupervised validation ceiling. We achieve
0\%positive-anchor miss rate (Wilson 95% upper bound1.45\%) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. We frame the overall validation strategy as a multi-tool collection of nine partial-evidence diagnostics, each with an explicitly disclosed untested assumption; their conjunction constitutes the unsupervised validation ceiling achievable on this corpus. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review.
The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity validation, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work.
II. Related Work
Note for the Phase 4 review pass: §II is inherited substantively unchanged from v3.20.0 §II in the master manuscript, with one new paragraph added below. The unchanged content is not reproduced in this Phase 4 file; readers reviewing this draft should consult
paper/paper_a_related_work_v3.mdfor the v3.20.0 §II text covering offline signature verification, near-duplicate detection, copy-move forgery detection, perceptual hashing, deep-feature similarity, and the statistical methods adopted (Hartigan dip test, finite mixture EM, Burgstahler-Dichev / McCrary density-smoothness diagnostic). The paragraph below is the only v4.0-specific §II addition.
Addition for v4.0: leave-one-firm-out cross-validation in a small-cluster scope. Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the firm (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a composition-sensitivity band on the candidate mixture boundary, not as a sufficiency claim for the inherited five-way operational classifier (which is calibrated separately; §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier. Numerical references [42]–[44] are placeholders in this draft and will be replaced with the project's preferred references at copy-edit time.
V. Discussion
A. Non-Hand-Signing Detection as a Distinct Problem
Non-hand-signing differs from forgery in that the questioned signature is produced by its legitimate signer's own stored image rather than by an impostor. The detection problem is therefore framed around intra-signer image reproduction rather than inter-signer imitation. This framing has analytical consequences. The within-CPA signature distribution is the analytical population of interest; the cross-CPA inter-class distribution is a reference against which intra-CPA similarity is interpreted, not the population to be modelled. This contrasts with most prior offline signature verification work, which treats genuine-versus-forged as the central two-class problem.
B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Multimodality is Composition-Driven
A central empirical finding of v3.x was that per-signature similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading.
The Big-4 accountant-level descriptor distribution does reject unimodality on both marginals at p < 5 \times 10^{-4} (Script 34). v4.0's composition decomposition (§III-I.4; Scripts 39b–39e) shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of 2.73 versus Firms B/C/D's 6.46, 7.39, 7.21 creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter [-0.5, +0.5], 5 jitter seeds) shows that the dip test fails to reject (p_{\text{median}} = 0.35, 0/5 seeds reject) when both corrections are applied; either correction alone leaves the rejection in place. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with \geq 500 signatures (10 firms tested). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes.
C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor)
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for 0\% of C1 (low-cos / high-dHash position) and 82.5\% of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at 23.5\%. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). The additional v3.x finding that the 145 Firm A pixel-identical signatures span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches across different fiscal years, is inherited from v3.20.0 §IV-F.1 / Script 28 / Appendix B byte-decomposition output and was not regenerated in v4.0's spike scripts; we retain those numbers by reference.
In v4.0 we treat Firm A as a templated-end case study rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: $98$–100\% of inter-CPA collisions originate from candidates within the source firm, regardless of which Big-4 firm is the source. Firm A's high per-document HC$+$MC alarm rate of 0.62 (versus Firms B/C/D's $0.09$–0.16) reflects high inter-CPA collision concentration under the deployed rule on real same-CPA pools, consistent with firm-specific template, stamp, or document-production reuse — though the inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence of v3.x §IV-F.1 (Firm A's 145 pixel-identical signatures across \sim 50 distinct partners) provides direct evidence that firm-level template reuse does occur at Firm A; the within-firm collision pattern at all four Big-4 firms is consistent with that mechanism extending in milder form to Firms B/C/D.
D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
Leave-one-firm-out cross-validation of the Big-4 mixture fit reveals a sharp contrast between K=2 and K=3 behaviour. K=2 is unstable: across-fold cosine-crossing deviation is 0.028, and holding Firm A out gives a fold rule (cos > 0.938, dHash \leq 8.79) that classifies 100\% of held-out Firm A in the upper component, while holding any non-Firm-A Big-4 firm out gives a fold rule near (cos > 0.975, dHash \leq 3.76) that classifies 0\% of the held-out firm in the upper component. The K=2 boundary is essentially a Firm-A-vs-others separator — direct evidence that the K=2 partition reflects firm-compositional rather than mechanistic structure.
K=3 in contrast has a reproducible component shape at the descriptor-position level: across the four folds the C1 (low-cos / high-dHash) component cosine mean varies by at most 0.005, the dHash mean by at most 0.96, and the weight by at most 0.023. Hard-posterior membership for the held-out firm is composition-sensitive (absolute differences $1.8$–12.8 pp across folds). Together with the §III-I.4 composition decomposition (no within-population bimodal antimode), the K=3 stability supports a descriptive reading: the Big-4 descriptor plane has a reproducible three-region partition that reflects how firm-compositional weight is distributed across the descriptor space, not a three-mechanism latent-class structure. We accordingly do not use K=3 hard-posterior membership as an operational classifier; we use it as the accountant-level descriptive summary that complements the deployed signature-level five-way classifier of §III-L.
E. Three-Score Convergent Internal-Consistency
Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman \rho \geq 0.879: the K=3 mixture posterior (a firm-compositional position score, not a mechanism cluster posterior); the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the inherited Paper A box-rule less-replication-dominated rate. The three scores are not statistically independent measurements — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise |\rho| > 0.87) and its persistence at the signature level (Cohen \kappa = 0.87 between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-region ordering, and three different summarisations of the descriptor space produce broadly concordant per-CPA rankings, with a residual non-Firm-A disagreement (the reverse-anchor cosine percentile ranks Firm D fractionally above Firm C, while the mixture posterior and the box-rule rate rank Firm C highest among non-Firm-A firms).
F. Anchor-Based Multi-Level Calibration
The operational specificity of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR replicates v3.x's per-comparison rate (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint \to 0.00014). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement (0.1102 pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size n_{\text{pool}}, so the inter-CPA-equivalent rate scales approximately as 1 - (1 - p_{\text{pair}})^{n_{\text{pool}}} in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone 0.18, the operational HC$+$MC alarm 0.34.
Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is 0.234 (Wilson 95% [0.190, 0.285]): given the cosine gate, the structural dimension provides further per-comparison specificity at \sim 4.3\times refinement. Second, the alert-rate sensitivity analysis (§III-L.5; Script 46) shows the inherited HC threshold is locally sensitive rather than plateau-stable (local gradient \approx 25\times the median for cosine, \approx 3.8\times for dHash); stakeholders requiring different specificity-alert-yield operating points can derive thresholds by inverting the ICCR curves (a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint gives per-signature ICCR \approx 0.045). The MC/HSC sub-band boundary at dHash$=15$, by contrast, is plateau-like (local-to-median ratio \approx 0.08), consistent with high-dHash-tail saturation.
G. Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor Reframed as Coincidence Rate
The only hard ground-truth subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are conservative-subset ground truth for the replicated class. On the Big-4 subset (n = 262 pixel-identical signatures), all three candidate classifiers — the inherited box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve 0\% positive-anchor miss rate (Wilson 95% upper bound 1.45\%). We caution that this result is necessary but not sufficient: for the box rule it is close to tautological, because byte-identical neighbours have cosine \approx 1 and dHash \approx 0, well inside the rule's high-confidence region. The corresponding signature-level negative anchor evidence is developed in §III-L.1 above (v4 spike: cos$>0.95$ per-comparison ICCR = 0.00060, replicating v3.20.0's reported 0.0005 under prior "FAR" terminology). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures.
G. Limitations
Several limitations should be transparent. The first nine are v4.0-specific; the last five are inherited from v3.20.0 §V-G and still apply to the v4.0 pipeline.
No signature-level ground truth; no true error rates reportable. The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).
Inter-CPA negative-anchor assumption is partially violated. The cross-firm hit matrix of §III-L.4 shows that $98$–100\% of inter-CPA collisions under the deployed rule originate from candidates within the source firm, consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs.
Scope. The v4.0 primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full n = 686 scope; the §IV-K full-dataset Spearman re-run shows the K=3 + box-rule rank-convergence is preserved at n = 686 but does not validate the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope.
Pixel-identity is a conservative subset. Byte-identical pairs are the easiest replicated cases, and for the inherited box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical \Rightarrow cosine \approx 1, dHash \approx 0, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).
Inherited rule components are not separately v4-validated. The five-way classifier's moderate-confidence band (cos > 0.95 AND 5 < \text{dHash} \leq 15), the style-consistency band (\text{dHash} > 15), and the document-level worst-case aggregation rule retain their v3.20.0 calibration and capture-rate evidence; v4.0's anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-validated by v4.0's diagnostic battery.
Deployed-rate excess is not a presumed true-positive rate. The $\sim 44$-pp per-document gap between the observed deployed alert rate (HC: 0.62 on real same-CPA pools) and the inter-CPA proxy rate (HC: 0.18) cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; within-firm template sharing inflates the inter-CPA proxy baseline). The gap is best read as a same-CPA repeatability signal.
A1 pair-detectability stipulation. The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime.
K=3 hard-posterior membership is composition-sensitive. The K=3 hard-posterior membership for any single firm varies by up to 12.8 pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as v4.0 operational classifier output; they are reported only as accountant-level descriptive characterisation.
No partner-level mechanism attribution. v4.0 reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-L.4 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing.
Transferred ImageNet features (inherited from v3.20.0). The ResNet-50 feature extractor uses pre-trained ImageNet weights without signature-domain fine-tuning. While our backbone-ablation study (§IV-L, inherited from v3.20.0 §IV-I) and prior literature support the effectiveness of transferred ImageNet features for signature comparison, a signature-domain fine-tuned feature extractor could improve discriminative performance.
Red-stamp HSV preprocessing artifacts (inherited from v3.20.0). The red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. Blended pixels are replaced with white, potentially creating small gaps in signature strokes that could reduce dHash similarity. This bias would push classifications toward false negatives rather than false positives.
Longitudinal scan / PDF / compression confounds (inherited from v3.20.0). Scanning equipment, PDF generation software, and compression algorithms may have changed over the 2013–2023 study period, potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.
Source-exemplar misattribution in max/min pair logic (inherited from v3.20.0). The max-cosine / min-dHash detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case where one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as a stamping or e-signature template, the pair correctly identifies image reuse but misattributes non-hand-signed status to the source exemplar. This affects at most one source document per template variant per CPA and is not expected to be common.
Legal and regulatory interpretation (inherited from v3.20.0). Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a jurisdiction-specific legal question. Our technical analysis can inform such determinations but cannot resolve them.
VI. Conclusion and Future Work
We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is an inherited Paper A five-way per-signature classifier with worst-case document-level aggregation (§III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population.
Our central methodological contributions are: (1) a composition decomposition (Scripts 39b–39e) that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter (p_{\text{median}} = 0.35), so distributional "natural-threshold" framings of the inherited operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison (0.0006 at cos$>0.95$; 0.0013 at dHash$\leq 5$; 0.00014 jointly), pool-normalised per-signature (0.11 for the deployed any-pair HC rule), and per-document (0.34 for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios 0.053, 0.010, 0.027 for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that $98$–100\% of inter-CPA collisions under the deployed rule originate from candidates within the source firm, consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman \rho \geq 0.879, reported as internal consistency rather than external validation; (7) 0\% positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) a nine-tool unsupervised-validation collection (§III-M) that explicitly discloses each tool's untested assumption and positions the system as an anchor-calibrated screening framework with human-in-the-loop review, not as a validated forensic detector.
Future work falls in four directions. First, a small-scale human-rated validation set would enable direct ROC optimisation and provide signature-level ground truth that v4.0 fundamentally lacks; without such ground truth, no true error rates can be reported. Second, the within-firm collision concentration documented in §III-L.4 (98–100% same-firm partners) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. Third, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm 0.62 vs $0.09$–0.16) — together with v3.x's byte-level evidence of 145 pixel-identical signatures across \sim 50 distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. Fourth, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the v4.0 calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
Notes for Phase 4 close-out
Items remaining for the Phase 4 close-out pass before §I, §II, §V, §VI prose can be moved into the manuscript master file:
- Abstract word count. Current draft is 243–244 words (shell
wc -won the paragraph returns 243; one-token tokenization difference depending on counter); both satisfy IEEE Access's\leq 250word constraint with\sim 6words of margin. - §I contributions list (8 items). v3.20.0's contribution list had 7 items; v4.0's has 8 to reflect the Big-4 scope, K=3 descriptive role, and three-score convergence as separate contributions. Confirm whether the journal style supports 8 contributions or whether items can be merged.
- §II Related Work LOOO citation. A standard cross-validation citation for the LOOO addition is flagged "[add citation]" in the draft and needs to be filled with a specific reference (Geisser 1975 / Stone 1974 / a modern survey).
- §V-G Limitations. The seven limitations are listed flat; the journal style may prefer them grouped (scope vs ground-truth vs methodology) — consider reorganisation at copy-edit time.
- §VI Future Work directions. Four directions are listed; the third (audit-quality companion analysis) ties to the Paper B placeholder in the project memory and should be cross-checked for consistency with the planned Paper B framing.
- Internal draft note + this close-out checklist. Strip before submission packaging, per the across-paper "internal — remove before submission" policy applied to §III v6 and §IV v3.2 draft notes.