Files

T

gbanyan 6946baa096 Paper A v3.6: codex round-5 quick-wins cleanup (Minor Revision)

Codex gpt-5.4 round-5 (codex_review_gpt54_v3_5.md) verdict was Minor
Revision - all v3.4 round-4 PARTIAL/UNFIXED items now confirmed
RESOLVED, including line-by-line recomputation of Table XI z/p
matching the manuscript values. This commit cleans the remaining
quick-win items:

Table IX numerical sync to Script 24 authoritative values
- Five count corrections: cos>0.837 (60,405->60,408), cos>0.945
  (57,131/94.52% -> 56,836/94.02%, was 295 sigs / 0.50 pp off),
  cos>0.973 (48,910/80.91% -> 48,028/79.45%, was 882 sigs / 1.46 pp
  off), cos>0.95 (55,916->55,922), dh<=8 (57,521->57,527),
  dh<=15 (60,345->60,348), dual (54,373->54,370).
- Threshold label cos>0.941 -> cos>0.9407 (use exact calib-fold P5
  rather than rounded value).
- "dHash_indep <= 5 (calib-fold median-adjacent)" relabeled to
  "(whole-sample upper-tail of mode)" to match what III-L explains.
- Added "(operational dual)" / "(style-consistency boundary)" labels
  for unambiguous mapping into III-L category definitions.
- Removed circularity-language footnote inside the table comment.

Circularity overclaim removed paper-wide
- Methodology III-K (Section 3 anchor): "we break the resulting
  circularity" -> "we make the within-Firm-A sampling variance
  visible".
- Results IV-G.2 subsection title: "(breaks calibration-validation
  circularity)" -> "(within-Firm-A sampling variance disclosure)".
- Combined with the v3.5 Abstract / Conclusion edits, no surviving
  use of circular* anywhere in the paper.

export_v3.py title page now single-anonymized
- Removed "[Authors removed for double-blind review]" placeholder
  (IEEE Access uses single-anonymized review).
- Replaced with explicit "[AUTHOR NAMES - fill in before submission]"
  + affiliation placeholder so the requirement is unmissable.
- Subtitle now reads "single-anonymized review".

III-G stale "cosine-conditional dHash" sentence removed
- After the v3.5 III-L rewrite to dh_indep, the sentence at
  Methodology L131 referencing "cosine-conditional dHash used as a
  diagnostic elsewhere" no longer described any current paper usage.
- Replaced with a positive statement that dh_indep is the dHash
  statistic used throughout the operational classifier and all
  reported capture-rate analyses.

Abstract trimmed 247 -> 242 words for IEEE 250-word safety margin
- "an end-to-end pipeline" -> "a pipeline"; "Unlike signature
  forgery" -> "Unlike forgery"; "we report" passive recast; small
  conjunction trims.

Outstanding items deferred (require user decision / larger scope):
- BD/McCrary either substantiate (Z/p table + bin-width robustness)
  or demote to supplementary diagnostic.
- Visual-inspection protocol disclosure (sample size, rater count,
  blinding, adjudication rule).
- Reproducibility appendix (VLM prompt, HSV thresholds, seeds, EM
  init / stopping / boundary handling).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-21 12:41:11 +08:00

34 KiB

Raw Blame History

III. Methodology

A. Pipeline Overview

We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents. Fig. 1 illustrates the overall architecture. The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three methodologically distinct statistical methods and a pixel-identity anchor.

Throughout this paper we use the term non-hand-signed rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.

B. Data Collection

The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.

CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. Table I summarizes the dataset composition.

C. Signature Page Identification

To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism. Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. The model was configured with temperature 0 for deterministic output.

The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. Scanning terminated upon the first positive detection. This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.

Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents. The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling.

D. Signature Detection

We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization. A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.

The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).

Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.

Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).

E. Feature Extraction

Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.

Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D). This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

F. Dual-Method Similarity Descriptors

For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:

Cosine similarity on deep embeddings captures high-level visual style:

\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B

where \mathbf{f}_A and \mathbf{f}_B are L2-normalized 2048-dim feature vectors. Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].

Perceptual hash distance (dHash) [27] captures structural-level similarity. Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].

These descriptors provide partially independent evidence. Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. Non-hand-signing yields extreme similarity under both descriptors, since the underlying image is identical up to reproduction noise. Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.

We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content. Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.

G. Unit of Analysis and Summary Statistics

Two unit-of-analysis choices are relevant for this study: (i) the signature---one signature image extracted from one report---and (ii) the accountant---the collection of all signatures attributed to a single CPA across the sample period. A third composite unit---the auditor-year, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution.

For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA. The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism. Mean statistics would dilute this signal.

We also adopt an explicit within-auditor-year no-mixing identification assumption. Specifically, within any single fiscal year we treat a given CPA's signing mechanism as uniform: a CPA who reproduces one signature image in that year is assumed to do so for every report, and a CPA who hand-signs in that year is assumed to hand-sign every report in that year. Domain-knowledge from industry practice at Firm A is consistent with this assumption for that firm during the sample period. Under the assumption, per-auditor-year summary statistics are well defined and robust to outliers: if even one pair of same-CPA signatures in the year is near-identical, the max/min captures it. The intra-report consistency analysis in Section IV-H.3 is a related but distinct check: it tests whether the two co-signing CPAs on the same report receive the same signature-level label (firm-level signing-practice homogeneity) rather than testing whether a single CPA mixes mechanisms within a fiscal year. A direct empirical check of the within-auditor-year assumption at the same-CPA level would require labeling multiple reports of the same CPA in the same year and is left to future work; in this paper we maintain the assumption as an identification convention motivated by industry practice and bounded by the worst-case aggregation rule of Section III-L.

For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. The independent minimum dHash of a signature is defined as the minimum Hamming distance to any other signature of the same CPA (over the full same-CPA set). The independent minimum is unconditional on the cosine-nearest pair and is therefore the conservative structural-similarity statistic; it is the dHash statistic used throughout the operational classifier (Section III-L) and all reported capture-rate analyses. These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level three-method analysis in Section III-I.5.

H. Calibration Reference: Firm A as a Replication-Dominated Population

A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.

The background context for this choice is practitioner knowledge about Firm A's signing practice: industry practice at the firm is widely understood among practitioners to involve reproducing a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports. We use this only as background context for why Firm A is a plausible calibration candidate; the evidence for Firm A's replication-dominated status comes entirely from the paper's own analyses, which do not depend on any claim about signing practice beyond what the audit-report images themselves show.

We establish Firm A's replication-dominated status through four independent quantitative analyses, each of which can be reproduced from the public audit-report corpus alone:

First, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.

Second, whole-sample signature-level rates: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% form a long left tail consistent with a minority of hand-signers.

Third, accountant-level mixture analysis (Section IV-E): a BIC-selected three-component Gaussian mixture over per-accountant mean cosine and mean dHash places 139 of the 171 Firm A CPAs (with \geq 10 signatures) in the high-replication C1 cluster and 32 in the middle-band C2 cluster, directly quantifying the within-firm heterogeneity.

Fourth, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-H. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output: (a) Longitudinal stability (Section IV-H.1). The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P95 of the per-signature cosine distribution (Section III-L); the substantive finding here is the temporal stability of the rate, not the absolute rate at any single year. (b) Partner-level similarity ranking (Section IV-H.2). When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff. (c) Intra-report consistency (Section IV-H.3). Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.

We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the visual inspection, the accountant-level mixture, the three complementary analyses above, and the held-out Firm A fold described in Section III-K.

We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. Its identification rests on visual evidence and accountant-level clustering that is independent of the statistical pipeline. The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.

I. Three-Method Convergent Threshold Determination

Direct assignment of thresholds based on prior intuition (e.g., cosine \geq 0.95 for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others. To place threshold selection on a statistically principled and data-driven footing, we apply three methodologically distinct methods whose underlying assumptions decrease in strength. The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence is therefore a diagnostic of distributional structure rather than a formal statistical guarantee. When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement itself is informative about whether the data support a single clean decision boundary at a given level.

1) Method 1: KDE Antimode / Crossover with Unimodality Test

We use two closely related KDE-based threshold estimators and apply each where it is appropriate. When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the KDE crossover is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes. When a single distribution is analyzed (e.g., the per-accountant cosine mean of Section IV-E) the KDE antimode is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal. In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify threshold stability.

2) Method 2: Burgstahler-Dichev / McCrary Discontinuity

We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39]. We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin i with count n_i, the standardized deviation from the smooth-null expectation of the average of its neighbours,

Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},

which is approximately N(0,1) under the null of distributional smoothness. A threshold is identified at the transition where Z_{i-1} is significantly negative (observed count below expectation) adjacent to Z_i significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition Z^- \rightarrow Z^+ marks the candidate decision boundary.

3) Method 3: Finite Mixture Model via EM

We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data). The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread). Under the fitted model the threshold is the crossing point of the two weighted component densities,

\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),

solved numerically via bracketed root-finding. As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the logit-transformed similarity, following standard practice for bounded proportion data. White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery.

We fit 2- and 3-component variants of each mixture and report BIC for model selection. When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.

4) Convergent Validation and Level-Shift Diagnostic

The three methods rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification. If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.

Equally informative is the level at which the methods agree or disagree. Applied to the per-signature similarity distribution the three methods yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D). Applied to the per-accountant cosine mean, Methods 1 (KDE antimode) and 3 (Beta-mixture crossing and its logit-Gaussian counterpart) converge within a narrow band, whereas Method 2 (BD/McCrary) does not produce a significant transition because the accountant-mean distribution is smooth at the bin resolution the test requires. This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a discrete discontinuity, and we interpret it accordingly in Section V rather than treating disagreement among methods as a failure.

5) Accountant-Level Three-Method Analysis

In addition to applying the three methods at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures. The accountant-level estimates provide the methodologically defensible threshold reference used in the per-document classification of Section III-L. All three methods are reported with their estimates and, where applicable, cross-method spreads.

J. Accountant-Level Mixture Model

In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash). The motivation is the expectation---consistent with industry-practice knowledge at Firm A---that an individual CPA's signing practice is clustered (typically consistent adoption of non-hand-signing or consistent hand-signing within a given year) even when the output pixel-level quality lies on a continuous spectrum.

We fit mixtures with K \in \{1, 2, 3, 4, 5\} components under full covariance, selecting K^* by BIC with 15 random initializations per K. For the selected K^* we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.

K. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation)

Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling:

Pixel-identical anchor (gold positive, conservative subset): signatures whose nearest same-CPA match is byte-identical after crop and normalization. Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for the byte-identical subset of non-hand-signed signatures. We emphasize that this anchor is a subset of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).
Inter-CPA negative anchor (large gold negative): $\sim$50,000 pairs of signatures randomly sampled from different CPAs. Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps. This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.
Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference whose left tail contains a minority of hand-signers, as directly evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H). Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% calibration fold and a 30% heldout fold. Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions are derived from the calibration fold only. The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.
Low-similarity same-CPA anchor (supplementary negative): signatures whose maximum same-CPA cosine similarity is below 0.70. This anchor is retained for continuity with prior work but is small in our dataset (n = 35) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.

From these anchors we report FAR with Wilson 95% confidence intervals (against the inter-CPA negative anchor) and FRR (against the byte-identical positive anchor), together with the Equal Error Rate (EER) interpolated at the threshold where FAR = FRR, following biometric-verification reporting conventions [3]. Precision and F_1 are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and F_1 from Table X. The 70/30 held-out Firm A fold of Section IV-G.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference. We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.

L. Per-Document Classification

The per-signature classifier operates at the signature level and uses whole-sample Firm A percentile heuristics as its operational thresholds, while the three-method analysis of Section IV-E operates at the accountant level and supplies a convergent external reference for the operational cuts. Because the two analyses are at different units (signature vs accountant) we treat them as complementary rather than substitutable: the accountant-level convergence band cos \in [0.945, 0.979] anchors the signature-level operational cut cos > 0.95 used below, and Section IV-G.3 reports a sensitivity analysis in which cos > 0.95 is replaced by the accountant-level 2D-GMM marginal crossing cos > 0.945. All dHash references in this section refer to the independent-minimum dHash defined in Section III-G---the smallest Hamming distance from a signature to any other same-CPA signature. We use a single dHash statistic throughout the operational classifier and the supporting capture-rate analyses (Tables IX, XI, XII, XVI), which keeps the classifier definition and its empirical evaluation arithmetically consistent.

We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:

High-confidence non-hand-signed: Cosine > 0.95 AND \text{dHash}_\text{indep} \leq 5. Both descriptors converge on strong replication evidence.
Moderate-confidence non-hand-signed: Cosine > 0.95 AND 5 < \text{dHash}_\text{indep} \leq 15. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.
High style consistency: Cosine > 0.95 AND \text{dHash}_\text{indep} > 15. High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
Uncertain: Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
Likely hand-signed: Cosine below the all-pairs KDE crossover threshold.

We note three conventions about the thresholds. First, the cosine cutoff 0.95 is the whole-sample Firm A P95 of the per-signature best-match cosine distribution (chosen for its transparent percentile interpretation in the whole-sample reference distribution), and the cosine crossover 0.837 is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions. Section IV-G.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible. Second, the dHash cutoffs \leq 5 and > 15 are chosen from the whole-sample Firm A \text{dHash}_\text{indep} distribution: \leq 5 captures the upper tail of the high-similarity mode (whole-sample Firm A median \text{dHash}_\text{indep} = 2, P75 \approx 4, so \leq 5 is the band immediately above median), while > 15 marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction. Third, the three accountant-level 1D estimators (KDE antimode 0.973, Beta-2 crossing 0.979, logit-GMM-2 crossing 0.976) and the accountant-level 2D GMM marginal (0.945) are not the operational thresholds of this classifier: they are the convergent external reference that supports the choice of signature-level operational cut. Section IV-G.3 reports the classifier's five-way output under the nearby operational cut cos > 0.945 as a sensitivity check; the aggregate firm-level capture rates change by at most \approx 1.2 percentage points (e.g., the operational dual rule cos > 0.95 AND \text{dHash}_\text{indep} \leq 8 captures 89.95% of whole Firm A versus 91.14% at cos > 0.945), and category-level shifts are concentrated at the Uncertain/Moderate-confidence boundary.

Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the most-replication-consistent signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification). This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.

M. Data Source and Firm Anonymization

Audit-report corpus. The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation. MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document. We did not access any non-public auditor work papers, internal firm records, or personally identifying information beyond the certifying CPAs' names and signatures, which are themselves published on the face of the audit report as part of the public regulatory filing. The CPA registry used to map signatures to CPAs is a publicly available audit-firm tenure registry (Section III-B).

Firm-level anonymization. Although all audit reports and CPA identities in the corpus are public, we report firm-level results under the pseudonyms Firm A / B / C / D throughout this paper to avoid naming specific accounting firms in descriptive rate comparisons. Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name. Authors declare no conflict of interest with Firm A, Firm B, Firm C, or Firm D.

34 KiB Raw Blame History Unescape Escape