Files

T

gbanyan 53125d11d9 Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1):

Must-fix items (6/6):
- §III-F SSIM/pixel rejection rewritten from first principles (design-level
  argument from luminance/contrast/structure local-window product, not the
  prior empirical 0.70 result)
- Table VI restructured by population × method; added missing Firm A
  logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary
  marked bin-unstable (Appendix A)
- Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted
  from "operational dual" to "calibration-fold-adjacent reference"; the
  actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout
- New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A
  on top); script 30_yearly_big4_comparison.py
- Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets
- §III-K reframed P7.5 from "round-number lower-tail boundary" to operating
  point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds:
  0.9407 / 0.945 / 0.95 / 0.977 / 0.985)

Nice-to-have items (3/3):
- Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985)
- Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to
  table notes; cut "invite reviewer skepticism" and "non-load-bearing"

Codex 3-pass verification cleanup:
- Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A
  Beta-2 forced-fit crossing from beta_mixture_results.json)
- dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer
  immediately below) instead of misleading "rounded down"
- Table XII-B prose corrected: per-segment qualification of "non-Firm-A
  capture falls faster" (true on 0.95→0.977 segment but contracts on
  0.977→0.985 segment); arithmetic now from exact counts

Within-year analyses removed:
- Within-year ranking robustness check (Class A) was added in nice-to-have
  pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the
  Appendix B provenance row
- Within-CPA future-work disclosures (Class B) removed from Discussion
  limitation #5 and Conclusion future-work paragraph; subsequent limitations
  renumbered Sixth → Fifth, Seventh → Sixth

DOCX rendering pipeline overhaul (paper/export_v3.py):

Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES:
strip_comments() was wholesale-deleting HTML comments, but every numerical
table is wrapped in <!-- TABLE X: ... -->, so the table body was deleted
alongside the wrapper. Now unwraps TABLE comments (emit synthetic
__TABLE_CAPTION__: marker + table body) while still stripping non-TABLE
editorial comments. Result: 19 tables now render in the DOCX.

Other rendering fixes:
- LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥,
  ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,})
- Math-context-scoped sub/superscript via PUA sentinels (/):
  no more underscore-eating in identifiers like signature_analysis
- Display equations rendered via matplotlib mathtext to PNG (3 equations:
  cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as
  numbered equation blocks (1), (2), (3); content-addressed cache at
  paper/equations/ (gitignored, regenerable)
- Manual numbered/bulleted list rendering with hanging indent (replaces
  python-docx style="List Number" which silently drops the number prefix
  when no numbering definition is bound)
- Markdown blockquote (> ...) defensively stripped
- Pandoc footnote ([^name]) markers no longer leak (inlined at source)
- Heading text cleaned of LaTeX residue + PUA sentinels
- File paths in body text (signature_analysis/X.py, reports/Y.json)
  trimmed to "(reproduction artifact in Appendix B)" pointers

New leak linter: paper/lint_paper_v3.py - two-pass markdown source +
rendered DOCX leak detector; auto-runs at end of export_v3.py.

Script changes:
- 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR
  threshold list so Table XII-B is reproducible from persisted JSON
- 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly
  data (writes to reports/figures/ and reports/firm_yearly_comparison/)
- 31_within_year_ranking_robustness.py: NEW; supports the within-year
  robustness check (no longer cited in paper but kept as repo-internal
  due-diligence artifact)

Partner handoff DOCX shipped to
~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB:
19 tables + 4 figures + 3 equation images).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-06 13:44:49 +08:00

42 KiB

Raw Blame History

III. Methodology

A. Pipeline Overview

We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents. Fig. 1 illustrates the overall architecture. The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum anchored on whole-sample Firm A percentile heuristics and validated against a byte-level pixel-identity positive anchor and a large random inter-CPA negative anchor.

Throughout this paper we use the term non-hand-signed rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.

B. Data Collection

The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.

CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. Table I summarizes the dataset composition.

C. Signature Page Identification

To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism. Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. The model was configured with temperature 0 for deterministic output.

The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. Scanning terminated upon the first positive detection. This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.

Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents. The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling.

D. Signature Detection

We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization. A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.

The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).

Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.

Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset is the sample that flows into Sections IV-D through IV-H; the unmatched 7.4% are excluded for definitional reasons rather than discarded as noise.

E. Feature Extraction

Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.

Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-G). This design choice is validated by an ablation study (Section IV-I) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

F. Dual-Method Similarity Descriptors

For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:

Cosine similarity on deep embeddings captures high-level visual style:

\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B

where \mathbf{f}_A and \mathbf{f}_B are L2-normalized 2048-dim feature vectors. Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].

Perceptual hash distance (dHash) [27] captures structural-level similarity. Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].

These descriptors provide partially independent evidence. Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. Non-hand-signing yields extreme similarity under both descriptors, since the underlying image is identical up to reproduction noise. Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.

We did not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors, and the reasons are specific to what each of those measures was designed to do rather than to how either happened to perform on our corpus.

SSIM was developed by Wang et al. [30] as a perceptual quality index for natural images, and it factorises local-window image statistics into three components---luminance, contrast, and structural correlation---combined multiplicatively over a sliding window. Each of these components is computed at the pixel level on the original-resolution image and is designed to be sensitive to small fluctuations in local luminance and local contrast, because that is what makes SSIM track human perception of natural-image quality. Applied to a binarised auditor's signature crop, exactly those design choices become liabilities: the JPEG block artifacts, scan-noise speckle, and faint scanner-rule ghosts that are routine in a print-scan cycle perturb local luminance and local contrast in every window they touch, and SSIM amplifies those perturbations in the structural-correlation product. A signature reproduced twice from the same stored image---the very case that defines our positive class---is therefore one in which SSIM is structurally guaranteed to penalise the easily perturbed margins around the strokes, even though the strokes themselves are identical up to rendering noise. This is a property of how SSIM is constructed, not a finding about how it scored on our data; the empirical observation that the calibration firm exhibits a mean SSIM of only 0.70 in our corpus is a confirmation of the design-level prediction rather than the basis for the rejection.

Pixel-level comparison---whether L_1, L_2, or pixel-identity counting---fails on a stricter design ground. Pixel-level distances are defined on geometrically aligned images at a common resolution, and they treat any sub-pixel translation, rotation, or rescale as a large perturbation by construction (a one-pixel uniform translation flips a fraction of foreground pixels on a thin-stroke signature crop and inflates pixel L1 distance to the same magnitude as for a different signer's signature). Two scans of the same physical document, however, do not share a common pixel grid: scanner DPI, paper-handling alignment, and PDF-page rasterisation each contribute random sub-pixel offsets, and the print-scan cycle that intervenes between the stored stamp image and the audit-report PDF additionally introduces resolution mismatch and small geometric drift. A pixel-level descriptor cannot therefore satisfy the basic stability requirement for our task: two presentations of the same stored image must score nearly identically. We retain pixel-identity counting only as a threshold-free anchor (Section III-J), because byte-identical pairs in our corpus are necessarily produced by literal file reuse rather than by repeated scanning, and so they do not interact with the alignment-fragility argument; they are not used as a primary similarity descriptor.

Cosine similarity on deep embeddings and dHash, in contrast, both remain stable across the print-scan-rasterise cycle by design: cosine on L2-normalised pooled features is invariant to overall scale and bias and degrades gracefully under local-pixel noise that the convolutional backbone has been trained to absorb [14], [21], while dHash compresses the image to a 9 \times 8 grayscale grid before computing horizontal-gradient signs, which removes the resolution and sub-pixel-alignment sensitivity that breaks pixel-level comparison [19], [27]. Together they constitute the dual descriptor used throughout the rest of this paper.

G. Unit of Analysis and Summary Statistics

Two unit-of-analysis choices are relevant for this study, ordered from finest to coarsest: (i) the signature---one signature image extracted from one report; and (ii) the auditor-year---all signatures by one CPA within one fiscal year. The signature is the operational unit of classification (Section III-K) and of all primary statistical analyses (Section IV-D, IV-F, IV-G). The auditor-year is used in the partner-level similarity ranking of Section IV-G.2 as a within-year aggregation unit: each auditor-year's mean is computed over its own fiscal-year signatures, although the per-signature best-match cosine that feeds the mean is computed against the full same-CPA cross-year pool (Section III-G's max-cosine / min-dHash definition). We do not use a coarser CPA-level cross-year unit, because pooling a CPA's signatures across the full 2013--2023 sample period would conflate distinct signing-mechanism regimes whenever a CPA's practice changes during the sample, and we make no claim about the within-CPA stability of signing mechanisms over time.

For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA (over the full same-CPA set, not restricted to the same fiscal year). The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism. Mean statistics would dilute this signal.

For the dHash dimension we use the independent minimum dHash: the minimum Hamming distance from a signature to any other signature of the same CPA (over the full same-CPA set). The independent minimum is unconditional on the cosine-nearest pair and is therefore the conservative structural-similarity statistic; it is the dHash statistic used throughout the operational classifier (Section III-K) and all reported capture-rate analyses.

We make one stipulation about same-CPA pair detectability.

(A1) Pair-detectability. If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the cross-year same-CPA pool used by the max-cosine / min-dHash computation above. This is plausible for high-volume stamping or firm-level electronic-signing workflows---where a stored image is typically reused many times under similar scan and compression conditions---but it is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are in use simultaneously, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is a cross-year pair-existence property, not a within-year uniformity claim, and is the only assumption the per-signature detector requires to be sensitive to replication.

We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's signing output within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., different stored images for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination; our signature-level analyses remain valid under all of these regimes, since they do not attempt mechanism attribution at the partner or report level.

The intra-report consistency analysis in Section IV-G.3 is a firm-level homogeneity check---whether the two co-signing CPAs on the same report receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.

H. Calibration Reference: Firm A as a Replication-Dominated Population

A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.

Practitioner knowledge motivated treating Firm A as a candidate calibration reference: the firm is understood within the audit profession to reproduce a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports. This practitioner background motivates Firm A's selection but is not used as evidence: the evidentiary basis in the analyses below---byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency---is derived entirely from the audit-report images themselves and does not depend on any claim about firm-level signing practice.

We establish Firm A's replication-dominated status through two primary independent quantitative analyses plus a third strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone:

First, automated byte-level pair analysis (Section IV-F.1; reproduction artifact listed in Appendix B) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years. Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports---these pairs therefore establish image reuse as a concrete, threshold-free phenomenon within Firm A and confirm that replication is widespread (50 of 180 registered partners) rather than confined to a handful of CPAs.

Second, signature-level distributional evidence: Firm A's per-signature best-match cosine distribution fails to reject unimodality (Hartigan dip test p = 0.17, N = 60{,}448 Firm A signatures; Section IV-D) and exhibits a long left tail, consistent with a dominant high-similarity regime plus residual within-firm heterogeneity rather than two cleanly separated mechanisms. 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95 and the remaining 7.5% form the long left tail (we do not disaggregate partner-level mechanism here; see Section III-G for the scope of claims). The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).

Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-G. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output: (a) Longitudinal stability (Section IV-G.1). The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the temporal stability of the rate, not the absolute rate at any single year. (b) Partner-level similarity ranking (Section IV-G.2). When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff. (c) Intra-report consistency (Section IV-G.3). Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.

The 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the byte-level pixel-identity evidence, the unimodal-long-tail dip-test result, the three complementary analyses above, and the held-out Firm A fold (described in Section III-J; fold-level rate differences are disclosed in Section IV-F.2). Firm A's replication-dominated status itself was not derived from the thresholds we calibrate against it; it rests on the byte-level pair evidence and the dip-test-confirmed unimodal-long-tail shape, both of which are independent of any threshold choice. The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section IV-D)---and for avoiding overclaim in downstream inference.

I. Signature-Level Threshold Characterisation

This section describes how we set the operational classifier's similarity threshold and how we characterise the per-signature similarity distribution that supports it. The two roles are kept separate by design.

Operational threshold (used by the classifier). The cosine cut is anchored on the whole-sample Firm A P7.5 percentile (cos > 0.95; Section III-K).

Statistical characterisation (used to motivate the choice of anchor and to describe the distributional structure). A Hartigan dip test, an EM-fitted Beta mixture (with logit-Gaussian robustness check), and a Burgstahler-Dichev / McCrary density-smoothness procedure---all applied at the per-signature level (Section IV-D).

The reason for the split is empirical. The three statistical diagnostics jointly find that per-signature similarity forms a continuous quality spectrum (Section IV-D, summarised below): the dip test fails to reject unimodality for Firm A; BIC strongly prefers a 3-component over a 2-component Beta fit, so the 2-component crossing is a forced fit; and the BD/McCrary candidate transition lies inside the non-hand-signed mode rather than between modes (and is not bin-width-stable; Appendix A). Under these conditions the natural anchor for an operational cosine cut is a transparent percentile of a replication-dominated reference population (Firm A) rather than a mixture-fit crossing whose location depends on parametric assumptions the data do not support.

We describe the three diagnostics and the assumptions underlying each in the subsections below. The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification, and the logit-Gaussian cross-check reports sensitivity to that form. The Burgstahler-Dichev / McCrary procedure is applied to the same distribution as a density-smoothness diagnostic: it would identify a sharp local density discontinuity if one existed at the boundary between two cleanly separated mechanisms. Because all three diagnostics are applied to the same sample rather than to independent experiments, agreement or disagreement among them is read as evidence about distributional structure rather than as a formal statistical guarantee.

1) Method 1: KDE Antimode / Crossover with Unimodality Test

We use two closely related KDE-based threshold estimators and apply each where it is appropriate. When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the KDE crossover is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes. When a single distribution is analysed (e.g., the per-signature best-match cosine distribution of Section IV-D) the KDE antimode is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal. In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality. The dip test asks one question: is the distribution single-peaked? A non-significant $p$-value means we cannot reject the single-peak null (the data are consistent with one peak); a significant $p$-value means the distribution has more than one peak (it could be two, three, or more---the test does not specify how many). We use the test to decide whether a KDE antimode is well-defined (it is, only when there is more than one peak), not to assert any particular number of components. We additionally perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify threshold stability.

2) Method 2: Finite Mixture Model via EM

We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data). The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread). Under the fitted model the threshold is the crossing point of the two weighted component densities,

\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),

solved numerically via bracketed root-finding. As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the logit-transformed similarity, following standard practice for bounded proportion data. White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery.

We fit 2- and 3-component variants of each mixture and report BIC for model selection. When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit; we report the resulting crossing only as a forced-fit descriptive reference and do not use it as an operational threshold.

3) Density-Smoothness Diagnostic: Burgstahler-Dichev / McCrary

Complementing the two threshold estimators above, we apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39], as a density-smoothness diagnostic rather than as a third threshold estimator. We discretize each distribution (cosine into bins of width 0.005; \text{dHash}_\text{indep} into integer bins) and compute, for each bin i with count n_i, the standardized deviation from the smooth-null expectation of the average of its neighbours,

Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},

which is approximately N(0,1) under the null of distributional smoothness. A candidate transition is identified at an adjacent bin pair where Z_{i-1} is significantly negative and Z_i is significantly positive (cosine) or the reverse (dHash). Appendix A reports a bin-width sensitivity sweep covering \text{bin} \in \{0.003, 0.005, 0.010, 0.015\} for cosine and \text{bin} \in \{1, 2, 3\} for dHash; the sweep shows that signature-level BD transitions are not bin-width-stable, consistent with histogram-resolution artifacts rather than a genuine cross-mode density discontinuity. We therefore do not treat the BD/McCrary procedure as a threshold estimator in our application but as diagnostic evidence about distributional smoothness.

4) Reading the Three Diagnostics Together

The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification (with logit-Gaussian as a robustness cross-check against that form). If the two estimated thresholds were to differ by less than a practically meaningful margin and the BD/McCrary procedure were to identify a sharp transition at the same level, that pattern would constitute convergent evidence for a clean two-mechanism boundary at that location.

This is not the pattern we observe at the per-signature level. The two threshold estimators yield crossings spread across a wide range (Section IV-D); the BIC clearly prefers a 3-component over a 2-component Beta fit, indicating that the 2-component crossing is a forced fit reported only as a descriptive reference rather than as an operational threshold; and the BD/McCrary procedure locates its candidate transition inside the non-hand-signed mode rather than between modes (Appendix A). We interpret this jointly as evidence that per-signature similarity is a continuous quality spectrum rather than a clean two-mechanism mixture, and we accordingly anchor the operational classifier's cosine cut on whole-sample Firm A percentile heuristics (Section III-K) rather than on a mixture-fit crossing.

J. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation)

Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling:

Pixel-identical anchor (gold positive, conservative subset): signatures whose nearest same-CPA match is byte-identical after crop and normalization. Handwriting physics makes byte-identity impossible under independent signing events, so a byte-identical same-CPA pair is pair-level proof of image reuse and---for the byte-identical subset---conservative ground truth for non-hand-signed signatures; the narrow exception, in which a genuinely hand-signed exemplar was subsequently reused as the stamping or e-signature template, is discussed as a Limitation in Section V-G. We further emphasize that this anchor is a subset of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).
Inter-CPA negative anchor (large gold negative): $\sim$50,000 pairs of signatures randomly sampled from different CPAs. Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps. This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.
Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference with within-firm heterogeneity in the left tail, as evidenced by the 7.5% of Firm A signatures whose per-signature best-match cosine falls at or below 0.95 (Section III-H, Section IV-D). Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% calibration fold and a 30% heldout fold. The calibration-fold percentiles used in thresholding---cosine median, P1, and P5 (lower-tail, since higher cosine indicates greater similarity), and dHash_indep median and P95 (upper-tail, since lower dHash indicates greater similarity)---are derived from the 70% calibration fold only. The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.
Low-similarity same-CPA anchor (supplementary negative): signatures whose maximum same-CPA cosine similarity is below 0.70. This anchor is retained for continuity with prior work but is small in our dataset (n = 35) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.

From these anchors we report FAR with Wilson 95% confidence intervals against the inter-CPA negative anchor. We do not report an Equal Error Rate or FRR column against the byte-identical positive anchor, because byte-identical pairs have cosine \approx 1 by construction and any FRR computed against that subset is trivially 0 at every threshold below 1; the conservative-subset role of the byte-identical anchor is instead discussed qualitatively in Section V-F. Precision and F_1 are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and F_1 from Table X. The 70/30 held-out Firm A fold of Section IV-F.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.

K. Per-Document Classification

The per-signature classifier operates at the signature level with operational thresholds anchored on whole-sample Firm A percentile heuristics: cos > 0.95 (Firm A P7.5) for the cosine dimension and dHash$_\text{indep} \leq 5$ / > 15 (Firm A median+P75 / style-consistency ceiling) for the structural dimension. This percentile-based anchor is the natural choice given the continuous-spectrum shape of the per-signature similarity distribution documented in Section IV-D; sensitivity to nearby alternatives is reported in Section IV-F.3. All dHash references in this section refer to the independent-minimum dHash defined in Section III-G---the smallest Hamming distance from a signature to any other same-CPA signature. We use a single dHash statistic throughout the operational classifier and the supporting capture-rate analyses (Tables IX, XI, XII, XVI), which keeps the classifier definition and its empirical evaluation arithmetically consistent.

We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:

High-confidence non-hand-signed: Cosine > 0.95 AND \text{dHash}_\text{indep} \leq 5. Both descriptors converge on strong replication evidence.
Moderate-confidence non-hand-signed: Cosine > 0.95 AND 5 < \text{dHash}_\text{indep} \leq 15. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.
High style consistency: Cosine > 0.95 AND \text{dHash}_\text{indep} > 15. High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
Uncertain: Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
Likely hand-signed: Cosine below the all-pairs KDE crossover threshold.

We note three conventions about the thresholds. First, the cosine cutoff 0.95 is the operating point chosen for the five-way classifier from a small grid of candidate cuts, on the basis of an explicit capture-vs-FAR tradeoff against the inter-CPA negative anchor of Section III-J---not a discovered natural boundary in the per-signature distribution. The candidate grid spans the calibration-fold P5 (0.9407), its rounded value (0.945), the operational anchor (0.95), and two reference points drawn from the signature-level threshold-estimator outputs of Section IV-D (the Firm A Beta-2 forced-fit crossing 0.977 and the BD/McCrary candidate transition 0.985); for each grid point Section IV-F.3 reports the Firm A capture rate, the non-Firm-A capture rate, and the inter-CPA FAR with Wilson 95% CI (Table XII-B). Three considerations motivate the operating point at 0.95. (i) Inter-CPA specificity. At cosine > 0.95 the inter-CPA FAR against the 50,000-pair anchor of Section IV-F.1 is 0.0005 (Wilson 95% CI [0.0003, 0.0007]): one in two thousand random cross-CPA pairs exceeds the cut, an order-of-magnitude margin against the working assumption that random cross-CPA pairs do not arise from image reuse. (ii) Capture stability under nearby alternatives. Moving the cut to 0.945 raises Firm A capture by 1.51 percentage points (operational dual rule cos > t AND \text{dHash}_\text{indep} \leq 15; Section IV-F.3) and inter-CPA FAR by 0.00032, while moving it to the calibration-fold P5 of 0.9407 raises Firm A capture by 2.63 percentage points and inter-CPA FAR by 0.00076; in either direction the qualitative finding---Firm A is replication-dominated, non-Firm-A capture is much lower at the same cut, and the inter-CPA noise floor is small---is preserved. (iii) Interpretive transparency. The complement 7.5\% corresponds to the whole-sample Firm A P7.5 of the per-signature best-match cosine distribution---that is, 92.5\% of whole-sample Firm A signatures exceed this cutoff and 7.5\% fall at or below it (Section III-H)---which gives the operational cut a transparent reading in the replication-dominated reference population without requiring a parametric mixture fit that the data of Section IV-D do not support. The cosine crossover 0.837 is the all-pairs intra/inter KDE crossover; both 0.95 and 0.837 are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions. Section IV-F.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible; Section IV-F.3 (Table XII-B) reports the full capture-vs-FAR tradeoff at the candidate grid above. Second, the dHash cutoffs \leq 5 and > 15 are chosen from the whole-sample Firm A \text{dHash}_\text{indep} distribution: \leq 5 captures the upper tail of the high-similarity mode (whole-sample Firm A median \text{dHash}_\text{indep} = 2, P75 \approx 4, so \leq 5 is the band immediately above median), while > 15 marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction. Third, the signature-level threshold-estimator outputs of Section IV-D (KDE antimode, Beta-mixture and logit-Gaussian crossings, BD/McCrary diagnostic) are not the operational thresholds of this classifier: they are descriptive characterisation of the per-signature similarity distribution, and Section IV-D shows they do not converge to a clean two-mechanism boundary at the per-signature level---which is why the operational cosine cut is anchored on the whole-sample Firm A percentile rather than on any mixture-fit crossing.

Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the most-replication-consistent signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification). This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.

L. Data Source and Firm Anonymization

Audit-report corpus. The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation. MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document. We did not access any non-public auditor work papers, internal firm records, or personally identifying information beyond the certifying CPAs' names and signatures, which are themselves published on the face of the audit report as part of the public regulatory filing. The CPA registry used to map signatures to CPAs is a publicly available audit-firm tenure registry (Section III-B).

Firm-level anonymization. Although all audit reports and CPA identities in the corpus are public, we report firm-level results under the pseudonyms Firm A / B / C / D throughout this paper to avoid naming specific accounting firms in descriptive rate comparisons. Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name.

42 KiB Raw Blame History Unescape Escape