Files

T

gbanyan 6b64eabbfb Paper A v3.18.4: address codex GPT-5.5 round-18 self-comparing review findings

Codex round-18 (paper/codex_review_gpt55_v3_18_3.md) caught a falsified
provenance claim I introduced in v3.18.3 plus four cleaner narrative items
that survived the prior 17 rounds. Verdict was Minor Revision; this
commit closes all 5 actionable items.

- Harmonize signature_analysis/28_byte_identity_decomposition.py to use
  accountants.firm (joined on signatures.assigned_accountant) for Firm A
  membership, matching the convention in 24_validation_recalibration.py.
  Regenerated reports/byte_identity_decomp/byte_identity_decomposition.json.
  Cross-firm convergence now reports Firm A 49,389 / 55,922 = 88.32% and
  Non-Firm-A 27,595 / 65,514 = 42.12% (percentages unchanged at two
  decimal places; counts now match Table IX exactly).
- Replace the Section IV-H.2 reconciliation note. The previous note
  speculated that the one-record discrepancy was a snapshot/floating-point
  artifact, which codex round-18 falsified by direct DB queries: the real
  cause was that script 28 used signatures.excel_firm while Table IX uses
  accountants.firm. With script 28 now harmonized, Table IX and the
  cross-firm artifact agree exactly at 55,922; the new note documents the
  Firm A grouping convention plus the dHash-non-null filter.
- Replace residual "known-majority-positive" wording with
  "replication-dominated" in Introduction (contributions 4 and 6) and
  Methodology III-I (anchor-rationale paragraph).
- Correct Methodology III-G's auditor-year description: the per-signature
  best-match cosine that feeds each auditor-year mean is computed against
  the full same-CPA cross-year pool, not within-year only. The aggregation
  unit is within-year, but the underlying similarity statistic is not.
- Add the 145 / 50 / 180 / 35 Firm A byte-decomposition sentence to
  Results IV-F.1 with explicit pointer to script 28 and the JSON artifact;
  this resolves the round-18 finding that several manuscript locations
  cited IV-F.1 for a decomposition that was not actually reported there.
- Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 20:59:07 +08:00

38 KiB

Raw Blame History

III. Methodology

A. Pipeline Overview

We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents. Fig. 1 illustrates the overall architecture. The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum anchored on whole-sample Firm A percentile heuristics and validated against a byte-level pixel-identity positive anchor and a large random inter-CPA negative anchor.

Throughout this paper we use the term non-hand-signed rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.

B. Data Collection

The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.

CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. Table I summarizes the dataset composition.

C. Signature Page Identification

To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism. Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. The model was configured with temperature 0 for deterministic output.

The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. Scanning terminated upon the first positive detection. This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.

Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents. The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling.

D. Signature Detection

We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization. A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.

The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).

Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.

Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset is the sample that flows into Sections IV-D through IV-H; the unmatched 7.4% are excluded for definitional reasons rather than discarded as noise.

E. Feature Extraction

Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.

Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-G). This design choice is validated by an ablation study (Section IV-I) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

F. Dual-Method Similarity Descriptors

For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:

Cosine similarity on deep embeddings captures high-level visual style:

\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B

where \mathbf{f}_A and \mathbf{f}_B are L2-normalized 2048-dim feature vectors. Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].

Perceptual hash distance (dHash) [27] captures structural-level similarity. Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].

These descriptors provide partially independent evidence. Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. Non-hand-signing yields extreme similarity under both descriptors, since the underlying image is identical up to reproduction noise. Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.

We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content. Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.

G. Unit of Analysis and Summary Statistics

Two unit-of-analysis choices are relevant for this study, ordered from finest to coarsest: (i) the signature---one signature image extracted from one report; and (ii) the auditor-year---all signatures by one CPA within one fiscal year. The signature is the operational unit of classification (Section III-K) and of all primary statistical analyses (Section IV-D, IV-F, IV-G). The auditor-year is used in the partner-level similarity ranking of Section IV-G.2 as a within-year aggregation unit: each auditor-year's mean is computed over its own fiscal-year signatures, although the per-signature best-match cosine that feeds the mean is computed against the full same-CPA cross-year pool (Section III-G's max-cosine / min-dHash definition). We do not use a coarser CPA-level cross-year unit, because pooling a CPA's signatures across the full 2013--2023 sample period would conflate distinct signing-mechanism regimes whenever a CPA's practice changes during the sample, and we make no claim about the within-CPA stability of signing mechanisms over time.

For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA (over the full same-CPA set, not restricted to the same fiscal year). The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism. Mean statistics would dilute this signal.

For the dHash dimension we use the independent minimum dHash: the minimum Hamming distance from a signature to any other signature of the same CPA (over the full same-CPA set). The independent minimum is unconditional on the cosine-nearest pair and is therefore the conservative structural-similarity statistic; it is the dHash statistic used throughout the operational classifier (Section III-K) and all reported capture-rate analyses.

We make one stipulation about same-CPA pair detectability.

(A1) Pair-detectability. If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the cross-year same-CPA pool used by the max-cosine / min-dHash computation above. This is plausible for high-volume stamping or firm-level electronic-signing workflows---where a stored image is typically reused many times under similar scan and compression conditions---but it is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are in use simultaneously, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is a cross-year pair-existence property, not a within-year uniformity claim, and is the only assumption the per-signature detector requires to be sensitive to replication.

We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's signing output within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., different stored images for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination; our signature-level analyses remain valid under all of these regimes, since they do not attempt mechanism attribution at the partner or report level.

The intra-report consistency analysis in Section IV-G.3 is a firm-level homogeneity check---whether the two co-signing CPAs on the same report receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.

H. Calibration Reference: Firm A as a Replication-Dominated Population

A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.

Practitioner knowledge motivated treating Firm A as a candidate calibration reference: the firm is understood within the audit profession to reproduce a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports. This practitioner background is non-load-bearing in our analysis: the evidentiary basis used in this paper is the observable image evidence reported below---byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency---which does not depend on any claim about signing practice beyond what the audit-report images themselves show.

We establish Firm A's replication-dominated status through two primary independent quantitative analyses plus a third strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone:

First, automated byte-level pair analysis (Section IV-F.1; reproduced by signature_analysis/28_byte_identity_decomposition.py with output in reports/byte_identity_decomp/byte_identity_decomposition.json) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years. Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports---these pairs therefore establish image reuse as a concrete, threshold-free phenomenon within Firm A and confirm that replication is widespread (50 of 180 registered partners) rather than confined to a handful of CPAs.

Second, signature-level distributional evidence: Firm A's per-signature best-match cosine distribution fails to reject unimodality (Hartigan dip test p = 0.17, N = 60{,}448 Firm A signatures; Section IV-D) and exhibits a long left tail, consistent with a dominant high-similarity regime plus residual within-firm heterogeneity rather than two cleanly separated mechanisms. 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95 and the remaining 7.5% form the long left tail (we do not disaggregate partner-level mechanism here; see Section III-G for the scope of claims). The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).

Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-G. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output: (a) Longitudinal stability (Section IV-G.1). The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the temporal stability of the rate, not the absolute rate at any single year. (b) Partner-level similarity ranking (Section IV-G.2). When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff. (c) Intra-report consistency (Section IV-G.3). Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.

We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the byte-level pixel-identity evidence, the unimodal-long-tail dip-test result, the three complementary analyses above, and the held-out Firm A fold (described in Section III-J; fold-level rate differences are disclosed in Section IV-F.2).

We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. Its identification rests on the byte-level pair evidence and the dip-test-confirmed unimodal-long-tail shape, both of which are independent of any threshold choice. The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section IV-D)---and for avoiding overclaim in downstream inference.

I. Signature-Level Threshold Characterisation

This section describes how we set the operational classifier's similarity threshold and how we characterise the per-signature similarity distribution that supports it. The two roles are kept separate by design.

Operational threshold (used by the classifier). The cosine cut is anchored on the whole-sample Firm A P7.5 percentile (cos > 0.95; Section III-K).

Statistical characterisation (used to motivate the choice of anchor and to describe the distributional structure). A Hartigan dip test, an EM-fitted Beta mixture (with logit-Gaussian robustness check), and a Burgstahler-Dichev / McCrary density-smoothness procedure---all applied at the per-signature level (Section IV-D).

The reason for the split is empirical. The three statistical diagnostics jointly find that per-signature similarity forms a continuous quality spectrum (Section IV-D, summarised below): the dip test fails to reject unimodality for Firm A; BIC strongly prefers a 3-component over a 2-component Beta fit, so the 2-component crossing is a forced fit; and the BD/McCrary candidate transition lies inside the non-hand-signed mode rather than between modes (and is not bin-width-stable; Appendix A). Under these conditions the natural anchor for an operational cosine cut is a transparent percentile of a replication-dominated reference population (Firm A) rather than a mixture-fit crossing whose location depends on parametric assumptions the data do not support.

We describe the three diagnostics and the assumptions underlying each in the subsections below. The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification, and the logit-Gaussian cross-check reports sensitivity to that form. The Burgstahler-Dichev / McCrary procedure is applied to the same distribution as a density-smoothness diagnostic: it would identify a sharp local density discontinuity if one existed at the boundary between two cleanly separated mechanisms. Because all three diagnostics are applied to the same sample rather than to independent experiments, agreement or disagreement among them is read as evidence about distributional structure rather than as a formal statistical guarantee.

1) Method 1: KDE Antimode / Crossover with Unimodality Test

We use two closely related KDE-based threshold estimators and apply each where it is appropriate. When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the KDE crossover is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes. When a single distribution is analysed (e.g., the per-signature best-match cosine distribution of Section IV-D) the KDE antimode is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal. In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality. The dip test asks one question: is the distribution single-peaked? A non-significant $p$-value means we cannot reject the single-peak null (the data are consistent with one peak); a significant $p$-value means the distribution has more than one peak (it could be two, three, or more---the test does not specify how many). We use the test to decide whether a KDE antimode is well-defined (it is, only when there is more than one peak), not to assert any particular number of components. We additionally perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify threshold stability.

2) Method 2: Finite Mixture Model via EM

We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data). The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread). Under the fitted model the threshold is the crossing point of the two weighted component densities,

\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),

solved numerically via bracketed root-finding. As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the logit-transformed similarity, following standard practice for bounded proportion data. White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery.

We fit 2- and 3-component variants of each mixture and report BIC for model selection. When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.

3) Density-Smoothness Diagnostic: Burgstahler-Dichev / McCrary

Complementing the two threshold estimators above, we apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39], as a density-smoothness diagnostic rather than as a third threshold estimator. We discretize each distribution (cosine into bins of width 0.005; \text{dHash}_\text{indep} into integer bins) and compute, for each bin i with count n_i, the standardized deviation from the smooth-null expectation of the average of its neighbours,

Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},

which is approximately N(0,1) under the null of distributional smoothness. A candidate transition is identified at an adjacent bin pair where Z_{i-1} is significantly negative and Z_i is significantly positive (cosine) or the reverse (dHash). Appendix A reports a bin-width sensitivity sweep covering \text{bin} \in \{0.003, 0.005, 0.010, 0.015\} for cosine and \text{bin} \in \{1, 2, 3\} for dHash; the sweep shows that signature-level BD transitions are not bin-width-stable, consistent with histogram-resolution artifacts rather than a genuine cross-mode density discontinuity. We therefore do not treat the BD/McCrary procedure as a threshold estimator in our application but as diagnostic evidence about distributional smoothness.

4) Reading the Three Diagnostics Together

The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification (with logit-Gaussian as a robustness cross-check against that form). If the two estimated thresholds were to differ by less than a practically meaningful margin and the BD/McCrary procedure were to identify a sharp transition at the same level, that pattern would constitute convergent evidence for a clean two-mechanism boundary at that location.

This is not the pattern we observe at the per-signature level. The two threshold estimators yield crossings spread across a wide range (Section IV-D); the BIC clearly prefers a 3-component over a 2-component Beta fit, indicating that the 2-component crossing is a forced fit and should be read as an upper bound rather than a definitive cut; and the BD/McCrary procedure locates its candidate transition inside the non-hand-signed mode rather than between modes (Appendix A). We interpret this jointly as evidence that per-signature similarity is a continuous quality spectrum rather than a clean two-mechanism mixture, and we accordingly anchor the operational classifier's cosine cut on whole-sample Firm A percentile heuristics (Section III-K) rather than on a mixture-fit crossing.

J. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation)

Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling:

Pixel-identical anchor (gold positive, conservative subset): signatures whose nearest same-CPA match is byte-identical after crop and normalization. Handwriting physics makes byte-identity impossible under independent signing events, so a byte-identical same-CPA pair is pair-level proof of image reuse and---for the byte-identical subset---conservative ground truth for non-hand-signed signatures; the narrow exception, in which a genuinely hand-signed exemplar was subsequently reused as the stamping or e-signature template, is discussed as a Limitation in Section V-G. We further emphasize that this anchor is a subset of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).
Inter-CPA negative anchor (large gold negative): $\sim$50,000 pairs of signatures randomly sampled from different CPAs. Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps. This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.
Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference with within-firm heterogeneity in the left tail, as evidenced by the 7.5% of Firm A signatures whose per-signature best-match cosine falls at or below 0.95 (Section III-H, Section IV-D). Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% calibration fold and a 30% heldout fold. The calibration-fold percentiles used in thresholding---cosine median, P1, and P5 (lower-tail, since higher cosine indicates greater similarity), and dHash_indep median and P95 (upper-tail, since lower dHash indicates greater similarity)---are derived from the 70% calibration fold only. The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.
Low-similarity same-CPA anchor (supplementary negative): signatures whose maximum same-CPA cosine similarity is below 0.70. This anchor is retained for continuity with prior work but is small in our dataset (n = 35) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.

From these anchors we report FAR with Wilson 95% confidence intervals against the inter-CPA negative anchor. We do not report an Equal Error Rate or FRR column against the byte-identical positive anchor, because byte-identical pairs have cosine \approx 1 by construction and any FRR computed against that subset is trivially 0 at every threshold below 1; the conservative-subset role of the byte-identical anchor is instead discussed qualitatively in Section V-F. Precision and F_1 are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and F_1 from Table X. The 70/30 held-out Firm A fold of Section IV-F.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.

K. Per-Document Classification

The per-signature classifier operates at the signature level with operational thresholds anchored on whole-sample Firm A percentile heuristics: cos > 0.95 (Firm A P7.5) for the cosine dimension and dHash$_\text{indep} \leq 5$ / > 15 (Firm A median+P75 / style-consistency ceiling) for the structural dimension. This percentile-based anchor is the natural choice given the continuous-spectrum shape of the per-signature similarity distribution documented in Section IV-D; sensitivity to nearby alternatives is reported in Section IV-F.3. All dHash references in this section refer to the independent-minimum dHash defined in Section III-G---the smallest Hamming distance from a signature to any other same-CPA signature. We use a single dHash statistic throughout the operational classifier and the supporting capture-rate analyses (Tables IX, XI, XII, XVI), which keeps the classifier definition and its empirical evaluation arithmetically consistent.

We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:

High-confidence non-hand-signed: Cosine > 0.95 AND \text{dHash}_\text{indep} \leq 5. Both descriptors converge on strong replication evidence.
Moderate-confidence non-hand-signed: Cosine > 0.95 AND 5 < \text{dHash}_\text{indep} \leq 15. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.
High style consistency: Cosine > 0.95 AND \text{dHash}_\text{indep} > 15. High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
Uncertain: Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
Likely hand-signed: Cosine below the all-pairs KDE crossover threshold.

We note three conventions about the thresholds. First, the cosine cutoff 0.95 corresponds to approximately the whole-sample Firm A P7.5 of the per-signature best-match cosine distribution---that is, 92.5% of whole-sample Firm A signatures exceed this cutoff and 7.5% fall at or below it (Section III-H)---chosen as a round-number lower-tail boundary whose complement (92.5% above) has a transparent interpretation in the whole-sample reference distribution; the cosine crossover 0.837 is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions. Section IV-F.3 reports a sensitivity check confirming that replacing 0.95 with the nearby rounded sensitivity cut 0.945 (motivated by the calibration-fold P5 = 0.9407, see Section IV-F.2) shifts whole-Firm-A dual-rule capture by 1.19 percentage points, so the round-number heuristic is robust to nearby percentile-based alternatives. Section IV-F.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible. Second, the dHash cutoffs \leq 5 and > 15 are chosen from the whole-sample Firm A \text{dHash}_\text{indep} distribution: \leq 5 captures the upper tail of the high-similarity mode (whole-sample Firm A median \text{dHash}_\text{indep} = 2, P75 \approx 4, so \leq 5 is the band immediately above median), while > 15 marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction. Third, the signature-level threshold-estimator outputs of Section IV-D (KDE antimode, Beta-mixture and logit-Gaussian crossings, BD/McCrary diagnostic) are not the operational thresholds of this classifier: they are descriptive characterisation of the per-signature similarity distribution, and Section IV-D shows they do not converge to a clean two-mechanism boundary at the per-signature level---which is why the operational cosine cut is anchored on the whole-sample Firm A percentile rather than on any mixture-fit crossing.

Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the most-replication-consistent signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification). This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.

L. Data Source and Firm Anonymization

Audit-report corpus. The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation. MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document. We did not access any non-public auditor work papers, internal firm records, or personally identifying information beyond the certifying CPAs' names and signatures, which are themselves published on the face of the audit report as part of the public regulatory filing. The CPA registry used to map signatures to CPAs is a publicly available audit-firm tenure registry (Section III-B).

Firm-level anonymization. Although all audit reports and CPA identities in the corpus are public, we report firm-level results under the pseudonyms Firm A / B / C / D throughout this paper to avoid naming specific accounting firms in descriptive rate comparisons. Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name.

38 KiB Raw Blame History Unescape Escape