Files
pdf_signature_extraction/paper/paper_a_methodology_v3.md
T
gbanyan 9d19ca5a31 Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review:

## Structural fixes
- Fixed three-method convergence overclaim: added Script 20 to run KDE
  antimode, BD/McCrary, and Beta mixture EM on accountant-level means.
  Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979,
  LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at
  accountant level (consistent with smooth clustering, not sharp
  discontinuity).
- Disambiguated Method 1: KDE crossover (between two labeled distributions,
  used at signature all-pairs level) vs KDE antimode (single-distribution
  local minimum, used at accountant level).
- Addressed Firm A circular validation: Script 21 adds CPA-level 70/30
  held-out fold. Calibration thresholds derived from 70% only; heldout
  rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61%
  [93.21%-93.98%]).
- Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10
  signatures (9 CPAs excluded for insufficient sample). Reconciled across
  intro, results, discussion, conclusion.
- Added document-level classification aggregation rule (worst-case signature
  label determines document label).

## Pixel-identity validation strengthened
- Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces
  the original n=35 same-CPA low-similarity negative which had untenable
  Wilson CIs).
- Added Wilson 95% CI for every FAR in Table X.
- Proper EER interpolation (FAR=FRR point) in Table X.
- Softened "conservative recall" claim to "non-generalizable subset"
  language per codex feedback (byte-identical positives are a subset, not
  a representative positive class).
- Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913.

## Terminology & sentence-level fixes
- "statistically independent methods" -> "methodologically distinct methods"
  throughout (three diagnostics on the same sample are not independent).
- "formal bimodality check" -> "unimodality test" (dip test tests H0 of
  unimodality; rejection is consistent with but not a direct test of
  bimodality).
- "Firm A near-universally non-hand-signed" -> already corrected to
  "replication-dominated" in prior commit; this commit strengthens that
  framing with explicit held-out validation.
- "discrete-behavior regimes" -> "clustered accountant-level heterogeneity"
  (BD/McCrary non-transition at accountant level rules out sharp discrete
  boundaries; the defensible claim is clustered-but-smooth).
- Softened White 1982 quasi-MLE claim (no longer framed as a guarantee).
- Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP
  or YOLO FN).
- Unified "310 byte-identical signatures" language across Abstract,
  Results, Discussion (previously alternated between pairs/signatures).
- Defined min_dhash_independent explicitly in Section III-G.
- Fixed table numbering (Table XI heldout added, classification moved to
  XII, ablation to XIII).
- Explained 84,386 vs 85,042 gap (656 docs have only one signature, no
  pairwise stat).
- Made Table IX explicitly a "consistency check" not "validation"; paired
  it with Table XI held-out rates as the genuine external check.
- Defined 0.941 threshold (calibration-fold Firm A cosine P5).
- Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated.
- Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923).

## New artifacts
- Script 20: accountant-level three-method threshold analysis
- Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30)
- paper/codex_review_gpt54_v3.md: preserved review feedback

Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1
markdown sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00

26 KiB
Raw Blame History

III. Methodology

A. Pipeline Overview

We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents. Fig. 1 illustrates the overall architecture. The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.

Throughout this paper we use the term non-hand-signed rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.

B. Data Collection

The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.

CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. Table I summarizes the dataset composition.

C. Signature Page Identification

To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism. Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. The model was configured with temperature 0 for deterministic output.

The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. Scanning terminated upon the first positive detection. This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.

Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents. The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling.

D. Signature Detection

We adopted YOLOv11n (nano variant) [25] for signature region localization. A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.

The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).

Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.

Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).

E. Feature Extraction

Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.

Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D). This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

F. Dual-Method Similarity Descriptors

For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:

Cosine similarity on deep embeddings captures high-level visual style:

\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B

where \mathbf{f}_A and \mathbf{f}_B are L2-normalized 2048-dim feature vectors. Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].

Perceptual hash distance (dHash) captures structural-level similarity. Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].

These descriptors provide partially independent evidence. Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. Non-hand-signing yields extreme similarity under both descriptors, since the underlying image is identical up to reproduction noise. Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.

We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content. Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.

G. Unit of Analysis and Summary Statistics

Two unit-of-analysis choices are relevant for this study: (i) the signature---one signature image extracted from one report---and (ii) the accountant---the collection of all signatures attributed to a single CPA across the sample period. A third composite unit---the auditor-year, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution.

For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA. The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism. Mean statistics would dilute this signal.

For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. The independent minimum dHash of a signature is defined as the minimum Hamming distance to any other signature of the same CPA (over the full same-CPA set), in contrast to the cosine-conditional dHash used as a diagnostic elsewhere, which is the dHash to the single signature selected as the cosine-nearest match. The independent minimum avoids conditioning on the cosine choice and is therefore the conservative structural-similarity statistic for each signature. These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level three-method analysis in Section III-I.5.

H. Calibration Reference: Firm A as a Replication-Dominated Population

A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class. This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration.

First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems. Crucially, the same interview evidence does not exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports.

Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners.

Third, our own quantitative analysis is consistent with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. We emphasize that this 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence enumerated above and by the held-out Firm A fold described in Section III-K.

We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline. The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.

I. Three-Method Convergent Threshold Determination

Direct assignment of thresholds based on prior intuition (e.g., cosine \geq 0.95 for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others. To place threshold selection on a statistically principled and data-driven footing, we apply three methodologically distinct methods whose underlying assumptions decrease in strength. The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence is therefore a diagnostic of distributional structure rather than a formal statistical guarantee. When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement itself is informative about whether the data support a single clean decision boundary at a given level.

1) Method 1: KDE Antimode / Crossover with Unimodality Test

We use two closely related KDE-based threshold estimators and apply each where it is appropriate. When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the KDE crossover is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes. When a single distribution is analyzed (e.g., the per-accountant cosine mean of Section IV-E) the KDE antimode is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal. In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify threshold stability.

2) Method 2: Burgstahler-Dichev / McCrary Discontinuity

We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39]. We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin i with count n_i, the standardized deviation from the smooth-null expectation of the average of its neighbours,

Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},

which is approximately N(0,1) under the null of distributional smoothness. A threshold is identified at the transition where Z_{i-1} is significantly negative (observed count below expectation) adjacent to Z_i significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition Z^- \rightarrow Z^+ marks the candidate decision boundary.

3) Method 3: Finite Mixture Model via EM

We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data). The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread). Under the fitted model the threshold is the crossing point of the two weighted component densities,

\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),

solved numerically via bracketed root-finding. As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the logit-transformed similarity, following standard practice for bounded proportion data. White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery.

We fit 2- and 3-component variants of each mixture and report BIC for model selection. When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.

4) Convergent Validation and Level-Shift Diagnostic

The three methods rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification. If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.

Equally informative is the level at which the methods agree or disagree. Applied to the per-signature similarity distribution the three methods yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D). Applied to the per-accountant cosine mean, Methods 1 (KDE antimode) and 3 (Beta-mixture crossing and its logit-Gaussian counterpart) converge within a narrow band, whereas Method 2 (BD/McCrary) does not produce a significant transition because the accountant-mean distribution is smooth at the bin resolution the test requires. This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a discrete discontinuity, and we interpret it accordingly in Section V rather than treating disagreement among methods as a failure.

5) Accountant-Level Three-Method Analysis

In addition to applying the three methods at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures. The accountant-level estimates provide the methodologically defensible threshold reference used in the per-document classification of Section III-L. All three methods are reported with their estimates and, where applicable, cross-method spreads.

J. Accountant-Level Mixture Model

In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash). The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing behavior is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level quality lies on a continuous spectrum.

We fit mixtures with K \in \{1, 2, 3, 4, 5\} components under full covariance, selecting K^* by BIC with 15 random initializations per K. For the selected K^* we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.

K. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation)

Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling:

  1. Pixel-identical anchor (gold positive, conservative subset): signatures whose nearest same-CPA match is byte-identical after crop and normalization. Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for the byte-identical subset of non-hand-signed signatures. We emphasize that this anchor is a subset of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).

  2. Inter-CPA negative anchor (large gold negative): $\sim$50,000 pairs of signatures randomly sampled from different CPAs. Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps. This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.

  3. Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above. Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we break the resulting circularity by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% calibration fold and a 30% heldout fold. Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions are derived from the calibration fold only. The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.

  4. Low-similarity same-CPA anchor (supplementary negative): signatures whose maximum same-CPA cosine similarity is below 0.70. This anchor is retained for continuity with prior work but is small in our dataset (n = 35) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.

From these anchors we report precision, recall, F_1, FAR with Wilson 95% confidence intervals, and the Equal Error Rate (EER) interpolated at the threshold where FAR = FRR, following biometric-verification reporting conventions [3]. We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.

L. Per-Document Classification

The final per-document classification combines the three-method thresholds with the dual-descriptor framework. Rather than rely on a single cutoff, we assign each signature to one of five signature-level categories using convergent evidence from both descriptors with thresholds derived from the Firm A calibration fold (Section III-K):

  1. High-confidence non-hand-signed: Cosine > 0.95 AND dHash \leq (calibration-fold Firm A dHash median). Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.

  2. Moderate-confidence non-hand-signed: Cosine > 0.95 AND dHash between the calibration-fold dHash median and 95th percentile. Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.

  3. High style consistency: Cosine > 0.95 AND dHash > calibration-fold Firm A dHash 95th percentile. High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.

  4. Uncertain: Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.

  5. Likely hand-signed: Cosine below the all-pairs KDE crossover threshold.

Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the most-replication-consistent signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification). This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge. The dHash thresholds (\leq 5 and \leq 15, corresponding to the calibration-fold Firm A dHash median and 95th percentile) are derived empirically rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.