pdf_signature_extraction/paper/paper_a_methodology_v3.md

# III. Methodology

## A. Pipeline Overview

We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
Fig. 1 illustrates the overall architecture.
The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.

Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.

<!--
[Figure 1: Pipeline Architecture - clean vector diagram]
90,282 PDFs → VLM Pre-screening → 86,072 PDFs
→ YOLOv11 Detection → 182,328 signatures
→ ResNet-50 Features → 2048-dim embeddings
→ Dual-Method Verification (Cosine + dHash)
→ Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification
→ Pixel-identity + Firm A + Accountant-level GMM validation
-->

## B. Data Collection

The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023.
The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings.
An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period.
Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.

CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
Table I summarizes the dataset composition.

<!-- TABLE I: Dataset Summary
| Attribute | Value |
|-----------|-------|
| Total PDF documents | 90,282 |
| Date range | 2013–2023 |
| Documents with signatures | 86,072 (95.4%) |
| Unique CPAs identified | 758 |
| Accounting firms | >50 |
-->

## C. Signature Page Identification

To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism.
Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
The model was configured with temperature 0 for deterministic output.

The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document.
Scanning terminated upon the first positive detection.
This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded.
An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.

Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false-positive rate of 1.2%.

## D. Signature Detection

We adopted YOLOv11n (nano variant) [25] for signature region localization.
A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.

The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).

<!-- TABLE II: YOLO Detection Performance
| Metric | Value |
|--------|-------|
| Precision | 0.97–0.98 |
| Recall | 0.95–0.98 |
| mAP@0.50 | 0.98–0.99 |
| mAP@0.50:0.95 | 0.85–0.90 |
-->

Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.

Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).

## E. Feature Extraction

Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning.
The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.

Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D).
This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

## F. Dual-Method Similarity Descriptors

For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:

**Cosine similarity on deep embeddings** captures high-level visual style:

$$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$

where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized 2048-dim feature vectors.
Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].

**Perceptual hash distance (dHash)** captures structural-level similarity.
Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images.
Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].

These descriptors provide partially independent evidence.
Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise.
Non-hand-signing yields extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise.
Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.

We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content.
Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.

## G. Unit of Analysis and Summary Statistics

Two unit-of-analysis choices are relevant for this study: (i) the *signature*---one signature image extracted from one report---and (ii) the *accountant*---the collection of all signatures attributed to a single CPA across the sample period.
A third composite unit---the *auditor-year*, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution.

For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA.
The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
Mean statistics would dilute this signal.

For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA.
These accountant-level aggregates are the input to the mixture model described in Section III-I.

## H. Calibration Reference: Firm A as a Replication-Dominated Population

A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.
This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration.

First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems.
Crucially, the same interview evidence does *not* exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports.

Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners.

Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.

We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline.
The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.

## I. Three-Method Convergent Threshold Determination

Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
To place threshold selection on a statistically principled and data-driven footing, we apply *three independent* methods whose underlying assumptions decrease in strength.
When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement is itself a diagnostic of distributional structure.

### 1) Method 1: KDE + Antimode with Bimodality Check

We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28].
A candidate threshold is taken at the location of the local density minimum (antimode) between modes of the fitted density.
Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify stability.

### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity

We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39].
We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,

$$Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},$$

which is approximately $N(0,1)$ under the null of distributional smoothness.
A threshold is identified at the transition where $Z_{i-1}$ is significantly negative (observed count below expectation) adjacent to $Z_i$ significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition $Z^- \rightarrow Z^+$ marks the candidate decision boundary.

### 3) Method 3: Finite Mixture Model via EM

We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data).
The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread).
Under the fitted model the threshold is the crossing point of the two weighted component densities,

$$\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),$$

solved numerically via bracketed root-finding.
As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the *logit-transformed* similarity, following standard practice for bounded proportion data; White's [41] quasi-MLE consistency result guarantees asymptotic recovery of the best Beta-family approximation to the true distribution even if the true component densities are not exactly Beta.

We fit 2- and 3-component variants of each mixture and report BIC for model selection.
When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.

### 4) Convergent Validation and Level-Shift Diagnostic

The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification.
If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.

Equally informative is the *level at which the three methods agree*.
Applied to the per-signature cosine distribution the three methods may converge on a single boundary only weakly or not at all---because, as our results show, per-signature similarity is not a cleanly bimodal population.
Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum.
We therefore explicitly analyze both levels and interpret their divergence as a substantive finding (Section V) rather than a statistical nuisance.

## J. Accountant-Level Mixture Model

In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *behavior* is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level *quality* lies on a continuous spectrum.

We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.

## K. Pixel-Identity and Firm A Validation (No Manual Annotation)

Rather than construct a stratified manual-annotation validation set, we validate the classifier using three naturally occurring reference populations that require no human labeling:

1. **Pixel-identical anchor (gold positive):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing.

2. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above.
Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity.

3. **Low-similarity anchor (gold negative):** signatures whose maximum same-CPA cosine similarity is below a conservative cutoff ($0.70$) that cannot plausibly arise from pixel-level duplication.

From these anchors we report Equal Error Rate (EER), precision, recall, $F_1$, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3].
We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.

## L. Per-Document Classification

The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution:

1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.

2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash in $[6, 15]$.
Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.

3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.

4. **Uncertain:** Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.

5. **Likely hand-signed:** Cosine below the KDE crossover threshold.

The dHash thresholds ($\leq 5$ and $\leq 15$) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.