# III. Methodology ## A. Pipeline Overview We propose a six-stage pipeline for large-scale signature replication detection in scanned financial documents. Fig. 1 illustrates the overall architecture. The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures into one of four categories---definite replication, likely replication, uncertain, or likely genuine---along with supporting evidence from multiple verification methods. ## B. Data Collection The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. An automated web scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the handwritten signatures of the certifying CPAs. CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. Table I summarizes the dataset composition. ## C. Signature Page Identification To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism. Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. The model was configured with temperature 0 for deterministic output. The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. Scanning terminated upon the first positive detection. This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents. Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false positive rate of 1.2%. ## D. Signature Detection We adopted YOLOv11n (nano variant) [25] for signature region localization. A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps. The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II). Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). A red stamp removal step was applied to each cropped signature using HSV color space filtering, replacing detected red regions with white pixels to isolate the handwritten content. Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). ## E. Feature Extraction Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer. Preprocessing consisted of resizing to 224×224 pixels with aspect ratio preservation and white padding, followed by ImageNet channel normalization. All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product. The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D). This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0. ## F. Dual-Method Similarity Verification For each signature, the most similar signature from the same CPA across all other documents was identified via cosine similarity of feature vectors. Two complementary measures were then computed against this closest match: **Cosine similarity** captures high-level visual style similarity: $$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$ where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized feature vectors. A high cosine similarity indicates that two signatures share similar visual characteristics---stroke patterns, spatial layout, and overall appearance---but does not distinguish between consistent handwriting style and digital duplication. **Perceptual hash distance** captures structural-level similarity. Specifically, we employ a difference hash (dHash) [27], a perceptual hashing variant that encodes relative intensity gradients rather than absolute pixel values. Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19]. The complementarity of these two measures is the key to resolving the style-versus-replication ambiguity: - High cosine similarity + low pHash distance → converging evidence of digital replication - High cosine similarity + high pHash distance → consistent handwriting style, not replication This dual-method design was preferred over SSIM (Structural Similarity Index), which proved unreliable for scanned documents: a known-replication firm exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content. Cosine similarity and pHash are both robust to the noise introduced by the print-scan cycle, making them more suitable for this application. ## G. Threshold Selection and Calibration ### Distribution-Free Thresholds To establish classification thresholds, we computed cosine similarity distributions for two groups: - **Intra-class** (same CPA): all pairwise similarities among signatures attributed to the same CPA (41.3M pairs from 728 CPAs with ≥3 signatures) - **Inter-class** (different CPAs): 500,000 randomly sampled cross-CPA pairs Shapiro-Wilk tests rejected normality for both distributions ($p < 0.001$), motivating the use of distribution-free, percentile-based thresholds rather than parametric ($\mu \pm k\sigma$) approaches. The primary threshold was derived via Kernel Density Estimation (KDE) [28]: the crossover point where the intra-class and inter-class density functions intersect. Under equal prior probabilities and symmetric misclassification costs, this crossover approximates the optimal decision boundary between the two classes. ### Known-Replication Calibration A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm whose use of digitally replicated signatures was established through independent visual inspection and domain knowledge prior to threshold calibration (see Section I)---as a calibration reference. Firm A's signature similarity distribution provides two critical anchors: 1. **Lower bound validation:** Any detection threshold must classify the vast majority of Firm A's signatures as replicated; a threshold that fails this criterion is too conservative. 2. **Replication floor estimation:** Firm A's 1st percentile of cosine similarity establishes how low similarity scores can fall even among confirmed replicated signatures, due to scan noise and PDF compression artifacts. This lower bound on replication similarity informs the minimum sensitivity required of any detection threshold. This calibration strategy addresses a persistent challenge in document forensics where comprehensive ground truth labels are unavailable. ## H. Classification The final per-document classification uses exclusively the dual-method framework (cosine similarity + dHash distance), with thresholds calibrated against Firm A's known-replication distribution. Firm A's dHash distances show a median of 5 and a 95th percentile of 15; we use these empirical values to define confidence tiers: 1. **High-confidence replication:** Cosine similarity > 0.95 AND dHash distance ≤ 5. Both feature-level and structural-level evidence converge, consistent with Firm A's median behavior. 2. **Moderate-confidence replication:** Cosine similarity > 0.95 AND dHash distance 6--15. Feature-level evidence is strong; structural similarity is present but below the Firm A median, possibly due to scan variations. 3. **High style consistency:** Cosine similarity > 0.95 AND dHash distance > 15. High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not digitally. 4. **Uncertain:** Cosine similarity between the KDE crossover (0.837) and 0.95, without sufficient evidence for classification in either direction. 5. **Likely genuine:** Cosine similarity below the KDE crossover threshold. The dHash thresholds (≤ 5 and ≤ 15) are directly derived from Firm A's calibration distribution rather than set ad hoc, ensuring that the classification boundaries are empirically grounded.