pdf_signature_extraction/paper/paper_a_introduction.md

# I. Introduction

<!-- Target: ~1.5 pages double-column IEEE format. Double-blind: no author/institution info. -->

Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection.
In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require that certifying CPAs affix their signature or seal (簽名或蓋章) to each audit report [1].
While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].

The digitization of financial reporting, however, has introduced a practice that challenges this intent.
As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically trivial for a CPA to digitally replicate a single scanned signature image and paste it across multiple reports.
Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful attestation of individual professional judgment for each engagement.
Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, signature replication involves the legitimate signer reusing a digital copy of their own genuine signature.
This practice, while potentially widespread, is virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of digital duplication.

The distinction between signature *replication* and signature *forgery* is both conceptually and technically important.
The extensive body of research on offline signature verification [3]--[8] has focused almost exclusively on forgery detection---determining whether a questioned signature was produced by its purported author or by an impostor.
This framing presupposes that the central threat is identity fraud.
In our context, identity is not in question; the CPA is indeed the legitimate signer.
The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was digitally propagated across many reports.
This replication detection problem differs fundamentally from forgery detection: while it does not require modeling the variability of skilled forgers, it introduces the distinct challenge of separating legitimate intra-signer consistency from digital duplication, requiring an analytical framework focused on detecting abnormally high similarity across documents.

Despite the significance of this problem for audit quality and regulatory oversight, no prior work has specifically addressed the detection of same-signer digital replication in financial audit documents at scale.
Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of digital copies.
Copy-move forgery detection methods [10], [11] address duplicated regions within or across images, but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from digital duplication.
Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations, but has not been applied to document forensics or signature analysis.

In this paper, we present a fully automated, end-to-end pipeline for detecting digitally replicated CPA signatures in audit reports at scale.
Our approach processes raw PDF documents through six sequential stages: (1) signature page identification using a Vision-Language Model (VLM), (2) signature region detection using a trained YOLOv11 object detector, (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network, (4) dual-method similarity verification combining cosine similarity of deep features with difference hash (dHash) distance, (5) distribution-free threshold calibration using a known-replication reference group, and (6) statistical classification with cross-method validation.

The dual-method verification is central to our contribution.
Cosine similarity of deep feature embeddings captures high-level visual style similarity---it can identify signatures that share similar stroke patterns and spatial layouts---but cannot distinguish between a CPA who signs consistently and one who reuses a digital copy.
Perceptual hashing (specifically, difference hashing), by contrast, encodes structural-level image gradients into compact binary fingerprints that are robust to scan noise but sensitive to substantive content differences.
By requiring convergent evidence from both methods, we can differentiate *style consistency* (high cosine similarity but divergent pHash) from *digital replication* (high cosine similarity with convergent pHash), resolving an ambiguity that neither method can address alone.

A distinctive feature of our approach is the use of a known-replication calibration group for threshold validation.
One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as using digitally replicated signatures across its audit reports.
This status was established through three independent lines of evidence prior to our analysis: (1) visual inspection of a random sample of Firm A's reports reveals pixel-identical signature images across different audit engagements and fiscal years; (2) the practice is acknowledged as common knowledge among audit practitioners in Taiwan; and (3) our subsequent quantitative analysis confirmed this independently, with 92.5% of Firm A's signatures exhibiting best-match cosine similarity exceeding 0.95, consistent with digital replication rather than handwriting.
Importantly, Firm A's known-replication status was not derived from the thresholds we calibrate against it; the identification is based on domain knowledge and visual evidence that is independent of the statistical pipeline.
This provides an empirical anchor for calibrating detection thresholds: any threshold that fails to classify the vast majority of Firm A's signatures as replicated is demonstrably too conservative, while Firm A's distributional characteristics establish the range of similarity values achievable through replication in real-world scanned documents.
This calibration strategy---using a known-positive subpopulation to validate detection thresholds---addresses a persistent challenge in document forensics, where comprehensive ground truth labels are scarce.

We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants.
To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature.

The contributions of this paper are summarized as follows:

1. **Problem formulation:** We formally define the signature replication detection problem as distinct from signature forgery detection, and argue that it requires a different analytical framework focused on intra-signer similarity distributions rather than genuine-versus-forged classification.

2. **End-to-end pipeline:** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-method similarity verification, with automated inference requiring no manual intervention after initial training and annotation.

3. **Dual-method verification:** We demonstrate that combining deep feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and digital replication, supported by an ablation study comparing three feature extraction backbones.

4. **Calibration methodology:** We introduce a threshold calibration approach using a known-replication reference group, providing empirical validation in a domain where labeled ground truth is scarce.

5. **Large-scale empirical analysis:** We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on signature replication practices in financial reporting.

The remainder of this paper is organized as follows.
Section II reviews related work on signature verification, document forensics, and perceptual hashing.
Section III describes the proposed methodology.
Section IV presents experimental results including the ablation study and calibration group analysis.
Section V discusses the implications and limitations of our findings.
Section VI concludes with directions for future work.

<!--
REFERENCES used in Introduction:
[1] Taiwan CPA Act §4 (會計師法第4條) + FSC Attestation Regulations §6 (查核簽證核准準則第6條)
    - CPA Act: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067
    - FSC Regs: https://law.moj.gov.tw/LawClass/LawAll.aspx?pcode=G0400013
[2] Yen, Chang & Chen 2013 — Does the signature of a CPA matter? (Res. Account. Regul., vol. 25, no. 2)
[2] Bromley et al. 1993 — Siamese time delay neural network for signature verification (NeurIPS)
[3] Dey et al. 2017 — SigNet: Siamese CNN for writer-independent offline SV (arXiv:1707.02131)
[4] Hadjadj et al. 2020 — Single known sample offline SV (Applied Sciences)
[5] Li et al. 2024 — TransOSV: Transformer for offline SV (Pattern Recognition)
[6] Tehsin et al. 2024 — Triplet Siamese for digital documents (Mathematics)
[7] Brimoh & Olisah 2024 — Consensus threshold for offline SV (arXiv:2401.03085)
[8] Woodruff et al. 2021 — Fully automatic pipeline for document signature analysis / money laundering (arXiv:2107.14091)
[9] Abramova & Böhme 2016 — Copy-move forgery detection in scanned text documents (Electronic Imaging)
[10] Copy-move forgery detection survey — MTAP 2024
[11] Jakhar & Borah 2025 — Near-duplicate detection using pHash + deep learning (Info. Processing & Management)
[12] Pizzi et al. 2022 — SSCD: Self-supervised copy detection (CVPR)
-->