Paper A v3: full rewrite for IEEE Access with three-method convergence

Major changes from v2: Terminology: - "digitally replicated" -> "non-hand-signed" throughout (per partner v3 feedback and to avoid implicit accusation) - "Firm A near-universal non-hand-signing" -> "replication-dominated" (per interview nuance: most but not all Firm A partners use replication) Target journal: IEEE TAI -> IEEE Access (per NCKU CSIE list) New methodological sections (III.G-III.L + IV.D-IV.G): - Three convergent threshold methods (KDE antimode + Hartigan dip test / Burgstahler-Dichev McCrary / EM-fitted Beta mixture + logit-GMM robustness check) - Explicit unit-of-analysis discussion (signature vs accountant) - Accountant-level 2D Gaussian mixture (BIC-best K=3 found empirically) - Pixel-identity validation anchor (no manual annotation needed) - Low-similarity negative anchor + Firm A replication-dominated anchor New empirical findings integrated: - Firm A signature cosine UNIMODAL (dip p=0.17) - long left tail = minority hand-signers - Full-sample cosine MULTIMODAL but not cleanly bimodal (BIC prefers 3-comp mixture) - signature-level is continuous quality spectrum - Accountant-level mixture trimodal (C1 Deloitte-heavy 139/141, C2 other Big-4, C3 smaller firms). 2-comp crossings cos=0.945, dh=8.10 - Pixel-identity anchor (310 pairs) gives perfect recall at all cosine thresholds - Firm A anchor rates: cos>0.95=92.5%, dual-rule cos>0.95 AND dh<=8=89.95% New discussion section V.B: "Continuous-quality spectrum vs discrete- behavior regimes" - the core interpretive contribution of v3. References added: Hartigan & Hartigan 1985, Burgstahler & Dichev 1997, McCrary 2008, Dempster-Laird-Rubin 1977, White 1982 (refs 37-41). export_v3.py builds Paper_A_IEEE_Access_Draft_v3.docx (462 KB, +40% vs v2 from expanded methodology + results sections). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 00:14:47 +08:00
parent 68689c9f9b
commit 9b11f03548
11 changed files with 1148 additions and 0 deletions
@@ -0,0 +1,233 @@
 #!/usr/bin/env python3
 """Export Paper A v3 (IEEE Access target) to Word, reading from v3 md section files."""
 from docx import Document
 from docx.shared import Inches, Pt, RGBColor
 from docx.enum.text import WD_ALIGN_PARAGRAPH
 from pathlib import Path
 import re
 PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
 FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures")
 EXTRA_FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
 OUTPUT = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
 SECTIONS = [
    "paper_a_abstract_v3.md",
    "paper_a_impact_statement_v3.md",
    "paper_a_introduction_v3.md",
    "paper_a_related_work_v3.md",
    "paper_a_methodology_v3.md",
    "paper_a_results_v3.md",
    "paper_a_discussion_v3.md",
    "paper_a_conclusion_v3.md",
    "paper_a_references_v3.md",
 ]
 # Figure insertion hooks (trigger phrase -> (file, caption, width inches)).
 # New figures for v3: dip test, BD/McCrary overlays, accountant GMM 2D + marginals.
 FIGURES = {
    "Fig. 1 illustrates": (
        FIG_DIR / "fig1_pipeline.png",
        "Fig. 1. Pipeline architecture for automated non-hand-signed signature detection.",
        6.5,
    ),
    "Fig. 2 presents the cosine similarity distributions for intra-class": (
        FIG_DIR / "fig2_intra_inter_kde.png",
        "Fig. 2. Cosine similarity distributions: intra-class vs. inter-class with KDE crossover at 0.837.",
        3.5,
    ),
    "Fig. 3 presents the per-signature cosine and dHash distributions of Firm A": (
        FIG_DIR / "fig3_firm_a_calibration.png",
        "Fig. 3. Firm A per-signature cosine and dHash distributions against the overall CPA population.",
        3.5,
    ),
    "Fig. 4 visualizes the accountant-level clusters": (
        EXTRA_FIG_DIR / "accountant_mixture" / "accountant_mixture_2d.png",
        "Fig. 4. Accountant-level 3-component Gaussian mixture in the (cosine-mean, dHash-mean) plane.",
        4.5,
    ),
    "conducted an ablation study comparing three": (
        FIG_DIR / "fig4_ablation.png",
        "Fig. 5. Ablation study comparing three feature extraction backbones.",
        6.5,
    ),
 }
 def strip_comments(text):
    return re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
 def add_md_table(doc, table_lines):
    rows_data = []
    for line in table_lines:
        cells = [c.strip() for c in line.strip("|").split("|")]
        if not re.match(r"^[-: ]+$", cells[0]):
            rows_data.append(cells)
    if len(rows_data) < 2:
        return
    ncols = len(rows_data[0])
    table = doc.add_table(rows=len(rows_data), cols=ncols)
    table.style = "Table Grid"
    for r_idx, row in enumerate(rows_data):
        for c_idx in range(min(len(row), ncols)):
            cell = table.rows[r_idx].cells[c_idx]
            cell.text = row[c_idx]
            for p in cell.paragraphs:
                p.alignment = WD_ALIGN_PARAGRAPH.CENTER
                for run in p.runs:
                    run.font.size = Pt(8)
                    run.font.name = "Times New Roman"
                    if r_idx == 0:
                        run.bold = True
    doc.add_paragraph()
 def _insert_figures(doc, para_text):
    for trigger, (fig_path, caption, width) in FIGURES.items():
        if trigger in para_text and Path(fig_path).exists():
            fp = doc.add_paragraph()
            fp.alignment = WD_ALIGN_PARAGRAPH.CENTER
            fr = fp.add_run()
            fr.add_picture(str(fig_path), width=Inches(width))
            cp = doc.add_paragraph()
            cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
            cr = cp.add_run(caption)
            cr.font.size = Pt(9)
            cr.font.name = "Times New Roman"
            cr.italic = True
 def process_section(doc, filepath):
    text = filepath.read_text(encoding="utf-8")
    text = strip_comments(text)
    lines = text.split("\n")
    i = 0
    while i < len(lines):
        line = lines[i]
        stripped = line.strip()
        if not stripped:
            i += 1
            continue
        if stripped.startswith("# "):
            h = doc.add_heading(stripped[2:], level=1)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        if stripped.startswith("## "):
            h = doc.add_heading(stripped[3:], level=2)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        if stripped.startswith("### "):
            h = doc.add_heading(stripped[4:], level=3)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        if "|" in stripped and i + 1 < len(lines) and re.match(r"\s*\|[-|: ]+\|", lines[i + 1]):
            table_lines = []
            while i < len(lines) and "|" in lines[i]:
                table_lines.append(lines[i])
                i += 1
            add_md_table(doc, table_lines)
            continue
        if re.match(r"^\d+\.\s", stripped):
            p = doc.add_paragraph(style="List Number")
            content = re.sub(r"^\d+\.\s", "", stripped)
            content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
            run = p.add_run(content)
            run.font.size = Pt(10)
            run.font.name = "Times New Roman"
            i += 1
            continue
        if stripped.startswith("- "):
            p = doc.add_paragraph(style="List Bullet")
            content = stripped[2:]
            content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
            run = p.add_run(content)
            run.font.size = Pt(10)
            run.font.name = "Times New Roman"
            i += 1
            continue
        # Regular paragraph
        para_lines = [stripped]
        i += 1
        while i < len(lines):
            nxt = lines[i].strip()
            if (
                not nxt
                or nxt.startswith("#")
                or nxt.startswith("|")
                or nxt.startswith("- ")
                or re.match(r"^\d+\.\s", nxt)
            ):
                break
            para_lines.append(nxt)
            i += 1
        para_text = " ".join(para_lines)
        para_text = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", para_text)
        para_text = re.sub(r"\*\*(.+?)\*\*", r"\1", para_text)
        para_text = re.sub(r"\*(.+?)\*", r"\1", para_text)
        para_text = re.sub(r"`(.+?)`", r"\1", para_text)
        para_text = para_text.replace("$$", "")
        para_text = para_text.replace("---", "\u2014")
        p = doc.add_paragraph()
        p.paragraph_format.space_after = Pt(6)
        run = p.add_run(para_text)
        run.font.size = Pt(10)
        run.font.name = "Times New Roman"
        _insert_figures(doc, para_text)
 def main():
    doc = Document()
    style = doc.styles["Normal"]
    style.font.name = "Times New Roman"
    style.font.size = Pt(10)
    # Title page
    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    p.paragraph_format.space_after = Pt(12)
    run = p.add_run(
        "Automated Identification of Non-Hand-Signed Auditor Signatures\n"
        "in Large-Scale Financial Audit Reports:\n"
        "A Dual-Descriptor Framework with Three-Method Convergent Thresholding"
    )
    run.font.size = Pt(16)
    run.font.name = "Times New Roman"
    run.bold = True
    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    p.paragraph_format.space_after = Pt(6)
    run = p.add_run("[Authors removed for double-blind review]")
    run.font.size = Pt(10)
    run.italic = True
    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    p.paragraph_format.space_after = Pt(20)
    run = p.add_run("Target journal: IEEE Access (Regular Paper)")
    run.font.size = Pt(10)
    run.italic = True
    for section_file in SECTIONS:
        filepath = PAPER_DIR / section_file
        if filepath.exists():
            process_section(doc, filepath)
        else:
            print(f"WARNING: missing section file: {filepath}")
    doc.save(str(OUTPUT))
    print(f"Saved: {OUTPUT}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,17 @@
 # Abstract
 <!-- 200-270 words -->
 Regulations in many jurisdictions require Certified Public Accountants (CPAs) to attest to each audit report they certify, typically by affixing a signature or seal.
 However, the digitization of financial reporting makes it straightforward to reuse a stored signature image across multiple reports---whether by administrative stamping or firm-level electronic signing systems---potentially undermining the intent of individualized attestation.
 Unlike signature forgery, where an impostor imitates another person's handwriting, *non-hand-signed* reproduction involves the legitimate signer's own stored signature image being reproduced on each report, a practice that is visually invisible to report users and infeasible to audit at scale through manual inspection.
 We present an end-to-end AI pipeline that automatically detects non-hand-signed auditor signatures in financial audit reports.
 The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash).
 For threshold determination we apply three statistically independent methods---Kernel Density antimode with a Hartigan dip test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level.
 Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best $K = 3$), reflecting that individual signing *behavior* is close to discrete even when pixel-level output *quality* is continuous.
 The accountant-level 2-component crossings yield principled thresholds (cosine $= 0.945$, dHash $= 8.10$).
 A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with interview and visual evidence supporting majority non-hand-signing and a minority of hand-signers.
 Validation against 310 pixel-identical signature pairs and a low-similarity negative anchor yields perfect recall at all candidate thresholds.
 To our knowledge, this represents the largest-scale forensic analysis of auditor signature authenticity reported in the literature.
 <!-- Word count: ~290 -->
@@ -0,0 +1,30 @@
 # VI. Conclusion and Future Work
 ## Conclusion
 We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
 Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels.
 Our contributions are fourfold.
 First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
 Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
 Third, we introduced a three-method convergent threshold framework combining KDE antimode (with a Hartigan dip test as formal bimodality check), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check).
 Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
 The substantive reading is that *pixel-level output quality* is continuous while *individual signing behavior* is close to discrete.
 Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor that requires no manual annotation.
 This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters.
 An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
 ## Future Work
 Several directions merit further investigation.
 Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
 Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade.
 The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
 The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
 Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
@@ -0,0 +1,101 @@
 # V. Discussion
 ## A. Non-Hand-Signing Detection as a Distinct Problem
 Our results highlight the importance of distinguishing *non-hand-signing detection* from the well-studied *signature forgery detection* problem.
 In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature.
 In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).
 This distinction has direct methodological consequences.
 Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures.
 Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
 The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.
 ## B. Continuous-Quality Spectrum vs. Discrete-Behavior Regimes
 The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E).
 At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
 Firm A's signature-level cosine is formally unimodal (dip test $p = 0.17$) with a long left tail.
 The all-CPA signature-level cosine is multimodal ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
 The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary.
 Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class.
 Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains.
 At the per-accountant aggregate level the picture reverses.
 The distribution of per-accountant mean cosine (and mean dHash) is unambiguously multimodal.
 A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms.
 The two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
 The substantive interpretation is simple: *pixel-level output quality* is continuous, but *individual signing behavior* is close to discrete.
 A given CPA tends to be either a consistent user of non-hand-signing or a consistent hand-signer; it is the mixing of these discrete behavioral types at the firm and population levels that produces the quality spectrum observed at the signature level.
 Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level.
 ## C. Firm A as a Replication-Dominated, Not Pure, Population
 A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
 Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.
 Three convergent strands of evidence support the replication-dominated framing.
 First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign.
 Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
 Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers.
 The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
 We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
 ## D. The Style-Replication Gap
 Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
 A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
 The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
 Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
 Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting.
 Some may use signing pads or templates that further constrain variability without constituting image-level reproduction.
 The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.
 ## E. Value of a Replication-Dominated Calibration Group
 The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
 In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
 Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset.
 This calibration strategy has broader applicability beyond signature analysis.
 Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
 The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates interview-acknowledged heterogeneity, and yields classification rates that are internally consistent with the data.
 ## F. Pixel-Identity as Annotation-Free Ground Truth
 A further methodological contribution is the use of byte-level pixel identity as an annotation-free gold positive.
 Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
 In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly.
 Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice.
 We regard this as a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself.
 ## G. Limitations
 Several limitations should be acknowledged.
 First, comprehensive per-document ground truth labels are not available.
 The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), and the low-similarity anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70.
 The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work.
 Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
 While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.
 Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions.
 In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
 This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.
 Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
 While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.
 Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later).
 Extending the accountant-level analysis to auditor-year units is a natural next step.
 Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation.
 This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes.
 Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
 Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
@@ -0,0 +1,9 @@
 # Impact Statement
 <!-- 100-150 words. Non-specialist readable. No jargon. Specific, not vague. -->
 Auditor signatures on financial reports are a key safeguard of corporate accountability.
 When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets.
 We developed an artificial intelligence system that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan.
 By combining deep-learning visual features with perceptual hashing and three statistically independent threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time.
 After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale.
@@ -0,0 +1,86 @@
 # I. Introduction
 <!-- Target: ~1.5 pages double-column IEEE format. Double-blind: no author/institution info. -->
 Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection.
 In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require that certifying CPAs affix their signature or seal (簽名或蓋章) to each audit report [1].
 While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].
 The digitization of financial reporting has introduced a practice that complicates this intent.
 As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one.
 This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step.
 From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
 We refer to signatures produced by either workflow collectively as *non-hand-signed*.
 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement.
 Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused.
 This practice, while potentially widespread, is visually invisible to report users and virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of image reproduction.
 The distinction between *non-hand-signing detection* and *signature forgery detection* is both conceptually and technically important.
 The extensive body of research on offline signature verification [3]--[8] has focused almost exclusively on forgery detection---determining whether a questioned signature was produced by its purported author or by an impostor.
 This framing presupposes that the central threat is identity fraud.
 In our context, identity is not in question; the CPA is indeed the legitimate signer.
 The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports.
 This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction, requiring an analytical framework focused on detecting abnormally high similarity across documents.
 A secondary methodological concern shapes the research design.
 Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification.
 Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference.
 A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, $F_1$, and equal-error-rate metrics that prevail in the biometric-verification literature.
 Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards.
 Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image.
 Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from image reproduction.
 Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis.
 From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection.
 In this paper, we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale.
 Our approach processes raw PDF documents through the following stages:
 (1) signature page identification using a Vision-Language Model (VLM);
 (2) signature region detection using a trained YOLOv11 object detector;
 (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network;
 (4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance;
 (5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and
 (6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm.
 The dual-descriptor verification is central to our contribution.
 Cosine similarity of deep feature embeddings captures high-level visual style similarity---it can identify signatures that share similar stroke patterns and spatial layouts---but cannot distinguish between a CPA who signs consistently and one whose signature is reproduced from a stored image.
 Perceptual hashing (specifically, difference hashing) encodes structural-level image gradients into compact binary fingerprints that are robust to scan noise but sensitive to substantive content differences.
 By requiring convergent evidence from both descriptors, we can differentiate *style consistency* (high cosine but divergent dHash) from *image reproduction* (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone.
 A second distinctive feature is our framing of the calibration reference.
 One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing.
 Structured interviews with multiple Firm A partners confirm that *most* certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a *minority* may continue to hand-sign some reports.
 We therefore treat Firm A as a *replication-dominated* calibration reference rather than a pure positive class.
 This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode.
 Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence between the interview evidence and the statistical results.
 A third distinctive feature is our unit-of-analysis treatment.
 Our three-method convergent framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are cleanly trimodal (BIC-best $K = 3$).
 The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, but *individual signing behavior* is close to discrete---a given CPA is either a consistent user of non-hand-signing or a consistent hand-signer.
 The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level.
 We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants.
 To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature.
 The contributions of this paper are summarized as follows:
 1. **Problem formulation.** We formally define non-hand-signing detection as distinct from signature forgery detection and argue that it requires an analytical framework focused on intra-signer similarity distributions rather than genuine-versus-forged classification.
 2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity computation, with automated inference requiring no manual intervention after initial training and annotation.
 3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures.
 4. **Three-method convergent threshold framework.** We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence.
 5. **Continuous-quality / discrete-behavior finding.** We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates are cleanly trimodal---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics.
 6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a known-majority-positive reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling.
 7. **Large-scale empirical analysis.** We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on non-hand-signing practices in financial reporting under a methodology designed for peer-review defensibility.
 The remainder of this paper is organized as follows.
 Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination.
 Section III describes the proposed methodology.
 Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study.
 Section V discusses the implications and limitations of our findings.
 Section VI concludes with directions for future work.
@@ -0,0 +1,229 @@
 # III. Methodology
 ## A. Pipeline Overview
 We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
 Fig. 1 illustrates the overall architecture.
 The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.
 Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
 From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
 <!--
 [Figure 1: Pipeline Architecture - clean vector diagram]
 90,282 PDFs → VLM Pre-screening → 86,072 PDFs
 → YOLOv11 Detection → 182,328 signatures
 → ResNet-50 Features → 2048-dim embeddings
 → Dual-Method Verification (Cosine + dHash)
 → Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification
 → Pixel-identity + Firm A + Accountant-level GMM validation
 -->
 ## B. Data Collection
 The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023.
 The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings.
 An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period.
 Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.
 CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
 Table I summarizes the dataset composition.
 <!-- TABLE I: Dataset Summary
 | Attribute | Value |
 |-----------|-------|
 | Total PDF documents | 90,282 |
 | Date range | 2013–2023 |
 | Documents with signatures | 86,072 (95.4%) |
 | Unique CPAs identified | 758 |
 | Accounting firms | >50 |
 -->
 ## C. Signature Page Identification
 To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism.
 Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
 The model was configured with temperature 0 for deterministic output.
 The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document.
 Scanning terminated upon the first positive detection.
 This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded.
 An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.
 Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false-positive rate of 1.2%.
 ## D. Signature Detection
 We adopted YOLOv11n (nano variant) [25] for signature region localization.
 A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
 A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.
 The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).
 <!-- TABLE II: YOLO Detection Performance
 | Metric | Value |
 |--------|-------|
 | Precision | 0.97–0.98 |
 | Recall | 0.95–0.98 |
 | mAP@0.50 | 0.98–0.99 |
 | mAP@0.50:0.95 | 0.85–0.90 |
 -->
 Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
 A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.
 Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
 ## E. Feature Extraction
 Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning.
 The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.
 Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
 All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.
 The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D).
 This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
 ## F. Dual-Method Similarity Descriptors
 For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:
 **Cosine similarity on deep embeddings** captures high-level visual style:
 $$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$
 where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized 2048-dim feature vectors.
 Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].
 **Perceptual hash distance (dHash)** captures structural-level similarity.
 Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
 The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images.
 Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].
 These descriptors provide partially independent evidence.
 Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise.
 Non-hand-signing yields extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise.
 Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
 Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.
 We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content.
 Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.
 ## G. Unit of Analysis and Summary Statistics
 Two unit-of-analysis choices are relevant for this study: (i) the *signature*---one signature image extracted from one report---and (ii) the *accountant*---the collection of all signatures attributed to a single CPA across the sample period.
 A third composite unit---the *auditor-year*, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution.
 For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA.
 The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
 Mean statistics would dilute this signal.
 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA.
 These accountant-level aggregates are the input to the mixture model described in Section III-I.
 ## H. Calibration Reference: Firm A as a Replication-Dominated Population
 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
 Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.
 This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration.
 First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems.
 Crucially, the same interview evidence does *not* exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports.
 Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners.
 Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
 We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
 Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline.
 The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.
 ## I. Three-Method Convergent Threshold Determination
 Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
 To place threshold selection on a statistically principled and data-driven footing, we apply *three independent* methods whose underlying assumptions decrease in strength.
 When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement is itself a diagnostic of distributional structure.
 ### 1) Method 1: KDE + Antimode with Bimodality Check
 We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28].
 A candidate threshold is taken at the location of the local density minimum (antimode) between modes of the fitted density.
 Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify stability.
 ### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity
 We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39].
 We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,
 $$Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},$$
 which is approximately $N(0,1)$ under the null of distributional smoothness.
 A threshold is identified at the transition where $Z_{i-1}$ is significantly negative (observed count below expectation) adjacent to $Z_i$ significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition $Z^- \rightarrow Z^+$ marks the candidate decision boundary.
 ### 3) Method 3: Finite Mixture Model via EM
 We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data).
 The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread).
 Under the fitted model the threshold is the crossing point of the two weighted component densities,
 $$\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),$$
 solved numerically via bracketed root-finding.
 As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the *logit-transformed* similarity, following standard practice for bounded proportion data; White's [41] quasi-MLE consistency result guarantees asymptotic recovery of the best Beta-family approximation to the true distribution even if the true component densities are not exactly Beta.
 We fit 2- and 3-component variants of each mixture and report BIC for model selection.
 When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.
 ### 4) Convergent Validation and Level-Shift Diagnostic
 The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification.
 If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.
 Equally informative is the *level at which the three methods agree*.
 Applied to the per-signature cosine distribution the three methods may converge on a single boundary only weakly or not at all---because, as our results show, per-signature similarity is not a cleanly bimodal population.
 Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum.
 We therefore explicitly analyze both levels and interpret their divergence as a substantive finding (Section V) rather than a statistical nuisance.
 ## J. Accountant-Level Mixture Model
 In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
 The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *behavior* is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level *quality* lies on a continuous spectrum.
 We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
 For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
 ## K. Pixel-Identity and Firm A Validation (No Manual Annotation)
 Rather than construct a stratified manual-annotation validation set, we validate the classifier using three naturally occurring reference populations that require no human labeling:
 1. **Pixel-identical anchor (gold positive):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
 Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing.
 2. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above.
 Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity.
 3. **Low-similarity anchor (gold negative):** signatures whose maximum same-CPA cosine similarity is below a conservative cutoff ($0.70$) that cannot plausibly arise from pixel-level duplication.
 From these anchors we report Equal Error Rate (EER), precision, recall, $F_1$, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3].
 We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.
 ## L. Per-Document Classification
 The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
 Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution:
 1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
 Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.
 2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash in $[6, 15]$.
 Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.
 3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
 4. **Uncertain:** Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
 5. **Likely hand-signed:** Cosine below the KDE crossover threshold.
 The dHash thresholds ($\leq 5$ and $\leq 15$) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.
@@ -0,0 +1,87 @@
 # References
 <!-- IEEE numbered style, sequential by first appearance in text. v3 adds statistical-method refs (37–41). -->
 [1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067
 [2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," *Res. Account. Regul.*, vol. 25, no. 2, pp. 230–235, 2013.
 [3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in *Proc. NeurIPS*, 1993.
 [4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.
 [5] I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020.
 [6] H. Li et al., "TransOSV: Offline signature verification with transformers," *Pattern Recognit.*, vol. 145, p. 109882, 2024.
 [7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," *Mathematics*, vol. 12, no. 17, p. 2757, 2024.
 [8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.
 [9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021.
 [10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," in *Proc. Electronic Imaging*, 2016.
 [11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," *Multimedia Tools Appl.*, 2024.
 [12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," *Inf. Process. Manage.*, p. 104086, 2025.
 [13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in *Proc. CVPR*, 2022.
 [14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," *Pattern Recognit.*, vol. 70, pp. 163–176, 2017.
 [15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 1342–1356, 2024.
 [16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 1735–1745, 2019.
 [17] H. Farid, "Image forgery detection," *IEEE Signal Process. Mag.*, vol. 26, no. 2, pp. 16–25, 2009.
 [18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," *Pattern Recognit.*, vol. 144, art. no. 109778, 2023.
 [19] J. Luo et al., "A survey of perceptual hashing for multimedia," *ACM Trans. Multimedia Comput. Commun. Appl.*, vol. 21, no. 7, 2025.
 [20] D. Engin et al., "Offline signature verification on real-world documents," in *Proc. CVPRW*, 2020.
 [21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, 2022.
 [22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," *Procedia Comput. Sci.*, vol. 270, 2025.
 [23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in *Proc. ECCV*, 2014, pp. 584–599.
 [24] Qwen2.5-VL Technical Report, Alibaba Group, 2025.
 [25] Ultralytics, "YOLOv11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/
 [26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. CVPR*, 2016.
 [27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html
 [28] B. W. Silverman, *Density Estimation for Statistics and Data Analysis*. London: Chapman & Hall, 1986.
 [29] J. Cohen, *Statistical Power Analysis for the Behavioral Sciences*, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988.
 [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600–612, 2004.
 [31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," *The Accounting Review*, vol. 88, no. 5, pp. 1511–1546, 2013.
 [32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," *Int. J. Auditing*, vol. 18, no. 3, pp. 172–192, 2014.
 [33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," *Contemp. Account. Res.*, vol. 26, no. 2, pp. 359–391, 2009.
 [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in *Proc. CVPR*, 2016, pp. 779–788.
 [35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 46, no. 8, pp. 5625–5644, 2024.
 [36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," *Ann. Math. Statist.*, vol. 18, no. 1, pp. 50–60, 1947.
 [37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," *Ann. Statist.*, vol. 13, no. 1, pp. 70–84, 1985.
 [38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," *J. Account. Econ.*, vol. 24, no. 1, pp. 99–126, 1997.
 [39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," *J. Econometrics*, vol. 142, no. 2, pp. 698–714, 2008.
 [40] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," *J. R. Statist. Soc. B*, vol. 39, no. 1, pp. 1–38, 1977.
 [41] H. White, "Maximum likelihood estimation of misspecified models," *Econometrica*, vol. 50, no. 1, pp. 1–25, 1982.
 <!-- Total: 41 references (v2: 36 + 5 new statistical methods refs) -->
@@ -0,0 +1,104 @@
 # II. Related Work
 ## A. Offline Signature Verification
 Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning.
 Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant.
 Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work.
 Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining.
 Hadjadj et al. [5] addressed the practical constraint of limited reference samples, achieving competitive verification accuracy using only a single known genuine signature per writer.
 More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results.
 Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives.
 Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer.
 Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems.
 A common thread in this literature is the assumption that the primary threat is *identity fraud*: a forger attempting to produce a convincing imitation of another person's signature.
 Our work addresses a fundamentally different problem---detecting whether the *legitimate signer's* stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability.
 Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy.
 However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a replication-dominated subpopulation identified through domain expertise in real-world regulatory documents.
 ## B. Document Forensics and Copy Detection
 Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18].
 Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11].
 Abramova and Böhme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection.
 Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money-laundering investigations.
 Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering.
 While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting image-level reproduction within a single author's signatures across documents.
 In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images.
 Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach.
 ## C. Perceptual Hashing
 Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19].
 Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process.
 Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks.
 Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-descriptor approach, though applied to natural images rather than document signatures.
 Our work differs from prior perceptual-hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from image-level reproduction in scanned financial documents.
 ## D. Deep Feature Extraction for Signature Analysis
 Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures.
 Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach.
 Tsourounis et al. [21] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison.
 Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach.
 Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach.
 These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical.
 ## E. Statistical Methods for Threshold Determination
 Our threshold-determination framework combines three families of methods developed in statistics and accounting-econometrics.
 *Non-parametric density estimation.*
 Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions.
 Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors.
 The statistical validity of the bimodality itself can be tested independently via the Hartigan & Hartigan dip test [37], which we use as a formal bimodality diagnostic.
 *Discontinuity tests on empirical distributions.*
 Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions.
 Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately $N(0,1)$.
 The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis.
 The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions.
 *Finite mixture models.*
 When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters.
 For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation.
 Under mild regularity conditions, White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation to the true distribution, even when the true distribution is not exactly Beta, provided the model is correctly specified in the broader exponential-family sense.
 The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported.
 <!--
 REFERENCES for Related Work (see paper_a_references_v3.md for full list):
 [3]  Bromley et al. 1993 — Siamese TDNN (NeurIPS)
 [4]  Dey et al. 2017 — SigNet
 [5]  Hadjadj et al. 2020 — Single sample SV
 [6]  Li et al. 2024 — TransOSV
 [7]  Tehsin et al. 2024 — Triplet Siamese
 [8]  Brimoh & Olisah 2024 — Consensus threshold
 [9]  Woodruff et al. 2021 — AML signature pipeline
 [10] Abramova & Böhme 2016 — CMFD in scanned docs
 [11] Copy-move forgery detection survey — MTAP 2024
 [12] Jakhar & Borah 2025 — pHash + DL
 [13] Pizzi et al. 2022 — SSCD
 [14] Hafemann et al. 2017 — CNN features for SV
 [15] Zois et al. 2024 — SPD manifold SV
 [16] Hafemann et al. 2019 — Meta-learning for SV
 [17] Farid 2009 — Image forgery detection survey
 [18] Mehrjardi et al. 2023 — DL-based image forgery detection survey
 [19] Luo et al. 2025 — Perceptual hashing survey
 [20] Engin et al. 2020 — ResNet + cosine on real docs
 [21] Tsourounis et al. 2022 — Transfer from text to signatures
 [22] Chamakh & Bounouh 2025 — ResNet18 unified SV
 [23] Babenko et al. 2014 — Neural codes for image retrieval
 [28] Silverman 1986 — Density estimation
 [37] Hartigan & Hartigan 1985 — dip test of unimodality
 [38] Burgstahler & Dichev 1997 — earnings management discontinuity
 [39] McCrary 2008 — density discontinuity test
 [40] Dempster, Laird & Rubin 1977 — EM algorithm
 [41] White 1982 — quasi-MLE consistency
 -->
@@ -0,0 +1,252 @@
 # IV. Experiments and Results
 ## A. Experimental Setup
 All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration.
 Feature extraction used PyTorch 2.9 with torchvision model implementations.
 The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
 ## B. Signature Detection Performance
 The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
 We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
 However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
 The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.
 <!-- TABLE III: Extraction Results
 | Metric | Value |
 |--------|-------|
 | Documents processed | 86,071 |
 | Documents with detections | 85,042 (98.8%) |
 | Total signatures extracted | 182,328 |
 | Avg. signatures per document | 2.14 |
 | CPA-matched signatures | 168,755 (92.6%) |
 | Processing rate | 43.1 docs/sec |
 -->
 ## C. Signature-Level Distribution Analysis
 Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs.
 Table IV summarizes the distributional statistics.
 <!-- TABLE IV: Cosine Similarity Distribution Statistics
 | Statistic | Intra-class | Inter-class |
 |-----------|-------------|-------------|
 | N (pairs) | 41,352,824 | 500,000 |
 | Mean | 0.821 | 0.758 |
 | Std. Dev. | 0.098 | 0.090 |
 | Median | 0.836 | 0.774 |
 | Skewness | −0.711 | −0.851 |
 | Kurtosis | 0.550 | 1.027 |
 -->
 Both distributions are left-skewed and leptokurtic.
 Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
 Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
 The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
 Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
 Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).
 We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
 We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
 A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
 ## D. Hartigan Dip Test: Unimodality at the Signature Level
 Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
 <!-- TABLE V: Hartigan Dip Test Results
 | Distribution | N | dip | p-value | Verdict (α=0.05) |
 |--------------|---|-----|---------|------------------|
 | Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
 | Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
 | All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
 | All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
 | Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
 | Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
 -->
 Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews.
 The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
 At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
 This asymmetry between signature level and accountant level is itself an empirical finding.
 It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
 ### 1) Burgstahler-Dichev / McCrary Discontinuity
 Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample.
 We note that the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
 In contrast, the dHash transition at distance 2 is a meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.
 ### 2) Beta Mixture at Signature Level: A Forced Fit
 Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
 For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
 Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
 Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
 The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
 Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
 This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual *behavior* (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.
 ## E. Accountant-Level Gaussian Mixture
 We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
 BIC selects $K^* = 3$ (Table VI).
 <!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
 | K | BIC | AIC | Converged |
 |---|-----|-----|-----------|
 | 1 | −316 | −339 | ✓ |
 | 2 | −545 | −595 | ✓ |
 | 3 | **−792** | **−869** | ✓  (best) |
 | 4 | −779 | −883 | ✓ |
 | 5 | −747 | −879 | ✓ |
 -->
 Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
 <!-- TABLE VII: Accountant-Level 3-Component GMM
 | Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
 |-------|----------|------------|--------|---|----------------|
 | C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
 | C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
 | C3 (hand-signed tendency) | 0.928 | 11.17 | 0.28 | 184 | smaller domestic firms |
 -->
 Three empirical findings stand out.
 First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster.
 The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
 Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
 Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine $= 0.945$ and dHash $= 8.10$; these are the natural per-accountant thresholds.
 Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level.
 <!-- TABLE VIII: Threshold Convergence Summary
 | Level / method | Cosine threshold | dHash threshold |
 |----------------|-------------------|------------------|
 | Signature-level KDE crossover | 0.837 | — |
 | Signature-level BD/McCrary transition | 0.985 | 2.0 |
 | Signature-level Beta 2-comp (Firm A) | 0.977 | — |
 | Signature-level LogGMM 2-comp (Full) | 0.980 | — |
 | Accountant-level 2-comp GMM crossing | **0.945** | **8.10** |
 | Firm A P95 (median/95th pct calibration) | 0.95 | 15 |
 | Firm A median calibration | — | 5 |
 -->
 The accountant-level two-component crossing (cosine $= 0.945$, dHash $= 8.10$) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture.
 The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum / discrete-behavior asymmetry rather than as primary classification boundaries.
 ## F. Calibration Validation with Firm A
 Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
 Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
 <!-- TABLE IX: Firm A Anchor Rates Across Candidate Thresholds
 | Rule | Firm A rate |
 |------|-------------|
 | cosine > 0.837 | 99.93% |
 | cosine > 0.941 | 95.08% |
 | cosine > 0.945 (accountant 2-comp) | 94.5%† |
 | cosine > 0.95 | 92.51% |
 | dHash_indep ≤ 5 | 84.20% |
 | dHash_indep ≤ 8 | 95.17% |
 | dHash_indep ≤ 15 | 99.83% |
 | cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% |
 † interpolated from adjacent rates; all other rates computed exactly.
 -->
 The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail.
 The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix.
 ## G. Pixel-Identity Validation
 Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1).
 These serve as the gold-positive anchor of Section III-K.
 Using signatures with cosine $< 0.70$ ($n = 35$) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X).
 <!-- TABLE X: Pixel-Identity Validation Metrics
 | Indicator | Threshold | Precision | Recall | F1 | FAR | FRR |
 |-----------|-----------|-----------|--------|----|-----|-----|
 | cosine > 0.837 | KDE crossover | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
 | cosine > 0.945 | Accountant crossing | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
 | cosine > 0.95 | Canonical | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
 | dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 |
 | dHash_indep ≤ 8 | Accountant crossing | 0.966 | 1.000 | 0.983 | 0.314 | 0.000 |
 | dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 |
 -->
 All cosine thresholds achieve perfect classification of the pixel-identical anchor against the low-similarity anchor, which is unsurprising given the complete separation between the two anchor populations.
 The dHash thresholds trade precision for recall along the expected tradeoff.
 We emphasize that because the gold-positive anchor is a *subset* of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate.
 The negative-anchor population ($n = 35$) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates.
 A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
 ## H. Classification Results
 Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents.
 <!-- TABLE XI: Classification Results (Dual-Method: Cosine + dHash)
 | Verdict | N (PDFs) | % | Firm A | Firm A % |
 |---------|----------|---|--------|----------|
 | High-confidence non-hand-signed | 29,529 | 35.0% | 22,970 | 76.0% |
 | Moderate-confidence non-hand-signed | 36,994 | 43.8% | 6,311 | 20.9% |
 | High style consistency | 5,133 | 6.1% | 183 | 0.6% |
 | Uncertain | 12,683 | 15.0% | 758 | 2.5% |
 | Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
 -->
 Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
 29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
 36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
 and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
 A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations.
 ### 1) Firm A Validation
 96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
 This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers.
 The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
 ### 2) Cross-Method Agreement
 Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
 This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
 ## I. Ablation Study: Feature Backbone Comparison
 To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
 All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
 Table XII presents the comparison.
 <!-- TABLE XII: Backbone Comparison
 | Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
 |--------|-----------|--------|-----------------|
 | Feature dim | 2048 | 4096 | 1280 |
 | Intra mean | 0.821 | 0.822 | 0.786 |
 | Inter mean | 0.758 | 0.767 | 0.699 |
 | Cohen's d | 0.669 | 0.564 | 0.707 |
 | KDE crossover | 0.837 | 0.850 | 0.792 |
 | Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
 | Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |
 Note: Firm A values in this table are computed over all intra-firm pairwise
 similarities (16.0M pairs) for cross-backbone comparability. These differ from
 the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
 the classification-relevant statistic: the similarity of each signature to its
 single closest match from the same CPA.
 -->
 EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
 However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence.
 VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.
 ResNet-50 provides the best overall balance:
 (1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
 (2) its tighter distributions yield more reliable individual classifications;
 (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
 (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.