Paper A v13: filled submission draft (rev7) + reproducible build bundle

Fill all 18 placeholders in the condensed v13 submission draft with data verified against the analysis DB and LOCKED canonical scripts; close 12/13 co-author review items (only #8b protocol first-run open). Key changes (need co-author sign-off; see handoff doc): - Firm A out-of-sample HC 0.01% -> 0.42% (buggy 0.0001 from Script 49 same-pair bug, propagated v4.2->v13; never reuse 0.0001) - §III-D empty cell ~=0 -> 7,681 honest reframe (not degenerate crops) - low cosine cut 0.837 -> 0.8547 primary (BCD 2013-2019 closed-world, held-out discipline; 0.8489 confirmed = BCD all-period); HC/MC/HSC unchanged, UN/LH move <=0.4pp Adds Figures 1-5 (real-data plots + schematics), full references, Appendix A/B, UN/HSC ICCR, n-reconciliation, #13 MOPS-metadata survival verification, "參" set-level feasibility probe (negative). Two codex (gpt-5.5) adversarial rounds applied; no fabrication found. Bundle: paper/v13_build/ (markdown source, harvest/figure scripts, figures) for reproducibility. Handoff note for co-author included. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 03:24:50 +08:00
parent 1e8466f7a8
commit 66c9194fcf
13 changed files with 749 additions and 0 deletions
@@ -0,0 +1,418 @@
 # Anchor-Calibrated, Label-Free Screening of Non-Hand-Signed Signatures in Large-Scale Audit Reports
 *(Authors removed for double-blind review)*
 ## Abstract
 Audit reports must carry the certifying accountant's signature as the mark of an individual act of endorsement, but once reports are produced and stored digitally, a saved image of that signature can be pasted onto many reports instead — by manual stamping or by an automated signing system. We call such signatures non-hand-signed. This is not forgery: the signer is genuine, and the question is whether an act of signing took place for each report. We present a screening system that asks this question at archive scale: 86,071 Taiwanese statutory audit reports (2013–2023), within which the four largest audit firms contribute 150,442 analyzable signatures. The system finds the signature page, detects each signature, extracts deep features, and computes two similarities against the same accountant's other signatures: a cosine similarity that reflects style, and a perceptual-hash (dHash) distance that reflects pixel-level structure — a consistent hand keeps style similarity high while structure varies, whereas a reused image keeps both extreme. The archive carries no signature-level labels, and the data contain no natural gap (a unimodality test gives median p = 0.35 once firm effects and the hash's integer steps are removed), so no cutoff can be learned or read off the data. Instead, we calibrate a five-way rule by how often it fires by chance among unrelated accountants in a clean reference group (the non-Firm-A firms, 2013–2019): the strict high-confidence rule fires on about 1.2% of clean-group reports and anchors a high-specificity tier; a looser band fires on about 17.5% and is demoted to advisory. Held out as a test, Firm A fires the strict rule on 82% of its own signatures — about 139 times the chance rate — while its cross-firm match rate sits at or below the clean reference rate and is negligible beside its within-firm matching, so the signal is entirely inside the firm; 262 byte-identical signatures (145 at Firm A) are direct evidence of reuse, and anonymized interviews independently describe Firm A as a stamping firm since at least 2013. Operationally, the screen discovers where reuse concentrates without being told where to look, and it keeps human review at the scale of exceptions — at the signature level where reuse dominates, and at the accountant level where practices are mixed, through calibrated demotion of the low-specificity band, accountant-level ranking, and byte-identity confirmation — withholding only per-signature verdicts for the ambiguous middle. Calibrated on a large Chinese-signature corpus with script-agnostic image descriptors, the high-confidence rule and its measured specificity serve as a concrete, operator-tunable reference point for other Chinese-signature settings. We report specificity rather than a true error rate, and we label no single signature.
 **Keywords:** signature analysis, document forensics, perceptual hashing, deep features, unsupervised calibration, audit reports, anchor-based screening.
 ## I. Introduction
 An audit report is one of the main ways a company is held accountable to investors, and the certifying accountant's signature is the visible sign that a named professional takes responsibility for it. In Taiwan, the Certified Public Accountant Act and the attestation rules of the Financial Supervisory Commission require certifying CPAs to put their signature or seal on each audit report [1]. The law accepts either a handwritten signature or a seal, but the point of the requirement is the same in both cases: the mark on each report should stand for a deliberate, individual act of endorsement for that particular engagement [2].
 Going digital makes that harder to guarantee. Because reports are now created, sent, and stored as electronic files, it is easy to copy an accountant's saved signature image onto many reports instead of signing each one. This can happen in two ways: a staff member can overlay a scanned signature onto the finished report (a stamping workflow), or a firm-wide electronic-signing system can do the same step automatically. We call signatures produced either way non-hand-signed. The worry is not about legality; it is about meaning. A single image pasted onto hundreds of reports may not carry the individual endorsement the rule assumes — a concern the literature on signatures connects to behaviour, and, in auditing specifically, to rules that name and identify the engagement partner [31], [32], [33]. This is also why the problem is not forgery: a non-hand-signed signature reuses the real signer's own image, and at scale no reader can see the difference.
 That difference matters for the method too. Almost all work on offline signature analysis is about forgery — deciding whether a questioned signature was really written by the person it claims to be [3]–[8]. In our setting the identity is not in doubt; the accountant is genuine. What we want to know is whether the person actually signed each report, or whether one signing was copied as an image. This removes the need to model clever forgers, but it adds a new difficulty: we must separate a person who signs consistently from a reused image. Someone who signs in a very steady hand will produce signatures that look alike year after year; a process that reuses one stored file will produce signatures that are structurally identical. The method has to tell these two cases apart.
 Two facts make the obvious approach — pick a similarity cutoff and call everything above it a copy — unworkable, and they shape our design. First, archives like ours have no labels at the level of individual signatures: no signature is marked as "definitely hand-signed" or "definitely reused." Without such labels, any cutoff we choose has unknown error rates; we cannot measure how often it would wrongly flag a genuine signature or miss a reused one. Second, even setting labels aside, the data themselves do not contain a natural cutoff. As we show in §V, the raw numbers look at first as if they split into two groups, but that appearance comes from differences between firms and from the fact that the hash takes only whole-number values; once we remove those two effects, the distribution is a single smooth spread, not two clusters. You cannot read a dividing line off a distribution that has no gap, and you cannot test a line against labels that do not exist. So the method must get its cutoff some other way.
 Our two similarity measures are chosen precisely to expose the distinction the problem turns on. For each signature we compute two numbers against the same accountant's other signatures: a cosine similarity on deep ResNet-50 features, and an independent perceptual hash (dHash) distance. They carry different information. Cosine similarity measures overall style, and it is high both when an image is reused and when a person signs consistently. The dHash distance measures structure almost pixel by pixel, and a very small distance is the sign most specific to a reused image. But neither measure is enough on its own. Cosine alone over-flags a steady hand, because consistent signing also keeps it high. dHash alone has the opposite weakness: it is brittle to how an image is captured — a reused signature that has been re-scaled, re-cropped, or re-compressed can show a larger dHash distance and slip past a structure-only test — and a small dHash distance carries no meaning between two signatures whose styles do not match in the first place. The two are complementary precisely because they fail in different directions: cosine first establishes that the styles match, which catches reuse even when the image has been mildly altered, and dHash then asks whether the match is also near-identical in structure, which is what separates a reused image from a merely steady hand. A single similarity number blurs these two cases; two measures keep them apart. The implication between them runs one way only: a near-identical structure (a tiny dHash) forces a high cosine, but a high cosine in no way implies a near-identical structure — which is why the two-measure plane cannot be collapsed onto either single axis. This complementarity also shapes the rule (§III-D): because a small dHash distance is only meaningful once cosine is already high, the structural cut subdivides the high-cosine cases rather than the low-cosine ones. This is the heart of the design.
 On this basis we build and study a complete screening system. The pipeline takes raw PDF reports through four steps — find the signature page, detect each signature, turn it into features, and compute the two similarities — and sorts each signature into one of five categories. Because there is no natural cutoff to read off the data and no labels to learn one from, we instead measure how often the rule fires by chance between unrelated accountants in a clean reference group. That chance rate is the rule's specificity: it gives us a principled way to choose an operating point, and — just as important — it tells us exactly what each category's flag is worth.
 What is the screen for? Two things. Run over a large archive, it discovers where reuse concentrates — which firms, which periods — without being told where to look. And it keeps human review at the scale of exceptions. In a reuse-dominated population (a stamping firm, a firm with an electronic-signing system), the high-confidence tier routes most signatures directly to a high-specificity candidate list, and the small residual goes through a defined review protocol (specified in §IV-B) — side-by-side overlay inspection, secondary image-artifact checks, and bounded per-accountant sampling — that also accumulates labels for later calibration. In a mixed population, where hand-signing and informal stamping coexist, the ambiguous middle is larger, and the same disposition machinery delivers the same promise one level up, at the accountant: the low-specificity advisory band is demoted rather than worked, accountant-level scores concentrate attention on the few high-ranked or mixed cases, and byte-identity hits supply proof where proof exists, confirming that an accountant's stored image is in circulation. What the screen does not deliver there — and we say so plainly when we report the category proportions (§IV-B) — is a per-signature verdict for the ambiguous middle. In every case the output is bounded triage, not a verdict on any single signature.
 The Taiwan setting suits this study well. The Market Observation Post System offers a large, standardized, public collection of statutory audit reports, each with the same two-signature format, which makes large-scale extraction practical. In addition, anonymized interviews with certifying partners and signing-system staff at all four firms give us institutional facts about how each firm signs and about when each firm adopted a formal electronic-signing system — adoptions that were staggered from 2020 onward. This gives the study a natural before-and-after structure in time, and outside information against which to read the firm-level results (§III-A).
 We make four contributions:
 1. An end-to-end screening pipeline that turns raw audit-report PDFs into operational labels for hundreds of thousands of signatures.
 2. A dual descriptor that separates style consistency from image reproduction — a distinction a single similarity measure blurs.
 3. A label-free, anchor-calibrated operating point that is both a method and a concrete, reusable rule. With neither a natural cutoff in the data nor labels to learn one from, we set a tunable rule by measuring how often it fires by chance in a clean reference group, and we say plainly what that measure can and cannot support. The result is not only a calibration method but a concrete operating point — the high-confidence rule and its measured specificity — that practitioners working with Chinese-signature corpora can adopt directly or use as a starting reference, together with a defined disposition path for the ambiguous middle (calibrated demotion of the low-specificity band, aggregation to the accountant level, byte-identity escalation, and a bounded manual protocol) that keeps human review at the scale of exceptions.
 4. A demonstration on Chinese signatures, a structurally complex and comparatively under-served script for signature analysis. Because our descriptors work on the image rather than on script-specific strokes, the approach does not depend on Latin-script assumptions and is a candidate for other scripts.
 The paper is organized to move from the problem to the evidence. Section II reviews related work and states the gap. Section III describes the study design — the data split, the pipeline, the five-way rule, and the calibration logic — and explains why each piece is built the way it is. Section IV reports the results: the calibration baseline, which category needs human review, and the held-out test on Firm A. Section V collects supporting analyses, including the diagnostic showing that no natural cutoff exists. Section VI concludes.
 ## II. Related Work and Research Gap
 Why reproduction matters: signatures carry symbolic weight. A signature is valuable mainly as a symbol — it stands for the signer's identity and intent. Recent experiments show that this symbolism does not survive a change in how one signs. In studies that take the reader's point of view, Chou [41] finds that electronic signatures give a weaker sense of the signer's presence than handwritten ones, and that readers therefore judge an e-signed document as less valid and expect more non-compliance; across five kinds of e-signature (a checked box, a PIN, an avatar, a typed name, and a software-generated signature), the software-generated kind felt the most "present" of the electronic options but still less than a handwritten signature. In studies that take the signer's point of view, Chou [42] finds that electronic signatures give a weaker sense of self-presence — the signer's felt attachment to the mark — and that this, in turn, makes people more willing to cheat; the work singles out signing by proxy (an autopen) as cutting the tie between the document and the signer. These results matter for us because the practice we detect — a stored signature image laid onto a report by staff or by software — is, in this scheme, one of the lowest-presence modes: it looks like a software-generated signature and is executed like a proxy signature, because the accountant performs no signing act for the report. These effects are robust rather than one-off: in a pre-registered, multi-study replication with meta-analysis, Tzelios and Williams [43] reproduce Chou's reader-side result — an avatar e-signature lowers the sense of the signer's presence and raises the expectation that the contract will be breached. In their general discussion the same authors point to accounting as a next setting — noting the spread of online tax filing and asking how digital signatures affect an evaluator's assessment of the legitimacy of claims, while cautioning that accounting documents may prove less sensitive to signature form than legal ones. We read that call precisely: their "auditors" are the readers of digitally signed filings — those who evaluate the claims — not the certifying accountants who sign. The signer-side question in auditing — what it means when the certifying professional's own signature is reproduced rather than performed — is not addressed in that literature. Both questions, reader-side and signer-side, presuppose the same missing capability: a way to measure non-hand-signing at scale. The lesson we draw is not that non-hand-signing harms audit quality — that is a separate question we leave to a companion study (§VI) — but that whether it matters is a real question, and one nobody can study without first being able to measure non-hand-signing at scale.
 Signature analysis to date is about forgery, not reuse. The obvious toolkit for that measurement is signature analysis, but its main concern is the wrong one for us. Bromley et al. [3] introduced the Siamese network that still anchors the field; SigNet [4] extended it to compare writers it had never seen; Kao and Wen [5] worked from a single genuine sample; TransOSV [6] brought in a Vision Transformer; and meta-learning has been used to cut the effort of enrolling new signers [16]. All of this targets imitation by another hand, so it learns to tell different people apart. Our task is the opposite: spotting reuse of the genuine signer's own image, which lives in the most-similar tail of one person's signatures. The closest idea uses reference examples to set a sensible cutoff [8], but on benchmark data with known genuine references — whereas our archive has no signature-level labels at all. This body of work is also overwhelmingly built on Western, Latin-script signatures; non-Latin scripts such as Chinese are comparatively under-served, and reported accuracies for them are lower [44]. Chinese signatures are structurally distinctive — many strokes, with wide variation between writers — and the forensic literature on them is thin; the closest precedent, Chen [45], analyses Chinese signatures with a maximum-similarity-to-same-class statistic that directly parallels our use of the maximum cosine to the same accountant. Our descriptors, however, work on the image rather than on script-specific strokes, so the method itself does not depend on the script.
 Image-duplication and document forensics: useful parts, different setting. A second line of work looks directly at duplicated images. Copy-move detection finds regions copied within an image [11], and Abramova and Böhme [10] adapted it to scanned documents, noting that ordinary repeated characters confuse the standard methods. Self-supervised copy detection on everyday photos [13] shows that pretrained CNN features with cosine similarity make a strong baseline for spotting near-duplicates. Closest in pipeline terms, Woodruff et al. [9] pull signatures from corporate filings for anti-money-laundering work — but to group signatures by who signed them, not to detect one signer's image being reused across documents. The building blocks exist; the specific setting — one signer's image reused across many scanned financial reports — does not seem to have been addressed.
 Deep features and perceptual hashing as ready-made parts. Features from a pretrained CNN transfer well to document images without any retraining [20], [21], and perceptual hashes are built to survive the print–scan–rasterize cycle [27]. Jakhar and Borah [12] show that combining a perceptual hash with deep features beats either one alone for near-duplicate detection — a direct precedent for our two-measure design, though they work on natural images rather than signatures.
 The recurring obstacle is the missing label. None of these lines solves the problem we face, because real archives carry no signature-level ground truth, and a similarity screen without it falls back on a hand-chosen cutoff whose error behaviour is unknown. (The statistical tools we use to test for a natural cutoff and to describe the rule once we find none are introduced where they are used, in §III and §V, since they are part of our method rather than prior work on this problem.)
 The gap, and our contribution. Two gaps follow. First, large-scale screening for non-hand-signed auditor signatures has not been done, even though there is good reason (above) to think it matters. Second, and more broadly, similarity-based screening has no principled way to set and describe an operating point when labels are missing. Our contribution sits exactly here: a label-free calibration that replaces both the arbitrary cutoff and the unavailable labelled validation with a chance-rate measured in a clean reference group, together with the pipeline and dual descriptor that make the screening possible (contributions listed in §I).
 ## III. Research Background and Study Design
 This section explains how the study is built and why. We report no computed numbers here; all results appear in §IV.
 ### A. Institutional Background
 To pin down the signing practices that we need in order to interpret the results, we held semi-structured interviews with certifying partners and signing-system staff at all four firms in the study.¹ Three points do real work later. First, all four firms allow handwritten signing but none require it. Second, formal firm-wide electronic signing or sealing systems were adopted on staggered dates from 2020 onward. Third, one firm — which we call Firm A throughout — has used scanned-image overlay stamping as its usual practice since at least 2013. We use these facts only as background, not as labels for individual signatures: they guide how we split the data below and how we read the firm-level results in §IV-C, but they do not tell us the status of any single signature. The practical implication is that the years before the formal systems (before 2020) are the right "normal" period to use for calibration.
 > ¹ Footnote — institutional detail. The interviews were conducted under institutional research-ethics approval and are reported in anonymized, aggregated form; firms are labelled A–D and no individual can be identified. The formal systems were reported to have been adopted at roughly one firm in early 2020, one in 2021, and one in late 2022 (exact firm-level dates are withheld for anonymity; see supplementary materials). Interviewees attributed this timing partly to the COVID-19 pandemic, which forced remote review and signing, and to firm-wide paperless and environmental (ESG) initiatives — both of which accelerated the move to formal electronic signing at Firms B/C/D. For Firm A, the reported workflow is that the certifying accountant approves the finished report electronically, after which the print room overlays the accountant's stored seal or signature image onto the PDF and prints it; the stored image is rarely changed, and although handwritten signing is allowed it is reported to be very rare, and rarer over time. Before the formal systems, the other firms' practice varied: some used informal scan- or photocopy-based stamping alongside handwritten signing, and at least one reported mostly handwritten signing before its system. The property the calibration relies on (§III-E) is that, in the pre-2020 baseline firms, different accountants did not share a common template — not that every signature was handwritten.
 ### B. Data and Analysis Design
 The corpus is all retrievable Taiwan statutory audit reports for fiscal years 2013–2023 from the four largest firms (A–D); signatures are extracted from them as described in §III-C. We then split this corpus by firm and by period, giving each part a distinct job (Fig. 1):
 - Calibration (the clean reference group): Firms B/C/D, 2013–2019.
 - Held-out test 1: Firm A, 2013–2023.
 - Held-out test 2 (secondary): Firms B/C/D, 2020–2023.
 We explain the reason for each part in §III-E. The key idea is simple: we calibrate only on the clean cell — the non-Firm-A firms in the years before formal systems — and test everything else against it. No numbers appear here; the calibration results start in §IV-A.
 ![Figure 1](figures/fig1.png)
 *Figure 1. The data split. Rows are Firms A–D; columns are 2013–2019 and 2020–2023. The B/C/D × 2013–2019 cells are the clean calibration group; Firm A (both periods) is held-out test 1; B/C/D × 2020–2023 is the secondary held-out test. We calibrate only on the clean cell and test everything else against it.*
 ### C. Pipeline
 The pipeline turns a raw PDF report into labelled signatures in five steps (Fig. 2).
 Finding the signature page. A vision-language model [24], [35] scans only the first quarter of each document — where the auditor's report page reliably sits — and stops as soon as it finds the page.
 Detecting signatures. A YOLOv11n detector [25], [34], trained on 500 hand-labelled signature pages (425 for training, 75 for validation; 100 epochs; started from COCO weights), draws a box around each signature. A region counts as a signature if it holds handwritten content that belongs to a personal signature, even where it overlaps an official stamp. A red-stamp removal step (filtering in HSV colour space) then strips away overlapping red seals, leaving the handwritten part.
 Turning signatures into features. Each detected signature is passed through an ImageNet-pretrained ResNet-50 [26] used as a fixed feature extractor — we take the 2,048-number output of its global-average-pooling layer and drop the classification head. We resize each image to 224×224 while keeping its aspect ratio (padding with white), apply the standard ImageNet normalization, and scale the feature vector to unit length, so that cosine similarity is just the dot product. We use these off-the-shelf features rather than fine-tuning the network, for three reasons: the task is comparing similarity, not classifying; ImageNet features are known to transfer well to document images [20], [21]; and not fine-tuning avoids the risk of learning quirks of our particular dataset. The backbone choice is checked in §V-C.
 Assigning each signature to an accountant. Each signature is matched to a registered accountant by its position on the page (first or second) against the official registry. Signatures we cannot match are left out of the same-accountant comparisons, because the "most similar signature by the same accountant" measure has no meaning without an assigned accountant.
 (Detection accuracy, signature counts, match rates, and the resulting analysis sample are reported in §IV-A.)
 ![Figure 2](figures/fig2.png)
 *Figure 2. The screening pipeline. A raw PDF passes through page-finding (a vision-language model), signature detection (YOLOv11) with red-stamp removal, feature extraction (ResNet-50), the two per-signature similarities (cosine for style; the smallest dHash to the same accountant for structure), and a five-way label.*
 ### D. The Two Similarity Measures and the Five-Way Rule
 For each signature we compute two numbers, both against the same accountant's other signatures: cos, its highest cosine similarity to another of that accountant's signatures, and dHash, its smallest perceptual-hash distance to another of them. As explained in §I, the point of using two measures is to separate two things that one measure blurs. A high cos means the signatures look alike in style, which happens both when an image is reused and when a person signs consistently. A small dHash means the signatures are alike almost pixel for pixel, which is the sign most specific to a reused image. Together they are far more telling than either alone: a steady hand gives a high cos but a dHash that still varies, while a reused image gives a high cos and a tiny dHash.
 The rule places each signature in one of five categories, with cosine acting as the primary gate and the structural (dHash) distance refining only the cases where cosine is already high. Each name states the screening hypothesis its region suggests — a candidate reading, not a confirmed determination:
 - HC — high-confidence reuse candidate: cosine above the high cut and structure at or below the near-identical cut. Both measures point to a reused image.
 - MC — moderate-confidence, advisory: cosine above the high cut and structure between the two structural cuts. Style is very similar, but structure is below the strict bar.
 - HSC — high style-consistency: cosine above the high cut and structure above the upper structural cut. Style is similar with no structural support.
 - UN — uncertain: cosine between the low cut (the same-vs-different-accountant crossover) and the high cut.
 - LH — low reuse-similarity: cosine at or below the low cut.
 A report takes the strongest label among its signatures (HC > MC > HSC > UN > LH).
 Why the partition has this shape (five categories, not nine). As explained in §I, a near-identical structure is decision-relevant only once the styles already match, so the two cosine cuts come first — splitting signatures into three style bands (low, uncertain, high) — and the two structural cuts subdivide only the high band. Three facts pin this shape down. First, structure carries little standalone decision weight in the two lower bands: between signatures whose styles do not clearly match, a moderate structural distance is hash noise, not evidence of reproduction — and even the near-identical structural matches that do appear below the style cut (quantified next) are not assigned HC; their structural information re-enters only through accountant-level aggregation and byte-identity review (§IV-B), not through a separate cell. Second, the cells of the full 3×3 grid that pair a lower style band with a near-identical structure are sparsely populated rather than ignored — and the empirical reading is more precise than a simple "they are empty." An explicit count makes this exact: of the 150,442 Big-4 signatures, 7,681 (5.1%) combine a near-identical structural match (dHash ≤ 5) with a sub-0.95 cosine, so the one-way implication of §I (a tiny dHash forces a high cosine) holds approximately, not strictly. But the residents' mass sits immediately below the high-cosine cut — 7,311 of them (95.2%) fall in cosine 0.90–0.95, and only 370 signatures (0.25% of the corpus) reach the genuinely low-cosine bands, of which just 38 lie below the LH/UN crossover (cosine ≤ 0.8547). These residents are not degenerate crops: their image size (mean 33k px) and detection confidence (0.875) match the rest of the corpus (28k px, 0.877). Under the coherent same-pair definition — style and structure satisfied on the same partner signature — the count falls further to 874 (0.58%). The point is therefore not that these cells are empty but that subdividing the lower style bands by structure changes no disposition: because cosine is the primary gate, a near-identical structural match beneath the style cut is already handled as UN, and the residual structural information re-enters through the accountant-level aggregation and byte-identity escalation of §IV-B rather than through a separate cell. Third, a partition should cut only where the resulting actions differ: subdividing the two lower bands by structure would create cells whose dispositions (§IV-B) are identical — all demoted or aggregated the same way — adding calibration burden without operational consequence, whereas the three structural cells inside the high band exist precisely because their dispositions differ. (Count from the deployed-rule descriptor columns; any-pair definition, full Big-4 corpus.)
 The cuts are operator-tunable operating points, not learned boundaries: there is no natural gap to read off the data (§V-A) and no signature-level labels to learn one from, so the cuts are chosen and their specificity is measured, not learned. The four cut values, and where each one comes from — two are read directly from this study's data — are given in §IV-A, alongside the chance-rate calibration that characterizes them and the figure of the two-measure plane (Fig. 3).
 Any-pair versus same-pair: how the two extrema combine. One construction detail deserves to be explicit, because a careful reader will ask. The two per-signature values are independent extrema over the same accountant's other signatures — the highest cosine and the smallest dHash, each taken on its own — so the two values may come from different partner signatures. We call this the any-pair rule, and the choice is deliberate, for three reasons. First, the two descriptors have different invariances: cosine survives re-scaling and re-compression; dHash does not. For a genuinely reused image that crossed different scan or compression pipelines, the style-nearest copy and the pixel-nearest copy can therefore legitimately be different reports — forcing both extrema onto one pair would miss exactly that most realistic positive case. Second, dHash takes whole-number values and ties are massive in duplicate-heavy pools: which tied copy wins the minimum is essentially arbitrary, so whether the two extrema land on the same file is largely tie-breaking noise — both point into the same duplicate cluster. Third, the chance-rate calibration of §IV-A applies the same any-pair rule to the clean reference group, so the high-specificity claim rests on the absolute clean-group rate (the HC rule fires by chance on only ~1.2% of clean-group reports), not on any firm-versus-floor ratio; the same rule is applied to every firm and to the reference group alike. The stricter same-pair variant, in which a single partner signature must satisfy both inequalities at once, is reported as a robustness check (§V-C) and leaves every conclusion unchanged — the within-firm concentration of cross-accountant matches is in fact *higher* under same-pair (97.0–99.96% across the four firms) than under the deployed any-pair rule (76.7–98.8%) — because in the high-confidence region the two rules nearly coincide: a partner within the near-identical structural cut is pixel-near-identical and therefore clears the high style cut by itself.
 Limitations, stated up front. Three follow directly from the design. (i) Because the cutoffs are chosen rather than learned, the system has a tunable operating point, not an optimal one; a reviewer who wants higher specificity can tighten it (§V). (ii) The chance rate we report is a measure of specificity, not a true false-acceptance rate, because we have no labelled negatives (§III-E). (iii) For any single signature, the two measures cannot tell us why it is so similar to another: reuse of an image, a shared scanning pipeline, and a very uniform house style all push the numbers the same way, and we do not try to choose between them at the level of one signature. These limits apply to every claim that follows.
 ### E. The Clean Reference Group and the Chance Rate
 With no labelled negatives to learn from, the calibration uses a stand-in: a group in which the rule should fire only by chance — unrelated accountants whose signatures happen to look alike now and then. Choosing this group well is the central design decision, and two requirements force the choice.
 Why not all four firms. As §IV-C will show, almost all of one firm's between-accountant matches fall on other accountants of the same firm, and we have byte-level proof of image reuse across about fifty of that firm's partners. If we put Firm A into the reference group, we would be filling the "by chance" rate with exactly the within-firm matches the rule is supposed to catch — a circular calibration. So we use Firms B/C/D as the clean reference group and keep Firm A as a test case; we report the all-four-firm number only to show how much Firm A contaminates it.
 Why 2013–2019. We further limit the reference group to the years before formal firm-wide electronic-signing systems (adopted from 2020 onward; §III-A). What this buys us is the absence of a shared template across accountants — not a guarantee that every signature was handwritten. The interviews say some baseline firms used informal individual stamping before 2020, but each accountant's stored image was their own, so different accountants' signatures still match only by chance; the chance rate is about matches between accountants, which individual stamping does not inflate. After 2020, formal systems standardize how reports are assembled, so that period is not a clean reference — and indeed the chance rate rises after 2020 (§V-B). We therefore calibrate on the Firms-B/C/D 2013–2019 cell and score every held-out cell against it.
 We report the rule's chance rate at three levels, because the rule takes the best match over a pool and so the per-signature rate is not the same as the per-pair rate: per comparison (sampled pairs of different accountants), per signature, and per report, each with a confidence interval. We call this the inter-CPA coincidence rate (ICCR) rather than a "false-acceptance rate," which we reserve for settings that have labelled negatives. Read as a measure of specificity under the stated assumption (no shared template across accountants), the ICCR is faithful to the evidence; read as a true error rate, it would claim more than we can show.
 ## IV. Findings
 This section reports the numbers. It starts with the calibration baseline (Firms B/C/D, 2013–2019), then says which category needs human review, then presents the held-out test on Firm A.
 ### A. Detection Sample (Whole Corpus) and the Calibration Baseline (Firms B/C/D, 2013–2019)
 Detection and the analysis sample (whole corpus). Two scopes appear in this section and must not be confused: detection and the analysis sample here are computed on the whole corpus, whereas both data-derived calibration quantities — the chance-rate ICCR and the low cosine cut (§IV-C) — are computed only on the clean Firms-B/C/D 2013–2019 cell. Of the 90,282 reports, the page-finder flagged 86,084 as having a signature page (the other 4,198, or 4.6%, had none); 13 of those 86,084 could not be rendered, leaving 86,071 documents processed. On the validation set, the YOLOv11n detector reached precision 0.97–0.98, recall 0.95–0.98, mAP@0.50 0.98–0.99, and mAP@0.50:0.95 0.85–0.90. Across the corpus it extracted 182,328 signatures — 2.14 per document with detections, where two certifying accountants per report implies 2.00. The ≈6.7% excess is explained by extra detections rather than missed accountants: of the 13,573 detections (7.4%) that could not be matched to a registered accountant and were excluded, 8,901 (66%) are third-or-later detections on a page — boxes beyond the two certifying signatures — and the unmatched set as a whole carries lower detection confidence than the matched set (mean 0.826 vs 0.874), consistent with these being extra boxes and low-confidence noise; the remaining 4,672 are first/second-position detections that failed registry matching. Throughput was 43.1 documents per second, and the detector agreed with the vision-language model on 98.8% of documents. Matching by position assigned 92.6% of signatures (168,755 of 182,328) to a registered accountant. The four-firm analysis sample is 437 accountants (171/112/102/52 across Firms A–D) and 150,442 signatures with both measures computed (Table I).
 **Table I — Detection and extraction summary.**
 | Quantity | Value |
 |---|---|
 | Documents with a signature page | 86,071 |
 | Detector precision / recall | 0.97–0.98 / 0.95–0.98 |
 | Detector mAP@0.50 / mAP@0.50:0.95 | 0.98–0.99 / 0.85–0.90 |
 | Signatures extracted | 182,328 (2.14 per document) |
 | VLM–detector agreement | 98.8% |
 | Signatures matched to an accountant | 168,755 (92.6%) |
 | Four-firm analysis sample | 437 accountants; 150,442 signatures |
 The calibrated operating point: the four cut values and their bases. The five-way rule of §III-D uses four cut values; we state them here because two are read directly from this study's data. The low cosine cut, 0.8547, is the crossover of the same-accountant and different-accountant cosine distributions computed on the calibration cell alone (Firms B/C/D, 2013–2019, closed-world: both the source signatures and their comparison set drawn from that cell; §IV-C). We use this closed-world value as the primary cut rather than the corpus-wide crossover, so that the one data-derived threshold in the rule is estimated only on the calibration-only Firms-B/C/D 2013–2019 cell, held out from Firm A and from post-2020 scoring. The cut is stable across scopes — 0.8547 (calibration closed-world), 0.8367 corpus-wide, 0.8489 on the all-period baseline firms, 0.8302 with the non-Big-4 firms added; it moves by at most 0.025 across all four scopes (0.018 from the corpus-wide value), so the choice of scope is immaterial and the broader-scope values stand as robustness checks (§V-C). The high cosine cut, 0.95, is the high-similarity operating point: it sits in the region where genuine reuse concentrates — the byte-identical anchor (§IV-C) lies at cosine 1 — and a recalibration cannot move it onto a distributional antimode because none exists (no within-population bimodality, §V-A). The near-identical structural cut, dHash ≤ 5, is the perceptual-hash distance below which two rasters are pixel-equivalent up to mild recompression, and dHash ≤ 15 bounds the looser "structurally similar" band; both follow the standard 64-bit dHash distance scale [27]. We therefore do not re-derive these three as optimal cutoffs but characterize their chance-of-firing behaviour directly (the full prior-calibration provenance is in the supplementary materials), and we make them operator-tunable: their specificity at these values is read off the chance-rate calibration below, and an operator can retune by inverting the ICCR curve (for example, dHash ≤ 3 for a tighter floor). We deliver these as a concrete, calibrated operating point — in particular the high-confidence (HC) rule, cosine > 0.95 and dHash ≤ 5 — and we treat the values as operator-tunable: the calibration below shows what each setting yields, so an operator can retune for a different specificity target (for example dHash ≤ 3 for a tighter floor, by inverting the ICCR curve). Because the rule is calibrated on a large Chinese-signature corpus, the HC values double as a practical reference point that practitioners working with other Chinese-signature corpora can adopt directly or use as a starting reference.
 ![Figure 3](figures/fig3.png)
 *Figure 3. The two measures and the five regions. The cosine axis is split at the low cut 0.8547 (the calibration-cell same-vs-different-accountant crossover) and the high cut 0.95; within the high-cosine band the dHash axis is split at 5 and 15. The bottom-right corner — high cosine with near-identical structure — is the high-confidence reuse region.*
 The calibration sample itself (Firms B/C/D, 2013–2019). The chance-rate calibration that follows is computed on the clean cell only, and the reader should be able to see the calibration base directly rather than infer it from the full-period totals above. The Firms-B/C/D 2013–2019 cell contains 226 accountants, 52,071 signatures with both measures computed, and 26,042 reports; the per-comparison ICCR below is estimated from 5×10⁵ inter-CPA signature pairs sampled uniformly from this cell. Every ICCR source signature is restricted to this cell — the headline per-signature and per-document rates reproduce on the 52,071-signature 2013–2019 cell, not on the full-period BCD record (~90,000 signatures), which is used only where a robustness figure is explicitly quoted — so no post-2020 or Firm-A signature enters the calibration.
 How often the strict rule fires by chance (pooled). In the Firms-B/C/D 2013–2019 group, the strict (HC) rule fires by chance very rarely at every level (Table II): about 1 in 100,000 per comparison (Wilson 95% CI [4×10⁻⁶, 2.3×10⁻⁵]), 0.59% per signature ([0.45%, 0.73%]), and 1.2% per report. These are roughly ten times lower than the contaminated all-four-firm figures (1.4×10⁻⁴, 11.0%, 18.0%); the difference is exactly the within-firm matching that the clean group leaves out. So a clean group of unrelated accountants almost never produces an HC report, which makes HC a high-specificity operating point. (The per-comparison figure rests on a small number of chance hits — 5 of 5×10⁵ pairs — and is best read as an order-of-magnitude value; the per-signature and per-report figures, which are well powered, carry the weight.)
 **Table II — Chance-firing rates (ICCR) by level and group: the strict HC rule (top two rows), with the looser MC band's per-report rate shown for contrast (bottom row).**
 | Group / rule | Per comparison | Per signature | Per report |
 |---|---|---|---|
 | HC rule — B/C/D 2013–2019 (calibration) | 1.0×10⁻⁵ [4e-6, 2.3e-5] | 0.59% [0.45%, 0.73%] | 1.2% |
 | HC rule — all four firms (contamination check) | 1.4×10⁻⁴ | 11.0% | 18.0% |
 | MC band (HC+MC) — B/C/D 2013–2019, per report | — | — | ≈17.5% |
 Each baseline firm on its own (B, C, D). Reported separately, the three baseline firms are alike and uniformly low. A logistic regression of the per-signature HC flag on firm (with Firm D as the reference) over the baseline cell puts Firms B and C within about 3.5× of each other (odds ratios 1.73 and 0.49), and none of them comes close to the high rates we see for Firm A in §IV-C. The 2013–2019 five-way breakdown for each of Firms B/C/D (counts and within-firm percentages) is reported in Table II-b; the full-period (2013–2023) breakdown is in Table IV for reference.
 **Table II-b — Five-way breakdown for each baseline firm, calibration period (B/C/D, 2013–2019).**
 | Firm | HC | MC | HSC | UN | LH | signatures |
 |---|---|---|---|---|---|---|
 | Firm B | 29.04% | 39.31% | 0.39% | 30.91% | 0.35% | 19,677 |
 | Firm C | 21.59% | 42.09% | 0.37% | 35.53% | 0.43% | 22,449 |
 | Firm D | 22.01% | 29.67% | 0.20% | 47.35% | 0.76% | 9,945 |
 ### B. From Categories to Actions: Review as Exception Management
 The proportions first, stated plainly. Before saying what to do with each category, we show how much of the data falls into each (Table IV, full corpus): HC 49.6%, MC 26.5%, UN 23.4%, HSC 0.2%, LH 0.3%. The ambiguous middle (MC + UN) is therefore not a fringe: about half of all four-firm signatures, and 65–76% at Firms B/C/D individually, against 18.1% at Firm A. Read against the institutional background (§III-A), this is exactly the expected shape. Firm A is a reuse-dominated population, where the screen settles most signatures outright; Firms B/C/D in this period are mixed populations in which hand-signing and informal individual stamping coexist, so per-signature similarity is genuinely ambiguous there. The right response to a large middle is not to hide it but to give it a disposition path that does not require a per-signature verdict. Four moves do this.
 Move 1 — calibrate each band's evidential weight, and demote what fails. The calibration tells us what each flag is worth. The HC band fires by chance on only about 1.2% of reports in the clean reference group, so an HC flag is close to self-certifying: it needs essentially no verification effort, and it goes straight onto the action list — findings to count, report, or investigate — rather than onto a list of flags still to be checked. The MC band fires by chance on about 17.5% of reports in the clean reference group — roughly one clean-group report in six — and, unlike HC, this rate does not drop when Firm A's accountants are excluded from the cross-accountant comparison pool (it edges up, because removing Firm A's distinctive template leaves a pool whose members resemble one another a little more at the coarse dHash ≤ 15 scale); the boundary at dHash = 15 also sits in a flat region of the sensitivity sweep, adding flagged cases without adding specificity (§V-C). An MC flag on its own therefore carries almost no information and does not justify verification effort; it matters only in combination with other evidence. The UN band is ambiguous in the same spirit and is treated alongside MC; on the clean baseline the UN cosine band is reached by chance about 88% of the time per signature (98.2% per report), confirming that a UN flag is essentially uninformative about reuse on its own, whereas the HSC band is reached by chance only about 0.13% of the time per signature (0.25% per report) and in any case points away from reuse (style match without structural support). The HSC band is tiny (0.2%), so it warrants only a light spot-check. The LH band needs no action. Demotion, however, only says what an MC or UN flag is not — standalone evidence; what becomes of these signatures is the business of the next three moves: their information flows into the accountant-level scores (Move 2), which byte-identity hits then sharpen by proving that an accountant's stored image is in circulation (Move 3); the residual's data needs are named rather than guessed at (Move 4); and where a human does look at individual cases, the bounded protocol specified below applies.
 Move 2 — lift the unit of decision from the signature to the accountant. The middle categories rarely need to be resolved one signature at a time, because the operational question is almost always about an accountant or a firm, and the ambiguous signatures still carry information at that level. Three accountant-level scores — a mixture-model position score on the two-measure plane, a percentile relative to an external non-Big-4 reference population, and the accountant's own rate of replication-consistent labels — rank the 437 accountants in close agreement (Spearman ρ ≥ 0.879; reported as internal consistency among scores built on the same descriptors, not as external validation). A signature that is individually undecidable still moves its accountant's position; several hundred per-signature questions collapse into one per-accountant judgment.
 Move 3 — anchor with byte-identity, the one check that yields certainty. An exact byte-level comparison costs little, and what it finds is proof rather than evidence: independent hand-signing cannot produce byte-identical images, so every byte-identical pair is confirmed reuse with no human judgment required (the corpus contains 262 such signatures; §IV-C). To be precise about where this bites: a byte-identical pair has cosine 1 and dHash 0, so these signatures sit in HC by construction — byte-identity rescues no case from the ambiguous middle. Its role is twofold. Within HC, it upgrades a subset from high-confidence candidate to logical certainty, removing even the pipeline-and-house-style caveat of §III-D for those cases — the difference between a statistical screen and an exhibit one can act on without qualification. And at the accountant level, a byte-identical hit proves that a stored image of that accountant is in circulation, which raises the prior on the rest of that accountant's near-identical cluster — including its MC and UN members — and thereby sharpens the per-accountant judgment of Move 2. That the rule captures 100% of the byte-identical set is also the system's one threshold-free sanity check.
 Move 4 — state what the residual needs, instead of classifying it anyway. After the three moves, a residual middle remains whose mechanism the two measures genuinely cannot identify: reuse through a noisy pipeline, a very steady hand, and a homogeneous scanning infrastructure can occupy the same spot on the plane. We name the data that would resolve it — a proposed resolution path, not one executed in this study. Image-acquisition metadata is machine-readable provenance that could be extracted automatically rather than judged by eye: scanner identifiers and PDF-generator strings recorded in the files themselves, and compression markers such as JPEG quantization tables, which encode the processing history an image has been through. This adds the axis the two similarity measures lack — two near-identical images that arrived through different production pipelines are hard to explain except by reuse, while two that shared one pipeline may owe their similarity to the pipeline itself. (Whether this provenance survives the upload platform is itself an empirical question, and we checked: we verified across a stratified sample of MOPS reports (all four firms, 2014–2022) that producer/creator strings, PDF versions, and image encodings are heterogeneous report-to-report — distinct scanner models (Fuji Xerox D125, ApeosPort-III/IV/V), born-digital producers (Microsoft Word, Adobe, Acrobat Distiller), and a mix of CCITT-grayscale and JPEG-RGB encodings at differing resolutions — so the platform does not flatten uploads to a uniform template and the acquisition history is recoverable here; firms' own internal archives would retain at least as much.) A small labelled set of known hand-signed examples — certified by the firms, or accumulated case by case as a by-product of the review protocol below — would turn the chance-rate calibration into directly estimated error rates. Naming these is the honest alternative to pretending the residual can be classified from similarity alone.
 Where a human does look, the review follows a defined and bounded protocol. We specify the protocol here as a design deliverable of the method: the discriminating behaviors stated below are design expectations, following from the artifact properties of reused versus independently signed images, and the protocol's first execution, on a bounded sample, is listed as future work (§VI). (1) Side-by-side overlay inspection: the reviewer is shown the flagged signature next to the same-accountant signature(s) that produced its score, with a pixel-difference overlay and an edge-aligned superposition; a reused image is expected to overlay almost exactly, whereas two independent signings show natural variation in pressure, ink, and baseline. (2) Secondary artifact checks not used by the rule — exact registration, JPEG and scan-noise fingerprints (the compression and anti-aliasing traces a reused raster carries with it), and scaling traces — are designed to separate a reused raster from a re-scanned genuine signature at low cost. (3) Document and time context: the reviewer checks whether the matched signatures come from reports of different dates or engagements (reuse across time is more telling than within a single filing) and whether the surrounding layout shows a standard template or stamp. (4) Bounded per-accountant sampling: because the operational question is usually at the accountant or firm level, the reviewer judges a bounded random sample per accountant rather than every flagged signature, keeping the effort proportional to the number of accountants, not the number of signatures. (5) Feedback into calibration: each adjudicated case yields a label — reuse, hand-signed, or undetermined — and these accumulate into the small ground-truth set the setting otherwise lacks, which can later tighten the operating point or support supervised validation. The protocol's relation to Move 4 is one of scale: steps 1–3 apply per-case versions of the same artifact evidence that Move 4 would collect corpus-wide, step 4 bounds how many cases a human ever sees, and step 5 accumulates the labelled set Move 4 asks for. What the protocol cannot do — and is not claimed to do — is resolve the residual at scale; that is exactly what the corpus-wide metadata collection of Move 4 would add.
 Why this is exception management rather than caseload. In a reuse-dominated population the high-confidence tier settles most signatures directly (82% at Firm A), and the four moves reduce the remaining 18% to per-accountant judgments and a review queue bounded by the number of accountants rather than the number of signatures — exception management at the signature level. In a mixed population the same machinery delivers the same promise one level up, at the accountant. Move 1 removes the bulk of the apparent caseload outright: an MC flag alone does not justify verification, which at Firms B/C/D takes 29–41% of signatures off the worklist before anyone is assigned to anything (the UN band carries no flag in the first place, so the demotion bites on MC alone). Move 2 positions every accountant on the replication-dominance spectrum, so attention concentrates on the few high-ranked or mixed cases rather than on tens of thousands of signatures. Move 3 supplies proof where proof exists: 117 of the 262 byte-identical signatures sit at Firms B/C/D, demonstrating that stored-image reuse is a real practice at the mixed firms too, and anchoring the accountant-level judgments there. And the staggered post-2020 adoption of formal systems gives the mixed firms a readable time axis (§V-B). What is not delivered in a mixed population is a per-signature verdict for the ambiguous middle — a limit of identification, not of workload. Exception management therefore holds in both settings; what changes is only the level at which exceptions are defined — the signature where reuse dominates, the accountant where practices are mixed. Because the cutoffs are tunable, a reviewer who wants higher specificity can tighten them (for example, cosine > 0.98 and dHash ≤ 3), trading a lighter caseload against the risk of missing noisier reused signatures — a trade-off we cannot tune against recall, since recall is unobservable here.
 ### C. Held-Out Test: Firm A
 Firm A — described by the interviews as a mainly-stamping firm, and kept out of the calibration — is our main test.
 (1) Firm A's two measures against the baseline. Comparing Firm A's within-accountant similarities to those of Firms B/C/D (full record, 2013–2023²), Firm A's cos values are shifted toward 1.0 and its dHash distances toward 0 — the direction we would expect if a stored image is reused rather than re-signed. Concretely, Firm A's within-accountant cosine is centred at a median of 0.986 (mean 0.980) versus 0.959 (mean 0.954) for Firms B/C/D, and its smallest-dHash distance at a median of 2 (mean 2.7) versus 7 (mean 7.0); both shifts are in the reuse direction and overwhelmingly significant (Mann–Whitney U, p < 10⁻³⁰⁰ for each; two-sample Kolmogorov–Smirnov D = 0.60 for cosine and 0.57 for dHash). The decisive number is this: scored as a true held-out case — Firm A's signatures matched against unrelated accountants drawn from the clean 2013–2019 group — Firm A's per-signature HC chance rate is 0.42% (154/36,552; Wilson 95% CI [0.36%, 0.49%]), at or below the clean reference ICCR of 0.59%. In other words, Firm A's cross-firm match rate sits at the level a clean inter-CPA comparison produces by chance — it is not elevated relative to the reference, and it is negligible beside the within-firm rate below — so the entire rise in Firm A's rate comes from matches with other Firm-A signatures, not from resemblance to other firms. The signal is inside the firm, not across firms. (Against the full-period BCD pool the same across-firm rate is 1.0%; the small difference reflects the post-2020 rise in baseline similarity of §V-B. Both lie at the clean floor, two orders of magnitude below the within-firm rate that follows.)
 > ² Restricting both groups to 2013–2019 gives essentially the same picture (Firm A cosine median 0.986, dHash 2; Firms B/C/D 0.957 and 7; Mann–Whitney p < 10⁻³⁰⁰ for each), confirming the contrast is not a post-2020 artefact.
 Firm A's within-firm repeatability, against the other firms. On their own signatures, the HC rule fires on 82% of Firm A's — about 139 times the clean floor — versus 24–35% for Firms B/C/D (about 40–59 times). A logistic regression of the per-signature HC flag on firm and pool size, with Firm A as the reference, gives odds ratios of 0.053, 0.010, and 0.027 for Firms B/C/D — one to two orders of magnitude lower (the odds ratio for log pool size is 4.01). Firm A stands alone, against a baseline of three firms that look alike.
 ![Figure 4](figures/fig4.png)
 *Figure 4 — Within-accountant similarities, Firm A vs Firms B/C/D: (a) cosine, (b) dHash. Firm A's mass sits near cosine = 1 and dHash = 0 (medians 0.986 / 2) against Firms B/C/D's 0.959 / 7; dashed lines mark the cuts (cosine 0.95; dHash 5), the dotted line the LH/UN crossover (0.8547). The held-out cross-firm HC rate (0.42%) sits at/below the clean reference ICCR (0.59%), while the within-Firm-A HC rate is 82% — the signal is inside the firm (annotation below panels).*
 (2) Ranking accountants by similarity, in each period. Ranking every accountant in Firms A–D by a single within-accountant similarity score, separately for 2013–2019 and for 2020–2023, Firm A's accountants sit at the high-similarity (templated) end. A descriptive three-group summary of the two-measure space tells the same story: its high-cosine/low-dHash group holds 82.5% of Firm A's accountants and almost none of the others' (Table III). The period split confirms the expected pattern: Firm A's per-signature HC rate is at the top in both periods (80.3% in 2013–2019, 83.8% in 2020–2023), while Firms B/C/D move upward after 2020 as the formal systems came in — Firm B from 29.0% to 42.0%, Firm C from 21.6% to 26.7%, Firm D from 22.0% to 28.0% (see §V-B).
 **Table III — Firm by descriptive-group membership (whole corpus). The "high-cosine/low-dHash group" is the templated-end cluster of the three-group (K = 3) descriptive Gaussian-mixture partition of the accountant-level two-measure plane (§V-C); membership is the cluster of maximum posterior probability for each accountant. The groups are used for description only, never as operational labels.**
 | Firm | Accountants | Share in the high-cosine/low-dHash group |
 |---|---|---|
 | Firm A | 171 | 82.5% |
 | Firm B | 112 | 0.0% |
 | Firm C | 102 | 1.0% |
 | Firm D | 52 | 1.9% |
 ![Figure 5](figures/fig5.png)
 *Figure 5 — Per-accountant HC rate, ranked, one panel per period (2013–2019; 2020–2023), points coloured by firm (accountants with ≥ 5 signatures in the period). Firm A (red) occupies the templated top of the ranking in both periods; Firms B/C/D rise after 2020 (HC rate B 29.0→42.0%, C 21.6→26.7%, D 22.0→28.0%; Firm A 80.3→83.8%), consistent with staggered formal-system adoption (§V-B).*
 (3) Applying the calibrated rule to Firm A, 2013–2023. Taking the operating point calibrated on Firms B/C/D in 2013–2019 and applying it across Firm A's full record, 81.70% of Firm A's signatures (82% rounded) land in HC (per signature; the full five-way breakdown is in Table IV). Read together with the interview fact that Firm A mainly uses overlay stamping, the system's firm-level output matches the practice the firm itself describes. We say this carefully: it is a match at the firm level, not a label on any single signature. We do not classify the individual signatures as non-hand-signed, because for any one signature the two measures cannot separate reuse from a shared scanning pipeline or a uniform house style (§III-D).
 **Table IV — Five-way breakdown by firm (whole corpus, 2013–2023; for reference; n = 150,442).**
 | Firm | HC | MC | HSC | UN | LH | signatures |
 |---|---|---|---|---|---|---|
 | Firm A | 81.70% | 10.76% | 0.05% | 7.35% | 0.14% | 60,448 |
 | Firm B | 34.56% | 35.88% | 0.29% | 28.95% | 0.32% | 34,248 |
 | Firm C | 23.75% | 41.44% | 0.38% | 33.97% | 0.47% | 38,613 |
 | Firm D | 24.51% | 29.33% | 0.22% | 45.28% | 0.66% | 17,133 |
 | Overall | 49.58% | 26.47% | 0.21% | 23.42% | 0.32% | 150,442 |
 Reading the five-way mix across firms. Table IV is also the quantitative basis for the positioning in §IV-B. At Firm A the ambiguous middle (MC + UN) is 18.1% — the screen reads a reuse-dominated population almost cleanly, with four signatures in five settled outright. At Firms B/C/D the middle is 65–76% — the signature of a mixed population in which hand-signing and informal stamping coexist (§III-A), where per-signature similarity is genuinely ambiguous. There the screen's deliverables move up one level (§IV-B): the MC share (29–41% of these firms' signatures, against the 26.5% corpus-wide MC share) is demoted off the worklist, the accountant-level scores rank these firms' accountants alongside everyone else's, and the byte-identical signatures at these firms (117 of the 262) are threshold-free proof that reuse occurs there too. The per-signature mix stays ambiguous; the disposition does not.
 (+) Byte-identical signatures: direct evidence of reuse. Beyond the screening numbers, 262 signatures across the four firms are byte-for-byte identical to another signature — 145 of them at Firm A, spread across about fifty partners. Identical files cannot come from independent hand-signing, so their existence is direct, hard evidence that image reuse happens and that it concentrates at Firm A. The rule catches 100% of them, which confirms it misses no clear-cut case of reuse; we note this only as a sanity check and a lower bound on recall for the clearest cases, since for our rule a byte-identical pair (cosine = 1, dHash = 0) must land in HC by definition.
 ## V. Other Analyses
 This section gathers analyses that support the design and test its robustness: (a) the diagnostic showing the data contain no natural cutoff — the premise the whole calibration rests on; (b) how the baseline behaves after 2020; and (c) sensitivity checks.
 ### A. Why the Data Contain No Natural Cutoff
 This diagnostic backs the design choice announced in §III-D and §III-E: that no cutoff can be read off the data, so the operating point has to be set from an outside reference. The Hartigan dip test [37] rejects a single-peak shape for both measures at the Big-4-pooled accountant level (p < 5×10⁻⁴), which might look like a clean split into two groups. But that rejection comes from two side-effects. Once we remove the differences between firms (by centring each firm on its own mean) and the effect of the hash taking only whole-number values (by adding a small jitter to dHash), the single-peak shape comes back (median p = 0.35 over jitter seeds). Tested firm by firm, each Big-4 firm is already unimodal on both axes (Firm A p_cos = 0.99, p_dHash = 0.92; B/C/D pooled p_cos = 0.998, p_dHash = 0.91), so the pooled rejection is a between-firm location-shift artefact, not within-population bimodality. A density-smoothness test in the Burgstahler–Dichev / McCrary style [38], [39] finds no real break in either measure at the Big-4 scope (Appendix A.1 shows the apparent signature-level breaks drift with histogram bin width and sit inside the high-similarity region — a resolution artefact, not an antimode). So the data hold no real gap; per-signature similarity is best read as one continuous spread of quality, not two separate classes. This is exactly why the operating point is set from an outside reference (§III-E) rather than read off the data, and why the three groups used for description in §IV-C are treated as a summary of composition, not as real mechanisms.
 ### B. The Baseline After 2020 (Secondary)
 Looking only at Firms B/C/D, the strict rule's chance rate rises after 2020 (per comparison from 1.0×10⁻⁵ to 3.6×10⁻⁵; per signature from 0.59% to 1.05%). The rise is gradual rather than a sudden step, which fits the staggered, firm-by-firm adoption of formal systems described in §III-A; the same gradual rise shows in the deployed HC rate by firm (Firm B 29.0→42.0%, Firm C 21.6→26.7%, Firm D 22.0→28.0% across the two periods, §IV-C). We read this as the system registering a shift over time in the similarity numbers, and we use it to justify limiting the calibration to the pre-2020 years. We do not claim to have detected the adoption of electronic signatures as such, because a post-2020 rise in image similarity could just as easily come from changes in scanning and document-production pipelines at the same time (§III-D). This analysis is secondary and could move to an appendix.
 ### C. Sensitivity and Robustness
 We summarize the robustness checks here; full detail is in the supplementary materials.
 How sensitive the operating point is. Right around the HC cutoff the per-signature firing rate changes quickly — its local slope is about 25× the median across a cosine sweep and about 3.8× across a dHash sweep — which confirms that the HC point is a chosen, specificity-anchored operating point rather than a natural gap. The MC/HSC boundary at dHash = 15 sits in a flat (saturating) region, where moving the line adds flagged cases without adding specificity; this is a further reason to treat the MC band as advisory (§IV-B).
 Leaving out one firm at a time. A two-group fit is unstable across firms — its boundary is basically a "Firm A versus the rest" divider — while a three-group fit keeps a stable shape (its low-cosine/high-dHash group drifts by at most 0.005 in cosine) but a membership that shifts with the mix of firms (by up to 12.8 percentage points). So we use the groups only as descriptions, never as operational labels.
 Crossover scope. The low cosine cut is the same-vs-different-accountant cosine crossover; recomputing it across scopes moves it by at most 0.025 — 0.8547 on the calibration cell (the primary value; §IV-A), 0.8367 corpus-wide, 0.8489 on the all-period baseline firms, 0.8302 with the non-Big-4 firms added — and because the cut affects only the UN/LH boundary, switching among these scopes changes no HC/MC/HSC result and shifts the UN/LH split by at most 0.4 percentage points per firm. We use the calibration-cell value as primary for held-out discipline and report the others as robustness.
 The same-pair variant. Recomputing the rule so that a single partner signature must satisfy both inequalities at once (the same-pair rule of §III-D) leaves every conclusion unchanged. The within-firm concentration of cross-accountant matches is in fact higher under same-pair (97.0–99.96% across the four firms) than under the deployed any-pair rule (76.7–98.8%), so the headline structure does not depend on the any-pair construction — pushed to the stricter event, it gets stronger.
 Each gate adds specificity. On the all-four-firm pool the cosine gate alone fires per comparison at 6.0×10⁻⁴; adding the structural gate multiplies this by 0.234 (the conditional ICCR of dHash ≤ 5 given cos > 0.95), giving the joint 1.4×10⁻⁴. Each axis contributes specificity beyond the other — quantitative support for the two-gate design over either measure alone (§I, §III-D).
 Which network we use. We compare ResNet-50 against VGG-16 and EfficientNet-B0 under the same preprocessing and L2 normalization (Appendix A; supplementary backbone-ablation table). EfficientNet-B0 gives the largest intra/inter separation (Cohen's d = 0.707) but also the widest descriptor spread (intra std 0.123 vs ResNet-50's 0.098); VGG-16 is worst on every key metric despite its larger 4096-dim features. ResNet-50 is the best overall balance: its Cohen's d (0.669) is competitive, its tighter distributions give more stable per-signature behaviour, it yields the highest Firm A all-pairs 1st-percentile similarity (0.543), and its 2048-dim features are a practical compromise for processing 182K+ signatures. The comparison supports ResNet-50.
 ## VI. Conclusion
 We have presented a label-free, anchor-calibrated way to screen for non-hand-signed signatures in large numbers of audit reports. It has three working parts — a pipeline that takes raw PDFs through page-finding, detection, feature extraction, and a two-measure similarity step; a pair of measures that separate style consistency from image reproduction; and, in place of a natural cutoff we do not have and labelled data we cannot get, a calibration based on how often the rule fires by chance in a clean reference group. That calibration yields both a measure of specificity and a concrete operating point: the high-confidence rule almost never fires by chance on the clean group, so it is a usable, highly specific screen, with a defined, bounded human-review protocol (§IV-B) for the advisory and uncertain cases. Operationally the screen earns its keep in two ways: run over an archive, it discovers where reuse concentrates; and it keeps human review at the scale of exceptions in both kinds of population — settling most signatures directly where reuse dominates, and, where practices are mixed, demoting the low-specificity band, ranking accountants, and confirming the byte-identical cases, withholding only the per-signature verdict for the ambiguous middle. We report the category proportions that make that distinction concrete. Because it is calibrated on a large Chinese-signature corpus and uses script-agnostic image descriptors, the rule transfers as a practical reference point for other Chinese-signature settings and, in principle, to other scripts. Held out as a test, one firm stands alone in how alike its own signatures are, its output matches the stamping practice the firm itself describes, and byte-identical signatures give direct evidence that reuse happens and concentrates there.
 The limits are built into working without labels, and we have stated them alongside the design. There is no signature-level ground truth, so we report no false-rejection rate, recall, ROC-AUC, or precision; every rate we give is a chance rate read as a measure of specificity, not a true false-acceptance rate. The contrast between firms is something the method can see, not a finding about why the signatures look alike: for any single signature, the two measures cannot separate reuse from a shared scanning pipeline or a uniform house style, and for Firms B/C/D we make no claim about firm practice at all. Whether firm-level signing patterns matter for audit quality is a question for a dedicated companion study — one this screening points toward, together with the low-presence character of proxy-executed stamping shown in the behavioural literature, but one that similarity alone cannot settle.
 Four directions follow. First, a set-level reading of each accountant: judging the shape of an accountant's whole signature set — a tight cluster that recurs near-identically across reports and years (the signature of a stored image) versus a dispersed cloud (the signature of a hand) — instead of per-signature extrema. This would collapse much of the remaining middle into a few per-accountant cluster decisions, and it is the natural tool for separating the mixed signers of the baseline firms, whose sets may contain both a tight recurring sub-cluster and a dispersed remainder if both practices are present. We view this as the highest-value methodological extension, while noting honestly that it narrows but does not remove the fundamental ambiguity: a very steady hand and a noisy reused image can still meet in the middle of any set-level statistic. A first-pass probe on the calibration cell is consistent with this caution — across the 206 Firms-B/C/D 2013–2019 accountants with sufficient signatures, the within-accountant similarity forms a continuum that piles up just below the high-similarity cut rather than splitting into a tight reused cluster and a dispersed hand-signed cloud (no accountant shows a tight-versus-remainder cosine gap above 0.10), so the no-natural-cutoff structure of §V-A recurs at the accountant level; we therefore treat set-level adjudication as a research direction rather than a ready robustness result. Second, executing the review protocol of §IV-B on a bounded sample — its first run — would both test the protocol's expected discriminating behavior and accumulate the small human-labelled set that permits supervised validation and direct error rates. Third, image-acquisition metadata (scanner identifiers, PDF-generator fingerprints, compression markers) adds a provenance axis that could help resolve the pipeline-versus-reuse ambiguity similarity alone cannot; we confirmed this metadata survives in the present corpus rather than being flattened by the platform, though its discriminative power remains to be validated (§IV-B, Move 4). Fourth, the audit-quality question itself: whether firm-level signing patterns correlate with audit outcomes, for which this screening supplies the measurement layer.
 ## Appendix A. Supplementary Diagnostic Detail
 ### A.1. BD/McCrary Bin-Width Sensitivity (Signature Level)
 The main text (§III-D, §V-A) treats the Burgstahler–Dichev / McCrary discontinuity procedure [38], [39] as a density-smoothness diagnostic rather than as a threshold estimator. This subsection documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and dHash direction.
 **Table A.I. BD/McCrary bin-width sensitivity (two-sided α = 0.05, |Z| > 1.96).**
 | Variant | n | Bin width | Best transition | z_below | z_above |
 |---|---|---|---|---|---|
 | Firm A cosine (sig-level) | 60,448 | 0.003 | 0.9870 | −2.81 | +9.42 |
 | Firm A cosine (sig-level) | 60,448 | 0.005 | 0.9850 | −9.57 | +19.07 |
 | Firm A cosine (sig-level) | 60,448 | 0.010 | 0.9800 | −54.64 | +69.96 |
 | Firm A cosine (sig-level) | 60,448 | 0.015 | 0.9750 | −85.86 | +106.17 |
 | Firm A dHash (sig-level) | 60,448 | 1 | 2.0 | −4.69 | +10.01 |
 | Firm A dHash (sig-level) | 60,448 | 2 | no transition | — | — |
 | Firm A dHash (sig-level) | 60,448 | 3 | no transition | — | — |
 | Full-sample cosine (sig-level) | 168,740 | 0.003 | 0.9870 | −3.21 | +8.17 |
 | Full-sample cosine (sig-level) | 168,740 | 0.005 | 0.9850 | −8.80 | +14.32 |
 | Full-sample cosine (sig-level) | 168,740 | 0.010 | 0.9800 | −29.69 | +44.91 |
 | Full-sample cosine (sig-level) | 168,740 | 0.015 | 0.9450 | −11.35 | +14.85 |
 | Full-sample dHash (sig-level) | 168,740 | 1 | 2.0 | −6.22 | +4.89 |
 | Full-sample dHash (sig-level) | 168,740 | 2 | 10.0 | −7.35 | +3.83 |
 | Full-sample dHash (sig-level) | 168,740 | 3 | 9.0 | −11.05 | +45.39 |
 Two patterns are visible. First, the procedure consistently identifies a "transition" under every bin width, but the location drifts monotonically with bin width (Firm A cosine: 0.987 → 0.985 → 0.980 → 0.975 as bin width grows from 0.003 to 0.015; full-sample dHash: 2 → 10 → 9 as bin width grows from 1 to 3), and the Z statistics inflate superlinearly with bin width because wider bins aggregate more mass and shrink the per-bin standard error on a very large sample. Both features are characteristic of a histogram-resolution artifact rather than a genuine density discontinuity. Second, the candidate transitions all locate inside the high-similarity region (cosine ≥ 0.975, dHash ≤ 10) rather than at a between-mode boundary. Taken together, the signature-level BD/McCrary transitions are not a threshold in the usual sense — they are histogram-resolution-dependent local density anomalies inside the high-similarity descriptor region rather than between modes — which supports using BD/McCrary as a density-smoothness diagnostic, not a threshold estimator (§V-A).
 ### A.2. Diagnostic Summary
 The unsupervised-diagnostic strategy is a set of complementary checks, each addressing one specific failure mode of an unsupervised screening classifier under an explicitly disclosed untested assumption.
 **Table A.II. Diagnostics, failure mode addressed, and disclosed untested assumption (abridged).**
 | Diagnostic | Failure mode addressed | Disclosed untested assumption |
 |---|---|---|
 | Composition decomposition (§V-A) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); p_median = 0.35 under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support |
 | Per-comparison ICCR (§IV-A) | Pair-level specificity proxy under a random-pair negative anchor, on the BCD baseline | Inter-CPA pairs are negative; addressed by anchoring on B/C/D and holding Firm A out |
 | Pool-normalised per-signature ICCR (§IV-A) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | As above + pool replacement preserves the negative-anchor property |
 | Document-level ICCR (§IV-A) | Operational alarm-rate proxy at per-document unit (HC and HC+MC) | As above |
 | Firm-heterogeneity logistic regression (§IV-C) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Observations clustered by CPA/firm; cluster-robust SEs are a future check |
 | Cross-firm hit matrix (§IV-C, §V-C) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (same-pair 97.0–99.96% vs any-pair 76.7–98.8%) |
 | Alert-rate sensitivity sweep (§V-C) | Local sensitivity of the deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test |
 | Convergent score Spearman ranking (§IV-B) | Internal consistency of three feature-derived per-CPA scores | Scores share inputs; not statistically independent |
 | Pixel-identical positive capture (§IV-C) | Sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold |
 ## Appendix B. Reproducibility Materials
 The full table-to-script provenance mapping, script source code, and report artefacts for every numerical table and figure in this paper are provided in the supplementary materials. Scripts run deterministically under fixed random seeds documented there (the inter-CPA candidate sampler uses seed 42 and a retry-loop matching the canonical samplers; CPA-block bootstraps use 1,000 replicates); reviewer reproduction should re-emit artefacts from the listed scripts rather than rely on any local path layout. The calibration baseline (BCD 2013–2019), the contamination-comparison scope (all-Big-4), the Firm-A out-of-sample scoring, and the five-way classification are all emitted by the same canonical pipeline so that the headline numbers in Tables I, II, II-b, and IV reproduce bit-for-bit.
 ## References
 *References follow IEEE numeric style; entries [41]–[45] are the behavioural-science and Chinese-script works added in this draft.*
 [1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6.
 [2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," *Res. Account. Regul.*, vol. 25, no. 2, pp. 230–235, 2013.
 [3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in *Proc. NeurIPS*, 1993.
 [4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.
 [5] H.-H. Kao and C.-Y. Wen, "An offline signature verification and forgery detection method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020.
 [6] H. Li et al., "TransOSV: Offline signature verification with transformers," *Pattern Recognit.*, vol. 145, p. 109882, 2024.
 [7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," *Mathematics*, vol. 12, no. 17, p. 2757, 2024.
 [8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.
 [9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021.
 [10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," in *Proc. Electronic Imaging*, 2016.
 [11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," *Multimedia Tools Appl.*, 2024.
 [12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," *Inf. Process. Manage.*, p. 104086, 2025.
 [13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in *Proc. CVPR*, 2022.
 [14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," *Pattern Recognit.*, vol. 70, pp. 163–176, 2017.
 [15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 1342–1356, 2024.
 [16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 1735–1745, 2020.
 [17] H. Farid, "Image forgery detection," *IEEE Signal Process. Mag.*, vol. 26, no. 2, pp. 16–25, 2009.
 [18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," *Pattern Recognit.*, vol. 144, art. 109778, 2023.
 [19] J. Luo et al., "A survey of perceptual hashing for multimedia," *ACM Trans. Multimedia Comput. Commun. Appl.*, vol. 21, no. 7, 2025.
 [20] D. Engin et al., "Offline signature verification on real-world documents," in *Proc. CVPRW*, 2020.
 [21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, vol. 189, art. 116136, 2022.
 [22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification across multilingual datasets," *Procedia Comput. Sci.*, vol. 270, pp. 4024–4033, 2025.
 [23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in *Proc. ECCV*, 2014, pp. 584–599.
 [24] S. Bai et al., "Qwen2.5-VL technical report," arXiv:2502.13923, 2025.
 [25] Ultralytics, "YOLO11 documentation," 2024.
 [26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. CVPR*, 2016.
 [27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013.
 [28] B. W. Silverman, *Density Estimation for Statistics and Data Analysis*. London: Chapman & Hall, 1986.
 [29] J. Cohen, *Statistical Power Analysis for the Behavioral Sciences*, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988.
 [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600–612, 2004.
 [31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," *The Accounting Review*, vol. 88, no. 5, pp. 1511–1546, 2013.
 [32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," *Int. J. Auditing*, vol. 18, no. 3, pp. 172–192, 2014.
 [33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," *Contemp. Account. Res.*, vol. 26, no. 2, pp. 359–391, 2009.
 [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in *Proc. CVPR*, 2016, pp. 779–788.
 [35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 46, no. 8, pp. 5625–5644, 2024.
 [36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," *Ann. Math. Statist.*, vol. 18, no. 1, pp. 50–60, 1947.
 [37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," *Ann. Statist.*, vol. 13, no. 1, pp. 70–84, 1985.
 [38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," *J. Account. Econ.*, vol. 24, no. 1, pp. 99–126, 1997.
 [39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," *J. Econometrics*, vol. 142, no. 2, pp. 698–714, 2008.
 [40] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," *J. R. Statist. Soc. B*, vol. 39, no. 1, pp. 1–38, 1977.
 [41] E. Y. Chou, "Paperless and soulless: E-signatures diminish the signer's presence and decrease acceptance," *Social Psychological and Personality Science*, vol. 6, no. 3, pp. 343–351, 2015.
 [42] E. Y. Chou, "What's in a name? The toll e-signatures take on individual honesty," *Journal of Experimental Social Psychology*, vol. 61, pp. 84–95, 2015.
 [43] K. Tzelios and L. A. Williams, "The psychological impact of digital signatures: A multistudy replication," *Technology, Mind, and Behavior*, vol. 1, no. 2, 2020.
 [44] S. Pal, M. Blumenstein, and U. Pal, "Non-English and non-Latin signature verification systems: A survey," in *Proc. 1st Int. Workshop on Automated Forensic Handwriting Analysis (AFHA)*, 2011, pp. 1–5.
 [45] X. Chen, "Extraction and analysis of the width, gray scale and radian in Chinese signature handwriting," *Forensic Science International*, vol. 255, pp. 123–132, 2015.
 ## Declarations
 **Conflict of interest.** The authors declare no conflict of interest with Firm A, Firm B, Firm C, or Firm D, or with any other entity referenced in this work.
 **Data availability.** All audit reports analysed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, a publicly accessible regulatory disclosure platform. The CPA registry used to map signatures to certifying CPAs is publicly available. The reproducibility scripts and trained model weights are provided in the supplementary materials; signature-image release is subject to the firm-anonymization constraints of §III-A (a de-identified subset and the per-table provenance mapping are included, with the full image set available to reviewers under the platform's public-data terms).
@@ -0,0 +1,79 @@
 import sqlite3, numpy as np
 from collections import defaultdict
 from scipy.stats import gaussian_kde
 DB='/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 FIRM_A='勤業眾信聯合'; BIG4=('勤業眾信聯合','安侯建業聯合','資誠聯合','安永聯合')
 SEED=42; POP=np.array([bin(i).count('1') for i in range(256)],dtype=np.uint8)
 def load():
    c=sqlite3.connect(f'file:{DB}?mode=ro',uri=True)
    r=c.execute("""SELECT s.assigned_accountant,a.firm,s.source_pdf,s.feature_vector,s.dhash_vector,
      CAST(substr(s.year_month,1,4) AS INT) FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
      WHERE s.assigned_accountant IS NOT NULL AND a.firm IS NOT NULL AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""").fetchall()
    c.close(); return r
 def crossover(keep,label):
    feats=np.stack([np.frombuffer(r[3],np.float32) for r in keep]).astype(np.float32)
    feats/=np.clip(np.linalg.norm(feats,axis=1,keepdims=True),1e-9,None)
    cpas=np.array([r[0] for r in keep]); by=defaultdict(list)
    for i,c in enumerate(cpas): by[c].append(i)
    by={c:np.array(v) for c,v in by.items() if len(v)>=3}; accts=list(by.keys())
    pw=np.array([len(by[c])*(len(by[c])-1)/2 for c in accts],float); pw/=pw.sum()
    rng=np.random.default_rng(SEED); M=100_000
    intra=np.empty(M,np.float32); ci=rng.choice(len(accts),M,p=pw)
    for t in range(M):
        a,b=rng.choice(by[accts[ci[t]]],2,replace=False); intra[t]=feats[a]@feats[b]
    inter=np.empty(M,np.float32)
    for t in range(M):
        i,j=rng.choice(len(accts),2,replace=False); inter[t]=feats[rng.choice(by[accts[i]])]@feats[rng.choice(by[accts[j]])]
    xs=np.linspace(0.3,1.0,10000); diff=gaussian_kde(intra)(xs)-gaussian_kde(inter)(xs)
    cr=[float(x) for x in xs[np.where(np.diff(np.sign(diff)))[0]] if 0.6<x<0.99]
    print(f'  [{label}] crossover {[f"{x:.4f}" for x in cr]} (n={len(keep)}, accts>=3={len(accts)})')
 def percomp_bands(keep,label,M=500_000):
    feats=np.stack([np.frombuffer(r[3],np.float32) for r in keep]).astype(np.float32)
    feats/=np.clip(np.linalg.norm(feats,axis=1,keepdims=True),1e-9,None)
    dh=np.stack([np.frombuffer(r[4],np.uint8) for r in keep]); cpas=np.array([r[0] for r in keep])
    by=defaultdict(list)
    for i,c in enumerate(cpas): by[c].append(i)
    accts=[c for c,v in by.items() if len(v)>=1]; rng=np.random.default_rng(SEED)
    n=len(keep); ii=rng.integers(0,n,M*2); jj=rng.integers(0,n,M*2)
    keepm=cpas[ii]!=cpas[jj]; ii=ii[keepm][:M]; jj=jj[keepm][:M]
    cos=np.einsum('ij,ij->i',feats[ii],feats[jj]); d=POP[dh[ii]^dh[jj]].sum(1)
    hc=(cos>0.95)&(d<=5); mc=(cos>0.95)&(d>5)&(d<=15); hsc=(cos>0.95)&(d>15)
    un=(cos>0.837)&(cos<=0.95); lh=cos<=0.837
    print(f'  [{label}] per-COMPARISON ICCR (M={len(ii)}): HC {hc.mean():.6f}  MC {mc.mean():.6f}  HSC {hsc.mean():.6f}  UN {un.mean():.4f}  LH {lh.mean():.4f}')
 def persig_perdoc_bands(keep,label):
    n=len(keep); feats=np.stack([np.frombuffer(r[3],np.float32) for r in keep]).astype(np.float32)
    feats/=np.clip(np.linalg.norm(feats,axis=1,keepdims=True),1e-9,None)
    dh=np.stack([np.frombuffer(r[4],np.uint8) for r in keep]); cpas=np.array([r[0] for r in keep]); docs=np.array([r[2] for r in keep])
    ci=defaultdict(list)
    for i,c in enumerate(cpas): ci[c].append(i)
    ci={c:np.array(v) for c,v in ci.items()}; ps={c:len(v)-1 for c,v in ci.items()}
    allidx=np.arange(n); rng=np.random.default_rng(SEED); mc=np.zeros(n,np.float32); md=np.full(n,64,np.int32)
    for si in range(n):
        p=ps[cpas[si]]
        if p<=0: continue
        same=ci[cpas[si]]; need=p; cand=[]; att=0
        while need>0 and att<10:
            dr=rng.choice(n,size=need*2,replace=True); ok=dr[~np.isin(dr,same)]; cand.extend(ok[:need].tolist()); need-=len(ok[:need]); att+=1
        cand=np.array(cand[:p],dtype=np.int64)
        mc[si]=(feats[cand]@feats[si]).max(); md[si]=int(POP[dh[cand]^dh[si]].sum(1).min())
    un=(mc>0.837)&(mc<=0.95); hsc=(mc>0.95)&(md>15)
    # per-doc: any signature in band
    dd=defaultdict(list)
    for i in range(n): dd[docs[i]].append(i)
    docs_un=np.mean([un[v].any() for v in dd.values()]); docs_hsc=np.mean([hsc[v].any() for v in dd.values()])
    print(f'  [{label}] per-SIGNATURE ICCR: UN {un.mean():.4f}  HSC {hsc.mean():.6f}')
    print(f'  [{label}] per-REPORT ICCR:    UN {docs_un:.4f}  HSC {docs_hsc:.6f}  (n_doc={len(dd)})')
 rows=load()
 bcd_all=[r for r in rows if r[1] in BIG4 and r[1]!=FIRM_A]
 bcd_19=[r for r in bcd_all if 2013<=r[5]<=2019]
 print("=== ITEM 11: KDE crossover (verify corpus 0.837 / BCD-all 0.8489, then closed-world 2013-2019) ===")
 crossover(rows,'corpus-wide (verify ~0.8367)')
 crossover(bcd_all,'BCD-only ALL period (verify 0.8489)')
 crossover(bcd_19,'BCD 2013-2019 CLOSED-WORLD (NEW primary candidate)')
 print("\n=== ITEM 3: UN / HSC full ICCR on BCD 2013-2019 ===")
 percomp_bands(bcd_19,'BCD 2013-2019')
 persig_perdoc_bands(bcd_19,'BCD 2013-2019')
 print("\n=== ITEM 12: n reconciliation ===")
 print(f"  BCD full-period (2013-2023) signatures = {len(bcd_all)}  <- Script53 logged n=89,994")
 print(f"  BCD 2013-2019 signatures              = {len(bcd_19)}  <- headline ICCR base (reproduces 0.0059)")
@@ -0,0 +1,70 @@
 import sqlite3
 from collections import defaultdict, Counter
 import numpy as np
 DB='/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 FIRM_A='勤業眾信聯合'; BIG4=('勤業眾信聯合','安侯建業聯合','資誠聯合','安永聯合')
 ALIAS={'勤業眾信聯合':'A','安侯建業聯合':'B','資誠聯合':'C','安永聯合':'D'}
 SEED=42; POP=np.array([bin(i).count('1') for i in range(256)],dtype=np.uint8)
 def wilson(k,n,z=1.96):
    if n==0: return (None,None)
    p=k/n; d=1+z*z/n; c=(p+z*z/(2*n))/d; h=z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
    return (max(0,c-h),min(1,c+h))
 def load():
    c=sqlite3.connect(f'file:{DB}?mode=ro',uri=True); cur=c.cursor()
    cur.execute("""SELECT s.assigned_accountant,a.firm,s.source_pdf,s.feature_vector,s.dhash_vector,
        CAST(substr(s.year_month,1,4) AS INT) FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
        WHERE s.assigned_accountant IS NOT NULL AND a.firm IS NOT NULL AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""")
    r=cur.fetchall(); c.close(); return r
 def canonical_sampler(rng,n,n_pool,same_cpa,all_idx):
    need=n_pool; cand=[]; att=0
    while need>0 and att<10:
        draw=rng.choice(n,size=need*2,replace=True); ok=draw[~np.isin(draw,same_cpa)]
        cand.extend(ok[:need].tolist()); need-=len(ok[:need]); att+=1
    if need>0:
        pm=np.ones(n,bool); pm[same_cpa]=False
        cand.extend(rng.choice(all_idx[pm],size=need,replace=False).tolist())
    return np.array(cand[:n_pool],dtype=np.int64)
 def simulate(keep):
    n=len(keep); feats=np.stack([np.frombuffer(r[3],np.float32) for r in keep]).astype(np.float32)
    nr=np.linalg.norm(feats,axis=1,keepdims=True); nr[nr==0]=1; feats=feats/nr
    dh=np.stack([np.frombuffer(r[4],np.uint8) for r in keep]); cpas=np.array([r[0] for r in keep])
    cpa_idx=defaultdict(list)
    for i,c in enumerate(cpas): cpa_idx[c].append(i)
    cpa_idx={c:np.array(v) for c,v in cpa_idx.items()}; ps={c:len(v)-1 for c,v in cpa_idx.items()}
    all_idx=np.arange(n); rng=np.random.default_rng(SEED)
    mc=np.zeros(n,np.float32); md=np.full(n,64,np.int32)
    for si in range(n):
        p=ps[cpas[si]]
        if p<=0: continue
        cand=canonical_sampler(rng,n,p,cpa_idx[cpas[si]],all_idx)
        mc[si]=(feats[cand]@feats[si]).max(); md[si]=int(POP[dh[cand]^dh[si]].sum(axis=1).min())
    return mc,md
 def iccr(keep,label):
    mc,md=simulate(keep); n=len(keep)
    hc=(mc>0.95)&(md<=5); d2=(mc>0.95)&(md<=15)
    un=(mc>0.837)&(mc<=0.95); hsc=(mc>0.95)&(md>15)
    print(f"\n== {label} (n_sig={n:,}) ==")
    for nm,a in [('HC',hc),('HC+MC',d2),('UN-band',un),('HSC-band',hsc)]:
        k=int(a.sum()); lo,hi=wilson(k,n); print(f"  ICCR per-sig {nm}: {k/n:.6f} ({k}/{n}) [{lo:.5f},{hi:.5f}]")
 def a_oos(rows,label):
    A=[r for r in rows if r[1]==FIRM_A]; BCD=[r for r in rows if r[1] in BIG4 and r[1]!=FIRM_A]
    bf=np.stack([np.frombuffer(r[3],np.float32) for r in BCD]).astype(np.float32)
    bn=np.linalg.norm(bf,axis=1,keepdims=True); bn[bn==0]=1; bf=bf/bn
    bdh=np.stack([np.frombuffer(r[4],np.uint8) for r in BCD]); nb=bf.shape[0]
    ac=defaultdict(list)
    for i,r in enumerate(A): ac[r[0]].append(i)
    ps={c:len(v)-1 for c,v in ac.items()}; rng=np.random.default_rng(SEED); hc=np.zeros(len(A),bool)
    for i,r in enumerate(A):
        p=ps[r[0]]
        if p<=0: continue
        cand=rng.choice(nb,size=p,replace=True); sf=np.frombuffer(r[3],np.float32).astype(np.float32); sf=sf/max(np.linalg.norm(sf),1e-9)
        mc=(bf[cand]@sf).max(); mdv=int(POP[bdh[cand]^np.frombuffer(r[4],np.uint8)].sum(axis=1).min())
        hc[i]=(mc>0.95)and(mdv<=5)
    k=int(hc.sum()); n=len(A); lo,hi=wilson(k,n)
    print(f"\n== Firm A OOS vs {label} BCD pool == per-sig HC: {k/n:.6f} ({k}/{n}) [{lo:.6f},{hi:.6f}]")
 rows=load()
 bcd_all=[r for r in rows if r[1] in BIG4 and r[1]!=FIRM_A]
 bcd_19=[r for r in bcd_all if 2013<=r[5]<=2019]
 iccr(bcd_19,'BCD 2013-2019 (verify per-sig HC~0.0059)')
 a_oos([r for r in rows if 2013<=r[5]<=2019],'2013-2019')
 a_oos(rows,'full-period')
@@ -0,0 +1,58 @@
 import sqlite3, numpy as np
 import matplotlib
 matplotlib.use('Agg')
 import matplotlib.pyplot as plt
 DB='/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 ALIAS={'勤業眾信聯合':'A','安侯建業聯合':'B','資誠聯合':'C','安永聯合':'D'}
 COL={'A':'#c0392b','B':'#2980b9','C':'#27ae60','D':'#8e44ad'}
 c=sqlite3.connect(f'file:{DB}?mode=ro',uri=True)
 rows=c.execute("""SELECT a.firm, s.max_similarity_to_same_accountant, s.min_dhash_independent,
  s.assigned_accountant, CAST(substr(s.year_month,1,4) AS INT)
  FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
  WHERE s.max_similarity_to_same_accountant IS NOT NULL AND s.min_dhash_independent IS NOT NULL
    AND a.firm IN ('勤業眾信聯合','安侯建業聯合','資誠聯合','安永聯合')""").fetchall()
 firm=np.array([ALIAS[r[0]] for r in rows]); cos=np.array([r[1] for r in rows],float)
 dh=np.array([r[2] for r in rows],float); acc=np.array([r[3] for r in rows]); yr=np.array([r[4] for r in rows])
 A=firm=='A'; BCD=np.isin(firm,['B','C','D'])
 # ---- Figure 4: two panels, Firm A vs BCD ----
 fig,ax=plt.subplots(1,2,figsize=(9,3.4))
 ax[0].hist(cos[A],bins=np.linspace(0.7,1.0,60),density=True,alpha=0.6,color='#c0392b',label='Firm A')
 ax[0].hist(cos[BCD],bins=np.linspace(0.7,1.0,60),density=True,alpha=0.5,color='#34495e',label='Firms B/C/D')
 ax[0].axvline(0.95,ls='--',c='k',lw=0.8); ax[0].axvline(0.8547,ls=':',c='gray',lw=0.8)
 ax[0].set_title('(a) Within-accountant cosine',fontsize=10)
 ax[0].set_xlabel('max cosine to same accountant'); ax[0].set_ylabel('density')
 ax[0].text(0.952,ax[0].get_ylim()[1]*0.9,'0.95',fontsize=7); ax[0].legend(fontsize=8,frameon=False)
 ax[0].annotate('A median 0.986',(0.986,0),(0.80,ax[0].get_ylim()[1]*0.55),fontsize=7,color='#c0392b',arrowprops=dict(arrowstyle='->',color='#c0392b',lw=0.7))
 ax[0].annotate('B/C/D median 0.959',(0.959,0),(0.72,ax[0].get_ylim()[1]*0.35),fontsize=7,color='#34495e',arrowprops=dict(arrowstyle='->',color='#34495e',lw=0.7))
 bins=np.arange(0,21)-0.5
 ax[1].hist(np.clip(dh[A],0,20),bins=bins,density=True,alpha=0.6,color='#c0392b',label='Firm A')
 ax[1].hist(np.clip(dh[BCD],0,20),bins=bins,density=True,alpha=0.5,color='#34495e',label='Firms B/C/D')
 ax[1].axvline(5,ls='--',c='k',lw=0.8)
 ax[1].set_title('(b) Within-accountant dHash',fontsize=10)
 ax[1].set_xlabel('min dHash to same accountant'); ax[1].set_ylabel('density')
 ax[1].text(5.1,ax[1].get_ylim()[1]*0.9,'5',fontsize=7); ax[1].legend(fontsize=8,frameon=False)
 ax[1].text(0.50,0.62,'A median 2 / B,C,D median 7',transform=ax[1].transAxes,fontsize=7)
 fig.text(0.5,-0.02,'Cross-firm held-out HC rate 0.42% sits at/below the clean reference ICCR 0.59%; within-Firm-A HC rate is 82%.',ha='center',fontsize=7,style='italic')
 fig.tight_layout(); fig.savefig('/tmp/fig4.png',dpi=200,bbox_inches='tight'); plt.close(fig)
 # ---- Figure 5: per-accountant HC rate, ranked, per period ----
 def hc_by_acc(mask):
    out={}
    a=acc[mask]; h=((cos[mask]>0.95)&(dh[mask]<=5)).astype(float); f=firm[mask]
    for ai in np.unique(a):
        m=a==ai
        if m.sum()>=5: out[ai]=(h[m].mean(),f[m][0])
    return out
 fig,ax=plt.subplots(1,2,figsize=(9,3.4),sharey=True)
 for j,(lo,hi,ttl) in enumerate([(2013,2019,'(a) 2013–2019'),(2020,2023,'(b) 2020–2023')]):
    d=hc_by_acc(BCD|A if False else ((yr>=lo)&(yr<=hi)))
    items=sorted(d.items(),key=lambda kv:-kv[1][0])
    xs=np.arange(len(items)); ys=[v[0]*100 for _,v in items]; cs=[COL[v[1]] for _,v in items]
    ax[j].scatter(xs,ys,c=cs,s=10)
    ax[j].set_title(ttl,fontsize=10); ax[j].set_xlabel('accountant rank'); 
    if j==0: ax[j].set_ylabel('per-accountant HC rate (%)')
 from matplotlib.lines import Line2D
 ax[1].legend([Line2D([0],[0],marker='o',ls='',color=COL[k]) for k in 'ABCD'],['Firm A','Firm B','Firm C','Firm D'],fontsize=7,frameon=False,loc='upper right')
 fig.tight_layout(); fig.savefig('/tmp/fig5.png',dpi=200,bbox_inches='tight'); plt.close(fig)
 print('figs OK', __import__('os').path.getsize('/tmp/fig4.png'), __import__('os').path.getsize('/tmp/fig5.png'))
@@ -0,0 +1,75 @@
 import matplotlib
 matplotlib.use('Agg')
 import matplotlib.pyplot as plt
 from matplotlib.patches import FancyBboxPatch, FancyArrowPatch, Rectangle
 import numpy as np
 # ============ Figure 1: data split grid ============
 fig, ax = plt.subplots(figsize=(7, 3.2))
 firms = ['Firm A', 'Firm B', 'Firm C', 'Firm D']
 periods = ['2013–2019', '2020–2023']
 # role per (row firm, col period)
 def role(f, p):
    if f == 'Firm A':
        return ('Held-out test 1\n(Firm A, full record)', '#c0392b')
    if p == '2013–2019':
        return ('Calibration\n(clean reference)', '#27ae60')
    return ('Held-out test 2\n(secondary)', '#2980b9')
 for i, f in enumerate(firms):
    for j, p in enumerate(periods):
        txt, col = role(f, p)
        ax.add_patch(Rectangle((j, len(firms)-1-i), 1, 1, facecolor=col, alpha=0.30, edgecolor='black', lw=1))
        ax.text(j+0.5, len(firms)-1-i+0.5, txt, ha='center', va='center', fontsize=6.5)
 ax.set_xlim(0, 2); ax.set_ylim(0, 4)
 ax.set_xticks([0.5, 1.5]); ax.set_xticklabels(periods, fontsize=9)
 ax.set_yticks([3.5, 2.5, 1.5, 0.5]); ax.set_yticklabels(firms, fontsize=9)
 ax.tick_params(length=0)
 for s in ax.spines.values(): s.set_visible(False)
 ax.set_title('Figure 1. Data split: calibrate on the clean cell, test everything else', fontsize=9)
 fig.tight_layout(); fig.savefig('/tmp/fig1.png', dpi=200, bbox_inches='tight'); plt.close(fig)
 # ============ Figure 2: pipeline ============
 fig, ax = plt.subplots(figsize=(9, 2.5))
 steps = ['Raw PDF\nreport', 'Find signature\npage (VLM)', 'Detect signatures\n(YOLOv11)\n+ red-stamp removal',
         'Feature extraction\n(ResNet-50, 2048-d)', 'Two similarities\ncosine (style)\nmin dHash (structure)', 'Five-way\nlabel']
 n = len(steps); w = 1.0/n
 cols = ['#ecf0f1', '#d6eaf8', '#d5f5e3', '#fcf3cf', '#fadbd8', '#e8daef']
 for i, (s, c) in enumerate(zip(steps, cols)):
    x = i*w + 0.01
    ax.add_patch(FancyBboxPatch((x, 0.30), w-0.02, 0.40, boxstyle='round,pad=0.005,rounding_size=0.02',
                                facecolor=c, edgecolor='black', lw=1, transform=ax.transAxes))
    ax.text(x+(w-0.02)/2, 0.50, s, ha='center', va='center', fontsize=6.8, transform=ax.transAxes)
    if i < n-1:
        ax.add_patch(FancyArrowPatch((x+w-0.012, 0.50), (x+w+0.002, 0.50), transform=ax.transAxes,
                                     arrowstyle='-|>', mutation_scale=10, lw=1.2, color='black'))
 ax.axis('off')
 ax.set_title('Figure 2. The screening pipeline', fontsize=9, y=0.92)
 fig.savefig('/tmp/fig2.png', dpi=200, bbox_inches='tight'); plt.close(fig)
 # ============ Figure 3: two-measure plane, five regions ============
 fig, ax = plt.subplots(figsize=(5.2, 4.2))
 LO, HI = 0.8547, 0.95
 DH1, DH2 = 5, 15
 xmin, xmax = 0.70, 1.005
 ymin, ymax = -1, 30
 # LH (cos<=LO): whole column
 ax.add_patch(Rectangle((xmin, ymin), LO-xmin, ymax-ymin, facecolor='#bdc3c7', alpha=0.5))
 # UN (LO<cos<=HI)
 ax.add_patch(Rectangle((LO, ymin), HI-LO, ymax-ymin, facecolor='#f7dc6f', alpha=0.5))
 # high-cosine band subdivided by dHash
 ax.add_patch(Rectangle((HI, ymin), xmax-HI, DH1-ymin, facecolor='#cb4335', alpha=0.55))   # HC dHash<=5
 ax.add_patch(Rectangle((HI, DH1), xmax-HI, DH2-DH1, facecolor='#eb984e', alpha=0.55))      # MC 5<dHash<=15
 ax.add_patch(Rectangle((HI, DH2), xmax-HI, ymax-DH2, facecolor='#aed6f1', alpha=0.6))      # HSC dHash>15
 ax.axvline(LO, color='gray', ls=':', lw=1); ax.axvline(HI, color='black', ls='--', lw=1)
 ax.plot([HI, xmax], [DH1, DH1], 'k--', lw=0.8); ax.plot([HI, xmax], [DH2, DH2], 'k--', lw=0.8)
 ax.text((xmin+LO)/2, 22, 'LH', ha='center', fontsize=11, weight='bold')
 ax.text((LO+HI)/2, 22, 'UN', ha='center', fontsize=11, weight='bold')
 ax.text((HI+xmax)/2, 2, 'HC', ha='center', fontsize=11, weight='bold', color='white')
 ax.text((HI+xmax)/2, 9.5, 'MC', ha='center', fontsize=11, weight='bold')
 ax.text((HI+xmax)/2, 22, 'HSC', ha='center', fontsize=10, weight='bold')
 ax.text(LO, ymin-1.5, '0.8547', ha='center', fontsize=7); ax.text(HI, ymin-1.5, '0.95', ha='center', fontsize=7)
 ax.set_xlim(xmin, xmax); ax.set_ylim(ymin, ymax)
 ax.set_xlabel('cosine similarity (style)'); ax.set_ylabel('dHash distance (structure)')
 ax.set_title('Figure 3. The two measures and the five regions', fontsize=9)
 fig.tight_layout(); fig.savefig('/tmp/fig3.png', dpi=200, bbox_inches='tight'); plt.close(fig)
 print('figs 1/2/3 OK')
@@ -0,0 +1,49 @@
 import sqlite3, numpy as np
 DB='/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 BCD=('安侯建業聯合','資誠聯合','安永聯合')
 c=sqlite3.connect(f'file:{DB}?mode=ro',uri=True)
 rows=c.execute("""SELECT s.assigned_accountant, s.max_similarity_to_same_accountant, s.min_dhash_independent
  FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
  WHERE a.firm IN ('安侯建業聯合','資誠聯合','安永聯合')
    AND CAST(substr(s.year_month,1,4) AS INT) BETWEEN 2013 AND 2019
    AND s.max_similarity_to_same_accountant IS NOT NULL AND s.min_dhash_independent IS NOT NULL""").fetchall()
 from collections import defaultdict
 by=defaultdict(list)
 for a,cos,dh in rows: by[a].append((cos,dh))
 accs={a:np.array(v) for a,v in by.items() if len(v)>=15}
 print(f"BCD 2013-2019: {len(accs)} accountants with >=15 signatures (of {len(by)} total)")
 rep=[]; tight=[]; rem_med=[]; klass=[]
 for a,v in accs.items():
    cos=v[:,0]; dh=v[:,1]
    hc=(cos>0.95)&(dh<=5)
    rf=hc.mean(); tf=(cos>0.95).mean()
    isolated=cos[cos<=0.95]
    rm=np.median(isolated) if len(isolated)>=3 else np.nan
    rep.append(rf); tight.append(tf); rem_med.append(rm)
    klass.append('pure-hand' if rf<0.10 else ('pure-stamp' if rf>0.90 else 'mixed'))
 rep=np.array(rep); tight=np.array(tight); rem_med=np.array(rem_med); klass=np.array(klass)
 import collections
 print("\n=== Per-accountant replication-fraction (HC share) distribution ===")
 for lo,hi in [(0,0.1),(0.1,0.3),(0.3,0.5),(0.5,0.7),(0.7,0.9),(0.9,1.01)]:
    n=((rep>=lo)&(rep<hi)).sum(); print(f"  rep_frac [{lo:.1f},{hi:.1f}): {n:3d} accountants")
 print("  class counts:", dict(collections.Counter(klass)))
 mixed=klass=='mixed'
 print(f"\n=== MIXED accountants (n={mixed.sum()}): is the non-tight remainder dispersed (separable)? ===")
 rm_mixed=rem_med[mixed & ~np.isnan(rem_med)]
 print(f"  remainder (cos<=0.95) median cosine across mixed accountants: median={np.median(rm_mixed):.3f}, IQR[{np.percentile(rm_mixed,25):.3f},{np.percentile(rm_mixed,75):.3f}]")
 print(f"  fraction of mixed accountants whose remainder median < 0.90 (clearly dispersed): {(rm_mixed<0.90).mean():.2f}")
 print(f"  fraction with remainder median < 0.85 (very dispersed): {(rm_mixed<0.85).mean():.2f}")
 # gap between tight group (cos>0.95) and remainder: per mixed accountant
 gaps=[]
 for a,v in accs.items():
    cos=v[:,0]
    t=cos[cos>0.95]; r=cos[cos<=0.95]
    if len(t)>=3 and len(r)>=3:
        gaps.append(np.median(t)-np.median(r))
 gaps=np.array(gaps)
 print(f"\n=== Tight-vs-remainder cosine gap (all accountants with both parts, n={len(gaps)}) ===")
 print(f"  median gap = {np.median(gaps):.3f}  (large gap => two-component structure is real & separable)")
 print(f"  fraction with gap > 0.10: {(gaps>0.10).mean():.2f}")