Files

T

gbanyan becce857e1 Phase 6 round-7 codex 3-axis review fixes: 11 MAJOR + 5 MINOR

Codex GPT-5.5 3-axis peer review (paper/codex_review_gpt55_v4_round_3axis.md)
identified 11 MAJOR + 5 MINOR + 0 BLOCKER on three axes: (1) abstract/body
tone consistency, (2) methodology clarity / v3 residue, (3) no implicit
within-CPA or cross-year signature-consistency assumptions. 13 patches
applied across 4 source files; mirrored in paper_a_v4_combined.md.

Axis 1 (tone consistency between abstract and body):
- S I L33: "resolves the ambiguity" -> "provides complementary evidence
  for screening cases where ... hypotheses diverge"
- S I L35: "disproves the distributional-threshold path" -> "does not
  support the distributional-threshold path"
- S I L37 / S V-F L29: "characterise the deployed five-way classifier
  at three units" -> "characterise the deployed HC sub-rule and
  document-level HC+MC alarm derived from the five-way classifier at
  three units" (consistent with S V-H which says only HC sub-rule and
  HC+MC alarm are re-characterised by the present ICCR battery)
- S I L39 / S V-C / S III-L.4: "consistent with firm-specific template,
  stamp, or document-production reuse mechanisms" -> "consistent with --
  but does not independently establish -- firm-level template-like
  reuse, digitisation-pipeline homogeneity, or signing-style
  homogeneity, which descriptor-only data cannot separate (S V-H)"
  (mirrors abstract)

Axis 2 (methodology clarity / v3 residue):
- S III-G: added unit-bridge sentence distinguishing "descriptor-summary
  units" (signature/accountant) from "operational reporting units"
  (per-comparison/per-signature/per-document, S III-L)
- S III-H.2: "The calibration distinguishes two reference populations"
  -> "The supporting diagnostics use two reference populations" with
  explicit "neither is the calibration anchor"
- S III-L.1: "specificity" -> "ICCR refinement"
- S III-L.2: added "descriptive intuition, not an independence
  assumption used for estimation" caveat after the 1-(1-p)^n form

Axis 3 (no implicit signature-consistency assumptions):
- S III-F: hand-signing motivation rewritten as working hypothesis that
  "the classifier does not require ... to hold for all CPAs"
- S III-G A1: added "A1 does not assume temporal stability of
  handwriting or scanning workflow within or across years"
- S III-H.1: added label-caveat paragraph (operational rule outputs,
  not validated ground-truth classes); HC "strong replication evidence"
  -> "image-similarity evidence consistent with replication"; HSC
  "consistent with a CPA who signs very consistently" -> "mechanism not
  resolved by descriptor data alone"; LH explicitly owns that
  cross-year handwriting drift, scanner workflow change, or template
  variant rotation can also yield low max-cosine within a same-CPA pool
- S III-L.6 / S IV-M.6: "same-CPA repeatability signal" -> "observed
  same-CPA-pool excess ... not attributed to within-CPA handwriting
  repeatability"

Deferred (structural, not single-sentence patch): codex S III-I.2 /
S III-J K=2/K=3 deduplication; codex S III-K LOOO / S III-J duplication.
Both are MINOR stylistic redundancies, not reviewer-rejection risks.

DOCX rebuilt via export_v3.py; v4.0_20260515 file refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 03:11:53 +08:00

16 KiB

Raw Blame History

I. Introduction

Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require certifying CPAs to affix their signature or seal (簽名或蓋章) to each audit report [1]. While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].

The digitization of financial reporting has introduced a practice that complicates this intent. As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one. This reproduction can occur either through an administrative stamping workflow — in which scanned signature images are affixed by staff as part of the report-assembly process — or through a firm-level electronic signing system that automates the same step. We refer to signatures produced by either workflow collectively as non-hand-signed. Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement. The accounting literature has examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33]. Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused, and is visually invisible to report users at scale.

The distinction between non-hand-signing detection and signature forgery detection is conceptually and technically important. The extensive body of research on offline signature verification [3]–[8] focuses almost exclusively on forgery detection — determining whether a questioned signature was produced by its purported author. In our context, identity is not in question; the CPA is indeed the legitimate signer. The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction.

A methodological concern shapes the research design. Many prior similarity-based classification studies rely on ad-hoc thresholds — declaring two images equivalent above a hand-picked cosine cutoff, for example — without principled statistical justification. Such thresholds are fragile in an archival-data setting. A defensible approach requires (i) explicit calibration of the operational thresholds against measurable negative-anchor evidence; (ii) diagnostic procedures that test whether the descriptor distribution itself supports a within-population threshold, including formal decomposition of apparent multimodality into between-group composition and integer-tie artefacts; (iii) annotation-free reporting of operational alarm rates at multiple analysis units (per-comparison, per-signature pool, per-document) with Wilson 95% confidence intervals; (iv) per-firm stratification of the reported rates to surface heterogeneity that aggregate metrics conceal; and (v) explicit disclosure of the unsupervised setting's limits — in particular, the inability to estimate true error rates without signature-level ground-truth labels.

Despite the significance of the problem for audit quality and regulatory oversight, to our knowledge no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation.

In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) disclosure of each diagnostic's untested assumption (§III-M).

A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of 2.73 versus Firms B/C/D's 6.46, 7.39, 7.21) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears (p_{\text{median}} = 0.35 across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.

In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the cos$>0.95$ operating point yields ICCR = 0.00060 on a $5 \times 10^5$-pair Big-4 sample; the dHash$\leq 5$ structural cutoff yields ICCR = 0.00129; the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR = 0.00014 (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is 0.1102 (Wilson 95% CI [0.1086, 0.1118]; CPA-block bootstrap 95% [0.0908, 0.1330]). At the per-document unit, the operational HC$+$MC alarm fires on 33.75\% of Big-4 documents under the inter-CPA candidate-pool counterfactual.

The pooled per-signature and per-document rates conceal striking firm heterogeneity. A logistic regression of the per-signature hit indicator on firm dummies (Firm A reference) and centred log pool size yields odds ratios of 0.053 (Firm B), 0.010 (Firm C), and 0.027 (Firm D) — Firms B/C/D are an order of magnitude below Firm A even after controlling for the pool-size confound. Cross-firm hit matrix analysis under the deployed any-pair rule shows within-firm collision concentrations of 98.8\% at Firm A and $76.7$–83.7\% at Firms B/C/D (Table XXV; the stricter same-pair joint event saturates at $97.0$–99.96\% within-firm across all four firms). The pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour rather than to derive new thresholds. The high-confidence sub-rule (cos > 0.95 AND dHash \leq 5) and moderate-confidence sub-rule (cos > 0.95 AND 5 < \text{dHash} \leq 15) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.

Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman \rho \geq 0.879: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve 0\% positive-anchor miss rate (Wilson 95% upper bound 1.45\%). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.

We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available.

The contributions of this paper are:

Problem formulation. We define non-hand-signing detection as distinct from signature forgery detection and frame it as a detection problem on intra-signer similarity distributions.
End-to-end pipeline. We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training.
Dual-descriptor similarity. We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash provides complementary evidence for screening cases where style consistency and image reproduction hypotheses diverge, and we support the backbone choice through a feature-backbone ablation.
Composition decomposition does not support the distributional-threshold path. We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
Anchor-based multi-level inter-CPA coincidence-rate calibration. We characterise the deployed high-confidence (HC) sub-rule and document-level HC$+$MC alarm derived from the five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: 0.0006; dHash$\leq 5$: 0.0013; joint: 0.00014), pool-normalised per-signature ICCR (0.11 for the deployed any-pair high-confidence rule), and per-document ICCR (0.34 for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
Firm heterogeneity quantification and within-firm cross-CPA collision concentration. Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: 0.62 versus Firms B/C/D: $0.09$–0.16); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of 98.8\% at Firm A and $76.7$–83.7\% at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$–99.96\% within-firm across all four firms); the pattern is consistent with — but does not independently establish — firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
K=3 as descriptive firm-compositional partition; three-score convergent internal consistency. We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman \rho \geq 0.879; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.
Annotation-free positive-anchor capture check and unsupervised-setting disclosure. We achieve 0\% positive-anchor miss rate (Wilson 95% upper bound 1.45\%) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-M addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review.

The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity positive-anchor check, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work.

16 KiB Raw Blame History Unescape Escape

I. Introduction

16 KiB

Raw Blame History