Phase 6 round-3 codex-review fixes: blockers + majors + minors

Resolved Codex review (gpt-5.5 xhigh) findings against b6913d2.

BLOCKERS:
- Appendix B reference mismatch: rewrote all main-text "Appendix B" references
  to "supplementary materials" since Appendix B is now a redirect stub. Affected
  the SSIM design-argument pointer, threshold provenance, byte-level
  decomposition, MC band capture-rate, and backbone-ablation table references
  across §III-F / §III-H.1 / §III-H.2 / §III-K / §III-L.4 / §III-M / §IV-F /
  §IV-J / §IV-K / §IV-L / §V-C / §V-H.
- Table rendering: un-commented Tables I-IV (Dataset Summary, YOLO Detection,
  Extraction Results, Cosine Distribution Statistics) which were inside HTML
  comment blocks and would not have rendered in the submission.
- Table numbering out of order: Table XIX appeared before Tables XVI-XVIII.
  Renumbered XIX -> XVI (document-level worst-case counts), XVI -> XVII (Firm x
  K=3 cross-tab), XVII -> XVIII (K=3 component comparison), XVIII -> XIX
  (Spearman correlation). Cross-references updated in §IV-J / §IV-K and §V-C.
- Table V mis-citation: §IV-C said "KDE crossover ... (Table V)" but Table V is
  the dip test. Dropped the (Table V) tag; crossover is a textual finding.
- Submission cleanup: wrapped the archived Impact Statement section heading and
  body inside the existing HTML comment (was rendering). Funding placeholder
  wrapped in HTML comment with a TO-DO note (won't render but is preserved as
  reminder).

MAJORS:
- Line 1077 numerical conflation: rewrote the §V-C / §III-L.4 paragraph that
  labelled Firm A's per-document HC+MC inter-CPA proxy ICCR of 0.6201 as a rate
  "on real same-CPA pools." 0.6201 is a counterfactual proxy under inter-CPA
  candidate-pool replacement, not the observed rate. Added explicit disambig:
  the corresponding observed rate from Table XVI (formerly XIX) is 97.5%
  HC+MC for Firm A; the proxy and observed rates measure different quantities.
- Residual "validation" language softened: "Dual-descriptor verification" ->
  "Dual-descriptor similarity"; "we validate the backbone choice" -> "we
  support the backbone choice"; "pixel-identity validation" -> "pixel-identity
  positive-anchor check"; "## M. Validation Strategy and Limitations under
  Unsupervised Setting" -> "## M. Unsupervised Diagnostic Strategy and Limits".
- "Specificity behaviour" overclaim: "characterises the cosine threshold's
  specificity behaviour" -> "specificity-proxy behaviour" (methodology §III-L.0
  and discussion §V-F).
- "Prior published / prior calibration" ambiguity: replaced "prior published
  per-comparison rate" with "the corpus-wide rate reported in §IV-I"; replaced
  "(prior published operating point)" with "(alternative operating point from
  supplementary calibration evidence)" in Tables XXI; replaced "prior reporting
  and the existing literature" with "the existing literature and the
  supplementary calibration evidence."

MINORS:
- Line 116 Bayes-optimal qualifier: "the local density minimum ... is the
  Bayes-optimal decision boundary under equal priors" -> "In idealized
  two-class mixture settings with equal priors and equal misclassification
  costs, the local density minimum ... coincides with the Bayes-optimal
  decision boundary."
- Stale section refs: §V-G for the fine-tuning caveat retargeted to §V-H
  Engineering-level caveats (where it lives after the §V-H reorganisation);
  §III-L for the worst-case rule retargeted to §III-H.1; "Section IV-D.2"
  (nonexistent) retargeted to "Section IV-D Table VI."
- Abstract / Introduction "after pool-size adjustment": separated the
  document-level D2 proxy ICCR claim from the per-signature logistic regression
  claim. Now: "Per-document D2 inter-CPA proxy ICCRs differ by an order of
  magnitude across firms ... a per-signature logistic regression confirms the
  firm gap persists after pool-size control."

NIT:
- Related Work HTML comment "(see paper_a_references_v3.md for full list)"
  -> "(full list in the References section)"; removes the version-coded
  filename reference from the source.

Artefacts:
- Combined manuscript regenerated: paper_a_v4_combined.md, 1312 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-15 18:28:14 +08:00
parent b6913d2f93
commit 9e68f2e1d3
10 changed files with 108 additions and 112 deletions
+54 -56
View File
@@ -2,7 +2,7 @@
<!-- IEEE Access target: <= 250 words, single paragraph -->
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports — through administrative stamping or firm-level electronic signing — thereby undermining individualized attestation. We build an end-to-end pipeline for screening such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$$0.16$ at Firms B/C/D after pool-size adjustment, and under the deployed any-pair rule $77$$99\%$ of inter-CPA collisions concentrate within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports — through administrative stamping or firm-level electronic signing — thereby undermining individualized attestation. We build an end-to-end pipeline for screening such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC inter-CPA proxy ICCR is $0.62$ versus $0.09$$0.16$ at Firms B/C/D, and a per-signature logistic regression confirms the firm gap persists after controlling for pool size; under the deployed any-pair rule $77$$99\%$ of inter-CPA collisions concentrate within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
<!-- Word count: 247 -->
@@ -39,19 +39,19 @@ The contributions of this paper are:
2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training.
3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we validate the backbone choice through a feature-backbone ablation.
3. **Dual-descriptor similarity.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we support the backbone choice through a feature-backbone ablation.
4. **Composition decomposition disproves the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-firm rates differ by an order of magnitude after pool-size adjustment (Firm A's per-document HC$+$MC alarm at $0.62$ versus Firms B/C/D at $0.09$$0.16$). Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: $0.62$ versus Firms B/C/D: $0.09$$0.16$); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing.
7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.
8. **Annotation-free positive-anchor capture check and unsupervised-setting disclosure.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. Each supporting diagnostic in §III-M addresses one specific failure mode of an unsupervised screening classifier — composition artefacts, inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold sensitivity, or positive-anchor capture — with an explicitly disclosed untested assumption. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review.
The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity validation, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work.
The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity positive-anchor check, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work.
# II. Related Work
@@ -113,7 +113,7 @@ Our threshold-characterisation and calibration framework combines three families
*Non-parametric density estimation.*
Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions.
Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors.
In idealized two-class mixture settings with equal priors and equal misclassification costs, the local density minimum (antimode) between the two modes coincides with the Bayes-optimal decision boundary.
The statistical validity of the unimodality-vs-multimodality dichotomy can be tested via the Hartigan & Hartigan dip test [37], which tests the null of unimodality; we use rejection of this null as evidence consistent with (though not a direct test for) bimodality.
*Discontinuity tests on empirical distributions.*
@@ -132,7 +132,7 @@ The present study uses these tools diagnostically: first to test whether the des
*Cross-validation in a small-cluster scope.*
Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier.
<!--
REFERENCES for Related Work (see paper_a_references_v3.md for full list):
REFERENCES for Related Work (full list in the References section):
[3] Bromley et al. 1993 — Siamese TDNN (NeurIPS)
[4] Dey et al. 2017 — SigNet
[5] Kao & Wen 2020 — Single-sample SV with forgery detection
@@ -197,7 +197,8 @@ Each report is a multi-page PDF document containing, among other content, the au
CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
Table I summarizes the dataset composition.
<!-- TABLE I: Dataset Summary
**Table I.** Dataset Summary.
| Attribute | Value |
|-----------|-------|
| Total PDF documents | 90,282 |
@@ -206,7 +207,6 @@ Table I summarizes the dataset composition.
| Processed for signature extraction | 86,071 (95.3%) |
| Unique CPAs identified | 758 |
| Accounting firms | >50 |
-->
## C. Signature Page Identification
@@ -230,14 +230,14 @@ A region was labeled as "signature" if it contained any Chinese handwritten cont
The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).
<!-- TABLE II: YOLO Detection Performance
**Table II.** YOLO Detection Performance.
| Metric | Value |
|--------|-------|
| Precision | 0.970.98 |
| Recall | 0.950.98 |
| mAP@0.50 | 0.980.99 |
| mAP@0.50:0.95 | 0.850.90 |
-->
Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.
@@ -252,7 +252,7 @@ The final classification layer was removed, yielding the 2048-dimensional output
Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.
The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-G).
The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-H, Engineering-level caveats).
This design choice is supported by an ablation study (Section IV-L) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
## F. Dual-Method Similarity Descriptors
@@ -277,7 +277,7 @@ Non-hand-signing is expected to yield extreme similarity under *both* descriptor
Hand-signing, by contrast, often yields high dHash similarity (the overall layout of a signature is typically preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.
We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. Appendix B contains the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.
We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. The supplementary materials contain the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.
Cosine similarity on L2-normalised deep embeddings and dHash both remain stable across the print-scan-rasterise cycle by design [14], [19], [21], [27]; together they constitute the dual descriptor used throughout the rest of this paper.
@@ -317,7 +317,7 @@ Each Big-4 signature is assigned to one of five categories using the per-signatu
4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$.
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (Appendix B). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).
The remainder of this section (§III-H.2) describes the reference populations used to calibrate and cross-check this rule. §III-I demonstrates that the descriptor distributions do not provide a within-population natural threshold; §III-J–§III-K develop the descriptive partition and internal-consistency cross-checks; §III-L develops the anchor-based threshold calibration; §III-M discloses the unsupervised-setting limits.
@@ -325,7 +325,7 @@ The remainder of this section (§III-H.2) describes the reference populations us
The calibration distinguishes two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (Appendix B) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.
Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate.
@@ -427,7 +427,7 @@ We read this as the strongest internal-consistency signal in the analysis: three
| Deployed binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ |
| Per-CPA K=3 vs per-signature K=3 | $0.870$ |
The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in Appendix B and not regenerated on the Big-4 subset.
The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in the supplementary materials and not regenerated on the Big-4 subset.
**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference:
@@ -453,7 +453,7 @@ The operational classifier defined in §III-H.1 is calibrated by characterising
### L.0. Calibration methodology
**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with prior reporting and with the existing literature. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine threshold's specificity behaviour at the inter-CPA pair level and the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain the prior calibration; the present calibration does not provide independent rates for those sub-bands.
**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine threshold's specificity-proxy behaviour at the inter-CPA pair level and the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands.
**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:
@@ -472,7 +472,7 @@ We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatu
| Threshold | Per-comparison inter-CPA coincidence rate | 95% Wilson CI |
|---|---|---|
| Cosine $> 0.95$ | $0.00060$ | $[0.00053, 0.00067]$ |
| Cosine $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ |
| Cosine $> 0.945$ (alternative operating point from supplementary calibration evidence) | $0.00081$ | $[0.00073, 0.00089]$ |
| Cosine $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| Cosine $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
| dHash $\leq 5$ | $0.00129$ | $[0.00120, 0.00140]$ |
@@ -482,7 +482,7 @@ We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatu
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ |
The cosine row at $\text{cos} > 0.95$ is consistent with a prior published per-comparison rate of $0.0005$ on a similarly-sized inter-CPA sample; the present $5 \times 10^5$-pair sample yields $0.00060$, within that earlier precision. The dHash row and joint row are reported here for the first time on this corpus; the prior calibration did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule.
The cosine row at $\text{cos} > 0.95$ is consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I on a similarly-sized inter-CPA sample; the present $5 \times 10^5$-pair sample yields $0.00060$, within that precision. The dHash row and joint row are reported here for the first time; the corpus-wide spike did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule.
The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check.
@@ -561,7 +561,7 @@ The per-decile per-firm breakdown (Script 44) confirms the pattern: within every
For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$).
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (Appendix B; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (supplementary materials; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where the within-firm collision concentration is the dominant pattern at all four Big-4 firms — most strongly at Firm A ($98.8\%$ any-pair, $99.96\%$ same-pair) and at materially lower but still majority levels at Firms B/C/D ($76.7$$83.7\%$ any-pair; $97.0$$98.2\%$ same-pair).
@@ -586,7 +586,7 @@ The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence.
## M. Validation Strategy and Limitations under Unsupervised Setting
## M. Unsupervised Diagnostic Strategy and Limits
The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth.
@@ -609,7 +609,7 @@ Each diagnostic reported in this paper therefore addresses one specific failure
No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth.
**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, Appendix B) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.
**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, supplementary materials) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.
**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.
@@ -644,7 +644,8 @@ We note that Table II reports validation-set metrics, as no separate hold-out te
However, the subsequent production deployment provides a practical consistency check: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.
<!-- TABLE III: Extraction Results
**Table III.** Extraction Results.
| Metric | Value |
|--------|-------|
| Documents processed | 86,071 |
@@ -653,7 +654,6 @@ The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliabi
| Avg. signatures per document | 2.14 |
| CPA-matched signatures | 168,755 (92.6%) |
| Processing rate | 43.1 docs/sec |
-->
The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in the primary analyses of §IV-D through §IV-J.
@@ -663,7 +663,8 @@ Fig. 2 presents the cosine similarity distributions computed over the full set o
This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
Table IV summarizes the distributional statistics.
<!-- TABLE IV: Cosine Similarity Distribution Statistics
**Table IV.** Cosine Similarity Distribution Statistics.
| Statistic | Intra-class | Inter-class |
|-----------|-------------|-------------|
| N (pairs) | 41,352,824 | 500,000 |
@@ -672,13 +673,12 @@ Table IV summarizes the distributional statistics.
| Median | 0.836 | 0.774 |
| Skewness | 0.711 | 0.851 |
| Kurtosis | 0.550 | 1.027 |
-->
Both distributions are left-skewed and leptokurtic.
Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-I to avoid single-family distributional assumptions.
The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
The KDE crossover---where the two density functions intersect---was located at 0.837.
Under equal prior probabilities and equal misclassification costs, this crossover is a candidate decision boundary between the two classes; we adopt it only as the operational LH/UN boundary in §III-H.1, not as a natural distributional threshold.
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).
@@ -777,7 +777,7 @@ The three scores agree on placing Firm A as the most replication-dominated and t
| deployed binary high-confidence box rule vs per-signature K=3 hard label | 0.559 |
| Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 |
(Source: Script 39.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) retains its prior calibration and capture-rate evidence (Appendix B; cross-referenced in §IV-J).
(Source: Script 39.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) retains its prior calibration and capture-rate evidence (supplementary materials; cross-referenced in §IV-J).
## G. Leave-One-Firm-Out Reproducibility
@@ -853,11 +853,11 @@ This section reports the five-way per-signature + document-level worst-case clas
| Firm C | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 |
| Firm D | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 |
(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).
(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVII: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVII). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).
**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-L), applied to the Big-4 sub-corpus.
**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-H.1), applied to the Big-4 sub-corpus.
**Table XIX.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs.
**Table XVI.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs.
| Category | Long name | $n$ documents | % |
|---|---|---|---|
@@ -880,9 +880,9 @@ This section reports the five-way per-signature + document-level worst-case clas
(Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)
The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains its prior calibration (Appendix B); it is **not separately re-characterised by Scripts 3840**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in Appendix B and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains its prior calibration (supplementary materials); it is **not separately re-characterised by Scripts 3840**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in the supplementary materials and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
**Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.
**Table XVII.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.
| Firm | $n$ | C1 (low-cos / high-dHash) | C2 (central) | C3 (high-cos / low-dHash) | C1 % | C3 % |
|---|---|---|---|---|---|---|
@@ -893,13 +893,13 @@ The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 <
(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).
**Document-level worst-case aggregation outputs are reported in Table XIX above.**
**Document-level worst-case aggregation outputs are reported in Table XVI above.**
## K. Full-Dataset Robustness (light scope)
This section reports the reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (Appendix B; §IV-J).
This section reports the reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (supplementary materials; §IV-J).
**Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset.
**Table XVIII.** K=3 component comparison, Big-4 sub-corpus vs full dataset.
| K=3 component | Big-4 (n=437) cos / dHash / weight | Full (n=686) cos / dHash / weight | Drift Big-4 → Full |
|---|---|---|---|
@@ -909,7 +909,7 @@ This section reports the reproducibility cross-check at the full accountant scop
(Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.)
**Table XVIII.** Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.
**Table XIX.** Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.
| Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs deployed less-replication-dominated rate) | $p$-value |
|---|---|---|---|
@@ -925,9 +925,9 @@ This section reports the reproducibility cross-check at the full accountant scop
To support the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
The comparison summary is reported in Appendix B (the backbone-ablation table; not the same table as Table XVIII in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K).
The comparison summary is reported in the supplementary materials (backbone-ablation table; not the same table as Table XIX in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K).
<!-- BACKBONE ABLATION TABLE (rendered in Appendix B):
<!-- BACKBONE ABLATION TABLE (rendered in supplementary materials):
| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
|--------|-----------|--------|-----------------|
| Feature dim | 2048 | 4096 | 1280 |
@@ -980,7 +980,7 @@ This section consolidates the empirical results that support the §III-L anchor-
| Threshold | Per-comparison ICCR | 95% Wilson CI |
|---|---|---|
| cos $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ |
| cos $> 0.945$ (alternative operating point from supplementary calibration evidence) | $0.00081$ | $[0.00073, 0.00089]$ |
| cos $> 0.95$ (deployed operating point) | $0.00060$ | $[0.00053, 0.00067]$ |
| cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
@@ -1021,7 +1021,7 @@ Decile trend is broadly monotone in pool size with two minor reversals (decile 5
| D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ |
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |
Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XIX's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XIX's per-firm rows sum to $74{,}854 = 75{,}233 - 379$.
Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XVI's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XVI's per-firm rows sum to $74{,}854 = 75{,}233 - 379$.
### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)
@@ -1072,9 +1072,9 @@ The Big-4 accountant-level descriptor distribution rejects unimodality on both m
## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor)
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (Appendix B) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.
We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). Firm A's high per-document HC$+$MC alarm rate of $0.62$ (versus Firms B/C/D's $0.09$$0.16$) reflects high inter-CPA collision concentration under the deployed rule on real same-CPA pools, consistent with firm-specific template, stamp, or document-production reuse — though the inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder production-related reuse patterns at Firms B/C/D.
We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). Firm A's per-document D2 inter-CPA proxy ICCR of $0.6201$ (versus Firms B/C/D's $0.09$$0.16$) — the counterfactual rate at which Firm A documents would fire HC$+$MC if same-CPA pools were replaced by random inter-CPA candidates — reflects high inter-CPA collision concentration under the deployed rule, consistent with firm-specific template, stamp, or document-production reuse. (The corresponding observed rate on real same-CPA pools, from Table XVI, is substantially higher: $97.5\%$ HC$+$MC for Firm A; the proxy and observed rates measure different quantities and are not directly comparable.) The inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder production-related reuse patterns at Firms B/C/D.
## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
@@ -1088,13 +1088,13 @@ Three feature-derived scores agree on the per-CPA descriptor-position ranking at
## F. Anchor-Based Multi-Level Calibration
The operational specificity of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR matches a prior published per-comparison rate (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
The operational specificity-proxy behaviour of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR is consistent with the corpus-wide rate reported in §IV-I (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5) shows the deployed HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); alternative operating points can be characterised by inverting the ICCR curves (e.g., a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint corresponds to per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation.
## G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor
The only conservative hard-positive subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are a conservative hard-positive subset for image replication. On the Big-4 subset ($n = 262$ pixel-identical signatures), all three candidate checks — the deployed box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). We caution that this result is necessary but not sufficient: for the deployed box rule it is close to tautological, because byte-identical neighbours have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region. The corresponding signature-level *negative* anchor evidence is developed in §III-L.1 above (per-comparison ICCR $= 0.00060$ at cos$>0.95$, consistent with the prior published rate of $0.0005$). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures.
The only conservative hard-positive subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are a conservative hard-positive subset for image replication. On the Big-4 subset ($n = 262$ pixel-identical signatures), all three candidate checks — the deployed box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). We caution that this result is necessary but not sufficient: for the deployed box rule it is close to tautological, because byte-identical neighbours have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region. The corresponding signature-level *negative* anchor evidence is developed in §III-L.1 above (per-comparison ICCR $= 0.00060$ at cos$>0.95$, consistent with the corpus-wide rate of $0.0005$ reported in §IV-I). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures.
## H. Limitations
@@ -1112,7 +1112,7 @@ Several limitations should be transparent. We group them into primary methodolog
*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).
*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their prior calibration and capture-rate evidence (Appendix B); the anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their prior calibration and capture-rate evidence (supplementary materials); the anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
*Deployed-rate excess is not a presumed true-positive rate.* The $\sim 44$-pp per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the inter-CPA proxy rate (HC: $0.18$) cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; within-firm template sharing inflates the inter-CPA proxy baseline). The gap is best read as a same-CPA repeatability signal.
@@ -1243,7 +1243,7 @@ Future work falls in four directions. *First*, a small-scale human-rated labelle
# Appendix A. BD/McCrary Bin-Width Sensitivity (Signature Level)
The main text (Section III-I, Section IV-D.2) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator.
The main text (Section III-I, Section IV-D Table VI) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator.
This appendix documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction.
<!-- TABLE A.I: BD/McCrary Bin-Width Sensitivity (two-sided alpha = 0.05, |Z| > 1.96)
@@ -1288,22 +1288,19 @@ The full table-to-script provenance mapping, script source code, and report arte
**Data availability.** All audit reports analysed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, a publicly accessible regulatory disclosure platform. The CPA registry used to map signatures to certifying CPAs is publicly available. Signature images, model weights, and reproducibility scripts are available in the supplementary materials.
**Funding.** [To be filled in before submission.]
<!-- Funding statement to be inserted before submission:
**Funding.** [acknowledge any grants, awards, or institutional support here]
-->
<!--
ARCHIVED. Not part of the IEEE Access submission.
IEEE Access Regular Papers do not include a separate Impact Statement
section. The text below is retained for possible reuse in a cover
letter, grant report, or non-IEEE venue. It is excluded from the
assembled paper by the manuscript export script.
If reused, note that the wording "distinguishes genuinely hand-signed
ARCHIVED. Not part of the IEEE Access submission. The block below is wrapped
in an HTML comment so it does not render in the assembled paper. It is
retained for possible reuse in a cover letter, grant report, or non-IEEE
venue. If reused, note that the wording "distinguishes genuinely hand-signed
signatures from reproduced ones" overstates what a five-way confidence
classifier without a fully labeled test set establishes; soften before
external use.
-->
# Impact Statement (archived; not in IEEE Access submission)
@@ -1312,3 +1309,4 @@ When the signature on an audit report is produced by reproducing a stored image
We developed a pipeline that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan.
Combining deep-learning visual features with perceptual hashing, distributional diagnostics, and anchor-based inter-CPA coincidence-rate calibration, the system stratifies signatures into a five-way confidence-graded classification and quantifies how the practice varies across firms and over time.
With a future labelled evaluation set, the technology could support financial regulators in screening candidate non-hand-signed signatures at national scale.
-->