Paper A v3.13: resolve Opus 4.7 round-12 + codex gpt-5.5 round-13 findings

Opus 4.7 max-effort round-12 on v3.12 found 1 MAJOR + 7 MINOR residues; codex gpt-5.5 xhigh round-13 cross-verified 11/11 RESOLVED and caught one additional cosine-P95 ambiguity Opus missed (methodology L255). Total 12 text-only edits across 5 files. MAJOR M1 - Cosine P95→P7.5 terminology residue at two sites that cite the v3.12-corrected Section III-L but still wrote "P95" (self- contradiction). Fix: methodology L165 and results L247 both restated as "whole-sample Firm A P7.5 heuristic" with the 92.5%/7.5% complement spelled out. MINOR findings and fixes: - m1 Big-4 scope slip: methodology III-H(b) L166 and results IV-H.2 L311 said "every Big-4 auditor-year" but IV-H.2 ranking actually pools all 4,629 auditor-years across Big-4 and Non-Big-4. Both sites now say "every auditor-year ... across all firms." - m2 178 vs 180 Firm A CPA breakdown: intro L54 and conclusion L21 now add "of 180 registered CPAs; 178 after excluding two with disambiguation ties, Section IV-G.2" parenthetical to avoid the misleading 180−171=9 reading. - m3 IV-H.1 A2 citation: results L286 now explicitly invokes the A2 within-year label-uniformity convention (Section III-G) when reading the left-tail share as a partner-level "minority of hand- signers." - m4 IV-F L177 cross-ref / fold distinction: corrected Section III-H → Section III-L anchor, and added explicit note that the 0.95 heuristic is a whole-sample anchor while Table XI thresholds are calibration-fold-derived (cosine P5 = 0.9407). - m5 Table XVI (30,222) vs Table XVII (30,226) Firm A count gap: results L406 now explains the 4-report difference (XVI restricts to both-signers-Firm-A single-firm two-signer reports; XVII counts at-least-one-Firm-A signer under the 84,386-document cohort). - m6 Methodology L156 "four independent quantitative analyses" actually enumerated 6 items: rephrased as "three primary independent quantitative analyses plus a fourth strand comprising three complementary checks." - m7 Abstract "cluster into three groups" restored the "smoothly- mixed" qualifier to match Discussion V-B and Conclusion L17. - Codex-caught residue at methodology L255 ("Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions") grammatically applied P95 to cosine too. Rewrote as "cosine median, P1, and P5 (lower-tail) and dHash_indep median and P95 (upper-tail)" matching Table XI L233 exactly. No re-computation. All tables (IV-XVIII) and Appendix A numbers unchanged. Abstract at 249/250 words after smoothly-mixed qualifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paper A v3.12: resolve Gemini 3.1 Pro round-11 full-paper review findings
2026-04-24 21:21:37 +08:00 · 2026-04-24 20:10:20 +08:00 · 2026-04-24 19:52:45 +08:00
6 changed files with 57 additions and 38 deletions
@@ -2,6 +2,6 @@

 <!-- IEEE Access target: <= 250 words, single paragraph -->

-Regulations require Certified Public Accountants (CPAs) to attest to each audit report by affixing a signature. Digitization makes reusing a stored signature image across reports trivial---through administrative stamping or firm-level electronic signing---potentially undermining individualized attestation. Unlike forgery, *non-hand-signed* reproduction reuses the legitimate signer's own stored image, making it visually invisible to report users and infeasible to audit at scale manually. We present a pipeline integrating a Vision-Language Model for signature-page identification, YOLOv11 for signature detection, and ResNet-50 for feature extraction, followed by dual-descriptor verification combining cosine similarity and difference hashing. For threshold determination we apply two estimators---kernel-density antimode with a Hartigan unimodality test and an EM-fitted Beta mixture with a logit-Gaussian robustness check---plus a Burgstahler-Dichev/McCrary density-smoothness diagnostic, at the signature and accountant levels. Applied to 90,282 audit reports filed in Taiwan over 2013-2023 (182,328 signatures from 758 CPAs), the methods reveal a level asymmetry: signature-level similarity is a continuous quality spectrum that no two-component mixture separates, while accountant-level aggregates cluster into three groups with the antimode and two mixture estimators converging within $\sim$0.006 at cosine $\approx 0.975$. A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with visual inspection and accountant-level mixture evidence supporting majority non-hand-signing and a minority of hand-signers; capture rates on both 70/30 calibration and held-out folds are reported with Wilson 95% intervals to make fold-level variance visible. Validation against 310 byte-identical positives and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 at all accountant-level thresholds.
+Regulations require Certified Public Accountants (CPAs) to attest to each audit report by affixing a signature. Digitization makes reusing a stored signature image across reports trivial---through administrative stamping or firm-level electronic signing---potentially undermining individualized attestation. Unlike forgery, *non-hand-signed* reproduction reuses the legitimate signer's own stored image, making it visually invisible to report users and infeasible to audit at scale manually. We present a pipeline integrating a Vision-Language Model for signature-page identification, YOLOv11 for signature detection, and ResNet-50 for feature extraction, followed by dual-descriptor verification combining cosine similarity and difference hashing. For threshold determination we apply two estimators---kernel-density antimode with a Hartigan unimodality test and an EM-fitted Beta mixture with a logit-Gaussian robustness check---plus a Burgstahler-Dichev/McCrary density-smoothness diagnostic, at the signature and accountant levels. Applied to 90,282 audit reports filed in Taiwan over 2013-2023 (182,328 signatures from 758 CPAs), the methods reveal a level asymmetry: signature-level similarity is a continuous quality spectrum that no two-component mixture separates, while accountant-level aggregates cluster into three smoothly-mixed groups with the antimode and two mixture estimators converging within $\sim$0.006 at cosine $\approx 0.975$. A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with visual inspection and accountant-level mixture evidence supporting majority non-hand-signing alongside within-firm heterogeneity consistent with a minority of hand-signers; capture rates on both 70/30 calibration and held-out folds are reported with Wilson 95% intervals to make fold-level variance visible. Validation against 310 byte-identical positives and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 at all accountant-level thresholds.

 <!-- Target word count: 240 -->
@@ -5,7 +5,7 @@
 We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
 Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through two methodologically distinct threshold estimators and a density-smoothness diagnostic applied at two analysis levels.

-Our contributions are fourfold.
+The seven numbered contributions listed in Section I can be grouped into four broader methodological themes, summarized below.

 First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.

@@ -18,7 +18,7 @@ The substantive reading is therefore narrower than "discrete behavior": *pixel-l

 Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
 To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85-95% capture band differ by 1-5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure.
-This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
+This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 registered CPAs; 178 after excluding two with disambiguation ties, Section IV-G.2), the 139 / 32 split between the high-replication and middle-band clusters.

 An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.

@@ -40,8 +40,8 @@ Our evidence across multiple analyses rules out that assumption for Firm A while
 Three convergent strands of evidence support the replication-dominated framing.
 First, the visual-inspection evidence: randomly sampled Firm A reports exhibit pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.
 Second, the signature-level statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
-Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures ($\geq 10$) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---directly quantifying the within-firm minority of hand-signers.
-Nine additional Firm A CPAs are excluded from the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone.
+Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures ($\geq 10$) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---consistent with within-firm heterogeneity in signing practice (spanning a minority of hand-signers, CPAs undergoing mid-sample mechanism transitions, and CPAs whose pooled coordinates reflect mixed-quality replication) rather than a pure replication population.
+Of the 178 valid Firm A CPAs (the 180 registered CPAs minus two excluded for disambiguation ties in the registry; Section IV-G.2), seven are outside the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone.
 The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85–95% band differ between folds by 1–5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure).
 The accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to fold-level sampling variance.

@@ -72,7 +72,7 @@ The framing we adopt---replication-dominated rather than replication-pure---is a
 ## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation

 A further methodological contribution is the combination of byte-level pixel identity as an annotation-free *conservative* gold positive and a large random-inter-CPA negative anchor.
-Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
+Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is pair-level proof of image reuse and, modulo the narrow source-template edge case discussed in the seventh limitation below, a conservative positive for non-hand-signing without requiring human review.
 In our corpus 310 signatures satisfied this condition.
 We emphasize that byte-identical pairs are a *subset* of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways).
 Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.
@@ -99,13 +99,17 @@ This effect would bias classification toward false negatives rather than false p
 Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
 While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.

-Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later).
-Extending the accountant-level analysis to auditor-year units is a natural next step.
+Fifth, the accountant-level summary (Section III-J) is a cross-year pooled statistic by construction, so a CPA whose signing mechanism changed mid-sample is placed at a weighted mix of component means rather than at a single regime centroid.
+Extending the accountant-level analysis to auditor-year units---using the same convergent threshold framework at finer temporal resolution---is the natural next step for resolving such within-accountant transitions.

 Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, and the test produces no significant transition at all at the accountant level.
 In our application, therefore, BD/McCrary contributes diagnostic information about local density-smoothness rather than an independent accountant-level threshold estimate; that role is played by the KDE antimode and the two mixture-based estimators.
 We emphasize that the accountant-level BD/McCrary null is *consistent with*---not affirmative proof of---smoothly mixed cluster boundaries: the BD/McCrary test is known to have limited statistical power at modest sample sizes, and with $N = 686$ accountants in our analysis the test cannot reliably detect anything less than a sharp cliff-type density discontinuity.
 Failure to reject the smoothness null at this sample size therefore reinforces BD/McCrary's role as a diagnostic rather than a definitive estimator; the substantive claim of smoothly-mixed accountant-level clustering rests on the joint weight of the dip-test and Beta-mixture evidence together with the BD null, not on the BD null alone.

+Seventh, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed.
+In the rare case that one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as the stamping or e-signature template, the pair correctly identifies image reuse but misattributes the non-hand-signed status to the source exemplar.
+This misattribution affects at most one source document per template variant per CPA (the exemplar from which the template was produced), is not expected to be common given that stored signature templates are typically generated in a separate acquisition step rather than extracted from submitted audit reports, and does not materially affect aggregate capture rates at the firm level.
+
 Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
 Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
@@ -51,12 +51,12 @@ By requiring convergent evidence from both descriptors, we can differentiate *st
 A second distinctive feature is our framing of the calibration reference.
 One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing for the majority of its certifying partners, while not ruling out that a minority may continue to hand-sign some reports.
 We therefore treat Firm A as a *replication-dominated* calibration reference rather than a pure positive class.
-This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode.
+This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in the registry; 178 after excluding two with disambiguation ties, see Section IV-G.2) cluster into an accountant-level "middle band" rather than the high-replication mode.
 Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence among the visual-inspection evidence, the signature-level statistics, and the accountant-level mixture.

 A third distinctive feature is our unit-of-analysis treatment.
 Our threshold-framework analysis reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are clustered into three recognizable groups (BIC-best $K = 3$).
-The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, while *accountant-level aggregate behaviour* is clustered but not sharply discrete---a given CPA tends to cluster into a dominant regime (high-replication, middle-band, or hand-signed-tendency), though the boundaries between regimes are smooth rather than discontinuous.
+The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, while *accountant-level aggregate behaviour* is clustered but not sharply discrete: each CPA's cross-year-pooled coordinates sit closest to one of three recognizable groups (high-replication, middle-band, or hand-signed-tendency), reflecting a pooled observed tendency rather than a time-invariant regime, with smooth rather than discontinuous boundaries between groups.
 At the accountant level, the KDE antimode and the two mixture-based estimators (Beta-2 crossing and its logit-Gaussian robustness counterpart) converge within $\sim 0.006$ on a cosine threshold of approximately $0.975$, while the Burgstahler-Dichev / McCrary density-smoothness diagnostic finds no significant transition---an outcome (robust across a bin-width sweep, Appendix A) consistent with smoothly mixed clusters.
 The two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are reported as a complementary cross-check rather than as the primary accountant-level threshold.

@@ -113,19 +113,32 @@ Cosine similarity and dHash are both robust to the noise introduced by the print

 ## G. Unit of Analysis and Summary Statistics

-Two unit-of-analysis choices are relevant for this study: (i) the *signature*---one signature image extracted from one report---and (ii) the *accountant*---the collection of all signatures attributed to a single CPA across the sample period.
-A third composite unit---the *auditor-year*, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution.
+Three unit-of-analysis choices are relevant for this study, ordered from finest to coarsest: (i) the *signature*---one signature image extracted from one report; (ii) the *auditor-year*---all signatures by one CPA within one fiscal year; and (iii) the *accountant*---the collection of all signatures attributed to a single CPA across the full sample period.
+All three are well-defined as descriptive groupings without additional assumptions; the distinction that matters for *regime interpretation*---i.e., reading a unit's summary as "this CPA's signing mechanism for that unit"---is that the auditor-year is the smallest CPA-level aggregation that is coherent under the stipulations below without additional across-year homogeneity, whereas the accountant unit is a deliberate cross-year pooling that may blend distinct signing-mechanism regimes when a CPA's practice changes over the sample period.
+We use all three units in the paper and specify the role of each at the point of use.

-For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA.
+For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA (over the full same-CPA set, not restricted to the same fiscal year).
 The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
 Mean statistics would dilute this signal.

-We also adopt an explicit *within-auditor-year no-mixing* identification assumption.
-Specifically, within any single fiscal year we treat a given CPA's signing mechanism as uniform: a CPA who reproduces one signature image in that year is assumed to do so for every report, and a CPA who hand-signs in that year is assumed to hand-sign every report in that year.
-Domain-knowledge from industry practice at Firm A is consistent with this assumption for that firm during the sample period.
-Under the assumption, per-auditor-year summary statistics are well defined and robust to outliers: if even one pair of same-CPA signatures in the year is near-identical, the max/min captures it.
-The intra-report consistency analysis in Section IV-H.3 is a related but distinct check: it tests whether the *two co-signing CPAs on the same report* receive the same signature-level label (firm-level signing-practice homogeneity) rather than testing whether a single CPA mixes mechanisms within a fiscal year.
-A direct empirical check of the within-auditor-year assumption at the same-CPA level would require labeling multiple reports of the same CPA in the same year and is left to future work; in this paper we maintain the assumption as an identification convention motivated by industry practice and bounded by the worst-case aggregation rule of Section III-L.
+We distinguish two stipulations by the role each plays, in order to avoid overstating the paper's reliance on them.
+
+**(A1) Pair-detectability** is a statistical assumption scoped to the same-CPA pool (pooled across fiscal years, matching the max/min computation above): if a CPA uses image replication anywhere in the corpus, at least one pair of same-CPA signatures is near-identical after reproduction noise, so that max cosine / min dHash detects the replication.
+This is plausible for high-volume stamping or firm-level electronic-signing workflows---where a stored image is typically reused many times under similar scan and compression conditions---but is not guaranteed in sparse CPA-corpora with only one observed replicated report, when multiple template variants are in use, or when scan-stage noise pushes a replicated pair outside the detection regime.
+A1 is what the per-signature detector requires to be sensitive to replication; it is a cross-year pair-existence property, not a within-year uniformity claim.
+
+**(A2) Within-year label uniformity** is an interpretive convention used when a signature-level label is *read as* "this CPA's signing mechanism for that fiscal year": within any single fiscal year we treat the CPA's mechanism as uniform, i.e., a CPA who replicates any signature image in that year is treated as doing so for every report in that year, and a hand-signer is treated as hand-signing every report in that year.
+A2 is consistent with industry practice at Firm A during the sample period, but may weaken at other Big-4 firms during the 2019--2021 digitalization-transition years, in which a CPA's mechanism could in principle shift mid-year as firm-level electronic-signing systems were rolled out.
+We therefore read A2 as a domain-motivated default rather than a universally validated empirical claim.
+The arithmetic statistics reported in this paper do not require A2 for their definition or computation: the per-signature classifier (Section III-L) operates at signature level, the accountant-level mixture (Section III-J) uses mean statistics over the full same-CPA pool, and the partner-level ranking (Section IV-H.2) uses a per-auditor-year mean---none of which require within-year uniformity to be well-defined.
+A2 does, however, underwrite certain *interpretive* readings---most notably, the framing in Section IV-H.1 of Firm A's yearly left-tail share as a partner-level "minority of hand-signers" rather than a bare signature-level rate---and the downstream use of per-signature or per-auditor-year labels as regime labels for auditor-behavior research.
+
+We explicitly *do not* assume across-year homogeneity.
+A CPA's mechanism may change across fiscal years---the 2019--2021 Big-4 digitalization trends documented in Section IV-H are consistent with such changes---and accountant-level summary statistics (Section III-J) therefore represent a cross-year pooled summary that may blend multiple regimes for the same CPA.
+We treat this as a design choice: the accountant-level aggregates characterize each CPA's overall distribution over the full sample period, not a single time-invariant regime.
+
+The intra-report consistency analysis in Section IV-H.3 is a related but distinct check: it tests whether the *two co-signing CPAs on the same report* receive the same signature-level label (firm-level signing-practice homogeneity) rather than testing A2 at the same-CPA level.
+A direct empirical check of A2 would require labeling multiple reports of the same CPA in the same year and is left to future work; as noted above, no reported statistic relies on A2, and A2's interpretive scope is further bounded by the worst-case aggregation rule of Section III-L.

 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean *independent minimum dHash* across all signatures of that CPA.
 The *independent minimum dHash* of a signature is defined as the minimum Hamming distance to *any* other signature of the same CPA (over the full same-CPA set).
@@ -140,7 +153,7 @@ Rather than treating Firm A as a synthetic or laboratory positive control, we tr
 The background context for this choice is practitioner knowledge about Firm A's signing practice: industry practice at the firm is widely understood among practitioners to involve reproducing a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
 We use this only as background context for why Firm A is a plausible calibration candidate; the *evidence* for Firm A's replication-dominated status comes entirely from the paper's own analyses, which do not depend on any claim about signing practice beyond what the audit-report images themselves show.

-We establish Firm A's replication-dominated status through four independent quantitative analyses, each of which can be reproduced from the public audit-report corpus alone:
+We establish Firm A's replication-dominated status through three primary independent quantitative analyses plus a fourth strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone:

 First, *independent visual inspection* of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.

@@ -149,8 +162,8 @@ Second, *whole-sample signature-level rates*: 92.5% of Firm A's per-signature be
 Third, *accountant-level mixture analysis* (Section IV-E): a BIC-selected three-component Gaussian mixture over per-accountant mean cosine and mean dHash places 139 of the 171 Firm A CPAs (with $\geq 10$ signatures) in the high-replication C1 cluster and 32 in the middle-band C2 cluster, directly quantifying the within-firm heterogeneity.

 Fourth, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-H. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
-  (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P95 of the per-signature cosine distribution (Section III-L); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
-  (b) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
+  (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-L; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
+  (b) *Partner-level similarity ranking (Section IV-H.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
  (c) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.

 We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the visual inspection, the accountant-level mixture, the three complementary analyses above, and the held-out Firm A fold (which confirms the qualitative replication-dominated framing; fold-level rate differences are disclosed in Section IV-G.2) described in Section III-K.
@@ -219,7 +232,8 @@ The accountant-level estimates from the two threshold estimators (together with
 ## J. Accountant-Level Mixture Model

 In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
-The motivation is the expectation---consistent with industry-practice knowledge at Firm A---that an individual CPA's signing *practice* is clustered (typically consistent adoption of non-hand-signing or consistent hand-signing within a given year) even when the output pixel-level *quality* lies on a continuous spectrum.
+The motivation is that an individual CPA's cross-year-pooled signing *tendency*---their full-sample distribution of best-match statistics---is expected to cluster with other CPAs of similar tendency, even when the output pixel-level *quality* at the signature level lies on a continuous spectrum.
+Cluster membership in the accountant-level fit is accordingly best read as a *pooled observed tendency* over the CPA's full sample-period signature set rather than as a time-invariant signing regime; where a CPA switched mechanisms during the sample period, their accountant-level coordinates reflect a weighted mix of the corresponding regimes.

 We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
 For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
@@ -229,16 +243,16 @@ For the selected $K^*$ we report component means, weights, per-component firm co
 Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling:

 1. **Pixel-identical anchor (gold positive, conservative subset):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
-Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth *for the byte-identical subset* of non-hand-signed signatures.
-We emphasize that this anchor is a *subset* of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).
+Handwriting physics makes byte-identity impossible under independent signing events, so a byte-identical same-CPA pair is pair-level proof of image reuse and---for the byte-identical subset---conservative ground truth for non-hand-signed signatures; the narrow exception, in which a genuinely hand-signed exemplar was subsequently reused as the stamping or e-signature template, is discussed as a Limitation in Section V-G.
+We further emphasize that this anchor is a *subset* of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).

 2. **Inter-CPA negative anchor (large gold negative):** $\sim$50,000 pairs of signatures randomly sampled from *different* CPAs.
 Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps.
 This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.

-3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail contains a minority of hand-signers, as directly evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H).
+3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference with within-firm heterogeneity in the left tail (consistent with a minority of hand-signers), as evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H).
 Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% *calibration* fold and a 30% *heldout* fold.
-Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions are derived from the calibration fold only.
+The calibration-fold percentiles used in thresholding---cosine median, P1, and P5 (lower-tail, since higher cosine indicates greater similarity), and dHash_indep median and P95 (upper-tail, since lower dHash indicates greater similarity)---are derived from the 70% calibration fold only.
 The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.

 4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70.
@@ -273,7 +287,7 @@ High feature-level similarity without structural corroboration---consistent with
 5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.

 We note three conventions about the thresholds.
-First, the cosine cutoff $0.95$ is the whole-sample Firm A P95 of the per-signature best-match cosine distribution (chosen for its transparent percentile interpretation in the whole-sample reference distribution), and the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions.
+First, the cosine cutoff $0.95$ corresponds to approximately the whole-sample Firm A P7.5 of the per-signature best-match cosine distribution---that is, 92.5% of whole-sample Firm A signatures exceed this cutoff and 7.5% fall at or below it (Section III-H)---chosen as a round-number lower-tail boundary whose complement (92.5% above) has a transparent interpretation in the whole-sample reference distribution; the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions.
 Section IV-G.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible.
 Second, the dHash cutoffs $\leq 5$ and $> 15$ are chosen from the whole-sample Firm A $\text{dHash}_\text{indep}$ distribution: $\leq 5$ captures the upper tail of the high-similarity mode (whole-sample Firm A median $\text{dHash}_\text{indep} = 2$, P75 $\approx 4$, so $\leq 5$ is the band immediately above median), while $> 15$ marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction.
 Third, the three accountant-level 1D estimators (KDE antimode $0.973$, Beta-2 crossing $0.979$, logit-GMM-2 crossing $0.976$) and the accountant-level 2D GMM marginal ($0.945$) are *not* the operational thresholds of this classifier: they are the *convergent external reference* that supports the choice of signature-level operational cut.
@@ -69,7 +69,7 @@ The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-sig
 | Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
 -->

-Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in the accountant-level mixture (Section IV-E).
+Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to within-firm heterogeneity---consistent with a minority of hand-signing Firm A partners---as identified in the accountant-level mixture (Section IV-E).
 The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
 At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.

@@ -174,7 +174,7 @@ Table IX reports the proportion of Firm A signatures crossing each candidate thr
 All rates computed exactly from the full Firm A sample (N = 60,448 signatures); counts reproduce from `signature_analysis/24_validation_recalibration.py` (whole_firm_a section).
 -->

-Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H.
+Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to the whole-sample Firm A distribution described in Section III-L (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
 The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E).
 Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.

@@ -184,7 +184,7 @@ We report three validation analyses corresponding to the anchors of Section III-

 ### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor

-Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
+Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
 As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
 Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
 We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
@@ -244,7 +244,7 @@ We therefore interpret the held-out fold as confirming the qualitative finding (

 ### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$

-The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P95 heuristic.
+The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P7.5 heuristic (i.e., 7.5% of whole-sample Firm A signatures lie at or below 0.95; see Section III-L).
 The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$ (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at $0.945$.
 Because the classifier operates at the signature level while these convergent accountant-level estimates are at the accountant level, they are formally non-substitutable.
 We report a sensitivity check in which the classifier's operational cut cos $> 0.95$ is replaced by the nearest accountant-level reference, cos $> 0.945$.
@@ -283,7 +283,7 @@ Subsection H.3 applies the calibrated classifier and is therefore a consistency
 ### 1) Year-by-Year Stability of the Firm A Left Tail

 Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year.
-Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign.
+Under the replication-dominated interpretation (Section III-H) and the within-year label-uniformity convention A2 (Section III-G), this left-tail share is read as a partner-level minority of Firm A CPAs who continue to hand-sign rather than as a bare signature-level rate.
 Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.

 <!-- TABLE XIII: Firm A Per-Year Cosine Distribution
@@ -308,12 +308,13 @@ This stability supports the replication-dominated framing: a persistent minority

 ### 2) Partner-Level Similarity Ranking

-If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all Big-4 auditor-years.
+If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all auditor-years (across all firms).
 We test this prediction directly.

 For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
 Firm A accounts for 1,287 of these (27.8% baseline share).
 Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.
+The per-signature best-match cosine underlying each auditor-year mean is taken over the full same-CPA pool (Section III-G) and may match against signatures from other fiscal years, so the auditor-year mean reflects the year's signatures' position within the CPA's full-sample similarity structure rather than purely within-year similarity; a within-year-restricted sensitivity replication is a natural robustness check and is left to future work.

 <!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
 | Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
@@ -342,7 +343,7 @@ Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 10
 | 2023 | 474 | 47 | 46 | 97.9% | 27.4% |
 -->

-This over-representation is a direct consequence of firm-wide non-hand-signing practice and is not derived from any threshold we subsequently calibrate.
+This over-representation is consistent with firm-wide non-hand-signing practice at Firm A and is not derived from any threshold we subsequently calibrate.
 It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.

 ### 3) Intra-Report Consistency
@@ -351,8 +352,8 @@ Taiwanese statutory audit reports are co-signed by two engagement partners (a pr
 Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
 Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.

-For each report with exactly two signatures and complete per-signature data (83,970 reports assigned to a single firm, plus 384 reports with one signer per firm in the mixed-firm buckets for 84,354 total), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
-Table XVI reports per-firm intra-report agreement (firm-assignment defined by the firm identity of both signers; mixed-firm reports are reported separately).
+For each report with exactly two signatures and complete per-signature data (84,354 reports total: 83,970 single-firm reports, in which both signers are at the same firm, and 384 mixed-firm reports, in which the two signers are at different firms), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
+Table XVI reports per-firm intra-report agreement for the 83,970 single-firm reports only (firm-assignment defined by the common firm identity of both signers); the 384 mixed-firm reports (0.46% of the 2-signature corpus) are excluded from the intra-report analysis because firm-level agreement is not well defined when the two signers are at different firms.

 <!-- TABLE XVI: Intra-Report Classification Agreement by Firm
 | Firm | Total 2-signer reports | Both non-hand-signed | Both uncertain | Both style | Both hand-signed | Mixed | Agreement rate |
@@ -402,7 +403,7 @@ A cosine-only classifier would treat all 71,656 identically; the dual-descriptor

 96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
 This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the 32/171 middle-band minority identified by the accountant-level mixture (Section IV-E).
-The absence of any meaningful "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
+The absence of any meaningful "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 count here is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset in Table XVI by 4 reports) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
 We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.

 ### 2) Cross-Method Agreement