Paper A v3.3: apply codex v3.2 peer-review fixes

Codex (gpt-5.4) second-round review recommended 'minor revision'. This commit addresses all issues flagged in that review. ## Structural fixes - dHash calibration inconsistency (codex #1, most important): Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come from the whole-sample Firm A cosine-conditional dHash distribution (median=5, P95=15), not from the calibration-fold independent-minimum dHash distribution (median=2, P95=9) which we report elsewhere as descriptive anchors. Added explicit note about the two dHash conventions and their relationship. - Section IV-H framing (codex #2): Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence" to "Additional Firm A Benchmark Validation" and clarified in the section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully threshold-free, H.3 uses the calibrated classifier. H.3's concluding sentence now says "the substantive evidence lies in the cross-firm gap" rather than claiming the test is threshold-free. - Table XVI 93,979 typo fixed (codex #3): Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm). - Held-out Firm A denominator 124+54=178 vs 180 (codex #4): Added explicit note that 2 CPAs were excluded due to disambiguation ties in the CPA registry. - Table VIII duplication (codex #5): Removed the duplicate accountant-level-only Table VIII comment; the comprehensive cross-level Table VIII subsumes it. Text now says "accountant-level rows of Table VIII (below)". - Anonymization broken in Tables XIV-XVI (codex #6): Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/ "Firm D" across Tables XIV, XV, XVI. Table and caption language updated accordingly. - Table X unit mismatch (codex #7): Dropped precision, recall, F1 columns. Table now reports FAR (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR (against the byte-identical positive anchor). III-K and IV-G.1 text updated to justify the change. ## Sentence-level fixes - "three independent statistical methods" in Methodology III-A -> "three methodologically distinct statistical methods". - "three independent methods" in Conclusion -> "three methodologically distinct methods". - Abstract "~0.006 converging" now explicitly acknowledges that BD/McCrary produces no significant accountant-level discontinuity. - Conclusion ditto. - Discussion limitation sentence "BD/McCrary should be interpreted at the accountant level for threshold-setting purposes" rewritten to reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold estimator, at the accountant level. - III-H "two analyses" -> "three analyses" (H.1 longitudinal stability, H.2 partner ranking, H.3 intra-report consistency). - Related Work White 1982 overclaim rewritten: "consistent estimators of the pseudo-true parameter that minimizes KL divergence" replaces "guarantees asymptotic recovery". - III-J "behavior is close to discrete" -> "practice is clustered". - IV-D.2 pivot sentence "discreteness of individual behavior yields bimodality" -> "aggregation over signatures reveals clustered (though not sharply discrete) patterns". Target journal remains IEEE Access. Output: Paper_A_IEEE_Access_Draft_v3.docx (395 KB). Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 02:32:17 +08:00
parent 51d15b32a5
commit 5717d61dd4
8 changed files with 2352 additions and 71 deletions
@@ -8,7 +8,7 @@ Unlike signature forgery, where an impostor imitates another person's handwritin
 We present an end-to-end AI pipeline that automatically detects non-hand-signed auditor signatures in financial audit reports.
 The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash).
 For threshold determination we apply three methodologically distinct methods---Kernel Density antimode with a Hartigan unimodality test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level.
-Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum that no two-component mixture cleanly separates, while accountant-level aggregates are clustered into three recognizable groups (BIC-best $K = 3$) with the three 1D methods converging within $\sim$0.006 of each other at cosine $\approx 0.975$.
+Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum that no two-component mixture cleanly separates, while accountant-level aggregates are clustered into three recognizable groups (BIC-best $K = 3$) with the KDE antimode and the two mixture-based estimators converging within $\sim$0.006 of each other at cosine $\approx 0.975$; the Burgstahler-Dichev / McCrary test produces no significant discontinuity at the accountant level, consistent with clustered-but-smooth rather than sharply discrete accountant-level heterogeneity.
 A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with interview and visual evidence supporting majority non-hand-signing and a minority of hand-signers; we break the circularity of using the same firm for calibration and validation by a 70/30 CPA-level held-out fold.
 Validation against 310 byte-identical positive signatures and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 with Wilson 95% confidence intervals at all accountant-level thresholds.
 To our knowledge, this represents the largest-scale forensic analysis of auditor signature authenticity reported in the literature.
@@ -3,7 +3,7 @@
 ## Conclusion
 We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
-Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels.
+Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three methodologically distinct methods applied at two analysis levels.
 Our contributions are fourfold.
@@ -12,8 +12,8 @@ First, we argued that non-hand-signing detection is a distinct problem from sign
 Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
 Third, we introduced a three-method threshold framework combining KDE antimode (with a Hartigan unimodality test), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check).
-Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the three 1D methods agree within $\sim 0.006$ at cosine $\approx 0.975$.
+Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the KDE antimode together with the Beta-mixture and logit-Gaussian estimators agree within $\sim 0.006$ at cosine $\approx 0.975$.
-The Burgstahler-Dichev / McCrary test finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level behavior.
+The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level heterogeneity.
 The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered with smooth cluster boundaries.
 Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
@@ -100,8 +100,9 @@ While cosine similarity and dHash are designed to be robust to such variations,
 Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later).
 Extending the accountant-level analysis to auditor-year units is a natural next step.
-Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation.
+Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, and the test produces no significant transition at all at the accountant level.
-This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes.
+In our application, therefore, BD/McCrary contributes diagnostic information about local density-smoothness rather than an independent accountant-level threshold estimate; that role is played by the KDE antimode and the two mixture-based estimators.
 The BD/McCrary results remain informative as a robustness check---their non-transition at the accountant level is consistent with the dip-test and Beta-mixture evidence that accountant-level clustering is smooth rather than sharply discontinuous.
 Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
 Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
@@ -4,7 +4,7 @@
 We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
 Fig. 1 illustrates the overall architecture.
-The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.
+The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three methodologically distinct statistical methods and a pixel-identity anchor.
 Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
 From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
@@ -144,11 +144,12 @@ Second, independent visual inspection of randomly sampled Firm A reports reveals
 Third, our own quantitative analysis is consistent with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
-Fourth, we additionally validate the Firm A benchmark through two analyses that do not depend on any threshold we subsequently calibrate:
+Fourth, we additionally validate the Firm A benchmark through three analyses reported in Section IV-H.  Two of them are fully threshold-free, and one uses the downstream classifier as an internal consistency check:
-  (a) *Partner-level similarity ranking (Section IV-H.2).*  When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023.
+  (a) *Longitudinal stability (Section IV-H.1).*  The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023.  The fixed 0.95 cutoff is not calibrated to Firm A; the stability itself is the finding.
-  (b) *Intra-report consistency (Section IV-H.3).*  Because each Taiwanese statutory audit report is co-signed by two engagement partners, firmwide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label.  Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms, consistent with firm-wide rather than partner-specific practice.
+  (b) *Partner-level similarity ranking (Section IV-H.2).*  When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023.  This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
  (c) *Intra-report consistency (Section IV-H.3).*  Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier.  Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms.  This test uses the calibrated classifier and therefore is a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
-We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence, by the two threshold-independent analyses above, and by the held-out Firm A fold described in Section III-K.
+We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence, by the complementary analyses above, and by the held-out Firm A fold described in Section III-K.
 We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
 Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline.
@@ -212,7 +213,7 @@ All three methods are reported with their estimates and, where applicable, cross
 ## J. Accountant-Level Mixture Model
 In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
-The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *behavior* is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level *quality* lies on a continuous spectrum.
+The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *practice* is clustered (typically consistent adoption of non-hand-signing or consistent hand-signing within a given year) even when the output pixel-level *quality* lies on a continuous spectrum.
 We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
 For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
@@ -237,27 +238,34 @@ The heldout fold is used exclusively to report post-hoc capture rates with Wilso
 4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70.
 This anchor is retained for continuity with prior work but is small in our dataset ($n = 35$) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.
-From these anchors we report precision, recall, $F_1$, FAR with Wilson 95% confidence intervals, and the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
+From these anchors we report FAR with Wilson 95% confidence intervals (against the inter-CPA negative anchor) and FRR (against the byte-identical positive anchor), together with the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
 Precision and $F_1$ are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and $F_1$ from Table X.
 The 70/30 held-out Firm A fold of Section IV-G.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.
 We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.
 ## L. Per-Document Classification
-The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
+The final per-document classification combines the accountant-level cosine reference from Section IV-E with dHash-based structural stratification.
-Rather than rely on a single cutoff, we assign each signature to one of five signature-level categories using convergent evidence from both descriptors with thresholds derived from the Firm A calibration fold (Section III-K):
+We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:
-1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq$ (calibration-fold Firm A dHash median).
+1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
-Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.
+Both descriptors converge on strong replication evidence.
-2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash between the calibration-fold dHash median and 95th percentile.
+2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < $ dHash $\leq 15$.
-Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.
+Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.
-3. **High style consistency:** Cosine $> 0.95$ AND dHash $>$ calibration-fold Firm A dHash 95th percentile.
+3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
 4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
 5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.
-Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification).
+We note two conventions about the dHash cutoffs.
 First, the cutoffs $\leq 5$ and $\leq 15$ correspond to the whole-sample Firm A *cosine-conditional* dHash distribution's median and 95th percentile (the dHash to the cosine-nearest same-CPA match), not to the *independent-minimum* dHash distribution we use elsewhere.
 The two dHash statistics are related but not identical: the whole-sample cosine-conditional distribution has median $= 5$ and 95th percentile $= 15$, while the calibration-fold independent-minimum distribution has median $= 2$ and 95th percentile $= 9$.
 The classifier retains the cosine-conditional cutoffs for continuity with the preceding version of this work while the anchor-level capture-rate analysis reports both cosine-conditional and independent-minimum rates for comparability.
 Second, because the cosine cutoff $0.95$ and the cosine crossover $0.837$ have simple percentile interpretations and are not calibrated *to the calibration fold specifically*, the classifier rules inherit thresholds derived from the whole-sample Firm A distribution rather than the 70% calibration fold; the held-out fold of Section IV-G.2 is the corresponding external validation.
 Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed determines the document's classification).
 This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.
 The dHash thresholds ($\leq 5$ and $\leq 15$, corresponding to the calibration-fold Firm A dHash median and 95th percentile) are derived empirically rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.
@@ -69,7 +69,7 @@ The BD/McCrary pairing is well suited to detecting the boundary between two gene
 *Finite mixture models.*
 When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters.
 For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation.
-Under mild regularity conditions, White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation to the true distribution, even when the true distribution is not exactly Beta, provided the model is correctly specified in the broader exponential-family sense.
+Under mild regularity conditions, White's quasi-MLE result [41] supports interpreting maximum-likelihood estimates under a mis-specified parametric family as consistent estimators of the pseudo-true parameter that minimizes the Kullback-Leibler divergence to the data-generating distribution within that family; we use this result to justify the Beta-mixture fit as a principled approximation rather than as a guarantee that the true distribution is Beta.
 The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported.
 <!--
@@ -91,7 +91,7 @@ Under the full-sample 2-component forced fit no Beta crossing is identified; the
 The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
 Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
-This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual *behavior* (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.
+This motivates the pivot to the accountant-level analysis in Section IV-E, where aggregation over signatures reveals clustered (though not sharply discrete) patterns in individual-level signing *practice* that the signature-level analysis lacks.
 ## E. Accountant-Level Gaussian Mixture
@@ -123,20 +123,10 @@ First, of the 180 CPAs in the Firm A registry, 171 have $\geq 10$ signatures and
 Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
 This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
 Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
-Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in Table VIII: KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary test does not produce a significant transition at the accountant level.
+Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary test does not produce a significant transition at the accountant level.
 For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
-<!-- TABLE VIII-acct: Accountant-Level Three-Method Threshold Summary
+Table VIII summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison.
 | Level / method | Cosine threshold | dHash threshold |
 |----------------|-------------------|------------------|
 | Method 1 (KDE antimode)           | 0.973 | 4.07 |
 | Method 2 (BD/McCrary)             | no transition | no transition |
 | Method 3 (Beta-2 EM crossing)     | 0.979 | 3.41 |
 | Method 3' (logit-GMM-2 crossing)  | 0.976 | 3.93 |
 | 2D GMM 2-comp marginal crossing   | 0.945 | 8.10 |
 -->
 Table VIII then summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison.
 <!-- TABLE VIII: Threshold Convergence Summary Across Levels
 | Level / method | Cosine threshold | dHash threshold |
@@ -193,29 +183,31 @@ We report three validation analyses corresponding to the anchors of Section III-
 Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
 As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
-Table X reports precision, recall, $F_1$, FAR with Wilson 95% confidence intervals, and FRR at each candidate threshold.
+Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
-The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because pixel-identical positives are all at cosine very close to 1.
+We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor and FRR against the byte-identical positive anchor in Table X; these two error rates are well defined within their respective anchor populations.
 The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because every byte-identical positive falls at cosine very close to 1.
-<!-- TABLE X: Cosine Threshold Sweep (positives = 310 pixel-identical signatures; negatives = 50,000 inter-CPA pairs)
+<!-- TABLE X: Cosine Threshold Sweep (positives = 310 byte-identical signatures; negatives = 50,000 inter-CPA pairs)
-| Threshold | Precision | Recall | F1 | FAR | FAR 95% Wilson CI | FRR |
+| Threshold | FAR | FAR 95% Wilson CI | FRR (byte-identical) |
-|-----------|-----------|--------|----|-----|-------------------|-----|
+|-----------|-----|-------------------|----------------------|
-| 0.837 (all-pairs KDE crossover) | 0.029 | 1.000 | 0.056 | 0.2062 | [0.2027, 0.2098] | 0.000 |
+| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] | 0.000 |
-| 0.900                            | 0.210 | 1.000 | 0.347 | 0.0233 | [0.0221, 0.0247] | 0.000 |
+| 0.900                            | 0.0233 | [0.0221, 0.0247] | 0.000 |
-| 0.945 (2D GMM marginal)         | 0.883 | 1.000 | 0.938 | 0.0008 | [0.0006, 0.0011] | 0.000 |
+| 0.945 (2D GMM marginal)          | 0.0008 | [0.0006, 0.0011] | 0.000 |
-| 0.950                            | 0.904 | 1.000 | 0.950 | 0.0007 | [0.0005, 0.0009] | 0.000 |
+| 0.950                            | 0.0007 | [0.0005, 0.0009] | 0.000 |
-| 0.973 (accountant KDE antimode) | 0.960 | 1.000 | 0.980 | 0.0003 | [0.0002, 0.0004] | 0.000 |
+| 0.973 (accountant KDE antimode)  | 0.0003 | [0.0002, 0.0004] | 0.000 |
-| 0.979 (accountant Beta-2)       | 0.969 | 1.000 | 0.984 | 0.0002 | [0.0001, 0.0004] | 0.000 |
+| 0.979 (accountant Beta-2)        | 0.0002 | [0.0001, 0.0004] | 0.000 |
 -->
 Two caveats apply.
 First, the gold-positive anchor is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
-Perfect recall against this subset does not establish perfect recall against the broader positive class, and the reported recall should therefore be interpreted as a lower-bound calibration check rather than a generalizable recall estimate.
+Zero FRR against this subset does not establish zero FRR against the broader positive class, and the reported FRR should therefore be interpreted as a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable miss rate.
 Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
-The very low FAR at the accountant-level thresholds is therefore informative.
+The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.
 ### 2) Held-Out Firm A Validation (breaks calibration-validation circularity)
 We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
 The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here and has no effect on the accountant-level mixture analysis of Section IV-E, which uses the $\geq 10$-signature subset of 171 CPAs.
 Thresholds are re-derived from calibration-fold percentiles only.
 Table XI reports heldout-fold capture rates with Wilson 95% confidence intervals.
@@ -242,10 +234,13 @@ The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 91.54% [91.09%, 91.97%
 A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
-## H. Firm A Benchmark Validation: Threshold-Independent Evidence
+## H. Additional Firm A Benchmark Validation
 The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
-This section reports three additional analyses that are *threshold-independent* in the sense that their findings do not depend on any cutoff we calibrate to Firm A, and therefore constitute genuine benchmark-validation evidence rather than a circular check.
+This section reports three complementary analyses that go beyond the whole-sample capture rates.
 Subsection H.2 is fully threshold-independent (it uses only ordinal ranking).
 Subsection H.1 uses a fixed 0.95 cutoff but derives information from the longitudinal stability of rates rather than from the absolute rate at any single year.
 Subsection H.3 applies the calibrated classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the informative quantity is the cross-firm *gap* rather than the absolute agreement rate at any one firm.
 ### 1) Year-by-Year Stability of the Firm A Left Tail
@@ -283,19 +278,19 @@ Firm A accounts for 1,287 of these (27.8% baseline share).
 Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.
 <!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
-| Top-K | k in bucket | Deloitte (Firm A) | KPMG | PwC | EY | Other/Non-Big-4 | Deloitte share |
+| Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
-|-------|-------------|-------------------|------|-----|----|----|-----------------|
+|-------|-------------|--------|--------|--------|--------|-----------|--------------|
 | 10%   | 462         | 443    | 2      | 3      | 0      | 14        | 95.9% |
 | 25%   | 1,157       | 1,043  | 32     | 23     | 9      | 50        | 90.1% |
 | 50%   | 2,314       | 1,220  | 473    | 273    | 102    | 246       | 52.7% |
 -->
 Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
-Year-by-year (Table XV), the top-10% Deloitte share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
+Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
-<!-- TABLE XV: Deloitte Share of Top-10% Similarity by Year
+<!-- TABLE XV: Firm A Share of Top-10% Similarity by Year
-| Year | N auditor-years | Top-10% k | Deloitte in top-10% | Deloitte share | Deloitte baseline |
+| Year | N auditor-years | Top-10% k | Firm A in top-10% | Firm A share | Firm A baseline |
-|------|-----------------|-----------|---------------------|----------------|-------------------|
+|------|-----------------|-----------|-------------------|--------------|-----------------|
 | 2013 | 324 | 32 | 32 | 100.0% | 26.2% |
 | 2014 | 399 | 39 | 39 | 100.0% | 27.1% |
 | 2015 | 394 | 39 | 38 | 97.4% | 27.2% |
@@ -309,7 +304,7 @@ Year-by-year (Table XV), the top-10% Deloitte share ranges from 88.4% (2020) to
 | 2023 | 474 | 47 | 46 | 97.9% | 28.5% |
 -->
-This over-representation is a direct consequence of firm-wide stamping practice and is not derived from any threshold we subsequently calibrate.
+This over-representation is a direct consequence of firm-wide non-hand-signing practice and is not derived from any threshold we subsequently calibrate.
 It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.
 ### 3) Intra-Report Consistency
@@ -318,27 +313,27 @@ Taiwanese statutory audit reports are co-signed by two engagement partners (a pr
 Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
 Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.
-For each report with exactly two signatures and complete per-signature data (93,979 reports), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
+For each report with exactly two signatures and complete per-signature data (83,970 reports assigned to a single firm, plus 384 reports with one signer per firm in the mixed-firm buckets for 84,354 total), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
-Table XVI reports per-firm intra-report agreement.
+Table XVI reports per-firm intra-report agreement (firm-assignment defined by the firm identity of both signers; mixed-firm reports are reported separately).
 <!-- TABLE XVI: Intra-Report Classification Agreement by Firm
 | Firm | Total 2-signer reports | Both non-hand-signed | Both uncertain | Both style | Both hand-signed | Mixed | Agreement rate |
 |------|-----------------------|----------------------|----------------|------------|------------------|-------|----------------|
-| Deloitte (Firm A) | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | **89.91%** |
+| Firm A  | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | **89.91%** |
-| KPMG              | 17,121 | 9,260  | 2,159| 5 | 6 | 5,691 | 66.76% |
+| Firm B  | 17,121 | 9,260  | 2,159| 5 | 6 | 5,691 | 66.76% |
-| PwC               | 19,112 | 8,983  | 3,035| 3 | 5 | 7,086 | 62.92% |
+| Firm C  | 19,112 | 8,983  | 3,035| 3 | 5 | 7,086 | 62.92% |
-| EY                | 8,375  | 3,028  | 2,376| 0 | 3 | 2,968 | 64.56% |
+| Firm D  | 8,375  | 3,028  | 2,376| 0 | 3 | 2,968 | 64.56% |
-| Other / Non-Big-4 | 9,140  | 1,671  | 3,945| 18| 27| 3,479 | 61.94% |
+| Non-Big-4 | 9,140  | 1,671  | 3,945| 18| 27| 3,479 | 61.94% |
 A report is "in agreement" if both signature labels fall in the same coarse bucket
 (non-hand-signed = high+moderate; uncertain; style consistency; or likely hand-signed).
 -->
 Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
-The other Big-4 firms and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
+The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
-This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) stamping practice.
+This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.
-Like the partner-level ranking, this test does not depend on any threshold we calibrate to Firm A; the firm-vs-firm comparison is invariant to the absolute cutoff so long as the cutoff is applied uniformly.
+We note that this test uses the calibrated classifier of Section III-L rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
 ## I. Classification Results