Paper A v3.3: apply codex v3.2 peer-review fixes

Codex (gpt-5.4) second-round review recommended 'minor revision'. This commit addresses all issues flagged in that review. ## Structural fixes - dHash calibration inconsistency (codex #1, most important): Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come from the whole-sample Firm A cosine-conditional dHash distribution (median=5, P95=15), not from the calibration-fold independent-minimum dHash distribution (median=2, P95=9) which we report elsewhere as descriptive anchors. Added explicit note about the two dHash conventions and their relationship. - Section IV-H framing (codex #2): Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence" to "Additional Firm A Benchmark Validation" and clarified in the section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully threshold-free, H.3 uses the calibrated classifier. H.3's concluding sentence now says "the substantive evidence lies in the cross-firm gap" rather than claiming the test is threshold-free. - Table XVI 93,979 typo fixed (codex #3): Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm). - Held-out Firm A denominator 124+54=178 vs 180 (codex #4): Added explicit note that 2 CPAs were excluded due to disambiguation ties in the CPA registry. - Table VIII duplication (codex #5): Removed the duplicate accountant-level-only Table VIII comment; the comprehensive cross-level Table VIII subsumes it. Text now says "accountant-level rows of Table VIII (below)". - Anonymization broken in Tables XIV-XVI (codex #6): Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/ "Firm D" across Tables XIV, XV, XVI. Table and caption language updated accordingly. - Table X unit mismatch (codex #7): Dropped precision, recall, F1 columns. Table now reports FAR (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR (against the byte-identical positive anchor). III-K and IV-G.1 text updated to justify the change. ## Sentence-level fixes - "three independent statistical methods" in Methodology III-A -> "three methodologically distinct statistical methods". - "three independent methods" in Conclusion -> "three methodologically distinct methods". - Abstract "~0.006 converging" now explicitly acknowledges that BD/McCrary produces no significant accountant-level discontinuity. - Conclusion ditto. - Discussion limitation sentence "BD/McCrary should be interpreted at the accountant level for threshold-setting purposes" rewritten to reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold estimator, at the accountant level. - III-H "two analyses" -> "three analyses" (H.1 longitudinal stability, H.2 partner ranking, H.3 intra-report consistency). - Related Work White 1982 overclaim rewritten: "consistent estimators of the pseudo-true parameter that minimizes KL divergence" replaces "guarantees asymptotic recovery". - III-J "behavior is close to discrete" -> "practice is clustered". - IV-D.2 pivot sentence "discreteness of individual behavior yields bimodality" -> "aggregation over signatures reveals clustered (though not sharply discrete) patterns". Target journal remains IEEE Access. Output: Paper_A_IEEE_Access_Draft_v3.docx (395 KB). Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 02:32:17 +08:00
parent 51d15b32a5
commit 5717d61dd4
8 changed files with 2352 additions and 71 deletions
@@ -4,7 +4,7 @@

 We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
 Fig. 1 illustrates the overall architecture.
-The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.
+The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three methodologically distinct statistical methods and a pixel-identity anchor.

 Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
 From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
@@ -144,11 +144,12 @@ Second, independent visual inspection of randomly sampled Firm A reports reveals

 Third, our own quantitative analysis is consistent with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.

-Fourth, we additionally validate the Firm A benchmark through two analyses that do not depend on any threshold we subsequently calibrate:
-  (a) *Partner-level similarity ranking (Section IV-H.2).*  When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023.
-  (b) *Intra-report consistency (Section IV-H.3).*  Because each Taiwanese statutory audit report is co-signed by two engagement partners, firmwide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label.  Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms, consistent with firm-wide rather than partner-specific practice.
+Fourth, we additionally validate the Firm A benchmark through three analyses reported in Section IV-H.  Two of them are fully threshold-free, and one uses the downstream classifier as an internal consistency check:
+  (a) *Longitudinal stability (Section IV-H.1).*  The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023.  The fixed 0.95 cutoff is not calibrated to Firm A; the stability itself is the finding.
+  (b) *Partner-level similarity ranking (Section IV-H.2).*  When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023.  This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
+  (c) *Intra-report consistency (Section IV-H.3).*  Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier.  Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms.  This test uses the calibrated classifier and therefore is a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.

-We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence, by the two threshold-independent analyses above, and by the held-out Firm A fold described in Section III-K.
+We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence, by the complementary analyses above, and by the held-out Firm A fold described in Section III-K.

 We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
 Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline.
@@ -212,7 +213,7 @@ All three methods are reported with their estimates and, where applicable, cross
 ## J. Accountant-Level Mixture Model

 In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
-The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *behavior* is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level *quality* lies on a continuous spectrum.
+The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *practice* is clustered (typically consistent adoption of non-hand-signing or consistent hand-signing within a given year) even when the output pixel-level *quality* lies on a continuous spectrum.

 We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
 For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
@@ -237,27 +238,34 @@ The heldout fold is used exclusively to report post-hoc capture rates with Wilso
 4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70.
 This anchor is retained for continuity with prior work but is small in our dataset ($n = 35$) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.

-From these anchors we report precision, recall, $F_1$, FAR with Wilson 95% confidence intervals, and the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
+From these anchors we report FAR with Wilson 95% confidence intervals (against the inter-CPA negative anchor) and FRR (against the byte-identical positive anchor), together with the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
+Precision and $F_1$ are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and $F_1$ from Table X.
+The 70/30 held-out Firm A fold of Section IV-G.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.
 We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.

 ## L. Per-Document Classification

-The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
-Rather than rely on a single cutoff, we assign each signature to one of five signature-level categories using convergent evidence from both descriptors with thresholds derived from the Firm A calibration fold (Section III-K):
+The final per-document classification combines the accountant-level cosine reference from Section IV-E with dHash-based structural stratification.
+We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:

-1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq$ (calibration-fold Firm A dHash median).
-Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.
+1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
+Both descriptors converge on strong replication evidence.

-2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash between the calibration-fold dHash median and 95th percentile.
-Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.
+2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < $ dHash $\leq 15$.
+Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.

-3. **High style consistency:** Cosine $> 0.95$ AND dHash $>$ calibration-fold Firm A dHash 95th percentile.
+3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.

 4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.

 5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.

-Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification).
+We note two conventions about the dHash cutoffs.
+First, the cutoffs $\leq 5$ and $\leq 15$ correspond to the whole-sample Firm A *cosine-conditional* dHash distribution's median and 95th percentile (the dHash to the cosine-nearest same-CPA match), not to the *independent-minimum* dHash distribution we use elsewhere.
+The two dHash statistics are related but not identical: the whole-sample cosine-conditional distribution has median $= 5$ and 95th percentile $= 15$, while the calibration-fold independent-minimum distribution has median $= 2$ and 95th percentile $= 9$.
+The classifier retains the cosine-conditional cutoffs for continuity with the preceding version of this work while the anchor-level capture-rate analysis reports both cosine-conditional and independent-minimum rates for comparability.
+Second, because the cosine cutoff $0.95$ and the cosine crossover $0.837$ have simple percentile interpretations and are not calibrated *to the calibration fold specifically*, the classifier rules inherit thresholds derived from the whole-sample Firm A distribution rather than the 70% calibration fold; the held-out fold of Section IV-G.2 is the corresponding external validation.
+
+Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed determines the document's classification).
 This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.
-The dHash thresholds ($\leq 5$ and $\leq 15$, corresponding to the calibration-fold Firm A dHash median and 95th percentile) are derived empirically rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.