Paper A v3.3: apply codex v3.2 peer-review fixes

Codex (gpt-5.4) second-round review recommended 'minor revision'. This
commit addresses all issues flagged in that review.

## Structural fixes

- dHash calibration inconsistency (codex #1, most important):
  Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come
  from the whole-sample Firm A cosine-conditional dHash distribution
  (median=5, P95=15), not from the calibration-fold independent-minimum
  dHash distribution (median=2, P95=9) which we report elsewhere as
  descriptive anchors. Added explicit note about the two dHash
  conventions and their relationship.

- Section IV-H framing (codex #2):
  Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence"
  to "Additional Firm A Benchmark Validation" and clarified in the
  section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully
  threshold-free, H.3 uses the calibrated classifier. H.3's concluding
  sentence now says "the substantive evidence lies in the cross-firm
  gap" rather than claiming the test is threshold-free.

- Table XVI 93,979 typo fixed (codex #3):
  Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm).

- Held-out Firm A denominator 124+54=178 vs 180 (codex #4):
  Added explicit note that 2 CPAs were excluded due to disambiguation
  ties in the CPA registry.

- Table VIII duplication (codex #5):
  Removed the duplicate accountant-level-only Table VIII comment; the
  comprehensive cross-level Table VIII subsumes it. Text now says
  "accountant-level rows of Table VIII (below)".

- Anonymization broken in Tables XIV-XVI (codex #6):
  Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/
  "Firm D" across Tables XIV, XV, XVI. Table and caption language
  updated accordingly.

- Table X unit mismatch (codex #7):
  Dropped precision, recall, F1 columns. Table now reports FAR
  (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR
  (against the byte-identical positive anchor). III-K and IV-G.1 text
  updated to justify the change.

## Sentence-level fixes

- "three independent statistical methods" in Methodology III-A ->
  "three methodologically distinct statistical methods".
- "three independent methods" in Conclusion -> "three methodologically
  distinct methods".
- Abstract "~0.006 converging" now explicitly acknowledges that
  BD/McCrary produces no significant accountant-level discontinuity.
- Conclusion ditto.
- Discussion limitation sentence "BD/McCrary should be interpreted at
  the accountant level for threshold-setting purposes" rewritten to
  reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold
  estimator, at the accountant level.
- III-H "two analyses" -> "three analyses" (H.1 longitudinal stability,
  H.2 partner ranking, H.3 intra-report consistency).
- Related Work White 1982 overclaim rewritten: "consistent estimators
  of the pseudo-true parameter that minimizes KL divergence" replaces
  "guarantees asymptotic recovery".
- III-J "behavior is close to discrete" -> "practice is clustered".
- IV-D.2 pivot sentence "discreteness of individual behavior yields
  bimodality" -> "aggregation over signatures reveals clustered (though
  not sharply discrete) patterns".

Target journal remains IEEE Access. Output:
Paper_A_IEEE_Access_Draft_v3.docx (395 KB).

Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-21 02:32:17 +08:00
parent 51d15b32a5
commit 5717d61dd4
8 changed files with 2352 additions and 71 deletions
+24 -16
View File
@@ -4,7 +4,7 @@
We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
Fig. 1 illustrates the overall architecture.
The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.
The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three methodologically distinct statistical methods and a pixel-identity anchor.
Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
@@ -144,11 +144,12 @@ Second, independent visual inspection of randomly sampled Firm A reports reveals
Third, our own quantitative analysis is consistent with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
Fourth, we additionally validate the Firm A benchmark through two analyses that do not depend on any threshold we subsequently calibrate:
(a) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023.
(b) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firmwide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms, consistent with firm-wide rather than partner-specific practice.
Fourth, we additionally validate the Firm A benchmark through three analyses reported in Section IV-H. Two of them are fully threshold-free, and one uses the downstream classifier as an internal consistency check:
(a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The fixed 0.95 cutoff is not calibrated to Firm A; the stability itself is the finding.
(b) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
(c) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the calibrated classifier and therefore is a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence, by the two threshold-independent analyses above, and by the held-out Firm A fold described in Section III-K.
We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence, by the complementary analyses above, and by the held-out Firm A fold described in Section III-K.
We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline.
@@ -212,7 +213,7 @@ All three methods are reported with their estimates and, where applicable, cross
## J. Accountant-Level Mixture Model
In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *behavior* is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level *quality* lies on a continuous spectrum.
The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *practice* is clustered (typically consistent adoption of non-hand-signing or consistent hand-signing within a given year) even when the output pixel-level *quality* lies on a continuous spectrum.
We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
@@ -237,27 +238,34 @@ The heldout fold is used exclusively to report post-hoc capture rates with Wilso
4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70.
This anchor is retained for continuity with prior work but is small in our dataset ($n = 35$) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.
From these anchors we report precision, recall, $F_1$, FAR with Wilson 95% confidence intervals, and the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
From these anchors we report FAR with Wilson 95% confidence intervals (against the inter-CPA negative anchor) and FRR (against the byte-identical positive anchor), together with the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
Precision and $F_1$ are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and $F_1$ from Table X.
The 70/30 held-out Firm A fold of Section IV-G.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.
We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.
## L. Per-Document Classification
The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
Rather than rely on a single cutoff, we assign each signature to one of five signature-level categories using convergent evidence from both descriptors with thresholds derived from the Firm A calibration fold (Section III-K):
The final per-document classification combines the accountant-level cosine reference from Section IV-E with dHash-based structural stratification.
We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:
1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq$ (calibration-fold Firm A dHash median).
Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.
1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
Both descriptors converge on strong replication evidence.
2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash between the calibration-fold dHash median and 95th percentile.
Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.
2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < $ dHash $\leq 15$.
Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.
3. **High style consistency:** Cosine $> 0.95$ AND dHash $>$ calibration-fold Firm A dHash 95th percentile.
3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.
Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification).
We note two conventions about the dHash cutoffs.
First, the cutoffs $\leq 5$ and $\leq 15$ correspond to the whole-sample Firm A *cosine-conditional* dHash distribution's median and 95th percentile (the dHash to the cosine-nearest same-CPA match), not to the *independent-minimum* dHash distribution we use elsewhere.
The two dHash statistics are related but not identical: the whole-sample cosine-conditional distribution has median $= 5$ and 95th percentile $= 15$, while the calibration-fold independent-minimum distribution has median $= 2$ and 95th percentile $= 9$.
The classifier retains the cosine-conditional cutoffs for continuity with the preceding version of this work while the anchor-level capture-rate analysis reports both cosine-conditional and independent-minimum rates for comparability.
Second, because the cosine cutoff $0.95$ and the cosine crossover $0.837$ have simple percentile interpretations and are not calibrated *to the calibration fold specifically*, the classifier rules inherit thresholds derived from the whole-sample Firm A distribution rather than the 70% calibration fold; the held-out fold of Section IV-G.2 is the corresponding external validation.
Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed determines the document's classification).
This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.
The dHash thresholds ($\leq 5$ and $\leq 15$, corresponding to the calibration-fold Firm A dHash median and 95th percentile) are derived empirically rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.