Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21

Major fixes per codex (gpt-5.4) review: ## Structural fixes - Fixed three-method convergence overclaim: added Script 20 to run KDE antimode, BD/McCrary, and Beta mixture EM on accountant-level means. Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979, LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at accountant level (consistent with smooth clustering, not sharp discontinuity). - Disambiguated Method 1: KDE crossover (between two labeled distributions, used at signature all-pairs level) vs KDE antimode (single-distribution local minimum, used at accountant level). - Addressed Firm A circular validation: Script 21 adds CPA-level 70/30 held-out fold. Calibration thresholds derived from 70% only; heldout rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61% [93.21%-93.98%]). - Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion. - Added document-level classification aggregation rule (worst-case signature label determines document label). ## Pixel-identity validation strengthened - Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces the original n=35 same-CPA low-similarity negative which had untenable Wilson CIs). - Added Wilson 95% CI for every FAR in Table X. - Proper EER interpolation (FAR=FRR point) in Table X. - Softened "conservative recall" claim to "non-generalizable subset" language per codex feedback (byte-identical positives are a subset, not a representative positive class). - Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913. ## Terminology & sentence-level fixes - "statistically independent methods" -> "methodologically distinct methods" throughout (three diagnostics on the same sample are not independent). - "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality; rejection is consistent with but not a direct test of bimodality). - "Firm A near-universally non-hand-signed" -> already corrected to "replication-dominated" in prior commit; this commit strengthens that framing with explicit held-out validation. - "discrete-behavior regimes" -> "clustered accountant-level heterogeneity" (BD/McCrary non-transition at accountant level rules out sharp discrete boundaries; the defensible claim is clustered-but-smooth). - Softened White 1982 quasi-MLE claim (no longer framed as a guarantee). - Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP or YOLO FN). - Unified "310 byte-identical signatures" language across Abstract, Results, Discussion (previously alternated between pairs/signatures). - Defined min_dhash_independent explicitly in Section III-G. - Fixed table numbering (Table XI heldout added, classification moved to XII, ablation to XIII). - Explained 84,386 vs 85,042 gap (656 docs have only one signature, no pairwise stat). - Made Table IX explicitly a "consistency check" not "validation"; paired it with Table XI held-out rates as the genuine external check. - Defined 0.941 threshold (calibration-fold Firm A cosine P5). - Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated. - Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923). ## New artifacts - Script 20: accountant-level three-method threshold analysis - Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30) - paper/codex_review_gpt54_v3.md: preserved review feedback Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1 markdown sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00
parent 9b11f03548
commit 9d19ca5a31
13 changed files with 2915 additions and 127 deletions
@@ -39,7 +39,7 @@ Our approach processes raw PDF documents through the following stages:
 (2) signature region detection using a trained YOLOv11 object detector;
 (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network;
 (4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance;
-(5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and
+(5) threshold determination using three methodologically distinct methods---KDE antimode with a Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and
 (6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm.

 The dual-descriptor verification is central to our contribution.
@@ -51,13 +51,14 @@ A second distinctive feature is our framing of the calibration reference.
 One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing.
 Structured interviews with multiple Firm A partners confirm that *most* certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a *minority* may continue to hand-sign some reports.
 We therefore treat Firm A as a *replication-dominated* calibration reference rather than a pure positive class.
-This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode.
+This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode.
 Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence between the interview evidence and the statistical results.

 A third distinctive feature is our unit-of-analysis treatment.
-Our three-method convergent framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are cleanly trimodal (BIC-best $K = 3$).
-The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, but *individual signing behavior* is close to discrete---a given CPA is either a consistent user of non-hand-signing or a consistent hand-signer.
-The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level.
+Our three-method framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are clustered into three recognizable groups (BIC-best $K = 3$).
+The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, while *accountant-level aggregate behaviour* is clustered but not sharply discrete---a given CPA tends to cluster into a dominant regime (high-replication, middle-band, or hand-signed-tendency), though the boundaries between regimes are smooth rather than discontinuous.
+Among the three accountant-level methods, KDE antimode and the two mixture-based estimators converge within $\sim 0.006$ on a cosine threshold of approximately $0.975$, while the Burgstahler-Dichev / McCrary discontinuity test finds no significant transition at the accountant level---an outcome consistent with smoothly mixed clusters rather than a failure of the method.
+The two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are reported as a complementary cross-check rather than as the primary accountant-level threshold.

 We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants.
 To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature.
@@ -70,9 +71,9 @@ The contributions of this paper are summarized as follows:

 3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures.

-4. **Three-method convergent threshold framework.** We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence.
+4. **Three-method convergent threshold framework.** We introduce a threshold-selection framework that applies three methodologically distinct methods---KDE antimode with Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or principled divergence) across methods as diagnostic evidence about the mixture structure of the data.

-5. **Continuous-quality / discrete-behavior finding.** We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates are cleanly trimodal---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics.
+5. **Continuous-quality / clustered-accountant finding.** We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates cluster into three recognizable groups with smoothly mixed rather than sharply discrete boundaries---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics.

 6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a known-majority-positive reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling.