Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21

Major fixes per codex (gpt-5.4) review: ## Structural fixes - Fixed three-method convergence overclaim: added Script 20 to run KDE antimode, BD/McCrary, and Beta mixture EM on accountant-level means. Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979, LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at accountant level (consistent with smooth clustering, not sharp discontinuity). - Disambiguated Method 1: KDE crossover (between two labeled distributions, used at signature all-pairs level) vs KDE antimode (single-distribution local minimum, used at accountant level). - Addressed Firm A circular validation: Script 21 adds CPA-level 70/30 held-out fold. Calibration thresholds derived from 70% only; heldout rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61% [93.21%-93.98%]). - Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion. - Added document-level classification aggregation rule (worst-case signature label determines document label). ## Pixel-identity validation strengthened - Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces the original n=35 same-CPA low-similarity negative which had untenable Wilson CIs). - Added Wilson 95% CI for every FAR in Table X. - Proper EER interpolation (FAR=FRR point) in Table X. - Softened "conservative recall" claim to "non-generalizable subset" language per codex feedback (byte-identical positives are a subset, not a representative positive class). - Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913. ## Terminology & sentence-level fixes - "statistically independent methods" -> "methodologically distinct methods" throughout (three diagnostics on the same sample are not independent). - "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality; rejection is consistent with but not a direct test of bimodality). - "Firm A near-universally non-hand-signed" -> already corrected to "replication-dominated" in prior commit; this commit strengthens that framing with explicit held-out validation. - "discrete-behavior regimes" -> "clustered accountant-level heterogeneity" (BD/McCrary non-transition at accountant level rules out sharp discrete boundaries; the defensible claim is clustered-but-smooth). - Softened White 1982 quasi-MLE claim (no longer framed as a guarantee). - Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP or YOLO FN). - Unified "310 byte-identical signatures" language across Abstract, Results, Discussion (previously alternated between pairs/signatures). - Defined min_dhash_independent explicitly in Section III-G. - Fixed table numbering (Table XI heldout added, classification moved to XII, ablation to XIII). - Explained 84,386 vs 85,042 gap (656 docs have only one signature, no pairwise stat). - Made Table IX explicitly a "consistency check" not "validation"; paired it with Table XI held-out rates as the genuine external check. - Defined 0.941 threshold (calibration-fold Firm A cosine P5). - Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated. - Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923). ## New artifacts - Script 20: accountant-level three-method threshold analysis - Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30) - paper/codex_review_gpt54_v3.md: preserved review feedback Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1 markdown sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00
parent 9b11f03548
commit 9d19ca5a31
13 changed files with 2915 additions and 127 deletions
@@ -11,25 +11,25 @@ Forgery detection systems optimize for inter-class discriminability---maximizing
 Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
 The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.

-## B. Continuous-Quality Spectrum vs. Discrete-Behavior Regimes
+## B. Continuous-Quality Spectrum vs. Clustered Accountant-Level Heterogeneity

-The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E).
+The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method framework and the Hartigan dip test (Sections IV-D and IV-E).

 At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
 Firm A's signature-level cosine is formally unimodal (dip test $p = 0.17$) with a long left tail.
-The all-CPA signature-level cosine is multimodal ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
-The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary.
+The all-CPA signature-level cosine rejects unimodality ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
+The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary between two mechanisms.
 Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class.
-Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains.
+Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.

-At the per-accountant aggregate level the picture reverses.
-The distribution of per-accountant mean cosine (and mean dHash) is unambiguously multimodal.
-A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms.
-The two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
+At the per-accountant aggregate level the picture partly reverses.
+The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, logit-GMM-2 crossing $= 0.976$).
+The BD/McCrary test, however, does not produce a significant transition at the accountant level either, in contrast to the signature level.
+This pattern is consistent with a clustered *but smoothly mixed* accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the transitions between them are gradual rather than discrete at the bin resolution BD/McCrary requires.

-The substantive interpretation is simple: *pixel-level output quality* is continuous, but *individual signing behavior* is close to discrete.
-A given CPA tends to be either a consistent user of non-hand-signing or a consistent hand-signer; it is the mixing of these discrete behavioral types at the firm and population levels that produces the quality spectrum observed at the signature level.
-Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level.
+The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behaviour* is clustered (three recognizable groups) but not sharply discrete.
+The accountant-level mixture is a useful classifier of firm-and-practitioner-level signing regimes; individual behaviour may still transition or mix over time within a practitioner, and our cross-sectional analysis does not rule this out.
+Methodologically, the implication is that the three 1D methods are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is itself diagnostic of smoothness rather than a failure of the method.

 ## C. Firm A as a Replication-Dominated, Not Pure, Population

@@ -39,7 +39,9 @@ Our evidence across multiple analyses rules out that assumption for Firm A while
 Three convergent strands of evidence support the replication-dominated framing.
 First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign.
 Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
-Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers.
+Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures ($\geq 10$) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---consistent with the interview-acknowledged minority of hand-signers.
+Nine additional Firm A CPAs are excluded from the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone.
+The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that are within the Wilson 95% CIs of the whole-sample rates, indicating that the statistical signature of the replication-dominated framing is stable to the CPA sub-sample used for calibration.

 The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
 We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
@@ -65,21 +67,25 @@ This calibration strategy has broader applicability beyond signature analysis.
 Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
 The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates interview-acknowledged heterogeneity, and yields classification rates that are internally consistent with the data.

-## F. Pixel-Identity as Annotation-Free Ground Truth
+## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation

-A further methodological contribution is the use of byte-level pixel identity as an annotation-free gold positive.
+A further methodological contribution is the combination of byte-level pixel identity as an annotation-free *conservative* gold positive and a large random-inter-CPA negative anchor.
 Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
-In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly.
-Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice.
-We regard this as a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself.
+In our corpus 310 signatures satisfied this condition.
+We emphasize that byte-identical pairs are a *subset* of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways).
+Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.
+
+Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative ($n = 35$) we originally considered.
+The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset.

 ## G. Limitations

 Several limitations should be acknowledged.

 First, comprehensive per-document ground truth labels are not available.
-The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), and the low-similarity anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70.
-The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work.
+The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class.
+The low-similarity same-CPA anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled).
+A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class.

 Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
 While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.