Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21

Major fixes per codex (gpt-5.4) review: ## Structural fixes - Fixed three-method convergence overclaim: added Script 20 to run KDE antimode, BD/McCrary, and Beta mixture EM on accountant-level means. Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979, LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at accountant level (consistent with smooth clustering, not sharp discontinuity). - Disambiguated Method 1: KDE crossover (between two labeled distributions, used at signature all-pairs level) vs KDE antimode (single-distribution local minimum, used at accountant level). - Addressed Firm A circular validation: Script 21 adds CPA-level 70/30 held-out fold. Calibration thresholds derived from 70% only; heldout rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61% [93.21%-93.98%]). - Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion. - Added document-level classification aggregation rule (worst-case signature label determines document label). ## Pixel-identity validation strengthened - Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces the original n=35 same-CPA low-similarity negative which had untenable Wilson CIs). - Added Wilson 95% CI for every FAR in Table X. - Proper EER interpolation (FAR=FRR point) in Table X. - Softened "conservative recall" claim to "non-generalizable subset" language per codex feedback (byte-identical positives are a subset, not a representative positive class). - Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913. ## Terminology & sentence-level fixes - "statistically independent methods" -> "methodologically distinct methods" throughout (three diagnostics on the same sample are not independent). - "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality; rejection is consistent with but not a direct test of bimodality). - "Firm A near-universally non-hand-signed" -> already corrected to "replication-dominated" in prior commit; this commit strengthens that framing with explicit held-out validation. - "discrete-behavior regimes" -> "clustered accountant-level heterogeneity" (BD/McCrary non-transition at accountant level rules out sharp discrete boundaries; the defensible claim is clustered-but-smooth). - Softened White 1982 quasi-MLE claim (no longer framed as a guarantee). - Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP or YOLO FN). - Unified "310 byte-identical signatures" language across Abstract, Results, Discussion (previously alternated between pairs/signatures). - Defined min_dhash_independent explicitly in Section III-G. - Fixed table numbering (Table XI heldout added, classification moved to XII, ablation to XIII). - Explained 84,386 vs 85,042 gap (656 docs have only one signature, no pairwise stat). - Made Table IX explicitly a "consistency check" not "validation"; paired it with Table XI held-out rates as the genuine external check. - Defined 0.941 threshold (calibration-fold Firm A cosine P5). - Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated. - Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923). ## New artifacts - Script 20: accountant-level three-method threshold analysis - Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30) - paper/codex_review_gpt54_v3.md: preserved review feedback Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1 markdown sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00
parent 9b11f03548
commit 9d19ca5a31
13 changed files with 2915 additions and 127 deletions
@@ -50,7 +50,8 @@ Scanning terminated upon the first positive detection.
 This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded.
 An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.

-Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false-positive rate of 1.2%.
+Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents.
+The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling.

 ## D. Signature Detection

@@ -119,8 +120,10 @@ For per-signature classification we compute, for each signature, the maximum pai
 The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
 Mean statistics would dilute this signal.

-For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA.
-These accountant-level aggregates are the input to the mixture model described in Section III-I.
+For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean *independent minimum dHash* across all signatures of that CPA.
+The *independent minimum dHash* of a signature is defined as the minimum Hamming distance to *any* other signature of the same CPA (over the full same-CPA set), in contrast to the *cosine-conditional dHash* used as a diagnostic elsewhere, which is the dHash to the single signature selected as the cosine-nearest match.
+The independent minimum avoids conditioning on the cosine choice and is therefore the conservative structural-similarity statistic for each signature.
+These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level three-method analysis in Section III-I.5.

 ## H. Calibration Reference: Firm A as a Replication-Dominated Population

@@ -133,7 +136,8 @@ Crucially, the same interview evidence does *not* exclude the possibility that a

 Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners.

-Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
+Third, our own quantitative analysis is consistent with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
+We emphasize that this 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence enumerated above and by the held-out Firm A fold described in Section III-K.

 We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
 Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline.
@@ -142,14 +146,16 @@ The "replication-dominated, not pure" framing is important both for internal con
 ## I. Three-Method Convergent Threshold Determination

 Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
-To place threshold selection on a statistically principled and data-driven footing, we apply *three independent* methods whose underlying assumptions decrease in strength.
-When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement is itself a diagnostic of distributional structure.
+To place threshold selection on a statistically principled and data-driven footing, we apply *three methodologically distinct* methods whose underlying assumptions decrease in strength.
+The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence is therefore a diagnostic of distributional structure rather than a formal statistical guarantee.
+When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement itself is informative about whether the data support a single clean decision boundary at a given level.

-### 1) Method 1: KDE + Antimode with Bimodality Check
+### 1) Method 1: KDE Antimode / Crossover with Unimodality Test

-We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28].
-A candidate threshold is taken at the location of the local density minimum (antimode) between modes of the fitted density.
-Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify stability.
+We use two closely related KDE-based threshold estimators and apply each where it is appropriate.
+When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the *KDE crossover* is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes.
+When a single distribution is analyzed (e.g., the per-accountant cosine mean of Section IV-E) the *KDE antimode* is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal.
+In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify threshold stability.

 ### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity

@@ -170,20 +176,27 @@ Under the fitted model the threshold is the crossing point of the two weighted c
 $$\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),$$

 solved numerically via bracketed root-finding.
-As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the *logit-transformed* similarity, following standard practice for bounded proportion data; White's [41] quasi-MLE consistency result guarantees asymptotic recovery of the best Beta-family approximation to the true distribution even if the true component densities are not exactly Beta.
+As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the *logit-transformed* similarity, following standard practice for bounded proportion data.
+White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery.

 We fit 2- and 3-component variants of each mixture and report BIC for model selection.
 When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.

 ### 4) Convergent Validation and Level-Shift Diagnostic

-The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification.
+The three methods rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification.
 If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.

-Equally informative is the *level at which the three methods agree*.
-Applied to the per-signature cosine distribution the three methods may converge on a single boundary only weakly or not at all---because, as our results show, per-signature similarity is not a cleanly bimodal population.
-Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum.
-We therefore explicitly analyze both levels and interpret their divergence as a substantive finding (Section V) rather than a statistical nuisance.
+Equally informative is the *level at which the methods agree or disagree*.
+Applied to the per-signature similarity distribution the three methods yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D).
+Applied to the per-accountant cosine mean, Methods 1 (KDE antimode) and 3 (Beta-mixture crossing and its logit-Gaussian counterpart) converge within a narrow band, whereas Method 2 (BD/McCrary) does not produce a significant transition because the accountant-mean distribution is smooth at the bin resolution the test requires.
+This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a discrete discontinuity, and we interpret it accordingly in Section V rather than treating disagreement among methods as a failure.
+
+### 5) Accountant-Level Three-Method Analysis
+
+In addition to applying the three methods at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures.
+The accountant-level estimates provide the methodologically defensible threshold reference used in the per-document classification of Section III-L.
+All three methods are reported with their estimates and, where applicable, cross-method spreads.

 ## J. Accountant-Level Mixture Model

@@ -193,37 +206,47 @@ The motivation is the expectation---supported by Firm A's interview evidence---t
 We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
 For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.

-## K. Pixel-Identity and Firm A Validation (No Manual Annotation)
+## K. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation)

-Rather than construct a stratified manual-annotation validation set, we validate the classifier using three naturally occurring reference populations that require no human labeling:
+Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling:

-1. **Pixel-identical anchor (gold positive):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
-Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing.
+1. **Pixel-identical anchor (gold positive, conservative subset):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
+Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth *for the byte-identical subset* of non-hand-signed signatures.
+We emphasize that this anchor is a *subset* of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).

-2. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above.
-Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity.
+2. **Inter-CPA negative anchor (large gold negative):** $\sim$50,000 pairs of signatures randomly sampled from *different* CPAs.
+Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps.
+This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.

-3. **Low-similarity anchor (gold negative):** signatures whose maximum same-CPA cosine similarity is below a conservative cutoff ($0.70$) that cannot plausibly arise from pixel-level duplication.
+3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above.
+Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we break the resulting circularity by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% *calibration* fold and a 30% *heldout* fold.
+Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions are derived from the calibration fold only.
+The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.

-From these anchors we report Equal Error Rate (EER), precision, recall, $F_1$, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3].
+4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70.
+This anchor is retained for continuity with prior work but is small in our dataset ($n = 35$) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.
+
+From these anchors we report precision, recall, $F_1$, FAR with Wilson 95% confidence intervals, and the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
 We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.

 ## L. Per-Document Classification

 The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
-Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution:
+Rather than rely on a single cutoff, we assign each signature to one of five signature-level categories using convergent evidence from both descriptors with thresholds derived from the Firm A calibration fold (Section III-K):

-1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
+1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq$ (calibration-fold Firm A dHash median).
 Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.

-2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash in $[6, 15]$.
+2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash between the calibration-fold dHash median and 95th percentile.
 Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.

-3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
+3. **High style consistency:** Cosine $> 0.95$ AND dHash $>$ calibration-fold Firm A dHash 95th percentile.
 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.

-4. **Uncertain:** Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
+4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.

-5. **Likely hand-signed:** Cosine below the KDE crossover threshold.
+5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.

-The dHash thresholds ($\leq 5$ and $\leq 15$) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.
+Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification).
+This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.
+The dHash thresholds ($\leq 5$ and $\leq 15$, corresponding to the calibration-fold Firm A dHash median and 95th percentile) are derived empirically rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.