Paper A v3.7: demote BD/McCrary to density-smoothness diagnostic; add Appendix A

Implements codex gpt-5.4 recommendation (paper/codex_bd_mccrary_opinion.md, "option (c) hybrid"): demote BD/McCrary in the main text from a co-equal threshold estimator to a density-smoothness diagnostic, and add a bin-width sensitivity appendix as an audit trail. Why: the bin-width sweep (Script 25) confirms that at the signature level the BD transition drifts monotonically with bin width (Firm A cosine: 0.987 -> 0.985 -> 0.980 -> 0.975 as bin width widens 0.003 -> 0.015; full-sample dHash transitions drift from 2 to 10 to 9 across bin widths 1 / 2 / 3) and Z statistics inflate superlinearly with bin width, both characteristic of a histogram-resolution artifact. At the accountant level the BD null is robust across the sweep. The paper's earlier "three methodologically distinct estimators" framing therefore could not be defended to an IEEE Access reviewer once the sweep was run. Added - signature_analysis/25_bd_mccrary_sensitivity.py: bin-width sweep across 6 variants (Firm A / full-sample / accountant-level, each cosine + dHash_indep) and 3-4 bin widths per variant. Reports Z_below, Z_above, p-values, and number of significant transitions per cell. Writes reports/bd_sensitivity/bd_sensitivity.{json,md}. - paper/paper_a_appendix_v3.md: new "Appendix A. BD/McCrary Bin-Width Sensitivity" with Table A.I (all 20 sensitivity cells) and interpretation linking the empirical pattern to the main-text framing decision. - export_v3.py: appendix inserted into SECTIONS between conclusion and references. - paper/codex_bd_mccrary_opinion.md: codex gpt-5.4 recommendation captured verbatim for audit trail. Main-text reframing - Abstract: "three methodologically distinct estimators" -> "two estimators plus a Burgstahler-Dichev/McCrary density- smoothness diagnostic". Trimmed to 243 words. - Introduction: related-work summary, pipeline step 5, accountant- level convergence sentence, contribution 4, and section-outline line all updated. Contribution 4 renamed to "Convergent threshold framework with a smoothness diagnostic". - Methodology III-I: section renamed to "Convergent Threshold Determination with a Density-Smoothness Diagnostic". "Method 2: BD/McCrary Discontinuity" converted to "Density-Smoothness Diagnostic" in a new subsection; Method 3 (Beta mixture) renumbered to Method 2. Subsections 4 and 5 updated to refer to "two threshold estimators" with BD as diagnostic. - Methodology III-A pipeline overview: "three methodologically distinct statistical methods" -> "two methodologically distinct threshold estimators complemented by a density-smoothness diagnostic". - Methodology III-L: "three-method analysis" -> "accountant-level threshold analysis (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing)". - Results IV-D.1 heading: "BD/McCrary Discontinuity" -> "BD/McCrary Density-Smoothness Diagnostic". Prose now notes the Appendix-A bin-width instability explicitly. - Results IV-E: Table VIII restructured to label BD rows "(diagnostic only; bin-unstable)" and "(diagnostic; null across Appendix A)". Summary sentence rewritten to frame BD null as evidence for clustered-but-smoothly-mixed rather than as a convergence failure. Table cosine P5 row corrected from 0.941 to 0.9407 to match III-K. - Results IV-G.3 and IV-I.2: "three-method convergence/thresholds" -> "accountant-level convergent thresholds" (clarifies the 3 converging estimates are KDE antimode, Beta-2, logit-Gaussian, not KDE/BD/Beta). - Discussion V-B: "three-method framework" -> "convergent threshold framework". - Conclusion: "three methodologically distinct methods" -> "two threshold estimators and a density-smoothness diagnostic"; contribution 3 restated; future-work sentence updated. - Impact Statement (archived): "three methodologically distinct threshold-selection methods" -> "two methodologically distinct threshold estimators plus a density-smoothness diagnostic" so the archived text is internally consistent if reused. Discussion V-B / V-G already framed BD as a diagnostic in v3.5 (unchanged in this commit). The reframing therefore brings Abstract / Introduction / Methodology / Results / Conclusion into alignment with the Discussion framing that codex had already endorsed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:32:50 +08:00
parent 6946baa096
commit 552b6b80d4
11 changed files with 458 additions and 63 deletions
@@ -4,7 +4,7 @@

 We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
 Fig. 1 illustrates the overall architecture.
-The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three methodologically distinct statistical methods and a pixel-identity anchor.
+The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from two methodologically distinct threshold estimators complemented by a density-smoothness diagnostic and a pixel-identity anchor.

 Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
 From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
@@ -130,7 +130,7 @@ A direct empirical check of the within-auditor-year assumption at the same-CPA l
 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean *independent minimum dHash* across all signatures of that CPA.
 The *independent minimum dHash* of a signature is defined as the minimum Hamming distance to *any* other signature of the same CPA (over the full same-CPA set).
 The independent minimum is unconditional on the cosine-nearest pair and is therefore the conservative structural-similarity statistic; it is the dHash statistic used throughout the operational classifier (Section III-L) and all reported capture-rate analyses.
-These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level three-method analysis in Section III-I.5.
+These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level threshold analysis in Section III-I.5.

 ## H. Calibration Reference: Firm A as a Replication-Dominated Population

@@ -159,12 +159,14 @@ We emphasize that Firm A's replication-dominated status was *not* derived from t
 Its identification rests on visual evidence and accountant-level clustering that is independent of the statistical pipeline.
 The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.

-## I. Three-Method Convergent Threshold Determination
+## I. Convergent Threshold Determination with a Density-Smoothness Diagnostic

 Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
-To place threshold selection on a statistically principled and data-driven footing, we apply *three methodologically distinct* methods whose underlying assumptions decrease in strength.
-The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence is therefore a diagnostic of distributional structure rather than a formal statistical guarantee.
-When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement itself is informative about whether the data support a single clean decision boundary at a given level.
+To place threshold selection on a statistically principled and data-driven footing, we apply *two methodologically distinct* threshold estimators---KDE antimode with a Hartigan dip test, and a finite Beta mixture (with a logit-Gaussian robustness check)---whose underlying assumptions decrease in strength (KDE antimode requires only smoothness; the Beta mixture additionally requires a parametric specification, and the logit-Gaussian cross-check reports sensitivity to that form).
+We complement these estimators with a Burgstahler-Dichev / McCrary density-smoothness diagnostic applied to the same distributions.
+The BD/McCrary procedure is *not* a third threshold estimator in our application---we show in Appendix A that the signature-level BD transitions are not bin-width-robust and that the accountant-level BD null survives a bin-width sweep---but it is informative about *how* the accountant-level distribution fails to exhibit a sharp density discontinuity even though it is clustered.
+The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence between the two threshold estimators is therefore a diagnostic of distributional structure rather than a formal statistical guarantee.
+When the two estimates agree, the decision boundary is robust to the choice of method; when the BD/McCrary diagnostic finds no significant transition at the same level, that pattern is evidence for clustered-but-smoothly-mixed rather than sharply discontinuous distributional structure.

 ### 1) Method 1: KDE Antimode / Crossover with Unimodality Test

@@ -173,17 +175,7 @@ When two labeled populations are available (e.g., the all-pairs intra-class and
 When a single distribution is analyzed (e.g., the per-accountant cosine mean of Section IV-E) the *KDE antimode* is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal.
 In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify threshold stability.

-### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity
-
-We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39].
-We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,
-
-$$Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},$$
-
-which is approximately $N(0,1)$ under the null of distributional smoothness.
-A threshold is identified at the transition where $Z_{i-1}$ is significantly negative (observed count below expectation) adjacent to $Z_i$ significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition $Z^- \rightarrow Z^+$ marks the candidate decision boundary.
-
-### 3) Method 3: Finite Mixture Model via EM
+### 2) Method 2: Finite Mixture Model via EM

 We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data).
 The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread).
@@ -198,21 +190,31 @@ White's [41] quasi-MLE consistency result justifies interpreting the logit-Gauss
 We fit 2- and 3-component variants of each mixture and report BIC for model selection.
 When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.

-### 4) Convergent Validation and Level-Shift Diagnostic
+### 3) Density-Smoothness Diagnostic: Burgstahler-Dichev / McCrary

-The three methods rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification.
-If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.
+Complementing the two threshold estimators above, we apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39], as a *density-smoothness diagnostic* rather than as a third threshold estimator.
+We discretize each distribution (cosine into bins of width 0.005; $\text{dHash}_\text{indep}$ into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,
+
+$$Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},$$
+
+which is approximately $N(0,1)$ under the null of distributional smoothness.
+A candidate transition is identified at an adjacent bin pair where $Z_{i-1}$ is significantly negative and $Z_i$ is significantly positive (cosine) or the reverse (dHash).
+Appendix A reports a bin-width sensitivity sweep covering $\text{bin} \in \{0.003, 0.005, 0.010, 0.015\}$ for cosine and $\text{bin} \in \{1, 2, 3\}$ for dHash; the sweep shows that signature-level BD transitions are not bin-width-stable and that accountant-level BD transitions are largely absent, consistent with clustered-but-smoothly-mixed accountant-level aggregates.
+We therefore do not treat the BD/McCrary procedure as a threshold estimator in our application but as diagnostic evidence about distributional smoothness.
+
+### 4) Convergent Validation and Level-Shift Framing
+
+The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification (with logit-Gaussian as a robustness cross-check against that form).
+If the two estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.

 Equally informative is the *level at which the methods agree or disagree*.
-Applied to the per-signature similarity distribution the three methods yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D).
-Applied to the per-accountant cosine mean, Methods 1 (KDE antimode) and 3 (Beta-mixture crossing and its logit-Gaussian counterpart) converge within a narrow band, whereas Method 2 (BD/McCrary) does not produce a significant transition because the accountant-mean distribution is smooth at the bin resolution the test requires.
-This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a discrete discontinuity, and we interpret it accordingly in Section V rather than treating disagreement among methods as a failure.
+Applied to the per-signature similarity distribution the two estimators yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D).
+Applied to the per-accountant cosine mean, the KDE antimode and the Beta-mixture crossing (together with its logit-Gaussian counterpart) converge within a narrow band, while the BD/McCrary diagnostic finds no significant transition at the same level; this pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a sharply discrete discontinuity, and we interpret it accordingly in Section V rather than treating the BD null as a failure of the test.

-### 5) Accountant-Level Three-Method Analysis
+### 5) Accountant-Level Application

-In addition to applying the three methods at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures.
-The accountant-level estimates provide the methodologically defensible threshold reference used in the per-document classification of Section III-L.
-All three methods are reported with their estimates and, where applicable, cross-method spreads.
+In addition to applying the two threshold estimators and the BD/McCrary diagnostic at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures.
+The accountant-level estimates from the two threshold estimators (together with their convergence) provide the methodologically defensible threshold reference used in the per-document classification of Section III-L; the BD/McCrary accountant-level null is reported alongside as a smoothness diagnostic.

 ## J. Accountant-Level Mixture Model

@@ -249,7 +251,7 @@ We additionally draw a small stratified sample (30 signatures across high-confid

 ## L. Per-Document Classification

-The per-signature classifier operates at the signature level and uses whole-sample Firm A percentile heuristics as its operational thresholds, while the three-method analysis of Section IV-E operates at the accountant level and supplies a *convergent* external reference for the operational cuts.
+The per-signature classifier operates at the signature level and uses whole-sample Firm A percentile heuristics as its operational thresholds, while the accountant-level threshold analysis of Section IV-E (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing) supplies a *convergent* external reference for the operational cuts.
 Because the two analyses are at different units (signature vs accountant) we treat them as complementary rather than substitutable: the accountant-level convergence band cos $\in [0.945, 0.979]$ anchors the signature-level operational cut cos $> 0.95$ used below, and Section IV-G.3 reports a sensitivity analysis in which cos $> 0.95$ is replaced by the accountant-level 2D-GMM marginal crossing cos $> 0.945$.
 All dHash references in this section refer to the *independent-minimum* dHash defined in Section III-G---the smallest Hamming distance from a signature to any other same-CPA signature.
 We use a single dHash statistic throughout the operational classifier and the supporting capture-rate analyses (Tables IX, XI, XII, XVI), which keeps the classifier definition and its empirical evaluation arithmetically consistent.