Paper A v3.10: resolve Opus 4.7 round-9 paper-vs-Appendix-A contradiction

Opus round-9 review (paper/opus_final_review_v3_9.md) dissented from
Gemini round-7 Accept and aligned with codex round-8 Minor, but for a
DIFFERENT issue all prior reviewers missed: the paper's main text in
four locations flatly claimed the BD/McCrary accountant-level null
"persists across the Appendix-A bin-width sweep", yet Appendix A
Table A.I itself documents a significant accountant-level cosine
transition at bin 0.005 with |Z_below|=3.23, |Z_above|=5.18 (both
past 1.96) located at cosine 0.980 --- on the upper edge of our two
threshold estimators' convergence band [0.973, 0.979]. This is a
paper-to-appendix contradiction that a careful reviewer would catch
in 30 seconds.

BLOCKER B1: BD/McCrary accountant-level claim softened across all
four locations to match what Appendix A Table A.I actually reports:
- Results IV-D.1 (lines 85-86): rewritten to say the null is not
  rejected at 2/3 cosine bin widths and 2/3 dHash bin widths, with
  the one cosine transition at bin 0.005 sitting on the upper edge
  of the convergence band and the one dHash transition at |Z|=1.96.
- Results IV-E Table VIII row (line 145): "no transition / no
  transition" changed to "0.980 at bin 0.005 only; null at 0.002,
  0.010" / "3.0 at bin 1.0 only ( |Z|=1.96); null at 0.2, 0.5".
- Results IV-E line 130 (Third finding): "does not produce a
  significant transition (robust across bin-width sweep)" replaced
  with "largely null at the accountant level --- no significant
  transition at 2/3 cosine bin widths and 2/3 dHash bin widths,
  with the one cosine transition at bin 0.005 sitting at cosine
  0.980 on the upper edge of the convergence band".
- Results IV-E line 152 (Table VIII synthesis paragraph): matched
  reframing.
- Discussion V-B (line 27): "does not produce a significant
  transition at the accountant level either" -> "largely null at
  the accountant level ... with the one cosine transition on the
  upper edge of the convergence band".
- Conclusion (line 16): matched reframing with power caveat
  retained.

MAJOR M1: Related Work L67 stale "well suited to detecting the
boundary between two generative mechanisms" framing (residue from
pre-demotion drafts) replaced with a local-density-discontinuity
diagnostic framing that matches the rest of the paper and flags
the signature-level bin-width sensitivity + accountant-level rarity
as documented in Appendix A.

MAJOR M2: Table XII orphaned in-text anchor --- Table XII is defined
inside IV-G.3 but had no in-text "Table XII reports ..." pointer at
its presentation location. Added a single sentence before the table
comment.

MINOR m1: Section IV-I.1 "4 of 30,000+ Firm A documents, 0.01%"
replaced with the exact "4 of 30,226 Firm A documents, 0.013%".

MINOR m2: Section IV-E "the two-dimensional two-component GMM"
wording ambiguity (reader might confuse with the already-selected
K*=3 GMM from BIC) replaced with explicit "a separately fit
two-component 2D GMM (reported as a cross-check on the 1D
accountant-level crossings)".

MINOR m3: Section IV-D L59 "downstream all-pairs analyses
(Tables XII, XVIII)" misnomer --- Table XII is per-signature
classifier output not all-pairs; Table XVIII's all-pairs are over
~16M pairs not 168,740. Replaced with an accurate list:
"same-CPA per-signature best-match analyses (Tables V and XII, and
the Firm-A per-signature rows of Tables XIII and XVIII)".

MINOR m4: Methodology III-H L156 "the validation role is played by
... the held-out Firm A fold" slightly overclaims what the held-out
fold establishes (the fold-level rates differ by 1-5 pp with
p<0.001). Parenthetical hedge added: "(which confirms the qualitative
replication-dominated framing; fold-level rate differences are
disclosed in Section IV-G.2)".

Also add:
- paper/opus_final_review_v3_9.md (Opus 4.7 max-effort review)
- paper/gemini_review_v3_8.md (Gemini round-7 Accept verdict, was
  missing from prior commit)

Abstract remains 243 words (under IEEE Access 250 limit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-21 15:25:04 +08:00
parent 85cfefe49f
commit 615059a2c1
7 changed files with 327 additions and 12 deletions
+9 -8
View File
@@ -56,7 +56,7 @@ A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the
## D. Hartigan Dip Test: Unimodality at the Signature Level
Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
The $N = 168{,}740$ count used in Table V and downstream all-pairs analyses (Tables XII, XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
<!-- TABLE V: Hartigan Dip Test Results
| Distribution | N | dip | p-value | Verdict (α=0.05) |
@@ -82,8 +82,8 @@ Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine
Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null persists across the Appendix-A bin-width sweep.
We read this accountant-level pattern as *consistent with*---not affirmative proof of---clustered-but-smoothly-mixed aggregates: at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness (Section V-G).
At the accountant level the BD/McCrary null is not rejected at two of three cosine bin widths (0.002, 0.010) and two of three dHash bin widths (0.2, 0.5); the one cosine transition that does occur (at bin width 0.005) sits at cosine 0.980---*at the upper edge* of the convergence band of our two threshold estimators (Section IV-E)---and the one dHash transition (at bin width 1.0, location dHash = 3.0) has $|Z_{\text{below}}|$ exactly at the 1.96 critical value.
We read this pattern as *largely but not uniformly* null and *consistent with*---not affirmative proof of---clustered-but-smoothly-mixed aggregates: at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness (Section V-G), and the one bin-0.005 cosine transition, sitting at the edge rather than outside the threshold band and flanked by bin-0.002 and bin-0.010 non-rejections, is consistent with a mild histogram-resolution effect rather than a stable cross-mode density discontinuity (Appendix A).
We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator, and the substantive claim of smoothly-mixed accountant clustering rests on the joint evidence of the dip test, the BIC-selected GMM, and the BD null.
### 2) Beta Mixture at Signature Level: A Forced Fit
@@ -127,8 +127,8 @@ First, of the 180 CPAs in the Firm A registry, 171 have $\geq 10$ signatures and
Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
Third, applying the threshold framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary density-smoothness diagnostic does not produce a significant transition at the accountant level (robust across the bin-width sweep in Appendix A).
For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
Third, applying the threshold framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary density-smoothness diagnostic is largely null at the accountant level---no significant transition at two of three cosine bin widths and two of three dHash bin widths, with the one cosine transition at bin 0.005 sitting at cosine 0.980 on the upper edge of the convergence band (Appendix A).
For completeness we also report the marginal crossings of a *separately fit* two-component 2D GMM (reported as a cross-check on the 1D accountant-level crossings) at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
Table VIII summarizes the threshold estimates produced by the two threshold estimators and the BD/McCrary smoothness diagnostic across the two analysis levels for a compact cross-level comparison.
@@ -142,14 +142,14 @@ Table VIII summarizes the threshold estimates produced by the two threshold esti
| Accountant-level, KDE antimode (threshold estimator) | **0.973** | **4.07** |
| Accountant-level, Beta-2 EM crossing (threshold estimator) | **0.979** | **3.41** |
| Accountant-level, logit-GMM-2 crossing (robustness) | **0.976** | **3.93** |
| Accountant-level, BD/McCrary transition (diagnostic; null across Appendix A) | no transition | no transition |
| Accountant-level, BD/McCrary transition (diagnostic; largely null, Appendix A) | 0.980 at bin 0.005 only; null at 0.002, 0.010 | 3.0 at bin 1.0 only (\|Z\|=1.96); null at 0.2, 0.5 |
| Accountant-level, 2D-GMM 2-comp marginal crossing (secondary) | 0.945 | 8.10 |
| Firm A calibration-fold cosine P5 | 0.9407 | — |
| Firm A calibration-fold dHash_indep P95 | — | 9 |
| Firm A calibration-fold dHash_indep median | — | 2 |
-->
At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic produces no significant transition at the same level (a null that persists across Appendix A's bin-width sweep), which is *consistent with*---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates.
At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic is largely null at the same level (two of three cosine bin widths and two of three dHash bin widths produce no significant transition; the one bin-0.005 cosine transition at 0.980 sits on the convergence-band upper edge and is flanked by non-rejections at bin 0.002 and bin 0.010, Appendix A), which is *consistent with*---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates.
This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
@@ -248,6 +248,7 @@ The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operationa
The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$ (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at $0.945$.
Because the classifier operates at the signature level while these convergent accountant-level estimates are at the accountant level, they are formally non-substitutable.
We report a sensitivity check in which the classifier's operational cut cos $> 0.95$ is replaced by the nearest accountant-level reference, cos $> 0.945$.
Table XII reports the five-way classifier output under each cut.
<!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
| Category | cos > 0.95 count (%) | cos > 0.945 count (%) | Δ count |
@@ -401,7 +402,7 @@ A cosine-only classifier would treat all 71,656 identically; the dual-descriptor
96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the 32/171 middle-band minority identified by the accountant-level mixture (Section IV-E).
The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
The absence of any meaningful "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.
### 2) Cross-Method Agreement