Paper A v3.5: resolve codex round-4 residual issues
Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):
Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
table had 1-4-unit transcription errors in k values and a fabricated
cos > 0.9407 calibration row; both fixed by rerunning Script 24
with cos = 0.9407 added to COS_RULES and copying exact values from
the JSON output.
- Section III-L classifier now defined entirely in terms of the
independent-minimum dHash statistic that the deployed code (Scripts
21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
language is removed. Tables IX, XI, XII, XVI are now arithmetically
consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
per-signature cosine distribution, matching III-L and IV-F.
Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
limit. Removed "we break the circularity" overclaim; replaced with
"report capture rates on both folds with Wilson 95% intervals to
make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
Methods/Results don't deliver; replaced with anchor-based capture /
FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
intra-report consistency (IV-H.3) is a different test (two co-signers
on the same report, firm-level homogeneity) and is not a within-CPA
year-level mixing check; the assumption is maintained as a bounded
identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
the partner-level ranking is threshold-free"; longitudinal-stability
uses 0.95 cutoff, intra-report uses the operational classifier.
Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
Regular Papers do not have a standalone Impact Statement). The file
itself is retained as an archived non-paper note for cover-letter /
grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
[35] VLM survey, [36] Mann-Whitney) are now cited in-text:
[27] in Methodology III-E (dHash definition)
[31][32][33] in Introduction (audit-quality regulation context)
[34][35] in Methodology III-C/III-D
[36] in Results IV-C (Mann-Whitney result)
Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+11
-11
@@ -47,7 +47,7 @@ Distribution fitting identified the lognormal distribution as the best parametri
|
||||
|
||||
The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
|
||||
Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
|
||||
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).
|
||||
Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).
|
||||
|
||||
We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
|
||||
We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
|
||||
@@ -214,17 +214,17 @@ Table XI reports both calibration-fold and held-out-fold capture rates with Wils
|
||||
<!-- TABLE XI: Held-Out vs Calibration Firm A Capture Rates and Generalization Test
|
||||
| Rule | Calibration 70% fold (CI) | Held-out 30% fold (CI) | 2-prop z | p | k/n calib | k/n held |
|
||||
|------|---------------------------|-------------------------|----------|---|-----------|----------|
|
||||
| cosine > 0.837 | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31 | 0.756 n.s. | 45,088/45,116 | 15,321/15,332 |
|
||||
| cosine > 0.945 (2D GMM marginal) | 93.77% [93.55%, 93.98%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001 | 42,304/45,116 | 14,531/15,332 |
|
||||
| cosine > 0.950 | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001 | 41,571/45,116 | 14,352/15,332 |
|
||||
| cosine > 0.9407 (calib-fold P5) | 95.00% [94.80%, 95.20%] | 95.64% [95.31%, 95.95%] | -2.83 | 0.005 | 42,862/45,116 | 14,664/15,332 |
|
||||
| dHash_indep ≤ 5 | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001 | 37,434/45,116 | 13,467/15,332 |
|
||||
| dHash_indep ≤ 8 | 94.84% [94.63%, 95.05%] | 96.13% [95.82%, 96.43%] | -6.45 | <0.001 | 42,791/45,116 | 14,739/15,332 |
|
||||
| dHash_indep ≤ 9 (calib-fold P95) | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07 | <0.001 | 43,603/45,116 | 14,945/15,332 |
|
||||
| dHash_indep ≤ 15 | 99.83% [99.79%, 99.86%] | 99.84% [99.77%, 99.89%] | -0.31 | 0.754 n.s. | 45,038/45,116 | 15,308/15,332 |
|
||||
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.40% [89.12%, 89.69%] | 91.54% [91.09%, 91.97%] | -7.60 | <0.001 | 40,335/45,116 | 14,035/15,332 |
|
||||
| cosine > 0.837 | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31 | 0.756 n.s. | 45,087/45,116 | 15,321/15,332 |
|
||||
| cosine > 0.9407 (calib-fold P5) | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.19 | 0.001 | 42,856/45,116 | 14,662/15,332 |
|
||||
| cosine > 0.945 (2D GMM marginal) | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001 | 42,305/45,116 | 14,531/15,332 |
|
||||
| cosine > 0.950 | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001 | 41,570/45,116 | 14,352/15,332 |
|
||||
| dHash_indep ≤ 5 | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001 | 37,430/45,116 | 13,467/15,332 |
|
||||
| dHash_indep ≤ 8 | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45 | <0.001 | 42,788/45,116 | 14,739/15,332 |
|
||||
| dHash_indep ≤ 9 (calib-fold P95) | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07 | <0.001 | 43,604/45,116 | 14,945/15,332 |
|
||||
| dHash_indep ≤ 15 | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31 | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
|
||||
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60 | <0.001 | 40,335/45,116 | 14,035/15,332 |
|
||||
|
||||
Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash median = 2, P95 = 9.
|
||||
Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. All counts and z/p values are reproducible from `signature_analysis/24_validation_recalibration.py` (seed = 42).
|
||||
-->
|
||||
|
||||
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
|
||||
|
||||
Reference in New Issue
Block a user