Paper A v3.5: resolve codex round-4 residual issues

Fully addresses the partial-resolution / unfixed items from codex gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md): Critical - Table XI z/p columns now reproduce from displayed counts. Earlier table had 1-4-unit transcription errors in k values and a fabricated cos > 0.9407 calibration row; both fixed by rerunning Script 24 with cos = 0.9407 added to COS_RULES and copying exact values from the JSON output. - Section III-L classifier now defined entirely in terms of the independent-minimum dHash statistic that the deployed code (Scripts 21, 23, 24) actually uses; the legacy "cosine-conditional dHash" language is removed. Tables IX, XI, XII, XVI are now arithmetically consistent with the III-L classifier definition. - "0.95 not calibrated to Firm A" inconsistency reconciled: Section III-H now correctly says 0.95 is the whole-sample Firm A P95 of the per-signature cosine distribution, matching III-L and IV-F. Major - Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word limit. Removed "we break the circularity" overclaim; replaced with "report capture rates on both folds with Wilson 95% intervals to make fold-level variance visible". - Conclusion mirrors the Abstract reframe: 70/30 split documents within-firm sampling variance, not external generalization. - Introduction no longer promises precision / F1 / EER metrics that Methods/Results don't deliver; replaced with anchor-based capture / FAR + Wilson CI language. - Section III-G within-auditor-year empirical-check wording corrected: intra-report consistency (IV-H.3) is a different test (two co-signers on the same report, firm-level homogeneity) and is not a within-CPA year-level mixing check; the assumption is maintained as a bounded identification convention. - Section III-H "two analyses fully threshold-free" corrected to "only the partner-level ranking is threshold-free"; longitudinal-stability uses 0.95 cutoff, intra-report uses the operational classifier. Minor - Impact Statement removed from export_v3.py SECTIONS list (IEEE Access Regular Papers do not have a standalone Impact Statement). The file itself is retained as an archived non-paper note for cover-letter / grant-report reuse, with a clear archive header. - All 7 previously unused references ([27] dHash, [31][32] partner- signature mandates, [33] Taiwan partner rotation, [34] YOLO original, [35] VLM survey, [36] Mann-Whitney) are now cited in-text: [27] in Methodology III-E (dHash definition) [31][32][33] in Introduction (audit-quality regulation context) [34][35] in Methodology III-C/III-D [36] in Results IV-C (Mann-Whitney result) Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's calibration-fold P5 row is computed from the same data file as the other rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 12:23:03 +08:00
parent 0ff1845b22
commit 12f716ddf1
9 changed files with 172 additions and 48 deletions
@@ -47,7 +47,7 @@ Distribution fitting identified the lognormal distribution as the best parametri

 The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
 Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
-Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).
+Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).

 We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
 We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
@@ -214,17 +214,17 @@ Table XI reports both calibration-fold and held-out-fold capture rates with Wils
 <!-- TABLE XI: Held-Out vs Calibration Firm A Capture Rates and Generalization Test
 | Rule | Calibration 70% fold (CI) | Held-out 30% fold (CI) | 2-prop z | p | k/n calib | k/n held |
 |------|---------------------------|-------------------------|----------|---|-----------|----------|
-| cosine > 0.837                      | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31 | 0.756 n.s. | 45,088/45,116 | 15,321/15,332 |
-| cosine > 0.945 (2D GMM marginal)    | 93.77% [93.55%, 93.98%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001     | 42,304/45,116 | 14,531/15,332 |
-| cosine > 0.950                      | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001     | 41,571/45,116 | 14,352/15,332 |
-| cosine > 0.9407 (calib-fold P5)     | 95.00% [94.80%, 95.20%] | 95.64% [95.31%, 95.95%] | -2.83 | 0.005      | 42,862/45,116 | 14,664/15,332 |
-| dHash_indep ≤ 5                      | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001    | 37,434/45,116 | 13,467/15,332 |
-| dHash_indep ≤ 8                      | 94.84% [94.63%, 95.05%] | 96.13% [95.82%, 96.43%] | -6.45  | <0.001    | 42,791/45,116 | 14,739/15,332 |
-| dHash_indep ≤ 9 (calib-fold P95)     | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07  | <0.001    | 43,603/45,116 | 14,945/15,332 |
-| dHash_indep ≤ 15                     | 99.83% [99.79%, 99.86%] | 99.84% [99.77%, 99.89%] | -0.31  | 0.754 n.s. | 45,038/45,116 | 15,308/15,332 |
-| cosine > 0.95 AND dHash_indep ≤ 8    | 89.40% [89.12%, 89.69%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001    | 40,335/45,116 | 14,035/15,332 |
+| cosine > 0.837                       | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31  | 0.756 n.s. | 45,087/45,116 | 15,321/15,332 |
+| cosine > 0.9407 (calib-fold P5)      | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.19  | 0.001      | 42,856/45,116 | 14,662/15,332 |
+| cosine > 0.945 (2D GMM marginal)     | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54  | <0.001     | 42,305/45,116 | 14,531/15,332 |
+| cosine > 0.950                       | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97  | <0.001     | 41,570/45,116 | 14,352/15,332 |
+| dHash_indep ≤ 5                      | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001     | 37,430/45,116 | 13,467/15,332 |
+| dHash_indep ≤ 8                      | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45  | <0.001     | 42,788/45,116 | 14,739/15,332 |
+| dHash_indep ≤ 9 (calib-fold P95)     | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07  | <0.001     | 43,604/45,116 | 14,945/15,332 |
+| dHash_indep ≤ 15                     | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31  | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
+| cosine > 0.95 AND dHash_indep ≤ 8    | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001     | 40,335/45,116 | 14,035/15,332 |

-Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash median = 2, P95 = 9.
+Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. All counts and z/p values are reproducible from `signature_analysis/24_validation_recalibration.py` (seed = 42).
 -->

 Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.