Paper A v3.18.2: address codex GPT-5.5 round-16 Minor-Revision findings

Codex independent peer review (paper/codex_review_gpt55_v3_18_1.md) audited empirical claims against scripts/JSON reports rather than rubber-stamping prior Accept verdicts. Verdict: Minor Revision. This commit addresses every flagged item. - Soften mechanism-identification language (Results IV-D.1, Discussion B): per-signature cosine "fails to reject unimodality" rather than "reflects a single dominant generative mechanism"; framing tied to joint evidence. - Replace overabsolute "single stored image" with multi-template phrasing in Introduction and Methodology III-A. - Reframe Methodology III-H so practitioner knowledge is non-load-bearing; evidentiary basis is the paper's own image evidence. - Fix stale section cross-references after the v3.18 retitling: IV-F.* -> IV-G.* in 11 locations across methodology and results. - Fix 0.941 / 0.945 / 0.9407 wording in Methodology III-K to use the calibration-fold P5 = 0.9407 and the rounded sensitivity cut 0.945. - Soften "sharp discontinuity" in Results IV-G.3 to "23-28 percentage-point gap consistent with firm-wide non-hand-signing practice". - Soften Conclusion's "directly generalizable" with explicit conditions on analogous anchors and artifact-generation physics. - Add Appendix B: table-to-script provenance map (15 manuscript tables mapped to generating scripts and JSON report artifacts). - New script signature_analysis/28_byte_identity_decomposition.py produces reproducible artifacts for two previously-unverified claims: (a) 145 / 50 / 180 / 35 Firm A byte-identity decomposition (verified); (b) cross-firm dual-descriptor convergence -- corrected from the previous manuscript text "non-Firm-A 11.3% vs Firm A 58.7% (5x)" to the database-verified "non-Firm-A 42.12% vs Firm A 88.32% (~2.1x)". - Clarify scripts 19 / 21 docstrings: legacy EER / FRR / Precision / F1 helpers are retained for diagnostic use only and are NOT cited as biometric performance in the paper. Remove "interview evidence" wording. - Rebuild Paper_A_IEEE_Access_Draft_v3.docx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 20:23:08 +08:00
parent cb77f481ec
commit 4bb7aa9189
9 changed files with 299 additions and 53 deletions
@@ -2,26 +2,39 @@
 """
 Script 21: Expanded Validation with Larger Negative Anchor + Held-out Firm A
 ============================================================================
-Addresses codex review weaknesses of Script 19's pixel-identity validation:
+Addresses three weaknesses of Script 19's pixel-identity validation:

  (a) Negative anchor of n=35 (cosine<0.70) is too small to give
      meaningful FAR confidence intervals.
-  (b) Pixel-identical positive anchor is an easy subset, not
-      representative of the broader positive class.
-  (c) Firm A is both the calibration anchor and the validation anchor
-      (circular).
+  (b) Pixel-identical positive anchor is a CONSERVATIVE SUBSET of the
+      true non-hand-signed class, not representative of the broader
+      positive class. Recall against this subset is therefore a
+      lower-bound calibration check, not a generalizable recall
+      estimate.
+  (c) Firm A is both the calibration anchor and a validation anchor
+      (circular). The 70/30 fold split makes within-Firm-A sampling
+      variance visible without claiming external validation.

 This script:
  1. Constructs a large inter-CPA negative anchor (~50,000 pairs) by
     randomly sampling pairs from different CPAs. Inter-CPA high
     similarity is highly unlikely to arise from legitimate signing.
  2. Splits Firm A CPAs 70/30 into CALIBRATION and HELDOUT folds.
-     Re-derives signature-level / accountant-level thresholds from the
-     calibration fold only, then reports all metrics (including Firm A
-     anchor rates) on the heldout fold.
-  3. Computes proper EER (FAR = FRR interpolated) in addition to
-     metrics at canonical thresholds.
-  4. Computes 95% Wilson confidence intervals for each FAR/FRR.
+     Re-derives signature-level thresholds from the calibration fold
+     only, then reports capture rates on the heldout fold.
+  3. Computes 95% Wilson confidence intervals for FAR at canonical
+     thresholds (Table X in the manuscript).
+
+Legacy / diagnostic-only metrics:
+  Helper functions for EER, Precision, Recall, F1, and FRR remain in
+  this script for backward compatibility. The manuscript intentionally
+  OMITS these metrics from Table X because the byte-identical positive
+  anchor has cosine ~= 1 by construction (so FRR / EER are arithmetic
+  tautologies) and because positive and negative anchors are
+  constructed from different sampling units, making prevalence
+  arbitrary (so Precision and F1 have no meaningful population
+  interpretation). Only FAR against the large inter-CPA negative
+  anchor is reported as a biometric metric in the paper.

 Output:
  reports/expanded_validation/expanded_validation_report.md