Phase 6 round-4 codex-review fixes: blocker + 2 majors + minors

Codex round-3 verification (commit 9e68f2e) flagged remaining items: BLOCKER: - Appendix A Table A.I was inside an HTML comment but visible prose at L1268 and L1275 referenced it as if it rendered. Un-commented Table A.I (same fix pattern as Tables I-IV in round-3). MAJOR: - Table XXVII (diagnostic summary) appeared in §III-M at L593 before Tables III-XXVI in §IV, breaking sequential IEEE-style numbering. Moved the table to Appendix A as Table A.II (new §A.2 "Diagnostic Summary"). §III-M now references "Appendix A Table A.II" instead of hosting the table inline; §VI Conclusion contribution (8) updated similarly. Appendix A heading generalised to "Supplementary Diagnostic Detail" with §A.1 (BD/McCrary) and §A.2 (Diagnostic Summary). - §III-M L614 rate-definition conflation: the sentence "per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm)" mixed unit labels. Rewrote to label each rate by its actual unit and source table: per-signature any-pair HC ICCR ($0.11$; Table XXII) and per-document HC+MC alarm-rate ICCR ($0.34$; Table XXIII). MINORS: - L400 §III-L stale ref retargeted: "operational signature-level classifier of §III-L" -> "of §III-H.1 calibrated in §III-L". - L663 §III-L stale ref retargeted: "KDE crossover used in per-document classification (Section III-L)" -> "(Section III-H.1)". - L1146 Conclusion "corpus-universal" overstated generality -> "across the tested eligible scopes" (non-Big-4 evidence is cosine-axis only, not full calibration scope). - L1024 Results §IV-M.4 footnote: np.argmax / np.unique implementation detail moved to "alphabetically-ordered tie-break ... full implementation detail in the supplementary materials" (less script-internal prose). Artefacts: - Combined manuscript regenerated: paper_a_v4_combined.md, 1316 lines. - Final main-text table sequence: I, II, III, ..., XXVI (26 tables, all sequential in rendered order). Appendix tables: A.I, A.II. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 19:59:56 +08:00
parent 9e68f2e1d3
commit bbb662acd1
5 changed files with 64 additions and 56 deletions
@@ -229,7 +229,7 @@ We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptiv
 - The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$.
 - Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs.

-The operational signature-level classifier of §III-L is calibrated against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K.
+The operational signature-level classifier of §III-H.1 is calibrated in §III-L against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K.

 ## K. Convergent Internal-Consistency Checks

@@ -422,28 +422,11 @@ We *do not* interpret the deployed-rate excess as a presumed true-positive rate;

 The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth.

-Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier (Table XXVII), with an explicitly disclosed untested assumption:
-
-**Table XXVII.** Diagnostics, failure mode addressed, and disclosed untested assumption.
-
-| Diagnostic | Failure mode addressed | Disclosed untested assumption |
-|---|---|---|
-| Composition decomposition (§III-I.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) |
-| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) |
-| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property |
-| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above |
-| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check |
-| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 45's mode-of-firms tie-break (§IV-M.4 footnote) |
-| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test |
-| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent |
-| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold |
-| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity |
-
-No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth.
+Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier; the full diagnostic-to-failure-mode-to-assumption map is given in Appendix A Table A.II. No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth.

 **Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, supplementary materials) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.

-**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.
+**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature any-pair HC ICCR ($0.11$; Table XXII) and per-document HC+MC alarm-rate ICCR ($0.34$; Table XXIII) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.

 **Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data.