Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1): Must-fix items (6/6): - §III-F SSIM/pixel rejection rewritten from first principles (design-level argument from luminance/contrast/structure local-window product, not the prior empirical 0.70 result) - Table VI restructured by population × method; added missing Firm A logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary marked bin-unstable (Appendix A) - Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted from "operational dual" to "calibration-fold-adjacent reference"; the actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout - New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A on top); script 30_yearly_big4_comparison.py - Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets - §III-K reframed P7.5 from "round-number lower-tail boundary" to operating point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds: 0.9407 / 0.945 / 0.95 / 0.977 / 0.985) Nice-to-have items (3/3): - Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985) - Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to table notes; cut "invite reviewer skepticism" and "non-load-bearing" Codex 3-pass verification cleanup: - Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A Beta-2 forced-fit crossing from beta_mixture_results.json) - dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer immediately below) instead of misleading "rounded down" - Table XII-B prose corrected: per-segment qualification of "non-Firm-A capture falls faster" (true on 0.95→0.977 segment but contracts on 0.977→0.985 segment); arithmetic now from exact counts Within-year analyses removed: - Within-year ranking robustness check (Class A) was added in nice-to-have pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the Appendix B provenance row - Within-CPA future-work disclosures (Class B) removed from Discussion limitation #5 and Conclusion future-work paragraph; subsequent limitations renumbered Sixth → Fifth, Seventh → Sixth DOCX rendering pipeline overhaul (paper/export_v3.py): Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES: strip_comments() was wholesale-deleting HTML comments, but every numerical table is wrapped in , so the table body was deleted alongside the wrapper. Now unwraps TABLE comments (emit synthetic __TABLE_CAPTION__: marker + table body) while still stripping non-TABLE editorial comments. Result: 19 tables now render in the DOCX. Other rendering fixes: - LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥, ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,}) - Math-context-scoped sub/superscript via PUA sentinels (/): no more underscore-eating in identifiers like signature_analysis - Display equations rendered via matplotlib mathtext to PNG (3 equations: cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as numbered equation blocks (1), (2), (3); content-addressed cache at paper/equations/ (gitignored, regenerable) - Manual numbered/bulleted list rendering with hanging indent (replaces python-docx style="List Number" which silently drops the number prefix when no numbering definition is bound) - Markdown blockquote (> ...) defensively stripped - Pandoc footnote ([^name]) markers no longer leak (inlined at source) - Heading text cleaned of LaTeX residue + PUA sentinels - File paths in body text (signature_analysis/X.py, reports/Y.json) trimmed to "(reproduction artifact in Appendix B)" pointers New leak linter: paper/lint_paper_v3.py - two-pass markdown source + rendered DOCX leak detector; auto-runs at end of export_v3.py. Script changes: - 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR threshold list so Table XII-B is reproducible from persisted JSON - 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly data (writes to reports/figures/ and reports/firm_yearly_comparison/) - 31_within_year_ranking_robustness.py: NEW; supports the within-year robustness check (no longer cited in paper but kept as repo-internal due-diligence artifact) Partner handoff DOCX shipped to ~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB: 19 tables + 4 figures + 3 equation images). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:44:49 +08:00
parent 623eb4cd4b
commit 53125d11d9
13 changed files with 1554 additions and 112 deletions
@@ -102,17 +102,30 @@ The three diagnostics agree that per-signature similarity does not form a clean
 Table VI summarises the signature-level threshold-estimator outputs for cross-method comparison.

 <!-- TABLE VI: Signature-Level Threshold-Estimator Summary
-| Method | Cosine | dHash |
-|--------|--------|-------|
-| All-pairs KDE crossover (Section IV-C)                                          | 0.837              | —      |
-| Beta-2 EM crossing (Firm A; forced fit, BIC prefers $K{=}3$)                    | 0.977              | —      |
-| logit-GMM-2 crossing (Full sample; forced fit)                                  | 0.980              | —      |
-| BD/McCrary transition (diagnostic; bin-unstable, Appendix A)                    | 0.985              | 2.0    |
-| Firm A whole-sample P7.5 (operational anchor; Section III-K)                    | 0.95               | —      |
-| Firm A whole-sample dHash_indep P75 (operational $\leq 5$ band lower edge)      | —                  | 4      |
-| Firm A whole-sample dHash_indep ceiling (style-consistency boundary)            | —                  | 15     |
-| Firm A calibration-fold cosine P5 (held-out validation; Section IV-F.2)         | 0.9407             | —      |
-| Firm A calibration-fold dHash_indep P95 (held-out validation)                   | —                  | 9      |
+| Population | Method | Cosine threshold | dHash threshold | Status |
+|------------|--------|------------------|-----------------|--------|
+| **Threshold estimators (signature-level distributional fits)** | | | | |
+| Firm A signature-level    | KDE antimode + Hartigan dip (Section III-I.1)  | undefined           | —    | unimodal at $\alpha=0.05$ ($p=0.169$); antimode not defined for unimodal data |
+| Firm A signature-level    | Beta-2 EM crossing (Section III-I.2)           | 0.977               | —    | forced fit; BIC strongly prefers $K{=}3$ ($\Delta\text{BIC} = 381$) |
+| Firm A signature-level    | logit-Gaussian-2 crossing (robustness check)   | 0.999               | —    | forced fit; sharply inconsistent with Beta-2 crossing—reflects parametric-form sensitivity |
+| Full-sample signature-l.  | KDE antimode + Hartigan dip                    | (multiple modes)    | —    | multimodal ($p<0.001$); KDE crossover at full-sample is dominated by between-firm heterogeneity |
+| Full-sample signature-l.  | Beta-2 EM crossing                             | no crossing         | —    | forced fit; component densities do not cross over $[0,1]$ under recovered parameters |
+| Full-sample signature-l.  | logit-Gaussian-2 crossing                      | 0.980               | —    | forced fit; BIC strongly prefers $K{=}3$ ($\Delta\text{BIC} = 10{,}175$) |
+| **Density-smoothness diagnostics (not threshold estimators)** | | | | |
+| Firm A signature-level    | BD/McCrary candidate transition (Section III-I.3) | 0.985 (bin 0.005)| 2.0 (bin 1) | bin-unstable across $\{0.003, 0.005, 0.010, 0.015\}$ (Appendix A); transition lies *inside* the non-hand-signed mode |
+| Full-sample signature-l.  | BD/McCrary candidate transition                | 0.985 (bin 0.005)   | 2.0 (bin 1) | bin-unstable across $\{0.003, 0.005, 0.010, 0.015\}$ (Appendix A) |
+| **Reference: between-class KDE (different unit of analysis)** | | | | |
+| All-pairs intra/inter (pair-level; Section IV-C) | KDE crossover | 0.837 | — | reference point for the Uncertain/Likely-hand-signed boundary in the operational classifier |
+| **Operational classifier anchors and percentile cross-references** | | | | |
+| Firm A whole-sample       | P7.5 (operational anchor; Section III-K)       | 0.95                | —    | operational cosine cut for the five-way classifier |
+| Firm A whole-sample       | dHash$_\text{indep}$ P75                       | —                   | 4    | informs the $\leq 5$ high-confidence band edge in the classifier |
+| Firm A whole-sample       | dHash$_\text{indep}$ style-consistency ceiling | —                   | 15   | operational $> 15$ style-consistency boundary |
+| Firm A calibration fold (70%) | cosine P5 (Section IV-F.2)                  | 0.9407              | —    | calibration-fold cross-reference; held-out fold reports rates at this cut |
+| Firm A calibration fold (70%) | dHash$_\text{indep}$ P95                    | —                   | 9    | calibration-fold cross-reference (Tables IX and XI report rates at the rounded $\leq 8$ cut for continuity) |
+
+Read this table by *population × method*: each row reports one method applied to one population.
+The first three blocks (threshold estimators; density-smoothness diagnostics; between-class KDE) are *characterisation* outputs; the bottom block is the operational anchor set used by the classifier of Section III-K.
+The disagreement between Firm A Beta-2 (0.977) and Firm A logit-Gaussian-2 (0.999) is the parametric-form sensitivity referenced in the prose of Section IV-D.3; it cannot be resolved from the data because BIC rejects the underlying $K{=}2$ assumption itself.
 -->

 Non-hand-signed replication quality is therefore best read as a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) acting on a common stored exemplar.
@@ -126,20 +139,30 @@ Table IX reports the proportion of Firm A signatures crossing each candidate thr
 <!-- TABLE IX: Firm A Whole-Sample Capture Rates (consistency check, NOT external validation)
 | Rule | Firm A rate | k / N |
 |------|-------------|-------|
-| cosine > 0.837 (all-pairs KDE crossover)              | 99.93% | 60,408 / 60,448 |
-| cosine > 0.9407 (calibration-fold P5)                 | 95.15% | 57,518 / 60,448 |
-| cosine > 0.945 (calibration-fold P5 rounded)          | 94.02% | 56,836 / 60,448 |
-| cosine > 0.95                                         | 92.51% | 55,922 / 60,448 |
-| dHash_indep ≤ 5 (whole-sample upper-tail of mode)     | 84.20% | 50,897 / 60,448 |
-| dHash_indep ≤ 8                                       | 95.17% | 57,527 / 60,448 |
-| dHash_indep ≤ 15 (style-consistency boundary)         | 99.83% | 60,348 / 60,448 |
-| cosine > 0.95 AND dHash_indep ≤ 8 (operational dual)  | 89.95% | 54,370 / 60,448 |
+| **Cosine-only marginal rates** | | |
+| cosine > 0.837 (all-pairs KDE crossover)                                  | 99.93% | 60,408 / 60,448 |
+| cosine > 0.9407 (calibration-fold P5)                                     | 95.15% | 57,518 / 60,448 |
+| cosine > 0.945 (calibration-fold P5 rounded)                              | 94.02% | 56,836 / 60,448 |
+| cosine > 0.95 (operational; whole-sample Firm A P7.5)                     | 92.51% | 55,922 / 60,448 |
+| **dHash-only marginal rates** | | |
+| dHash_indep ≤ 5 (operational high-confidence cap)                         | 84.20% | 50,897 / 60,448 |
+| dHash_indep ≤ 8 (calibration-fold P95 rounded)                            | 95.17% | 57,527 / 60,448 |
+| dHash_indep ≤ 15 (operational style-consistency boundary)                 | 99.83% | 60,348 / 60,448 |
+| **Operational classifier dual rules (Section III-K)** | | |
+| cosine > 0.95 AND dHash_indep ≤ 5 (high-confidence non-hand-signed)       | 81.70% | 49,389 / 60,448 |
+| cosine > 0.95 AND 5 < dHash_indep ≤ 15 (moderate-confidence)              | 10.76% | 6,503 / 60,448  |
+| cosine > 0.95 AND dHash_indep ≤ 15 (combined non-hand-signed)             | 92.46% | 55,892 / 60,448 |
+| **Calibration-fold-adjacent cross-reference (not the operational classifier rule)** | | |
+| cosine > 0.95 AND dHash_indep ≤ 8                                         | 89.95% | 54,370 / 60,448 |

 All rates computed exactly from the full Firm A sample (N = 60,448 signatures); per-rule counts and codes are available in the supplementary materials.
+The two operational dHash cuts ($\leq 5$ for the high-confidence cap and $\leq 15$ for the style-consistency boundary) come from the classifier definition in Section III-K and are the rules used by the five-way classifier of Tables XII and XVII; the dHash $\leq 8$ row is *not* an operational classifier rule but a calibration-fold-adjacent reference (Section IV-F.2 calibration-fold dHash P95 = 9; we report the $\leq 8$ rate as the integer-valued threshold immediately below P95, included here so that Firm A capture in the calibration-fold-P95 neighbourhood can be read off the same table).
 -->

-Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to the whole-sample Firm A distribution described in Section III-K (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
-The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 92.5% / 7.5% signature-level split (Section III-H).
+Table IX is a whole-sample consistency check rather than an external validation: the cosine cut $0.95$ and the operational dHash band edges ($\leq 5$ high-confidence cap and $\leq 15$ style-consistency boundary) are themselves anchored to the whole-sample Firm A distribution described in Section III-K (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
+The operational dual rule used by the five-way classifier of Section III-K---cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ (the union of the high-confidence and moderate-confidence non-hand-signed buckets)---captures 92.46% of Firm A; the high-confidence component alone (cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 5$) captures 81.70%.
+For continuity with prior calibration-fold reporting (Section IV-F.2 reports the calibration-fold rate at the calibration-fold-P95-adjacent cut $\text{dHash}_\text{indep} \leq 8$), Table IX also lists the cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ rate of 89.95%; this is *not* the operational classifier rule but a cross-reference value.
+Both operational rates are consistent with the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 92.5% / 7.5% signature-level split (Section III-H).
 Section IV-F.2 reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.

 ## F. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
@@ -149,7 +172,7 @@ We report three validation analyses corresponding to the anchors of Section III-
 ### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor

 Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
-Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B).
+Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; reproduction artifact for this Firm A decomposition is listed in Appendix B.
 As the gold-negative anchor we sample 50,000 i.i.d. random cross-CPA signature pairs from the full 168,755-signature matched corpus (inter-CPA cosine: mean $= 0.763$, $P_{95} = 0.886$, $P_{99} = 0.915$, max $= 0.992$).
 Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
 We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
@@ -163,8 +186,8 @@ We do not report an Equal Error Rate: EER is meaningful only when the positive a
 | 0.900                            | 0.0250 | [0.0237, 0.0264] |
 | 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
 | 0.950 (whole-sample Firm A P7.5; operational cut)  | 0.0005 | [0.0003, 0.0007] |
-| 0.973 (signature-level Beta/KDE forced-fit reference) | 0.0002 | [0.0001, 0.0004] |
-| 0.979 (signature-level Beta-2 forced-fit crossing)  | 0.0001 | [0.0001, 0.0003] |
+| 0.977 (Firm A Beta-2 forced-fit crossing; Section IV-D)  | 0.00014 | [0.00007, 0.00029] |
+| 0.985 (BD/McCrary candidate transition; Appendix A) | 0.00004 | [0.00001, 0.00015] |

 Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
 -->
@@ -193,17 +216,17 @@ Table XI reports both calibration-fold and held-out-fold capture rates with Wils
 | dHash_indep ≤ 8                      | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45  | <0.001     | 42,788/45,116 | 14,739/15,332 |
 | dHash_indep ≤ 9 (calib-fold P95)     | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07  | <0.001     | 43,604/45,116 | 14,945/15,332 |
 | dHash_indep ≤ 15                     | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31  | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
-| cosine > 0.95 AND dHash_indep ≤ 8    | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001     | 40,335/45,116 | 14,035/15,332 |
+| cosine > 0.95 AND dHash_indep ≤ 8 (calibration-fold P95-adjacent reference; P95 = 9) | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001     | 40,335/45,116 | 14,035/15,332 |
+| cosine > 0.95 AND dHash_indep ≤ 15 (operational classifier rule, Section III-K) | 92.09% [91.84%, 92.34%] | 93.56% [93.16%, 93.93%] | -5.93  | <0.001     | 41,548/45,116 | 14,344/15,332 |

 Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. Counts and z/p values are reproducible from the supplementary materials (fixed random seed).
 -->

-Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
 We report fold-versus-fold comparisons rather than fold-versus-whole-sample comparisons, because the whole-sample rate is a weighted average of the two folds and therefore cannot, in general, fall inside the Wilson CI of either fold when the folds differ in rate; the correct generalization reference is the calibration fold, which produced the thresholds.

 Under this proper test the two extreme rules agree across folds (cosine $> 0.837$ and $\text{dHash}_\text{indep} \leq 15$; both $p > 0.7$).
 The operationally relevant rules in the 85–95% capture band differ between folds by 1–5 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
-Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
+Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the calibration-fold-adjacent reference rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ (the integer cut immediately below the calibration-fold dHash P95 of 9) captures 89.40% of the calibration fold and 91.54% of the held-out fold; the operational classifier rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K captures still higher rates in both folds (calibration 92.09%, 41,548 / 45,116; held-out 93.56%, 14,344 / 15,332).
 The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
 We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-G.2) is the cross-check that is robust to this fold variance.

@@ -214,25 +237,79 @@ We report a sensitivity check in which this round-number cut is replaced by the
 Table XII reports the five-way classifier output under each cut.

 <!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
-| Category                                   | cos > 0.95 count (%) | cos > 0.945 count (%) | Δ count |
-|--------------------------------------------|----------------------|-----------------------|---------|
-| High-confidence non-hand-signed            | 76,984 (45.62%)      | 79,278 (46.98%)       | +2,294  |
-| Moderate-confidence non-hand-signed        | 43,906 (26.02%)      | 50,001 (29.63%)       | +6,095  |
-| High style consistency                     |    546 ( 0.32%)      |    665 ( 0.39%)       |   +119  |
-| Uncertain                                  | 46,768 (27.72%)      | 38,260 (22.67%)       | -8,508  |
-| Likely hand-signed                         |    536 ( 0.32%)      |    536 ( 0.32%)       |     +0  |
+| Cosine cut | High-confidence | Moderate-confidence | High style consistency | Uncertain | Likely hand-signed |
+|------------|-----------------|---------------------|------------------------|-----------|--------------------|
+| cos > 0.940                | 81,069 (48.04%) | 55,308 (32.78%) | 801 (0.47%) |  31,026 (18.39%) | 536 (0.32%) |
+| cos > 0.945                | 79,278 (46.98%) | 50,001 (29.63%) | 665 (0.39%) |  38,260 (22.67%) | 536 (0.32%) |
+| cos > 0.950 (operational)  | 76,984 (45.62%) | 43,906 (26.02%) | 546 (0.32%) |  46,768 (27.72%) | 536 (0.32%) |
+| cos > 0.960                | 70,250 (41.63%) | 29,450 (17.45%) | 288 (0.17%) |  68,216 (40.43%) | 536 (0.32%) |
+| cos > 0.970                | 60,247 (35.70%) | 14,865 ( 8.81%) | 117 (0.07%) |  92,975 (55.10%) | 536 (0.32%) |
+| cos > 0.985                | 37,368 (22.15%) |  2,231 ( 1.32%) |  10 (0.01%) | 128,595 (76.21%) | 536 (0.32%) |
+
+The dHash band edges ($\leq 5$ for high-confidence, $5 < \text{dHash}_\text{indep} \leq 15$ for moderate-confidence, $> 15$ for style) are held fixed across the grid; only the cosine cut varies. The Likely-hand-signed count is invariant across the grid because it depends only on the all-pairs KDE crossover cosine $= 0.837$.
 -->

-At the aggregate firm-level, the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
+At the aggregate firm-level, the calibration-fold-adjacent reference dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
+The operational classifier rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K captures 92.46% under the 0.95 cut and 93.97% under the 0.945 cut---a shift of 1.51 percentage points.
+
+Reading the wider grid in Table XII: the High-confidence and Moderate-confidence shares shift by less than 5 percentage points across the 0.940-0.950 neighbourhood, while pushing the cosine cut to 0.970 or 0.985 produces qualitatively different classifier behaviour (Moderate-confidence collapses from 26.02% at $0.95$ to 8.81% at $0.97$ and 1.32% at $0.985$, with the displaced mass landing in Uncertain rather than reclassifying out of the corpus).
+The classifier output is therefore robust to small (~0.005-cosine) perturbations of the operational cut but not to wholesale reanchoring at the threshold-estimator outputs of Section IV-D, which is consistent with our reading that those outputs are not classifier thresholds.
 At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies 8,508 signatures (5.04% of the corpus) out of the Uncertain band; 6,095 of them migrate to Moderate-confidence non-hand-signed, 2,294 to High-confidence non-hand-signed, and 119 to High style consistency.
 The Likely-hand-signed category is unaffected because it depends only on the fixed all-pairs KDE crossover cosine $= 0.837$.
 The High-confidence non-hand-signed share grows from 45.62% to 46.98%.

 We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within a 0.005-cosine neighbourhood of the Firm A P7.5 anchor, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
-The paper therefore retains cos $> 0.95$ as the primary operational cut for transparency (round-number P7.5 of the whole-sample Firm A reference distribution) and reports the 0.945 results as a sensitivity check rather than as a deployed alternative.
+
+To make the operating-point selection (Section III-K) auditable rather than presented as a single fixed value, Table XII-B reports the capture-vs-FAR tradeoff over the candidate threshold grid spanning the calibration-fold P5 (0.9407), its rounded value (0.945), the operational anchor (0.95), the Firm A Beta-2 forced-fit crossing from Section IV-D.3 (0.977), and the BD/McCrary candidate transition from Section IV-D.2 (0.985).
+For each grid point we report Firm A capture (under both the cosine-only marginal and the operational dual rule cos $> t$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K), non-Firm-A capture (the cosine-only marginal in the 108,292 non-Firm-A matched signatures), and inter-CPA FAR with Wilson 95% CI against the 50,000-pair anchor of Section IV-F.1.
+
+<!-- TABLE XII-B: Cosine-Threshold Tradeoff: Capture vs Inter-CPA FAR
+| Cosine cut t | Firm A capture (cos > t) | Firm A capture (cos > t AND dHash_indep ≤ 15) | Non-Firm-A capture (cos > t) | Inter-CPA FAR | Inter-CPA FAR Wilson 95% CI |
+|--------------|--------------------------|------------------------------------------------|------------------------------|---------------|------------------------------|
+| 0.9407 (calibration-fold P5)               | 95.15% (57,518/60,448) | 95.09% (57,482/60,448) | 72.68% (78,710/108,292) | 0.00126 | [0.00099, 0.00161] |
+| 0.945 (calibration-fold P5 rounded)        | 94.02% (56,836/60,448) | 93.97% (56,804/60,448) | 67.51% (73,108/108,292) | 0.00082 | [0.00061, 0.00111] |
+| 0.95 (whole-sample Firm A P7.5; **operational cut**) | **92.51%** (55,922/60,448) | **92.46%** (55,892/60,448) | 60.50% (65,514/108,292) | **0.00050** | [0.00034, 0.00074] |
+| 0.977 (Firm A Beta-2 forced-fit crossing)  | 74.53% (45,050/60,448) | 74.51% (45,038/60,448) | 13.14% (14,233/108,292) | 0.00014 | [0.00007, 0.00029] |
+| 0.985 (BD/McCrary candidate transition)    | 55.27% (33,409/60,448) | 55.26% (33,406/60,448) |  5.73%  (6,200/108,292) | 0.00004 | [0.00001, 0.00015] |
+
+Inter-CPA FAR computed against 50,000 i.i.d. inter-CPA pairs (random seed 42, reproducing the anchor of Section IV-F.1 / Table X). Capture and FAR percentages are exact ratios of the displayed integer counts; gap arithmetic in the surrounding prose is computed from those exact counts and rounded to two decimal places. The dual-rule column is the operational classifier rule of Section III-K; for cuts above the dHash-15 saturation point (Firm A dHash$_\text{indep}$ $> 15$ rate is only 0.17%, Table IX), the dual-rule and cosine-only columns coincide to within the dHash$_\text{indep}$ $> 15$ residual.
+-->
+
+Reading Table XII-B, three patterns motivate the choice of $0.95$ as the operating point.
+First, *Firm A capture* on the operational dual rule decays smoothly from 95.09% at $t = 0.9407$ to 55.26% at $t = 0.985$.
+Relaxing the cut from $0.95$ to $0.945$ buys 1.51 percentage points of additional Firm A capture, and to $0.9407$ buys 2.63 percentage points; tightening from $0.95$ to $0.977$ costs 17.96 percentage points and to $0.985$ costs 37.20 percentage points.
+The selected cut at $0.95$ is the strictest cut on this grid at which Firm A capture remains above $90\%$ on the operational dual rule.
+Second, *inter-CPA FAR* is small in absolute terms across the entire candidate grid ($0.00126$ at $0.9407$, falling to $0.00004$ at $0.985$): under any of these operating points the classifier's specificity against random cross-CPA pairs is in the per-mille range or better, so FAR alone does not determine the choice.
+The marginal FAR cost of relaxing from $0.95$ to $0.945$ is $+0.00032$ ($25 \to 41$ false positives per 50,000 pairs) and to $0.9407$ is $+0.00076$ ($25 \to 63$); the marginal FAR savings from tightening to $0.977$ and $0.985$ are $-0.00036$ and $-0.00046$ respectively.
+The FAR savings from going stricter are small in absolute terms compared with the corresponding Firm A capture loss, which makes $0.95$ a balanced operating point on this grid rather than a uniquely optimal one.
+Third, *non-Firm-A capture* (the cosine-only marginal in the 108,292 non-Firm-A signatures) decays from 67.51% at $0.945$ to 60.50% at $0.95$, 13.14% at $0.977$, and 5.73% at $0.985$.
+The Firm-A-minus-non-Firm-A gap widens with strictness through $0.977$ and then contracts (22.41 percentage points at $0.9407$; 26.46 at $0.945$; 31.97 at $0.95$; 61.36 at $0.977$; 49.54 at $0.985$): on the $0.95 \to 0.977$ segment non-Firm-A capture falls faster than Firm A capture in absolute terms ($-47.35$ vs $-17.96$ percentage points), so the widening is dominated by non-Firm-A removal rather than by an intrinsic property of Firm A; on the $0.977 \to 0.985$ segment Firm A capture falls faster than non-Firm-A's already-low residual, so the gap contracts.
+We do *not* read the gap pattern as evidence for a particular cut; it is reported here as cross-firm replication heterogeneity rather than as a selection criterion.
+The operating point at $0.95$ is therefore a defensible---not unique---selection in this neighbourhood, motivated by (i) keeping Firm A capture above $90\%$ on the operational dual rule, (ii) achieving an FAR of $0.0005$ at which marginal further savings from tightening are small relative to the corresponding capture loss, and (iii) preserving the interpretive transparency of the whole-sample Firm A P7.5 reading.
+It is *not* derived from the threshold-estimator outputs of Section IV-D, which the data do not support as classifier thresholds.
+
+The paper therefore retains cos $> 0.95$ as the primary operational cut and reports the 0.945 result of Table XII as a sensitivity check rather than as a deployed alternative; downstream document-level rates (Table XVII) and intra-report agreement (Table XVI) are robust to moderate cutoff shifts within the 0.945--0.95 neighbourhood as long as the same cutoff is applied uniformly across firms.

 ## G. Additional Firm A Benchmark Validation

+Before presenting the three threshold-robust analyses, Fig. 4 summarises the per-firm yearly per-signature best-match cosine distribution that motivates them.
+The left panel reports the mean per-signature best-match cosine within each firm bucket and fiscal year (a threshold-free statistic); the right panel reports the share of each firm-bucket-year with per-signature best-match cosine $\geq 0.95$ (the operational cut of Section III-K).
+Both panels show Firm A above the other Big-4 firms in every year of the 2013-2023 sample, with non-Big-4 firms below all four Big-4 firms throughout, and the cross-firm ordering is stable across the sample period.
+The mean-cosine separation between Firm A and the other Big-4 firms is on the order of 0.02-0.04 throughout the sample (e.g., 2013: Firm A $0.9733$ vs Firm B $0.9498$, Firm C $0.9464$, Firm D $0.9395$, Non-Big-4 $0.9227$; 2023: $0.9860$ vs $0.9668$, $0.9662$, $0.9525$, $0.9346$); the share-above-0.95 separation is wider (2013: Firm A $87.2\%$ vs $61.8\%$, $56.2\%$, $38.5\%$, $27.5\%$).
+This visual is the most direct cross-firm evidence in the paper that Firm A's high-similarity behaviour is firm-specific rather than corpus-wide; the three subsections below decompose this gap along three threshold-free or threshold-robust dimensions.
+
+<!-- FIGURE 4: Per-firm yearly per-signature best-match cosine
+File: reports/figures/fig_yearly_big4_comparison.png (and .pdf)
+Generated by: signature_analysis/30_yearly_big4_comparison.py
+Caption: Per-firm yearly per-signature best-match cosine, 2013-2023.
+(a) Mean per-signature best-match cosine by firm bucket and fiscal year
+(threshold-free). (b) Share of per-signature best-match cosine $\geq 0.95$
+(operational cut of Section III-K). Five lines: Firm A, B, C, D, Non-Big-4.
+Firm A is above the other Big-4 firms in every year; Non-Big-4 is below all
+four Big-4 firms in every year. Per-firm signature counts and exact values
+are in `reports/firm_yearly_comparison/firm_yearly_comparison.{json,md}`.
+-->
+
 The capture rates of Section IV-E are an *internal* consistency check: they ask "how much of Firm A does our threshold capture?", but the threshold was itself derived from Firm A's percentiles, so a high capture rate is not surprising.
 To go beyond this circular check, we report three further analyses, each chosen so that the *informative quantity* does not depend on the threshold's absolute value:

@@ -277,33 +354,38 @@ We test this prediction directly.
 For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
 Firm A accounts for 1,287 of these (27.8% baseline share).
 Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.
-The per-signature best-match cosine underlying each auditor-year mean is taken over the full same-CPA pool (Section III-G) and may match against signatures from other fiscal years, so the auditor-year mean reflects the year's signatures' position within the CPA's full-sample similarity structure rather than purely within-year similarity; a within-year-restricted sensitivity replication is a natural robustness check and is left to future work.
+The per-signature best-match cosine underlying each auditor-year mean is taken over the full same-CPA pool (Section III-G), consistent with the unit-of-analysis framing in Section III-G.

 <!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
 | Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
 |-------|-------------|--------|--------|--------|--------|-----------|--------------|
 | 10%   | 462         | 443    | 2      | 3      | 0      | 14        | 95.9% |
+| 20%   | 925         | 877    | 9      | 14     | 2      | 23        | 94.8% |
 | 25%   | 1,157       | 1,043  | 32     | 23     | 9      | 50        | 90.1% |
+| 30%   | 1,388       | 1,129  | 105    | 52     | 25     | 77        | 81.3% |
 | 50%   | 2,314       | 1,220  | 473    | 273    | 102    | 246       | 52.7% |
 -->

-Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
+Firm A occupies 95.9% of the top 10%, 94.8% of the top 20%, 90.1% of the top 25%, and 81.3% of the top 30% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of $3.5\times$ at the top decile, $3.4\times$ at the top quintile, and $2.9\times$ at the top tercile.
+Firm A's share decays monotonically as the bracket widens (95.9% $\to$ 94.8% $\to$ 90.1% $\to$ 81.3% $\to$ 52.7% across top-10/20/25/30/50%), and only at the top 50% does its share approach its baseline; the over-representation is therefore concentrated in the very top of the distribution rather than spread uniformly through the upper half.
 Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.

-<!-- TABLE XV: Firm A Share of Top-10% Similarity by Year
-| Year | N auditor-years | Top-10% k | Firm A in top-10% | Firm A share | Firm A baseline |
-|------|-----------------|-----------|-------------------|--------------|-----------------|
-| 2013 | 324 | 32 | 32 | 100.0% | 32.4% |
-| 2014 | 399 | 39 | 39 | 100.0% | 27.8% |
-| 2015 | 394 | 39 | 38 | 97.4% | 27.7% |
-| 2016 | 413 | 41 | 39 | 95.1% | 26.2% |
-| 2017 | 415 | 41 | 41 | 100.0% | 27.2% |
-| 2018 | 434 | 43 | 43 | 100.0% | 26.5% |
-| 2019 | 429 | 42 | 42 | 100.0% | 27.0% |
-| 2020 | 430 | 43 | 38 | 88.4% | 27.7% |
-| 2021 | 450 | 45 | 44 | 97.8% | 28.7% |
-| 2022 | 467 | 46 | 43 | 93.5% | 28.3% |
-| 2023 | 474 | 47 | 46 | 97.9% | 27.4% |
+<!-- TABLE XV: Firm A Share of Top-K Similarity by Year (K = 10%, 20%, 30%)
+| Year | N auditor-years | Top-10% share | Top-20% share | Top-30% share | Firm A baseline |
+|------|-----------------|---------------|---------------|---------------|-----------------|
+| 2013 | 324 | 100.0% (32/32) | 98.4% (63/64) | 89.7% (87/97) | 32.4% |
+| 2014 | 399 | 100.0% (39/39) | 98.7% (78/79) | 82.4% (98/119) | 27.8% |
+| 2015 | 394 | 97.4% (38/39) | 96.2% (75/78) | 84.7% (100/118) | 27.7% |
+| 2016 | 413 | 95.1% (39/41) | 96.3% (79/82) | 81.3% (100/123) | 26.2% |
+| 2017 | 415 | 100.0% (41/41) | 97.6% (81/83) | 83.9% (104/124) | 27.2% |
+| 2018 | 434 | 100.0% (43/43) | 97.7% (84/86) | 80.0% (104/130) | 26.5% |
+| 2019 | 429 | 100.0% (42/42) | 97.6% (83/85) | 78.9% (101/128) | 27.0% |
+| 2020 | 430 | 88.4% (38/43)  | 91.9% (79/86) | 76.0% (98/129)  | 27.7% |
+| 2021 | 450 | 97.8% (44/45)  | 96.7% (87/90) | 81.5% (110/135) | 28.7% |
+| 2022 | 467 | 93.5% (43/46)  | 95.7% (89/93) | 84.3% (118/140) | 28.3% |
+| 2023 | 474 | 97.9% (46/47)  | 94.7% (89/94) | 83.8% (119/142) | 27.4% |
+
+Per-cell entries are "share (k_FirmA / k_total)". Top-25% and top-50% pooled values are reported in Table XIV; per-year top-25/50 columns are omitted from this table to reduce visual width but are reproducible from the supplementary materials.
 -->

 This over-representation is consistent with firm-wide non-hand-signing practice at Firm A and is not derived from any threshold we subsequently calibrate.
@@ -339,8 +421,7 @@ We note that this test uses the calibrated classifier of Section III-K rather th

 ## H. Classification Results

-Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
-The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents have no signature whose extracted handwriting could be matched to a registered CPA name (every such signature has `assigned_accountant IS NULL` in the database, typically because the auditor's report page deviates from the standard two-signature layout or the OCRed printed CPA name was not present in the registry); the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity exists, and these documents are therefore excluded from the classification reported here.
+Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents (656 documents excluded from the 85,042-document YOLO-detection cohort because no signature on the document could be matched to a registered CPA; see Table XVII note).
 We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
 Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.

@@ -354,6 +435,7 @@ Document-level rates therefore represent the share of reports in which *at least
 | Likely hand-signed | 47 | 0.1% | 4 | 0.0% |

 Per the worst-case aggregation rule of Section III-K, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
+The 84,386-document cohort excludes 656 documents (relative to the 85,042 YOLO-detected cohort of Table III) for which no signature could be matched to a registered CPA: the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity is defined. The exclusion is definitional rather than discretionary; typical causes are auditor's-report-page formats deviating from the standard two-signature layout, or OCR returning a printed CPA name not present in the registry.
 -->

 Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
@@ -366,7 +448,7 @@ A cosine-only classifier would treat all 71,656 identically; the dual-descriptor

 96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
 This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the within-firm heterogeneity implied by the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 7.5% signature-level left tail (Section III-H).
-The near-zero "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 count here is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset in Table XVI by 4 reports) indicates that the within-firm heterogeneity implied by the 7.5% signature-level left tail (Section IV-D) does not project into the lowest-cosine document-level category under the dual-descriptor rules; it is absorbed instead into the uncertain or high-style-consistency categories at this threshold set.
+The near-zero "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 denominator is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset of Table XVI by 4 mixed-firm reports excluded from the firm-level intra-report comparison) indicates that the within-firm heterogeneity implied by the 7.5% signature-level left tail (Section IV-D) does not project into the lowest-cosine document-level category under the dual-descriptor rules; it is absorbed instead into the uncertain or high-style-consistency categories at this threshold set.
 We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-F.2 is the corresponding external check.

 ### 2) Cross-Firm Comparison of Dual-Descriptor Convergence
@@ -374,7 +456,7 @@ We note that because the non-hand-signed thresholds are themselves calibrated to
 Among the 65,514 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,922 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible.
 The Firm A denominator (55,922) matches Table IX exactly: both Table IX and the cross-firm decomposition define Firm A membership via the CPA registry (`accountants.firm`), and the cross-firm analysis additionally requires a non-null independent-min dHash record, which all 55,922 Firm A cosine-eligible signatures have in the current database.
 This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings.
-Counts and percentages are reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (see Appendix B for the table-to-script provenance map).
+Reproduction artifact for these counts is listed in Appendix B.

 ## I. Ablation Study: Feature Backbone Comparison