Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1):

Must-fix items (6/6):
- §III-F SSIM/pixel rejection rewritten from first principles (design-level
  argument from luminance/contrast/structure local-window product, not the
  prior empirical 0.70 result)
- Table VI restructured by population × method; added missing Firm A
  logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary
  marked bin-unstable (Appendix A)
- Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted
  from "operational dual" to "calibration-fold-adjacent reference"; the
  actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout
- New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A
  on top); script 30_yearly_big4_comparison.py
- Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets
- §III-K reframed P7.5 from "round-number lower-tail boundary" to operating
  point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds:
  0.9407 / 0.945 / 0.95 / 0.977 / 0.985)

Nice-to-have items (3/3):
- Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985)
- Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to
  table notes; cut "invite reviewer skepticism" and "non-load-bearing"

Codex 3-pass verification cleanup:
- Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A
  Beta-2 forced-fit crossing from beta_mixture_results.json)
- dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer
  immediately below) instead of misleading "rounded down"
- Table XII-B prose corrected: per-segment qualification of "non-Firm-A
  capture falls faster" (true on 0.95→0.977 segment but contracts on
  0.977→0.985 segment); arithmetic now from exact counts

Within-year analyses removed:
- Within-year ranking robustness check (Class A) was added in nice-to-have
  pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the
  Appendix B provenance row
- Within-CPA future-work disclosures (Class B) removed from Discussion
  limitation #5 and Conclusion future-work paragraph; subsequent limitations
  renumbered Sixth → Fifth, Seventh → Sixth

DOCX rendering pipeline overhaul (paper/export_v3.py):

Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES:
strip_comments() was wholesale-deleting HTML comments, but every numerical
table is wrapped in <!-- TABLE X: ... -->, so the table body was deleted
alongside the wrapper. Now unwraps TABLE comments (emit synthetic
__TABLE_CAPTION__: marker + table body) while still stripping non-TABLE
editorial comments. Result: 19 tables now render in the DOCX.

Other rendering fixes:
- LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥,
  ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,})
- Math-context-scoped sub/superscript via PUA sentinels (/):
  no more underscore-eating in identifiers like signature_analysis
- Display equations rendered via matplotlib mathtext to PNG (3 equations:
  cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as
  numbered equation blocks (1), (2), (3); content-addressed cache at
  paper/equations/ (gitignored, regenerable)
- Manual numbered/bulleted list rendering with hanging indent (replaces
  python-docx style="List Number" which silently drops the number prefix
  when no numbering definition is bound)
- Markdown blockquote (> ...) defensively stripped
- Pandoc footnote ([^name]) markers no longer leak (inlined at source)
- Heading text cleaned of LaTeX residue + PUA sentinels
- File paths in body text (signature_analysis/X.py, reports/Y.json)
  trimmed to "(reproduction artifact in Appendix B)" pointers

New leak linter: paper/lint_paper_v3.py - two-pass markdown source +
rendered DOCX leak detector; auto-runs at end of export_v3.py.

Script changes:
- 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR
  threshold list so Table XII-B is reproducible from persisted JSON
- 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly
  data (writes to reports/figures/ and reports/firm_yearly_comparison/)
- 31_within_year_ranking_robustness.py: NEW; supports the within-year
  robustness check (no longer cited in paper but kept as repo-internal
  due-diligence artifact)

Partner handoff DOCX shipped to
~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB:
19 tables + 4 figures + 3 equation images).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-06 13:44:49 +08:00
parent 623eb4cd4b
commit 53125d11d9
13 changed files with 1554 additions and 112 deletions
+138 -56
View File
@@ -102,17 +102,30 @@ The three diagnostics agree that per-signature similarity does not form a clean
Table VI summarises the signature-level threshold-estimator outputs for cross-method comparison.
<!-- TABLE VI: Signature-Level Threshold-Estimator Summary
| Method | Cosine | dHash |
|--------|--------|-------|
| All-pairs KDE crossover (Section IV-C) | 0.837 | — |
| Beta-2 EM crossing (Firm A; forced fit, BIC prefers $K{=}3$) | 0.977 | — |
| logit-GMM-2 crossing (Full sample; forced fit) | 0.980 | — |
| BD/McCrary transition (diagnostic; bin-unstable, Appendix A) | 0.985 | 2.0 |
| Firm A whole-sample P7.5 (operational anchor; Section III-K) | 0.95 | — |
| Firm A whole-sample dHash_indep P75 (operational $\leq 5$ band lower edge) | — | 4 |
| Firm A whole-sample dHash_indep ceiling (style-consistency boundary) | | 15 |
| Firm A calibration-fold cosine P5 (held-out validation; Section IV-F.2) | 0.9407 | — |
| Firm A calibration-fold dHash_indep P95 (held-out validation) | — | 9 |
| Population | Method | Cosine threshold | dHash threshold | Status |
|------------|--------|------------------|-----------------|--------|
| **Threshold estimators (signature-level distributional fits)** | | | | |
| Firm A signature-level | KDE antimode + Hartigan dip (Section III-I.1) | undefined | — | unimodal at $\alpha=0.05$ ($p=0.169$); antimode not defined for unimodal data |
| Firm A signature-level | Beta-2 EM crossing (Section III-I.2) | 0.977 | — | forced fit; BIC strongly prefers $K{=}3$ ($\Delta\text{BIC} = 381$) |
| Firm A signature-level | logit-Gaussian-2 crossing (robustness check) | 0.999 | — | forced fit; sharply inconsistent with Beta-2 crossing—reflects parametric-form sensitivity |
| Full-sample signature-l. | KDE antimode + Hartigan dip | (multiple modes) | — | multimodal ($p<0.001$); KDE crossover at full-sample is dominated by between-firm heterogeneity |
| Full-sample signature-l. | Beta-2 EM crossing | no crossing | — | forced fit; component densities do not cross over $[0,1]$ under recovered parameters |
| Full-sample signature-l. | logit-Gaussian-2 crossing | 0.980 | — | forced fit; BIC strongly prefers $K{=}3$ ($\Delta\text{BIC} = 10{,}175$) |
| **Density-smoothness diagnostics (not threshold estimators)** | | | | |
| Firm A signature-level | BD/McCrary candidate transition (Section III-I.3) | 0.985 (bin 0.005)| 2.0 (bin 1) | bin-unstable across $\{0.003, 0.005, 0.010, 0.015\}$ (Appendix A); transition lies *inside* the non-hand-signed mode |
| Full-sample signature-l. | BD/McCrary candidate transition | 0.985 (bin 0.005) | 2.0 (bin 1) | bin-unstable across $\{0.003, 0.005, 0.010, 0.015\}$ (Appendix A) |
| **Reference: between-class KDE (different unit of analysis)** | | | | |
| All-pairs intra/inter (pair-level; Section IV-C) | KDE crossover | 0.837 | — | reference point for the Uncertain/Likely-hand-signed boundary in the operational classifier |
| **Operational classifier anchors and percentile cross-references** | | | | |
| Firm A whole-sample | P7.5 (operational anchor; Section III-K) | 0.95 | — | operational cosine cut for the five-way classifier |
| Firm A whole-sample | dHash$_\text{indep}$ P75 | — | 4 | informs the $\leq 5$ high-confidence band edge in the classifier |
| Firm A whole-sample | dHash$_\text{indep}$ style-consistency ceiling | — | 15 | operational $> 15$ style-consistency boundary |
| Firm A calibration fold (70%) | cosine P5 (Section IV-F.2) | 0.9407 | — | calibration-fold cross-reference; held-out fold reports rates at this cut |
| Firm A calibration fold (70%) | dHash$_\text{indep}$ P95 | — | 9 | calibration-fold cross-reference (Tables IX and XI report rates at the rounded $\leq 8$ cut for continuity) |
Read this table by *population × method*: each row reports one method applied to one population.
The first three blocks (threshold estimators; density-smoothness diagnostics; between-class KDE) are *characterisation* outputs; the bottom block is the operational anchor set used by the classifier of Section III-K.
The disagreement between Firm A Beta-2 (0.977) and Firm A logit-Gaussian-2 (0.999) is the parametric-form sensitivity referenced in the prose of Section IV-D.3; it cannot be resolved from the data because BIC rejects the underlying $K{=}2$ assumption itself.
-->
Non-hand-signed replication quality is therefore best read as a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) acting on a common stored exemplar.
@@ -126,20 +139,30 @@ Table IX reports the proportion of Firm A signatures crossing each candidate thr
<!-- TABLE IX: Firm A Whole-Sample Capture Rates (consistency check, NOT external validation)
| Rule | Firm A rate | k / N |
|------|-------------|-------|
| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,408 / 60,448 |
| cosine > 0.9407 (calibration-fold P5) | 95.15% | 57,518 / 60,448 |
| cosine > 0.945 (calibration-fold P5 rounded) | 94.02% | 56,836 / 60,448 |
| cosine > 0.95 | 92.51% | 55,922 / 60,448 |
| dHash_indep ≤ 5 (whole-sample upper-tail of mode) | 84.20% | 50,897 / 60,448 |
| dHash_indep ≤ 8 | 95.17% | 57,527 / 60,448 |
| dHash_indep ≤ 15 (style-consistency boundary) | 99.83% | 60,348 / 60,448 |
| cosine > 0.95 AND dHash_indep ≤ 8 (operational dual) | 89.95% | 54,370 / 60,448 |
| **Cosine-only marginal rates** | | |
| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,408 / 60,448 |
| cosine > 0.9407 (calibration-fold P5) | 95.15% | 57,518 / 60,448 |
| cosine > 0.945 (calibration-fold P5 rounded) | 94.02% | 56,836 / 60,448 |
| cosine > 0.95 (operational; whole-sample Firm A P7.5) | 92.51% | 55,922 / 60,448 |
| **dHash-only marginal rates** | | |
| dHash_indep ≤ 5 (operational high-confidence cap) | 84.20% | 50,897 / 60,448 |
| dHash_indep ≤ 8 (calibration-fold P95 rounded) | 95.17% | 57,527 / 60,448 |
| dHash_indep ≤ 15 (operational style-consistency boundary) | 99.83% | 60,348 / 60,448 |
| **Operational classifier dual rules (Section III-K)** | | |
| cosine > 0.95 AND dHash_indep ≤ 5 (high-confidence non-hand-signed) | 81.70% | 49,389 / 60,448 |
| cosine > 0.95 AND 5 < dHash_indep ≤ 15 (moderate-confidence) | 10.76% | 6,503 / 60,448 |
| cosine > 0.95 AND dHash_indep ≤ 15 (combined non-hand-signed) | 92.46% | 55,892 / 60,448 |
| **Calibration-fold-adjacent cross-reference (not the operational classifier rule)** | | |
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | 54,370 / 60,448 |
All rates computed exactly from the full Firm A sample (N = 60,448 signatures); per-rule counts and codes are available in the supplementary materials.
The two operational dHash cuts ($\leq 5$ for the high-confidence cap and $\leq 15$ for the style-consistency boundary) come from the classifier definition in Section III-K and are the rules used by the five-way classifier of Tables XII and XVII; the dHash $\leq 8$ row is *not* an operational classifier rule but a calibration-fold-adjacent reference (Section IV-F.2 calibration-fold dHash P95 = 9; we report the $\leq 8$ rate as the integer-valued threshold immediately below P95, included here so that Firm A capture in the calibration-fold-P95 neighbourhood can be read off the same table).
-->
Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to the whole-sample Firm A distribution described in Section III-K (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 92.5% / 7.5% signature-level split (Section III-H).
Table IX is a whole-sample consistency check rather than an external validation: the cosine cut $0.95$ and the operational dHash band edges ($\leq 5$ high-confidence cap and $\leq 15$ style-consistency boundary) are themselves anchored to the whole-sample Firm A distribution described in Section III-K (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
The operational dual rule used by the five-way classifier of Section III-K---cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ (the union of the high-confidence and moderate-confidence non-hand-signed buckets)---captures 92.46% of Firm A; the high-confidence component alone (cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 5$) captures 81.70%.
For continuity with prior calibration-fold reporting (Section IV-F.2 reports the calibration-fold rate at the calibration-fold-P95-adjacent cut $\text{dHash}_\text{indep} \leq 8$), Table IX also lists the cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ rate of 89.95%; this is *not* the operational classifier rule but a cross-reference value.
Both operational rates are consistent with the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 92.5% / 7.5% signature-level split (Section III-H).
Section IV-F.2 reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
## F. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
@@ -149,7 +172,7 @@ We report three validation analyses corresponding to the anchors of Section III-
### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B).
Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; reproduction artifact for this Firm A decomposition is listed in Appendix B.
As the gold-negative anchor we sample 50,000 i.i.d. random cross-CPA signature pairs from the full 168,755-signature matched corpus (inter-CPA cosine: mean $= 0.763$, $P_{95} = 0.886$, $P_{99} = 0.915$, max $= 0.992$).
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
@@ -163,8 +186,8 @@ We do not report an Equal Error Rate: EER is meaningful only when the positive a
| 0.900 | 0.0250 | [0.0237, 0.0264] |
| 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
| 0.950 (whole-sample Firm A P7.5; operational cut) | 0.0005 | [0.0003, 0.0007] |
| 0.973 (signature-level Beta/KDE forced-fit reference) | 0.0002 | [0.0001, 0.0004] |
| 0.979 (signature-level Beta-2 forced-fit crossing) | 0.0001 | [0.0001, 0.0003] |
| 0.977 (Firm A Beta-2 forced-fit crossing; Section IV-D) | 0.00014 | [0.00007, 0.00029] |
| 0.985 (BD/McCrary candidate transition; Appendix A) | 0.00004 | [0.00001, 0.00015] |
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
-->
@@ -193,17 +216,17 @@ Table XI reports both calibration-fold and held-out-fold capture rates with Wils
| dHash_indep ≤ 8 | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45 | <0.001 | 42,788/45,116 | 14,739/15,332 |
| dHash_indep ≤ 9 (calib-fold P95) | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07 | <0.001 | 43,604/45,116 | 14,945/15,332 |
| dHash_indep ≤ 15 | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31 | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
| cosine > 0.95 AND dHash_indep ≤ 8 | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60 | <0.001 | 40,335/45,116 | 14,035/15,332 |
| cosine > 0.95 AND dHash_indep ≤ 8 (calibration-fold P95-adjacent reference; P95 = 9) | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60 | <0.001 | 40,335/45,116 | 14,035/15,332 |
| cosine > 0.95 AND dHash_indep ≤ 15 (operational classifier rule, Section III-K) | 92.09% [91.84%, 92.34%] | 93.56% [93.16%, 93.93%] | -5.93 | <0.001 | 41,548/45,116 | 14,344/15,332 |
Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. Counts and z/p values are reproducible from the supplementary materials (fixed random seed).
-->
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
We report fold-versus-fold comparisons rather than fold-versus-whole-sample comparisons, because the whole-sample rate is a weighted average of the two folds and therefore cannot, in general, fall inside the Wilson CI of either fold when the folds differ in rate; the correct generalization reference is the calibration fold, which produced the thresholds.
Under this proper test the two extreme rules agree across folds (cosine $> 0.837$ and $\text{dHash}_\text{indep} \leq 15$; both $p > 0.7$).
The operationally relevant rules in the 8595% capture band differ between folds by 15 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 8599% range has a held-out counterpart in the 8799% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 8599% range has a held-out counterpart in the 8799% range, and the calibration-fold-adjacent reference rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ (the integer cut immediately below the calibration-fold dHash P95 of 9) captures 89.40% of the calibration fold and 91.54% of the held-out fold; the operational classifier rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K captures still higher rates in both folds (calibration 92.09%, 41,548 / 45,116; held-out 93.56%, 14,344 / 15,332).
The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-G.2) is the cross-check that is robust to this fold variance.
@@ -214,25 +237,79 @@ We report a sensitivity check in which this round-number cut is replaced by the
Table XII reports the five-way classifier output under each cut.
<!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
| Category | cos > 0.95 count (%) | cos > 0.945 count (%) | Δ count |
|--------------------------------------------|----------------------|-----------------------|---------|
| High-confidence non-hand-signed | 76,984 (45.62%) | 79,278 (46.98%) | +2,294 |
| Moderate-confidence non-hand-signed | 43,906 (26.02%) | 50,001 (29.63%) | +6,095 |
| High style consistency | 546 ( 0.32%) | 665 ( 0.39%) | +119 |
| Uncertain | 46,768 (27.72%) | 38,260 (22.67%) | -8,508 |
| Likely hand-signed | 536 ( 0.32%) | 536 ( 0.32%) | +0 |
| Cosine cut | High-confidence | Moderate-confidence | High style consistency | Uncertain | Likely hand-signed |
|------------|-----------------|---------------------|------------------------|-----------|--------------------|
| cos > 0.940 | 81,069 (48.04%) | 55,308 (32.78%) | 801 (0.47%) | 31,026 (18.39%) | 536 (0.32%) |
| cos > 0.945 | 79,278 (46.98%) | 50,001 (29.63%) | 665 (0.39%) | 38,260 (22.67%) | 536 (0.32%) |
| cos > 0.950 (operational) | 76,984 (45.62%) | 43,906 (26.02%) | 546 (0.32%) | 46,768 (27.72%) | 536 (0.32%) |
| cos > 0.960 | 70,250 (41.63%) | 29,450 (17.45%) | 288 (0.17%) | 68,216 (40.43%) | 536 (0.32%) |
| cos > 0.970 | 60,247 (35.70%) | 14,865 ( 8.81%) | 117 (0.07%) | 92,975 (55.10%) | 536 (0.32%) |
| cos > 0.985 | 37,368 (22.15%) | 2,231 ( 1.32%) | 10 (0.01%) | 128,595 (76.21%) | 536 (0.32%) |
The dHash band edges ($\leq 5$ for high-confidence, $5 < \text{dHash}_\text{indep} \leq 15$ for moderate-confidence, $> 15$ for style) are held fixed across the grid; only the cosine cut varies. The Likely-hand-signed count is invariant across the grid because it depends only on the all-pairs KDE crossover cosine $= 0.837$.
-->
At the aggregate firm-level, the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
At the aggregate firm-level, the calibration-fold-adjacent reference dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
The operational classifier rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K captures 92.46% under the 0.95 cut and 93.97% under the 0.945 cut---a shift of 1.51 percentage points.
Reading the wider grid in Table XII: the High-confidence and Moderate-confidence shares shift by less than 5 percentage points across the 0.940-0.950 neighbourhood, while pushing the cosine cut to 0.970 or 0.985 produces qualitatively different classifier behaviour (Moderate-confidence collapses from 26.02% at $0.95$ to 8.81% at $0.97$ and 1.32% at $0.985$, with the displaced mass landing in Uncertain rather than reclassifying out of the corpus).
The classifier output is therefore robust to small (~0.005-cosine) perturbations of the operational cut but not to wholesale reanchoring at the threshold-estimator outputs of Section IV-D, which is consistent with our reading that those outputs are not classifier thresholds.
At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies 8,508 signatures (5.04% of the corpus) out of the Uncertain band; 6,095 of them migrate to Moderate-confidence non-hand-signed, 2,294 to High-confidence non-hand-signed, and 119 to High style consistency.
The Likely-hand-signed category is unaffected because it depends only on the fixed all-pairs KDE crossover cosine $= 0.837$.
The High-confidence non-hand-signed share grows from 45.62% to 46.98%.
We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within a 0.005-cosine neighbourhood of the Firm A P7.5 anchor, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
The paper therefore retains cos $> 0.95$ as the primary operational cut for transparency (round-number P7.5 of the whole-sample Firm A reference distribution) and reports the 0.945 results as a sensitivity check rather than as a deployed alternative.
To make the operating-point selection (Section III-K) auditable rather than presented as a single fixed value, Table XII-B reports the capture-vs-FAR tradeoff over the candidate threshold grid spanning the calibration-fold P5 (0.9407), its rounded value (0.945), the operational anchor (0.95), the Firm A Beta-2 forced-fit crossing from Section IV-D.3 (0.977), and the BD/McCrary candidate transition from Section IV-D.2 (0.985).
For each grid point we report Firm A capture (under both the cosine-only marginal and the operational dual rule cos $> t$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K), non-Firm-A capture (the cosine-only marginal in the 108,292 non-Firm-A matched signatures), and inter-CPA FAR with Wilson 95% CI against the 50,000-pair anchor of Section IV-F.1.
<!-- TABLE XII-B: Cosine-Threshold Tradeoff: Capture vs Inter-CPA FAR
| Cosine cut t | Firm A capture (cos > t) | Firm A capture (cos > t AND dHash_indep ≤ 15) | Non-Firm-A capture (cos > t) | Inter-CPA FAR | Inter-CPA FAR Wilson 95% CI |
|--------------|--------------------------|------------------------------------------------|------------------------------|---------------|------------------------------|
| 0.9407 (calibration-fold P5) | 95.15% (57,518/60,448) | 95.09% (57,482/60,448) | 72.68% (78,710/108,292) | 0.00126 | [0.00099, 0.00161] |
| 0.945 (calibration-fold P5 rounded) | 94.02% (56,836/60,448) | 93.97% (56,804/60,448) | 67.51% (73,108/108,292) | 0.00082 | [0.00061, 0.00111] |
| 0.95 (whole-sample Firm A P7.5; **operational cut**) | **92.51%** (55,922/60,448) | **92.46%** (55,892/60,448) | 60.50% (65,514/108,292) | **0.00050** | [0.00034, 0.00074] |
| 0.977 (Firm A Beta-2 forced-fit crossing) | 74.53% (45,050/60,448) | 74.51% (45,038/60,448) | 13.14% (14,233/108,292) | 0.00014 | [0.00007, 0.00029] |
| 0.985 (BD/McCrary candidate transition) | 55.27% (33,409/60,448) | 55.26% (33,406/60,448) | 5.73% (6,200/108,292) | 0.00004 | [0.00001, 0.00015] |
Inter-CPA FAR computed against 50,000 i.i.d. inter-CPA pairs (random seed 42, reproducing the anchor of Section IV-F.1 / Table X). Capture and FAR percentages are exact ratios of the displayed integer counts; gap arithmetic in the surrounding prose is computed from those exact counts and rounded to two decimal places. The dual-rule column is the operational classifier rule of Section III-K; for cuts above the dHash-15 saturation point (Firm A dHash$_\text{indep}$ $> 15$ rate is only 0.17%, Table IX), the dual-rule and cosine-only columns coincide to within the dHash$_\text{indep}$ $> 15$ residual.
-->
Reading Table XII-B, three patterns motivate the choice of $0.95$ as the operating point.
First, *Firm A capture* on the operational dual rule decays smoothly from 95.09% at $t = 0.9407$ to 55.26% at $t = 0.985$.
Relaxing the cut from $0.95$ to $0.945$ buys 1.51 percentage points of additional Firm A capture, and to $0.9407$ buys 2.63 percentage points; tightening from $0.95$ to $0.977$ costs 17.96 percentage points and to $0.985$ costs 37.20 percentage points.
The selected cut at $0.95$ is the strictest cut on this grid at which Firm A capture remains above $90\%$ on the operational dual rule.
Second, *inter-CPA FAR* is small in absolute terms across the entire candidate grid ($0.00126$ at $0.9407$, falling to $0.00004$ at $0.985$): under any of these operating points the classifier's specificity against random cross-CPA pairs is in the per-mille range or better, so FAR alone does not determine the choice.
The marginal FAR cost of relaxing from $0.95$ to $0.945$ is $+0.00032$ ($25 \to 41$ false positives per 50,000 pairs) and to $0.9407$ is $+0.00076$ ($25 \to 63$); the marginal FAR savings from tightening to $0.977$ and $0.985$ are $-0.00036$ and $-0.00046$ respectively.
The FAR savings from going stricter are small in absolute terms compared with the corresponding Firm A capture loss, which makes $0.95$ a balanced operating point on this grid rather than a uniquely optimal one.
Third, *non-Firm-A capture* (the cosine-only marginal in the 108,292 non-Firm-A signatures) decays from 67.51% at $0.945$ to 60.50% at $0.95$, 13.14% at $0.977$, and 5.73% at $0.985$.
The Firm-A-minus-non-Firm-A gap widens with strictness through $0.977$ and then contracts (22.41 percentage points at $0.9407$; 26.46 at $0.945$; 31.97 at $0.95$; 61.36 at $0.977$; 49.54 at $0.985$): on the $0.95 \to 0.977$ segment non-Firm-A capture falls faster than Firm A capture in absolute terms ($-47.35$ vs $-17.96$ percentage points), so the widening is dominated by non-Firm-A removal rather than by an intrinsic property of Firm A; on the $0.977 \to 0.985$ segment Firm A capture falls faster than non-Firm-A's already-low residual, so the gap contracts.
We do *not* read the gap pattern as evidence for a particular cut; it is reported here as cross-firm replication heterogeneity rather than as a selection criterion.
The operating point at $0.95$ is therefore a defensible---not unique---selection in this neighbourhood, motivated by (i) keeping Firm A capture above $90\%$ on the operational dual rule, (ii) achieving an FAR of $0.0005$ at which marginal further savings from tightening are small relative to the corresponding capture loss, and (iii) preserving the interpretive transparency of the whole-sample Firm A P7.5 reading.
It is *not* derived from the threshold-estimator outputs of Section IV-D, which the data do not support as classifier thresholds.
The paper therefore retains cos $> 0.95$ as the primary operational cut and reports the 0.945 result of Table XII as a sensitivity check rather than as a deployed alternative; downstream document-level rates (Table XVII) and intra-report agreement (Table XVI) are robust to moderate cutoff shifts within the 0.945--0.95 neighbourhood as long as the same cutoff is applied uniformly across firms.
## G. Additional Firm A Benchmark Validation
Before presenting the three threshold-robust analyses, Fig. 4 summarises the per-firm yearly per-signature best-match cosine distribution that motivates them.
The left panel reports the mean per-signature best-match cosine within each firm bucket and fiscal year (a threshold-free statistic); the right panel reports the share of each firm-bucket-year with per-signature best-match cosine $\geq 0.95$ (the operational cut of Section III-K).
Both panels show Firm A above the other Big-4 firms in every year of the 2013-2023 sample, with non-Big-4 firms below all four Big-4 firms throughout, and the cross-firm ordering is stable across the sample period.
The mean-cosine separation between Firm A and the other Big-4 firms is on the order of 0.02-0.04 throughout the sample (e.g., 2013: Firm A $0.9733$ vs Firm B $0.9498$, Firm C $0.9464$, Firm D $0.9395$, Non-Big-4 $0.9227$; 2023: $0.9860$ vs $0.9668$, $0.9662$, $0.9525$, $0.9346$); the share-above-0.95 separation is wider (2013: Firm A $87.2\%$ vs $61.8\%$, $56.2\%$, $38.5\%$, $27.5\%$).
This visual is the most direct cross-firm evidence in the paper that Firm A's high-similarity behaviour is firm-specific rather than corpus-wide; the three subsections below decompose this gap along three threshold-free or threshold-robust dimensions.
<!-- FIGURE 4: Per-firm yearly per-signature best-match cosine
File: reports/figures/fig_yearly_big4_comparison.png (and .pdf)
Generated by: signature_analysis/30_yearly_big4_comparison.py
Caption: Per-firm yearly per-signature best-match cosine, 2013-2023.
(a) Mean per-signature best-match cosine by firm bucket and fiscal year
(threshold-free). (b) Share of per-signature best-match cosine $\geq 0.95$
(operational cut of Section III-K). Five lines: Firm A, B, C, D, Non-Big-4.
Firm A is above the other Big-4 firms in every year; Non-Big-4 is below all
four Big-4 firms in every year. Per-firm signature counts and exact values
are in `reports/firm_yearly_comparison/firm_yearly_comparison.{json,md}`.
-->
The capture rates of Section IV-E are an *internal* consistency check: they ask "how much of Firm A does our threshold capture?", but the threshold was itself derived from Firm A's percentiles, so a high capture rate is not surprising.
To go beyond this circular check, we report three further analyses, each chosen so that the *informative quantity* does not depend on the threshold's absolute value:
@@ -277,33 +354,38 @@ We test this prediction directly.
For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
Firm A accounts for 1,287 of these (27.8% baseline share).
Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.
The per-signature best-match cosine underlying each auditor-year mean is taken over the full same-CPA pool (Section III-G) and may match against signatures from other fiscal years, so the auditor-year mean reflects the year's signatures' position within the CPA's full-sample similarity structure rather than purely within-year similarity; a within-year-restricted sensitivity replication is a natural robustness check and is left to future work.
The per-signature best-match cosine underlying each auditor-year mean is taken over the full same-CPA pool (Section III-G), consistent with the unit-of-analysis framing in Section III-G.
<!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
| Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
|-------|-------------|--------|--------|--------|--------|-----------|--------------|
| 10% | 462 | 443 | 2 | 3 | 0 | 14 | 95.9% |
| 20% | 925 | 877 | 9 | 14 | 2 | 23 | 94.8% |
| 25% | 1,157 | 1,043 | 32 | 23 | 9 | 50 | 90.1% |
| 30% | 1,388 | 1,129 | 105 | 52 | 25 | 77 | 81.3% |
| 50% | 2,314 | 1,220 | 473 | 273 | 102 | 246 | 52.7% |
-->
Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
Firm A occupies 95.9% of the top 10%, 94.8% of the top 20%, 90.1% of the top 25%, and 81.3% of the top 30% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of $3.5\times$ at the top decile, $3.4\times$ at the top quintile, and $2.9\times$ at the top tercile.
Firm A's share decays monotonically as the bracket widens (95.9% $\to$ 94.8% $\to$ 90.1% $\to$ 81.3% $\to$ 52.7% across top-10/20/25/30/50%), and only at the top 50% does its share approach its baseline; the over-representation is therefore concentrated in the very top of the distribution rather than spread uniformly through the upper half.
Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
<!-- TABLE XV: Firm A Share of Top-10% Similarity by Year
| Year | N auditor-years | Top-10% k | Firm A in top-10% | Firm A share | Firm A baseline |
|------|-----------------|-----------|-------------------|--------------|-----------------|
| 2013 | 324 | 32 | 32 | 100.0% | 32.4% |
| 2014 | 399 | 39 | 39 | 100.0% | 27.8% |
| 2015 | 394 | 39 | 38 | 97.4% | 27.7% |
| 2016 | 413 | 41 | 39 | 95.1% | 26.2% |
| 2017 | 415 | 41 | 41 | 100.0% | 27.2% |
| 2018 | 434 | 43 | 43 | 100.0% | 26.5% |
| 2019 | 429 | 42 | 42 | 100.0% | 27.0% |
| 2020 | 430 | 43 | 38 | 88.4% | 27.7% |
| 2021 | 450 | 45 | 44 | 97.8% | 28.7% |
| 2022 | 467 | 46 | 43 | 93.5% | 28.3% |
| 2023 | 474 | 47 | 46 | 97.9% | 27.4% |
<!-- TABLE XV: Firm A Share of Top-K Similarity by Year (K = 10%, 20%, 30%)
| Year | N auditor-years | Top-10% share | Top-20% share | Top-30% share | Firm A baseline |
|------|-----------------|---------------|---------------|---------------|-----------------|
| 2013 | 324 | 100.0% (32/32) | 98.4% (63/64) | 89.7% (87/97) | 32.4% |
| 2014 | 399 | 100.0% (39/39) | 98.7% (78/79) | 82.4% (98/119) | 27.8% |
| 2015 | 394 | 97.4% (38/39) | 96.2% (75/78) | 84.7% (100/118) | 27.7% |
| 2016 | 413 | 95.1% (39/41) | 96.3% (79/82) | 81.3% (100/123) | 26.2% |
| 2017 | 415 | 100.0% (41/41) | 97.6% (81/83) | 83.9% (104/124) | 27.2% |
| 2018 | 434 | 100.0% (43/43) | 97.7% (84/86) | 80.0% (104/130) | 26.5% |
| 2019 | 429 | 100.0% (42/42) | 97.6% (83/85) | 78.9% (101/128) | 27.0% |
| 2020 | 430 | 88.4% (38/43) | 91.9% (79/86) | 76.0% (98/129) | 27.7% |
| 2021 | 450 | 97.8% (44/45) | 96.7% (87/90) | 81.5% (110/135) | 28.7% |
| 2022 | 467 | 93.5% (43/46) | 95.7% (89/93) | 84.3% (118/140) | 28.3% |
| 2023 | 474 | 97.9% (46/47) | 94.7% (89/94) | 83.8% (119/142) | 27.4% |
Per-cell entries are "share (k_FirmA / k_total)". Top-25% and top-50% pooled values are reported in Table XIV; per-year top-25/50 columns are omitted from this table to reduce visual width but are reproducible from the supplementary materials.
-->
This over-representation is consistent with firm-wide non-hand-signing practice at Firm A and is not derived from any threshold we subsequently calibrate.
@@ -339,8 +421,7 @@ We note that this test uses the calibrated classifier of Section III-K rather th
## H. Classification Results
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents have no signature whose extracted handwriting could be matched to a registered CPA name (every such signature has `assigned_accountant IS NULL` in the database, typically because the auditor's report page deviates from the standard two-signature layout or the OCRed printed CPA name was not present in the registry); the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity exists, and these documents are therefore excluded from the classification reported here.
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents (656 documents excluded from the 85,042-document YOLO-detection cohort because no signature on the document could be matched to a registered CPA; see Table XVII note).
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
@@ -354,6 +435,7 @@ Document-level rates therefore represent the share of reports in which *at least
| Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
Per the worst-case aggregation rule of Section III-K, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
The 84,386-document cohort excludes 656 documents (relative to the 85,042 YOLO-detected cohort of Table III) for which no signature could be matched to a registered CPA: the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity is defined. The exclusion is definitional rather than discretionary; typical causes are auditor's-report-page formats deviating from the standard two-signature layout, or OCR returning a printed CPA name not present in the registry.
-->
Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
@@ -366,7 +448,7 @@ A cosine-only classifier would treat all 71,656 identically; the dual-descriptor
96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the within-firm heterogeneity implied by the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 7.5% signature-level left tail (Section III-H).
The near-zero "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 count here is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset in Table XVI by 4 reports) indicates that the within-firm heterogeneity implied by the 7.5% signature-level left tail (Section IV-D) does not project into the lowest-cosine document-level category under the dual-descriptor rules; it is absorbed instead into the uncertain or high-style-consistency categories at this threshold set.
The near-zero "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 denominator is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset of Table XVI by 4 mixed-firm reports excluded from the firm-level intra-report comparison) indicates that the within-firm heterogeneity implied by the 7.5% signature-level left tail (Section IV-D) does not project into the lowest-cosine document-level category under the dual-descriptor rules; it is absorbed instead into the uncertain or high-style-consistency categories at this threshold set.
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-F.2 is the corresponding external check.
### 2) Cross-Firm Comparison of Dual-Descriptor Convergence
@@ -374,7 +456,7 @@ We note that because the non-hand-signed thresholds are themselves calibrated to
Among the 65,514 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,922 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible.
The Firm A denominator (55,922) matches Table IX exactly: both Table IX and the cross-firm decomposition define Firm A membership via the CPA registry (`accountants.firm`), and the cross-firm analysis additionally requires a non-null independent-min dHash record, which all 55,922 Firm A cosine-eligible signatures have in the current database.
This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings.
Counts and percentages are reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (see Appendix B for the table-to-script provenance map).
Reproduction artifact for these counts is listed in Appendix B.
## I. Ablation Study: Feature Backbone Comparison