Files
pdf_signature_extraction/paper/paper_a_results_v3.md
T
gbanyan 12637cd413 Phase 6 manuscript splice (2/2): §IV / §V / §VI spliced
Lands v4.0 §IV / §V / §VI content into v3.20.0 master sub-files.
Strips internal close-out checklists, draft notes, and open-questions
blocks at splice. Completes the Phase 6 manuscript-master file
assembly.

§IV Results (paper_a_results_v3.md):
- §IV-A..C: kept v3.20.0 inherited content (experimental setup,
  detection performance, all-pairs distribution); added v4 scope
  note (Big-4 primary) at the §IV header
- §IV-D..K: replaced v3.20.0 §IV-D..H with v4.0 §IV-D..K (Big-4
  distributional / mixture / convergence / LOOO / pixel-identity /
  inter-CPA reference / five-way classification / full-dataset
  robustness)
- §IV-L: renumbered v3.20.0 §IV-I (backbone ablation) content to
  match v4's "§IV-L inherited from v3.20.0 §IV-I" reframing
- §IV-M: appended v4.0 ICCR calibration tables (XX-XXVI):
  composition decomposition, per-comparison/per-signature/
  per-document ICCRs, firm heterogeneity + cross-firm hit matrix,
  alert-rate sensitivity
- §III-K ablation cross-ref updated to §IV-L (was §IV-I)
- Phase 3 close-out checklist (lines 365+) stripped

§V Discussion (paper_a_discussion_v3.md):
- Replaced v3.20.0 §V with v4.0 §V (8 sub-sections A-H):
  A. Distinct problem framing
  B. Continuous quality spectrum + composition-driven multimodality
  C. Firm A as templated end (case study, not anchor)
  D. K=2 / K=3 descriptive partitions
  E. Three-score convergent internal-consistency
  F. Anchor-based multi-level calibration
  G. Pixel-identity hard positive anchor + ICCR reframing
  H. Limitations (14 items: 9 v4-specific + 5 inherited from v3.x)

§VI Conclusion (paper_a_conclusion_v3.md):
- Replaced v3.20.0 §VI with v4.0 §VI (8 contribution items mirroring
  §I contributions; 4-direction future work).

Known splice-time issue (deferred to typesetting): §IV table numbering
is sequential by label (V, VI, ..., XXVI) but Table XIX (document-level
worst-case) appears physically before Tables XVI/XVII/XVIII in §IV-J
narrative flow. IEEE Access typesetters typically normalize table order
during typesetting; we accept the in-file ordering quirk to preserve
the §IV-J narrative arc (per-signature -> document-level worst-case ->
K=3 cross-tab). Renumbering to strictly-ascending physical order would
require renaming Tables XVI/XVII/XVIII -> XVII/XVIII/XIX with
downstream cross-reference updates; deferred unless partner Jimmy
review or IEEE Access submission portal flags it.

Manuscript splice complete. Working drafts in paper/v4/ retained as
archive of the round-by-round Phase 5 fix history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:43:41 +08:00

38 KiB
Raw Blame History

IV. Experiments and Results

The v4.0 primary analyses (§IV-D through §IV-J) are scoped to the Big-4 sub-corpus (Firms AD, n = 437 CPAs with n_{\text{sig}} \geq 10, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. The §IV-K Full-Dataset Robustness section reports the full-dataset (686 CPAs) variant of the K=3 mixture + Paper A box-rule Spearman analysis as a cross-scope robustness check. §IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables.

A. Experimental Setup

Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration. Feature extraction used PyTorch 2.9 with torchvision model implementations. The complete pipeline---from raw PDF processing through final classification---was implemented in Python. Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision.

B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in all §IV v4 primary analyses (§IV-D through §IV-J).

C. All-Pairs Intra-vs-Inter Class Distribution Analysis

Fig. 2 presents the cosine similarity distributions computed over the full set of pairwise comparisons under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L). Table IV summarizes the distributional statistics.

Both distributions are left-skewed and leptokurtic. Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent threshold-estimator outputs reported in Section IV-D are derived via the methods of Section III-I to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes. Statistical tests confirmed significant separation between the two distributions (Cohen's d = 0.669, Mann-Whitney [36] p < 0.001, K-S 2-sample p < 0.001).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size. A Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

D. Big-4 Accountant-Level Distributional Characterisation

This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. All numbers below are direct re-statements from Scripts 32 / 34. The accountant-level dip-test rejection reported in Table V is, per §III-I.4 (Scripts 39b39e), fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the v4-new composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.

Table V. Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32).

Population n CPAs p_{\text{cos}} p_{\text{dHash}} Interpretation
Big-4 pooled (primary) 437 < 5 \times 10^{-4} < 5 \times 10^{-4} reject unimodality on both axes
Firm A pooled alone 171 0.992 0.924 unimodal
Firms B + C + D pooled 266 0.998 0.906 unimodal
All non-Firm-A pooled 515 0.998 0.907 unimodal

Bootstrap implementation: n_{\text{boot}} = 2000; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution 1 / 2000 = 5 \times 10^{-4} (Script 34 reports this as p = 0.0000; we report p < 5 \times 10^{-4} to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed.

Table VI. Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; \alpha = 0.05, two-sided).

Population Cosine: significant transition? dHash: significant transition?
Big-4 pooled (primary) none (p > 0.05) none (p > 0.05)
Firm A pooled alone none none
Firms B + C + D pooled none one transition at \overline{\text{dHash}} = 10.8
All non-Firm-A pooled none one transition at \overline{\text{dHash}} = 6.6

The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as v4.0 operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes.

E. Big-4 K=2 / K=3 Mixture Fits

This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.

Table VII. Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-J) and marginal-crossing bootstrap 95% confidence intervals.

K=2 component \overline{\text{cos}} \overline{\text{dHash}} weight
K=2-a (low-cos / high-dHash position) 0.954 7.14 0.689
K=2-b (high-cos / low-dHash position) 0.983 2.41 0.311

Marginal crossings (point + bootstrap 95% CI, n_{\text{boot}} = 500):

Axis Point Bootstrap median 95% CI CI half-width
cos 0.9755 0.9754 [0.9742, 0.9772] 0.0015
dHash 3.755 3.763 [3.476, 3.969] 0.246

\text{BIC}(K{=}2) = -1108.45 (Script 34).

Table VIII. Big-4 K=3 mixture components (descriptive firm-compositional partition per §III-J; not mechanism clusters).

K=3 component \overline{\text{cos}} \overline{\text{dHash}} weight descriptive position
C1 0.9457 9.17 0.143 low-cos / high-dHash corner
C2 0.9558 6.66 0.536 central region
C3 0.9826 2.41 0.321 high-cos / low-dHash corner

\text{BIC}(K{=}3) = -1111.93, lower than K{=}2 by 3.48 (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.

F. Convergent Internal-Consistency Checks

This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair (\overline{\text{cos}}_a, \overline{\text{dHash}}_a) and are not statistically independent measurements. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.

Table IX. Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, n = 437.

Score pair Spearman \rho $p$-value
K=3 P(C1) vs Paper A box-rule less-replication-dominated rate +0.9627 < 10^{-248}
Reverse-anchor cosine percentile vs Paper A box-rule less-replication-dominated rate +0.8890 < 10^{-149}
K=3 P(C1) vs Reverse-anchor cosine percentile +0.8794 < 10^{-142}

(Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on n = 249 non-Big-4 CPAs; reference centre \overline{\text{cos}} = 0.935, \overline{\text{dHash}} = 9.77.

Table X. Per-firm summary across the three feature-derived scores, Big-4.

Firm n CPAs mean P(\text{C1}) mean reverse-anchor score mean Paper A less-replication-dominated rate
Firm A 171 0.0072 -0.9726 0.1935
Firm B 112 0.1410 -0.8201 0.6962
Firm C 102 0.3110 -0.7672 0.7896
Firm D 52 0.2406 -0.7125 0.7608

(Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that higher values indicate deeper into the reference left tail = less replication-dominated relative to the non-Big-4 reference.)

The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as less replication-dominated. The K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Score 1 and Score 3) place Firm C at the least-replication-dominated end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors.

Table XI. Per-signature Cohen \kappa (binary collapse, replication-dominated vs less-replication-dominated), n = 150{,}442 Big-4 signatures.

Pair Cohen \kappa
Paper A binary high-confidence box rule (cos > 0.95 AND dHash \leq 5) vs per-CPA K=3 hard label 0.662
Paper A binary high-confidence box rule vs per-signature K=3 hard label 0.559
Per-CPA K=3 hard label vs per-signature K=3 hard label 0.870

(Source: Script 39; verdict label SIG_CONVERGENCE_MODERATE.) Per-signature K=3 components (n = 150{,}442) sorted by ascending cosine: (0.928, 9.75, 0.146) / (0.963, 6.04, 0.582) / (0.989, 1.27, 0.272), an absolute cosine drift of 0.018 in C1 and 0.006 in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos > 0.95 AND dHash \leq 5); the five-way classifier's moderate-confidence band (5 < \text{dHash} \leq 15) inherits its v3.x calibration and capture-rate evaluation (§IV-J).

G. Leave-One-Firm-Out Reproducibility

This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing.

Table XII. K=2 leave-one-firm-out across the four Big-4 folds.

Held-out firm n_{\text{train}} n_{\text{held}} Fold rule (cos cut, dHash cut) Held-out classified as templated by fold rule
Firm A 266 171 cos > 0.9380 AND dHash \leq 8.79 171 / 171 = 100.00\% (95\% Wilson [97.80\%, 100.00\%])
Firm B 325 112 cos > 0.9744 AND dHash \leq 3.98 0 / 112 = 0\% (95\% Wilson [0\%, 3.32\%])
Firm C 335 102 cos > 0.9752 AND dHash \leq 3.75 0 / 102 = 0\% (95\% Wilson [0\%, 3.63\%])
Firm D 385 52 cos > 0.9756 AND dHash \leq 3.74 0 / 52 = 0\% (95\% Wilson [0\%, 6.88\%])

(Source: Script 36.) Across-fold cosine crossing: pairwise range [0.9380, 0.9756], range = 0.0376; max absolute deviation from the across-fold mean is 0.028. This exceeds the report's 0.005 across-fold stability tolerance by 5.6\times and is much larger than the full-Big-4 bootstrap CI half-width of 0.0015. Together with the all-or-nothing held-out classification pattern (Firm A held out \Rightarrow all held-out CPAs templated; any non-Firm-A firm held out \Rightarrow none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

Table XIII. K=3 leave-one-firm-out: C1 component shape and held-out membership.

Held-out firm C1 cos (fit) C1 dHash (fit) C1 weight (fit) Held-out C1 hard-label rate Full-Big-4 baseline C1% Absolute difference
Full-Big-4 baseline 0.9457 9.17 0.143
Firm A held out 0.9425 10.13 0.145 4.68\% 0.00\% 4.68 pp
Firm B held out 0.9441 9.16 0.127 7.14\% 8.93\% 1.76 pp
Firm C held out 0.9504 8.41 0.126 36.27\% 23.53\% 12.77 pp
Firm D held out 0.9439 9.29 0.120 17.31\% 11.54\% 5.81 pp

(Source: Script 37; verdict label P2_PARTIAL.) Component shape is reproducible across folds: max deviation of C1 cosine = 0.005, C1 dHash = 0.96, C1 weight = 0.023. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is 12.77 pp at the Firm C held-out fold, exceeding the report's 5 pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).

H. Pixel-Identity Positive-Anchor Miss Rate

This section reports the only hard-ground-truth subset analysis available in the corpus: the positive-anchor miss rate against n = 262 Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the replicated class. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4).

Table XIV. Positive-anchor miss rate, n = 262 Big-4 byte-identical signatures.

Classifier Misclassified as less-replication-dominated Miss rate Wilson 95% CI
Paper A binary high-confidence box rule (cos > 0.95 AND dHash \leq 5) 0 / 262 0\% [0\%, 1.45\%]
K=3 per-CPA hard label (C3 = high-cos / low-dHash; descriptive) 0 / 262 0\% [0\%, 1.45\%]
Reverse-anchor (prevalence-calibrated cut) 0 / 262 0\% [0\%, 1.45\%]

(Source: Script 40.) Per-firm breakdown of the byte-identical subset: Firm A 145; Firm B 8; Firm C 107; Firm D 2. All three candidate scores correctly assign every byte-identical signature to the replicated class.

We caution that for the Paper A box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine \approx 1 and dHash \approx 0, well inside the rule's high-confidence region); v3.20.0 §V-F discusses this conservative-subset caveat at length and we retain that discussion. The reverse-anchor cut is chosen by prevalence calibration against the inherited box rule's overall replicated rate of 49.58\% across Big-4 signatures; this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.

I. Inter-CPA Pair-Level Coincidence Rate (Big-4 spike + inherited corpus-wide)

The signature-level inter-CPA pair-level coincidence-rate analysis (reported in v3.20.0 §IV-F.1, Table X as "FAR") is inherited and extended in v4.0. v4.0 retroactively reframes the metric as inter-CPA pair-level coincidence rate (ICCR) rather than "False Acceptance Rate" because the corpus does not provide signature-level ground-truth negative labels; the inter-CPA negative-anchor assumption underpinning the metric is itself partially violated by within-firm cross-CPA template-like collision structures (§III-L.4). The v3.20.0 corpus-wide spike on \sim 50{,}000 inter-CPA pairs reported a per-comparison rate of 0.0005 (Wilson 95% CI [0.0003, 0.0007]) at the cosine cut 0.95.

v4.0 additionally reports the §III-L.1 Big-4-scope spike at higher sample size (5 \times 10^5 inter-CPA pairs; Script 40b), which replicates and extends the v3 result and adds the structural dimension (dHash) and joint-rule rates. The §III-L.1 numbers are referenced rather than duplicated here; the consolidated v4-new ICCR calibration appears in §IV-M Tables XXIXXVI.

J. Five-Way Per-Signature + Document-Level Classification Output

This section reports the §III-L five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. The five-way category definitions are inherited unchanged from v3.20.0 §III-K (now §III-L); see §III-L for the cosine and dHash cuts.

Table XV. Five-way per-signature category counts, Big-4 sub-corpus, n = 150{,}442 classified.

Category Long name n signatures % of classified
HC High-confidence non-hand-signed 74,593 49.58%
MC Moderate-confidence non-hand-signed 39,817 26.47%
HSC High style consistency 314 0.21%
UN Uncertain 35,480 23.58%
LH Likely hand-signed 238 0.16%

(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The 150{,}442 vs 150{,}453 distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use n = 150{,}442; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIVXXV; Scripts 40b, 43, 44) use n = 150{,}453 because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.)

Per-firm five-way breakdown (% within firm).

Firm HC MC HSC UN LH total signatures
Firm A 81.70% 10.76% 0.05% 7.42% 0.07% 60,448
Firm B 34.56% 35.88% 0.29% 29.09% 0.18% 34,248
Firm C 23.75% 41.44% 0.38% 34.21% 0.22% 38,613
Firm D 24.51% 29.33% 0.22% 45.65% 0.29% 17,133

(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).

Document-level worst-case aggregation. Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.20.0 worst-case rule (HC > MC > HSC > UN > LH; §III-L). v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset).

Table XIX. Document-level worst-case category counts, Big-4 sub-corpus, n = 75{,}233 unique PDFs.

Category Long name n documents %
HC High-confidence non-hand-signed 46,857 62.28%
MC Moderate-confidence non-hand-signed 19,667 26.14%
HSC High style consistency 167 0.22%
UN Uncertain 8,524 11.33%
LH Likely hand-signed 18 0.02%

(Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.)

Per-firm document-level breakdown (single-firm PDFs only).

Firm HC MC HSC UN LH total docs
Firm A 27,600 1,857 7 758 4 30,226
Firm B 8,783 6,079 57 2,202 6 17,127
Firm C 7,281 8,660 77 3,099 5 19,122
Firm D 3,100 2,838 22 2,416 3 8,379

(Source: Script 42; mixed-firm PDFs n = 379 excluded from the per-firm rows but included in the overall counts above.)

The five-way moderate-confidence non-hand-signed band (cos > 0.95 AND 5 < \text{dHash} \leq 15) inherits its v3.x calibration; it is not separately validated by Scripts 3840, which evaluated only the binary high-confidence rule (cos > 0.95 AND dHash \leq 5). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The v3.20.0 capture-rate calibration evidence for the moderate band (v3.20.0 Tables IX, XI, XII, XII-B) is carried into v4.0 by reference and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).

Table XVI. Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.

Firm n C1 (low-cos / high-dHash) C2 (central) C3 (high-cos / low-dHash) C1 % C3 %
Firm A 171 0 30 141 0.00\% 82.46\%
Firm B 112 10 102 0 8.93\% 0.00\%
Firm C 102 24 77 1 23.53\% 0.98\%
Firm D 52 6 45 1 11.54\% 1.92\%

(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction 23.5\%); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).

Document-level worst-case aggregation outputs are reported in Table XIX above.

K. Full-Dataset Robustness (light scope)

This section reports the v4.0 reproducibility cross-check at the full accountant scope (n = 686 CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the v4.0 K=3 + Paper A convergence reproduces at the wider scope. The §III-L five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band is documented as inherited from v3.x calibration in §IV-J.

Table XVII. K=3 component comparison, Big-4 sub-corpus vs full dataset.

K=3 component Big-4 (n=437) cos / dHash / weight Full (n=686) cos / dHash / weight Drift Big-4 → Full
C1 (low-cos / high-dHash) 0.9457 / 9.17 / 0.143 0.9278 / 11.17 / 0.284 \lvert\Delta\rvert cos 0.018, dHash 1.99, wt 0.141
C2 (central) 0.9558 / 6.66 / 0.536 0.9535 / 6.99 / 0.512 \lvert\Delta\rvert cos 0.002, dHash 0.33, wt 0.024
C3 (high-cos / low-dHash) 0.9826 / 2.41 / 0.321 0.9826 / 2.40 / 0.205 \lvert\Delta\rvert cos 0.000, dHash 0.01, wt 0.117

(Source: Script 41; full-dataset \text{BIC}(K{=}3) = -792.31 vs Big-4 \text{BIC}(K{=}3) = -1111.93; BIC values are not directly comparable across different n and are reported only for completeness.)

Table XVIII. Spearman rank correlation between K=3 P(C1) and Paper A operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.

Scope n CPAs Spearman \rho (P(C1) vs Paper A less-replication-dominated rate) $p$-value
Big-4 (primary) 437 +0.9627 < 10^{-248}
Full dataset 686 +0.9558 < 10^{-300}
\lvert\rho_{\text{full}} - \rho_{\text{Big-4}}\rvert 0.0069

(Source: Script 41.)

Reading. The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the Paper A box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight 0.117 as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight 0.141 and shifts to lower cosine and higher dHash (centre (0.928, 11.17) vs Big-4 (0.946, 9.17)) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + Paper A convergence is not a Big-4-specific artefact; we do not read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the v4.0 primary methodology is restricted to Big-4 by design (§III-G item 4).

L. Ablation Study: Feature Backbone Comparison

To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. The comparison summary is inherited unchanged from the v3.20.0 backbone-ablation table (v3.20.0 Table XVIII; not the same table as v4 Table XVIII which reports Big-4 vs full-dataset Spearman drift in §IV-K).

EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), resulting in lower per-sample classification confidence. VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance: (1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707; (2) its tighter distributions yield more reliable individual classifications; (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.

M. v4-New Anchor-Based ICCR Calibration Results

This section consolidates the v4-new empirical results that support the §III-L anchor-based threshold calibration framework. Numbers below are direct re-statements from the spike scripts cited per row; the corresponding §III provenance table entries appear in §III's provenance table.

M.1 Composition decomposition (Scripts 39b39e)

Table XX. Within-firm and between-firm decomposition of the Big-4 accountant-level dip-test rejection.

Diagnostic Scope Statistic Implication
Within-firm signature-level cosine dip Big-4 (4 firms) p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\} 0/4 firms reject; cosine within-firm unimodal
Within-firm signature-level cosine dip non-Big-4 (10 firms \geq 500 sigs) p_{\text{cos}} \in [0.59, 0.99] 0/10 firms reject; cosine within-firm unimodal
Within-firm jittered-dHash dip (5 seeds, median) Big-4 (4 firms) p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\} 0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact
Big-4 pooled dHash: 2×2 factorial firm-centred + jittered (5 seeds) p_{\text{med}} = 0.35, 0/5 seeds reject combined corrections eliminate rejection; multimodality is composition + integer artefact
Integer-histogram valley near \text{dHash} \approx 5 within each Big-4 firm none (0/4 firms) no within-firm dHash antimode at the inherited HC cutoff

(Source: Scripts 39b, 39c, 39d, 39e; bootstrap n_{\text{boot}} = 2000; jitter \sim \mathrm{U}[-0.5, +0.5].)

M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b)

Table XXI. Big-4 inter-CPA per-comparison ICCR sweep, n = 5 \times 10^5 pairs (Big-4 scope; v4 new).

Threshold Per-comparison ICCR 95% Wilson CI
cos > 0.945 (v3.x published "natural threshold") 0.00081 [0.00073, 0.00089]
cos > 0.95 (inherited operating point) 0.00060 [0.00053, 0.00067]
cos > 0.97 0.00024 [0.00020, 0.00029]
cos > 0.98 0.00009 [0.00007, 0.00012]
dHash \leq 5 (inherited operating point) 0.00129 [0.00120, 0.00140]
dHash \leq 4 0.00050 [0.00044, 0.00057]
dHash \leq 3 0.00019 [0.00015, 0.00023]
Joint: cos > 0.95 AND dHash \leq 5 (any-pair semantics) 0.00014
Joint: cos > 0.95 AND dHash \leq 4 (any-pair) 0.00011

Conditional ICCR(dHash \leq 5 | cos > 0.95) = 0.234 (Wilson 95% [0.190, 0.285]; 70 of 299 pairs).

The cos > 0.95 row replicates v3.20.0 §IV-F.1 Table X (v3 reported 0.0005 under prior "FAR" terminology). The dHash row and joint row are v4 new.

M.3 Pool-normalised per-signature ICCR (Script 43)

Table XXII. Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos > 0.95 AND dHash \leq 5); n_{\text{sig}} = 150{,}453 (vector-complete Big-4); CPA-block bootstrap n_{\text{boot}} = 1000.

Scope Per-signature ICCR Wilson 95% CI CPA-bootstrap 95% CI
Big-4 pooled (any-pair, deployed) 0.1102 [0.1086, 0.1118] [0.0908, 0.1330]
Big-4 pooled (same-pair, stricter alternative) 0.0827 [0.0813, 0.0841] [0.0668, 0.1021]
Firm A (any-pair) 0.2594
Firm B (any-pair) 0.0147
Firm C (any-pair) 0.0053
Firm D (any-pair) 0.0110
Pool-size decile 1 (smallest pools) any-pair 0.0249
Pool-size decile 10 (largest pools) any-pair 0.1905

Decile trend is broadly monotone in pool size with two minor reversals (decile 5 and decile 9 dip below their predecessors). Stricter operating point cos > 0.95 AND dHash \leq 3 (same-pair) gives per-signature ICCR 0.0449.

M.4 Document-level ICCR under three alarm definitions (Script 45)

Table XXIII. Document-level inter-CPA ICCR by alarm definition; n_{\text{docs}} = 75{,}233.

Alarm definition Alarm set Document-level ICCR Wilson 95% CI
D1 HC only 0.1797 [0.1770, 0.1825]
D2 (operational) HC + MC 0.3375 [0.3342, 0.3409]
D3 HC + MC + HSC 0.3384 [0.3351, 0.3418]

Per-firm D2 document-level ICCR: Firm A 0.6201 (n = 30{,}226); Firm B 0.1600 (n = 17{,}127); Firm C 0.1635 (n = 19{,}501); Firm D 0.0863 (n = 8{,}379). The Firm C denominator n = 19{,}501 exceeds Table XIX's single-firm Firm C count of 19{,}122 by exactly the 379 mixed-firm PDFs: all 379 are 1{:}1 Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (np.argmax over np.unique's alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full 75{,}233, whereas Table XIX's per-firm rows sum to 74{,}854 = 75{,}233 - 379.

M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)

Table XXIV. Logistic regression of per-signature any-pair HC hit indicator on firm dummies and centred log pool size (Firm A reference).

Term Odds ratio (vs Firm A) Direction
Firm B 0.053 \sim 19\times lower odds than Firm A
Firm C 0.010 \sim 100\times lower odds than Firm A
Firm D 0.027 \sim 37\times lower odds than Firm A
log(pool size, centred) 4.01 \sim 4\times higher odds per log unit pool size

Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$0.0358 while Firm A ranges $0.0541$0.5958. The firm gap survives within matched pool sizes.

Table XXV. Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts).

Source firm Firm A cand. Firm B Firm C Firm D non-Big-4 n hits
Firm A 14{,}447 95 44 19 17 14{,}622
Firm B 92 371 8 4 9 484
Firm C 16 7 149 5 1 178
Firm D 22 2 6 106 1 137

Same-pair joint hits (single candidate satisfying both cos > 0.95 AND dHash \leq 5) are within-firm at rates 99.96\% / 97.7\% / 98.2\% / 97.0\% for Firms A/B/C/D respectively.

M.6 Alert-rate sensitivity around inherited HC threshold (Script 46)

Table XXVI. Local-gradient / median-gradient ratio at inherited thresholds (descriptive plateau diagnostic).

Threshold Local / median gradient ratio Interpretation
cos = 0.95 (HC) \approx 25\times locally sensitive (not plateau-stable)
dHash = 5 (HC) \approx 3.8\times locally sensitive (not plateau-stable)
dHash = 15 (MC/HSC boundary) \approx 0.08 plateau-like (saturating tail)

Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC = 0.4958; per-document HC = 0.6228. The deployed-rate excess over the inter-CPA proxy is 0.3856 pp per-signature and 0.4431 pp per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate.