Files

T

gbanyan 9e68f2e1d3 Phase 6 round-3 codex-review fixes: blockers + majors + minors

Resolved Codex review (gpt-5.5 xhigh) findings against b6913d2.

BLOCKERS:
- Appendix B reference mismatch: rewrote all main-text "Appendix B" references
  to "supplementary materials" since Appendix B is now a redirect stub. Affected
  the SSIM design-argument pointer, threshold provenance, byte-level
  decomposition, MC band capture-rate, and backbone-ablation table references
  across §III-F / §III-H.1 / §III-H.2 / §III-K / §III-L.4 / §III-M / §IV-F /
  §IV-J / §IV-K / §IV-L / §V-C / §V-H.
- Table rendering: un-commented Tables I-IV (Dataset Summary, YOLO Detection,
  Extraction Results, Cosine Distribution Statistics) which were inside HTML
  comment blocks and would not have rendered in the submission.
- Table numbering out of order: Table XIX appeared before Tables XVI-XVIII.
  Renumbered XIX -> XVI (document-level worst-case counts), XVI -> XVII (Firm x
  K=3 cross-tab), XVII -> XVIII (K=3 component comparison), XVIII -> XIX
  (Spearman correlation). Cross-references updated in §IV-J / §IV-K and §V-C.
- Table V mis-citation: §IV-C said "KDE crossover ... (Table V)" but Table V is
  the dip test. Dropped the (Table V) tag; crossover is a textual finding.
- Submission cleanup: wrapped the archived Impact Statement section heading and
  body inside the existing HTML comment (was rendering). Funding placeholder
  wrapped in HTML comment with a TO-DO note (won't render but is preserved as
  reminder).

MAJORS:
- Line 1077 numerical conflation: rewrote the §V-C / §III-L.4 paragraph that
  labelled Firm A's per-document HC+MC inter-CPA proxy ICCR of 0.6201 as a rate
  "on real same-CPA pools." 0.6201 is a counterfactual proxy under inter-CPA
  candidate-pool replacement, not the observed rate. Added explicit disambig:
  the corresponding observed rate from Table XVI (formerly XIX) is 97.5%
  HC+MC for Firm A; the proxy and observed rates measure different quantities.
- Residual "validation" language softened: "Dual-descriptor verification" ->
  "Dual-descriptor similarity"; "we validate the backbone choice" -> "we
  support the backbone choice"; "pixel-identity validation" -> "pixel-identity
  positive-anchor check"; "## M. Validation Strategy and Limitations under
  Unsupervised Setting" -> "## M. Unsupervised Diagnostic Strategy and Limits".
- "Specificity behaviour" overclaim: "characterises the cosine threshold's
  specificity behaviour" -> "specificity-proxy behaviour" (methodology §III-L.0
  and discussion §V-F).
- "Prior published / prior calibration" ambiguity: replaced "prior published
  per-comparison rate" with "the corpus-wide rate reported in §IV-I"; replaced
  "(prior published operating point)" with "(alternative operating point from
  supplementary calibration evidence)" in Tables XXI; replaced "prior reporting
  and the existing literature" with "the existing literature and the
  supplementary calibration evidence."

MINORS:
- Line 116 Bayes-optimal qualifier: "the local density minimum ... is the
  Bayes-optimal decision boundary under equal priors" -> "In idealized
  two-class mixture settings with equal priors and equal misclassification
  costs, the local density minimum ... coincides with the Bayes-optimal
  decision boundary."
- Stale section refs: §V-G for the fine-tuning caveat retargeted to §V-H
  Engineering-level caveats (where it lives after the §V-H reorganisation);
  §III-L for the worst-case rule retargeted to §III-H.1; "Section IV-D.2"
  (nonexistent) retargeted to "Section IV-D Table VI."
- Abstract / Introduction "after pool-size adjustment": separated the
  document-level D2 proxy ICCR claim from the per-signature logistic regression
  claim. Now: "Per-document D2 inter-CPA proxy ICCRs differ by an order of
  magnitude across firms ... a per-signature logistic regression confirms the
  firm gap persists after pool-size control."

NIT:
- Related Work HTML comment "(see paper_a_references_v3.md for full list)"
  -> "(full list in the References section)"; removes the version-coded
  filename reference from the source.

Artefacts:
- Combined manuscript regenerated: paper_a_v4_combined.md, 1312 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 18:28:14 +08:00

37 KiB

Raw Blame History

IV. Experiments and Results

Section IV reports the empirical results that calibrate and characterise the operational classifier of §III-H.1 (calibration developed in §III-L). The primary analyses (§IV-D through §IV-J, and the anchor-based ICCR calibration consolidated in §IV-M) are scoped to the Big-4 sub-corpus (Firms A–D, n = 437 CPAs with n_{\text{sig}} \geq 10, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3 mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L report the corpus-wide pipeline performance and feature-backbone ablation that support the descriptor choice of §III-F.

A. Experimental Setup

Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration. Feature extraction used PyTorch 2.9 with torchvision model implementations. The complete pipeline---from raw PDF processing through final classification---was implemented in Python. Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision.

B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). However, the subsequent production deployment provides a practical consistency check: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

Table III. Extraction Results.

Metric	Value
Documents processed	86,071
Documents with detections	85,042 (98.8%)
Total signatures extracted	182,328
Avg. signatures per document	2.14
CPA-matched signatures	168,755 (92.6%)
Processing rate	43.1 docs/sec

The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in the primary analyses of §IV-D through §IV-J.

C. All-Pairs Intra-vs-Inter Class Distribution Analysis

Fig. 2 presents the cosine similarity distributions computed over the full set of pairwise comparisons under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L). Table IV summarizes the distributional statistics.

Table IV. Cosine Similarity Distribution Statistics.

Statistic	Intra-class	Inter-class
N (pairs)	41,352,824	500,000
Mean	0.821	0.758
Std. Dev.	0.098	0.090
Median	0.836	0.774
Skewness	−0.711	−0.851
Kurtosis	0.550	1.027

Both distributions are left-skewed and leptokurtic. Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-I to avoid single-family distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837. Under equal prior probabilities and equal misclassification costs, this crossover is a candidate decision boundary between the two classes; we adopt it only as the operational LH/UN boundary in §III-H.1, not as a natural distributional threshold. Statistical tests confirmed significant separation between the two distributions (Cohen's d = 0.669, Mann-Whitney [36] p < 0.001, K-S 2-sample p < 0.001).

We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size. A Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

D. Big-4 Accountant-Level Distributional Characterisation

This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. The accountant-level dip-test rejection reported in Table V is, per §III-I.4, fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.

Table V. Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32).

Population	`n` CPAs	`p_{\text{cos}}`	`p_{\text{dHash}}`	Interpretation
Big-4 pooled (primary)	437	`< 5 \times 10^{-4}`	`< 5 \times 10^{-4}`	reject unimodality on both axes
Firm A pooled alone	171	0.992	0.924	unimodal
Firms B + C + D pooled	266	0.998	0.906	unimodal
All non-Firm-A pooled	515	0.998	0.907	unimodal

Bootstrap implementation: n_{\text{boot}} = 2000; for the Big-4 cells, no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by the bootstrap resolution 1 / 2000 = 5 \times 10^{-4} (Script 34 reports this as p = 0.0000; we report p < 5 \times 10^{-4} to reflect the resolution). Single-firm dip statistics for Firms B, C, and D were not separately computed.

Table VI. Burgstahler-Dichev / McCrary density-smoothness diagnostic on accountant-level marginals (cosine in 0.002 bins; dHash in integer bins; \alpha = 0.05, two-sided).

Population	Cosine: significant transition?	dHash: significant transition?
Big-4 pooled (primary)	none (`p > 0.05`)	none (`p > 0.05`)
Firm A pooled alone	none	none
Firms B + C + D pooled	none	one transition at `\overline{\text{dHash}} = 10.8`
All non-Firm-A pooled	none	one transition at `\overline{\text{dHash}} = 6.6`

The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence: the K=3 components overlap in their tails rather than separating sharply, so a local-discontinuity test does not flag a transition. Outside Big-4, dHash transitions appear in some subsets but no cosine transition is identified in any tested subset (Script 32 sweeps; pre-2018 and post-2020 stratified variants exhibit dHash transitions at varying locations). These off-Big-4 dHash transitions are scope-dependent and are not used as operational thresholds; we do not claim a specific structural interpretation for them without an explicit bin-width sensitivity sweep at those scopes.

E. Big-4 K=2 / K=3 Mixture Fits

This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.

Table VII. Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-J) and marginal-crossing bootstrap 95% confidence intervals.

K=2 component	`\overline{\text{cos}}`	`\overline{\text{dHash}}`	weight
K=2-a (low-cos / high-dHash position)	0.954	7.14	0.689
K=2-b (high-cos / low-dHash position)	0.983	2.41	0.311

Marginal crossings (point + bootstrap 95% CI, n_{\text{boot}} = 500):

Axis	Point	Bootstrap median	95% CI	CI half-width
cos	0.9755	0.9754	`[0.9742, 0.9772]`	0.0015
dHash	3.755	3.763	`[3.476, 3.969]`	0.246

\text{BIC}(K{=}2) = -1108.45 (Script 34).

Table VIII. Big-4 K=3 mixture components (descriptive firm-compositional partition per §III-J; not mechanism clusters).

K=3 component	`\overline{\text{cos}}`	`\overline{\text{dHash}}`	weight	descriptive position
C1	0.9457	9.17	0.143	low-cos / high-dHash corner
C2	0.9558	6.66	0.536	central region
C3	0.9826	2.41	0.321	high-cos / low-dHash corner

\text{BIC}(K{=}3) = -1111.93, lower than K{=}2 by 3.48 (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.

F. Convergent Internal-Consistency Checks

This section reports the empirical evidence for §III-K's three-score internal-consistency analysis. We re-emphasise the §III-K caveat: the three scores are deterministic functions of the same per-CPA descriptor pair (\overline{\text{cos}}_a, \overline{\text{dHash}}_a) and are not statistically independent measurements. The pairwise correlations document internal consistency among feature-derived ranks rather than external validation against an independent ground truth.

Table IX. Per-CPA Spearman rank correlations among three feature-derived scores, Big-4, n = 437.

Score pair	Spearman `\rho`	$p$-value
K=3 P(C1) vs deployed box-rule less-replication-dominated rate	`+0.9627`	`< 10^{-248}`
Reverse-anchor cosine percentile vs deployed box-rule less-replication-dominated rate	`+0.8890`	`< 10^{-149}`
K=3 P(C1) vs Reverse-anchor cosine percentile	`+0.8794`	`< 10^{-142}`

(Source: Script 38.) Reverse-anchor reference: 2D Gaussian fit by MCD (support fraction 0.85) on n = 249 non-Big-4 CPAs; reference centre \overline{\text{cos}} = 0.935, \overline{\text{dHash}} = 9.77.

Table X. Per-firm summary across the three feature-derived scores, Big-4.

Firm	`n` CPAs	mean `P(\text{C1})`	mean reverse-anchor score	mean deployed less-replication-dominated rate
Firm A	171	0.0072	`-0.9726`	0.1935
Firm B	112	0.1410	`-0.8201`	0.6962
Firm C	102	0.3110	`-0.7672`	0.7896
Firm D	52	0.2406	`-0.7125`	0.7608

(Source: Script 38 per-firm summary; reverse-anchor score is sign-flipped so that higher values indicate deeper into the reference left tail = less replication-dominated relative to the non-Big-4 reference.)

The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A firms as less replication-dominated. The K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Score 1 and Score 3) place Firm C at the least-replication-dominated end of Big-4; the reverse-anchor cosine percentile (Score 2) ranks Firm D fractionally above Firm C. This residual within-Big-4-non-A disagreement is a design feature of the reverse-anchor metric: Score 2 measures only the marginal cosine percentile under the non-Big-4 reference, so a firm with a slightly higher cosine but a markedly different dHash distribution (Firm D vs Firm C) can score higher on Score 2 while scoring lower on Scores 1 and 3, both of which use both descriptors.

Table XI. Per-signature Cohen \kappa (binary collapse, replication-dominated vs less-replication-dominated), n = 150{,}442 Big-4 signatures.

Pair	Cohen `\kappa`
deployed binary high-confidence box rule (cos `> 0.95` AND dHash `\leq 5`) vs per-CPA K=3 hard label	0.662
deployed binary high-confidence box rule vs per-signature K=3 hard label	0.559
Per-CPA K=3 hard label vs per-signature K=3 hard label	0.870

(Source: Script 39.) Per-signature K=3 components (n = 150{,}442) sorted by ascending cosine: (0.928, 9.75, 0.146) / (0.963, 6.04, 0.582) / (0.989, 1.27, 0.272), an absolute cosine drift of 0.018 in C1 and 0.006 in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos > 0.95 AND dHash \leq 5); the five-way classifier's moderate-confidence band (5 < \text{dHash} \leq 15) retains its prior calibration and capture-rate evidence (supplementary materials; cross-referenced in §IV-J).

G. Leave-One-Firm-Out Reproducibility

This section reports the firm-level cross-validation evidence motivating §III-J's "K=3 descriptive, not operational" framing.

Table XII. K=2 leave-one-firm-out across the four Big-4 folds.

Held-out firm	`n_{\text{train}}`	`n_{\text{held}}`	Fold rule (cos cut, dHash cut)	Held-out classified as templated by fold rule
Firm A	266	171	cos `> 0.9380` AND dHash `\leq 8.79`	`171 / 171 = 100.00\%` (`95\%` Wilson `[97.80\%, 100.00\%]`)
Firm B	325	112	cos `> 0.9744` AND dHash `\leq 3.98`	`0 / 112 = 0\%` (`95\%` Wilson `[0\%, 3.32\%]`)
Firm C	335	102	cos `> 0.9752` AND dHash `\leq 3.75`	`0 / 102 = 0\%` (`95\%` Wilson `[0\%, 3.63\%]`)
Firm D	385	52	cos `> 0.9756` AND dHash `\leq 3.74`	`0 / 52 = 0\%` (`95\%` Wilson `[0\%, 6.88\%]`)

(Source: Script 36.) Across-fold cosine crossing: pairwise range [0.9380, 0.9756], range = 0.0376; max absolute deviation from the across-fold mean is 0.028. This exceeds the report's 0.005 across-fold stability tolerance by 5.6\times and is much larger than the full-Big-4 bootstrap CI half-width of 0.0015. Together with the all-or-nothing held-out classification pattern (Firm A held out \Rightarrow all held-out CPAs templated; any non-Firm-A firm held out \Rightarrow none templated), this indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.

Table XIII. K=3 leave-one-firm-out: C1 component shape and held-out membership.

Held-out firm	C1 cos (fit)	C1 dHash (fit)	C1 weight (fit)	Held-out C1 hard-label rate	Full-Big-4 baseline C1%	Absolute difference
Full-Big-4 baseline	0.9457	9.17	0.143	—	—	—
Firm A held out	0.9425	10.13	0.145	`4.68\%`	`0.00\%`	`4.68` pp
Firm B held out	0.9441	9.16	0.127	`7.14\%`	`8.93\%`	`1.76` pp
Firm C held out	0.9504	8.41	0.126	`36.27\%`	`23.53\%`	`12.77` pp
Firm D held out	0.9439	9.29	0.120	`17.31\%`	`11.54\%`	`5.81` pp

(Source: Script 37; screening label P2_PARTIAL.) Component shape is reproducible across folds: max deviation of C1 cosine = 0.005, C1 dHash = 0.96, C1 weight = 0.023. Hard-posterior membership for the held-out firm varies: max absolute difference from the full-Big-4 baseline is 12.77 pp at the Firm C held-out fold, exceeding the report's 5 pp viability bar. We accordingly do not use K=3 hard-posterior membership as an operational classifier label (§III-J, §III-L).

H. Pixel-Identity Positive-Anchor Miss Rate

This section reports the only conservative hard-positive subset analysis available in the corpus: the positive-anchor miss rate against n = 262 Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The analysis is one-sided (positive-anchor only); a paired false-alarm rate against a hand-signed negative anchor is not available because no signature-level hand-signed ground truth exists in the corpus (§III-K item 4).

Table XIV. Positive-anchor miss rate, n = 262 Big-4 byte-identical signatures.

Classifier	Misclassified as less-replication-dominated	Miss rate	Wilson 95% CI
deployed binary high-confidence box rule (cos `> 0.95` AND dHash `\leq 5`)	`0 / 262`	`0\%`	`[0\%, 1.45\%]`
K=3 per-CPA hard label (C3 = high-cos / low-dHash; descriptive)	`0 / 262`	`0\%`	`[0\%, 1.45\%]`
Reverse-anchor (prevalence-calibrated cut)	`0 / 262`	`0\%`	`[0\%, 1.45\%]`

(Source: Script 40.) Per-firm breakdown of the byte-identical subset: Firm A 145; Firm B 8; Firm C 107; Firm D 2. All three candidate scores correctly assign every byte-identical signature to the replicated class.

We caution that for the deployed box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine \approx 1 and dHash \approx 0, well inside the rule's high-confidence region). The reverse-anchor cut is chosen by prevalence calibration against the box rule's overall replicated rate of 49.58\% across Big-4 signatures; this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.

I. Inter-CPA Pair-Level Coincidence Rate

The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It is the per-pair rate at which two signatures from different CPAs satisfy the deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has a biometric-verification meaning that requires ground-truth negative labels, and (b) the inter-CPA negative-anchor assumption is partially violated by within-firm cross-CPA template-like collision structures (§III-L.4 cross-firm hit matrix).

A corpus-wide spike on \sim 50{,}000 inter-CPA pairs gives a per-comparison rate of 0.0005 (Wilson 95% CI [0.0003, 0.0007]) at the cosine cut 0.95. The Big-4-scope spike at higher sample size (5 \times 10^5 inter-CPA pairs) replicates this number, adds the structural dimension (dHash), and adds joint-rule rates; the §III-L.1 numbers are referenced rather than duplicated here, and the consolidated ICCR calibration appears in §IV-M Tables XXI–XXVI.

J. Five-Way Per-Signature + Document-Level Classification Output

This section reports the five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. See §III-H.1 for the five-way category definitions and the cosine and dHash cuts; calibration is in §III-L.

Table XV. Five-way per-signature category counts, Big-4 sub-corpus, n = 150{,}442 classified.

Category	Long name	`n` signatures	% of classified
HC	High-confidence non-hand-signed	74,593	49.58%
MC	Moderate-confidence non-hand-signed	39,817	26.47%
HSC	High style consistency	314	0.21%
UN	Uncertain	35,480	23.58%
LH	Likely hand-signed	238	0.16%

(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The 150{,}442 vs 150{,}453 distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use n = 150{,}442; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIV–XXV; Scripts 40b, 43, 44) use n = 150{,}453 because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.)

Per-firm five-way breakdown (% within firm).

Firm	HC	MC	HSC	UN	LH	total signatures
Firm A	81.70%	10.76%	0.05%	7.42%	0.07%	60,448
Firm B	34.56%	35.88%	0.29%	29.09%	0.18%	34,248
Firm C	23.75%	41.44%	0.38%	34.21%	0.22%	38,613
Firm D	24.51%	29.33%	0.22%	45.65%	0.29%	17,133

(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVII: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVII). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).

Document-level worst-case aggregation. Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-H.1), applied to the Big-4 sub-corpus.

Table XVI. Document-level worst-case category counts, Big-4 sub-corpus, n = 75{,}233 unique PDFs.

Category	Long name	`n` documents	%
HC	High-confidence non-hand-signed	46,857	62.28%
MC	Moderate-confidence non-hand-signed	19,667	26.14%
HSC	High style consistency	167	0.22%
UN	Uncertain	8,524	11.33%
LH	Likely hand-signed	18	0.02%

(Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.)

Per-firm document-level breakdown (single-firm PDFs only).

Firm	HC	MC	HSC	UN	LH	total docs
Firm A	27,600	1,857	7	758	4	30,226
Firm B	8,783	6,079	57	2,202	6	17,127
Firm C	7,281	8,660	77	3,099	5	19,122
Firm D	3,100	2,838	22	2,416	3	8,379

(Source: Script 42; mixed-firm PDFs n = 379 excluded from the per-firm rows but included in the overall counts above.)

The five-way moderate-confidence non-hand-signed band (cos > 0.95 AND 5 < \text{dHash} \leq 15) retains its prior calibration (supplementary materials); it is not separately re-characterised by Scripts 38–40, which checked only the binary high-confidence rule (cos > 0.95 AND dHash \leq 5). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in the supplementary materials and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).

Table XVII. Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.

Firm	`n`	C1 (low-cos / high-dHash)	C2 (central)	C3 (high-cos / low-dHash)	C1 %	C3 %
Firm A	171	0	30	141	`0.00\%`	`82.46\%`
Firm B	112	10	102	0	`8.93\%`	`0.00\%`
Firm C	102	24	77	1	`23.53\%`	`0.98\%`
Firm D	52	6	45	1	`11.54\%`	`1.92\%`

(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction 23.5\%); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).

Document-level worst-case aggregation outputs are reported in Table XVI above.

K. Full-Dataset Robustness (light scope)

This section reports the reproducibility cross-check at the full accountant scope (n = 686 CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (supplementary materials; §IV-J).

Table XVIII. K=3 component comparison, Big-4 sub-corpus vs full dataset.

K=3 component	Big-4 (n=437) cos / dHash / weight	Full (n=686) cos / dHash / weight	Drift Big-4 → Full
C1 (low-cos / high-dHash)	0.9457 / 9.17 / 0.143	0.9278 / 11.17 / 0.284	`\lvert\Delta\rvert` cos 0.018, dHash 1.99, wt 0.141
C2 (central)	0.9558 / 6.66 / 0.536	0.9535 / 6.99 / 0.512	`\lvert\Delta\rvert` cos 0.002, dHash 0.33, wt 0.024
C3 (high-cos / low-dHash)	0.9826 / 2.41 / 0.321	0.9826 / 2.40 / 0.205	`\lvert\Delta\rvert` cos 0.000, dHash 0.01, wt 0.117

(Source: Script 41; full-dataset \text{BIC}(K{=}3) = -792.31 vs Big-4 \text{BIC}(K{=}3) = -1111.93; BIC values are not directly comparable across different n and are reported only for completeness.)

Table XIX. Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.

Scope	`n` CPAs	Spearman `\rho` (P(C1) vs deployed less-replication-dominated rate)	$p$-value
Big-4 (primary)	437	`+0.9627`	`< 10^{-248}`
Full dataset	686	`+0.9558`	`< 10^{-300}`
`\lvert\rho_{\text{full}} - \rho_{\text{Big-4}}\rvert`	—	`0.0069`	—

(Source: Script 41.)

Reading. The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the deployed box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight 0.117 as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight 0.141 and shifts to lower cosine and higher dHash (centre (0.928, 11.17) vs Big-4 (0.946, 9.17)) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + deployed-rule convergence is not a Big-4-specific artefact; we do not read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the primary methodology is restricted to Big-4 by design (§III-G item 4).

L. Ablation Study: Feature Backbone Comparison

To support the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. The comparison summary is reported in the supplementary materials (backbone-ablation table; not the same table as Table XIX in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K).

EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), i.e., a wider descriptor dispersion per signature. VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance: (1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707; (2) its tighter distributions yield more stable descriptor behaviour at the per-signature level; (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that Firm A replication-dominated signatures are least likely to produce low-similarity outlier pairs under this backbone; and (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.

M. Anchor-Based ICCR Calibration Results

This section consolidates the empirical results that support the §III-L anchor-based threshold calibration framework.

M.1 Composition decomposition (Scripts 39b–39e)

Table XX. Within-firm and between-firm decomposition of the Big-4 accountant-level dip-test rejection.

Diagnostic	Scope	Statistic	Implication
Within-firm signature-level cosine dip	Big-4 (4 firms)	`p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}`	0/4 firms reject; cosine within-firm unimodal
Within-firm signature-level cosine dip	non-Big-4 (10 firms `\geq 500` sigs)	`p_{\text{cos}} \in [0.59, 0.99]`	0/10 firms reject; cosine within-firm unimodal
Within-firm jittered-dHash dip (5 seeds, median)	Big-4 (4 firms)	`p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\}`	0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact
Big-4 pooled dHash: 2×2 factorial	firm-centred + jittered (5 seeds)	`p_{\text{med}} = 0.35`, 0/5 seeds reject	combined corrections eliminate rejection; multimodality is composition + integer artefact
Integer-histogram valley near `\text{dHash} \approx 5`	within each Big-4 firm	none (0/4 firms)	no within-firm dHash antimode at the deployed HC cutoff

(Source: Scripts 39b, 39c, 39d, 39e; bootstrap n_{\text{boot}} = 2000; jitter \sim \mathrm{U}[-0.5, +0.5].)

M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b)

Table XXI. Big-4 inter-CPA per-comparison ICCR sweep, n = 5 \times 10^5 pairs (Big-4 scope).

Threshold	Per-comparison ICCR	95% Wilson CI
cos `> 0.945` (alternative operating point from supplementary calibration evidence)	`0.00081`	`[0.00073, 0.00089]`
cos `> 0.95` (deployed operating point)	`0.00060`	`[0.00053, 0.00067]`
cos `> 0.97`	`0.00024`	`[0.00020, 0.00029]`
cos `> 0.98`	`0.00009`	`[0.00007, 0.00012]`
dHash `\leq 5` (deployed operating point)	`0.00129`	`[0.00120, 0.00140]`
dHash `\leq 4`	`0.00050`	`[0.00044, 0.00057]`
dHash `\leq 3`	`0.00019`	`[0.00015, 0.00023]`
Joint: cos `> 0.95` AND dHash `\leq 5` (any-pair semantics)	`0.00014`	`[0.00011, 0.00018]`
Joint: cos `> 0.95` AND dHash `\leq 4` (any-pair)	`0.00011`	`[0.00008, 0.00014]`

Conditional ICCR(dHash \leq 5 | cos > 0.95) = 0.234 (Wilson 95% [0.190, 0.285]; 70 of 299 pairs).

The cos > 0.95 row is consistent with the corpus-wide spike of §IV-I (per-comparison rate 0.0005). The dHash row and joint row are reported here for the first time on this corpus.

M.3 Pool-normalised per-signature ICCR (Script 43)

Table XXII. Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos > 0.95 AND dHash \leq 5); n_{\text{sig}} = 150{,}453 (vector-complete Big-4); CPA-block bootstrap n_{\text{boot}} = 1000.

Scope	Per-signature ICCR	Wilson 95% CI	CPA-bootstrap 95% CI
Big-4 pooled (any-pair, deployed)	`0.1102`	`[0.1086, 0.1118]`	`[0.0908, 0.1330]`
Big-4 pooled (same-pair, stricter alternative)	`0.0827`	`[0.0813, 0.0841]`	`[0.0668, 0.1021]`
Firm A (any-pair)	`0.2594`	—	—
Firm B (any-pair)	`0.0147`	—	—
Firm C (any-pair)	`0.0053`	—	—
Firm D (any-pair)	`0.0110`	—	—
Pool-size decile 1 (smallest pools) any-pair	`0.0249`	—	—
Pool-size decile 10 (largest pools) any-pair	`0.1905`	—	—

Decile trend is broadly monotone in pool size with two minor reversals (decile 5 and decile 9 dip below their predecessors). Stricter operating point cos > 0.95 AND dHash \leq 3 (same-pair) gives per-signature ICCR 0.0449.

M.4 Document-level ICCR under three alarm definitions (Script 45)

Table XXIII. Document-level inter-CPA ICCR by alarm definition; n_{\text{docs}} = 75{,}233.

Alarm definition	Alarm set	Document-level ICCR	Wilson 95% CI
D1	HC only	`0.1797`	`[0.1770, 0.1825]`
D2 (operational)	HC + MC	`0.3375`	`[0.3342, 0.3409]`
D3	HC + MC + HSC	`0.3384`	`[0.3351, 0.3418]`

Per-firm D2 document-level ICCR: Firm A 0.6201 (n = 30{,}226); Firm B 0.1600 (n = 17{,}127); Firm C 0.1635 (n = 19{,}501); Firm D 0.0863 (n = 8{,}379). The Firm C denominator n = 19{,}501 exceeds Table XVI's single-firm Firm C count of 19{,}122 by exactly the 379 mixed-firm PDFs: all 379 are 1{:}1 Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (np.argmax over np.unique's alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full 75{,}233, whereas Table XVI's per-firm rows sum to 74{,}854 = 75{,}233 - 379.

M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)

Table XXIV. Logistic regression of per-signature any-pair HC hit indicator on firm dummies and centred log pool size (Firm A reference).

Term	Odds ratio (vs Firm A)	Direction
Firm B	`0.053`	`\sim 19\times` lower odds than Firm A
Firm C	`0.010`	`\sim 100\times` lower odds than Firm A
Firm D	`0.027`	`\sim 37\times` lower odds than Firm A
log(pool size, centred)	`4.01`	`\sim 4\times` higher odds per log unit pool size

Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–0.0358 while Firm A ranges $0.0541$–0.5958. The firm gap survives within matched pool sizes.

Table XXV. Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts).

Source firm	Firm A cand.	Firm B	Firm C	Firm D	non-Big-4	n hits
Firm A	`14{,}447`	`95`	`44`	`19`	`17`	`14{,}622`
Firm B	`92`	`371`	`8`	`4`	`9`	`484`
Firm C	`16`	`7`	`149`	`5`	`1`	`178`
Firm D	`22`	`2`	`6`	`106`	`1`	`137`

Same-pair joint hits (single candidate satisfying both cos > 0.95 AND dHash \leq 5) are within-firm at rates 99.96\% / 97.7\% / 98.2\% / 97.0\% for Firms A/B/C/D respectively.

M.6 Alert-rate sensitivity around deployed HC threshold (Script 46)

Table XXVI. Local-gradient / median-gradient ratio at deployed thresholds (descriptive plateau diagnostic).

Threshold	Local / median gradient ratio	Interpretation
cos `= 0.95` (HC)	`\approx 25\times`	locally sensitive (not plateau-stable)
dHash `= 5` (HC)	`\approx 3.8\times`	locally sensitive (not plateau-stable)
dHash `= 15` (MC/HSC boundary)	`\approx 0.08`	plateau-like (saturating tail)

Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC = 0.4958; per-document HC = 0.6228. The deployed-rate excess over the inter-CPA proxy is 0.3856 (38.6 pp) per-signature and 0.4431 (44.3 pp) per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate.

37 KiB Raw Blame History Unescape Escape