Paper A v4.1: BCD-baseline reframe + screening positioning + trim

- Re-anchor inter-CPA coincidence-rate (ICCR) calibration on a normative
  non-Firm-A baseline (Firms B/C/D); Firm A held out as an out-of-sample
  target. Locked canonical numbers (codex-audited; Scripts 46/52/53):
  per-comparison HC 0.00014->0.000018, per-signature HC 0.0116, per-document
  HC+MC 0.34->0.1905; KDE crossover 0.837 retained corpus-wide.
- Reposition as an operator-tunable, semi-automated screening/triage framework
  (title -> "Automated Screening..."): HC = high-specificity operating point;
  MC band demoted to low-specificity advisory; Firm A = demonstration that the
  screening surfaces a templated end, audit-quality implications deferred.
- Apply codex prose-review fixes: triage-neutral five-way labels, soften
  mechanism/specificity wording, supersede MC claim-strength, update stale
  Appendix script references (40b/43/45 -> 46/52/53).
- Trim pass: compress Sec. V discussion + Sec. III echoes (27.7k -> 26.8k
  words); no substantive content removed.
- Add analysis scripts 45-53 (firm-year trends; BCD-only ICCR recompute;
  canonical-sampler locked numbers; Firm-A out-of-sample; BCD regression +
  cross-firm hit matrix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-04 19:35:10 +08:00
parent becce857e1
commit 3c7fcc010f
11 changed files with 1225 additions and 184 deletions
+137 -184
View File
@@ -1,10 +1,15 @@
---
title: "Automated Screening of Digitally Replicated Signatures in Large-Scale Financial Audit Reports"
author: "[Authors removed for double-blind review]"
---
# Abstract
<!-- IEEE Access target: <= 250 words, single paragraph -->
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports — through administrative stamping or firm-level electronic signing — thereby undermining individualized attestation. We build an end-to-end pipeline for screening such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). The framework surfaces pronounced firm-level heterogeneity: Firm A's per-document HC+MC ICCR is $0.62$ versus $0.09$$0.16$ at Firms B/C/D (gap persists after pool-size adjustment), and $77$$99\%$ of inter-CPA collisions concentrate within the source firm. This contrast is consistent with firm-level template-like reuse but not independently diagnostic, since descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity within a firm; we report it as a scope limitation rather than a mechanism finding. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports, undermining individualized attestation. We build an end-to-end pipeline to screen *non-hand-signed* signatures: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash), separating *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses cover the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Diagnostics show no within-population antimode anchors a threshold ($p=0.35$ after firm-mean centring and integer-tie jitter). We instead calibrate via an inter-CPA coincidence-rate (ICCR) anchored on a normative non-Firm-A baseline (Firms B/C/D), as Firm A's extreme within-firm collision structure would contaminate an all-firm anchor. On this clean baseline the high-confidence rule (cos$>0.95$, dHash$\leq 5$) has a very low inter-CPA coincidence rate (per-comparison ICCR $0.000018$; per-signature $0.012$; per-document $0.023$), whereas the moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.19$ per-document coincidence rate and is reported as advisory. Scored out-of-sample, Firm A coincides at baseline rate cross-firm yet fires the rule on $82\%$ of its own signatures ($\sim 70\times$ floor); its signal is within-firm. We read this as consistent with firm-level template-like reuse but not independently diagnostic: descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity. We position it as a specificity-proxy screening framework with human-in-the-loop review, not a validated forensic detector; no calibrated error rates are reportable without ground truth.
<!-- Word count: 247 -->
<!-- Word count: 250 (v4.1 BCD-baseline reframe) -->
# I. Introduction
@@ -23,11 +28,13 @@ Despite the significance of the problem for audit quality and regulatory oversig
In this paper we present a fully automated, end-to-end pipeline for screening non-hand-signed CPA signatures in audit reports at scale, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour under explicit unsupervised assumptions. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) disclosure of each diagnostic's untested assumption (§III-M).
We are deliberate about what the system claims. The operating thresholds are *operator-tunable* rather than asserted as ground-truth decision boundaries: the contribution is not a fixed detector that pronounces a signature non-hand-signed, but (a) a dual-descriptor design that separates *style consistency* from *image reproduction*, and (b) a methodology for choosing and characterising a screening operating point in the absence of labels, so that an operator can set a specificity target and read off what each setting yields. Operationally the framework is a semi-automated triage step that surfaces a tractable set of replication candidates from hundreds of thousands of signatures for human adjudication; it does not adjudicate. The firm-level results and the byte-identical capture check are reported as *demonstrations that this triage works at scale*, not as forensic determinations.
A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. After joint firm-mean centring and uniform integer-tie jitter, the pooled dHash dip-test rejection disappears ($p_{\text{median}} = 0.35$ across five seeds). Within-firm diagnostics in every Big-4 firm fail to reveal stable bimodal structure after accounting for integer ties; eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (§III-I.4). We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.
In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the cos$>0.95$ operating point yields ICCR $= 0.00060$ on a $5 \times 10^5$-pair Big-4 sample; the dHash$\leq 5$ structural cutoff yields ICCR $= 0.00129$; the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR $= 0.00014$ (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is $0.1102$ (Wilson 95% CI $[0.1086, 0.1118]$; CPA-block bootstrap 95% $[0.0908, 0.1330]$). At the per-document unit, the operational HC$+$MC alarm fires on $33.75\%$ of Big-4 documents under the inter-CPA candidate-pool counterfactual.
In place of distributional anchoring, we adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D); §III-L.0 explains why an all-Big-4 negative anchor is partially circular — Firm A's extreme within-firm cross-CPA collision structure loads the all-firm pool with the very structure the rule targets. On this BCD baseline the deployed high-confidence rule (cos$>0.95$ AND dHash$\leq 5$) yields per-comparison ICCR $= 0.000018$ (versus $0.00014$ on the contaminated all-Big-4 pool), pool-normalised per-signature ICCR $= 0.0116$ (CPA-block bootstrap 95% $[0.0094, 0.0141]$), and per-document ICCR $= 0.023$ — roughly an order of magnitude below the all-Big-4 figures, confirming that the HC rule has a very low inter-CPA coincidence rate against an uncontaminated baseline. The moderate-confidence band (cos$>0.95$ AND $5 < \text{dHash} \leq 15$), by contrast, retains a per-document coincidence rate of $0.19$ even on the clean baseline (and rises slightly when Firm A is removed), so we treat HC as the specificity-anchored operating point and reposition the MC band as a low-specificity advisory tier rather than a confident non-hand-signed label. The cosine LH/UN crossover ($\text{cos} = 0.837$) is a corpus-wide descriptor-space landmark robust to baseline choice (it moves $\leq 0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes) and is retained corpus-wide.
The pooled per-signature and per-document rates conceal striking firm heterogeneity. A logistic regression of the per-signature hit indicator on firm dummies (Firm A reference) and centred log pool size yields odds ratios of $0.053$ (Firm B), $0.010$ (Firm C), and $0.027$ (Firm D) — Firms B/C/D are an order of magnitude below Firm A even after controlling for the pool-size confound. Cross-firm hit matrix analysis under the deployed any-pair rule shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (Table XXV; the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). The pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour rather than to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.
With Firm A treated as an out-of-sample target rather than a calibration input, the heterogeneity reads cleanly. Against the BCD floor (per-signature HC ICCR $0.0116$), the deployed rule fires on each firm's *actual* same-CPA pools far above the inter-CPA coincidence floor: Firm A at $0.82$ ($\sim 70\times$ floor), Firms B/C/D at $0.24$$0.35$ ($\sim 21$$30\times$). Firm A scored against the clean baseline coincides at only $0.0102$ — essentially the floor itself — so its elevation is entirely a within-firm phenomenon, not cross-firm distinctiveness. Two logistic regressions confirm Firm A is the singular extreme while the baseline is internally homogeneous: with Firm A as reference on the full Big-4 pool, odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D); restricted to the BCD baseline with Firm D as reference, the residual spread collapses to within $\sim 3.5\times$ (odds ratio $1.73$ for B, $0.49$ for C). Under the deployed any-pair rule, within-firm collision concentration is a *universal* Big-4 pattern — $98.8\%$ at Firm A and, on the clean BCD pool, $89$$97\%$ at Firms B/C/D (Table XXV) — consistent with firm-specific template, stamp, or document-production reuse, though not by itself diagnostic of deliberate sharing. The deployed five-way box rule defines a reproducible screening classifier; the calibration contribution is to characterise its multi-level inter-CPA coincidence behaviour, not to derive new thresholds. The high-confidence sub-rule (cos $> 0.95$ AND dHash $\leq 5$) and the advisory moderate-confidence sub-rule (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) are explicit decision rules whose calibrated false-positive and false-negative error rates remain unknown in the absence of signature-level labels.
Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score under §III-J's reading, not a mechanism cluster posterior), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. A conservative hard-positive subset for image replication is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.
@@ -37,15 +44,15 @@ The contributions of this paper are:
1. **Problem formulation.** We define non-hand-signing detection as distinct from signature forgery detection and frame it as a detection problem on intra-signer similarity distributions.
2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training.
2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention before the human-adjudication step.
3. **Dual-descriptor similarity.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash provides complementary evidence for screening cases where *style consistency* and *image reproduction* hypotheses diverge, and we support the backbone choice through a feature-backbone ablation.
4. **Composition decomposition does not support the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed high-confidence (HC) sub-rule and document-level HC$+$MC alarm derived from the five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
5. **Anchor-based multi-level ICCR calibration on a normative non-Firm-A baseline.** We characterise the deployed high-confidence (HC) sub-rule at three units of analysis against a clean Firms-B/C/D negative anchor (Firm A held out as an out-of-sample target to avoid circularity): per-comparison ICCR $0.000018$, pool-normalised per-signature ICCR $0.0116$, and per-document ICCR $0.023$ — each roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$). The moderate-confidence band (dHash$\leq 15$) retains a $\sim 0.19$ per-document coincidence rate on the clean baseline and is repositioned as a low-specificity advisory tier rather than a confident non-hand-signed label. Because the deployed thresholds are operator-tunable, the contribution is this label-free calibration methodology — a principled way to choose and characterise a screening operating point and the specificity it yields — rather than any specific threshold. We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: $0.62$ versus Firms B/C/D: $0.09$$0.16$); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with but does not independently establish firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
6. **Firm A as a singular out-of-sample extreme; universal within-firm collision concentration.** Against the clean BCD floor (per-signature HC ICCR $0.0116$), the deployed rule fires on each firm's own pools far above the inter-CPA coincidence floor (Firm A $0.82$, $\sim 70\times$; Firms B/C/D $0.24$$0.35$, $\sim 21$$30\times$), while Firm A scored cross-firm against the baseline coincides only at the floor ($0.0102$) — localising the repeatability signal to within-firm comparisons. Two logistic regressions (full-Big-4 with Firm A reference: odds ratios $0.053$/$0.010$/$0.027$ for B/C/D; BCD-only with Firm D reference: residual spread within $\sim 3.5\times$, odds ratios $1.73$/$0.49$ for B/C) show Firm A is the lone outlier while Firms B/C/D form an internally homogeneous baseline. Within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$$97\%$ at Firms B/C/D on the clean pool — consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.
@@ -170,7 +177,7 @@ REFERENCES for Related Work (full list in the References section):
## A. Pipeline Overview
We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
We propose a six-stage pipeline for large-scale screening of non-hand-signed auditor signatures in scanned financial documents.
Fig. 1 illustrates the overall architecture.
The pipeline takes as input a corpus of PDF audit reports and produces five-way operational screening labels (§III-H.1) whose behaviour is characterised by pixel-identity positive-anchor capture checks and inter-CPA coincidence-rate calibration (§III-L).
@@ -311,11 +318,11 @@ A1 is plausible for high-volume stamping or firm-level electronic signing workfl
Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA. The five labels below name regions of the descriptor space and are operational rule outputs, not validated ground-truth classes; the label names reflect the screening hypothesis associated with each region and are subject to the unsupervised-setting caveats of §III-M:
1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on image-similarity evidence consistent with replication; mechanism attribution remains subject to §III-M.
2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level similarity is strong; structural similarity is present but below the high-confidence cutoff.
3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration; the descriptor signature is operationally distinguished from HC/MC, but the underlying mechanism (within-CPA signing style, lossy image reproduction with structural drift, or a hybrid) is not resolved by descriptor data alone.
1. **High-confidence replication candidate (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on image-similarity evidence consistent with replication; this is the highest-priority triage bin for human review, and mechanism attribution remains subject to §III-M.
2. **Moderate-confidence advisory flag (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level similarity is strong but structural similarity is below the high-confidence cutoff; §III-L.3 shows this band carries low inter-CPA specificity even on the normative baseline, so it is a low-specificity advisory bin (review-workload-expanding) rather than a confident replication flag.
3. **High style-consistency flag (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration; the descriptor position is operationally distinguished from HC/MC, but the underlying mechanism (within-CPA signing style, lossy image reproduction with structural drift, or a hybrid) is not resolved by descriptor data alone.
4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$. The "Likely hand-signed" name reflects the screening hypothesis that low maximum same-CPA cosine similarity is more consistent with hand-signing variation than with image replication; the label is operational, not a verified hand-signed classification, since cross-year handwriting drift, scanner-workflow change, or template variant rotation within a CPA's reports can also yield a low max-cosine within a same-CPA pool.
5. **Low replication-similarity (LH):** Cosine $\leq 0.837$. The name reflects the screening hypothesis that low maximum same-CPA cosine similarity is more consistent with hand-signing variation than with image replication; it is an operational low-priority bin, not a verified hand-signed classification, since cross-year handwriting drift, scanner-workflow change, or template variant rotation within a CPA's reports can also yield a low max-cosine within a same-CPA pool.
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).
@@ -362,11 +369,11 @@ Removing *both* the between-firm location shift *and* the integer mass points el
*Integer-histogram valleys (Script 39d).* A genuine within-firm dHash antimode would appear as a strict local minimum in the count histogram with deep relative depth. Within each of the four Big-4 firms, the dHash histogram on bins $0$$20$ exhibits no strict local minimum; the Big-4 pooled histogram exhibits one shallow valley at $\text{dHash} = 4$ with relative depth $0.021$ (a $2.1\%$ count drop). No valley near the deployed $\text{dHash} = 5$ operational boundary appears within any individual firm. The hypothesised dHash antimode near $\text{dHash} \approx 5$ is not empirically supported by the histogram analysis.
**5. Conclusion: no natural threshold from the descriptor distribution.** §III-I.4 jointly establishes that (a) the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts; (b) within the Big-4 firms, the descriptor marginals at the signature level are unimodal once integer ties are broken (Scripts 39b, 39d); (c) eligible non-Big-4 checks provide corroborating raw-axis evidence on the cosine dimension (Script 39c) and corroborate the integer-mass-point reading of raw dHash, but are not used as calibration evidence for the deployed thresholds; and (d) no integer-histogram valley near the deployed $\text{dHash} = 5$ operational boundary exists within any Big-4 firm. The descriptor distributions therefore do not contain a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits of §III-I.2 and §III-J are retained as *descriptive partitions* that reflect firm-composition contrast, not as inferential evidence for two or three population modes. §III-L develops the anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode.
**5. Conclusion: no natural threshold from the descriptor distribution.** The four diagnostics jointly establish that the descriptor distributions contain no within-population bimodal antimode that could anchor an operational threshold: the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts (the 2×2 factorial restores unimodality, $p_{\text{median}} = 0.35$), the within-firm signature-level marginals are unimodal once integer ties are broken, and no integer-histogram valley exists near the deployed $\text{dHash} = 5$ boundary in any Big-4 firm. The K=2 / K=3 mixtures (§III-J) are therefore *descriptive* firm-compositional partitions, not evidence for population modes, and §III-L develops an anchor-based calibration that does not require a distributional antimode.
## J. K=3 as a Descriptive Partition of Firm-Composition Contrast
This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes.** §III-I.4 demonstrates that the apparent multimodality of the accountant-level marginals is fully explained by between-firm location shifts and integer mass-point artefacts, leaving no residual evidence for two or three latent within-population mechanism classes. Neither mixture is used to assign signature-level or document-level labels in the primary analysis. The operational classifier of §III-H.1 is calibrated in §III-L via inter-CPA negative-anchor coincidence rates, not via mixture-derived antimodes.
This section develops the K=2 and K=3 Gaussian mixture fits and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes** (§III-I.4 shows the apparent multimodality is fully explained by between-firm location shifts and integer mass-point artefacts). Neither mixture is used to assign signature- or document-level labels in the primary analysis; the operational classifier of §III-H.1 is calibrated in §III-L via inter-CPA coincidence rates, not mixture-derived antimodes.
**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ (weight $0.689$) and $(0.983, 2.41)$ (weight $0.311$) (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. We refer to the components by index rather than by mechanism labels, since §III-I.4 establishes that the K=2 separation is firm-compositional rather than mechanistic.
@@ -429,11 +436,7 @@ We read this as the strongest internal-consistency signal in the analysis: three
The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in the supplementary materials and not regenerated on the Big-4 subset.
**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference:
- *K=2 LOOO is unstable.* The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$, against the report's $0.005$ across-fold stability tolerance (Script 36; pairwise fold range $0.0376$, from $0.9380$ to $0.9756$). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.
- *K=3 LOOO is partially stable.* The C1 (low-cos / high-dHash) component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own screening label is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label.
**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Developed in §III-J: the firm-level LOOO cross-validation shows K=2 is unstable (its boundary is essentially a Firm-A-versus-others separator), while K=3 has a reproducible C1 component shape ($\leq 0.005$ cosine drift across folds) but composition-sensitive hard-posterior membership (up to $12.8$ pp; labelled `P2_PARTIAL`), which is why K=3 hard membership is not used as an operational label.
**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one conservative hard-positive subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are a conservative hard-positive subset for image replication. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40).
@@ -447,123 +450,89 @@ We report each candidate check's *positive-anchor miss rate* — the fraction of
All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them. The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the box rule's overall replicated rate ($49.58\%$ of Big-4 signatures); this is a documented limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.
## L. Anchor-Based Threshold Calibration
## L. Anchor-Based Threshold Calibration on a Normative (Non-Firm-A) Baseline
The operational classifier defined in §III-H.1 is calibrated by characterising the deployed thresholds' inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. §III-I.4 establishes that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. Throughout this section we report **inter-CPA coincidence rates** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0.
The operational classifier defined in §III-H.1 is calibrated by characterising the deployed thresholds' inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. §III-I.4 establishes that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. Throughout this section we report **inter-CPA coincidence rates (ICCR)** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0. The calibration is anchored on a **normative non-Firm-A baseline (Firms B, C, D)** and treats Firm A as an out-of-sample target, for reasons developed in §III-L.0.
### L.0. Calibration methodology
**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine threshold's specificity-proxy behaviour at the inter-CPA pair level and the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands.
**Choice of negative-anchor pool.** A negative anchor must approximate a population in which the rule should *not* fire — independent CPAs whose signatures coincide only by chance. §III-L.4 shows that under the deployed rule, $98.8\%$ of Firm A's inter-CPA collisions fall on other Firm-A CPAs, and byte-level evidence (§IV-H, supplementary materials) confirms image-level reuse across $\sim 50$ Firm-A partners. Including Firm A in the negative-anchor pool therefore loads the "coincidence" rate with structured within-firm collisions, not chance coincidence — a circularity, since that collision structure is the phenomenon the rule targets. We adopt **Firms B/C/D (BCD) as the normative negative-anchor baseline** and report the all-Big-4 (ABCD) pool only as a contamination-comparison scope; Firm A enters as an **out-of-sample target** (§III-L.4), not as a calibration input. A still-broader baseline adding the eligible non-Big-4 firms (BCD+non-Big-4) is reported as a robustness scope.
**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine and structural ($\text{dHash} \leq 5$) thresholds' specificity-proxy behaviour at the inter-CPA pair level on the BCD baseline. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands. The cosine LH/UN crossover $\text{cos} = 0.837$ is a corpus-wide descriptor-space landmark (intra- vs inter-CPA cosine KDE crossover, §IV-C) robust to baseline choice — it moves by at most $0.012$ across the corpus-wide, BCD, and BCD+non-Big-4 scopes ($0.8367$, $0.8489$, $0.8302$) — so we retain the corpus-wide value and do not re-anchor it on BCD.
**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:
- *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the conventional pairwise calibration unit in biometric verification. We report it for both the cosine and dHash dimensions, marginally and jointly (§III-L.1).
- *Per signature pool.* For a Big-4 source signature $s$ with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates? This addresses the standard concern that a per-pair rate computed on independent pairs is not the deployed-rule rate at the per-signature classifier level: the deployed rule takes max-cosine and min-dHash over a pool of size $n_{\text{pool}}(s)$, so its effective coincidence rate is approximately $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-L.2).
- *Per signature pool.* For a source signature $s$ in the baseline pool with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates drawn from the baseline pool? This addresses the standard concern that a per-pair rate computed on independent pairs is not the deployed-rule rate at the per-signature classifier level: the deployed rule takes max-cosine and min-dHash over a pool of size $n_{\text{pool}}(s)$, so its effective coincidence rate is approximately $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-L.2).
- *Per document.* For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose deployed pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-L.3).
**Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate in the pool satisfies both. We refer to this as the **any-pair** rule. A stricter alternative — the **same-pair** rule — requires a single candidate to satisfy both inequalities; the deployed rule is any-pair, but we report same-pair as a stricter alternative classifier where useful (§III-L.2, §III-L.4).
**Terminological note on "FAR".** The biometric-verification literature speaks of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) §III-L.4 shows that the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures. Reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence.
**Terminological note on "FAR".** The biometric-verification literature speaks of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures, which is precisely why we move the anchor to the BCD baseline (§III-L.0). Even on the BCD baseline, reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence.
### L.1. Per-comparison inter-CPA coincidence rate (Script 40b)
### L.1. Per-comparison inter-CPA coincidence rate (Script 46)
We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatures, computing for each pair the cosine similarity (feature dot product) and Hamming distance between the dHash byte vectors. Marginal and joint rates at threshold $k$ are reported with Wilson 95% confidence intervals (Script 40b).
We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from the baseline pool, computing for each pair the cosine similarity (feature dot product) and Hamming distance between the dHash byte vectors. Marginal and joint rates are reported with Wilson 95% confidence intervals (Script 46).
| Threshold | Per-comparison inter-CPA coincidence rate | 95% Wilson CI |
| Threshold | BCD baseline (primary) | All-Big-4 (contamination comparison) | BCD+non-Big-4 |
|---|---|---|---|
| Cosine $> 0.95$ | $0.00026$ | $0.00060$ | $0.00014$ |
| dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000018}$ $\;[0.000009, 0.000034]$ | $0.000140$ $\;[0.000111, 0.000177]$ | $0.000004$ $\;[0.000001, 0.000015]$ |
On the normative BCD baseline the joint per-comparison coincidence rate for the deployed HC rule is $0.000018$ — roughly $8\times$ lower than the all-Big-4 rate ($0.000140$), and lower still when the non-Big-4 firms are added ($0.000004$). The all-Big-4 figure is inflated by Firm A's within-firm collision structure (§III-L.4): removing Firm A from the negative anchor strips out the structured reuse that an honest specificity proxy must exclude. The joint-rule hit count is small in absolute terms ($9$ of $5 \times 10^5$ pairs on the BCD pool), so we report the Wilson interval and treat the per-comparison joint rate as an order-of-magnitude specificity proxy rather than a precisely estimated rate; the well-powered per-signature and per-document units (§III-L.2, §III-L.3) carry the primary calibration weight. The all-Big-4 cos $> 0.95$ row remains consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I. On the all-Big-4 sample the conditional rate ICCR(dHash $\leq 5\mid$ cos $> 0.95$) is $0.234$, indicating that the structural dimension adds substantial per-comparison specificity beyond the cosine gate.
The per-comparison rate does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.
### L.2. Pool-normalised inter-CPA alert rate (Script 52)
The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \text{dHash}$ over the same-CPA pool of size $n_{\text{pool}}(s)$ for each signature $s$, so a per-comparison rate is not the rate at which the deployed classifier fires per signature. For each source signature $s$ we simulate one realisation of an inter-CPA candidate pool of the same size $n_{\text{pool}}(s)$, drawn uniformly from non-same-CPA signatures *in the baseline pool*, compute the deployed extrema and rule indicator, and aggregate (Script 52, canonical retry-loop sampler matching Scripts 43/45; CPA-block bootstrap 95% CIs on $n_{\text{boot}} = 1000$ replicates).
**Headline rate (deployed any-pair HC rule, cos $> 0.95$ AND dHash $\leq 5$):**
| Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI |
|---|---|---|
| Cosine $> 0.95$ | $0.00060$ | $[0.00053, 0.00067]$ |
| Cosine $> 0.945$ (alternative operating point from supplementary calibration evidence) | $0.00081$ | $[0.00073, 0.00089]$ |
| Cosine $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| Cosine $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
| dHash $\leq 5$ | $0.00129$ | $[0.00120, 0.00140]$ |
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
| dHash $\leq 2$ | $0.00006$ | $[0.00004, 0.00008]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ |
| **BCD (primary)** | $\mathbf{0.0116}$ | $[0.0094, 0.0141]$ |
| All-Big-4 (contamination comparison) | $0.1102$ | $[0.0908, 0.1330]$ |
| BCD+non-Big-4 | $0.0083$ | $[0.0066, 0.0099]$ |
The cosine row at $\text{cos} > 0.95$ is consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I on a similarly-sized inter-CPA sample; the present $5 \times 10^5$-pair sample yields $0.00060$, within that precision. The dHash row and joint row are reported here for the first time; the corpus-wide spike did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule.
On the normative BCD baseline the deployed HC rule's pool-normalised per-signature coincidence rate is $0.0116$ — an order of magnitude below the all-Big-4 figure of $0.1102$. The all-Big-4 figure is dominated by Firm A, whose signatures coincide with other Firm-A signatures at high rate; once Firm A is removed from both the source set and the candidate pool, the residual per-signature coincidence among independent normative-baseline CPAs is $\approx 1.2\%$. This is the specificity-proxy floor against which the deployed HC rule operates. The rate increases with pool size (the rule takes extrema over $n_{\text{pool}}$ candidates), consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence; the within-firm violation of that independence (§III-L.4) bounds how literally the closed form can be read. Stakeholders requiring a tighter specificity proxy can characterise alternative operating points (e.g., dHash $\leq 3$) by inverting the ICCR curve, with the unsupervised-setting caveats of §III-M.
The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check.
### L.3. Document-level inter-CPA proxy alert rate (Script 52)
**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison ICCR refinement (joint $0.00014$ vs cos-only $0.00060$).
Each document is classified by the worst-case rule over its constituent signatures (§III-H.1) under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 52; dominant-firm document assignment). We report the high-confidence (HC, dHash $\leq 5$) and the HC+MC (dHash $\leq 15$) document-level alarm definitions:
The per-comparison rate is a useful *specificity-proxy calibration* for the deployed rule's pair-level behaviour. It does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.
### L.2. Pool-normalised inter-CPA alert rate (Script 43)
The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \text{dHash}$ over the same-CPA pool of size $n_{\text{pool}}(s)$ for each signature $s$. A per-comparison rate is therefore not the rate at which the deployed classifier fires per signature. To compute the per-signature inter-CPA-equivalent rate, for each Big-4 source signature $s$ we simulate one realisation of an inter-CPA candidate pool of the same size $n_{\text{pool}}(s)$, drawn uniformly from non-same-CPA signatures across all firms, compute the deployed extrema and rule indicator, and aggregate (Script 43; $n_{\text{sig}} = 150{,}453$ vector-complete in this analysis; CPA-block bootstrap 95% CIs reported below).
**Headline rates (deployed any-pair rule, HC = cos $> 0.95$ AND dHash $\leq 5$).** Wilson 95% CIs on the point estimate, CPA-block bootstrap 95% CI on $n_{\text{boot}} = 1000$ replicates:
| Rule semantics | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
| Alarm definition | BCD baseline (primary) | All-Big-4 | BCD+non-Big-4 |
|---|---|---|---|
| Any-pair (deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
| Same-pair (stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |
| HC (dHash $\leq 5$) | $\mathbf{0.0226}$ | $0.1797$ | $0.0163$ |
| HC + MC (dHash $\leq 15$) | $0.1905$ | $0.3375$ | $0.1467$ |
Per-firm any-pair rates (no bootstrap; descriptive):
**The HC and HC+MC bands behave very differently on a clean baseline, which sharpens the operating-point recommendation.** On the BCD baseline the per-document HC rate is $0.0226$ ($\sim 8\times$ below the all-Big-4 $0.1797$), confirming that the HC (dHash $\leq 5$) rule has a very low inter-CPA coincidence rate: a clean inter-CPA baseline almost never produces an HC document. The HC+MC (dHash $\leq 15$) rate, by contrast, remains high on the clean baseline — $0.1905$ per document — and the per-firm breakdown shows it does *not* fall when Firm A is removed. **We therefore treat the HC sub-rule (dHash $\leq 5$) as the specificity-anchored operating point and reposition the MC band ($5 < \text{dHash} \leq 15$) as a low-specificity advisory tier rather than a confident non-hand-signed screening label.** Roughly one normative-baseline document in five would coincidentally carry an HC+MC flag under random inter-CPA candidate replacement, so an HC+MC alarm is not by itself evidence of image reproduction.
| Firm | $n_{\text{sig}}$ | Any-pair ICCR | Same-pair ICCR |
|---|---|---|---|
| Firm A | $60{,}450$ | $0.2594$ | $0.2018$ |
| Firm B | $34{,}254$ | $0.0147$ | $0.0023$ |
| Firm C | $38{,}616$ | $0.0053$ | $0.0019$ |
| Firm D | $17{,}133$ | $0.0110$ | $0.0051$ |
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $0.218$, Firm D $0.114$ — slightly *higher* than under the all-Big-4 pool (B $0.160$, C $0.163$, D $0.088$), because removing Firm A's idiosyncratic template leaves a candidate pool whose members resemble one another more closely at the coarse dHash $\leq 15$ scale. This is direct evidence that the MC band carries little inter-CPA specificity even among normative firms, corroborating its demotion to an advisory tier. The positioning of the operational system as a **screening framework with human-in-the-loop review**, not an autonomous forensic classifier, follows directly (§III-M).
**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). This functional form is used as descriptive intuition for the broad monotone trend, not as an independence assumption used for estimation; the within-firm violation of inter-CPA independence (§III-L.4) bounds how literally the closed form can be read. Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors).
### L.4. Firm A as an out-of-sample target; firm heterogeneity (Scripts 49, 52, 44, 53)
**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). A stricter operating point of dHash $\leq 3$ same-pair would correspond to a per-signature ICCR of $\approx 0.05$; the deployed HC any-pair rule with $\text{dHash} \leq 5$ corresponds to $\approx 0.11$. Stakeholders requiring a tighter specificity proxy could consider the dHash $\leq 3$ same-pair variant, with the unsupervised-setting caveats of §III-M.
With the calibration anchored on BCD, Firm A is scored as an out-of-sample target against the clean baseline. Three complementary readings establish that Firm A is the extreme case while keeping the inferential limits explicit.
### L.3. Document-level inter-CPA proxy alert rate (Script 45)
**(i) Observed deployed rate versus the clean floor.** The deployed HC rule fires on each firm's *actual* same-CPA pools at the following per-signature rates (observed, not counterfactual; Script 49), against the BCD specificity-proxy floor of $0.0116$ (§III-L.2):
The deployed worst-case aggregation classifies each document by the most-replication-consistent category among its constituent signatures (§III-H.1). Three operationally meaningful document-level alarm definitions are reported, each as the fraction of documents whose worst-case signature category falls in the alarm set under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 45; $n_{\text{docs}} = 75{,}233$ Big-4 documents):
| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|---|---|---|---|
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
| D2 | HC + MC ("any non-hand-signed screening label") | $0.3375$ | $[0.3342, 0.3409]$ |
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |
Per-firm D2 document-level rates:
| Firm | $n_{\text{docs}}$ | D2 (HC + MC) ICCR |
| Firm | Observed per-signature HC rate | Multiple of BCD floor |
|---|---|---|
| Firm A | $30{,}226$ | $0.6201$ |
| Firm B | $17{,}127$ | $0.1600$ |
| Firm C | $19{,}501$ | $0.1635$ |
| Firm D | $8{,}379$ | $0.0863$ |
| Firm A | $0.817$ | $\sim 70\times$ |
| Firm B | $0.346$ | $\sim 30\times$ |
| Firm C | $0.238$ | $\sim 21\times$ |
| Firm D | $0.245$ | $\sim 21\times$ |
The document-level D2 rate of $33.75\%$ pooled over Big-4 is the most operationally relevant alarm-rate metric: it is the fraction of audit documents that would carry at least one signature flagged HC or MC under the counterfactual of inter-CPA candidate-pool replacement. The non-trivial per-document inter-CPA alarm rate (and its concentration in Firm A at $62\%$) motivates the positioning of the operational system as a **screening framework with human-in-the-loop review**, not as an autonomous forensic classifier (§III-M).
All four Big-4 firms fire the HC rule on their own pools far above the inter-CPA coincidence floor; Firm A is the extreme at $\sim 70\times$, roughly $2.4$$3.4\times$ the other Big-4 firms in absolute rate. We emphasise (and develop in §III-M) that this excess is *not* a true-positive rate: the floor is an inter-CPA coincidence rate, whereas a CPA who hand-signs consistently can also produce same-pool repeatability above it. The multiple is a framework-discriminative observation, not a measure of image reproduction.
### L.4. Firm heterogeneity (Script 44)
**(ii) Firm A against the clean baseline behaves like the floor — its signal is within-firm.** Scored as a true out-of-sample target (Firm A source signatures, candidate pool drawn from the clean BCD baseline, any-pair, Script 52), Firm A's per-signature HC coincidence rate is $0.0102$ — essentially identical to the BCD-internal floor of $0.0116$. Firm A's signatures are thus unremarkable when matched against *other firms'* signatures; the entire elevation in Firm A's observed rate ($0.817$) arises from matches against *other Firm-A* signatures, localising the repeatability signal to within-firm comparisons rather than cross-firm distinctiveness.
§III-L.2 and §III-L.3 report large per-firm variation in the deployed rule's pool-normalised behaviour: Firm A's any-pair per-signature ICCR is $0.2594$, an order of magnitude larger than Firm B's $0.0147$, Firm C's $0.0053$, Firm D's $0.0110$. A natural alternative explanation is the pool-size confound: Firm A's median pool size ($\sim 285$) is larger than other firms', and pool size monotonically (broadly) increases the per-signature rate (§III-L.2 decile trend). We test the firm-vs-pool confound with a logistic regression of the per-signature hit indicator (any-pair HC) on firm dummies (Firm A = reference) and centred log pool size (Script 44):
**(iii) Firm-effect regressions: Firm A singular, baseline homogeneous.** Two logistic regressions of the per-signature any-pair HC hit indicator on firm dummies and centred log pool size jointly establish that Firm A is the singular extreme while Firms B/C/D form an internally homogeneous baseline. On the full Big-4 pool with Firm A as reference (Script 44), the odds ratios are $0.053$ (B), $0.010$ (C), $0.027$ (D), with log-pool-size odds ratio $4.01$ — Firms B/C/D sit one to two orders of magnitude below Firm A after pool-size control. On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0116$), the residual firm spread collapses to within a factor of $\sim 3.5$: odds ratios $1.73$ (B), $0.49$ (C), log-pool-size odds ratio $3.29$. The normative-baseline firms are therefore comparable to one another, with Firm A the lone outlier — supporting treating B/C/D as a coherent calibration baseline and Firm A as an out-of-sample target. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm; cluster-robust inference is left as a robustness check.)
| Term | Odds ratio (vs Firm A) | Direction | Magnitude |
|---|---|---|---|
| Firm B | $0.053$ | $< 1$ | $\sim 19\times$ lower odds than Firm A |
| Firm C | $0.010$ | $< 1$ | $\sim 100\times$ lower odds than Firm A |
| Firm D | $0.027$ | $< 1$ | $\sim 37\times$ lower odds than Firm A |
| log(pool size, centred) | $4.01$ | $> 1$ | $\sim 4\times$ higher odds per unit log pool size |
**Cross-firm hit matrix: within-firm concentration is a universal Big-4 pattern.** Under the deployed any-pair rule, inter-CPA collisions concentrate within the source firm at every Big-4 firm. On the full Big-4 candidate pool, within-firm concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (same-pair $97.0$$99.96\%$; Table XXV). Restricting the candidate pool to the BCD baseline (Script 53) *raises* the within-firm concentration for B/C/D to $89.2$$97.2\%$ any-pair (Firm B $97.2\%$, Firm C $92.3\%$, Firm D $89.2\%$) and $98.5$$100\%$ same-pair — higher than on the full pool, because on the full pool some B/C/D collisions landed on Firm A's generically copy-like signatures; removing Firm A leaves each firm's collisions concentrated within itself. Within-firm collision concentration is therefore a universal Big-4 structural pattern, not a Firm-A peculiarity: Firm A is extreme in the *rate* at which the rule fires (reading (i)), but all four firms exhibit the same within-firm collision signature.
The Firm B/C/D odds ratios are very small after controlling for pool size, indicating that firm membership accounts for a large multiplicative effect on the per-signature rate that is *not* explained by pool size alone. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm, and naive standard errors would be unreliable under within-cluster correlation; a cluster-robust standard error analysis is left as a robustness check.)
The per-decile per-firm breakdown (Script 44) confirms the pattern: within every pool-size decile, Firms B/C/D have rates of $0.0006$$0.0358$, while Firm A's rate ranges $0.0541$$0.5958$ across deciles. The firm gap is large within matched pool sizes, not driven by pool composition.
**Cross-firm hit matrix.** Among Big-4 source signatures whose any-pair rule fires under the inter-CPA candidate-pool counterfactual, the candidate firm of the max-cosine partner is distributed as follows (Script 44):
| Source firm | Firm A candidate | Firm B | Firm C | Firm D | non-Big-4 | hits |
|---|---|---|---|---|---|---|
| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ |
| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ |
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |
For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$).
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (supplementary materials; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder within-firm collision patterns at Firms B/C/D, whose mechanisms may include template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity (§V-H). We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where the within-firm collision concentration is the dominant pattern at all four Big-4 firms — most strongly at Firm A ($98.8\%$ any-pair, $99.96\%$ same-pair) and at materially lower but still majority levels at Firms B/C/D ($76.7$$83.7\%$ any-pair; $97.0$$98.2\%$ same-pair).
**Interpretation.** This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (§IV-H, supplementary materials) provides direct evidence of image-level reuse among Firm A signatures; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We report "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
### L.5. Alert-rate sensitivity around deployed thresholds (Script 46)
@@ -571,7 +540,7 @@ To test whether the deployed cosine threshold $0.95$ and dHash threshold $5$ coi
At the deployed HC operating point cos $> 0.95$ AND dHash $\leq 5$, the local gradient of the per-signature alert rate is substantially larger than the median gradient across the sweep (cosine: ratio $\approx 25\times$ at the $0.95$ point relative to median; dHash: ratio $\approx 3.8\times$ at the $5$ point relative to median; both Script 46). Reading these ratios descriptively, the deployed HC threshold is *locally sensitive* rather than plateau-stable: small threshold perturbations materially change the deployed alert rate (cosine sweep at dHash $\leq 5$ yields rates of $0.5091$ at cos $> 0.945$ vs $0.4789$ at cos $> 0.955$, a $3.0$ pp swing across a $0.01$ cosine perturbation; dHash sweep at cos $> 0.95$ yields rates of $0.4207$ at dHash $\leq 4$ vs $0.5639$ at dHash $\leq 6$, a $14.3$ pp swing across a single integer step). The local-gradient-to-median-gradient ratios are descriptive diagnostics, not formal plateau tests; the primary evidence for "no within-population bimodal antimode at these thresholds" comes from §III-I.4's composition decomposition, not from §III-L.5.
The MC/HSC boundary at dHash $= 15$, by contrast, *is* in a low-gradient region (ratio $\approx 0.08$ to the median); the plateau-like behaviour around dHash $= 15$ is corroborating evidence that the high-end structural threshold lies in a regime where the rule's alert rate is approximately saturated, consistent with the high-dHash tail behaviour expected once near-identical pairs have been exhausted. The §III-L.5 non-plateau / local-sensitivity finding therefore applies specifically to the HC cutoff (cos $= 0.95$, dHash $= 5$); the MC/HSC sub-band boundary at dHash $= 15$ exhibits the opposite behaviour and is plateau-like.
The MC/HSC boundary at dHash $= 15$, by contrast, *is* in a low-gradient region (ratio $\approx 0.08$ to the median); the plateau-like behaviour around dHash $= 15$ is corroborating evidence that the high-end structural threshold lies in a regime where the rule's alert rate is approximately saturated, consistent with the high-dHash tail behaviour expected once near-identical pairs have been exhausted. The §III-L.5 non-plateau / local-sensitivity finding therefore applies specifically to the HC cutoff (cos $= 0.95$, dHash $= 5$); the MC/HSC sub-band boundary at dHash $= 15$ exhibits the opposite behaviour and is plateau-like. The plateau at dHash $= 15$ — added alert yield without added inter-CPA specificity (§III-L.3) — reinforces the demotion of the MC band to an advisory tier.
We interpret the deployed HC thresholds as **specificity-anchored operating points** chosen for the specificity-vs-alert-yield tradeoff (§III-L.1), *not* as distributional antimodes. Alternative operating points on the tradeoff curve can be characterised by inverting the per-comparison or pool-normalised ICCR curves (§III-L.1, §III-L.2) at the preferred specificity target.
@@ -579,12 +548,12 @@ We interpret the deployed HC thresholds as **specificity-anchored operating poin
The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the deployed HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$).
The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised inter-CPA rate ($0.4958$ vs $0.1102$); the per-document observed-deployed rate is $\sim 3.5\times$ the pool-normalised inter-CPA D1 (HC) rate ($0.6228$ vs $0.1797$). We refer to this multiplicative gap as the **deployed-rate excess over the inter-CPA proxy**:
Read against the **normative BCD specificity-proxy floor** rather than the contaminated all-Big-4 rate, the observed-deployed excess is larger: the per-signature observed rate is $\sim 43\times$ the BCD floor ($0.4958$ vs $0.0116$), and the per-document HC observed rate is $\sim 28\times$ the BCD floor ($0.6228$ vs $0.0226$):
- Per-signature: $0.4958 - 0.1102 = 0.3856$ ($38.6$ pp excess)
- Per-document HC: $0.6228 - 0.1797 = 0.4431$ ($44.3$ pp excess)
- Per-signature: $0.4958 - 0.0116 = 0.4842$ ($48.4$ pp excess over the clean floor)
- Per-document HC: $0.6228 - 0.0226 = 0.6002$ ($60.0$ pp excess over the clean floor)
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as an *observed same-CPA-pool excess* — a quantity that exceeds what random inter-CPA candidate replacement would produce — whose mechanism is not identifiable from descriptor-only data (§III-M); we do not attribute it to within-CPA handwriting repeatability or to image replication without further evidence.
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits are developed in §III-M. The excess is best read as an *observed same-CPA-pool excess over the normative inter-CPA floor* — a quantity that far exceeds what random inter-CPA candidate replacement among normative firms would produce — whose mechanism is not identifiable from descriptor-only data (§III-M). Anchoring the floor on the clean BCD baseline sharpens this contrast (the all-Big-4 floor would understate it by absorbing Firm A's reuse), while leaving the §III-M caveat — that the floor is an inter-CPA coincidence rate, not an intra-CPA genuine-hand-signing rate — fully in force; we do not attribute the excess to within-CPA handwriting repeatability or to image replication without further evidence.
## M. Unsupervised Diagnostic Strategy and Limits
@@ -594,7 +563,7 @@ Each diagnostic reported in this paper therefore addresses one specific failure
**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, supplementary materials) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.
**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature any-pair HC ICCR ($0.11$; Table XXII) and per-document HC+MC alarm-rate ICCR ($0.34$; Table XXIII) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.
**Scope of the present analysis.** The deployed screening rule is characterised at three units (per-comparison, per-signature, per-document) against the normative Firms-B/C/D inter-CPA negative anchor, with Firm A held out as an out-of-sample target (§III-L.0). The resulting rates (§III-L.1L.6) are specificity-proxy-anchored **alarm-yield** indicators, not true error rates: the HC rule has a very low BCD coincidence rate at every unit, the dHash $\leq 15$ MC band is a low-specificity advisory tier, and the per-firm heterogeneity is read against the clean floor (Firm A the rate-extreme, its signal within-firm). The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.
**Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data.
@@ -677,7 +646,7 @@ This section reports the empirical evidence for §III-I's distributional diagnos
| Population | $n$ CPAs | $p_{\text{cos}}$ | $p_{\text{dHash}}$ | Interpretation |
|---|---|---|---|---|
| **Big-4 pooled (primary)** | 437 | $< 5 \times 10^{-4}$ | $< 5 \times 10^{-4}$ | reject unimodality on both axes |
| **Big-4 pooled (analysis scope; not the calibration anchor)** | 437 | $< 5 \times 10^{-4}$ | $< 5 \times 10^{-4}$ | reject unimodality on both axes |
| Firm A pooled alone | 171 | 0.992 | 0.924 | unimodal |
| Firms B + C + D pooled | 266 | 0.998 | 0.906 | unimodal |
| All non-Firm-A pooled | 515 | 0.998 | 0.907 | unimodal |
@@ -819,13 +788,13 @@ This section reports the five-way per-signature + document-level worst-case clas
| Category | Long name | $n$ signatures | % of classified |
|---|---|---|---|
| HC | High-confidence non-hand-signed | 74,593 | 49.58% |
| MC | Moderate-confidence non-hand-signed | 39,817 | 26.47% |
| HSC | High style consistency | 314 | 0.21% |
| HC | High-confidence replication candidate | 74,593 | 49.58% |
| MC | Moderate-confidence advisory flag | 39,817 | 26.47% |
| HSC | High style-consistency flag | 314 | 0.21% |
| UN | Uncertain | 35,480 | 23.58% |
| LH | Likely hand-signed | 238 | 0.16% |
| LH | Low replication-similarity | 238 | 0.16% |
(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The $150{,}442$ vs $150{,}453$ distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use $n = 150{,}442$; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIVXXV; Scripts 40b, 43, 44) use $n = 150{,}453$ because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.)
(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded. The $150{,}442$ vs $150{,}453$ distinction — descriptor-complete vs vector-complete — recurs across §IV: descriptor-complete analyses (§IV-D through §IV-J, all using accountant-level aggregates or per-signature category counts derived from the same 150,442-signature substrate) use $n = 150{,}442$; vector- or pair-recomputed analyses (§IV-M.2 Table XXI, §IV-M.3 Table XXII, §IV-M.5 Tables XXIVXXV; Scripts 46, 52, 44, 53) use $n = 150{,}453$ because their pair- or pool-level computations load all vector-complete signatures including those failing the descriptor-complete filter. See §III-G for the sample-size reconciliation.)
**Per-firm five-way breakdown (% within firm).**
@@ -844,11 +813,11 @@ This section reports the five-way per-signature + document-level worst-case clas
| Category | Long name | $n$ documents | % |
|---|---|---|---|
| HC | High-confidence non-hand-signed | 46,857 | 62.28% |
| MC | Moderate-confidence non-hand-signed | 19,667 | 26.14% |
| HSC | High style consistency | 167 | 0.22% |
| HC | High-confidence replication candidate | 46,857 | 62.28% |
| MC | Moderate-confidence advisory flag | 19,667 | 26.14% |
| HSC | High style-consistency flag | 167 | 0.22% |
| UN | Uncertain | 8,524 | 11.33% |
| LH | Likely hand-signed | 18 | 0.02% |
| LH | Low replication-similarity | 18 | 0.02% |
(Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.)
@@ -863,7 +832,7 @@ This section reports the five-way per-signature + document-level worst-case clas
(Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)
The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains its prior calibration (supplementary materials); it is **not separately re-characterised by Scripts 3840**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in the supplementary materials and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
The five-way **moderate-confidence advisory** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains the threshold provenance of its prior calibration (supplementary materials), but §III-L.3 **supersedes its claim strength**: on the normative BCD baseline this band carries a $\sim 0.19$ per-document inter-CPA coincidence rate, so it is a low-specificity advisory (review-workload-expanding) bin, not calibrated evidence of replication. It is **not separately re-characterised by Scripts 3840**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively only. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
**Table XVII.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.
@@ -874,7 +843,7 @@ The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 <
| Firm C | 102 | 24 | 77 | 1 | $23.53\%$ | $0.98\%$ |
| Firm D | 52 | 6 | 45 | 1 | $11.54\%$ | $1.92\%$ |
(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).
(Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature screening rule (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).
**Document-level worst-case aggregation outputs are reported in Table XVI above.**
@@ -902,7 +871,7 @@ This section reports the reproducibility cross-check at the full accountant scop
(Source: Script 41.)
**Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the deployed box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more non-templated CPAs (mid/small firms); C1 (low-cos / high-dHash) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + deployed-rule convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the primary methodology is restricted to Big-4 by design (§III-G item 4).
**Reading.** The K=3 component ordering and the strong Spearman convergence between K=3 P(C1) and the deployed box-rule less-replication-dominated rate are preserved at the full scope. Component centres shift modestly: C3 (high-cos / low-dHash) is essentially unchanged in centre but loses weight $0.117$ as the full population includes more CPAs in the low-cos / high-dHash descriptor region (mid/small firms); C1 (low-cos / high-dHash) gains weight $0.141$ and shifts to lower cosine and higher dHash (centre $(0.928, 11.17)$ vs Big-4 $(0.946, 9.17)$) as the broader population includes mid/small-firm CPAs landing toward the low-cos / high-dHash region that the Big-4-primary scope deliberately excludes. We read this as evidence that the Big-4-primary K=3 + deployed-rule convergence is not a Big-4-specific artefact; we do **not** read it as an endorsement of using full-dataset K=3 component centres or operational thresholds in place of the Big-4-primary analysis. Mid/small-firm composition shifts the component centres meaningfully and the primary methodology is restricted to Big-4 by design (§III-G item 4).
## L. Ablation Study: Feature Backbone Comparison
@@ -957,54 +926,40 @@ This section consolidates the empirical results that support the §III-L anchor-
(Source: Scripts 39b, 39c, 39d, 39e; bootstrap $n_{\text{boot}} = 2000$; jitter $\sim \mathrm{U}[-0.5, +0.5]$.)
### M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b)
### M.2 Anchor-based inter-CPA pair-level ICCR (Script 46)
**Table XXI.** Big-4 inter-CPA per-comparison ICCR sweep, $n = 5 \times 10^5$ pairs (Big-4 scope).
**Table XXI.** Inter-CPA per-comparison ICCR by negative-anchor pool, $n = 5 \times 10^5$ pairs each.
| Threshold | Per-comparison ICCR | 95% Wilson CI |
| Threshold | BCD (primary) | All-Big-4 (contamination comparison) | BCD+non-Big-4 |
|---|---|---|---|
| cos $> 0.95$ | $0.00026$ | $0.00060$ | $0.00014$ |
| dHash $\leq 5$ | $0.00037$ | $0.00129$ | $0.00034$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair) | $\mathbf{0.000018}$ | $0.000140$ | $0.000004$ |
BCD joint Wilson 95% $[0.000009, 0.000034]$ ($9$ of $5 \times 10^5$ pairs); all-Big-4 joint $[0.000111, 0.000177]$. Removing Firm A from the negative anchor lowers the joint HC coincidence rate by $\sim 8\times$, confirming that the all-Big-4 rate is inflated by Firm A's within-firm template reuse (§III-L.4). On the all-Big-4 sample, conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$; the all-Big-4 cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I ($0.0005$).
### M.3 Pool-normalised per-signature ICCR (Script 52)
**Table XXII.** Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos $> 0.95$ AND dHash $\leq 5$) by negative-anchor pool; canonical retry-loop sampler; CPA-block bootstrap $n_{\text{boot}} = 1000$.
| Baseline pool | Per-signature HC ICCR | CPA-bootstrap 95% CI |
|---|---|---|
| cos $> 0.945$ (alternative operating point from supplementary calibration evidence) | $0.00081$ | $[0.00073, 0.00089]$ |
| cos $> 0.95$ (deployed operating point) | $0.00060$ | $[0.00053, 0.00067]$ |
| cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
| dHash $\leq 5$ (deployed operating point) | $0.00129$ | $[0.00120, 0.00140]$ |
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ |
| BCD (primary) | $\mathbf{0.0116}$ | $[0.0094, 0.0141]$ |
| All-Big-4 (contamination comparison) | $0.1102$ | $[0.0908, 0.1330]$ |
| BCD+non-Big-4 | $0.0083$ | $[0.0066, 0.0099]$ |
Conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$ (Wilson 95% $[0.190, 0.285]$; $70$ of $299$ pairs).
The BCD floor is an order of magnitude below the all-Big-4 figure, which is dominated by Firm A's within-firm coincidences. The per-signature rate increases with pool size (the deployed rule takes extrema over $n_{\text{pool}}$ candidates); the per-document HC+MC band on the clean baseline (Table XXIII) does not show the same collapse, because the dHash $\leq 15$ band carries little inter-CPA specificity even among normative firms.
The cos $> 0.95$ row is consistent with the corpus-wide spike of §IV-I (per-comparison rate $0.0005$). The dHash row and joint row are reported here for the first time on this corpus.
### M.4 Document-level ICCR by alarm definition and pool (Script 52)
### M.3 Pool-normalised per-signature ICCR (Script 43)
**Table XXIII.** Document-level inter-CPA ICCR by alarm definition and negative-anchor pool (dominant-firm document assignment).
**Table XXII.** Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos $> 0.95$ AND dHash $\leq 5$); $n_{\text{sig}} = 150{,}453$ (vector-complete Big-4); CPA-block bootstrap $n_{\text{boot}} = 1000$.
| Scope | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
| Alarm definition | BCD (primary) | All-Big-4 | BCD+non-Big-4 |
|---|---|---|---|
| Big-4 pooled (any-pair, deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
| Big-4 pooled (same-pair, stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |
| Firm A (any-pair) | $0.2594$ | — | — |
| Firm B (any-pair) | $0.0147$ | — | — |
| Firm C (any-pair) | $0.0053$ | — | — |
| Firm D (any-pair) | $0.0110$ | — | — |
| Pool-size decile 1 (smallest pools) any-pair | $0.0249$ | — | — |
| Pool-size decile 10 (largest pools) any-pair | $0.1905$ | — | — |
| HC (dHash $\leq 5$) | $\mathbf{0.0226}$ | $0.1797$ | $0.0163$ |
| HC + MC (dHash $\leq 15$) | $0.1905$ | $0.3375$ | $0.1467$ |
Decile trend is broadly monotone in pool size with two minor reversals (decile 5 and decile 9 dip below their predecessors). Stricter operating point cos $> 0.95$ AND dHash $\leq 3$ (same-pair) gives per-signature ICCR $0.0449$.
### M.4 Document-level ICCR under three alarm definitions (Script 45)
**Table XXIII.** Document-level inter-CPA ICCR by alarm definition; $n_{\text{docs}} = 75{,}233$.
| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|---|---|---|---|
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
| D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ |
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |
Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XVI's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs (all $379$ are $1{:}1$ Firm C / Firm D documents that an alphabetically-ordered tie-break assigns to Firm C; full implementation detail in the supplementary materials). The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XVI's per-firm rows sum to $74{,}854 = 75{,}233 - 379$.
Per-firm per-document HC+MC ICCR on the BCD baseline is Firm B $0.197$, Firm C $0.218$, Firm D $0.114$ (all-Big-4 pool: Firm A $0.620$, Firm B $0.160$, Firm C $0.163$, Firm D $0.088$). The HC band collapses by $\sim 8\times$ when Firm A is removed from the anchor (high specificity), whereas the HC+MC band is essentially unchanged — slightly higher for B/C/D — confirming that dHash $\leq 15$ adds alert yield without inter-CPA specificity and motivating the MC band's repositioning as an advisory tier (§III-L.3).
### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)
@@ -1017,6 +972,8 @@ Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.160
| Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A |
| log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size |
On the BCD baseline with Firm D as reference (Script 53; $n = 89{,}994$, hit rate $0.0116$), the residual firm spread collapses to within $\sim 3.5\times$ — odds ratios $1.73$ (Firm B), $0.49$ (Firm C), log-pool-size $3.29$ — confirming that Firm A is the singular outlier while Firms B/C/D form an internally homogeneous baseline (§III-L.4).
Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$$0.0358$ while Firm A ranges $0.0541$$0.5958$. The firm gap survives within matched pool sizes.
**Table XXV.** Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts).
@@ -1028,7 +985,7 @@ Per-decile per-firm rates (Table not duplicated here; Script 44 decile table ava
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |
Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively.
Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively. Restricting the candidate pool to the BCD baseline (Script 53) raises Firms B/C/D within-firm any-pair concentration to $97.2\%$ / $92.3\%$ / $89.2\%$ (same-pair $100\%$ / $100\%$ / $98.5\%$): within-firm concentration is a universal Big-4 pattern, not a Firm-A peculiarity (§III-L.4).
### M.6 Alert-rate sensitivity around deployed HC threshold (Script 46)
@@ -1040,7 +997,7 @@ Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\
| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ ($38.6$ pp) per-signature and $0.4431$ ($44.3$ pp) per-document; this excess is reported as an observed same-CPA-pool excess under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. Against the normative BCD floor (per-signature $0.0116$; per-document HC $0.0226$), the observed same-CPA-pool excess is $0.4842$ ($48.4$ pp, $\sim 43\times$) per-signature and $0.6002$ ($60.0$ pp, $\sim 28\times$) per-document; this excess is reported under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
# V. Discussion
@@ -1051,33 +1008,29 @@ Non-hand-signing differs from forgery in that the questioned signature is produc
## B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Multimodality is Composition-Driven
The Big-4 accountant-level descriptor distribution rejects unimodality on both marginals at $p < 5 \times 10^{-4}$ (§IV-D Table V). The composition decomposition of §III-I.4 shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within the Big-4 firms, the descriptor marginals at the signature level are unimodal once integer ties are broken (Scripts 39b, 39d); eligible non-Big-4 firms provide corroborating raw-axis evidence on the cosine dimension (Script 39c) but are not used as calibration evidence (§III-I.4). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes.
The Big-4 accountant-level distribution rejects unimodality on both marginals (§IV-D), but §III-I.4 shows this is fully attributable to between-firm location shifts and integer mass-point artefacts, not within-population structure: under joint firm-mean centring and integer-tie jitter the dip test no longer rejects ($p_{\text{median}} = 0.35$), and within each Big-4 firm the signature-level marginals are unimodal once integer ties are broken. The distributions therefore contain no within-population bimodal antimode to anchor a threshold, and per-signature similarity is best read as a continuous quality spectrum rather than two discrete populations. The K=2 / K=3 fits are descriptive firm-compositional partitions (§III-J), not latent mechanism classes.
## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor)
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.
We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). Firm A's per-document D2 inter-CPA proxy ICCR of $0.6201$ (versus Firms B/C/D's $0.09$$0.16$) — the counterfactual rate at which Firm A documents would fire HC$+$MC if same-CPA pools were replaced by random inter-CPA candidates — reflects high inter-CPA collision concentration under the deployed rule, consistent with firm-specific template, stamp, or document-production reuse. (The corresponding observed rate on real same-CPA pools, from Table XVI, is substantially higher: $97.5\%$ HC$+$MC for Firm A; the proxy and observed rates measure different quantities and are not directly comparable.) The inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder within-firm collision patterns at Firms B/C/D, whose mechanisms may include template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity (§V-H).
We treat Firm A as a *templated-end case study* and, in the calibration, as an **out-of-sample target** scored against the normative Firms-B/C/D baseline rather than as a calibration input (§III-L.0). Three readings (§III-L.4) make Firm A's status precise. First, scored against the clean BCD baseline, Firm A's signatures coincide at only $0.0102$ — essentially the BCD floor ($0.0116$) — so Firm A is unremarkable *cross-firm*; its signal is entirely within-firm. Second, on its own same-CPA pools the deployed HC rule fires on $0.82$ of Firm A signatures, $\sim 70\times$ the clean floor, versus $\sim 21$$30\times$ for Firms B/C/D — Firm A is the rate-extreme, but every Big-4 firm sits far above the floor. Third, within-firm collision concentration is universal: $98.8\%$ at Firm A and, on the clean BCD pool, $89$$97\%$ at Firms B/C/D, with same-pair concentration $97$$100\%$ across all four firms. The firm contrast is sharpest and most defensible in the high-confidence bin (the observed per-signature HC rates above); the per-document HC+MC proxy ICCR of $0.62$ at Firm A versus $0.09$$0.16$ at Firms B/C/D is reported only as advisory review burden, since the MC band carries low inter-CPA specificity even on the normative baseline (§III-L.3). None of this is by itself diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures, consistent with a firm-level template or production workflow; the milder within-firm patterns at Firms B/C/D may reflect template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H). We present Firm A as a *demonstration that the screening surfaces a known templated end at scale* — corroborated by the byte-identical capture check (§IV-H) — not as a forensic determination about the firm. Whether firm-level signing patterns bear on audit quality is a question for a dedicated companion study (§VI), beyond what descriptor-only screening can establish.
## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
Leave-one-firm-out cross-validation of the Big-4 mixture fit reveals a sharp contrast between K=2 and K=3 behaviour. K=2 is unstable: across-fold cosine-crossing deviation is $0.028$, and holding Firm A out gives a fold rule (cos $> 0.938$, dHash $\leq 8.79$) that classifies $100\%$ of held-out Firm A in the upper component, while holding any non-Firm-A Big-4 firm out gives a fold rule near (cos $> 0.975$, dHash $\leq 3.76$) that classifies $0\%$ of the held-out firm in the upper component. The K=2 boundary is essentially a Firm-A-vs-others separator — direct evidence that the K=2 partition reflects firm-compositional rather than mechanistic structure.
K=3 in contrast has a *reproducible component shape* at the descriptor-position level: across the four folds the C1 (low-cos / high-dHash) component cosine mean varies by at most $0.005$, the dHash mean by at most $0.96$, and the weight by at most $0.023$. Hard-posterior membership for the held-out firm is composition-sensitive (absolute differences $1.8$$12.8$ pp across folds). Together with the §III-I.4 composition decomposition (no within-population bimodal antimode), the K=3 stability supports a descriptive reading: the Big-4 descriptor plane has a reproducible three-region partition that reflects how firm-compositional weight is distributed across the descriptor space, *not* a three-mechanism latent-class structure. We accordingly do not use K=3 hard-posterior membership as an operational classifier; we use it as the accountant-level descriptive summary that complements the deployed signature-level five-way classifier of §III-H.1.
Leave-one-firm-out cross-validation (§III-J) sharply separates the two fits. K=2 is unstable — its boundary is essentially a Firm-A-versus-others separator (holding Firm A out gives a markedly looser fold rule than holding any other firm out), direct evidence that it reflects firm composition, not mechanism. K=3, by contrast, has a reproducible component shape across folds (the C1 cosine mean varies by $\leq 0.005$), though hard-posterior membership remains composition-sensitive. We therefore read K=3 as a reproducible three-region descriptor partition reflecting how firm-compositional weight is distributed across the descriptor plane, not a three-mechanism latent structure, and use it only as an accountant-level descriptive summary — never as operational classifier output.
## E. Three-Score Convergent Internal-Consistency
Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score, not a mechanism cluster posterior); the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the deployed box-rule less-replication-dominated rate. The three scores are *not* statistically independent measurements — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise $|\rho| > 0.87$) and its persistence at the signature level (Cohen $\kappa = 0.87$ between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-region ordering, and three different summarisations of the descriptor space produce broadly concordant per-CPA rankings, with a residual non-Firm-A disagreement (the reverse-anchor cosine percentile ranks Firm D fractionally above Firm C, while the mixture posterior and the deployed box-rule rate rank Firm C highest among non-Firm-A firms).
Three feature-derived per-CPA scores — the K=3 firm-compositional position, the reverse-anchor cosine percentile against a non-Big-4 reference, and the deployed box-rule rate — agree on the per-CPA ranking at Spearman $\rho \geq 0.879$, with agreement persisting at the signature level (Cohen $\kappa = 0.87$). Because the three are deterministic functions of the same descriptor pair, we report this as internal consistency, not external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The only material disagreement is a Firm C / Firm D swap among the non-Firm-A firms.
## F. Anchor-Based Multi-Level Calibration
The operational specificity-proxy behaviour of the deployed HC sub-rule and document-level HC$+$MC alarm derived from the five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR is consistent with the corpus-wide rate reported in §IV-I (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5) shows the deployed HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); alternative operating points can be characterised by inverting the ICCR curves (e.g., a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint corresponds to per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation.
The deployed HC sub-rule's specificity-proxy behaviour is characterised at three units against the normative BCD baseline (§III-L), with the all-Big-4 pool shown only as a contamination comparison. At every unit the HC inter-CPA coincidence rate is an order of magnitude below the all-Big-4 figure — the gap being Firm A's extreme within-firm collision structure (§III-L.4) — confirming HC as a high-specificity-proxy operating point. Because the deployed rule takes pool extrema, the per-comparison rate understates the per-signature rate (the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ pool effect). The HC threshold is locally sensitive rather than plateau-stable (§III-L.5), so it is a specificity-anchored operating *choice*, not a distributional antimode; operators can select alternative points by inverting the ICCR curves (§III-L.2). The dHash$\leq 15$ MC band stays a low-specificity advisory tier even on the clean baseline (§III-L.3).
## G. Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor
The only conservative hard-positive subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are a conservative hard-positive subset for image replication. On the Big-4 subset ($n = 262$ pixel-identical signatures), all three candidate checks — the deployed box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). We caution that this result is necessary but not sufficient: for the deployed box rule it is close to tautological, because byte-identical neighbours have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region. The corresponding signature-level *negative* anchor evidence is developed in §III-L.1 above (per-comparison ICCR $= 0.00060$ at cos$>0.95$, consistent with the corpus-wide rate of $0.0005$ reported in §IV-I). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures.
The only conservative hard-positive subset is pixel-identical (byte-identical) signatures, which independent hand-signing cannot produce. All three candidate checks achieve $0\%$ positive-anchor miss on the 262 Big-4 byte-identical signatures (§IV-H) — a necessary check, though close to tautological for the box rule (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the HC region). The complementary negative anchor is the §III-L.1 per-comparison ICCR on the normative BCD baseline ($0.000018$); we frame it as a specificity proxy, and because the inter-CPA-as-negative assumption is violated by within-firm collisions concentrated at Firm A, we anchor on Firms B/C/D with Firm A held out as an out-of-sample target (§III-L.0).
## H. Limitations
@@ -1087,7 +1040,7 @@ Several limitations should be transparent. We group them into primary methodolog
*No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).
*Inter-CPA negative-anchor assumption is partially violated and the violation is firm-dependent.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs. Because the violation is firm-dependent, Firm A's per-firm ICCR is more contaminated by within-firm sharing than Firms B/C/D's; the per-firm B/C/D rates of $0.09$$0.16$ may therefore be less contaminated than the pooled rate, and the Firm A vs Firms B/C/D contrast reflects both genuine firm heterogeneity and a firm-dependent proxy-contamination gradient.
*Inter-CPA negative-anchor assumption, and why we anchor on the BCD baseline.* The cross-firm hit matrix of §III-L.4 shows that under the deployed rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$97.2\%$ at Firms B/C/D, consistent with firm-specific template, stamp, or document-production reuse. An all-Big-4 inter-CPA pool is therefore not a clean negative anchor — some inter-CPA pairs share firm-level templates rather than being independent random matches, and the contamination is dominated by Firm A. We address this directly by anchoring the calibration on the Firms-B/C/D baseline and holding Firm A out as an out-of-sample target (§III-L.0); on this baseline the per-comparison HC rate falls from $0.00014$ to $0.000018$ and the per-signature HC rate from $0.1102$ to $0.0116$. A residual caveat survives even on the clean baseline: the BCD floor is an *inter-CPA coincidence* rate, not an *intra-CPA genuine-hand-signing* rate, so the observed-versus-floor excess (§III-L.6) cannot be read as a true-positive rate — a consistently hand-signing CPA can exceed the inter-CPA floor. All reported ICCRs are therefore specificity proxies, not calibrated FARs or specificities.
*Mechanism attribution for the firm-level heterogeneity is not identifiable from descriptor-only data.* The observed firm-level contrast (Firm A's per-document HC$+$MC ICCR of $0.62$ versus $0.09$$0.16$ at Firms B/C/D; within-firm collision concentration $77$$99\%$ under the deployed any-pair rule; byte-identical evidence of §IV-H) is consistent with at least three non-mutually-exclusive firm-level mechanisms: (i) template, stamp, or e-signature production reuse; (ii) digitisation-pipeline homogeneity — shared scanners, common PDF generation infrastructure, identical compression and form-template settings — that systematically inflates image-descriptor similarity without signature replication; and (iii) signing-style or training homogeneity that produces correlated handwritten signatures within a firm. The descriptor pair (cosine, dHash) operates at the image-similarity level and is, by construction, indifferent to which mechanism generated a given near-identical pair. We therefore report the firm contrast as a methodological observation — the framework discriminates at firm-level resolution — rather than as a mechanism finding. The byte-identical Firm A signatures across $\sim 50$ distinct partners (§IV-H, §V-C) provide direct evidence for (i) at Firm A specifically, but do not exclude additive contribution from (ii) or (iii); the milder within-firm collision patterns at Firms B/C/D are individually consistent with all three mechanisms. Image-acquisition metadata (scanner identifiers, PDF generator fingerprints, compression-codec markers), partner-level intent records, or controlled hand-signed baselines would be needed to attribute the contrast across (i), (ii), and (iii).
@@ -1097,9 +1050,9 @@ Several limitations should be transparent. We group them into primary methodolog
*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the deployed box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).
*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their prior calibration and capture-rate evidence (supplementary materials); the anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
*Rule components not separately re-characterised by the present diagnostic battery.* The five-way classifier's moderate-confidence advisory band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain the threshold provenance of their prior calibration (supplementary materials); however, §III-L.3 supersedes the MC band's *claim strength* — its $\sim 0.19$ per-document inter-CPA coincidence on the normative baseline makes it a low-specificity advisory bin, not calibrated evidence of replication. The anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-characterised by the present diagnostic battery.
*Deployed-rate excess is not a presumed true-positive rate.* The $\sim 44$-pp per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the inter-CPA proxy rate (HC: $0.18$) cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; within-firm template sharing inflates the inter-CPA proxy baseline). The gap is best read as a same-CPA repeatability signal.
*Deployed-rate excess is not a presumed true-positive rate.* The per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the normative inter-CPA proxy floor (HC: $0.023$ on the BCD baseline) — $\sim 60$ pp — cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; the inter-CPA floor is not an intra-CPA genuine-hand-signing rate). The gap is best read as an observed same-CPA-pool repeatability signal.
*A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime.
@@ -1124,11 +1077,11 @@ Several limitations should be transparent. We group them into primary methodolog
# VI. Conclusion and Future Work
We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature classifier with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442150,453 signatures at signature level) as the primary analytical population.
We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature screening rule with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442150,453 signatures at signature level) as the primary analytical population. We emphasise that the operating thresholds are operator-tunable and that the system performs semi-automated triage — surfacing replication candidates from hundreds of thousands of signatures for human adjudication — rather than autonomous forensic classification; its central deliverable is the label-free calibration methodology by which an operator selects and characterises a screening operating point.
Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, a contrast that pool-size differences do not explain and that we report as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms), consistent with but not independently establishing firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.
Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration on a normative non-Firm-A baseline (Firms B/C/D, with Firm A held out as an out-of-sample target to avoid circularity): on this clean baseline the deployed HC rule yields per-comparison ICCR $0.000018$, per-signature $0.0116$, and per-document $0.023$ — roughly an order of magnitude below the contaminated all-Big-4 figures ($0.00014$, $0.11$, $0.18$) — while the dHash$\leq 15$ moderate-confidence band, which retains a $\sim 0.19$ per-document coincidence rate even on the clean baseline, is repositioned as a low-specificity advisory tier; with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm-level heterogeneity surfaced by the framework: against the clean BCD floor the deployed rule fires on each firm's own pools at $\sim 70\times$ (Firm A) and $\sim 21$$30\times$ (Firms B/C/D), while Firm A scored cross-firm against the baseline coincides only at the floor ($0.0102$); two logistic regressions (full-Big-4 Firm-A-reference odds ratios $0.053$/$0.010$/$0.027$; BCD-only Firm-D-reference residual spread within $\sim 3.5\times$) show Firm A is the singular outlier and Firms B/C/D an internally homogeneous baseline — reported as a framework-discriminative observation rather than a mechanism finding (§V-H); (4) cross-firm hit matrix evidence that within-firm collision concentration is a universal Big-4 pattern — $98.8\%$ at Firm A and $89$$97\%$ at Firms B/C/D on the clean BCD pool (same-pair $97$$100\%$ across all four firms) consistent with, but not independently establishing, firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style similarity, which descriptor-only data cannot separate (§V-H); (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.
Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$$98.8\%$ across Big-4; same-pair joint $97.0$$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$$0.16$) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$$98.8\%$ across Big-4; same-pair joint $97.0$$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (observed per-signature high-confidence rate $0.82$ vs $0.24$$0.35$, $\sim 70\times$ vs $\sim 21$$30\times$ the clean BCD floor) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
# Appendix A. Supplementary Diagnostic Detail
@@ -1164,7 +1117,7 @@ Both features are characteristic of a histogram-resolution artifact rather than
Second, the candidate transitions all locate *inside* the high-similarity region (cosine $\geq 0.975$, dHash $\leq 10$) rather than at a between-mode boundary, which is the location pattern we would expect of a clean within-population antimode.
Taken together, Table A.I shows that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes.
Taken together, Table A.I shows that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the high-similarity descriptor region rather than between modes.
This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold.
Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials.
@@ -1178,11 +1131,11 @@ Section III-M positions the unsupervised-diagnostic strategy as a set of complem
| Diagnostic | Failure mode addressed | Disclosed untested assumption |
|---|---|---|
| Composition decomposition (§III-I.4; Scripts 39b39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) |
| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) |
| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property |
| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above |
| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 46) | Pair-level specificity proxy under a random-pair negative anchor, on the normative BCD baseline | Inter-CPA pairs are negative (i.e., not template-related); addressed by anchoring on Firms B/C/D and holding Firm A out (§III-L.0) |
| Pool-normalised per-signature ICCR (§III-L.2; Script 52) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size, on the BCD baseline | Same as above + that pool replacement preserves the negative-anchor property |
| Document-level ICCR (§III-L.3; Script 52) | Operational alarm-rate proxy at per-document unit (HC and HC+MC), on the BCD baseline | Same as above |
| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check |
| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$$99.96\%$ within-firm at all four firms versus $76.7$$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 45's mode-of-firms tie-break (§IV-M.4) |
| Cross-firm hit matrix (§III-L.4; Scripts 44, 53) | Concentration of inter-CPA collisions within source firm (all-Big-4 and BCD-pool variants) | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$$99.96\%$ within-firm at all four firms versus $76.7$$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 52's dominant-firm rule (§IV-M.4) |
| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test |
| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent |
| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold |
+60
View File
@@ -0,0 +1,60 @@
#!/usr/bin/env python3
"""Firm x year descriptor trends (B-gate diagnostic).
Plots per-firm yearly mean cosine, mean dHash, and HC-box hit share to test
whether Firms B/C/D show a 2020 structural break converging toward Firm A.
Read-only against the production DB.
"""
import sqlite3
import matplotlib.pyplot as plt
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRMS = [('勤業眾信聯合', 'Firm A (Deloitte)', '#d62728'),
('安侯建業聯合', 'Firm B (KPMG)', '#1f77b4'),
('資誠聯合', 'Firm C (PwC)', '#2ca02c'),
('安永聯合', 'Firm D (EY)', '#ff7f0e')]
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
def series(firm_zh):
cur.execute("""
SELECT CAST(substr(s.year_month,1,4) AS INT) AS yr,
AVG(s.max_similarity_to_same_accountant),
AVG(s.min_dhash_independent),
AVG(CASE WHEN s.max_similarity_to_same_accountant>0.95
AND s.min_dhash_independent<=5 THEN 1.0 ELSE 0.0 END),
COUNT(*)
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE a.firm=? AND s.year_month IS NOT NULL
AND s.max_similarity_to_same_accountant IS NOT NULL
AND s.min_dhash_independent IS NOT NULL
GROUP BY yr ORDER BY yr""", (firm_zh,))
return cur.fetchall()
fig, axes = plt.subplots(1, 3, figsize=(16, 4.8))
for firm_zh, label, color in FIRMS:
rows = series(firm_zh)
yrs = [r[0] for r in rows]
axes[0].plot(yrs, [r[1] for r in rows], 'o-', color=color, label=label)
axes[1].plot(yrs, [r[2] for r in rows], 'o-', color=color, label=label)
axes[2].plot(yrs, [r[3] for r in rows], 'o-', color=color, label=label)
for ax in axes:
ax.axvline(2020, ls='--', color='grey', alpha=0.6)
ax.text(2020.05, ax.get_ylim()[0], ' 2020', color='grey', fontsize=8, va='bottom')
ax.set_xlabel('Fiscal year')
ax.grid(alpha=0.3)
axes[0].set_title('Mean best-match cosine'); axes[0].axhline(0.95, ls=':', color='k', alpha=0.4)
axes[1].set_title('Mean independent-min dHash'); axes[1].axhline(5, ls=':', color='k', alpha=0.4)
axes[2].set_title('HC-box share (cos>0.95 & dHash$\\leq$5)')
axes[0].legend(fontsize=8, loc='lower right')
fig.suptitle('Big-4 descriptor trends 20132023 (2023 = partial, to Apr) — no 2020 break, no convergence to A',
fontsize=11)
fig.tight_layout()
out = '/Volumes/NV2/pdf_recognize/signature_analysis/firm_year_trends.png'
fig.savefig(out, dpi=130, bbox_inches='tight')
print('saved', out)
conn.close()
+87
View File
@@ -0,0 +1,87 @@
#!/usr/bin/env python3
"""Script 46: BCD-only (exclude Firm A) per-comparison ICCR recompute.
Replicates 40b's inter-CPA negative-anchor pair sampling (N=500k, seed=42)
but compares three negative-anchor pool compositions:
- ABCD : all Big-4 (current paper baseline)
- BCD : Big-4 excluding Firm A (normative-baseline proposal)
- BCD+nonB4 : BCD plus all non-Big-4 firms
Reports marginal cos>0.95, dHash<=5, and the joint HC rule cos>0.95 & dHash<=5.
Read-only.
"""
import sqlite3
from collections import defaultdict
import numpy as np
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
N_PAIRS = 500_000
SEED = 42
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
def hamming(a, b):
return (int.from_bytes(a, 'big') ^ int.from_bytes(b, 'big')).bit_count()
def load():
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""
SELECT s.assigned_accountant, a.firm, s.feature_vector, s.dhash_vector
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE s.assigned_accountant IS NOT NULL AND a.firm IS NOT NULL
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""")
rows = cur.fetchall()
conn.close()
return rows
def wilson(k, n, z=1.96):
if n == 0:
return (None, None)
p = k / n
d = 1 + z*z/n
c = (p + z*z/(2*n)) / d
h = z*np.sqrt(p*(1-p)/n + z*z/(4*n*n)) / d
return (max(0.0, c-h), min(1.0, c+h))
def iccr(rows, label):
by = defaultdict(list)
for acct, firm, fv, dh in rows:
by[acct].append((fv, dh))
accts = list(by.keys())
feats = {a: np.stack([np.frombuffer(r[0], dtype=np.float32) for r in by[a]]) for a in accts}
dhs = {a: [r[1] for r in by[a]] for a in accts}
rng = np.random.default_rng(SEED)
cos = np.empty(N_PAIRS, np.float32)
dv = np.empty(N_PAIRS, np.int32)
na = len(accts)
for t in range(N_PAIRS):
i, j = rng.choice(na, 2, replace=False)
a1, a2 = accts[i], accts[j]
k1 = int(rng.integers(0, len(by[a1])))
k2 = int(rng.integers(0, len(by[a2])))
cos[t] = float(feats[a1][k1] @ feats[a2][k2])
dv[t] = hamming(dhs[a1][k1], dhs[a2][k2])
n = N_PAIRS
m_cos = int((cos > 0.95).sum())
m_dh = int((dv <= 5).sum())
joint = int(((cos > 0.95) & (dv <= 5)).sum())
jlo, jhi = wilson(joint, n)
print(f'\n== {label} ==')
print(f' signatures={len(rows):,} accountants={na} pairs={n:,}')
print(f' cos>0.95 ICCR = {m_cos/n:.5f} ({m_cos})')
print(f' dHash<=5 ICCR = {m_dh/n:.5f} ({m_dh})')
print(f' JOINT (HC rule) ICCR = {joint/n:.6f} ({joint}) Wilson95% [{jlo:.6f},{jhi:.6f}]')
return joint/n
rows = load()
abcd = [r for r in rows if r[1] in BIG4]
bcd = [r for r in rows if r[1] in BIG4 and r[1] != FIRM_A]
bcd_non = [r for r in rows if r[1] != FIRM_A]
iccr(abcd, 'ABCD (current paper baseline)')
iccr(bcd, 'BCD only (exclude Firm A)')
iccr(bcd_non, 'BCD + non-Big-4')
@@ -0,0 +1,160 @@
#!/usr/bin/env python3
"""Script 47: BCD-only recompute of (1) KDE crossover, (2) per-signature
pool-normalized any-pair ICCR (cos>0.95 & dHash<=5), (3) per-document HC+MC
inter-CPA ICCR (cos>0.95 & dHash<=15), each for ABCD vs BCD-only negative-anchor
pools. Replicates Scripts 10/43/44 methodology. Document-level subsampling used
for the pool simulation (exact same-CPA pool sizes retained). Read-only.
"""
import sqlite3
from collections import defaultdict
import numpy as np
from scipy.stats import gaussian_kde
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
N_INTRA = 200_000
N_INTER = 500_000
N_DOC_SUBSAMPLE = 9000 # documents processed in pool simulation per scope
def load():
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""
SELECT s.signature_id, s.assigned_accountant, a.firm, s.source_pdf,
s.feature_vector, s.dhash_vector
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE s.assigned_accountant IS NOT NULL AND a.firm IN (?,?,?,?)
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""", BIG4)
rows = cur.fetchall()
conn.close()
return rows
def hamming1(q, c):
return (int.from_bytes(q, 'big') ^ int.from_bytes(c, 'big')).bit_count()
def wilson(k, n, z=1.96):
if n == 0:
return (None, None)
p = k/n; d = 1+z*z/n
c = (p+z*z/(2*n))/d
h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
return (max(0.0, c-h), min(1.0, c+h))
def kde_crossover(feats, cpas, label):
by = defaultdict(list)
for i, c in enumerate(cpas):
by[c].append(i)
by = {c: np.array(v) for c, v in by.items() if len(v) >= 2}
accts = list(by.keys())
rng = np.random.default_rng(SEED)
# intra: two sigs from same random CPA
intra = np.empty(N_INTRA, np.float32)
ks = rng.integers(0, len(accts), N_INTRA)
for t in range(N_INTRA):
idx = by[accts[ks[t]]]
a, b = rng.choice(idx, 2, replace=False)
intra[t] = feats[a] @ feats[b]
# inter: two sigs from different CPAs
inter = np.empty(N_INTER, np.float32)
for t in range(N_INTER):
i, j = rng.choice(len(accts), 2, replace=False)
a = rng.choice(by[accts[i]]); b = rng.choice(by[accts[j]])
inter[t] = feats[a] @ feats[b]
xs = np.linspace(0.3, 1.0, 10000)
ki = gaussian_kde(intra[:100000]); ke = gaussian_kde(inter[:100000])
diff = ki(xs) - ke(xs)
cross = xs[np.where(np.diff(np.sign(diff)))[0]]
cross = [float(x) for x in cross if 0.6 < x < 0.99]
print(f' [{label}] intra mean={intra.mean():.4f} inter mean={inter.mean():.4f}'
f' KDE crossover(s): {[f"{x:.4f}" for x in cross]}')
return cross
def pool_sim(rows, scope_firms, label):
"""Per-signature & per-document inter-CPA any-pair ICCR over a doc subsample."""
keep = [r for r in rows if ALIAS[r[2]] in scope_firms]
feats = np.stack([np.frombuffer(r[4], np.float32) for r in keep]).astype(np.float32)
feats /= np.clip(np.linalg.norm(feats, axis=1, keepdims=True), 1e-9, None)
cpas = [r[1] for r in keep]
firms = [ALIAS[r[2]] for r in keep]
docs = [r[3] for r in keep]
dh = [r[5] for r in keep]
n = len(keep)
cpa_idx = defaultdict(list)
for i, c in enumerate(cpas):
cpa_idx[c].append(i)
cpa_idx = {c: np.array(v) for c, v in cpa_idx.items()}
pool_size = {c: len(v)-1 for c, v in cpa_idx.items()}
doc_idx = defaultdict(list)
for i, d in enumerate(docs):
doc_idx[d].append(i)
rng = np.random.default_rng(SEED)
all_docs = list(doc_idx.keys())
sub = rng.choice(len(all_docs), min(N_DOC_SUBSAMPLE, len(all_docs)), replace=False)
sel_docs = [all_docs[i] for i in sub]
sig_hc = [] # per-signature: any-pair cos>0.95 & dh<=5
sig_firm = []
doc_hcmc = {} # per-document worst-case: any sig with cos>0.95 & dh<=15
doc_firm = {}
for d in sel_docs:
dhit = False
for si in doc_idx[d]:
c = cpas[si]; npool = pool_size[c]
if npool <= 0:
sig_hc.append(False); sig_firm.append(firms[si]); continue
same = cpa_idx[c]
draw = rng.choice(n, size=min(npool*2+10, n), replace=True)
cand = draw[~np.isin(draw, same)][:npool]
cosv = feats[cand] @ feats[si]
dhv = np.fromiter((hamming1(dh[si], dh[c2]) for c2 in cand), np.int32, len(cand))
cg = cosv > 0.95
hc = bool((cg & (dhv <= 5)).any())
hcmc = bool((cg & (dhv <= 15)).any())
sig_hc.append(hc); sig_firm.append(firms[si])
if hcmc:
dhit = True
doc_hcmc[d] = dhit
doc_firm[d] = firms[doc_idx[d][0]]
sig_hc = np.array(sig_hc); sig_firm = np.array(sig_firm)
k = int(sig_hc.sum()); m = len(sig_hc)
lo, hi = wilson(k, m)
print(f'\n [{label}] per-SIGNATURE any-pair HC ICCR (cos>0.95 & dh<=5): '
f'{k/m:.4f} ({k}/{m}) Wilson95% [{lo:.4f},{hi:.4f}]')
for f in sorted(set(sig_firm)):
msk = sig_firm == f
kk = int(sig_hc[msk].sum()); mm = int(msk.sum())
print(f' Firm {f}: {kk/mm:.4f} ({kk}/{mm})')
dvals = np.array(list(doc_hcmc.values())); dfirm = np.array(list(doc_firm.values()))
dk = int(dvals.sum()); dm = len(dvals)
dlo, dhi = wilson(dk, dm)
print(f' [{label}] per-DOCUMENT HC+MC ICCR (cos>0.95 & dh<=15): '
f'{dk/dm:.4f} ({dk}/{dm}) Wilson95% [{dlo:.4f},{dhi:.4f}]')
for f in sorted(set(dfirm)):
msk = dfirm == f
kk = int(dvals[msk].sum()); mm = int(msk.sum())
print(f' Firm {f}: {kk/mm:.4f} ({kk}/{mm})')
rows = load()
allf = np.stack([np.frombuffer(r[4], np.float32) for r in rows]).astype(np.float32)
allf /= np.clip(np.linalg.norm(allf, axis=1, keepdims=True), 1e-9, None)
allc = [r[1] for r in rows]
abcd_mask = [True]*len(rows)
bcd_mask = [r[2] != FIRM_A for r in rows]
print('=== (1) KDE crossover (intra vs inter cosine) ===')
kde_crossover(allf, allc, 'ABCD')
kde_crossover(allf[bcd_mask], [allc[i] for i in range(len(rows)) if bcd_mask[i]], 'BCD-only')
print('\n=== (2)(3) per-signature & per-document inter-CPA ICCR ===')
pool_sim(rows, {'A', 'B', 'C', 'D'}, 'ABCD (reproduce)')
pool_sim(rows, {'B', 'C', 'D'}, 'BCD-only')
+141
View File
@@ -0,0 +1,141 @@
#!/usr/bin/env python3
"""Script 48: full-fidelity (no subsample) BCD-only recompute of per-signature
and per-document inter-CPA any-pair ICCR, plus corpus-style KDE crossover.
Vectorized popcount. Scopes: ABCD, BCD-only, BCD+non-Big-4. Read-only.
"""
import sqlite3
from collections import defaultdict
import numpy as np
from scipy.stats import gaussian_kde
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
def load():
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""
SELECT s.assigned_accountant, a.firm, s.source_pdf,
s.feature_vector, s.dhash_vector
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE s.assigned_accountant IS NOT NULL AND a.firm IS NOT NULL
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""")
rows = cur.fetchall()
conn.close()
return rows
def wilson(k, n, z=1.96):
if n == 0:
return (None, None)
p = k/n; d = 1+z*z/n
c = (p+z*z/(2*n))/d
h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
return (max(0.0, c-h), min(1.0, c+h))
def prep(rows, keep_fn):
keep = [r for r in rows if keep_fn(r[1])]
feats = np.stack([np.frombuffer(r[3], np.float32) for r in keep]).astype(np.float32)
feats /= np.clip(np.linalg.norm(feats, axis=1, keepdims=True), 1e-9, None)
dh = np.stack([np.frombuffer(r[4], np.uint8) for r in keep]) # (n,8)
cpas = np.array([r[0] for r in keep])
firms = np.array([ALIAS.get(r[1], 'X') for r in keep])
docs = np.array([r[2] for r in keep])
return feats, dh, cpas, firms, docs
def crossover(feats, cpas, label):
by = defaultdict(list)
for i, c in enumerate(cpas):
by[c].append(i)
by = {c: np.array(v) for c, v in by.items() if len(v) >= 2}
accts = list(by.keys())
rng = np.random.default_rng(SEED)
N = 100_000
intra = np.empty(N, np.float32); inter = np.empty(N, np.float32)
ks = rng.integers(0, len(accts), N)
for t in range(N):
idx = by[accts[ks[t]]]
a, b = rng.choice(idx, 2, replace=False)
intra[t] = feats[a] @ feats[b]
i, j = rng.choice(len(accts), 2, replace=False)
inter[t] = feats[rng.choice(by[accts[i]])] @ feats[rng.choice(by[accts[j]])]
xs = np.linspace(0.3, 1.0, 10000)
diff = gaussian_kde(intra)(xs) - gaussian_kde(inter)(xs)
cross = [float(x) for x in xs[np.where(np.diff(np.sign(diff)))[0]] if 0.6 < x < 0.99]
print(f' [{label}] crossover {[f"{x:.4f}" for x in cross]} '
f'(intra {intra.mean():.4f} / inter {inter.mean():.4f})')
def pool_sim(feats, dh, cpas, firms, docs, label):
n = len(cpas)
cpa_idx = defaultdict(list)
for i, c in enumerate(cpas):
cpa_idx[c].append(i)
cpa_idx = {c: np.array(v) for c, v in cpa_idx.items()}
pool_size = {c: len(v)-1 for c, v in cpa_idx.items()}
rng = np.random.default_rng(SEED)
sig_hc = np.zeros(n, bool)
doc_hcmc = defaultdict(bool)
for si in range(n):
c = cpas[si]; npool = pool_size[c]
if npool <= 0:
continue
same = cpa_idx[c]
draw = rng.integers(0, n, size=npool + same.size + 20)
cand = draw[~np.isin(draw, same)][:npool]
cosv = feats[cand] @ feats[si]
cg = cosv > 0.95
if cg.any():
dist = POP[dh[cand] ^ dh[si]].sum(axis=1)
sig_hc[si] = bool((cg & (dist <= 5)).any())
if (cg & (dist <= 15)).any():
doc_hcmc[docs[si]] = True
else:
doc_hcmc.setdefault(docs[si], doc_hcmc[docs[si]] if docs[si] in doc_hcmc else False)
# ensure every doc present
for d in docs:
doc_hcmc.setdefault(d, False)
k = int(sig_hc.sum())
lo, hi = wilson(k, n)
print(f'\n [{label}] per-SIGNATURE any-pair HC (cos>0.95 & dh<=5): '
f'{k/n:.4f} ({k}/{n}) Wilson95% [{lo:.4f},{hi:.4f}]')
for f in sorted(set(firms)):
m = firms == f
print(f' Firm {f}: {sig_hc[m].sum()/m.sum():.4f} ({int(sig_hc[m].sum())}/{int(m.sum())})')
# per-doc, with firm of first sig
dfirm = {}
for i, d in enumerate(docs):
dfirm.setdefault(d, firms[i])
dl = list(doc_hcmc.keys())
dv = np.array([doc_hcmc[d] for d in dl])
df = np.array([dfirm[d] for d in dl])
dk = int(dv.sum()); dm = len(dv)
dlo, dhi = wilson(dk, dm)
print(f' [{label}] per-DOCUMENT HC+MC (cos>0.95 & dh<=15): '
f'{dk/dm:.4f} ({dk}/{dm}) Wilson95% [{dlo:.4f},{dhi:.4f}]')
for f in sorted(set(df)):
m = df == f
print(f' Firm {f}: {dv[m].sum()/m.sum():.4f} ({int(dv[m].sum())}/{int(m.sum())})')
rows = load()
SCOPES = [('ABCD', lambda fm: fm in BIG4),
('BCD-only', lambda fm: fm in BIG4 and fm != FIRM_A),
('BCD+nonBig4', lambda fm: fm != FIRM_A)]
print('=== KDE crossover ===')
for name, fn in SCOPES[:2]:
f, _, c, _, _ = prep(rows, fn)
crossover(f, c, name)
print('\n=== per-signature & per-document inter-CPA ICCR (full) ===')
for name, fn in SCOPES:
f, dh, c, fm, dc = prep(rows, fn)
pool_sim(f, dh, c, fm, dc, name)
+104
View File
@@ -0,0 +1,104 @@
#!/usr/bin/env python3
"""Script 49: Firm A as out-of-sample target against a clean BCD baseline.
(1) A signatures scored against a BCD-only candidate pool (true out-of-sample
inter-firm coincidence).
(2) Observed deployed rate on ACTUAL same-CPA pools, per firm (the real fired
rate, from precomputed deployed descriptors), to juxtapose against the
clean BCD inter-CPA coincidence floor. Read-only.
"""
import sqlite3
from collections import defaultdict
import numpy as np
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
def wilson(k, n, z=1.96):
if n == 0:
return (None, None)
p = k/n; d = 1+z*z/n
c = (p+z*z/(2*n))/d
h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
return (max(0.0, c-h), min(1.0, c+h))
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""
SELECT s.assigned_accountant, a.firm, s.source_pdf, s.feature_vector,
s.dhash_vector, s.max_similarity_to_same_accountant, s.min_dhash_independent
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE s.assigned_accountant IS NOT NULL AND a.firm IN (?,?,?,?)
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""", BIG4)
rows = cur.fetchall()
conn.close()
# ---- (1) Firm A source vs BCD-only candidate pool ----
print('=== (1) Firm A out-of-sample vs clean BCD candidate pool ===')
A = [r for r in rows if r[1] == FIRM_A]
BCD = [r for r in rows if r[1] in BIG4 and r[1] != FIRM_A]
bcd_feat = np.stack([np.frombuffer(r[3], np.float32) for r in BCD]).astype(np.float32)
bcd_feat /= np.clip(np.linalg.norm(bcd_feat, axis=1, keepdims=True), 1e-9, None)
bcd_dh = np.stack([np.frombuffer(r[4], np.uint8) for r in BCD])
nb = len(BCD)
# A CPA pool sizes (their own same-CPA count - 1), to match negative-anchor construction
a_cpa_idx = defaultdict(list)
for i, r in enumerate(A):
a_cpa_idx[r[0]].append(i)
pool_size = {c: len(v)-1 for c, v in a_cpa_idx.items()}
rng = np.random.default_rng(SEED)
sig_hc = np.zeros(len(A), bool)
doc_hcmc = defaultdict(bool)
for i, r in enumerate(A):
npool = max(pool_size[r[0]], 1)
cand = rng.integers(0, nb, size=npool)
sf = np.frombuffer(r[3], np.float32).astype(np.float32)
sf /= max(np.linalg.norm(sf), 1e-9)
cosv = bcd_feat[cand] @ sf
cg = cosv > 0.95
doc_hcmc.setdefault(r[2], False)
if cg.any():
dist = POP[bcd_dh[cand] ^ np.frombuffer(r[4], np.uint8)].sum(axis=1)
sig_hc[i] = bool((cg & (dist <= 5)).any())
if (cg & (dist <= 15)).any():
doc_hcmc[r[2]] = True
k = int(sig_hc.sum()); n = len(A); lo, hi = wilson(k, n)
print(f' A-source vs BCD-pool per-SIGNATURE HC (cos>0.95 & dh<=5): '
f'{k/n:.4f} ({k}/{n}) Wilson95% [{lo:.4f},{hi:.4f}]')
dv = np.array(list(doc_hcmc.values())); dk = int(dv.sum()); dm = len(dv)
dlo, dhi = wilson(dk, dm)
print(f' A-source vs BCD-pool per-DOCUMENT HC+MC (cos>0.95 & dh<=15): '
f'{dk/dm:.4f} ({dk}/{dm}) Wilson95% [{dlo:.4f},{dhi:.4f}]')
# ---- (2) Observed deployed rate on ACTUAL same-CPA pools, per firm ----
print('\n=== (2) Observed deployed rate on actual same-CPA pools (real fired rate) ===')
print(' per-signature HC = max_sim>0.95 & min_dh<=5 ; per-doc HC+MC worst-case dh<=15')
by_firm_sig = defaultdict(lambda: [0, 0])
doc_obs = {}
doc_firm = {}
for r in rows:
fm = ALIAS[r[1]]
ms, md = r[5], r[6]
if ms is None or md is None:
continue
hc = (ms > 0.95) and (md <= 5)
hcmc = (ms > 0.95) and (md <= 15)
by_firm_sig[fm][0] += int(hc); by_firm_sig[fm][1] += 1
doc_firm.setdefault(r[2], fm)
doc_obs[r[2]] = doc_obs.get(r[2], False) or hcmc
for fm in sorted(by_firm_sig):
k, n = by_firm_sig[fm]
lo, hi = wilson(k, n)
print(f' Firm {fm} per-SIGNATURE HC: {k/n:.4f} ({k}/{n}) [{lo:.4f},{hi:.4f}]')
dd = defaultdict(lambda: [0, 0])
for d, hit in doc_obs.items():
fm = doc_firm[d]; dd[fm][0] += int(hit); dd[fm][1] += 1
for fm in sorted(dd):
k, n = dd[fm]
print(f' Firm {fm} per-DOCUMENT HC+MC: {k/n:.4f} ({k}/{n})')
print(f'\n Clean BCD inter-CPA coincidence FLOOR: per-sig HC=0.0048, per-doc HC+MC=0.1281')
@@ -0,0 +1,107 @@
#!/usr/bin/env python3
"""Script 50: publication-grade scoped inter-CPA anchor recompute.
Faithfully reproduces Script 45's any-pair five-way pool simulation
(max_cos & min_dh over a random same-size inter-CPA pool, excl. same-CPA),
then reports for scopes ABCD / BCD / BCD+nonBig4:
- per-signature HC (D1) and HC+MC (D2) any-pair FAR
- per-document HC (D1) and HC+MC (D2) any-pair FAR
- per-firm per-document D2
ABCD is printed first to verify reproduction of published values
(per-sig HC~0.1102, per-doc D2~0.3375, Firm A~0.62). Read-only.
"""
import sqlite3
from collections import defaultdict
import numpy as np
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
def wilson(k, n, z=1.96):
if n == 0:
return (None, None)
p = k/n; d = 1+z*z/n
c = (p+z*z/(2*n))/d
h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
return (max(0.0, c-h), min(1.0, c+h))
def load():
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""
SELECT s.assigned_accountant, a.firm, s.source_pdf,
s.feature_vector, s.dhash_vector
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE s.assigned_accountant IS NOT NULL AND a.firm IS NOT NULL
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""")
rows = cur.fetchall()
conn.close()
return rows
def run(rows, keep_fn, label):
keep = [r for r in rows if keep_fn(r[1])]
n = len(keep)
feats = np.stack([np.frombuffer(r[3], np.float32) for r in keep]).astype(np.float32)
feats /= np.clip(np.linalg.norm(feats, axis=1, keepdims=True), 1e-9, None)
dh = np.stack([np.frombuffer(r[4], np.uint8) for r in keep])
cpas = np.array([r[0] for r in keep])
firms = np.array([ALIAS.get(r[1], 'NonB4') for r in keep])
docs = np.array([r[2] for r in keep])
cpa_idx = defaultdict(list)
for i, c in enumerate(cpas):
cpa_idx[c].append(i)
cpa_idx = {c: np.array(v) for c, v in cpa_idx.items()}
pool_size = {c: len(v)-1 for c, v in cpa_idx.items()}
rng = np.random.default_rng(SEED)
max_cos = np.zeros(n, np.float32)
min_dh = np.full(n, 64, np.int32)
for si in range(n):
c = cpas[si]; npool = pool_size[c]
if npool <= 0:
continue
same = cpa_idx[c]
draw = rng.integers(0, n, size=npool + same.size + 20)
cand = draw[~np.isin(draw, same)][:npool]
cosv = feats[cand] @ feats[si]
dist = POP[dh[cand] ^ dh[si]].sum(axis=1)
max_cos[si] = cosv.max()
min_dh[si] = int(dist.min())
# any-pair classification
hc = (max_cos > 0.95) & (min_dh <= 5)
mc = (max_cos > 0.95) & (min_dh > 5) & (min_dh <= 15)
d1 = hc
d2 = hc | mc
print(f'\n===== {label} (n_sig={n:,}) =====')
for nm, arr in [('per-sig HC (D1)', d1), ('per-sig HC+MC (D2)', d2)]:
k = int(arr.sum()); lo, hi = wilson(k, n)
print(f' {nm}: {k/n:.4f} ({k}/{n}) [{lo:.4f},{hi:.4f}]')
# per-document worst-case
doc_d1 = defaultdict(bool); doc_d2 = defaultdict(bool); doc_firm = {}
for i in range(n):
if d1[i]: doc_d1[docs[i]] = True
if d2[i]: doc_d2[docs[i]] = True
doc_firm.setdefault(docs[i], firms[i])
doc_d1.setdefault(docs[i], False); doc_d2.setdefault(docs[i], False)
dl = list(doc_d2.keys())
nd = len(dl)
k1 = sum(doc_d1[d] for d in dl); k2 = sum(doc_d2[d] for d in dl)
l1 = wilson(k1, nd); l2 = wilson(k2, nd)
print(f' per-doc HC (D1): {k1/nd:.4f} ({k1}/{nd}) [{l1[0]:.4f},{l1[1]:.4f}]')
print(f' per-doc HC+MC (D2):{k2/nd:.4f} ({k2}/{nd}) [{l2[0]:.4f},{l2[1]:.4f}]')
df = np.array([doc_firm[d] for d in dl])
dv = np.array([doc_d2[d] for d in dl])
for f in sorted(set(df)):
m = df == f
print(f' Firm {f} per-doc D2: {dv[m].sum()/m.sum():.4f} ({int(dv[m].sum())}/{int(m.sum())})')
rows = load()
run(rows, lambda fm: fm in BIG4, 'ABCD (verify vs published: HC~0.110 / D2~0.338 / A~0.62)')
run(rows, lambda fm: fm in BIG4 and fm != FIRM_A, 'BCD-only')
run(rows, lambda fm: fm != FIRM_A, 'BCD + non-Big4')
@@ -0,0 +1,135 @@
#!/usr/bin/env python3
"""Script 51: publication polish.
Part A: CPA-block bootstrap (1000 reps) on per-signature HC any-pair rate, and
document-level bootstrap on per-document HC+MC, for ABCD & BCD.
Part B: corpus-wide KDE crossover (pair-weighted intra, reproduce 0.837) plus
BCD-only and BCD+nonBig4 variants.
Read-only.
"""
import sqlite3
from collections import defaultdict
import numpy as np
from scipy.stats import gaussian_kde
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
N_BOOT = 1000
POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
def load():
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""
SELECT s.assigned_accountant, a.firm, s.source_pdf,
s.feature_vector, s.dhash_vector
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE s.assigned_accountant IS NOT NULL AND a.firm IS NOT NULL
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""")
rows = cur.fetchall()
conn.close()
return rows
# ============ Part A: bootstrap on anchor rates ============
def simulate(keep):
n = len(keep)
feats = np.stack([np.frombuffer(r[3], np.float32) for r in keep]).astype(np.float32)
feats /= np.clip(np.linalg.norm(feats, axis=1, keepdims=True), 1e-9, None)
dh = np.stack([np.frombuffer(r[4], np.uint8) for r in keep])
cpas = np.array([r[0] for r in keep])
docs = np.array([r[2] for r in keep])
cpa_idx = defaultdict(list)
for i, c in enumerate(cpas):
cpa_idx[c].append(i)
cpa_idx = {c: np.array(v) for c, v in cpa_idx.items()}
pool_size = {c: len(v)-1 for c, v in cpa_idx.items()}
rng = np.random.default_rng(SEED)
max_cos = np.zeros(n, np.float32); min_dh = np.full(n, 64, np.int32)
for si in range(n):
c = cpas[si]; npool = pool_size[c]
if npool <= 0:
continue
same = cpa_idx[c]
draw = rng.integers(0, n, size=npool + same.size + 20)
cand = draw[~np.isin(draw, same)][:npool]
cosv = feats[cand] @ feats[si]
dist = POP[dh[cand] ^ dh[si]].sum(axis=1)
max_cos[si] = cosv.max(); min_dh[si] = int(dist.min())
hc = (max_cos > 0.95) & (min_dh <= 5)
d2 = (max_cos > 0.95) & (min_dh <= 15)
return hc, d2, cpa_idx, docs
def boot_part(keep, label):
hc, d2, cpa_idx, docs = simulate(keep)
n = len(hc)
rng = np.random.default_rng(SEED + 1)
cpa_list = list(cpa_idx.keys())
# CPA-block bootstrap on per-signature HC
bs = np.empty(N_BOOT)
for b in range(N_BOOT):
cs = rng.choice(len(cpa_list), len(cpa_list), replace=True)
idx = np.concatenate([cpa_idx[cpa_list[i]] for i in cs])
bs[b] = hc[idx].mean()
# document-level bootstrap on per-doc D2
doc_d2 = defaultdict(bool)
for i in range(n):
doc_d2[docs[i]] = doc_d2[docs[i]] or bool(d2[i])
dl = np.array(list(doc_d2.keys())); dvals = np.array([doc_d2[d] for d in dl])
nd = len(dl); bd = np.empty(N_BOOT)
for b in range(N_BOOT):
s = rng.integers(0, nd, nd)
bd[b] = dvals[s].mean()
print(f'\n [{label}] per-sig HC point={hc.mean():.4f} '
f'CPA-block boot95% [{np.percentile(bs,2.5):.4f}, {np.percentile(bs,97.5):.4f}]')
print(f' [{label}] per-doc HC+MC point={dvals.mean():.4f} '
f'doc boot95% [{np.percentile(bd,2.5):.4f}, {np.percentile(bd,97.5):.4f}]')
# ============ Part B: pair-weighted KDE crossover ============
def crossover(keep, label):
feats = np.stack([np.frombuffer(r[3], np.float32) for r in keep]).astype(np.float32)
feats /= np.clip(np.linalg.norm(feats, axis=1, keepdims=True), 1e-9, None)
cpas = np.array([r[0] for r in keep])
by = defaultdict(list)
for i, c in enumerate(cpas):
by[c].append(i)
by = {c: np.array(v) for c, v in by.items() if len(v) >= 3}
accts = list(by.keys())
pair_w = np.array([len(by[c])*(len(by[c])-1)/2 for c in accts], float)
pair_w /= pair_w.sum()
rng = np.random.default_rng(SEED)
M = 100_000
# intra: CPA sampled proportional to pair count (= uniform over all intra pairs)
intra = np.empty(M, np.float32)
ci = rng.choice(len(accts), M, p=pair_w)
for t in range(M):
a, b = rng.choice(by[accts[ci[t]]], 2, replace=False)
intra[t] = feats[a] @ feats[b]
inter = np.empty(M, np.float32)
for t in range(M):
i, j = rng.choice(len(accts), 2, replace=False)
inter[t] = feats[rng.choice(by[accts[i]])] @ feats[rng.choice(by[accts[j]])]
xs = np.linspace(0.3, 1.0, 10000)
diff = gaussian_kde(intra)(xs) - gaussian_kde(inter)(xs)
cr = [float(x) for x in xs[np.where(np.diff(np.sign(diff)))[0]] if 0.6 < x < 0.99]
print(f' [{label}] crossover {[f"{x:.4f}" for x in cr]} '
f'(intra {intra.mean():.4f}/{np.median(intra):.4f} inter {inter.mean():.4f}/{np.median(inter):.4f})')
rows = load()
abcd = [r for r in rows if r[1] in BIG4]
bcd = [r for r in rows if r[1] in BIG4 and r[1] != FIRM_A]
print('=== Part A: bootstrap CIs on anchor rates ===')
boot_part(abcd, 'ABCD (verify ~0.109 / ~0.338)')
boot_part(bcd, 'BCD-only')
print('\n=== Part B: KDE crossover (pair-weighted intra, corpus-wide reproduces 0.837) ===')
crossover(rows, 'corpus-wide (all firms)')
crossover(bcd, 'BCD-only')
crossover([r for r in rows if r[1] != FIRM_A], 'BCD + non-Big4')
@@ -0,0 +1,169 @@
#!/usr/bin/env python3
"""Script 52: canonical-correct locked publication numbers (supersedes 48/50,
fixes the per-firm assignment and A-out-of-sample any-pair issues codex flagged).
Uses the EXACT canonical candidate sampler of Scripts 43/45 (retry-loop to
collect exactly n_pool non-same-CPA candidates, rng.choice, default_rng(42)),
any-pair max-cos/min-dHash five-way classification, and dominant-firm document
assignment. Scopes: ABCD / BCD / BCD+nonBig4, plus Firm-A out-of-sample vs a
clean BCD candidate pool. Read-only.
"""
import sqlite3
from collections import defaultdict, Counter
import numpy as np
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
N_BOOT = 1000
POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
def wilson(k, n, z=1.96):
if n == 0:
return (None, None)
p = k/n; d = 1+z*z/n
c = (p+z*z/(2*n))/d
h = z*np.sqrt(p*(1-p)/n+z*z/(4*n*n))/d
return (max(0.0, c-h), min(1.0, c+h))
def load():
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""
SELECT s.assigned_accountant, a.firm, s.source_pdf,
s.feature_vector, s.dhash_vector
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE s.assigned_accountant IS NOT NULL AND a.firm IS NOT NULL
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""")
rows = cur.fetchall()
conn.close()
return rows
def canonical_sampler(rng, n, n_pool, same_cpa, all_idx):
"""EXACT Scripts 43/45 sampler: retry-loop to exactly n_pool non-same."""
need = n_pool
cand = []
attempts = 0
while need > 0 and attempts < 10:
draw = rng.choice(n, size=need * 2, replace=True)
ok = draw[~np.isin(draw, same_cpa)]
cand.extend(ok[:need].tolist())
need -= len(ok[:need])
attempts += 1
if need > 0:
pool_mask = np.ones(n, dtype=bool)
pool_mask[same_cpa] = False
fb = rng.choice(all_idx[pool_mask], size=need, replace=False)
cand.extend(fb.tolist())
return np.array(cand[:n_pool], dtype=np.int64)
def simulate(keep):
n = len(keep)
feats = np.stack([np.frombuffer(r[3], np.float32) for r in keep]).astype(np.float32)
norms = np.linalg.norm(feats, axis=1, keepdims=True); norms[norms == 0] = 1.0
feats = feats / norms
dh = np.stack([np.frombuffer(r[4], np.uint8) for r in keep])
cpas = np.array([r[0] for r in keep])
firms = np.array([ALIAS.get(r[1], 'NonB4') for r in keep])
docs = np.array([r[2] for r in keep])
cpa_idx = defaultdict(list)
for i, c in enumerate(cpas):
cpa_idx[c].append(i)
cpa_idx = {c: np.array(v) for c, v in cpa_idx.items()}
pool_size = {c: len(v)-1 for c, v in cpa_idx.items()}
all_idx = np.arange(n)
rng = np.random.default_rng(SEED)
max_cos = np.zeros(n, np.float32); min_dh = np.full(n, 64, np.int32)
for si in range(n):
np_ = pool_size[cpas[si]]
if np_ <= 0:
continue
cand = canonical_sampler(rng, n, np_, cpa_idx[cpas[si]], all_idx)
cosv = feats[cand] @ feats[si]
dist = POP[dh[cand] ^ dh[si]].sum(axis=1)
max_cos[si] = cosv.max(); min_dh[si] = int(dist.min())
return max_cos, min_dh, cpas, firms, docs, cpa_idx
def report(keep, label):
max_cos, min_dh, cpas, firms, docs, cpa_idx = simulate(keep)
n = len(cpas)
hc = (max_cos > 0.95) & (min_dh <= 5)
d2 = (max_cos > 0.95) & (min_dh <= 15)
print(f'\n===== {label} (n_sig={n:,}) =====')
for nm, a in [('per-sig HC', hc), ('per-sig HC+MC', d2)]:
k = int(a.sum()); lo, hi = wilson(k, n)
print(f' {nm}: {k/n:.6f} ({k}/{n}) [{lo:.4f},{hi:.4f}]')
# CPA-block bootstrap on per-sig HC
rng = np.random.default_rng(SEED + 1)
cl = list(cpa_idx.keys())
bs = np.empty(N_BOOT)
for b in range(N_BOOT):
cs = rng.choice(len(cl), len(cl), replace=True)
idx = np.concatenate([cpa_idx[cl[i]] for i in cs])
bs[b] = hc[idx].mean()
print(f' per-sig HC CPA-block boot95% [{np.percentile(bs,2.5):.4f},{np.percentile(bs,97.5):.4f}]')
# per-doc, dominant-firm assignment (canonical)
doc_sigs = defaultdict(list)
for i in range(n):
doc_sigs[docs[i]].append(i)
dl = list(doc_sigs.keys()); nd = len(dl)
doc_d1 = np.array([hc[doc_sigs[d]].any() for d in dl])
doc_d2 = np.array([d2[doc_sigs[d]].any() for d in dl])
doc_firm = np.array([Counter(firms[doc_sigs[d]]).most_common(1)[0][0] for d in dl])
print(f' per-doc HC: {doc_d1.mean():.6f} ({int(doc_d1.sum())}/{nd})')
print(f' per-doc HC+MC: {doc_d2.mean():.6f} ({int(doc_d2.sum())}/{nd})')
for f in sorted(set(doc_firm)):
m = doc_firm == f
print(f' Firm {f} per-doc D2: {doc_d2[m].mean():.4f} ({int(doc_d2[m].sum())}/{int(m.sum())})')
def a_out_of_sample(rows):
"""Firm A source vs clean BCD candidate pool, any-pair, pool=count-1."""
A = [r for r in rows if r[1] == FIRM_A]
BCD = [r for r in rows if r[1] in BIG4 and r[1] != FIRM_A]
bf = np.stack([np.frombuffer(r[3], np.float32) for r in BCD]).astype(np.float32)
nb = bf.shape[0]
bn = np.linalg.norm(bf, axis=1, keepdims=True); bn[bn == 0] = 1.0; bf = bf/bn
bdh = np.stack([np.frombuffer(r[4], np.uint8) for r in BCD])
a_cpa = defaultdict(list)
for i, r in enumerate(A):
a_cpa[r[0]].append(i)
pool_size = {c: len(v)-1 for c, v in a_cpa.items()}
rng = np.random.default_rng(SEED)
hc = np.zeros(len(A), bool); d2 = np.zeros(len(A), bool)
docs = np.array([r[2] for r in A])
for i, r in enumerate(A):
np_ = pool_size[r[0]]
if np_ <= 0: # singleton CPA: no same-CPA pool, skip (canonical)
continue
cand = rng.choice(nb, size=np_, replace=True) # A not in BCD pool
sf = np.frombuffer(r[3], np.float32).astype(np.float32)
sf = sf/max(np.linalg.norm(sf), 1e-9)
cosv = bf[cand] @ sf
dist = POP[bdh[cand] ^ np.frombuffer(r[4], np.uint8)].sum(axis=1)
mc, md = cosv.max(), int(dist.min())
hc[i] = (mc > 0.95) and (md <= 5)
d2[i] = (mc > 0.95) and (md <= 15)
k = int(hc.sum()); n = len(A); lo, hi = wilson(k, n)
print(f'\n===== Firm A out-of-sample vs clean BCD pool (any-pair) =====')
print(f' per-sig HC: {k/n:.6f} ({k}/{n}) [{lo:.5f},{hi:.5f}]')
ds = defaultdict(list)
for i in range(n):
ds[docs[i]].append(i)
dl = list(ds.keys())
dd2 = np.array([d2[ds[d]].any() for d in dl])
print(f' per-doc HC+MC: {dd2.mean():.6f} ({int(dd2.sum())}/{len(dl)})')
rows = load()
report([r for r in rows if r[1] in BIG4], 'ABCD (verify: per-sig HC~0.1102 / per-doc D2~0.3375)')
report([r for r in rows if r[1] in BIG4 and r[1] != FIRM_A], 'BCD-only (verify codex: HC~0.0116 / doc HC~0.0226 / doc D2~0.1905)')
report([r for r in rows if r[1] != FIRM_A], 'BCD + non-Big4')
a_out_of_sample(rows)
@@ -0,0 +1,125 @@
#!/usr/bin/env python3
"""Script 53: BCD-only firm-effect logistic regression (Firm D reference) and
BCD-only cross-firm hit matrix. Candidate pool = BCD (exclude Firm A and
same-CPA). Canonical retry-loop sampler, any-pair + same-pair. Read-only.
Replicates Script 44's logistic_fit and matrix logic, restricted to BCD.
"""
import sqlite3
from collections import defaultdict
import numpy as np
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
FIRM_A = '勤業眾信聯合'
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
ALIAS = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
SEED = 42
POP = np.array([bin(i).count('1') for i in range(256)], dtype=np.uint8)
def logistic_fit(X, y, max_iter=200, l2=0.001):
n, k = X.shape
beta = np.zeros(k)
for _ in range(max_iter):
eta = np.clip(X @ beta, -30, 30)
p = 1.0/(1.0+np.exp(-eta))
grad = X.T @ (y-p) - l2*beta
W = p*(1-p)
H = -(X.T*W) @ X - l2*np.eye(k)
try:
delta = np.linalg.solve(H, grad)
except np.linalg.LinAlgError:
delta = 0.3*grad
nb = beta - delta
if np.max(np.abs(nb-beta)) < 1e-8:
beta = nb; break
beta = nb
eta = np.clip(X @ beta, -30, 30)
p = 1.0/(1.0+np.exp(-eta)); W = p*(1-p)
cov = np.linalg.inv((X.T*W) @ X + l2*np.eye(k))
return beta, np.sqrt(np.diag(cov))
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
cur = conn.cursor()
cur.execute("""SELECT s.assigned_accountant, a.firm, s.feature_vector, s.dhash_vector
FROM signatures s JOIN accountants a ON s.assigned_accountant=a.name
WHERE s.assigned_accountant IS NOT NULL AND a.firm IN (?,?,?)
AND s.feature_vector IS NOT NULL AND s.dhash_vector IS NOT NULL""",
('安侯建業聯合', '資誠聯合', '安永聯合'))
rows = cur.fetchall()
conn.close()
n = len(rows)
feats = np.stack([np.frombuffer(r[2], np.float32) for r in rows]).astype(np.float32)
norms = np.linalg.norm(feats, axis=1, keepdims=True); norms[norms == 0] = 1.0
feats = feats / norms
dh = np.stack([np.frombuffer(r[3], np.uint8) for r in rows])
cpas = np.array([r[0] for r in rows])
firms = np.array([ALIAS[r[1]] for r in rows])
cpa_idx = defaultdict(list)
for i, c in enumerate(cpas):
cpa_idx[c].append(i)
cpa_idx = {c: np.array(v) for c, v in cpa_idx.items()}
pool_size = {c: len(v)-1 for c, v in cpa_idx.items()}
all_idx = np.arange(n)
print(f'BCD signatures: {n:,}; CPAs: {len(cpa_idx)}')
rng = np.random.default_rng(SEED)
hit_any = np.zeros(n, bool)
hit_same = np.zeros(n, bool)
cand_firm_maxcos = np.empty(n, dtype=object) # any-pair partner firm
cand_firm_same = np.empty(n, dtype=object)
psize = np.zeros(n, np.int32)
for si in range(n):
np_ = pool_size[cpas[si]]; psize[si] = np_
if np_ <= 0:
continue
same = cpa_idx[cpas[si]]
need = np_; cand = []; att = 0
while need > 0 and att < 10:
draw = rng.choice(n, size=need*2, replace=True)
ok = draw[~np.isin(draw, same)]
cand.extend(ok[:need].tolist()); need -= len(ok[:need]); att += 1
if need > 0:
pm = np.ones(n, bool); pm[same] = False
cand.extend(rng.choice(all_idx[pm], size=need, replace=False).tolist())
cand = np.array(cand[:np_], dtype=np.int64)
cosv = feats[cand] @ feats[si]
dist = POP[dh[cand] ^ dh[si]].sum(axis=1)
mc = int(np.argmax(cosv)); md = int(np.argmin(dist))
if cosv[mc] > 0.95 and dist[md] <= 5:
hit_any[si] = True
cand_firm_maxcos[si] = firms[cand[mc]]
spm = (cosv > 0.95) & (dist <= 5)
if spm.any():
hit_same[si] = True
cand_firm_same[si] = firms[cand[int(np.argmax(spm))]]
# ---- Logistic regression: hit_any ~ FirmB + FirmC + log(pool), Firm D reference ----
hp = psize > 0
y = hit_any[hp].astype(np.float64)
fa = firms[hp]
lp = np.log(psize[hp].astype(np.float64)); lp = lp - lp.mean()
X = np.column_stack([np.ones(y.shape), (fa == 'B').astype(float), (fa == 'C').astype(float), lp])
beta, se = logistic_fit(X, y)
print(f'\n[BCD logistic: hit_any ~ FirmB + FirmC + log(pool); Firm D = reference] n={len(y):,}, y_mean={y.mean():.4f}')
for nm, b, s in zip(['intercept(FirmD)', 'FirmB', 'FirmC', 'log(pool,centred)'], beta, se):
print(f' {nm}: beta={b:+.4f} SE={s:.4f} OR={np.exp(b):.4f} z~{abs(b)/s if s>0 else float("inf"):.2f}')
# ---- Cross-firm hit matrix (any-pair max-cos partner) ----
print('\n[BCD cross-firm hit matrix: any-pair, source firm x max-cos partner firm]')
print(' src -> B C D | within-firm% (n hits)')
for sf in ['B', 'C', 'D']:
m = (firms == sf) & hit_any
parts = cand_firm_maxcos[m]
tot = len(parts)
cnt = {f: int((parts == f).sum()) for f in ['B', 'C', 'D']}
wf = cnt[sf]/tot if tot else 0
print(f' {sf} {cnt["B"]:5d} {cnt["C"]:5d} {cnt["D"]:5d} | {wf:6.1%} ({tot})')
print('\n[BCD same-pair within-firm concentration]')
for sf in ['B', 'C', 'D']:
m = (firms == sf) & hit_same
parts = cand_firm_same[m]; tot = len(parts)
wf = int((parts == sf).sum())/tot if tot else 0
print(f' Firm {sf}: {wf:.1%} within-firm ({int((parts==sf).sum())}/{tot})')