diff --git a/paper/paper_a_appendix_v3.md b/paper/paper_a_appendix_v3.md index 76143ed..0d27e92 100644 --- a/paper/paper_a_appendix_v3.md +++ b/paper/paper_a_appendix_v3.md @@ -1,9 +1,12 @@ -# Appendix A. BD/McCrary Bin-Width Sensitivity (Signature Level) +# Appendix A. Supplementary Diagnostic Detail + +## A.1. BD/McCrary Bin-Width Sensitivity (Signature Level) The main text (Section III-I, Section IV-D Table VI) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator. -This appendix documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction. +This subsection documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction. + +**Table A.I.** BD/McCrary Bin-Width Sensitivity (two-sided $\alpha = 0.05$, $|Z| > 1.96$). - Two patterns are visible in Table A.I. First, the procedure consistently identifies a "transition" under every bin width, but the *location* of that transition drifts monotonically with bin width (Firm A cosine: 0.987 → 0.985 → 0.980 → 0.975 as bin width grows from 0.003 to 0.015; full-sample dHash: 2 → 10 → 9 as the bin width grows from 1 to 3). @@ -34,6 +36,25 @@ This observation supports the main-text decision to use BD/McCrary as a density- Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials. +## A.2. Diagnostic Summary + +Section III-M positions the unsupervised-diagnostic strategy as a set of complementary checks, each addressing one specific failure mode of an unsupervised screening classifier with an explicitly disclosed untested assumption. Table A.II maps each diagnostic to the failure mode it addresses and to the untested assumption it relies on. + +**Table A.II.** Diagnostics, failure mode addressed, and disclosed untested assumption. + +| Diagnostic | Failure mode addressed | Disclosed untested assumption | +|---|---|---| +| Composition decomposition (§III-I.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) | +| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) | +| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property | +| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above | +| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check | +| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 45's mode-of-firms tie-break (§IV-M.4 footnote) | +| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test | +| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent | +| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold | +| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity | + # Appendix B. Reproducibility Materials The full table-to-script provenance mapping, script source code, and report artefacts for every numerical table and figure in this paper are provided in the supplementary materials. Scripts run deterministically under fixed random seeds documented there; reviewer reproduction should re-emit artefacts from the listed scripts rather than rely on any local path layout. diff --git a/paper/paper_a_conclusion_v3.md b/paper/paper_a_conclusion_v3.md index 6ce9541..1ef22a4 100644 --- a/paper/paper_a_conclusion_v3.md +++ b/paper/paper_a_conclusion_v3.md @@ -2,6 +2,6 @@ We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature classifier with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. -Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (§III-M Table XXVII), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector. +Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector. -Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. +Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. diff --git a/paper/paper_a_methodology_v3.md b/paper/paper_a_methodology_v3.md index 58a04b8..38ce8bb 100644 --- a/paper/paper_a_methodology_v3.md +++ b/paper/paper_a_methodology_v3.md @@ -229,7 +229,7 @@ We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptiv - The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$. - Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs. -The operational signature-level classifier of §III-L is calibrated against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K. +The operational signature-level classifier of §III-H.1 is calibrated in §III-L against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K. ## K. Convergent Internal-Consistency Checks @@ -422,28 +422,11 @@ We *do not* interpret the deployed-rate excess as a presumed true-positive rate; The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth. -Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier (Table XXVII), with an explicitly disclosed untested assumption: - -**Table XXVII.** Diagnostics, failure mode addressed, and disclosed untested assumption. - -| Diagnostic | Failure mode addressed | Disclosed untested assumption | -|---|---|---| -| Composition decomposition (§III-I.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) | -| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) | -| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property | -| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above | -| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check | -| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 45's mode-of-firms tie-break (§IV-M.4 footnote) | -| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test | -| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent | -| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold | -| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity | - -No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth. +Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier; the full diagnostic-to-failure-mode-to-assumption map is given in Appendix A Table A.II. No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth. **Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, supplementary materials) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question. -**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier. +**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature any-pair HC ICCR ($0.11$; Table XXII) and per-document HC+MC alarm-rate ICCR ($0.34$; Table XXIII) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier. **Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data. diff --git a/paper/paper_a_results_v3.md b/paper/paper_a_results_v3.md index b3d9116..e439ed6 100644 --- a/paper/paper_a_results_v3.md +++ b/paper/paper_a_results_v3.md @@ -32,7 +32,7 @@ The Big-4 subset of the detection output yields 150,442 signatures with both des ## C. All-Pairs Intra-vs-Inter Class Distribution Analysis Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). -This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L). +This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-signature classification (Section III-H.1). Table IV summarizes the distributional statistics. **Table IV.** Cosine Similarity Distribution Statistics. @@ -393,7 +393,7 @@ Decile trend is broadly monotone in pool size with two minor reversals (decile 5 | D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ | | D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ | -Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XVI's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XVI's per-firm rows sum to $74{,}854 = 75{,}233 - 379$. +Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XVI's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs (all $379$ are $1{:}1$ Firm C / Firm D documents that an alphabetically-ordered tie-break assigns to Firm C; full implementation detail in the supplementary materials). The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XVI's per-firm rows sum to $74{,}854 = 75{,}233 - 379$. ### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44) diff --git a/paper/paper_a_v4_combined.md b/paper/paper_a_v4_combined.md index 72e0692..4954b68 100644 --- a/paper/paper_a_v4_combined.md +++ b/paper/paper_a_v4_combined.md @@ -397,7 +397,7 @@ We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptiv - The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$. - Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs. -The operational signature-level classifier of §III-L is calibrated against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K. +The operational signature-level classifier of §III-H.1 is calibrated in §III-L against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the deployed five-way box rule and the K=3 partition appear in §III-K. ## K. Convergent Internal-Consistency Checks @@ -590,28 +590,11 @@ We *do not* interpret the deployed-rate excess as a presumed true-positive rate; The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth. -Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier (Table XXVII), with an explicitly disclosed untested assumption: - -**Table XXVII.** Diagnostics, failure mode addressed, and disclosed untested assumption. - -| Diagnostic | Failure mode addressed | Disclosed untested assumption | -|---|---|---| -| Composition decomposition (§III-I.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) | -| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) | -| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property | -| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above | -| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check | -| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 45's mode-of-firms tie-break (§IV-M.4 footnote) | -| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test | -| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent | -| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold | -| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity | - -No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth. +Each diagnostic reported in this paper therefore addresses one specific failure mode of an unsupervised screening classifier; the full diagnostic-to-failure-mode-to-assumption map is given in Appendix A Table A.II. No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth. **Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, supplementary materials) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question. -**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier. +**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature any-pair HC ICCR ($0.11$; Table XXII) and per-document HC+MC alarm-rate ICCR ($0.34$; Table XXIII) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier. **Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data. @@ -660,7 +643,7 @@ The Big-4 subset of the detection output yields 150,442 signatures with both des ## C. All-Pairs Intra-vs-Inter Class Distribution Analysis Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). -This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L). +This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-signature classification (Section III-H.1). Table IV summarizes the distributional statistics. **Table IV.** Cosine Similarity Distribution Statistics. @@ -1021,7 +1004,7 @@ Decile trend is broadly monotone in pool size with two minor reversals (decile 5 | D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ | | D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ | -Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XVI's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XVI's per-firm rows sum to $74{,}854 = 75{,}233 - 379$. +Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XVI's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs (all $379$ are $1{:}1$ Firm C / Firm D documents that an alphabetically-ordered tie-break assigns to Firm C; full implementation detail in the supplementary materials). The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XVI's per-firm rows sum to $74{,}854 = 75{,}233 - 379$. ### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44) @@ -1141,9 +1124,9 @@ Several limitations should be transparent. We group them into primary methodolog We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature classifier with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population. -Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (§III-M Table XXVII), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector. +Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (Appendix A Table A.II), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector. -Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. +Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality holds across the tested eligible scopes, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work. # References @@ -1241,12 +1224,15 @@ Future work falls in four directions. *First*, a small-scale human-rated labelle -# Appendix A. BD/McCrary Bin-Width Sensitivity (Signature Level) +# Appendix A. Supplementary Diagnostic Detail + +## A.1. BD/McCrary Bin-Width Sensitivity (Signature Level) The main text (Section III-I, Section IV-D Table VI) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator. -This appendix documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction. +This subsection documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction. + +**Table A.I.** BD/McCrary Bin-Width Sensitivity (two-sided $\alpha = 0.05$, $|Z| > 1.96$). - Two patterns are visible in Table A.I. First, the procedure consistently identifies a "transition" under every bin width, but the *location* of that transition drifts monotonically with bin width (Firm A cosine: 0.987 → 0.985 → 0.980 → 0.975 as bin width grows from 0.003 to 0.015; full-sample dHash: 2 → 10 → 9 as the bin width grows from 1 to 3). @@ -1277,6 +1262,25 @@ This observation supports the main-text decision to use BD/McCrary as a density- Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials. +## A.2. Diagnostic Summary + +Section III-M positions the unsupervised-diagnostic strategy as a set of complementary checks, each addressing one specific failure mode of an unsupervised screening classifier with an explicitly disclosed untested assumption. Table A.II maps each diagnostic to the failure mode it addresses and to the untested assumption it relies on. + +**Table A.II.** Diagnostics, failure mode addressed, and disclosed untested assumption. + +| Diagnostic | Failure mode addressed | Disclosed untested assumption | +|---|---|---| +| Composition decomposition (§III-I.4; Scripts 39b–39e) | Whether descriptor multimodality is within-population (mechanism) or between-group (composition + integer artefact); $p_{\text{median}} = 0.35$ under joint firm-mean centring + integer-tie jitter | Integer-tie jitter and firm-mean centring are unbiased over the descriptor support; corroborated by Big-4 per-firm jitter (Script 39d; per-firm dHash rejection disappears under jitter at every Big-4 firm) and Big-4 pooled centred + jittered ($n_{\text{seeds}} = 5$; Script 39e) | +| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) | +| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property | +| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above | +| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors unreliable; cluster-robust analysis is a future check | +| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | Concentration depends on deployed-rule semantics (the stricter same-pair joint event yields $97.0$–$99.96\%$ within-firm at all four firms versus $76.7$–$98.8\%$ under any-pair; §III-L.4); per-document per-firm assignment uses Script 45's mode-of-firms tie-break (§IV-M.4 footnote) | +| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test | +| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent | +| Pixel-identical conservative positive capture (§III-K.4; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold | +| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity | + # Appendix B. Reproducibility Materials The full table-to-script provenance mapping, script source code, and report artefacts for every numerical table and figure in this paper are provided in the supplementary materials. Scripts run deterministically under fixed random seeds documented there; reviewer reproduction should re-emit artefacts from the listed scripts rather than rely on any local path layout.