From a06e9456e67194082cc9abd77cb647d0c246db85 Mon Sep 17 00:00:00 2001 From: gbanyan Date: Tue, 12 May 2026 15:15:36 +0800 Subject: [PATCH] =?UTF-8?q?Add=20Phase=202=20=C2=A7III-G..L=20methodology?= =?UTF-8?q?=20rewrite=20(v4.0=20draft)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Single consolidated draft of Section III sub-sections G through L, replacing the v3.20.0 §III-G..L block with the Big-4 reframe. Sub-sections (note: G/H/I/J/K/L written together to keep cross- references coherent; user originally requested G/I/J/L only but H rewrite and new K were required for cohesion): G Unit of Analysis and Scope -- accountant unit defined; Big-4 scope justified by within-pool homogeneity, dip-test multimodality, LOOO feasibility. H Reference Populations -- Firm A pivots from "calibration anchor" to "templated-end case study"; non-Big-4 added as reverse-anchor reference. I Distributional Characterisation -- dip-test multimodality at Big-4 level (p < 1e-4 both axes); BD/McCrary null as honest density-smoothness diagnostic. J Mixture Model and Operational Threshold Derivation -- K=2 vs K=3 fits reported; K=3 selected with rationale deferred to §III-K LOOO evidence. K Convergent Validation (NEW in v4.0) -- three-lens Spearman convergence (rho >= 0.879); per-signature K=3 fit (kappa = 0.870 vs per-CPA); K=2 LOOO UNSTABLE / K=3 LOOO PARTIAL; pixel-identity FAR 0% on 262 ground-truth signatures. L Per-Document Classification -- inherits v3.x five-way box rule for continuity; K=3 alternative output documented. Includes: cross-reference index, script-to-section evidence map (linking each empirical claim to the spike Script 32-40 commit), and 5 open questions flagged at the end for partner / reviewer review of this draft. Output: paper/v4/paper_a_methodology_v4_section_iii.md (single file replacing the v3.20.0 §III-G..L block on this branch only; v3.20.0 paper/paper_a_methodology_v3.md left untouched). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../v4/paper_a_methodology_v4_section_iii.md | 177 ++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 paper/v4/paper_a_methodology_v4_section_iii.md diff --git a/paper/v4/paper_a_methodology_v4_section_iii.md b/paper/v4/paper_a_methodology_v4_section_iii.md new file mode 100644 index 0000000..31a4ae1 --- /dev/null +++ b/paper/v4/paper_a_methodology_v4_section_iii.md @@ -0,0 +1,177 @@ +# Section III. Methodology — v4.0 Draft (Big-4 reframe) + +> **Draft note (2026-05-12).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. The v4.0 sub-section ordering inserts a new **§III-K Convergent Validation** between the old K (per-document classification, now §III-L) and J (validation anchors, now folded into §III-K and §III-J). Empirical anchors throughout reference Scripts 32–40 on branch `paper-a-v4-big4`. + +## G. Unit of Analysis and Scope + +We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of all signature-level capture-rate analyses (§IV-D, §IV-F, §IV-G). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), per-CPA convergent validation (§III-K), and the leave-one-firm-out reproducibility analysis (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. + +We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism. + +We adopt one stipulation about same-CPA pair detectability: + +> **(A1) Pair-detectability.** *If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the cross-year same-CPA pool used by the max-cosine / min-dHash computation.* + +A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication. + +**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, all §IV results except §IV-K) are restricted to the four largest accounting firms in Taiwan (referred to throughout as the Big-4: 勤業眾信聯合 / 安侯建業聯合 / 資誠聯合 / 安永聯合, pseudonymously Firms A through D in the rest of the paper). The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$, totalling 150,442 signatures with both descriptors available. Restricting the analyses to Big-4 is a methodological choice driven by three considerations: + +1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$250 CPAs distributed across $\sim$30 firms with idiosyncratic signing practices and small per-firm samples. Including this tail in a single 2D-GMM mixture distorts the inferred component locations and weights — empirically, the published v3.x "natural threshold" $\overline{\text{cos}} = 0.945$, $\overline{\text{dHash}} = 8.10$ shifts to $\overline{\text{cos}} = 0.975$, $\overline{\text{dHash}} = 3.76$ when restricted to Big-4 (Script 34, bootstrap 95% CI $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$). The shift is large compared to either CI half-width, indicating the mid/small-firm tail was actively pulling the v3.x crossing rather than acting as inert noise. + +2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes ($p < 10^{-4}$, $n = 437$ CPAs; Script 34) for the per-CPA $\overline{\text{cos}}$ and $\overline{\text{dHash}}$ distributions. No such rejection holds within any single firm pooled alone (each firm's per-CPA distribution is dip-test unimodal at $p > 0.92$; Scripts 32, 34) or within the all-non-Firm-A pool ($p > 0.99$; Script 32). Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution. + +3. **Reproducibility under leave-one-firm-out cross-validation.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 mixture fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm); the held-out firm's CPAs are then classified by the rule derived from the other three firms. No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm, far below the sample sizes required for stable mixture-fit estimation on a held-out fold. + +The full-dataset (686 CPAs across Big-4 plus mid/small) is reported as a robustness section in §IV-K so that v4.0's pipeline reproducibility claim is demonstrable at multiple scopes, while the primary methodology operates at the scope where its statistical assumptions (dip-test multimodality, firm-level LOOO feasibility) are met. + +## H. Reference Populations + +v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single Firm-A-as-anchor framing. + +**Internal reference: Firm A as the templated end of Big-4 (case study).** Firm A is empirically the most digitally-replicated of the Big-4. In the Big-4 K=3 mixture (§III-J, Script 35), Firm A accounts for 0% of the C1 hand-leaning component, 17.5% of the C2 mixed component, and 82.5% of the C3 replicated component; the reverse pattern holds at PwC (23.5% C1, 75.5% C2, 1.0% C3). Automated byte-level pair analysis (§IV-F.1; reproduction artifact in Appendix B) identifies 145 Firm A signatures byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years. Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports — these pairs establish image reuse as a concrete, threshold-free phenomenon within Firm A. + +In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with KPMG, PwC, and EY; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the within-Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level ground truth that anchors §III-K's pixel-identity false-alarm-rate validation. + +**External reference: non-Big-4 as the reverse-anchor reference for cross-validation.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor convergent-validation analysis: each Big-4 CPA's deviation from the reference centre, measured as the marginal cosine left-tail percentile under the reference, is one of three independent statistical lenses that v4.0 uses to validate its hand-leaning ranking. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its sole role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be cross-checked. + +The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$. This sits below the Big-4 K=3 hand-leaning component centre ($\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$), confirming that the non-Big-4 reference population is itself in the lower-cosine / higher-dHash regime relative to Big-4 — appropriate for use as a "more-replicated-population" baseline against which Big-4 hand-leaning CPAs deviate by being further into the left tail of cosine. + +## I. Distributional Characterisation at the Accountant Level + +This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G. Three diagnostic procedures are applied: a univariate unimodality test on each marginal axis, a 2D Gaussian mixture fit (developed further in §III-J), and a density-smoothness diagnostic. These procedures play decreasing-strength roles in v4.0: the dip test directly motivates whether a mixture model is statistically warranted; the mixture is the operational generative model whose fit is reported in §III-J; the density-smoothness diagnostic complements the mixture with a non-parametric check on local discontinuity that is independent of the mixture's parametric form. + +**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). Both marginals reject unimodality with $p < 10^{-4}$ (Script 34). For comparison, no rejection of unimodality holds at any narrower scope: Firm A alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); KPMG + PwC + EY pooled ($p_{\text{cos}} = 0.999$, $p_{\text{dHash}} = 0.906$, $n = 266$); all-non-Firm-A pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution. + +**2. Mixture-model evidence.** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42) fit to the joint $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ respectively over $n_{\text{boot}} = 500$ resamples (Script 34). A 3-component fit (§III-J) is BIC-preferred ($-1111.93$ vs $-1108.45$), and the K=3 component structure carries different operational implications; we develop both fits in §III-J and discuss the choice between them in §III-K. + +**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* rather than as a third threshold estimator, on each marginal axis (cosine in bins of $0.002$, dHash in integer bins). The diagnostic identifies no significant transition on either axis at $\alpha = 0.05$ (Script 32 / Script 34). This null result is consistent with the mixture-model evidence: the K=3 components overlap rather than separate sharply, so a local-discontinuity test does not flag a transition. We retain BD/McCrary in v4.0 as a non-parametric robustness check — its null result indicates that no boundary location chosen by the mixture model is corroborated by sharp local discontinuity, which we read as an honest description of the data's continuous-overlap structure rather than a methodological deficiency. + +The operational threshold derivation (§III-J) and the convergent validation (§III-K) both operate on the K=3 fit. The K=2 fit and its marginal crossings are reported here for completeness and to enable cross-paper-version comparison with v3.x, but are not used as v4.0's operational rule. + +## J. Mixture Model and Operational Threshold Derivation + +We fit 2- and 3-component 2D Gaussian Mixture Models (full covariance, $n_{\text{init}} = 15$, $\text{max\_iter} = 500$, fixed seed 42) to the joint $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution of the 437 Big-4 CPAs. We report both fits and develop the operational threshold from the K=3 fit, with the K=2 fit retained as a reference and the choice of K=3 over K=2 justified in §III-K. + +**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$ ("hand-leaning"), and $(0.983, 2.41)$, weight $0.311$ ("replicated"). BIC = $-1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. + +**K=3 fit (selected for v4.0).** Three components, sorted by ascending cosine mean: + +| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | interpretation | +|---|---|---|---|---| +| C1 hand-leaning | 0.9457 | 9.17 | 0.143 | low-cosine, high-dHash; cross-firm membership | +| C2 mixed | 0.9558 | 6.66 | 0.536 | intermediate; spans all four Big-4 firms | +| C3 replicated | 0.9826 | 2.41 | 0.321 | high-cosine, low-dHash; dominated by Firm A | + +BIC = $-1111.93$, slightly preferred over K=2 ($\Delta\text{BIC} = -3.5$). The K=3 components carry direct interpretations from the joint Big-4 distribution: C3 is the templated mode dominated by Firm A (82.5% of Firm A CPAs assigned to C3 by hard-posterior assignment; Script 35); C1 is the hand-leaning mode with cross-firm membership (24 PwC, 10 KPMG, 6 EY, 0 Firm A — Script 35); C2 is the intermediate mode that contains the majority of all four Big-4 firms' CPAs. + +**Operational threshold.** v4.0's per-CPA classifier is the K=3 hard-posterior assignment: a CPA $a$ is assigned to component $C_k$ that maximises $P(C_k | (\overline{\text{cos}}_a, \overline{\text{dHash}}_a))$ under the K=3 fit. CPAs assigned to C3 are labelled *templated*; CPAs assigned to C1 or C2 are labelled *non-templated*, with C1 carrying the additional designation *hand-leaning* and C2 the designation *mixed*. The per-CPA classifier is an honest description of the joint Big-4 distribution rather than a discovered "natural threshold" — the dip-test rejection of unimodality (§III-I) supports treating the data as multi-mode, but the components overlap in their tails (BD/McCrary null on the marginal density), so the classifier's hard boundary inherits a calibration uncertainty quantified in §III-K. + +The signature-level operational rule is inherited from v3.x for continuity and is reported in §III-L: a signature is classified as *non-hand-signed* iff $\text{cos} > 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. The agreement between this signature-level box rule and the per-CPA K=3 hard label is reported in §III-K (Cohen $\kappa = 0.66$ binary collapse), and is one of three convergent-validation signals supporting v4.0's overall methodology. + +## K. Convergent Validation + +v4.0 validates its mixture-derived per-CPA classifier through three independent statistical lenses, supplemented by leave-one-firm-out cross-validation and a hard ground-truth false-alarm-rate check. + +**1. Three convergent lenses (Script 38).** For each Big-4 CPA we compute three scalars: + +- **Lens 1 (mixture posterior):** $P(\text{C1}_{\text{hand-leaning}})$ from the K=3 fit of §III-J. +- **Lens 2 (reverse-anchor directional):** $-F_{\text{ref}}(\overline{\text{cos}}_a)$, where $F_{\text{ref}}$ is the marginal cosine CDF of the non-Big-4 reference Gaussian from §III-H. Higher score corresponds to deeper in the left tail of the reference cosine distribution, i.e., more deviated in the hand-leaning direction relative to a strictly out-of-target population. +- **Lens 3 (Paper A operational rule):** the per-CPA failure rate of the inherited signature-level rule cos $> 0.95$ AND dHash$_{\text{indep}} \leq 5$. + +Pairwise Spearman rank correlations among the three lenses, $n = 437$: + +| Pair | Spearman $\rho$ | $p$-value | +|---|---|---| +| Lens 1 vs Lens 3 | $+0.963$ | $< 10^{-248}$ | +| Lens 2 vs Lens 3 | $+0.889$ | $< 10^{-149}$ | +| Lens 1 vs Lens 2 | $+0.879$ | $< 10^{-142}$ | + +All three pairwise correlations exceed $0.7$, and the three lenses agree on the per-firm hand-leaning ranking: Firm A < KPMG < EY $\approx$ PwC by mean P(C1), mean reverse-anchor score, and mean per-CPA hand-failure rate. Convergence on a per-CPA ranking by three approaches that draw on different statistical machinery — finite-mixture posterior, deviation from a strictly out-of-target reference, and a published box rule — is the strongest single methodology-validation signal in v4.0. + +**2. Per-signature convergence (Script 39).** The per-CPA convergence above could in principle reflect averaging across heterogeneous within-CPA signatures rather than coherent within-CPA behaviour. We address this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points — and comparing labels: + +| Pair | Cohen $\kappa$ (binary collapse, 1 = replicated) | +|---|---| +| Paper A box rule vs K=3 per-CPA hard label | $0.662$ | +| Paper A box rule vs K=3 per-signature hard label | $0.559$ | +| K=3 per-CPA vs K=3 per-signature hard label | $0.870$ | + +The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 labels indicates that per-CPA aggregation does not materially distort the K=3 classifier; the same component structure recovers when fit at signature granularity. The moderate $\kappa = 0.56$–$0.66$ between Paper A's box rule and either K=3 fit reflects different decision geometries — a rectangular box vs a Gaussian-mixture posterior boundary — rather than a fundamental signal disagreement. + +**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** We test cross-firm reproducibility of the mixture fit by holding out each Big-4 firm in turn, refitting the GMM on the other three firms' CPAs, and reporting the held-out firm's classifier behaviour under the fold-derived rule. We report both K=2 and K=3 LOOO results: + +- *K=2 LOOO is unstable.* Across the four folds the marginal cosine crossing varies between 0.938 (Firm-A held out) and 0.976 (EY held out), a max deviation of $0.028$, $5.6\times$ the bootstrap CI half-width of $0.005$. When Firm A is held out, the fold rule classifies 100% of held-out Firm A as *replicated*; when any non-Firm-A firm is held out, the fold rule classifies 0% of the held-out firm as *replicated*. This pattern indicates that the K=2 boundary is essentially a *firm-mass separator* between Firm A and the other three Big-4, not a *mechanism boundary* between hand-leaning and replicated CPAs. + +- *K=3 LOOO is partially stable.* The C1 hand-leaning component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.025$ in mixture weight (Script 37). The held-out firm's C1 membership rate varies more substantially (e.g., $36.3\%$ for held-out PwC vs $23.5\%$ baseline; $4.7\%$ for held-out Firm A vs $0\%$ baseline), reflecting that the fold's decision boundary moves modestly when training composition changes. We document the held-out membership rates as a *calibration uncertainty band* of $\pm 5$–$13$ percentage points around the baseline rate, rather than as a defect of the K=3 fit. + +The contrast — K=2 unstable, K=3 stable in component shape with bounded membership uncertainty — is the empirical justification for selecting K=3 over K=2 as v4.0's operational mixture (referenced in §III-J). + +**4. Pixel-identity false-alarm rate (Script 40).** The only hard ground truth available in the corpus is the set of Big-4 signatures whose nearest same-CPA match is byte-identical after crop and normalisation ($n = 262$: 145 Firm A, 8 KPMG, 107 PwC, 2 EY). Byte-identical pairs cannot arise from independent hand-signing and are taken as ground-truth *replicated*. We report each classifier's false-alarm rate (probability of labelling a pixel-identical signature as hand-leaning): + +| Classifier | Misclassified | FAR | Wilson 95% CI | +|---|---|---|---| +| Paper A box rule | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ | +| K=3 per-CPA hard label (C3 = replicated) | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ | +| Reverse-anchor score (prevalence-calibrated cut) | 0 / 262 | $0\%$ | $[0\%, 1.45\%]$ | + +All three v4.0 classifiers achieve perfect detection on the pixel-identity ground-truth subset, with a Wilson upper bound on FAR of $1.45\%$. We acknowledge two caveats. First, pixel-identity is a *conservative subset* of true replication — only the byte-equal extreme — so a low FAR against this subset is necessary but not sufficient evidence of correct replication detection across the full non-hand-signed class. Second, no signature-level ground truth exists for the hand-leaning class, so the reverse-anchor classifier's cut is chosen by *prevalence calibration* against Paper A's overall replicated rate; this is documented as a v4.0 limitation in §V-G and would benefit from human-rated validation in a future revision. + +## L. Per-Document Classification + +The per-signature classifier inherits its operational thresholds from v3.x: cos $> 0.95$ for the cosine dimension and $\text{dHash}_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. v4.0 retains these cuts unchanged for two reasons: (a) per-signature classification is an operational inheritance from prior literature with established interpretation, and (b) §III-K reports that the inherited rule agrees with the v4.0 K=3 per-CPA classifier at $\rho = 0.963$ per-CPA / $\kappa = 0.66$ per-signature, supporting continued use without recalibration. We document the K=3 hard label as an *alternative output* available alongside the inherited rule for users who prefer a posterior-based assignment. + +We assign each signature to one of five signature-level categories using convergent evidence from both descriptors: + +1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence. + +2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff. + +3. **High style consistency:** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction. + +4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence in either direction. + +5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold. + +Because each audit report typically carries two certifying-CPA signatures, we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label among the two signatures (rank order: High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed). This rule reflects the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge. + +The K=3 alternative output assigns each signature directly to {C1 hand-leaning, C2 mixed, C3 replicated} via the per-signature K=3 fit of §III-K. Cross-tabulation between the five-way box-rule classifier and the K=3 hard label is reported in §IV-G (cross-tab table; Script 39 output). + +--- + +## Cross-reference index + +- **Big-4 sub-corpus definition** (§III-G) — 437 CPAs, 150,442 signatures. +- **Reference populations** (§III-H) — Firm A as templated-end case study; non-Big-4 ($n = 249$) as reverse-anchor reference. +- **Distributional characterisation** (§III-I) — dip-test multimodality; BD/McCrary null; mixture support. +- **K=3 components** (§III-J) — C1 hand-leaning $(0.946, 9.17, w = 0.143)$; C2 mixed $(0.956, 6.66, w = 0.536)$; C3 replicated $(0.983, 2.41, w = 0.321)$. +- **Convergent validation** (§III-K) — three lenses ($\rho \geq 0.879$); $\kappa \geq 0.56$ at signature level; LOOO K=3 partial; pixel-identity FAR $0\%$ on $n = 262$. +- **Per-document classifier** (§III-L) — five-way inherited rule + K=3 alternative output. + +## Script-to-section evidence map + +| Section | Empirical anchor | Script | +|---|---|---| +| III-G scope | mid/small-firm tail distortion | 32, 34 | +| III-H Firm A descriptive | byte-level pair analysis | (v3 §IV-F.1 inherited) | +| III-H non-Big-4 reference | MCD reference Gaussian | 38 | +| III-I dip-test | Big-4 marginal $p < 10^{-4}$ | 32, 34 | +| III-I BD/McCrary null | density-smoothness diagnostic | 32, 34 | +| III-J K=2 components | bootstrap CI on marginals | 34 | +| III-J K=3 components | full-Big-4 fit | 35 | +| III-K three convergent lenses | per-CPA Spearman | 38 | +| III-K per-signature | Cohen $\kappa$ binary | 39 | +| III-K K=2 LOOO unstable | $\Delta\text{cos} = 0.028$ | 36 | +| III-K K=3 LOOO partial | C1 shape stable | 37 | +| III-K pixel-identity FAR | $0\%$ on $n = 262$ | 40 | +| III-L five-way inherited rule | (no v4 change) | (v3 §III-K inherited) | +| III-L K=3 alternative output | per-signature K=3 fit | 39 | + +--- + +> **Open questions for partner / reviewer review of this draft:** +> +> 1. §III-G scope justification — is the three-point argument (within-pool homogeneity, dip-test multimodality, LOOO feasibility) sufficiently strong, or should we add a fourth (e.g., generalisability claim restricted to Big-4 audit context)? +> 2. §III-H Firm A reframing — is "case study of templated end" the right phrase, or should we use "calibration reference, descriptively defined post-hoc"? +> 3. §III-J K=3 vs K=2 selection — we choose K=3 on the basis of LOOO stability (§III-K), not BIC alone (only $\Delta = 3.5$). Is this acceptable to a reviewer, or should we strengthen the BIC argument? +> 4. §III-L — keeping the inherited five-way rule intact preserves continuity with v3.x literature but means the operational classifier is *not* derived from v4.0's K=3 mixture. Is this hybrid (mixture for characterisation, inherited box rule for classification) acceptable, or should v4.0 commit to K=3 hard label as the primary classifier? +> 5. The Section IV (Results) regeneration in Phase 3 will need to mirror this section's structure — particularly Tables IV–XVIII numbering. We should confirm the numbering and section-letter scheme before Phase 3 begins.