Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21

Major fixes per codex (gpt-5.4) review: ## Structural fixes - Fixed three-method convergence overclaim: added Script 20 to run KDE antimode, BD/McCrary, and Beta mixture EM on accountant-level means. Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979, LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at accountant level (consistent with smooth clustering, not sharp discontinuity). - Disambiguated Method 1: KDE crossover (between two labeled distributions, used at signature all-pairs level) vs KDE antimode (single-distribution local minimum, used at accountant level). - Addressed Firm A circular validation: Script 21 adds CPA-level 70/30 held-out fold. Calibration thresholds derived from 70% only; heldout rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61% [93.21%-93.98%]). - Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion. - Added document-level classification aggregation rule (worst-case signature label determines document label). ## Pixel-identity validation strengthened - Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces the original n=35 same-CPA low-similarity negative which had untenable Wilson CIs). - Added Wilson 95% CI for every FAR in Table X. - Proper EER interpolation (FAR=FRR point) in Table X. - Softened "conservative recall" claim to "non-generalizable subset" language per codex feedback (byte-identical positives are a subset, not a representative positive class). - Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913. ## Terminology & sentence-level fixes - "statistically independent methods" -> "methodologically distinct methods" throughout (three diagnostics on the same sample are not independent). - "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality; rejection is consistent with but not a direct test of bimodality). - "Firm A near-universally non-hand-signed" -> already corrected to "replication-dominated" in prior commit; this commit strengthens that framing with explicit held-out validation. - "discrete-behavior regimes" -> "clustered accountant-level heterogeneity" (BD/McCrary non-transition at accountant level rules out sharp discrete boundaries; the defensible claim is clustered-but-smooth). - Softened White 1982 quasi-MLE claim (no longer framed as a guarantee). - Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP or YOLO FN). - Unified "310 byte-identical signatures" language across Abstract, Results, Discussion (previously alternated between pairs/signatures). - Defined min_dhash_independent explicitly in Section III-G. - Fixed table numbering (Table XI heldout added, classification moved to XII, ablation to XIII). - Explained 84,386 vs 85,042 gap (656 docs have only one signature, no pairwise stat). - Made Table IX explicitly a "consistency check" not "validation"; paired it with Table XI held-out rates as the genuine external check. - Defined 0.941 threshold (calibration-fold Firm A cosine P5). - Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated. - Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923). ## New artifacts - Script 20: accountant-level three-method threshold analysis - Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30) - paper/codex_review_gpt54_v3.md: preserved review feedback Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1 markdown sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00
parent 9b11f03548
commit 9d19ca5a31
13 changed files with 2915 additions and 127 deletions
@@ -0,0 +1,526 @@
+#!/usr/bin/env python3
+"""
+Script 20: Three-Method Threshold Determination at the Accountant Level
+=======================================================================
+Completes the three-method convergent framework at the analysis level
+where the mixture structure is statistically supported (per Script 15
+dip test: accountant cos_mean p<0.001).
+
+Runs on the per-accountant aggregates (mean best-match cosine, mean
+independent minimum dHash) for 686 CPAs with >=10 signatures:
+
+  Method 1: KDE antimode with Hartigan dip test (formal unimodality test)
+  Method 2: Burgstahler-Dichev / McCrary discontinuity
+  Method 3: 2-component Beta mixture via EM + parallel logit-GMM
+
+Also re-runs the accountant-level 2-component GMM crossings from
+Script 18 for completeness and side-by-side comparison.
+
+Output:
+  reports/accountant_three_methods/accountant_three_methods_report.md
+  reports/accountant_three_methods/accountant_three_methods_results.json
+  reports/accountant_three_methods/accountant_cos_panel.png
+  reports/accountant_three_methods/accountant_dhash_panel.png
+"""
+
+import sqlite3
+import json
+import numpy as np
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from pathlib import Path
+from datetime import datetime
+from scipy import stats
+from scipy.signal import find_peaks
+from scipy.optimize import brentq
+from sklearn.mixture import GaussianMixture
+import diptest
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'accountant_three_methods')
+OUT.mkdir(parents=True, exist_ok=True)
+
+EPS = 1e-6
+Z_CRIT = 1.96
+MIN_SIGS = 10
+
+
+def load_accountant_means(min_sigs=MIN_SIGS):
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT s.assigned_accountant,
+               AVG(s.max_similarity_to_same_accountant) AS cos_mean,
+               AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
+               COUNT(*) AS n
+        FROM signatures s
+        WHERE s.assigned_accountant IS NOT NULL
+          AND s.max_similarity_to_same_accountant IS NOT NULL
+          AND s.min_dhash_independent IS NOT NULL
+        GROUP BY s.assigned_accountant
+        HAVING n >= ?
+    ''', (min_sigs,))
+    rows = cur.fetchall()
+    conn.close()
+    cos = np.array([r[1] for r in rows])
+    dh = np.array([r[2] for r in rows])
+    return cos, dh
+
+
+# ---------- Method 1: KDE antimode with dip test ----------
+def method_kde_antimode(values, name):
+    arr = np.asarray(values, dtype=float)
+    arr = arr[np.isfinite(arr)]
+    dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=2000)
+    kde = stats.gaussian_kde(arr, bw_method='silverman')
+    xs = np.linspace(arr.min(), arr.max(), 2000)
+    density = kde(xs)
+    # Find modes (local maxima) and antimodes (local minima)
+    peaks, _ = find_peaks(density, prominence=density.max() * 0.02)
+    # Antimodes = local minima between peaks
+    antimodes = []
+    for i in range(len(peaks) - 1):
+        seg = density[peaks[i]:peaks[i + 1]]
+        if len(seg) == 0:
+            continue
+        local = peaks[i] + int(np.argmin(seg))
+        antimodes.append(float(xs[local]))
+    # Sensitivity analysis across bandwidth factors
+    sens = {}
+    for bwf in [0.5, 0.75, 1.0, 1.5, 2.0]:
+        kde_s = stats.gaussian_kde(arr, bw_method=kde.factor * bwf)
+        d_s = kde_s(xs)
+        p_s, _ = find_peaks(d_s, prominence=d_s.max() * 0.02)
+        sens[f'bw_x{bwf}'] = int(len(p_s))
+    return {
+        'name': name,
+        'n': int(len(arr)),
+        'dip': float(dip),
+        'dip_pvalue': float(pval),
+        'unimodal_alpha05': bool(pval > 0.05),
+        'kde_bandwidth_silverman': float(kde.factor),
+        'n_modes': int(len(peaks)),
+        'mode_locations': [float(xs[p]) for p in peaks],
+        'antimodes': antimodes,
+        'primary_antimode': (antimodes[0] if antimodes else None),
+        'bandwidth_sensitivity_n_modes': sens,
+    }
+
+
+# ---------- Method 2: Burgstahler-Dichev / McCrary ----------
+def method_bd_mccrary(values, bin_width, direction, name):
+    arr = np.asarray(values, dtype=float)
+    arr = arr[np.isfinite(arr)]
+    lo = float(np.floor(arr.min() / bin_width) * bin_width)
+    hi = float(np.ceil(arr.max() / bin_width) * bin_width)
+    edges = np.arange(lo, hi + bin_width, bin_width)
+    counts, _ = np.histogram(arr, bins=edges)
+    centers = (edges[:-1] + edges[1:]) / 2.0
+
+    N = counts.sum()
+    p = counts / N if N else counts.astype(float)
+    n_bins = len(counts)
+    z = np.full(n_bins, np.nan)
+    expected = np.full(n_bins, np.nan)
+
+    for i in range(1, n_bins - 1):
+        p_lo = p[i - 1]
+        p_hi = p[i + 1]
+        exp_i = 0.5 * (counts[i - 1] + counts[i + 1])
+        var_i = (N * p[i] * (1 - p[i])
+                 + 0.25 * N * (p_lo + p_hi) * (1 - p_lo - p_hi))
+        if var_i <= 0:
+            continue
+        z[i] = (counts[i] - exp_i) / np.sqrt(var_i)
+        expected[i] = exp_i
+
+    # Identify transitions
+    transitions = []
+    for i in range(1, len(z)):
+        if np.isnan(z[i - 1]) or np.isnan(z[i]):
+            continue
+        ok = False
+        if direction == 'neg_to_pos' and z[i - 1] < -Z_CRIT and z[i] > Z_CRIT:
+            ok = True
+        elif direction == 'pos_to_neg' and z[i - 1] > Z_CRIT and z[i] < -Z_CRIT:
+            ok = True
+        if ok:
+            transitions.append({
+                'threshold_between': float((centers[i - 1] + centers[i]) / 2.0),
+                'z_before': float(z[i - 1]),
+                'z_after': float(z[i]),
+            })
+
+    best = None
+    if transitions:
+        best = max(transitions,
+                   key=lambda t: abs(t['z_before']) + abs(t['z_after']))
+    return {
+        'name': name,
+        'n': int(len(arr)),
+        'bin_width': float(bin_width),
+        'direction': direction,
+        'n_transitions': len(transitions),
+        'transitions': transitions,
+        'best_transition': best,
+        'threshold': (best['threshold_between'] if best else None),
+        'bin_centers': [float(c) for c in centers],
+        'counts': [int(c) for c in counts],
+        'z_scores': [None if np.isnan(zi) else float(zi) for zi in z],
+    }
+
+
+# ---------- Method 3: Beta mixture + logit-GMM ----------
+def fit_beta_mixture_em(x, K=2, max_iter=300, tol=1e-6, seed=42):
+    rng = np.random.default_rng(seed)
+    x = np.clip(np.asarray(x, dtype=float), EPS, 1 - EPS)
+    n = len(x)
+    q = np.linspace(0, 1, K + 1)
+    thresh = np.quantile(x, q[1:-1])
+    labels = np.digitize(x, thresh)
+    resp = np.zeros((n, K))
+    resp[np.arange(n), labels] = 1.0
+    ll_hist = []
+    for it in range(max_iter):
+        nk = resp.sum(axis=0) + 1e-12
+        weights = nk / nk.sum()
+        mus = (resp * x[:, None]).sum(axis=0) / nk
+        var_num = (resp * (x[:, None] - mus) ** 2).sum(axis=0)
+        vars_ = var_num / nk
+        upper = mus * (1 - mus) - 1e-9
+        vars_ = np.minimum(vars_, upper)
+        vars_ = np.maximum(vars_, 1e-9)
+        factor = np.maximum(mus * (1 - mus) / vars_ - 1, 1e-6)
+        alphas = mus * factor
+        betas = (1 - mus) * factor
+        log_pdfs = np.column_stack([
+            stats.beta.logpdf(x, alphas[k], betas[k]) + np.log(weights[k])
+            for k in range(K)
+        ])
+        m = log_pdfs.max(axis=1, keepdims=True)
+        ll = (m.ravel() + np.log(np.exp(log_pdfs - m).sum(axis=1))).sum()
+        ll_hist.append(float(ll))
+        new_resp = np.exp(log_pdfs - m)
+        new_resp = new_resp / new_resp.sum(axis=1, keepdims=True)
+        if it > 0 and abs(ll_hist[-1] - ll_hist[-2]) < tol:
+            resp = new_resp
+            break
+        resp = new_resp
+    order = np.argsort(mus)
+    alphas, betas, weights, mus = alphas[order], betas[order], weights[order], mus[order]
+    k_params = 3 * K - 1
+    ll_final = ll_hist[-1]
+    return {
+        'K': K,
+        'alphas': [float(a) for a in alphas],
+        'betas': [float(b) for b in betas],
+        'weights': [float(w) for w in weights],
+        'mus': [float(m) for m in mus],
+        'log_likelihood': float(ll_final),
+        'aic': float(2 * k_params - 2 * ll_final),
+        'bic': float(k_params * np.log(n) - 2 * ll_final),
+        'n_iter': it + 1,
+    }
+
+
+def beta_crossing(fit):
+    if fit['K'] != 2:
+        return None
+    a1, b1, w1 = fit['alphas'][0], fit['betas'][0], fit['weights'][0]
+    a2, b2, w2 = fit['alphas'][1], fit['betas'][1], fit['weights'][1]
+
+    def diff(x):
+        return (w2 * stats.beta.pdf(x, a2, b2)
+                - w1 * stats.beta.pdf(x, a1, b1))
+    xs = np.linspace(EPS, 1 - EPS, 2000)
+    ys = diff(xs)
+    changes = np.where(np.diff(np.sign(ys)) != 0)[0]
+    if not len(changes):
+        return None
+    mid = 0.5 * (fit['mus'][0] + fit['mus'][1])
+    crossings = []
+    for i in changes:
+        try:
+            crossings.append(brentq(diff, xs[i], xs[i + 1]))
+        except ValueError:
+            continue
+    if not crossings:
+        return None
+    return float(min(crossings, key=lambda c: abs(c - mid)))
+
+
+def fit_logit_gmm(x, K=2, seed=42):
+    x = np.clip(np.asarray(x, dtype=float), EPS, 1 - EPS)
+    z = np.log(x / (1 - x)).reshape(-1, 1)
+    gmm = GaussianMixture(n_components=K, random_state=seed,
+                          max_iter=500).fit(z)
+    order = np.argsort(gmm.means_.ravel())
+    means = gmm.means_.ravel()[order]
+    stds = np.sqrt(gmm.covariances_.ravel())[order]
+    weights = gmm.weights_[order]
+    crossing = None
+    if K == 2:
+        m1, s1, w1 = means[0], stds[0], weights[0]
+        m2, s2, w2 = means[1], stds[1], weights[1]
+
+        def diff(z0):
+            return (w2 * stats.norm.pdf(z0, m2, s2)
+                    - w1 * stats.norm.pdf(z0, m1, s1))
+        zs = np.linspace(min(m1, m2) - 2, max(m1, m2) + 2, 2000)
+        ys = diff(zs)
+        ch = np.where(np.diff(np.sign(ys)) != 0)[0]
+        if len(ch):
+            try:
+                z_cross = brentq(diff, zs[ch[0]], zs[ch[0] + 1])
+                crossing = float(1 / (1 + np.exp(-z_cross)))
+            except ValueError:
+                pass
+    return {
+        'K': K,
+        'means_logit': [float(m) for m in means],
+        'stds_logit': [float(s) for s in stds],
+        'weights': [float(w) for w in weights],
+        'means_original_scale': [float(1 / (1 + np.exp(-m))) for m in means],
+        'aic': float(gmm.aic(z)),
+        'bic': float(gmm.bic(z)),
+        'crossing_original': crossing,
+    }
+
+
+def method_beta_mixture(values, name, is_cosine=True):
+    arr = np.asarray(values, dtype=float)
+    arr = arr[np.isfinite(arr)]
+    if not is_cosine:
+        # normalize dHash into [0,1] by dividing by 64 (max Hamming)
+        x = arr / 64.0
+    else:
+        x = arr
+    beta2 = fit_beta_mixture_em(x, K=2)
+    beta3 = fit_beta_mixture_em(x, K=3)
+    cross_beta2 = beta_crossing(beta2)
+    # Transform back to original scale for dHash
+    if not is_cosine and cross_beta2 is not None:
+        cross_beta2 = cross_beta2 * 64.0
+    gmm2 = fit_logit_gmm(x, K=2)
+    gmm3 = fit_logit_gmm(x, K=3)
+    if not is_cosine and gmm2.get('crossing_original') is not None:
+        gmm2['crossing_original'] = gmm2['crossing_original'] * 64.0
+    return {
+        'name': name,
+        'n': int(len(x)),
+        'scale_transform': ('identity' if is_cosine else 'dhash/64'),
+        'beta_2': beta2,
+        'beta_3': beta3,
+        'bic_preferred_K': (2 if beta2['bic'] < beta3['bic'] else 3),
+        'beta_2_crossing_original': cross_beta2,
+        'logit_gmm_2': gmm2,
+        'logit_gmm_3': gmm3,
+    }
+
+
+# ---------- Plot helpers ----------
+def plot_panel(values, methods, title, out_path, bin_width=None,
+               is_cosine=True):
+    arr = np.asarray(values, dtype=float)
+    fig, axes = plt.subplots(2, 1, figsize=(11, 7),
+                             gridspec_kw={'height_ratios': [3, 1]})
+
+    ax = axes[0]
+    if bin_width is None:
+        bins = 40
+    else:
+        lo = float(np.floor(arr.min() / bin_width) * bin_width)
+        hi = float(np.ceil(arr.max() / bin_width) * bin_width)
+        bins = np.arange(lo, hi + bin_width, bin_width)
+    ax.hist(arr, bins=bins, density=True, alpha=0.55, color='steelblue',
+            edgecolor='white')
+    # KDE overlay
+    kde = stats.gaussian_kde(arr, bw_method='silverman')
+    xs = np.linspace(arr.min(), arr.max(), 500)
+    ax.plot(xs, kde(xs), 'g-', lw=1.5, label='KDE (Silverman)')
+
+    # Annotate thresholds from each method
+    colors = {'kde': 'green', 'bd': 'red', 'beta': 'purple', 'gmm2': 'orange'}
+    for key, (val, lbl) in methods.items():
+        if val is None:
+            continue
+        ax.axvline(val, color=colors.get(key, 'gray'), lw=2, ls='--',
+                   label=f'{lbl} = {val:.4f}')
+    ax.set_xlabel(title + ' value')
+    ax.set_ylabel('Density')
+    ax.set_title(title)
+    ax.legend(fontsize=8)
+
+    ax2 = axes[1]
+    ax2.set_title('Thresholds across methods')
+    ax2.set_xlim(ax.get_xlim())
+    for i, (key, (val, lbl)) in enumerate(methods.items()):
+        if val is None:
+            continue
+        ax2.scatter([val], [i], color=colors.get(key, 'gray'), s=100, zorder=3)
+        ax2.annotate(f'  {lbl}: {val:.4f}', (val, i), fontsize=8,
+                     va='center')
+    ax2.set_yticks(range(len(methods)))
+    ax2.set_yticklabels([m for m in methods.keys()])
+    ax2.set_xlabel(title + ' value')
+    ax2.grid(alpha=0.3)
+
+    plt.tight_layout()
+    fig.savefig(out_path, dpi=150)
+    plt.close()
+
+
+# ---------- GMM 2-comp crossing from Script 18 ----------
+def marginal_2comp_crossing(X, dim):
+    gmm = GaussianMixture(n_components=2, covariance_type='full',
+                          random_state=42, n_init=15, max_iter=500).fit(X)
+    means = gmm.means_
+    covs = gmm.covariances_
+    weights = gmm.weights_
+    m1, m2 = means[0][dim], means[1][dim]
+    s1 = np.sqrt(covs[0][dim, dim])
+    s2 = np.sqrt(covs[1][dim, dim])
+    w1, w2 = weights[0], weights[1]
+
+    def diff(x):
+        return (w2 * stats.norm.pdf(x, m2, s2)
+                - w1 * stats.norm.pdf(x, m1, s1))
+    xs = np.linspace(float(X[:, dim].min()), float(X[:, dim].max()), 2000)
+    ys = diff(xs)
+    ch = np.where(np.diff(np.sign(ys)) != 0)[0]
+    if not len(ch):
+        return None
+    mid = 0.5 * (m1 + m2)
+    crossings = []
+    for i in ch:
+        try:
+            crossings.append(brentq(diff, xs[i], xs[i + 1]))
+        except ValueError:
+            continue
+    if not crossings:
+        return None
+    return float(min(crossings, key=lambda c: abs(c - mid)))
+
+
+def main():
+    print('=' * 70)
+    print('Script 20: Three-Method Threshold at Accountant Level')
+    print('=' * 70)
+    cos, dh = load_accountant_means()
+    print(f'\nN accountants (>={MIN_SIGS} sigs) = {len(cos)}')
+
+    results = {}
+
+    for desc, arr, bin_width, direction, is_cosine in [
+        ('cos_mean', cos, 0.002, 'neg_to_pos', True),
+        ('dh_mean', dh, 0.2, 'pos_to_neg', False),
+    ]:
+        print(f'\n[{desc}]')
+        m1 = method_kde_antimode(arr, f'{desc} KDE')
+        print(f'  Method 1 (KDE + dip): dip={m1["dip"]:.4f} '
+              f'p={m1["dip_pvalue"]:.4f} '
+              f'n_modes={m1["n_modes"]} '
+              f'antimode={m1["primary_antimode"]}')
+        m2 = method_bd_mccrary(arr, bin_width, direction, f'{desc} BD')
+        print(f'  Method 2 (BD/McCrary): {m2["n_transitions"]} transitions, '
+              f'threshold={m2["threshold"]}')
+        m3 = method_beta_mixture(arr, f'{desc} Beta', is_cosine=is_cosine)
+        print(f'  Method 3 (Beta mixture): BIC-preferred K={m3["bic_preferred_K"]}, '
+              f'Beta-2 crossing={m3["beta_2_crossing_original"]}, '
+              f'LogGMM-2 crossing={m3["logit_gmm_2"].get("crossing_original")}')
+
+        # GMM 2-comp crossing (for completeness / reproduce Script 18)
+        X = np.column_stack([cos, dh])
+        dim = 0 if desc == 'cos_mean' else 1
+        gmm2_crossing = marginal_2comp_crossing(X, dim)
+        print(f'  (Script 18 2-comp GMM marginal crossing = {gmm2_crossing})')
+
+        results[desc] = {
+            'method_1_kde_antimode': m1,
+            'method_2_bd_mccrary': m2,
+            'method_3_beta_mixture': m3,
+            'script_18_gmm_2comp_crossing': gmm2_crossing,
+        }
+
+        methods_for_plot = {
+            'kde': (m1.get('primary_antimode'), 'KDE antimode'),
+            'bd': (m2.get('threshold'), 'BD/McCrary'),
+            'beta': (m3.get('beta_2_crossing_original'), 'Beta-2 crossing'),
+            'gmm2': (gmm2_crossing, 'GMM 2-comp crossing'),
+        }
+        png = OUT / f'accountant_{desc}_panel.png'
+        plot_panel(arr, methods_for_plot,
+                   f'Accountant-level {desc}: three-method thresholds',
+                   png, bin_width=bin_width, is_cosine=is_cosine)
+        print(f'  plot: {png}')
+
+    # Write JSON
+    with open(OUT / 'accountant_three_methods_results.json', 'w') as f:
+        json.dump({'generated_at': datetime.now().isoformat(),
+                   'n_accountants': int(len(cos)),
+                   'min_signatures': MIN_SIGS,
+                   'results': results}, f, indent=2, ensure_ascii=False)
+    print(f'\nJSON: {OUT / "accountant_three_methods_results.json"}')
+
+    # Markdown
+    md = [
+        '# Accountant-Level Three-Method Threshold Report',
+        f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
+        '',
+        f'N accountants (>={MIN_SIGS} signatures): {len(cos)}',
+        '',
+        '## Accountant-level cosine mean',
+        '',
+        '| Method | Threshold | Supporting statistic |',
+        '|--------|-----------|----------------------|',
+    ]
+    r = results['cos_mean']
+    md.append(f"| Method 1: KDE antimode (with dip test) | "
+              f"{r['method_1_kde_antimode']['primary_antimode']} | "
+              f"dip={r['method_1_kde_antimode']['dip']:.4f}, "
+              f"p={r['method_1_kde_antimode']['dip_pvalue']:.4f} "
+              f"({'unimodal' if r['method_1_kde_antimode']['unimodal_alpha05'] else 'multimodal'}) |")
+    md.append(f"| Method 2: Burgstahler-Dichev / McCrary | "
+              f"{r['method_2_bd_mccrary']['threshold']} | "
+              f"{r['method_2_bd_mccrary']['n_transitions']} transition(s) "
+              f"at α=0.05 |")
+    md.append(f"| Method 3: 2-component Beta mixture | "
+              f"{r['method_3_beta_mixture']['beta_2_crossing_original']} | "
+              f"Beta-2 BIC={r['method_3_beta_mixture']['beta_2']['bic']:.2f}, "
+              f"Beta-3 BIC={r['method_3_beta_mixture']['beta_3']['bic']:.2f} "
+              f"(BIC-preferred K={r['method_3_beta_mixture']['bic_preferred_K']}) |")
+    md.append(f"| Method 3': LogGMM-2 on logit-transformed | "
+              f"{r['method_3_beta_mixture']['logit_gmm_2'].get('crossing_original')} | "
+              f"White 1982 quasi-MLE robustness check |")
+    md.append(f"| Script 18 GMM 2-comp marginal crossing | "
+              f"{r['script_18_gmm_2comp_crossing']} | full 2D mixture |")
+
+    md += ['', '## Accountant-level dHash mean', '',
+           '| Method | Threshold | Supporting statistic |',
+           '|--------|-----------|----------------------|']
+    r = results['dh_mean']
+    md.append(f"| Method 1: KDE antimode | "
+              f"{r['method_1_kde_antimode']['primary_antimode']} | "
+              f"dip={r['method_1_kde_antimode']['dip']:.4f}, "
+              f"p={r['method_1_kde_antimode']['dip_pvalue']:.4f} |")
+    md.append(f"| Method 2: BD/McCrary | "
+              f"{r['method_2_bd_mccrary']['threshold']} | "
+              f"{r['method_2_bd_mccrary']['n_transitions']} transition(s) |")
+    md.append(f"| Method 3: 2-component Beta mixture | "
+              f"{r['method_3_beta_mixture']['beta_2_crossing_original']} | "
+              f"Beta-2 BIC={r['method_3_beta_mixture']['beta_2']['bic']:.2f}, "
+              f"Beta-3 BIC={r['method_3_beta_mixture']['beta_3']['bic']:.2f} |")
+    md.append(f"| Method 3': LogGMM-2 | "
+              f"{r['method_3_beta_mixture']['logit_gmm_2'].get('crossing_original')} | |")
+    md.append(f"| Script 18 GMM 2-comp crossing | "
+              f"{r['script_18_gmm_2comp_crossing']} | |")
+
+    (OUT / 'accountant_three_methods_report.md').write_text('\n'.join(md),
+                                                            encoding='utf-8')
+    print(f'Report: {OUT / "accountant_three_methods_report.md"}')
+
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,421 @@
+#!/usr/bin/env python3
+"""
+Script 21: Expanded Validation with Larger Negative Anchor + Held-out Firm A
+============================================================================
+Addresses codex review weaknesses of Script 19's pixel-identity validation:
+
+  (a) Negative anchor of n=35 (cosine<0.70) is too small to give
+      meaningful FAR confidence intervals.
+  (b) Pixel-identical positive anchor is an easy subset, not
+      representative of the broader positive class.
+  (c) Firm A is both the calibration anchor and the validation anchor
+      (circular).
+
+This script:
+  1. Constructs a large inter-CPA negative anchor (~50,000 pairs) by
+     randomly sampling pairs from different CPAs. Inter-CPA high
+     similarity is highly unlikely to arise from legitimate signing.
+  2. Splits Firm A CPAs 70/30 into CALIBRATION and HELDOUT folds.
+     Re-derives signature-level / accountant-level thresholds from the
+     calibration fold only, then reports all metrics (including Firm A
+     anchor rates) on the heldout fold.
+  3. Computes proper EER (FAR = FRR interpolated) in addition to
+     metrics at canonical thresholds.
+  4. Computes 95% Wilson confidence intervals for each FAR/FRR.
+
+Output:
+  reports/expanded_validation/expanded_validation_report.md
+  reports/expanded_validation/expanded_validation_results.json
+"""
+
+import sqlite3
+import json
+import numpy as np
+from pathlib import Path
+from datetime import datetime
+from scipy.stats import norm
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'expanded_validation')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+N_INTER_PAIRS = 50_000
+SEED = 42
+
+
+def wilson_ci(k, n, alpha=0.05):
+    if n == 0:
+        return (0.0, 1.0)
+    z = norm.ppf(1 - alpha / 2)
+    phat = k / n
+    denom = 1 + z * z / n
+    center = (phat + z * z / (2 * n)) / denom
+    pm = z * np.sqrt(phat * (1 - phat) / n + z * z / (4 * n * n)) / denom
+    return (max(0.0, center - pm), min(1.0, center + pm))
+
+
+def load_signatures():
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT s.signature_id, s.assigned_accountant, a.firm,
+               s.max_similarity_to_same_accountant,
+               s.min_dhash_independent, s.pixel_identical_to_closest
+        FROM signatures s
+        LEFT JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE s.max_similarity_to_same_accountant IS NOT NULL
+    ''')
+    rows = cur.fetchall()
+    conn.close()
+    return rows
+
+
+def load_feature_vectors_sample(n=2000):
+    """Load feature vectors for inter-CPA negative-anchor sampling."""
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT signature_id, assigned_accountant, feature_vector
+        FROM signatures
+        WHERE feature_vector IS NOT NULL
+          AND assigned_accountant IS NOT NULL
+        ORDER BY RANDOM()
+        LIMIT ?
+    ''', (n,))
+    rows = cur.fetchall()
+    conn.close()
+    out = []
+    for r in rows:
+        vec = np.frombuffer(r[2], dtype=np.float32)
+        out.append({'sig_id': r[0], 'accountant': r[1], 'feature': vec})
+    return out
+
+
+def build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS, seed=SEED):
+    """Sample random cross-CPA pairs; return their cosine similarities."""
+    rng = np.random.default_rng(seed)
+    n = len(sample)
+    feats = np.stack([s['feature'] for s in sample])
+    accts = np.array([s['accountant'] for s in sample])
+    sims = []
+    tries = 0
+    while len(sims) < n_pairs and tries < n_pairs * 10:
+        i = rng.integers(n)
+        j = rng.integers(n)
+        if i == j or accts[i] == accts[j]:
+            tries += 1
+            continue
+        sim = float(feats[i] @ feats[j])
+        sims.append(sim)
+        tries += 1
+    return np.array(sims)
+
+
+def classification_metrics(y_true, y_pred):
+    y_true = np.asarray(y_true).astype(int)
+    y_pred = np.asarray(y_pred).astype(int)
+    tp = int(np.sum((y_true == 1) & (y_pred == 1)))
+    fp = int(np.sum((y_true == 0) & (y_pred == 1)))
+    fn = int(np.sum((y_true == 1) & (y_pred == 0)))
+    tn = int(np.sum((y_true == 0) & (y_pred == 0)))
+    p_den = max(tp + fp, 1)
+    r_den = max(tp + fn, 1)
+    far_den = max(fp + tn, 1)
+    frr_den = max(fn + tp, 1)
+    precision = tp / p_den
+    recall = tp / r_den
+    f1 = (2 * precision * recall / (precision + recall)
+          if (precision + recall) > 0 else 0.0)
+    far = fp / far_den
+    frr = fn / frr_den
+    far_ci = wilson_ci(fp, far_den)
+    frr_ci = wilson_ci(fn, frr_den)
+    return {
+        'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn,
+        'precision': float(precision),
+        'recall': float(recall),
+        'f1': float(f1),
+        'far': float(far),
+        'frr': float(frr),
+        'far_ci95': [float(x) for x in far_ci],
+        'frr_ci95': [float(x) for x in frr_ci],
+        'n_pos': int(tp + fn),
+        'n_neg': int(tn + fp),
+    }
+
+
+def sweep_threshold(scores, y, direction, thresholds):
+    out = []
+    for t in thresholds:
+        if direction == 'above':
+            y_pred = (scores > t).astype(int)
+        else:
+            y_pred = (scores < t).astype(int)
+        m = classification_metrics(y, y_pred)
+        m['threshold'] = float(t)
+        out.append(m)
+    return out
+
+
+def find_eer(sweep):
+    thr = np.array([s['threshold'] for s in sweep])
+    far = np.array([s['far'] for s in sweep])
+    frr = np.array([s['frr'] for s in sweep])
+    diff = far - frr
+    signs = np.sign(diff)
+    changes = np.where(np.diff(signs) != 0)[0]
+    if len(changes) == 0:
+        idx = int(np.argmin(np.abs(diff)))
+        return {'threshold': float(thr[idx]), 'far': float(far[idx]),
+                'frr': float(frr[idx]),
+                'eer': float(0.5 * (far[idx] + frr[idx]))}
+    i = int(changes[0])
+    w = abs(diff[i]) / (abs(diff[i]) + abs(diff[i + 1]) + 1e-12)
+    thr_i = (1 - w) * thr[i] + w * thr[i + 1]
+    far_i = (1 - w) * far[i] + w * far[i + 1]
+    frr_i = (1 - w) * frr[i] + w * frr[i + 1]
+    return {'threshold': float(thr_i), 'far': float(far_i),
+            'frr': float(frr_i),
+            'eer': float(0.5 * (far_i + frr_i))}
+
+
+def main():
+    print('=' * 70)
+    print('Script 21: Expanded Validation')
+    print('=' * 70)
+
+    rows = load_signatures()
+    print(f'\nLoaded {len(rows):,} signatures')
+    sig_ids = [r[0] for r in rows]
+    accts = [r[1] for r in rows]
+    firms = [r[2] or '(unknown)' for r in rows]
+    cos = np.array([r[3] for r in rows], dtype=float)
+    dh = np.array([-1 if r[4] is None else r[4] for r in rows], dtype=float)
+    pix = np.array([r[5] or 0 for r in rows], dtype=int)
+
+    firm_a_mask = np.array([f == FIRM_A for f in firms])
+    print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')
+
+    # --- (1) INTER-CPA NEGATIVE ANCHOR ---
+    print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} pairs)...')
+    sample = load_feature_vectors_sample(n=3000)
+    inter_cos = build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS)
+    print(f'  inter-CPA cos: mean={inter_cos.mean():.4f}, '
+          f'p95={np.percentile(inter_cos, 95):.4f}, '
+          f'p99={np.percentile(inter_cos, 99):.4f}, '
+          f'max={inter_cos.max():.4f}')
+
+    # --- (2) POSITIVES ---
+    # Pixel-identical (gold) + optional Firm A extension
+    pos_pix_mask = pix == 1
+    n_pix = int(pos_pix_mask.sum())
+    print(f'\n[2] Positive anchors:')
+    print(f'  pixel-identical signatures: {n_pix}')
+
+    # Build negative anchor scores = inter-CPA cosine distribution
+    # Positive anchor scores = pixel-identical signatures' max same-CPA cosine
+    # NB: the two distributions are not drawn from the same random variable
+    # (one is intra-CPA max, the other is inter-CPA random), so we treat the
+    # inter-CPA distribution as a negative reference for threshold sweep.
+
+    # Combined labeled set: positives=pixel-identical sigs' max cosine,
+    #                       negatives=inter-CPA random pair cosines.
+    pos_scores = cos[pos_pix_mask]
+    neg_scores = inter_cos
+    y = np.concatenate([np.ones(len(pos_scores)),
+                        np.zeros(len(neg_scores))])
+    scores = np.concatenate([pos_scores, neg_scores])
+
+    # Sweep thresholds
+    thr = np.linspace(0.30, 1.00, 141)
+    sweep = sweep_threshold(scores, y, 'above', thr)
+    eer = find_eer(sweep)
+    print(f'\n[3] Cosine EER (pos=pixel-identical, neg=inter-CPA n={len(inter_cos)}):')
+    print(f"    threshold={eer['threshold']:.4f}, EER={eer['eer']:.4f}")
+    # Canonical threshold evaluations with Wilson CIs
+    canonical = {}
+    for tt in [0.70, 0.80, 0.837, 0.90, 0.945, 0.95, 0.973, 0.979]:
+        y_pred = (scores > tt).astype(int)
+        m = classification_metrics(y, y_pred)
+        m['threshold'] = float(tt)
+        canonical[f'cos>{tt:.3f}'] = m
+        print(f"    @ {tt:.3f}: P={m['precision']:.3f}, R={m['recall']:.3f}, "
+              f"FAR={m['far']:.4f} (CI95={m['far_ci95'][0]:.4f}-"
+              f"{m['far_ci95'][1]:.4f}), FRR={m['frr']:.4f}")
+
+    # --- (3) HELD-OUT FIRM A ---
+    print('\n[4] Held-out Firm A 70/30 split:')
+    rng = np.random.default_rng(SEED)
+    firm_a_accts = sorted(set(a for a, f in zip(accts, firms) if f == FIRM_A))
+    rng.shuffle(firm_a_accts)
+    n_calib = int(0.7 * len(firm_a_accts))
+    calib_accts = set(firm_a_accts[:n_calib])
+    heldout_accts = set(firm_a_accts[n_calib:])
+    print(f'  Calibration fold CPAs: {len(calib_accts)}, '
+          f'heldout fold CPAs: {len(heldout_accts)}')
+
+    calib_mask = np.array([a in calib_accts for a in accts])
+    heldout_mask = np.array([a in heldout_accts for a in accts])
+    print(f'  Calibration sigs: {int(calib_mask.sum())}, '
+          f'heldout sigs: {int(heldout_mask.sum())}')
+
+    # Derive per-signature thresholds from calibration fold:
+    # - Firm A cos median, 1st-pct, 5th-pct
+    # - Firm A dHash median, 95th-pct
+    calib_cos = cos[calib_mask]
+    calib_dh = dh[calib_mask]
+    calib_dh = calib_dh[calib_dh >= 0]
+    cal_cos_med = float(np.median(calib_cos))
+    cal_cos_p1 = float(np.percentile(calib_cos, 1))
+    cal_cos_p5 = float(np.percentile(calib_cos, 5))
+    cal_dh_med = float(np.median(calib_dh))
+    cal_dh_p95 = float(np.percentile(calib_dh, 95))
+    print(f'  Calib Firm A  cos: median={cal_cos_med:.4f}, P1={cal_cos_p1:.4f}, P5={cal_cos_p5:.4f}')
+    print(f'  Calib Firm A dHash: median={cal_dh_med:.2f}, P95={cal_dh_p95:.2f}')
+
+    # Apply canonical rules to heldout fold
+    held_cos = cos[heldout_mask]
+    held_dh = dh[heldout_mask]
+    held_dh_valid = held_dh >= 0
+    held_rates = {}
+    for tt in [0.837, 0.945, 0.95, cal_cos_p5]:
+        rate = float(np.mean(held_cos > tt))
+        k = int(np.sum(held_cos > tt))
+        lo, hi = wilson_ci(k, len(held_cos))
+        held_rates[f'cos>{tt:.4f}'] = {
+            'rate': rate, 'k': k, 'n': int(len(held_cos)),
+            'wilson95': [float(lo), float(hi)],
+        }
+    for tt in [5, 8, 15, cal_dh_p95]:
+        rate = float(np.mean(held_dh[held_dh_valid] <= tt))
+        k = int(np.sum(held_dh[held_dh_valid] <= tt))
+        lo, hi = wilson_ci(k, int(held_dh_valid.sum()))
+        held_rates[f'dh_indep<={tt:.2f}'] = {
+            'rate': rate, 'k': k, 'n': int(held_dh_valid.sum()),
+            'wilson95': [float(lo), float(hi)],
+        }
+    # Dual rule
+    dual_mask = (held_cos > 0.95) & (held_dh >= 0) & (held_dh <= 8)
+    rate = float(np.mean(dual_mask))
+    k = int(dual_mask.sum())
+    lo, hi = wilson_ci(k, len(dual_mask))
+    held_rates['cos>0.95 AND dh<=8'] = {
+        'rate': rate, 'k': k, 'n': int(len(dual_mask)),
+        'wilson95': [float(lo), float(hi)],
+    }
+    print('  Heldout Firm A rates:')
+    for k, v in held_rates.items():
+        print(f'    {k}: {v["rate"]*100:.2f}% '
+              f'[{v["wilson95"][0]*100:.2f}, {v["wilson95"][1]*100:.2f}]')
+
+    # --- Save ---
+    summary = {
+        'generated_at': datetime.now().isoformat(),
+        'n_signatures': len(rows),
+        'n_firm_a': int(firm_a_mask.sum()),
+        'n_pixel_identical': n_pix,
+        'n_inter_cpa_negatives': len(inter_cos),
+        'inter_cpa_cos_stats': {
+            'mean': float(inter_cos.mean()),
+            'p95': float(np.percentile(inter_cos, 95)),
+            'p99': float(np.percentile(inter_cos, 99)),
+            'max': float(inter_cos.max()),
+        },
+        'cosine_eer': eer,
+        'canonical_thresholds': canonical,
+        'held_out_firm_a': {
+            'calibration_cpas': len(calib_accts),
+            'heldout_cpas': len(heldout_accts),
+            'calibration_sig_count': int(calib_mask.sum()),
+            'heldout_sig_count': int(heldout_mask.sum()),
+            'calib_cos_median': cal_cos_med,
+            'calib_cos_p1': cal_cos_p1,
+            'calib_cos_p5': cal_cos_p5,
+            'calib_dh_median': cal_dh_med,
+            'calib_dh_p95': cal_dh_p95,
+            'heldout_rates': held_rates,
+        },
+    }
+    with open(OUT / 'expanded_validation_results.json', 'w') as f:
+        json.dump(summary, f, indent=2, ensure_ascii=False)
+    print(f'\nJSON: {OUT / "expanded_validation_results.json"}')
+
+    # Markdown
+    md = [
+        '# Expanded Validation Report',
+        f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
+        '',
+        '## 1. Inter-CPA Negative Anchor',
+        '',
+        f'* N random cross-CPA pairs sampled: {len(inter_cos):,}',
+        f'* Inter-CPA cosine: mean={inter_cos.mean():.4f}, '
+        f'P95={np.percentile(inter_cos, 95):.4f}, '
+        f'P99={np.percentile(inter_cos, 99):.4f}, max={inter_cos.max():.4f}',
+        '',
+        'This anchor is a meaningful negative set because inter-CPA pairs',
+        'cannot arise from legitimate reuse of a single signer\'s image.',
+        '',
+        '## 2. Cosine Threshold Sweep (pos=pixel-identical, neg=inter-CPA)',
+        '',
+        f"EER threshold: {eer['threshold']:.4f}, EER: {eer['eer']:.4f}",
+        '',
+        '| Threshold | Precision | Recall | F1 | FAR | FAR 95% CI | FRR |',
+        '|-----------|-----------|--------|----|-----|------------|-----|',
+    ]
+    for k, m in canonical.items():
+        md.append(
+            f"| {m['threshold']:.3f} | {m['precision']:.3f} | "
+            f"{m['recall']:.3f} | {m['f1']:.3f} | {m['far']:.4f} | "
+            f"[{m['far_ci95'][0]:.4f}, {m['far_ci95'][1]:.4f}] | "
+            f"{m['frr']:.4f} |"
+        )
+    md += [
+        '',
+        '## 3. Held-out Firm A 70/30 Validation',
+        '',
+        f'* Firm A CPAs randomly split by CPA (not by signature) into',
+        f'  calibration (n={len(calib_accts)}) and heldout (n={len(heldout_accts)}).',
+        f'* Calibration Firm A signatures: {int(calib_mask.sum()):,}. '
+        f'Heldout signatures: {int(heldout_mask.sum()):,}.',
+        '',
+        '### Calibration-fold anchor statistics (for thresholds)',
+        '',
+        f'* Firm A cosine: median = {cal_cos_med:.4f}, '
+        f'P1 = {cal_cos_p1:.4f}, P5 = {cal_cos_p5:.4f}',
+        f'* Firm A dHash (independent min): median = {cal_dh_med:.2f}, '
+        f'P95 = {cal_dh_p95:.2f}',
+        '',
+        '### Heldout-fold capture rates (with Wilson 95% CIs)',
+        '',
+        '| Rule | Heldout rate | Wilson 95% CI | k / n |',
+        '|------|--------------|---------------|-------|',
+    ]
+    for k, v in held_rates.items():
+        md.append(
+            f"| {k} | {v['rate']*100:.2f}% | "
+            f"[{v['wilson95'][0]*100:.2f}%, {v['wilson95'][1]*100:.2f}%] | "
+            f"{v['k']}/{v['n']} |"
+        )
+    md += [
+        '',
+        '## Interpretation',
+        '',
+        'The inter-CPA negative anchor (N ~50,000) gives tight confidence',
+        'intervals on FAR at each threshold, addressing the small-negative',
+        'anchor limitation of Script 19 (n=35).',
+        '',
+        'The 70/30 Firm A split breaks the circular-validation concern of',
+        'using the same calibration anchor for threshold derivation and',
+        'validation. Calibration-fold percentiles derive the thresholds;',
+        'heldout-fold rates with Wilson 95% CIs show how those thresholds',
+        'generalize to Firm A CPAs that did not contribute to calibration.',
+    ]
+    (OUT / 'expanded_validation_report.md').write_text('\n'.join(md),
+                                                       encoding='utf-8')
+    print(f'Report: {OUT / "expanded_validation_report.md"}')
+
+
+if __name__ == '__main__':
+    main()