Add three-convergent-method threshold scripts + pixel-identity validation

Implements Partner v3's statistical rigor requirements at the level of signature vs. accountant analysis units: - Script 15 (Hartigan dip test): formal unimodality test via `diptest`. Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population); full-sample cosine MULTIMODAL (p<0.001, mix of two regimes); accountant-level aggregates MULTIMODAL on both cos and dHash. - Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition detection. Firm A and full-sample cosine transitions at 0.985; dHash at 2.0. - Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM with MoM M-step, plus parallel Gaussian mixture on logit transform as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature level confirms 2-component is a forced fit -- supporting the pivot to accountant-level mixture. - Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis that was done inline and not saved. BIC-best K=3 with components matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%, Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928, 11.17, 28%, small firms). 2-component natural thresholds: cos=0.9450, dh=8.10. - Script 19 (Pixel-identity validation): no human annotation needed. Uses pixel_identical_to_closest (310 sigs) as gold positive and Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51% (matches prior 2026-04-08 finding of 92.5%), dual rule cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A. Python deps added: diptest, scikit-learn (installed into venv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:51:41 +08:00
parent 158f63efb2
commit fbfab1fa68
5 changed files with 1760 additions and 0 deletions
@@ -0,0 +1,406 @@
+#!/usr/bin/env python3
+"""
+Script 17: Beta Mixture Model via EM + Gaussian Mixture on Logit Transform
+==========================================================================
+Fits a 2-component Beta mixture to cosine similarity, plus parallel
+Gaussian mixture on logit-transformed data as robustness check.
+
+Theory:
+  - Cosine similarity is bounded [0,1] so Beta is the natural parametric
+    family for the component distributions.
+  - EM algorithm (Dempster, Laird & Rubin 1977) provides ML estimates.
+  - If the mixture gives a crossing point, that is the Bayes-optimal
+    threshold under the fitted model.
+  - Robustness: logit(x) maps (0,1) to the real line, where Gaussian
+    mixture is standard; White (1982) quasi-MLE guarantees asymptotic
+    recovery of the best Beta-family approximation even under
+    mis-specification.
+
+Parametrization of Beta via method-of-moments inside the M-step:
+  alpha = mu * ((mu*(1-mu))/var - 1)
+  beta  = (1-mu) * ((mu*(1-mu))/var - 1)
+
+Expected outcome (per memory 2026-04-16):
+  Signature-level Beta mixture FAILS to separate hand-signed vs
+  non-hand-signed because the distribution is unimodal long-tail.
+  Report this as a formal result -- it motivates the pivot to
+  accountant-level mixture (Script 18).
+
+Output:
+  reports/beta_mixture/beta_mixture_report.md
+  reports/beta_mixture/beta_mixture_results.json
+  reports/beta_mixture/beta_mixture_<case>.png
+"""
+
+import sqlite3
+import json
+import numpy as np
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from pathlib import Path
+from datetime import datetime
+from scipy import stats
+from scipy.optimize import brentq
+from sklearn.mixture import GaussianMixture
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/beta_mixture')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+EPS = 1e-6
+
+
+def fit_beta_mixture_em(x, n_components=2, max_iter=300, tol=1e-6, seed=42):
+    """
+    Fit a K-component Beta mixture via EM using MoM M-step estimates for
+    alpha/beta of each component. MoM works because Beta is fully determined
+    by its mean and variance under the moment equations.
+    """
+    rng = np.random.default_rng(seed)
+    x = np.clip(np.asarray(x, dtype=float), EPS, 1 - EPS)
+    n = len(x)
+    K = n_components
+
+    # Initialise responsibilities by quantile-based split
+    q = np.linspace(0, 1, K + 1)
+    thresh = np.quantile(x, q[1:-1])
+    labels = np.digitize(x, thresh)
+    resp = np.zeros((n, K))
+    resp[np.arange(n), labels] = 1.0
+
+    params = []  # list of dicts with alpha, beta, weight
+    log_like_hist = []
+    for it in range(max_iter):
+        # M-step
+        nk = resp.sum(axis=0) + 1e-12
+        weights = nk / nk.sum()
+        mus = (resp * x[:, None]).sum(axis=0) / nk
+        var_num = (resp * (x[:, None] - mus) ** 2).sum(axis=0)
+        vars_ = var_num / nk
+        # Ensure validity for Beta: var < mu*(1-mu)
+        upper = mus * (1 - mus) - 1e-9
+        vars_ = np.minimum(vars_, upper)
+        vars_ = np.maximum(vars_, 1e-9)
+        factor = mus * (1 - mus) / vars_ - 1
+        factor = np.maximum(factor, 1e-6)
+        alphas = mus * factor
+        betas = (1 - mus) * factor
+        params = [{'alpha': float(alphas[k]), 'beta': float(betas[k]),
+                   'weight': float(weights[k]), 'mu': float(mus[k]),
+                   'var': float(vars_[k])} for k in range(K)]
+
+        # E-step
+        log_pdfs = np.column_stack([
+            stats.beta.logpdf(x, alphas[k], betas[k]) + np.log(weights[k])
+            for k in range(K)
+        ])
+        m = log_pdfs.max(axis=1, keepdims=True)
+        log_like = (m.ravel() + np.log(np.exp(log_pdfs - m).sum(axis=1))).sum()
+        log_like_hist.append(float(log_like))
+        new_resp = np.exp(log_pdfs - m)
+        new_resp = new_resp / new_resp.sum(axis=1, keepdims=True)
+
+        if it > 0 and abs(log_like_hist[-1] - log_like_hist[-2]) < tol:
+            resp = new_resp
+            break
+        resp = new_resp
+
+    # Order components by mean ascending (so C1 = low mean, CK = high mean)
+    order = np.argsort([p['mu'] for p in params])
+    params = [params[i] for i in order]
+    resp = resp[:, order]
+
+    # AIC/BIC (k = 3K - 1 free parameters: alpha, beta, weight each component;
+    # weights sum to 1 removes one df)
+    k = 3 * K - 1
+    aic = 2 * k - 2 * log_like_hist[-1]
+    bic = k * np.log(n) - 2 * log_like_hist[-1]
+
+    return {
+        'components': params,
+        'log_likelihood': log_like_hist[-1],
+        'aic': float(aic),
+        'bic': float(bic),
+        'n_iter': it + 1,
+        'responsibilities': resp,
+    }
+
+
+def mixture_crossing(params, x_range):
+    """Find crossing point of two weighted component densities (K=2)."""
+    if len(params) != 2:
+        return None
+    a1, b1, w1 = params[0]['alpha'], params[0]['beta'], params[0]['weight']
+    a2, b2, w2 = params[1]['alpha'], params[1]['beta'], params[1]['weight']
+
+    def diff(x):
+        return (w2 * stats.beta.pdf(x, a2, b2)
+                - w1 * stats.beta.pdf(x, a1, b1))
+
+    # Search for sign change inside the overlap region
+    xs = np.linspace(x_range[0] + 1e-4, x_range[1] - 1e-4, 2000)
+    ys = diff(xs)
+    sign_changes = np.where(np.diff(np.sign(ys)) != 0)[0]
+    if len(sign_changes) == 0:
+        return None
+    # Pick crossing closest to midpoint of component means
+    mid = 0.5 * (params[0]['mu'] + params[1]['mu'])
+    crossings = []
+    for i in sign_changes:
+        try:
+            x0 = brentq(diff, xs[i], xs[i + 1])
+            crossings.append(x0)
+        except ValueError:
+            continue
+    if not crossings:
+        return None
+    return min(crossings, key=lambda c: abs(c - mid))
+
+
+def logit(x):
+    x = np.clip(x, EPS, 1 - EPS)
+    return np.log(x / (1 - x))
+
+
+def invlogit(z):
+    return 1.0 / (1.0 + np.exp(-z))
+
+
+def fit_gmm_logit(x, n_components=2, seed=42):
+    """GMM on logit-transformed values. Returns crossing point in original scale."""
+    z = logit(x).reshape(-1, 1)
+    gmm = GaussianMixture(n_components=n_components, random_state=seed,
+                          max_iter=500).fit(z)
+    means = gmm.means_.ravel()
+    covs = gmm.covariances_.ravel()
+    weights = gmm.weights_
+    order = np.argsort(means)
+    comps = [{
+        'mu_logit': float(means[i]),
+        'sigma_logit': float(np.sqrt(covs[i])),
+        'weight': float(weights[i]),
+        'mu_original': float(invlogit(means[i])),
+    } for i in order]
+
+    result = {
+        'components': comps,
+        'log_likelihood': float(gmm.score(z) * len(z)),
+        'aic': float(gmm.aic(z)),
+        'bic': float(gmm.bic(z)),
+        'n_iter': int(gmm.n_iter_),
+    }
+
+    if n_components == 2:
+        m1, s1, w1 = means[order[0]], np.sqrt(covs[order[0]]), weights[order[0]]
+        m2, s2, w2 = means[order[1]], np.sqrt(covs[order[1]]), weights[order[1]]
+
+        def diff(z0):
+            return (w2 * stats.norm.pdf(z0, m2, s2)
+                    - w1 * stats.norm.pdf(z0, m1, s1))
+        zs = np.linspace(min(m1, m2) - 1, max(m1, m2) + 1, 2000)
+        ys = diff(zs)
+        changes = np.where(np.diff(np.sign(ys)) != 0)[0]
+        if len(changes):
+            try:
+                z_cross = brentq(diff, zs[changes[0]], zs[changes[0] + 1])
+                result['crossing_logit'] = float(z_cross)
+                result['crossing_original'] = float(invlogit(z_cross))
+            except ValueError:
+                pass
+    return result
+
+
+def plot_mixture(x, beta_res, title, out_path, gmm_res=None):
+    x = np.asarray(x, dtype=float).ravel()
+    x = x[np.isfinite(x)]
+    fig, ax = plt.subplots(figsize=(10, 5))
+    bin_edges = np.linspace(float(x.min()), float(x.max()), 81)
+    ax.hist(x, bins=bin_edges, density=True, alpha=0.45, color='steelblue',
+            edgecolor='white')
+    xs = np.linspace(max(0.0, x.min() - 0.01), min(1.0, x.max() + 0.01), 500)
+    total = np.zeros_like(xs)
+    for i, p in enumerate(beta_res['components']):
+        comp_pdf = p['weight'] * stats.beta.pdf(xs, p['alpha'], p['beta'])
+        total = total + comp_pdf
+        ax.plot(xs, comp_pdf, '--', lw=1.5,
+                label=f"C{i+1}: α={p['alpha']:.2f}, β={p['beta']:.2f}, "
+                      f"w={p['weight']:.2f}")
+    ax.plot(xs, total, 'r-', lw=2, label='Beta mixture (sum)')
+
+    crossing = mixture_crossing(beta_res['components'], (xs[0], xs[-1]))
+    if crossing is not None:
+        ax.axvline(crossing, color='green', ls='--', lw=2,
+                   label=f'Beta crossing = {crossing:.4f}')
+
+    if gmm_res and 'crossing_original' in gmm_res:
+        ax.axvline(gmm_res['crossing_original'], color='purple', ls=':',
+                   lw=2, label=f"Logit-GMM crossing = "
+                               f"{gmm_res['crossing_original']:.4f}")
+
+    ax.set_xlabel('Value')
+    ax.set_ylabel('Density')
+    ax.set_title(title)
+    ax.legend(fontsize=8)
+    plt.tight_layout()
+    fig.savefig(out_path, dpi=150)
+    plt.close()
+    return crossing
+
+
+def fetch(label):
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    if label == 'firm_a_cosine':
+        cur.execute('''
+            SELECT s.max_similarity_to_same_accountant
+            FROM signatures s
+            JOIN accountants a ON s.assigned_accountant = a.name
+            WHERE a.firm = ? AND s.max_similarity_to_same_accountant IS NOT NULL
+        ''', (FIRM_A,))
+    elif label == 'full_cosine':
+        cur.execute('''
+            SELECT max_similarity_to_same_accountant FROM signatures
+            WHERE max_similarity_to_same_accountant IS NOT NULL
+        ''')
+    else:
+        raise ValueError(label)
+    vals = [r[0] for r in cur.fetchall() if r[0] is not None]
+    conn.close()
+    return np.array(vals, dtype=float)
+
+
+def main():
+    print('='*70)
+    print('Script 17: Beta Mixture EM + Logit-GMM Robustness Check')
+    print('='*70)
+
+    cases = [
+        ('firm_a_cosine', 'Firm A cosine max-similarity'),
+        ('full_cosine', 'Full-sample cosine max-similarity'),
+    ]
+
+    summary = {}
+    for key, label in cases:
+        print(f'\n[{label}]')
+        x = fetch(key)
+        print(f'  N = {len(x):,}')
+
+        # Subsample for full sample to keep EM tractable but still stable
+        if len(x) > 200000:
+            rng = np.random.default_rng(42)
+            x_fit = rng.choice(x, 200000, replace=False)
+            print(f'  Subsampled to {len(x_fit):,} for EM fitting')
+        else:
+            x_fit = x
+
+        beta2 = fit_beta_mixture_em(x_fit, n_components=2)
+        beta3 = fit_beta_mixture_em(x_fit, n_components=3)
+        print(f'  Beta-2  AIC={beta2["aic"]:.1f}, BIC={beta2["bic"]:.1f}')
+        print(f'  Beta-3  AIC={beta3["aic"]:.1f}, BIC={beta3["bic"]:.1f}')
+
+        gmm2 = fit_gmm_logit(x_fit, n_components=2)
+        gmm3 = fit_gmm_logit(x_fit, n_components=3)
+        print(f'  LogGMM2 AIC={gmm2["aic"]:.1f}, BIC={gmm2["bic"]:.1f}')
+        print(f'  LogGMM3 AIC={gmm3["aic"]:.1f}, BIC={gmm3["bic"]:.1f}')
+
+        # Report crossings
+        crossing_beta = mixture_crossing(beta2['components'], (x.min(), x.max()))
+        print(f'  Beta-2 crossing: '
+              f"{('%.4f' % crossing_beta) if crossing_beta else '—'}")
+        print(f'  LogGMM-2 crossing (original scale): '
+              f"{gmm2.get('crossing_original', '—')}")
+
+        # Plot
+        png = OUT / f'beta_mixture_{key}.png'
+        plot_mixture(x_fit, beta2, f'{label}: Beta mixture (2 comp)', png,
+                     gmm_res=gmm2)
+        print(f'  plot: {png}')
+
+        # Strip responsibilities for JSON compactness
+        beta2_out = {k: v for k, v in beta2.items() if k != 'responsibilities'}
+        beta3_out = {k: v for k, v in beta3.items() if k != 'responsibilities'}
+
+        summary[key] = {
+            'label': label,
+            'n': int(len(x)),
+            'n_fit': int(len(x_fit)),
+            'beta_2': beta2_out,
+            'beta_3': beta3_out,
+            'beta_2_crossing': (float(crossing_beta)
+                                if crossing_beta is not None else None),
+            'logit_gmm_2': gmm2,
+            'logit_gmm_3': gmm3,
+            'bic_best': ('beta_2' if beta2['bic'] < beta3['bic']
+                         else 'beta_3'),
+        }
+
+    # Write JSON
+    json_path = OUT / 'beta_mixture_results.json'
+    with open(json_path, 'w') as f:
+        json.dump({
+            'generated_at': datetime.now().isoformat(),
+            'results': summary,
+        }, f, indent=2, ensure_ascii=False, default=float)
+    print(f'\nJSON: {json_path}')
+
+    # Markdown
+    md = [
+        '# Beta Mixture EM Report',
+        f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
+        '',
+        '## Method',
+        '',
+        '* 2- and 3-component Beta mixture fit by EM with method-of-moments',
+        '  M-step (stable for bounded data).',
+        '* Parallel 2/3-component Gaussian mixture on logit-transformed',
+        '  values as robustness check (White 1982 quasi-MLE consistency).',
+        '* Crossing point of the 2-component mixture densities is reported',
+        '  as the Bayes-optimal threshold under equal misclassification cost.',
+        '',
+        '## Results',
+        '',
+        '| Dataset | N (fit) | Beta-2 BIC | Beta-3 BIC | LogGMM-2 BIC | LogGMM-3 BIC | BIC-best |',
+        '|---------|---------|------------|------------|--------------|--------------|----------|',
+    ]
+    for r in summary.values():
+        md.append(
+            f"| {r['label']} | {r['n_fit']:,} | "
+            f"{r['beta_2']['bic']:.1f} | {r['beta_3']['bic']:.1f} | "
+            f"{r['logit_gmm_2']['bic']:.1f} | {r['logit_gmm_3']['bic']:.1f} | "
+            f"{r['bic_best']} |"
+        )
+
+    md += ['', '## Threshold estimates (2-component)', '',
+           '| Dataset | Beta-2 crossing | LogGMM-2 crossing (orig) |',
+           '|---------|-----------------|--------------------------|']
+    for r in summary.values():
+        beta_str = (f"{r['beta_2_crossing']:.4f}"
+                    if r['beta_2_crossing'] is not None else '—')
+        gmm_str = (f"{r['logit_gmm_2']['crossing_original']:.4f}"
+                   if 'crossing_original' in r['logit_gmm_2'] else '—')
+        md.append(f"| {r['label']} | {beta_str} | {gmm_str} |")
+
+    md += [
+        '',
+        '## Interpretation',
+        '',
+        'A successful 2-component fit with a clear crossing point would',
+        'indicate two underlying generative mechanisms (hand-signed vs',
+        'non-hand-signed) with a principled Bayes-optimal boundary.',
+        '',
+        'If Beta-3 BIC is meaningfully smaller than Beta-2, or if the',
+        'components of Beta-2 largely overlap (similar means, wide spread),',
+        'this is consistent with a unimodal distribution poorly approximated',
+        'by two components. Prior finding (2026-04-16) suggested this is',
+        'the case at signature level; the accountant-level mixture',
+        '(Script 18) is where the bimodality emerges.',
+    ]
+    md_path = OUT / 'beta_mixture_report.md'
+    md_path.write_text('\n'.join(md), encoding='utf-8')
+    print(f'Report: {md_path}')
+
+
+if __name__ == '__main__':
+    main()