fbfab1fa68
Implements Partner v3's statistical rigor requirements at the level of signature vs. accountant analysis units: - Script 15 (Hartigan dip test): formal unimodality test via `diptest`. Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population); full-sample cosine MULTIMODAL (p<0.001, mix of two regimes); accountant-level aggregates MULTIMODAL on both cos and dHash. - Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition detection. Firm A and full-sample cosine transitions at 0.985; dHash at 2.0. - Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM with MoM M-step, plus parallel Gaussian mixture on logit transform as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature level confirms 2-component is a forced fit -- supporting the pivot to accountant-level mixture. - Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis that was done inline and not saved. BIC-best K=3 with components matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%, Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928, 11.17, 28%, small firms). 2-component natural thresholds: cos=0.9450, dh=8.10. - Script 19 (Pixel-identity validation): no human annotation needed. Uses pixel_identical_to_closest (310 sigs) as gold positive and Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51% (matches prior 2026-04-08 finding of 92.5%), dual rule cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A. Python deps added: diptest, scikit-learn (installed into venv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
407 lines
15 KiB
Python
407 lines
15 KiB
Python
#!/usr/bin/env python3
|
||
"""
|
||
Script 17: Beta Mixture Model via EM + Gaussian Mixture on Logit Transform
|
||
==========================================================================
|
||
Fits a 2-component Beta mixture to cosine similarity, plus parallel
|
||
Gaussian mixture on logit-transformed data as robustness check.
|
||
|
||
Theory:
|
||
- Cosine similarity is bounded [0,1] so Beta is the natural parametric
|
||
family for the component distributions.
|
||
- EM algorithm (Dempster, Laird & Rubin 1977) provides ML estimates.
|
||
- If the mixture gives a crossing point, that is the Bayes-optimal
|
||
threshold under the fitted model.
|
||
- Robustness: logit(x) maps (0,1) to the real line, where Gaussian
|
||
mixture is standard; White (1982) quasi-MLE guarantees asymptotic
|
||
recovery of the best Beta-family approximation even under
|
||
mis-specification.
|
||
|
||
Parametrization of Beta via method-of-moments inside the M-step:
|
||
alpha = mu * ((mu*(1-mu))/var - 1)
|
||
beta = (1-mu) * ((mu*(1-mu))/var - 1)
|
||
|
||
Expected outcome (per memory 2026-04-16):
|
||
Signature-level Beta mixture FAILS to separate hand-signed vs
|
||
non-hand-signed because the distribution is unimodal long-tail.
|
||
Report this as a formal result -- it motivates the pivot to
|
||
accountant-level mixture (Script 18).
|
||
|
||
Output:
|
||
reports/beta_mixture/beta_mixture_report.md
|
||
reports/beta_mixture/beta_mixture_results.json
|
||
reports/beta_mixture/beta_mixture_<case>.png
|
||
"""
|
||
|
||
import sqlite3
|
||
import json
|
||
import numpy as np
|
||
import matplotlib
|
||
matplotlib.use('Agg')
|
||
import matplotlib.pyplot as plt
|
||
from pathlib import Path
|
||
from datetime import datetime
|
||
from scipy import stats
|
||
from scipy.optimize import brentq
|
||
from sklearn.mixture import GaussianMixture
|
||
|
||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/beta_mixture')
|
||
OUT.mkdir(parents=True, exist_ok=True)
|
||
|
||
FIRM_A = '勤業眾信聯合'
|
||
EPS = 1e-6
|
||
|
||
|
||
def fit_beta_mixture_em(x, n_components=2, max_iter=300, tol=1e-6, seed=42):
|
||
"""
|
||
Fit a K-component Beta mixture via EM using MoM M-step estimates for
|
||
alpha/beta of each component. MoM works because Beta is fully determined
|
||
by its mean and variance under the moment equations.
|
||
"""
|
||
rng = np.random.default_rng(seed)
|
||
x = np.clip(np.asarray(x, dtype=float), EPS, 1 - EPS)
|
||
n = len(x)
|
||
K = n_components
|
||
|
||
# Initialise responsibilities by quantile-based split
|
||
q = np.linspace(0, 1, K + 1)
|
||
thresh = np.quantile(x, q[1:-1])
|
||
labels = np.digitize(x, thresh)
|
||
resp = np.zeros((n, K))
|
||
resp[np.arange(n), labels] = 1.0
|
||
|
||
params = [] # list of dicts with alpha, beta, weight
|
||
log_like_hist = []
|
||
for it in range(max_iter):
|
||
# M-step
|
||
nk = resp.sum(axis=0) + 1e-12
|
||
weights = nk / nk.sum()
|
||
mus = (resp * x[:, None]).sum(axis=0) / nk
|
||
var_num = (resp * (x[:, None] - mus) ** 2).sum(axis=0)
|
||
vars_ = var_num / nk
|
||
# Ensure validity for Beta: var < mu*(1-mu)
|
||
upper = mus * (1 - mus) - 1e-9
|
||
vars_ = np.minimum(vars_, upper)
|
||
vars_ = np.maximum(vars_, 1e-9)
|
||
factor = mus * (1 - mus) / vars_ - 1
|
||
factor = np.maximum(factor, 1e-6)
|
||
alphas = mus * factor
|
||
betas = (1 - mus) * factor
|
||
params = [{'alpha': float(alphas[k]), 'beta': float(betas[k]),
|
||
'weight': float(weights[k]), 'mu': float(mus[k]),
|
||
'var': float(vars_[k])} for k in range(K)]
|
||
|
||
# E-step
|
||
log_pdfs = np.column_stack([
|
||
stats.beta.logpdf(x, alphas[k], betas[k]) + np.log(weights[k])
|
||
for k in range(K)
|
||
])
|
||
m = log_pdfs.max(axis=1, keepdims=True)
|
||
log_like = (m.ravel() + np.log(np.exp(log_pdfs - m).sum(axis=1))).sum()
|
||
log_like_hist.append(float(log_like))
|
||
new_resp = np.exp(log_pdfs - m)
|
||
new_resp = new_resp / new_resp.sum(axis=1, keepdims=True)
|
||
|
||
if it > 0 and abs(log_like_hist[-1] - log_like_hist[-2]) < tol:
|
||
resp = new_resp
|
||
break
|
||
resp = new_resp
|
||
|
||
# Order components by mean ascending (so C1 = low mean, CK = high mean)
|
||
order = np.argsort([p['mu'] for p in params])
|
||
params = [params[i] for i in order]
|
||
resp = resp[:, order]
|
||
|
||
# AIC/BIC (k = 3K - 1 free parameters: alpha, beta, weight each component;
|
||
# weights sum to 1 removes one df)
|
||
k = 3 * K - 1
|
||
aic = 2 * k - 2 * log_like_hist[-1]
|
||
bic = k * np.log(n) - 2 * log_like_hist[-1]
|
||
|
||
return {
|
||
'components': params,
|
||
'log_likelihood': log_like_hist[-1],
|
||
'aic': float(aic),
|
||
'bic': float(bic),
|
||
'n_iter': it + 1,
|
||
'responsibilities': resp,
|
||
}
|
||
|
||
|
||
def mixture_crossing(params, x_range):
|
||
"""Find crossing point of two weighted component densities (K=2)."""
|
||
if len(params) != 2:
|
||
return None
|
||
a1, b1, w1 = params[0]['alpha'], params[0]['beta'], params[0]['weight']
|
||
a2, b2, w2 = params[1]['alpha'], params[1]['beta'], params[1]['weight']
|
||
|
||
def diff(x):
|
||
return (w2 * stats.beta.pdf(x, a2, b2)
|
||
- w1 * stats.beta.pdf(x, a1, b1))
|
||
|
||
# Search for sign change inside the overlap region
|
||
xs = np.linspace(x_range[0] + 1e-4, x_range[1] - 1e-4, 2000)
|
||
ys = diff(xs)
|
||
sign_changes = np.where(np.diff(np.sign(ys)) != 0)[0]
|
||
if len(sign_changes) == 0:
|
||
return None
|
||
# Pick crossing closest to midpoint of component means
|
||
mid = 0.5 * (params[0]['mu'] + params[1]['mu'])
|
||
crossings = []
|
||
for i in sign_changes:
|
||
try:
|
||
x0 = brentq(diff, xs[i], xs[i + 1])
|
||
crossings.append(x0)
|
||
except ValueError:
|
||
continue
|
||
if not crossings:
|
||
return None
|
||
return min(crossings, key=lambda c: abs(c - mid))
|
||
|
||
|
||
def logit(x):
|
||
x = np.clip(x, EPS, 1 - EPS)
|
||
return np.log(x / (1 - x))
|
||
|
||
|
||
def invlogit(z):
|
||
return 1.0 / (1.0 + np.exp(-z))
|
||
|
||
|
||
def fit_gmm_logit(x, n_components=2, seed=42):
|
||
"""GMM on logit-transformed values. Returns crossing point in original scale."""
|
||
z = logit(x).reshape(-1, 1)
|
||
gmm = GaussianMixture(n_components=n_components, random_state=seed,
|
||
max_iter=500).fit(z)
|
||
means = gmm.means_.ravel()
|
||
covs = gmm.covariances_.ravel()
|
||
weights = gmm.weights_
|
||
order = np.argsort(means)
|
||
comps = [{
|
||
'mu_logit': float(means[i]),
|
||
'sigma_logit': float(np.sqrt(covs[i])),
|
||
'weight': float(weights[i]),
|
||
'mu_original': float(invlogit(means[i])),
|
||
} for i in order]
|
||
|
||
result = {
|
||
'components': comps,
|
||
'log_likelihood': float(gmm.score(z) * len(z)),
|
||
'aic': float(gmm.aic(z)),
|
||
'bic': float(gmm.bic(z)),
|
||
'n_iter': int(gmm.n_iter_),
|
||
}
|
||
|
||
if n_components == 2:
|
||
m1, s1, w1 = means[order[0]], np.sqrt(covs[order[0]]), weights[order[0]]
|
||
m2, s2, w2 = means[order[1]], np.sqrt(covs[order[1]]), weights[order[1]]
|
||
|
||
def diff(z0):
|
||
return (w2 * stats.norm.pdf(z0, m2, s2)
|
||
- w1 * stats.norm.pdf(z0, m1, s1))
|
||
zs = np.linspace(min(m1, m2) - 1, max(m1, m2) + 1, 2000)
|
||
ys = diff(zs)
|
||
changes = np.where(np.diff(np.sign(ys)) != 0)[0]
|
||
if len(changes):
|
||
try:
|
||
z_cross = brentq(diff, zs[changes[0]], zs[changes[0] + 1])
|
||
result['crossing_logit'] = float(z_cross)
|
||
result['crossing_original'] = float(invlogit(z_cross))
|
||
except ValueError:
|
||
pass
|
||
return result
|
||
|
||
|
||
def plot_mixture(x, beta_res, title, out_path, gmm_res=None):
|
||
x = np.asarray(x, dtype=float).ravel()
|
||
x = x[np.isfinite(x)]
|
||
fig, ax = plt.subplots(figsize=(10, 5))
|
||
bin_edges = np.linspace(float(x.min()), float(x.max()), 81)
|
||
ax.hist(x, bins=bin_edges, density=True, alpha=0.45, color='steelblue',
|
||
edgecolor='white')
|
||
xs = np.linspace(max(0.0, x.min() - 0.01), min(1.0, x.max() + 0.01), 500)
|
||
total = np.zeros_like(xs)
|
||
for i, p in enumerate(beta_res['components']):
|
||
comp_pdf = p['weight'] * stats.beta.pdf(xs, p['alpha'], p['beta'])
|
||
total = total + comp_pdf
|
||
ax.plot(xs, comp_pdf, '--', lw=1.5,
|
||
label=f"C{i+1}: α={p['alpha']:.2f}, β={p['beta']:.2f}, "
|
||
f"w={p['weight']:.2f}")
|
||
ax.plot(xs, total, 'r-', lw=2, label='Beta mixture (sum)')
|
||
|
||
crossing = mixture_crossing(beta_res['components'], (xs[0], xs[-1]))
|
||
if crossing is not None:
|
||
ax.axvline(crossing, color='green', ls='--', lw=2,
|
||
label=f'Beta crossing = {crossing:.4f}')
|
||
|
||
if gmm_res and 'crossing_original' in gmm_res:
|
||
ax.axvline(gmm_res['crossing_original'], color='purple', ls=':',
|
||
lw=2, label=f"Logit-GMM crossing = "
|
||
f"{gmm_res['crossing_original']:.4f}")
|
||
|
||
ax.set_xlabel('Value')
|
||
ax.set_ylabel('Density')
|
||
ax.set_title(title)
|
||
ax.legend(fontsize=8)
|
||
plt.tight_layout()
|
||
fig.savefig(out_path, dpi=150)
|
||
plt.close()
|
||
return crossing
|
||
|
||
|
||
def fetch(label):
|
||
conn = sqlite3.connect(DB)
|
||
cur = conn.cursor()
|
||
if label == 'firm_a_cosine':
|
||
cur.execute('''
|
||
SELECT s.max_similarity_to_same_accountant
|
||
FROM signatures s
|
||
JOIN accountants a ON s.assigned_accountant = a.name
|
||
WHERE a.firm = ? AND s.max_similarity_to_same_accountant IS NOT NULL
|
||
''', (FIRM_A,))
|
||
elif label == 'full_cosine':
|
||
cur.execute('''
|
||
SELECT max_similarity_to_same_accountant FROM signatures
|
||
WHERE max_similarity_to_same_accountant IS NOT NULL
|
||
''')
|
||
else:
|
||
raise ValueError(label)
|
||
vals = [r[0] for r in cur.fetchall() if r[0] is not None]
|
||
conn.close()
|
||
return np.array(vals, dtype=float)
|
||
|
||
|
||
def main():
|
||
print('='*70)
|
||
print('Script 17: Beta Mixture EM + Logit-GMM Robustness Check')
|
||
print('='*70)
|
||
|
||
cases = [
|
||
('firm_a_cosine', 'Firm A cosine max-similarity'),
|
||
('full_cosine', 'Full-sample cosine max-similarity'),
|
||
]
|
||
|
||
summary = {}
|
||
for key, label in cases:
|
||
print(f'\n[{label}]')
|
||
x = fetch(key)
|
||
print(f' N = {len(x):,}')
|
||
|
||
# Subsample for full sample to keep EM tractable but still stable
|
||
if len(x) > 200000:
|
||
rng = np.random.default_rng(42)
|
||
x_fit = rng.choice(x, 200000, replace=False)
|
||
print(f' Subsampled to {len(x_fit):,} for EM fitting')
|
||
else:
|
||
x_fit = x
|
||
|
||
beta2 = fit_beta_mixture_em(x_fit, n_components=2)
|
||
beta3 = fit_beta_mixture_em(x_fit, n_components=3)
|
||
print(f' Beta-2 AIC={beta2["aic"]:.1f}, BIC={beta2["bic"]:.1f}')
|
||
print(f' Beta-3 AIC={beta3["aic"]:.1f}, BIC={beta3["bic"]:.1f}')
|
||
|
||
gmm2 = fit_gmm_logit(x_fit, n_components=2)
|
||
gmm3 = fit_gmm_logit(x_fit, n_components=3)
|
||
print(f' LogGMM2 AIC={gmm2["aic"]:.1f}, BIC={gmm2["bic"]:.1f}')
|
||
print(f' LogGMM3 AIC={gmm3["aic"]:.1f}, BIC={gmm3["bic"]:.1f}')
|
||
|
||
# Report crossings
|
||
crossing_beta = mixture_crossing(beta2['components'], (x.min(), x.max()))
|
||
print(f' Beta-2 crossing: '
|
||
f"{('%.4f' % crossing_beta) if crossing_beta else '—'}")
|
||
print(f' LogGMM-2 crossing (original scale): '
|
||
f"{gmm2.get('crossing_original', '—')}")
|
||
|
||
# Plot
|
||
png = OUT / f'beta_mixture_{key}.png'
|
||
plot_mixture(x_fit, beta2, f'{label}: Beta mixture (2 comp)', png,
|
||
gmm_res=gmm2)
|
||
print(f' plot: {png}')
|
||
|
||
# Strip responsibilities for JSON compactness
|
||
beta2_out = {k: v for k, v in beta2.items() if k != 'responsibilities'}
|
||
beta3_out = {k: v for k, v in beta3.items() if k != 'responsibilities'}
|
||
|
||
summary[key] = {
|
||
'label': label,
|
||
'n': int(len(x)),
|
||
'n_fit': int(len(x_fit)),
|
||
'beta_2': beta2_out,
|
||
'beta_3': beta3_out,
|
||
'beta_2_crossing': (float(crossing_beta)
|
||
if crossing_beta is not None else None),
|
||
'logit_gmm_2': gmm2,
|
||
'logit_gmm_3': gmm3,
|
||
'bic_best': ('beta_2' if beta2['bic'] < beta3['bic']
|
||
else 'beta_3'),
|
||
}
|
||
|
||
# Write JSON
|
||
json_path = OUT / 'beta_mixture_results.json'
|
||
with open(json_path, 'w') as f:
|
||
json.dump({
|
||
'generated_at': datetime.now().isoformat(),
|
||
'results': summary,
|
||
}, f, indent=2, ensure_ascii=False, default=float)
|
||
print(f'\nJSON: {json_path}')
|
||
|
||
# Markdown
|
||
md = [
|
||
'# Beta Mixture EM Report',
|
||
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
|
||
'',
|
||
'## Method',
|
||
'',
|
||
'* 2- and 3-component Beta mixture fit by EM with method-of-moments',
|
||
' M-step (stable for bounded data).',
|
||
'* Parallel 2/3-component Gaussian mixture on logit-transformed',
|
||
' values as robustness check (White 1982 quasi-MLE consistency).',
|
||
'* Crossing point of the 2-component mixture densities is reported',
|
||
' as the Bayes-optimal threshold under equal misclassification cost.',
|
||
'',
|
||
'## Results',
|
||
'',
|
||
'| Dataset | N (fit) | Beta-2 BIC | Beta-3 BIC | LogGMM-2 BIC | LogGMM-3 BIC | BIC-best |',
|
||
'|---------|---------|------------|------------|--------------|--------------|----------|',
|
||
]
|
||
for r in summary.values():
|
||
md.append(
|
||
f"| {r['label']} | {r['n_fit']:,} | "
|
||
f"{r['beta_2']['bic']:.1f} | {r['beta_3']['bic']:.1f} | "
|
||
f"{r['logit_gmm_2']['bic']:.1f} | {r['logit_gmm_3']['bic']:.1f} | "
|
||
f"{r['bic_best']} |"
|
||
)
|
||
|
||
md += ['', '## Threshold estimates (2-component)', '',
|
||
'| Dataset | Beta-2 crossing | LogGMM-2 crossing (orig) |',
|
||
'|---------|-----------------|--------------------------|']
|
||
for r in summary.values():
|
||
beta_str = (f"{r['beta_2_crossing']:.4f}"
|
||
if r['beta_2_crossing'] is not None else '—')
|
||
gmm_str = (f"{r['logit_gmm_2']['crossing_original']:.4f}"
|
||
if 'crossing_original' in r['logit_gmm_2'] else '—')
|
||
md.append(f"| {r['label']} | {beta_str} | {gmm_str} |")
|
||
|
||
md += [
|
||
'',
|
||
'## Interpretation',
|
||
'',
|
||
'A successful 2-component fit with a clear crossing point would',
|
||
'indicate two underlying generative mechanisms (hand-signed vs',
|
||
'non-hand-signed) with a principled Bayes-optimal boundary.',
|
||
'',
|
||
'If Beta-3 BIC is meaningfully smaller than Beta-2, or if the',
|
||
'components of Beta-2 largely overlap (similar means, wide spread),',
|
||
'this is consistent with a unimodal distribution poorly approximated',
|
||
'by two components. Prior finding (2026-04-16) suggested this is',
|
||
'the case at signature level; the accountant-level mixture',
|
||
'(Script 18) is where the bimodality emerges.',
|
||
]
|
||
md_path = OUT / 'beta_mixture_report.md'
|
||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||
print(f'Report: {md_path}')
|
||
|
||
|
||
if __name__ == '__main__':
|
||
main()
|