Files
pdf_signature_extraction/signature_analysis/17_beta_mixture_em.py
T
gbanyan fbfab1fa68 Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of
signature vs. accountant analysis units:

- Script 15 (Hartigan dip test): formal unimodality test via `diptest`.
  Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population);
  full-sample cosine MULTIMODAL (p<0.001, mix of two regimes);
  accountant-level aggregates MULTIMODAL on both cos and dHash.

- Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition
  detection. Firm A and full-sample cosine transitions at 0.985; dHash
  at 2.0.

- Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM
  with MoM M-step, plus parallel Gaussian mixture on logit transform
  as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature
  level confirms 2-component is a forced fit -- supporting the pivot
  to accountant-level mixture.

- Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis
  that was done inline and not saved. BIC-best K=3 with components
  matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%,
  Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928,
  11.17, 28%, small firms). 2-component natural thresholds:
  cos=0.9450, dh=8.10.

- Script 19 (Pixel-identity validation): no human annotation needed.
  Uses pixel_identical_to_closest (310 sigs) as gold positive and
  Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51%
  (matches prior 2026-04-08 finding of 92.5%), dual rule
  cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A.

Python deps added: diptest, scikit-learn (installed into venv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:51:41 +08:00

407 lines
15 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""
Script 17: Beta Mixture Model via EM + Gaussian Mixture on Logit Transform
==========================================================================
Fits a 2-component Beta mixture to cosine similarity, plus parallel
Gaussian mixture on logit-transformed data as robustness check.
Theory:
- Cosine similarity is bounded [0,1] so Beta is the natural parametric
family for the component distributions.
- EM algorithm (Dempster, Laird & Rubin 1977) provides ML estimates.
- If the mixture gives a crossing point, that is the Bayes-optimal
threshold under the fitted model.
- Robustness: logit(x) maps (0,1) to the real line, where Gaussian
mixture is standard; White (1982) quasi-MLE guarantees asymptotic
recovery of the best Beta-family approximation even under
mis-specification.
Parametrization of Beta via method-of-moments inside the M-step:
alpha = mu * ((mu*(1-mu))/var - 1)
beta = (1-mu) * ((mu*(1-mu))/var - 1)
Expected outcome (per memory 2026-04-16):
Signature-level Beta mixture FAILS to separate hand-signed vs
non-hand-signed because the distribution is unimodal long-tail.
Report this as a formal result -- it motivates the pivot to
accountant-level mixture (Script 18).
Output:
reports/beta_mixture/beta_mixture_report.md
reports/beta_mixture/beta_mixture_results.json
reports/beta_mixture/beta_mixture_<case>.png
"""
import sqlite3
import json
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime
from scipy import stats
from scipy.optimize import brentq
from sklearn.mixture import GaussianMixture
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/beta_mixture')
OUT.mkdir(parents=True, exist_ok=True)
FIRM_A = '勤業眾信聯合'
EPS = 1e-6
def fit_beta_mixture_em(x, n_components=2, max_iter=300, tol=1e-6, seed=42):
"""
Fit a K-component Beta mixture via EM using MoM M-step estimates for
alpha/beta of each component. MoM works because Beta is fully determined
by its mean and variance under the moment equations.
"""
rng = np.random.default_rng(seed)
x = np.clip(np.asarray(x, dtype=float), EPS, 1 - EPS)
n = len(x)
K = n_components
# Initialise responsibilities by quantile-based split
q = np.linspace(0, 1, K + 1)
thresh = np.quantile(x, q[1:-1])
labels = np.digitize(x, thresh)
resp = np.zeros((n, K))
resp[np.arange(n), labels] = 1.0
params = [] # list of dicts with alpha, beta, weight
log_like_hist = []
for it in range(max_iter):
# M-step
nk = resp.sum(axis=0) + 1e-12
weights = nk / nk.sum()
mus = (resp * x[:, None]).sum(axis=0) / nk
var_num = (resp * (x[:, None] - mus) ** 2).sum(axis=0)
vars_ = var_num / nk
# Ensure validity for Beta: var < mu*(1-mu)
upper = mus * (1 - mus) - 1e-9
vars_ = np.minimum(vars_, upper)
vars_ = np.maximum(vars_, 1e-9)
factor = mus * (1 - mus) / vars_ - 1
factor = np.maximum(factor, 1e-6)
alphas = mus * factor
betas = (1 - mus) * factor
params = [{'alpha': float(alphas[k]), 'beta': float(betas[k]),
'weight': float(weights[k]), 'mu': float(mus[k]),
'var': float(vars_[k])} for k in range(K)]
# E-step
log_pdfs = np.column_stack([
stats.beta.logpdf(x, alphas[k], betas[k]) + np.log(weights[k])
for k in range(K)
])
m = log_pdfs.max(axis=1, keepdims=True)
log_like = (m.ravel() + np.log(np.exp(log_pdfs - m).sum(axis=1))).sum()
log_like_hist.append(float(log_like))
new_resp = np.exp(log_pdfs - m)
new_resp = new_resp / new_resp.sum(axis=1, keepdims=True)
if it > 0 and abs(log_like_hist[-1] - log_like_hist[-2]) < tol:
resp = new_resp
break
resp = new_resp
# Order components by mean ascending (so C1 = low mean, CK = high mean)
order = np.argsort([p['mu'] for p in params])
params = [params[i] for i in order]
resp = resp[:, order]
# AIC/BIC (k = 3K - 1 free parameters: alpha, beta, weight each component;
# weights sum to 1 removes one df)
k = 3 * K - 1
aic = 2 * k - 2 * log_like_hist[-1]
bic = k * np.log(n) - 2 * log_like_hist[-1]
return {
'components': params,
'log_likelihood': log_like_hist[-1],
'aic': float(aic),
'bic': float(bic),
'n_iter': it + 1,
'responsibilities': resp,
}
def mixture_crossing(params, x_range):
"""Find crossing point of two weighted component densities (K=2)."""
if len(params) != 2:
return None
a1, b1, w1 = params[0]['alpha'], params[0]['beta'], params[0]['weight']
a2, b2, w2 = params[1]['alpha'], params[1]['beta'], params[1]['weight']
def diff(x):
return (w2 * stats.beta.pdf(x, a2, b2)
- w1 * stats.beta.pdf(x, a1, b1))
# Search for sign change inside the overlap region
xs = np.linspace(x_range[0] + 1e-4, x_range[1] - 1e-4, 2000)
ys = diff(xs)
sign_changes = np.where(np.diff(np.sign(ys)) != 0)[0]
if len(sign_changes) == 0:
return None
# Pick crossing closest to midpoint of component means
mid = 0.5 * (params[0]['mu'] + params[1]['mu'])
crossings = []
for i in sign_changes:
try:
x0 = brentq(diff, xs[i], xs[i + 1])
crossings.append(x0)
except ValueError:
continue
if not crossings:
return None
return min(crossings, key=lambda c: abs(c - mid))
def logit(x):
x = np.clip(x, EPS, 1 - EPS)
return np.log(x / (1 - x))
def invlogit(z):
return 1.0 / (1.0 + np.exp(-z))
def fit_gmm_logit(x, n_components=2, seed=42):
"""GMM on logit-transformed values. Returns crossing point in original scale."""
z = logit(x).reshape(-1, 1)
gmm = GaussianMixture(n_components=n_components, random_state=seed,
max_iter=500).fit(z)
means = gmm.means_.ravel()
covs = gmm.covariances_.ravel()
weights = gmm.weights_
order = np.argsort(means)
comps = [{
'mu_logit': float(means[i]),
'sigma_logit': float(np.sqrt(covs[i])),
'weight': float(weights[i]),
'mu_original': float(invlogit(means[i])),
} for i in order]
result = {
'components': comps,
'log_likelihood': float(gmm.score(z) * len(z)),
'aic': float(gmm.aic(z)),
'bic': float(gmm.bic(z)),
'n_iter': int(gmm.n_iter_),
}
if n_components == 2:
m1, s1, w1 = means[order[0]], np.sqrt(covs[order[0]]), weights[order[0]]
m2, s2, w2 = means[order[1]], np.sqrt(covs[order[1]]), weights[order[1]]
def diff(z0):
return (w2 * stats.norm.pdf(z0, m2, s2)
- w1 * stats.norm.pdf(z0, m1, s1))
zs = np.linspace(min(m1, m2) - 1, max(m1, m2) + 1, 2000)
ys = diff(zs)
changes = np.where(np.diff(np.sign(ys)) != 0)[0]
if len(changes):
try:
z_cross = brentq(diff, zs[changes[0]], zs[changes[0] + 1])
result['crossing_logit'] = float(z_cross)
result['crossing_original'] = float(invlogit(z_cross))
except ValueError:
pass
return result
def plot_mixture(x, beta_res, title, out_path, gmm_res=None):
x = np.asarray(x, dtype=float).ravel()
x = x[np.isfinite(x)]
fig, ax = plt.subplots(figsize=(10, 5))
bin_edges = np.linspace(float(x.min()), float(x.max()), 81)
ax.hist(x, bins=bin_edges, density=True, alpha=0.45, color='steelblue',
edgecolor='white')
xs = np.linspace(max(0.0, x.min() - 0.01), min(1.0, x.max() + 0.01), 500)
total = np.zeros_like(xs)
for i, p in enumerate(beta_res['components']):
comp_pdf = p['weight'] * stats.beta.pdf(xs, p['alpha'], p['beta'])
total = total + comp_pdf
ax.plot(xs, comp_pdf, '--', lw=1.5,
label=f"C{i+1}: α={p['alpha']:.2f}, β={p['beta']:.2f}, "
f"w={p['weight']:.2f}")
ax.plot(xs, total, 'r-', lw=2, label='Beta mixture (sum)')
crossing = mixture_crossing(beta_res['components'], (xs[0], xs[-1]))
if crossing is not None:
ax.axvline(crossing, color='green', ls='--', lw=2,
label=f'Beta crossing = {crossing:.4f}')
if gmm_res and 'crossing_original' in gmm_res:
ax.axvline(gmm_res['crossing_original'], color='purple', ls=':',
lw=2, label=f"Logit-GMM crossing = "
f"{gmm_res['crossing_original']:.4f}")
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title(title)
ax.legend(fontsize=8)
plt.tight_layout()
fig.savefig(out_path, dpi=150)
plt.close()
return crossing
def fetch(label):
conn = sqlite3.connect(DB)
cur = conn.cursor()
if label == 'firm_a_cosine':
cur.execute('''
SELECT s.max_similarity_to_same_accountant
FROM signatures s
JOIN accountants a ON s.assigned_accountant = a.name
WHERE a.firm = ? AND s.max_similarity_to_same_accountant IS NOT NULL
''', (FIRM_A,))
elif label == 'full_cosine':
cur.execute('''
SELECT max_similarity_to_same_accountant FROM signatures
WHERE max_similarity_to_same_accountant IS NOT NULL
''')
else:
raise ValueError(label)
vals = [r[0] for r in cur.fetchall() if r[0] is not None]
conn.close()
return np.array(vals, dtype=float)
def main():
print('='*70)
print('Script 17: Beta Mixture EM + Logit-GMM Robustness Check')
print('='*70)
cases = [
('firm_a_cosine', 'Firm A cosine max-similarity'),
('full_cosine', 'Full-sample cosine max-similarity'),
]
summary = {}
for key, label in cases:
print(f'\n[{label}]')
x = fetch(key)
print(f' N = {len(x):,}')
# Subsample for full sample to keep EM tractable but still stable
if len(x) > 200000:
rng = np.random.default_rng(42)
x_fit = rng.choice(x, 200000, replace=False)
print(f' Subsampled to {len(x_fit):,} for EM fitting')
else:
x_fit = x
beta2 = fit_beta_mixture_em(x_fit, n_components=2)
beta3 = fit_beta_mixture_em(x_fit, n_components=3)
print(f' Beta-2 AIC={beta2["aic"]:.1f}, BIC={beta2["bic"]:.1f}')
print(f' Beta-3 AIC={beta3["aic"]:.1f}, BIC={beta3["bic"]:.1f}')
gmm2 = fit_gmm_logit(x_fit, n_components=2)
gmm3 = fit_gmm_logit(x_fit, n_components=3)
print(f' LogGMM2 AIC={gmm2["aic"]:.1f}, BIC={gmm2["bic"]:.1f}')
print(f' LogGMM3 AIC={gmm3["aic"]:.1f}, BIC={gmm3["bic"]:.1f}')
# Report crossings
crossing_beta = mixture_crossing(beta2['components'], (x.min(), x.max()))
print(f' Beta-2 crossing: '
f"{('%.4f' % crossing_beta) if crossing_beta else ''}")
print(f' LogGMM-2 crossing (original scale): '
f"{gmm2.get('crossing_original', '')}")
# Plot
png = OUT / f'beta_mixture_{key}.png'
plot_mixture(x_fit, beta2, f'{label}: Beta mixture (2 comp)', png,
gmm_res=gmm2)
print(f' plot: {png}')
# Strip responsibilities for JSON compactness
beta2_out = {k: v for k, v in beta2.items() if k != 'responsibilities'}
beta3_out = {k: v for k, v in beta3.items() if k != 'responsibilities'}
summary[key] = {
'label': label,
'n': int(len(x)),
'n_fit': int(len(x_fit)),
'beta_2': beta2_out,
'beta_3': beta3_out,
'beta_2_crossing': (float(crossing_beta)
if crossing_beta is not None else None),
'logit_gmm_2': gmm2,
'logit_gmm_3': gmm3,
'bic_best': ('beta_2' if beta2['bic'] < beta3['bic']
else 'beta_3'),
}
# Write JSON
json_path = OUT / 'beta_mixture_results.json'
with open(json_path, 'w') as f:
json.dump({
'generated_at': datetime.now().isoformat(),
'results': summary,
}, f, indent=2, ensure_ascii=False, default=float)
print(f'\nJSON: {json_path}')
# Markdown
md = [
'# Beta Mixture EM Report',
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
'',
'## Method',
'',
'* 2- and 3-component Beta mixture fit by EM with method-of-moments',
' M-step (stable for bounded data).',
'* Parallel 2/3-component Gaussian mixture on logit-transformed',
' values as robustness check (White 1982 quasi-MLE consistency).',
'* Crossing point of the 2-component mixture densities is reported',
' as the Bayes-optimal threshold under equal misclassification cost.',
'',
'## Results',
'',
'| Dataset | N (fit) | Beta-2 BIC | Beta-3 BIC | LogGMM-2 BIC | LogGMM-3 BIC | BIC-best |',
'|---------|---------|------------|------------|--------------|--------------|----------|',
]
for r in summary.values():
md.append(
f"| {r['label']} | {r['n_fit']:,} | "
f"{r['beta_2']['bic']:.1f} | {r['beta_3']['bic']:.1f} | "
f"{r['logit_gmm_2']['bic']:.1f} | {r['logit_gmm_3']['bic']:.1f} | "
f"{r['bic_best']} |"
)
md += ['', '## Threshold estimates (2-component)', '',
'| Dataset | Beta-2 crossing | LogGMM-2 crossing (orig) |',
'|---------|-----------------|--------------------------|']
for r in summary.values():
beta_str = (f"{r['beta_2_crossing']:.4f}"
if r['beta_2_crossing'] is not None else '')
gmm_str = (f"{r['logit_gmm_2']['crossing_original']:.4f}"
if 'crossing_original' in r['logit_gmm_2'] else '')
md.append(f"| {r['label']} | {beta_str} | {gmm_str} |")
md += [
'',
'## Interpretation',
'',
'A successful 2-component fit with a clear crossing point would',
'indicate two underlying generative mechanisms (hand-signed vs',
'non-hand-signed) with a principled Bayes-optimal boundary.',
'',
'If Beta-3 BIC is meaningfully smaller than Beta-2, or if the',
'components of Beta-2 largely overlap (similar means, wide spread),',
'this is consistent with a unimodal distribution poorly approximated',
'by two components. Prior finding (2026-04-16) suggested this is',
'the case at signature level; the accountant-level mixture',
'(Script 18) is where the bimodality emerges.',
]
md_path = OUT / 'beta_mixture_report.md'
md_path.write_text('\n'.join(md), encoding='utf-8')
print(f'Report: {md_path}')
if __name__ == '__main__':
main()