Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of signature vs. accountant analysis units: - Script 15 (Hartigan dip test): formal unimodality test via `diptest`. Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population); full-sample cosine MULTIMODAL (p<0.001, mix of two regimes); accountant-level aggregates MULTIMODAL on both cos and dHash. - Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition detection. Firm A and full-sample cosine transitions at 0.985; dHash at 2.0. - Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM with MoM M-step, plus parallel Gaussian mixture on logit transform as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature level confirms 2-component is a forced fit -- supporting the pivot to accountant-level mixture. - Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis that was done inline and not saved. BIC-best K=3 with components matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%, Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928, 11.17, 28%, small firms). 2-component natural thresholds: cos=0.9450, dh=8.10. - Script 19 (Pixel-identity validation): no human annotation needed. Uses pixel_identical_to_closest (310 sigs) as gold positive and Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51% (matches prior 2026-04-08 finding of 92.5%), dual rule cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A. Python deps added: diptest, scikit-learn (installed into venv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,406 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 17: Beta Mixture Model via EM + Gaussian Mixture on Logit Transform
|
||||
==========================================================================
|
||||
Fits a 2-component Beta mixture to cosine similarity, plus parallel
|
||||
Gaussian mixture on logit-transformed data as robustness check.
|
||||
|
||||
Theory:
|
||||
- Cosine similarity is bounded [0,1] so Beta is the natural parametric
|
||||
family for the component distributions.
|
||||
- EM algorithm (Dempster, Laird & Rubin 1977) provides ML estimates.
|
||||
- If the mixture gives a crossing point, that is the Bayes-optimal
|
||||
threshold under the fitted model.
|
||||
- Robustness: logit(x) maps (0,1) to the real line, where Gaussian
|
||||
mixture is standard; White (1982) quasi-MLE guarantees asymptotic
|
||||
recovery of the best Beta-family approximation even under
|
||||
mis-specification.
|
||||
|
||||
Parametrization of Beta via method-of-moments inside the M-step:
|
||||
alpha = mu * ((mu*(1-mu))/var - 1)
|
||||
beta = (1-mu) * ((mu*(1-mu))/var - 1)
|
||||
|
||||
Expected outcome (per memory 2026-04-16):
|
||||
Signature-level Beta mixture FAILS to separate hand-signed vs
|
||||
non-hand-signed because the distribution is unimodal long-tail.
|
||||
Report this as a formal result -- it motivates the pivot to
|
||||
accountant-level mixture (Script 18).
|
||||
|
||||
Output:
|
||||
reports/beta_mixture/beta_mixture_report.md
|
||||
reports/beta_mixture/beta_mixture_results.json
|
||||
reports/beta_mixture/beta_mixture_<case>.png
|
||||
"""
|
||||
|
||||
import sqlite3
|
||||
import json
|
||||
import numpy as np
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from scipy import stats
|
||||
from scipy.optimize import brentq
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/beta_mixture')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
FIRM_A = '勤業眾信聯合'
|
||||
EPS = 1e-6
|
||||
|
||||
|
||||
def fit_beta_mixture_em(x, n_components=2, max_iter=300, tol=1e-6, seed=42):
|
||||
"""
|
||||
Fit a K-component Beta mixture via EM using MoM M-step estimates for
|
||||
alpha/beta of each component. MoM works because Beta is fully determined
|
||||
by its mean and variance under the moment equations.
|
||||
"""
|
||||
rng = np.random.default_rng(seed)
|
||||
x = np.clip(np.asarray(x, dtype=float), EPS, 1 - EPS)
|
||||
n = len(x)
|
||||
K = n_components
|
||||
|
||||
# Initialise responsibilities by quantile-based split
|
||||
q = np.linspace(0, 1, K + 1)
|
||||
thresh = np.quantile(x, q[1:-1])
|
||||
labels = np.digitize(x, thresh)
|
||||
resp = np.zeros((n, K))
|
||||
resp[np.arange(n), labels] = 1.0
|
||||
|
||||
params = [] # list of dicts with alpha, beta, weight
|
||||
log_like_hist = []
|
||||
for it in range(max_iter):
|
||||
# M-step
|
||||
nk = resp.sum(axis=0) + 1e-12
|
||||
weights = nk / nk.sum()
|
||||
mus = (resp * x[:, None]).sum(axis=0) / nk
|
||||
var_num = (resp * (x[:, None] - mus) ** 2).sum(axis=0)
|
||||
vars_ = var_num / nk
|
||||
# Ensure validity for Beta: var < mu*(1-mu)
|
||||
upper = mus * (1 - mus) - 1e-9
|
||||
vars_ = np.minimum(vars_, upper)
|
||||
vars_ = np.maximum(vars_, 1e-9)
|
||||
factor = mus * (1 - mus) / vars_ - 1
|
||||
factor = np.maximum(factor, 1e-6)
|
||||
alphas = mus * factor
|
||||
betas = (1 - mus) * factor
|
||||
params = [{'alpha': float(alphas[k]), 'beta': float(betas[k]),
|
||||
'weight': float(weights[k]), 'mu': float(mus[k]),
|
||||
'var': float(vars_[k])} for k in range(K)]
|
||||
|
||||
# E-step
|
||||
log_pdfs = np.column_stack([
|
||||
stats.beta.logpdf(x, alphas[k], betas[k]) + np.log(weights[k])
|
||||
for k in range(K)
|
||||
])
|
||||
m = log_pdfs.max(axis=1, keepdims=True)
|
||||
log_like = (m.ravel() + np.log(np.exp(log_pdfs - m).sum(axis=1))).sum()
|
||||
log_like_hist.append(float(log_like))
|
||||
new_resp = np.exp(log_pdfs - m)
|
||||
new_resp = new_resp / new_resp.sum(axis=1, keepdims=True)
|
||||
|
||||
if it > 0 and abs(log_like_hist[-1] - log_like_hist[-2]) < tol:
|
||||
resp = new_resp
|
||||
break
|
||||
resp = new_resp
|
||||
|
||||
# Order components by mean ascending (so C1 = low mean, CK = high mean)
|
||||
order = np.argsort([p['mu'] for p in params])
|
||||
params = [params[i] for i in order]
|
||||
resp = resp[:, order]
|
||||
|
||||
# AIC/BIC (k = 3K - 1 free parameters: alpha, beta, weight each component;
|
||||
# weights sum to 1 removes one df)
|
||||
k = 3 * K - 1
|
||||
aic = 2 * k - 2 * log_like_hist[-1]
|
||||
bic = k * np.log(n) - 2 * log_like_hist[-1]
|
||||
|
||||
return {
|
||||
'components': params,
|
||||
'log_likelihood': log_like_hist[-1],
|
||||
'aic': float(aic),
|
||||
'bic': float(bic),
|
||||
'n_iter': it + 1,
|
||||
'responsibilities': resp,
|
||||
}
|
||||
|
||||
|
||||
def mixture_crossing(params, x_range):
|
||||
"""Find crossing point of two weighted component densities (K=2)."""
|
||||
if len(params) != 2:
|
||||
return None
|
||||
a1, b1, w1 = params[0]['alpha'], params[0]['beta'], params[0]['weight']
|
||||
a2, b2, w2 = params[1]['alpha'], params[1]['beta'], params[1]['weight']
|
||||
|
||||
def diff(x):
|
||||
return (w2 * stats.beta.pdf(x, a2, b2)
|
||||
- w1 * stats.beta.pdf(x, a1, b1))
|
||||
|
||||
# Search for sign change inside the overlap region
|
||||
xs = np.linspace(x_range[0] + 1e-4, x_range[1] - 1e-4, 2000)
|
||||
ys = diff(xs)
|
||||
sign_changes = np.where(np.diff(np.sign(ys)) != 0)[0]
|
||||
if len(sign_changes) == 0:
|
||||
return None
|
||||
# Pick crossing closest to midpoint of component means
|
||||
mid = 0.5 * (params[0]['mu'] + params[1]['mu'])
|
||||
crossings = []
|
||||
for i in sign_changes:
|
||||
try:
|
||||
x0 = brentq(diff, xs[i], xs[i + 1])
|
||||
crossings.append(x0)
|
||||
except ValueError:
|
||||
continue
|
||||
if not crossings:
|
||||
return None
|
||||
return min(crossings, key=lambda c: abs(c - mid))
|
||||
|
||||
|
||||
def logit(x):
|
||||
x = np.clip(x, EPS, 1 - EPS)
|
||||
return np.log(x / (1 - x))
|
||||
|
||||
|
||||
def invlogit(z):
|
||||
return 1.0 / (1.0 + np.exp(-z))
|
||||
|
||||
|
||||
def fit_gmm_logit(x, n_components=2, seed=42):
|
||||
"""GMM on logit-transformed values. Returns crossing point in original scale."""
|
||||
z = logit(x).reshape(-1, 1)
|
||||
gmm = GaussianMixture(n_components=n_components, random_state=seed,
|
||||
max_iter=500).fit(z)
|
||||
means = gmm.means_.ravel()
|
||||
covs = gmm.covariances_.ravel()
|
||||
weights = gmm.weights_
|
||||
order = np.argsort(means)
|
||||
comps = [{
|
||||
'mu_logit': float(means[i]),
|
||||
'sigma_logit': float(np.sqrt(covs[i])),
|
||||
'weight': float(weights[i]),
|
||||
'mu_original': float(invlogit(means[i])),
|
||||
} for i in order]
|
||||
|
||||
result = {
|
||||
'components': comps,
|
||||
'log_likelihood': float(gmm.score(z) * len(z)),
|
||||
'aic': float(gmm.aic(z)),
|
||||
'bic': float(gmm.bic(z)),
|
||||
'n_iter': int(gmm.n_iter_),
|
||||
}
|
||||
|
||||
if n_components == 2:
|
||||
m1, s1, w1 = means[order[0]], np.sqrt(covs[order[0]]), weights[order[0]]
|
||||
m2, s2, w2 = means[order[1]], np.sqrt(covs[order[1]]), weights[order[1]]
|
||||
|
||||
def diff(z0):
|
||||
return (w2 * stats.norm.pdf(z0, m2, s2)
|
||||
- w1 * stats.norm.pdf(z0, m1, s1))
|
||||
zs = np.linspace(min(m1, m2) - 1, max(m1, m2) + 1, 2000)
|
||||
ys = diff(zs)
|
||||
changes = np.where(np.diff(np.sign(ys)) != 0)[0]
|
||||
if len(changes):
|
||||
try:
|
||||
z_cross = brentq(diff, zs[changes[0]], zs[changes[0] + 1])
|
||||
result['crossing_logit'] = float(z_cross)
|
||||
result['crossing_original'] = float(invlogit(z_cross))
|
||||
except ValueError:
|
||||
pass
|
||||
return result
|
||||
|
||||
|
||||
def plot_mixture(x, beta_res, title, out_path, gmm_res=None):
|
||||
x = np.asarray(x, dtype=float).ravel()
|
||||
x = x[np.isfinite(x)]
|
||||
fig, ax = plt.subplots(figsize=(10, 5))
|
||||
bin_edges = np.linspace(float(x.min()), float(x.max()), 81)
|
||||
ax.hist(x, bins=bin_edges, density=True, alpha=0.45, color='steelblue',
|
||||
edgecolor='white')
|
||||
xs = np.linspace(max(0.0, x.min() - 0.01), min(1.0, x.max() + 0.01), 500)
|
||||
total = np.zeros_like(xs)
|
||||
for i, p in enumerate(beta_res['components']):
|
||||
comp_pdf = p['weight'] * stats.beta.pdf(xs, p['alpha'], p['beta'])
|
||||
total = total + comp_pdf
|
||||
ax.plot(xs, comp_pdf, '--', lw=1.5,
|
||||
label=f"C{i+1}: α={p['alpha']:.2f}, β={p['beta']:.2f}, "
|
||||
f"w={p['weight']:.2f}")
|
||||
ax.plot(xs, total, 'r-', lw=2, label='Beta mixture (sum)')
|
||||
|
||||
crossing = mixture_crossing(beta_res['components'], (xs[0], xs[-1]))
|
||||
if crossing is not None:
|
||||
ax.axvline(crossing, color='green', ls='--', lw=2,
|
||||
label=f'Beta crossing = {crossing:.4f}')
|
||||
|
||||
if gmm_res and 'crossing_original' in gmm_res:
|
||||
ax.axvline(gmm_res['crossing_original'], color='purple', ls=':',
|
||||
lw=2, label=f"Logit-GMM crossing = "
|
||||
f"{gmm_res['crossing_original']:.4f}")
|
||||
|
||||
ax.set_xlabel('Value')
|
||||
ax.set_ylabel('Density')
|
||||
ax.set_title(title)
|
||||
ax.legend(fontsize=8)
|
||||
plt.tight_layout()
|
||||
fig.savefig(out_path, dpi=150)
|
||||
plt.close()
|
||||
return crossing
|
||||
|
||||
|
||||
def fetch(label):
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
if label == 'firm_a_cosine':
|
||||
cur.execute('''
|
||||
SELECT s.max_similarity_to_same_accountant
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE a.firm = ? AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
''', (FIRM_A,))
|
||||
elif label == 'full_cosine':
|
||||
cur.execute('''
|
||||
SELECT max_similarity_to_same_accountant FROM signatures
|
||||
WHERE max_similarity_to_same_accountant IS NOT NULL
|
||||
''')
|
||||
else:
|
||||
raise ValueError(label)
|
||||
vals = [r[0] for r in cur.fetchall() if r[0] is not None]
|
||||
conn.close()
|
||||
return np.array(vals, dtype=float)
|
||||
|
||||
|
||||
def main():
|
||||
print('='*70)
|
||||
print('Script 17: Beta Mixture EM + Logit-GMM Robustness Check')
|
||||
print('='*70)
|
||||
|
||||
cases = [
|
||||
('firm_a_cosine', 'Firm A cosine max-similarity'),
|
||||
('full_cosine', 'Full-sample cosine max-similarity'),
|
||||
]
|
||||
|
||||
summary = {}
|
||||
for key, label in cases:
|
||||
print(f'\n[{label}]')
|
||||
x = fetch(key)
|
||||
print(f' N = {len(x):,}')
|
||||
|
||||
# Subsample for full sample to keep EM tractable but still stable
|
||||
if len(x) > 200000:
|
||||
rng = np.random.default_rng(42)
|
||||
x_fit = rng.choice(x, 200000, replace=False)
|
||||
print(f' Subsampled to {len(x_fit):,} for EM fitting')
|
||||
else:
|
||||
x_fit = x
|
||||
|
||||
beta2 = fit_beta_mixture_em(x_fit, n_components=2)
|
||||
beta3 = fit_beta_mixture_em(x_fit, n_components=3)
|
||||
print(f' Beta-2 AIC={beta2["aic"]:.1f}, BIC={beta2["bic"]:.1f}')
|
||||
print(f' Beta-3 AIC={beta3["aic"]:.1f}, BIC={beta3["bic"]:.1f}')
|
||||
|
||||
gmm2 = fit_gmm_logit(x_fit, n_components=2)
|
||||
gmm3 = fit_gmm_logit(x_fit, n_components=3)
|
||||
print(f' LogGMM2 AIC={gmm2["aic"]:.1f}, BIC={gmm2["bic"]:.1f}')
|
||||
print(f' LogGMM3 AIC={gmm3["aic"]:.1f}, BIC={gmm3["bic"]:.1f}')
|
||||
|
||||
# Report crossings
|
||||
crossing_beta = mixture_crossing(beta2['components'], (x.min(), x.max()))
|
||||
print(f' Beta-2 crossing: '
|
||||
f"{('%.4f' % crossing_beta) if crossing_beta else '—'}")
|
||||
print(f' LogGMM-2 crossing (original scale): '
|
||||
f"{gmm2.get('crossing_original', '—')}")
|
||||
|
||||
# Plot
|
||||
png = OUT / f'beta_mixture_{key}.png'
|
||||
plot_mixture(x_fit, beta2, f'{label}: Beta mixture (2 comp)', png,
|
||||
gmm_res=gmm2)
|
||||
print(f' plot: {png}')
|
||||
|
||||
# Strip responsibilities for JSON compactness
|
||||
beta2_out = {k: v for k, v in beta2.items() if k != 'responsibilities'}
|
||||
beta3_out = {k: v for k, v in beta3.items() if k != 'responsibilities'}
|
||||
|
||||
summary[key] = {
|
||||
'label': label,
|
||||
'n': int(len(x)),
|
||||
'n_fit': int(len(x_fit)),
|
||||
'beta_2': beta2_out,
|
||||
'beta_3': beta3_out,
|
||||
'beta_2_crossing': (float(crossing_beta)
|
||||
if crossing_beta is not None else None),
|
||||
'logit_gmm_2': gmm2,
|
||||
'logit_gmm_3': gmm3,
|
||||
'bic_best': ('beta_2' if beta2['bic'] < beta3['bic']
|
||||
else 'beta_3'),
|
||||
}
|
||||
|
||||
# Write JSON
|
||||
json_path = OUT / 'beta_mixture_results.json'
|
||||
with open(json_path, 'w') as f:
|
||||
json.dump({
|
||||
'generated_at': datetime.now().isoformat(),
|
||||
'results': summary,
|
||||
}, f, indent=2, ensure_ascii=False, default=float)
|
||||
print(f'\nJSON: {json_path}')
|
||||
|
||||
# Markdown
|
||||
md = [
|
||||
'# Beta Mixture EM Report',
|
||||
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
|
||||
'',
|
||||
'## Method',
|
||||
'',
|
||||
'* 2- and 3-component Beta mixture fit by EM with method-of-moments',
|
||||
' M-step (stable for bounded data).',
|
||||
'* Parallel 2/3-component Gaussian mixture on logit-transformed',
|
||||
' values as robustness check (White 1982 quasi-MLE consistency).',
|
||||
'* Crossing point of the 2-component mixture densities is reported',
|
||||
' as the Bayes-optimal threshold under equal misclassification cost.',
|
||||
'',
|
||||
'## Results',
|
||||
'',
|
||||
'| Dataset | N (fit) | Beta-2 BIC | Beta-3 BIC | LogGMM-2 BIC | LogGMM-3 BIC | BIC-best |',
|
||||
'|---------|---------|------------|------------|--------------|--------------|----------|',
|
||||
]
|
||||
for r in summary.values():
|
||||
md.append(
|
||||
f"| {r['label']} | {r['n_fit']:,} | "
|
||||
f"{r['beta_2']['bic']:.1f} | {r['beta_3']['bic']:.1f} | "
|
||||
f"{r['logit_gmm_2']['bic']:.1f} | {r['logit_gmm_3']['bic']:.1f} | "
|
||||
f"{r['bic_best']} |"
|
||||
)
|
||||
|
||||
md += ['', '## Threshold estimates (2-component)', '',
|
||||
'| Dataset | Beta-2 crossing | LogGMM-2 crossing (orig) |',
|
||||
'|---------|-----------------|--------------------------|']
|
||||
for r in summary.values():
|
||||
beta_str = (f"{r['beta_2_crossing']:.4f}"
|
||||
if r['beta_2_crossing'] is not None else '—')
|
||||
gmm_str = (f"{r['logit_gmm_2']['crossing_original']:.4f}"
|
||||
if 'crossing_original' in r['logit_gmm_2'] else '—')
|
||||
md.append(f"| {r['label']} | {beta_str} | {gmm_str} |")
|
||||
|
||||
md += [
|
||||
'',
|
||||
'## Interpretation',
|
||||
'',
|
||||
'A successful 2-component fit with a clear crossing point would',
|
||||
'indicate two underlying generative mechanisms (hand-signed vs',
|
||||
'non-hand-signed) with a principled Bayes-optimal boundary.',
|
||||
'',
|
||||
'If Beta-3 BIC is meaningfully smaller than Beta-2, or if the',
|
||||
'components of Beta-2 largely overlap (similar means, wide spread),',
|
||||
'this is consistent with a unimodal distribution poorly approximated',
|
||||
'by two components. Prior finding (2026-04-16) suggested this is',
|
||||
'the case at signature level; the accountant-level mixture',
|
||||
'(Script 18) is where the bimodality emerges.',
|
||||
]
|
||||
md_path = OUT / 'beta_mixture_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'Report: {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user