Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of signature vs. accountant analysis units: - Script 15 (Hartigan dip test): formal unimodality test via `diptest`. Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population); full-sample cosine MULTIMODAL (p<0.001, mix of two regimes); accountant-level aggregates MULTIMODAL on both cos and dHash. - Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition detection. Firm A and full-sample cosine transitions at 0.985; dHash at 2.0. - Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM with MoM M-step, plus parallel Gaussian mixture on logit transform as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature level confirms 2-component is a forced fit -- supporting the pivot to accountant-level mixture. - Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis that was done inline and not saved. BIC-best K=3 with components matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%, Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928, 11.17, 28%, small firms). 2-component natural thresholds: cos=0.9450, dh=8.10. - Script 19 (Pixel-identity validation): no human annotation needed. Uses pixel_identical_to_closest (310 sigs) as gold positive and Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51% (matches prior 2026-04-08 finding of 92.5%), dual rule cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A. Python deps added: diptest, scikit-learn (installed into venv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,396 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 18: Accountant-Level 3-Component Gaussian Mixture
|
||||
========================================================
|
||||
Rebuild the GMM analysis from memory 2026-04-16: at the accountant level
|
||||
(not signature level), the joint distribution of (cosine_mean, dhash_mean)
|
||||
separates into three components corresponding to signing-behaviour
|
||||
regimes:
|
||||
|
||||
C1 High-replication cos_mean ≈ 0.983, dh_mean ≈ 2.4, ~20%, Deloitte-heavy
|
||||
C2 Middle band cos_mean ≈ 0.954, dh_mean ≈ 7.0, ~52%, KPMG/PwC/EY
|
||||
C3 Hand-signed tendency cos_mean ≈ 0.928, dh_mean ≈ 11.2, ~28%, small firms
|
||||
|
||||
The script:
|
||||
1. Aggregates per-accountant means from the signature table.
|
||||
2. Fits 1-, 2-, 3-, 4-component 2D Gaussian mixtures and selects by BIC.
|
||||
3. Reports component parameters, cluster assignments, and per-firm
|
||||
breakdown.
|
||||
4. For the 2-component fit derives the natural threshold (crossing of
|
||||
marginal densities in cosine-mean and dhash-mean).
|
||||
|
||||
Output:
|
||||
reports/accountant_mixture/accountant_mixture_report.md
|
||||
reports/accountant_mixture/accountant_mixture_results.json
|
||||
reports/accountant_mixture/accountant_mixture_2d.png
|
||||
reports/accountant_mixture/accountant_mixture_marginals.png
|
||||
"""
|
||||
|
||||
import sqlite3
|
||||
import json
|
||||
import numpy as np
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from scipy import stats
|
||||
from scipy.optimize import brentq
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'accountant_mixture')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
MIN_SIGS = 10
|
||||
|
||||
|
||||
def load_accountant_aggregates():
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.assigned_accountant,
|
||||
a.firm,
|
||||
AVG(s.max_similarity_to_same_accountant) AS cos_mean,
|
||||
AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
|
||||
COUNT(*) AS n
|
||||
FROM signatures s
|
||||
LEFT JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
AND s.min_dhash_independent IS NOT NULL
|
||||
GROUP BY s.assigned_accountant
|
||||
HAVING n >= ?
|
||||
''', (MIN_SIGS,))
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return [
|
||||
{'accountant': r[0], 'firm': r[1] or '(unknown)',
|
||||
'cos_mean': float(r[2]), 'dh_mean': float(r[3]), 'n': int(r[4])}
|
||||
for r in rows
|
||||
]
|
||||
|
||||
|
||||
def fit_gmm_range(X, ks=(1, 2, 3, 4, 5), seed=42, n_init=10):
|
||||
results = []
|
||||
best_bic = np.inf
|
||||
best = None
|
||||
for k in ks:
|
||||
gmm = GaussianMixture(
|
||||
n_components=k, covariance_type='full',
|
||||
random_state=seed, n_init=n_init, max_iter=500,
|
||||
).fit(X)
|
||||
bic = gmm.bic(X)
|
||||
aic = gmm.aic(X)
|
||||
results.append({
|
||||
'k': int(k), 'bic': float(bic), 'aic': float(aic),
|
||||
'converged': bool(gmm.converged_), 'n_iter': int(gmm.n_iter_),
|
||||
})
|
||||
if bic < best_bic:
|
||||
best_bic = bic
|
||||
best = gmm
|
||||
return results, best
|
||||
|
||||
|
||||
def summarize_components(gmm, X, df):
|
||||
"""Assign clusters, return per-component stats + per-firm breakdown."""
|
||||
labels = gmm.predict(X)
|
||||
means = gmm.means_
|
||||
order = np.argsort(means[:, 0]) # order by cos_mean ascending
|
||||
# Relabel so smallest cos_mean = component 1
|
||||
relabel = np.argsort(order)
|
||||
|
||||
# Actually invert: in prior memory C1 was HIGH replication (highest cos).
|
||||
# To keep consistent with memory, order DESCENDING by cos_mean so C1 = high.
|
||||
order = np.argsort(-means[:, 0])
|
||||
relabel = {int(old): new + 1 for new, old in enumerate(order)}
|
||||
new_labels = np.array([relabel[int(l)] for l in labels])
|
||||
|
||||
components = []
|
||||
for rank, old_idx in enumerate(order, start=1):
|
||||
mu = means[old_idx]
|
||||
cov = gmm.covariances_[old_idx]
|
||||
w = gmm.weights_[old_idx]
|
||||
mask = new_labels == rank
|
||||
firms = {}
|
||||
for row, in_cluster in zip(df, mask):
|
||||
if not in_cluster:
|
||||
continue
|
||||
firms[row['firm']] = firms.get(row['firm'], 0) + 1
|
||||
firms_sorted = sorted(firms.items(), key=lambda kv: -kv[1])
|
||||
components.append({
|
||||
'component': rank,
|
||||
'mu_cos': float(mu[0]),
|
||||
'mu_dh': float(mu[1]),
|
||||
'cov_00': float(cov[0, 0]),
|
||||
'cov_11': float(cov[1, 1]),
|
||||
'cov_01': float(cov[0, 1]),
|
||||
'corr': float(cov[0, 1] / np.sqrt(cov[0, 0] * cov[1, 1])),
|
||||
'weight': float(w),
|
||||
'n_accountants': int(mask.sum()),
|
||||
'top_firms': firms_sorted[:5],
|
||||
})
|
||||
return components, new_labels
|
||||
|
||||
|
||||
def marginal_crossing(means, covs, weights, dim, search_lo, search_hi):
|
||||
"""Find crossing of two weighted marginal Gaussians along dimension `dim`."""
|
||||
m1, m2 = means[0][dim], means[1][dim]
|
||||
s1 = np.sqrt(covs[0][dim, dim])
|
||||
s2 = np.sqrt(covs[1][dim, dim])
|
||||
w1, w2 = weights[0], weights[1]
|
||||
|
||||
def diff(x):
|
||||
return (w2 * stats.norm.pdf(x, m2, s2)
|
||||
- w1 * stats.norm.pdf(x, m1, s1))
|
||||
|
||||
xs = np.linspace(search_lo, search_hi, 2000)
|
||||
ys = diff(xs)
|
||||
changes = np.where(np.diff(np.sign(ys)) != 0)[0]
|
||||
if not len(changes):
|
||||
return None
|
||||
mid = 0.5 * (m1 + m2)
|
||||
crossings = []
|
||||
for i in changes:
|
||||
try:
|
||||
crossings.append(brentq(diff, xs[i], xs[i + 1]))
|
||||
except ValueError:
|
||||
continue
|
||||
if not crossings:
|
||||
return None
|
||||
return float(min(crossings, key=lambda c: abs(c - mid)))
|
||||
|
||||
|
||||
def plot_2d(df, labels, means, title, out_path):
|
||||
colors = ['#d62728', '#1f77b4', '#2ca02c', '#9467bd', '#ff7f0e']
|
||||
fig, ax = plt.subplots(figsize=(9, 7))
|
||||
for k in sorted(set(labels)):
|
||||
mask = labels == k
|
||||
xs = [r['cos_mean'] for r, m in zip(df, mask) if m]
|
||||
ys = [r['dh_mean'] for r, m in zip(df, mask) if m]
|
||||
ax.scatter(xs, ys, s=20, alpha=0.55, color=colors[(k - 1) % 5],
|
||||
label=f'C{k} (n={int(mask.sum())})')
|
||||
for i, mu in enumerate(means):
|
||||
ax.plot(mu[0], mu[1], 'k*', ms=18, mec='white', mew=1.5)
|
||||
ax.annotate(f' μ{i+1}', (mu[0], mu[1]), fontsize=10)
|
||||
ax.set_xlabel('Per-accountant mean cosine max-similarity')
|
||||
ax.set_ylabel('Per-accountant mean independent min dHash')
|
||||
ax.set_title(title)
|
||||
ax.legend()
|
||||
plt.tight_layout()
|
||||
fig.savefig(out_path, dpi=150)
|
||||
plt.close()
|
||||
|
||||
|
||||
def plot_marginals(df, labels, gmm_2, out_path, cos_cross=None, dh_cross=None):
|
||||
cos = np.array([r['cos_mean'] for r in df])
|
||||
dh = np.array([r['dh_mean'] for r in df])
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
|
||||
|
||||
# Cosine marginal
|
||||
ax = axes[0]
|
||||
ax.hist(cos, bins=40, density=True, alpha=0.5, color='steelblue',
|
||||
edgecolor='white')
|
||||
xs = np.linspace(cos.min(), cos.max(), 400)
|
||||
means_2 = gmm_2.means_
|
||||
covs_2 = gmm_2.covariances_
|
||||
weights_2 = gmm_2.weights_
|
||||
order = np.argsort(-means_2[:, 0])
|
||||
for rank, i in enumerate(order, start=1):
|
||||
ys = weights_2[i] * stats.norm.pdf(xs, means_2[i, 0],
|
||||
np.sqrt(covs_2[i, 0, 0]))
|
||||
ax.plot(xs, ys, '--', label=f'C{rank} μ={means_2[i,0]:.3f}')
|
||||
if cos_cross is not None:
|
||||
ax.axvline(cos_cross, color='green', lw=2,
|
||||
label=f'Crossing = {cos_cross:.4f}')
|
||||
ax.set_xlabel('Per-accountant mean cosine')
|
||||
ax.set_ylabel('Density')
|
||||
ax.set_title('Cosine marginal (2-component fit)')
|
||||
ax.legend(fontsize=8)
|
||||
|
||||
# dHash marginal
|
||||
ax = axes[1]
|
||||
ax.hist(dh, bins=40, density=True, alpha=0.5, color='coral',
|
||||
edgecolor='white')
|
||||
xs = np.linspace(dh.min(), dh.max(), 400)
|
||||
for rank, i in enumerate(order, start=1):
|
||||
ys = weights_2[i] * stats.norm.pdf(xs, means_2[i, 1],
|
||||
np.sqrt(covs_2[i, 1, 1]))
|
||||
ax.plot(xs, ys, '--', label=f'C{rank} μ={means_2[i,1]:.2f}')
|
||||
if dh_cross is not None:
|
||||
ax.axvline(dh_cross, color='green', lw=2,
|
||||
label=f'Crossing = {dh_cross:.4f}')
|
||||
ax.set_xlabel('Per-accountant mean dHash')
|
||||
ax.set_ylabel('Density')
|
||||
ax.set_title('dHash marginal (2-component fit)')
|
||||
ax.legend(fontsize=8)
|
||||
|
||||
plt.tight_layout()
|
||||
fig.savefig(out_path, dpi=150)
|
||||
plt.close()
|
||||
|
||||
|
||||
def main():
|
||||
print('='*70)
|
||||
print('Script 18: Accountant-Level Gaussian Mixture')
|
||||
print('='*70)
|
||||
|
||||
df = load_accountant_aggregates()
|
||||
print(f'\nAccountants with >= {MIN_SIGS} signatures: {len(df)}')
|
||||
X = np.array([[r['cos_mean'], r['dh_mean']] for r in df])
|
||||
|
||||
# Fit K=1..5
|
||||
print('\nFitting GMMs with K=1..5...')
|
||||
bic_results, _ = fit_gmm_range(X, ks=(1, 2, 3, 4, 5), seed=42, n_init=15)
|
||||
for r in bic_results:
|
||||
print(f" K={r['k']}: BIC={r['bic']:.2f} AIC={r['aic']:.2f} "
|
||||
f"converged={r['converged']}")
|
||||
best_k = min(bic_results, key=lambda r: r['bic'])['k']
|
||||
print(f'\nBIC-best K = {best_k}')
|
||||
|
||||
# Fit 3-component specifically (target)
|
||||
gmm_3 = GaussianMixture(n_components=3, covariance_type='full',
|
||||
random_state=42, n_init=15, max_iter=500).fit(X)
|
||||
comps_3, labels_3 = summarize_components(gmm_3, X, df)
|
||||
|
||||
print('\n--- 3-component summary ---')
|
||||
for c in comps_3:
|
||||
tops = ', '.join(f"{f}({n})" for f, n in c['top_firms'])
|
||||
print(f" C{c['component']}: cos={c['mu_cos']:.3f}, "
|
||||
f"dh={c['mu_dh']:.2f}, w={c['weight']:.2f}, "
|
||||
f"n={c['n_accountants']} -> {tops}")
|
||||
|
||||
# Fit 2-component for threshold derivation
|
||||
gmm_2 = GaussianMixture(n_components=2, covariance_type='full',
|
||||
random_state=42, n_init=15, max_iter=500).fit(X)
|
||||
comps_2, labels_2 = summarize_components(gmm_2, X, df)
|
||||
|
||||
# Crossings
|
||||
cos_cross = marginal_crossing(gmm_2.means_, gmm_2.covariances_,
|
||||
gmm_2.weights_, dim=0,
|
||||
search_lo=X[:, 0].min(),
|
||||
search_hi=X[:, 0].max())
|
||||
dh_cross = marginal_crossing(gmm_2.means_, gmm_2.covariances_,
|
||||
gmm_2.weights_, dim=1,
|
||||
search_lo=X[:, 1].min(),
|
||||
search_hi=X[:, 1].max())
|
||||
print(f'\n2-component crossings: cos={cos_cross}, dh={dh_cross}')
|
||||
|
||||
# Plots
|
||||
plot_2d(df, labels_3, gmm_3.means_,
|
||||
'3-component accountant-level GMM',
|
||||
OUT / 'accountant_mixture_2d.png')
|
||||
plot_marginals(df, labels_2, gmm_2,
|
||||
OUT / 'accountant_mixture_marginals.png',
|
||||
cos_cross=cos_cross, dh_cross=dh_cross)
|
||||
|
||||
# Per-accountant CSV (for downstream use)
|
||||
csv_path = OUT / 'accountant_clusters.csv'
|
||||
with open(csv_path, 'w', encoding='utf-8') as f:
|
||||
f.write('accountant,firm,n_signatures,cos_mean,dh_mean,'
|
||||
'cluster_k3,cluster_k2\n')
|
||||
for r, k3, k2 in zip(df, labels_3, labels_2):
|
||||
f.write(f"{r['accountant']},{r['firm']},{r['n']},"
|
||||
f"{r['cos_mean']:.6f},{r['dh_mean']:.6f},{k3},{k2}\n")
|
||||
print(f'CSV: {csv_path}')
|
||||
|
||||
# Summary JSON
|
||||
summary = {
|
||||
'generated_at': datetime.now().isoformat(),
|
||||
'n_accountants': len(df),
|
||||
'min_signatures': MIN_SIGS,
|
||||
'bic_model_selection': bic_results,
|
||||
'best_k_by_bic': best_k,
|
||||
'gmm_3': {
|
||||
'components': comps_3,
|
||||
'aic': float(gmm_3.aic(X)),
|
||||
'bic': float(gmm_3.bic(X)),
|
||||
'log_likelihood': float(gmm_3.score(X) * len(X)),
|
||||
},
|
||||
'gmm_2': {
|
||||
'components': comps_2,
|
||||
'aic': float(gmm_2.aic(X)),
|
||||
'bic': float(gmm_2.bic(X)),
|
||||
'log_likelihood': float(gmm_2.score(X) * len(X)),
|
||||
'cos_crossing': cos_cross,
|
||||
'dh_crossing': dh_cross,
|
||||
},
|
||||
}
|
||||
with open(OUT / 'accountant_mixture_results.json', 'w') as f:
|
||||
json.dump(summary, f, indent=2, ensure_ascii=False)
|
||||
print(f'JSON: {OUT / "accountant_mixture_results.json"}')
|
||||
|
||||
# Markdown
|
||||
md = [
|
||||
'# Accountant-Level Gaussian Mixture Report',
|
||||
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
|
||||
'',
|
||||
'## Data',
|
||||
'',
|
||||
f'* Per-accountant aggregates: mean cosine max-similarity, '
|
||||
f'mean independent min dHash.',
|
||||
f"* Minimum signatures per accountant: {MIN_SIGS}.",
|
||||
f'* Accountants included: **{len(df)}**.',
|
||||
'',
|
||||
'## Model selection (BIC)',
|
||||
'',
|
||||
'| K | BIC | AIC | Converged |',
|
||||
'|---|-----|-----|-----------|',
|
||||
]
|
||||
for r in bic_results:
|
||||
mark = ' ←best' if r['k'] == best_k else ''
|
||||
md.append(
|
||||
f"| {r['k']} | {r['bic']:.2f} | {r['aic']:.2f} | "
|
||||
f"{r['converged']}{mark} |"
|
||||
)
|
||||
|
||||
md += ['', '## 3-component fit', '',
|
||||
'| Component | cos_mean | dh_mean | weight | n_accountants | top firms |',
|
||||
'|-----------|----------|---------|--------|----------------|-----------|']
|
||||
for c in comps_3:
|
||||
tops = ', '.join(f"{f}:{n}" for f, n in c['top_firms'])
|
||||
md.append(
|
||||
f"| C{c['component']} | {c['mu_cos']:.3f} | {c['mu_dh']:.2f} | "
|
||||
f"{c['weight']:.3f} | {c['n_accountants']} | {tops} |"
|
||||
)
|
||||
|
||||
md += ['', '## 2-component fit (threshold derivation)', '',
|
||||
'| Component | cos_mean | dh_mean | weight | n_accountants |',
|
||||
'|-----------|----------|---------|--------|----------------|']
|
||||
for c in comps_2:
|
||||
md.append(
|
||||
f"| C{c['component']} | {c['mu_cos']:.3f} | {c['mu_dh']:.2f} | "
|
||||
f"{c['weight']:.3f} | {c['n_accountants']} |"
|
||||
)
|
||||
|
||||
md += ['', '### Natural thresholds from 2-component crossings', '',
|
||||
f'* Cosine: **{cos_cross:.4f}**' if cos_cross
|
||||
else '* Cosine: no crossing found',
|
||||
f'* dHash: **{dh_cross:.4f}**' if dh_cross
|
||||
else '* dHash: no crossing found',
|
||||
'',
|
||||
'## Interpretation',
|
||||
'',
|
||||
'The accountant-level mixture separates signing-behaviour regimes,',
|
||||
'while the signature-level distribution is a continuous spectrum',
|
||||
'(see Scripts 15 and 17). The BIC-best model chooses how many',
|
||||
'discrete regimes the data supports. The 2-component crossings',
|
||||
'are the natural per-accountant thresholds for classifying a',
|
||||
"CPA's signing behaviour.",
|
||||
'',
|
||||
'## Artifacts',
|
||||
'',
|
||||
'* `accountant_mixture_2d.png` - 2D scatter with 3-component fit',
|
||||
'* `accountant_mixture_marginals.png` - 1D marginals with 2-component fit',
|
||||
'* `accountant_clusters.csv` - per-accountant cluster assignments',
|
||||
'* `accountant_mixture_results.json` - full numerical results',
|
||||
]
|
||||
(OUT / 'accountant_mixture_report.md').write_text('\n'.join(md),
|
||||
encoding='utf-8')
|
||||
print(f'Report: {OUT / "accountant_mixture_report.md"}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user