Files
pdf_signature_extraction/signature_analysis/18_accountant_mixture.py
T
gbanyan fbfab1fa68 Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of
signature vs. accountant analysis units:

- Script 15 (Hartigan dip test): formal unimodality test via `diptest`.
  Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population);
  full-sample cosine MULTIMODAL (p<0.001, mix of two regimes);
  accountant-level aggregates MULTIMODAL on both cos and dHash.

- Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition
  detection. Firm A and full-sample cosine transitions at 0.985; dHash
  at 2.0.

- Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM
  with MoM M-step, plus parallel Gaussian mixture on logit transform
  as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature
  level confirms 2-component is a forced fit -- supporting the pivot
  to accountant-level mixture.

- Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis
  that was done inline and not saved. BIC-best K=3 with components
  matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%,
  Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928,
  11.17, 28%, small firms). 2-component natural thresholds:
  cos=0.9450, dh=8.10.

- Script 19 (Pixel-identity validation): no human annotation needed.
  Uses pixel_identical_to_closest (310 sigs) as gold positive and
  Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51%
  (matches prior 2026-04-08 finding of 92.5%), dual rule
  cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A.

Python deps added: diptest, scikit-learn (installed into venv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:51:41 +08:00

397 lines
15 KiB
Python

#!/usr/bin/env python3
"""
Script 18: Accountant-Level 3-Component Gaussian Mixture
========================================================
Rebuild the GMM analysis from memory 2026-04-16: at the accountant level
(not signature level), the joint distribution of (cosine_mean, dhash_mean)
separates into three components corresponding to signing-behaviour
regimes:
C1 High-replication cos_mean ≈ 0.983, dh_mean ≈ 2.4, ~20%, Deloitte-heavy
C2 Middle band cos_mean ≈ 0.954, dh_mean ≈ 7.0, ~52%, KPMG/PwC/EY
C3 Hand-signed tendency cos_mean ≈ 0.928, dh_mean ≈ 11.2, ~28%, small firms
The script:
1. Aggregates per-accountant means from the signature table.
2. Fits 1-, 2-, 3-, 4-component 2D Gaussian mixtures and selects by BIC.
3. Reports component parameters, cluster assignments, and per-firm
breakdown.
4. For the 2-component fit derives the natural threshold (crossing of
marginal densities in cosine-mean and dhash-mean).
Output:
reports/accountant_mixture/accountant_mixture_report.md
reports/accountant_mixture/accountant_mixture_results.json
reports/accountant_mixture/accountant_mixture_2d.png
reports/accountant_mixture/accountant_mixture_marginals.png
"""
import sqlite3
import json
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime
from scipy import stats
from scipy.optimize import brentq
from sklearn.mixture import GaussianMixture
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
'accountant_mixture')
OUT.mkdir(parents=True, exist_ok=True)
MIN_SIGS = 10
def load_accountant_aggregates():
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT s.assigned_accountant,
a.firm,
AVG(s.max_similarity_to_same_accountant) AS cos_mean,
AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
COUNT(*) AS n
FROM signatures s
LEFT JOIN accountants a ON s.assigned_accountant = a.name
WHERE s.assigned_accountant IS NOT NULL
AND s.max_similarity_to_same_accountant IS NOT NULL
AND s.min_dhash_independent IS NOT NULL
GROUP BY s.assigned_accountant
HAVING n >= ?
''', (MIN_SIGS,))
rows = cur.fetchall()
conn.close()
return [
{'accountant': r[0], 'firm': r[1] or '(unknown)',
'cos_mean': float(r[2]), 'dh_mean': float(r[3]), 'n': int(r[4])}
for r in rows
]
def fit_gmm_range(X, ks=(1, 2, 3, 4, 5), seed=42, n_init=10):
results = []
best_bic = np.inf
best = None
for k in ks:
gmm = GaussianMixture(
n_components=k, covariance_type='full',
random_state=seed, n_init=n_init, max_iter=500,
).fit(X)
bic = gmm.bic(X)
aic = gmm.aic(X)
results.append({
'k': int(k), 'bic': float(bic), 'aic': float(aic),
'converged': bool(gmm.converged_), 'n_iter': int(gmm.n_iter_),
})
if bic < best_bic:
best_bic = bic
best = gmm
return results, best
def summarize_components(gmm, X, df):
"""Assign clusters, return per-component stats + per-firm breakdown."""
labels = gmm.predict(X)
means = gmm.means_
order = np.argsort(means[:, 0]) # order by cos_mean ascending
# Relabel so smallest cos_mean = component 1
relabel = np.argsort(order)
# Actually invert: in prior memory C1 was HIGH replication (highest cos).
# To keep consistent with memory, order DESCENDING by cos_mean so C1 = high.
order = np.argsort(-means[:, 0])
relabel = {int(old): new + 1 for new, old in enumerate(order)}
new_labels = np.array([relabel[int(l)] for l in labels])
components = []
for rank, old_idx in enumerate(order, start=1):
mu = means[old_idx]
cov = gmm.covariances_[old_idx]
w = gmm.weights_[old_idx]
mask = new_labels == rank
firms = {}
for row, in_cluster in zip(df, mask):
if not in_cluster:
continue
firms[row['firm']] = firms.get(row['firm'], 0) + 1
firms_sorted = sorted(firms.items(), key=lambda kv: -kv[1])
components.append({
'component': rank,
'mu_cos': float(mu[0]),
'mu_dh': float(mu[1]),
'cov_00': float(cov[0, 0]),
'cov_11': float(cov[1, 1]),
'cov_01': float(cov[0, 1]),
'corr': float(cov[0, 1] / np.sqrt(cov[0, 0] * cov[1, 1])),
'weight': float(w),
'n_accountants': int(mask.sum()),
'top_firms': firms_sorted[:5],
})
return components, new_labels
def marginal_crossing(means, covs, weights, dim, search_lo, search_hi):
"""Find crossing of two weighted marginal Gaussians along dimension `dim`."""
m1, m2 = means[0][dim], means[1][dim]
s1 = np.sqrt(covs[0][dim, dim])
s2 = np.sqrt(covs[1][dim, dim])
w1, w2 = weights[0], weights[1]
def diff(x):
return (w2 * stats.norm.pdf(x, m2, s2)
- w1 * stats.norm.pdf(x, m1, s1))
xs = np.linspace(search_lo, search_hi, 2000)
ys = diff(xs)
changes = np.where(np.diff(np.sign(ys)) != 0)[0]
if not len(changes):
return None
mid = 0.5 * (m1 + m2)
crossings = []
for i in changes:
try:
crossings.append(brentq(diff, xs[i], xs[i + 1]))
except ValueError:
continue
if not crossings:
return None
return float(min(crossings, key=lambda c: abs(c - mid)))
def plot_2d(df, labels, means, title, out_path):
colors = ['#d62728', '#1f77b4', '#2ca02c', '#9467bd', '#ff7f0e']
fig, ax = plt.subplots(figsize=(9, 7))
for k in sorted(set(labels)):
mask = labels == k
xs = [r['cos_mean'] for r, m in zip(df, mask) if m]
ys = [r['dh_mean'] for r, m in zip(df, mask) if m]
ax.scatter(xs, ys, s=20, alpha=0.55, color=colors[(k - 1) % 5],
label=f'C{k} (n={int(mask.sum())})')
for i, mu in enumerate(means):
ax.plot(mu[0], mu[1], 'k*', ms=18, mec='white', mew=1.5)
ax.annotate(f' μ{i+1}', (mu[0], mu[1]), fontsize=10)
ax.set_xlabel('Per-accountant mean cosine max-similarity')
ax.set_ylabel('Per-accountant mean independent min dHash')
ax.set_title(title)
ax.legend()
plt.tight_layout()
fig.savefig(out_path, dpi=150)
plt.close()
def plot_marginals(df, labels, gmm_2, out_path, cos_cross=None, dh_cross=None):
cos = np.array([r['cos_mean'] for r in df])
dh = np.array([r['dh_mean'] for r in df])
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
# Cosine marginal
ax = axes[0]
ax.hist(cos, bins=40, density=True, alpha=0.5, color='steelblue',
edgecolor='white')
xs = np.linspace(cos.min(), cos.max(), 400)
means_2 = gmm_2.means_
covs_2 = gmm_2.covariances_
weights_2 = gmm_2.weights_
order = np.argsort(-means_2[:, 0])
for rank, i in enumerate(order, start=1):
ys = weights_2[i] * stats.norm.pdf(xs, means_2[i, 0],
np.sqrt(covs_2[i, 0, 0]))
ax.plot(xs, ys, '--', label=f'C{rank} μ={means_2[i,0]:.3f}')
if cos_cross is not None:
ax.axvline(cos_cross, color='green', lw=2,
label=f'Crossing = {cos_cross:.4f}')
ax.set_xlabel('Per-accountant mean cosine')
ax.set_ylabel('Density')
ax.set_title('Cosine marginal (2-component fit)')
ax.legend(fontsize=8)
# dHash marginal
ax = axes[1]
ax.hist(dh, bins=40, density=True, alpha=0.5, color='coral',
edgecolor='white')
xs = np.linspace(dh.min(), dh.max(), 400)
for rank, i in enumerate(order, start=1):
ys = weights_2[i] * stats.norm.pdf(xs, means_2[i, 1],
np.sqrt(covs_2[i, 1, 1]))
ax.plot(xs, ys, '--', label=f'C{rank} μ={means_2[i,1]:.2f}')
if dh_cross is not None:
ax.axvline(dh_cross, color='green', lw=2,
label=f'Crossing = {dh_cross:.4f}')
ax.set_xlabel('Per-accountant mean dHash')
ax.set_ylabel('Density')
ax.set_title('dHash marginal (2-component fit)')
ax.legend(fontsize=8)
plt.tight_layout()
fig.savefig(out_path, dpi=150)
plt.close()
def main():
print('='*70)
print('Script 18: Accountant-Level Gaussian Mixture')
print('='*70)
df = load_accountant_aggregates()
print(f'\nAccountants with >= {MIN_SIGS} signatures: {len(df)}')
X = np.array([[r['cos_mean'], r['dh_mean']] for r in df])
# Fit K=1..5
print('\nFitting GMMs with K=1..5...')
bic_results, _ = fit_gmm_range(X, ks=(1, 2, 3, 4, 5), seed=42, n_init=15)
for r in bic_results:
print(f" K={r['k']}: BIC={r['bic']:.2f} AIC={r['aic']:.2f} "
f"converged={r['converged']}")
best_k = min(bic_results, key=lambda r: r['bic'])['k']
print(f'\nBIC-best K = {best_k}')
# Fit 3-component specifically (target)
gmm_3 = GaussianMixture(n_components=3, covariance_type='full',
random_state=42, n_init=15, max_iter=500).fit(X)
comps_3, labels_3 = summarize_components(gmm_3, X, df)
print('\n--- 3-component summary ---')
for c in comps_3:
tops = ', '.join(f"{f}({n})" for f, n in c['top_firms'])
print(f" C{c['component']}: cos={c['mu_cos']:.3f}, "
f"dh={c['mu_dh']:.2f}, w={c['weight']:.2f}, "
f"n={c['n_accountants']} -> {tops}")
# Fit 2-component for threshold derivation
gmm_2 = GaussianMixture(n_components=2, covariance_type='full',
random_state=42, n_init=15, max_iter=500).fit(X)
comps_2, labels_2 = summarize_components(gmm_2, X, df)
# Crossings
cos_cross = marginal_crossing(gmm_2.means_, gmm_2.covariances_,
gmm_2.weights_, dim=0,
search_lo=X[:, 0].min(),
search_hi=X[:, 0].max())
dh_cross = marginal_crossing(gmm_2.means_, gmm_2.covariances_,
gmm_2.weights_, dim=1,
search_lo=X[:, 1].min(),
search_hi=X[:, 1].max())
print(f'\n2-component crossings: cos={cos_cross}, dh={dh_cross}')
# Plots
plot_2d(df, labels_3, gmm_3.means_,
'3-component accountant-level GMM',
OUT / 'accountant_mixture_2d.png')
plot_marginals(df, labels_2, gmm_2,
OUT / 'accountant_mixture_marginals.png',
cos_cross=cos_cross, dh_cross=dh_cross)
# Per-accountant CSV (for downstream use)
csv_path = OUT / 'accountant_clusters.csv'
with open(csv_path, 'w', encoding='utf-8') as f:
f.write('accountant,firm,n_signatures,cos_mean,dh_mean,'
'cluster_k3,cluster_k2\n')
for r, k3, k2 in zip(df, labels_3, labels_2):
f.write(f"{r['accountant']},{r['firm']},{r['n']},"
f"{r['cos_mean']:.6f},{r['dh_mean']:.6f},{k3},{k2}\n")
print(f'CSV: {csv_path}')
# Summary JSON
summary = {
'generated_at': datetime.now().isoformat(),
'n_accountants': len(df),
'min_signatures': MIN_SIGS,
'bic_model_selection': bic_results,
'best_k_by_bic': best_k,
'gmm_3': {
'components': comps_3,
'aic': float(gmm_3.aic(X)),
'bic': float(gmm_3.bic(X)),
'log_likelihood': float(gmm_3.score(X) * len(X)),
},
'gmm_2': {
'components': comps_2,
'aic': float(gmm_2.aic(X)),
'bic': float(gmm_2.bic(X)),
'log_likelihood': float(gmm_2.score(X) * len(X)),
'cos_crossing': cos_cross,
'dh_crossing': dh_cross,
},
}
with open(OUT / 'accountant_mixture_results.json', 'w') as f:
json.dump(summary, f, indent=2, ensure_ascii=False)
print(f'JSON: {OUT / "accountant_mixture_results.json"}')
# Markdown
md = [
'# Accountant-Level Gaussian Mixture Report',
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
'',
'## Data',
'',
f'* Per-accountant aggregates: mean cosine max-similarity, '
f'mean independent min dHash.',
f"* Minimum signatures per accountant: {MIN_SIGS}.",
f'* Accountants included: **{len(df)}**.',
'',
'## Model selection (BIC)',
'',
'| K | BIC | AIC | Converged |',
'|---|-----|-----|-----------|',
]
for r in bic_results:
mark = ' ←best' if r['k'] == best_k else ''
md.append(
f"| {r['k']} | {r['bic']:.2f} | {r['aic']:.2f} | "
f"{r['converged']}{mark} |"
)
md += ['', '## 3-component fit', '',
'| Component | cos_mean | dh_mean | weight | n_accountants | top firms |',
'|-----------|----------|---------|--------|----------------|-----------|']
for c in comps_3:
tops = ', '.join(f"{f}:{n}" for f, n in c['top_firms'])
md.append(
f"| C{c['component']} | {c['mu_cos']:.3f} | {c['mu_dh']:.2f} | "
f"{c['weight']:.3f} | {c['n_accountants']} | {tops} |"
)
md += ['', '## 2-component fit (threshold derivation)', '',
'| Component | cos_mean | dh_mean | weight | n_accountants |',
'|-----------|----------|---------|--------|----------------|']
for c in comps_2:
md.append(
f"| C{c['component']} | {c['mu_cos']:.3f} | {c['mu_dh']:.2f} | "
f"{c['weight']:.3f} | {c['n_accountants']} |"
)
md += ['', '### Natural thresholds from 2-component crossings', '',
f'* Cosine: **{cos_cross:.4f}**' if cos_cross
else '* Cosine: no crossing found',
f'* dHash: **{dh_cross:.4f}**' if dh_cross
else '* dHash: no crossing found',
'',
'## Interpretation',
'',
'The accountant-level mixture separates signing-behaviour regimes,',
'while the signature-level distribution is a continuous spectrum',
'(see Scripts 15 and 17). The BIC-best model chooses how many',
'discrete regimes the data supports. The 2-component crossings',
'are the natural per-accountant thresholds for classifying a',
"CPA's signing behaviour.",
'',
'## Artifacts',
'',
'* `accountant_mixture_2d.png` - 2D scatter with 3-component fit',
'* `accountant_mixture_marginals.png` - 1D marginals with 2-component fit',
'* `accountant_clusters.csv` - per-accountant cluster assignments',
'* `accountant_mixture_results.json` - full numerical results',
]
(OUT / 'accountant_mixture_report.md').write_text('\n'.join(md),
encoding='utf-8')
print(f'Report: {OUT / "accountant_mixture_report.md"}')
if __name__ == '__main__':
main()