Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21

Major fixes per codex (gpt-5.4) review:

## Structural fixes
- Fixed three-method convergence overclaim: added Script 20 to run KDE
  antimode, BD/McCrary, and Beta mixture EM on accountant-level means.
  Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979,
  LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at
  accountant level (consistent with smooth clustering, not sharp
  discontinuity).
- Disambiguated Method 1: KDE crossover (between two labeled distributions,
  used at signature all-pairs level) vs KDE antimode (single-distribution
  local minimum, used at accountant level).
- Addressed Firm A circular validation: Script 21 adds CPA-level 70/30
  held-out fold. Calibration thresholds derived from 70% only; heldout
  rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61%
  [93.21%-93.98%]).
- Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10
  signatures (9 CPAs excluded for insufficient sample). Reconciled across
  intro, results, discussion, conclusion.
- Added document-level classification aggregation rule (worst-case signature
  label determines document label).

## Pixel-identity validation strengthened
- Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces
  the original n=35 same-CPA low-similarity negative which had untenable
  Wilson CIs).
- Added Wilson 95% CI for every FAR in Table X.
- Proper EER interpolation (FAR=FRR point) in Table X.
- Softened "conservative recall" claim to "non-generalizable subset"
  language per codex feedback (byte-identical positives are a subset, not
  a representative positive class).
- Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913.

## Terminology & sentence-level fixes
- "statistically independent methods" -> "methodologically distinct methods"
  throughout (three diagnostics on the same sample are not independent).
- "formal bimodality check" -> "unimodality test" (dip test tests H0 of
  unimodality; rejection is consistent with but not a direct test of
  bimodality).
- "Firm A near-universally non-hand-signed" -> already corrected to
  "replication-dominated" in prior commit; this commit strengthens that
  framing with explicit held-out validation.
- "discrete-behavior regimes" -> "clustered accountant-level heterogeneity"
  (BD/McCrary non-transition at accountant level rules out sharp discrete
  boundaries; the defensible claim is clustered-but-smooth).
- Softened White 1982 quasi-MLE claim (no longer framed as a guarantee).
- Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP
  or YOLO FN).
- Unified "310 byte-identical signatures" language across Abstract,
  Results, Discussion (previously alternated between pairs/signatures).
- Defined min_dhash_independent explicitly in Section III-G.
- Fixed table numbering (Table XI heldout added, classification moved to
  XII, ablation to XIII).
- Explained 84,386 vs 85,042 gap (656 docs have only one signature, no
  pairwise stat).
- Made Table IX explicitly a "consistency check" not "validation"; paired
  it with Table XI held-out rates as the genuine external check.
- Defined 0.941 threshold (calibration-fold Firm A cosine P5).
- Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated.
- Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923).

## New artifacts
- Script 20: accountant-level three-method threshold analysis
- Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30)
- paper/codex_review_gpt54_v3.md: preserved review feedback

Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1
markdown sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-21 01:11:51 +08:00
parent 9b11f03548
commit 9d19ca5a31
13 changed files with 2915 additions and 127 deletions
@@ -0,0 +1,526 @@
#!/usr/bin/env python3
"""
Script 20: Three-Method Threshold Determination at the Accountant Level
=======================================================================
Completes the three-method convergent framework at the analysis level
where the mixture structure is statistically supported (per Script 15
dip test: accountant cos_mean p<0.001).
Runs on the per-accountant aggregates (mean best-match cosine, mean
independent minimum dHash) for 686 CPAs with >=10 signatures:
Method 1: KDE antimode with Hartigan dip test (formal unimodality test)
Method 2: Burgstahler-Dichev / McCrary discontinuity
Method 3: 2-component Beta mixture via EM + parallel logit-GMM
Also re-runs the accountant-level 2-component GMM crossings from
Script 18 for completeness and side-by-side comparison.
Output:
reports/accountant_three_methods/accountant_three_methods_report.md
reports/accountant_three_methods/accountant_three_methods_results.json
reports/accountant_three_methods/accountant_cos_panel.png
reports/accountant_three_methods/accountant_dhash_panel.png
"""
import sqlite3
import json
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime
from scipy import stats
from scipy.signal import find_peaks
from scipy.optimize import brentq
from sklearn.mixture import GaussianMixture
import diptest
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
'accountant_three_methods')
OUT.mkdir(parents=True, exist_ok=True)
EPS = 1e-6
Z_CRIT = 1.96
MIN_SIGS = 10
def load_accountant_means(min_sigs=MIN_SIGS):
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT s.assigned_accountant,
AVG(s.max_similarity_to_same_accountant) AS cos_mean,
AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
COUNT(*) AS n
FROM signatures s
WHERE s.assigned_accountant IS NOT NULL
AND s.max_similarity_to_same_accountant IS NOT NULL
AND s.min_dhash_independent IS NOT NULL
GROUP BY s.assigned_accountant
HAVING n >= ?
''', (min_sigs,))
rows = cur.fetchall()
conn.close()
cos = np.array([r[1] for r in rows])
dh = np.array([r[2] for r in rows])
return cos, dh
# ---------- Method 1: KDE antimode with dip test ----------
def method_kde_antimode(values, name):
arr = np.asarray(values, dtype=float)
arr = arr[np.isfinite(arr)]
dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=2000)
kde = stats.gaussian_kde(arr, bw_method='silverman')
xs = np.linspace(arr.min(), arr.max(), 2000)
density = kde(xs)
# Find modes (local maxima) and antimodes (local minima)
peaks, _ = find_peaks(density, prominence=density.max() * 0.02)
# Antimodes = local minima between peaks
antimodes = []
for i in range(len(peaks) - 1):
seg = density[peaks[i]:peaks[i + 1]]
if len(seg) == 0:
continue
local = peaks[i] + int(np.argmin(seg))
antimodes.append(float(xs[local]))
# Sensitivity analysis across bandwidth factors
sens = {}
for bwf in [0.5, 0.75, 1.0, 1.5, 2.0]:
kde_s = stats.gaussian_kde(arr, bw_method=kde.factor * bwf)
d_s = kde_s(xs)
p_s, _ = find_peaks(d_s, prominence=d_s.max() * 0.02)
sens[f'bw_x{bwf}'] = int(len(p_s))
return {
'name': name,
'n': int(len(arr)),
'dip': float(dip),
'dip_pvalue': float(pval),
'unimodal_alpha05': bool(pval > 0.05),
'kde_bandwidth_silverman': float(kde.factor),
'n_modes': int(len(peaks)),
'mode_locations': [float(xs[p]) for p in peaks],
'antimodes': antimodes,
'primary_antimode': (antimodes[0] if antimodes else None),
'bandwidth_sensitivity_n_modes': sens,
}
# ---------- Method 2: Burgstahler-Dichev / McCrary ----------
def method_bd_mccrary(values, bin_width, direction, name):
arr = np.asarray(values, dtype=float)
arr = arr[np.isfinite(arr)]
lo = float(np.floor(arr.min() / bin_width) * bin_width)
hi = float(np.ceil(arr.max() / bin_width) * bin_width)
edges = np.arange(lo, hi + bin_width, bin_width)
counts, _ = np.histogram(arr, bins=edges)
centers = (edges[:-1] + edges[1:]) / 2.0
N = counts.sum()
p = counts / N if N else counts.astype(float)
n_bins = len(counts)
z = np.full(n_bins, np.nan)
expected = np.full(n_bins, np.nan)
for i in range(1, n_bins - 1):
p_lo = p[i - 1]
p_hi = p[i + 1]
exp_i = 0.5 * (counts[i - 1] + counts[i + 1])
var_i = (N * p[i] * (1 - p[i])
+ 0.25 * N * (p_lo + p_hi) * (1 - p_lo - p_hi))
if var_i <= 0:
continue
z[i] = (counts[i] - exp_i) / np.sqrt(var_i)
expected[i] = exp_i
# Identify transitions
transitions = []
for i in range(1, len(z)):
if np.isnan(z[i - 1]) or np.isnan(z[i]):
continue
ok = False
if direction == 'neg_to_pos' and z[i - 1] < -Z_CRIT and z[i] > Z_CRIT:
ok = True
elif direction == 'pos_to_neg' and z[i - 1] > Z_CRIT and z[i] < -Z_CRIT:
ok = True
if ok:
transitions.append({
'threshold_between': float((centers[i - 1] + centers[i]) / 2.0),
'z_before': float(z[i - 1]),
'z_after': float(z[i]),
})
best = None
if transitions:
best = max(transitions,
key=lambda t: abs(t['z_before']) + abs(t['z_after']))
return {
'name': name,
'n': int(len(arr)),
'bin_width': float(bin_width),
'direction': direction,
'n_transitions': len(transitions),
'transitions': transitions,
'best_transition': best,
'threshold': (best['threshold_between'] if best else None),
'bin_centers': [float(c) for c in centers],
'counts': [int(c) for c in counts],
'z_scores': [None if np.isnan(zi) else float(zi) for zi in z],
}
# ---------- Method 3: Beta mixture + logit-GMM ----------
def fit_beta_mixture_em(x, K=2, max_iter=300, tol=1e-6, seed=42):
rng = np.random.default_rng(seed)
x = np.clip(np.asarray(x, dtype=float), EPS, 1 - EPS)
n = len(x)
q = np.linspace(0, 1, K + 1)
thresh = np.quantile(x, q[1:-1])
labels = np.digitize(x, thresh)
resp = np.zeros((n, K))
resp[np.arange(n), labels] = 1.0
ll_hist = []
for it in range(max_iter):
nk = resp.sum(axis=0) + 1e-12
weights = nk / nk.sum()
mus = (resp * x[:, None]).sum(axis=0) / nk
var_num = (resp * (x[:, None] - mus) ** 2).sum(axis=0)
vars_ = var_num / nk
upper = mus * (1 - mus) - 1e-9
vars_ = np.minimum(vars_, upper)
vars_ = np.maximum(vars_, 1e-9)
factor = np.maximum(mus * (1 - mus) / vars_ - 1, 1e-6)
alphas = mus * factor
betas = (1 - mus) * factor
log_pdfs = np.column_stack([
stats.beta.logpdf(x, alphas[k], betas[k]) + np.log(weights[k])
for k in range(K)
])
m = log_pdfs.max(axis=1, keepdims=True)
ll = (m.ravel() + np.log(np.exp(log_pdfs - m).sum(axis=1))).sum()
ll_hist.append(float(ll))
new_resp = np.exp(log_pdfs - m)
new_resp = new_resp / new_resp.sum(axis=1, keepdims=True)
if it > 0 and abs(ll_hist[-1] - ll_hist[-2]) < tol:
resp = new_resp
break
resp = new_resp
order = np.argsort(mus)
alphas, betas, weights, mus = alphas[order], betas[order], weights[order], mus[order]
k_params = 3 * K - 1
ll_final = ll_hist[-1]
return {
'K': K,
'alphas': [float(a) for a in alphas],
'betas': [float(b) for b in betas],
'weights': [float(w) for w in weights],
'mus': [float(m) for m in mus],
'log_likelihood': float(ll_final),
'aic': float(2 * k_params - 2 * ll_final),
'bic': float(k_params * np.log(n) - 2 * ll_final),
'n_iter': it + 1,
}
def beta_crossing(fit):
if fit['K'] != 2:
return None
a1, b1, w1 = fit['alphas'][0], fit['betas'][0], fit['weights'][0]
a2, b2, w2 = fit['alphas'][1], fit['betas'][1], fit['weights'][1]
def diff(x):
return (w2 * stats.beta.pdf(x, a2, b2)
- w1 * stats.beta.pdf(x, a1, b1))
xs = np.linspace(EPS, 1 - EPS, 2000)
ys = diff(xs)
changes = np.where(np.diff(np.sign(ys)) != 0)[0]
if not len(changes):
return None
mid = 0.5 * (fit['mus'][0] + fit['mus'][1])
crossings = []
for i in changes:
try:
crossings.append(brentq(diff, xs[i], xs[i + 1]))
except ValueError:
continue
if not crossings:
return None
return float(min(crossings, key=lambda c: abs(c - mid)))
def fit_logit_gmm(x, K=2, seed=42):
x = np.clip(np.asarray(x, dtype=float), EPS, 1 - EPS)
z = np.log(x / (1 - x)).reshape(-1, 1)
gmm = GaussianMixture(n_components=K, random_state=seed,
max_iter=500).fit(z)
order = np.argsort(gmm.means_.ravel())
means = gmm.means_.ravel()[order]
stds = np.sqrt(gmm.covariances_.ravel())[order]
weights = gmm.weights_[order]
crossing = None
if K == 2:
m1, s1, w1 = means[0], stds[0], weights[0]
m2, s2, w2 = means[1], stds[1], weights[1]
def diff(z0):
return (w2 * stats.norm.pdf(z0, m2, s2)
- w1 * stats.norm.pdf(z0, m1, s1))
zs = np.linspace(min(m1, m2) - 2, max(m1, m2) + 2, 2000)
ys = diff(zs)
ch = np.where(np.diff(np.sign(ys)) != 0)[0]
if len(ch):
try:
z_cross = brentq(diff, zs[ch[0]], zs[ch[0] + 1])
crossing = float(1 / (1 + np.exp(-z_cross)))
except ValueError:
pass
return {
'K': K,
'means_logit': [float(m) for m in means],
'stds_logit': [float(s) for s in stds],
'weights': [float(w) for w in weights],
'means_original_scale': [float(1 / (1 + np.exp(-m))) for m in means],
'aic': float(gmm.aic(z)),
'bic': float(gmm.bic(z)),
'crossing_original': crossing,
}
def method_beta_mixture(values, name, is_cosine=True):
arr = np.asarray(values, dtype=float)
arr = arr[np.isfinite(arr)]
if not is_cosine:
# normalize dHash into [0,1] by dividing by 64 (max Hamming)
x = arr / 64.0
else:
x = arr
beta2 = fit_beta_mixture_em(x, K=2)
beta3 = fit_beta_mixture_em(x, K=3)
cross_beta2 = beta_crossing(beta2)
# Transform back to original scale for dHash
if not is_cosine and cross_beta2 is not None:
cross_beta2 = cross_beta2 * 64.0
gmm2 = fit_logit_gmm(x, K=2)
gmm3 = fit_logit_gmm(x, K=3)
if not is_cosine and gmm2.get('crossing_original') is not None:
gmm2['crossing_original'] = gmm2['crossing_original'] * 64.0
return {
'name': name,
'n': int(len(x)),
'scale_transform': ('identity' if is_cosine else 'dhash/64'),
'beta_2': beta2,
'beta_3': beta3,
'bic_preferred_K': (2 if beta2['bic'] < beta3['bic'] else 3),
'beta_2_crossing_original': cross_beta2,
'logit_gmm_2': gmm2,
'logit_gmm_3': gmm3,
}
# ---------- Plot helpers ----------
def plot_panel(values, methods, title, out_path, bin_width=None,
is_cosine=True):
arr = np.asarray(values, dtype=float)
fig, axes = plt.subplots(2, 1, figsize=(11, 7),
gridspec_kw={'height_ratios': [3, 1]})
ax = axes[0]
if bin_width is None:
bins = 40
else:
lo = float(np.floor(arr.min() / bin_width) * bin_width)
hi = float(np.ceil(arr.max() / bin_width) * bin_width)
bins = np.arange(lo, hi + bin_width, bin_width)
ax.hist(arr, bins=bins, density=True, alpha=0.55, color='steelblue',
edgecolor='white')
# KDE overlay
kde = stats.gaussian_kde(arr, bw_method='silverman')
xs = np.linspace(arr.min(), arr.max(), 500)
ax.plot(xs, kde(xs), 'g-', lw=1.5, label='KDE (Silverman)')
# Annotate thresholds from each method
colors = {'kde': 'green', 'bd': 'red', 'beta': 'purple', 'gmm2': 'orange'}
for key, (val, lbl) in methods.items():
if val is None:
continue
ax.axvline(val, color=colors.get(key, 'gray'), lw=2, ls='--',
label=f'{lbl} = {val:.4f}')
ax.set_xlabel(title + ' value')
ax.set_ylabel('Density')
ax.set_title(title)
ax.legend(fontsize=8)
ax2 = axes[1]
ax2.set_title('Thresholds across methods')
ax2.set_xlim(ax.get_xlim())
for i, (key, (val, lbl)) in enumerate(methods.items()):
if val is None:
continue
ax2.scatter([val], [i], color=colors.get(key, 'gray'), s=100, zorder=3)
ax2.annotate(f' {lbl}: {val:.4f}', (val, i), fontsize=8,
va='center')
ax2.set_yticks(range(len(methods)))
ax2.set_yticklabels([m for m in methods.keys()])
ax2.set_xlabel(title + ' value')
ax2.grid(alpha=0.3)
plt.tight_layout()
fig.savefig(out_path, dpi=150)
plt.close()
# ---------- GMM 2-comp crossing from Script 18 ----------
def marginal_2comp_crossing(X, dim):
gmm = GaussianMixture(n_components=2, covariance_type='full',
random_state=42, n_init=15, max_iter=500).fit(X)
means = gmm.means_
covs = gmm.covariances_
weights = gmm.weights_
m1, m2 = means[0][dim], means[1][dim]
s1 = np.sqrt(covs[0][dim, dim])
s2 = np.sqrt(covs[1][dim, dim])
w1, w2 = weights[0], weights[1]
def diff(x):
return (w2 * stats.norm.pdf(x, m2, s2)
- w1 * stats.norm.pdf(x, m1, s1))
xs = np.linspace(float(X[:, dim].min()), float(X[:, dim].max()), 2000)
ys = diff(xs)
ch = np.where(np.diff(np.sign(ys)) != 0)[0]
if not len(ch):
return None
mid = 0.5 * (m1 + m2)
crossings = []
for i in ch:
try:
crossings.append(brentq(diff, xs[i], xs[i + 1]))
except ValueError:
continue
if not crossings:
return None
return float(min(crossings, key=lambda c: abs(c - mid)))
def main():
print('=' * 70)
print('Script 20: Three-Method Threshold at Accountant Level')
print('=' * 70)
cos, dh = load_accountant_means()
print(f'\nN accountants (>={MIN_SIGS} sigs) = {len(cos)}')
results = {}
for desc, arr, bin_width, direction, is_cosine in [
('cos_mean', cos, 0.002, 'neg_to_pos', True),
('dh_mean', dh, 0.2, 'pos_to_neg', False),
]:
print(f'\n[{desc}]')
m1 = method_kde_antimode(arr, f'{desc} KDE')
print(f' Method 1 (KDE + dip): dip={m1["dip"]:.4f} '
f'p={m1["dip_pvalue"]:.4f} '
f'n_modes={m1["n_modes"]} '
f'antimode={m1["primary_antimode"]}')
m2 = method_bd_mccrary(arr, bin_width, direction, f'{desc} BD')
print(f' Method 2 (BD/McCrary): {m2["n_transitions"]} transitions, '
f'threshold={m2["threshold"]}')
m3 = method_beta_mixture(arr, f'{desc} Beta', is_cosine=is_cosine)
print(f' Method 3 (Beta mixture): BIC-preferred K={m3["bic_preferred_K"]}, '
f'Beta-2 crossing={m3["beta_2_crossing_original"]}, '
f'LogGMM-2 crossing={m3["logit_gmm_2"].get("crossing_original")}')
# GMM 2-comp crossing (for completeness / reproduce Script 18)
X = np.column_stack([cos, dh])
dim = 0 if desc == 'cos_mean' else 1
gmm2_crossing = marginal_2comp_crossing(X, dim)
print(f' (Script 18 2-comp GMM marginal crossing = {gmm2_crossing})')
results[desc] = {
'method_1_kde_antimode': m1,
'method_2_bd_mccrary': m2,
'method_3_beta_mixture': m3,
'script_18_gmm_2comp_crossing': gmm2_crossing,
}
methods_for_plot = {
'kde': (m1.get('primary_antimode'), 'KDE antimode'),
'bd': (m2.get('threshold'), 'BD/McCrary'),
'beta': (m3.get('beta_2_crossing_original'), 'Beta-2 crossing'),
'gmm2': (gmm2_crossing, 'GMM 2-comp crossing'),
}
png = OUT / f'accountant_{desc}_panel.png'
plot_panel(arr, methods_for_plot,
f'Accountant-level {desc}: three-method thresholds',
png, bin_width=bin_width, is_cosine=is_cosine)
print(f' plot: {png}')
# Write JSON
with open(OUT / 'accountant_three_methods_results.json', 'w') as f:
json.dump({'generated_at': datetime.now().isoformat(),
'n_accountants': int(len(cos)),
'min_signatures': MIN_SIGS,
'results': results}, f, indent=2, ensure_ascii=False)
print(f'\nJSON: {OUT / "accountant_three_methods_results.json"}')
# Markdown
md = [
'# Accountant-Level Three-Method Threshold Report',
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
'',
f'N accountants (>={MIN_SIGS} signatures): {len(cos)}',
'',
'## Accountant-level cosine mean',
'',
'| Method | Threshold | Supporting statistic |',
'|--------|-----------|----------------------|',
]
r = results['cos_mean']
md.append(f"| Method 1: KDE antimode (with dip test) | "
f"{r['method_1_kde_antimode']['primary_antimode']} | "
f"dip={r['method_1_kde_antimode']['dip']:.4f}, "
f"p={r['method_1_kde_antimode']['dip_pvalue']:.4f} "
f"({'unimodal' if r['method_1_kde_antimode']['unimodal_alpha05'] else 'multimodal'}) |")
md.append(f"| Method 2: Burgstahler-Dichev / McCrary | "
f"{r['method_2_bd_mccrary']['threshold']} | "
f"{r['method_2_bd_mccrary']['n_transitions']} transition(s) "
f"at α=0.05 |")
md.append(f"| Method 3: 2-component Beta mixture | "
f"{r['method_3_beta_mixture']['beta_2_crossing_original']} | "
f"Beta-2 BIC={r['method_3_beta_mixture']['beta_2']['bic']:.2f}, "
f"Beta-3 BIC={r['method_3_beta_mixture']['beta_3']['bic']:.2f} "
f"(BIC-preferred K={r['method_3_beta_mixture']['bic_preferred_K']}) |")
md.append(f"| Method 3': LogGMM-2 on logit-transformed | "
f"{r['method_3_beta_mixture']['logit_gmm_2'].get('crossing_original')} | "
f"White 1982 quasi-MLE robustness check |")
md.append(f"| Script 18 GMM 2-comp marginal crossing | "
f"{r['script_18_gmm_2comp_crossing']} | full 2D mixture |")
md += ['', '## Accountant-level dHash mean', '',
'| Method | Threshold | Supporting statistic |',
'|--------|-----------|----------------------|']
r = results['dh_mean']
md.append(f"| Method 1: KDE antimode | "
f"{r['method_1_kde_antimode']['primary_antimode']} | "
f"dip={r['method_1_kde_antimode']['dip']:.4f}, "
f"p={r['method_1_kde_antimode']['dip_pvalue']:.4f} |")
md.append(f"| Method 2: BD/McCrary | "
f"{r['method_2_bd_mccrary']['threshold']} | "
f"{r['method_2_bd_mccrary']['n_transitions']} transition(s) |")
md.append(f"| Method 3: 2-component Beta mixture | "
f"{r['method_3_beta_mixture']['beta_2_crossing_original']} | "
f"Beta-2 BIC={r['method_3_beta_mixture']['beta_2']['bic']:.2f}, "
f"Beta-3 BIC={r['method_3_beta_mixture']['beta_3']['bic']:.2f} |")
md.append(f"| Method 3': LogGMM-2 | "
f"{r['method_3_beta_mixture']['logit_gmm_2'].get('crossing_original')} | |")
md.append(f"| Script 18 GMM 2-comp crossing | "
f"{r['script_18_gmm_2comp_crossing']} | |")
(OUT / 'accountant_three_methods_report.md').write_text('\n'.join(md),
encoding='utf-8')
print(f'Report: {OUT / "accountant_three_methods_report.md"}')
if __name__ == '__main__':
main()
@@ -0,0 +1,421 @@
#!/usr/bin/env python3
"""
Script 21: Expanded Validation with Larger Negative Anchor + Held-out Firm A
============================================================================
Addresses codex review weaknesses of Script 19's pixel-identity validation:
(a) Negative anchor of n=35 (cosine<0.70) is too small to give
meaningful FAR confidence intervals.
(b) Pixel-identical positive anchor is an easy subset, not
representative of the broader positive class.
(c) Firm A is both the calibration anchor and the validation anchor
(circular).
This script:
1. Constructs a large inter-CPA negative anchor (~50,000 pairs) by
randomly sampling pairs from different CPAs. Inter-CPA high
similarity is highly unlikely to arise from legitimate signing.
2. Splits Firm A CPAs 70/30 into CALIBRATION and HELDOUT folds.
Re-derives signature-level / accountant-level thresholds from the
calibration fold only, then reports all metrics (including Firm A
anchor rates) on the heldout fold.
3. Computes proper EER (FAR = FRR interpolated) in addition to
metrics at canonical thresholds.
4. Computes 95% Wilson confidence intervals for each FAR/FRR.
Output:
reports/expanded_validation/expanded_validation_report.md
reports/expanded_validation/expanded_validation_results.json
"""
import sqlite3
import json
import numpy as np
from pathlib import Path
from datetime import datetime
from scipy.stats import norm
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
'expanded_validation')
OUT.mkdir(parents=True, exist_ok=True)
FIRM_A = '勤業眾信聯合'
N_INTER_PAIRS = 50_000
SEED = 42
def wilson_ci(k, n, alpha=0.05):
if n == 0:
return (0.0, 1.0)
z = norm.ppf(1 - alpha / 2)
phat = k / n
denom = 1 + z * z / n
center = (phat + z * z / (2 * n)) / denom
pm = z * np.sqrt(phat * (1 - phat) / n + z * z / (4 * n * n)) / denom
return (max(0.0, center - pm), min(1.0, center + pm))
def load_signatures():
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT s.signature_id, s.assigned_accountant, a.firm,
s.max_similarity_to_same_accountant,
s.min_dhash_independent, s.pixel_identical_to_closest
FROM signatures s
LEFT JOIN accountants a ON s.assigned_accountant = a.name
WHERE s.max_similarity_to_same_accountant IS NOT NULL
''')
rows = cur.fetchall()
conn.close()
return rows
def load_feature_vectors_sample(n=2000):
"""Load feature vectors for inter-CPA negative-anchor sampling."""
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT signature_id, assigned_accountant, feature_vector
FROM signatures
WHERE feature_vector IS NOT NULL
AND assigned_accountant IS NOT NULL
ORDER BY RANDOM()
LIMIT ?
''', (n,))
rows = cur.fetchall()
conn.close()
out = []
for r in rows:
vec = np.frombuffer(r[2], dtype=np.float32)
out.append({'sig_id': r[0], 'accountant': r[1], 'feature': vec})
return out
def build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS, seed=SEED):
"""Sample random cross-CPA pairs; return their cosine similarities."""
rng = np.random.default_rng(seed)
n = len(sample)
feats = np.stack([s['feature'] for s in sample])
accts = np.array([s['accountant'] for s in sample])
sims = []
tries = 0
while len(sims) < n_pairs and tries < n_pairs * 10:
i = rng.integers(n)
j = rng.integers(n)
if i == j or accts[i] == accts[j]:
tries += 1
continue
sim = float(feats[i] @ feats[j])
sims.append(sim)
tries += 1
return np.array(sims)
def classification_metrics(y_true, y_pred):
y_true = np.asarray(y_true).astype(int)
y_pred = np.asarray(y_pred).astype(int)
tp = int(np.sum((y_true == 1) & (y_pred == 1)))
fp = int(np.sum((y_true == 0) & (y_pred == 1)))
fn = int(np.sum((y_true == 1) & (y_pred == 0)))
tn = int(np.sum((y_true == 0) & (y_pred == 0)))
p_den = max(tp + fp, 1)
r_den = max(tp + fn, 1)
far_den = max(fp + tn, 1)
frr_den = max(fn + tp, 1)
precision = tp / p_den
recall = tp / r_den
f1 = (2 * precision * recall / (precision + recall)
if (precision + recall) > 0 else 0.0)
far = fp / far_den
frr = fn / frr_den
far_ci = wilson_ci(fp, far_den)
frr_ci = wilson_ci(fn, frr_den)
return {
'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn,
'precision': float(precision),
'recall': float(recall),
'f1': float(f1),
'far': float(far),
'frr': float(frr),
'far_ci95': [float(x) for x in far_ci],
'frr_ci95': [float(x) for x in frr_ci],
'n_pos': int(tp + fn),
'n_neg': int(tn + fp),
}
def sweep_threshold(scores, y, direction, thresholds):
out = []
for t in thresholds:
if direction == 'above':
y_pred = (scores > t).astype(int)
else:
y_pred = (scores < t).astype(int)
m = classification_metrics(y, y_pred)
m['threshold'] = float(t)
out.append(m)
return out
def find_eer(sweep):
thr = np.array([s['threshold'] for s in sweep])
far = np.array([s['far'] for s in sweep])
frr = np.array([s['frr'] for s in sweep])
diff = far - frr
signs = np.sign(diff)
changes = np.where(np.diff(signs) != 0)[0]
if len(changes) == 0:
idx = int(np.argmin(np.abs(diff)))
return {'threshold': float(thr[idx]), 'far': float(far[idx]),
'frr': float(frr[idx]),
'eer': float(0.5 * (far[idx] + frr[idx]))}
i = int(changes[0])
w = abs(diff[i]) / (abs(diff[i]) + abs(diff[i + 1]) + 1e-12)
thr_i = (1 - w) * thr[i] + w * thr[i + 1]
far_i = (1 - w) * far[i] + w * far[i + 1]
frr_i = (1 - w) * frr[i] + w * frr[i + 1]
return {'threshold': float(thr_i), 'far': float(far_i),
'frr': float(frr_i),
'eer': float(0.5 * (far_i + frr_i))}
def main():
print('=' * 70)
print('Script 21: Expanded Validation')
print('=' * 70)
rows = load_signatures()
print(f'\nLoaded {len(rows):,} signatures')
sig_ids = [r[0] for r in rows]
accts = [r[1] for r in rows]
firms = [r[2] or '(unknown)' for r in rows]
cos = np.array([r[3] for r in rows], dtype=float)
dh = np.array([-1 if r[4] is None else r[4] for r in rows], dtype=float)
pix = np.array([r[5] or 0 for r in rows], dtype=int)
firm_a_mask = np.array([f == FIRM_A for f in firms])
print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')
# --- (1) INTER-CPA NEGATIVE ANCHOR ---
print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} pairs)...')
sample = load_feature_vectors_sample(n=3000)
inter_cos = build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS)
print(f' inter-CPA cos: mean={inter_cos.mean():.4f}, '
f'p95={np.percentile(inter_cos, 95):.4f}, '
f'p99={np.percentile(inter_cos, 99):.4f}, '
f'max={inter_cos.max():.4f}')
# --- (2) POSITIVES ---
# Pixel-identical (gold) + optional Firm A extension
pos_pix_mask = pix == 1
n_pix = int(pos_pix_mask.sum())
print(f'\n[2] Positive anchors:')
print(f' pixel-identical signatures: {n_pix}')
# Build negative anchor scores = inter-CPA cosine distribution
# Positive anchor scores = pixel-identical signatures' max same-CPA cosine
# NB: the two distributions are not drawn from the same random variable
# (one is intra-CPA max, the other is inter-CPA random), so we treat the
# inter-CPA distribution as a negative reference for threshold sweep.
# Combined labeled set: positives=pixel-identical sigs' max cosine,
# negatives=inter-CPA random pair cosines.
pos_scores = cos[pos_pix_mask]
neg_scores = inter_cos
y = np.concatenate([np.ones(len(pos_scores)),
np.zeros(len(neg_scores))])
scores = np.concatenate([pos_scores, neg_scores])
# Sweep thresholds
thr = np.linspace(0.30, 1.00, 141)
sweep = sweep_threshold(scores, y, 'above', thr)
eer = find_eer(sweep)
print(f'\n[3] Cosine EER (pos=pixel-identical, neg=inter-CPA n={len(inter_cos)}):')
print(f" threshold={eer['threshold']:.4f}, EER={eer['eer']:.4f}")
# Canonical threshold evaluations with Wilson CIs
canonical = {}
for tt in [0.70, 0.80, 0.837, 0.90, 0.945, 0.95, 0.973, 0.979]:
y_pred = (scores > tt).astype(int)
m = classification_metrics(y, y_pred)
m['threshold'] = float(tt)
canonical[f'cos>{tt:.3f}'] = m
print(f" @ {tt:.3f}: P={m['precision']:.3f}, R={m['recall']:.3f}, "
f"FAR={m['far']:.4f} (CI95={m['far_ci95'][0]:.4f}-"
f"{m['far_ci95'][1]:.4f}), FRR={m['frr']:.4f}")
# --- (3) HELD-OUT FIRM A ---
print('\n[4] Held-out Firm A 70/30 split:')
rng = np.random.default_rng(SEED)
firm_a_accts = sorted(set(a for a, f in zip(accts, firms) if f == FIRM_A))
rng.shuffle(firm_a_accts)
n_calib = int(0.7 * len(firm_a_accts))
calib_accts = set(firm_a_accts[:n_calib])
heldout_accts = set(firm_a_accts[n_calib:])
print(f' Calibration fold CPAs: {len(calib_accts)}, '
f'heldout fold CPAs: {len(heldout_accts)}')
calib_mask = np.array([a in calib_accts for a in accts])
heldout_mask = np.array([a in heldout_accts for a in accts])
print(f' Calibration sigs: {int(calib_mask.sum())}, '
f'heldout sigs: {int(heldout_mask.sum())}')
# Derive per-signature thresholds from calibration fold:
# - Firm A cos median, 1st-pct, 5th-pct
# - Firm A dHash median, 95th-pct
calib_cos = cos[calib_mask]
calib_dh = dh[calib_mask]
calib_dh = calib_dh[calib_dh >= 0]
cal_cos_med = float(np.median(calib_cos))
cal_cos_p1 = float(np.percentile(calib_cos, 1))
cal_cos_p5 = float(np.percentile(calib_cos, 5))
cal_dh_med = float(np.median(calib_dh))
cal_dh_p95 = float(np.percentile(calib_dh, 95))
print(f' Calib Firm A cos: median={cal_cos_med:.4f}, P1={cal_cos_p1:.4f}, P5={cal_cos_p5:.4f}')
print(f' Calib Firm A dHash: median={cal_dh_med:.2f}, P95={cal_dh_p95:.2f}')
# Apply canonical rules to heldout fold
held_cos = cos[heldout_mask]
held_dh = dh[heldout_mask]
held_dh_valid = held_dh >= 0
held_rates = {}
for tt in [0.837, 0.945, 0.95, cal_cos_p5]:
rate = float(np.mean(held_cos > tt))
k = int(np.sum(held_cos > tt))
lo, hi = wilson_ci(k, len(held_cos))
held_rates[f'cos>{tt:.4f}'] = {
'rate': rate, 'k': k, 'n': int(len(held_cos)),
'wilson95': [float(lo), float(hi)],
}
for tt in [5, 8, 15, cal_dh_p95]:
rate = float(np.mean(held_dh[held_dh_valid] <= tt))
k = int(np.sum(held_dh[held_dh_valid] <= tt))
lo, hi = wilson_ci(k, int(held_dh_valid.sum()))
held_rates[f'dh_indep<={tt:.2f}'] = {
'rate': rate, 'k': k, 'n': int(held_dh_valid.sum()),
'wilson95': [float(lo), float(hi)],
}
# Dual rule
dual_mask = (held_cos > 0.95) & (held_dh >= 0) & (held_dh <= 8)
rate = float(np.mean(dual_mask))
k = int(dual_mask.sum())
lo, hi = wilson_ci(k, len(dual_mask))
held_rates['cos>0.95 AND dh<=8'] = {
'rate': rate, 'k': k, 'n': int(len(dual_mask)),
'wilson95': [float(lo), float(hi)],
}
print(' Heldout Firm A rates:')
for k, v in held_rates.items():
print(f' {k}: {v["rate"]*100:.2f}% '
f'[{v["wilson95"][0]*100:.2f}, {v["wilson95"][1]*100:.2f}]')
# --- Save ---
summary = {
'generated_at': datetime.now().isoformat(),
'n_signatures': len(rows),
'n_firm_a': int(firm_a_mask.sum()),
'n_pixel_identical': n_pix,
'n_inter_cpa_negatives': len(inter_cos),
'inter_cpa_cos_stats': {
'mean': float(inter_cos.mean()),
'p95': float(np.percentile(inter_cos, 95)),
'p99': float(np.percentile(inter_cos, 99)),
'max': float(inter_cos.max()),
},
'cosine_eer': eer,
'canonical_thresholds': canonical,
'held_out_firm_a': {
'calibration_cpas': len(calib_accts),
'heldout_cpas': len(heldout_accts),
'calibration_sig_count': int(calib_mask.sum()),
'heldout_sig_count': int(heldout_mask.sum()),
'calib_cos_median': cal_cos_med,
'calib_cos_p1': cal_cos_p1,
'calib_cos_p5': cal_cos_p5,
'calib_dh_median': cal_dh_med,
'calib_dh_p95': cal_dh_p95,
'heldout_rates': held_rates,
},
}
with open(OUT / 'expanded_validation_results.json', 'w') as f:
json.dump(summary, f, indent=2, ensure_ascii=False)
print(f'\nJSON: {OUT / "expanded_validation_results.json"}')
# Markdown
md = [
'# Expanded Validation Report',
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
'',
'## 1. Inter-CPA Negative Anchor',
'',
f'* N random cross-CPA pairs sampled: {len(inter_cos):,}',
f'* Inter-CPA cosine: mean={inter_cos.mean():.4f}, '
f'P95={np.percentile(inter_cos, 95):.4f}, '
f'P99={np.percentile(inter_cos, 99):.4f}, max={inter_cos.max():.4f}',
'',
'This anchor is a meaningful negative set because inter-CPA pairs',
'cannot arise from legitimate reuse of a single signer\'s image.',
'',
'## 2. Cosine Threshold Sweep (pos=pixel-identical, neg=inter-CPA)',
'',
f"EER threshold: {eer['threshold']:.4f}, EER: {eer['eer']:.4f}",
'',
'| Threshold | Precision | Recall | F1 | FAR | FAR 95% CI | FRR |',
'|-----------|-----------|--------|----|-----|------------|-----|',
]
for k, m in canonical.items():
md.append(
f"| {m['threshold']:.3f} | {m['precision']:.3f} | "
f"{m['recall']:.3f} | {m['f1']:.3f} | {m['far']:.4f} | "
f"[{m['far_ci95'][0]:.4f}, {m['far_ci95'][1]:.4f}] | "
f"{m['frr']:.4f} |"
)
md += [
'',
'## 3. Held-out Firm A 70/30 Validation',
'',
f'* Firm A CPAs randomly split by CPA (not by signature) into',
f' calibration (n={len(calib_accts)}) and heldout (n={len(heldout_accts)}).',
f'* Calibration Firm A signatures: {int(calib_mask.sum()):,}. '
f'Heldout signatures: {int(heldout_mask.sum()):,}.',
'',
'### Calibration-fold anchor statistics (for thresholds)',
'',
f'* Firm A cosine: median = {cal_cos_med:.4f}, '
f'P1 = {cal_cos_p1:.4f}, P5 = {cal_cos_p5:.4f}',
f'* Firm A dHash (independent min): median = {cal_dh_med:.2f}, '
f'P95 = {cal_dh_p95:.2f}',
'',
'### Heldout-fold capture rates (with Wilson 95% CIs)',
'',
'| Rule | Heldout rate | Wilson 95% CI | k / n |',
'|------|--------------|---------------|-------|',
]
for k, v in held_rates.items():
md.append(
f"| {k} | {v['rate']*100:.2f}% | "
f"[{v['wilson95'][0]*100:.2f}%, {v['wilson95'][1]*100:.2f}%] | "
f"{v['k']}/{v['n']} |"
)
md += [
'',
'## Interpretation',
'',
'The inter-CPA negative anchor (N ~50,000) gives tight confidence',
'intervals on FAR at each threshold, addressing the small-negative',
'anchor limitation of Script 19 (n=35).',
'',
'The 70/30 Firm A split breaks the circular-validation concern of',
'using the same calibration anchor for threshold derivation and',
'validation. Calibration-fold percentiles derive the thresholds;',
'heldout-fold rates with Wilson 95% CIs show how those thresholds',
'generalize to Firm A CPAs that did not contribute to calibration.',
]
(OUT / 'expanded_validation_report.md').write_text('\n'.join(md),
encoding='utf-8')
print(f'Report: {OUT / "expanded_validation_report.md"}')
if __name__ == '__main__':
main()