Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of signature vs. accountant analysis units: - Script 15 (Hartigan dip test): formal unimodality test via `diptest`. Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population); full-sample cosine MULTIMODAL (p<0.001, mix of two regimes); accountant-level aggregates MULTIMODAL on both cos and dHash. - Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition detection. Firm A and full-sample cosine transitions at 0.985; dHash at 2.0. - Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM with MoM M-step, plus parallel Gaussian mixture on logit transform as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature level confirms 2-component is a forced fit -- supporting the pivot to accountant-level mixture. - Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis that was done inline and not saved. BIC-best K=3 with components matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%, Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928, 11.17, 28%, small firms). 2-component natural thresholds: cos=0.9450, dh=8.10. - Script 19 (Pixel-identity validation): no human annotation needed. Uses pixel_identical_to_closest (310 sigs) as gold positive and Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51% (matches prior 2026-04-08 finding of 92.5%), dual rule cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A. Python deps added: diptest, scikit-learn (installed into venv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,413 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 19: Pixel-Identity Validation (No Human Annotation Required)
|
||||
===================================================================
|
||||
Validates the cosine + dHash dual classifier using three naturally
|
||||
occurring reference populations instead of manual labels:
|
||||
|
||||
Positive anchor 1: pixel_identical_to_closest = 1
|
||||
Two signature images byte-identical after crop/resize.
|
||||
Mathematically impossible to arise from independent hand-signing
|
||||
=> absolute ground truth for replication.
|
||||
|
||||
Positive anchor 2: Firm A (Deloitte) signatures
|
||||
Interview + visual evidence establishes near-universal non-hand-
|
||||
signing across 2013-2023 (see memories 2026-04-08, 2026-04-14).
|
||||
We treat Firm A as a strong prior positive.
|
||||
|
||||
Negative anchor: signatures with cosine <= low threshold
|
||||
Pairs with very low cosine similarity cannot plausibly be pixel
|
||||
duplicates, so they serve as absolute negatives.
|
||||
|
||||
Metrics reported:
|
||||
- FAR/FRR/EER using the pixel-identity anchor as the gold positive
|
||||
and low-similarity pairs as the gold negative.
|
||||
- Precision/Recall/F1 at cosine and dHash thresholds from Scripts
|
||||
15/16/17/18.
|
||||
- Convergence with Firm A anchor (what fraction of Firm A signatures
|
||||
are correctly classified at each threshold).
|
||||
|
||||
Small visual sanity sample (30 pairs) is exported for spot-check, but
|
||||
metrics are derived entirely from pixel and Firm A evidence.
|
||||
|
||||
Output:
|
||||
reports/pixel_validation/pixel_validation_report.md
|
||||
reports/pixel_validation/pixel_validation_results.json
|
||||
reports/pixel_validation/roc_cosine.png, roc_dhash.png
|
||||
reports/pixel_validation/sanity_sample.csv
|
||||
"""
|
||||
|
||||
import sqlite3
|
||||
import json
|
||||
import numpy as np
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'pixel_validation')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
FIRM_A = '勤業眾信聯合'
|
||||
NEGATIVE_COSINE_UPPER = 0.70 # pairs with max-cosine < 0.70 assumed not replicated
|
||||
SANITY_SAMPLE_SIZE = 30
|
||||
|
||||
|
||||
def load_signatures():
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.signature_id, s.image_filename, s.assigned_accountant,
|
||||
a.firm, s.max_similarity_to_same_accountant,
|
||||
s.phash_distance_to_closest, s.min_dhash_independent,
|
||||
s.pixel_identical_to_closest, s.closest_match_file
|
||||
FROM signatures s
|
||||
LEFT JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.max_similarity_to_same_accountant IS NOT NULL
|
||||
''')
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
data = []
|
||||
for r in rows:
|
||||
data.append({
|
||||
'sig_id': r[0], 'filename': r[1], 'accountant': r[2],
|
||||
'firm': r[3] or '(unknown)',
|
||||
'cosine': float(r[4]),
|
||||
'dhash_cond': None if r[5] is None else int(r[5]),
|
||||
'dhash_indep': None if r[6] is None else int(r[6]),
|
||||
'pixel_identical': int(r[7] or 0),
|
||||
'closest_match': r[8],
|
||||
})
|
||||
return data
|
||||
|
||||
|
||||
def confusion(y_true, y_pred):
|
||||
tp = int(np.sum((y_true == 1) & (y_pred == 1)))
|
||||
fp = int(np.sum((y_true == 0) & (y_pred == 1)))
|
||||
fn = int(np.sum((y_true == 1) & (y_pred == 0)))
|
||||
tn = int(np.sum((y_true == 0) & (y_pred == 0)))
|
||||
return tp, fp, fn, tn
|
||||
|
||||
|
||||
def classification_metrics(y_true, y_pred):
|
||||
tp, fp, fn, tn = confusion(y_true, y_pred)
|
||||
denom_p = max(tp + fp, 1)
|
||||
denom_r = max(tp + fn, 1)
|
||||
precision = tp / denom_p
|
||||
recall = tp / denom_r
|
||||
f1 = (2 * precision * recall / (precision + recall)
|
||||
if precision + recall > 0 else 0.0)
|
||||
far = fp / max(fp + tn, 1) # false acceptance rate (over negatives)
|
||||
frr = fn / max(fn + tp, 1) # false rejection rate (over positives)
|
||||
return {
|
||||
'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn,
|
||||
'precision': float(precision),
|
||||
'recall': float(recall),
|
||||
'f1': float(f1),
|
||||
'far': float(far),
|
||||
'frr': float(frr),
|
||||
}
|
||||
|
||||
|
||||
def sweep_threshold(scores, y, directions, thresholds):
|
||||
"""For direction 'above' a prediction is positive if score > threshold;
|
||||
for 'below' it is positive if score < threshold."""
|
||||
out = []
|
||||
for t in thresholds:
|
||||
if directions == 'above':
|
||||
y_pred = (scores > t).astype(int)
|
||||
else:
|
||||
y_pred = (scores < t).astype(int)
|
||||
m = classification_metrics(y, y_pred)
|
||||
m['threshold'] = float(t)
|
||||
out.append(m)
|
||||
return out
|
||||
|
||||
|
||||
def find_eer(sweep):
|
||||
"""EER = point where FAR ≈ FRR; interpolated from nearest pair."""
|
||||
thr = np.array([s['threshold'] for s in sweep])
|
||||
far = np.array([s['far'] for s in sweep])
|
||||
frr = np.array([s['frr'] for s in sweep])
|
||||
diff = far - frr
|
||||
signs = np.sign(diff)
|
||||
changes = np.where(np.diff(signs) != 0)[0]
|
||||
if len(changes) == 0:
|
||||
idx = int(np.argmin(np.abs(diff)))
|
||||
return {'threshold': float(thr[idx]), 'far': float(far[idx]),
|
||||
'frr': float(frr[idx]), 'eer': float(0.5 * (far[idx] + frr[idx]))}
|
||||
i = int(changes[0])
|
||||
w = abs(diff[i]) / (abs(diff[i]) + abs(diff[i + 1]) + 1e-12)
|
||||
thr_i = (1 - w) * thr[i] + w * thr[i + 1]
|
||||
far_i = (1 - w) * far[i] + w * far[i + 1]
|
||||
frr_i = (1 - w) * frr[i] + w * frr[i + 1]
|
||||
return {'threshold': float(thr_i), 'far': float(far_i),
|
||||
'frr': float(frr_i), 'eer': float(0.5 * (far_i + frr_i))}
|
||||
|
||||
|
||||
def plot_roc(sweep, title, out_path):
|
||||
far = np.array([s['far'] for s in sweep])
|
||||
frr = np.array([s['frr'] for s in sweep])
|
||||
thr = np.array([s['threshold'] for s in sweep])
|
||||
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
|
||||
|
||||
ax = axes[0]
|
||||
ax.plot(far, 1 - frr, 'b-', lw=2)
|
||||
ax.plot([0, 1], [0, 1], 'k--', alpha=0.4)
|
||||
ax.set_xlabel('FAR')
|
||||
ax.set_ylabel('1 - FRR (True Positive Rate)')
|
||||
ax.set_title(f'{title} - ROC')
|
||||
ax.set_xlim(0, 1)
|
||||
ax.set_ylim(0, 1)
|
||||
ax.grid(alpha=0.3)
|
||||
|
||||
ax = axes[1]
|
||||
ax.plot(thr, far, 'r-', lw=2, label='FAR')
|
||||
ax.plot(thr, frr, 'b-', lw=2, label='FRR')
|
||||
ax.set_xlabel('Threshold')
|
||||
ax.set_ylabel('Error rate')
|
||||
ax.set_title(f'{title} - FAR / FRR vs threshold')
|
||||
ax.legend()
|
||||
ax.grid(alpha=0.3)
|
||||
|
||||
plt.tight_layout()
|
||||
fig.savefig(out_path, dpi=150)
|
||||
plt.close()
|
||||
|
||||
|
||||
def main():
|
||||
print('='*70)
|
||||
print('Script 19: Pixel-Identity Validation (No Annotation)')
|
||||
print('='*70)
|
||||
|
||||
data = load_signatures()
|
||||
print(f'\nTotal signatures loaded: {len(data):,}')
|
||||
cos = np.array([d['cosine'] for d in data])
|
||||
dh_indep = np.array([d['dhash_indep'] if d['dhash_indep'] is not None
|
||||
else -1 for d in data])
|
||||
pix = np.array([d['pixel_identical'] for d in data])
|
||||
firm = np.array([d['firm'] for d in data])
|
||||
|
||||
print(f'Pixel-identical: {int(pix.sum()):,} signatures')
|
||||
print(f'Firm A signatures: {int((firm == FIRM_A).sum()):,}')
|
||||
print(f'Negative anchor (cosine < {NEGATIVE_COSINE_UPPER}): '
|
||||
f'{int((cos < NEGATIVE_COSINE_UPPER).sum()):,}')
|
||||
|
||||
# Build labelled set:
|
||||
# positive = pixel_identical == 1
|
||||
# negative = cosine < NEGATIVE_COSINE_UPPER (and not pixel_identical)
|
||||
pos_mask = pix == 1
|
||||
neg_mask = (cos < NEGATIVE_COSINE_UPPER) & (~pos_mask)
|
||||
labelled_mask = pos_mask | neg_mask
|
||||
y = pos_mask[labelled_mask].astype(int)
|
||||
cos_l = cos[labelled_mask]
|
||||
dh_l = dh_indep[labelled_mask]
|
||||
|
||||
# --- Sweep cosine threshold
|
||||
cos_thresh = np.linspace(0.50, 1.00, 101)
|
||||
cos_sweep = sweep_threshold(cos_l, y, 'above', cos_thresh)
|
||||
cos_eer = find_eer(cos_sweep)
|
||||
print(f'\nCosine EER: threshold={cos_eer["threshold"]:.4f}, '
|
||||
f'EER={cos_eer["eer"]:.4f}')
|
||||
|
||||
# --- Sweep dHash threshold (independent)
|
||||
dh_l_valid = dh_l >= 0
|
||||
y_dh = y[dh_l_valid]
|
||||
dh_valid = dh_l[dh_l_valid]
|
||||
dh_thresh = np.arange(0, 40)
|
||||
dh_sweep = sweep_threshold(dh_valid, y_dh, 'below', dh_thresh)
|
||||
dh_eer = find_eer(dh_sweep)
|
||||
print(f'dHash EER: threshold={dh_eer["threshold"]:.4f}, '
|
||||
f'EER={dh_eer["eer"]:.4f}')
|
||||
|
||||
# Plots
|
||||
plot_roc(cos_sweep, 'Cosine (pixel-identity anchor)',
|
||||
OUT / 'roc_cosine.png')
|
||||
plot_roc(dh_sweep, 'Independent dHash (pixel-identity anchor)',
|
||||
OUT / 'roc_dhash.png')
|
||||
|
||||
# --- Evaluate canonical thresholds
|
||||
canonical = [
|
||||
('cosine', 0.837, 'above', cos, pos_mask, neg_mask),
|
||||
('cosine', 0.941, 'above', cos, pos_mask, neg_mask),
|
||||
('cosine', 0.95, 'above', cos, pos_mask, neg_mask),
|
||||
('dhash_indep', 5, 'below', dh_indep, pos_mask,
|
||||
neg_mask & (dh_indep >= 0)),
|
||||
('dhash_indep', 8, 'below', dh_indep, pos_mask,
|
||||
neg_mask & (dh_indep >= 0)),
|
||||
('dhash_indep', 15, 'below', dh_indep, pos_mask,
|
||||
neg_mask & (dh_indep >= 0)),
|
||||
]
|
||||
canonical_results = []
|
||||
for name, thr, direction, scores, p_mask, n_mask in canonical:
|
||||
labelled = p_mask | n_mask
|
||||
valid = labelled & (scores >= 0 if 'dhash' in name else np.ones_like(
|
||||
labelled, dtype=bool))
|
||||
y_local = p_mask[valid].astype(int)
|
||||
s = scores[valid]
|
||||
if direction == 'above':
|
||||
y_pred = (s > thr).astype(int)
|
||||
else:
|
||||
y_pred = (s < thr).astype(int)
|
||||
m = classification_metrics(y_local, y_pred)
|
||||
m.update({'indicator': name, 'threshold': float(thr),
|
||||
'direction': direction})
|
||||
canonical_results.append(m)
|
||||
print(f" {name} @ {thr:>5} ({direction}): "
|
||||
f"P={m['precision']:.3f}, R={m['recall']:.3f}, "
|
||||
f"F1={m['f1']:.3f}, FAR={m['far']:.4f}, FRR={m['frr']:.4f}")
|
||||
|
||||
# --- Firm A anchor validation
|
||||
firm_a_mask = firm == FIRM_A
|
||||
firm_a_cos = cos[firm_a_mask]
|
||||
firm_a_dh = dh_indep[firm_a_mask]
|
||||
|
||||
firm_a_rates = {}
|
||||
for thr in [0.837, 0.941, 0.95]:
|
||||
firm_a_rates[f'cosine>{thr}'] = float(np.mean(firm_a_cos > thr))
|
||||
for thr in [5, 8, 15]:
|
||||
valid = firm_a_dh >= 0
|
||||
firm_a_rates[f'dhash_indep<={thr}'] = float(
|
||||
np.mean(firm_a_dh[valid] <= thr))
|
||||
# Dual thresholds
|
||||
firm_a_rates['cosine>0.95 AND dhash_indep<=8'] = float(
|
||||
np.mean((firm_a_cos > 0.95) &
|
||||
(firm_a_dh >= 0) & (firm_a_dh <= 8)))
|
||||
|
||||
print('\nFirm A anchor validation:')
|
||||
for k, v in firm_a_rates.items():
|
||||
print(f' {k}: {v*100:.2f}%')
|
||||
|
||||
# --- Stratified sanity sample (30 signatures across 5 strata)
|
||||
rng = np.random.default_rng(42)
|
||||
strata = [
|
||||
('pixel_identical', pix == 1),
|
||||
('high_cos_low_dh',
|
||||
(cos > 0.95) & (dh_indep >= 0) & (dh_indep <= 5) & (pix == 0)),
|
||||
('borderline',
|
||||
(cos > 0.837) & (cos < 0.95) & (dh_indep >= 0) & (dh_indep <= 15)),
|
||||
('style_consistency_only',
|
||||
(cos > 0.95) & (dh_indep >= 0) & (dh_indep > 15)),
|
||||
('likely_genuine', cos < NEGATIVE_COSINE_UPPER),
|
||||
]
|
||||
sanity_sample = []
|
||||
per_stratum = SANITY_SAMPLE_SIZE // len(strata)
|
||||
for stratum_name, m in strata:
|
||||
idx = np.where(m)[0]
|
||||
pick = rng.choice(idx, size=min(per_stratum, len(idx)), replace=False)
|
||||
for i in pick:
|
||||
d = data[i]
|
||||
sanity_sample.append({
|
||||
'stratum': stratum_name, 'sig_id': d['sig_id'],
|
||||
'filename': d['filename'], 'accountant': d['accountant'],
|
||||
'firm': d['firm'], 'cosine': d['cosine'],
|
||||
'dhash_indep': d['dhash_indep'],
|
||||
'pixel_identical': d['pixel_identical'],
|
||||
'closest_match': d['closest_match'],
|
||||
})
|
||||
|
||||
csv_path = OUT / 'sanity_sample.csv'
|
||||
with open(csv_path, 'w', encoding='utf-8') as f:
|
||||
keys = ['stratum', 'sig_id', 'filename', 'accountant', 'firm',
|
||||
'cosine', 'dhash_indep', 'pixel_identical', 'closest_match']
|
||||
f.write(','.join(keys) + '\n')
|
||||
for row in sanity_sample:
|
||||
f.write(','.join(str(row[k]) if row[k] is not None else ''
|
||||
for k in keys) + '\n')
|
||||
print(f'\nSanity sample CSV: {csv_path}')
|
||||
|
||||
# --- Save results
|
||||
summary = {
|
||||
'generated_at': datetime.now().isoformat(),
|
||||
'n_signatures': len(data),
|
||||
'n_pixel_identical': int(pos_mask.sum()),
|
||||
'n_firm_a': int(firm_a_mask.sum()),
|
||||
'n_negative_anchor': int(neg_mask.sum()),
|
||||
'negative_cosine_upper': NEGATIVE_COSINE_UPPER,
|
||||
'eer_cosine': cos_eer,
|
||||
'eer_dhash_indep': dh_eer,
|
||||
'canonical_thresholds': canonical_results,
|
||||
'firm_a_anchor_rates': firm_a_rates,
|
||||
'cosine_sweep': cos_sweep,
|
||||
'dhash_sweep': dh_sweep,
|
||||
}
|
||||
with open(OUT / 'pixel_validation_results.json', 'w') as f:
|
||||
json.dump(summary, f, indent=2, ensure_ascii=False)
|
||||
print(f'JSON: {OUT / "pixel_validation_results.json"}')
|
||||
|
||||
# --- Markdown
|
||||
md = [
|
||||
'# Pixel-Identity Validation Report',
|
||||
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
|
||||
'',
|
||||
'## Anchors (no human annotation required)',
|
||||
'',
|
||||
f'* **Pixel-identical anchor (gold positive):** '
|
||||
f'{int(pos_mask.sum()):,} signatures whose closest same-accountant',
|
||||
' match is byte-identical after crop/normalise. Under handwriting',
|
||||
' physics this can only arise from image duplication.',
|
||||
f'* **Negative anchor:** signatures whose maximum same-accountant',
|
||||
f' cosine is below {NEGATIVE_COSINE_UPPER} '
|
||||
f'({int(neg_mask.sum()):,} signatures). Treated as',
|
||||
' confirmed not-replicated.',
|
||||
f'* **Firm A anchor:** Deloitte ({int(firm_a_mask.sum()):,} signatures),',
|
||||
' near-universally non-hand-signed per partner interviews.',
|
||||
'',
|
||||
'## Equal Error Rate (EER)',
|
||||
'',
|
||||
'| Indicator | Direction | EER threshold | EER |',
|
||||
'|-----------|-----------|---------------|-----|',
|
||||
f"| Cosine max-similarity | > t | {cos_eer['threshold']:.4f} | "
|
||||
f"{cos_eer['eer']:.4f} |",
|
||||
f"| Independent min dHash | < t | {dh_eer['threshold']:.4f} | "
|
||||
f"{dh_eer['eer']:.4f} |",
|
||||
'',
|
||||
'## Canonical thresholds',
|
||||
'',
|
||||
'| Indicator | Threshold | Precision | Recall | F1 | FAR | FRR |',
|
||||
'|-----------|-----------|-----------|--------|----|-----|-----|',
|
||||
]
|
||||
for c in canonical_results:
|
||||
md.append(
|
||||
f"| {c['indicator']} | {c['threshold']} "
|
||||
f"({c['direction']}) | {c['precision']:.3f} | "
|
||||
f"{c['recall']:.3f} | {c['f1']:.3f} | "
|
||||
f"{c['far']:.4f} | {c['frr']:.4f} |"
|
||||
)
|
||||
|
||||
md += ['', '## Firm A anchor validation', '',
|
||||
'| Rule | Firm A rate |',
|
||||
'|------|-------------|']
|
||||
for k, v in firm_a_rates.items():
|
||||
md.append(f'| {k} | {v*100:.2f}% |')
|
||||
|
||||
md += ['', '## Sanity sample', '',
|
||||
f'A stratified sample of {len(sanity_sample)} signatures '
|
||||
'(pixel-identical, high-cos/low-dh, borderline, style-only, '
|
||||
'likely-genuine) is exported to `sanity_sample.csv` for visual',
|
||||
'spot-check. These are **not** used to compute metrics.',
|
||||
'',
|
||||
'## Interpretation',
|
||||
'',
|
||||
'Because the gold positive is a *subset* of the true replication',
|
||||
'positives (only those that happen to be pixel-identical to their',
|
||||
'nearest match), recall is conservative: the classifier should',
|
||||
'catch pixel-identical pairs reliably and will additionally flag',
|
||||
'many non-pixel-identical replications (low dHash but not zero).',
|
||||
'FAR against the low-cosine negative anchor is the meaningful',
|
||||
'upper bound on spurious replication flags.',
|
||||
'',
|
||||
'Convergence of thresholds across Scripts 15 (dip test), 16 (BD),',
|
||||
'17 (Beta mixture), 18 (accountant mixture) and the EER here',
|
||||
'should be reported in the paper as multi-method validation.',
|
||||
]
|
||||
(OUT / 'pixel_validation_report.md').write_text('\n'.join(md),
|
||||
encoding='utf-8')
|
||||
print(f'Report: {OUT / "pixel_validation_report.md"}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user