Add three-convergent-method threshold scripts + pixel-identity validation

Implements Partner v3's statistical rigor requirements at the level of
signature vs. accountant analysis units:

- Script 15 (Hartigan dip test): formal unimodality test via `diptest`.
  Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population);
  full-sample cosine MULTIMODAL (p<0.001, mix of two regimes);
  accountant-level aggregates MULTIMODAL on both cos and dHash.

- Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition
  detection. Firm A and full-sample cosine transitions at 0.985; dHash
  at 2.0.

- Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM
  with MoM M-step, plus parallel Gaussian mixture on logit transform
  as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature
  level confirms 2-component is a forced fit -- supporting the pivot
  to accountant-level mixture.

- Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis
  that was done inline and not saved. BIC-best K=3 with components
  matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%,
  Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928,
  11.17, 28%, small firms). 2-component natural thresholds:
  cos=0.9450, dh=8.10.

- Script 19 (Pixel-identity validation): no human annotation needed.
  Uses pixel_identical_to_closest (310 sigs) as gold positive and
  Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51%
  (matches prior 2026-04-08 finding of 92.5%), dual rule
  cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A.

Python deps added: diptest, scikit-learn (installed into venv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-20 21:51:41 +08:00
parent 158f63efb2
commit fbfab1fa68
5 changed files with 1760 additions and 0 deletions
@@ -0,0 +1,413 @@
#!/usr/bin/env python3
"""
Script 19: Pixel-Identity Validation (No Human Annotation Required)
===================================================================
Validates the cosine + dHash dual classifier using three naturally
occurring reference populations instead of manual labels:
Positive anchor 1: pixel_identical_to_closest = 1
Two signature images byte-identical after crop/resize.
Mathematically impossible to arise from independent hand-signing
=> absolute ground truth for replication.
Positive anchor 2: Firm A (Deloitte) signatures
Interview + visual evidence establishes near-universal non-hand-
signing across 2013-2023 (see memories 2026-04-08, 2026-04-14).
We treat Firm A as a strong prior positive.
Negative anchor: signatures with cosine <= low threshold
Pairs with very low cosine similarity cannot plausibly be pixel
duplicates, so they serve as absolute negatives.
Metrics reported:
- FAR/FRR/EER using the pixel-identity anchor as the gold positive
and low-similarity pairs as the gold negative.
- Precision/Recall/F1 at cosine and dHash thresholds from Scripts
15/16/17/18.
- Convergence with Firm A anchor (what fraction of Firm A signatures
are correctly classified at each threshold).
Small visual sanity sample (30 pairs) is exported for spot-check, but
metrics are derived entirely from pixel and Firm A evidence.
Output:
reports/pixel_validation/pixel_validation_report.md
reports/pixel_validation/pixel_validation_results.json
reports/pixel_validation/roc_cosine.png, roc_dhash.png
reports/pixel_validation/sanity_sample.csv
"""
import sqlite3
import json
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
'pixel_validation')
OUT.mkdir(parents=True, exist_ok=True)
FIRM_A = '勤業眾信聯合'
NEGATIVE_COSINE_UPPER = 0.70 # pairs with max-cosine < 0.70 assumed not replicated
SANITY_SAMPLE_SIZE = 30
def load_signatures():
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT s.signature_id, s.image_filename, s.assigned_accountant,
a.firm, s.max_similarity_to_same_accountant,
s.phash_distance_to_closest, s.min_dhash_independent,
s.pixel_identical_to_closest, s.closest_match_file
FROM signatures s
LEFT JOIN accountants a ON s.assigned_accountant = a.name
WHERE s.max_similarity_to_same_accountant IS NOT NULL
''')
rows = cur.fetchall()
conn.close()
data = []
for r in rows:
data.append({
'sig_id': r[0], 'filename': r[1], 'accountant': r[2],
'firm': r[3] or '(unknown)',
'cosine': float(r[4]),
'dhash_cond': None if r[5] is None else int(r[5]),
'dhash_indep': None if r[6] is None else int(r[6]),
'pixel_identical': int(r[7] or 0),
'closest_match': r[8],
})
return data
def confusion(y_true, y_pred):
tp = int(np.sum((y_true == 1) & (y_pred == 1)))
fp = int(np.sum((y_true == 0) & (y_pred == 1)))
fn = int(np.sum((y_true == 1) & (y_pred == 0)))
tn = int(np.sum((y_true == 0) & (y_pred == 0)))
return tp, fp, fn, tn
def classification_metrics(y_true, y_pred):
tp, fp, fn, tn = confusion(y_true, y_pred)
denom_p = max(tp + fp, 1)
denom_r = max(tp + fn, 1)
precision = tp / denom_p
recall = tp / denom_r
f1 = (2 * precision * recall / (precision + recall)
if precision + recall > 0 else 0.0)
far = fp / max(fp + tn, 1) # false acceptance rate (over negatives)
frr = fn / max(fn + tp, 1) # false rejection rate (over positives)
return {
'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn,
'precision': float(precision),
'recall': float(recall),
'f1': float(f1),
'far': float(far),
'frr': float(frr),
}
def sweep_threshold(scores, y, directions, thresholds):
"""For direction 'above' a prediction is positive if score > threshold;
for 'below' it is positive if score < threshold."""
out = []
for t in thresholds:
if directions == 'above':
y_pred = (scores > t).astype(int)
else:
y_pred = (scores < t).astype(int)
m = classification_metrics(y, y_pred)
m['threshold'] = float(t)
out.append(m)
return out
def find_eer(sweep):
"""EER = point where FAR ≈ FRR; interpolated from nearest pair."""
thr = np.array([s['threshold'] for s in sweep])
far = np.array([s['far'] for s in sweep])
frr = np.array([s['frr'] for s in sweep])
diff = far - frr
signs = np.sign(diff)
changes = np.where(np.diff(signs) != 0)[0]
if len(changes) == 0:
idx = int(np.argmin(np.abs(diff)))
return {'threshold': float(thr[idx]), 'far': float(far[idx]),
'frr': float(frr[idx]), 'eer': float(0.5 * (far[idx] + frr[idx]))}
i = int(changes[0])
w = abs(diff[i]) / (abs(diff[i]) + abs(diff[i + 1]) + 1e-12)
thr_i = (1 - w) * thr[i] + w * thr[i + 1]
far_i = (1 - w) * far[i] + w * far[i + 1]
frr_i = (1 - w) * frr[i] + w * frr[i + 1]
return {'threshold': float(thr_i), 'far': float(far_i),
'frr': float(frr_i), 'eer': float(0.5 * (far_i + frr_i))}
def plot_roc(sweep, title, out_path):
far = np.array([s['far'] for s in sweep])
frr = np.array([s['frr'] for s in sweep])
thr = np.array([s['threshold'] for s in sweep])
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
ax = axes[0]
ax.plot(far, 1 - frr, 'b-', lw=2)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.4)
ax.set_xlabel('FAR')
ax.set_ylabel('1 - FRR (True Positive Rate)')
ax.set_title(f'{title} - ROC')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.grid(alpha=0.3)
ax = axes[1]
ax.plot(thr, far, 'r-', lw=2, label='FAR')
ax.plot(thr, frr, 'b-', lw=2, label='FRR')
ax.set_xlabel('Threshold')
ax.set_ylabel('Error rate')
ax.set_title(f'{title} - FAR / FRR vs threshold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
fig.savefig(out_path, dpi=150)
plt.close()
def main():
print('='*70)
print('Script 19: Pixel-Identity Validation (No Annotation)')
print('='*70)
data = load_signatures()
print(f'\nTotal signatures loaded: {len(data):,}')
cos = np.array([d['cosine'] for d in data])
dh_indep = np.array([d['dhash_indep'] if d['dhash_indep'] is not None
else -1 for d in data])
pix = np.array([d['pixel_identical'] for d in data])
firm = np.array([d['firm'] for d in data])
print(f'Pixel-identical: {int(pix.sum()):,} signatures')
print(f'Firm A signatures: {int((firm == FIRM_A).sum()):,}')
print(f'Negative anchor (cosine < {NEGATIVE_COSINE_UPPER}): '
f'{int((cos < NEGATIVE_COSINE_UPPER).sum()):,}')
# Build labelled set:
# positive = pixel_identical == 1
# negative = cosine < NEGATIVE_COSINE_UPPER (and not pixel_identical)
pos_mask = pix == 1
neg_mask = (cos < NEGATIVE_COSINE_UPPER) & (~pos_mask)
labelled_mask = pos_mask | neg_mask
y = pos_mask[labelled_mask].astype(int)
cos_l = cos[labelled_mask]
dh_l = dh_indep[labelled_mask]
# --- Sweep cosine threshold
cos_thresh = np.linspace(0.50, 1.00, 101)
cos_sweep = sweep_threshold(cos_l, y, 'above', cos_thresh)
cos_eer = find_eer(cos_sweep)
print(f'\nCosine EER: threshold={cos_eer["threshold"]:.4f}, '
f'EER={cos_eer["eer"]:.4f}')
# --- Sweep dHash threshold (independent)
dh_l_valid = dh_l >= 0
y_dh = y[dh_l_valid]
dh_valid = dh_l[dh_l_valid]
dh_thresh = np.arange(0, 40)
dh_sweep = sweep_threshold(dh_valid, y_dh, 'below', dh_thresh)
dh_eer = find_eer(dh_sweep)
print(f'dHash EER: threshold={dh_eer["threshold"]:.4f}, '
f'EER={dh_eer["eer"]:.4f}')
# Plots
plot_roc(cos_sweep, 'Cosine (pixel-identity anchor)',
OUT / 'roc_cosine.png')
plot_roc(dh_sweep, 'Independent dHash (pixel-identity anchor)',
OUT / 'roc_dhash.png')
# --- Evaluate canonical thresholds
canonical = [
('cosine', 0.837, 'above', cos, pos_mask, neg_mask),
('cosine', 0.941, 'above', cos, pos_mask, neg_mask),
('cosine', 0.95, 'above', cos, pos_mask, neg_mask),
('dhash_indep', 5, 'below', dh_indep, pos_mask,
neg_mask & (dh_indep >= 0)),
('dhash_indep', 8, 'below', dh_indep, pos_mask,
neg_mask & (dh_indep >= 0)),
('dhash_indep', 15, 'below', dh_indep, pos_mask,
neg_mask & (dh_indep >= 0)),
]
canonical_results = []
for name, thr, direction, scores, p_mask, n_mask in canonical:
labelled = p_mask | n_mask
valid = labelled & (scores >= 0 if 'dhash' in name else np.ones_like(
labelled, dtype=bool))
y_local = p_mask[valid].astype(int)
s = scores[valid]
if direction == 'above':
y_pred = (s > thr).astype(int)
else:
y_pred = (s < thr).astype(int)
m = classification_metrics(y_local, y_pred)
m.update({'indicator': name, 'threshold': float(thr),
'direction': direction})
canonical_results.append(m)
print(f" {name} @ {thr:>5} ({direction}): "
f"P={m['precision']:.3f}, R={m['recall']:.3f}, "
f"F1={m['f1']:.3f}, FAR={m['far']:.4f}, FRR={m['frr']:.4f}")
# --- Firm A anchor validation
firm_a_mask = firm == FIRM_A
firm_a_cos = cos[firm_a_mask]
firm_a_dh = dh_indep[firm_a_mask]
firm_a_rates = {}
for thr in [0.837, 0.941, 0.95]:
firm_a_rates[f'cosine>{thr}'] = float(np.mean(firm_a_cos > thr))
for thr in [5, 8, 15]:
valid = firm_a_dh >= 0
firm_a_rates[f'dhash_indep<={thr}'] = float(
np.mean(firm_a_dh[valid] <= thr))
# Dual thresholds
firm_a_rates['cosine>0.95 AND dhash_indep<=8'] = float(
np.mean((firm_a_cos > 0.95) &
(firm_a_dh >= 0) & (firm_a_dh <= 8)))
print('\nFirm A anchor validation:')
for k, v in firm_a_rates.items():
print(f' {k}: {v*100:.2f}%')
# --- Stratified sanity sample (30 signatures across 5 strata)
rng = np.random.default_rng(42)
strata = [
('pixel_identical', pix == 1),
('high_cos_low_dh',
(cos > 0.95) & (dh_indep >= 0) & (dh_indep <= 5) & (pix == 0)),
('borderline',
(cos > 0.837) & (cos < 0.95) & (dh_indep >= 0) & (dh_indep <= 15)),
('style_consistency_only',
(cos > 0.95) & (dh_indep >= 0) & (dh_indep > 15)),
('likely_genuine', cos < NEGATIVE_COSINE_UPPER),
]
sanity_sample = []
per_stratum = SANITY_SAMPLE_SIZE // len(strata)
for stratum_name, m in strata:
idx = np.where(m)[0]
pick = rng.choice(idx, size=min(per_stratum, len(idx)), replace=False)
for i in pick:
d = data[i]
sanity_sample.append({
'stratum': stratum_name, 'sig_id': d['sig_id'],
'filename': d['filename'], 'accountant': d['accountant'],
'firm': d['firm'], 'cosine': d['cosine'],
'dhash_indep': d['dhash_indep'],
'pixel_identical': d['pixel_identical'],
'closest_match': d['closest_match'],
})
csv_path = OUT / 'sanity_sample.csv'
with open(csv_path, 'w', encoding='utf-8') as f:
keys = ['stratum', 'sig_id', 'filename', 'accountant', 'firm',
'cosine', 'dhash_indep', 'pixel_identical', 'closest_match']
f.write(','.join(keys) + '\n')
for row in sanity_sample:
f.write(','.join(str(row[k]) if row[k] is not None else ''
for k in keys) + '\n')
print(f'\nSanity sample CSV: {csv_path}')
# --- Save results
summary = {
'generated_at': datetime.now().isoformat(),
'n_signatures': len(data),
'n_pixel_identical': int(pos_mask.sum()),
'n_firm_a': int(firm_a_mask.sum()),
'n_negative_anchor': int(neg_mask.sum()),
'negative_cosine_upper': NEGATIVE_COSINE_UPPER,
'eer_cosine': cos_eer,
'eer_dhash_indep': dh_eer,
'canonical_thresholds': canonical_results,
'firm_a_anchor_rates': firm_a_rates,
'cosine_sweep': cos_sweep,
'dhash_sweep': dh_sweep,
}
with open(OUT / 'pixel_validation_results.json', 'w') as f:
json.dump(summary, f, indent=2, ensure_ascii=False)
print(f'JSON: {OUT / "pixel_validation_results.json"}')
# --- Markdown
md = [
'# Pixel-Identity Validation Report',
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
'',
'## Anchors (no human annotation required)',
'',
f'* **Pixel-identical anchor (gold positive):** '
f'{int(pos_mask.sum()):,} signatures whose closest same-accountant',
' match is byte-identical after crop/normalise. Under handwriting',
' physics this can only arise from image duplication.',
f'* **Negative anchor:** signatures whose maximum same-accountant',
f' cosine is below {NEGATIVE_COSINE_UPPER} '
f'({int(neg_mask.sum()):,} signatures). Treated as',
' confirmed not-replicated.',
f'* **Firm A anchor:** Deloitte ({int(firm_a_mask.sum()):,} signatures),',
' near-universally non-hand-signed per partner interviews.',
'',
'## Equal Error Rate (EER)',
'',
'| Indicator | Direction | EER threshold | EER |',
'|-----------|-----------|---------------|-----|',
f"| Cosine max-similarity | > t | {cos_eer['threshold']:.4f} | "
f"{cos_eer['eer']:.4f} |",
f"| Independent min dHash | < t | {dh_eer['threshold']:.4f} | "
f"{dh_eer['eer']:.4f} |",
'',
'## Canonical thresholds',
'',
'| Indicator | Threshold | Precision | Recall | F1 | FAR | FRR |',
'|-----------|-----------|-----------|--------|----|-----|-----|',
]
for c in canonical_results:
md.append(
f"| {c['indicator']} | {c['threshold']} "
f"({c['direction']}) | {c['precision']:.3f} | "
f"{c['recall']:.3f} | {c['f1']:.3f} | "
f"{c['far']:.4f} | {c['frr']:.4f} |"
)
md += ['', '## Firm A anchor validation', '',
'| Rule | Firm A rate |',
'|------|-------------|']
for k, v in firm_a_rates.items():
md.append(f'| {k} | {v*100:.2f}% |')
md += ['', '## Sanity sample', '',
f'A stratified sample of {len(sanity_sample)} signatures '
'(pixel-identical, high-cos/low-dh, borderline, style-only, '
'likely-genuine) is exported to `sanity_sample.csv` for visual',
'spot-check. These are **not** used to compute metrics.',
'',
'## Interpretation',
'',
'Because the gold positive is a *subset* of the true replication',
'positives (only those that happen to be pixel-identical to their',
'nearest match), recall is conservative: the classifier should',
'catch pixel-identical pairs reliably and will additionally flag',
'many non-pixel-identical replications (low dHash but not zero).',
'FAR against the low-cosine negative anchor is the meaningful',
'upper bound on spurious replication flags.',
'',
'Convergence of thresholds across Scripts 15 (dip test), 16 (BD),',
'17 (Beta mixture), 18 (accountant mixture) and the EER here',
'should be reported in the paper as multi-method validation.',
]
(OUT / 'pixel_validation_report.md').write_text('\n'.join(md),
encoding='utf-8')
print(f'Report: {OUT / "pixel_validation_report.md"}')
if __name__ == '__main__':
main()