12f716ddf1
Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):
Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
table had 1-4-unit transcription errors in k values and a fabricated
cos > 0.9407 calibration row; both fixed by rerunning Script 24
with cos = 0.9407 added to COS_RULES and copying exact values from
the JSON output.
- Section III-L classifier now defined entirely in terms of the
independent-minimum dHash statistic that the deployed code (Scripts
21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
language is removed. Tables IX, XI, XII, XVI are now arithmetically
consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
per-signature cosine distribution, matching III-L and IV-F.
Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
limit. Removed "we break the circularity" overclaim; replaced with
"report capture rates on both folds with Wilson 95% intervals to
make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
Methods/Results don't deliver; replaced with anchor-based capture /
FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
intra-report consistency (IV-H.3) is a different test (two co-signers
on the same report, firm-level homogeneity) and is not a within-CPA
year-level mixing check; the assumption is maintained as a bounded
identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
the partner-level ranking is threshold-free"; longitudinal-stability
uses 0.95 cutoff, intra-report uses the operational classifier.
Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
Regular Papers do not have a standalone Impact Statement). The file
itself is retained as an archived non-paper note for cover-letter /
grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
[35] VLM survey, [36] Mann-Whitney) are now cited in-text:
[27] in Methodology III-E (dHash definition)
[31][32][33] in Introduction (audit-quality regulation context)
[34][35] in Methodology III-C/III-D
[36] in Results IV-C (Mann-Whitney result)
Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
420 lines
16 KiB
Python
420 lines
16 KiB
Python
#!/usr/bin/env python3
|
|
"""
|
|
Script 24: Validation Recalibration (addresses codex v3.3 blockers)
|
|
====================================================================
|
|
Fixes three issues flagged by codex gpt-5.4 round-3 review of Paper A v3.3:
|
|
|
|
Blocker 2: held-out validation prose claims "held-out rates match
|
|
whole-sample within Wilson CI", which is numerically false
|
|
(e.g., whole 92.51% vs held-out CI [93.21%, 93.98%]).
|
|
The correct reference for generalization is the calibration
|
|
fold (70%), not the whole sample.
|
|
|
|
Blocker 1: the deployed per-signature classifier uses whole-sample
|
|
Firm A percentile heuristics (0.95, 0.837, dHash 5/15),
|
|
while the accountant-level three-method convergence sits at
|
|
cos ~0.973-0.979. This script adds a sensitivity check of
|
|
the classifier's five-way output under cos>0.945 and
|
|
cos>0.95 so the paper can report how the category
|
|
distribution shifts when the operational threshold is
|
|
replaced with the accountant-level 2D GMM marginal.
|
|
|
|
This script reads Script 21's output JSON for the 70/30 fold, recomputes
|
|
both calibration-fold and held-out-fold capture rates (with Wilson 95%
|
|
CIs), and runs a two-proportion z-test between calib and held-out for
|
|
each rule. It also computes the full-sample five-way classifier output
|
|
under cos>0.95 vs cos>0.945 for sensitivity.
|
|
|
|
Output:
|
|
reports/validation_recalibration/validation_recalibration.md
|
|
reports/validation_recalibration/validation_recalibration.json
|
|
"""
|
|
|
|
import json
|
|
import sqlite3
|
|
import numpy as np
|
|
from pathlib import Path
|
|
from datetime import datetime
|
|
from scipy.stats import norm
|
|
|
|
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
|
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
|
'validation_recalibration')
|
|
OUT.mkdir(parents=True, exist_ok=True)
|
|
|
|
FIRM_A = '勤業眾信聯合'
|
|
SEED = 42
|
|
|
|
# Rules of interest for held-out vs calib comparison.
|
|
# 0.9407 = calibration-fold P5 of the Firm A cosine distribution
|
|
# (see Script 21 / Section III-K) and is included so Table XI of the
|
|
# paper can report calib- and held-fold rates for the same rule set.
|
|
COS_RULES = [0.837, 0.9407, 0.945, 0.95]
|
|
DH_RULES = [5, 8, 9, 15]
|
|
# Dual rule (the paper's classifier's operational dual).
|
|
DUAL_RULES = [(0.95, 8), (0.945, 8)]
|
|
|
|
|
|
def wilson_ci(k, n, alpha=0.05):
|
|
if n == 0:
|
|
return (0.0, 1.0)
|
|
z = norm.ppf(1 - alpha / 2)
|
|
phat = k / n
|
|
denom = 1 + z * z / n
|
|
center = (phat + z * z / (2 * n)) / denom
|
|
pm = z * np.sqrt(phat * (1 - phat) / n + z * z / (4 * n * n)) / denom
|
|
return (max(0.0, center - pm), min(1.0, center + pm))
|
|
|
|
|
|
def two_prop_z(k1, n1, k2, n2):
|
|
"""Two-proportion z-test (two-sided). Returns (z, p)."""
|
|
if n1 == 0 or n2 == 0:
|
|
return (float('nan'), float('nan'))
|
|
p1 = k1 / n1
|
|
p2 = k2 / n2
|
|
p_pool = (k1 + k2) / (n1 + n2)
|
|
if p_pool == 0 or p_pool == 1:
|
|
return (0.0, 1.0)
|
|
se = np.sqrt(p_pool * (1 - p_pool) * (1 / n1 + 1 / n2))
|
|
if se == 0:
|
|
return (0.0, 1.0)
|
|
z = (p1 - p2) / se
|
|
p = 2 * (1 - norm.cdf(abs(z)))
|
|
return (float(z), float(p))
|
|
|
|
|
|
def load_signatures():
|
|
conn = sqlite3.connect(DB)
|
|
cur = conn.cursor()
|
|
cur.execute('''
|
|
SELECT s.signature_id, s.assigned_accountant, a.firm,
|
|
s.max_similarity_to_same_accountant,
|
|
s.min_dhash_independent
|
|
FROM signatures s
|
|
LEFT JOIN accountants a ON s.assigned_accountant = a.name
|
|
WHERE s.max_similarity_to_same_accountant IS NOT NULL
|
|
''')
|
|
rows = cur.fetchall()
|
|
conn.close()
|
|
return rows
|
|
|
|
|
|
def fmt_pct(x):
|
|
return f'{x * 100:.2f}%'
|
|
|
|
|
|
def rate_with_ci(k, n):
|
|
lo, hi = wilson_ci(k, n)
|
|
return {
|
|
'rate': float(k / n) if n else 0.0,
|
|
'k': int(k),
|
|
'n': int(n),
|
|
'wilson95': [float(lo), float(hi)],
|
|
}
|
|
|
|
|
|
def main():
|
|
print('=' * 70)
|
|
print('Script 24: Validation Recalibration')
|
|
print('=' * 70)
|
|
|
|
rows = load_signatures()
|
|
accts = [r[1] for r in rows]
|
|
firms = [r[2] or '(unknown)' for r in rows]
|
|
cos = np.array([r[3] for r in rows], dtype=float)
|
|
dh = np.array([-1 if r[4] is None else r[4] for r in rows], dtype=float)
|
|
|
|
firm_a_mask = np.array([f == FIRM_A for f in firms])
|
|
print(f'\nLoaded {len(rows):,} signatures')
|
|
print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')
|
|
|
|
# --- Reproduce Script 21's 70/30 split (same SEED=42) ---
|
|
rng = np.random.default_rng(SEED)
|
|
firm_a_accts = sorted(set(a for a, f in zip(accts, firms) if f == FIRM_A))
|
|
rng.shuffle(firm_a_accts)
|
|
n_calib = int(0.7 * len(firm_a_accts))
|
|
calib_accts = set(firm_a_accts[:n_calib])
|
|
heldout_accts = set(firm_a_accts[n_calib:])
|
|
print(f'\n70/30 split: calib CPAs={len(calib_accts)}, '
|
|
f'heldout CPAs={len(heldout_accts)}')
|
|
|
|
calib_mask = np.array([a in calib_accts for a in accts])
|
|
heldout_mask = np.array([a in heldout_accts for a in accts])
|
|
whole_mask = firm_a_mask
|
|
|
|
def summarize_fold(mask, label):
|
|
mcos = cos[mask]
|
|
mdh = dh[mask]
|
|
dh_valid = mdh >= 0
|
|
out = {
|
|
'fold': label,
|
|
'n_sigs': int(mask.sum()),
|
|
'n_dh_valid': int(dh_valid.sum()),
|
|
'cos_rules': {},
|
|
'dh_rules': {},
|
|
'dual_rules': {},
|
|
}
|
|
for t in COS_RULES:
|
|
k = int(np.sum(mcos > t))
|
|
n = int(len(mcos))
|
|
out['cos_rules'][f'cos>{t:.4f}'] = rate_with_ci(k, n)
|
|
for t in DH_RULES:
|
|
k = int(np.sum((mdh >= 0) & (mdh <= t)))
|
|
n = int(dh_valid.sum())
|
|
out['dh_rules'][f'dh_indep<={t}'] = rate_with_ci(k, n)
|
|
for ct, dt in DUAL_RULES:
|
|
k = int(np.sum((mcos > ct) & (mdh >= 0) & (mdh <= dt)))
|
|
n = int(len(mcos))
|
|
out['dual_rules'][f'cos>{ct:.3f}_AND_dh<={dt}'] = rate_with_ci(k, n)
|
|
return out
|
|
|
|
calib = summarize_fold(calib_mask, 'calibration_70pct')
|
|
held = summarize_fold(heldout_mask, 'heldout_30pct')
|
|
whole = summarize_fold(whole_mask, 'whole_firm_a')
|
|
print(f'\nCalib sigs: {calib["n_sigs"]:,} (dh valid: {calib["n_dh_valid"]:,})')
|
|
print(f'Held sigs: {held["n_sigs"]:,} (dh valid: {held["n_dh_valid"]:,})')
|
|
print(f'Whole sigs: {whole["n_sigs"]:,} (dh valid: {whole["n_dh_valid"]:,})')
|
|
|
|
# --- 2-proportion z-tests: calib vs held-out ---
|
|
print('\n=== Calib vs Held-out: 2-proportion z-test ===')
|
|
tests = {}
|
|
all_rules = (
|
|
[(f'cos>{t:.4f}', 'cos_rules') for t in COS_RULES] +
|
|
[(f'dh_indep<={t}', 'dh_rules') for t in DH_RULES] +
|
|
[(f'cos>{ct:.3f}_AND_dh<={dt}', 'dual_rules') for ct, dt in DUAL_RULES]
|
|
)
|
|
for rule, group in all_rules:
|
|
c = calib[group][rule]
|
|
h = held[group][rule]
|
|
z, p = two_prop_z(c['k'], c['n'], h['k'], h['n'])
|
|
in_calib_ci = c['wilson95'][0] <= h['rate'] <= c['wilson95'][1]
|
|
in_held_ci = h['wilson95'][0] <= c['rate'] <= h['wilson95'][1]
|
|
tests[rule] = {
|
|
'calib_rate': c['rate'],
|
|
'calib_ci': c['wilson95'],
|
|
'held_rate': h['rate'],
|
|
'held_ci': h['wilson95'],
|
|
'z': z,
|
|
'p': p,
|
|
'held_within_calib_ci': bool(in_calib_ci),
|
|
'calib_within_held_ci': bool(in_held_ci),
|
|
}
|
|
sig = '***' if p < 0.001 else '**' if p < 0.01 else \
|
|
'*' if p < 0.05 else 'n.s.'
|
|
print(f' {rule:40s} calib={fmt_pct(c["rate"])} '
|
|
f'held={fmt_pct(h["rate"])} z={z:+.3f} p={p:.4f} {sig}')
|
|
|
|
# --- Classifier sensitivity: cos>0.95 vs cos>0.945 ---
|
|
print('\n=== Classifier sensitivity: 0.95 vs 0.945 ===')
|
|
# All whole-sample signatures (not just Firm A) for the classifier.
|
|
# Reproduces the Section III-L five-way classifier categorization.
|
|
dh_all_valid = dh >= 0
|
|
all_cos = cos
|
|
all_dh = dh
|
|
|
|
def classify(cos_arr, dh_arr, dh_valid, cos_hi, dh_hi_high=5,
|
|
dh_hi_mod=15, cos_lo=0.837):
|
|
"""Replicate Section III-L five-way classifier.
|
|
|
|
Categories (signature-level):
|
|
1 high-confidence non-hand-signed: cos>cos_hi AND dh<=dh_hi_high
|
|
2 moderate-confidence: cos>cos_hi AND dh_hi_high<dh<=dh_hi_mod
|
|
3 style-only: cos>cos_hi AND dh>dh_hi_mod
|
|
4 uncertain: cos_lo<cos<=cos_hi
|
|
5 likely hand-signed: cos<=cos_lo
|
|
Signatures with missing dHash fall into a sixth bucket (dh-missing).
|
|
"""
|
|
cats = np.full(len(cos_arr), 6, dtype=int) # 6 = dh-missing default
|
|
above_hi = cos_arr > cos_hi
|
|
above_lo_only = (cos_arr > cos_lo) & (~above_hi)
|
|
below_lo = cos_arr <= cos_lo
|
|
cats[above_lo_only] = 4
|
|
cats[below_lo] = 5
|
|
# For dh-valid subset that exceeds cos_hi, subdivide.
|
|
has_dh = dh_valid & above_hi
|
|
cats[has_dh & (dh_arr <= dh_hi_high)] = 1
|
|
cats[has_dh & (dh_arr > dh_hi_high) & (dh_arr <= dh_hi_mod)] = 2
|
|
cats[has_dh & (dh_arr > dh_hi_mod)] = 3
|
|
# Signatures with above_hi but dh missing -> default cat 2 (moderate)
|
|
# for continuity with the classifier's whole-sample behavior.
|
|
cats[above_hi & ~dh_valid] = 2
|
|
return cats
|
|
|
|
cats_95 = classify(all_cos, all_dh, dh_all_valid, cos_hi=0.95)
|
|
cats_945 = classify(all_cos, all_dh, dh_all_valid, cos_hi=0.945)
|
|
# 5 + dh-missing bucket
|
|
labels = {
|
|
1: 'high_confidence_non_hand_signed',
|
|
2: 'moderate_confidence_non_hand_signed',
|
|
3: 'high_style_consistency',
|
|
4: 'uncertain',
|
|
5: 'likely_hand_signed',
|
|
6: 'dh_missing',
|
|
}
|
|
sens = {'0.95': {}, '0.945': {}, 'diff': {}}
|
|
total = len(cats_95)
|
|
for c, name in labels.items():
|
|
n95 = int((cats_95 == c).sum())
|
|
n945 = int((cats_945 == c).sum())
|
|
sens['0.95'][name] = {'n': n95, 'pct': n95 / total * 100}
|
|
sens['0.945'][name] = {'n': n945, 'pct': n945 / total * 100}
|
|
sens['diff'][name] = n945 - n95
|
|
print(f' {name:40s} 0.95: {n95:>7,} ({n95/total*100:5.2f}%) '
|
|
f'0.945: {n945:>7,} ({n945/total*100:5.2f}%) '
|
|
f'diff: {n945 - n95:+,}')
|
|
# Transition matrix (how many signatures change category)
|
|
transitions = {}
|
|
for from_c in range(1, 7):
|
|
for to_c in range(1, 7):
|
|
if from_c == to_c:
|
|
continue
|
|
n = int(((cats_95 == from_c) & (cats_945 == to_c)).sum())
|
|
if n > 0:
|
|
key = f'{labels[from_c]}->{labels[to_c]}'
|
|
transitions[key] = n
|
|
|
|
# Dual rule capture on whole Firm A (not just heldout)
|
|
# under 0.95 AND dh<=8 vs 0.945 AND dh<=8
|
|
fa_cos = cos[firm_a_mask]
|
|
fa_dh = dh[firm_a_mask]
|
|
dual_95_8 = int(((fa_cos > 0.95) & (fa_dh >= 0) & (fa_dh <= 8)).sum())
|
|
dual_945_8 = int(((fa_cos > 0.945) & (fa_dh >= 0) & (fa_dh <= 8)).sum())
|
|
n_fa = int(firm_a_mask.sum())
|
|
print(f'\nDual rule on whole Firm A (n={n_fa:,}):')
|
|
print(f' cos>0.950 AND dh<=8: {dual_95_8:,} ({dual_95_8/n_fa*100:.2f}%)')
|
|
print(f' cos>0.945 AND dh<=8: {dual_945_8:,} ({dual_945_8/n_fa*100:.2f}%)')
|
|
|
|
# --- Save ---
|
|
summary = {
|
|
'generated_at': datetime.now().isoformat(),
|
|
'firm_a_name_redacted': 'Firm A (real name redacted)',
|
|
'seed': SEED,
|
|
'n_signatures': len(rows),
|
|
'n_firm_a': int(firm_a_mask.sum()),
|
|
'split': {
|
|
'calib_cpas': len(calib_accts),
|
|
'heldout_cpas': len(heldout_accts),
|
|
'calib_sigs': int(calib_mask.sum()),
|
|
'heldout_sigs': int(heldout_mask.sum()),
|
|
},
|
|
'calibration_fold': calib,
|
|
'heldout_fold': held,
|
|
'whole_firm_a': whole,
|
|
'generalization_tests': tests,
|
|
'classifier_sensitivity': sens,
|
|
'classifier_transitions_95_to_945': transitions,
|
|
'dual_rule_whole_firm_a': {
|
|
'cos_gt_0.95_AND_dh_le_8': {
|
|
'k': dual_95_8, 'n': n_fa,
|
|
'rate': dual_95_8 / n_fa,
|
|
'wilson95': list(wilson_ci(dual_95_8, n_fa)),
|
|
},
|
|
'cos_gt_0.945_AND_dh_le_8': {
|
|
'k': dual_945_8, 'n': n_fa,
|
|
'rate': dual_945_8 / n_fa,
|
|
'wilson95': list(wilson_ci(dual_945_8, n_fa)),
|
|
},
|
|
},
|
|
}
|
|
|
|
with open(OUT / 'validation_recalibration.json', 'w') as f:
|
|
json.dump(summary, f, indent=2, ensure_ascii=False)
|
|
print(f'\nJSON: {OUT / "validation_recalibration.json"}')
|
|
|
|
# --- Markdown ---
|
|
md = [
|
|
'# Validation Recalibration Report',
|
|
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
|
|
'',
|
|
'Addresses codex gpt-5.4 v3.3 round-3 review Blockers 1 and 2.',
|
|
'',
|
|
'## 1. Calibration vs Held-out Firm A Generalization Test',
|
|
'',
|
|
f'* Seed {SEED}; 70/30 CPA-level split.',
|
|
f'* Calibration fold: {calib["n_sigs"]:,} signatures '
|
|
f'({len(calib_accts)} CPAs).',
|
|
f'* Held-out fold: {held["n_sigs"]:,} signatures '
|
|
f'({len(heldout_accts)} CPAs).',
|
|
'',
|
|
'**Reference comparison.** The correct generalization test compares '
|
|
'calib-fold vs held-out-fold rates, not whole-sample vs held-out-fold. '
|
|
'The whole-sample rate is a weighted average of the two folds and '
|
|
'therefore cannot lie inside the held-out CI when the folds differ in '
|
|
'rate.',
|
|
'',
|
|
'| Rule | Calib rate (CI) | Held-out rate (CI) | z | p | Held within calib CI? |',
|
|
'|------|-----------------|---------------------|---|---|------------------------|',
|
|
]
|
|
for rule, group in all_rules:
|
|
c = calib[group][rule]
|
|
h = held[group][rule]
|
|
t = tests[rule]
|
|
md.append(
|
|
f'| `{rule}` | {fmt_pct(c["rate"])} '
|
|
f'[{fmt_pct(c["wilson95"][0])}, {fmt_pct(c["wilson95"][1])}] '
|
|
f'| {fmt_pct(h["rate"])} '
|
|
f'[{fmt_pct(h["wilson95"][0])}, {fmt_pct(h["wilson95"][1])}] '
|
|
f'| {t["z"]:+.3f} | {t["p"]:.4f} | '
|
|
f'{"yes" if t["held_within_calib_ci"] else "no"} |'
|
|
)
|
|
md += [
|
|
'',
|
|
'## 2. Classifier Sensitivity: cos > 0.95 vs cos > 0.945',
|
|
'',
|
|
f'All-sample five-way classifier output (N = {total:,} signatures).',
|
|
'The 0.945 cutoff is the accountant-level 2D GMM marginal crossing; ',
|
|
'the 0.95 cutoff is the whole-sample Firm A P95 heuristic.',
|
|
'',
|
|
'| Category | cos>0.95 count (%) | cos>0.945 count (%) | Δ |',
|
|
'|----------|---------------------|-----------------------|---|',
|
|
]
|
|
for c, name in labels.items():
|
|
a = sens['0.95'][name]
|
|
b = sens['0.945'][name]
|
|
md.append(
|
|
f'| {name} | {a["n"]:,} ({a["pct"]:.2f}%) '
|
|
f'| {b["n"]:,} ({b["pct"]:.2f}%) '
|
|
f'| {sens["diff"][name]:+,} |'
|
|
)
|
|
md += [
|
|
'',
|
|
'### Category transitions (0.95 -> 0.945)',
|
|
'',
|
|
]
|
|
for k, v in sorted(transitions.items(), key=lambda x: -x[1]):
|
|
md.append(f'* `{k}`: {v:,}')
|
|
|
|
md += [
|
|
'',
|
|
'## 3. Dual-Rule Capture on Whole Firm A',
|
|
'',
|
|
f'* cos > 0.950 AND dh_indep <= 8: {dual_95_8:,}/{n_fa:,} '
|
|
f'({dual_95_8/n_fa*100:.2f}%)',
|
|
f'* cos > 0.945 AND dh_indep <= 8: {dual_945_8:,}/{n_fa:,} '
|
|
f'({dual_945_8/n_fa*100:.2f}%)',
|
|
'',
|
|
'## 4. Interpretation',
|
|
'',
|
|
'* The calib-vs-held-out 2-proportion z-test is the correct '
|
|
'generalization check. If `p >= 0.05` the two folds are not '
|
|
'statistically distinguishable at 5% level.',
|
|
'* Where the two folds differ significantly, the paper should say the '
|
|
'held-out fold happens to be slightly more replication-dominated than '
|
|
'the calibration fold (i.e., a sampling-variance effect, not a '
|
|
'generalization failure), and still discloses the rates for both '
|
|
'folds.',
|
|
'* The sensitivity analysis shows how many signatures flip categories '
|
|
'under the accountant-level convergence threshold (0.945) versus the '
|
|
'whole-sample heuristic (0.95). Small shifts support the paper\'s '
|
|
'claim that the operational classifier is robust to the threshold '
|
|
'choice; larger shifts would require either changing the classifier '
|
|
'or reporting results under both cuts.',
|
|
]
|
|
(OUT / 'validation_recalibration.md').write_text('\n'.join(md),
|
|
encoding='utf-8')
|
|
print(f'Report: {OUT / "validation_recalibration.md"}')
|
|
|
|
|
|
if __name__ == '__main__':
|
|
main()
|