Add script 36: v4.0 calibration + LOOO validation (UNSTABLE verdict)

Phase 1 foundation script for Paper A v4.0 Big-4 reframe. Sections: A. Big-4 calibration recap (replicates Script 34: K=2 marginal crossings cos=0.9755, dh=3.7549; bootstrap 95% CI tight; dip-test cos p<0.0001, dh p<0.0001). B. Leave-one-firm-out (LOOO) cross-validation: refit K=2 on the other 3 firms, predict the held-out firm's CPAs. C. Cross-fold stability verdict. Result: UNSTABLE. Held-out firm Fold rule Replicated rate Firm A cos>0.9380 AND dh<=8.7902 171/171 = 100% KPMG cos>0.9744 AND dh<=3.9783 0/112 = 0% PwC cos>0.9752 AND dh<=3.7470 0/102 = 0% EY cos>0.9756 AND dh<=3.7409 0/52 = 0% Max |dev_cos| from fold-mean = 0.028 (5.6x over 0.005 stability bar). Methodological implication: The Big-4 K=2 bimodality that Script 34 celebrated (dip p<0.0001) is firm-mass driven, not mechanism driven. K=2 separates Firm A from the other three Big-4, then mis-applies to held-out non-Firm-A firms (everyone falls below the cosine cut). Same conceptual problem as Paper A v3.x's between-firm threshold, just at smaller scope. v4.0 narrative as currently planned does not survive a reviewer who runs LOOO. Forward options under discussion: P1 firm-templatedness reframe, P2 K=3 primary (next: Script 37 = K=3 LOOO), P3 rollback to v3.20.0, P4 reverse-anchor as v4.0 core. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:54:54 +08:00
parent e429e4eed1
commit ccd9f23635
1 changed files with 599 additions and 0 deletions
@@ -0,0 +1,599 @@
+#!/usr/bin/env python3
+"""
+Script 36: Paper A v4.0 Calibration + Leave-One-Firm-Out Validation
+=====================================================================
+Phase 1 foundation script for the v4.0 Big-4 reframe.
+
+Inputs (DB):
+  /Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db
+
+Output:
+  /Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/
+    calibration_and_loo_validation/
+      calibration_loo_results.json
+      calibration_loo_report.md
+      panel_calibration.png
+      panel_loo_<firm>.png
+
+Sections:
+  A. Big-4 calibration recap
+     - Pool Firm A + KPMG + PwC + EY accountant means (n=437 CPAs).
+     - Fit 2D GMM K=2 (primary) and K=3 (secondary).
+     - Bootstrap 500 resamples for marginal crossings (cos and dh).
+     - Derive operational classifier rule:
+         R_v4 := cos > c_cut AND dh <= d_cut
+       where (c_cut, d_cut) = (Big-4 2D-GMM K=2 marginal crossings).
+
+  B. Leave-one-firm-out (LOOO) cross-validation
+     - For each of 4 Big-4 firms F:
+         * Refit K=2 on the other 3 firms only.
+         * Bootstrap 500 resamples for the held-out fit's marginal crossings.
+         * Predict the held-out F CPAs' cluster assignments using the
+           held-out-derived rule.
+         * Compute:
+            - n_F, n_F_classified_replicated (cluster C_high_cos),
+              n_F_classified_handleaning  (cluster C_low_cos)
+            - Wilson 95% CI on the replicated rate for F
+            - Compare derived rule (c_cut, d_cut) across folds: is it stable?
+
+  C. Cross-fold stability table
+     - For each fold, report (c_cut, d_cut), and the replicated rate the
+       held-out firm receives.
+     - Verdict (printed and saved):
+       STABLE        max |c_cut - mean| <= 0.005 AND max |d_cut - mean| <= 0.5
+                     across the 4 folds
+       UNSTABLE      otherwise
+
+Methodology decisions (flag for partner / reviewer feedback):
+  * Held-out unit = firm (not 30% of accountants within firm).
+    Rationale: v4.0 makes a methodology-paper claim that the
+    pipeline reproduces across firms.  Within-firm 70/30 only tests
+    sampling variance within one firm; LOOO tests cross-firm
+    generalization, which is the stronger and more honest claim.
+  * Bootstrap n=500 (vs Script 34's 500 — kept consistent).
+  * GMM hyperparameters (n_init=15, max_iter=500, random_state=42)
+    kept consistent with Scripts 32/34/35.
+"""
+
+import sqlite3
+import json
+import numpy as np
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from pathlib import Path
+from datetime import datetime
+from scipy import stats
+from scipy.optimize import brentq
+from scipy.stats import norm
+from sklearn.mixture import GaussianMixture
+import diptest
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'v4_big4/calibration_and_loo_validation')
+OUT.mkdir(parents=True, exist_ok=True)
+
+MIN_SIGS = 10
+N_BOOT = 500
+SEED = 42
+
+BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
+FIRM_A_LABEL = '勤業眾信聯合'  # Deloitte
+
+
+def load_big4_accountants():
+    """Return list of dicts: {cpa, firm, cos_mean, dh_mean, n_sigs}."""
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT s.assigned_accountant,
+               a.firm,
+               AVG(s.max_similarity_to_same_accountant) AS cos_mean,
+               AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
+               COUNT(*) AS n
+        FROM signatures s
+        JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE s.assigned_accountant IS NOT NULL
+          AND s.max_similarity_to_same_accountant IS NOT NULL
+          AND s.min_dhash_independent IS NOT NULL
+          AND a.firm IN (?, ?, ?, ?)
+        GROUP BY s.assigned_accountant
+        HAVING n >= ?
+    ''', BIG4 + (MIN_SIGS,))
+    rows = cur.fetchall()
+    conn.close()
+    return [{'cpa': r[0], 'firm': r[1],
+             'cos_mean': float(r[2]), 'dh_mean': float(r[3]),
+             'n_sigs': int(r[4])} for r in rows]
+
+
+def fit_gmm_2d(X, K, seed=SEED):
+    return GaussianMixture(n_components=K, covariance_type='full',
+                           random_state=seed, n_init=15, max_iter=500).fit(X)
+
+
+def marginal_crossing(gmm, X, dim):
+    """2-comp 2D GMM -> crossing on the specified marginal dim."""
+    means = gmm.means_
+    covs = gmm.covariances_
+    weights = gmm.weights_
+    if gmm.n_components != 2:
+        raise ValueError('marginal_crossing requires K=2')
+    m1, m2 = means[0][dim], means[1][dim]
+    s1 = np.sqrt(covs[0][dim, dim])
+    s2 = np.sqrt(covs[1][dim, dim])
+    w1, w2 = weights[0], weights[1]
+
+    def diff(x):
+        return (w2 * stats.norm.pdf(x, m2, s2)
+                - w1 * stats.norm.pdf(x, m1, s1))
+    xs = np.linspace(float(X[:, dim].min()), float(X[:, dim].max()), 2000)
+    ys = diff(xs)
+    ch = np.where(np.diff(np.sign(ys)) != 0)[0]
+    if not len(ch):
+        return None
+    mid = 0.5 * (m1 + m2)
+    crossings = []
+    for i in ch:
+        try:
+            crossings.append(brentq(diff, xs[i], xs[i + 1]))
+        except ValueError:
+            continue
+    if not crossings:
+        return None
+    return float(min(crossings, key=lambda c: abs(c - mid)))
+
+
+def bootstrap_crossings(X, n_boot=N_BOOT, seed=SEED):
+    rng = np.random.default_rng(seed)
+    n = len(X)
+    cos_cs, dh_cs = [], []
+    for _ in range(n_boot):
+        idx = rng.integers(0, n, size=n)
+        Xb = X[idx]
+        gmm = fit_gmm_2d(Xb, 2)
+        c = marginal_crossing(gmm, Xb, 0)
+        d = marginal_crossing(gmm, Xb, 1)
+        if c is not None:
+            cos_cs.append(c)
+        if d is not None:
+            dh_cs.append(d)
+    cos_cs = np.asarray(cos_cs)
+    dh_cs = np.asarray(dh_cs)
+
+    def summarize(arr):
+        if len(arr) < n_boot * 0.5:
+            return None
+        return {
+            'n_successful': int(len(arr)),
+            'mean': float(np.mean(arr)),
+            'median': float(np.median(arr)),
+            'std': float(np.std(arr, ddof=1)),
+            'ci95': [float(np.quantile(arr, 0.025)),
+                     float(np.quantile(arr, 0.975))],
+            'ci_halfwidth': float(0.5 * (np.quantile(arr, 0.975)
+                                          - np.quantile(arr, 0.025))),
+        }
+    return summarize(cos_cs), summarize(dh_cs)
+
+
+def derive_rule(c_cut, d_cut):
+    """Operational classifier rule: a signature is replicated iff
+    cos > c_cut AND dh <= d_cut."""
+    return {
+        'cos_threshold': float(c_cut) if c_cut is not None else None,
+        'dh_threshold': float(d_cut) if d_cut is not None else None,
+        'rule': (f'replicated iff cos > {c_cut:.4f} AND dh <= {d_cut:.4f}'
+                 if c_cut is not None and d_cut is not None
+                 else 'rule undefined'),
+    }
+
+
+def wilson_ci(k, n, alpha=0.05):
+    if n == 0:
+        return (0.0, 1.0)
+    z = norm.ppf(1 - alpha / 2)
+    phat = k / n
+    denom = 1 + z * z / n
+    center = (phat + z * z / (2 * n)) / denom
+    pm = z * np.sqrt(phat * (1 - phat) / n + z * z / (4 * n * n)) / denom
+    return (max(0.0, center - pm), min(1.0, center + pm))
+
+
+def classify_cpa(cos_mean, dh_mean, c_cut, d_cut):
+    """At the accountant level, a CPA is 'replicated' if their MEAN
+    coordinates satisfy the rule. (Note: this is a CPA-level
+    summarisation; a per-signature classifier would apply the same
+    rule signature-by-signature.)"""
+    if c_cut is None or d_cut is None:
+        return 'undefined'
+    if cos_mean > c_cut and dh_mean <= d_cut:
+        return 'replicated'
+    return 'hand_leaning'
+
+
+def kde_dip(values):
+    arr = np.asarray(values, dtype=float)
+    arr = arr[np.isfinite(arr)]
+    if len(arr) < 8:
+        return None
+    dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=2000)
+    return {'dip': float(dip), 'dip_pvalue': float(pval),
+            'unimodal_alpha05': bool(pval > 0.05),
+            'n': int(len(arr))}
+
+
+def run_calibration(cpas):
+    cos = np.array([c['cos_mean'] for c in cpas])
+    dh = np.array([c['dh_mean'] for c in cpas])
+    X = np.column_stack([cos, dh])
+    print(f'\n[A] Calibration on {len(cpas)} Big-4 CPAs')
+
+    dip_cos = kde_dip(cos)
+    dip_dh = kde_dip(dh)
+    print(f'  dip-test (cos): p={dip_cos["dip_pvalue"]:.4g}')
+    print(f'  dip-test (dh) : p={dip_dh["dip_pvalue"]:.4g}')
+
+    gmm2 = fit_gmm_2d(X, 2)
+    gmm3 = fit_gmm_2d(X, 3)
+    c_cut = marginal_crossing(gmm2, X, 0)
+    d_cut = marginal_crossing(gmm2, X, 1)
+    print(f'  K=2 marginal crossings: cos={c_cut:.4f}, dh={d_cut:.4f}')
+    print(f'  K=2 BIC={gmm2.bic(X):.2f}; K=3 BIC={gmm3.bic(X):.2f}')
+
+    boot_cos, boot_dh = bootstrap_crossings(X)
+    if boot_cos:
+        print(f'  bootstrap (cos): median={boot_cos["median"]:.4f}, '
+              f'95% CI=[{boot_cos["ci95"][0]:.4f}, {boot_cos["ci95"][1]:.4f}]')
+    if boot_dh:
+        print(f'  bootstrap (dh) : median={boot_dh["median"]:.4f}, '
+              f'95% CI=[{boot_dh["ci95"][0]:.4f}, {boot_dh["ci95"][1]:.4f}]')
+
+    rule = derive_rule(c_cut, d_cut)
+    print(f'  Derived rule: {rule["rule"]}')
+
+    return {
+        'n_cpas': len(cpas),
+        'dip_test_cos': dip_cos,
+        'dip_test_dh': dip_dh,
+        'k2_crossings': {'cos': c_cut, 'dh': d_cut},
+        'k2_bic': float(gmm2.bic(X)),
+        'k3_bic': float(gmm3.bic(X)),
+        'k2_components': {
+            'means': gmm2.means_.tolist(),
+            'weights': gmm2.weights_.tolist(),
+        },
+        'bootstrap_cos': boot_cos,
+        'bootstrap_dh': boot_dh,
+        'rule': rule,
+    }
+
+
+def run_loo(cpas):
+    """Leave-one-firm-out cross-validation."""
+    by_firm = {}
+    for c in cpas:
+        by_firm.setdefault(c['firm'], []).append(c)
+
+    fold_results = {}
+    for held_firm in BIG4:
+        train_cpas = [c for c in cpas if c['firm'] != held_firm]
+        held_cpas = by_firm.get(held_firm, [])
+        n_train = len(train_cpas)
+        n_held = len(held_cpas)
+        print(f'\n[B] LOOO fold: held-out = {held_firm} '
+              f'(n_train={n_train}, n_held={n_held})')
+
+        X_train = np.column_stack([
+            [c['cos_mean'] for c in train_cpas],
+            [c['dh_mean'] for c in train_cpas],
+        ])
+        gmm = fit_gmm_2d(X_train, 2)
+        c_cut = marginal_crossing(gmm, X_train, 0)
+        d_cut = marginal_crossing(gmm, X_train, 1)
+        boot_cos, boot_dh = bootstrap_crossings(X_train)
+
+        # Apply derived rule to held-out firm
+        replicated = 0
+        hand_leaning = 0
+        for c in held_cpas:
+            cls = classify_cpa(c['cos_mean'], c['dh_mean'], c_cut, d_cut)
+            if cls == 'replicated':
+                replicated += 1
+            else:
+                hand_leaning += 1
+        rep_rate = replicated / n_held if n_held else 0.0
+        wlo, whi = wilson_ci(replicated, n_held)
+        print(f'  fold rule: cos>{c_cut:.4f} AND dh<={d_cut:.4f}')
+        print(f'  held-out replicated: {replicated}/{n_held} = '
+              f'{rep_rate*100:.2f}% [{wlo*100:.2f}%, {whi*100:.2f}%]')
+
+        fold_results[held_firm] = {
+            'n_train': n_train,
+            'n_held': n_held,
+            'fold_rule': derive_rule(c_cut, d_cut),
+            'fold_crossings': {'cos': c_cut, 'dh': d_cut},
+            'bootstrap_cos': boot_cos,
+            'bootstrap_dh': boot_dh,
+            'held_out_classification': {
+                'n_replicated': replicated,
+                'n_hand_leaning': hand_leaning,
+                'replicated_rate': rep_rate,
+                'wilson95': [float(wlo), float(whi)],
+            },
+        }
+    return fold_results
+
+
+def cross_fold_stability(fold_results, full_calib):
+    cs = [fold_results[f]['fold_crossings']['cos'] for f in BIG4
+          if fold_results[f]['fold_crossings']['cos'] is not None]
+    ds = [fold_results[f]['fold_crossings']['dh'] for f in BIG4
+          if fold_results[f]['fold_crossings']['dh'] is not None]
+    full_c = full_calib['k2_crossings']['cos']
+    full_d = full_calib['k2_crossings']['dh']
+    summary = {
+        'fold_cos_crossings': cs,
+        'fold_dh_crossings': ds,
+        'mean_cos': float(np.mean(cs)) if cs else None,
+        'mean_dh': float(np.mean(ds)) if ds else None,
+        'max_dev_cos_from_mean': (float(max(abs(np.array(cs) - np.mean(cs))))
+                                   if cs else None),
+        'max_dev_dh_from_mean': (float(max(abs(np.array(ds) - np.mean(ds))))
+                                  if ds else None),
+        'max_dev_cos_from_full': (float(max(abs(np.array(cs) - full_c)))
+                                   if cs and full_c else None),
+        'max_dev_dh_from_full': (float(max(abs(np.array(ds) - full_d)))
+                                  if ds and full_d else None),
+    }
+    cos_stable = (summary['max_dev_cos_from_mean'] is not None
+                  and summary['max_dev_cos_from_mean'] <= 0.005)
+    dh_stable = (summary['max_dev_dh_from_mean'] is not None
+                 and summary['max_dev_dh_from_mean'] <= 0.5)
+    summary['verdict'] = ('STABLE' if (cos_stable and dh_stable)
+                          else 'UNSTABLE')
+    return summary
+
+
+def render_panels(cpas, full_calib, fold_results):
+    by_firm = {}
+    for c in cpas:
+        by_firm.setdefault(c['firm'], []).append(c)
+
+    # Calibration panel
+    fig, ax = plt.subplots(figsize=(9, 7))
+    colors = {'勤業眾信聯合': 'crimson', '安侯建業聯合': 'royalblue',
+              '資誠聯合': 'forestgreen', '安永聯合': 'darkorange'}
+    labels = {'勤業眾信聯合': 'Firm A (Deloitte)', '安侯建業聯合': 'KPMG',
+              '資誠聯合': 'PwC', '安永聯合': 'EY'}
+    for firm in BIG4:
+        pts = by_firm[firm]
+        ax.scatter([p['cos_mean'] for p in pts], [p['dh_mean'] for p in pts],
+                   s=30, alpha=0.6, color=colors[firm],
+                   label=f'{labels[firm]} (n={len(pts)})')
+    c_cut = full_calib['k2_crossings']['cos']
+    d_cut = full_calib['k2_crossings']['dh']
+    ax.axvline(c_cut, color='black', ls='--', lw=1.5,
+               label=f'cos cut = {c_cut:.4f}')
+    ax.axhline(d_cut, color='black', ls=':', lw=1.5,
+               label=f'dh cut = {d_cut:.4f}')
+    ax.set_xlabel('Accountant cos_mean')
+    ax.set_ylabel('Accountant dh_mean')
+    ax.set_title('Big-4 calibration: 437 CPAs + K=2 marginal crossings')
+    ax.legend(fontsize=9)
+    ax.grid(alpha=0.3)
+    fig.tight_layout()
+    fig.savefig(OUT / 'panel_calibration.png', dpi=150)
+    plt.close(fig)
+
+    # LOOO panels
+    for held_firm in BIG4:
+        held = by_firm[held_firm]
+        train_pts = [c for c in cpas if c['firm'] != held_firm]
+        fr = fold_results[held_firm]
+        c_cut_f = fr['fold_crossings']['cos']
+        d_cut_f = fr['fold_crossings']['dh']
+        fig, ax = plt.subplots(figsize=(9, 7))
+        ax.scatter([p['cos_mean'] for p in train_pts],
+                   [p['dh_mean'] for p in train_pts],
+                   s=20, alpha=0.4, color='lightgray',
+                   label=f'Train (other Big-3, n={len(train_pts)})')
+        ax.scatter([p['cos_mean'] for p in held],
+                   [p['dh_mean'] for p in held],
+                   s=40, alpha=0.85, color=colors[held_firm],
+                   edgecolor='white',
+                   label=f'Held-out: {labels[held_firm]} (n={len(held)})')
+        if c_cut_f is not None:
+            ax.axvline(c_cut_f, color='black', ls='--', lw=1.5,
+                       label=f'fold cos cut = {c_cut_f:.4f}')
+        if d_cut_f is not None:
+            ax.axhline(d_cut_f, color='black', ls=':', lw=1.5,
+                       label=f'fold dh cut = {d_cut_f:.4f}')
+        rep = fr['held_out_classification']['n_replicated']
+        nh = fr['n_held']
+        rate = fr['held_out_classification']['replicated_rate']
+        wlo, whi = fr['held_out_classification']['wilson95']
+        ax.set_title(
+            f'LOOO: held-out {labels[held_firm]} ({rep}/{nh} = '
+            f'{rate*100:.1f}% replicated, Wilson 95% '
+            f'[{wlo*100:.1f}%, {whi*100:.1f}%])')
+        ax.set_xlabel('Accountant cos_mean')
+        ax.set_ylabel('Accountant dh_mean')
+        ax.legend(fontsize=9)
+        ax.grid(alpha=0.3)
+        fig.tight_layout()
+        firm_slug = ('FirmA' if held_firm == FIRM_A_LABEL
+                     else {'安侯建業聯合': 'KPMG', '資誠聯合': 'PwC',
+                           '安永聯合': 'EY'}.get(held_firm, held_firm))
+        fig.savefig(OUT / f'panel_loo_{firm_slug}.png', dpi=150)
+        plt.close(fig)
+
+
+def render_md(full_calib, fold_results, stability, sample_sizes):
+    md = [
+        '# Paper A v4.0 Phase 1 — Calibration + LOOO Validation',
+        f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
+        '',
+        '## A. Big-4 Calibration',
+        '',
+        f'- N CPAs: {full_calib["n_cpas"]}',
+        f'- dip-test cos: p = {full_calib["dip_test_cos"]["dip_pvalue"]:.4g} '
+        f'({"unimodal" if full_calib["dip_test_cos"]["unimodal_alpha05"] else "multimodal"})',
+        f'- dip-test dh : p = {full_calib["dip_test_dh"]["dip_pvalue"]:.4g} '
+        f'({"unimodal" if full_calib["dip_test_dh"]["unimodal_alpha05"] else "multimodal"})',
+        f'- 2D GMM K=2 BIC = {full_calib["k2_bic"]:.2f}',
+        f'- 2D GMM K=3 BIC = {full_calib["k3_bic"]:.2f}',
+        '',
+        '### Marginal crossings (point + bootstrap 95% CI, n=500)',
+        '',
+        '| Axis | Point | Bootstrap median | 95% CI | CI half-width |',
+        '|---|---|---|---|---|',
+    ]
+    for axis_label, key in [('cos', 'bootstrap_cos'), ('dh', 'bootstrap_dh')]:
+        b = full_calib[key]
+        point = full_calib['k2_crossings'][axis_label]
+        if b is None:
+            md.append(f'| {axis_label} | {point} | n/a | n/a | n/a |')
+        else:
+            md.append(f'| {axis_label} | {point:.4f} | {b["median"]:.4f} | '
+                      f'[{b["ci95"][0]:.4f}, {b["ci95"][1]:.4f}] | '
+                      f'{b["ci_halfwidth"]:.4f} |')
+    md += ['',
+           f'### Operational classifier rule',
+           '',
+           f'> {full_calib["rule"]["rule"]}',
+           '',
+           '### K=2 components',
+           '',
+           '| Component | mean cos | mean dh | weight |',
+           '|---|---|---|---|']
+    for i, (m, w) in enumerate(zip(full_calib['k2_components']['means'],
+                                   full_calib['k2_components']['weights'])):
+        md.append(f'| C{i+1} | {m[0]:.4f} | {m[1]:.4f} | {w:.3f} |')
+
+    md += ['', '## B. Leave-One-Firm-Out Validation', '',
+           '| Held-out firm | n_train | n_held | Fold cos cut | Fold dh cut | '
+           'Replicated rate | Wilson 95% |',
+           '|---|---|---|---|---|---|---|']
+    label_map = {'勤業眾信聯合': 'Firm A (Deloitte)',
+                 '安侯建業聯合': 'KPMG',
+                 '資誠聯合': 'PwC',
+                 '安永聯合': 'EY'}
+    for f in BIG4:
+        fr = fold_results[f]
+        c = fr['fold_crossings']['cos']
+        d = fr['fold_crossings']['dh']
+        rep = fr['held_out_classification']
+        c_str = f'{c:.4f}' if c is not None else 'n/a'
+        d_str = f'{d:.4f}' if d is not None else 'n/a'
+        md.append(f'| {label_map[f]} | {fr["n_train"]} | {fr["n_held"]} | '
+                  f'{c_str} | {d_str} | {rep["replicated_rate"]*100:.2f}% | '
+                  f'[{rep["wilson95"][0]*100:.2f}%, '
+                  f'{rep["wilson95"][1]*100:.2f}%] |')
+
+    md += ['', '## C. Cross-fold stability', '',
+           f'- Mean fold cos crossing: '
+           f'{stability["mean_cos"]:.4f}' if stability["mean_cos"] is not None
+           else '- Mean fold cos crossing: n/a',
+           f'- Mean fold dh crossing : '
+           f'{stability["mean_dh"]:.4f}' if stability["mean_dh"] is not None
+           else '- Mean fold dh crossing: n/a',
+           f'- Max |dev_cos| across folds: '
+           f'{stability["max_dev_cos_from_mean"]:.4f}'
+           if stability["max_dev_cos_from_mean"] is not None
+           else '- Max |dev_cos|: n/a',
+           f'- Max |dev_dh| across folds : '
+           f'{stability["max_dev_dh_from_mean"]:.4f}'
+           if stability["max_dev_dh_from_mean"] is not None
+           else '- Max |dev_dh|: n/a',
+           f'- Max |dev_cos| vs full-calib: '
+           f'{stability["max_dev_cos_from_full"]:.4f}'
+           if stability["max_dev_cos_from_full"] is not None
+           else '- Max |dev_cos| vs full: n/a',
+           f'- Max |dev_dh| vs full-calib : '
+           f'{stability["max_dev_dh_from_full"]:.4f}'
+           if stability["max_dev_dh_from_full"] is not None
+           else '- Max |dev_dh| vs full: n/a',
+           '',
+           f'**Verdict: {stability["verdict"]}**',
+           '',
+           '### Verdict legend',
+           '- **STABLE**: max |dev_cos| <= 0.005 AND max |dev_dh| <= 0.5 '
+           'across the 4 LOOO folds; the threshold is reproducible across '
+           'firms.',
+           '- **UNSTABLE**: at least one fold deviates beyond the tolerance; '
+           'the threshold is sensitive to which firm is held out, which '
+           'would invite reviewer questions about generalizability.',
+           '',
+           '## Methodology notes',
+           '',
+           '- Held-out unit is the firm (not within-firm 70/30) -- this '
+           'tests the v4.0 methodology-paper claim that the pipeline '
+           'reproduces across firms, not just within a calibration sample.',
+           '- Bootstrap n=500 (consistent with Script 34); '
+           'GMM hyperparameters n_init=15, max_iter=500, random_state=42 '
+           '(consistent with Scripts 32/34/35).',
+           '- CPA-level classification uses the rule applied to the '
+           'accountant\'s mean (cos_mean, dh_mean).  A per-signature '
+           'classifier would apply the same rule signature-by-signature '
+           '(deferred to Script 38 for sensitivity analysis).',
+           '',
+           '## Files',
+           '- `panel_calibration.png` -- 437 Big-4 CPAs + K=2 cuts',
+           '- `panel_loo_<firm>.png` -- LOOO fold panels (4 firms)',
+           '- `calibration_loo_results.json` -- machine-readable full output',
+           ]
+    return '\n'.join(md)
+
+
+def main():
+    print('=' * 72)
+    print('Script 36: v4.0 Calibration + Leave-One-Firm-Out Validation')
+    print('=' * 72)
+    cpas = load_big4_accountants()
+    sample_sizes = {}
+    for c in cpas:
+        sample_sizes.setdefault(c['firm'], 0)
+        sample_sizes[c['firm']] += 1
+    print(f'\nTotal Big-4 CPAs (n_sigs >= {MIN_SIGS}): {len(cpas)}')
+    for f in BIG4:
+        print(f'  {f}: {sample_sizes.get(f, 0)}')
+
+    full_calib = run_calibration(cpas)
+    fold_results = run_loo(cpas)
+    stability = cross_fold_stability(fold_results, full_calib)
+    print(f'\n[C] Cross-fold stability verdict: {stability["verdict"]}')
+    print(f'    Max |dev_cos| from mean = '
+          f'{stability["max_dev_cos_from_mean"]}; '
+          f'from full-calib = {stability["max_dev_cos_from_full"]}')
+    print(f'    Max |dev_dh|  from mean = '
+          f'{stability["max_dev_dh_from_mean"]}; '
+          f'from full-calib = {stability["max_dev_dh_from_full"]}')
+
+    render_panels(cpas, full_calib, fold_results)
+    print(f'\nPanels: {OUT}/panel_calibration.png + 4 LOOO panels')
+
+    payload = {
+        'generated_at': datetime.now().isoformat(),
+        'min_sigs_per_accountant': MIN_SIGS,
+        'n_bootstrap': N_BOOT,
+        'random_seed': SEED,
+        'sample_sizes': sample_sizes,
+        'big4_calibration': full_calib,
+        'loo_folds': fold_results,
+        'cross_fold_stability': stability,
+    }
+    json_path = OUT / 'calibration_loo_results.json'
+    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
+                         encoding='utf-8')
+    print(f'JSON: {json_path}')
+
+    md = render_md(full_calib, fold_results, stability, sample_sizes)
+    md_path = OUT / 'calibration_loo_report.md'
+    md_path.write_text(md, encoding='utf-8')
+    print(f'Report: {md_path}')
+
+
+if __name__ == '__main__':
+    main()