Add script 41: §IV-K full-dataset robustness comparison (Light)
Light §IV-K secondary analysis per v4.0 author choice (codex
round-22 open question 1). Reruns the K=3 mixture + Paper A
operational-rule per-CPA hand_frac on the full accountant dataset
(n = 686) and compares to the Big-4 primary scope (n = 437).
Results:
Component drift Big-4 -> Full:
C1 hand-leaning |dcos| = 0.018, |ddh| = 2.0, |dwt| = 0.14
C2 mixed |dcos| = 0.002, |ddh| = 0.3, |dwt| = 0.02
C3 replicated |dcos| = 0.000, |ddh| = 0.0, |dwt| = 0.12
Spearman rho (P_C1 vs paperA_hand_frac):
Big-4: +0.9627
Full dataset: +0.9558
|drift| = 0.0069
Reading: K=3 component ordering and Spearman convergence are
preserved at full scope, supporting the v4.0 reproducibility
claim. Component locations and weights shift modestly because
mid/small-firm composition broadens C1 (hand-leaning) and reduces
C3 weight; this is expected since mid/small firms include
hand-leaning CPAs that the Big-4-primary scope deliberately
excludes. Crossings and component locations are NOT operationally
interchangeable between scopes; §IV-K reports them only as a
robustness cross-check.
The five-way moderate-confidence band is NOT re-evaluated here
(Light scope); §IV-J flags it as inherited from v3.x calibration
without v4-specific recalibration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,311 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 41: Full-Dataset Robustness Comparison (light §IV-K)
|
||||
=============================================================
|
||||
v4.0 §IV-K secondary analysis: re-runs the K=3 mixture + Paper A
|
||||
operational-rule per-CPA hand_frac on the FULL accountant dataset
|
||||
(Big-4 + mid/small firms) and compares to the Big-4-only primary
|
||||
analysis.
|
||||
|
||||
Per the v4.0 author choice (codex round-22 open question, "Light"
|
||||
scope), this script does NOT re-evaluate the five-way moderate-
|
||||
confidence band. The five-way classifier inherits its v3.x
|
||||
calibration; §IV-K's role is to show the Big-4 primary methodology
|
||||
also runs at the wider scope, not to re-validate every rule.
|
||||
|
||||
Inputs (DB):
|
||||
/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db
|
||||
|
||||
Output:
|
||||
reports/v4_big4/full_dataset_robustness/
|
||||
fulldataset_results.json
|
||||
fulldataset_report.md
|
||||
panel_full_vs_big4.png
|
||||
|
||||
Scope of analysis:
|
||||
- Population A: full accountant dataset (n_sig >= 10), n = 686 CPAs
|
||||
- Population B: Big-4 sub-corpus (n_sig >= 10), n = 437 CPAs
|
||||
(= primary analysis scope, reproduced for cross-check)
|
||||
|
||||
For each population:
|
||||
- Fit 2D K=3 GMM on (cos_mean, dh_mean)
|
||||
- Report component centers + weights
|
||||
- Compute per-CPA P(C1_hand_leaning) (the K=3 posterior, as in
|
||||
Script 38)
|
||||
- Compute per-CPA paperA_hand_frac (cos > 0.95 AND dh <= 5
|
||||
failure rate)
|
||||
- Spearman correlation between P(C1) and hand_frac
|
||||
|
||||
Comparison highlights:
|
||||
- Component drift between full and Big-4 K=3 fits
|
||||
- Spearman correlation drift
|
||||
- Per-firm summary at full-dataset scope (Big-4 firms + grouped
|
||||
non-Big-4)
|
||||
"""
|
||||
|
||||
import sqlite3
|
||||
import json
|
||||
import numpy as np
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from scipy import stats
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/full_dataset_robustness')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
SEED = 42
|
||||
MIN_SIGS = 10
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
LABEL = {'勤業眾信聯合': 'Firm A (Deloitte)', '安侯建業聯合': 'KPMG',
|
||||
'資誠聯合': 'PwC', '安永聯合': 'EY'}
|
||||
PAPER_A_COS_CUT = 0.95
|
||||
PAPER_A_DH_CUT = 5
|
||||
|
||||
|
||||
def load_accountants(big4_only):
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
if big4_only:
|
||||
firm_filter = 'AND a.firm IN (?, ?, ?, ?)'
|
||||
params = list(BIG4)
|
||||
else:
|
||||
firm_filter = 'AND a.firm IS NOT NULL'
|
||||
params = []
|
||||
sql = f'''
|
||||
SELECT s.assigned_accountant, a.firm,
|
||||
AVG(s.max_similarity_to_same_accountant) AS cos_mean,
|
||||
AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
|
||||
AVG(CASE
|
||||
WHEN s.max_similarity_to_same_accountant > ?
|
||||
AND s.min_dhash_independent <= ?
|
||||
THEN 0.0 ELSE 1.0
|
||||
END) AS hand_frac,
|
||||
COUNT(*) AS n
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
AND s.min_dhash_independent IS NOT NULL
|
||||
{firm_filter}
|
||||
GROUP BY s.assigned_accountant
|
||||
HAVING n >= ?
|
||||
'''
|
||||
cur.execute(sql, [PAPER_A_COS_CUT, PAPER_A_DH_CUT] + params + [MIN_SIGS])
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return [{'cpa': r[0], 'firm': r[1],
|
||||
'cos_mean': float(r[2]), 'dh_mean': float(r[3]),
|
||||
'hand_frac': float(r[4]), 'n_sigs': int(r[5])} for r in rows]
|
||||
|
||||
|
||||
def fit_k3(cpas):
|
||||
X = np.column_stack([
|
||||
[c['cos_mean'] for c in cpas],
|
||||
[c['dh_mean'] for c in cpas],
|
||||
])
|
||||
gmm = GaussianMixture(n_components=3, covariance_type='full',
|
||||
random_state=SEED, n_init=15, max_iter=500).fit(X)
|
||||
order = np.argsort(gmm.means_[:, 0])
|
||||
means_sorted = gmm.means_[order]
|
||||
weights_sorted = gmm.weights_[order]
|
||||
raw_post = gmm.predict_proba(X)
|
||||
p_c1 = raw_post[:, order[0]]
|
||||
return {
|
||||
'means': means_sorted.tolist(),
|
||||
'weights': weights_sorted.tolist(),
|
||||
'bic': float(gmm.bic(X)),
|
||||
'aic': float(gmm.aic(X)),
|
||||
}, p_c1
|
||||
|
||||
|
||||
def per_population(cpas, label):
|
||||
print(f'\n=== {label} (n = {len(cpas)} CPAs) ===')
|
||||
by_firm = {}
|
||||
for c in cpas:
|
||||
by_firm.setdefault(c['firm'], 0)
|
||||
by_firm[c['firm']] += 1
|
||||
fit, p_c1 = fit_k3(cpas)
|
||||
hf = np.array([c['hand_frac'] for c in cpas])
|
||||
rho, p = stats.spearmanr(p_c1, hf)
|
||||
print(f' K=3 components (sorted by ascending cos):')
|
||||
for i, name in enumerate(['C1 hand-leaning', 'C2 mixed',
|
||||
'C3 replicated']):
|
||||
m = fit['means'][i]
|
||||
print(f' {name}: cos={m[0]:.4f}, dh={m[1]:.4f}, '
|
||||
f'weight={fit["weights"][i]:.3f}')
|
||||
print(f' K=3 BIC = {fit["bic"]:.2f}; AIC = {fit["aic"]:.2f}')
|
||||
print(f' Spearman rho (P_C1 vs paperA_hand_frac) = {rho:+.4f} '
|
||||
f'(p = {p:.2e})')
|
||||
print(f' Population breakdown:')
|
||||
for f in sorted(by_firm, key=lambda k: -by_firm[k]):
|
||||
firm_label = LABEL.get(f, f)
|
||||
print(f' {firm_label}: {by_firm[f]}')
|
||||
return {
|
||||
'label': label,
|
||||
'n_cpas': len(cpas),
|
||||
'k3_fit': fit,
|
||||
'spearman_p_c1_vs_handfrac': {
|
||||
'rho': float(rho), 'p': float(p),
|
||||
},
|
||||
'firm_counts': by_firm,
|
||||
'p_c1': p_c1.tolist(),
|
||||
'hand_frac': hf.tolist(),
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 41: Full-Dataset Robustness Comparison (Light §IV-K)')
|
||||
print('=' * 72)
|
||||
|
||||
full = load_accountants(big4_only=False)
|
||||
big4 = load_accountants(big4_only=True)
|
||||
|
||||
full_summary = per_population(full, 'Full dataset')
|
||||
big4_summary = per_population(big4, 'Big-4 (primary)')
|
||||
|
||||
# Component drift
|
||||
drift = []
|
||||
for i, name in enumerate(['C1 hand-leaning', 'C2 mixed',
|
||||
'C3 replicated']):
|
||||
d_cos = abs(full_summary['k3_fit']['means'][i][0]
|
||||
- big4_summary['k3_fit']['means'][i][0])
|
||||
d_dh = abs(full_summary['k3_fit']['means'][i][1]
|
||||
- big4_summary['k3_fit']['means'][i][1])
|
||||
d_w = abs(full_summary['k3_fit']['weights'][i]
|
||||
- big4_summary['k3_fit']['weights'][i])
|
||||
drift.append({'component': name, 'd_cos': float(d_cos),
|
||||
'd_dh': float(d_dh), 'd_weight': float(d_w)})
|
||||
print('\n=== Component drift Big-4 -> Full ===')
|
||||
for d in drift:
|
||||
print(f' {d["component"]}: |dcos|={d["d_cos"]:.4f}, '
|
||||
f'|ddh|={d["d_dh"]:.3f}, |dweight|={d["d_weight"]:.3f}')
|
||||
|
||||
rho_drift = abs(full_summary['spearman_p_c1_vs_handfrac']['rho']
|
||||
- big4_summary['spearman_p_c1_vs_handfrac']['rho'])
|
||||
print(f'\n=== Spearman rho drift Big-4 -> Full ===')
|
||||
print(f' Big-4: {big4_summary["spearman_p_c1_vs_handfrac"]["rho"]:+.4f}')
|
||||
print(f' Full: {full_summary["spearman_p_c1_vs_handfrac"]["rho"]:+.4f}')
|
||||
print(f' |drift| = {rho_drift:.4f}')
|
||||
|
||||
# Plot: scatter of P_C1 vs hand_frac for both populations
|
||||
fig, axes = plt.subplots(1, 2, figsize=(13, 5.5))
|
||||
for ax, summ in zip(axes, [big4_summary, full_summary]):
|
||||
p1 = np.array(summ['p_c1'])
|
||||
hf = np.array(summ['hand_frac'])
|
||||
ax.scatter(p1, hf, s=20, alpha=0.55, c='steelblue',
|
||||
edgecolor='white')
|
||||
rho = summ['spearman_p_c1_vs_handfrac']['rho']
|
||||
ax.set_xlabel('K=3 posterior P(C1 hand-leaning)')
|
||||
ax.set_ylabel('Paper A box-rule hand-leaning rate')
|
||||
ax.set_title(f'{summ["label"]} (n = {summ["n_cpas"]})\n'
|
||||
f'Spearman rho = {rho:+.3f}')
|
||||
ax.set_xlim(-0.05, 1.05)
|
||||
ax.set_ylim(-0.05, 1.05)
|
||||
ax.grid(alpha=0.3)
|
||||
fig.tight_layout()
|
||||
fig.savefig(OUT / 'panel_full_vs_big4.png', dpi=150)
|
||||
plt.close(fig)
|
||||
print(f'\nPlot: {OUT / "panel_full_vs_big4.png"}')
|
||||
|
||||
payload = {
|
||||
'generated_at': datetime.now().isoformat(),
|
||||
'min_sigs_per_accountant': MIN_SIGS,
|
||||
'paper_a_cuts': {'cos': PAPER_A_COS_CUT, 'dh': PAPER_A_DH_CUT},
|
||||
'big4_summary': {k: v for k, v in big4_summary.items()
|
||||
if k not in ('p_c1', 'hand_frac')},
|
||||
'full_dataset_summary': {k: v for k, v in full_summary.items()
|
||||
if k not in ('p_c1', 'hand_frac')},
|
||||
'component_drift_big4_to_full': drift,
|
||||
'spearman_rho_drift_big4_to_full': float(rho_drift),
|
||||
}
|
||||
json_path = OUT / 'fulldataset_results.json'
|
||||
json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'JSON: {json_path}')
|
||||
|
||||
md = [
|
||||
'# §IV-K Full-Dataset Robustness Comparison (Light)',
|
||||
f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
|
||||
'',
|
||||
'## Scope',
|
||||
'',
|
||||
('Compares the v4.0 primary Big-4 K=3 + Paper A box-rule '
|
||||
'analysis to the same analysis run on the FULL accountant '
|
||||
'dataset (Big-4 + mid/small firms). The five-way moderate-'
|
||||
'confidence band is NOT re-evaluated here; this is the '
|
||||
'"Light" scope per the v4.0 author choice (codex round-22 '
|
||||
'open question 1).'),
|
||||
'',
|
||||
'## Population sizes',
|
||||
'',
|
||||
'| Scope | N CPAs (n_sig >= 10) |',
|
||||
'|---|---|',
|
||||
f'| Big-4 primary | {big4_summary["n_cpas"]} |',
|
||||
f'| Full dataset | {full_summary["n_cpas"]} |',
|
||||
'',
|
||||
'## K=3 components',
|
||||
'',
|
||||
'| Component | Big-4 cos / dh / weight | Full cos / dh / weight | |dcos| / |ddh| / |dwt| |',
|
||||
'|---|---|---|---|',
|
||||
]
|
||||
for i, name in enumerate(['C1 hand-leaning', 'C2 mixed',
|
||||
'C3 replicated']):
|
||||
b_m = big4_summary['k3_fit']['means'][i]
|
||||
b_w = big4_summary['k3_fit']['weights'][i]
|
||||
f_m = full_summary['k3_fit']['means'][i]
|
||||
f_w = full_summary['k3_fit']['weights'][i]
|
||||
d = drift[i]
|
||||
md.append(f'| {name} | {b_m[0]:.4f} / {b_m[1]:.3f} / {b_w:.3f} | '
|
||||
f'{f_m[0]:.4f} / {f_m[1]:.3f} / {f_w:.3f} | '
|
||||
f'{d["d_cos"]:.4f} / {d["d_dh"]:.3f} / '
|
||||
f'{d["d_weight"]:.3f} |')
|
||||
|
||||
md += ['',
|
||||
f'BIC: Big-4 K=3 = {big4_summary["k3_fit"]["bic"]:.2f}; '
|
||||
f'Full K=3 = {full_summary["k3_fit"]["bic"]:.2f}',
|
||||
'',
|
||||
'## Spearman correlation (P(C1) vs Paper A hand_frac)',
|
||||
'',
|
||||
'| Scope | Spearman rho | p |',
|
||||
'|---|---|---|',
|
||||
f'| Big-4 | {big4_summary["spearman_p_c1_vs_handfrac"]["rho"]:+.4f} | '
|
||||
f'{big4_summary["spearman_p_c1_vs_handfrac"]["p"]:.2e} |',
|
||||
f'| Full dataset | {full_summary["spearman_p_c1_vs_handfrac"]["rho"]:+.4f} | '
|
||||
f'{full_summary["spearman_p_c1_vs_handfrac"]["p"]:.2e} |',
|
||||
f'| |Drift| Big-4 -> Full | {rho_drift:.4f} | n/a |',
|
||||
'',
|
||||
'## Reading',
|
||||
'',
|
||||
('The Big-4 primary analysis and the full-dataset rerun '
|
||||
'agree on the K=3 component ordering and on the strong '
|
||||
'positive Spearman rank correlation between K=3 posterior '
|
||||
'P(C1) and Paper A box-rule hand-leaning rate. Component '
|
||||
'centers shift modestly between scopes (largest shift = '
|
||||
f'C{1 + int(np.argmax([d["d_cos"] for d in drift]))}, '
|
||||
f'|dcos| = {max(d["d_cos"] for d in drift):.4f}); the '
|
||||
'Spearman rho remains > 0.9 in both populations. We read '
|
||||
'this as evidence that the v4.0 K=3 + Paper A convergence '
|
||||
'is not a Big-4-specific artefact, while not implying that '
|
||||
'the full-dataset crossings or component locations are '
|
||||
'operationally interchangeable with the Big-4-primary '
|
||||
'numbers (they are not; mid/small-firm tail composition '
|
||||
'shifts the component centers).'),
|
||||
'',
|
||||
'## Files',
|
||||
'- `fulldataset_results.json` -- machine-readable results',
|
||||
'- `panel_full_vs_big4.png` -- side-by-side scatter',
|
||||
]
|
||||
md_path = OUT / 'fulldataset_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'Report: {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user