Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of signature vs. accountant analysis units: - Script 15 (Hartigan dip test): formal unimodality test via `diptest`. Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population); full-sample cosine MULTIMODAL (p<0.001, mix of two regimes); accountant-level aggregates MULTIMODAL on both cos and dHash. - Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition detection. Firm A and full-sample cosine transitions at 0.985; dHash at 2.0. - Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM with MoM M-step, plus parallel Gaussian mixture on logit transform as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature level confirms 2-component is a forced fit -- supporting the pivot to accountant-level mixture. - Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis that was done inline and not saved. BIC-best K=3 with components matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%, Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928, 11.17, 28%, small firms). 2-component natural thresholds: cos=0.9450, dh=8.10. - Script 19 (Pixel-identity validation): no human annotation needed. Uses pixel_identical_to_closest (310 sigs) as gold positive and Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51% (matches prior 2026-04-08 finding of 92.5%), dual rule cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A. Python deps added: diptest, scikit-learn (installed into venv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,227 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 15: Hartigan Dip Test for Unimodality
|
||||
=============================================
|
||||
Runs the proper Hartigan & Hartigan (1985) dip test via the `diptest` package
|
||||
on the empirical signature-similarity distributions.
|
||||
|
||||
Purpose:
|
||||
Confirm/refute bimodality assumption underpinning threshold-selection methods.
|
||||
Prior finding (2026-04-16): signature-level distribution is unimodal long-tail;
|
||||
the story is that bimodality only emerges at the accountant level.
|
||||
|
||||
Tests:
|
||||
1. Firm A (Deloitte) cosine max-similarity -> expected UNIMODAL
|
||||
2. Firm A (Deloitte) independent min dHash -> expected UNIMODAL
|
||||
3. Full-sample cosine max-similarity -> test
|
||||
4. Full-sample independent min dHash -> test
|
||||
5. Accountant-level cosine mean (per-accountant) -> expected BIMODAL / MULTIMODAL
|
||||
6. Accountant-level dhash mean (per-accountant) -> expected BIMODAL / MULTIMODAL
|
||||
|
||||
Output:
|
||||
reports/dip_test/dip_test_report.md
|
||||
reports/dip_test/dip_test_results.json
|
||||
"""
|
||||
|
||||
import sqlite3
|
||||
import json
|
||||
import numpy as np
|
||||
import diptest
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/dip_test')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
FIRM_A = '勤業眾信聯合'
|
||||
|
||||
|
||||
def run_dip(values, label, n_boot=2000):
|
||||
"""Run Hartigan dip test and return structured result."""
|
||||
arr = np.asarray(values, dtype=float)
|
||||
arr = arr[~np.isnan(arr)]
|
||||
if len(arr) < 4:
|
||||
return {'label': label, 'n': int(len(arr)), 'error': 'too few observations'}
|
||||
|
||||
dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=n_boot)
|
||||
verdict = 'UNIMODAL (accept H0)' if pval > 0.05 else 'MULTIMODAL (reject H0)'
|
||||
return {
|
||||
'label': label,
|
||||
'n': int(len(arr)),
|
||||
'mean': float(np.mean(arr)),
|
||||
'std': float(np.std(arr)),
|
||||
'min': float(np.min(arr)),
|
||||
'max': float(np.max(arr)),
|
||||
'dip': float(dip),
|
||||
'p_value': float(pval),
|
||||
'n_boot': int(n_boot),
|
||||
'verdict_alpha_05': verdict,
|
||||
}
|
||||
|
||||
|
||||
def fetch_firm_a():
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.max_similarity_to_same_accountant,
|
||||
s.min_dhash_independent
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE a.firm = ?
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
''', (FIRM_A,))
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
cos = [r[0] for r in rows if r[0] is not None]
|
||||
dh = [r[1] for r in rows if r[1] is not None]
|
||||
return np.array(cos), np.array(dh)
|
||||
|
||||
|
||||
def fetch_full_sample():
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT max_similarity_to_same_accountant, min_dhash_independent
|
||||
FROM signatures
|
||||
WHERE max_similarity_to_same_accountant IS NOT NULL
|
||||
''')
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
cos = np.array([r[0] for r in rows if r[0] is not None])
|
||||
dh = np.array([r[1] for r in rows if r[1] is not None])
|
||||
return cos, dh
|
||||
|
||||
|
||||
def fetch_accountant_aggregates(min_sigs=10):
|
||||
"""Per-accountant mean cosine and mean independent dHash."""
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.assigned_accountant,
|
||||
AVG(s.max_similarity_to_same_accountant) AS cos_mean,
|
||||
AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
|
||||
COUNT(*) AS n
|
||||
FROM signatures s
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
AND s.min_dhash_independent IS NOT NULL
|
||||
GROUP BY s.assigned_accountant
|
||||
HAVING n >= ?
|
||||
''', (min_sigs,))
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
cos_means = np.array([r[1] for r in rows])
|
||||
dh_means = np.array([r[2] for r in rows])
|
||||
return cos_means, dh_means, len(rows)
|
||||
|
||||
|
||||
def main():
|
||||
print('='*70)
|
||||
print('Script 15: Hartigan Dip Test for Unimodality')
|
||||
print('='*70)
|
||||
|
||||
results = {}
|
||||
|
||||
# Firm A
|
||||
print('\n[1/3] Firm A (Deloitte)...')
|
||||
fa_cos, fa_dh = fetch_firm_a()
|
||||
print(f' Firm A cosine N={len(fa_cos):,}, dHash N={len(fa_dh):,}')
|
||||
results['firm_a_cosine'] = run_dip(fa_cos, 'Firm A cosine max-similarity')
|
||||
results['firm_a_dhash'] = run_dip(fa_dh, 'Firm A independent min dHash')
|
||||
|
||||
# Full sample
|
||||
print('\n[2/3] Full sample...')
|
||||
all_cos, all_dh = fetch_full_sample()
|
||||
print(f' Full cosine N={len(all_cos):,}, dHash N={len(all_dh):,}')
|
||||
# Dip test on >=10k obs can be slow with 2000 boot; use 500 for full sample
|
||||
results['full_cosine'] = run_dip(all_cos, 'Full-sample cosine max-similarity',
|
||||
n_boot=500)
|
||||
results['full_dhash'] = run_dip(all_dh, 'Full-sample independent min dHash',
|
||||
n_boot=500)
|
||||
|
||||
# Accountant-level aggregates
|
||||
print('\n[3/3] Accountant-level aggregates (min 10 sigs)...')
|
||||
acct_cos, acct_dh, n_acct = fetch_accountant_aggregates(min_sigs=10)
|
||||
print(f' Accountants analyzed: {n_acct}')
|
||||
results['accountant_cos_mean'] = run_dip(acct_cos,
|
||||
'Per-accountant cosine mean')
|
||||
results['accountant_dh_mean'] = run_dip(acct_dh,
|
||||
'Per-accountant dHash mean')
|
||||
|
||||
# Print summary
|
||||
print('\n' + '='*70)
|
||||
print('RESULTS SUMMARY')
|
||||
print('='*70)
|
||||
print(f"{'Test':<40} {'N':>8} {'dip':>8} {'p':>10} Verdict")
|
||||
print('-'*90)
|
||||
for key, r in results.items():
|
||||
if 'error' in r:
|
||||
continue
|
||||
print(f"{r['label']:<40} {r['n']:>8,} {r['dip']:>8.4f} "
|
||||
f"{r['p_value']:>10.4f} {r['verdict_alpha_05']}")
|
||||
|
||||
# Write JSON
|
||||
json_path = OUT / 'dip_test_results.json'
|
||||
with open(json_path, 'w') as f:
|
||||
json.dump({
|
||||
'generated_at': datetime.now().isoformat(),
|
||||
'db': DB,
|
||||
'results': results,
|
||||
}, f, indent=2, ensure_ascii=False)
|
||||
print(f'\nJSON saved: {json_path}')
|
||||
|
||||
# Write Markdown report
|
||||
md = [
|
||||
'# Hartigan Dip Test Report',
|
||||
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
|
||||
'',
|
||||
'## Method',
|
||||
'',
|
||||
'Hartigan & Hartigan (1985) dip test via `diptest` Python package.',
|
||||
'H0: distribution is unimodal. H1: multimodal (two or more modes).',
|
||||
'p-value computed by bootstrap against a uniform null (2000 reps for',
|
||||
'Firm A/accountant-level, 500 reps for full-sample due to size).',
|
||||
'',
|
||||
'## Results',
|
||||
'',
|
||||
'| Test | N | dip | p-value | Verdict (α=0.05) |',
|
||||
'|------|---|-----|---------|------------------|',
|
||||
]
|
||||
for r in results.values():
|
||||
if 'error' in r:
|
||||
md.append(f"| {r['label']} | {r['n']} | — | — | {r['error']} |")
|
||||
continue
|
||||
md.append(
|
||||
f"| {r['label']} | {r['n']:,} | {r['dip']:.4f} | "
|
||||
f"{r['p_value']:.4f} | {r['verdict_alpha_05']} |"
|
||||
)
|
||||
md += [
|
||||
'',
|
||||
'## Interpretation',
|
||||
'',
|
||||
'* **Signature level** (Firm A + full sample): the dip test indicates',
|
||||
' whether a single mode explains the max-cosine/min-dHash distribution.',
|
||||
' Prior finding (2026-04-16) suggested unimodal long-tail; this script',
|
||||
' provides the formal test.',
|
||||
'',
|
||||
'* **Accountant level** (per-accountant mean): if multimodal here but',
|
||||
' unimodal at the signature level, this confirms the interpretation',
|
||||
" that signing-behaviour is discrete across accountants (replication",
|
||||
' vs hand-signing), while replication quality itself is a continuous',
|
||||
' spectrum.',
|
||||
'',
|
||||
'## Downstream implication',
|
||||
'',
|
||||
'Methods that assume bimodality (KDE antimode, 2-component Beta mixture)',
|
||||
'should be applied at the level where dip test rejects H0. If the',
|
||||
"signature-level dip test fails to reject, the paper should report this",
|
||||
'and shift the mixture analysis to the accountant level (see Script 18).',
|
||||
]
|
||||
md_path = OUT / 'dip_test_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'Report saved: {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user