Files
pdf_signature_extraction/signature_analysis/15_hartigan_dip_test.py
T
gbanyan fbfab1fa68 Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of
signature vs. accountant analysis units:

- Script 15 (Hartigan dip test): formal unimodality test via `diptest`.
  Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population);
  full-sample cosine MULTIMODAL (p<0.001, mix of two regimes);
  accountant-level aggregates MULTIMODAL on both cos and dHash.

- Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition
  detection. Firm A and full-sample cosine transitions at 0.985; dHash
  at 2.0.

- Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM
  with MoM M-step, plus parallel Gaussian mixture on logit transform
  as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature
  level confirms 2-component is a forced fit -- supporting the pivot
  to accountant-level mixture.

- Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis
  that was done inline and not saved. BIC-best K=3 with components
  matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%,
  Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928,
  11.17, 28%, small firms). 2-component natural thresholds:
  cos=0.9450, dh=8.10.

- Script 19 (Pixel-identity validation): no human annotation needed.
  Uses pixel_identical_to_closest (310 sigs) as gold positive and
  Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51%
  (matches prior 2026-04-08 finding of 92.5%), dual rule
  cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A.

Python deps added: diptest, scikit-learn (installed into venv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:51:41 +08:00

228 lines
8.2 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""
Script 15: Hartigan Dip Test for Unimodality
=============================================
Runs the proper Hartigan & Hartigan (1985) dip test via the `diptest` package
on the empirical signature-similarity distributions.
Purpose:
Confirm/refute bimodality assumption underpinning threshold-selection methods.
Prior finding (2026-04-16): signature-level distribution is unimodal long-tail;
the story is that bimodality only emerges at the accountant level.
Tests:
1. Firm A (Deloitte) cosine max-similarity -> expected UNIMODAL
2. Firm A (Deloitte) independent min dHash -> expected UNIMODAL
3. Full-sample cosine max-similarity -> test
4. Full-sample independent min dHash -> test
5. Accountant-level cosine mean (per-accountant) -> expected BIMODAL / MULTIMODAL
6. Accountant-level dhash mean (per-accountant) -> expected BIMODAL / MULTIMODAL
Output:
reports/dip_test/dip_test_report.md
reports/dip_test/dip_test_results.json
"""
import sqlite3
import json
import numpy as np
import diptest
from pathlib import Path
from datetime import datetime
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/dip_test')
OUT.mkdir(parents=True, exist_ok=True)
FIRM_A = '勤業眾信聯合'
def run_dip(values, label, n_boot=2000):
"""Run Hartigan dip test and return structured result."""
arr = np.asarray(values, dtype=float)
arr = arr[~np.isnan(arr)]
if len(arr) < 4:
return {'label': label, 'n': int(len(arr)), 'error': 'too few observations'}
dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=n_boot)
verdict = 'UNIMODAL (accept H0)' if pval > 0.05 else 'MULTIMODAL (reject H0)'
return {
'label': label,
'n': int(len(arr)),
'mean': float(np.mean(arr)),
'std': float(np.std(arr)),
'min': float(np.min(arr)),
'max': float(np.max(arr)),
'dip': float(dip),
'p_value': float(pval),
'n_boot': int(n_boot),
'verdict_alpha_05': verdict,
}
def fetch_firm_a():
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT s.max_similarity_to_same_accountant,
s.min_dhash_independent
FROM signatures s
JOIN accountants a ON s.assigned_accountant = a.name
WHERE a.firm = ?
AND s.max_similarity_to_same_accountant IS NOT NULL
''', (FIRM_A,))
rows = cur.fetchall()
conn.close()
cos = [r[0] for r in rows if r[0] is not None]
dh = [r[1] for r in rows if r[1] is not None]
return np.array(cos), np.array(dh)
def fetch_full_sample():
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT max_similarity_to_same_accountant, min_dhash_independent
FROM signatures
WHERE max_similarity_to_same_accountant IS NOT NULL
''')
rows = cur.fetchall()
conn.close()
cos = np.array([r[0] for r in rows if r[0] is not None])
dh = np.array([r[1] for r in rows if r[1] is not None])
return cos, dh
def fetch_accountant_aggregates(min_sigs=10):
"""Per-accountant mean cosine and mean independent dHash."""
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT s.assigned_accountant,
AVG(s.max_similarity_to_same_accountant) AS cos_mean,
AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
COUNT(*) AS n
FROM signatures s
WHERE s.assigned_accountant IS NOT NULL
AND s.max_similarity_to_same_accountant IS NOT NULL
AND s.min_dhash_independent IS NOT NULL
GROUP BY s.assigned_accountant
HAVING n >= ?
''', (min_sigs,))
rows = cur.fetchall()
conn.close()
cos_means = np.array([r[1] for r in rows])
dh_means = np.array([r[2] for r in rows])
return cos_means, dh_means, len(rows)
def main():
print('='*70)
print('Script 15: Hartigan Dip Test for Unimodality')
print('='*70)
results = {}
# Firm A
print('\n[1/3] Firm A (Deloitte)...')
fa_cos, fa_dh = fetch_firm_a()
print(f' Firm A cosine N={len(fa_cos):,}, dHash N={len(fa_dh):,}')
results['firm_a_cosine'] = run_dip(fa_cos, 'Firm A cosine max-similarity')
results['firm_a_dhash'] = run_dip(fa_dh, 'Firm A independent min dHash')
# Full sample
print('\n[2/3] Full sample...')
all_cos, all_dh = fetch_full_sample()
print(f' Full cosine N={len(all_cos):,}, dHash N={len(all_dh):,}')
# Dip test on >=10k obs can be slow with 2000 boot; use 500 for full sample
results['full_cosine'] = run_dip(all_cos, 'Full-sample cosine max-similarity',
n_boot=500)
results['full_dhash'] = run_dip(all_dh, 'Full-sample independent min dHash',
n_boot=500)
# Accountant-level aggregates
print('\n[3/3] Accountant-level aggregates (min 10 sigs)...')
acct_cos, acct_dh, n_acct = fetch_accountant_aggregates(min_sigs=10)
print(f' Accountants analyzed: {n_acct}')
results['accountant_cos_mean'] = run_dip(acct_cos,
'Per-accountant cosine mean')
results['accountant_dh_mean'] = run_dip(acct_dh,
'Per-accountant dHash mean')
# Print summary
print('\n' + '='*70)
print('RESULTS SUMMARY')
print('='*70)
print(f"{'Test':<40} {'N':>8} {'dip':>8} {'p':>10} Verdict")
print('-'*90)
for key, r in results.items():
if 'error' in r:
continue
print(f"{r['label']:<40} {r['n']:>8,} {r['dip']:>8.4f} "
f"{r['p_value']:>10.4f} {r['verdict_alpha_05']}")
# Write JSON
json_path = OUT / 'dip_test_results.json'
with open(json_path, 'w') as f:
json.dump({
'generated_at': datetime.now().isoformat(),
'db': DB,
'results': results,
}, f, indent=2, ensure_ascii=False)
print(f'\nJSON saved: {json_path}')
# Write Markdown report
md = [
'# Hartigan Dip Test Report',
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
'',
'## Method',
'',
'Hartigan & Hartigan (1985) dip test via `diptest` Python package.',
'H0: distribution is unimodal. H1: multimodal (two or more modes).',
'p-value computed by bootstrap against a uniform null (2000 reps for',
'Firm A/accountant-level, 500 reps for full-sample due to size).',
'',
'## Results',
'',
'| Test | N | dip | p-value | Verdict (α=0.05) |',
'|------|---|-----|---------|------------------|',
]
for r in results.values():
if 'error' in r:
md.append(f"| {r['label']} | {r['n']} | — | — | {r['error']} |")
continue
md.append(
f"| {r['label']} | {r['n']:,} | {r['dip']:.4f} | "
f"{r['p_value']:.4f} | {r['verdict_alpha_05']} |"
)
md += [
'',
'## Interpretation',
'',
'* **Signature level** (Firm A + full sample): the dip test indicates',
' whether a single mode explains the max-cosine/min-dHash distribution.',
' Prior finding (2026-04-16) suggested unimodal long-tail; this script',
' provides the formal test.',
'',
'* **Accountant level** (per-accountant mean): if multimodal here but',
' unimodal at the signature level, this confirms the interpretation',
" that signing-behaviour is discrete across accountants (replication",
' vs hand-signing), while replication quality itself is a continuous',
' spectrum.',
'',
'## Downstream implication',
'',
'Methods that assume bimodality (KDE antimode, 2-component Beta mixture)',
'should be applied at the level where dip test rejects H0. If the',
"signature-level dip test fails to reject, the paper should report this",
'and shift the mixture analysis to the accountant level (see Script 18).',
]
md_path = OUT / 'dip_test_report.md'
md_path.write_text('\n'.join(md), encoding='utf-8')
print(f'Report saved: {md_path}')
if __name__ == '__main__':
main()