Files
pdf_signature_extraction/signature_analysis/15_hartigan_dip_test.py
T
gbanyan 68689c9f9b Correct Firm A framing: replication-dominated, not pure
Interview evidence from multiple Firm A accountants confirms that MOST
use replication (stamping / firm-level e-signing) but a MINORITY may
still hand-sign. Firm A is therefore a "replication-dominated" population,
not a "pure" one. This framing is consistent with:

- 92.5% of Firm A signatures exceed cosine 0.95 (majority replication)
- The long left tail (~7%) captures the minority hand-signers, not scan
  noise or preprocessing artifacts
- Hartigan dip test: Firm A cosine unimodal long-tail (p=0.17)
- Accountant-level GMM: of 180 Firm A accountants, 139 cluster in C1
  (high-replication) and 32 in C2 (middle band = minority hand-signers)

Updates docstrings and report text in Scripts 15, 16, 18, 19 to match.
Partner v3's "near-universal non-hand-signing" language corrected.

Script 19 regenerated with the updated text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:57:16 +08:00

239 lines
8.8 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""
Script 15: Hartigan Dip Test for Unimodality
=============================================
Runs the proper Hartigan & Hartigan (1985) dip test via the `diptest` package
on the empirical signature-similarity distributions.
Purpose:
Confirm/refute bimodality assumption underpinning threshold-selection methods.
Prior finding (2026-04-16): signature-level distribution is unimodal long-tail;
the story is that bimodality only emerges at the accountant level.
Firm A framing (2026-04-20, corrected):
Interviews with multiple Firm A accountants confirm that MOST use
replication (stamping / firm-level e-signing) but do NOT exclude a
minority of hand-signers. Firm A is therefore a "replication-dominated"
population, NOT a "pure" one. This framing is consistent with:
- 92.5% of Firm A signatures exceed cosine 0.95
- The long left tail (7.5% below 0.95) captures the minority
hand-signers, not scan noise
- Script 18: of 180 Firm A accountants, 139 cluster in C1
(high-replication) and 32 in C2 (middle band = minority hand-signers)
Tests:
1. Firm A (Deloitte) cosine max-similarity -> expected UNIMODAL
2. Firm A (Deloitte) independent min dHash -> expected UNIMODAL
3. Full-sample cosine max-similarity -> test
4. Full-sample independent min dHash -> test
5. Accountant-level cosine mean (per-accountant) -> expected BIMODAL / MULTIMODAL
6. Accountant-level dhash mean (per-accountant) -> expected BIMODAL / MULTIMODAL
Output:
reports/dip_test/dip_test_report.md
reports/dip_test/dip_test_results.json
"""
import sqlite3
import json
import numpy as np
import diptest
from pathlib import Path
from datetime import datetime
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/dip_test')
OUT.mkdir(parents=True, exist_ok=True)
FIRM_A = '勤業眾信聯合'
def run_dip(values, label, n_boot=2000):
"""Run Hartigan dip test and return structured result."""
arr = np.asarray(values, dtype=float)
arr = arr[~np.isnan(arr)]
if len(arr) < 4:
return {'label': label, 'n': int(len(arr)), 'error': 'too few observations'}
dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=n_boot)
verdict = 'UNIMODAL (accept H0)' if pval > 0.05 else 'MULTIMODAL (reject H0)'
return {
'label': label,
'n': int(len(arr)),
'mean': float(np.mean(arr)),
'std': float(np.std(arr)),
'min': float(np.min(arr)),
'max': float(np.max(arr)),
'dip': float(dip),
'p_value': float(pval),
'n_boot': int(n_boot),
'verdict_alpha_05': verdict,
}
def fetch_firm_a():
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT s.max_similarity_to_same_accountant,
s.min_dhash_independent
FROM signatures s
JOIN accountants a ON s.assigned_accountant = a.name
WHERE a.firm = ?
AND s.max_similarity_to_same_accountant IS NOT NULL
''', (FIRM_A,))
rows = cur.fetchall()
conn.close()
cos = [r[0] for r in rows if r[0] is not None]
dh = [r[1] for r in rows if r[1] is not None]
return np.array(cos), np.array(dh)
def fetch_full_sample():
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT max_similarity_to_same_accountant, min_dhash_independent
FROM signatures
WHERE max_similarity_to_same_accountant IS NOT NULL
''')
rows = cur.fetchall()
conn.close()
cos = np.array([r[0] for r in rows if r[0] is not None])
dh = np.array([r[1] for r in rows if r[1] is not None])
return cos, dh
def fetch_accountant_aggregates(min_sigs=10):
"""Per-accountant mean cosine and mean independent dHash."""
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT s.assigned_accountant,
AVG(s.max_similarity_to_same_accountant) AS cos_mean,
AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
COUNT(*) AS n
FROM signatures s
WHERE s.assigned_accountant IS NOT NULL
AND s.max_similarity_to_same_accountant IS NOT NULL
AND s.min_dhash_independent IS NOT NULL
GROUP BY s.assigned_accountant
HAVING n >= ?
''', (min_sigs,))
rows = cur.fetchall()
conn.close()
cos_means = np.array([r[1] for r in rows])
dh_means = np.array([r[2] for r in rows])
return cos_means, dh_means, len(rows)
def main():
print('='*70)
print('Script 15: Hartigan Dip Test for Unimodality')
print('='*70)
results = {}
# Firm A
print('\n[1/3] Firm A (Deloitte)...')
fa_cos, fa_dh = fetch_firm_a()
print(f' Firm A cosine N={len(fa_cos):,}, dHash N={len(fa_dh):,}')
results['firm_a_cosine'] = run_dip(fa_cos, 'Firm A cosine max-similarity')
results['firm_a_dhash'] = run_dip(fa_dh, 'Firm A independent min dHash')
# Full sample
print('\n[2/3] Full sample...')
all_cos, all_dh = fetch_full_sample()
print(f' Full cosine N={len(all_cos):,}, dHash N={len(all_dh):,}')
# Dip test on >=10k obs can be slow with 2000 boot; use 500 for full sample
results['full_cosine'] = run_dip(all_cos, 'Full-sample cosine max-similarity',
n_boot=500)
results['full_dhash'] = run_dip(all_dh, 'Full-sample independent min dHash',
n_boot=500)
# Accountant-level aggregates
print('\n[3/3] Accountant-level aggregates (min 10 sigs)...')
acct_cos, acct_dh, n_acct = fetch_accountant_aggregates(min_sigs=10)
print(f' Accountants analyzed: {n_acct}')
results['accountant_cos_mean'] = run_dip(acct_cos,
'Per-accountant cosine mean')
results['accountant_dh_mean'] = run_dip(acct_dh,
'Per-accountant dHash mean')
# Print summary
print('\n' + '='*70)
print('RESULTS SUMMARY')
print('='*70)
print(f"{'Test':<40} {'N':>8} {'dip':>8} {'p':>10} Verdict")
print('-'*90)
for key, r in results.items():
if 'error' in r:
continue
print(f"{r['label']:<40} {r['n']:>8,} {r['dip']:>8.4f} "
f"{r['p_value']:>10.4f} {r['verdict_alpha_05']}")
# Write JSON
json_path = OUT / 'dip_test_results.json'
with open(json_path, 'w') as f:
json.dump({
'generated_at': datetime.now().isoformat(),
'db': DB,
'results': results,
}, f, indent=2, ensure_ascii=False)
print(f'\nJSON saved: {json_path}')
# Write Markdown report
md = [
'# Hartigan Dip Test Report',
f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
'',
'## Method',
'',
'Hartigan & Hartigan (1985) dip test via `diptest` Python package.',
'H0: distribution is unimodal. H1: multimodal (two or more modes).',
'p-value computed by bootstrap against a uniform null (2000 reps for',
'Firm A/accountant-level, 500 reps for full-sample due to size).',
'',
'## Results',
'',
'| Test | N | dip | p-value | Verdict (α=0.05) |',
'|------|---|-----|---------|------------------|',
]
for r in results.values():
if 'error' in r:
md.append(f"| {r['label']} | {r['n']} | — | — | {r['error']} |")
continue
md.append(
f"| {r['label']} | {r['n']:,} | {r['dip']:.4f} | "
f"{r['p_value']:.4f} | {r['verdict_alpha_05']} |"
)
md += [
'',
'## Interpretation',
'',
'* **Signature level** (Firm A + full sample): the dip test indicates',
' whether a single mode explains the max-cosine/min-dHash distribution.',
' Prior finding (2026-04-16) suggested unimodal long-tail; this script',
' provides the formal test.',
'',
'* **Accountant level** (per-accountant mean): if multimodal here but',
' unimodal at the signature level, this confirms the interpretation',
" that signing-behaviour is discrete across accountants (replication",
' vs hand-signing), while replication quality itself is a continuous',
' spectrum.',
'',
'## Downstream implication',
'',
'Methods that assume bimodality (KDE antimode, 2-component Beta mixture)',
'should be applied at the level where dip test rejects H0. If the',
"signature-level dip test fails to reject, the paper should report this",
'and shift the mixture analysis to the accountant level (see Script 18).',
]
md_path = OUT / 'dip_test_report.md'
md_path.write_text('\n'.join(md), encoding='utf-8')
print(f'Report saved: {md_path}')
if __name__ == '__main__':
main()