Add three-convergent-method threshold scripts + pixel-identity validation

Implements Partner v3's statistical rigor requirements at the level of signature vs. accountant analysis units: - Script 15 (Hartigan dip test): formal unimodality test via `diptest`. Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population); full-sample cosine MULTIMODAL (p<0.001, mix of two regimes); accountant-level aggregates MULTIMODAL on both cos and dHash. - Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition detection. Firm A and full-sample cosine transitions at 0.985; dHash at 2.0. - Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM with MoM M-step, plus parallel Gaussian mixture on logit transform as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature level confirms 2-component is a forced fit -- supporting the pivot to accountant-level mixture. - Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis that was done inline and not saved. BIC-best K=3 with components matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%, Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928, 11.17, 28%, small firms). 2-component natural thresholds: cos=0.9450, dh=8.10. - Script 19 (Pixel-identity validation): no human annotation needed. Uses pixel_identical_to_closest (310 sigs) as gold positive and Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51% (matches prior 2026-04-08 finding of 92.5%), dual rule cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A. Python deps added: diptest, scikit-learn (installed into venv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:51:41 +08:00
parent 158f63efb2
commit fbfab1fa68
5 changed files with 1760 additions and 0 deletions
@@ -0,0 +1,227 @@
+#!/usr/bin/env python3
+"""
+Script 15: Hartigan Dip Test for Unimodality
+=============================================
+Runs the proper Hartigan & Hartigan (1985) dip test via the `diptest` package
+on the empirical signature-similarity distributions.
+
+Purpose:
+  Confirm/refute bimodality assumption underpinning threshold-selection methods.
+  Prior finding (2026-04-16): signature-level distribution is unimodal long-tail;
+  the story is that bimodality only emerges at the accountant level.
+
+Tests:
+  1. Firm A (Deloitte) cosine max-similarity       -> expected UNIMODAL
+  2. Firm A (Deloitte) independent min dHash       -> expected UNIMODAL
+  3. Full-sample cosine max-similarity             -> test
+  4. Full-sample independent min dHash             -> test
+  5. Accountant-level cosine mean (per-accountant) -> expected BIMODAL / MULTIMODAL
+  6. Accountant-level dhash mean (per-accountant)  -> expected BIMODAL / MULTIMODAL
+
+Output:
+  reports/dip_test/dip_test_report.md
+  reports/dip_test/dip_test_results.json
+"""
+
+import sqlite3
+import json
+import numpy as np
+import diptest
+from pathlib import Path
+from datetime import datetime
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/dip_test')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+
+
+def run_dip(values, label, n_boot=2000):
+    """Run Hartigan dip test and return structured result."""
+    arr = np.asarray(values, dtype=float)
+    arr = arr[~np.isnan(arr)]
+    if len(arr) < 4:
+        return {'label': label, 'n': int(len(arr)), 'error': 'too few observations'}
+
+    dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=n_boot)
+    verdict = 'UNIMODAL (accept H0)' if pval > 0.05 else 'MULTIMODAL (reject H0)'
+    return {
+        'label': label,
+        'n': int(len(arr)),
+        'mean': float(np.mean(arr)),
+        'std': float(np.std(arr)),
+        'min': float(np.min(arr)),
+        'max': float(np.max(arr)),
+        'dip': float(dip),
+        'p_value': float(pval),
+        'n_boot': int(n_boot),
+        'verdict_alpha_05': verdict,
+    }
+
+
+def fetch_firm_a():
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT s.max_similarity_to_same_accountant,
+               s.min_dhash_independent
+        FROM signatures s
+        JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE a.firm = ?
+          AND s.max_similarity_to_same_accountant IS NOT NULL
+    ''', (FIRM_A,))
+    rows = cur.fetchall()
+    conn.close()
+    cos = [r[0] for r in rows if r[0] is not None]
+    dh = [r[1] for r in rows if r[1] is not None]
+    return np.array(cos), np.array(dh)
+
+
+def fetch_full_sample():
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT max_similarity_to_same_accountant, min_dhash_independent
+        FROM signatures
+        WHERE max_similarity_to_same_accountant IS NOT NULL
+    ''')
+    rows = cur.fetchall()
+    conn.close()
+    cos = np.array([r[0] for r in rows if r[0] is not None])
+    dh = np.array([r[1] for r in rows if r[1] is not None])
+    return cos, dh
+
+
+def fetch_accountant_aggregates(min_sigs=10):
+    """Per-accountant mean cosine and mean independent dHash."""
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT s.assigned_accountant,
+               AVG(s.max_similarity_to_same_accountant) AS cos_mean,
+               AVG(CAST(s.min_dhash_independent AS REAL)) AS dh_mean,
+               COUNT(*) AS n
+        FROM signatures s
+        WHERE s.assigned_accountant IS NOT NULL
+          AND s.max_similarity_to_same_accountant IS NOT NULL
+          AND s.min_dhash_independent IS NOT NULL
+        GROUP BY s.assigned_accountant
+        HAVING n >= ?
+    ''', (min_sigs,))
+    rows = cur.fetchall()
+    conn.close()
+    cos_means = np.array([r[1] for r in rows])
+    dh_means = np.array([r[2] for r in rows])
+    return cos_means, dh_means, len(rows)
+
+
+def main():
+    print('='*70)
+    print('Script 15: Hartigan Dip Test for Unimodality')
+    print('='*70)
+
+    results = {}
+
+    # Firm A
+    print('\n[1/3] Firm A (Deloitte)...')
+    fa_cos, fa_dh = fetch_firm_a()
+    print(f'  Firm A cosine N={len(fa_cos):,}, dHash N={len(fa_dh):,}')
+    results['firm_a_cosine'] = run_dip(fa_cos, 'Firm A cosine max-similarity')
+    results['firm_a_dhash'] = run_dip(fa_dh, 'Firm A independent min dHash')
+
+    # Full sample
+    print('\n[2/3] Full sample...')
+    all_cos, all_dh = fetch_full_sample()
+    print(f'  Full cosine N={len(all_cos):,}, dHash N={len(all_dh):,}')
+    # Dip test on >=10k obs can be slow with 2000 boot; use 500 for full sample
+    results['full_cosine'] = run_dip(all_cos, 'Full-sample cosine max-similarity',
+                                     n_boot=500)
+    results['full_dhash'] = run_dip(all_dh, 'Full-sample independent min dHash',
+                                    n_boot=500)
+
+    # Accountant-level aggregates
+    print('\n[3/3] Accountant-level aggregates (min 10 sigs)...')
+    acct_cos, acct_dh, n_acct = fetch_accountant_aggregates(min_sigs=10)
+    print(f'  Accountants analyzed: {n_acct}')
+    results['accountant_cos_mean'] = run_dip(acct_cos,
+                                             'Per-accountant cosine mean')
+    results['accountant_dh_mean'] = run_dip(acct_dh,
+                                            'Per-accountant dHash mean')
+
+    # Print summary
+    print('\n' + '='*70)
+    print('RESULTS SUMMARY')
+    print('='*70)
+    print(f"{'Test':<40} {'N':>8} {'dip':>8} {'p':>10} Verdict")
+    print('-'*90)
+    for key, r in results.items():
+        if 'error' in r:
+            continue
+        print(f"{r['label']:<40} {r['n']:>8,} {r['dip']:>8.4f} "
+              f"{r['p_value']:>10.4f} {r['verdict_alpha_05']}")
+
+    # Write JSON
+    json_path = OUT / 'dip_test_results.json'
+    with open(json_path, 'w') as f:
+        json.dump({
+            'generated_at': datetime.now().isoformat(),
+            'db': DB,
+            'results': results,
+        }, f, indent=2, ensure_ascii=False)
+    print(f'\nJSON saved: {json_path}')
+
+    # Write Markdown report
+    md = [
+        '# Hartigan Dip Test Report',
+        f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
+        '',
+        '## Method',
+        '',
+        'Hartigan & Hartigan (1985) dip test via `diptest` Python package.',
+        'H0: distribution is unimodal. H1: multimodal (two or more modes).',
+        'p-value computed by bootstrap against a uniform null (2000 reps for',
+        'Firm A/accountant-level, 500 reps for full-sample due to size).',
+        '',
+        '## Results',
+        '',
+        '| Test | N | dip | p-value | Verdict (α=0.05) |',
+        '|------|---|-----|---------|------------------|',
+    ]
+    for r in results.values():
+        if 'error' in r:
+            md.append(f"| {r['label']} | {r['n']} | — | — | {r['error']} |")
+            continue
+        md.append(
+            f"| {r['label']} | {r['n']:,} | {r['dip']:.4f} | "
+            f"{r['p_value']:.4f} | {r['verdict_alpha_05']} |"
+        )
+    md += [
+        '',
+        '## Interpretation',
+        '',
+        '* **Signature level** (Firm A + full sample): the dip test indicates',
+        '  whether a single mode explains the max-cosine/min-dHash distribution.',
+        '  Prior finding (2026-04-16) suggested unimodal long-tail; this script',
+        '  provides the formal test.',
+        '',
+        '* **Accountant level** (per-accountant mean): if multimodal here but',
+        '  unimodal at the signature level, this confirms the interpretation',
+        "  that signing-behaviour is discrete across accountants (replication",
+        '  vs hand-signing), while replication quality itself is a continuous',
+        '  spectrum.',
+        '',
+        '## Downstream implication',
+        '',
+        'Methods that assume bimodality (KDE antimode, 2-component Beta mixture)',
+        'should be applied at the level where dip test rejects H0. If the',
+        "signature-level dip test fails to reject, the paper should report this",
+        'and shift the mixture analysis to the accountant level (see Script 18).',
+    ]
+    md_path = OUT / 'dip_test_report.md'
+    md_path.write_text('\n'.join(md), encoding='utf-8')
+    print(f'Report saved: {md_path}')
+
+
+if __name__ == '__main__':
+    main()