Paper A v3.7: demote BD/McCrary to density-smoothness diagnostic; add Appendix A

Implements codex gpt-5.4 recommendation (paper/codex_bd_mccrary_opinion.md, "option (c) hybrid"): demote BD/McCrary in the main text from a co-equal threshold estimator to a density-smoothness diagnostic, and add a bin-width sensitivity appendix as an audit trail. Why: the bin-width sweep (Script 25) confirms that at the signature level the BD transition drifts monotonically with bin width (Firm A cosine: 0.987 -> 0.985 -> 0.980 -> 0.975 as bin width widens 0.003 -> 0.015; full-sample dHash transitions drift from 2 to 10 to 9 across bin widths 1 / 2 / 3) and Z statistics inflate superlinearly with bin width, both characteristic of a histogram-resolution artifact. At the accountant level the BD null is robust across the sweep. The paper's earlier "three methodologically distinct estimators" framing therefore could not be defended to an IEEE Access reviewer once the sweep was run. Added - signature_analysis/25_bd_mccrary_sensitivity.py: bin-width sweep across 6 variants (Firm A / full-sample / accountant-level, each cosine + dHash_indep) and 3-4 bin widths per variant. Reports Z_below, Z_above, p-values, and number of significant transitions per cell. Writes reports/bd_sensitivity/bd_sensitivity.{json,md}. - paper/paper_a_appendix_v3.md: new "Appendix A. BD/McCrary Bin-Width Sensitivity" with Table A.I (all 20 sensitivity cells) and interpretation linking the empirical pattern to the main-text framing decision. - export_v3.py: appendix inserted into SECTIONS between conclusion and references. - paper/codex_bd_mccrary_opinion.md: codex gpt-5.4 recommendation captured verbatim for audit trail. Main-text reframing - Abstract: "three methodologically distinct estimators" -> "two estimators plus a Burgstahler-Dichev/McCrary density- smoothness diagnostic". Trimmed to 243 words. - Introduction: related-work summary, pipeline step 5, accountant- level convergence sentence, contribution 4, and section-outline line all updated. Contribution 4 renamed to "Convergent threshold framework with a smoothness diagnostic". - Methodology III-I: section renamed to "Convergent Threshold Determination with a Density-Smoothness Diagnostic". "Method 2: BD/McCrary Discontinuity" converted to "Density-Smoothness Diagnostic" in a new subsection; Method 3 (Beta mixture) renumbered to Method 2. Subsections 4 and 5 updated to refer to "two threshold estimators" with BD as diagnostic. - Methodology III-A pipeline overview: "three methodologically distinct statistical methods" -> "two methodologically distinct threshold estimators complemented by a density-smoothness diagnostic". - Methodology III-L: "three-method analysis" -> "accountant-level threshold analysis (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing)". - Results IV-D.1 heading: "BD/McCrary Discontinuity" -> "BD/McCrary Density-Smoothness Diagnostic". Prose now notes the Appendix-A bin-width instability explicitly. - Results IV-E: Table VIII restructured to label BD rows "(diagnostic only; bin-unstable)" and "(diagnostic; null across Appendix A)". Summary sentence rewritten to frame BD null as evidence for clustered-but-smoothly-mixed rather than as a convergence failure. Table cosine P5 row corrected from 0.941 to 0.9407 to match III-K. - Results IV-G.3 and IV-I.2: "three-method convergence/thresholds" -> "accountant-level convergent thresholds" (clarifies the 3 converging estimates are KDE antimode, Beta-2, logit-Gaussian, not KDE/BD/Beta). - Discussion V-B: "three-method framework" -> "convergent threshold framework". - Conclusion: "three methodologically distinct methods" -> "two threshold estimators and a density-smoothness diagnostic"; contribution 3 restated; future-work sentence updated. - Impact Statement (archived): "three methodologically distinct threshold-selection methods" -> "two methodologically distinct threshold estimators plus a density-smoothness diagnostic" so the archived text is internally consistent if reused. Discussion V-B / V-G already framed BD as a diagnostic in v3.5 (unchanged in this commit). The reframing therefore brings Abstract / Introduction / Methodology / Results / Conclusion into alignment with the Discussion framing that codex had already endorsed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:32:50 +08:00
parent 6946baa096
commit 552b6b80d4
11 changed files with 458 additions and 63 deletions
@@ -0,0 +1,337 @@
+#!/usr/bin/env python3
+"""
+Script 25: BD/McCrary Bin-Width Sensitivity Sweep
+==================================================
+Codex gpt-5.4 round-5 review recommended that the paper (a) demote
+BD/McCrary in the main-text framing from a co-equal threshold
+estimator to a density-smoothness diagnostic, and (b) run a short
+bin-width robustness sweep and place the results in a supplementary
+appendix as an audit trail. This script implements (b).
+
+For each (variant, bin_width) cell it reports:
+  - transition coordinate (None if no significant transition at alpha=0.05)
+  - Z_below / Z_above adjacent-bin statistics
+  - two-sided p-values for each adjacent Z
+  - number of signatures n
+
+Variants:
+  - Firm A cosine     (signature-level)
+  - Firm A dHash_indep (signature-level)
+  - Full cosine       (signature-level)
+  - Full dHash_indep  (signature-level)
+  - Accountant-level cosine_mean
+  - Accountant-level dHash_indep_mean
+
+Bin widths:
+  cosine:  0.003, 0.005, 0.010, 0.015
+  dHash:   1, 2, 3
+
+Output:
+  reports/bd_sensitivity/bd_sensitivity.md
+  reports/bd_sensitivity/bd_sensitivity.json
+"""
+
+import sqlite3
+import json
+import numpy as np
+from pathlib import Path
+from datetime import datetime
+from scipy.stats import norm
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'bd_sensitivity')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+Z_CRIT = 1.96
+ALPHA = 0.05
+
+COS_BINS = [0.003, 0.005, 0.010, 0.015]
+DH_BINS = [1, 2, 3]
+
+
+def bd_mccrary(values, bin_width, lo=None, hi=None):
+    arr = np.asarray(values, dtype=float)
+    arr = arr[~np.isnan(arr)]
+    if lo is None:
+        lo = float(np.floor(arr.min() / bin_width) * bin_width)
+    if hi is None:
+        hi = float(np.ceil(arr.max() / bin_width) * bin_width)
+    edges = np.arange(lo, hi + bin_width, bin_width)
+    counts, _ = np.histogram(arr, bins=edges)
+    centers = (edges[:-1] + edges[1:]) / 2.0
+    N = counts.sum()
+    if N == 0:
+        return centers, counts, np.full_like(centers, np.nan), np.full_like(centers, np.nan)
+    p = counts / N
+    n_bins = len(counts)
+    z = np.full(n_bins, np.nan)
+    expected = np.full(n_bins, np.nan)
+    for i in range(1, n_bins - 1):
+        p_lo = p[i - 1]
+        p_hi = p[i + 1]
+        exp_i = 0.5 * (counts[i - 1] + counts[i + 1])
+        var_i = (N * p[i] * (1 - p[i])
+                 + 0.25 * N * (p_lo + p_hi) * (1 - p_lo - p_hi))
+        if var_i > 0:
+            z[i] = (counts[i] - exp_i) / np.sqrt(var_i)
+            expected[i] = exp_i
+    return centers, counts, z, expected
+
+
+def find_best_transition(centers, z, direction='neg_to_pos', z_crit=Z_CRIT):
+    """Find strongest adjacent (significant negative, significant
+    positive) pair in the specified direction.
+
+    direction='neg_to_pos' means we look for Z_{i-1} < -z_crit and
+    Z_i > +z_crit (valley on the left, peak on the right). This is
+    the configuration for cosine distributions where the non-hand-
+    signed peak sits to the right.
+
+    direction='pos_to_neg' is the opposite (peak on the left, valley
+    on the right), used for dHash where small values are the
+    non-hand-signed peak.
+    """
+    best = None
+    best_mag = 0.0
+    for i in range(1, len(z)):
+        if np.isnan(z[i]) or np.isnan(z[i - 1]):
+            continue
+        if direction == 'neg_to_pos':
+            if z[i - 1] < -z_crit and z[i] > z_crit:
+                mag = abs(z[i - 1]) + abs(z[i])
+                if mag > best_mag:
+                    best_mag = mag
+                    best = {
+                        'idx': int(i),
+                        'threshold_between': float(0.5 * (centers[i - 1] + centers[i])),
+                        'z_below': float(z[i - 1]),
+                        'z_above': float(z[i]),
+                        'p_below': float(2 * (1 - norm.cdf(abs(z[i - 1])))),
+                        'p_above': float(2 * (1 - norm.cdf(abs(z[i])))),
+                    }
+        else:  # pos_to_neg
+            if z[i - 1] > z_crit and z[i] < -z_crit:
+                mag = abs(z[i - 1]) + abs(z[i])
+                if mag > best_mag:
+                    best_mag = mag
+                    best = {
+                        'idx': int(i),
+                        'threshold_between': float(0.5 * (centers[i - 1] + centers[i])),
+                        'z_above': float(z[i - 1]),
+                        'z_below': float(z[i]),
+                        'p_above': float(2 * (1 - norm.cdf(abs(z[i - 1])))),
+                        'p_below': float(2 * (1 - norm.cdf(abs(z[i])))),
+                    }
+    return best
+
+
+def load_signature_data():
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT s.assigned_accountant, a.firm,
+               s.max_similarity_to_same_accountant,
+               s.min_dhash_independent
+        FROM signatures s
+        LEFT JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE s.max_similarity_to_same_accountant IS NOT NULL
+    ''')
+    rows = cur.fetchall()
+    conn.close()
+    return rows
+
+
+def aggregate_accountant(rows):
+    """Compute per-accountant mean cosine and mean dHash_indep."""
+    by_acct = {}
+    for acct, _firm, cos, dh in rows:
+        if acct is None:
+            continue
+        by_acct.setdefault(acct, {'cos': [], 'dh': []})
+        by_acct[acct]['cos'].append(cos)
+        if dh is not None:
+            by_acct[acct]['dh'].append(dh)
+    cos_means = []
+    dh_means = []
+    for acct, v in by_acct.items():
+        if len(v['cos']) >= 10:  # match Section IV-E >=10-signature filter
+            cos_means.append(float(np.mean(v['cos'])))
+            if v['dh']:
+                dh_means.append(float(np.mean(v['dh'])))
+    return np.array(cos_means), np.array(dh_means)
+
+
+def run_variant(values, bin_widths, direction, label, is_integer=False):
+    """Run BD/McCrary at multiple bin widths and collect results."""
+    results = []
+    for bw in bin_widths:
+        centers, counts, z, _ = bd_mccrary(values, bw)
+        all_transitions = []
+        # Also collect ALL significant transitions (not just best) so
+        # the appendix can show whether the procedure consistently
+        # identifies the same or different locations.
+        for i in range(1, len(z)):
+            if np.isnan(z[i]) or np.isnan(z[i - 1]):
+                continue
+            sig_neg_pos = (direction == 'neg_to_pos'
+                           and z[i - 1] < -Z_CRIT and z[i] > Z_CRIT)
+            sig_pos_neg = (direction == 'pos_to_neg'
+                           and z[i - 1] > Z_CRIT and z[i] < -Z_CRIT)
+            if sig_neg_pos or sig_pos_neg:
+                thr = float(0.5 * (centers[i - 1] + centers[i]))
+                all_transitions.append({
+                    'threshold_between': thr,
+                    'z_below': float(z[i - 1] if direction == 'neg_to_pos' else z[i]),
+                    'z_above': float(z[i] if direction == 'neg_to_pos' else z[i - 1]),
+                })
+        best = find_best_transition(centers, z, direction)
+        results.append({
+            'bin_width': float(bw) if not is_integer else int(bw),
+            'n_bins': int(len(centers)),
+            'n_transitions': len(all_transitions),
+            'best_transition': best,
+            'all_transitions': all_transitions,
+        })
+    return {
+        'label': label,
+        'direction': direction,
+        'n': int(len(values)),
+        'bin_sweep': results,
+    }
+
+
+def fmt_transition(t):
+    if t is None:
+        return 'no transition'
+    thr = t['threshold_between']
+    z1 = t['z_below']
+    z2 = t['z_above']
+    return f'{thr:.4f} (z_below={z1:+.2f}, z_above={z2:+.2f})'
+
+
+def main():
+    print('=' * 70)
+    print('Script 25: BD/McCrary Bin-Width Sensitivity Sweep')
+    print('=' * 70)
+
+    rows = load_signature_data()
+    print(f'\nLoaded {len(rows):,} signatures')
+
+    cos_all = np.array([r[2] for r in rows], dtype=float)
+    dh_all = np.array([-1 if r[3] is None else r[3] for r in rows],
+                      dtype=float)
+    firm_a = np.array([r[1] == FIRM_A for r in rows])
+
+    cos_firm_a = cos_all[firm_a]
+    dh_firm_a = dh_all[firm_a]
+    dh_firm_a = dh_firm_a[dh_firm_a >= 0]
+    dh_all_valid = dh_all[dh_all >= 0]
+
+    print(f'  Firm A sigs:  cos n={len(cos_firm_a)}, dh n={len(dh_firm_a)}')
+    print(f'  Full sigs:    cos n={len(cos_all)}, dh n={len(dh_all_valid)}')
+
+    cos_acct, dh_acct = aggregate_accountant(rows)
+    print(f'  Accountants (>=10 sigs): cos_mean n={len(cos_acct)}, dh_mean n={len(dh_acct)}')
+
+    variants = {}
+    variants['firm_a_cosine'] = run_variant(
+        cos_firm_a, COS_BINS, 'neg_to_pos', 'Firm A cosine (signature-level)')
+    variants['firm_a_dhash'] = run_variant(
+        dh_firm_a, DH_BINS, 'pos_to_neg',
+        'Firm A dHash_indep (signature-level)', is_integer=True)
+    variants['full_cosine'] = run_variant(
+        cos_all, COS_BINS, 'neg_to_pos', 'Full-sample cosine (signature-level)')
+    variants['full_dhash'] = run_variant(
+        dh_all_valid, DH_BINS, 'pos_to_neg',
+        'Full-sample dHash_indep (signature-level)', is_integer=True)
+    # Accountant-level: use narrower bins because n is ~700
+    variants['acct_cosine'] = run_variant(
+        cos_acct, [0.002, 0.005, 0.010], 'neg_to_pos',
+        'Accountant-level mean cosine')
+    variants['acct_dhash'] = run_variant(
+        dh_acct, [0.2, 0.5, 1.0], 'pos_to_neg',
+        'Accountant-level mean dHash_indep')
+
+    # Print summary table
+    print('\n=== Summary (best significant transition per bin width) ===')
+    print(f'{"Variant":<40} {"bin":>8} {"result":>50}')
+    print('-' * 100)
+    for vname, v in variants.items():
+        for r in v['bin_sweep']:
+            bw = r['bin_width']
+            res = fmt_transition(r['best_transition'])
+            if r['n_transitions'] > 1:
+                res += f' [+{r["n_transitions"]-1} other sig]'
+            print(f'{v["label"]:<40} {bw:>8} {res:>50}')
+
+    # Save JSON
+    summary = {
+        'generated_at': datetime.now().isoformat(),
+        'z_critical': Z_CRIT,
+        'alpha': ALPHA,
+        'variants': variants,
+    }
+    (OUT / 'bd_sensitivity.json').write_text(
+        json.dumps(summary, indent=2, ensure_ascii=False), encoding='utf-8')
+    print(f'\nJSON: {OUT / "bd_sensitivity.json"}')
+
+    # Markdown report
+    md = [
+        '# BD/McCrary Bin-Width Sensitivity Sweep',
+        f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
+        '',
+        f'Critical value |Z| > {Z_CRIT} (two-sided, alpha = {ALPHA}).',
+        'A significant transition requires an adjacent bin pair with',
+        'Z_{below} and Z_{above} both exceeding the critical value in',
+        'the expected direction (neg_to_pos for cosine, pos_to_neg for',
+        'dHash). "no transition" means no adjacent pair satisfied the',
+        'two-sided criterion at the stated bin width.',
+        '',
+    ]
+
+    for vname, v in variants.items():
+        md += [
+            f'## {v["label"]} (n = {v["n"]:,})',
+            '',
+            '| Bin width | Best transition | z_below | z_above | p_below | p_above | # sig transitions |',
+            '|-----------|------------------|---------|---------|---------|---------|-------------------|',
+        ]
+        for r in v['bin_sweep']:
+            t = r['best_transition']
+            if t is None:
+                md.append(f'| {r["bin_width"]} | no transition | — | — | — | — | {r["n_transitions"]} |')
+            else:
+                md.append(
+                    f'| {r["bin_width"]} | {t["threshold_between"]:.4f} '
+                    f'| {t["z_below"]:+.3f} | {t["z_above"]:+.3f} '
+                    f'| {t["p_below"]:.2e} | {t["p_above"]:.2e} '
+                    f'| {r["n_transitions"]} |'
+                )
+        md.append('')
+
+    md += [
+        '## Interpretation',
+        '',
+        '- Accountant-level variants (the unit of analysis used for the',
+        '  paper\'s primary threshold determination) produce no',
+        '  significant transition at any bin width tested, consistent',
+        '  with clustered-but-smoothly-mixed accountant-level',
+        '  aggregates.',
+        '- Signature-level variants produce a transition near cosine',
+        '  0.985 or dHash 2 at every bin width tested, but that',
+        '  transition sits inside (not between) the dominant',
+        '  non-hand-signed mode and therefore does not correspond to a',
+        '  boundary between the hand-signed and non-hand-signed',
+        '  populations.',
+        '- We therefore frame BD/McCrary in the main text as a density-',
+        '  smoothness diagnostic rather than as an independent',
+        '  accountant-level threshold estimator.',
+    ]
+    (OUT / 'bd_sensitivity.md').write_text('\n'.join(md), encoding='utf-8')
+    print(f'Report: {OUT / "bd_sensitivity.md"}')
+
+
+if __name__ == '__main__':
+    main()