Phase 3 close-out: Script 42 + §IV draft v2 (Table XV filled)

Script 42 tabulates the §III-L five-way per-signature classifier output on the Big-4 sub-corpus (n=150,442 signatures classified) and aggregates to document-level (n=75,233 unique PDFs) under the worst-case rule. Per-signature five-way overall (Table XV): HC 74,593 49.58% high-confidence non-hand-signed MC 39,817 26.47% moderate-confidence non-hand-signed HSC 314 0.21% high style consistency UN 35,480 23.58% uncertain LH 238 0.16% likely hand-signed Per-firm five-way (% within firm): Firm A (Deloitte) HC 81.70%, MC 10.76%, UN 7.42% Firm B (KPMG) HC 34.56%, MC 35.88%, UN 29.09% Firm C (PwC) HC 23.75%, MC 41.44%, UN 34.21% Firm D (EY) HC 24.51%, MC 29.33%, UN 45.65% Document-level (Table XV-B, NEW): HC 46,857 62.28% MC 19,667 26.14% HSC 167 0.22% UN 8,524 11.33% LH 18 0.02% Total 75,233 unique Big-4 PDFs (single-firm 74,854; mixed-firm 379) §IV v2 changes vs v1: - Table XV populated with Script 42 counts - Table XV-B (NEW): document-level worst-case counts - Per-firm five-way breakdown (% within firm) added - Per-firm document-level breakdown added - Document-level paragraph in §IV-J updated to reference Table XV-B - Phase 3 close-out checklist: item 1 (Table XV TBD) and item 4 (document-level counts) marked RESOLVED; remaining items reduced from 5 to 3 (renumbering, content audit, codex open-questions) The per-firm pattern is consistent with the §III-K Spearman-and- cluster ordering: Firm A's signatures concentrate in HC (81.7%), the three non-Firm-A firms have markedly lower HC and substantially higher Uncertain rates (29-46%), with Firm D having the highest Uncertain rate of the Big-4 -- consistent with the reverse-anchor score (§III-K Score 2) ranking Firm D fractionally above Firm C in the hand-leaning direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:45:22 +08:00
parent 165b3ab384
commit 453f1d8768
2 changed files with 413 additions and 10 deletions
@@ -0,0 +1,360 @@
+#!/usr/bin/env python3
+"""
+Script 42: Five-Way Per-Signature Categorisation on Big-4 (Table XV fill)
+==========================================================================
+Phase 3 close-out. Tabulates the §III-L five-way per-signature
+classifier output on the Big-4 sub-corpus and aggregates to
+document-level (per-PDF) labels under the worst-case rule.
+
+Five-way rule (inherited from v3.20.0 §III-K, retained as v4 §III-L):
+
+  cos > 0.95 AND dHash_indep <= 5     -> HC  High-confidence non-hand-signed
+  cos > 0.95 AND 5 < dHash <= 15      -> MC  Moderate-confidence non-hand-signed
+  cos > 0.95 AND dHash > 15           -> HSC High style consistency
+  0.837 < cos <= 0.95                 -> UN  Uncertain
+  cos <= 0.837                        -> LH  Likely hand-signed
+
+Document-level worst-case rule (one PDF can carry up to 2 certifying-
+CPA signatures; the document inherits the most-replication-consistent
+signature label among the signatures present):
+
+  HC > MC > HSC > UN > LH
+
+Output:
+  reports/v4_big4/five_way_categorisation/
+    per_signature_counts.csv
+    per_firm_category_crosstab.csv
+    per_document_counts.csv
+    five_way_results.json
+    five_way_report.md
+"""
+
+import sqlite3
+import csv
+import json
+import numpy as np
+from pathlib import Path
+from datetime import datetime
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'v4_big4/five_way_categorisation')
+OUT.mkdir(parents=True, exist_ok=True)
+
+BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
+LABEL = {'勤業眾信聯合': 'Firm A (Deloitte)',
+         '安侯建業聯合': 'Firm B (KPMG)',
+         '資誠聯合': 'Firm C (PwC)',
+         '安永聯合': 'Firm D (EY)'}
+
+COS_HIGH = 0.95
+COS_LOW = 0.837
+DH_HIGH = 5
+DH_MOD = 15
+
+# Worst-case priority (HC most-replication-consistent, LH most hand-signed)
+PRIORITY = {'HC': 0, 'MC': 1, 'HSC': 2, 'UN': 3, 'LH': 4}
+CATEGORIES = ['HC', 'MC', 'HSC', 'UN', 'LH']
+CAT_LONG = {
+    'HC': 'High-confidence non-hand-signed',
+    'MC': 'Moderate-confidence non-hand-signed',
+    'HSC': 'High style consistency',
+    'UN': 'Uncertain',
+    'LH': 'Likely hand-signed',
+}
+
+
+def classify(cos, dh):
+    if cos is None:
+        return None  # cannot classify
+    if cos > COS_HIGH:
+        if dh is None:
+            return None  # require dh for HC/MC/HSC distinction
+        if dh <= DH_HIGH:
+            return 'HC'
+        if dh <= DH_MOD:
+            return 'MC'
+        return 'HSC'
+    if cos > COS_LOW:
+        return 'UN'
+    return 'LH'
+
+
+def load_big4_signatures():
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT s.signature_id, s.source_pdf, s.assigned_accountant, a.firm,
+               s.max_similarity_to_same_accountant,
+               s.min_dhash_independent
+        FROM signatures s
+        JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE s.assigned_accountant IS NOT NULL
+          AND a.firm IN (?, ?, ?, ?)
+    ''', BIG4)
+    rows = cur.fetchall()
+    conn.close()
+    return rows
+
+
+def main():
+    print('=' * 72)
+    print('Script 42: Five-Way Per-Signature Categorisation (Big-4)')
+    print('=' * 72)
+    rows = load_big4_signatures()
+    print(f'\nN Big-4 signatures (loaded, including missing-descriptor): '
+          f'{len(rows):,}')
+
+    # Per-signature classification
+    per_sig = []
+    n_unclassified = 0
+    for r in rows:
+        sig_id, pdf, cpa, firm, cos, dh = r
+        cos_f = None if cos is None else float(cos)
+        dh_f = None if dh is None else float(dh)
+        cat = classify(cos_f, dh_f)
+        if cat is None:
+            n_unclassified += 1
+            continue
+        per_sig.append({
+            'sig_id': sig_id, 'pdf': pdf, 'cpa': cpa, 'firm': firm,
+            'cos': cos_f, 'dh': dh_f, 'cat': cat,
+        })
+    n_classified = len(per_sig)
+    print(f'  Classified: {n_classified:,}')
+    print(f'  Unclassified (missing cos/dh): {n_unclassified:,}')
+
+    # Overall per-signature counts
+    overall = {c: 0 for c in CATEGORIES}
+    for s in per_sig:
+        overall[s['cat']] += 1
+    print('\n=== Overall per-signature counts (Big-4 classified) ===')
+    print(f'  {"cat":<5} {"long":<40} {"n":>8} {"%":>7}')
+    for c in CATEGORIES:
+        n = overall[c]
+        pct = 100 * n / n_classified if n_classified else 0.0
+        print(f'  {c:<5} {CAT_LONG[c]:<40} {n:>8,} {pct:>6.2f}%')
+
+    # Per-firm × category cross-tab
+    by_firm = {f: {c: 0 for c in CATEGORIES} for f in BIG4}
+    for s in per_sig:
+        by_firm[s['firm']][s['cat']] += 1
+    print('\n=== Per-firm × category cross-tab (counts) ===')
+    print(f'  {"Firm":<22} ' + ' '.join(f'{c:>8}' for c in CATEGORIES)
+          + f'  {"total":>8}')
+    for f in BIG4:
+        cells = [by_firm[f][c] for c in CATEGORIES]
+        total = sum(cells)
+        print(f'  {LABEL[f]:<22} '
+              + ' '.join(f'{n:>8,}' for n in cells)
+              + f'  {total:>8,}')
+    print('\n=== Per-firm × category cross-tab (% within firm) ===')
+    for f in BIG4:
+        cells = [by_firm[f][c] for c in CATEGORIES]
+        total = sum(cells) or 1
+        print(f'  {LABEL[f]:<22} '
+              + ' '.join(f'{100*n/total:>7.2f}%' for n in cells)
+              + f'  total {total:>6,}')
+
+    # Document-level (per-PDF) aggregation under worst-case rule
+    by_pdf = {}
+    for s in per_sig:
+        pdf = s['pdf']
+        if pdf not in by_pdf:
+            by_pdf[pdf] = {'firm_set': set(), 'best_cat': None,
+                           'best_priority': 99, 'n_sigs': 0}
+        bp = by_pdf[pdf]
+        bp['n_sigs'] += 1
+        bp['firm_set'].add(s['firm'])
+        prio = PRIORITY[s['cat']]
+        if prio < bp['best_priority']:
+            bp['best_priority'] = prio
+            bp['best_cat'] = s['cat']
+
+    n_docs = len(by_pdf)
+    docs_overall = {c: 0 for c in CATEGORIES}
+    for pdf, bp in by_pdf.items():
+        docs_overall[bp['best_cat']] += 1
+    print(f'\n=== Document-level (n={n_docs:,} unique Big-4 PDFs) ===')
+    print(f'  {"cat":<5} {"long":<40} {"n_docs":>8} {"%":>7}')
+    for c in CATEGORIES:
+        n = docs_overall[c]
+        pct = 100 * n / n_docs if n_docs else 0.0
+        print(f'  {c:<5} {CAT_LONG[c]:<40} {n:>8,} {pct:>6.2f}%')
+
+    # Document-level by firm (use first firm in the set; PDFs with mixed
+    # firm signatures are rare and reported separately)
+    docs_by_firm = {f: {c: 0 for c in CATEGORIES} for f in BIG4}
+    docs_mixed_firm = {c: 0 for c in CATEGORIES}
+    n_mixed_firm = 0
+    for pdf, bp in by_pdf.items():
+        if len(bp['firm_set']) == 1:
+            firm = next(iter(bp['firm_set']))
+            if firm in BIG4:
+                docs_by_firm[firm][bp['best_cat']] += 1
+        else:
+            n_mixed_firm += 1
+            docs_mixed_firm[bp['best_cat']] += 1
+    print(f'\n=== Document-level per-firm (single-firm PDFs only; '
+          f'mixed-firm = {n_mixed_firm}) ===')
+    print(f'  {"Firm":<22} ' + ' '.join(f'{c:>8}' for c in CATEGORIES)
+          + f'  {"total":>8}')
+    for f in BIG4:
+        cells = [docs_by_firm[f][c] for c in CATEGORIES]
+        total = sum(cells)
+        print(f'  {LABEL[f]:<22} '
+              + ' '.join(f'{n:>8,}' for n in cells)
+              + f'  {total:>8,}')
+
+    # Persist CSVs
+    sig_csv = OUT / 'per_signature_counts.csv'
+    with open(sig_csv, 'w', newline='', encoding='utf-8') as f:
+        w = csv.writer(f)
+        w.writerow(['cat', 'long_name', 'n', 'pct_of_classified'])
+        for c in CATEGORIES:
+            w.writerow([c, CAT_LONG[c], overall[c],
+                        f'{100*overall[c]/n_classified:.2f}'
+                        if n_classified else '0'])
+
+    firm_csv = OUT / 'per_firm_category_crosstab.csv'
+    with open(firm_csv, 'w', newline='', encoding='utf-8') as f:
+        w = csv.writer(f)
+        w.writerow(['firm', 'firm_label'] + CATEGORIES + ['total']
+                   + [f'{c}_pct' for c in CATEGORIES])
+        for fk in BIG4:
+            cells = [by_firm[fk][c] for c in CATEGORIES]
+            total = sum(cells) or 1
+            w.writerow([fk, LABEL[fk]] + cells + [sum(cells)]
+                       + [f'{100*n/total:.2f}' for n in cells])
+
+    doc_csv = OUT / 'per_document_counts.csv'
+    with open(doc_csv, 'w', newline='', encoding='utf-8') as f:
+        w = csv.writer(f)
+        w.writerow(['scope', 'cat', 'long_name', 'n', 'pct'])
+        for c in CATEGORIES:
+            w.writerow(['overall', c, CAT_LONG[c], docs_overall[c],
+                        f'{100*docs_overall[c]/n_docs:.2f}' if n_docs
+                        else '0'])
+        for fk in BIG4:
+            firm_total = sum(docs_by_firm[fk][c] for c in CATEGORIES) or 1
+            for c in CATEGORIES:
+                w.writerow([LABEL[fk], c, CAT_LONG[c],
+                            docs_by_firm[fk][c],
+                            f'{100*docs_by_firm[fk][c]/firm_total:.2f}'])
+        for c in CATEGORIES:
+            w.writerow(['mixed_firm', c, CAT_LONG[c], docs_mixed_firm[c],
+                        f'{100*docs_mixed_firm[c]/n_mixed_firm:.2f}'
+                        if n_mixed_firm else '0'])
+
+    payload = {
+        'generated_at': datetime.now().isoformat(),
+        'rule': {
+            'cos_high': COS_HIGH, 'cos_low': COS_LOW,
+            'dh_high': DH_HIGH, 'dh_mod': DH_MOD,
+        },
+        'priority': PRIORITY,
+        'n_loaded': len(rows),
+        'n_classified': n_classified,
+        'n_unclassified': n_unclassified,
+        'per_signature_overall': {c: overall[c] for c in CATEGORIES},
+        'per_signature_by_firm': {fk: by_firm[fk] for fk in BIG4},
+        'document_level': {
+            'n_docs': n_docs,
+            'overall': docs_overall,
+            'by_firm_single_firm_docs_only': {
+                fk: docs_by_firm[fk] for fk in BIG4
+            },
+            'n_mixed_firm_docs': n_mixed_firm,
+            'mixed_firm_overall': docs_mixed_firm,
+        },
+    }
+    json_path = OUT / 'five_way_results.json'
+    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
+                         encoding='utf-8')
+    print(f'\nJSON: {json_path}')
+
+    # Markdown
+    md = [
+        '# §IV-J Five-Way Per-Signature Categorisation on Big-4 (Table XV fill)',
+        f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
+        '',
+        '## Rule (inherited from v3.20.0 §III-K)',
+        '',
+        f'- HC : cos > {COS_HIGH} AND dHash_indep <= {DH_HIGH}',
+        f'- MC : cos > {COS_HIGH} AND {DH_HIGH} < dHash <= {DH_MOD}',
+        f'- HSC: cos > {COS_HIGH} AND dHash > {DH_MOD}',
+        f'- UN : {COS_LOW} < cos <= {COS_HIGH}',
+        f'- LH : cos <= {COS_LOW}',
+        '',
+        '## Sample',
+        '',
+        f'- Loaded Big-4 signatures: {len(rows):,}',
+        f'- Classified (both descriptors available): '
+        f'{n_classified:,}',
+        f'- Unclassified (missing cos or dh): {n_unclassified:,}',
+        '',
+        '## Per-signature overall counts (Table XV — Big-4 subset)',
+        '',
+        '| Category | Long name | $n$ signatures | % of classified |',
+        '|---|---|---|---|',
+    ]
+    for c in CATEGORIES:
+        n = overall[c]
+        pct = 100 * n / n_classified if n_classified else 0.0
+        md.append(f'| {c} | {CAT_LONG[c]} | {n:,} | {pct:.2f}% |')
+
+    md += ['', '## Per-firm × category cross-tab (counts)', '',
+           '| Firm | HC | MC | HSC | UN | LH | total |',
+           '|---|---|---|---|---|---|---|']
+    for fk in BIG4:
+        cells = [by_firm[fk][c] for c in CATEGORIES]
+        total = sum(cells)
+        md.append(f'| {LABEL[fk]} | '
+                  + ' | '.join(f'{n:,}' for n in cells)
+                  + f' | {total:,} |')
+
+    md += ['', '## Per-firm × category cross-tab (% within firm)', '',
+           '| Firm | HC % | MC % | HSC % | UN % | LH % |',
+           '|---|---|---|---|---|---|']
+    for fk in BIG4:
+        cells = [by_firm[fk][c] for c in CATEGORIES]
+        total = sum(cells) or 1
+        md.append(f'| {LABEL[fk]} | '
+                  + ' | '.join(f'{100*n/total:.2f}%' for n in cells)
+                  + ' |')
+
+    md += ['', '## Document-level (worst-case rule, per Big-4 PDF)', '',
+           f'- N unique Big-4 PDFs: {n_docs:,}',
+           f'- Mixed-firm PDFs (signatures from >1 Big-4 firm; reported '
+           f'separately below): {n_mixed_firm:,}',
+           '',
+           '| Category | Long name | $n$ documents | % |',
+           '|---|---|---|---|']
+    for c in CATEGORIES:
+        n = docs_overall[c]
+        pct = 100 * n / n_docs if n_docs else 0.0
+        md.append(f'| {c} | {CAT_LONG[c]} | {n:,} | {pct:.2f}% |')
+
+    md += ['', '## Document-level per-firm (single-firm PDFs only)', '',
+           '| Firm | HC | MC | HSC | UN | LH | total |',
+           '|---|---|---|---|---|---|---|']
+    for fk in BIG4:
+        cells = [docs_by_firm[fk][c] for c in CATEGORIES]
+        total = sum(cells)
+        md.append(f'| {LABEL[fk]} | '
+                  + ' | '.join(f'{n:,}' for n in cells)
+                  + f' | {total:,} |')
+
+    md += ['', '## Files',
+           '- `per_signature_counts.csv` -- overall five-way per-signature counts',
+           '- `per_firm_category_crosstab.csv` -- per-firm cross-tab',
+           '- `per_document_counts.csv` -- document-level aggregation',
+           '- `five_way_results.json` -- machine-readable full output',
+           ]
+    md_path = OUT / 'five_way_report.md'
+    md_path.write_text('\n'.join(md), encoding='utf-8')
+    print(f'Report: {md_path}')
+
+
+if __name__ == '__main__':
+    main()