Phase 3 close-out: Script 42 + §IV draft v2 (Table XV filled)

Script 42 tabulates the §III-L five-way per-signature classifier output on the Big-4 sub-corpus (n=150,442 signatures classified) and aggregates to document-level (n=75,233 unique PDFs) under the worst-case rule. Per-signature five-way overall (Table XV): HC 74,593 49.58% high-confidence non-hand-signed MC 39,817 26.47% moderate-confidence non-hand-signed HSC 314 0.21% high style consistency UN 35,480 23.58% uncertain LH 238 0.16% likely hand-signed Per-firm five-way (% within firm): Firm A (Deloitte) HC 81.70%, MC 10.76%, UN 7.42% Firm B (KPMG) HC 34.56%, MC 35.88%, UN 29.09% Firm C (PwC) HC 23.75%, MC 41.44%, UN 34.21% Firm D (EY) HC 24.51%, MC 29.33%, UN 45.65% Document-level (Table XV-B, NEW): HC 46,857 62.28% MC 19,667 26.14% HSC 167 0.22% UN 8,524 11.33% LH 18 0.02% Total 75,233 unique Big-4 PDFs (single-firm 74,854; mixed-firm 379) §IV v2 changes vs v1: - Table XV populated with Script 42 counts - Table XV-B (NEW): document-level worst-case counts - Per-firm five-way breakdown (% within firm) added - Per-firm document-level breakdown added - Document-level paragraph in §IV-J updated to reference Table XV-B - Phase 3 close-out checklist: item 1 (Table XV TBD) and item 4 (document-level counts) marked RESOLVED; remaining items reduced from 5 to 3 (renumbering, content audit, codex open-questions) The per-firm pattern is consistent with the §III-K Spearman-and- cluster ordering: Firm A's signatures concentrate in HC (81.7%), the three non-Firm-A firms have markedly lower HC and substantially higher Uncertain rates (29-46%), with Firm D having the highest Uncertain rate of the Big-4 -- consistent with the reverse-anchor score (§III-K Score 2) ranking Firm D fractionally above Firm C in the hand-leaning direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:45:22 +08:00
parent 165b3ab384
commit 453f1d8768
2 changed files with 413 additions and 10 deletions
@@ -1,6 +1,6 @@
-# Section IV. Results — v4.0 Draft v1
+# Section IV. Results — v4.0 Draft v2
-> **Draft note (2026-05-12).** This file replaces the §IV-A through §IV-H block of `paper/paper_a_results_v3.md` (v3.20.0) with the Big-4 reframed structure; Section IV expands from 8 sub-sections in v3.20.0 to 12 sub-sections in v4.0 to mirror the §III-G..L lineage. Tables IV–XVIII numbering is **provisional** in this draft and finalised in Phase 3 close-out per codex round-22 open question 3. Empirical anchors trace to Scripts 32–41 on branch `paper-a-v4-big4`; the §III provenance table covers the methodology-side citations and §IV adds new tables for the v4.0-specific results.
+> **Draft note (2026-05-12, v2).** This file replaces the §IV-A through §IV-H block of `paper/paper_a_results_v3.md` (v3.20.0) with the Big-4 reframed structure; Section IV expands from 8 sub-sections in v3.20.0 to 12 sub-sections in v4.0 (A through L) to mirror the §III-G..L lineage. **v2** fills Table XV (and adds Table XV-B for document-level counts) using Script 42's per-signature five-way categorisation on the Big-4 sub-corpus, closing the only TBD that v1 carried. Tables IV–XVIII numbering remains provisional and is finalised in Phase 3 close-out. Empirical anchors trace to Scripts 32–42 on branch `paper-a-v4-big4`; the §III provenance table covers the methodology-side citations and §IV adds new tables for the v4.0-specific results.
 ## A. Experimental Setup
@@ -160,11 +160,55 @@ The signature-level inter-CPA negative-anchor FAR analysis (~50,000 random pairs
 This section reports the §III-L five-way per-signature + document-level worst-case classifier output on the Big-4 sub-corpus. The five-way category definitions are inherited unchanged from v3.20.0 §III-K (now §III-L); see §III-L for the cosine and dHash cuts.
-**Table XV (revised: five-way per-signature category counts, Big-4 only, $n = 150{,}442$).**
+**Table XV (revised: five-way per-signature category counts, Big-4 only, $n = 150{,}442$ classified).**
-We adopt the v3.20.0 Tables IX / XI / XII methodology for the per-signature category counts and re-compute on the Big-4 subset for v4.0; the resulting proportions are reported in this table when the Phase 3 tabulation script is wired to consume the existing per-signature category-assignment output. *[Phase 3 close-out task: regenerate per-signature category counts on Big-4 subset by adapting the v3.x classifier output. Numbers held in v4.0 v1 draft as TBD; the inherited v3.x signature-level rule does not change, only the Big-4 scope of the population over which it is tabulated.]*
+| Category | Long name | $n$ signatures | % of classified |
 |---|---|---|---|
 | HC | High-confidence non-hand-signed | 74,593 | 49.58% |
 | MC | Moderate-confidence non-hand-signed | 39,817 | 26.47% |
 | HSC | High style consistency | 314 | 0.21% |
 | UN | Uncertain | 35,480 | 23.58% |
 | LH | Likely hand-signed | 238 | 0.16% |
-The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we note this inheritance status explicitly so the reader can locate the v3.x Tables IX / XI / XII calibration evidence (carried into v4.0 by reference) without expecting v4.0-spike-script confirmation of the moderate-band specifics.
+(Source: Script 42; 11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded.)
 **Per-firm five-way breakdown (% within firm).**
 | Firm | HC | MC | HSC | UN | LH | total signatures |
 |---|---|---|---|---|---|---|
 | Firm A (Deloitte) | 81.70% | 10.76% | 0.05% | 7.42% | 0.07% | 60,448 |
 | Firm B (KPMG) | 34.56% | 35.88% | 0.29% | 29.09% | 0.18% | 34,248 |
 | Firm C (PwC) | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 |
 | Firm D (EY) | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 |
 (Source: Script 42 per-firm cross-tab.) The per-firm pattern aligns with the K=3 cluster cross-tab of Table XVI: Firm A is concentrated in the HC band (81.70% of its signatures), consistent with its 82.46% C3-replicated concentration at the accountant level; the three non-Firm-A Big-4 firms have markedly lower HC rates and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%) — consistent with §III-K Score 2 (reverse-anchor cosine percentile) ranking Firm D fractionally above Firm C in the hand-leaning direction.
 **Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.20.0 worst-case rule (HC > MC > HSC > UN > LH; §III-L). v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset).
 **Table XV-B (NEW: document-level worst-case category counts, Big-4 only, $n = 75{,}233$ unique PDFs).**
 | Category | Long name | $n$ documents | % |
 |---|---|---|---|
 | HC | High-confidence non-hand-signed | 46,857 | 62.28% |
 | MC | Moderate-confidence non-hand-signed | 19,667 | 26.14% |
 | HSC | High style consistency | 167 | 0.22% |
 | UN | Uncertain | 8,524 | 11.33% |
 | LH | Likely hand-signed | 18 | 0.02% |
 (Source: Script 42 document-level table; 379 of 75,233 PDFs carried signatures from more than one Big-4 firm and are reported in the single-firm-PDF per-firm breakdown of the script CSV but pooled into the overall counts here.)
 **Per-firm document-level breakdown (single-firm PDFs only).**
 | Firm | HC | MC | HSC | UN | LH | total docs |
 |---|---|---|---|---|---|---|
 | Firm A (Deloitte) | 27,600 | 1,857 | 7 | 758 | 4 | 30,226 |
 | Firm B (KPMG) | 8,783 | 6,079 | 57 | 2,202 | 6 | 17,127 |
 | Firm C (PwC) | 7,281 | 8,660 | 77 | 3,099 | 5 | 19,122 |
 | Firm D (EY) | 3,100 | 2,838 | 22 | 2,416 | 3 | 8,379 |
 (Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)
 The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) inherits its v3.x calibration; it is **not separately validated by Scripts 38–40**, which evaluated only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). v4.0 does not re-derive the moderate-band cuts on the Big-4 subset; we note this inheritance status explicitly so the reader can locate the v3.x Tables IX / XI / XII calibration evidence (carried into v4.0 by reference) without expecting v4.0-spike-script confirmation of the moderate-band specifics. The Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) report the inherited rule's output on the Big-4 subset; the relative ordering of the non-Firm-A firms on MC is consistent with the §III-K Spearman convergence on the per-CPA hand-leaning ranking.
 **Table XVI (NEW: firm × K=3 cluster cross-tabulation, Big-4 only).**
@@ -177,7 +221,7 @@ The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 <
 (Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 replicated component (no Firm A CPAs in C1); Firm C has the highest hand-leaning concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).
-**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. Document-level outputs use the v3.20.0 worst-case rule (§III-L; v3.20.0 §III-K); v4.0 does not change this aggregation. Document-level proportions on the Big-4 subset are reported when the Phase 3 tabulation script is wired (see Table XV TBD note).
+**Document-level worst-case aggregation outputs are reported in Table XV-B above.**
 ## K. Full-Dataset Robustness (light scope)
@@ -215,8 +259,7 @@ The feature-backbone ablation (Table XVIII in v3.20.0; backbone replacement of R
 The following items are flagged for resolution before §IV is sent for codex round 23 / partner Jimmy review:
-1. **Table XV per-signature category counts on Big-4 subset.** The five-way classifier's per-signature counts on the Big-4 subset need to be re-tabulated by adapting the v3.x category-assignment script. Numbers are TBD in v1; the inherited cosine/dHash cuts do not change.
+1. **Table XV per-signature category counts** — RESOLVED (v2 of §IV draft, Script 42 output). Per-signature, per-firm, document-level, and per-firm-document tables now populated.
-2. **Table renumbering finalisation.** The provisional Tables IV–XVIII numbering should be confirmed once §IV is read end-to-end; some v3.x table positions (e.g., capture-rate tables Tables IX, XI, XII) are kept by reference rather than reproduced as v4.0-numbered tables.
+2. **Table renumbering finalisation.** The provisional Tables IV–XVIII numbering (with Table XV-B added in v2) should be confirmed once §IV is read end-to-end and §III–§IV cross-references are traced; some v3.x table positions (e.g., capture-rate tables Tables IX, XI, XII) are kept by reference rather than reproduced as v4.0-numbered tables.
 3. **§IV-A to §IV-C content audit.** Verify that the inherited prose for Experimental Setup, Detection Performance, and All-Pairs analysis remains accurate after the §III-G scope change to Big-4 primary.
-4. **Document-level worst-case aggregation counts.** Companion to item 1; the Big-4 subset document-level proportions need to be regenerated in the same Phase 3 close-out tabulation pass.
+4. **Open question carry-over from §III v3.** Codex round-22 open questions on five-way moderate-band validation, firm anonymisation policy, and §IV table numbering are addressed in this v2: (a) five-way moderate band documented as inherited from v3.x in §IV-J with Big-4 per-firm proportions reported descriptively (Table XV); (b) firm anonymisation maintained throughout §IV (Firm A–D used consistently); (c) §IV table numbering set provisionally and to be finalised at Phase 3 close-out.
 5. **Open question carry-over from §III v3.** Codex round-22 open questions on five-way moderate-band validation, firm anonymisation policy, and §IV table numbering are now addressed: (a) five-way moderate band documented as inherited from v3.x in §IV-J; (b) firm anonymisation maintained throughout §IV (Firm A–D used consistently); (c) §IV table numbering set provisionally and to be finalised at Phase 3 close-out.
@@ -0,0 +1,360 @@
 #!/usr/bin/env python3
 """
 Script 42: Five-Way Per-Signature Categorisation on Big-4 (Table XV fill)
 ==========================================================================
 Phase 3 close-out. Tabulates the §III-L five-way per-signature
 classifier output on the Big-4 sub-corpus and aggregates to
 document-level (per-PDF) labels under the worst-case rule.
 Five-way rule (inherited from v3.20.0 §III-K, retained as v4 §III-L):
  cos > 0.95 AND dHash_indep <= 5     -> HC  High-confidence non-hand-signed
  cos > 0.95 AND 5 < dHash <= 15      -> MC  Moderate-confidence non-hand-signed
  cos > 0.95 AND dHash > 15           -> HSC High style consistency
  0.837 < cos <= 0.95                 -> UN  Uncertain
  cos <= 0.837                        -> LH  Likely hand-signed
 Document-level worst-case rule (one PDF can carry up to 2 certifying-
 CPA signatures; the document inherits the most-replication-consistent
 signature label among the signatures present):
  HC > MC > HSC > UN > LH
 Output:
  reports/v4_big4/five_way_categorisation/
    per_signature_counts.csv
    per_firm_category_crosstab.csv
    per_document_counts.csv
    five_way_results.json
    five_way_report.md
 """
 import sqlite3
 import csv
 import json
 import numpy as np
 from pathlib import Path
 from datetime import datetime
 DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
           'v4_big4/five_way_categorisation')
 OUT.mkdir(parents=True, exist_ok=True)
 BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
 LABEL = {'勤業眾信聯合': 'Firm A (Deloitte)',
         '安侯建業聯合': 'Firm B (KPMG)',
         '資誠聯合': 'Firm C (PwC)',
         '安永聯合': 'Firm D (EY)'}
 COS_HIGH = 0.95
 COS_LOW = 0.837
 DH_HIGH = 5
 DH_MOD = 15
 # Worst-case priority (HC most-replication-consistent, LH most hand-signed)
 PRIORITY = {'HC': 0, 'MC': 1, 'HSC': 2, 'UN': 3, 'LH': 4}
 CATEGORIES = ['HC', 'MC', 'HSC', 'UN', 'LH']
 CAT_LONG = {
    'HC': 'High-confidence non-hand-signed',
    'MC': 'Moderate-confidence non-hand-signed',
    'HSC': 'High style consistency',
    'UN': 'Uncertain',
    'LH': 'Likely hand-signed',
 }
 def classify(cos, dh):
    if cos is None:
        return None  # cannot classify
    if cos > COS_HIGH:
        if dh is None:
            return None  # require dh for HC/MC/HSC distinction
        if dh <= DH_HIGH:
            return 'HC'
        if dh <= DH_MOD:
            return 'MC'
        return 'HSC'
    if cos > COS_LOW:
        return 'UN'
    return 'LH'
 def load_big4_signatures():
    conn = sqlite3.connect(DB)
    cur = conn.cursor()
    cur.execute('''
        SELECT s.signature_id, s.source_pdf, s.assigned_accountant, a.firm,
               s.max_similarity_to_same_accountant,
               s.min_dhash_independent
        FROM signatures s
        JOIN accountants a ON s.assigned_accountant = a.name
        WHERE s.assigned_accountant IS NOT NULL
          AND a.firm IN (?, ?, ?, ?)
    ''', BIG4)
    rows = cur.fetchall()
    conn.close()
    return rows
 def main():
    print('=' * 72)
    print('Script 42: Five-Way Per-Signature Categorisation (Big-4)')
    print('=' * 72)
    rows = load_big4_signatures()
    print(f'\nN Big-4 signatures (loaded, including missing-descriptor): '
          f'{len(rows):,}')
    # Per-signature classification
    per_sig = []
    n_unclassified = 0
    for r in rows:
        sig_id, pdf, cpa, firm, cos, dh = r
        cos_f = None if cos is None else float(cos)
        dh_f = None if dh is None else float(dh)
        cat = classify(cos_f, dh_f)
        if cat is None:
            n_unclassified += 1
            continue
        per_sig.append({
            'sig_id': sig_id, 'pdf': pdf, 'cpa': cpa, 'firm': firm,
            'cos': cos_f, 'dh': dh_f, 'cat': cat,
        })
    n_classified = len(per_sig)
    print(f'  Classified: {n_classified:,}')
    print(f'  Unclassified (missing cos/dh): {n_unclassified:,}')
    # Overall per-signature counts
    overall = {c: 0 for c in CATEGORIES}
    for s in per_sig:
        overall[s['cat']] += 1
    print('\n=== Overall per-signature counts (Big-4 classified) ===')
    print(f'  {"cat":<5} {"long":<40} {"n":>8} {"%":>7}')
    for c in CATEGORIES:
        n = overall[c]
        pct = 100 * n / n_classified if n_classified else 0.0
        print(f'  {c:<5} {CAT_LONG[c]:<40} {n:>8,} {pct:>6.2f}%')
    # Per-firm × category cross-tab
    by_firm = {f: {c: 0 for c in CATEGORIES} for f in BIG4}
    for s in per_sig:
        by_firm[s['firm']][s['cat']] += 1
    print('\n=== Per-firm × category cross-tab (counts) ===')
    print(f'  {"Firm":<22} ' + ' '.join(f'{c:>8}' for c in CATEGORIES)
          + f'  {"total":>8}')
    for f in BIG4:
        cells = [by_firm[f][c] for c in CATEGORIES]
        total = sum(cells)
        print(f'  {LABEL[f]:<22} '
              + ' '.join(f'{n:>8,}' for n in cells)
              + f'  {total:>8,}')
    print('\n=== Per-firm × category cross-tab (% within firm) ===')
    for f in BIG4:
        cells = [by_firm[f][c] for c in CATEGORIES]
        total = sum(cells) or 1
        print(f'  {LABEL[f]:<22} '
              + ' '.join(f'{100*n/total:>7.2f}%' for n in cells)
              + f'  total {total:>6,}')
    # Document-level (per-PDF) aggregation under worst-case rule
    by_pdf = {}
    for s in per_sig:
        pdf = s['pdf']
        if pdf not in by_pdf:
            by_pdf[pdf] = {'firm_set': set(), 'best_cat': None,
                           'best_priority': 99, 'n_sigs': 0}
        bp = by_pdf[pdf]
        bp['n_sigs'] += 1
        bp['firm_set'].add(s['firm'])
        prio = PRIORITY[s['cat']]
        if prio < bp['best_priority']:
            bp['best_priority'] = prio
            bp['best_cat'] = s['cat']
    n_docs = len(by_pdf)
    docs_overall = {c: 0 for c in CATEGORIES}
    for pdf, bp in by_pdf.items():
        docs_overall[bp['best_cat']] += 1
    print(f'\n=== Document-level (n={n_docs:,} unique Big-4 PDFs) ===')
    print(f'  {"cat":<5} {"long":<40} {"n_docs":>8} {"%":>7}')
    for c in CATEGORIES:
        n = docs_overall[c]
        pct = 100 * n / n_docs if n_docs else 0.0
        print(f'  {c:<5} {CAT_LONG[c]:<40} {n:>8,} {pct:>6.2f}%')
    # Document-level by firm (use first firm in the set; PDFs with mixed
    # firm signatures are rare and reported separately)
    docs_by_firm = {f: {c: 0 for c in CATEGORIES} for f in BIG4}
    docs_mixed_firm = {c: 0 for c in CATEGORIES}
    n_mixed_firm = 0
    for pdf, bp in by_pdf.items():
        if len(bp['firm_set']) == 1:
            firm = next(iter(bp['firm_set']))
            if firm in BIG4:
                docs_by_firm[firm][bp['best_cat']] += 1
        else:
            n_mixed_firm += 1
            docs_mixed_firm[bp['best_cat']] += 1
    print(f'\n=== Document-level per-firm (single-firm PDFs only; '
          f'mixed-firm = {n_mixed_firm}) ===')
    print(f'  {"Firm":<22} ' + ' '.join(f'{c:>8}' for c in CATEGORIES)
          + f'  {"total":>8}')
    for f in BIG4:
        cells = [docs_by_firm[f][c] for c in CATEGORIES]
        total = sum(cells)
        print(f'  {LABEL[f]:<22} '
              + ' '.join(f'{n:>8,}' for n in cells)
              + f'  {total:>8,}')
    # Persist CSVs
    sig_csv = OUT / 'per_signature_counts.csv'
    with open(sig_csv, 'w', newline='', encoding='utf-8') as f:
        w = csv.writer(f)
        w.writerow(['cat', 'long_name', 'n', 'pct_of_classified'])
        for c in CATEGORIES:
            w.writerow([c, CAT_LONG[c], overall[c],
                        f'{100*overall[c]/n_classified:.2f}'
                        if n_classified else '0'])
    firm_csv = OUT / 'per_firm_category_crosstab.csv'
    with open(firm_csv, 'w', newline='', encoding='utf-8') as f:
        w = csv.writer(f)
        w.writerow(['firm', 'firm_label'] + CATEGORIES + ['total']
                   + [f'{c}_pct' for c in CATEGORIES])
        for fk in BIG4:
            cells = [by_firm[fk][c] for c in CATEGORIES]
            total = sum(cells) or 1
            w.writerow([fk, LABEL[fk]] + cells + [sum(cells)]
                       + [f'{100*n/total:.2f}' for n in cells])
    doc_csv = OUT / 'per_document_counts.csv'
    with open(doc_csv, 'w', newline='', encoding='utf-8') as f:
        w = csv.writer(f)
        w.writerow(['scope', 'cat', 'long_name', 'n', 'pct'])
        for c in CATEGORIES:
            w.writerow(['overall', c, CAT_LONG[c], docs_overall[c],
                        f'{100*docs_overall[c]/n_docs:.2f}' if n_docs
                        else '0'])
        for fk in BIG4:
            firm_total = sum(docs_by_firm[fk][c] for c in CATEGORIES) or 1
            for c in CATEGORIES:
                w.writerow([LABEL[fk], c, CAT_LONG[c],
                            docs_by_firm[fk][c],
                            f'{100*docs_by_firm[fk][c]/firm_total:.2f}'])
        for c in CATEGORIES:
            w.writerow(['mixed_firm', c, CAT_LONG[c], docs_mixed_firm[c],
                        f'{100*docs_mixed_firm[c]/n_mixed_firm:.2f}'
                        if n_mixed_firm else '0'])
    payload = {
        'generated_at': datetime.now().isoformat(),
        'rule': {
            'cos_high': COS_HIGH, 'cos_low': COS_LOW,
            'dh_high': DH_HIGH, 'dh_mod': DH_MOD,
        },
        'priority': PRIORITY,
        'n_loaded': len(rows),
        'n_classified': n_classified,
        'n_unclassified': n_unclassified,
        'per_signature_overall': {c: overall[c] for c in CATEGORIES},
        'per_signature_by_firm': {fk: by_firm[fk] for fk in BIG4},
        'document_level': {
            'n_docs': n_docs,
            'overall': docs_overall,
            'by_firm_single_firm_docs_only': {
                fk: docs_by_firm[fk] for fk in BIG4
            },
            'n_mixed_firm_docs': n_mixed_firm,
            'mixed_firm_overall': docs_mixed_firm,
        },
    }
    json_path = OUT / 'five_way_results.json'
    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
                         encoding='utf-8')
    print(f'\nJSON: {json_path}')
    # Markdown
    md = [
        '# §IV-J Five-Way Per-Signature Categorisation on Big-4 (Table XV fill)',
        f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
        '',
        '## Rule (inherited from v3.20.0 §III-K)',
        '',
        f'- HC : cos > {COS_HIGH} AND dHash_indep <= {DH_HIGH}',
        f'- MC : cos > {COS_HIGH} AND {DH_HIGH} < dHash <= {DH_MOD}',
        f'- HSC: cos > {COS_HIGH} AND dHash > {DH_MOD}',
        f'- UN : {COS_LOW} < cos <= {COS_HIGH}',
        f'- LH : cos <= {COS_LOW}',
        '',
        '## Sample',
        '',
        f'- Loaded Big-4 signatures: {len(rows):,}',
        f'- Classified (both descriptors available): '
        f'{n_classified:,}',
        f'- Unclassified (missing cos or dh): {n_unclassified:,}',
        '',
        '## Per-signature overall counts (Table XV — Big-4 subset)',
        '',
        '| Category | Long name | $n$ signatures | % of classified |',
        '|---|---|---|---|',
    ]
    for c in CATEGORIES:
        n = overall[c]
        pct = 100 * n / n_classified if n_classified else 0.0
        md.append(f'| {c} | {CAT_LONG[c]} | {n:,} | {pct:.2f}% |')
    md += ['', '## Per-firm × category cross-tab (counts)', '',
           '| Firm | HC | MC | HSC | UN | LH | total |',
           '|---|---|---|---|---|---|---|']
    for fk in BIG4:
        cells = [by_firm[fk][c] for c in CATEGORIES]
        total = sum(cells)
        md.append(f'| {LABEL[fk]} | '
                  + ' | '.join(f'{n:,}' for n in cells)
                  + f' | {total:,} |')
    md += ['', '## Per-firm × category cross-tab (% within firm)', '',
           '| Firm | HC % | MC % | HSC % | UN % | LH % |',
           '|---|---|---|---|---|---|']
    for fk in BIG4:
        cells = [by_firm[fk][c] for c in CATEGORIES]
        total = sum(cells) or 1
        md.append(f'| {LABEL[fk]} | '
                  + ' | '.join(f'{100*n/total:.2f}%' for n in cells)
                  + ' |')
    md += ['', '## Document-level (worst-case rule, per Big-4 PDF)', '',
           f'- N unique Big-4 PDFs: {n_docs:,}',
           f'- Mixed-firm PDFs (signatures from >1 Big-4 firm; reported '
           f'separately below): {n_mixed_firm:,}',
           '',
           '| Category | Long name | $n$ documents | % |',
           '|---|---|---|---|']
    for c in CATEGORIES:
        n = docs_overall[c]
        pct = 100 * n / n_docs if n_docs else 0.0
        md.append(f'| {c} | {CAT_LONG[c]} | {n:,} | {pct:.2f}% |')
    md += ['', '## Document-level per-firm (single-firm PDFs only)', '',
           '| Firm | HC | MC | HSC | UN | LH | total |',
           '|---|---|---|---|---|---|---|']
    for fk in BIG4:
        cells = [docs_by_firm[fk][c] for c in CATEGORIES]
        total = sum(cells)
        md.append(f'| {LABEL[fk]} | '
                  + ' | '.join(f'{n:,}' for n in cells)
                  + f' | {total:,} |')
    md += ['', '## Files',
           '- `per_signature_counts.csv` -- overall five-way per-signature counts',
           '- `per_firm_category_crosstab.csv` -- per-firm cross-tab',
           '- `per_document_counts.csv` -- document-level aggregation',
           '- `five_way_results.json` -- machine-readable full output',
           ]
    md_path = OUT / 'five_way_report.md'
    md_path.write_text('\n'.join(md), encoding='utf-8')
    print(f'Report: {md_path}')
 if __name__ == '__main__':
    main()