Paper A v13 rev8: fusion-review revision (29 items) + verified data analysis

Address all 29 items from the fused reviewer report (Gemini 3.1 Pro + ChatGPT 5.5 + Opus 4.8): 3 fatal, 4 severe, arbitration A/B, 5 fusion-new, 15 minor. All new numbers computed from signature_analysis.db; nothing fabricated. Claim honesty (F1/F3/F4/F7/G3): - Retract all "139x the floor" comparisons; ICCR -> between-accountant specificity proxy throughout; state within-accountant FPR is not estimable and ICCR is not even a bound (anti-conservative direction). - Firm A reframed as quasi-positive known-positive benchmark (not blinded). - byte-identity recast as prevalence signal, not a recall/sanity check. - tunable -> single-direction conservativeness dial (no P-R frontier). New data analysis (verified, bit-reproducible via committed scripts): - F2/G1 (Sec V-B): 880-PDF imaging-pipeline audit (Table V) - plain scans 82% (2013) -> 1% (2021); producer strings name scanner hardware (Fuji Xerox D125 etc.); substrate transforms at 2020/21 = named confound. - F5 (Sec IV-C): four robustness checks - pool-size stratification, accountant-clustered bootstrap (gap 53.7pp [49.5,57.5]), firm+year FE logistic (B/C/D OR 0.06-0.12), leave-one-year-out (gap 53.1-54.9pp). - byte-identity era split: 30 scan-era (18 Firm A, pipeline-robust) vs 232 digital-era (detectability-inflated, hedged). - G5: archive-wide 888 expected chance HC flags [677,1098]. - M4: Figure 3 replaced with real 2D density (n=150,441). Structure/minor: abstract restructured (M1); operational definition (M2); interview disclaimer (M3); Threats to Validity subsection (M8); review protocol framed as design not evidence (M9); N reconciliations (M10/M11); Table II-c 2020-23 five-way (M12); Section refs, American spelling, notation table (M5/M13/M15); reference URLs verified (M14). Open (author-only): placeholders (M13), II-b/IV table merge (M15). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Qn59FdF9JMyfFg3sjcUNNG
2026-06-23 14:36:51 +08:00
parent 61dd2dcaad
commit da455791de
7 changed files with 438 additions and 81 deletions
@@ -0,0 +1,63 @@
+"""Imaging-pipeline audit (Table V) + byte-identity era split (Section V-B).
+Classifies a stratified sample of report PDFs as scanned / OCR'd / digital-native
+from embedded metadata + extractable-text heuristic, and tabulates by year and firm.
+Also reports the scan-era vs digital-era split of the 262 byte-identical signatures.
+
+Requires: PyMuPDF (fitz); signature_analysis.db; original PDFs under total-pdf/.
+"""
+import fitz, os, glob, sqlite3
+from collections import defaultdict
+
+fitz.TOOLS.mupdf_display_errors(False)
+DB = "/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db"
+PDF_ROOT = "/Volumes/NV2/PDF-Processing/total-pdf"
+BIG4 = ('勤業眾信聯合', '資誠聯合', '安侯建業聯合', '安永聯合')
+FMAP = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
+
+con = sqlite3.connect(DB); cur = con.cursor()
+
+# --- stratified sample: 20 distinct PDFs per firm-year ---
+cur.execute(f"""
+WITH d AS (SELECT DISTINCT excel_firm, substr(year_month,1,4) yr, source_pdf,
+  ROW_NUMBER() OVER (PARTITION BY excel_firm, substr(year_month,1,4) ORDER BY source_pdf) rn
+  FROM signatures WHERE excel_firm IN ({','.join(['?']*4)}) AND source_pdf IS NOT NULL)
+SELECT excel_firm, yr, source_pdf FROM d WHERE rn<=20 ORDER BY yr""", BIG4)
+rows = cur.fetchall()
+idx = {os.path.basename(p): p for p in glob.glob(PDF_ROOT + '/*/*.pdf')}
+
+def classify(path):
+    try:
+        doc = fitz.open(path)
+    except Exception:
+        return None
+    text = sum(len(doc[i].get_text().strip()) for i in range(min(len(doc), 4)))
+    doc.close()
+    return 'DIGITAL' if text > 2000 else ('OCR' if text > 200 else 'SCAN')
+
+byyear = defaultdict(lambda: defaultdict(int))
+for firm, yr, fn in rows:
+    p = idx.get(fn)
+    if not p:
+        continue
+    k = classify(p)
+    if k:
+        byyear[yr][k] += 1
+
+print("year | n | scan% | ocr% | digital%")
+for yr in sorted(byyear):
+    d = byyear[yr]; n = sum(d.values())
+    print(f"{yr} | {n} | {100*d['SCAN']//n} | {100*d['OCR']//n} | {100*d['DIGITAL']//n}")
+
+# --- byte-identity era split ---
+cur.execute(f"""
+SELECT CASE WHEN year_month<'202101' THEN 'scan-era' ELSE 'digital-era' END era,
+  CASE excel_firm WHEN '勤業眾信聯合' THEN 'A' WHEN '安侯建業聯合' THEN 'B'
+                  WHEN '資誠聯合' THEN 'C' WHEN '安永聯合' THEN 'D' END firm,
+  COUNT(*) n
+FROM signatures WHERE is_valid=1 AND pixel_identical_to_closest=1
+  AND excel_firm IN ({','.join(['?']*4)})
+GROUP BY era, firm ORDER BY era, firm""", BIG4)
+print("\nbyte-identical by era x firm:")
+for era, firm, n in cur.fetchall():
+    print(f"  {era} | {firm} | {n}")
+con.close()