Files
gbanyan da455791de Paper A v13 rev8: fusion-review revision (29 items) + verified data analysis
Address all 29 items from the fused reviewer report (Gemini 3.1 Pro +
ChatGPT 5.5 + Opus 4.8): 3 fatal, 4 severe, arbitration A/B, 5 fusion-new,
15 minor. All new numbers computed from signature_analysis.db; nothing
fabricated.

Claim honesty (F1/F3/F4/F7/G3):
- Retract all "139x the floor" comparisons; ICCR -> between-accountant
  specificity proxy throughout; state within-accountant FPR is not
  estimable and ICCR is not even a bound (anti-conservative direction).
- Firm A reframed as quasi-positive known-positive benchmark (not blinded).
- byte-identity recast as prevalence signal, not a recall/sanity check.
- tunable -> single-direction conservativeness dial (no P-R frontier).

New data analysis (verified, bit-reproducible via committed scripts):
- F2/G1 (Sec V-B): 880-PDF imaging-pipeline audit (Table V) - plain scans
  82% (2013) -> 1% (2021); producer strings name scanner hardware
  (Fuji Xerox D125 etc.); substrate transforms at 2020/21 = named confound.
- F5 (Sec IV-C): four robustness checks - pool-size stratification,
  accountant-clustered bootstrap (gap 53.7pp [49.5,57.5]), firm+year FE
  logistic (B/C/D OR 0.06-0.12), leave-one-year-out (gap 53.1-54.9pp).
- byte-identity era split: 30 scan-era (18 Firm A, pipeline-robust) vs
  232 digital-era (detectability-inflated, hedged).
- G5: archive-wide 888 expected chance HC flags [677,1098].
- M4: Figure 3 replaced with real 2D density (n=150,441).

Structure/minor: abstract restructured (M1); operational definition (M2);
interview disclaimer (M3); Threats to Validity subsection (M8); review
protocol framed as design not evidence (M9); N reconciliations (M10/M11);
Table II-c 2020-23 five-way (M12); Section refs, American spelling,
notation table (M5/M13/M15); reference URLs verified (M14).

Open (author-only): placeholders (M13), II-b/IV table merge (M15).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Qn59FdF9JMyfFg3sjcUNNG
2026-06-23 14:36:51 +08:00

64 lines
2.6 KiB
Python

"""Imaging-pipeline audit (Table V) + byte-identity era split (Section V-B).
Classifies a stratified sample of report PDFs as scanned / OCR'd / digital-native
from embedded metadata + extractable-text heuristic, and tabulates by year and firm.
Also reports the scan-era vs digital-era split of the 262 byte-identical signatures.
Requires: PyMuPDF (fitz); signature_analysis.db; original PDFs under total-pdf/.
"""
import fitz, os, glob, sqlite3
from collections import defaultdict
fitz.TOOLS.mupdf_display_errors(False)
DB = "/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db"
PDF_ROOT = "/Volumes/NV2/PDF-Processing/total-pdf"
BIG4 = ('勤業眾信聯合', '資誠聯合', '安侯建業聯合', '安永聯合')
FMAP = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
con = sqlite3.connect(DB); cur = con.cursor()
# --- stratified sample: 20 distinct PDFs per firm-year ---
cur.execute(f"""
WITH d AS (SELECT DISTINCT excel_firm, substr(year_month,1,4) yr, source_pdf,
ROW_NUMBER() OVER (PARTITION BY excel_firm, substr(year_month,1,4) ORDER BY source_pdf) rn
FROM signatures WHERE excel_firm IN ({','.join(['?']*4)}) AND source_pdf IS NOT NULL)
SELECT excel_firm, yr, source_pdf FROM d WHERE rn<=20 ORDER BY yr""", BIG4)
rows = cur.fetchall()
idx = {os.path.basename(p): p for p in glob.glob(PDF_ROOT + '/*/*.pdf')}
def classify(path):
try:
doc = fitz.open(path)
except Exception:
return None
text = sum(len(doc[i].get_text().strip()) for i in range(min(len(doc), 4)))
doc.close()
return 'DIGITAL' if text > 2000 else ('OCR' if text > 200 else 'SCAN')
byyear = defaultdict(lambda: defaultdict(int))
for firm, yr, fn in rows:
p = idx.get(fn)
if not p:
continue
k = classify(p)
if k:
byyear[yr][k] += 1
print("year | n | scan% | ocr% | digital%")
for yr in sorted(byyear):
d = byyear[yr]; n = sum(d.values())
print(f"{yr} | {n} | {100*d['SCAN']//n} | {100*d['OCR']//n} | {100*d['DIGITAL']//n}")
# --- byte-identity era split ---
cur.execute(f"""
SELECT CASE WHEN year_month<'202101' THEN 'scan-era' ELSE 'digital-era' END era,
CASE excel_firm WHEN '勤業眾信聯合' THEN 'A' WHEN '安侯建業聯合' THEN 'B'
WHEN '資誠聯合' THEN 'C' WHEN '安永聯合' THEN 'D' END firm,
COUNT(*) n
FROM signatures WHERE is_valid=1 AND pixel_identical_to_closest=1
AND excel_firm IN ({','.join(['?']*4)})
GROUP BY era, firm ORDER BY era, firm""", BIG4)
print("\nbyte-identical by era x firm:")
for era, firm, n in cur.fetchall():
print(f" {era} | {firm} | {n}")
con.close()