Paper A v13 rev8: fusion-review revision (29 items) + verified data analysis
Address all 29 items from the fused reviewer report (Gemini 3.1 Pro + ChatGPT 5.5 + Opus 4.8): 3 fatal, 4 severe, arbitration A/B, 5 fusion-new, 15 minor. All new numbers computed from signature_analysis.db; nothing fabricated. Claim honesty (F1/F3/F4/F7/G3): - Retract all "139x the floor" comparisons; ICCR -> between-accountant specificity proxy throughout; state within-accountant FPR is not estimable and ICCR is not even a bound (anti-conservative direction). - Firm A reframed as quasi-positive known-positive benchmark (not blinded). - byte-identity recast as prevalence signal, not a recall/sanity check. - tunable -> single-direction conservativeness dial (no P-R frontier). New data analysis (verified, bit-reproducible via committed scripts): - F2/G1 (Sec V-B): 880-PDF imaging-pipeline audit (Table V) - plain scans 82% (2013) -> 1% (2021); producer strings name scanner hardware (Fuji Xerox D125 etc.); substrate transforms at 2020/21 = named confound. - F5 (Sec IV-C): four robustness checks - pool-size stratification, accountant-clustered bootstrap (gap 53.7pp [49.5,57.5]), firm+year FE logistic (B/C/D OR 0.06-0.12), leave-one-year-out (gap 53.1-54.9pp). - byte-identity era split: 30 scan-era (18 Firm A, pipeline-robust) vs 232 digital-era (detectability-inflated, hedged). - G5: archive-wide 888 expected chance HC flags [677,1098]. - M4: Figure 3 replaced with real 2D density (n=150,441). Structure/minor: abstract restructured (M1); operational definition (M2); interview disclaimer (M3); Threats to Validity subsection (M8); review protocol framed as design not evidence (M9); N reconciliations (M10/M11); Table II-c 2020-23 five-way (M12); Section refs, American spelling, notation table (M5/M13/M15); reference URLs verified (M14). Open (author-only): placeholders (M13), II-b/IV table merge (M15). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Qn59FdF9JMyfFg3sjcUNNG
This commit is contained in:
@@ -0,0 +1,63 @@
|
||||
"""Imaging-pipeline audit (Table V) + byte-identity era split (Section V-B).
|
||||
Classifies a stratified sample of report PDFs as scanned / OCR'd / digital-native
|
||||
from embedded metadata + extractable-text heuristic, and tabulates by year and firm.
|
||||
Also reports the scan-era vs digital-era split of the 262 byte-identical signatures.
|
||||
|
||||
Requires: PyMuPDF (fitz); signature_analysis.db; original PDFs under total-pdf/.
|
||||
"""
|
||||
import fitz, os, glob, sqlite3
|
||||
from collections import defaultdict
|
||||
|
||||
fitz.TOOLS.mupdf_display_errors(False)
|
||||
DB = "/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db"
|
||||
PDF_ROOT = "/Volumes/NV2/PDF-Processing/total-pdf"
|
||||
BIG4 = ('勤業眾信聯合', '資誠聯合', '安侯建業聯合', '安永聯合')
|
||||
FMAP = {'勤業眾信聯合': 'A', '安侯建業聯合': 'B', '資誠聯合': 'C', '安永聯合': 'D'}
|
||||
|
||||
con = sqlite3.connect(DB); cur = con.cursor()
|
||||
|
||||
# --- stratified sample: 20 distinct PDFs per firm-year ---
|
||||
cur.execute(f"""
|
||||
WITH d AS (SELECT DISTINCT excel_firm, substr(year_month,1,4) yr, source_pdf,
|
||||
ROW_NUMBER() OVER (PARTITION BY excel_firm, substr(year_month,1,4) ORDER BY source_pdf) rn
|
||||
FROM signatures WHERE excel_firm IN ({','.join(['?']*4)}) AND source_pdf IS NOT NULL)
|
||||
SELECT excel_firm, yr, source_pdf FROM d WHERE rn<=20 ORDER BY yr""", BIG4)
|
||||
rows = cur.fetchall()
|
||||
idx = {os.path.basename(p): p for p in glob.glob(PDF_ROOT + '/*/*.pdf')}
|
||||
|
||||
def classify(path):
|
||||
try:
|
||||
doc = fitz.open(path)
|
||||
except Exception:
|
||||
return None
|
||||
text = sum(len(doc[i].get_text().strip()) for i in range(min(len(doc), 4)))
|
||||
doc.close()
|
||||
return 'DIGITAL' if text > 2000 else ('OCR' if text > 200 else 'SCAN')
|
||||
|
||||
byyear = defaultdict(lambda: defaultdict(int))
|
||||
for firm, yr, fn in rows:
|
||||
p = idx.get(fn)
|
||||
if not p:
|
||||
continue
|
||||
k = classify(p)
|
||||
if k:
|
||||
byyear[yr][k] += 1
|
||||
|
||||
print("year | n | scan% | ocr% | digital%")
|
||||
for yr in sorted(byyear):
|
||||
d = byyear[yr]; n = sum(d.values())
|
||||
print(f"{yr} | {n} | {100*d['SCAN']//n} | {100*d['OCR']//n} | {100*d['DIGITAL']//n}")
|
||||
|
||||
# --- byte-identity era split ---
|
||||
cur.execute(f"""
|
||||
SELECT CASE WHEN year_month<'202101' THEN 'scan-era' ELSE 'digital-era' END era,
|
||||
CASE excel_firm WHEN '勤業眾信聯合' THEN 'A' WHEN '安侯建業聯合' THEN 'B'
|
||||
WHEN '資誠聯合' THEN 'C' WHEN '安永聯合' THEN 'D' END firm,
|
||||
COUNT(*) n
|
||||
FROM signatures WHERE is_valid=1 AND pixel_identical_to_closest=1
|
||||
AND excel_firm IN ({','.join(['?']*4)})
|
||||
GROUP BY era, firm ORDER BY era, firm""", BIG4)
|
||||
print("\nbyte-identical by era x firm:")
|
||||
for era, firm, n in cur.fetchall():
|
||||
print(f" {era} | {firm} | {n}")
|
||||
con.close()
|
||||
Reference in New Issue
Block a user