Apply Phase 5 round-3 splice-blocker fixes from codex round-8
Closes the three concrete splice blockers codex round-8 surfaced in the post-round-2 drafts, plus the binary-collapse terminology residue. No empirical changes. - Abstract trimmed 261 -> 247 words (3 under IEEE Access <=250 target). Cut "technically trivial and visually invisible," (S1 motivational redundancy) and the within-firm-rate parenthetical "(Firm A 98.8%; Firms B/C/D 76.7-83.7%)" plus "between" connector; preserved the corrected 77-99% any-pair headline so the M3 substance survives. - §IV-J Table XV sample-size footnote (line 177) corrected: round-2 misclassified §IV-M.5 as descriptor-complete n=150,442; Script 44 / Tables XXIV-XXV actually use vector-complete n=150,453, same as §IV-M.2 Table XXI (Script 40b) and §IV-M.3 Table XXII (Script 43). New footnote distinguishes descriptor-complete (§IV-D through §IV-J) from vector/pair-recomputed (§IV-M.2/M.3/M.5; Scripts 40b/43/44). - §IV-I (line 161) stale cross-reference: "§IV-M Table XVI" was the K=3 firm cross-tab (descriptive), not the v4-new ICCR calibration. Replaced with "§IV-M Tables XXI-XXVI" — the full ICCR calibration block. Pre-existing error exposed by the round-2 cascade. - §III line 131 + §IV Table XI line 104 binary-collapse label: "replicated vs not-replicated" -> "replication-dominated vs less-replication-dominated" for consistency with the K=3 descriptor-position framing. "Replicated class" preserved where it refers to byte-identical positive-anchor ground truth (§III-K.4, §IV-H lines 143/153/155). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
|
||||
> *IEEE Access target: <= 250 words, single paragraph.*
|
||||
|
||||
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — technically trivial and visually invisible, undermining individualized attestation. We build an end-to-end pipeline detecting such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$–$0.16$ at Firms B/C/D after pool-size adjustment, and under the deployed any-pair rule between $77\%$ and $99\%$ of inter-CPA collisions concentrate within the source firm (Firm A $98.8\%$; Firms B/C/D $76.7$–$83.7\%$) — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
|
||||
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — undermining individualized attestation. We build an end-to-end pipeline detecting such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$–$0.16$ at Firms B/C/D after pool-size adjustment, and under the deployed any-pair rule $77$–$99\%$ of inter-CPA collisions concentrate within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user