Apply Phase 5 round-3 splice-blocker fixes from codex round-8

Closes the three concrete splice blockers codex round-8 surfaced
in the post-round-2 drafts, plus the binary-collapse terminology
residue. No empirical changes.

- Abstract trimmed 261 -> 247 words (3 under IEEE Access <=250
  target). Cut "technically trivial and visually invisible,"
  (S1 motivational redundancy) and the within-firm-rate
  parenthetical "(Firm A 98.8%; Firms B/C/D 76.7-83.7%)" plus
  "between" connector; preserved the corrected 77-99% any-pair
  headline so the M3 substance survives.

- §IV-J Table XV sample-size footnote (line 177) corrected:
  round-2 misclassified §IV-M.5 as descriptor-complete n=150,442;
  Script 44 / Tables XXIV-XXV actually use vector-complete
  n=150,453, same as §IV-M.2 Table XXI (Script 40b) and §IV-M.3
  Table XXII (Script 43). New footnote distinguishes
  descriptor-complete (§IV-D through §IV-J) from
  vector/pair-recomputed (§IV-M.2/M.3/M.5; Scripts 40b/43/44).

- §IV-I (line 161) stale cross-reference: "§IV-M Table XVI" was
  the K=3 firm cross-tab (descriptive), not the v4-new ICCR
  calibration. Replaced with "§IV-M Tables XXI-XXVI" — the full
  ICCR calibration block. Pre-existing error exposed by the
  round-2 cascade.

- §III line 131 + §IV Table XI line 104 binary-collapse label:
  "replicated vs not-replicated" -> "replication-dominated vs
  less-replication-dominated" for consistency with the K=3
  descriptor-position framing. "Replicated class" preserved
  where it refers to byte-identical positive-anchor ground
  truth (§III-K.4, §IV-H lines 143/153/155).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-14 17:17:30 +08:00
parent 4ee2efb5bb
commit 4a6f9c5c98
3 changed files with 5 additions and 5 deletions
+1 -1
View File
@@ -8,7 +8,7 @@
> *IEEE Access target: <= 250 words, single paragraph.*
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — technically trivial and visually invisible, undermining individualized attestation. We build an end-to-end pipeline detecting such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$$0.16$ at Firms B/C/D after pool-size adjustment, and under the deployed any-pair rule between $77\%$ and $99\%$ of inter-CPA collisions concentrate within the source firm (Firm A $98.8\%$; Firms B/C/D $76.7$$83.7\%$) — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — undermining individualized attestation. We build an end-to-end pipeline detecting such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$$0.16$ at Firms B/C/D after pool-size adjustment, and under the deployed any-pair rule $77$$99\%$ of inter-CPA collisions concentrate within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
---