Files
pdf_signature_extraction/paper/paper_a_abstract_v3.md
T
gbanyan 3672c9343e Phase 6 round-6: soften firm-heterogeneity framing, fix DOCX table render
Framing softening (per partner tone decision: own the limitation rather
than defend the strong claim).

Abstract: "Firm heterogeneity is decisive ... consistent with firm-level
template-like reuse" -> "The framework surfaces pronounced firm-level
heterogeneity ... consistent with firm-level template-like reuse but not
independently diagnostic, since descriptor-only data cannot separate
reuse from digitisation-pipeline or signing-style homogeneity within a
firm; we report it as a scope limitation rather than a mechanism finding."

S V-H Limitations: new bullet "Mechanism attribution for the firm-level
heterogeneity is not identifiable from descriptor-only data." enumerates
three non-mutually-exclusive firm-level mechanisms (template-like reuse /
digitisation-pipeline homogeneity / signing-style homogeneity), notes the
(cosine, dHash) descriptor pair is by construction indifferent to which
mechanism generated a near-identical pair, and lists what additional data
would be needed for attribution.

S VI Conclusion items (3) and (4): "firm heterogeneity quantification" ->
"firm-level heterogeneity surfaced by the framework ... reported as a
framework-discriminative observation rather than a mechanism finding";
item (4) expanded from template/stamp/document-production reuse alone to
the three-mechanism scope, with explicit "not independently establishing"
and S V-H cross-reference.

DOCX export fix (export_v3.py): add missing LaTeX-to-Unicode tokens
(\checkmark, \lvert/\rvert, \lVert/\rVert, \in, \notin, \max, \min,
\log, \ln, \exp, \bullet) that were silently dropping content from
Table III rows 2-4 (integer-jitter robustness check marks empty) and
Table XVIII drift column (|Delta| empty).

Rebuild Paper_A_IEEE_Access_Draft_v3.docx via export_v3.py and install
copy as Paper_A_IEEE_Access_Draft_v4.0_20260515.docx (replaces prior
pandoc-built v4 DOCX which had empty cells in every table header with
LaTeX math and inconsistent column widths). All 43 tables now have
non-empty cells with sub/superscript runs.

Mirrored in paper_a_v4_combined.md for consistency with the single-file
combined source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:16:05 +08:00

8 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Abstract
<!-- IEEE Access target: <= 250 words, single paragraph -->
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports — through administrative stamping or firm-level electronic signing — thereby undermining individualized attestation. We build an end-to-end pipeline for screening such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (20132023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). The framework surfaces pronounced firm-level heterogeneity: Firm A's per-document HC+MC ICCR is $0.62$ versus $0.09$$0.16$ at Firms B/C/D (gap persists after pool-size adjustment), and $77$$99\%$ of inter-CPA collisions concentrate within the source firm. This contrast is consistent with firm-level template-like reuse but not independently diagnostic, since descriptor-only data cannot separate reuse from digitisation-pipeline or signing-style homogeneity within a firm; we report it as a scope limitation rather than a mechanism finding. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
<!-- Word count: 281 -->