Files
pdf_signature_extraction/paper/codex_review_gpt54_v3.md
T
gbanyan 9d19ca5a31 Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review:

## Structural fixes
- Fixed three-method convergence overclaim: added Script 20 to run KDE
  antimode, BD/McCrary, and Beta mixture EM on accountant-level means.
  Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979,
  LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at
  accountant level (consistent with smooth clustering, not sharp
  discontinuity).
- Disambiguated Method 1: KDE crossover (between two labeled distributions,
  used at signature all-pairs level) vs KDE antimode (single-distribution
  local minimum, used at accountant level).
- Addressed Firm A circular validation: Script 21 adds CPA-level 70/30
  held-out fold. Calibration thresholds derived from 70% only; heldout
  rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61%
  [93.21%-93.98%]).
- Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10
  signatures (9 CPAs excluded for insufficient sample). Reconciled across
  intro, results, discussion, conclusion.
- Added document-level classification aggregation rule (worst-case signature
  label determines document label).

## Pixel-identity validation strengthened
- Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces
  the original n=35 same-CPA low-similarity negative which had untenable
  Wilson CIs).
- Added Wilson 95% CI for every FAR in Table X.
- Proper EER interpolation (FAR=FRR point) in Table X.
- Softened "conservative recall" claim to "non-generalizable subset"
  language per codex feedback (byte-identical positives are a subset, not
  a representative positive class).
- Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913.

## Terminology & sentence-level fixes
- "statistically independent methods" -> "methodologically distinct methods"
  throughout (three diagnostics on the same sample are not independent).
- "formal bimodality check" -> "unimodality test" (dip test tests H0 of
  unimodality; rejection is consistent with but not a direct test of
  bimodality).
- "Firm A near-universally non-hand-signed" -> already corrected to
  "replication-dominated" in prior commit; this commit strengthens that
  framing with explicit held-out validation.
- "discrete-behavior regimes" -> "clustered accountant-level heterogeneity"
  (BD/McCrary non-transition at accountant level rules out sharp discrete
  boundaries; the defensible claim is clustered-but-smooth).
- Softened White 1982 quasi-MLE claim (no longer framed as a guarantee).
- Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP
  or YOLO FN).
- Unified "310 byte-identical signatures" language across Abstract,
  Results, Discussion (previously alternated between pairs/signatures).
- Defined min_dhash_independent explicitly in Section III-G.
- Fixed table numbering (Table XI heldout added, classification moved to
  XII, ablation to XIII).
- Explained 84,386 vs 85,042 gap (656 docs have only one signature, no
  pairwise stat).
- Made Table IX explicitly a "consistency check" not "validation"; paired
  it with Table XI held-out rates as the genuine external check.
- Defined 0.941 threshold (calibration-fold Firm A cosine P5).
- Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated.
- Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923).

## New artifacts
- Script 20: accountant-level three-method threshold analysis
- Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30)
- paper/codex_review_gpt54_v3.md: preserved review feedback

Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1
markdown sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00

1751 lines
242 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Reading additional input from stdin...
OpenAI Codex v0.121.0 (research preview)
--------
workdir: /Volumes/NV2/pdf_recognize
model: gpt-5.4
provider: openai
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019dabc7-b94b-7d71-81f6-764d687c3d5a
--------
user
You are conducting a peer review for IEEE Access of my paper v3.
Read these files in /Volumes/NV2/pdf_recognize/paper/ :
paper_a_abstract_v3.md, paper_a_impact_statement_v3.md, paper_a_introduction_v3.md, paper_a_related_work_v3.md, paper_a_methodology_v3.md, paper_a_results_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md, paper_a_references_v3.md.
Paper title: "Automated Identification of Non-Hand-Signed Auditor Signatures in Large-Scale Financial Audit Reports: A Dual-Descriptor Framework with Three-Method Convergent Thresholding"
Key facts to keep in mind while reviewing:
- Pipeline: VLM (Qwen2.5-VL 32B) page-ID, YOLOv11n detection, ResNet-50 2048-dim features (L2-norm, no fine-tune), dual descriptors = cosine + dHash (9x8, 64-bit, Hamming distance).
- Dataset: 90,282 PDF audit reports from TWSE MOPS 2013-2023, 182,328 signatures, 758 CPAs.
- Three convergent threshold methods: (a) KDE antimode with Hartigan dip test, (b) Burgstahler-Dichev / McCrary discontinuity, (c) 2/3-comp Beta mixture via EM + parallel logit-Gaussian mixture as White 1982 quasi-MLE robustness check.
- Accountant-level: Gaussian mixture on (mean cosine, mean dHash), BIC picks K=3. C1 cos=0.983 dh=2.41 w=0.21 n=141 (Firm A 139/141). C2 cos=0.954 dh=7.00 w=0.51 n=361 (other Big-4). C3 cos=0.928 dh=11.17 w=0.28 n=184 (smaller firms). 2-comp marginal crossings cos=0.945 dh=8.10.
- Firm A framed as "replication-dominated" not pure: interviews confirm most partners use replication but a minority may still hand-sign. Consistent with 92.5% Firm A cos>0.95 and 32/180 Firm A CPAs in C2.
- Pixel-identity validation: 310 sigs with byte-identical nearest same-CPA match as gold positives. Low-cosine anchor (cos<0.70) is only n=35 gold negatives so FAR is order-of-magnitude only. Firm A anchor rates: cos>0.95=92.5%, dual rule cos>0.95 AND dh<=8=89.95%.
- Classification: 5 tiers (high/moderate-confidence non-hand-signed, high style consistency, uncertain, likely hand-signed), thresholds from Firm A (dHash median=5, P95=15).
- Ablation: ResNet-50 vs VGG-16 vs EfficientNet-B0; ResNet-50 chosen on balance.
Produce a structured IEEE-Access-style peer review covering, in priority order (most critical first):
1) Overall assessment and recommendation (accept / minor revision / major revision / reject).
2) Technical soundness of the three-method threshold framework and the accountant-level GMM; focus on whether the "continuous signature-level spectrum vs discrete accountant-level behaviour" interpretation is supported by the reported evidence.
3) Whether the "replication-dominated Firm A" framing is justified AND whether the paper does enough to prevent circular validation given Firm A is used both to calibrate thresholds and to anchor validation.
4) Weaknesses in the pixel-identity validation: is n=35 negative anchor sufficient, does using pixel-identity as gold positive bias recall estimates, what would strengthen it.
5) Clarity and organization of Sections III / IV / V. Flag terminology inconsistency, undefined results, and any table/figure referenced but not clearly explained.
6) Specific sentence-level, table-level, numerical issues to fix before submission. Include exact quotes or table-cell references.
Be direct and specific. Cite exact sections, table numbers, and line quotes. Do not restate the paper content; focus on review opinions and concrete fixes.
codex
Im reviewing the manuscript files directly and will anchor the review to the text as written. First Im loading the section drafts with line numbers and scanning for tables, figures, and threshold/validation claims so I can cite specific locations rather than paraphrase.
→ Load all manuscript section files with line numbers and identify tables/figures/results claims
• Evaluate technical claims around thresholding, accountant-level GMM, and Firm A framing
• Draft IEEE-Access-style review with prioritized findings and exact citations
exec
/bin/zsh -lc 'wc -l /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md' in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
17 /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md
30 /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md
101 /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md
9 /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md
86 /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md
229 /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md
87 /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md
104 /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md
252 /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md
915 total
exec
/bin/zsh -lc "rg --files /Volumes/NV2/pdf_recognize/paper | rg 'paper_a_.*_v3\\.md"'$'"'" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md
/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md
/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md
exec
/bin/zsh -lc "rg -n 'Table|Figure|Fig\\.|Section|threshold|Firm A|pixel|identity|Hartigan|dip|Burgstahler|McCrary|Beta mixture|Gaussian mixture|GMM|dHash|cosine|validation|anchor|Tier|tier|KDE|antimode|quasi-MLE|White 1982' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:12:Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:14:Third, we introduced a three-method convergent threshold framework combining KDE antimode (with a Hartigan dip test as formal bimodality check), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check).
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:15:Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:16:The substantive reading is that *pixel-level output quality* is continuous while *individual signing behavior* is close to discrete.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:18:Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor that requires no manual annotation.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19:This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:29:The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:30:Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:9:The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash).
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10:For threshold determination we apply three statistically independent methods---Kernel Density antimode with a Hartigan dip test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best $K = 3$), reflecting that individual signing *behavior* is close to discrete even when pixel-level output *quality* is continuous.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12:The accountant-level 2-component crossings yield principled thresholds (cosine $= 0.945$, dHash $= 8.10$).
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:13:A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with interview and visual evidence supporting majority non-hand-signing and a minority of hand-signers.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14:Validation against 310 pixel-identical signature pairs and a low-similarity negative anchor yields perfect recall at all candidate thresholds.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:7:In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:12:The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:16:The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E).
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:18:At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:19:Firm A's signature-level cosine is formally unimodal (dip test $p = 0.17$) with a long left tail.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:20:The all-CPA signature-level cosine is multimodal ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:21:The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:26:The distribution of per-accountant mean cosine (and mean dHash) is unambiguously multimodal.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27:A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:28:The two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:30:The substantive interpretation is simple: *pixel-level output quality* is continuous, but *individual signing behavior* is close to discrete.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:32:Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:34:## C. Firm A as a Replication-Dominated, Not Pure, Population
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:36:A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:37:Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:40:First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:41:Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42:Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:45:We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:49:Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:50:A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:52:The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:53:Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:60:The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:65:Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:70:A further methodological contribution is the use of byte-level pixel identity as an annotation-free gold positive.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:71:Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72:In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:73:Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:81:The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), and the low-similarity anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:82:The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:88:In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:92:While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:97:Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:98:This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:12:From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:20:This framing presupposes that the central threat is identity fraud.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:21:In our context, identity is not in question; the CPA is indeed the legitimate signer.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:26:Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:27:Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28:A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, $F_1$, and equal-error-rate metrics that prevail in the biometric-verification literature.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:31:Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:34:From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:41:(4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance;
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:43:(6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:48:By requiring convergent evidence from both descriptors, we can differentiate *style consistency* (high cosine but divergent dHash) from *image reproduction* (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:51:One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:52:Structured interviews with multiple Firm A partners confirm that *most* certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a *minority* may continue to hand-sign some reports.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:53:We therefore treat Firm A as a *replication-dominated* calibration reference rather than a pure positive class.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:59:The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, but *individual signing behavior* is close to discrete---a given CPA is either a consistent user of non-hand-signing or a consistent hand-signer.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:71:3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:73:4. **Three-method convergent threshold framework.** We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:75:5. **Continuous-quality / discrete-behavior finding.** We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates are cleanly trimodal---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:77:6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a known-majority-positive reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:82:Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:83:Section III describes the proposed methodology.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:84:Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:85:Section V discusses the implications and limitations of our findings.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:86:Section VI concludes with directions for future work.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:6:Fig. 1 illustrates the overall architecture.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:7:The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:13:[Figure 1: Pipeline Architecture - clean vector diagram]
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:17:→ Dual-Method Verification (Cosine + dHash)
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:18:→ Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:19:→ Pixel-identity + Firm A + Accountant-level GMM validation
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:30:Table I summarizes the dataset composition.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:53:Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false-positive rate of 1.2%.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:61:The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:73:A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:82:Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:83:All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:85:The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:86:This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:97:Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:99:**Perceptual hash distance (dHash)** captures structural-level similarity.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:100:Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:102:Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:105:Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:107:Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:110:We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:111:Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:118:For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:119:The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:123:These accountant-level aggregates are the input to the mixture model described in Section III-I.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:125:## H. Calibration Reference: Firm A as a Replication-Dominated Population
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:127:A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:128:Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:129:This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:131:First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:132:Crucially, the same interview evidence does *not* exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134:Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:138:We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140:The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144:Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145:To place threshold selection on a statistically principled and data-driven footing, we apply *three independent* methods whose underlying assumptions decrease in strength.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:148:### 1) Method 1: KDE + Antimode with Bimodality Check
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151:A candidate threshold is taken at the location of the local density minimum (antimode) between modes of the fitted density.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152:Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify stability.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:154:### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:156:We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39].
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:157:We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:162:A threshold is identified at the transition where $Z_{i-1}$ is significantly negative (observed count below expectation) adjacent to $Z_i$ significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition $Z^- \rightarrow Z^+$ marks the candidate decision boundary.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:166:We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:168:Under the fitted model the threshold is the crossing point of the two weighted component densities,
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:173:As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the *logit-transformed* similarity, following standard practice for bounded proportion data; White's [41] quasi-MLE consistency result guarantees asymptotic recovery of the best Beta-family approximation to the true distribution even if the true component densities are not exactly Beta.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:176:When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:180:The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:181:If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:184:Applied to the per-signature cosine distribution the three methods may converge on a single boundary only weakly or not at all---because, as our results show, per-signature similarity is not a cleanly bimodal population.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:185:Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:186:We therefore explicitly analyze both levels and interpret their divergence as a substantive finding (Section V) rather than a statistical nuisance.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:190:In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:191:The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *behavior* is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level *quality* lies on a continuous spectrum.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:194:For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:196:## K. Pixel-Identity and Firm A Validation (No Manual Annotation)
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:198:Rather than construct a stratified manual-annotation validation set, we validate the classifier using three naturally occurring reference populations that require no human labeling:
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:200:1. **Pixel-identical anchor (gold positive):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:201:Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:203:2. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:204:Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:206:3. **Low-similarity anchor (gold negative):** signatures whose maximum same-CPA cosine similarity is below a conservative cutoff ($0.70$) that cannot plausibly arise from pixel-level duplication.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:208:From these anchors we report Equal Error Rate (EER), precision, recall, $F_1$, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3].
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:209:We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213:The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214:Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution:
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:216:1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:217:Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:219:2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash in $[6, 15]$.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:220:Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:222:3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225:4. **Uncertain:** Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:227:5. **Likely hand-signed:** Cosine below the KDE crossover threshold.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:229:The dHash thresholds ($\leq 5$ and $\leq 15$) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.
/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8:By combining deep-learning visual features with perceptual hashing and three statistically independent threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time.
/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:9:After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:11:The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:12:We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29:Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:30:Table IV summarizes the distributional statistics.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:44:Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:45:Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47:The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:55:## D. Hartigan Dip Test: Unimodality at the Signature Level
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:59:<!-- TABLE V: Hartigan Dip Test Results
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:60:| Distribution | N | dip | p-value | Verdict (α=0.05) |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:62:| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:64:| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:67:| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:70:Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:71:The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:72:At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:75:It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:77:### 1) Burgstahler-Dichev / McCrary Discontinuity
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:79:Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:80:We note that the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:81:In contrast, the dHash transition at distance 2 is a meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:85:Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:86:For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:87:Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:88:Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:90:The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:92:This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual *behavior* (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:97:BIC selects $K^* = 3$ (Table VI).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:99:<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:109:Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:111:<!-- TABLE VII: Accountant-Level 3-Component GMM
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:112:| Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:114:| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120:First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:121:The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:122:Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123:Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine $= 0.945$ and dHash $= 8.10$; these are the natural per-accountant thresholds.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:125:Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:128:| Level / method | Cosine threshold | dHash threshold |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:130:| Signature-level KDE crossover | 0.837 | — |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:131:| Signature-level BD/McCrary transition | 0.985 | 2.0 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:132:| Signature-level Beta 2-comp (Firm A) | 0.977 | — |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:133:| Signature-level LogGMM 2-comp (Full) | 0.980 | — |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:134:| Accountant-level 2-comp GMM crossing | **0.945** | **8.10** |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:135:| Firm A P95 (median/95th pct calibration) | 0.95 | 15 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:136:| Firm A median calibration | — | 5 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine $= 0.945$, dHash $= 8.10$) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:142:## F. Calibration Validation with Firm A
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:144:Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:145:Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:147:<!-- TABLE IX: Firm A Anchor Rates Across Candidate Thresholds
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:148:| Rule | Firm A rate |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:150:| cosine > 0.837 | 99.93% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:151:| cosine > 0.941 | 95.08% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:152:| cosine > 0.945 (accountant 2-comp) | 94.5%† |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:153:| cosine > 0.95 | 92.51% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:154:| dHash_indep ≤ 5 | 84.20% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:155:| dHash_indep ≤ 8 | 95.17% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:156:| dHash_indep ≤ 15 | 99.83% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:157:| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:162:The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:163:The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167:Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:168:These serve as the gold-positive anchor of Section III-K.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:169:Using signatures with cosine $< 0.70$ ($n = 35$) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174:| cosine > 0.837 | KDE crossover | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:175:| cosine > 0.945 | Accountant crossing | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:176:| cosine > 0.95 | Canonical | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:177:| dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:178:| dHash_indep ≤ 8 | Accountant crossing | 0.966 | 1.000 | 0.983 | 0.314 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:182:All cosine thresholds achieve perfect classification of the pixel-identical anchor against the low-similarity anchor, which is unsurprising given the complete separation between the two anchor populations.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:183:The dHash thresholds trade precision for recall along the expected tradeoff.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184:We emphasize that because the gold-positive anchor is a *subset* of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:185:The negative-anchor population ($n = 35$) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:187:A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:193:<!-- TABLE XI: Classification Results (Dual-Method: Cosine + dHash)
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:194:| Verdict | N (PDFs) | % | Firm A | Firm A % |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:203:Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:204:29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:205:36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:206:and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:207:A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:209:### 1) Firm A Validation
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:211:96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:212:This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:213:The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217:Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:218:This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:224:Table XII presents the comparison.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:233:| KDE crossover | 0.837 | 0.850 | 0.792 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:234:| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:235:| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237:Note: Firm A values in this table are computed over all intra-firm pairwise
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239:the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:251:(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:19:[8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.
/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:77:[37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," *Ann. Statist.*, vol. 13, no. 1, pp. 7084, 1985.
/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:79:[38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," *J. Account. Econ.*, vol. 24, no. 1, pp. 99126, 1997.
/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:81:[39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," *J. Econometrics*, vol. 142, no. 2, pp. 698714, 2008.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:11:Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:15:A common thread in this literature is the assumption that the primary threat is *identity fraud*: a forger attempting to produce a convincing imitation of another person's signature.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:18:Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:19:However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a replication-dominated subpopulation identified through domain expertise in real-world regulatory documents.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:32:Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:37:Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:47:Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:49:Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:51:Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:52:These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:56:Our threshold-determination framework combines three families of methods developed in statistics and accounting-econometrics.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:60:Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:61:The statistical validity of the bimodality itself can be tested independently via the Hartigan & Hartigan dip test [37], which we use as a formal bimodality diagnostic.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:64:Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:66:The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:67:The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:71:For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:72:Under mild regularity conditions, White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation to the true distribution, even when the true distribution is not exactly Beta, provided the model is correctly specified in the broader exponential-family sense.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:74:The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:82:[8] Brimoh & Olisah 2024 — Consensus threshold
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:94:[20] Engin et al. 2020 — ResNet + cosine on real docs
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:99:[37] Hartigan & Hartigan 1985 — dip test of unimodality
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:100:[38] Burgstahler & Dichev 1997 — earnings management discontinuity
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:101:[39] McCrary 2008 — density discontinuity test
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:103:[41] White 1982 — quasi-MLE consistency
codex
Ive isolated the sections where the core claims live. Next Im reading `Methodology`, `Results`, and `Discussion` line-by-line because most of the review will turn on whether the thresholding logic, validation design, and interpretation are actually supported there.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md | sed -n '1,260p'" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
1 # III. Methodology
2
3 ## A. Pipeline Overview
4
5 We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
6 Fig. 1 illustrates the overall architecture.
7 The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.
8
9 Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
10 From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
11
12 <!--
13 [Figure 1: Pipeline Architecture - clean vector diagram]
14 90,282 PDFs → VLM Pre-screening → 86,072 PDFs
15 → YOLOv11 Detection → 182,328 signatures
16 → ResNet-50 Features → 2048-dim embeddings
17 → Dual-Method Verification (Cosine + dHash)
18 → Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification
19 → Pixel-identity + Firm A + Accountant-level GMM validation
20 -->
21
22 ## B. Data Collection
23
24 The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023.
25 The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings.
26 An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period.
27 Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.
28
29 CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
30 Table I summarizes the dataset composition.
31
32 <!-- TABLE I: Dataset Summary
33 | Attribute | Value |
34 |-----------|-------|
35 | Total PDF documents | 90,282 |
36 | Date range | 20132023 |
37 | Documents with signatures | 86,072 (95.4%) |
38 | Unique CPAs identified | 758 |
39 | Accounting firms | >50 |
40 -->
41
42 ## C. Signature Page Identification
43
44 To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism.
45 Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
46 The model was configured with temperature 0 for deterministic output.
47
48 The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document.
49 Scanning terminated upon the first positive detection.
50 This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded.
51 An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.
52
53 Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false-positive rate of 1.2%.
54
55 ## D. Signature Detection
56
57 We adopted YOLOv11n (nano variant) [25] for signature region localization.
58 A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
59 A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.
60
61 The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).
62
63 <!-- TABLE II: YOLO Detection Performance
64 | Metric | Value |
65 |--------|-------|
66 | Precision | 0.970.98 |
67 | Recall | 0.950.98 |
68 | mAP@0.50 | 0.980.99 |
69 | mAP@0.50:0.95 | 0.850.90 |
70 -->
71
72 Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
73 A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.
74
75 Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
76
77 ## E. Feature Extraction
78
79 Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning.
80 The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.
81
82 Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
83 All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.
84
85 The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D).
86 This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
87
88 ## F. Dual-Method Similarity Descriptors
89
90 For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:
91
92 **Cosine similarity on deep embeddings** captures high-level visual style:
93
94 $$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$
95
96 where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized 2048-dim feature vectors.
97 Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].
98
99 **Perceptual hash distance (dHash)** captures structural-level similarity.
100 Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
101 The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images.
102 Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].
103
104 These descriptors provide partially independent evidence.
105 Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise.
106 Non-hand-signing yields extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise.
107 Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
108 Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.
109
110 We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content.
111 Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.
112
113 ## G. Unit of Analysis and Summary Statistics
114
115 Two unit-of-analysis choices are relevant for this study: (i) the *signature*---one signature image extracted from one report---and (ii) the *accountant*---the collection of all signatures attributed to a single CPA across the sample period.
116 A third composite unit---the *auditor-year*, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution.
117
118 For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA.
119 The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
120 Mean statistics would dilute this signal.
121
122 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA.
123 These accountant-level aggregates are the input to the mixture model described in Section III-I.
124
125 ## H. Calibration Reference: Firm A as a Replication-Dominated Population
126
127 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
128 Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.
129 This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration.
130
131 First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems.
132 Crucially, the same interview evidence does *not* exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports.
133
134 Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners.
135
136 Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
137
138 We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
139 Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline.
140 The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.
141
142 ## I. Three-Method Convergent Threshold Determination
143
144 Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
145 To place threshold selection on a statistically principled and data-driven footing, we apply *three independent* methods whose underlying assumptions decrease in strength.
146 When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement is itself a diagnostic of distributional structure.
147
148 ### 1) Method 1: KDE + Antimode with Bimodality Check
149
150 We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28].
151 A candidate threshold is taken at the location of the local density minimum (antimode) between modes of the fitted density.
152 Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify stability.
153
154 ### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity
155
156 We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39].
157 We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,
158
159 $$Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},$$
160
161 which is approximately $N(0,1)$ under the null of distributional smoothness.
162 A threshold is identified at the transition where $Z_{i-1}$ is significantly negative (observed count below expectation) adjacent to $Z_i$ significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition $Z^- \rightarrow Z^+$ marks the candidate decision boundary.
163
164 ### 3) Method 3: Finite Mixture Model via EM
165
166 We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data).
167 The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread).
168 Under the fitted model the threshold is the crossing point of the two weighted component densities,
169
170 $$\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),$$
171
172 solved numerically via bracketed root-finding.
173 As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the *logit-transformed* similarity, following standard practice for bounded proportion data; White's [41] quasi-MLE consistency result guarantees asymptotic recovery of the best Beta-family approximation to the true distribution even if the true component densities are not exactly Beta.
174
175 We fit 2- and 3-component variants of each mixture and report BIC for model selection.
176 When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.
177
178 ### 4) Convergent Validation and Level-Shift Diagnostic
179
180 The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification.
181 If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.
182
183 Equally informative is the *level at which the three methods agree*.
184 Applied to the per-signature cosine distribution the three methods may converge on a single boundary only weakly or not at all---because, as our results show, per-signature similarity is not a cleanly bimodal population.
185 Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum.
186 We therefore explicitly analyze both levels and interpret their divergence as a substantive finding (Section V) rather than a statistical nuisance.
187
188 ## J. Accountant-Level Mixture Model
189
190 In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
191 The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing *behavior* is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level *quality* lies on a continuous spectrum.
192
193 We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
194 For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
195
196 ## K. Pixel-Identity and Firm A Validation (No Manual Annotation)
197
198 Rather than construct a stratified manual-annotation validation set, we validate the classifier using three naturally occurring reference populations that require no human labeling:
199
200 1. **Pixel-identical anchor (gold positive):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
201 Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing.
202
203 2. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above.
204 Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity.
205
206 3. **Low-similarity anchor (gold negative):** signatures whose maximum same-CPA cosine similarity is below a conservative cutoff ($0.70$) that cannot plausibly arise from pixel-level duplication.
207
208 From these anchors we report Equal Error Rate (EER), precision, recall, $F_1$, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3].
209 We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.
210
211 ## L. Per-Document Classification
212
213 The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
214 Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution:
215
216 1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
217 Both descriptors converge on strong replication evidence consistent with Firm A's median behavior.
218
219 2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash in $[6, 15]$.
220 Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations.
221
222 3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
223 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
224
225 4. **Uncertain:** Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
226
227 5. **Likely hand-signed:** Cosine below the KDE crossover threshold.
228
229 The dHash thresholds ($\leq 5$ and $\leq 15$) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md | sed -n '1,160p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md | sed -n '1,280p'" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
1 # IV. Experiments and Results
2
3 ## A. Experimental Setup
4
5 All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration.
6 Feature extraction used PyTorch 2.9 with torchvision model implementations.
7 The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
8
9 ## B. Signature Detection Performance
10
11 The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
12 We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
13 However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
14 The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.
15
16 <!-- TABLE III: Extraction Results
17 | Metric | Value |
18 |--------|-------|
19 | Documents processed | 86,071 |
20 | Documents with detections | 85,042 (98.8%) |
21 | Total signatures extracted | 182,328 |
22 | Avg. signatures per document | 2.14 |
23 | CPA-matched signatures | 168,755 (92.6%) |
24 | Processing rate | 43.1 docs/sec |
25 -->
26
27 ## C. Signature-Level Distribution Analysis
28
29 Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs.
30 Table IV summarizes the distributional statistics.
31
32 <!-- TABLE IV: Cosine Similarity Distribution Statistics
33 | Statistic | Intra-class | Inter-class |
34 |-----------|-------------|-------------|
35 | N (pairs) | 41,352,824 | 500,000 |
36 | Mean | 0.821 | 0.758 |
37 | Std. Dev. | 0.098 | 0.090 |
38 | Median | 0.836 | 0.774 |
39 | Skewness | 0.711 | 0.851 |
40 | Kurtosis | 0.550 | 1.027 |
41 -->
42
43 Both distributions are left-skewed and leptokurtic.
44 Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
45 Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
46
47 The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
48 Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
49 Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).
50
51 We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
52 We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
53 A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
54
55 ## D. Hartigan Dip Test: Unimodality at the Signature Level
56
57 Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
58
59 <!-- TABLE V: Hartigan Dip Test Results
60 | Distribution | N | dip | p-value | Verdict (α=0.05) |
61 |--------------|---|-----|---------|------------------|
62 | Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
63 | Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
64 | All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
65 | All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
66 | Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
67 | Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
68 -->
69
70 Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews.
71 The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
72 At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
73
74 This asymmetry between signature level and accountant level is itself an empirical finding.
75 It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
76
77 ### 1) Burgstahler-Dichev / McCrary Discontinuity
78
79 Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample.
80 We note that the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
81 In contrast, the dHash transition at distance 2 is a meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.
82
83 ### 2) Beta Mixture at Signature Level: A Forced Fit
84
85 Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
86 For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
87 Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
88 Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
89
90 The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
91 Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
92 This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual *behavior* (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.
93
94 ## E. Accountant-Level Gaussian Mixture
95
96 We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
97 BIC selects $K^* = 3$ (Table VI).
98
99 <!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
100 | K | BIC | AIC | Converged |
101 |---|-----|-----|-----------|
102 | 1 | 316 | 339 | ✓ |
103 | 2 | 545 | 595 | ✓ |
104 | 3 | **792** | **869** | ✓ (best) |
105 | 4 | 779 | 883 | ✓ |
106 | 5 | 747 | 879 | ✓ |
107 -->
108
109 Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
110
111 <!-- TABLE VII: Accountant-Level 3-Component GMM
112 | Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
113 |-------|----------|------------|--------|---|----------------|
114 | C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
115 | C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
116 | C3 (hand-signed tendency) | 0.928 | 11.17 | 0.28 | 184 | smaller domestic firms |
117 -->
118
119 Three empirical findings stand out.
120 First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster.
121 The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
122 Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
123 Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine $= 0.945$ and dHash $= 8.10$; these are the natural per-accountant thresholds.
124
125 Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level.
126
127 <!-- TABLE VIII: Threshold Convergence Summary
128 | Level / method | Cosine threshold | dHash threshold |
129 |----------------|-------------------|------------------|
130 | Signature-level KDE crossover | 0.837 | — |
131 | Signature-level BD/McCrary transition | 0.985 | 2.0 |
132 | Signature-level Beta 2-comp (Firm A) | 0.977 | — |
133 | Signature-level LogGMM 2-comp (Full) | 0.980 | — |
134 | Accountant-level 2-comp GMM crossing | **0.945** | **8.10** |
135 | Firm A P95 (median/95th pct calibration) | 0.95 | 15 |
136 | Firm A median calibration | — | 5 |
137 -->
138
139 The accountant-level two-component crossing (cosine $= 0.945$, dHash $= 8.10$) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture.
140 The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum / discrete-behavior asymmetry rather than as primary classification boundaries.
141
142 ## F. Calibration Validation with Firm A
143
144 Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
145 Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
146
147 <!-- TABLE IX: Firm A Anchor Rates Across Candidate Thresholds
148 | Rule | Firm A rate |
149 |------|-------------|
150 | cosine > 0.837 | 99.93% |
151 | cosine > 0.941 | 95.08% |
152 | cosine > 0.945 (accountant 2-comp) | 94.5%† |
153 | cosine > 0.95 | 92.51% |
154 | dHash_indep ≤ 5 | 84.20% |
155 | dHash_indep ≤ 8 | 95.17% |
156 | dHash_indep ≤ 15 | 99.83% |
157 | cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% |
158
159 † interpolated from adjacent rates; all other rates computed exactly.
160 -->
161
162 The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail.
163 The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix.
164
165 ## G. Pixel-Identity Validation
166
167 Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1).
168 These serve as the gold-positive anchor of Section III-K.
169 Using signatures with cosine $< 0.70$ ($n = 35$) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X).
170
171 <!-- TABLE X: Pixel-Identity Validation Metrics
172 | Indicator | Threshold | Precision | Recall | F1 | FAR | FRR |
173 |-----------|-----------|-----------|--------|----|-----|-----|
174 | cosine > 0.837 | KDE crossover | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
175 | cosine > 0.945 | Accountant crossing | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
176 | cosine > 0.95 | Canonical | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
177 | dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 |
178 | dHash_indep ≤ 8 | Accountant crossing | 0.966 | 1.000 | 0.983 | 0.314 | 0.000 |
179 | dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 |
180 -->
181
182 All cosine thresholds achieve perfect classification of the pixel-identical anchor against the low-similarity anchor, which is unsurprising given the complete separation between the two anchor populations.
183 The dHash thresholds trade precision for recall along the expected tradeoff.
184 We emphasize that because the gold-positive anchor is a *subset* of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate.
185 The negative-anchor population ($n = 35$) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates.
186
187 A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
188
189 ## H. Classification Results
190
191 Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents.
192
193 <!-- TABLE XI: Classification Results (Dual-Method: Cosine + dHash)
194 | Verdict | N (PDFs) | % | Firm A | Firm A % |
195 |---------|----------|---|--------|----------|
196 | High-confidence non-hand-signed | 29,529 | 35.0% | 22,970 | 76.0% |
197 | Moderate-confidence non-hand-signed | 36,994 | 43.8% | 6,311 | 20.9% |
198 | High style consistency | 5,133 | 6.1% | 183 | 0.6% |
199 | Uncertain | 12,683 | 15.0% | 758 | 2.5% |
200 | Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
201 -->
202
203 Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
204 29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
205 36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
206 and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
207 A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations.
208
209 ### 1) Firm A Validation
210
211 96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
212 This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers.
213 The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
214
215 ### 2) Cross-Method Agreement
216
217 Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
218 This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
219
220 ## I. Ablation Study: Feature Backbone Comparison
221
222 To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
223 All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
224 Table XII presents the comparison.
225
226 <!-- TABLE XII: Backbone Comparison
227 | Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
228 |--------|-----------|--------|-----------------|
229 | Feature dim | 2048 | 4096 | 1280 |
230 | Intra mean | 0.821 | 0.822 | 0.786 |
231 | Inter mean | 0.758 | 0.767 | 0.699 |
232 | Cohen's d | 0.669 | 0.564 | 0.707 |
233 | KDE crossover | 0.837 | 0.850 | 0.792 |
234 | Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
235 | Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |
236
237 Note: Firm A values in this table are computed over all intra-firm pairwise
238 similarities (16.0M pairs) for cross-backbone comparability. These differ from
239 the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
240 the classification-relevant statistic: the similarity of each signature to its
241 single closest match from the same CPA.
242 -->
243
244 EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
245 However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence.
246 VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.
247
248 ResNet-50 provides the best overall balance:
249 (1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
250 (2) its tighter distributions yield more reliable individual classifications;
251 (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
252 (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.
succeeded in 0ms:
1 # V. Discussion
2
3 ## A. Non-Hand-Signing Detection as a Distinct Problem
4
5 Our results highlight the importance of distinguishing *non-hand-signing detection* from the well-studied *signature forgery detection* problem.
6 In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature.
7 In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).
8
9 This distinction has direct methodological consequences.
10 Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures.
11 Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
12 The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.
13
14 ## B. Continuous-Quality Spectrum vs. Discrete-Behavior Regimes
15
16 The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E).
17
18 At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
19 Firm A's signature-level cosine is formally unimodal (dip test $p = 0.17$) with a long left tail.
20 The all-CPA signature-level cosine is multimodal ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
21 The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary.
22 Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class.
23 Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains.
24
25 At the per-accountant aggregate level the picture reverses.
26 The distribution of per-accountant mean cosine (and mean dHash) is unambiguously multimodal.
27 A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms.
28 The two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
29
30 The substantive interpretation is simple: *pixel-level output quality* is continuous, but *individual signing behavior* is close to discrete.
31 A given CPA tends to be either a consistent user of non-hand-signing or a consistent hand-signer; it is the mixing of these discrete behavioral types at the firm and population levels that produces the quality spectrum observed at the signature level.
32 Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level.
33
34 ## C. Firm A as a Replication-Dominated, Not Pure, Population
35
36 A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
37 Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.
38
39 Three convergent strands of evidence support the replication-dominated framing.
40 First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign.
41 Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
42 Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers.
43
44 The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
45 We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
46
47 ## D. The Style-Replication Gap
48
49 Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
50 A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
51
52 The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
53 Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
54 Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting.
55 Some may use signing pads or templates that further constrain variability without constituting image-level reproduction.
56 The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.
57
58 ## E. Value of a Replication-Dominated Calibration Group
59
60 The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
61 In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
62 Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset.
63
64 This calibration strategy has broader applicability beyond signature analysis.
65 Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
66 The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates interview-acknowledged heterogeneity, and yields classification rates that are internally consistent with the data.
67
68 ## F. Pixel-Identity as Annotation-Free Ground Truth
69
70 A further methodological contribution is the use of byte-level pixel identity as an annotation-free gold positive.
71 Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
72 In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly.
73 Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice.
74 We regard this as a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself.
75
76 ## G. Limitations
77
78 Several limitations should be acknowledged.
79
80 First, comprehensive per-document ground truth labels are not available.
81 The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), and the low-similarity anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70.
82 The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work.
83
84 Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
85 While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.
86
87 Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions.
88 In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
89 This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.
90
91 Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
92 While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.
93
94 Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later).
95 Extending the accountant-level analysis to auditor-year units is a natural next step.
96
97 Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation.
98 This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes.
99
100 Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
101 Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md | sed -n '1,120p'" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
1 # Abstract
2
3 <!-- 200-270 words -->
4
5 Regulations in many jurisdictions require Certified Public Accountants (CPAs) to attest to each audit report they certify, typically by affixing a signature or seal.
6 However, the digitization of financial reporting makes it straightforward to reuse a stored signature image across multiple reports---whether by administrative stamping or firm-level electronic signing systems---potentially undermining the intent of individualized attestation.
7 Unlike signature forgery, where an impostor imitates another person's handwriting, *non-hand-signed* reproduction involves the legitimate signer's own stored signature image being reproduced on each report, a practice that is visually invisible to report users and infeasible to audit at scale through manual inspection.
8 We present an end-to-end AI pipeline that automatically detects non-hand-signed auditor signatures in financial audit reports.
9 The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash).
10 For threshold determination we apply three statistically independent methods---Kernel Density antimode with a Hartigan dip test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level.
11 Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best $K = 3$), reflecting that individual signing *behavior* is close to discrete even when pixel-level output *quality* is continuous.
12 The accountant-level 2-component crossings yield principled thresholds (cosine $= 0.945$, dHash $= 8.10$).
13 A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with interview and visual evidence supporting majority non-hand-signing and a minority of hand-signers.
14 Validation against 310 pixel-identical signature pairs and a low-similarity negative anchor yields perfect recall at all candidate thresholds.
15 To our knowledge, this represents the largest-scale forensic analysis of auditor signature authenticity reported in the literature.
16
17 <!-- Word count: ~290 -->
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
1 # I. Introduction
2
3 <!-- Target: ~1.5 pages double-column IEEE format. Double-blind: no author/institution info. -->
4
5 Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection.
6 In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require that certifying CPAs affix their signature or seal (簽名或蓋章) to each audit report [1].
7 While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].
8
9 The digitization of financial reporting has introduced a practice that complicates this intent.
10 As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one.
11 This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step.
12 From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
13 We refer to signatures produced by either workflow collectively as *non-hand-signed*.
14 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement.
15 Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused.
16 This practice, while potentially widespread, is visually invisible to report users and virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of image reproduction.
17
18 The distinction between *non-hand-signing detection* and *signature forgery detection* is both conceptually and technically important.
19 The extensive body of research on offline signature verification [3]--[8] has focused almost exclusively on forgery detection---determining whether a questioned signature was produced by its purported author or by an impostor.
20 This framing presupposes that the central threat is identity fraud.
21 In our context, identity is not in question; the CPA is indeed the legitimate signer.
22 The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports.
23 This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction, requiring an analytical framework focused on detecting abnormally high similarity across documents.
24
25 A secondary methodological concern shapes the research design.
26 Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification.
27 Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference.
28 A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, $F_1$, and equal-error-rate metrics that prevail in the biometric-verification literature.
29
30 Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards.
31 Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image.
32 Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from image reproduction.
33 Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis.
34 From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection.
35
36 In this paper, we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale.
37 Our approach processes raw PDF documents through the following stages:
38 (1) signature page identification using a Vision-Language Model (VLM);
39 (2) signature region detection using a trained YOLOv11 object detector;
40 (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network;
41 (4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance;
42 (5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and
43 (6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm.
44
45 The dual-descriptor verification is central to our contribution.
46 Cosine similarity of deep feature embeddings captures high-level visual style similarity---it can identify signatures that share similar stroke patterns and spatial layouts---but cannot distinguish between a CPA who signs consistently and one whose signature is reproduced from a stored image.
47 Perceptual hashing (specifically, difference hashing) encodes structural-level image gradients into compact binary fingerprints that are robust to scan noise but sensitive to substantive content differences.
48 By requiring convergent evidence from both descriptors, we can differentiate *style consistency* (high cosine but divergent dHash) from *image reproduction* (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone.
49
50 A second distinctive feature is our framing of the calibration reference.
51 One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing.
52 Structured interviews with multiple Firm A partners confirm that *most* certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a *minority* may continue to hand-sign some reports.
53 We therefore treat Firm A as a *replication-dominated* calibration reference rather than a pure positive class.
54 This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode.
55 Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence between the interview evidence and the statistical results.
56
57 A third distinctive feature is our unit-of-analysis treatment.
58 Our three-method convergent framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are cleanly trimodal (BIC-best $K = 3$).
59 The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, but *individual signing behavior* is close to discrete---a given CPA is either a consistent user of non-hand-signing or a consistent hand-signer.
60 The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level.
61
62 We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants.
63 To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature.
64
65 The contributions of this paper are summarized as follows:
66
67 1. **Problem formulation.** We formally define non-hand-signing detection as distinct from signature forgery detection and argue that it requires an analytical framework focused on intra-signer similarity distributions rather than genuine-versus-forged classification.
68
69 2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity computation, with automated inference requiring no manual intervention after initial training and annotation.
70
71 3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures.
72
73 4. **Three-method convergent threshold framework.** We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence.
74
75 5. **Continuous-quality / discrete-behavior finding.** We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates are cleanly trimodal---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics.
76
77 6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a known-majority-positive reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling.
78
79 7. **Large-scale empirical analysis.** We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on non-hand-signing practices in financial reporting under a methodology designed for peer-review defensibility.
80
81 The remainder of this paper is organized as follows.
82 Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination.
83 Section III describes the proposed methodology.
84 Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study.
85 Section V discusses the implications and limitations of our findings.
86 Section VI concludes with directions for future work.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md | sed -n '1,160p'" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
1 # II. Related Work
2
3 ## A. Offline Signature Verification
4
5 Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning.
6 Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant.
7 Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work.
8 Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining.
9 Hadjadj et al. [5] addressed the practical constraint of limited reference samples, achieving competitive verification accuracy using only a single known genuine signature per writer.
10 More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results.
11 Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives.
12 Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer.
13 Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems.
14
15 A common thread in this literature is the assumption that the primary threat is *identity fraud*: a forger attempting to produce a convincing imitation of another person's signature.
16 Our work addresses a fundamentally different problem---detecting whether the *legitimate signer's* stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability.
17
18 Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy.
19 However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a replication-dominated subpopulation identified through domain expertise in real-world regulatory documents.
20
21 ## B. Document Forensics and Copy Detection
22
23 Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18].
24 Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11].
25 Abramova and Böhme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection.
26
27 Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money-laundering investigations.
28 Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering.
29 While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting image-level reproduction within a single author's signatures across documents.
30
31 In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images.
32 Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach.
33
34 ## C. Perceptual Hashing
35
36 Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19].
37 Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process.
38
39 Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks.
40 Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-descriptor approach, though applied to natural images rather than document signatures.
41
42 Our work differs from prior perceptual-hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from image-level reproduction in scanned financial documents.
43
44 ## D. Deep Feature Extraction for Signature Analysis
45
46 Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures.
47 Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach.
48 Tsourounis et al. [21] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison.
49 Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach.
50
51 Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach.
52 These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical.
53
54 ## E. Statistical Methods for Threshold Determination
55
56 Our threshold-determination framework combines three families of methods developed in statistics and accounting-econometrics.
57
58 *Non-parametric density estimation.*
59 Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions.
60 Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors.
61 The statistical validity of the bimodality itself can be tested independently via the Hartigan & Hartigan dip test [37], which we use as a formal bimodality diagnostic.
62
63 *Discontinuity tests on empirical distributions.*
64 Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions.
65 Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately $N(0,1)$.
66 The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis.
67 The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions.
68
69 *Finite mixture models.*
70 When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters.
71 For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation.
72 Under mild regularity conditions, White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation to the true distribution, even when the true distribution is not exactly Beta, provided the model is correctly specified in the broader exponential-family sense.
73
74 The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported.
75 <!--
76 REFERENCES for Related Work (see paper_a_references_v3.md for full list):
77 [3] Bromley et al. 1993 — Siamese TDNN (NeurIPS)
78 [4] Dey et al. 2017 — SigNet
79 [5] Hadjadj et al. 2020 — Single sample SV
80 [6] Li et al. 2024 — TransOSV
81 [7] Tehsin et al. 2024 — Triplet Siamese
82 [8] Brimoh & Olisah 2024 — Consensus threshold
83 [9] Woodruff et al. 2021 — AML signature pipeline
84 [10] Abramova & Böhme 2016 — CMFD in scanned docs
85 [11] Copy-move forgery detection survey — MTAP 2024
86 [12] Jakhar & Borah 2025 — pHash + DL
87 [13] Pizzi et al. 2022 — SSCD
88 [14] Hafemann et al. 2017 — CNN features for SV
89 [15] Zois et al. 2024 — SPD manifold SV
90 [16] Hafemann et al. 2019 — Meta-learning for SV
91 [17] Farid 2009 — Image forgery detection survey
92 [18] Mehrjardi et al. 2023 — DL-based image forgery detection survey
93 [19] Luo et al. 2025 — Perceptual hashing survey
94 [20] Engin et al. 2020 — ResNet + cosine on real docs
95 [21] Tsourounis et al. 2022 — Transfer from text to signatures
96 [22] Chamakh & Bounouh 2025 — ResNet18 unified SV
97 [23] Babenko et al. 2014 — Neural codes for image retrieval
98 [28] Silverman 1986 — Density estimation
99 [37] Hartigan & Hartigan 1985 — dip test of unimodality
100 [38] Burgstahler & Dichev 1997 — earnings management discontinuity
101 [39] McCrary 2008 — density discontinuity test
102 [40] Dempster, Laird & Rubin 1977 — EM algorithm
103 [41] White 1982 — quasi-MLE consistency
104 -->
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md | sed -n '1,80p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md | sed -n '1,120p'" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
1 # Impact Statement
2
3 <!-- 100-150 words. Non-specialist readable. No jargon. Specific, not vague. -->
4
5 Auditor signatures on financial reports are a key safeguard of corporate accountability.
6 When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets.
7 We developed an artificial intelligence system that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan.
8 By combining deep-learning visual features with perceptual hashing and three statistically independent threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time.
9 After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale.
succeeded in 0ms:
1 # VI. Conclusion and Future Work
2
3 ## Conclusion
4
5 We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
6 Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels.
7
8 Our contributions are fourfold.
9
10 First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
11
12 Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
13
14 Third, we introduced a three-method convergent threshold framework combining KDE antimode (with a Hartigan dip test as formal bimodality check), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check).
15 Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
16 The substantive reading is that *pixel-level output quality* is continuous while *individual signing behavior* is close to discrete.
17
18 Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor that requires no manual annotation.
19 This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters.
20
21 An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
22
23 ## Future Work
24
25 Several directions merit further investigation.
26 Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
27 Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade.
28 The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
29 The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
30 Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
1 # References
2
3 <!-- IEEE numbered style, sequential by first appearance in text. v3 adds statistical-method refs (3741). -->
4
5 [1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067
6
7 [2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," *Res. Account. Regul.*, vol. 25, no. 2, pp. 230235, 2013.
8
9 [3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in *Proc. NeurIPS*, 1993.
10
11 [4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.
12
13 [5] I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020.
14
15 [6] H. Li et al., "TransOSV: Offline signature verification with transformers," *Pattern Recognit.*, vol. 145, p. 109882, 2024.
16
17 [7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," *Mathematics*, vol. 12, no. 17, p. 2757, 2024.
18
19 [8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.
20
21 [9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021.
22
23 [10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," in *Proc. Electronic Imaging*, 2016.
24
25 [11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," *Multimedia Tools Appl.*, 2024.
26
27 [12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," *Inf. Process. Manage.*, p. 104086, 2025.
28
29 [13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in *Proc. CVPR*, 2022.
30
31 [14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," *Pattern Recognit.*, vol. 70, pp. 163176, 2017.
32
33 [15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 13421356, 2024.
34
35 [16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 17351745, 2019.
36
37 [17] H. Farid, "Image forgery detection," *IEEE Signal Process. Mag.*, vol. 26, no. 2, pp. 1625, 2009.
38
39 [18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," *Pattern Recognit.*, vol. 144, art. no. 109778, 2023.
40
41 [19] J. Luo et al., "A survey of perceptual hashing for multimedia," *ACM Trans. Multimedia Comput. Commun. Appl.*, vol. 21, no. 7, 2025.
42
43 [20] D. Engin et al., "Offline signature verification on real-world documents," in *Proc. CVPRW*, 2020.
44
45 [21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, 2022.
46
47 [22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," *Procedia Comput. Sci.*, vol. 270, 2025.
48
49 [23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in *Proc. ECCV*, 2014, pp. 584599.
50
51 [24] Qwen2.5-VL Technical Report, Alibaba Group, 2025.
52
53 [25] Ultralytics, "YOLOv11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/
54
55 [26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. CVPR*, 2016.
56
57 [27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html
58
59 [28] B. W. Silverman, *Density Estimation for Statistics and Data Analysis*. London: Chapman & Hall, 1986.
60
61 [29] J. Cohen, *Statistical Power Analysis for the Behavioral Sciences*, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988.
62
63 [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600612, 2004.
64
65 [31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," *The Accounting Review*, vol. 88, no. 5, pp. 15111546, 2013.
66
67 [32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," *Int. J. Auditing*, vol. 18, no. 3, pp. 172192, 2014.
68
69 [33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," *Contemp. Account. Res.*, vol. 26, no. 2, pp. 359391, 2009.
70
71 [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in *Proc. CVPR*, 2016, pp. 779788.
72
73 [35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 46, no. 8, pp. 56255644, 2024.
74
75 [36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," *Ann. Math. Statist.*, vol. 18, no. 1, pp. 5060, 1947.
76
77 [37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," *Ann. Statist.*, vol. 13, no. 1, pp. 7084, 1985.
78
79 [38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," *J. Account. Econ.*, vol. 24, no. 1, pp. 99126, 1997.
80
81 [39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," *J. Econometrics*, vol. 142, no. 2, pp. 698714, 2008.
82
83 [40] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," *J. R. Statist. Soc. B*, vol. 39, no. 1, pp. 138, 1977.
84
85 [41] H. White, "Maximum likelihood estimation of misspecified models," *Econometrica*, vol. 50, no. 1, pp. 125, 1982.
86
87 <!-- Total: 41 references (v2: 36 + 5 new statistical methods refs) -->
exec
/bin/zsh -lc "rg -n 'Table [IVX]+|Fig\\. [0-9]+' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "rg -n '0\\.941|0\\.945|0\\.95|0\\.985|8\\.10|6\\.99|7\\.00|71,656|84,386|180|182,328|168,755|168,740|686|758|90,282|86,071|86,072|4,198|12 corrupted|30,000\\+' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:11:The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:12:We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29:Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:30:Table IV summarizes the distributional statistics.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47:The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:97:BIC selects $K^* = 3$ (Table VI).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:109:Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:125:Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:144:Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:145:Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:169:Using signatures with cosine $< 0.70$ ($n = 35$) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:218:This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:224:Table XII presents the comparison.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:6:Fig. 1 illustrates the overall architecture.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:30:Table I summarizes the dataset composition.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:61:The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:19:| Documents processed | 86,071 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:21:| Total signatures extracted | 182,328 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:23:| CPA-matched signatures | 168,755 (92.6%) |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:36:| Mean | 0.821 | 0.758 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:64:| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:66:| Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:67:| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:79:Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:80:We note that the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:115:| C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120:First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123:Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine $= 0.945$ and dHash $= 8.10$; these are the natural per-accountant thresholds.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:131:| Signature-level BD/McCrary transition | 0.985 | 2.0 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:134:| Accountant-level 2-comp GMM crossing | **0.945** | **8.10** |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:135:| Firm A P95 (median/95th pct calibration) | 0.95 | 15 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine $= 0.945$, dHash $= 8.10$) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:151:| cosine > 0.941 | 95.08% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:152:| cosine > 0.945 (accountant 2-comp) | 94.5%† |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:153:| cosine > 0.95 | 92.51% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:157:| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:163:The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167:Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:175:| cosine > 0.945 | Accountant crossing | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:176:| cosine > 0.95 | Canonical | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:199:| Uncertain | 12,683 | 15.0% | 758 | 2.5% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:203:Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:207:A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:213:The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217:Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:231:| Inter mean | 0.758 | 0.767 | 0.699 |
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:14:90,282 PDFs → VLM Pre-screening → 86,072 PDFs
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:15:→ YOLOv11 Detection → 182,328 signatures
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:24:The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:29:CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:35:| Total PDF documents | 90,282 |
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:37:| Documents with signatures | 86,072 (95.4%) |
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:38:| Unique CPAs identified | 758 |
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:45:Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:50:This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:51:An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:67:| Recall | 0.950.98 |
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:69:| mAP@0.50:0.95 | 0.850.90 |
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:72:Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:75:Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144:Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:216:1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:219:2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash in $[6, 15]$.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:222:3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225:4. **Uncertain:** Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:62:We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:12:Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:15:Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19:This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:21:The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:28:The two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:41:Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42:Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:49:Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:50:A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:52:The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72:In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best $K = 3$), reflecting that individual signing *behavior* is close to discrete even when pixel-level output *quality* is continuous.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12:The accountant-level 2-component crossings yield principled thresholds (cosine $= 0.945$, dHash $= 8.10$).
exec
/bin/zsh -lc "rg -n 'undefined|not shown|not reported|not used|interpolated|all-pairs|best-match|independent|pixel-identical-to-closest|Equal Error Rate|EER|FAR|FRR|precision|recall|F1|Shapiro|Kolmogorov|BIC|AIC|Scott' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:11:[4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.
/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:33:[15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 13421356, 2024.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:44:Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:45:Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:51:We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:85:Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:86:For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:97:BIC selects $K^* = 3$ (Table VI).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:99:<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:100:| K | BIC | AIC | Converged |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine $= 0.945$, dHash $= 8.10$) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:159:† interpolated from adjacent rates; all other rates computed exactly.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167:Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1).
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:172:| Indicator | Threshold | Precision | Recall | F1 | FAR | FRR |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:183:The dHash thresholds trade precision for recall along the expected tradeoff.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184:We emphasize that because the gold-positive anchor is a *subset* of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:185:The negative-anchor population ($n = 35$) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:187:A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:234:| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:235:| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239:the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:251:(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:7:Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:8:Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:12:Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:61:The statistical validity of the bimodality itself can be tested independently via the Hartigan & Hartigan dip test [37], which we use as a formal bimodality diagnostic.
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:74:The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels.
/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:15:Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are sharp and mutually consistent.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10:For threshold determination we apply three statistically independent methods---Kernel Density antimode with a Hartigan dip test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best $K = 3$), reflecting that individual signing *behavior* is close to discrete even when pixel-level output *quality* is continuous.
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14:Validation against 310 pixel-identical signature pairs and a low-similarity negative anchor yields perfect recall at all candidate thresholds.
/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8:By combining deep-learning visual features with perceptual hashing and three statistically independent threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:18:At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:20:The all-CPA signature-level cosine is multimodal ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27:A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:71:Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:73:Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:82:The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28:A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, $F_1$, and equal-error-rate metrics that prevail in the biometric-verification literature.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:58:Our three-method convergent framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are cleanly trimodal (BIC-best $K = 3$).
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:73:4. **Three-method convergent threshold framework.** We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:7:The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:58:A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:104:These descriptors provide partially independent evidence.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:129:This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134:Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:139:Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145:To place threshold selection on a statistically principled and data-driven footing, we apply *three independent* methods whose underlying assumptions decrease in strength.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:150:We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28].
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152:Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify stability.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:175:We fit 2- and 3-component variants of each mixture and report BIC for model selection.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:176:When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:190:In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:193:We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:201:Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:208:From these anchors we report Equal Error Rate (EER), precision, recall, $F_1$, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3].
codex
Ive finished the core read. The main problems are not about the topic; they are about internal evidentiary standards: the accountant-level “three-method convergence” is not actually shown, some validation is circular, and a few numerical statements are inconsistent enough that they need to be fixed before submission.
exec
/bin/zsh -lc "rg -n 'sensitivity|bandwidth|\\+/- 50|± 50|Scott' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "rg -n 'independent minimum dHash|dHash_indep|independent\\)' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:66:The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:150:We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28].
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152:Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify stability.
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:11:Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:154:| dHash_indep ≤ 5 | 84.20% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:155:| dHash_indep ≤ 8 | 95.17% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:156:| dHash_indep ≤ 15 | 99.83% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:157:| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:177:| dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:178:| dHash_indep ≤ 8 | Accountant crossing | 0.966 | 1.000 | 0.983 | 0.314 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 |
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:190:In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
exec
/bin/zsh -lc "rg -n '84,386|85,042|per-document|per document|document-level|each document into one of five categories|first or second signature|two certifying CPAs' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:20:| Documents with detections | 85,042 (98.8%) |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:22:| Avg. signatures per document | 2.14 |
/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:75:Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213:The final per-document classification combines the three-method thresholds with the dual-descriptor framework.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214:Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution:
/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:80:First, comprehensive per-document ground truth labels are not available.
exec
/bin/zsh -lc "rg -n 'accountant-level.*KDE|accountant-level.*BD|accountant-level.*Beta|per-accountant.*KDE|per-accountant.*Beta|per-accountant.*BD|two-component fit used for threshold derivation|three methods yield a sharp and consistent boundary|three methods reveal|three-method thresholds are therefore statistically supported at the accountant level' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best $K = 3$), reflecting that individual signing *behavior* is close to discrete even when pixel-level output *quality* is continuous.
/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level.
/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:185:Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum.
codex
Im checking a couple of quantitative implications before writing the review, mainly to avoid hand-waving about the negative anchor. With only `n=35` negatives, the tabled FAR values and even the `0.000` entries have very wide uncertainty, and I want to state that precisely.
2026-04-20T16:47:47.550868Z ERROR codex_core::tools::router: error=exec_command failed for `/bin/zsh -lc "python - <<'PY'
import math
from statsmodels.stats.proportion import proportion_confint
n=35
for x,label in [(0,'0/35'),(6,'6/35'),(11,'11/35'),(24,'24/35')]:
lo,hi=proportion_confint(x,n,alpha=0.05,method='beta')
print(label, round(x/n,3), (round(lo,3), round(hi,3)))
PY"`: CreateProcess { message: "Codex(Sandbox(Denied { output: ExecToolCallOutput { exit_code: 1, stdout: StreamOutput { text: \"\", truncated_after_lines: None }, stderr: StreamOutput { text: \"zsh:1: can't create temp file for here document: operation not permitted\\n\", truncated_after_lines: None }, aggregated_output: StreamOutput { text: \"zsh:1: can't create temp file for here document: operation not permitted\\n\", truncated_after_lines: None }, duration: 0ns, timed_out: false }, network_policy_decision: None }))" }
2026-04-20T16:47:47.593389Z ERROR codex_core::tools::router: error=exec_command failed for `/bin/zsh -lc "python - <<'PY'
vals=[29529,36994,5133,12683,47]
print(sum(vals))
print(139+32,180-(139+32))
PY"`: CreateProcess { message: "Codex(Sandbox(Denied { output: ExecToolCallOutput { exit_code: 1, stdout: StreamOutput { text: \"\", truncated_after_lines: None }, stderr: StreamOutput { text: \"zsh:1: can't create temp file for here document: operation not permitted\\n\", truncated_after_lines: None }, aggregated_output: StreamOutput { text: \"zsh:1: can't create temp file for here document: operation not permitted\\n\", truncated_after_lines: None }, duration: 0ns, timed_out: false }, network_policy_decision: None }))" }
exec
/bin/zsh -lc 'python -c "vals=[29529,36994,5133,12683,47]; print(sum(vals)); print(139+32, 180-(139+32))"' in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "python -c \"from math import isnan; from statsmodels.stats.proportion import proportion_confint; n=35
for x,label in [(0,'0/35'),(6,'6/35'),(11,'11/35'),(24,'24/35')]:
lo,hi=proportion_confint(x,n,alpha=0.05,method='beta'); print(label, round(x/n,3), (round(lo,3), round(hi,3)))\"" in /Volumes/NV2/pdf_recognize
exited 127 in 0ms:
zsh:1: command not found: python
exited 127 in 0ms:
zsh:1: command not found: python
exec
/bin/zsh -lc 'python3 -c "vals=[29529,36994,5133,12683,47]; print(sum(vals)); print(139+32, 180-(139+32))"' in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
84386
171 9
exec
/bin/zsh -lc "python3 -c \"from statsmodels.stats.proportion import proportion_confint; n=35
for x,label in [(0,'0/35'),(6,'6/35'),(11,'11/35'),(24,'24/35')]:
lo,hi=proportion_confint(x,n,alpha=0.05,method='beta'); print(label, round(x/n,3), (round(lo,3), round(hi,3)))\"" in /Volumes/NV2/pdf_recognize
exited 1 in 0ms:
Traceback (most recent call last):
File "<string>", line 1, in <module>
from statsmodels.stats.proportion import proportion_confint; n=35
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'statsmodels'
exec
/bin/zsh -lc 'python3 -c "import scipy; print(scipy.__version__)"' in /Volumes/NV2/pdf_recognize
succeeded in 0ms:
1.17.1
exec
/bin/zsh -lc "python3 -c \"from scipy.stats import beta
n=35
for x,label in [(0,'0/35'),(6,'6/35'),(11,'11/35'),(24,'24/35')]:
a=0.05
lo=0.0 if x==0 else beta.ppf(a/2, x, n-x+1)
hi=1.0 if x==n else beta.ppf(1-a/2, x+1, n-x)
print(label, round(x/n,3), round(lo,3), round(hi,3))\"" in /Volumes/NV2/pdf_recognize
succeeded in 590ms:
0/35 0.0 0.0 0.1
6/35 0.171 0.066 0.336
11/35 0.314 0.169 0.493
24/35 0.686 0.507 0.831
exec
/bin/zsh -lc 'python3 -c "from scipy.stats import beta
x,n=310,310
a=0.05
lo=0.0 if x==0 else beta.ppf(a/2,x,n-x+1)
hi=1.0 if x==n else beta.ppf(1-a/2,x+1,n-x)
print(round(lo,3), round(hi,3))"' in /Volumes/NV2/pdf_recognize
succeeded in 418ms:
0.988 1.0
Completed the manuscript read and synthesized the main technical and presentation issues into a reviewer-style assessment.
✓ Load all manuscript section files with line numbers and identify tables/figures/results claims
✓ Evaluate technical claims around thresholding, accountant-level GMM, and Firm A framing
✓ Draft IEEE-Access-style review with prioritized findings and exact citations
codex
**1. Overall Assessment**
Recommendation: `major revision`.
The paper tackles an interesting and important problem with an unusually strong corpus, but the current draft overstates what the evidence establishes. The main concern is not the extraction pipeline; it is that the thresholding and validation narrative is stronger than the analyses actually reported. In its present form, I would not recommend acceptance without substantial revision to the claims, reporting, and validation design.
**2. Technical Soundness of the Threshold Framework and Accountant-Level GMM**
- The largest problem is that the manuscript repeatedly claims accountant-level convergence of *three* methods, but Section IV does not actually show that. Section III-I says the three methods "yield a sharp and consistent boundary" at the accountant level [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:185), and the Introduction says the "convergent three-method thresholds are therefore statistically supported at the accountant level" [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60). Yet Section IV-E/Table VIII report only one accountant-level threshold source: the 2-component accountant GMM crossing at `0.945 / 8.10` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:127). I do not see accountant-level KDE antimodes, accountant-level BD/McCrary thresholds, or accountant-level Beta/logit-mixture thresholds. As written, the core "three-method convergence" claim is unsupported.
- Method 1 is internally inconsistent. Section III-I.1 defines the KDE method as the `local density minimum (antimode)` of a single distribution [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151), but Section IV-C reports a `KDE crossover` where `the two density functions intersect` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47). Those are different procedures with different decision-theoretic meanings. The manuscript needs to choose one and report it consistently.
- The paper mixes three different units of analysis without sufficient justification: all-pairs similarities (Table IV), per-signature best-match statistics (Table V), and per-accountant means (Tables VI-VIII). The `0.837` cutoff comes from pairwise intra/inter distributions [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47); the `0.945 / 8.10` cutoffs come from accountant means [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123]; then all are reused in per-signature/per-document classification [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174). That transfer across units is not justified.
- The accountant-level interpretation is overstated. The papers strongest result is that the signature-level data are not well summarized by a 2-component model, while accountant-level means cluster under a BIC-selected 3-component GMM. That is not the same as proving that "individual signing behavior is close to discrete" [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:30), especially when the selected accountant model itself has a very large middle component `C2` with weight `0.51` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:115). The evidence supports "clustered accountant-level heterogeneity" more clearly than "discrete behavior."
- There is also a model-selection tension that needs to be confronted directly. Section IV-E says `BIC selects K*=3` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:97), but the paper then derives thresholds from a 2-component fit [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123). That can be acceptable for a coarse dichotomy, but it must be justified explicitly, especially because the selected 3-component model contains a substantive "middle band" rather than a negligible bridge.
- The BD/McCrary treatment is too loose technically. Section III-I.2 scans histogram bins for a `Z^- -> Z^+` transition [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:156), but no uncertainty intervals, multiplicity correction, or sensitivity analysis are reported. I would present this as a heuristic discontinuity diagnostic unless the authors can supply a more rigorous justification.
- The White (1982) justification is overstated. The manuscript implies quasi-MLE consistency validates the Beta-mixture/logit-GMM robustness argument [paper_a_related_work_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:71), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:173). That is much stronger than Whites result warrants.
**3. Firm A Framing and Circular Validation**
- The "replication-dominated, not pure" framing is plausible and, in my view, the right wording. The problem is not the label; the problem is the way independence is claimed. The genuinely exogenous evidence is the interview evidence and pre-threshold visual inspection [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:131), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:139). The `92.5% > 0.95` statistic and the GMM cluster assignments are in-sample outputs of the same analytical pipeline [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120). They are consistency checks, not independent validation.
- Table IX is circular if presented as validation. The paper says these rates "play the role of calibration-validation metrics" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:145), but several thresholds are themselves anchored to Firm A, and the text explicitly says the final classes use thresholds "anchored to Firm A's empirical distribution" [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:229). This is calibration, not external validation.
- The manuscript should break this loop with cross-fitting or a true holdout. For example: calibrate on one subset of Firm A CPAs and evaluate on held-out Firm A CPAs; or calibrate without Firm A and then use Firm A only as an external check.
- There is a numerical inconsistency that must be fixed. Section IV-E says `139 of 180 Firm A CPAs` are in C1 and `the remaining 32` are in C2 [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:121). The same `139/32` split is repeated in the Introduction, Discussion, and Conclusion [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54), [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42), [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19). `139 + 32 != 180`. The denominator is being mixed or nine CPAs are unaccounted for.
**4. Weaknesses in the Pixel-Identity Validation**
- The negative anchor is too small for the claims being made. Table X uses only `n = 35` negatives [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:169), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:185). With 35 negatives, even `FAR = 0.000` only implies a very weak upper bound; the exact 95% upper confidence limit is about `0.10`. The dHash FARs are also extremely imprecise.
- The positive anchor is an easy subset, not a representative positive set. Byte-identical nearest matches are the most obvious replications [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:200), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167). Perfect recall on this subset does not establish recall on near-duplicate but non-identical replications. The manuscript currently says this makes recall "conservative" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184]; that is not correct. It makes the recall estimate non-generalizable and likely optimistic for the broader positive class.
- The EER language should be removed or corrected. Section III-K promises EER reporting [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:208), and Section IV-G says "we derive Equal-Error-Rate points" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:169), but Table X reports metrics at selected thresholds, not the threshold where `FAR = FRR`.
- This section would be materially stronger with a much larger negative set concentrated near the decision boundary, blocked inter-CPA comparisons, manually adjudicated borderline cases, and ideally a held-out temporal or firm-stratified validation set.
**5. Clarity and Organization of Sections III / IV / V**
- Section IV-C is titled "Signature-Level Distribution Analysis," but Table IV is all-pairs analysis (`41,352,824` intra-class pairs) rather than per-signature best-match statistics [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:32). Section IV-D then switches to best-match distributions. This unit shift is confusing.
- `dHash_indep` / `mean independent minimum dHash` is never defined clearly in Section III, yet it appears throughout Tables V, VII, IX, and X [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:154). The reader cannot infer what "independent" means operationally.
- Section III-L defines five document-level verdicts [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213), but the rule that maps one or two signature-level outcomes into one PDF-level label is missing. That omission becomes critical in Table XI, which reports `84,386` classified PDFs [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191) even though Table III reports `85,042` documents with detections [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:20).
- Figure/table explanation is uneven. Fig. 4 is said to show the 3-component accountant clusters "alongside the marginal-density crossings of the two-component fit" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:109), but the text never explains how the 2-component crossings relate visually or substantively to the BIC-best 3-component model.
**6. Specific Sentence-Level, Table-Level, and Numerical Fixes**
- Table IX contains an undefined threshold: `cosine > 0.941` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:151). I could not find where `0.941` is introduced.
- Table IX reports `cosine > 0.945 ... 94.5%†` and says it is `interpolated from adjacent rates` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:152), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:159). An empirical exceedance rate at a fixed threshold should be computed exactly from the data, not interpolated.
- The manuscript alternates between `310 pixel-identical signature pairs` in the Abstract [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14), `310 have a same-CPA nearest match` in Results [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167), and `310 signatures satisfied this condition` in Discussion [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72). Choose one unit and keep it consistent.
- The statement `establishing an upper bound on the VLM false-positive rate of 1.2%` [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:53) is not justified, because YOLO misses can arise from YOLO false negatives, not only VLM false positives.
- The manuscript repeatedly calls the three methods `statistically independent` [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145). They are different diagnostics applied to the same sample, not independent experiments. "Methodologically distinct" or "complementary" would be more accurate.
- The dip test is described as a `formal bimodality check` [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152). It is a test against unimodality, not a direct test for bimodality.
- Reference [24] is incomplete for IEEE style: `Qwen2.5-VL Technical Report, Alibaba Group, 2025` [paper_a_references_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:51). It needs authors, title, venue/repository, and an accessible identifier.
In short: the manuscript has publishable potential, but only after the authors either supply the missing accountant-level analyses and stronger validation, or narrow the claims and present the current evidence as exploratory calibration plus large-scale descriptive analysis rather than definitive threshold validation.
tokens used
85,743
**1. Overall Assessment**
Recommendation: `major revision`.
The paper tackles an interesting and important problem with an unusually strong corpus, but the current draft overstates what the evidence establishes. The main concern is not the extraction pipeline; it is that the thresholding and validation narrative is stronger than the analyses actually reported. In its present form, I would not recommend acceptance without substantial revision to the claims, reporting, and validation design.
**2. Technical Soundness of the Threshold Framework and Accountant-Level GMM**
- The largest problem is that the manuscript repeatedly claims accountant-level convergence of *three* methods, but Section IV does not actually show that. Section III-I says the three methods "yield a sharp and consistent boundary" at the accountant level [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:185), and the Introduction says the "convergent three-method thresholds are therefore statistically supported at the accountant level" [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60). Yet Section IV-E/Table VIII report only one accountant-level threshold source: the 2-component accountant GMM crossing at `0.945 / 8.10` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:127). I do not see accountant-level KDE antimodes, accountant-level BD/McCrary thresholds, or accountant-level Beta/logit-mixture thresholds. As written, the core "three-method convergence" claim is unsupported.
- Method 1 is internally inconsistent. Section III-I.1 defines the KDE method as the `local density minimum (antimode)` of a single distribution [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151), but Section IV-C reports a `KDE crossover` where `the two density functions intersect` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47). Those are different procedures with different decision-theoretic meanings. The manuscript needs to choose one and report it consistently.
- The paper mixes three different units of analysis without sufficient justification: all-pairs similarities (Table IV), per-signature best-match statistics (Table V), and per-accountant means (Tables VI-VIII). The `0.837` cutoff comes from pairwise intra/inter distributions [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47); the `0.945 / 8.10` cutoffs come from accountant means [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123]; then all are reused in per-signature/per-document classification [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174). That transfer across units is not justified.
- The accountant-level interpretation is overstated. The papers strongest result is that the signature-level data are not well summarized by a 2-component model, while accountant-level means cluster under a BIC-selected 3-component GMM. That is not the same as proving that "individual signing behavior is close to discrete" [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:30), especially when the selected accountant model itself has a very large middle component `C2` with weight `0.51` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:115). The evidence supports "clustered accountant-level heterogeneity" more clearly than "discrete behavior."
- There is also a model-selection tension that needs to be confronted directly. Section IV-E says `BIC selects K*=3` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:97), but the paper then derives thresholds from a 2-component fit [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123). That can be acceptable for a coarse dichotomy, but it must be justified explicitly, especially because the selected 3-component model contains a substantive "middle band" rather than a negligible bridge.
- The BD/McCrary treatment is too loose technically. Section III-I.2 scans histogram bins for a `Z^- -> Z^+` transition [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:156), but no uncertainty intervals, multiplicity correction, or sensitivity analysis are reported. I would present this as a heuristic discontinuity diagnostic unless the authors can supply a more rigorous justification.
- The White (1982) justification is overstated. The manuscript implies quasi-MLE consistency validates the Beta-mixture/logit-GMM robustness argument [paper_a_related_work_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:71), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:173). That is much stronger than Whites result warrants.
**3. Firm A Framing and Circular Validation**
- The "replication-dominated, not pure" framing is plausible and, in my view, the right wording. The problem is not the label; the problem is the way independence is claimed. The genuinely exogenous evidence is the interview evidence and pre-threshold visual inspection [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:131), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:139). The `92.5% > 0.95` statistic and the GMM cluster assignments are in-sample outputs of the same analytical pipeline [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120). They are consistency checks, not independent validation.
- Table IX is circular if presented as validation. The paper says these rates "play the role of calibration-validation metrics" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:145), but several thresholds are themselves anchored to Firm A, and the text explicitly says the final classes use thresholds "anchored to Firm A's empirical distribution" [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:229). This is calibration, not external validation.
- The manuscript should break this loop with cross-fitting or a true holdout. For example: calibrate on one subset of Firm A CPAs and evaluate on held-out Firm A CPAs; or calibrate without Firm A and then use Firm A only as an external check.
- There is a numerical inconsistency that must be fixed. Section IV-E says `139 of 180 Firm A CPAs` are in C1 and `the remaining 32` are in C2 [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:121). The same `139/32` split is repeated in the Introduction, Discussion, and Conclusion [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54), [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42), [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19). `139 + 32 != 180`. The denominator is being mixed or nine CPAs are unaccounted for.
**4. Weaknesses in the Pixel-Identity Validation**
- The negative anchor is too small for the claims being made. Table X uses only `n = 35` negatives [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:169), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:185). With 35 negatives, even `FAR = 0.000` only implies a very weak upper bound; the exact 95% upper confidence limit is about `0.10`. The dHash FARs are also extremely imprecise.
- The positive anchor is an easy subset, not a representative positive set. Byte-identical nearest matches are the most obvious replications [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:200), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167). Perfect recall on this subset does not establish recall on near-duplicate but non-identical replications. The manuscript currently says this makes recall "conservative" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184]; that is not correct. It makes the recall estimate non-generalizable and likely optimistic for the broader positive class.
- The EER language should be removed or corrected. Section III-K promises EER reporting [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:208), and Section IV-G says "we derive Equal-Error-Rate points" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:169), but Table X reports metrics at selected thresholds, not the threshold where `FAR = FRR`.
- This section would be materially stronger with a much larger negative set concentrated near the decision boundary, blocked inter-CPA comparisons, manually adjudicated borderline cases, and ideally a held-out temporal or firm-stratified validation set.
**5. Clarity and Organization of Sections III / IV / V**
- Section IV-C is titled "Signature-Level Distribution Analysis," but Table IV is all-pairs analysis (`41,352,824` intra-class pairs) rather than per-signature best-match statistics [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:32). Section IV-D then switches to best-match distributions. This unit shift is confusing.
- `dHash_indep` / `mean independent minimum dHash` is never defined clearly in Section III, yet it appears throughout Tables V, VII, IX, and X [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:154). The reader cannot infer what "independent" means operationally.
- Section III-L defines five document-level verdicts [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213), but the rule that maps one or two signature-level outcomes into one PDF-level label is missing. That omission becomes critical in Table XI, which reports `84,386` classified PDFs [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191) even though Table III reports `85,042` documents with detections [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:20).
- Figure/table explanation is uneven. Fig. 4 is said to show the 3-component accountant clusters "alongside the marginal-density crossings of the two-component fit" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:109), but the text never explains how the 2-component crossings relate visually or substantively to the BIC-best 3-component model.
**6. Specific Sentence-Level, Table-Level, and Numerical Fixes**
- Table IX contains an undefined threshold: `cosine > 0.941` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:151). I could not find where `0.941` is introduced.
- Table IX reports `cosine > 0.945 ... 94.5%†` and says it is `interpolated from adjacent rates` [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:152), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:159). An empirical exceedance rate at a fixed threshold should be computed exactly from the data, not interpolated.
- The manuscript alternates between `310 pixel-identical signature pairs` in the Abstract [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14), `310 have a same-CPA nearest match` in Results [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167), and `310 signatures satisfied this condition` in Discussion [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72). Choose one unit and keep it consistent.
- The statement `establishing an upper bound on the VLM false-positive rate of 1.2%` [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:53) is not justified, because YOLO misses can arise from YOLO false negatives, not only VLM false positives.
- The manuscript repeatedly calls the three methods `statistically independent` [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145). They are different diagnostics applied to the same sample, not independent experiments. "Methodologically distinct" or "complementary" would be more accurate.
- The dip test is described as a `formal bimodality check` [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152). It is a test against unimodality, not a direct test for bimodality.
- Reference [24] is incomplete for IEEE style: `Qwen2.5-VL Technical Report, Alibaba Group, 2025` [paper_a_references_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:51). It needs authors, title, venue/repository, and an accessible identifier.
In short: the manuscript has publishable potential, but only after the authors either supply the missing accountant-level analyses and stronger validation, or narrow the claims and present the current evidence as exploratory calibration plus large-scale descriptive analysis rather than definitive threshold validation.