Files
pdf_signature_extraction/paper/codex_review_gpt54_v3.md
T
gbanyan 9d19ca5a31 Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review:

## Structural fixes
- Fixed three-method convergence overclaim: added Script 20 to run KDE
  antimode, BD/McCrary, and Beta mixture EM on accountant-level means.
  Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979,
  LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at
  accountant level (consistent with smooth clustering, not sharp
  discontinuity).
- Disambiguated Method 1: KDE crossover (between two labeled distributions,
  used at signature all-pairs level) vs KDE antimode (single-distribution
  local minimum, used at accountant level).
- Addressed Firm A circular validation: Script 21 adds CPA-level 70/30
  held-out fold. Calibration thresholds derived from 70% only; heldout
  rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61%
  [93.21%-93.98%]).
- Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10
  signatures (9 CPAs excluded for insufficient sample). Reconciled across
  intro, results, discussion, conclusion.
- Added document-level classification aggregation rule (worst-case signature
  label determines document label).

## Pixel-identity validation strengthened
- Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces
  the original n=35 same-CPA low-similarity negative which had untenable
  Wilson CIs).
- Added Wilson 95% CI for every FAR in Table X.
- Proper EER interpolation (FAR=FRR point) in Table X.
- Softened "conservative recall" claim to "non-generalizable subset"
  language per codex feedback (byte-identical positives are a subset, not
  a representative positive class).
- Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913.

## Terminology & sentence-level fixes
- "statistically independent methods" -> "methodologically distinct methods"
  throughout (three diagnostics on the same sample are not independent).
- "formal bimodality check" -> "unimodality test" (dip test tests H0 of
  unimodality; rejection is consistent with but not a direct test of
  bimodality).
- "Firm A near-universally non-hand-signed" -> already corrected to
  "replication-dominated" in prior commit; this commit strengthens that
  framing with explicit held-out validation.
- "discrete-behavior regimes" -> "clustered accountant-level heterogeneity"
  (BD/McCrary non-transition at accountant level rules out sharp discrete
  boundaries; the defensible claim is clustered-but-smooth).
- Softened White 1982 quasi-MLE claim (no longer framed as a guarantee).
- Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP
  or YOLO FN).
- Unified "310 byte-identical signatures" language across Abstract,
  Results, Discussion (previously alternated between pairs/signatures).
- Defined min_dhash_independent explicitly in Section III-G.
- Fixed table numbering (Table XI heldout added, classification moved to
  XII, ablation to XIII).
- Explained 84,386 vs 85,042 gap (656 docs have only one signature, no
  pairwise stat).
- Made Table IX explicitly a "consistency check" not "validation"; paired
  it with Table XI held-out rates as the genuine external check.
- Defined 0.941 threshold (calibration-fold Firm A cosine P5).
- Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated.
- Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923).

## New artifacts
- Script 20: accountant-level three-method threshold analysis
- Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30)
- paper/codex_review_gpt54_v3.md: preserved review feedback

Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1
markdown sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00

242 KiB
Raw Blame History

Reading additional input from stdin... OpenAI Codex v0.121.0 (research preview)

workdir: /Volumes/NV2/pdf_recognize model: gpt-5.4 provider: openai approval: never sandbox: read-only reasoning effort: xhigh reasoning summaries: none session id: 019dabc7-b94b-7d71-81f6-764d687c3d5a

user You are conducting a peer review for IEEE Access of my paper v3.

Read these files in /Volumes/NV2/pdf_recognize/paper/ : paper_a_abstract_v3.md, paper_a_impact_statement_v3.md, paper_a_introduction_v3.md, paper_a_related_work_v3.md, paper_a_methodology_v3.md, paper_a_results_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md, paper_a_references_v3.md.

Paper title: "Automated Identification of Non-Hand-Signed Auditor Signatures in Large-Scale Financial Audit Reports: A Dual-Descriptor Framework with Three-Method Convergent Thresholding"

Key facts to keep in mind while reviewing:

  • Pipeline: VLM (Qwen2.5-VL 32B) page-ID, YOLOv11n detection, ResNet-50 2048-dim features (L2-norm, no fine-tune), dual descriptors = cosine + dHash (9x8, 64-bit, Hamming distance).
  • Dataset: 90,282 PDF audit reports from TWSE MOPS 2013-2023, 182,328 signatures, 758 CPAs.
  • Three convergent threshold methods: (a) KDE antimode with Hartigan dip test, (b) Burgstahler-Dichev / McCrary discontinuity, (c) 2/3-comp Beta mixture via EM + parallel logit-Gaussian mixture as White 1982 quasi-MLE robustness check.
  • Accountant-level: Gaussian mixture on (mean cosine, mean dHash), BIC picks K=3. C1 cos=0.983 dh=2.41 w=0.21 n=141 (Firm A 139/141). C2 cos=0.954 dh=7.00 w=0.51 n=361 (other Big-4). C3 cos=0.928 dh=11.17 w=0.28 n=184 (smaller firms). 2-comp marginal crossings cos=0.945 dh=8.10.
  • Firm A framed as "replication-dominated" not pure: interviews confirm most partners use replication but a minority may still hand-sign. Consistent with 92.5% Firm A cos>0.95 and 32/180 Firm A CPAs in C2.
  • Pixel-identity validation: 310 sigs with byte-identical nearest same-CPA match as gold positives. Low-cosine anchor (cos<0.70) is only n=35 gold negatives so FAR is order-of-magnitude only. Firm A anchor rates: cos>0.95=92.5%, dual rule cos>0.95 AND dh<=8=89.95%.
  • Classification: 5 tiers (high/moderate-confidence non-hand-signed, high style consistency, uncertain, likely hand-signed), thresholds from Firm A (dHash median=5, P95=15).
  • Ablation: ResNet-50 vs VGG-16 vs EfficientNet-B0; ResNet-50 chosen on balance.

Produce a structured IEEE-Access-style peer review covering, in priority order (most critical first):

  1. Overall assessment and recommendation (accept / minor revision / major revision / reject).

  2. Technical soundness of the three-method threshold framework and the accountant-level GMM; focus on whether the "continuous signature-level spectrum vs discrete accountant-level behaviour" interpretation is supported by the reported evidence.

  3. Whether the "replication-dominated Firm A" framing is justified AND whether the paper does enough to prevent circular validation given Firm A is used both to calibrate thresholds and to anchor validation.

  4. Weaknesses in the pixel-identity validation: is n=35 negative anchor sufficient, does using pixel-identity as gold positive bias recall estimates, what would strengthen it.

  5. Clarity and organization of Sections III / IV / V. Flag terminology inconsistency, undefined results, and any table/figure referenced but not clearly explained.

  6. Specific sentence-level, table-level, numerical issues to fix before submission. Include exact quotes or table-cell references.

Be direct and specific. Cite exact sections, table numbers, and line quotes. Do not restate the paper content; focus on review opinions and concrete fixes. codex Im reviewing the manuscript files directly and will anchor the review to the text as written. First Im loading the section drafts with line numbers and scanning for tables, figures, and threshold/validation claims so I can cite specific locations rather than paraphrase. → Load all manuscript section files with line numbers and identify tables/figures/results claims • Evaluate technical claims around thresholding, accountant-level GMM, and Firm A framing • Draft IEEE-Access-style review with prioritized findings and exact citations exec /bin/zsh -lc 'wc -l /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 17 /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md 30 /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md 101 /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md 9 /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md 86 /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md 229 /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md 87 /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md 104 /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md 252 /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md 915 total

exec /bin/zsh -lc "rg --files /Volumes/NV2/pdf_recognize/paper | rg 'paper_a_.*_v3\.md"'$'"'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md

exec /bin/zsh -lc "rg -n 'Table|Figure|Fig\.|Section|threshold|Firm A|pixel|identity|Hartigan|dip|Burgstahler|McCrary|Beta mixture|Gaussian mixture|GMM|dHash|cosine|validation|anchor|Tier|tier|KDE|antimode|quasi-MLE|White 1982' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:12:Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:14:Third, we introduced a three-method convergent threshold framework combining KDE antimode (with a Hartigan dip test as formal bimodality check), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check). /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:15:Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:16:The substantive reading is that pixel-level output quality is continuous while individual signing behavior is close to discrete. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:18:Fourth, we introduced a replication-dominated calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor that requires no manual annotation. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19:This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:29:The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:30:Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:9:The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash). /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10:For threshold determination we apply three statistically independent methods---Kernel Density antimode with a Hartigan dip test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12:The accountant-level 2-component crossings yield principled thresholds (cosine = 0.945, dHash = 8.10). /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:13:A major Big-4 firm is used as a replication-dominated (not pure) calibration anchor, with interview and visual evidence supporting majority non-hand-signing and a minority of hand-signers. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14:Validation against 310 pixel-identical signature pairs and a low-similarity negative anchor yields perfect recall at all candidate thresholds. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:7:In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:12:The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:16:The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:18:At the per-signature level, the distribution of best-match cosine similarity is not cleanly bimodal. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:19:Firm A's signature-level cosine is formally unimodal (dip test p = 0.17) with a long left tail. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:20:The all-CPA signature-level cosine is multimodal (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:21:The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:26:The distribution of per-accountant mean cosine (and mean dHash) is unambiguously multimodal. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27:A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:28:The two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:30:The substantive interpretation is simple: pixel-level output quality is continuous, but individual signing behavior is close to discrete. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:32:Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:34:## C. Firm A as a Replication-Dominated, Not Pure, Population /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:36:A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:37:Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:40:First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:41:Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42:Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:45:We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:49:Within the 71,656 documents exceeding cosine 0.95, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:50:A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:52:The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:53:Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:60:The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:65:Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:70:A further methodological contribution is the use of byte-level pixel identity as an annotation-free gold positive. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:71:Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72:In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:73:Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:81:The pixel-identity anchor is a strict subset of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), and the low-similarity anchor (n = 35) is small because intra-CPA pairs rarely fall below cosine 0.70. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:82:The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:88:In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:92:While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:97:Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:98:This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:12:From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:20:This framing presupposes that the central threat is identity fraud. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:21:In our context, identity is not in question; the CPA is indeed the legitimate signer. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:26:Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:27:Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28:A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, F_1, and equal-error-rate metrics that prevail in the biometric-verification literature. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:31:Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:34:From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:41:(4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance; /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:43:(6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:48:By requiring convergent evidence from both descriptors, we can differentiate style consistency (high cosine but divergent dHash) from image reproduction (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:51:One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:52:Structured interviews with multiple Firm A partners confirm that most certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a minority may continue to hand-sign some reports. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:53:We therefore treat Firm A as a replication-dominated calibration reference rather than a pure positive class. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:59:The substantive reading is that pixel-level output quality is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, but individual signing behavior is close to discrete---a given CPA is either a consistent user of non-hand-signing or a consistent hand-signer. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:71:3. Dual-descriptor verification. We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:73:4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:75:5. Continuous-quality / discrete-behavior finding. We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates are cleanly trimodal---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:77:6. Replication-dominated calibration methodology. We introduce a calibration strategy using a known-majority-positive reference group, distinguishing replication-dominated from replication-pure anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:82:Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:83:Section III describes the proposed methodology. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:84:Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:85:Section V discusses the implications and limitations of our findings. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:86:Section VI concludes with directions for future work. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:6:Fig. 1 illustrates the overall architecture. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:7:The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:13:[Figure 1: Pipeline Architecture - clean vector diagram] /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:17:→ Dual-Method Verification (Cosine + dHash) /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:18:→ Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:19:→ Pixel-identity + Firm A + Accountant-level GMM validation /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:30:Table I summarizes the dataset composition. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:53:Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false-positive rate of 1.2%. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:61:The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:73:A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:82:Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:83:All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:85:The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:86:This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:97:Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:99:Perceptual hash distance (dHash) captures structural-level similarity. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:100:Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:102:Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:105:Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:107:Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:110:We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:111:Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:118:For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:119:The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:123:These accountant-level aggregates are the input to the mixture model described in Section III-I. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:125:## H. Calibration Reference: Firm A as a Replication-Dominated Population /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:127:A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:128:Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:129:This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:131:First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:132:Crucially, the same interview evidence does not exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134:Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:138:We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140:The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144:Direct assignment of thresholds based on prior intuition (e.g., cosine \geq 0.95 for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145:To place threshold selection on a statistically principled and data-driven footing, we apply three independent methods whose underlying assumptions decrease in strength. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:148:### 1) Method 1: KDE + Antimode with Bimodality Check /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151:A candidate threshold is taken at the location of the local density minimum (antimode) between modes of the fitted density. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152:Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify stability. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:154:### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:156:We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:157:We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin i with count n_i, the standardized deviation from the smooth-null expectation of the average of its neighbours, /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:162:A threshold is identified at the transition where Z_{i-1} is significantly negative (observed count below expectation) adjacent to Z_i significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition Z^- \rightarrow Z^+ marks the candidate decision boundary. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:166:We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:168:Under the fitted model the threshold is the crossing point of the two weighted component densities, /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:173:As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the logit-transformed similarity, following standard practice for bounded proportion data; White's [41] quasi-MLE consistency result guarantees asymptotic recovery of the best Beta-family approximation to the true distribution even if the true component densities are not exactly Beta. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:176:When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:180:The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:181:If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:184:Applied to the per-signature cosine distribution the three methods may converge on a single boundary only weakly or not at all---because, as our results show, per-signature similarity is not a cleanly bimodal population. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:185:Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:186:We therefore explicitly analyze both levels and interpret their divergence as a substantive finding (Section V) rather than a statistical nuisance. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:190:In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:191:The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing behavior is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level quality lies on a continuous spectrum. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:194:For the selected K^* we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:196:## K. Pixel-Identity and Firm A Validation (No Manual Annotation) /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:198:Rather than construct a stratified manual-annotation validation set, we validate the classifier using three naturally occurring reference populations that require no human labeling: /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:200:1. Pixel-identical anchor (gold positive): signatures whose nearest same-CPA match is byte-identical after crop and normalization. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:201:Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:203:2. Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:204:Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:206:3. Low-similarity anchor (gold negative): signatures whose maximum same-CPA cosine similarity is below a conservative cutoff (0.70) that cannot plausibly arise from pixel-level duplication. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:208:From these anchors we report Equal Error Rate (EER), precision, recall, F_1, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:209:We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213:The final per-document classification combines the three-method thresholds with the dual-descriptor framework. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214:Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution: /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:216:1. High-confidence non-hand-signed: Cosine > 0.95 AND dHash \leq 5. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:217:Both descriptors converge on strong replication evidence consistent with Firm A's median behavior. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:219:2. Moderate-confidence non-hand-signed: Cosine > 0.95 AND dHash in [6, 15]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:220:Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:222:3. High style consistency: Cosine > 0.95 AND dHash > 15. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225:4. Uncertain: Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:227:5. Likely hand-signed: Cosine below the KDE crossover threshold. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:229:The dHash thresholds (\leq 5 and \leq 15) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population. /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8:By combining deep-learning visual features with perceptual hashing and three statistically independent threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time. /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:9:After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:11:The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:12:We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29:Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:30:Table IV summarizes the distributional statistics. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:44:Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:45:Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47:The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:55:## D. Hartigan Dip Test: Unimodality at the Signature Level /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:59:<!-- TABLE V: Hartigan Dip Test Results /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:60:| Distribution | N | dip | p-value | Verdict (α=0.05) | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:62:| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:64:| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:67:| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:70:Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:71:The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is multimodal (p < 0.001). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:72:At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:75:It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:77:### 1) Burgstahler-Dichev / McCrary Discontinuity /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:79:Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:80:We note that the cosine transition at 0.985 lies inside the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:81:In contrast, the dHash transition at distance 2 is a meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:85:Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:86:For the full-sample cosine the 3-component fit is likewise strongly preferred (\Delta\text{BIC} = 10{,}175). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:87:Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:88:Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:90:The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: at the per-signature level, no two-mechanism mixture explains the data. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:92:This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual behavior (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:97:BIC selects K^* = 3 (Table VI). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:99:<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC) /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:109:Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:111:<!-- TABLE VII: Accountant-Level 3-Component GMM /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:112:| Comp. | cos_mean | dHash_mean | weight | n | Dominant firms | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:114:| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120:First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:121:The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:122:Second, the three-component partition is not a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123:Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine = 0.945 and dHash = 8.10; these are the natural per-accountant thresholds. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:125:Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:128:| Level / method | Cosine threshold | dHash threshold | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:130:| Signature-level KDE crossover | 0.837 | — | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:131:| Signature-level BD/McCrary transition | 0.985 | 2.0 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:132:| Signature-level Beta 2-comp (Firm A) | 0.977 | — | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:133:| Signature-level LogGMM 2-comp (Full) | 0.980 | — | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:134:| Accountant-level 2-comp GMM crossing | 0.945 | 8.10 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:135:| Firm A P95 (median/95th pct calibration) | 0.95 | 15 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:136:| Firm A median calibration | — | 5 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:142:## F. Calibration Validation with Firm A /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:144:Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:145:Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:147:<!-- TABLE IX: Firm A Anchor Rates Across Candidate Thresholds /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:148:| Rule | Firm A rate | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:150:| cosine > 0.837 | 99.93% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:151:| cosine > 0.941 | 95.08% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:152:| cosine > 0.945 (accountant 2-comp) | 94.5%† | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:153:| cosine > 0.95 | 92.51% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:154:| dHash_indep ≤ 5 | 84.20% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:155:| dHash_indep ≤ 8 | 95.17% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:156:| dHash_indep ≤ 15 | 99.83% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:157:| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:162:The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:163:The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167:Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:168:These serve as the gold-positive anchor of Section III-K. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:169:Using signatures with cosine < 0.70 (n = 35) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174:| cosine > 0.837 | KDE crossover | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:175:| cosine > 0.945 | Accountant crossing | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:176:| cosine > 0.95 | Canonical | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:177:| dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:178:| dHash_indep ≤ 8 | Accountant crossing | 0.966 | 1.000 | 0.983 | 0.314 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:182:All cosine thresholds achieve perfect classification of the pixel-identical anchor against the low-similarity anchor, which is unsurprising given the complete separation between the two anchor populations. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:183:The dHash thresholds trade precision for recall along the expected tradeoff. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184:We emphasize that because the gold-positive anchor is a subset of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:185:The negative-anchor population (n = 35) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:187:A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:193:<!-- TABLE XI: Classification Results (Dual-Method: Cosine + dHash) /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:194:| Verdict | N (PDFs) | % | Firm A | Firm A % | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:203:Within the 71,656 documents exceeding cosine 0.95, the dHash dimension stratifies them into three distinct populations: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:204:29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash \leq 5); /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:205:36,994 (51.7%) show partial structural similarity (dHash in [6, 15]) consistent with replication degraded by scan variations; /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:206:and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than image reproduction. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:207:A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:209:### 1) Firm A Validation /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:211:96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:212:This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:213:The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217:Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:218:This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:224:Table XII presents the comparison. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:233:| KDE crossover | 0.837 | 0.850 | 0.792 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:234:| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:235:| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237:Note: Firm A values in this table are computed over all intra-firm pairwise /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239:the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:251:(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:19:[8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024. /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:77:[37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," Ann. Statist., vol. 13, no. 1, pp. 7084, 1985. /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:79:[38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," J. Account. Econ., vol. 24, no. 1, pp. 99126, 1997. /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:81:[39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," J. Econometrics, vol. 142, no. 2, pp. 698714, 2008. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:11:Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:15:A common thread in this literature is the assumption that the primary threat is identity fraud: a forger attempting to produce a convincing imitation of another person's signature. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:18:Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:19:However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a replication-dominated subpopulation identified through domain expertise in real-world regulatory documents. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:32:Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:37:Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:47:Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:49:Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:51:Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:52:These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:56:Our threshold-determination framework combines three families of methods developed in statistics and accounting-econometrics. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:60:Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:61:The statistical validity of the bimodality itself can be tested independently via the Hartigan & Hartigan dip test [37], which we use as a formal bimodality diagnostic. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:64:Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:66:The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:67:The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:71:For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:72:Under mild regularity conditions, White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation to the true distribution, even when the true distribution is not exactly Beta, provided the model is correctly specified in the broader exponential-family sense. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:74:The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:82:[8] Brimoh & Olisah 2024 — Consensus threshold /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:94:[20] Engin et al. 2020 — ResNet + cosine on real docs /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:99:[37] Hartigan & Hartigan 1985 — dip test of unimodality /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md💯[38] Burgstahler & Dichev 1997 — earnings management discontinuity /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:101:[39] McCrary 2008 — density discontinuity test /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:103:[41] White 1982 — quasi-MLE consistency

codex Ive isolated the sections where the core claims live. Next Im reading Methodology, Results, and Discussion line-by-line because most of the review will turn on whether the thresholding logic, validation design, and interpretation are actually supported there. exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md | sed -n '1,260p'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # III. Methodology 2 3 ## A. Pipeline Overview 4 5 We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents. 6 Fig. 1 illustrates the overall architecture. 7 The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor. 8 9 Throughout this paper we use the term non-hand-signed rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). 10 From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise. 11 12 21 22 ## B. Data Collection 23 24 The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. 25 The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. 26 An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. 27 Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs. 28 29 CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. 30 Table I summarizes the dataset composition. 31 32 41 42 ## C. Signature Page Identification 43 44 To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism. 45 Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. 46 The model was configured with temperature 0 for deterministic output. 47 48 The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. 49 Scanning terminated upon the first positive detection. 50 This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. 51 An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents. 52 53 Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false-positive rate of 1.2%. 54 55 ## D. Signature Detection 56 57 We adopted YOLOv11n (nano variant) [25] for signature region localization. 58 A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. 59 A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps. 60 61 The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II). 62 63 71 72 Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). 73 A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content. 74 75 Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). 76 77 ## E. Feature Extraction 78 79 Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. 80 The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer. 81 82 Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. 83 All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product. 84 85 The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D). 86 This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0. 87 88 ## F. Dual-Method Similarity Descriptors 89 90 For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA: 91 92 Cosine similarity on deep embeddings captures high-level visual style: 93 94 \text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B 95 96 where \mathbf{f}_A and \mathbf{f}_B are L2-normalized 2048-dim feature vectors. 97 Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14]. 98 99 Perceptual hash distance (dHash) captures structural-level similarity. 100 Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. 101 The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. 102 Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19]. 103 104 These descriptors provide partially independent evidence. 105 Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. 106 Non-hand-signing yields extreme similarity under both descriptors, since the underlying image is identical up to reproduction noise. 107 Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). 108 Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline. 109 110 We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content. 111 Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle. 112 113 ## G. Unit of Analysis and Summary Statistics 114 115 Two unit-of-analysis choices are relevant for this study: (i) the signature---one signature image extracted from one report---and (ii) the accountant---the collection of all signatures attributed to a single CPA across the sample period. 116 A third composite unit---the auditor-year, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution. 117 118 For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA. 119 The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism. 120 Mean statistics would dilute this signal. 121 122 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. 123 These accountant-level aggregates are the input to the mixture model described in Section III-I. 124 125 ## H. Calibration Reference: Firm A as a Replication-Dominated Population 126 127 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. 128 Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class. 129 This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration. 130 131 First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems. 132 Crucially, the same interview evidence does not exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports. 133 134 Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners. 135 136 Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. 137 138 We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. 139 Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline. 140 The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference. 141 142 ## I. Three-Method Convergent Threshold Determination 143 144 Direct assignment of thresholds based on prior intuition (e.g., cosine \geq 0.95 for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others. 145 To place threshold selection on a statistically principled and data-driven footing, we apply three independent methods whose underlying assumptions decrease in strength. 146 When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement is itself a diagnostic of distributional structure. 147 148 ### 1) Method 1: KDE + Antimode with Bimodality Check 149 150 We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28]. 151 A candidate threshold is taken at the location of the local density minimum (antimode) between modes of the fitted density. 152 Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify stability. 153 154 ### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity 155 156 We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39]. 157 We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin i with count n_i, the standardized deviation from the smooth-null expectation of the average of its neighbours, 158 159 Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}}, 160 161 which is approximately N(0,1) under the null of distributional smoothness. 162 A threshold is identified at the transition where Z_{i-1} is significantly negative (observed count below expectation) adjacent to Z_i significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition Z^- \rightarrow Z^+ marks the candidate decision boundary. 163 164 ### 3) Method 3: Finite Mixture Model via EM 165 166 We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data). 167 The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread). 168 Under the fitted model the threshold is the crossing point of the two weighted component densities, 169 170 \pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2), 171 172 solved numerically via bracketed root-finding. 173 As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the logit-transformed similarity, following standard practice for bounded proportion data; White's [41] quasi-MLE consistency result guarantees asymptotic recovery of the best Beta-family approximation to the true distribution even if the true component densities are not exactly Beta. 174 175 We fit 2- and 3-component variants of each mixture and report BIC for model selection. 176 When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut. 177 178 ### 4) Convergent Validation and Level-Shift Diagnostic 179 180 The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification. 181 If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method. 182 183 Equally informative is the level at which the three methods agree. 184 Applied to the per-signature cosine distribution the three methods may converge on a single boundary only weakly or not at all---because, as our results show, per-signature similarity is not a cleanly bimodal population. 185 Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum. 186 We therefore explicitly analyze both levels and interpret their divergence as a substantive finding (Section V) rather than a statistical nuisance. 187 188 ## J. Accountant-Level Mixture Model 189 190 In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash). 191 The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing behavior is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level quality lies on a continuous spectrum. 192 193 We fit mixtures with K \in \{1, 2, 3, 4, 5\} components under full covariance, selecting K^* by BIC with 15 random initializations per K. 194 For the selected K^* we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds. 195 196 ## K. Pixel-Identity and Firm A Validation (No Manual Annotation) 197 198 Rather than construct a stratified manual-annotation validation set, we validate the classifier using three naturally occurring reference populations that require no human labeling: 199 200 1. Pixel-identical anchor (gold positive): signatures whose nearest same-CPA match is byte-identical after crop and normalization. 201 Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing. 202 203 2. Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above. 204 Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity. 205 206 3. Low-similarity anchor (gold negative): signatures whose maximum same-CPA cosine similarity is below a conservative cutoff (0.70) that cannot plausibly arise from pixel-level duplication. 207 208 From these anchors we report Equal Error Rate (EER), precision, recall, F_1, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3]. 209 We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics. 210 211 ## L. Per-Document Classification 212 213 The final per-document classification combines the three-method thresholds with the dual-descriptor framework. 214 Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution: 215 216 1. High-confidence non-hand-signed: Cosine > 0.95 AND dHash \leq 5. 217 Both descriptors converge on strong replication evidence consistent with Firm A's median behavior. 218 219 2. Moderate-confidence non-hand-signed: Cosine > 0.95 AND dHash in [6, 15]. 220 Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations. 221 222 3. High style consistency: Cosine > 0.95 AND dHash > 15. 223 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction. 224 225 4. Uncertain: Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction. 226 227 5. Likely hand-signed: Cosine below the KDE crossover threshold. 228 229 The dHash thresholds (\leq 5 and \leq 15) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.

exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md | sed -n '1,160p'" in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md | sed -n '1,280p'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # IV. Experiments and Results 2 3 ## A. Experimental Setup 4 5 All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration. 6 Feature extraction used PyTorch 2.9 with torchvision model implementations. 7 The complete pipeline---from raw PDF processing through final classification---was implemented in Python. 8 9 ## B. Signature Detection Performance 10 11 The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). 12 We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). 13 However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. 14 The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale. 15 16 26 27 ## C. Signature-Level Distribution Analysis 28 29 Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs. 30 Table IV summarizes the distributional statistics. 31 32 42 43 Both distributions are left-skewed and leptokurtic. 44 Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. 45 Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions. 46 47 The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). 48 Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes. 49 Statistical tests confirmed significant separation between the two distributions (Cohen's d = 0.669, Mann-Whitney p < 0.001, K-S 2-sample p < 0.001). 50 51 We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. 52 We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size. 53 A Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count. 54 55 ## D. Hartigan Dip Test: Unimodality at the Signature Level 56 57 Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). 58 59 69 70 Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews. 71 The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is multimodal (p < 0.001). 72 At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E. 73 74 This asymmetry between signature level and accountant level is itself an empirical finding. 75 It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses. 76 77 ### 1) Burgstahler-Dichev / McCrary Discontinuity 78 79 Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. 80 We note that the cosine transition at 0.985 lies inside the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal. 81 In contrast, the dHash transition at distance 2 is a meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication. 82 83 ### 2) Beta Mixture at Signature Level: A Forced Fit 84 85 Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. 86 For the full-sample cosine the 3-component fit is likewise strongly preferred (\Delta\text{BIC} = 10{,}175). 87 Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data. 88 Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980. 89 90 The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: at the per-signature level, no two-mechanism mixture explains the data. 91 Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing. 92 This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual behavior (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks. 93 94 ## E. Accountant-Level Gaussian Mixture 95 96 We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. 97 BIC selects K^* = 3 (Table VI). 98 99 108 109 Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit. 110 111 118 119 Three empirical findings stand out. 120 First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster. 121 The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D. 122 Second, the three-component partition is not a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3. 123 Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine = 0.945 and dHash = 8.10; these are the natural per-accountant thresholds. 124 125 Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level. 126 127 138 139 The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. 140 The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum / discrete-behavior asymmetry rather than as primary classification boundaries. 141 142 ## F. Calibration Validation with Firm A 143 144 Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. 145 Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?). 146 147 161 162 The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail. 163 The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. 164 165 ## G. Pixel-Identity Validation 166 167 Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1). 168 These serve as the gold-positive anchor of Section III-K. 169 Using signatures with cosine < 0.70 (n = 35) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X). 170 171 181 182 All cosine thresholds achieve perfect classification of the pixel-identical anchor against the low-similarity anchor, which is unsurprising given the complete separation between the two anchor populations. 183 The dHash thresholds trade precision for recall along the expected tradeoff. 184 We emphasize that because the gold-positive anchor is a subset of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate. 185 The negative-anchor population (n = 35) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates. 186 187 A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics. 188 189 ## H. Classification Results 190 191 Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. 192 193 202 203 Within the 71,656 documents exceeding cosine 0.95, the dHash dimension stratifies them into three distinct populations: 204 29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash \leq 5); 205 36,994 (51.7%) show partial structural similarity (dHash in [6, 15]) consistent with replication degraded by scan variations; 206 and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than image reproduction. 207 A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations. 208 209 ### 1) Firm A Validation 210 211 96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain. 212 This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers. 213 The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. 214 215 ### 2) Cross-Method Agreement 216 217 Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. 218 This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII). 219 220 ## I. Ablation Study: Feature Backbone Comparison 221 222 To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). 223 All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. 224 Table XII presents the comparison. 225 226 243 244 EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. 245 However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), resulting in lower per-sample classification confidence. 246 VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task. 247 248 ResNet-50 provides the best overall balance: 249 (1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707; 250 (2) its tighter distributions yield more reliable individual classifications; 251 (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and 252 (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.

succeeded in 0ms: 1 # V. Discussion 2 3 ## A. Non-Hand-Signing Detection as a Distinct Problem 4 5 Our results highlight the importance of distinguishing non-hand-signing detection from the well-studied signature forgery detection problem. 6 In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature. 7 In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image). 8 9 This distinction has direct methodological consequences. 10 Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures. 11 Non-hand-signing detection, by contrast, requires sensitivity to the upper tail of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous. 12 The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot. 13 14 ## B. Continuous-Quality Spectrum vs. Discrete-Behavior Regimes 15 16 The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E). 17 18 At the per-signature level, the distribution of best-match cosine similarity is not cleanly bimodal. 19 Firm A's signature-level cosine is formally unimodal (dip test p = 0.17) with a long left tail. 20 The all-CPA signature-level cosine is multimodal (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). 21 The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary. 22 Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class. 23 Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains. 24 25 At the per-accountant aggregate level the picture reverses. 26 The distribution of per-accountant mean cosine (and mean dHash) is unambiguously multimodal. 27 A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms. 28 The two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. 29 30 The substantive interpretation is simple: pixel-level output quality is continuous, but individual signing behavior is close to discrete. 31 A given CPA tends to be either a consistent user of non-hand-signing or a consistent hand-signer; it is the mixing of these discrete behavioral types at the firm and population levels that produces the quality spectrum observed at the signature level. 32 Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level. 33 34 ## C. Firm A as a Replication-Dominated, Not Pure, Population 35 36 A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class. 37 Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference. 38 39 Three convergent strands of evidence support the replication-dominated framing. 40 First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign. 41 Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. 42 Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers. 43 44 The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise. 45 We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors. 46 47 ## D. The Style-Replication Gap 48 49 Within the 71,656 documents exceeding cosine 0.95, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance. 50 A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations. 51 52 The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative. 53 Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions. 54 Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting. 55 Some may use signing pads or templates that further constrain variability without constituting image-level reproduction. 56 The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence. 57 58 ## E. Value of a Replication-Dominated Calibration Group 59 60 The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels. 61 In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance. 62 Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset. 63 64 This calibration strategy has broader applicability beyond signature analysis. 65 Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives. 66 The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates interview-acknowledged heterogeneity, and yields classification rates that are internally consistent with the data. 67 68 ## F. Pixel-Identity as Annotation-Free Ground Truth 69 70 A further methodological contribution is the use of byte-level pixel identity as an annotation-free gold positive. 71 Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review. 72 In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly. 73 Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice. 74 We regard this as a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself. 75 76 ## G. Limitations 77 78 Several limitations should be acknowledged. 79 80 First, comprehensive per-document ground truth labels are not available. 81 The pixel-identity anchor is a strict subset of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), and the low-similarity anchor (n = 35) is small because intra-CPA pairs rarely fall below cosine 0.70. 82 The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work. 83 84 Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning. 85 While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance. 86 87 Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. 88 In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity. 89 This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified. 90 91 Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements. 92 While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions. 93 94 Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later). 95 Extending the accountant-level analysis to auditor-year units is a natural next step. 96 97 Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation. 98 This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes. 99 100 Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing." 101 Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.

exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md | sed -n '1,120p'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # Abstract 2 3 4 5 Regulations in many jurisdictions require Certified Public Accountants (CPAs) to attest to each audit report they certify, typically by affixing a signature or seal. 6 However, the digitization of financial reporting makes it straightforward to reuse a stored signature image across multiple reports---whether by administrative stamping or firm-level electronic signing systems---potentially undermining the intent of individualized attestation. 7 Unlike signature forgery, where an impostor imitates another person's handwriting, non-hand-signed reproduction involves the legitimate signer's own stored signature image being reproduced on each report, a practice that is visually invisible to report users and infeasible to audit at scale through manual inspection. 8 We present an end-to-end AI pipeline that automatically detects non-hand-signed auditor signatures in financial audit reports. 9 The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash). 10 For threshold determination we apply three statistically independent methods---Kernel Density antimode with a Hartigan dip test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level. 11 Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. 12 The accountant-level 2-component crossings yield principled thresholds (cosine = 0.945, dHash = 8.10). 13 A major Big-4 firm is used as a replication-dominated (not pure) calibration anchor, with interview and visual evidence supporting majority non-hand-signing and a minority of hand-signers. 14 Validation against 310 pixel-identical signature pairs and a low-similarity negative anchor yields perfect recall at all candidate thresholds. 15 To our knowledge, this represents the largest-scale forensic analysis of auditor signature authenticity reported in the literature. 16 17

exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # I. Introduction 2 3 4 5 Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. 6 In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require that certifying CPAs affix their signature or seal (簽名或蓋章) to each audit report [1]. 7 While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2]. 8 9 The digitization of financial reporting has introduced a practice that complicates this intent. 10 As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one. 11 This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step. 12 From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise. 13 We refer to signatures produced by either workflow collectively as non-hand-signed. 14 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement. 15 Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused. 16 This practice, while potentially widespread, is visually invisible to report users and virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of image reproduction. 17 18 The distinction between non-hand-signing detection and signature forgery detection is both conceptually and technically important. 19 The extensive body of research on offline signature verification [3]--[8] has focused almost exclusively on forgery detection---determining whether a questioned signature was produced by its purported author or by an impostor. 20 This framing presupposes that the central threat is identity fraud. 21 In our context, identity is not in question; the CPA is indeed the legitimate signer. 22 The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. 23 This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction, requiring an analytical framework focused on detecting abnormally high similarity across documents. 24 25 A secondary methodological concern shapes the research design. 26 Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification. 27 Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference. 28 A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, F_1, and equal-error-rate metrics that prevail in the biometric-verification literature. 29 30 Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. 31 Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image. 32 Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from image reproduction. 33 Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. 34 From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection. 35 36 In this paper, we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale. 37 Our approach processes raw PDF documents through the following stages: 38 (1) signature page identification using a Vision-Language Model (VLM); 39 (2) signature region detection using a trained YOLOv11 object detector; 40 (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network; 41 (4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance; 42 (5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and 43 (6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm. 44 45 The dual-descriptor verification is central to our contribution. 46 Cosine similarity of deep feature embeddings captures high-level visual style similarity---it can identify signatures that share similar stroke patterns and spatial layouts---but cannot distinguish between a CPA who signs consistently and one whose signature is reproduced from a stored image. 47 Perceptual hashing (specifically, difference hashing) encodes structural-level image gradients into compact binary fingerprints that are robust to scan noise but sensitive to substantive content differences. 48 By requiring convergent evidence from both descriptors, we can differentiate style consistency (high cosine but divergent dHash) from image reproduction (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone. 49 50 A second distinctive feature is our framing of the calibration reference. 51 One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing. 52 Structured interviews with multiple Firm A partners confirm that most certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a minority may continue to hand-sign some reports. 53 We therefore treat Firm A as a replication-dominated calibration reference rather than a pure positive class. 54 This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode. 55 Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence between the interview evidence and the statistical results. 56 57 A third distinctive feature is our unit-of-analysis treatment. 58 Our three-method convergent framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are cleanly trimodal (BIC-best K = 3). 59 The substantive reading is that pixel-level output quality is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, but individual signing behavior is close to discrete---a given CPA is either a consistent user of non-hand-signing or a consistent hand-signer. 60 The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. 61 62 We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. 63 To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature. 64 65 The contributions of this paper are summarized as follows: 66 67 1. Problem formulation. We formally define non-hand-signing detection as distinct from signature forgery detection and argue that it requires an analytical framework focused on intra-signer similarity distributions rather than genuine-versus-forged classification. 68 69 2. End-to-end pipeline. We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity computation, with automated inference requiring no manual intervention after initial training and annotation. 70 71 3. Dual-descriptor verification. We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures. 72 73 4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence. 74 75 5. Continuous-quality / discrete-behavior finding. We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates are cleanly trimodal---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics. 76 77 6. Replication-dominated calibration methodology. We introduce a calibration strategy using a known-majority-positive reference group, distinguishing replication-dominated from replication-pure anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling. 78 79 7. Large-scale empirical analysis. We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on non-hand-signing practices in financial reporting under a methodology designed for peer-review defensibility. 80 81 The remainder of this paper is organized as follows. 82 Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination. 83 Section III describes the proposed methodology. 84 Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study. 85 Section V discusses the implications and limitations of our findings. 86 Section VI concludes with directions for future work.

exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md | sed -n '1,160p'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # II. Related Work 2 3 ## A. Offline Signature Verification 4 5 Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning. 6 Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant. 7 Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work. 8 Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining. 9 Hadjadj et al. [5] addressed the practical constraint of limited reference samples, achieving competitive verification accuracy using only a single known genuine signature per writer. 10 More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results. 11 Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives. 12 Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer. 13 Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems. 14 15 A common thread in this literature is the assumption that the primary threat is identity fraud: a forger attempting to produce a convincing imitation of another person's signature. 16 Our work addresses a fundamentally different problem---detecting whether the legitimate signer's stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability. 17 18 Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy. 19 However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a replication-dominated subpopulation identified through domain expertise in real-world regulatory documents. 20 21 ## B. Document Forensics and Copy Detection 22 23 Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18]. 24 Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11]. 25 Abramova and Böhme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection. 26 27 Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money-laundering investigations. 28 Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering. 29 While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting image-level reproduction within a single author's signatures across documents. 30 31 In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images. 32 Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach. 33 34 ## C. Perceptual Hashing 35 36 Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19]. 37 Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process. 38 39 Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks. 40 Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-descriptor approach, though applied to natural images rather than document signatures. 41 42 Our work differs from prior perceptual-hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from image-level reproduction in scanned financial documents. 43 44 ## D. Deep Feature Extraction for Signature Analysis 45 46 Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures. 47 Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach. 48 Tsourounis et al. [21] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison. 49 Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach. 50 51 Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach. 52 These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical. 53 54 ## E. Statistical Methods for Threshold Determination 55 56 Our threshold-determination framework combines three families of methods developed in statistics and accounting-econometrics. 57 58 Non-parametric density estimation. 59 Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions. 60 Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors. 61 The statistical validity of the bimodality itself can be tested independently via the Hartigan & Hartigan dip test [37], which we use as a formal bimodality diagnostic. 62 63 Discontinuity tests on empirical distributions. 64 Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions. 65 Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately N(0,1). 66 The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis. 67 The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions. 68 69 Finite mixture models. 70 When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters. 71 For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation. 72 Under mild regularity conditions, White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation to the true distribution, even when the true distribution is not exactly Beta, provided the model is correctly specified in the broader exponential-family sense. 73 74 The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported. 75

exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md | sed -n '1,80p'" in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md | sed -n '1,120p'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # Impact Statement 2 3 4 5 Auditor signatures on financial reports are a key safeguard of corporate accountability. 6 When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets. 7 We developed an artificial intelligence system that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan. 8 By combining deep-learning visual features with perceptual hashing and three statistically independent threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time. 9 After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale.

succeeded in 0ms: 1 # VI. Conclusion and Future Work 2 3 ## Conclusion 4 5 We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale. 6 Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels. 7 8 Our contributions are fourfold. 9 10 First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability. 11 12 Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction. 13 14 Third, we introduced a three-method convergent threshold framework combining KDE antimode (with a Hartigan dip test as formal bimodality check), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check). 15 Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. 16 The substantive reading is that pixel-level output quality is continuous while individual signing behavior is close to discrete. 17 18 Fourth, we introduced a replication-dominated calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor that requires no manual annotation. 19 This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters. 20 21 An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task. 22 23 ## Future Work 24 25 Several directions merit further investigation. 26 Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study. 27 Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade. 28 The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration. 29 The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself. 30 Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.

exec /bin/zsh -lc "nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # References 2 3 4 5 [1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067 6 7 [2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," Res. Account. Regul., vol. 25, no. 2, pp. 230235, 2013. 8 9 [3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in Proc. NeurIPS, 1993. 10 11 [4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017. 12 13 [5] I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," Appl. Sci., vol. 10, no. 11, p. 3716, 2020. 14 15 [6] H. Li et al., "TransOSV: Offline signature verification with transformers," Pattern Recognit., vol. 145, p. 109882, 2024. 16 17 [7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," Mathematics, vol. 12, no. 17, p. 2757, 2024. 18 19 [8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024. 20 21 [9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021. 22 23 [10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," in Proc. Electronic Imaging, 2016. 24 25 [11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," Multimedia Tools Appl., 2024. 26 27 [12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," Inf. Process. Manage., p. 104086, 2025. 28 29 [13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in Proc. CVPR, 2022. 30 31 [14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," Pattern Recognit., vol. 70, pp. 163176, 2017. 32 33 [15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," IEEE Trans. Inf. Forensics Security, vol. 19, pp. 13421356, 2024. 34 35 [16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," IEEE Trans. Inf. Forensics Security, vol. 15, pp. 17351745, 2019. 36 37 [17] H. Farid, "Image forgery detection," IEEE Signal Process. Mag., vol. 26, no. 2, pp. 1625, 2009. 38 39 [18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," Pattern Recognit., vol. 144, art. no. 109778, 2023. 40 41 [19] J. Luo et al., "A survey of perceptual hashing for multimedia," ACM Trans. Multimedia Comput. Commun. Appl., vol. 21, no. 7, 2025. 42 43 [20] D. Engin et al., "Offline signature verification on real-world documents," in Proc. CVPRW, 2020. 44 45 [21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," Expert Syst. Appl., 2022. 46 47 [22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," Procedia Comput. Sci., vol. 270, 2025. 48 49 [23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in Proc. ECCV, 2014, pp. 584599. 50 51 [24] Qwen2.5-VL Technical Report, Alibaba Group, 2025. 52 53 [25] Ultralytics, "YOLOv11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/ 54 55 [26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proc. CVPR, 2016. 56 57 [27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html 58 59 [28] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London: Chapman & Hall, 1986. 60 61 [29] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988. 62 63 [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Trans. Image Process., vol. 13, no. 4, pp. 600612, 2004. 64 65 [31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," The Accounting Review, vol. 88, no. 5, pp. 15111546, 2013. 66 67 [32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," Int. J. Auditing, vol. 18, no. 3, pp. 172192, 2014. 68 69 [33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," Contemp. Account. Res., vol. 26, no. 2, pp. 359391, 2009. 70 71 [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proc. CVPR, 2016, pp. 779788. 72 73 [35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 56255644, 2024. 74 75 [36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," Ann. Math. Statist., vol. 18, no. 1, pp. 5060, 1947. 76 77 [37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," Ann. Statist., vol. 13, no. 1, pp. 7084, 1985. 78 79 [38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," J. Account. Econ., vol. 24, no. 1, pp. 99126, 1997. 80 81 [39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," J. Econometrics, vol. 142, no. 2, pp. 698714, 2008. 82 83 [40] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," J. R. Statist. Soc. B, vol. 39, no. 1, pp. 138, 1977. 84 85 [41] H. White, "Maximum likelihood estimation of misspecified models," Econometrica, vol. 50, no. 1, pp. 125, 1982. 86 87

exec /bin/zsh -lc "rg -n 'Table [IVX]+|Fig\. [0-9]+' /Volumes/NV2/pdf_recognize/paper/paper_a_v3.md" in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc "rg -n '0\.941|0\.945|0\.95|0\.985|8\.10|6\.99|7\.00|71,656|84,386|180|182,328|168,755|168,740|686|758|90,282|86,071|86,072|4,198|12 corrupted|30,000\+' /Volumes/NV2/pdf_recognize/paper/paper_a_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:11:The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:12:We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29:Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:30:Table IV summarizes the distributional statistics. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47:The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:97:BIC selects K^* = 3 (Table VI). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:109:Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:125:Table VIII summarizes the threshold estimates produced by the three convergent methods at each analysis level. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:144:Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:145:Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:169:Using signatures with cosine < 0.70 (n = 35) as the gold-negative anchor, we derive Equal-Error-Rate points and classification metrics for the canonical thresholds (Table X). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:218:This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:224:Table XII presents the comparison. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:6:Fig. 1 illustrates the overall architecture. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:30:Table I summarizes the dataset composition. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:61:The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).

succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:19:| Documents processed | 86,071 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:21:| Total signatures extracted | 182,328 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:23:| CPA-matched signatures | 168,755 (92.6%) | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:36:| Mean | 0.821 | 0.758 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:64:| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:66:| Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:67:| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:79:Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:80:We note that the cosine transition at 0.985 lies inside the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:115:| C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120:First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123:Third, the 2-component fit used for threshold derivation yields marginal-density crossings at cosine = 0.945 and dHash = 8.10; these are the natural per-accountant thresholds. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:131:| Signature-level BD/McCrary transition | 0.985 | 2.0 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:134:| Accountant-level 2-comp GMM crossing | 0.945 | 8.10 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:135:| Firm A P95 (median/95th pct calibration) | 0.95 | 15 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:151:| cosine > 0.941 | 95.08% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:152:| cosine > 0.945 (accountant 2-comp) | 94.5%† | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:153:| cosine > 0.95 | 92.51% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:157:| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:163:The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167:Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:175:| cosine > 0.945 | Accountant crossing | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:176:| cosine > 0.95 | Canonical | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:199:| Uncertain | 12,683 | 15.0% | 758 | 2.5% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:203:Within the 71,656 documents exceeding cosine 0.95, the dHash dimension stratifies them into three distinct populations: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:207:A cosine-only classifier would treat all 71,656 identically; the dual-method framework separates them into populations with fundamentally different interpretations. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:213:The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217:Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:231:| Inter mean | 0.758 | 0.767 | 0.699 | /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:14:90,282 PDFs → VLM Pre-screening → 86,072 PDFs /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:15:→ YOLOv11 Detection → 182,328 signatures /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:24:The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:29:CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:35:| Total PDF documents | 90,282 | /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:37:| Documents with signatures | 86,072 (95.4%) | /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:38:| Unique CPAs identified | 758 | /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:45:Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:50:This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:51:An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:67:| Recall | 0.950.98 | /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:69:| mAP@0.50:0.95 | 0.850.90 | /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:72:Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:75:Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144:Direct assignment of thresholds based on prior intuition (e.g., cosine \geq 0.95 for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:216:1. High-confidence non-hand-signed: Cosine > 0.95 AND dHash \leq 5. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:219:2. Moderate-confidence non-hand-signed: Cosine > 0.95 AND dHash in [6, 15]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:222:3. High style consistency: Cosine > 0.95 AND dHash > 15. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225:4. Uncertain: Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:62:We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:12:Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:15:Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19:This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:21:The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:28:The two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:41:Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42:Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:49:Within the 71,656 documents exceeding cosine 0.95, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:50:A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:52:The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72:In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12:The accountant-level 2-component crossings yield principled thresholds (cosine = 0.945, dHash = 8.10).

exec /bin/zsh -lc "rg -n 'undefined|not shown|not reported|not used|interpolated|all-pairs|best-match|independent|pixel-identical-to-closest|Equal Error Rate|EER|FAR|FRR|precision|recall|F1|Shapiro|Kolmogorov|BIC|AIC|Scott' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:11:[4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017. /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:33:[15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," IEEE Trans. Inf. Forensics Security, vol. 19, pp. 13421356, 2024. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:44:Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:45:Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:51:We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:85:Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:86:For the full-sample cosine the 3-component fit is likewise strongly preferred (\Delta\text{BIC} = 10{,}175). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:97:BIC selects K^* = 3 (Table VI). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:99:<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC) /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md💯| K | BIC | AIC | Converged | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:159:† interpolated from adjacent rates; all other rates computed exactly. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:167:Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:172:| Indicator | Threshold | Precision | Recall | F1 | FAR | FRR | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:183:The dHash thresholds trade precision for recall along the expected tradeoff. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184:We emphasize that because the gold-positive anchor is a subset of the true non-hand-signing positives (only those that happen to be pixel-identical to their nearest match), recall against this anchor is conservative by construction: the classifier additionally flags many non-pixel-identical replications (low dHash but not zero) that the anchor cannot by itself validate. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:185:The negative-anchor population (n = 35) is likewise small because intra-CPA pairs rarely fall below cosine 0.70, so the reported FAR values should be read as order-of-magnitude rather than tight estimates. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:187:A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:234:| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:235:| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239:the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:251:(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:7:Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:8:Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:12:Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:61:The statistical validity of the bimodality itself can be tested independently via the Hartigan & Hartigan dip test [37], which we use as a formal bimodality diagnostic. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:74:The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:15:Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10:For threshold determination we apply three statistically independent methods---Kernel Density antimode with a Hartigan dip test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14:Validation against 310 pixel-identical signature pairs and a low-similarity negative anchor yields perfect recall at all candidate thresholds. /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8:By combining deep-learning visual features with perceptual hashing and three statistically independent threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:18:At the per-signature level, the distribution of best-match cosine similarity is not cleanly bimodal. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:20:The all-CPA signature-level cosine is multimodal (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27:A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:71:Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:73:Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:82:The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28:A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, F_1, and equal-error-rate metrics that prevail in the biometric-verification literature. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:58:Our three-method convergent framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are cleanly trimodal (BIC-best K = 3). /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:73:4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:7:The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:58:A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:104:These descriptors provide partially independent evidence. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:129:This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134:Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:139:Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145:To place threshold selection on a statistically principled and data-driven footing, we apply three independent methods whose underlying assumptions decrease in strength. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:150:We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152:Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify stability. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:175:We fit 2- and 3-component variants of each mixture and report BIC for model selection. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:176:When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:190:In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:193:We fit mixtures with K \in \{1, 2, 3, 4, 5\} components under full covariance, selecting K^* by BIC with 15 random initializations per K. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:201:Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:208:From these anchors we report Equal Error Rate (EER), precision, recall, F_1, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3].

codex Ive finished the core read. The main problems are not about the topic; they are about internal evidentiary standards: the accountant-level “three-method convergence” is not actually shown, some validation is circular, and a few numerical statements are inconsistent enough that they need to be fixed before submission. exec /bin/zsh -lc "rg -n 'sensitivity|bandwidth|\+/- 50|± 50|Scott' /Volumes/NV2/pdf_recognize/paper/paper_a_v3.md" in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc "rg -n 'independent minimum dHash|dHash_indep|independent\)' /Volumes/NV2/pdf_recognize/paper/paper_a_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:66:The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:150:We fit a Gaussian kernel density estimate to the similarity distribution using Scott's rule for bandwidth selection [28]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152:Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify stability. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:11:Non-hand-signing detection, by contrast, requires sensitivity to the upper tail of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.

succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:96:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:154:| dHash_indep ≤ 5 | 84.20% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:155:| dHash_indep ≤ 8 | 95.17% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:156:| dHash_indep ≤ 15 | 99.83% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:157:| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:177:| dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:178:| dHash_indep ≤ 8 | Accountant crossing | 0.966 | 1.000 | 0.983 | 0.314 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:190:In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).

exec /bin/zsh -lc "rg -n '84,386|85,042|per-document|per document|document-level|each document into one of five categories|first or second signature|two certifying CPAs' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:20:| Documents with detections | 85,042 (98.8%) | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:22:| Avg. signatures per document | 2.14 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:75:Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213:The final per-document classification combines the three-method thresholds with the dual-descriptor framework. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214:Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution: /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:80:First, comprehensive per-document ground truth labels are not available.

exec /bin/zsh -lc "rg -n 'accountant-level.*KDE|accountant-level.*BD|accountant-level.*Beta|per-accountant.*KDE|per-accountant.*Beta|per-accountant.BD|two-component fit used for threshold derivation|three methods yield a sharp and consistent boundary|three methods reveal|three-method thresholds are therefore statistically supported at the accountant level' /Volumes/NV2/pdf_recognize/paper/paper_a__v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:185:Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum.

codex Im checking a couple of quantitative implications before writing the review, mainly to avoid hand-waving about the negative anchor. With only n=35 negatives, the tabled FAR values and even the 0.000 entries have very wide uncertainty, and I want to state that precisely. 2026-04-20T16:47:47.550868Z ERROR codex_core::tools::router: error=exec_command failed for /bin/zsh -lc "python - <<'PY' import math from statsmodels.stats.proportion import proportion_confint n=35 for x,label in [(0,'0/35'),(6,'6/35'),(11,'11/35'),(24,'24/35')]: lo,hi=proportion_confint(x,n,alpha=0.05,method='beta') print(label, round(x/n,3), (round(lo,3), round(hi,3))) PY": CreateProcess { message: "Codex(Sandbox(Denied { output: ExecToolCallOutput { exit_code: 1, stdout: StreamOutput { text: "", truncated_after_lines: None }, stderr: StreamOutput { text: "zsh:1: can't create temp file for here document: operation not permitted\n", truncated_after_lines: None }, aggregated_output: StreamOutput { text: "zsh:1: can't create temp file for here document: operation not permitted\n", truncated_after_lines: None }, duration: 0ns, timed_out: false }, network_policy_decision: None }))" } 2026-04-20T16:47:47.593389Z ERROR codex_core::tools::router: error=exec_command failed for /bin/zsh -lc "python - <<'PY' vals=[29529,36994,5133,12683,47] print(sum(vals)) print(139+32,180-(139+32)) PY": CreateProcess { message: "Codex(Sandbox(Denied { output: ExecToolCallOutput { exit_code: 1, stdout: StreamOutput { text: "", truncated_after_lines: None }, stderr: StreamOutput { text: "zsh:1: can't create temp file for here document: operation not permitted\n", truncated_after_lines: None }, aggregated_output: StreamOutput { text: "zsh:1: can't create temp file for here document: operation not permitted\n", truncated_after_lines: None }, duration: 0ns, timed_out: false }, network_policy_decision: None }))" } exec /bin/zsh -lc 'python -c "vals=[29529,36994,5133,12683,47]; print(sum(vals)); print(139+32, 180-(139+32))"' in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc "python -c "from math import isnan; from statsmodels.stats.proportion import proportion_confint; n=35 for x,label in [(0,'0/35'),(6,'6/35'),(11,'11/35'),(24,'24/35')]: lo,hi=proportion_confint(x,n,alpha=0.05,method='beta'); print(label, round(x/n,3), (round(lo,3), round(hi,3)))"" in /Volumes/NV2/pdf_recognize exited 127 in 0ms: zsh:1: command not found: python

exited 127 in 0ms: zsh:1: command not found: python

exec /bin/zsh -lc 'python3 -c "vals=[29529,36994,5133,12683,47]; print(sum(vals)); print(139+32, 180-(139+32))"' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 84386 171 9

exec /bin/zsh -lc "python3 -c "from statsmodels.stats.proportion import proportion_confint; n=35 for x,label in [(0,'0/35'),(6,'6/35'),(11,'11/35'),(24,'24/35')]: lo,hi=proportion_confint(x,n,alpha=0.05,method='beta'); print(label, round(x/n,3), (round(lo,3), round(hi,3)))"" in /Volumes/NV2/pdf_recognize exited 1 in 0ms: Traceback (most recent call last): File "", line 1, in from statsmodels.stats.proportion import proportion_confint; n=35 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'statsmodels'

exec /bin/zsh -lc 'python3 -c "import scipy; print(scipy.version)"' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1.17.1

exec /bin/zsh -lc "python3 -c "from scipy.stats import beta n=35 for x,label in [(0,'0/35'),(6,'6/35'),(11,'11/35'),(24,'24/35')]: a=0.05 lo=0.0 if x==0 else beta.ppf(a/2, x, n-x+1) hi=1.0 if x==n else beta.ppf(1-a/2, x+1, n-x) print(label, round(x/n,3), round(lo,3), round(hi,3))"" in /Volumes/NV2/pdf_recognize succeeded in 590ms: 0/35 0.0 0.0 0.1 6/35 0.171 0.066 0.336 11/35 0.314 0.169 0.493 24/35 0.686 0.507 0.831

exec /bin/zsh -lc 'python3 -c "from scipy.stats import beta x,n=310,310 a=0.05 lo=0.0 if x==0 else beta.ppf(a/2,x,n-x+1) hi=1.0 if x==n else beta.ppf(1-a/2,x+1,n-x) print(round(lo,3), round(hi,3))"' in /Volumes/NV2/pdf_recognize succeeded in 418ms: 0.988 1.0

Completed the manuscript read and synthesized the main technical and presentation issues into a reviewer-style assessment. ✓ Load all manuscript section files with line numbers and identify tables/figures/results claims ✓ Evaluate technical claims around thresholding, accountant-level GMM, and Firm A framing ✓ Draft IEEE-Access-style review with prioritized findings and exact citations codex 1. Overall Assessment

Recommendation: major revision.

The paper tackles an interesting and important problem with an unusually strong corpus, but the current draft overstates what the evidence establishes. The main concern is not the extraction pipeline; it is that the thresholding and validation narrative is stronger than the analyses actually reported. In its present form, I would not recommend acceptance without substantial revision to the claims, reporting, and validation design.

2. Technical Soundness of the Threshold Framework and Accountant-Level GMM

  • The largest problem is that the manuscript repeatedly claims accountant-level convergence of three methods, but Section IV does not actually show that. Section III-I says the three methods "yield a sharp and consistent boundary" at the accountant level paper_a_methodology_v3.md, and the Introduction says the "convergent three-method thresholds are therefore statistically supported at the accountant level" paper_a_introduction_v3.md. Yet Section IV-E/Table VIII report only one accountant-level threshold source: the 2-component accountant GMM crossing at 0.945 / 8.10 paper_a_results_v3.md, paper_a_results_v3.md. I do not see accountant-level KDE antimodes, accountant-level BD/McCrary thresholds, or accountant-level Beta/logit-mixture thresholds. As written, the core "three-method convergence" claim is unsupported.

  • Method 1 is internally inconsistent. Section III-I.1 defines the KDE method as the local density minimum (antimode) of a single distribution paper_a_methodology_v3.md, but Section IV-C reports a KDE crossover where the two density functions intersect paper_a_results_v3.md. Those are different procedures with different decision-theoretic meanings. The manuscript needs to choose one and report it consistently.

  • The paper mixes three different units of analysis without sufficient justification: all-pairs similarities (Table IV), per-signature best-match statistics (Table V), and per-accountant means (Tables VI-VIII). The 0.837 cutoff comes from pairwise intra/inter distributions paper_a_results_v3.md, paper_a_results_v3.md; the 0.945 / 8.10 cutoffs come from accountant means paper_a_results_v3.md, [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123]; then all are reused in per-signature/per-document classification paper_a_methodology_v3.md, paper_a_results_v3.md. That transfer across units is not justified.

  • The accountant-level interpretation is overstated. The papers strongest result is that the signature-level data are not well summarized by a 2-component model, while accountant-level means cluster under a BIC-selected 3-component GMM. That is not the same as proving that "individual signing behavior is close to discrete" paper_a_discussion_v3.md, especially when the selected accountant model itself has a very large middle component C2 with weight 0.51 paper_a_results_v3.md. The evidence supports "clustered accountant-level heterogeneity" more clearly than "discrete behavior."

  • There is also a model-selection tension that needs to be confronted directly. Section IV-E says BIC selects K*=3 paper_a_results_v3.md, but the paper then derives thresholds from a 2-component fit paper_a_results_v3.md. That can be acceptable for a coarse dichotomy, but it must be justified explicitly, especially because the selected 3-component model contains a substantive "middle band" rather than a negligible bridge.

  • The BD/McCrary treatment is too loose technically. Section III-I.2 scans histogram bins for a Z^- -> Z^+ transition paper_a_methodology_v3.md, but no uncertainty intervals, multiplicity correction, or sensitivity analysis are reported. I would present this as a heuristic discontinuity diagnostic unless the authors can supply a more rigorous justification.

  • The White (1982) justification is overstated. The manuscript implies quasi-MLE consistency validates the Beta-mixture/logit-GMM robustness argument paper_a_related_work_v3.md, paper_a_methodology_v3.md. That is much stronger than Whites result warrants.

3. Firm A Framing and Circular Validation

  • The "replication-dominated, not pure" framing is plausible and, in my view, the right wording. The problem is not the label; the problem is the way independence is claimed. The genuinely exogenous evidence is the interview evidence and pre-threshold visual inspection paper_a_methodology_v3.md, paper_a_methodology_v3.md, paper_a_methodology_v3.md. The 92.5% > 0.95 statistic and the GMM cluster assignments are in-sample outputs of the same analytical pipeline paper_a_methodology_v3.md, paper_a_results_v3.md. They are consistency checks, not independent validation.

  • Table IX is circular if presented as validation. The paper says these rates "play the role of calibration-validation metrics" paper_a_results_v3.md, but several thresholds are themselves anchored to Firm A, and the text explicitly says the final classes use thresholds "anchored to Firm A's empirical distribution" paper_a_methodology_v3.md, paper_a_methodology_v3.md. This is calibration, not external validation.

  • The manuscript should break this loop with cross-fitting or a true holdout. For example: calibrate on one subset of Firm A CPAs and evaluate on held-out Firm A CPAs; or calibrate without Firm A and then use Firm A only as an external check.

  • There is a numerical inconsistency that must be fixed. Section IV-E says 139 of 180 Firm A CPAs are in C1 and the remaining 32 are in C2 paper_a_results_v3.md, paper_a_results_v3.md. The same 139/32 split is repeated in the Introduction, Discussion, and Conclusion paper_a_introduction_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md. 139 + 32 != 180. The denominator is being mixed or nine CPAs are unaccounted for.

4. Weaknesses in the Pixel-Identity Validation

  • The negative anchor is too small for the claims being made. Table X uses only n = 35 negatives paper_a_results_v3.md, paper_a_results_v3.md. With 35 negatives, even FAR = 0.000 only implies a very weak upper bound; the exact 95% upper confidence limit is about 0.10. The dHash FARs are also extremely imprecise.

  • The positive anchor is an easy subset, not a representative positive set. Byte-identical nearest matches are the most obvious replications paper_a_methodology_v3.md, paper_a_results_v3.md. Perfect recall on this subset does not establish recall on near-duplicate but non-identical replications. The manuscript currently says this makes recall "conservative" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184]; that is not correct. It makes the recall estimate non-generalizable and likely optimistic for the broader positive class.

  • The EER language should be removed or corrected. Section III-K promises EER reporting paper_a_methodology_v3.md, and Section IV-G says "we derive Equal-Error-Rate points" paper_a_results_v3.md, but Table X reports metrics at selected thresholds, not the threshold where FAR = FRR.

  • This section would be materially stronger with a much larger negative set concentrated near the decision boundary, blocked inter-CPA comparisons, manually adjudicated borderline cases, and ideally a held-out temporal or firm-stratified validation set.

5. Clarity and Organization of Sections III / IV / V

  • Section IV-C is titled "Signature-Level Distribution Analysis," but Table IV is all-pairs analysis (41,352,824 intra-class pairs) rather than per-signature best-match statistics paper_a_results_v3.md. Section IV-D then switches to best-match distributions. This unit shift is confusing.

  • dHash_indep / mean independent minimum dHash is never defined clearly in Section III, yet it appears throughout Tables V, VII, IX, and X paper_a_methodology_v3.md, paper_a_results_v3.md. The reader cannot infer what "independent" means operationally.

  • Section III-L defines five document-level verdicts paper_a_methodology_v3.md, but the rule that maps one or two signature-level outcomes into one PDF-level label is missing. That omission becomes critical in Table XI, which reports 84,386 classified PDFs paper_a_results_v3.md even though Table III reports 85,042 documents with detections paper_a_results_v3.md.

  • Figure/table explanation is uneven. Fig. 4 is said to show the 3-component accountant clusters "alongside the marginal-density crossings of the two-component fit" paper_a_results_v3.md, but the text never explains how the 2-component crossings relate visually or substantively to the BIC-best 3-component model.

6. Specific Sentence-Level, Table-Level, and Numerical Fixes

  • Table IX contains an undefined threshold: cosine > 0.941 paper_a_results_v3.md. I could not find where 0.941 is introduced.

  • Table IX reports cosine > 0.945 ... 94.5%† and says it is interpolated from adjacent rates paper_a_results_v3.md, paper_a_results_v3.md. An empirical exceedance rate at a fixed threshold should be computed exactly from the data, not interpolated.

  • The manuscript alternates between 310 pixel-identical signature pairs in the Abstract paper_a_abstract_v3.md, 310 have a same-CPA nearest match in Results paper_a_results_v3.md, and 310 signatures satisfied this condition in Discussion paper_a_discussion_v3.md. Choose one unit and keep it consistent.

  • The statement establishing an upper bound on the VLM false-positive rate of 1.2% paper_a_methodology_v3.md is not justified, because YOLO misses can arise from YOLO false negatives, not only VLM false positives.

  • The manuscript repeatedly calls the three methods statistically independent paper_a_abstract_v3.md, paper_a_methodology_v3.md. They are different diagnostics applied to the same sample, not independent experiments. "Methodologically distinct" or "complementary" would be more accurate.

  • The dip test is described as a formal bimodality check paper_a_methodology_v3.md. It is a test against unimodality, not a direct test for bimodality.

  • Reference [24] is incomplete for IEEE style: Qwen2.5-VL Technical Report, Alibaba Group, 2025 paper_a_references_v3.md. It needs authors, title, venue/repository, and an accessible identifier.

In short: the manuscript has publishable potential, but only after the authors either supply the missing accountant-level analyses and stronger validation, or narrow the claims and present the current evidence as exploratory calibration plus large-scale descriptive analysis rather than definitive threshold validation. tokens used 85,743 1. Overall Assessment

Recommendation: major revision.

The paper tackles an interesting and important problem with an unusually strong corpus, but the current draft overstates what the evidence establishes. The main concern is not the extraction pipeline; it is that the thresholding and validation narrative is stronger than the analyses actually reported. In its present form, I would not recommend acceptance without substantial revision to the claims, reporting, and validation design.

2. Technical Soundness of the Threshold Framework and Accountant-Level GMM

  • The largest problem is that the manuscript repeatedly claims accountant-level convergence of three methods, but Section IV does not actually show that. Section III-I says the three methods "yield a sharp and consistent boundary" at the accountant level paper_a_methodology_v3.md, and the Introduction says the "convergent three-method thresholds are therefore statistically supported at the accountant level" paper_a_introduction_v3.md. Yet Section IV-E/Table VIII report only one accountant-level threshold source: the 2-component accountant GMM crossing at 0.945 / 8.10 paper_a_results_v3.md, paper_a_results_v3.md. I do not see accountant-level KDE antimodes, accountant-level BD/McCrary thresholds, or accountant-level Beta/logit-mixture thresholds. As written, the core "three-method convergence" claim is unsupported.

  • Method 1 is internally inconsistent. Section III-I.1 defines the KDE method as the local density minimum (antimode) of a single distribution paper_a_methodology_v3.md, but Section IV-C reports a KDE crossover where the two density functions intersect paper_a_results_v3.md. Those are different procedures with different decision-theoretic meanings. The manuscript needs to choose one and report it consistently.

  • The paper mixes three different units of analysis without sufficient justification: all-pairs similarities (Table IV), per-signature best-match statistics (Table V), and per-accountant means (Tables VI-VIII). The 0.837 cutoff comes from pairwise intra/inter distributions paper_a_results_v3.md, paper_a_results_v3.md; the 0.945 / 8.10 cutoffs come from accountant means paper_a_results_v3.md, [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123]; then all are reused in per-signature/per-document classification paper_a_methodology_v3.md, paper_a_results_v3.md. That transfer across units is not justified.

  • The accountant-level interpretation is overstated. The papers strongest result is that the signature-level data are not well summarized by a 2-component model, while accountant-level means cluster under a BIC-selected 3-component GMM. That is not the same as proving that "individual signing behavior is close to discrete" paper_a_discussion_v3.md, especially when the selected accountant model itself has a very large middle component C2 with weight 0.51 paper_a_results_v3.md. The evidence supports "clustered accountant-level heterogeneity" more clearly than "discrete behavior."

  • There is also a model-selection tension that needs to be confronted directly. Section IV-E says BIC selects K*=3 paper_a_results_v3.md, but the paper then derives thresholds from a 2-component fit paper_a_results_v3.md. That can be acceptable for a coarse dichotomy, but it must be justified explicitly, especially because the selected 3-component model contains a substantive "middle band" rather than a negligible bridge.

  • The BD/McCrary treatment is too loose technically. Section III-I.2 scans histogram bins for a Z^- -> Z^+ transition paper_a_methodology_v3.md, but no uncertainty intervals, multiplicity correction, or sensitivity analysis are reported. I would present this as a heuristic discontinuity diagnostic unless the authors can supply a more rigorous justification.

  • The White (1982) justification is overstated. The manuscript implies quasi-MLE consistency validates the Beta-mixture/logit-GMM robustness argument paper_a_related_work_v3.md, paper_a_methodology_v3.md. That is much stronger than Whites result warrants.

3. Firm A Framing and Circular Validation

  • The "replication-dominated, not pure" framing is plausible and, in my view, the right wording. The problem is not the label; the problem is the way independence is claimed. The genuinely exogenous evidence is the interview evidence and pre-threshold visual inspection paper_a_methodology_v3.md, paper_a_methodology_v3.md, paper_a_methodology_v3.md. The 92.5% > 0.95 statistic and the GMM cluster assignments are in-sample outputs of the same analytical pipeline paper_a_methodology_v3.md, paper_a_results_v3.md. They are consistency checks, not independent validation.

  • Table IX is circular if presented as validation. The paper says these rates "play the role of calibration-validation metrics" paper_a_results_v3.md, but several thresholds are themselves anchored to Firm A, and the text explicitly says the final classes use thresholds "anchored to Firm A's empirical distribution" paper_a_methodology_v3.md, paper_a_methodology_v3.md. This is calibration, not external validation.

  • The manuscript should break this loop with cross-fitting or a true holdout. For example: calibrate on one subset of Firm A CPAs and evaluate on held-out Firm A CPAs; or calibrate without Firm A and then use Firm A only as an external check.

  • There is a numerical inconsistency that must be fixed. Section IV-E says 139 of 180 Firm A CPAs are in C1 and the remaining 32 are in C2 paper_a_results_v3.md, paper_a_results_v3.md. The same 139/32 split is repeated in the Introduction, Discussion, and Conclusion paper_a_introduction_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md. 139 + 32 != 180. The denominator is being mixed or nine CPAs are unaccounted for.

4. Weaknesses in the Pixel-Identity Validation

  • The negative anchor is too small for the claims being made. Table X uses only n = 35 negatives paper_a_results_v3.md, paper_a_results_v3.md. With 35 negatives, even FAR = 0.000 only implies a very weak upper bound; the exact 95% upper confidence limit is about 0.10. The dHash FARs are also extremely imprecise.

  • The positive anchor is an easy subset, not a representative positive set. Byte-identical nearest matches are the most obvious replications paper_a_methodology_v3.md, paper_a_results_v3.md. Perfect recall on this subset does not establish recall on near-duplicate but non-identical replications. The manuscript currently says this makes recall "conservative" [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184]; that is not correct. It makes the recall estimate non-generalizable and likely optimistic for the broader positive class.

  • The EER language should be removed or corrected. Section III-K promises EER reporting paper_a_methodology_v3.md, and Section IV-G says "we derive Equal-Error-Rate points" paper_a_results_v3.md, but Table X reports metrics at selected thresholds, not the threshold where FAR = FRR.

  • This section would be materially stronger with a much larger negative set concentrated near the decision boundary, blocked inter-CPA comparisons, manually adjudicated borderline cases, and ideally a held-out temporal or firm-stratified validation set.

5. Clarity and Organization of Sections III / IV / V

  • Section IV-C is titled "Signature-Level Distribution Analysis," but Table IV is all-pairs analysis (41,352,824 intra-class pairs) rather than per-signature best-match statistics paper_a_results_v3.md. Section IV-D then switches to best-match distributions. This unit shift is confusing.

  • dHash_indep / mean independent minimum dHash is never defined clearly in Section III, yet it appears throughout Tables V, VII, IX, and X paper_a_methodology_v3.md, paper_a_results_v3.md. The reader cannot infer what "independent" means operationally.

  • Section III-L defines five document-level verdicts paper_a_methodology_v3.md, but the rule that maps one or two signature-level outcomes into one PDF-level label is missing. That omission becomes critical in Table XI, which reports 84,386 classified PDFs paper_a_results_v3.md even though Table III reports 85,042 documents with detections paper_a_results_v3.md.

  • Figure/table explanation is uneven. Fig. 4 is said to show the 3-component accountant clusters "alongside the marginal-density crossings of the two-component fit" paper_a_results_v3.md, but the text never explains how the 2-component crossings relate visually or substantively to the BIC-best 3-component model.

6. Specific Sentence-Level, Table-Level, and Numerical Fixes

  • Table IX contains an undefined threshold: cosine > 0.941 paper_a_results_v3.md. I could not find where 0.941 is introduced.

  • Table IX reports cosine > 0.945 ... 94.5%† and says it is interpolated from adjacent rates paper_a_results_v3.md, paper_a_results_v3.md. An empirical exceedance rate at a fixed threshold should be computed exactly from the data, not interpolated.

  • The manuscript alternates between 310 pixel-identical signature pairs in the Abstract paper_a_abstract_v3.md, 310 have a same-CPA nearest match in Results paper_a_results_v3.md, and 310 signatures satisfied this condition in Discussion paper_a_discussion_v3.md. Choose one unit and keep it consistent.

  • The statement establishing an upper bound on the VLM false-positive rate of 1.2% paper_a_methodology_v3.md is not justified, because YOLO misses can arise from YOLO false negatives, not only VLM false positives.

  • The manuscript repeatedly calls the three methods statistically independent paper_a_abstract_v3.md, paper_a_methodology_v3.md. They are different diagnostics applied to the same sample, not independent experiments. "Methodologically distinct" or "complementary" would be more accurate.

  • The dip test is described as a formal bimodality check paper_a_methodology_v3.md. It is a test against unimodality, not a direct test for bimodality.

  • Reference [24] is incomplete for IEEE style: Qwen2.5-VL Technical Report, Alibaba Group, 2025 paper_a_references_v3.md. It needs authors, title, venue/repository, and an accessible identifier.

In short: the manuscript has publishable potential, but only after the authors either supply the missing accountant-level analyses and stronger validation, or narrow the claims and present the current evidence as exploratory calibration plus large-scale descriptive analysis rather than definitive threshold validation.