Resolved Codex review (gpt-5.5 xhigh) findings against b6913d2.
BLOCKERS:
- Appendix B reference mismatch: rewrote all main-text "Appendix B" references
to "supplementary materials" since Appendix B is now a redirect stub. Affected
the SSIM design-argument pointer, threshold provenance, byte-level
decomposition, MC band capture-rate, and backbone-ablation table references
across §III-F / §III-H.1 / §III-H.2 / §III-K / §III-L.4 / §III-M / §IV-F /
§IV-J / §IV-K / §IV-L / §V-C / §V-H.
- Table rendering: un-commented Tables I-IV (Dataset Summary, YOLO Detection,
Extraction Results, Cosine Distribution Statistics) which were inside HTML
comment blocks and would not have rendered in the submission.
- Table numbering out of order: Table XIX appeared before Tables XVI-XVIII.
Renumbered XIX -> XVI (document-level worst-case counts), XVI -> XVII (Firm x
K=3 cross-tab), XVII -> XVIII (K=3 component comparison), XVIII -> XIX
(Spearman correlation). Cross-references updated in §IV-J / §IV-K and §V-C.
- Table V mis-citation: §IV-C said "KDE crossover ... (Table V)" but Table V is
the dip test. Dropped the (Table V) tag; crossover is a textual finding.
- Submission cleanup: wrapped the archived Impact Statement section heading and
body inside the existing HTML comment (was rendering). Funding placeholder
wrapped in HTML comment with a TO-DO note (won't render but is preserved as
reminder).
MAJORS:
- Line 1077 numerical conflation: rewrote the §V-C / §III-L.4 paragraph that
labelled Firm A's per-document HC+MC inter-CPA proxy ICCR of 0.6201 as a rate
"on real same-CPA pools." 0.6201 is a counterfactual proxy under inter-CPA
candidate-pool replacement, not the observed rate. Added explicit disambig:
the corresponding observed rate from Table XVI (formerly XIX) is 97.5%
HC+MC for Firm A; the proxy and observed rates measure different quantities.
- Residual "validation" language softened: "Dual-descriptor verification" ->
"Dual-descriptor similarity"; "we validate the backbone choice" -> "we
support the backbone choice"; "pixel-identity validation" -> "pixel-identity
positive-anchor check"; "## M. Validation Strategy and Limitations under
Unsupervised Setting" -> "## M. Unsupervised Diagnostic Strategy and Limits".
- "Specificity behaviour" overclaim: "characterises the cosine threshold's
specificity behaviour" -> "specificity-proxy behaviour" (methodology §III-L.0
and discussion §V-F).
- "Prior published / prior calibration" ambiguity: replaced "prior published
per-comparison rate" with "the corpus-wide rate reported in §IV-I"; replaced
"(prior published operating point)" with "(alternative operating point from
supplementary calibration evidence)" in Tables XXI; replaced "prior reporting
and the existing literature" with "the existing literature and the
supplementary calibration evidence."
MINORS:
- Line 116 Bayes-optimal qualifier: "the local density minimum ... is the
Bayes-optimal decision boundary under equal priors" -> "In idealized
two-class mixture settings with equal priors and equal misclassification
costs, the local density minimum ... coincides with the Bayes-optimal
decision boundary."
- Stale section refs: §V-G for the fine-tuning caveat retargeted to §V-H
Engineering-level caveats (where it lives after the §V-H reorganisation);
§III-L for the worst-case rule retargeted to §III-H.1; "Section IV-D.2"
(nonexistent) retargeted to "Section IV-D Table VI."
- Abstract / Introduction "after pool-size adjustment": separated the
document-level D2 proxy ICCR claim from the per-signature logistic regression
claim. Now: "Per-document D2 inter-CPA proxy ICCRs differ by an order of
magnitude across firms ... a per-signature logistic regression confirms the
firm gap persists after pool-size control."
NIT:
- Related Work HTML comment "(see paper_a_references_v3.md for full list)"
-> "(full list in the References section)"; removes the version-coded
filename reference from the source.
Artefacts:
- Combined manuscript regenerated: paper_a_v4_combined.md, 1312 lines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
II. Related Work
A. Offline Signature Verification
Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning. Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant. Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work. Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining. Kao and Wen [5] addressed offline verification and forgery detection using only a single known genuine signature per writer with an explainable deep-learning approach. More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results. Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives. Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer. Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems.
A common thread in this literature is the assumption that the primary threat is identity fraud: a forger attempting to produce a convincing imitation of another person's signature. Our work addresses a fundamentally different problem---detecting whether the legitimate signer's stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability.
Brimoh and Olisah [8] are closest in spirit in using reference evidence to discipline threshold choice. Their setting, however, uses standard verification benchmarks with known genuine references, whereas our archival setting lacks signature-level labels and therefore characterises a fixed deployed screening rule through inter-CPA coincidence-rate anchors.
B. Document Forensics and Copy Detection
Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18]. Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11]. Abramova and Böhme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection.
Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money-laundering investigations. Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering. While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting image-level reproduction within a single author's signatures across documents.
In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images. Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach.
C. Perceptual Hashing
Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19]. Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process.
Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks. Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-descriptor approach, though applied to natural images rather than document signatures.
Our work differs from prior perceptual-hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from image-level reproduction in scanned financial documents.
D. Deep Feature Extraction for Signature Analysis
Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures. Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach. Tsourounis et al. [21] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison. Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach.
Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach. These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical.
E. Statistical Methods for Threshold Characterisation and Calibration
Our threshold-characterisation and calibration framework combines three families of methods developed in statistics and accounting-econometrics.
Non-parametric density estimation. Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions. In idealized two-class mixture settings with equal priors and equal misclassification costs, the local density minimum (antimode) between the two modes coincides with the Bayes-optimal decision boundary. The statistical validity of the unimodality-vs-multimodality dichotomy can be tested via the Hartigan & Hartigan dip test [37], which tests the null of unimodality; we use rejection of this null as evidence consistent with (though not a direct test for) bimodality.
Discontinuity tests on empirical distributions.
Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions.
Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately N(0,1).
The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis.
The BD/McCrary pairing provides a local-density-discontinuity diagnostic that is informative about distributional smoothness under minimal assumptions; we use it in that diagnostic role (rather than as a threshold estimator) because its transitions in our corpus are bin-width-sensitive at the signature level and rarely significant at the accountant level (Appendix A).
Finite mixture models. When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters. For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation. Under mild regularity conditions, White's quasi-MLE result [41] supports interpreting maximum-likelihood estimates under a mis-specified parametric family as consistent estimators of the pseudo-true parameter that minimizes the Kullback-Leibler divergence to the data-generating distribution within that family; we use this result to justify the Beta-mixture fit as a principled approximation rather than as a guarantee that the true distribution is Beta.
The present study uses these tools diagnostically: first to test whether the descriptor distribution supports a natural operating boundary, and then, when that support fails under composition decomposition, to motivate anchor-based ICCR calibration of a fixed deployed rule.
Cross-validation in a small-cluster scope. Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the firm (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a composition-sensitivity band on the candidate mixture boundary, not as a sufficiency claim for the deployed five-way operational classifier (§III-H.1; calibrated separately in §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier.