0ff1845b22
Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md):
B1 Classifier vs three-method threshold mismatch
- Methodology III-L rewritten to make explicit that the per-signature
classifier and the accountant-level three-method convergence operate
at different units (signature vs accountant) and are complementary
rather than substitutable.
- Add Results IV-G.3 + Table XII operational-threshold sensitivity:
cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole
Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary.
B2 Held-out validation false "within Wilson CI" claim
- Script 24 recomputes both calibration-fold and held-out-fold rates
with Wilson 95% CIs and a two-proportion z-test on each rule.
- Table XI replaced with the proper fold-vs-fold comparison; prose
in Results IV-G.2 and Discussion V-C corrected: extreme rules agree
across folds (p>0.7); operational rules in the 85-95% band differ
by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample
contained more high-replication C1 accountants), not generalization
failure.
B3 Interview evidence reframed as practitioner knowledge
- The Firm A "interviews" referenced throughout v3.3 are private,
informal professional conversations, not structured research
interviews. Reframed accordingly: all "interview*" references in
abstract / intro / methodology / results / discussion / conclusion
are replaced with "domain knowledge / industry-practice knowledge".
- This avoids overclaiming methodological formality and removes the
human-subjects research framing that triggered the ethics-statement
requirement.
- Section III-H four-pillar Firm A validation now stands on visual
inspection, signature-level statistics, accountant-level GMM, and
the three Section IV-H analyses, with practitioner knowledge as
background context only.
- New Section III-M ("Data Source and Firm Anonymization") covers
MOPS public-data provenance, Firm A/B/C/D pseudonymization, and
conflict-of-interest declaration.
Add signature_analysis/24_validation_recalibration.py for the recomputed
calib-vs-held-out z-tests and the classifier sensitivity analysis;
output in reports/validation_recalibration/.
Pending (not in this commit): abstract length (368 -> 250 words),
Impact Statement removal, BD/McCrary sensitivity reporting, full
reproducibility appendix, references cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
110 lines
15 KiB
Markdown
110 lines
15 KiB
Markdown
# V. Discussion
|
||
|
||
## A. Non-Hand-Signing Detection as a Distinct Problem
|
||
|
||
Our results highlight the importance of distinguishing *non-hand-signing detection* from the well-studied *signature forgery detection* problem.
|
||
In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature.
|
||
In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).
|
||
|
||
This distinction has direct methodological consequences.
|
||
Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures.
|
||
Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
|
||
The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.
|
||
|
||
## B. Continuous-Quality Spectrum vs. Clustered Accountant-Level Heterogeneity
|
||
|
||
The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method framework and the Hartigan dip test (Sections IV-D and IV-E).
|
||
|
||
At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
|
||
Firm A's signature-level cosine is formally unimodal (dip test $p = 0.17$) with a long left tail.
|
||
The all-CPA signature-level cosine rejects unimodality ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
|
||
The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary between two mechanisms.
|
||
Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class.
|
||
Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.
|
||
|
||
At the per-accountant aggregate level the picture partly reverses.
|
||
The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, logit-GMM-2 crossing $= 0.976$).
|
||
The BD/McCrary test, however, does not produce a significant transition at the accountant level either, in contrast to the signature level.
|
||
This pattern is consistent with a clustered *but smoothly mixed* accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the transitions between them are gradual rather than discrete at the bin resolution BD/McCrary requires.
|
||
|
||
The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behaviour* is clustered (three recognizable groups) but not sharply discrete.
|
||
The accountant-level mixture is a useful classifier of firm-and-practitioner-level signing regimes; individual behaviour may still transition or mix over time within a practitioner, and our cross-sectional analysis does not rule this out.
|
||
Methodologically, the implication is that the three 1D methods are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is itself diagnostic of smoothness rather than a failure of the method.
|
||
|
||
## C. Firm A as a Replication-Dominated, Not Pure, Population
|
||
|
||
A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
|
||
Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.
|
||
|
||
Three convergent strands of evidence support the replication-dominated framing.
|
||
First, the visual-inspection evidence: randomly sampled Firm A reports exhibit pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.
|
||
Second, the signature-level statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
|
||
Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures ($\geq 10$) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---directly quantifying the within-firm minority of hand-signers.
|
||
Nine additional Firm A CPAs are excluded from the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone.
|
||
The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85–95% band differ between folds by 1–5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure).
|
||
The accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to fold-level sampling variance.
|
||
|
||
The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
|
||
We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
|
||
|
||
## D. The Style-Replication Gap
|
||
|
||
Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
|
||
A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
|
||
|
||
The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
|
||
Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
|
||
Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting.
|
||
Some may use signing pads or templates that further constrain variability without constituting image-level reproduction.
|
||
The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.
|
||
|
||
## E. Value of a Replication-Dominated Calibration Group
|
||
|
||
The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
|
||
In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
|
||
Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset.
|
||
|
||
This calibration strategy has broader applicability beyond signature analysis.
|
||
Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
|
||
The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity quantified by the accountant-level mixture, and yields classification rates that are internally consistent with the data.
|
||
|
||
## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation
|
||
|
||
A further methodological contribution is the combination of byte-level pixel identity as an annotation-free *conservative* gold positive and a large random-inter-CPA negative anchor.
|
||
Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
|
||
In our corpus 310 signatures satisfied this condition.
|
||
We emphasize that byte-identical pairs are a *subset* of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways).
|
||
Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.
|
||
|
||
Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative ($n = 35$) we originally considered.
|
||
The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset.
|
||
|
||
## G. Limitations
|
||
|
||
Several limitations should be acknowledged.
|
||
|
||
First, comprehensive per-document ground truth labels are not available.
|
||
The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class.
|
||
The low-similarity same-CPA anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled).
|
||
A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class.
|
||
|
||
Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
|
||
While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.
|
||
|
||
Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions.
|
||
In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
|
||
This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.
|
||
|
||
Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
|
||
While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.
|
||
|
||
Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later).
|
||
Extending the accountant-level analysis to auditor-year units is a natural next step.
|
||
|
||
Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, and the test produces no significant transition at all at the accountant level.
|
||
In our application, therefore, BD/McCrary contributes diagnostic information about local density-smoothness rather than an independent accountant-level threshold estimate; that role is played by the KDE antimode and the two mixture-based estimators.
|
||
The BD/McCrary results remain informative as a robustness check---their non-transition at the accountant level is consistent with the dip-test and Beta-mixture evidence that accountant-level clustering is smooth rather than sharply discontinuous.
|
||
|
||
Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
|
||
Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
|