Files

T

gbanyan 53125d11d9 Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1):

Must-fix items (6/6):
- §III-F SSIM/pixel rejection rewritten from first principles (design-level
  argument from luminance/contrast/structure local-window product, not the
  prior empirical 0.70 result)
- Table VI restructured by population × method; added missing Firm A
  logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary
  marked bin-unstable (Appendix A)
- Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted
  from "operational dual" to "calibration-fold-adjacent reference"; the
  actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout
- New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A
  on top); script 30_yearly_big4_comparison.py
- Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets
- §III-K reframed P7.5 from "round-number lower-tail boundary" to operating
  point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds:
  0.9407 / 0.945 / 0.95 / 0.977 / 0.985)

Nice-to-have items (3/3):
- Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985)
- Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to
  table notes; cut "invite reviewer skepticism" and "non-load-bearing"

Codex 3-pass verification cleanup:
- Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A
  Beta-2 forced-fit crossing from beta_mixture_results.json)
- dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer
  immediately below) instead of misleading "rounded down"
- Table XII-B prose corrected: per-segment qualification of "non-Firm-A
  capture falls faster" (true on 0.95→0.977 segment but contracts on
  0.977→0.985 segment); arithmetic now from exact counts

Within-year analyses removed:
- Within-year ranking robustness check (Class A) was added in nice-to-have
  pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the
  Appendix B provenance row
- Within-CPA future-work disclosures (Class B) removed from Discussion
  limitation #5 and Conclusion future-work paragraph; subsequent limitations
  renumbered Sixth → Fifth, Seventh → Sixth

DOCX rendering pipeline overhaul (paper/export_v3.py):

Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES:
strip_comments() was wholesale-deleting HTML comments, but every numerical
table is wrapped in <!-- TABLE X: ... -->, so the table body was deleted
alongside the wrapper. Now unwraps TABLE comments (emit synthetic
__TABLE_CAPTION__: marker + table body) while still stripping non-TABLE
editorial comments. Result: 19 tables now render in the DOCX.

Other rendering fixes:
- LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥,
  ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,})
- Math-context-scoped sub/superscript via PUA sentinels (/):
  no more underscore-eating in identifiers like signature_analysis
- Display equations rendered via matplotlib mathtext to PNG (3 equations:
  cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as
  numbered equation blocks (1), (2), (3); content-addressed cache at
  paper/equations/ (gitignored, regenerable)
- Manual numbered/bulleted list rendering with hanging indent (replaces
  python-docx style="List Number" which silently drops the number prefix
  when no numbering definition is bound)
- Markdown blockquote (> ...) defensively stripped
- Pandoc footnote ([^name]) markers no longer leak (inlined at source)
- Heading text cleaned of LaTeX residue + PUA sentinels
- File paths in body text (signature_analysis/X.py, reports/Y.json)
  trimmed to "(reproduction artifact in Appendix B)" pointers

New leak linter: paper/lint_paper_v3.py - two-pass markdown source +
rendered DOCX leak detector; auto-runs at end of export_v3.py.

Script changes:
- 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR
  threshold list so Table XII-B is reproducible from persisted JSON
- 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly
  data (writes to reports/figures/ and reports/firm_yearly_comparison/)
- 31_within_year_ranking_robustness.py: NEW; supports the within-year
  robustness check (no longer cited in paper but kept as repo-internal
  due-diligence artifact)

Partner handoff DOCX shipped to
~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB:
19 tables + 4 figures + 3 equation images).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-06 13:44:49 +08:00

15 KiB

Raw Blame History

V. Discussion

A. Non-Hand-Signing Detection as a Distinct Problem

Our results highlight the importance of distinguishing non-hand-signing detection from the well-studied signature forgery detection problem. In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature. In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).

This distinction has direct methodological consequences. Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures. Non-hand-signing detection, by contrast, requires sensitivity to the upper tail of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous. The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.

B. Per-Signature Similarity is a Continuous Quality Spectrum

A central empirical finding of this study is that per-signature similarity does not form a clean two-mechanism mixture (Section IV-D). Firm A's signature-level cosine is formally unimodal (Hartigan dip test p = 0.17) with a long left tail. The all-CPA signature-level cosine rejects unimodality (p < 0.001), reflecting the heterogeneity of signing practices across firms, but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit (\Delta\text{BIC} = 381 for Firm A; 10{,}175 for the full sample), and the forced 2-component Beta crossing and its logit-GMM robustness counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). The BD/McCrary discontinuity test locates its transition at cosine 0.985---inside the non-hand-signed mode rather than at a boundary between two mechanisms---and the transition is not bin-width-stable (Appendix A).

Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class cleanly separated from hand-signing. Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.

The methodological implication is that the operational classifier's cosine cut should not be derived from a mixture-fit crossing. We accordingly anchor the operational cosine cut on the whole-sample Firm A P7.5 percentile (Section III-K), and treat the signature-level threshold-estimator outputs (KDE antimode, Beta and logit-Gaussian crossings) as descriptive characterisation of the similarity distribution rather than as the source of operational thresholds. The BD/McCrary procedure plays a density-smoothness diagnostic role in this framing rather than that of an independent threshold estimator.

This continuous-spectrum finding also has substantive implications for downstream interpretation. Because pixel-level output quality varies continuously, signature-level rates (such as the 92.5% / 7.5% Firm A split) reflect the share of signatures whose similarity falls above or below a chosen threshold rather than the share that came from a "non-hand-signing mechanism" versus a "hand-signing mechanism." We accordingly report all rates as signature-level quantities and abstain from partner-level frequency claims (Section III-G).

C. Firm A as a Replication-Dominated, Not Pure, Population

A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class. Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.

Two convergent strands of evidence support the replication-dominated framing. First, the byte-level pair evidence: 145 Firm A signatures (from 50 distinct partners of 180 registered) have a byte-identical same-CPA match in a different audit report, with 35 of these matches spanning different fiscal years. Independent hand-signing cannot produce byte-identical images across distinct reports, so these pairs directly establish image reuse within Firm A as a concrete, threshold-free phenomenon, and the 50/180 partner spread shows that replication is widespread rather than confined to a handful of CPAs. Second, the signature-level distributional evidence: Firm A's per-signature cosine distribution is unimodal long-tail (Hartigan dip test p = 0.17) rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. The unimodal-long-tail shape, not the precise 92.5 / 7.5 split, is the structural evidence: it is consistent with a dominant high-similarity regime plus residual within-firm heterogeneity, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).

Two additional checks, reported in Section IV-G, are robust to threshold choice and complement the two primary strands: the held-out Firm A 70/30 validation (Section IV-F.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85--95% band differ between folds by 1--5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure), and the threshold-independent partner-ranking analysis (Section IV-G.2) shows that Firm A auditor-years occupy 95.9% of the top decile of similarity-ranked auditor-years against a 27.8% baseline share---a 3.5$\times$ concentration ratio that uses only ordinal ranking and is independent of any absolute cutoff.

The replication-dominated framing is internally coherent with both pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise. We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.

D. The Style-Replication Gap

Within the 71,656 documents exceeding cosine 0.95, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance. A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.

The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative. Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions. Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting. Some may use signing pads or templates that further constrain variability without constituting image-level reproduction. The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.

E. Value of a Replication-Dominated Calibration Group

The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels. In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance. Our approach uses practitioner background---one Big-4 firm reportedly relies predominantly on stamping or e-signing workflows---only as a motivation for selecting that firm as a candidate reference population; the calibration role is then established from the audit-report images themselves (byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency), so the calibration does not depend on the practitioner-background claim being externally verified (Section III-H).

This calibration strategy has broader applicability beyond signature analysis. Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives. The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity visible in the unimodal-long-tail shape of Firm A's per-signature cosine distribution, and yields classification rates that are internally consistent with the data.

F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation

A further methodological contribution is the combination of byte-level pixel identity as an annotation-free conservative gold positive and a large random-inter-CPA negative anchor. Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is pair-level proof of image reuse and, modulo the narrow source-template edge case discussed in the seventh limitation below, a conservative positive for non-hand-signing without requiring human review. In our corpus 310 signatures satisfied this condition. We emphasize that byte-identical pairs are a subset of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways). Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.

Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative (n = 35) we originally considered. The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset.

G. Limitations

Several limitations should be acknowledged.

First, comprehensive per-document ground truth labels are not available. The pixel-identity anchor is a strict subset of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class. The low-similarity same-CPA anchor (n = 35) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled). A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class.

Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning. While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.

Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity. This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.

Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.

Fifth, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case that one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as the stamping or e-signature template, the pair correctly identifies image reuse but misattributes the non-hand-signed status to the source exemplar. This misattribution affects at most one source document per template variant per CPA (the exemplar from which the template was produced), is not expected to be common given that stored signature templates are typically generated in a separate acquisition step rather than extracted from submitted audit reports, and does not materially affect aggregate capture rates at the firm level.

Sixth, our analyses remain at the signature level; we abstain from partner-level frequency inferences such as "X% of CPAs hand-sign in a given year." Per-signature labels in this paper are not translated to per-report or per-partner mechanism assignments (Section III-G). The signature-level rates we report, including the 92.5% / 7.5% Firm A split and the year-by-year left-tail share of Section IV-G.1, should accordingly be read as signature-level quantities rather than partner-level frequencies.

Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing." Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.

15 KiB Raw Blame History