Files
pdf_signature_extraction/paper/paper_a_conclusion_v3.md
T
gbanyan 9d19ca5a31 Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review:

## Structural fixes
- Fixed three-method convergence overclaim: added Script 20 to run KDE
  antimode, BD/McCrary, and Beta mixture EM on accountant-level means.
  Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979,
  LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at
  accountant level (consistent with smooth clustering, not sharp
  discontinuity).
- Disambiguated Method 1: KDE crossover (between two labeled distributions,
  used at signature all-pairs level) vs KDE antimode (single-distribution
  local minimum, used at accountant level).
- Addressed Firm A circular validation: Script 21 adds CPA-level 70/30
  held-out fold. Calibration thresholds derived from 70% only; heldout
  rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61%
  [93.21%-93.98%]).
- Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10
  signatures (9 CPAs excluded for insufficient sample). Reconciled across
  intro, results, discussion, conclusion.
- Added document-level classification aggregation rule (worst-case signature
  label determines document label).

## Pixel-identity validation strengthened
- Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces
  the original n=35 same-CPA low-similarity negative which had untenable
  Wilson CIs).
- Added Wilson 95% CI for every FAR in Table X.
- Proper EER interpolation (FAR=FRR point) in Table X.
- Softened "conservative recall" claim to "non-generalizable subset"
  language per codex feedback (byte-identical positives are a subset, not
  a representative positive class).
- Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913.

## Terminology & sentence-level fixes
- "statistically independent methods" -> "methodologically distinct methods"
  throughout (three diagnostics on the same sample are not independent).
- "formal bimodality check" -> "unimodality test" (dip test tests H0 of
  unimodality; rejection is consistent with but not a direct test of
  bimodality).
- "Firm A near-universally non-hand-signed" -> already corrected to
  "replication-dominated" in prior commit; this commit strengthens that
  framing with explicit held-out validation.
- "discrete-behavior regimes" -> "clustered accountant-level heterogeneity"
  (BD/McCrary non-transition at accountant level rules out sharp discrete
  boundaries; the defensible claim is clustered-but-smooth).
- Softened White 1982 quasi-MLE claim (no longer framed as a guarantee).
- Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP
  or YOLO FN).
- Unified "310 byte-identical signatures" language across Abstract,
  Results, Discussion (previously alternated between pairs/signatures).
- Defined min_dhash_independent explicitly in Section III-G.
- Fixed table numbering (Table XI heldout added, classification moved to
  XII, ablation to XIII).
- Explained 84,386 vs 85,042 gap (656 docs have only one signature, no
  pairwise stat).
- Made Table IX explicitly a "consistency check" not "validation"; paired
  it with Table XI held-out rates as the genuine external check.
- Defined 0.941 threshold (calibration-fold Firm A cosine P5).
- Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated.
- Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923).

## New artifacts
- Script 20: accountant-level three-method threshold analysis
- Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30)
- paper/codex_review_gpt54_v3.md: preserved review feedback

Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1
markdown sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00

33 lines
4.7 KiB
Markdown

# VI. Conclusion and Future Work
## Conclusion
We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels.
Our contributions are fourfold.
First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
Third, we introduced a three-method threshold framework combining KDE antimode (with a Hartigan unimodality test), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check).
Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the three 1D methods agree within $\sim 0.006$ at cosine $\approx 0.975$.
The Burgstahler-Dichev / McCrary test finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level behavior.
The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered with smooth cluster boundaries.
Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
To break the circularity of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report post-hoc capture rates on the held-out fold with Wilson 95% confidence intervals.
This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
## Future Work
Several directions merit further investigation.
Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade.
The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.