Phase 6 manuscript splice (2/2): §IV / §V / §VI spliced

Lands v4.0 §IV / §V / §VI content into v3.20.0 master sub-files. Strips internal close-out checklists, draft notes, and open-questions blocks at splice. Completes the Phase 6 manuscript-master file assembly. §IV Results (paper_a_results_v3.md): - §IV-A..C: kept v3.20.0 inherited content (experimental setup, detection performance, all-pairs distribution); added v4 scope note (Big-4 primary) at the §IV header - §IV-D..K: replaced v3.20.0 §IV-D..H with v4.0 §IV-D..K (Big-4 distributional / mixture / convergence / LOOO / pixel-identity / inter-CPA reference / five-way classification / full-dataset robustness) - §IV-L: renumbered v3.20.0 §IV-I (backbone ablation) content to match v4's "§IV-L inherited from v3.20.0 §IV-I" reframing - §IV-M: appended v4.0 ICCR calibration tables (XX-XXVI): composition decomposition, per-comparison/per-signature/ per-document ICCRs, firm heterogeneity + cross-firm hit matrix, alert-rate sensitivity - §III-K ablation cross-ref updated to §IV-L (was §IV-I) - Phase 3 close-out checklist (lines 365+) stripped §V Discussion (paper_a_discussion_v3.md): - Replaced v3.20.0 §V with v4.0 §V (8 sub-sections A-H): A. Distinct problem framing B. Continuous quality spectrum + composition-driven multimodality C. Firm A as templated end (case study, not anchor) D. K=2 / K=3 descriptive partitions E. Three-score convergent internal-consistency F. Anchor-based multi-level calibration G. Pixel-identity hard positive anchor + ICCR reframing H. Limitations (14 items: 9 v4-specific + 5 inherited from v3.x) §VI Conclusion (paper_a_conclusion_v3.md): - Replaced v3.20.0 §VI with v4.0 §VI (8 contribution items mirroring §I contributions; 4-direction future work). Known splice-time issue (deferred to typesetting): §IV table numbering is sequential by label (V, VI, ..., XXVI) but Table XIX (document-level worst-case) appears physically before Tables XVI/XVII/XVIII in §IV-J narrative flow. IEEE Access typesetters typically normalize table order during typesetting; we accept the in-file ordering quirk to preserve the §IV-J narrative arc (per-signature -> document-level worst-case -> K=3 cross-tab). Renumbering to strictly-ascending physical order would require renaming Tables XVI/XVII/XVIII -> XVII/XVIII/XIX with downstream cross-reference updates; deferred unless partner Jimmy review or IEEE Access submission portal flags it. Manuscript splice complete. Working drafts in paper/v4/ retained as archive of the round-by-round Phase 5 fix history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:43:41 +08:00
parent c79329457a
commit 12637cd413
3 changed files with 385 additions and 510 deletions
@@ -1,30 +1,7 @@
 # VI. Conclusion and Future Work

-## Conclusion
+We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is an inherited Paper A five-way per-signature classifier with worst-case document-level aggregation (§III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population.

-We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
-Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with the operational classifier's cosine cut anchored on a whole-sample Firm A percentile heuristic and the per-signature similarity distribution characterised through two threshold estimators and a density-smoothness diagnostic.
+Our central methodological contributions are: (1) a composition decomposition (Scripts 39b–39e) that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the inherited operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) a ten-tool unsupervised-validation collection (§III-M Table XXVII) that explicitly discloses each tool's untested assumption and positions the system as an anchor-calibrated screening framework with human-in-the-loop review, not as a validated forensic detector.

-The seven numbered contributions listed in Section I can be grouped into four broader methodological themes, summarized below.
-
-First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
-
-Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
-
-Third, we characterised the per-signature similarity distribution using three diagnostics---a Hartigan dip test, an EM-fitted Beta mixture (with logit-Gaussian robustness check), and a Burgstahler-Dichev / McCrary density-smoothness procedure---and showed that no two-mechanism mixture cleanly explains it: the dip test fails to reject unimodality for Firm A ($p = 0.17$), BIC strongly prefers a 3-component over a 2-component Beta fit ($\Delta\text{BIC} = 381$ for Firm A), and the BD/McCrary candidate transition lies inside the non-hand-signed mode rather than between modes (and is not bin-width-stable; Appendix A).
-The substantive reading is that *pixel-level output quality* is a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) and scan conditions, rather than a discrete class cleanly separated from hand-signing.
-This reading motivates anchoring the operational classifier's cosine cut on a whole-sample Firm A P7.5 percentile heuristic (cos $> 0.95$) rather than on a mixture-fit crossing.
-
-Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
-To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85--95% capture band differ by 1--5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure.
-This framing is internally consistent with the available evidence: the byte-level pair analysis finding of 145 pixel-identical calibration-firm signatures across 50 distinct partners of 180 registered (Section IV-F.1); the 92.5% / 7.5% split in signature-level cosine thresholds and the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1); and the 95.9% top-decile concentration of Firm A auditor-years in the threshold-independent partner-ranking analysis (Section IV-G.2).
-
-An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
-
-## Future Work
-
-Several directions merit further investigation.
-Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
-The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
-The replication-dominated calibration strategy and the pixel-identity anchor technique are both generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself, conditional on the availability of analogous anchors in the new domain and on artifact-generation physics that preserve the byte-level trace.
-Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
+Future work falls in four directions. *First*, a small-scale human-rated validation set would enable direct ROC optimisation and provide signature-level ground truth that v4.0 fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$–$98.8\%$ across Big-4; same-pair joint $97.0$–$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with v3.x's byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the v4.0 calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.