Files
pdf_signature_extraction/paper/paper_a_conclusion_v3.md
T
gbanyan b6913d2f93 Phase 6 round-2 reviewer revisions: §III-H.1 promotion + framing alignment
Structural:
- Promote operational classifier definition from §III-L.0 to new §III-H.1, so
  the reader meets the five-way HC/MC/HSC/UN/LH rule before the §III-I/J/K
  diagnostic chain instead of ~130 lines after. §III-L renamed to
  "Anchor-Based Threshold Calibration"; §III-L.0 retains only calibration
  methodology, three units of analysis, any-pair semantics, and the FAR
  terminological note. §III-L.7 deleted (redundant with §III-J).
- Reorganise §V-H Limitations into Primary / Secondary / Documented features /
  Engineering groupings (was a flat 14-item list).
- Reframe §III-M from "ten-tool unsupervised-validation collection" to
  "each diagnostic addresses one specific unsupervised failure mode";
  rename "What v4.0 does/does not claim" → "Limits / Scope of the present
  analysis"; retitle Table XXVII.

Framing alignment (cross-section):
- Strip all v3.x / v4.0 / v3.20 / v4-new / inherited lineage labels from
  rendered text (Abstract, Intro, §II, §III, §IV, §V, §VI, Appendix, Impact).
- Replace "Paper A" rule references with "deployed" rule references.
- Soften "validation" to "characterise" / "check" / "screening label" /
  "consistency check" / "support"; "verdict" → "screening label".
- Remove codex-verified spike claims (non-Big-4 jittered dHash, Big-4 pooled
  cosine after firm-mean centring). Only formally scripted evidence
  (Scripts 39b–39e) retained; non-Big-4 evidence framed as corroborating
  raw-axis cosine, not as calibration evidence.
- Strip script-provenance parentheticals from Introduction; defer Script 39c
  internal references and similar to Methodology / Appendix.

Numerical / table fixes:
- §III-C document-count arithmetic: 12 corrupted → 13 corrupted/unreadable,
  verified against sqlite DB and total-pdf/ folder counts (90,282 - 4,198
  no-sig - 13 corrupted = 86,071 → 85,042 with detections → 182,328 sigs →
  168,755 CPA-matched). Table I shows VLM-positive (86,084) and
  processed-for-extraction (86,071) as separate rows.
- Wilson 95% CIs added for joint-rule ICCR rows in Table XXI / methodology
  table ([0.00011, 0.00018] and [0.00008, 0.00014]).
- Unit error fixed: 0.3856 pp / 0.4431 pp → 0.3856 (38.6 pp) / 0.4431 (44.3 pp).

Smaller revisions:
- Pipeline framing: "detecting" → "screening" in Abstract / Intro / Conclusion
  for consistency with the unsupervised-screening positioning.
- "hard ground-truth subset" → "conservative hard-positive subset" throughout.
- §III-F SSIM / pixel-comparison rebuttal compressed from ~15 lines to 4;
  design-level argument deferred to supplementary materials.
- "stakeholders can adopt / can derive thresholds" → "alternative operating
  points can be characterised by inverting" (less prescriptive).
- "the same mechanism extending in milder form to Firms B/C/D" → "similar,
  milder production-related reuse patterns at Firms B/C/D" (mechanism claim
  softened).
- Appendix A "non-hand-signed mode" / "two-mechanism mixture" lineage language
  aligned with v4 framing.

Appendix B:
- Rebuilt as a redirect-only stub. The HTML-commented obsolete table mapping
  (Table IX–XVIII labels with FAR / capture-rate / validation language) is
  removed; replaced with a short paragraph pointing to supplementary
  materials for full table-to-script provenance.

Cross-references:
- All §III-L references for the rule definition retargeted to §III-H.1;
  references for calibration still point to §III-L.
- §III-H references for byte-level Firm A evidence / non-Big-4 reverse anchor
  retargeted to §III-H.2.

Artefacts:
- Combined manuscript regenerated: paper_a_v4_combined.md, 1314 lines
  (was 1346 pre-review).
- Two review handoff documents added:
  paper/review_handoff_abstract_intro_20260515.md
  paper/review_handoff_body_20260515.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:07:31 +08:00

4.4 KiB
Raw Blame History

VI. Conclusion and Future Work

We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature classifier with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442150,453 signatures at signature level) as the primary analytical population.

Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter (p_{\text{median}} = 0.35), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison (0.0006 at cos$>0.95$; 0.0013 at dHash$\leq 5$; 0.00014 jointly), pool-normalised per-signature (0.11 for the deployed any-pair HC rule), and per-document (0.34 for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios 0.053, 0.010, 0.027 for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is 98.8\% at Firm A and $76.7$83.7\% at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$99.96\% within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman \rho \geq 0.879, reported as internal consistency rather than external validation; (7) 0\% positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (§III-M Table XXVII), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.

Future work falls in four directions. First, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. Second, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$98.8\% across Big-4; same-pair joint $97.0$99.96\%) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. Third, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm 0.62 vs $0.09$0.16) — together with the byte-level evidence of 145 pixel-identical signatures across \sim 50 distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. Fourth, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.