Files
pdf_signature_extraction/paper/paper_a_conclusion_v3.md
T
gbanyan b6913d2f93 Phase 6 round-2 reviewer revisions: §III-H.1 promotion + framing alignment
Structural:
- Promote operational classifier definition from §III-L.0 to new §III-H.1, so
  the reader meets the five-way HC/MC/HSC/UN/LH rule before the §III-I/J/K
  diagnostic chain instead of ~130 lines after. §III-L renamed to
  "Anchor-Based Threshold Calibration"; §III-L.0 retains only calibration
  methodology, three units of analysis, any-pair semantics, and the FAR
  terminological note. §III-L.7 deleted (redundant with §III-J).
- Reorganise §V-H Limitations into Primary / Secondary / Documented features /
  Engineering groupings (was a flat 14-item list).
- Reframe §III-M from "ten-tool unsupervised-validation collection" to
  "each diagnostic addresses one specific unsupervised failure mode";
  rename "What v4.0 does/does not claim" → "Limits / Scope of the present
  analysis"; retitle Table XXVII.

Framing alignment (cross-section):
- Strip all v3.x / v4.0 / v3.20 / v4-new / inherited lineage labels from
  rendered text (Abstract, Intro, §II, §III, §IV, §V, §VI, Appendix, Impact).
- Replace "Paper A" rule references with "deployed" rule references.
- Soften "validation" to "characterise" / "check" / "screening label" /
  "consistency check" / "support"; "verdict" → "screening label".
- Remove codex-verified spike claims (non-Big-4 jittered dHash, Big-4 pooled
  cosine after firm-mean centring). Only formally scripted evidence
  (Scripts 39b–39e) retained; non-Big-4 evidence framed as corroborating
  raw-axis cosine, not as calibration evidence.
- Strip script-provenance parentheticals from Introduction; defer Script 39c
  internal references and similar to Methodology / Appendix.

Numerical / table fixes:
- §III-C document-count arithmetic: 12 corrupted → 13 corrupted/unreadable,
  verified against sqlite DB and total-pdf/ folder counts (90,282 - 4,198
  no-sig - 13 corrupted = 86,071 → 85,042 with detections → 182,328 sigs →
  168,755 CPA-matched). Table I shows VLM-positive (86,084) and
  processed-for-extraction (86,071) as separate rows.
- Wilson 95% CIs added for joint-rule ICCR rows in Table XXI / methodology
  table ([0.00011, 0.00018] and [0.00008, 0.00014]).
- Unit error fixed: 0.3856 pp / 0.4431 pp → 0.3856 (38.6 pp) / 0.4431 (44.3 pp).

Smaller revisions:
- Pipeline framing: "detecting" → "screening" in Abstract / Intro / Conclusion
  for consistency with the unsupervised-screening positioning.
- "hard ground-truth subset" → "conservative hard-positive subset" throughout.
- §III-F SSIM / pixel-comparison rebuttal compressed from ~15 lines to 4;
  design-level argument deferred to supplementary materials.
- "stakeholders can adopt / can derive thresholds" → "alternative operating
  points can be characterised by inverting" (less prescriptive).
- "the same mechanism extending in milder form to Firms B/C/D" → "similar,
  milder production-related reuse patterns at Firms B/C/D" (mechanism claim
  softened).
- Appendix A "non-hand-signed mode" / "two-mechanism mixture" lineage language
  aligned with v4 framing.

Appendix B:
- Rebuilt as a redirect-only stub. The HTML-commented obsolete table mapping
  (Table IX–XVIII labels with FAR / capture-rate / validation language) is
  removed; replaced with a short paragraph pointing to supplementary
  materials for full table-to-script provenance.

Cross-references:
- All §III-L references for the rule definition retargeted to §III-H.1;
  references for calibration still point to §III-L.
- §III-H references for byte-level Firm A evidence / non-Big-4 reverse anchor
  retargeted to §III-H.2.

Artefacts:
- Combined manuscript regenerated: paper_a_v4_combined.md, 1314 lines
  (was 1346 pre-review).
- Two review handoff documents added:
  paper/review_handoff_abstract_intro_20260515.md
  paper/review_handoff_body_20260515.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:07:31 +08:00

8 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# VI. Conclusion and Future Work
We present a fully automated pipeline for screening non-hand-signed CPA signatures in Taiwan-listed financial audit reports, together with an anchor-calibrated screening framework that characterises the pipeline's operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised assumptions. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is the deployed five-way per-signature classifier with worst-case document-level aggregation (§III-H.1; calibrated in §III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442150,453 signatures at signature level) as the primary analytical population.
Our central methodological contributions are: (1) a composition decomposition that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the deployed operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) explicit disclosure of each diagnostic's untested assumption (§III-M Table XXVII), positioning the system as an anchor-calibrated screening framework with human-in-the-loop review rather than as a validated forensic detector.
Future work falls in four directions. *First*, a small-scale human-rated labelled set would enable direct ROC optimisation and provide the signature-level ground truth that the present analysis fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (any-pair $76.7$$98.8\%$ across Big-4; same-pair joint $97.0$$99.96\%$) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$$0.16$) — together with the byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.