Files
pdf_signature_extraction/paper/codex_review_gpt54_v3_2.md
T
gbanyan 5717d61dd4 Paper A v3.3: apply codex v3.2 peer-review fixes
Codex (gpt-5.4) second-round review recommended 'minor revision'. This
commit addresses all issues flagged in that review.

## Structural fixes

- dHash calibration inconsistency (codex #1, most important):
  Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come
  from the whole-sample Firm A cosine-conditional dHash distribution
  (median=5, P95=15), not from the calibration-fold independent-minimum
  dHash distribution (median=2, P95=9) which we report elsewhere as
  descriptive anchors. Added explicit note about the two dHash
  conventions and their relationship.

- Section IV-H framing (codex #2):
  Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence"
  to "Additional Firm A Benchmark Validation" and clarified in the
  section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully
  threshold-free, H.3 uses the calibrated classifier. H.3's concluding
  sentence now says "the substantive evidence lies in the cross-firm
  gap" rather than claiming the test is threshold-free.

- Table XVI 93,979 typo fixed (codex #3):
  Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm).

- Held-out Firm A denominator 124+54=178 vs 180 (codex #4):
  Added explicit note that 2 CPAs were excluded due to disambiguation
  ties in the CPA registry.

- Table VIII duplication (codex #5):
  Removed the duplicate accountant-level-only Table VIII comment; the
  comprehensive cross-level Table VIII subsumes it. Text now says
  "accountant-level rows of Table VIII (below)".

- Anonymization broken in Tables XIV-XVI (codex #6):
  Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/
  "Firm D" across Tables XIV, XV, XVI. Table and caption language
  updated accordingly.

- Table X unit mismatch (codex #7):
  Dropped precision, recall, F1 columns. Table now reports FAR
  (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR
  (against the byte-identical positive anchor). III-K and IV-G.1 text
  updated to justify the change.

## Sentence-level fixes

- "three independent statistical methods" in Methodology III-A ->
  "three methodologically distinct statistical methods".
- "three independent methods" in Conclusion -> "three methodologically
  distinct methods".
- Abstract "~0.006 converging" now explicitly acknowledges that
  BD/McCrary produces no significant accountant-level discontinuity.
- Conclusion ditto.
- Discussion limitation sentence "BD/McCrary should be interpreted at
  the accountant level for threshold-setting purposes" rewritten to
  reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold
  estimator, at the accountant level.
- III-H "two analyses" -> "three analyses" (H.1 longitudinal stability,
  H.2 partner ranking, H.3 intra-report consistency).
- Related Work White 1982 overclaim rewritten: "consistent estimators
  of the pseudo-true parameter that minimizes KL divergence" replaces
  "guarantees asymptotic recovery".
- III-J "behavior is close to discrete" -> "practice is clustered".
- IV-D.2 pivot sentence "discreteness of individual behavior yields
  bimodality" -> "aggregation over signatures reveals clustered (though
  not sharply discrete) patterns".

Target journal remains IEEE Access. Output:
Paper_A_IEEE_Access_Draft_v3.docx (395 KB).

Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 02:32:17 +08:00

322 KiB
Raw Blame History

Reading additional input from stdin... OpenAI Codex v0.121.0 (research preview)

workdir: /Volumes/NV2/pdf_recognize model: gpt-5.4 provider: openai approval: never sandbox: read-only reasoning effort: xhigh reasoning summaries: none session id: 019dac12-60e9-7f03-9ade-171583b34220

user You are conducting a second-round peer review for IEEE Access of my paper v3.2.

Read the v3.2 paper section markdowns in /Volumes/NV2/pdf_recognize/paper/ : paper_a_abstract_v3.md, paper_a_impact_statement_v3.md, paper_a_introduction_v3.md, paper_a_related_work_v3.md, paper_a_methodology_v3.md, paper_a_results_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md, paper_a_references_v3.md.

Your own previous first-round review is saved at /Volumes/NV2/pdf_recognize/paper/codex_review_gpt54_v3.md. Read it first to remember your own earlier findings, then evaluate whether v3.2 addresses them.

Paper title: "Automated Identification of Non-Hand-Signed Auditor Signatures in Large-Scale Financial Audit Reports: A Dual-Descriptor Framework with Three-Method Convergent Thresholding"

v3.1 -> v3.2 changes to verify:

  1. Three-method convergence overclaim (your previous critical issue): Now reports accountant-level 1D methods converging at cos ~= 0.975 (KDE antimode = 0.973, Beta-2 = 0.979, LogGMM-2 = 0.976) and explicitly acknowledges BD/McCrary does NOT produce a transition at the accountant level (consistent with smooth-clustered, not discontinuous, aggregates). See Table VIII in results and Section III-I.4.

  2. Firm A circular validation (your previous critical issue): Now addressed via (a) 70/30 CPA-level held-out fold (Section IV-G.2, Table XI) and (b) a new Section IV-H "Firm A Benchmark Validation: Threshold-Independent Evidence" with three threshold-independent analyses: yearly stability (H.1), partner-level similarity ranking (H.2, 95.9% of top-10% is Deloitte vs 27.8% baseline), intra-report consistency (H.3, Deloitte agreement 89.9% vs other Big-4 62-67%).

  3. 139+32 vs 180 discrepancy: Resolved to 139+32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion.

  4. Method 1 internal inconsistency: Now explicitly distinguished: KDE crossover (two-labeled-population procedure, signature all-pairs) vs KDE antimode (single-distribution local minimum, accountant-level). See Section III-I.1.

  5. Pixel-identity validation weaknesses:

    • Added ~50,000-pair inter-CPA negative anchor (Section IV-G.1, Table X) with Wilson 95% CIs, replacing the n=35 low-cosine anchor.
    • Softened "conservative recall" language.
    • Proper EER interpolation.
  6. "statistically independent" -> "methodologically distinct" throughout.

  7. "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality).

  8. Within-auditor-year no-mixing identification assumption now explicit in Section III-G.

  9. PDF-level signature-to-document mapping rule now explicit (worst-case aggregation) in Section III-L.

  10. VLM 1.2% overclaim: softened to acknowledge VLM FP + YOLO FN mixture.

  11. 310 terminology unified as "byte-identical signatures" across Abstract, Results, Discussion.

  12. 0.941, 0.945 thresholds: 0.941 is calibration-fold Firm A cosine P5; 0.945 rate computed exactly (94.52%) not interpolated.

  13. Ref [24] Qwen2.5-VL: expanded to full IEEE format with arXiv:2502.13923.

Please produce a second-round review in priority order covering:

  1. Overall assessment and recommendation for this v3.2 (accept / minor revision / major revision / reject).

  2. For each of your previous critical issues (items 2-4 in your v3.1 review: three-method convergence, KDE method inconsistency, unit-of- analysis clarity, accountant-level interpretation overstated, BD/McCrary rigor, White 1982 overclaim, Firm A circular validation, 139+32 discrepancy), state whether v3.2 adequately addresses the issue, and if not, what specifically still needs fixing.

  3. Any NEW issues introduced by the v3.2 additions (new Section IV-H threshold-independent validation, new tables XIII-XVIII).

  4. Clarity and organization of the revised Sections III-V and the new Section IV-H.

  5. Any remaining sentence-level, table-level, or numerical issues with exact quotes.

Be direct and specific. Cite exact sections, table numbers, and line quotes where possible. Focus on review opinions and concrete fixes. codex Reading the prior v3.1 review first, then Ill inspect the revised section markdowns and map each claimed v3.2 fix against the earlier findings. After that Ill produce a priority-ordered second-round review with exact section/table references and any remaining numerical or wording problems. → Read prior v3.1 review and extract critical issues to re-check • Inspect revised v3.2 section markdowns and locate claimed fixes • Evaluate whether each prior issue is resolved and identify any new issues • Draft second-round review in priority order with exact references and quotes codex Im pulling the paper structure and line numbers now so I can cite exact locations rather than paraphrasing from memory. After that Ill compare the revised claims against the old review item by item. exec /bin/zsh -lc 'ls -1 /Volumes/NV2/pdf_recognize/paper' in /Volumes/NV2/pdf_recognize succeeded in 0ms: Paper_A_IEEE_Access_Draft_v3.docx Paper_A_IEEE_TAI_Draft_20260403.docx Paper_A_IEEE_TAI_Draft_v2.docx ablation_backbone_comparison.py build_docx.sh codex_review_gpt54_v3.md export_paper_to_docx.py export_v2.py export_v3.py generate_paper_figures.py generate_recalibrated_report.py paper_a_abstract.md paper_a_abstract_v3.md paper_a_conclusion.md paper_a_conclusion_v3.md paper_a_discussion.md paper_a_discussion_v3.md paper_a_impact_statement.md paper_a_impact_statement_v3.md paper_a_introduction.md paper_a_introduction_v3.md paper_a_methodology.md paper_a_methodology_v3.md paper_a_outline.md paper_a_references.md paper_a_references_v3.md paper_a_related_work.md paper_a_related_work_v3.md paper_a_results.md paper_a_results_v3.md recalibrate_classification.py renumber_citations.py

exec /bin/zsh -lc 'wc -l /Volumes/NV2/pdf_recognize/paper/codex_review_gpt54_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1750 /Volumes/NV2/pdf_recognize/paper/codex_review_gpt54_v3.md 16 /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md 9 /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md 87 /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md 104 /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md 263 /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md 410 /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md 107 /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md 32 /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md 87 /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md 2865 total

exec /bin/zsh -lc "sed -n '1,260p' /Volumes/NV2/pdf_recognize/paper/codex_review_gpt54_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: Reading additional input from stdin... OpenAI Codex v0.121.0 (research preview)

workdir: /Volumes/NV2/pdf_recognize model: gpt-5.4 provider: openai approval: never sandbox: read-only reasoning effort: xhigh reasoning summaries: none session id: 019dabc7-b94b-7d71-81f6-764d687c3d5a

user You are conducting a peer review for IEEE Access of my paper v3.

Read these files in /Volumes/NV2/pdf_recognize/paper/ : paper_a_abstract_v3.md, paper_a_impact_statement_v3.md, paper_a_introduction_v3.md, paper_a_related_work_v3.md, paper_a_methodology_v3.md, paper_a_results_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md, paper_a_references_v3.md.

Paper title: "Automated Identification of Non-Hand-Signed Auditor Signatures in Large-Scale Financial Audit Reports: A Dual-Descriptor Framework with Three-Method Convergent Thresholding"

Key facts to keep in mind while reviewing:

  • Pipeline: VLM (Qwen2.5-VL 32B) page-ID, YOLOv11n detection, ResNet-50 2048-dim features (L2-norm, no fine-tune), dual descriptors = cosine + dHash (9x8, 64-bit, Hamming distance).
  • Dataset: 90,282 PDF audit reports from TWSE MOPS 2013-2023, 182,328 signatures, 758 CPAs.
  • Three convergent threshold methods: (a) KDE antimode with Hartigan dip test, (b) Burgstahler-Dichev / McCrary discontinuity, (c) 2/3-comp Beta mixture via EM + parallel logit-Gaussian mixture as White 1982 quasi-MLE robustness check.
  • Accountant-level: Gaussian mixture on (mean cosine, mean dHash), BIC picks K=3. C1 cos=0.983 dh=2.41 w=0.21 n=141 (Firm A 139/141). C2 cos=0.954 dh=7.00 w=0.51 n=361 (other Big-4). C3 cos=0.928 dh=11.17 w=0.28 n=184 (smaller firms). 2-comp marginal crossings cos=0.945 dh=8.10.
  • Firm A framed as "replication-dominated" not pure: interviews confirm most partners use replication but a minority may still hand-sign. Consistent with 92.5% Firm A cos>0.95 and 32/180 Firm A CPAs in C2.
  • Pixel-identity validation: 310 sigs with byte-identical nearest same-CPA match as gold positives. Low-cosine anchor (cos<0.70) is only n=35 gold negatives so FAR is order-of-magnitude only. Firm A anchor rates: cos>0.95=92.5%, dual rule cos>0.95 AND dh<=8=89.95%.
  • Classification: 5 tiers (high/moderate-confidence non-hand-signed, high style consistency, uncertain, likely hand-signed), thresholds from Firm A (dHash median=5, P95=15).
  • Ablation: ResNet-50 vs VGG-16 vs EfficientNet-B0; ResNet-50 chosen on balance.

Produce a structured IEEE-Access-style peer review covering, in priority order (most critical first):

  1. Overall assessment and recommendation (accept / minor revision / major revision / reject).

  2. Technical soundness of the three-method threshold framework and the accountant-level GMM; focus on whether the "continuous signature-level spectrum vs discrete accountant-level behaviour" interpretation is supported by the reported evidence.

  3. Whether the "replication-dominated Firm A" framing is justified AND whether the paper does enough to prevent circular validation given Firm A is used both to calibrate thresholds and to anchor validation.

  4. Weaknesses in the pixel-identity validation: is n=35 negative anchor sufficient, does using pixel-identity as gold positive bias recall estimates, what would strengthen it.

  5. Clarity and organization of Sections III / IV / V. Flag terminology inconsistency, undefined results, and any table/figure referenced but not clearly explained.

  6. Specific sentence-level, table-level, numerical issues to fix before submission. Include exact quotes or table-cell references.

Be direct and specific. Cite exact sections, table numbers, and line quotes. Do not restate the paper content; focus on review opinions and concrete fixes. codex Im reviewing the manuscript files directly and will anchor the review to the text as written. First Im loading the section drafts with line numbers and scanning for tables, figures, and threshold/validation claims so I can cite specific locations rather than paraphrase. → Load all manuscript section files with line numbers and identify tables/figures/results claims • Evaluate technical claims around thresholding, accountant-level GMM, and Firm A framing • Draft IEEE-Access-style review with prioritized findings and exact citations exec /bin/zsh -lc 'wc -l /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 17 /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md 30 /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md 101 /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md 9 /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md 86 /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md 229 /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md 87 /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md 104 /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md 252 /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md 915 total

exec /bin/zsh -lc "rg --files /Volumes/NV2/pdf_recognize/paper | rg 'paper_a_.*_v3\.md"'$'"'" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md

exec /bin/zsh -lc "rg -n 'Table|Figure|Fig\.|Section|threshold|Firm A|pixel|identity|Hartigan|dip|Burgstahler|McCrary|Beta mixture|Gaussian mixture|GMM|dHash|cosine|validation|anchor|Tier|tier|KDE|antimode|quasi-MLE|White 1982' /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:12:Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:14:Third, we introduced a three-method convergent threshold framework combining KDE antimode (with a Hartigan dip test as formal bimodality check), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check). /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:15:Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture whose two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:16:The substantive reading is that pixel-level output quality is continuous while individual signing behavior is close to discrete. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:18:Fourth, we introduced a replication-dominated calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor that requires no manual annotation. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19:This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:29:The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:30:Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:9:The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash). /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10:For threshold determination we apply three statistically independent methods---Kernel Density antimode with a Hartigan dip test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12:The accountant-level 2-component crossings yield principled thresholds (cosine = 0.945, dHash = 8.10). /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:13:A major Big-4 firm is used as a replication-dominated (not pure) calibration anchor, with interview and visual evidence supporting majority non-hand-signing and a minority of hand-signers. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14:Validation against 310 pixel-identical signature pairs and a low-similarity negative anchor yields perfect recall at all candidate thresholds. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:7:In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:12:The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:16:The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:18:At the per-signature level, the distribution of best-match cosine similarity is not cleanly bimodal. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:19:Firm A's signature-level cosine is formally unimodal (dip test p = 0.17) with a long left tail. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:20:The all-CPA signature-level cosine is multimodal (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:21:The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:26:The distribution of per-accountant mean cosine (and mean dHash) is unambiguously multimodal. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27:A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:28:The two-component marginal crossings (cosine = 0.945, dHash = 8.10) are sharp and mutually consistent. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:30:The substantive interpretation is simple: pixel-level output quality is continuous, but individual signing behavior is close to discrete. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:32:Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:34:## C. Firm A as a Replication-Dominated, Not Pure, Population /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:36:A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:37:Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:40:First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:41:Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42:Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:45:We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:49:Within the 71,656 documents exceeding cosine 0.95, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:50:A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:52:The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:53:Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:60:The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:65:Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:70:A further methodological contribution is the use of byte-level pixel identity as an annotation-free gold positive. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:71:Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72:In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:73:Paired with a conservative low-similarity anchor, this yields EER-style validation without the stratified manual annotation that would otherwise be the default methodological choice. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:81:The pixel-identity anchor is a strict subset of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), and the low-similarity anchor (n = 35) is small because intra-CPA pairs rarely fall below cosine 0.70. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:82:The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:88:In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:92:While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:97:Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:98:This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:12:From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:20:This framing presupposes that the central threat is identity fraud. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:21:In our context, identity is not in question; the CPA is indeed the legitimate signer. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:26:Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:27:Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28:A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, F_1, and equal-error-rate metrics that prevail in the biometric-verification literature. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:31:Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:34:From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:41:(4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance; /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:43:(6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:48:By requiring convergent evidence from both descriptors, we can differentiate style consistency (high cosine but divergent dHash) from image reproduction (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:51:One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:52:Structured interviews with multiple Firm A partners confirm that most certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a minority may continue to hand-sign some reports. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:53:We therefore treat Firm A as a replication-dominated calibration reference rather than a pure positive class. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:59:The substantive reading is that pixel-level output quality is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, but individual signing behavior is close to discrete---a given CPA is either a consistent user of non-hand-signing or a consistent hand-signer. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:71:3. Dual-descriptor verification. We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:73:4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:75:5. Continuous-quality / discrete-behavior finding. We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates are cleanly trimodal---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:77:6. Replication-dominated calibration methodology. We introduce a calibration strategy using a known-majority-positive reference group, distinguishing replication-dominated from replication-pure anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:82:Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:83:Section III describes the proposed methodology. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:84:Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:85:Section V discusses the implications and limitations of our findings. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:86:Section VI concludes with directions for future work. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:6:Fig. 1 illustrates the overall architecture. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:7:The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:13:[Figure 1: Pipeline Architecture - clean vector diagram] /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:17:→ Dual-Method Verification (Cosine + dHash) /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:18:→ Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:19:→ Pixel-identity + Firm A + Accountant-level GMM validation /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:30:Table I summarizes the dataset composition. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:53:Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents, establishing an upper bound on the VLM false-positive rate of 1.2%. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:61:The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:73:A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:82:Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:83:All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:85:The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:86:This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:97:Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:99:Perceptual hash distance (dHash) captures structural-level similarity. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:100:Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:102:Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:105:Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:107:Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:110:We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:111:Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:118:For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:119:The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:123:These accountant-level aggregates are the input to the mixture model described in Section III-I. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:125:## H. Calibration Reference: Firm A as a Replication-Dominated Population /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:127:A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:128:Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:129:This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:131:First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:132:Crucially, the same interview evidence does not exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134:Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:138:We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140:The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144:Direct assignment of thresholds based on prior intuition (e.g., cosine \geq 0.95 for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145:To place threshold selection on a statistically principled and data-driven footing, we apply three independent methods whose underlying assumptions decrease in strength. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:148:### 1) Method 1: KDE + Antimode with Bimodality Check /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151:A candidate threshold is taken at the location of the local density minimum (antimode) between modes of the fitted density. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:152:Because the antimode is only meaningful if the distribution is indeed bimodal, we apply the Hartigan & Hartigan dip test [37] as a formal bimodality check at conventional significance levels, and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify stability. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:154:### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:156:We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:157:We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin i with count n_i, the standardized deviation from the smooth-null expectation of the average of its neighbours, /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:162:A threshold is identified at the transition where Z_{i-1} is significantly negative (observed count below expectation) adjacent to Z_i significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition Z^- \rightarrow Z^+ marks the candidate decision boundary. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:166:We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:168:Under the fitted model the threshold is the crossing point of the two weighted component densities, /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:173:As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the logit-transformed similarity, following standard practice for bounded proportion data; White's [41] quasi-MLE consistency result guarantees asymptotic recovery of the best Beta-family approximation to the true distribution even if the true component densities are not exactly Beta. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:176:When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:180:The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:181:If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:184:Applied to the per-signature cosine distribution the three methods may converge on a single boundary only weakly or not at all---because, as our results show, per-signature similarity is not a cleanly bimodal population. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:185:Applied to the per-accountant cosine mean, by contrast, the three methods yield a sharp and consistent boundary, reflecting that individual accountants tend to be either consistent users of non-hand-signing or consistent hand-signers, even if the signature-level output is a continuous quality spectrum. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:186:We therefore explicitly analyze both levels and interpret their divergence as a substantive finding (Section V) rather than a statistical nuisance. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:190:In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:191:The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing behavior is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level quality lies on a continuous spectrum. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:194:For the selected K^* we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:196:## K. Pixel-Identity and Firm A Validation (No Manual Annotation) /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:198:Rather than construct a stratified manual-annotation validation set, we validate the classifier using three naturally occurring reference populations that require no human labeling: /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:200:1. Pixel-identical anchor (gold positive): signatures whose nearest same-CPA match is byte-identical after crop and normalization. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:201:Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for non-hand-signing. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:203:2. Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:204:Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:206:3. Low-similarity anchor (gold negative): signatures whose maximum same-CPA cosine similarity is below a conservative cutoff (0.70) that cannot plausibly arise from pixel-level duplication. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:208:From these anchors we report Equal Error Rate (EER), precision, recall, F_1, and per-threshold False Acceptance / False Rejection Rates (FAR/FRR), following biometric-verification reporting conventions [3]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:209:We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213:The final per-document classification combines the three-method thresholds with the dual-descriptor framework. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214:Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution: /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:216:1. High-confidence non-hand-signed: Cosine > 0.95 AND dHash \leq 5. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:217:Both descriptors converge on strong replication evidence consistent with Firm A's median behavior. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:219:2. Moderate-confidence non-hand-signed: Cosine > 0.95 AND dHash in [6, 15]. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:220:Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:222:3. High style consistency: Cosine > 0.95 AND dHash > 15. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225:4. Uncertain: Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:227:5. Likely hand-signed: Cosine below the KDE crossover threshold. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:229:The dHash thresholds (\leq 5 and \leq 15) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population. /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8:By combining deep-learning visual features with perceptual hashing and three statistically independent threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time. /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:9:After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:11:The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:12:We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:13:However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29:Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:30:Table IV summarizes the distributional statistics. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:44:Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:45:Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47:The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:55:## D. Hartigan Dip Test: Unimodality at the Signature Level /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:59:<!-- TABLE V: Hartigan Dip Test Results /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:60:| Distribution | N | dip | p-value | Verdict (α=0.05) | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:62:| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:64:| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:65:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:67:| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:70:Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:71:The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is multimodal (p < 0.001). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:72:At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:75:It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:77:### 1) Burgstahler-Dichev / McCrary Discontinuity /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:79:Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:80:We note that the cosine transition at 0.985 lies inside the non-hand-signed mode rather than at the separation with the hand-signed mode, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:81:In contrast, the dHash transition at distance 2 is a meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.

exec /bin/zsh -lc "rg -n "three-method|KDE|unit-of-analysis|accountant-level|BD/McCrary|White 1982|Firm A|139\+32|critical|major revision|recommendation" /Volumes/NV2/pdf_recognize/paper/codex_review_gpt54_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: 24:- Three convergent threshold methods: (a) KDE antimode with Hartigan dip test, (b) Burgstahler-Dichev / McCrary discontinuity, (c) 2/3-comp Beta mixture via EM + parallel logit-Gaussian mixture as White 1982 quasi-MLE robustness check. 25:- Accountant-level: Gaussian mixture on (mean cosine, mean dHash), BIC picks K=3. C1 cos=0.983 dh=2.41 w=0.21 n=141 (Firm A 139/141). C2 cos=0.954 dh=7.00 w=0.51 n=361 (other Big-4). C3 cos=0.928 dh=11.17 w=0.28 n=184 (smaller firms). 2-comp marginal crossings cos=0.945 dh=8.10. 26:- Firm A framed as "replication-dominated" not pure: interviews confirm most partners use replication but a minority may still hand-sign. Consistent with 92.5% Firm A cos>0.95 and 32/180 Firm A CPAs in C2. 27:- Pixel-identity validation: 310 sigs with byte-identical nearest same-CPA match as gold positives. Low-cosine anchor (cos<0.70) is only n=35 gold negatives so FAR is order-of-magnitude only. Firm A anchor rates: cos>0.95=92.5%, dual rule cos>0.95 AND dh<=8=89.95%. 28:- Classification: 5 tiers (high/moderate-confidence non-hand-signed, high style consistency, uncertain, likely hand-signed), thresholds from Firm A (dHash median=5, P95=15). 31:Produce a structured IEEE-Access-style peer review covering, in priority order (most critical first): 33:1) Overall assessment and recommendation (accept / minor revision / major revision / reject). 35:2) Technical soundness of the three-method threshold framework and the accountant-level GMM; focus on whether the "continuous signature-level spectrum vs discrete accountant-level behaviour" interpretation is supported by the reported evidence. 37:3) Whether the "replication-dominated Firm A" framing is justified AND whether the paper does enough to prevent circular validation given Firm A is used both to calibrate thresholds and to anchor validation. 49: • Evaluate technical claims around thresholding, accountant-level GMM, and Firm A framing 79:/bin/zsh -lc "rg -n 'Table|Figure|Fig\.|Section|threshold|Firm A|pixel|identity|Hartigan|dip|Burgstahler|McCrary|Beta mixture|Gaussian mixture|GMM|dHash|cosine|validation|anchor|Tier|tier|KDE|antimode|quasi-MLE|White 1982' /Volumes/NV2/pdf_recognize/paper/paper_a_v3.md" in /Volumes/NV2/pdf_recognize 83:/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:14:Third, we introduced a three-method convergent threshold framework combining KDE antimode (with a Hartigan dip test as formal bimodality check), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check). 87:/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19:This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters. 89:/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:30:Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach. 92:/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. 93:/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12:The accountant-level 2-component crossings yield principled thresholds (cosine = 0.945, dHash = 8.10). 98:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:16:The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E). 100:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:19:Firm A's signature-level cosine is formally unimodal (dip test p = 0.17) with a long left tail. 101:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:20:The all-CPA signature-level cosine is multimodal (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). 102:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:21:The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary. 104:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27:A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms. 107:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:32:Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level. 108:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:34:## C. Firm A as a Replication-Dominated, Not Pure, Population 109:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:36:A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class. 110:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:37:Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference. 111:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:40:First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign. 112:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:41:Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. 113:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42:Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers. 119:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:60:The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels. 123:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72:In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly. 126:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:82:The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work. 128:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:92:While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions. 129:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:97:Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation. 130:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:98:This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes. 138:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:34:From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection. 140:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and 143:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:51:One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing. 144:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:52:Structured interviews with multiple Firm A partners confirm that most certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a minority may continue to hand-sign some reports. 145:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:53:We therefore treat Firm A as a replication-dominated calibration reference rather than a pure positive class. 146:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode. 148:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. 150:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:73:4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence. 155:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:84:Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study. 162:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:18:→ Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification 163:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:19:→ Pixel-identity + Firm A + Accountant-level GMM validation 182:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. 183:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:123:These accountant-level aggregates are the input to the mixture model described in Section III-I. 184:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:125:## H. Calibration Reference: Firm A as a Replication-Dominated Population 185:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:127:A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. 186:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:128:Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class. 188:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:131:First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems. 189:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:132:Crucially, the same interview evidence does not exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports. 190:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134:Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners. 191:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. 192:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:138:We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. 193:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140:The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference. 196:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:148:### 1) Method 1: KDE + Antimode with Bimodality Check 207:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:180:The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification. 213:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:191:The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing behavior is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level quality lies on a continuous spectrum. 215:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:196:## K. Pixel-Identity and Firm A Validation (No Manual Annotation) 219:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:203:2. Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above. 220:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:204:Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity. 224:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213:The final per-document classification combines the three-method thresholds with the dual-descriptor framework. 225:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214:Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution: 227:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:217:Both descriptors converge on strong replication evidence consistent with Firm A's median behavior. 229:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:220:Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations. 231:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225:4. Uncertain: Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction. 232:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:227:5. Likely hand-signed: Cosine below the KDE crossover threshold. 233:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:229:The dHash thresholds (\leq 5 and \leq 15) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population. 243:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47:The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). 245:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). 248:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:62:| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal | 249:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | 253:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:70:Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews. 258:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:79:Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. 261:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:85:Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. 263:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:87:Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data. 266:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:92:This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual behavior (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks. 270:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:109:Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit. 273:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:114:| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) | 274:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120:First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster. 275:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:121:The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D. 280:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:130:| Signature-level KDE crossover | 0.837 | — | 281:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:131:| Signature-level BD/McCrary transition | 0.985 | 2.0 | 282:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:132:| Signature-level Beta 2-comp (Firm A) | 0.977 | — | 285:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:135:| Firm A P95 (median/95th pct calibration) | 0.95 | 15 | 286:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:136:| Firm A median calibration | — | 5 | 287:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. 288:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:142:## F. Calibration Validation with Firm A 289:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:144:Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. 290:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:145:Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?). 291:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:147:<!-- TABLE IX: Firm A Anchor Rates Across Candidate Thresholds 292:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:148:| Rule | Firm A rate | 301:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:162:The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail. 302:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:163:The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. 306:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174:| cosine > 0.837 | KDE crossover | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | 309:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:177:| dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 | 311:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 | 317:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. 319:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:194:| Verdict | N (PDFs) | % | Firm A | Firm A % | 325:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:209:### 1) Firm A Validation 326:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:211:96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain. 327:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:212:This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers. 328:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:213:The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. 329:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217:Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. 330:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:218:This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII). 332:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:233:| KDE crossover | 0.837 | 0.850 | 0.792 | 333:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:234:| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 | 334:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:235:| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 | 335:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237:Note: Firm A values in this table are computed over all intra-firm pairwise 337:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:251:(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and 357:/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:67:The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions. 366:/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:103:[41] White 1982 — quasi-MLE consistency 390: 18 → Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification 391: 19 → Pixel-identity + Firm A + Accountant-level GMM validation 487: 115 Two unit-of-analysis choices are relevant for this study: (i) the signature---one signature image extracted from one report---and (ii) the accountant---the collection of all signatures attributed to a single CPA across the sample period. 488: 116 A third composite unit---the auditor-year, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution. 494: 122 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. 495: 123 These accountant-level aggregates are the input to the mixture model described in Section III-I. 497: 125 ## H. Calibration Reference: Firm A as a Replication-Dominated Population 499: 127 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. 500: 128 Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class. 503: 131 First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems. 504: 132 Crucially, the same interview evidence does not exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports. 506: 134 Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners. 508: 136 Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. 510: 138 We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. 512: 140 The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference. 520: 148 ### 1) Method 1: KDE + Antimode with Bimodality Check 552: 180 The three methods rest on decreasing-in-strength assumptions: the KDE antimode requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification. 563: 191 The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing behavior is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level quality lies on a continuous spectrum. 568: 196 ## K. Pixel-Identity and Firm A Validation (No Manual Annotation) 575: 203 2. Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above. 576: 204 Classification rates for Firm A provide convergent evidence with the three-method thresholds without overclaiming purity. 585: 213 The final per-document classification combines the three-method thresholds with the dual-descriptor framework. 586: 214 Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution: 589: 217 Both descriptors converge on strong replication evidence consistent with Firm A's median behavior. 592: 220 Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations. 597: 225 4. Uncertain: Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction. 599: 227 5. Likely hand-signed: Cosine below the KDE crossover threshold. 601: 229 The dHash thresholds (\leq 5 and \leq 15) are directly derived from Firm A's empirical median and 95th percentile rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population. 654: 47 The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). 664: 57 Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). 669: 62 | Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal | 670: 63 | Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | 677: 70 Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews. 686: 79 Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. 692: 85 Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. 694: 87 Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data. 699: 92 This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual behavior (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks. 716: 109 Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit. 721: 114 | C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) | 727: 120 First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster. 728: 121 The remaining 32 Firm A CPAs fall into C2, consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D. 737: 130 | Signature-level KDE crossover | 0.837 | — | 738: 131 | Signature-level BD/McCrary transition | 0.985 | 2.0 | 739: 132 | Signature-level Beta 2-comp (Firm A) | 0.977 | — | 742: 135 | Firm A P95 (median/95th pct calibration) | 0.95 | 15 | 743: 136 | Firm A median calibration | — | 5 | 746: 139 The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. 749: 142 ## F. Calibration Validation with Firm A 751: 144 Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. 752: 145 Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?). 754: 147 <!-- TABLE IX: Firm A Anchor Rates Across Candidate Thresholds 755: 148 | Rule | Firm A rate | 769: 162 The Firm A anchor validation is consistent with the replication-dominated framing throughout: the most permissive cosine threshold (the KDE crossover at 0.837) captures nearly all Firm A signatures, while the more stringent thresholds progressively filter out the minority of hand-signing Firm A partners in the left tail. 770: 163 The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. 781: 174 | cosine > 0.837 | KDE crossover | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | 784: 177 | dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 | 786: 179 | dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 | 798: 191 Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. 801: 194 | Verdict | N (PDFs) | % | Firm A | Firm A % | 816: 209 ### 1) Firm A Validation 818: 211 96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain. 819: 212 This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers. 820: 213 The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. 824: 217 Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. 825: 218 This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII). 840: 233 | KDE crossover | 0.837 | 0.850 | 0.792 | 841: 234 | Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 | 842: 235 | Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 | 844: 237 Note: Firm A values in this table are computed over all intra-firm pairwise 858: 251 (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and 877: 16 The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method convergent analysis and the Hartigan dip test (Sections IV-D and IV-E). 880: 19 Firm A's signature-level cosine is formally unimodal (dip test p = 0.17) with a long left tail. 881: 20 The all-CPA signature-level cosine is multimodal (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). 882: 21 The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary. 888: 27 A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms. 893: 32 Methodologically, this implies that the three-method convergent framework (KDE antimode / BD-McCrary / Beta mixture) is meaningful at the level where the mixture structure is statistically supported---namely the accountant level---and serves as diagnostic evidence rather than a primary classification boundary at the signature level. 895: 34 ## C. Firm A as a Replication-Dominated, Not Pure, Population 897: 36 A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class. 898: 37 Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference. 901: 40 First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign. 902: 41 Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. 903: 42 Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers. 921: 60 The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels. 933: 72 In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly. 943: 82 The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work. 953: 92 While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions. 956: 95 Extending the accountant-level analysis to auditor-year units is a natural next step. 958: 97 Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation. 959: 98 This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes. 977: 11 Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. 978: 12 The accountant-level 2-component crossings yield principled thresholds (cosine = 0.945, dHash = 8.10). 992: 5 Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. 1021: 34 From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection. 1029: 42 (5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and 1038: 51 One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing. 1039: 52 Structured interviews with multiple Firm A partners confirm that most certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a minority may continue to hand-sign some reports. 1040: 53 We therefore treat Firm A as a replication-dominated calibration reference rather than a pure positive class. 1041: 54 This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode. 1044: 57 A third distinctive feature is our unit-of-analysis treatment. 1045: 58 Our three-method convergent framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are cleanly trimodal (BIC-best K = 3). 1047: 60 The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. 1060: 73 4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence. 1071: 84 Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study. 1144: 67 The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions. 1180: 103 [41] White 1982 — quasi-MLE consistency 1212: 14 Third, we introduced a three-method convergent threshold framework combining KDE antimode (with a Hartigan dip test as formal bimodality check), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check). 1217: 19 This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters. 1225: 27 Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade. 1228: 30 Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach. 1331:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:47:The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). 1332:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). 1334:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:109:Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit. 1336:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:144:Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. 1337:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:145:Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?). 1339:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. 1340:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:218:This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII). 1356:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:79:Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. 1360:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:120:First, component C1 captures 139 of 180 Firm A CPAs (77%) in a tight high-cosine / low-dHash cluster. 1362:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:131:| Signature-level BD/McCrary transition | 0.985 | 2.0 | 1364:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:135:| Firm A P95 (median/95th pct calibration) | 0.95 | 15 | 1365:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. 1370:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:163:The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. 1374:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 | 1375:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. 1379:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:213:The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. 1380:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217:Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. 1396:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. 1401:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225:4. Uncertain: Cosine between the KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction. 1402:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of 180 Firm A CPAs cluster into an accountant-level "middle band" rather than the high-replication mode. 1403:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. 1408:/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19:This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and the 139/32 split of the calibration firm's 180 CPAs across the accountant-level mixture's high-replication and middle-band clusters. 1409:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:21:The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary. 1411:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:41:Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. 1412:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42:Third, the accountant-level evidence: 32 of the 180 Firm A CPAs (18%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster of the accountant GMM---consistent with the interview-acknowledged minority of hand-signers. 1416:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:72:In our corpus 310 signatures satisfied this condition, and all candidate thresholds (KDE crossover 0.837; accountant-crossing 0.945; canonical 0.95) classify this population perfectly. 1417:/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. 1418:/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12:The accountant-level 2-component crossings yield principled thresholds (cosine = 0.945, dHash = 8.10). 1428:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:57:Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). 1429:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | 1431:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:85:Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. 1437:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:139:The accountant-level two-component crossing (cosine = 0.945, dHash = 8.10) is the most defensible of the three-method thresholds because it is derived at the level where dip-test bimodality is statistically supported and the BIC model-selection criterion prefers a non-degenerate mixture. 1445:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:234:| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 | 1446:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:235:| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 | 1448:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:251:(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and 1457:/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. 1461:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:20:The all-CPA signature-level cosine is multimodal (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). 1462:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27:A BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms. 1465:/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:82:The reported FAR values should therefore be read as order-of-magnitude estimates rather than tight bounds, and a modest expansion of the negative anchor---for example by sampling inter-CPA pairs under explicit accountant-level blocking---would strengthen future work. 1467:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three statistically independent methods---KDE antimode with a Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and 1468:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:58:Our three-method convergent framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are cleanly trimodal (BIC-best K = 3). 1469:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:73:4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three statistically independent methods---KDE antimode with Hartigan dip test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or divergence) across methods as diagnostic evidence. 1473:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. 1475:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:134:Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners. 1476:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:136:Third, our own quantitative analysis converges with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. 1489:Ive finished the core read. The main problems are not about the topic; they are about internal evidentiary standards: the accountant-level “three-method convergence” is not actually shown, some validation is circular, and a few numerical statements are inconsistent enough that they need to be fixed before submission. 1501:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | 1508:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:177:| dHash_indep ≤ 5 | Firm A median | 0.981 | 1.000 | 0.990 | 0.171 | 0.000 | 1510:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:179:| dHash_indep ≤ 15 | Firm A P95 | 0.928 | 1.000 | 0.963 | 0.686 | 0.000 | 1511:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:122:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. 1520:/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:191:Table XI presents the final classification results under the dual-method framework with Firm A-calibrated thresholds for 84,386 documents. 1522:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:213:The final per-document classification combines the three-method thresholds with the dual-descriptor framework. 1523:/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214:Rather than rely on a single cutoff, we partition each document into one of five categories using convergent evidence from both descriptors with thresholds anchored to Firm A's empirical distribution: 1527:/bin/zsh -lc "rg -n 'accountant-level.*KDE|accountant-level.*BD|accountant-level.*Beta|per-accountant.*KDE|per-accountant.*Beta|per-accountant.*BD|two-component fit used for threshold derivation|three methods yield a sharp and consistent boundary|three methods reveal|three-method thresholds are therefore statistically supported at the accountant level' /Volumes/NV2/pdf_recognize/paper/paper_a_v3.md" in /Volumes/NV2/pdf_recognize 1529:/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum with no clean two-mechanism bimodality, while accountant-level aggregates are cleanly trimodal (BIC-best K = 3), reflecting that individual signing behavior is close to discrete even when pixel-level output quality is continuous. 1530:/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:60:The convergent three-method thresholds are therefore statistically supported at the accountant level (cosine crossing at 0.945, dHash crossing at 8.10) and serve as diagnostic evidence rather than primary classification boundaries at the signature level. 1546:print(139+32,180-(139+32)) 1549:/bin/zsh -lc 'python -c "vals=[29529,36994,5133,12683,47]; print(sum(vals)); print(139+32, 180-(139+32))"' in /Volumes/NV2/pdf_recognize 1561:/bin/zsh -lc 'python3 -c "vals=[29529,36994,5133,12683,47]; print(sum(vals)); print(139+32, 180-(139+32))"' in /Volumes/NV2/pdf_recognize 1608: ✓ Evaluate technical claims around thresholding, accountant-level GMM, and Firm A framing 1613:Recommendation: major revision. 1619:- The largest problem is that the manuscript repeatedly claims accountant-level convergence of three methods, but Section IV does not actually show that. Section III-I says the three methods "yield a sharp and consistent boundary" at the accountant level paper_a_methodology_v3.md, and the Introduction says the "convergent three-method thresholds are therefore statistically supported at the accountant level" paper_a_introduction_v3.md. Yet Section IV-E/Table VIII report only one accountant-level threshold source: the 2-component accountant GMM crossing at 0.945 / 8.10 paper_a_results_v3.md, paper_a_results_v3.md. I do not see accountant-level KDE antimodes, accountant-level BD/McCrary thresholds, or accountant-level Beta/logit-mixture thresholds. As written, the core "three-method convergence" claim is unsupported. 1621:- Method 1 is internally inconsistent. Section III-I.1 defines the KDE method as the local density minimum (antimode) of a single distribution paper_a_methodology_v3.md, but Section IV-C reports a KDE crossover where the two density functions intersect paper_a_results_v3.md. Those are different procedures with different decision-theoretic meanings. The manuscript needs to choose one and report it consistently. 1625:- The accountant-level interpretation is overstated. The papers strongest result is that the signature-level data are not well summarized by a 2-component model, while accountant-level means cluster under a BIC-selected 3-component GMM. That is not the same as proving that "individual signing behavior is close to discrete" paper_a_discussion_v3.md, especially when the selected accountant model itself has a very large middle component C2 with weight 0.51 paper_a_results_v3.md. The evidence supports "clustered accountant-level heterogeneity" more clearly than "discrete behavior." 1629:- The BD/McCrary treatment is too loose technically. Section III-I.2 scans histogram bins for a Z^- -> Z^+ transition paper_a_methodology_v3.md, but no uncertainty intervals, multiplicity correction, or sensitivity analysis are reported. I would present this as a heuristic discontinuity diagnostic unless the authors can supply a more rigorous justification. 1633:3. Firm A Framing and Circular Validation 1637:- Table IX is circular if presented as validation. The paper says these rates "play the role of calibration-validation metrics" paper_a_results_v3.md, but several thresholds are themselves anchored to Firm A, and the text explicitly says the final classes use thresholds "anchored to Firm A's empirical distribution" paper_a_methodology_v3.md, paper_a_methodology_v3.md. This is calibration, not external validation. 1639:- The manuscript should break this loop with cross-fitting or a true holdout. For example: calibrate on one subset of Firm A CPAs and evaluate on held-out Firm A CPAs; or calibrate without Firm A and then use Firm A only as an external check. 1641:- There is a numerical inconsistency that must be fixed. Section IV-E says 139 of 180 Firm A CPAs are in C1 and the remaining 32 are in C2 paper_a_results_v3.md, paper_a_results_v3.md. The same 139/32 split is repeated in the Introduction, Discussion, and Conclusion paper_a_introduction_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md. 139 + 32 != 180. The denominator is being mixed or nine CPAs are unaccounted for. 1659:- Section III-L defines five document-level verdicts paper_a_methodology_v3.md, but the rule that maps one or two signature-level outcomes into one PDF-level label is missing. That omission becomes critical in Table XI, which reports 84,386 classified PDFs paper_a_results_v3.md even though Table III reports 85,042 documents with detections paper_a_results_v3.md. 1679:In short: the manuscript has publishable potential, but only after the authors either supply the missing accountant-level analyses and stronger validation, or narrow the claims and present the current evidence as exploratory calibration plus large-scale descriptive analysis rather than definitive threshold validation. 1684:Recommendation: major revision. 1690:- The largest problem is that the manuscript repeatedly claims accountant-level convergence of three methods, but Section IV does not actually show that. Section III-I says the three methods "yield a sharp and consistent boundary" at the accountant level paper_a_methodology_v3.md, and the Introduction says the "convergent three-method thresholds are therefore statistically supported at the accountant level" paper_a_introduction_v3.md. Yet Section IV-E/Table VIII report only one accountant-level threshold source: the 2-component accountant GMM crossing at 0.945 / 8.10 paper_a_results_v3.md, paper_a_results_v3.md. I do not see accountant-level KDE antimodes, accountant-level BD/McCrary thresholds, or accountant-level Beta/logit-mixture thresholds. As written, the core "three-method convergence" claim is unsupported. 1692:- Method 1 is internally inconsistent. Section III-I.1 defines the KDE method as the local density minimum (antimode) of a single distribution paper_a_methodology_v3.md, but Section IV-C reports a KDE crossover where the two density functions intersect paper_a_results_v3.md. Those are different procedures with different decision-theoretic meanings. The manuscript needs to choose one and report it consistently. 1696:- The accountant-level interpretation is overstated. The papers strongest result is that the signature-level data are not well summarized by a 2-component model, while accountant-level means cluster under a BIC-selected 3-component GMM. That is not the same as proving that "individual signing behavior is close to discrete" paper_a_discussion_v3.md, especially when the selected accountant model itself has a very large middle component C2 with weight 0.51 paper_a_results_v3.md. The evidence supports "clustered accountant-level heterogeneity" more clearly than "discrete behavior." 1700:- The BD/McCrary treatment is too loose technically. Section III-I.2 scans histogram bins for a Z^- -> Z^+ transition paper_a_methodology_v3.md, but no uncertainty intervals, multiplicity correction, or sensitivity analysis are reported. I would present this as a heuristic discontinuity diagnostic unless the authors can supply a more rigorous justification. 1704:3. Firm A Framing and Circular Validation 1708:- Table IX is circular if presented as validation. The paper says these rates "play the role of calibration-validation metrics" paper_a_results_v3.md, but several thresholds are themselves anchored to Firm A, and the text explicitly says the final classes use thresholds "anchored to Firm A's empirical distribution" paper_a_methodology_v3.md, paper_a_methodology_v3.md. This is calibration, not external validation. 1710:- The manuscript should break this loop with cross-fitting or a true holdout. For example: calibrate on one subset of Firm A CPAs and evaluate on held-out Firm A CPAs; or calibrate without Firm A and then use Firm A only as an external check. 1712:- There is a numerical inconsistency that must be fixed. Section IV-E says 139 of 180 Firm A CPAs are in C1 and the remaining 32 are in C2 paper_a_results_v3.md, paper_a_results_v3.md. The same 139/32 split is repeated in the Introduction, Discussion, and Conclusion paper_a_introduction_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md. 139 + 32 != 180. The denominator is being mixed or nine CPAs are unaccounted for. 1730:- Section III-L defines five document-level verdicts paper_a_methodology_v3.md, but the rule that maps one or two signature-level outcomes into one PDF-level label is missing. That omission becomes critical in Table XI, which reports 84,386 classified PDFs paper_a_results_v3.md even though Table III reports 85,042 documents with detections paper_a_results_v3.md. 1750:In short: the manuscript has publishable potential, but only after the authors either supply the missing accountant-level analyses and stronger validation, or narrow the claims and present the current evidence as exploratory calibration plus large-scale descriptive analysis rather than definitive threshold validation.

codex Ive recovered the earlier review points. Next Im reading the revised methodology, results, discussion, and conclusion with line numbers to check whether the new claims are actually supported in-text, especially around Section IV-H and the new accountant-level thresholding. exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # Abstract 2 3 4 5 Regulations in many jurisdictions require Certified Public Accountants (CPAs) to attest to each audit report they certify, typically by affixing a signature or seal. 6 However, the digitization of financial reporting makes it straightforward to reuse a stored signature image across multiple reports---whether by administrative stamping or firm-level electronic signing systems---potentially undermining the intent of individualized attestation. 7 Unlike signature forgery, where an impostor imitates another person's handwriting, non-hand-signed reproduction involves the legitimate signer's own stored signature image being reproduced on each report, a practice that is visually invisible to report users and infeasible to audit at scale through manual inspection. 8 We present an end-to-end AI pipeline that automatically detects non-hand-signed auditor signatures in financial audit reports. 9 The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash). 10 For threshold determination we apply three methodologically distinct methods---Kernel Density antimode with a Hartigan unimodality test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level. 11 Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum that no two-component mixture cleanly separates, while accountant-level aggregates are clustered into three recognizable groups (BIC-best K = 3) with the three 1D methods converging within $\sim$0.006 of each other at cosine \approx 0.975. 12 A major Big-4 firm is used as a replication-dominated (not pure) calibration anchor, with interview and visual evidence supporting majority non-hand-signing and a minority of hand-signers; we break the circularity of using the same firm for calibration and validation by a 70/30 CPA-level held-out fold. 13 Validation against 310 byte-identical positive signatures and a $\sim$50,000-pair inter-CPA negative anchor yields FAR \leq 0.001 with Wilson 95% confidence intervals at all accountant-level thresholds. 14 To our knowledge, this represents the largest-scale forensic analysis of auditor signature authenticity reported in the literature. 15 16

exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md' in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # I. Introduction 2 3 4 5 Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. 6 In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require that certifying CPAs affix their signature or seal (簽名或蓋章) to each audit report [1]. 7 While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2]. 8 9 The digitization of financial reporting has introduced a practice that complicates this intent. 10 As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one. 11 This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step. 12 From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise. 13 We refer to signatures produced by either workflow collectively as non-hand-signed. 14 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement. 15 Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused. 16 This practice, while potentially widespread, is visually invisible to report users and virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of image reproduction. 17 18 The distinction between non-hand-signing detection and signature forgery detection is both conceptually and technically important. 19 The extensive body of research on offline signature verification [3]--[8] has focused almost exclusively on forgery detection---determining whether a questioned signature was produced by its purported author or by an impostor. 20 This framing presupposes that the central threat is identity fraud. 21 In our context, identity is not in question; the CPA is indeed the legitimate signer. 22 The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. 23 This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction, requiring an analytical framework focused on detecting abnormally high similarity across documents. 24 25 A secondary methodological concern shapes the research design. 26 Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification. 27 Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference. 28 A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, F_1, and equal-error-rate metrics that prevail in the biometric-verification literature. 29 30 Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. 31 Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image. 32 Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from image reproduction. 33 Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. 34 From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection. 35 36 In this paper, we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale. 37 Our approach processes raw PDF documents through the following stages: 38 (1) signature page identification using a Vision-Language Model (VLM); 39 (2) signature region detection using a trained YOLOv11 object detector; 40 (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network; 41 (4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance; 42 (5) threshold determination using three methodologically distinct methods---KDE antimode with a Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and 43 (6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm. 44 45 The dual-descriptor verification is central to our contribution. 46 Cosine similarity of deep feature embeddings captures high-level visual style similarity---it can identify signatures that share similar stroke patterns and spatial layouts---but cannot distinguish between a CPA who signs consistently and one whose signature is reproduced from a stored image. 47 Perceptual hashing (specifically, difference hashing) encodes structural-level image gradients into compact binary fingerprints that are robust to scan noise but sensitive to substantive content differences. 48 By requiring convergent evidence from both descriptors, we can differentiate style consistency (high cosine but divergent dHash) from image reproduction (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone. 49 50 A second distinctive feature is our framing of the calibration reference. 51 One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing. 52 Structured interviews with multiple Firm A partners confirm that most certifying partners produce their audit-report signatures by reproducing a stored image while not excluding that a minority may continue to hand-sign some reports. 53 We therefore treat Firm A as a replication-dominated calibration reference rather than a pure positive class. 54 This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode. 55 Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence between the interview evidence and the statistical results. 56 57 A third distinctive feature is our unit-of-analysis treatment. 58 Our three-method framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are clustered into three recognizable groups (BIC-best K = 3). 59 The substantive reading is that pixel-level output quality is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, while accountant-level aggregate behaviour is clustered but not sharply discrete---a given CPA tends to cluster into a dominant regime (high-replication, middle-band, or hand-signed-tendency), though the boundaries between regimes are smooth rather than discontinuous. 60 Among the three accountant-level methods, KDE antimode and the two mixture-based estimators converge within \sim 0.006 on a cosine threshold of approximately 0.975, while the Burgstahler-Dichev / McCrary discontinuity test finds no significant transition at the accountant level---an outcome consistent with smoothly mixed clusters rather than a failure of the method. 61 The two-dimensional GMM marginal crossings (cosine = 0.945, dHash = 8.10) are reported as a complementary cross-check rather than as the primary accountant-level threshold. 62 63 We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. 64 To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature. 65 66 The contributions of this paper are summarized as follows: 67 68 1. Problem formulation. We formally define non-hand-signing detection as distinct from signature forgery detection and argue that it requires an analytical framework focused on intra-signer similarity distributions rather than genuine-versus-forged classification. 69 70 2. End-to-end pipeline. We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity computation, with automated inference requiring no manual intervention after initial training and annotation. 71 72 3. Dual-descriptor verification. We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures. 73 74 4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three methodologically distinct methods---KDE antimode with Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or principled divergence) across methods as diagnostic evidence about the mixture structure of the data. 75 76 5. Continuous-quality / clustered-accountant finding. We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates cluster into three recognizable groups with smoothly mixed rather than sharply discrete boundaries---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics. 77 78 6. Replication-dominated calibration methodology. We introduce a calibration strategy using a known-majority-positive reference group, distinguishing replication-dominated from replication-pure anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling. 79 80 7. Large-scale empirical analysis. We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on non-hand-signing practices in financial reporting under a methodology designed for peer-review defensibility. 81 82 The remainder of this paper is organized as follows. 83 Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination. 84 Section III describes the proposed methodology. 85 Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study. 86 Section V discusses the implications and limitations of our findings. 87 Section VI concludes with directions for future work.

succeeded in 0ms: 1 # III. Methodology 2 3 ## A. Pipeline Overview 4 5 We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents. 6 Fig. 1 illustrates the overall architecture. 7 The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor. 8 9 Throughout this paper we use the term non-hand-signed rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). 10 From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise. 11 12 21 22 ## B. Data Collection 23 24 The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023. 25 The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings. 26 An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period. 27 Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs. 28 29 CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports. 30 Table I summarizes the dataset composition. 31 32 41 42 ## C. Signature Page Identification 43 44 To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism. 45 Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature. 46 The model was configured with temperature 0 for deterministic output. 47 48 The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document. 49 Scanning terminated upon the first positive detection. 50 This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded. 51 An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents. 52 53 Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents. 54 The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling. 55 56 ## D. Signature Detection 57 58 We adopted YOLOv11n (nano variant) [25] for signature region localization. 59 A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction. 60 A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps. 61 62 The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II). 63 64 72 73 Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers). 74 A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content. 75 76 Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures). 77 78 ## E. Feature Extraction 79 80 Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning. 81 The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer. 82 83 Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization. 84 All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product. 85 86 The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D). 87 This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0. 88 89 ## F. Dual-Method Similarity Descriptors 90 91 For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA: 92 93 Cosine similarity on deep embeddings captures high-level visual style: 94 95 \text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B 96 97 where \mathbf{f}_A and \mathbf{f}_B are L2-normalized 2048-dim feature vectors. 98 Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14]. 99 100 Perceptual hash distance (dHash) captures structural-level similarity. 101 Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint. 102 The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images. 103 Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19]. 104 105 These descriptors provide partially independent evidence. 106 Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise. 107 Non-hand-signing yields extreme similarity under both descriptors, since the underlying image is identical up to reproduction noise. 108 Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies). 109 Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline. 110 111 We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content. 112 Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle. 113 114 ## G. Unit of Analysis and Summary Statistics 115 116 Two unit-of-analysis choices are relevant for this study: (i) the signature---one signature image extracted from one report---and (ii) the accountant---the collection of all signatures attributed to a single CPA across the sample period. 117 A third composite unit---the auditor-year, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution. 118 119 For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA. 120 The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism. 121 Mean statistics would dilute this signal. 122 123 We also adopt an explicit within-auditor-year no-mixing identification assumption. 124 Specifically, within any single fiscal year we treat a given CPA's signing mechanism as uniform: a CPA who reproduces one signature image in that year is assumed to do so for every report, and a CPA who hand-signs in that year is assumed to hand-sign every report in that year. 125 Interview evidence from Firm A partners supports this assumption for their firm during the sample period. 126 Under the assumption, per-auditor-year summary statistics are well defined and robust to outliers: if even one pair of same-CPA signatures in the year is near-identical, the max/min captures it. 127 The intra-report consistency analysis in Section IV-H.3 provides an empirical check on the within-auditor-year assumption at the report level. 128 129 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. 130 The independent minimum dHash of a signature is defined as the minimum Hamming distance to any other signature of the same CPA (over the full same-CPA set), in contrast to the cosine-conditional dHash used as a diagnostic elsewhere, which is the dHash to the single signature selected as the cosine-nearest match. 131 The independent minimum avoids conditioning on the cosine choice and is therefore the conservative structural-similarity statistic for each signature. 132 These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level three-method analysis in Section III-I.5. 133 134 ## H. Calibration Reference: Firm A as a Replication-Dominated Population 135 136 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference. 137 Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring replication-dominated population: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class. 138 This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration. 139 140 First, structured interviews with multiple Firm A partners confirm that most certifying partners at Firm A produce their audit-report signatures by reproducing a stored signature image---originally via administrative stamping workflows and later via firm-level electronic signing systems. 141 Crucially, the same interview evidence does not exclude the possibility that a minority of Firm A partners continue to hand-sign some or all of their reports. 142 143 Second, independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners. 144 145 Third, our own quantitative analysis is consistent with the above: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% exhibit lower best-match values consistent with the minority of hand-signers identified in the interviews. 146 147 Fourth, we additionally validate the Firm A benchmark through two analyses that do not depend on any threshold we subsequently calibrate: 148 (a) Partner-level similarity ranking (Section IV-H.2). When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. 149 (b) Intra-report consistency (Section IV-H.3). Because each Taiwanese statutory audit report is co-signed by two engagement partners, firmwide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms, consistent with firm-wide rather than partner-specific practice. 150 151 We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the interview and visual-inspection evidence, by the two threshold-independent analyses above, and by the held-out Firm A fold described in Section III-K. 152 153 We emphasize that Firm A's replication-dominated status was not derived from the thresholds we calibrate against it. 154 Its identification rests on domain knowledge and visual evidence that is independent of the statistical pipeline. 155 The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference. 156 157 ## I. Three-Method Convergent Threshold Determination 158 159 Direct assignment of thresholds based on prior intuition (e.g., cosine \geq 0.95 for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others. 160 To place threshold selection on a statistically principled and data-driven footing, we apply three methodologically distinct methods whose underlying assumptions decrease in strength. 161 The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence is therefore a diagnostic of distributional structure rather than a formal statistical guarantee. 162 When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement itself is informative about whether the data support a single clean decision boundary at a given level. 163 164 ### 1) Method 1: KDE Antimode / Crossover with Unimodality Test 165 166 We use two closely related KDE-based threshold estimators and apply each where it is appropriate. 167 When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the KDE crossover is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes. 168 When a single distribution is analyzed (e.g., the per-accountant cosine mean of Section IV-E) the KDE antimode is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal. 169 In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify threshold stability. 170 171 ### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity 172 173 We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39]. 174 We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin i with count n_i, the standardized deviation from the smooth-null expectation of the average of its neighbours, 175 176 Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}}, 177 178 which is approximately N(0,1) under the null of distributional smoothness. 179 A threshold is identified at the transition where Z_{i-1} is significantly negative (observed count below expectation) adjacent to Z_i significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition Z^- \rightarrow Z^+ marks the candidate decision boundary. 180 181 ### 3) Method 3: Finite Mixture Model via EM 182 183 We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data). 184 The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread). 185 Under the fitted model the threshold is the crossing point of the two weighted component densities, 186 187 \pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2), 188 189 solved numerically via bracketed root-finding. 190 As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the logit-transformed similarity, following standard practice for bounded proportion data. 191 White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery. 192 193 We fit 2- and 3-component variants of each mixture and report BIC for model selection. 194 When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut. 195 196 ### 4) Convergent Validation and Level-Shift Diagnostic 197 198 The three methods rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification. 199 If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method. 200 201 Equally informative is the level at which the methods agree or disagree. 202 Applied to the per-signature similarity distribution the three methods yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D). 203 Applied to the per-accountant cosine mean, Methods 1 (KDE antimode) and 3 (Beta-mixture crossing and its logit-Gaussian counterpart) converge within a narrow band, whereas Method 2 (BD/McCrary) does not produce a significant transition because the accountant-mean distribution is smooth at the bin resolution the test requires. 204 This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a discrete discontinuity, and we interpret it accordingly in Section V rather than treating disagreement among methods as a failure. 205 206 ### 5) Accountant-Level Three-Method Analysis 207 208 In addition to applying the three methods at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures. 209 The accountant-level estimates provide the methodologically defensible threshold reference used in the per-document classification of Section III-L. 210 All three methods are reported with their estimates and, where applicable, cross-method spreads. 211 212 ## J. Accountant-Level Mixture Model 213 214 In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash). 215 The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing behavior is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level quality lies on a continuous spectrum. 216 217 We fit mixtures with K \in \{1, 2, 3, 4, 5\} components under full covariance, selecting K^* by BIC with 15 random initializations per K. 218 For the selected K^* we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds. 219 220 ## K. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation) 221 222 Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling: 223 224 1. Pixel-identical anchor (gold positive, conservative subset): signatures whose nearest same-CPA match is byte-identical after crop and normalization. 225 Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for the byte-identical subset of non-hand-signed signatures. 226 We emphasize that this anchor is a subset of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further). 227 228 2. Inter-CPA negative anchor (large gold negative): $\sim$50,000 pairs of signatures randomly sampled from different CPAs. 229 Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps. 230 This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold. 231 232 3. Firm A anchor (replication-dominated prior positive): Firm A signatures, treated as a majority-positive reference whose left tail is known to contain a minority of hand-signers per the interview evidence above. 233 Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we break the resulting circularity by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% calibration fold and a 30% heldout fold. 234 Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions are derived from the calibration fold only. 235 The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals. 236 237 4. Low-similarity same-CPA anchor (supplementary negative): signatures whose maximum same-CPA cosine similarity is below 0.70. 238 This anchor is retained for continuity with prior work but is small in our dataset (n = 35) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference. 239 240 From these anchors we report precision, recall, F_1, FAR with Wilson 95% confidence intervals, and the Equal Error Rate (EER) interpolated at the threshold where FAR = FRR, following biometric-verification reporting conventions [3]. 241 We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics. 242 243 ## L. Per-Document Classification 244 245 The final per-document classification combines the three-method thresholds with the dual-descriptor framework. 246 Rather than rely on a single cutoff, we assign each signature to one of five signature-level categories using convergent evidence from both descriptors with thresholds derived from the Firm A calibration fold (Section III-K): 247 248 1. High-confidence non-hand-signed: Cosine > 0.95 AND dHash \leq (calibration-fold Firm A dHash median). 249 Both descriptors converge on strong replication evidence consistent with Firm A's median behavior. 250 251 2. Moderate-confidence non-hand-signed: Cosine > 0.95 AND dHash between the calibration-fold dHash median and 95th percentile. 252 Feature-level evidence is strong; structural similarity is present but below the Firm A median, potentially due to scan variations. 253 254 3. High style consistency: Cosine > 0.95 AND dHash > calibration-fold Firm A dHash 95th percentile. 255 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction. 256 257 4. Uncertain: Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction. 258 259 5. Likely hand-signed: Cosine below the all-pairs KDE crossover threshold. 260 261 Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the most-replication-consistent signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification). 262 This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge. 263 The dHash thresholds (\leq 5 and \leq 15, corresponding to the calibration-fold Firm A dHash median and 95th percentile) are derived empirically rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population.

exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # V. Discussion 2 3 ## A. Non-Hand-Signing Detection as a Distinct Problem 4 5 Our results highlight the importance of distinguishing non-hand-signing detection from the well-studied signature forgery detection problem. 6 In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature. 7 In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image). 8 9 This distinction has direct methodological consequences. 10 Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures. 11 Non-hand-signing detection, by contrast, requires sensitivity to the upper tail of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous. 12 The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot. 13 14 ## B. Continuous-Quality Spectrum vs. Clustered Accountant-Level Heterogeneity 15 16 The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method framework and the Hartigan dip test (Sections IV-D and IV-E). 17 18 At the per-signature level, the distribution of best-match cosine similarity is not cleanly bimodal. 19 Firm A's signature-level cosine is formally unimodal (dip test p = 0.17) with a long left tail. 20 The all-CPA signature-level cosine rejects unimodality (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). 21 The BD/McCrary discontinuity test locates its transition at 0.985---inside the non-hand-signed mode rather than at a boundary between two mechanisms. 22 Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class. 23 Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level. 24 25 At the per-accountant aggregate level the picture partly reverses. 26 The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode = 0.973, Beta-2 crossing = 0.979, logit-GMM-2 crossing = 0.976). 27 The BD/McCrary test, however, does not produce a significant transition at the accountant level either, in contrast to the signature level. 28 This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the transitions between them are gradual rather than discrete at the bin resolution BD/McCrary requires. 29 30 The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: pixel-level output quality is continuous and heavy-tailed, and accountant-level aggregate behaviour is clustered (three recognizable groups) but not sharply discrete. 31 The accountant-level mixture is a useful classifier of firm-and-practitioner-level signing regimes; individual behaviour may still transition or mix over time within a practitioner, and our cross-sectional analysis does not rule this out. 32 Methodologically, the implication is that the three 1D methods are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is itself diagnostic of smoothness rather than a failure of the method. 33 34 ## C. Firm A as a Replication-Dominated, Not Pure, Population 35 36 A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class. 37 Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference. 38 39 Three convergent strands of evidence support the replication-dominated framing. 40 First, the interview evidence itself: Firm A partners report that most certifying partners at the firm use non-hand-signing, without excluding the possibility that a minority continue to hand-sign. 41 Second, the statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail. 42 Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures (\geq 10) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---consistent with the interview-acknowledged minority of hand-signers. 43 Nine additional Firm A CPAs are excluded from the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone. 44 The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that are within the Wilson 95% CIs of the whole-sample rates, indicating that the statistical signature of the replication-dominated framing is stable to the CPA sub-sample used for calibration. 45 46 The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise. 47 We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors. 48 49 ## D. The Style-Replication Gap 50 51 Within the 71,656 documents exceeding cosine 0.95, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance. 52 A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations. 53 54 The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative. 55 Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions. 56 Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting. 57 Some may use signing pads or templates that further constrain variability without constituting image-level reproduction. 58 The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence. 59 60 ## E. Value of a Replication-Dominated Calibration Group 61 62 The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels. 63 In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance. 64 Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset. 65 66 This calibration strategy has broader applicability beyond signature analysis. 67 Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives. 68 The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates interview-acknowledged heterogeneity, and yields classification rates that are internally consistent with the data. 69 70 ## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation 71 72 A further methodological contribution is the combination of byte-level pixel identity as an annotation-free conservative gold positive and a large random-inter-CPA negative anchor. 73 Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review. 74 In our corpus 310 signatures satisfied this condition. 75 We emphasize that byte-identical pairs are a subset of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways). 76 Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate. 77 78 Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative (n = 35) we originally considered. 79 The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset. 80 81 ## G. Limitations 82 83 Several limitations should be acknowledged. 84 85 First, comprehensive per-document ground truth labels are not available. 86 The pixel-identity anchor is a strict subset of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class. 87 The low-similarity same-CPA anchor (n = 35) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled). 88 A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class. 89 90 Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning. 91 While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance. 92 93 Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. 94 In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity. 95 This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified. 96 97 Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements. 98 While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions. 99 100 Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later). 101 Extending the accountant-level analysis to auditor-year units is a natural next step. 102 103 Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, reflecting that this test is a diagnostic of local density-smoothness violation rather than of two-mechanism separation. 104 This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes. 105 106 Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing." 107 Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.

exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # IV. Experiments and Results 2 3 ## A. Experimental Setup 4 5 All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration. 6 Feature extraction used PyTorch 2.9 with torchvision model implementations. 7 The complete pipeline---from raw PDF processing through final classification---was implemented in Python. 8 9 ## B. Signature Detection Performance 10 11 The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). 12 We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). 13 However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. 14 The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale. 15 16 26 27 ## C. All-Pairs Intra-vs-Inter Class Distribution Analysis 28 29 Fig. 2 presents the cosine similarity distributions computed over the full set of pairwise comparisons under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). 30 This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L). 31 Table IV summarizes the distributional statistics. 32 33 43 44 Both distributions are left-skewed and leptokurtic. 45 Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. 46 Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions. 47 48 The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V). 49 Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes. 50 Statistical tests confirmed significant separation between the two distributions (Cohen's d = 0.669, Mann-Whitney p < 0.001, K-S 2-sample p < 0.001). 51 52 We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength. 53 We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size. 54 A Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count. 55 56 ## D. Hartigan Dip Test: Unimodality at the Signature Level 57 58 Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V). 59 60 70 71 Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in interviews. 72 The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is multimodal (p < 0.001). 73 At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E. 74 75 This asymmetry between signature level and accountant level is itself an empirical finding. 76 It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses. 77 78 ### 1) Burgstahler-Dichev / McCrary Discontinuity 79 80 Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a single significant transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample. 81 We note that the cosine transition at 0.985 lies inside the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal. 82 In contrast, the dHash transition at distance 2 is a substantively meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication. 83 At the accountant level the test does not produce a significant Z^- \rightarrow Z^+ transition in either the cosine-mean or the dHash-mean distribution (Section IV-E), reflecting that accountant aggregates are smooth at the bin resolution the test requires rather than exhibiting a sharp density discontinuity. 84 85 ### 2) Beta Mixture at Signature Level: A Forced Fit 86 87 Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit (\Delta\text{BIC} = 381), with a parallel preference under the logit-GMM robustness check. 88 For the full-sample cosine the 3-component fit is likewise strongly preferred (\Delta\text{BIC} = 10{,}175). 89 Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data. 90 Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980. 91 92 The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: at the per-signature level, no two-mechanism mixture explains the data. 93 Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing. 94 This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual behavior (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks. 95 96 ## E. Accountant-Level Gaussian Mixture 97 98 We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. 99 BIC selects K^* = 3 (Table VI). 100 101 110 111 Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit. 112 113 120 121 Three empirical findings stand out. 122 First, of the 180 CPAs in the Firm A registry, 171 have \geq 10 signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only). 123 Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2. 124 This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D. 125 Second, the three-component partition is not a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3. 126 Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in Table VIII: KDE antimode = 0.973, Beta-2 crossing = 0.979, and the logit-GMM-2 crossing = 0.976 converge within \sim 0.006 of each other, while the BD/McCrary test does not produce a significant transition at the accountant level. 127 For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine = 0.945 and dHash = 8.10; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation. 128 129 138 139 Table VIII then summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison. 140 141 157 158 Methods 1 and 3 (KDE antimode, Beta-2 crossing, and its logit-GMM robustness check) converge at the accountant level to a cosine threshold of \approx 0.975 \pm 0.003 and a dHash threshold of \approx 3.8 \pm 0.4, while Method 2 (BD/McCrary) does not produce a significant discontinuity. 159 This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine = 0.945, dHash = 8.10) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check. 160 The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries. 161 162 ## F. Calibration Validation with Firm A 163 164 Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. 165 Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?). 166 167 183 184 Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H. 185 The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. 186 Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot. 187 188 ## G. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation 189 190 We report three validation analyses corresponding to the anchors of Section III-K. 191 192 ### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor 193 194 Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor. 195 As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean = 0.762, P_{95} = 0.884, P_{99} = 0.913, max = 0.988). 196 Table X reports precision, recall, F_1, FAR with Wilson 95% confidence intervals, and FRR at each candidate threshold. 197 The Equal-Error-Rate point, interpolated at FAR = FRR, is located at cosine = 0.990 with EER \approx 0, which is trivially small because pixel-identical positives are all at cosine very close to 1. 198 199 209 210 Two caveats apply. 211 First, the gold-positive anchor is a conservative subset of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical. 212 Perfect recall against this subset does not establish perfect recall against the broader positive class, and the reported recall should therefore be interpreted as a lower-bound calibration check rather than a generalizable recall estimate. 213 Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X. 214 The very low FAR at the accountant-level thresholds is therefore informative. 215 216 ### 2) Held-Out Firm A Validation (breaks calibration-validation circularity) 217 218 We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures). 219 Thresholds are re-derived from calibration-fold percentiles only. 220 Table XI reports heldout-fold capture rates with Wilson 95% confidence intervals. 221 222 237 238 The held-out rates match the whole-sample rates of Table IX within each rule's Wilson confidence interval, confirming that the calibration-derived thresholds generalize to Firm A CPAs that did not contribute to calibration. 239 The dual rule cosine > 0.95 AND dHash \leq 8 captures 91.54% [91.09%, 91.97%] of the held-out Firm A population, consistent with Firm A's interview-reported signing mix and with the replication-dominated framing of Section III-H. 240 241 ### 3) Sanity Sample 242 243 A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics. 244 245 ## H. Firm A Benchmark Validation: Threshold-Independent Evidence 246 247 The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles. 248 This section reports three additional analyses that are threshold-independent in the sense that their findings do not depend on any cutoff we calibrate to Firm A, and therefore constitute genuine benchmark-validation evidence rather than a circular check. 249 250 ### 1) Year-by-Year Stability of the Firm A Left Tail 251 252 Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year. 253 Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign. 254 Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023. 255 256 271 272 The left tail is stable at 6-13% throughout the sample period and shows no pre/post-2020 level shift: the 2013-2019 mean left-tail share is 8.26% and the 2020-2023 mean is 6.96%. 273 The lowest observed share is in 2023 (3.75%), consistent with firm-level electronic signing systems producing more uniform output than earlier manual scanning-and-stamping, not less. 274 This stability supports the replication-dominated framing: a persistent minority of hand-signing Firm A partners is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved. 275 276 ### 2) Partner-Level Similarity Ranking 277 278 If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all Big-4 auditor-years. 279 We test this prediction directly. 280 281 For each auditor-year (CPA \times fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023. 282 Firm A accounts for 1,287 of these (27.8% baseline share). 283 Table XIV reports per-firm occupancy of the top K\% of the ranked distribution. 284 285 292 293 Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile. 294 Year-by-year (Table XV), the top-10% Deloitte share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period. 295 296 311 312 This over-representation is a direct consequence of firm-wide stamping practice and is not derived from any threshold we subsequently calibrate. 313 It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status. 314 315 ### 3) Intra-Report Consistency 316 317 Taiwanese statutory audit reports are co-signed by two engagement partners (a primary and a secondary signer). 318 Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification. 319 Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific. 320 321 For each report with exactly two signatures and complete per-signature data (93,979 reports), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree. 322 Table XVI reports per-firm intra-report agreement. 323 324 336 337 Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having both signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed. 338 The other Big-4 firms and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap. 339 This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) stamping practice. 340 341 Like the partner-level ranking, this test does not depend on any threshold we calibrate to Firm A; the firm-vs-firm comparison is invariant to the absolute cutoff so long as the cutoff is applied uniformly. 342 343 ## I. Classification Results 344 345 Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents. 346 The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here. 347 348 359 360 Within the 71,656 documents exceeding cosine 0.95, the dHash dimension stratifies them into three distinct populations: 361 29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash \leq 5); 362 36,994 (51.7%) show partial structural similarity (dHash in [6, 15]) consistent with replication degraded by scan variations; 363 and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than image reproduction. 364 A cosine-only classifier would treat all 71,656 identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations. 365 366 ### 1) Firm A Capture Profile (Consistency Check) 367 368 96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain. 369 This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with Firm A's interview-acknowledged minority of hand-signers. 370 The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set. 371 We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check. 372 373 ### 2) Cross-Method Agreement 374 375 Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer. 376 This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII). 377 378 ## J. Ablation Study: Feature Backbone Comparison 379 380 To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). 381 All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. 382 Table XVIII presents the comparison. 383 384 401 402 EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. 403 However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), resulting in lower per-sample classification confidence. 404 VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task. 405 406 ResNet-50 provides the best overall balance: 407 (1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707; 408 (2) its tighter distributions yield more reliable individual classifications; 409 (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and 410 (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.

exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # VI. Conclusion and Future Work 2 3 ## Conclusion 4 5 We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale. 6 Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels. 7 8 Our contributions are fourfold. 9 10 First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability. 11 12 Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction. 13 14 Third, we introduced a three-method threshold framework combining KDE antimode (with a Hartigan unimodality test), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check). 15 Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the three 1D methods agree within \sim 0.006 at cosine \approx 0.975. 16 The Burgstahler-Dichev / McCrary test finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level behavior. 17 The substantive reading is therefore narrower than "discrete behavior": pixel-level output quality is continuous and heavy-tailed, and accountant-level aggregate behavior is clustered with smooth cluster boundaries. 18 19 Fourth, we introduced a replication-dominated calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor. 20 To break the circularity of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report post-hoc capture rates on the held-out fold with Wilson 95% confidence intervals. 21 This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters. 22 23 An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task. 24 25 ## Future Work 26 27 Several directions merit further investigation. 28 Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study. 29 Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade. 30 The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration. 31 The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself. 32 Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.

exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md' in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # Impact Statement 2 3 4 5 Auditor signatures on financial reports are a key safeguard of corporate accountability. 6 When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets. 7 We developed an artificial intelligence system that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan. 8 By combining deep-learning visual features with perceptual hashing and three methodologically distinct threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time. 9 After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale.

succeeded in 0ms: 1 # II. Related Work 2 3 ## A. Offline Signature Verification 4 5 Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning. 6 Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant. 7 Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work. 8 Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining. 9 Hadjadj et al. [5] addressed the practical constraint of limited reference samples, achieving competitive verification accuracy using only a single known genuine signature per writer. 10 More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results. 11 Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives. 12 Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer. 13 Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems. 14 15 A common thread in this literature is the assumption that the primary threat is identity fraud: a forger attempting to produce a convincing imitation of another person's signature. 16 Our work addresses a fundamentally different problem---detecting whether the legitimate signer's stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability. 17 18 Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy. 19 However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a replication-dominated subpopulation identified through domain expertise in real-world regulatory documents. 20 21 ## B. Document Forensics and Copy Detection 22 23 Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18]. 24 Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11]. 25 Abramova and Böhme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection. 26 27 Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money-laundering investigations. 28 Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering. 29 While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting image-level reproduction within a single author's signatures across documents. 30 31 In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images. 32 Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach. 33 34 ## C. Perceptual Hashing 35 36 Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19]. 37 Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process. 38 39 Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks. 40 Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-descriptor approach, though applied to natural images rather than document signatures. 41 42 Our work differs from prior perceptual-hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from image-level reproduction in scanned financial documents. 43 44 ## D. Deep Feature Extraction for Signature Analysis 45 46 Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures. 47 Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach. 48 Tsourounis et al. [21] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison. 49 Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach. 50 51 Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach. 52 These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical. 53 54 ## E. Statistical Methods for Threshold Determination 55 56 Our threshold-determination framework combines three families of methods developed in statistics and accounting-econometrics. 57 58 Non-parametric density estimation. 59 Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions. 60 Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors. 61 The statistical validity of the unimodality-vs-multimodality dichotomy can be tested via the Hartigan & Hartigan dip test [37], which tests the null of unimodality; we use rejection of this null as evidence consistent with (though not a direct test for) bimodality. 62 63 Discontinuity tests on empirical distributions. 64 Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions. 65 Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately N(0,1). 66 The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis. 67 The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions. 68 69 Finite mixture models. 70 When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters. 71 For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation. 72 Under mild regularity conditions, White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation to the true distribution, even when the true distribution is not exactly Beta, provided the model is correctly specified in the broader exponential-family sense. 73 74 The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported. 75

exec /bin/zsh -lc 'nl -ba /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 1 # References 2 3 4 5 [1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067 6 7 [2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," Res. Account. Regul., vol. 25, no. 2, pp. 230235, 2013. 8 9 [3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in Proc. NeurIPS, 1993. 10 11 [4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017. 12 13 [5] I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," Appl. Sci., vol. 10, no. 11, p. 3716, 2020. 14 15 [6] H. Li et al., "TransOSV: Offline signature verification with transformers," Pattern Recognit., vol. 145, p. 109882, 2024. 16 17 [7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," Mathematics, vol. 12, no. 17, p. 2757, 2024. 18 19 [8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024. 20 21 [9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021. 22 23 [10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," in Proc. Electronic Imaging, 2016. 24 25 [11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," Multimedia Tools Appl., 2024. 26 27 [12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," Inf. Process. Manage., p. 104086, 2025. 28 29 [13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in Proc. CVPR, 2022. 30 31 [14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," Pattern Recognit., vol. 70, pp. 163176, 2017. 32 33 [15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," IEEE Trans. Inf. Forensics Security, vol. 19, pp. 13421356, 2024. 34 35 [16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," IEEE Trans. Inf. Forensics Security, vol. 15, pp. 17351745, 2019. 36 37 [17] H. Farid, "Image forgery detection," IEEE Signal Process. Mag., vol. 26, no. 2, pp. 1625, 2009. 38 39 [18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," Pattern Recognit., vol. 144, art. no. 109778, 2023. 40 41 [19] J. Luo et al., "A survey of perceptual hashing for multimedia," ACM Trans. Multimedia Comput. Commun. Appl., vol. 21, no. 7, 2025. 42 43 [20] D. Engin et al., "Offline signature verification on real-world documents," in Proc. CVPRW, 2020. 44 45 [21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," Expert Syst. Appl., 2022. 46 47 [22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," Procedia Comput. Sci., vol. 270, 2025. 48 49 [23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in Proc. ECCV, 2014, pp. 584599. 50 51 [24] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, "Qwen2.5-VL technical report," arXiv:2502.13923, 2025. [Online]. Available: https://arxiv.org/abs/2502.13923 52 53 [25] Ultralytics, "YOLOv11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/ 54 55 [26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proc. CVPR, 2016. 56 57 [27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html 58 59 [28] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London: Chapman & Hall, 1986. 60 61 [29] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988. 62 63 [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Trans. Image Process., vol. 13, no. 4, pp. 600612, 2004. 64 65 [31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," The Accounting Review, vol. 88, no. 5, pp. 15111546, 2013. 66 67 [32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," Int. J. Auditing, vol. 18, no. 3, pp. 172192, 2014. 68 69 [33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," Contemp. Account. Res., vol. 26, no. 2, pp. 359391, 2009. 70 71 [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proc. CVPR, 2016, pp. 779788. 72 73 [35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 56255644, 2024. 74 75 [36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," Ann. Math. Statist., vol. 18, no. 1, pp. 5060, 1947. 76 77 [37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," Ann. Statist., vol. 13, no. 1, pp. 7084, 1985. 78 79 [38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," J. Account. Econ., vol. 24, no. 1, pp. 99126, 1997. 80 81 [39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," J. Econometrics, vol. 142, no. 2, pp. 698714, 2008. 82 83 [40] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," J. R. Statist. Soc. B, vol. 39, no. 1, pp. 138, 1977. 84 85 [41] H. White, "Maximum likelihood estimation of misspecified models," Econometrica, vol. 50, no. 1, pp. 125, 1982. 86 87

codex Ive read the revised core sections. The remaining work is targeted: Im checking for residual overstatements, inconsistent terminology/numbers, and whether the new Section IV-H genuinely provides independent evidence rather than repackaged calibration results. exec /bin/zsh -lc "rg -n "independent methods|statistically independent|methodologically distinct|discrete|clustered|White|0\.941|0\.9407|0\.945|formal bimodality|unimodality|139 / 32|139/32|171 Firm A|180 Firm A|BD/McCrary should be interpreted at the accountant level|does not produce a significant transition|worst-case rule|byte-identical|digital|digitally replicated" /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:93:Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:94:This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual behavior (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123:Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:126:Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in Table VIII: KDE antimode = 0.973, Beta-2 crossing = 0.979, and the logit-GMM-2 crossing = 0.976 converge within \sim 0.006 of each other, while the BD/McCrary test does not produce a significant transition at the accountant level. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:127:For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine = 0.945 and dHash = 8.10; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:136:| 2D GMM 2-comp marginal crossing | 0.945 | 8.10 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:152:| Accountant-level, 2D-GMM 2-comp marginal crossing | 0.945 | 8.10 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:153:| Firm A calibration-fold cosine P5 | 0.941 | — | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:159:This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine = 0.945, dHash = 8.10) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:171:| cosine > 0.941 (calibration-fold P5) | 95.08% | 57,473 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:172:| cosine > 0.945 (2D GMM marginal crossing) | 94.52% | 57,131 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:181:The threshold 0.941 corresponds to the 5th percentile of the calibration-fold Firm A cosine distribution (see Section IV-G for the held-out validation that addresses the circularity inherent in this whole-sample table). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:194:Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:204:| 0.945 (2D GMM marginal) | 0.883 | 1.000 | 0.938 | 0.0008 | [0.0006, 0.0011] | 0.000 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:211:First, the gold-positive anchor is a conservative subset of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:213:Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:226:| cosine > 0.945 (2D GMM marginal) | 94.78% | [94.41%, 95.12%] | 14,532 / 15,332 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:228:| cosine > 0.9407 (calib-fold P5) | 95.64% | [95.31%, 95.95%] | 14,664 / 15,332 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:235:Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash median = 2, P95 = 9. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:14:Third, we introduced a three-method threshold framework combining KDE antimode (with a Hartigan unimodality test), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check). /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:16:The Burgstahler-Dichev / McCrary test finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level behavior. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:17:The substantive reading is therefore narrower than "discrete behavior": pixel-level output quality is continuous and heavy-tailed, and accountant-level aggregate behavior is clustered with smooth cluster boundaries. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19:Fourth, we introduced a replication-dominated calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:21:This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10:For threshold determination we apply three methodologically distinct methods---Kernel Density antimode with a Hartigan unimodality test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11:Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the three methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum that no two-component mixture cleanly separates, while accountant-level aggregates are clustered into three recognizable groups (BIC-best K = 3) with the three 1D methods converging within $\sim$0.006 of each other at cosine \approx 0.975. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:13:Validation against 310 byte-identical positive signatures and a $\sim$50,000-pair inter-CPA negative anchor yields FAR \leq 0.001 with Wilson 95% confidence intervals at all accountant-level thresholds. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three methodologically distinct methods---KDE antimode with a Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:58:Our three-method framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are clustered into three recognizable groups (BIC-best K = 3). /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:59:The substantive reading is that pixel-level output quality is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, while accountant-level aggregate behaviour is clustered but not sharply discrete---a given CPA tends to cluster into a dominant regime (high-replication, middle-band, or hand-signed-tendency), though the boundaries between regimes are smooth rather than discontinuous. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:61:The two-dimensional GMM marginal crossings (cosine = 0.945, dHash = 8.10) are reported as a complementary cross-check rather than as the primary accountant-level threshold. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:74:4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three methodologically distinct methods---KDE antimode with Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or principled divergence) across methods as diagnostic evidence about the mixture structure of the data. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:76:5. Continuous-quality / clustered-accountant finding. We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates cluster into three recognizable groups with smoothly mixed rather than sharply discrete boundaries---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:61:The statistical validity of the unimodality-vs-multimodality dichotomy can be tested via the Hartigan & Hartigan dip test [37], which tests the null of unimodality; we use rejection of this null as evidence consistent with (though not a direct test for) bimodality. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:72:Under mild regularity conditions, White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation to the true distribution, even when the true distribution is not exactly Beta, provided the model is correctly specified in the broader exponential-family sense. /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:99:[37] Hartigan & Hartigan 1985 — dip test of unimodality /Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:103:[41] White 1982 — quasi-MLE consistency /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:9:Throughout this paper we use the term non-hand-signed rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:160:To place threshold selection on a statistically principled and data-driven footing, we apply three methodologically distinct methods whose underlying assumptions decrease in strength. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:161:The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence is therefore a diagnostic of distributional structure rather than a formal statistical guarantee. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:169:In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over \pm 50\% of the Scott's-rule value to verify threshold stability. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:191:White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:203:Applied to the per-accountant cosine mean, Methods 1 (KDE antimode) and 3 (Beta-mixture crossing and its logit-Gaussian counterpart) converge within a narrow band, whereas Method 2 (BD/McCrary) does not produce a significant transition because the accountant-mean distribution is smooth at the bin resolution the test requires. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:204:This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a discrete discontinuity, and we interpret it accordingly in Section V rather than treating disagreement among methods as a failure. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:215:The motivation is the expectation---supported by Firm A's interview evidence---that an individual CPA's signing behavior is close to discrete (either adopt non-hand-signing or not) even when the output pixel-level quality lies on a continuous spectrum. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:224:1. Pixel-identical anchor (gold positive, conservative subset): signatures whose nearest same-CPA match is byte-identical after crop and normalization. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:225:Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth for the byte-identical subset of non-hand-signed signatures. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:226:We emphasize that this anchor is a subset of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:261:Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the most-replication-consistent signature label (i.e., among the two signatures, the label rank ordered High-confidence > Moderate-confidence > Style-consistency > Uncertain > Likely-hand-signed determines the document's classification). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:20:The all-CPA signature-level cosine rejects unimodality (p < 0.001), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:22:Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:26:The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode = 0.973, Beta-2 crossing = 0.979, logit-GMM-2 crossing = 0.976). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27:The BD/McCrary test, however, does not produce a significant transition at the accountant level either, in contrast to the signature level. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:28:This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the transitions between them are gradual rather than discrete at the bin resolution BD/McCrary requires. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:30:The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: pixel-level output quality is continuous and heavy-tailed, and accountant-level aggregate behaviour is clustered (three recognizable groups) but not sharply discrete. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42:Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures (\geq 10) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---consistent with the interview-acknowledged minority of hand-signers. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:73:Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:75:We emphasize that byte-identical pairs are a subset of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways). /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:78:Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative (n = 35) we originally considered. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:86:The pixel-identity anchor is a strict subset of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:104:This result is itself an informative diagnostic---consistent with the dip-test and Beta-mixture evidence that signature-level cosine is not cleanly bimodal---but it means BD/McCrary should be interpreted at the accountant level for threshold-setting purposes. /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8:By combining deep-learning visual features with perceptual hashing and three methodologically distinct threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time. /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:17:[7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," Mathematics, vol. 12, no. 17, p. 2757, 2024. /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:25:[11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," Multimedia Tools Appl., 2024. /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:77:[37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," Ann. Statist., vol. 13, no. 1, pp. 7084, 1985. /Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:85:[41] H. White, "Maximum likelihood estimation of misspecified models," Econometrica, vol. 50, no. 1, pp. 125, 1982.

2026-04-20T18:07:26.512065Z ERROR codex_core::tools::router: error=exec_command failed for `/bin/zsh -lc "python3 - <<'PY'

Check year totals consistency in Table XIII and XV and Firm A signatures

sigs=[2167,5256,5484,5739,5796,5986,6122,6122,5996,5918,5862] print('Firm A sig sum Table XIII',sum(sigs)) audyrs=[324,399,394,413,415,434,429,430,450,467,474] print('Auditor-years Table XV sum',sum(audyrs))

compute Deloitte in top10 total

vals=[32,39,38,39,41,43,42,38,44,43,46] print('Deloitte top10 total',sum(vals))

compute total top10 k

ks=[32,39,39,41,41,43,42,43,45,46,47] print('Top10 k total',sum(ks)) PY": CreateProcess { message: "Codex(Sandbox(Denied { output: ExecToolCallOutput { exit_code: 1, stdout: StreamOutput { text: \"\", truncated_after_lines: None }, stderr: StreamOutput { text: \"zsh:1: can't create temp file for here document: operation not permitted\\n\", truncated_after_lines: None }, aggregated_output: StreamOutput { text: \"zsh:1: can't create temp file for here document: operation not permitted\\n\", truncated_after_lines: None }, duration: 0ns, timed_out: false }, network_policy_decision: None }))" } 2026-04-20T18:07:26.512065Z ERROR codex_core::tools::router: error=exec_command failed for /bin/zsh -lc "python3 - <<'PY' from math import isclose

compute percentages from tables

vals=[29529,36994,5133,12683,47] print('Table XVII total',sum(vals)) firmA=[22970,6311,183,758,4] print('Firm A total',sum(firmA)) print('71,656 split',29529/71656,36994/71656,5133/71656) print('Firm A doc non-hand share', (22970+6311)/sum(firmA)) print('Deloitte both non-hand share',26435/30222) print('Deloitte agreement', (26435+734+0+4)/30222) print('KPMG agreement', (9260+2159+5+6)/17121) print('PwC agreement', (8983+3035+3+5)/19112) print('EY agreement', (3028+2376+0+3)/8375) print('Other agreement', (1671+3945+18+27)/9140) print('2013-2019 mean left-tail', sum([12.78,8.69,7.46,6.92,6.69,6.58,8.71])/7) print('2020-2023 mean left-tail', sum([9.46,8.37,6.25,3.75])/4) print('top10 ratio',95.9/27.8) print('top25 ratio',90.1/27.8) PY"`: CreateProcess { message: "Codex(Sandbox(Denied { output: ExecToolCallOutput { exit_code: 1, stdout: StreamOutput { text: "", truncated_after_lines: None }, stderr: StreamOutput { text: "zsh:1: can't create temp file for here document: operation not permitted\n", truncated_after_lines: None }, aggregated_output: StreamOutput { text: "zsh:1: can't create temp file for here document: operation not permitted\n", truncated_after_lines: None }, duration: 0ns, timed_out: false }, network_policy_decision: None }))" } exec /bin/zsh -lc 'rg -n "XII|Table XII|TABLE XII" /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:252:Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:256:<!-- TABLE XIII: Firm A Per-Year Cosine Distribution

exec /bin/zsh -lc "rg -n "Fig\. [0-9]|Figure [0-9]" /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md" in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29:Fig. 2 presents the cosine similarity distributions computed over the full set of pairwise comparisons under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:111:Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:164:Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:6:Fig. 1 illustrates the overall architecture. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:13:[Figure 1: Pipeline Architecture - clean vector diagram]

exec /bin/zsh -lc "python3 -c "# Wilson CI rough not needed; verify exact rate from counts\nprint(round(57131/60448100,2)); print(round(54373/60448100,2)); print(round(14532/15332100,2)); print(round(14035/15332100,2))"" in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc "python3 -c "vals=[29529,36994,5133,12683,47]; firmA=[22970,6311,183,758,4]; print('TableXVII_total',sum(vals)); print('FirmA_total',sum(firmA)); print('split',[round(v/71656100,2) for v in vals[:3]]); print('FirmA_nonhand',round((22970+6311)/sum(firmA)100,2)); print('Deloitte_both_nonhand',round(26435/30222100,2)); print('Deloitte_agreement',round((26435+734+0+4)/30222100,2)); print('KPMG_agreement',round((9260+2159+5+6)/17121100,2)); print('PwC_agreement',round((8983+3035+3+5)/19112100,2)); print('EY_agreement',round((3028+2376+0+3)/8375100,2)); print('Other_agreement',round((1671+3945+18+27)/9140100,2)); print('mean_2013_2019',round(sum([12.78,8.69,7.46,6.92,6.69,6.58,8.71])/7,2)); print('mean_2020_2023',round(sum([9.46,8.37,6.25,3.75])/4,2)); print('top10_ratio',round(95.9/27.8,2)); print('top25_ratio',round(90.1/27.8,2)); print('TableXIII_sum',sum([2167,5256,5484,5739,5796,5986,6122,6122,5996,5918,5862])); print('TableXV_audyr_sum',sum([324,399,394,413,415,434,429,430,450,467,474])); print('TableXV_DeloitteTop10',sum([32,39,38,39,41,43,42,38,44,43,46])); print('TableXV_k',sum([32,39,39,41,41,43,42,43,45,46,47]))"" in /Volumes/NV2/pdf_recognize succeeded in 0ms: succeeded in 0ms: TableXVII_total 84386 FirmA_total 30226 split [41.21, 51.63, 7.16] FirmA_nonhand 96.87 Deloitte_both_nonhand 87.47 Deloitte_agreement 89.91 KPMG_agreement 66.76 PwC_agreement 62.92 EY_agreement 64.56 Other_agreement 61.94 mean_2013_2019 8.26 mean_2020_2023 6.96 top10_ratio 3.45 top25_ratio 3.24 TableXIII_sum 60448 TableXV_audyr_sum 4629 TableXV_DeloitteTop10 445 TableXV_k 458

exec /bin/zsh -lc 'rg -n "median = 5|P95 = 15|95th percentile.*15|dHash.*5|dHash.15|dHash median|dHash P95|median-adjacent|independent minimum dHash|cosine-conditional dHash" /Volumes/NV2/pdf_recognize/paper/paper_a__v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:51:Within the 71,656 documents exceeding cosine 0.95, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance. /Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:54:The 7.2% classified as "high style consistency" (cosine > 0.95 but dHash > 15) are particularly informative. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:129:For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean independent minimum dHash across all signatures of that CPA. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:130:The independent minimum dHash of a signature is defined as the minimum Hamming distance to any other signature of the same CPA (over the full same-CPA set), in contrast to the cosine-conditional dHash used as a diagnostic elsewhere, which is the dHash to the single signature selected as the cosine-nearest match. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:208:In addition to applying the three methods at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:214:In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:248:1. High-confidence non-hand-signed: Cosine > 0.95 AND dHash \leq (calibration-fold Firm A dHash median). /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:251:2. Moderate-confidence non-hand-signed: Cosine > 0.95 AND dHash between the calibration-fold dHash median and 95th percentile. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:254:3. High style consistency: Cosine > 0.95 AND dHash > calibration-fold Firm A dHash 95th percentile. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:263:The dHash thresholds (\leq 5 and \leq 15, corresponding to the calibration-fold Firm A dHash median and 95th percentile) are derived empirically rather than set ad hoc, ensuring that the classification boundaries are grounded in the replication-dominated calibration population. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:64:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:98:We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with \geq 10 signatures and fit Gaussian mixtures in two dimensions with K \in \{1, \ldots, 5\}. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:154:| Firm A calibration-fold dHash P95 | — | 9 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:155:| Firm A calibration-fold dHash median | — | 2 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:175:| dHash_indep ≤ 5 (calib-fold median-adjacent) | 84.20% | 50,897 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:176:| dHash_indep ≤ 8 | 95.17% | 57,521 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:177:| dHash_indep ≤ 15 | 99.83% | 60,345 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:178:| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | 54,373 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184:Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:185:The dual rule cosine > 0.95 AND dHash \leq 8 captures 89.95% of Firm A, a value that is consistent both with the accountant-level crossings (Section IV-E) and with Firm A's interview-reported signing mix. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:229:| dHash_indep ≤ 5 | 87.84% | [87.31%, 88.34%] | 13,469 / 15,332 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:230:| dHash_indep ≤ 8 | 96.13% | [95.82%, 96.43%] | 14,739 / 15,332 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:231:| dHash_indep ≤ 9 (calib-fold P95) | 97.48% | [97.22%, 97.71%] | 14,942 / 15,332 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:232:| dHash_indep ≤ 15 | 99.84% | [99.77%, 99.89%] | 15,308 / 15,332 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:233:| cosine > 0.95 AND dHash_indep ≤ 8 | 91.54% | [91.09%, 91.97%] | 14,035 / 15,332 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:235:Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash median = 2, P95 = 9. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239:The dual rule cosine > 0.95 AND dHash \leq 8 captures 91.54% [91.09%, 91.97%] of the held-out Firm A population, consistent with Firm A's interview-reported signing mix and with the replication-dominated framing of Section III-H. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:361:29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash \leq 5); /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:362:36,994 (51.7%) show partial structural similarity (dHash in [6, 15]) consistent with replication degraded by scan variations; /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:363:and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than image reproduction. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:375:Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash \leq 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.

exec /bin/zsh -lc 'rg -n "Deloitte|KPMG|PwC|EY" /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:286:| Top-K | k in bucket | Deloitte (Firm A) | KPMG | PwC | EY | Other/Non-Big-4 | Deloitte share | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:294:Year-by-year (Table XV), the top-10% Deloitte share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period. /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:296:<!-- TABLE XV: Deloitte Share of Top-10% Similarity by Year /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:297:| Year | N auditor-years | Top-10% k | Deloitte in top-10% | Deloitte share | Deloitte baseline | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:327:| Deloitte (Firm A) | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | 89.91% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:328:| KPMG | 17,121 | 9,260 | 2,159| 5 | 6 | 5,691 | 66.76% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:329:| PwC | 19,112 | 8,983 | 3,035| 3 | 5 | 7,086 | 62.92% | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:330:| EY | 8,375 | 3,028 | 2,376| 0 | 3 | 2,968 | 64.56% |

exec /bin/zsh -lc 'rg -n "93,979|83,970|10009|exactly two signatures" /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:321:For each report with exactly two signatures and complete per-signature data (93,979 reports), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.

exec /bin/zsh -lc 'rg -n "124 CPAs|54 CPAs|178|180 Firm A CPAs|171 have >=10|9 have too few|60,448" /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:63:| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:64:| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:122:First, of the 180 CPAs in the Firm A registry, 171 have \geq 10 signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:170:| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,405 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:171:| cosine > 0.941 (calibration-fold P5) | 95.08% | 57,473 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:172:| cosine > 0.945 (2D GMM marginal crossing) | 94.52% | 57,131 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:173:| cosine > 0.95 | 92.51% | 55,916 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174:| cosine > 0.973 (accountant KDE antimode) | 80.91% | 48,910 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:175:| dHash_indep ≤ 5 (calib-fold median-adjacent) | 84.20% | 50,897 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:176:| dHash_indep ≤ 8 | 95.17% | 57,521 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:177:| dHash_indep ≤ 15 | 99.83% | 60,345 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:178:| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | 54,373 / 60,448 | /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:180:All rates computed exactly from the full Firm A sample (N = 60,448 signatures). /Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:218:We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures). /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54:This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode.

codex Ive finished the cross-check. Im doing one last pass on the tables now because several of the new validation sections appear to have internal arithmetic or threshold-definition mismatches, and those will determine whether I land at minor or major revision. exec /bin/zsh -lc 'python3 -c "print(171+9)"' in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc 'python3 -c "print(124+54)"' in /Volumes/NV2/pdf_recognize exec /bin/zsh -lc 'python3 -c "print(30222+17121+19112+8375+9140)"' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 180

succeeded in 0ms: 178

succeeded in 0ms: 83970

exec /bin/zsh -lc 'python3 -c "print(85042-656)"' in /Volumes/NV2/pdf_recognize succeeded in 0ms: 84386

exec /bin/zsh -lc 'rg -n "three independent|three methodologically distinct|statistically independent" /Volumes/NV2/pdf_recognize/paper/paper_a_*_v3.md' in /Volumes/NV2/pdf_recognize succeeded in 0ms: /Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8:By combining deep-learning visual features with perceptual hashing and three methodologically distinct threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:7:The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three independent statistical methods and a pixel-identity anchor. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:138:This status rests on three independent lines of qualitative and quantitative evidence available prior to threshold calibration. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:160:To place threshold selection on a statistically principled and data-driven footing, we apply three methodologically distinct methods whose underlying assumptions decrease in strength. /Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:161:The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence is therefore a diagnostic of distributional structure rather than a formal statistical guarantee. /Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:6:Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three independent methods applied at two analysis levels. /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:42:(5) threshold determination using three methodologically distinct methods---KDE antimode with a Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and /Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:74:4. Three-method convergent threshold framework. We introduce a threshold-selection framework that applies three methodologically distinct methods---KDE antimode with Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or principled divergence) across methods as diagnostic evidence about the mixture structure of the data. /Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10:For threshold determination we apply three methodologically distinct methods---Kernel Density antimode with a Hartigan unimodality test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level.

I read the prior review, inspected all revised v3.2 sections with line numbers, checked the claimed fixes against the manuscript text, and identified both resolved points and remaining/new issues, including several new numerical and threshold-definition inconsistencies in the added validation material. ✓ Read prior v3.1 review and extract critical issues to re-check ✓ Inspect revised v3.2 section markdowns and locate claimed fixes ✓ Evaluate whether each prior issue is resolved and identify any new issues → Draft second-round review in priority order with exact references and quotes codex Overall Assessment

Recommendation: minor revision. Relative to v3.1, v3.2 is materially stronger. The two biggest first-round problems are largely fixed: the accountant-level threshold story is now actually reported in the Results, and the Firm A circularity problem is substantially reduced by the CPA-level 70/30 holdout plus the new benchmark analyses. I no longer see the paper as conceptually under-supported in the way v3.1 was. I do not recommend acceptance as-is, however, because the revised Sections III-V now contain several internal inconsistencies in threshold definitions, validation framing, and table arithmetic that need to be corrected before submission.

Status of the Previous Critical Issues

  • Three-method convergence overclaim: mostly addressed, but not fully cleaned up. The core fix is real in Section IV-E/Table VIII: accountant-level KDE antimode 0.973, Beta-2 0.979, logit-GMM-2 0.976, and BD/McCrary no transition paper_a_results_v3.md, paper_a_results_v3.md, paper_a_results_v3.md. That addresses the main first-round objection. But the manuscript still overstates this in several places: the Abstract says "the three 1D methods converging within ~0.006" paper_a_abstract_v3.md, and the Conclusion says "the three 1D methods agree within ~0.006" paper_a_conclusion_v3.md. Those are not correct if BD/McCrary produces no transition.

  • KDE method inconsistency: adequately addressed. Section III-I.1 now explicitly distinguishes KDE crossover for two labeled populations from KDE antimode for a single multimodal distribution paper_a_methodology_v3.md. Section IV-C then uses the all-pairs crossover, while Sections IV-D/E use unimodality/antimode logic appropriately paper_a_results_v3.md, paper_a_results_v3.md.

  • Unit-of-analysis clarity: largely addressed. Section III-G is much clearer on signature vs accountant vs auditor-year units, and Section IV-C now explicitly warns that the all-pairs analysis is a different unit from the per-signature best-match analysis paper_a_methodology_v3.md, paper_a_results_v3.md. The remaining confusion is no longer conceptual; it is now in how different thresholds are mixed in Sections III-L, IV-F, and IV-I.

  • Accountant-level interpretation overstated: partially addressed. The Introduction, Discussion, and Conclusion are now much better and mostly use the narrower formulation "clustered but not sharply discrete" paper_a_introduction_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md. But two residual sentences still revert to the old stronger claim: "the discreteness of individual behavior yields the bimodality" paper_a_results_v3.md, and "signing behavior is close to discrete (either adopt non-hand-signing or not)" paper_a_methodology_v3.md.

  • BD/McCrary rigor: partially addressed. The paper now uses it much more responsibly as a diagnostic and explicitly reports non-transition at the accountant level paper_a_methodology_v3.md, paper_a_results_v3.md. That is a substantial improvement. What is still missing is any real sensitivity analysis for the bin width / histogram construction, and one sentence is now plainly inconsistent with the revised results: "BD/McCrary should be interpreted at the accountant level for threshold-setting purposes" paper_a_discussion_v3.md. In v3.2, it does not yield an accountant-level threshold.

  • White 1982 overclaim: partially addressed. Section III-I.3 is now acceptable: it treats the logit-Gaussian fit as a robustness check for parametric-form sensitivity rather than a guarantee of recovery paper_a_methodology_v3.md. But Related Work still contains the older overclaim: "White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation" paper_a_related_work_v3.md. That sentence should be rewritten to match Section III-I.3.

  • Firm A circular validation: substantially addressed. The 70/30 CPA-level split in Section III-K / IV-G.2 is the right response to the first-round criticism paper_a_methodology_v3.md, paper_a_results_v3.md. The new Section IV-H also adds genuinely useful supporting evidence, especially the partner-level ranking in H.2. The remaining issue is that H.3 is mislabeled as threshold-independent even though it uses the Section III-L classifier.

  • 139+32 vs 180 discrepancy: adequately addressed. The paper now consistently states that 171 Firm A CPAs enter the accountant-level GMM and 9 are excluded for <10 signatures paper_a_introduction_v3.md, paper_a_results_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md.

New Issues Introduced by the v3.2 Additions

  • The dHash calibration story is now internally inconsistent. Section III-L says the document classifier uses calibration-fold Firm A dHash median and 95th percentile paper_a_methodology_v3.md, then states these are <=5 and <=15 paper_a_methodology_v3.md. But Table VIII / Table XI report calibration-fold dHash median = 2 and P95 = 9 paper_a_results_v3.md, paper_a_results_v3.md. Table IX then uses <=5 as "calib-fold median-adjacent" and the dual rule uses <=8 paper_a_results_v3.md. This is the most important remaining issue because it affects the meaning of Table XVII and the Section IV-H.3 agreement analysis.

  • Section IV-H.3 is not threshold-independent as claimed. Section IV-H defines the new subsection as evidence whose findings "do not depend on any cutoff we calibrate to Firm A" paper_a_results_v3.md. But H.3 explicitly classifies signatures "using the dual-descriptor rules of Section III-L" paper_a_results_v3.md, and Section III-L uses calibration-fold thresholds. The follow-up claim that this test "does not depend on any threshold we calibrate to Firm A" paper_a_results_v3.md is therefore not correct.

  • There is a new numerical error in Table XVI. The text says the intra-report analysis uses "93,979 reports" paper_a_results_v3.md, but the table rows sum to 83,970, and 93,979 would exceed the total corpus size of 90,282 PDFs. This must be corrected.

  • The held-out Firm A split has an unexplained denominator shift. Section IV-G.2 says the 70/30 split is 124 + 54 = 178 CPAs paper_a_results_v3.md, while Section IV-E says Firm A has 180 CPAs total and 171 + 9 = 180 in the GMM accounting paper_a_results_v3.md. If the holdout split is over only the 178 Firm A CPAs with at least one matched signature, state that explicitly.

  • Table numbering and cross-references are now confusing. Section IV-E first says the accountant-level estimates are "summarized in Table VIII" paper_a_results_v3.md, then later says "Table VIII then summarizes all threshold estimates" paper_a_results_v3.md. In the current draft there is also no Table XII. This needs clean renumbering before submission.

  • The new tables reveal the firm identity while the paper otherwise uses pseudonyms. Tables XIV-XVI explicitly name Deloitte, KPMG, PwC, and EY paper_a_results_v3.md, paper_a_results_v3.md, even though the manuscript otherwise uses Firm A and the source files still note double-blind drafting paper_a_introduction_v3.md. That is at least a consistency problem and may also be an anonymization problem.

  • Table X mixes unlike units. The positive anchor is defined at the signature-with-nearest-match level, while the negative anchor is 50,000 random inter-CPA pairs paper_a_results_v3.md, paper_a_results_v3.md. FAR/FRR are still useful, but precision and F1 are much harder to interpret when the positive and negative classes are constructed from different sampling units and arbitrary prevalence. I would keep FAR/FRR with CIs and either drop precision/F1 or define a single common validation unit.

Clarity and Organization of Revised Sections III-V

  • Sections III-G, III-I.1, III-K, and III-L are much improved. The manuscript now does a better job separating all-pairs KDE crossover, per-signature best-match analysis, accountant-level means, and held-out Firm A validation.

  • The new Section IV-H is directionally useful, especially H.2. It gives the paper a more convincing non-circular validation story. But the section title and internal framing are too broad. H.2 is genuinely threshold-free; H.1 is threshold-external but uses a fixed 0.95 cutoff; H.3 uses the calibrated classifier and should not be presented as threshold-independent.

  • Sections IV-E and V-B/C are now much better aligned than in v3.1, but a few stale sentences still reintroduce the older, stronger interpretation. The paper should adopt one consistent formulation everywhere: continuous signature-level quality plus clustered, smoothly mixed accountant-level heterogeneity.

Remaining Sentence-Level / Table-Level Fixes

Net assessment: v3.2 addresses the major first-round conceptual objections well enough that I would move this to minor revision. The remaining blockers are no longer about whether the core story exists; they are about making the revised story internally consistent, numerically clean, and honestly labeled. tokens used 102,922 Overall Assessment

Recommendation: minor revision. Relative to v3.1, v3.2 is materially stronger. The two biggest first-round problems are largely fixed: the accountant-level threshold story is now actually reported in the Results, and the Firm A circularity problem is substantially reduced by the CPA-level 70/30 holdout plus the new benchmark analyses. I no longer see the paper as conceptually under-supported in the way v3.1 was. I do not recommend acceptance as-is, however, because the revised Sections III-V now contain several internal inconsistencies in threshold definitions, validation framing, and table arithmetic that need to be corrected before submission.

Status of the Previous Critical Issues

  • Three-method convergence overclaim: mostly addressed, but not fully cleaned up. The core fix is real in Section IV-E/Table VIII: accountant-level KDE antimode 0.973, Beta-2 0.979, logit-GMM-2 0.976, and BD/McCrary no transition paper_a_results_v3.md, paper_a_results_v3.md, paper_a_results_v3.md. That addresses the main first-round objection. But the manuscript still overstates this in several places: the Abstract says "the three 1D methods converging within ~0.006" paper_a_abstract_v3.md, and the Conclusion says "the three 1D methods agree within ~0.006" paper_a_conclusion_v3.md. Those are not correct if BD/McCrary produces no transition.

  • KDE method inconsistency: adequately addressed. Section III-I.1 now explicitly distinguishes KDE crossover for two labeled populations from KDE antimode for a single multimodal distribution paper_a_methodology_v3.md. Section IV-C then uses the all-pairs crossover, while Sections IV-D/E use unimodality/antimode logic appropriately paper_a_results_v3.md, paper_a_results_v3.md.

  • Unit-of-analysis clarity: largely addressed. Section III-G is much clearer on signature vs accountant vs auditor-year units, and Section IV-C now explicitly warns that the all-pairs analysis is a different unit from the per-signature best-match analysis paper_a_methodology_v3.md, paper_a_results_v3.md. The remaining confusion is no longer conceptual; it is now in how different thresholds are mixed in Sections III-L, IV-F, and IV-I.

  • Accountant-level interpretation overstated: partially addressed. The Introduction, Discussion, and Conclusion are now much better and mostly use the narrower formulation "clustered but not sharply discrete" paper_a_introduction_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md. But two residual sentences still revert to the old stronger claim: "the discreteness of individual behavior yields the bimodality" paper_a_results_v3.md, and "signing behavior is close to discrete (either adopt non-hand-signing or not)" paper_a_methodology_v3.md.

  • BD/McCrary rigor: partially addressed. The paper now uses it much more responsibly as a diagnostic and explicitly reports non-transition at the accountant level paper_a_methodology_v3.md, paper_a_results_v3.md. That is a substantial improvement. What is still missing is any real sensitivity analysis for the bin width / histogram construction, and one sentence is now plainly inconsistent with the revised results: "BD/McCrary should be interpreted at the accountant level for threshold-setting purposes" paper_a_discussion_v3.md. In v3.2, it does not yield an accountant-level threshold.

  • White 1982 overclaim: partially addressed. Section III-I.3 is now acceptable: it treats the logit-Gaussian fit as a robustness check for parametric-form sensitivity rather than a guarantee of recovery paper_a_methodology_v3.md. But Related Work still contains the older overclaim: "White's quasi-MLE consistency result [41] guarantees asymptotic recovery of the best Beta-family approximation" paper_a_related_work_v3.md. That sentence should be rewritten to match Section III-I.3.

  • Firm A circular validation: substantially addressed. The 70/30 CPA-level split in Section III-K / IV-G.2 is the right response to the first-round criticism paper_a_methodology_v3.md, paper_a_results_v3.md. The new Section IV-H also adds genuinely useful supporting evidence, especially the partner-level ranking in H.2. The remaining issue is that H.3 is mislabeled as threshold-independent even though it uses the Section III-L classifier.

  • 139+32 vs 180 discrepancy: adequately addressed. The paper now consistently states that 171 Firm A CPAs enter the accountant-level GMM and 9 are excluded for <10 signatures paper_a_introduction_v3.md, paper_a_results_v3.md, paper_a_discussion_v3.md, paper_a_conclusion_v3.md.

New Issues Introduced by the v3.2 Additions

  • The dHash calibration story is now internally inconsistent. Section III-L says the document classifier uses calibration-fold Firm A dHash median and 95th percentile paper_a_methodology_v3.md, then states these are <=5 and <=15 paper_a_methodology_v3.md. But Table VIII / Table XI report calibration-fold dHash median = 2 and P95 = 9 paper_a_results_v3.md, paper_a_results_v3.md. Table IX then uses <=5 as "calib-fold median-adjacent" and the dual rule uses <=8 paper_a_results_v3.md. This is the most important remaining issue because it affects the meaning of Table XVII and the Section IV-H.3 agreement analysis.

  • Section IV-H.3 is not threshold-independent as claimed. Section IV-H defines the new subsection as evidence whose findings "do not depend on any cutoff we calibrate to Firm A" paper_a_results_v3.md. But H.3 explicitly classifies signatures "using the dual-descriptor rules of Section III-L" paper_a_results_v3.md, and Section III-L uses calibration-fold thresholds. The follow-up claim that this test "does not depend on any threshold we calibrate to Firm A" paper_a_results_v3.md is therefore not correct.

  • There is a new numerical error in Table XVI. The text says the intra-report analysis uses "93,979 reports" paper_a_results_v3.md, but the table rows sum to 83,970, and 93,979 would exceed the total corpus size of 90,282 PDFs. This must be corrected.

  • The held-out Firm A split has an unexplained denominator shift. Section IV-G.2 says the 70/30 split is 124 + 54 = 178 CPAs paper_a_results_v3.md, while Section IV-E says Firm A has 180 CPAs total and 171 + 9 = 180 in the GMM accounting paper_a_results_v3.md. If the holdout split is over only the 178 Firm A CPAs with at least one matched signature, state that explicitly.

  • Table numbering and cross-references are now confusing. Section IV-E first says the accountant-level estimates are "summarized in Table VIII" paper_a_results_v3.md, then later says "Table VIII then summarizes all threshold estimates" paper_a_results_v3.md. In the current draft there is also no Table XII. This needs clean renumbering before submission.

  • The new tables reveal the firm identity while the paper otherwise uses pseudonyms. Tables XIV-XVI explicitly name Deloitte, KPMG, PwC, and EY paper_a_results_v3.md, paper_a_results_v3.md, even though the manuscript otherwise uses Firm A and the source files still note double-blind drafting paper_a_introduction_v3.md. That is at least a consistency problem and may also be an anonymization problem.

  • Table X mixes unlike units. The positive anchor is defined at the signature-with-nearest-match level, while the negative anchor is 50,000 random inter-CPA pairs paper_a_results_v3.md, paper_a_results_v3.md. FAR/FRR are still useful, but precision and F1 are much harder to interpret when the positive and negative classes are constructed from different sampling units and arbitrary prevalence. I would keep FAR/FRR with CIs and either drop precision/F1 or define a single common validation unit.

Clarity and Organization of Revised Sections III-V

  • Sections III-G, III-I.1, III-K, and III-L are much improved. The manuscript now does a better job separating all-pairs KDE crossover, per-signature best-match analysis, accountant-level means, and held-out Firm A validation.

  • The new Section IV-H is directionally useful, especially H.2. It gives the paper a more convincing non-circular validation story. But the section title and internal framing are too broad. H.2 is genuinely threshold-free; H.1 is threshold-external but uses a fixed 0.95 cutoff; H.3 uses the calibrated classifier and should not be presented as threshold-independent.

  • Sections IV-E and V-B/C are now much better aligned than in v3.1, but a few stale sentences still reintroduce the older, stronger interpretation. The paper should adopt one consistent formulation everywhere: continuous signature-level quality plus clustered, smoothly mixed accountant-level heterogeneity.

Remaining Sentence-Level / Table-Level Fixes

Net assessment: v3.2 addresses the major first-round conceptual objections well enough that I would move this to minor revision. The remaining blockers are no longer about whether the core story exists; they are about making the revised story internally consistent, numerically clean, and honestly labeled.