diff --git a/paper/codex_review_gpt55_v4_round6.md b/paper/codex_review_gpt55_v4_round6.md new file mode 100644 index 0000000..2080414 --- /dev/null +++ b/paper/codex_review_gpt55_v4_round6.md @@ -0,0 +1,157 @@ +# Paper A Round 26 Review - v4 round 6 + +Reviewer: gpt-5.5 xhigh +Date: 2026-05-12 +Target: `paper/v4/paper_a_prose_v4_phase4.md` (Phase 4 prose draft v1) +Foundation checked: `paper/v4/paper_a_methodology_v4_section_iii.md` (§III v6) and `paper/v4/paper_a_results_v4_section_iv.md` (§IV v3.2) +Trajectory checked: rounds 21-25 plus v3.20.0 Abstract / §I / §II / §V / §VI baselines + +## Verdict + +Major Revision. + +The technical core in §III v6 and §IV v3.2 is stable, but the new Phase 4 prose introduces several reviewer-visible regressions. The most important are: (i) the Abstract and Introduction revive the "independent scores" overclaim even though §III/§IV repeatedly say the three scores are not statistically independent; (ii) §I and §V overstate the Big-4 scope evidence by claiming unsupported single-firm and full-dataset dip-test non-rejections; (iii) §II is still a placeholder with `[add citation]`, not a submission-ready related-work section; and (iv) §V-G drops several inherited limitations from v3.20.0. + +## Section-By-Section Findings + +### Abstract + +1. **Major - line 11: "Three independent feature-derived scores" contradicts the converged methodology.** §III-K states that the three scores are "not statistically independent measurements" because all are deterministic functions of the same descriptor means (§III:90), and §IV-F repeats the caveat (§IV:79). The Abstract should say "three feature-derived scores" or "three non-identical feature-derived summaries" and, if space allows, add the shared-feature caveat. + +2. **Minor - line 11: "candidate classifiers" can be read as operational-classifier language.** One of the three "candidate classifiers" is the K=3 per-CPA hard label, which §III-J/§III-L explicitly demotes to descriptive characterisation, not operational signature/document classification (§III:64, §III:156). Use "candidate rules/scores" or explicitly reserve "operational classifier" for the inherited five-way box rule. + +3. **Minor - line 11: the Abstract passes IEEE Access form but has no margin.** It is one paragraph and `wc -w` counts 247 words, so it satisfies the <=250-word target. Any added caveat will require trimming elsewhere. + +4. **Minor - line 11: the Abstract does not name the primary operational output.** The abstract describes the pipeline and the K=3 / convergence / anchor checks, but it does not state that the primary operational output remains the inherited five-way per-signature classifier with worst-case document aggregation (§III-L; §IV-J). This omission makes the K=3 and reverse-anchor checks look more central operationally than §III/§IV allow. + +### §I Introduction + +1. **Major - line 31: the Big-4 scope claim is overbroad and partly unsupported.** The sentence says "neither any single firm pooled alone nor the broader full-dataset variant rejects unimodality." §III and §IV only report comparison dip tests for Firm A alone, Firms B+C+D pooled, and all non-Firm-A pooled (§III:34, §III:56; §IV:27-34). They explicitly state that single-firm dip tests for Firms B, C, and D were not separately computed (§III:34, §III:56; §IV:34). §IV-K is a light full-dataset K=3 + Spearman robustness check and does not report a full-dataset dip test (§IV:230-252). Rewrite this as "no narrower comparison scope tested in Script 32..." and remove the full-dataset dip-test claim unless a spike report is added. + +2. **Major - line 29: the section cross-reference for accountant-level distributional characterisation is wrong.** The prose points to "§III-D" for the Big-4 accountant-level distributional characterisation. In the converged methodology, this material is §III-G through §III-J, especially §III-I and §III-J (§III:18-86). §IV-D/§IV-E are correct. + +3. **Major - line 35: the Introduction repeats the "independent feature-derived scores" error.** The next sentence correctly says the scores are not statistically independent, but the opening clause still hands reviewers an avoidable contradiction. This was a central round-21/22 issue and should not reappear in the front matter. + +4. **Minor - line 47: contribution 4 again overstates "not at narrower scopes."** The defensible phrase is "not in the narrower comparison scopes tested" because B/C/D single-firm dip tests were not computed. + +5. **Minor - line 55: contribution 8 overclaims the full-dataset check.** §IV-K deliberately re-runs only K=3 + Paper A box-rule Spearman convergence at full `n = 686`; it does not re-run LOOO, five-way moderate-band validation, or operational threshold calibration (§IV:230). "Pipeline reproducibility at multiple scopes" should be narrowed to "the K=3 + box-rule rank-convergence check reproduces at the full-CPA scope." + +6. **Minor - line 25: the methodological safeguards paragraph uses "external validation" too broadly.** The pixel-identity anchor is a conservative positive-subset check, the inter-CPA FAR is inherited corpus-wide, and LOOO is descriptive composition-sensitivity evidence. The paragraph should avoid implying full external validation of the operational classifier. + +### §II Related Work + +1. **Major - lines 63-65: §II is not submission-ready prose if inserted as written.** The section says v3.20.0 §II is retained "without substantive change," but the target Phase 4 file is supposed to replace the §II block. As written, it is a meta-summary rather than an actual Related Work section. Either the master manuscript must keep the full v3.20.0 §II text and splice in the LOOO paragraph, or this file must contain the full revised §II. + +2. **Major - line 67: unresolved citation placeholder.** "`[add citation]`" is still present. This must be replaced before Phase 5; otherwise a reviewer can attack the only new Related Work content as uncited. + +3. **Minor - line 67: "calibration uncertainty band on the operational rule" conflicts with the converged classifier framing.** §III-J says neither K=2 nor K=3 is used as an operational classifier (§III:64), and §III-L reserves operational classification for the inherited five-way box rule (§III:138-156). If the LOOO paragraph is about K=2/K=3 mixture fits, call it a composition-sensitivity or calibration-uncertainty check on the candidate mixture boundary/characterisation, not on "the operational rule." + +### §V Discussion + +1. **Major - line 81: the prose reifies mechanism labels at the CPA level.** "Some CPAs are templated, some are hand-leaning, some are mixed" is stronger than §III allows. §III-G says a per-CPA mean is a summary statistic, not a claim that all signatures for that CPA share a mechanism (§III:22). Use component-membership wording: "some CPAs' observed signatures place their per-CPA means in the templated, mixed, or hand-leaning regions." + +2. **Major - line 81: the within-CPA unimodality explanation is speculative.** The claim that occasional template reuse "produces a unimodal per-signature distribution within the CPA but a multimodal per-CPA distribution across CPAs" is not directly tested in §III/§IV. v3.x tested Firm A and all-CPA signature-level distributions, and v4.0 adds per-signature K=3 consistency (§IV-F), but there is no per-CPA distributional test for individual CPAs. + +3. **Major - lines 103-119: limitations are incomplete relative to v3.20.0 and the inherited pipeline.** The v4 limitations keep the Big-4 scope, missing hand-signed ground truth, pixel-identity subset, inherited-rule, A1, K=3 composition, and no-intent caveats. They drop v3 limitations that still apply: ImageNet-pretrained ResNet-50 without signature-domain fine-tuning (v3 §V:90-92), HSV red-stamp removal artifacts (v3 §V:93-95), longitudinal scanning/PDF/compression confounds (v3 §V:97-99), source-exemplar misattribution in max/min pair logic (v3 §V:100-102), and legal/regulatory interpretation limits (v3 §V:108-109). If these are intentionally retired, the draft needs a reason; otherwise they should be restored. + +4. **Major - line 107: the scope limitation repeats the unsupported full-dataset dip-test implication.** The sentence says dip-test multimodality is "not available at narrower or broader scopes." §III/§IV do not report full-dataset dip-test results; §IV-K is explicitly a light Spearman robustness check (§IV:230-252). Keep the LOOO broader-scope caveat, but do not claim full-dataset dip-test non-availability without evidence. + +5. **Minor - line 79: "v4.0 inherits and confirms" is too strong for the per-signature continuous-spectrum reading.** The exact v3 per-signature diagnostic package is inherited; v4.0's new per-signature evidence is mostly the K=3 consistency check (§IV-F) and five-way output (§IV-J). Safer: "v4.0 inherits this signature-level reading and remains consistent with it." + +6. **Minor - line 85: inherited Firm A byte-level details need provenance language.** The 145 Firm A pixel-identical signatures are verified in Script 40, but the "50 distinct partners" and "35 cross-year" details are explicitly inherited from v3 / Script 28 and not regenerated in v4.0 (§III:44, §III:190). The discussion should mark that provenance, especially because the spike reports provided for v4 only verify the 145 count. + +7. **Minor - line 87: Firm A does not alone anchor §IV-H.** §IV-H's positive-anchor subset is all Big-4 byte-identical signatures, `n = 262`, split 145 / 8 / 107 / 2 across Firms A-D (§IV:145-153). Firm A is the largest subset and the case-study evidence, but not the whole anchor. + +8. **Minor - line 97: "published box rule" is not traceable.** §III/§IV call this the inherited Paper A / v3.x box rule, not a published external rule (§III:96, §III:138; §IV:85-87). Use "inherited box rule" unless there is a publication citation. + +9. **Minor - line 97: "produce the same per-CPA ranking" is stronger than the evidence.** The scores are highly correlated, but §III/§IV note a residual non-Firm-A disagreement: reverse-anchor ranks Firm D fractionally above Firm C while P(C1) and box-rule hand-leaning rate rank Firm C highest (§III:106; §IV:102). Say "broadly concordant ranking." + +10. **Minor - line 101: "candidate classifiers" again blurs operational status.** K=3 hard labels remain descriptive. This can be fixed together with the Abstract wording. + +### §VI Conclusion And Future Work + +1. **Major - line 127: "cross-scope pipeline reproducibility" overstates §IV-K.** The full-dataset result verifies only that K=3 P(C1) and Paper A hand-leaning-rate Spearman convergence remains high at `n = 686` with drift `0.0069` (§IV:242-250; full-dataset report:25-31). It does not reproduce the pipeline, the five-way classifier, the moderate-confidence band, LOOO, or operational thresholds at full scope. + +2. **Minor - line 129: the future-work audit-quality contrast must stay explicitly descriptive.** "Firm A's 82% templated concentration vs Firm C's 23.5% hand-leaning concentration" comes from K=3 hard-posterior accountant-level assignment (§IV:215-224), whose membership is composition-sensitive (§IV:129-139). The future-work sentence is acceptable if it says these are descriptive component concentrations and that current Paper A provides no audit-quality correlation evidence. + +3. **Minor - lines 125-127: the conclusion underplays the actual operational output.** It names the pipeline and methodological checks, but it does not mention the inherited five-way per-signature/document-level classifier that §III-L and §IV-J define as the operational output. This is not a numerical error, but it leaves the operational-vs-descriptive distinction less clear at closure. + +## Reviewer-Attack Vulnerabilities Specific To The Prose + +1. A reviewer can quote line 11 or line 35 ("independent feature-derived scores") against §III-K/§IV-F's non-independence caveat and argue that the paper exaggerates validation strength. + +2. A reviewer can attack the Big-4 scope claim because the prose says "any single firm" and "full-dataset variant" even though B/C/D single-firm dip tests and full-dataset dip tests are not reported. + +3. The current §II can be rejected as incomplete because it is a placeholder, not a related-work section, and includes `[add citation]`. + +4. "Published box rule" invites a citation challenge. The body only supports "inherited Paper A / v3.x box rule." + +5. The discussion sometimes turns descriptive component labels into apparent mechanism claims about CPAs. This conflicts with the §III-G rule that per-CPA means are summaries, not partner-level mechanism assignments. + +6. The phrase "candidate classifiers" for K=3 and reverse-anchor checks can be read as walking back the round-21 convergence that K=3 is descriptive and the five-way box rule is operational. + +7. The limitations section is vulnerable because it drops inherited limitations that still apply to the pipeline: feature backbone transfer, red-stamp preprocessing, longitudinal document-generation shifts, source-exemplar misattribution, and legal interpretation limits. + +8. The full-dataset robustness claim is easy to overread. §IV-K is intentionally "light scope"; calling it pipeline reproducibility or cross-scope operational reproducibility exceeds the evidence. + +## Provenance Verification Table + +| # | Phase 4 numerical claim | Phase line(s) | Provenance checked | Status | +|---:|---|---:|---|---| +| 1 | Abstract is <=250 words | 11 | `sed -n '11p' ... \| wc -w` returned 247 | Confirmed, but close to limit | +| 2 | 90,282 reports, 182,328 signatures, 758 CPAs | 11, 37, 125 | §IV:7 gives 90,282 PDFs; §IV:13 gives 182,328 extracted signatures; v3 §I:62 gives 758 CPAs | Confirmed with inherited full-corpus CPA source | +| 3 | Big-4 sub-corpus: 437 CPAs, 150,442 signatures | 11, 37, 125 | §III:30; §IV:9, §IV:15; five-way report:14-15 | Confirmed | +| 4 | Big-4 dip-test multimodality, `p < 5 x 10^-4` on both axes | 11, 31, 81, 127 | §III:34, §III:56, §III:171-172; §IV:27-34 | Confirmed for Big-4 | +| 5 | "Neither any single firm pooled alone nor broader full-dataset variant rejects" | 31 | §III:34/56 and §IV:34 say only Firm A alone was tested among single firms; §IV-K has no full-dataset dip test | Not verified / overclaimed | +| 6 | K=2 crossings `cos*=0.9755`, `dHash*=3.755`, cosine CI half-width 0.0015 | 31 | calibration report:16-17; §III:58, §III:166-170; §IV:60-63 | Confirmed | +| 7 | K=2 LOOO max cosine-crossing deviation `0.028`, `5.6x` tolerance, Firm A held-out 100% vs non-A 0% | 31, 91 | calibration report:34-44; §III:78, §III:120; §IV:122-127 | Confirmed, with 0.0278 rounded to 0.028 | +| 8 | K=3 components: C3 `0.983/2.41/0.321`, C2 `0.956/6.66/0.536`, C1 `0.946/9.17/0.143` | 33 | k3 LOOO report:8-10; convergence report:8-12; §III:70-76; §IV:69-75 | Confirmed after rounding | +| 9 | K=3 C1 LOOO shape drift: cos <=0.005, dHash <=0.96, weight <=0.023 | 11, 33, 93, 127 | k3 LOOO report:77-79; §III:78, §III:122; §IV:139 | Confirmed | +| 10 | K=3 held-out hard-posterior differences `1.8-12.8 pp` | 33, 93, 117 | k3 LOOO report:83-90; §III:122; §IV:134-139 | Confirmed after rounding | +| 11 | Three-score Spearman convergence `rho >= 0.879` | 11, 35, 51, 97, 127 | convergence report:28-30; §III:100-104; §IV:83-87 | Confirmed numerically; wording must not say independent | +| 12 | Per-signature K=3 consistency `Cohen kappa = 0.87` | 97 | §III:108-116; §IV:104-112 | Confirmed | +| 13 | Pixel-identity subset `n = 262`, all three checks 0% miss, Wilson upper 1.45% | 11, 35, 53, 101, 127 | pixel-identity report:8, 14-16; §III:124-132; §IV:145-153 | Confirmed | +| 14 | Firm A pixel-identical `145`, plus `50 partners` and `35 cross-year` | 85 | pixel-identity report:24 confirms 145; §III:44 and §III:190 mark 50/35 as inherited from v3 / Script 28, not regenerated in v4 spikes | Partially confirmed; provenance caveat needed | +| 15 | Inter-CPA FAR `0.0005`, Wilson `[0.0003, 0.0007]` | 53, 101 | §III:188; §IV:157-159; inherited v3.20.0 §IV-F.1 Table X | Confirmed as inherited | +| 16 | Full-dataset robustness `n = 686`, full rho `0.9558`, drift `0.007` | 11, 55, 107, 127 | full-dataset report:10-13, 25-31; §III:186; §IV:242-250 | Confirmed numerically, but interpretive scope is light | +| 17 | Firm A `82%/82.5%` templated and Firm C `23.5%` hand-leaning | 85, 129 | convergence report:43-48; §IV:217-224 | Confirmed as descriptive K=3 hard assignment | + +## Cross-Reference Checks (Phase 4 <-> §III v6 / §IV v3.2) + +| Linkage | Phase 4 evidence | §III / §IV evidence | Status | +|---|---:|---:|---| +| Big-4 primary scope and sample size | Lines 11, 31, 37, 107, 125 | §III:30; §IV:9, §IV:15 | Numerically tight, but scope-test wording overbroad | +| Accountant-level distributional characterisation refs | Line 29 | §III-I/J are the relevant methodology sections (§III:52-86); §IV-D/E correct (§IV:21-75) | Fail: `§III-D` is stale/wrong | +| K=2 as firm-mass separator, not operational | Lines 31, 91 | §III:78-86, §III:120; §IV:118-127 | Tight | +| K=3 descriptive only | Lines 33, 49, 93 | §III:64, §III:80-86, §III:156; §IV:75, §IV:139, §IV:224 | Tight, except "candidate classifier" wording | +| Three-score internal consistency | Lines 11, 35, 51, 97, 127 | §III:90-106; §IV:79-102 | Numerically tight; independence wording fails | +| Reverse-anchor reference as non-Big-4 | Lines 35, 97 | §III:48-50; §IV:89 | Tight | +| Pixel-identity positive anchor | Lines 35, 101 | §III:124-134; §IV:141-155 | Tight; Firm A-only anchoring phrase should be narrowed | +| Inter-CPA negative-anchor FAR | Lines 53, 101 | §III:126, §III:188; §IV:157-159 | Tight as inherited | +| Five-way classifier primary / MC band inherited | Lines 33, 113 | §III:136-156; §IV:161-224 | Mostly tight; Abstract/Conclusion should name operational output more clearly | +| Full-dataset robustness | Lines 55, 107, 127 | §IV:228-252 | Numerically tight; "pipeline reproducibility" overclaims light scope | +| Internal notes and close-out artifacts | Lines 3, 133-142 | Round-25 review kept this open; §III and §IV also retain internal notes | Not partner/Phase-5 ready | + +## Phase 5 Readiness + +Partial. + +The §III/§IV technical foundation would likely survive cross-AI peer review, but the current Phase 4 prose would draw a Major Revision because it reintroduces known overclaims and has an incomplete §II. With the targeted prose repairs below, Phase 5 readiness should move to Yes. + +## Recommended Next-Step Actions + +1. Replace every "independent feature-derived scores" phrase with "three feature-derived scores" or "three feature-derived summaries," and preserve the shared-feature caveat in Abstract/§I/§V/§VI. + +2. Rewrite the Big-4 scope language at lines 31, 47, 81, 107, and 127 to match §III exactly: Big-4 is the smallest scope among the comparison scopes tested; B/C/D single-firm dip tests were not computed; no full-dataset dip-test result is reported. + +3. Fix stale cross-references in line 29: use §III-G/I/J/K as appropriate instead of §III-D. + +4. Turn §II into a real revised Related Work section: retain the v3.20.0 subsections in the master, splice in the LOOO paragraph, and replace `[add citation]` with a specific cross-validation citation. + +5. Rebuild §V-G limitations by merging the v4-specific limitations with still-valid v3 limitations: transferred ResNet-50 features, HSV stamp-removal artifacts, longitudinal scan/PDF confounds, source-exemplar misattribution, and legal/regulatory interpretation. + +6. Replace "published box rule" with "inherited Paper A box rule" unless an external publication citation is added. + +7. Narrow full-dataset language: say "K=3 + box-rule rank-convergence reproduces at full `n = 686`" rather than "pipeline reproducibility at multiple scopes." + +8. Before Phase 5, strip the Phase 4 draft note and close-out checklist (lines 3 and 133-142), and continue the same cleanup for §III/§IV internal notes flagged in round 25. diff --git a/paper/v4/paper_a_prose_v4_phase4.md b/paper/v4/paper_a_prose_v4_phase4.md index 80a6598..a7069bf 100644 --- a/paper/v4/paper_a_prose_v4_phase4.md +++ b/paper/v4/paper_a_prose_v4_phase4.md @@ -1,6 +1,6 @@ -# Paper A v4.0 Phase 4 Prose Draft v1 +# Paper A v4.0 Phase 4 Prose Draft v2 (post codex round 26) -> **Draft note (2026-05-12, Phase 4 v1; internal — remove before submission).** This file replaces the v3.20.0 Abstract, §I Introduction, §II Related Work, §V Discussion, and §VI Conclusion blocks with the Big-4 reframed v4.0 prose. The methodology and results sections (§III v6 and §IV v3.2 on this branch, committed 2026-05-12) are the technical foundation; Phase 4 prose aligns the narrative with the convergent post-codex-rounds 21–25 framing. Inherited corpus-wide setup material (§III-A through §III-F; §IV-A through §IV-C; §IV-I; §IV-L) is referenced rather than restated. Empirical anchors cite Scripts 32–42 on branch `paper-a-v4-big4`. +> **Draft note (2026-05-12, Phase 4 v2; internal — remove before submission).** This file replaces the v3.20.0 Abstract, §I Introduction, §II Related Work, §V Discussion, and §VI Conclusion blocks with the Big-4 reframed v4.0 prose. The methodology and results sections (§III v6 and §IV v3.2 on this branch, committed 2026-05-12) are the technical foundation; Phase 4 prose aligns the narrative with the convergent post-codex-rounds 21–25 framing. v2 incorporates codex round-26 review (`paper/codex_review_gpt55_v4_round6.md`, Major Revision): strip "independent feature-derived scores" overclaim from Abstract/§I/§V/§VI; narrow Big-4 scope language to "comparison scopes tested in Script 32"; repair §III-D → §III-G/I/J/K cross-reference; turn §II into a real Related Work section that explicitly defers to v3.20.0 plus a LOOO splice with a real citation; restore 5 v3.x limitations (backbone transfer, HSV preprocessing, longitudinal confounds, source-exemplar misattribution, legal interpretation); narrow §IV-K "pipeline reproducibility" to "K=3 + box-rule rank-convergence reproduces"; replace "published box rule" with "inherited Paper A box rule"; reserve "operational" for the five-way classifier. Inherited corpus-wide setup material (§III-A through §III-F; §IV-A through §IV-C; §IV-I; §IV-L) is referenced rather than restated. Empirical anchors cite Scripts 32–42 on branch `paper-a-v4-big4`. --- @@ -8,7 +8,7 @@ > *IEEE Access target: <= 250 words, single paragraph.* -Regulations require Certified Public Accountants (CPAs) to attest to each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — technically trivial and visually invisible to report users, undermining individualized attestation. We build an end-to-end pipeline that detects such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, a YOLOv11 detector localizes signature regions, ResNet-50 supplies deep features, and a dual-descriptor verification layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 audit reports filed in Taiwan over 2013–2023, the pipeline yields 182,328 signatures from 758 CPAs. Our primary statistical analysis is scoped to the Big-4 sub-corpus (437 CPAs, 150,442 signatures with both descriptors), where the accountant-level $(\overline{\text{cos}}, \overline{\text{dHash}})$ distribution exhibits dip-test multimodality on both axes ($p < 5 \times 10^{-4}$). A 3-component Gaussian mixture recovers a stable cross-firm hand-leaning component (weight 0.143; LOOO drift $\leq 0.005$ cos / 0.96 dh / 0.023 weight) alongside a templated component dominated by Firm A. Three independent feature-derived scores — mixture posterior, reverse-anchor cosine percentile against a non-Big-4 reference, and a published box-rule hand-leaning rate — converge on the per-CPA ranking at Spearman $\rho \geq 0.879$. Against 262 byte-identical signatures (conservative-subset ground truth for replication), all three candidate classifiers achieve 0% positive-anchor miss rate (Wilson 95% upper bound 1.45%). A full-dataset robustness check ($n = 686$ CPAs) preserves the Spearman convergence with rho drift 0.007. +Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — technically trivial and visually invisible, undermining individualized attestation. We build an end-to-end pipeline detecting such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, a YOLOv11 detector localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. The operational output is an inherited five-way per-signature classifier with worst-case document-level aggregation. Applied to 90,282 audit reports filed in Taiwan over 2013–2023, the pipeline yields 182,328 signatures from 758 CPAs. Our primary statistical analysis is scoped to the Big-4 sub-corpus (437 CPAs, 150,442 signatures), where the accountant-level $(\overline{\text{cos}}, \overline{\text{dHash}})$ distribution exhibits dip-test multimodality ($p < 5 \times 10^{-4}$, both axes) — the smallest scope tested at which this holds. A 3-component mixture recovers a stable cross-firm hand-leaning component (weight 0.143; leave-one-firm-out drift $\leq 0.005$ cos / 0.96 dh / 0.023 weight) alongside a Firm-A-concentrated templated component. Three feature-derived scores — mixture posterior, reverse-anchor cosine percentile against a non-Big-4 reference, and the inherited box-rule hand-leaning rate; not statistically independent — agree on the per-CPA ranking at Spearman $\rho \geq 0.879$. Against 262 byte-identical Big-4 signatures, all three achieve 0% positive-anchor miss rate (Wilson 95% upper bound 1.45%). A light full-dataset cross-scope check ($n = 686$) preserves the K=3 + box-rule Spearman convergence with drift 0.007. --- @@ -22,17 +22,17 @@ The digitization of financial reporting has introduced a practice that complicat The distinction between *non-hand-signing detection* and *signature forgery detection* is conceptually and technically important. The extensive body of research on offline signature verification [3]–[8] focuses almost exclusively on forgery detection — determining whether a questioned signature was produced by its purported author. In our context, identity is not in question; the CPA is indeed the legitimate signer. The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction. -A methodological concern shapes the research design. Many prior similarity-based classification studies rely on ad-hoc thresholds — declaring two images equivalent above a hand-picked cosine cutoff, for example — without principled statistical justification. Such thresholds are fragile in an archival-data setting. A defensible approach requires (i) a transparent threshold whose calibration scope is statistically supported; (ii) statistical diagnostics that characterise the *shape* of the underlying similarity distribution at the chosen scope; (iii) external validation against naturally-occurring anchor populations, with Wilson 95% confidence intervals on per-rule capture and miss rates; and (iv) cross-validation that demonstrates the calibration reproduces rather than collapses when the training composition changes. +A methodological concern shapes the research design. Many prior similarity-based classification studies rely on ad-hoc thresholds — declaring two images equivalent above a hand-picked cosine cutoff, for example — without principled statistical justification. Such thresholds are fragile in an archival-data setting. A defensible approach requires (i) a transparent threshold whose calibration scope is statistically supported; (ii) statistical diagnostics that characterise the *shape* of the underlying similarity distribution at the chosen scope; (iii) annotation-free validation against naturally-occurring anchor populations, with Wilson 95% confidence intervals on per-rule capture and miss rates; and (iv) leave-one-firm-out cross-validation that exposes whether the calibration boundary remains stable when the training composition changes, or collapses into a firm-mass separator. Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation. -In this paper we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) accountant-level distributional characterisation at the Big-4 sub-corpus scope (§III-D, §IV-D / §IV-E); (6) convergent internal-consistency analysis using three feature-derived scores (§III-K, §IV-F); (7) leave-one-firm-out reproducibility (§III-J, §IV-G); and (8) annotation-free validation against a pixel-identity positive anchor, an inter-CPA negative anchor, and a full-dataset robustness re-run (§IV-H, §IV-I, §IV-K). +In this paper we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) accountant-level distributional characterisation at the Big-4 sub-corpus scope (§III-G through §III-J; §IV-D and §IV-E); (6) convergent internal-consistency analysis using three feature-derived scores (§III-K; §IV-F); (7) leave-one-firm-out reproducibility (§III-J and §III-K; §IV-G); and (8) annotation-free validation against a pixel-identity positive anchor, an inter-CPA negative anchor inherited from v3.x, and a light full-dataset cross-scope re-run (§IV-H, §IV-I, §IV-K). -The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. The accountant-level $(\overline{\text{cos}}, \overline{\text{dHash}})$ distribution at the Big-4 sub-corpus scope is the smallest tested scope at which the Hartigan dip test rejects unimodality on both axes ($p < 5 \times 10^{-4}$, $n = 437$); neither any single firm pooled alone nor the broader full-dataset variant rejects unimodality at conventional levels. A 2-component Gaussian mixture fit to this distribution recovers marginal crossings $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with tight bootstrap 95% confidence intervals (cosine half-width $0.0015$). However, leave-one-firm-out cross-validation reveals that the K=2 boundary is essentially a Firm-A-vs-others separator: the maximum across-fold cosine-crossing deviation is $0.028$, $5.6\times$ the report's stability tolerance, and holding Firm A out flips the held-out classification rate from 0% to 100% replicated. We therefore do not use the K=2 fit as the operational classifier. +The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. The accountant-level $(\overline{\text{cos}}, \overline{\text{dHash}})$ distribution at the Big-4 sub-corpus scope is the smallest scope tested in Script 32 at which the Hartigan dip test rejects unimodality on both axes ($p < 5 \times 10^{-4}$, $n = 437$); none of the narrower comparison scopes tested (Firm A pooled alone, Firms B + C + D pooled, all non-Firm-A pooled) reject unimodality at conventional levels. A 2-component Gaussian mixture fit to this distribution recovers marginal crossings $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with tight bootstrap 95% confidence intervals (cosine half-width $0.0015$). However, leave-one-firm-out cross-validation reveals that the K=2 boundary is essentially a Firm-A-vs-others separator: the maximum across-fold cosine-crossing deviation is $0.028$, $5.6\times$ the across-fold stability tolerance of $0.005$, and holding Firm A out flips the held-out classification rate from 0% to 100% replicated. We therefore do not use the K=2 fit as the operational classifier. -A 3-component fit (BIC lower by $3.48$) instead identifies three accountant-level components: a templated cluster (cos $\approx 0.983$, dHash $\approx 2.41$, weight $0.321$), a mixed cluster (cos $\approx 0.956$, dHash $\approx 6.66$, weight $0.536$), and a cross-firm hand-leaning cluster (cos $\approx 0.946$, dHash $\approx 9.17$, weight $0.143$). The hand-leaning component shape is reproducible across LOOO folds (max drift $0.005$ in cos, $0.96$ in dHash, $0.023$ in weight); hard-posterior membership remains composition-sensitive (absolute differences $1.8$–$12.8$ pp), so we use K=3 as an *accountant-level descriptive characterisation*, not as an operational classifier. The signature-level operational classifier remains the inherited five-way box rule from v3.x literature, retained for continuity. +A 3-component fit (BIC lower by $3.48$) instead identifies three accountant-level components: a templated cluster (cos $\approx 0.983$, dHash $\approx 2.41$, weight $0.321$), a mixed cluster (cos $\approx 0.956$, dHash $\approx 6.66$, weight $0.536$), and a cross-firm hand-leaning cluster (cos $\approx 0.946$, dHash $\approx 9.17$, weight $0.143$). The hand-leaning component shape is reproducible across LOOO folds (max drift $0.005$ in cos, $0.96$ in dHash, $0.023$ in weight); hard-posterior membership remains composition-sensitive (absolute differences $1.8$–$12.8$ pp), so we use K=3 as an *accountant-level descriptive characterisation*, not as an operational classifier. The signature-level operational classifier remains the inherited Paper A v3.x five-way box rule, retained for continuity with the prior literature. -Three independent feature-derived scores converge on the per-CPA hand-leaning ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior; a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference distribution; and the inherited box-rule hand-leaning rate. The three scores are not statistically independent measurements (they are functions of the same descriptor pair), so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. External hard ground truth is provided by 262 byte-identical signatures in the Big-4 subset, against which all three candidate classifiers achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). +Three feature-derived scores converge on the per-CPA hand-leaning ranking with Spearman $\rho \geq 0.879$: the K=3 mixture posterior; a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference distribution; and the inherited box-rule hand-leaning rate. The three scores are not statistically independent — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency among feature-derived ranks rather than as external validation. Hard ground truth for the *replicated* class is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate scores achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity, and we discuss the conservative-subset caveat in §V-G. We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available. @@ -44,15 +44,15 @@ The contributions of this paper are: 3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we validate the backbone choice through a feature-backbone ablation. -4. **Big-4 sub-corpus scope as the methodologically privileged calibration unit.** We show that the accountant-level mixture model becomes statistically supportable (dip-test rejects unimodality) at the Big-4 scope and not at narrower scopes, and we report the scope choice as a substantive methodological finding rather than an arbitrary subset. +4. **Big-4 sub-corpus scope as the methodologically privileged calibration unit.** We show that the accountant-level mixture model becomes statistically supportable (dip-test rejects unimodality) at the Big-4 scope and not in the narrower comparison scopes tested (Firm A pooled alone; Firms B + C + D pooled; all non-Firm-A pooled). We report the scope choice as a substantive methodological finding rather than an arbitrary subset. 5. **Accountant-level K=3 mixture as descriptive characterisation.** We fit and document a K=3 mixture whose hand-leaning component shape is LOOO-reproducible while its hard-posterior membership remains composition-sensitive. We do not treat K=3 as an operational classifier; we treat it as the descriptive accountant-level summary of the Big-4 distribution. -6. **Three-score convergent internal-consistency.** We measure agreement among three feature-derived scores at $\rho \geq 0.879$ and report this as internal consistency rather than external validation, given the shared-feature caveat. +6. **Three-score convergent internal-consistency.** We measure agreement among three feature-derived scores at $\rho \geq 0.879$ and report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair. 7. **Annotation-free validation against a hard positive anchor.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, complementing v3.x's inherited inter-CPA negative-anchor FAR ($0.0005$ at the operational cosine cut). -8. **Full-dataset robustness as light secondary scope.** We re-run the K=3 + box-rule convergence at the full $n = 686$ CPA scope and observe Spearman $\rho = 0.9558$ (drift $0.007$ from Big-4), demonstrating pipeline reproducibility at multiple scopes while preserving Big-4 as the primary methodological scope. +8. **Light full-dataset cross-scope check.** We re-run the K=3 + box-rule rank-convergence check at the full $n = 686$ CPA scope and observe Spearman $\rho = 0.9558$ (drift $0.007$ from Big-4), demonstrating that this specific convergence reproduces at the broader scope. We do not interpret this as a re-validation of the operational thresholds or of the LOOO / pixel-identity / five-way components, all of which remain scoped to Big-4. The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity validation, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work. @@ -60,11 +60,9 @@ The remainder of the paper is organized as follows. Section II reviews related w # II. Related Work -> *Inherited largely unchanged from v3.20.0; v4.0 updates only the framing-relevant sentences.* +> *Note for the Phase 4 review pass: §II is inherited substantively unchanged from v3.20.0 §II in the master manuscript, with one new paragraph added below. The unchanged content is not reproduced in this Phase 4 file; readers reviewing this draft should consult `paper/paper_a_related_work_v3.md` for the v3.20.0 §II text covering offline signature verification, near-duplicate detection, copy-move forgery detection, perceptual hashing, deep-feature similarity, and the statistical methods adopted (Hartigan dip test, finite mixture EM, Burgstahler-Dichev / McCrary density-smoothness diagnostic). The paragraph below is the only v4.0-specific §II addition.* -The literature reviewed in v3.20.0 §II covers offline signature verification [3]–[8], near-duplicate image detection [12], [13], copy-move forgery detection [10], [11], the Hartigan dip test [37], finite mixture EM methods [40], [41], and the Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39]. The v3.20.0 §II content is retained without substantive change in v4.0. - -One v4.0-specific addition belongs in §II: the methodological motivation for *leave-one-firm-out* cross-validation in a small-cluster scope. LOOO is standard in cross-validation literature [add citation] but has been used selectively in document-forensics calibration. Our application differs from the standard validation usage in two respects: (i) the hold-out unit is the firm (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *calibration uncertainty band* on the operational rule, not as a sufficiency claim for the rule itself. We treat LOOO drift as descriptive information about the rule's composition sensitivity rather than as a pass/fail test. +**Addition for v4.0: leave-one-firm-out cross-validation in a small-cluster scope.** Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the inherited five-way operational classifier (which is calibrated separately; §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier. Numerical references [42]–[44] are placeholders in this draft and will be replaced with the project's preferred references at copy-edit time. --- @@ -76,15 +74,15 @@ Non-hand-signing differs from forgery in that the questioned signature is produc ## B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Distribution is Multimodal -A central empirical finding of v3.x was that *per-signature* similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 inherits and confirms this signature-level reading. +A central empirical finding of v3.x was that *per-signature* similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 inherits this signature-level reading and remains consistent with it (no signature-level diagnostic was newly run in v4.0 to overturn the v3.x finding; the v4.0 per-signature evidence is the K=3 fit consistency check of §IV-F). -The v4.0 *accountant-level* analysis at the Big-4 scope tells a complementary story. When per-CPA means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ are computed for the 437 Big-4 CPAs, the Hartigan dip test rejects unimodality on both marginals at $p < 5 \times 10^{-4}$. This is the smallest tested scope at which a finite-mixture model is statistically supported. The accountant-level multimodality reflects between-CPA heterogeneity in signing practice (some CPAs are templated, some are hand-leaning, some are mixed) — not a two-mechanism boundary within any single CPA's signing pattern. The two readings are consistent: a CPA can be hand-leaning on average while still occasionally reusing template signatures, which produces a unimodal per-signature distribution within the CPA but a multimodal per-CPA distribution across CPAs. +The v4.0 *accountant-level* analysis at the Big-4 scope tells a complementary story. When per-CPA means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ are computed for the 437 Big-4 CPAs, the Hartigan dip test rejects unimodality on both marginals at $p < 5 \times 10^{-4}$. Among the comparison scopes tested in Script 32 (Firm A pooled alone; Firms B + C + D pooled; all non-Firm-A pooled), this is the smallest scope at which a finite-mixture model is statistically supported. The accountant-level multimodality reflects between-CPA heterogeneity in *per-CPA-mean location* across the descriptor plane — some CPAs' observed signatures place their per-CPA means in the templated region of the plane, others in the mixed region, others in the hand-leaning region — not a within-CPA two-mechanism boundary. The two readings can be jointly consistent: a CPA whose per-CPA mean places them in a mixed or hand-leaning region of the accountant-level plane may still have a unimodal within-CPA signature distribution, because the per-CPA mean is a summary statistic and is not a claim that all of that CPA's signatures share a single signing mechanism (§III-G). ## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor) -Firm A is empirically the most digitally-replicated of the four Big-4 firms in our corpus. In the Big-4 K=3 hard-posterior assignment, Firm A accounts for $0\%$ of the hand-leaning component and $82.5\%$ of the templated component; the opposite pattern holds at one of the non-Firm-A Big-4 firms (Firm C, which has the highest hand-leaning concentration at $23.5\%$). Byte-level pair analysis (inherited from v3.x §IV-F.1) identifies 145 Firm A pixel-identical signatures across 50 distinct Firm A partners, with 35 byte-identical matches spanning different fiscal years. +Firm A is empirically the most digitally-replicated of the four Big-4 firms in our corpus. In the Big-4 K=3 hard-posterior assignment, Firm A accounts for $0\%$ of the hand-leaning component and $82.5\%$ of the templated component; the opposite pattern holds at one of the non-Firm-A Big-4 firms (Firm C, which has the highest hand-leaning concentration at $23.5\%$). Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). The additional v3.x finding that the 145 Firm A pixel-identical signatures span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches across different fiscal years, is inherited from v3.20.0 §IV-F.1 / Script 28 / Appendix B byte-decomposition output and was not regenerated in v4.0's spike scripts; we retain those numbers by reference. -In v4.0 we treat Firm A as a *templated-end case study* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with the other three Big-4 firms; the K=3 components are derived from the joint Big-4 distribution. Firm A's role in the methodology is descriptive: it is the within-Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner, and the byte-level pair evidence provides the firm-level signature-reuse ground truth that anchors §IV-H's positive-anchor miss-rate analysis. This is a deliberate departure from v3.x's "Firm A as replication-dominated calibration reference" framing, motivated by the LOOO evidence below. +In v4.0 we treat Firm A as a *templated-end case study* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with the other three Big-4 firms; the K=3 components are derived from the joint Big-4 distribution. Firm A's role in the methodology is descriptive: it is the within-Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane. The byte-level evidence above is a firm-level signature-reuse case study; the Big-4 byte-identical positive anchor of §IV-H pools all four firms' byte-identical signatures and is not anchored on Firm A alone. This is a deliberate departure from v3.x's "Firm A as replication-dominated calibration reference" framing, motivated by the LOOO evidence below. ## D. K=2 is Firm-Mass Driven; K=3 is Reproducible in Shape @@ -94,7 +92,7 @@ K=3 in contrast has a *reproducible component shape*: across the four folds the ## E. Three-Score Convergent Internal-Consistency -Three feature-derived scores agree on the per-CPA hand-leaning ranking at Spearman $\rho \geq 0.879$: the K=3 mixture posterior; the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the inherited box-rule hand-leaning rate. The three scores are *not* statistically independent measurements — they are functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise $|\rho| > 0.87$) and its persistence at the signature level (Cohen $\kappa = 0.87$ between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-component ordering, and three different summarisations of the descriptor space — mixture posterior, deviation from an external reference, and a published box rule — produce the same per-CPA ranking. +Three feature-derived scores agree on the per-CPA hand-leaning ranking at Spearman $\rho \geq 0.879$: the K=3 mixture posterior; the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the inherited Paper A box-rule hand-leaning rate. The three scores are *not* statistically independent measurements — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise $|\rho| > 0.87$) and its persistence at the signature level (Cohen $\kappa = 0.87$ between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-component ordering, and three different summarisations of the descriptor space — mixture posterior, deviation from an external reference, and the inherited Paper A box rule — produce broadly concordant per-CPA rankings, with a residual non-Firm-A disagreement (the reverse-anchor cosine percentile ranks Firm D fractionally above Firm C, while the mixture posterior and the box-rule rate rank Firm C highest among non-Firm-A firms). ## F. Pixel-Identity as a Hard Positive Anchor, Inherited Inter-CPA Negative-Anchor FAR @@ -102,31 +100,41 @@ The only hard ground-truth subset in the corpus is pixel-identical signatures: t ## G. Limitations -Several limitations should be transparent. +Several limitations should be transparent. The first seven are v4.0-specific; the last five are inherited from v3.20.0 §V-G and still apply to the v4.0 pipeline. -*Scope.* The v4.0 primary analyses are scoped to the Big-4 sub-corpus, restricted from the full $n = 686$ accountant population because the dip-test multimodality and the LOOO firm-fold structure are not available at narrower or broader scopes in our setting. Generalisation to mid- and small-firm contexts is left as future work; the §IV-K full-dataset robustness check shows the Spearman convergence is preserved at $n = 686$ but does not validate the Big-4 operational thresholds at the broader scope. +*Scope.* The v4.0 primary analyses are scoped to the Big-4 sub-corpus. Among the comparison scopes tested in Script 32, the Big-4 sub-corpus is the smallest at which the accountant-level dip test rejects unimodality and is the only scope at which the four-fold firm-level LOOO structure exists; we did not test the full $n = 686$ accountant population's dip-test marginals or perform LOOO at that scope. Generalisation to mid- and small-firm contexts is left as future work; the §IV-K full-dataset Spearman re-run shows the K=3 + box-rule rank-convergence is preserved at $n = 686$ but does not validate the Big-4 operational thresholds, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope. -*No signature-level hand-signed ground truth.* The corpus does not contain a labelled hand-signed class at the signature level, so the reverse-anchor classifier's cut is chosen by prevalence calibration against the inherited box rule's overall replicated rate rather than by direct ROC optimisation. Future work could supplement v4.0 with a small human-rated validation set. +*No signature-level hand-signed ground truth.* The corpus does not contain a labelled hand-signed class at the signature level, so the reverse-anchor score's cut is chosen by prevalence calibration against the inherited box rule's overall replicated rate rather than by direct ROC optimisation. Future work could supplement v4.0 with a small human-rated validation set. -*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases. A classifier that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures). +*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the inherited box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures). *Inherited rule components are not separately v4-validated.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their v3.20.0 calibration and capture-rate evidence; v4.0's convergent-checks evidence (§III-K, §IV-F) covers only the binary high-confidence sub-rule. *A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime. -*Composition sensitivity of K=3.* The K=3 hard-posterior membership for any single firm varies by up to $12.8$ pp across LOOO folds. This is documented as calibration uncertainty rather than failure, but it means K=3 hard labels are not used as v4.0 operational classifier output. +*Composition sensitivity of K=3.* The K=3 hard-posterior membership for any single firm varies by up to $12.8$ pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as v4.0 operational classifier output; they are reported only as accountant-level descriptive characterisation. -*Domain ground truth.* v4.0 reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. +*No partner-level mechanism attribution.* v4.0 reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. + +*Transferred ImageNet features (inherited from v3.20.0).* The ResNet-50 feature extractor uses pre-trained ImageNet weights without signature-domain fine-tuning. While our backbone-ablation study (§IV-L, inherited from v3.20.0 §IV-I) and prior literature support the effectiveness of transferred ImageNet features for signature comparison, a signature-domain fine-tuned feature extractor could improve discriminative performance. + +*Red-stamp HSV preprocessing artifacts (inherited from v3.20.0).* The red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. Blended pixels are replaced with white, potentially creating small gaps in signature strokes that could reduce dHash similarity. This bias would push classifications toward false negatives rather than false positives. + +*Longitudinal scan / PDF / compression confounds (inherited from v3.20.0).* Scanning equipment, PDF generation software, and compression algorithms may have changed over the 2013–2023 study period, potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded. + +*Source-exemplar misattribution in max/min pair logic (inherited from v3.20.0).* The max-cosine / min-dHash detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case where one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as a stamping or e-signature template, the pair correctly identifies image reuse but misattributes non-hand-signed status to the source exemplar. This affects at most one source document per template variant per CPA and is not expected to be common. + +*Legal and regulatory interpretation (inherited from v3.20.0).* Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a jurisdiction-specific legal question. Our technical analysis can inform such determinations but cannot resolve them. --- # VI. Conclusion and Future Work -We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a methodological framework for characterising and validating it at the Big-4 sub-corpus scope. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. Applied to 90,282 audit reports filed between 2013 and 2023, it extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs / 150,442 signatures) as the primary analytical population. +We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a methodological framework for characterising and validating it at the Big-4 sub-corpus scope. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is an inherited Paper A five-way per-signature classifier with worst-case document-level aggregation (§III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs / 150,442 signatures) as the primary analytical population. -Our central methodological contributions are: (1) the identification of the Big-4 sub-corpus as the smallest tested scope at which an accountant-level finite-mixture model is statistically supportable (dip-test $p < 5 \times 10^{-4}$); (2) the recognition that a 2-component mixture at this scope is a firm-mass separator rather than a mechanism boundary, evidenced by leave-one-firm-out instability; (3) a 3-component mixture whose hand-leaning component shape is LOOO-reproducible (cos drift $\leq 0.005$, dHash drift $\leq 0.96$, weight drift $\leq 0.023$) but whose hard-posterior membership is composition-sensitive ($\pm 5$–$13$ pp across folds), supporting K=3 as descriptive characterisation rather than operational classifier; (4) three feature-derived scores that converge on the per-CPA hand-leaning ranking at Spearman $\rho \geq 0.879$, with the explicit caveat that the scores are not statistically independent; (5) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures (Wilson 95% upper bound $1.45\%$); and (6) preservation of the K=3 + box-rule Spearman convergence at the full $n = 686$ CPA scope (drift $0.007$), demonstrating cross-scope pipeline reproducibility. +Our central methodological contributions are: (1) the identification of the Big-4 sub-corpus as the smallest scope tested at which an accountant-level finite-mixture model is statistically supportable (dip-test $p < 5 \times 10^{-4}$); (2) the recognition that a 2-component mixture at this scope is a firm-mass separator rather than a mechanism boundary, evidenced by leave-one-firm-out instability; (3) a 3-component mixture whose hand-leaning component *shape* is LOOO-reproducible (cos drift $\leq 0.005$, dHash drift $\leq 0.96$, weight drift $\leq 0.023$) while its hard-posterior membership remains composition-sensitive ($1.8$–$12.8$ pp across folds), supporting K=3 as descriptive characterisation rather than operational classifier; (4) three feature-derived scores that converge on the per-CPA hand-leaning ranking at Spearman $\rho \geq 0.879$, with the explicit caveat that the scores share the underlying descriptor pair and are not statistically independent; (5) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures (Wilson 95% upper bound $1.45\%$); and (6) at the full $n = 686$ CPA scope, the K=3 + box-rule rank-convergence reproduces (Spearman drift $0.007$ from Big-4), demonstrating that this specific convergence is robust to a wider scope while leaving the operational thresholds, the LOOO structure, the five-way classifier, and the pixel-identity check scoped to Big-4. -Future work falls in four directions. *First*, a small-scale human-rated validation set would enable direct ROC optimisation of the reverse-anchor cut and provide a signature-level hand-signed ground truth that v4.0 currently lacks. *Second*, the within-firm composition sensitivity of K=3 hard labels could be addressed with hierarchical mixture models or partial-pooling formulations that explicitly model firm-level random effects. *Third*, the policy and audit-quality implications of within-Big-4 contrast — Firm A's $82\%$ templated concentration vs Firm C's $23.5\%$ hand-leaning concentration — invite a substantive companion analysis examining whether differences in firm signing practice correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires either new methodological scope choices (since mid/small-firm samples do not support the dip-test or LOOO structure we use at Big-4) or a different validation framework altogether. +Future work falls in four directions. *First*, a small-scale human-rated validation set would enable direct ROC optimisation of the reverse-anchor cut and provide a signature-level hand-signed ground truth that v4.0 currently lacks. *Second*, the within-firm composition sensitivity of K=3 hard labels could be addressed with hierarchical mixture models or partial-pooling formulations that explicitly model firm-level random effects. *Third*, the descriptive within-Big-4 component-membership contrast — Firm A's $82\%$ C3-templated hard-posterior concentration and Firm C's $23.5\%$ C1-hand-leaning hard-posterior concentration are accountant-level descriptive features of the K=3 mixture, not validated mechanism-level claims and not currently linked to audit-quality outcomes — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires either new methodological scope choices (since mid/small-firm samples do not support the dip-test or LOOO structure we use at Big-4) or a different validation framework altogether. ---