Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1): Must-fix items (6/6): - §III-F SSIM/pixel rejection rewritten from first principles (design-level argument from luminance/contrast/structure local-window product, not the prior empirical 0.70 result) - Table VI restructured by population × method; added missing Firm A logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary marked bin-unstable (Appendix A) - Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted from "operational dual" to "calibration-fold-adjacent reference"; the actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout - New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A on top); script 30_yearly_big4_comparison.py - Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets - §III-K reframed P7.5 from "round-number lower-tail boundary" to operating point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds: 0.9407 / 0.945 / 0.95 / 0.977 / 0.985) Nice-to-have items (3/3): - Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985) - Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to table notes; cut "invite reviewer skepticism" and "non-load-bearing" Codex 3-pass verification cleanup: - Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A Beta-2 forced-fit crossing from beta_mixture_results.json) - dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer immediately below) instead of misleading "rounded down" - Table XII-B prose corrected: per-segment qualification of "non-Firm-A capture falls faster" (true on 0.95→0.977 segment but contracts on 0.977→0.985 segment); arithmetic now from exact counts Within-year analyses removed: - Within-year ranking robustness check (Class A) was added in nice-to-have pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the Appendix B provenance row - Within-CPA future-work disclosures (Class B) removed from Discussion limitation #5 and Conclusion future-work paragraph; subsequent limitations renumbered Sixth → Fifth, Seventh → Sixth DOCX rendering pipeline overhaul (paper/export_v3.py): Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES: strip_comments() was wholesale-deleting HTML comments, but every numerical table is wrapped in , so the table body was deleted alongside the wrapper. Now unwraps TABLE comments (emit synthetic __TABLE_CAPTION__: marker + table body) while still stripping non-TABLE editorial comments. Result: 19 tables now render in the DOCX. Other rendering fixes: - LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥, ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,}) - Math-context-scoped sub/superscript via PUA sentinels (/): no more underscore-eating in identifiers like signature_analysis - Display equations rendered via matplotlib mathtext to PNG (3 equations: cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as numbered equation blocks (1), (2), (3); content-addressed cache at paper/equations/ (gitignored, regenerable) - Manual numbered/bulleted list rendering with hanging indent (replaces python-docx style="List Number" which silently drops the number prefix when no numbering definition is bound) - Markdown blockquote (> ...) defensively stripped - Pandoc footnote ([^name]) markers no longer leak (inlined at source) - Heading text cleaned of LaTeX residue + PUA sentinels - File paths in body text (signature_analysis/X.py, reports/Y.json) trimmed to "(reproduction artifact in Appendix B)" pointers New leak linter: paper/lint_paper_v3.py - two-pass markdown source + rendered DOCX leak detector; auto-runs at end of export_v3.py. Script changes: - 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR threshold list so Table XII-B is reproducible from persisted JSON - 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly data (writes to reports/figures/ and reports/firm_yearly_comparison/) - 31_within_year_ranking_robustness.py: NEW; supports the within-year robustness check (no longer cited in paper but kept as repo-internal due-diligence artifact) Partner handoff DOCX shipped to ~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB: 19 tables + 4 figures + 3 equation images). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paper A v3.19.1: address codex partner-redpen audit residual ("upper bound" wording)
2026-05-06 13:44:49 +08:00 · 2026-04-27 23:05:39 +08:00 · 2026-04-27 22:20:52 +08:00 · 2026-04-27 21:56:54 +08:00 · 2026-04-27 21:40:43 +08:00 · 2026-04-27 21:40:42 +08:00
37 changed files with 7593 additions and 456 deletions
@@ -0,0 +1,9 @@
+# Codex gpt-5.4 Opinion on BD/McCrary Option (a) vs (b)
+
+(read-only sandbox; full text captured verbatim from codex output log)
+
+**Recommendation: option (b) DEMOTE**, with a preferred hybrid (c) if time permits.
+
+---
+
+For an IEEE Access reviewer, option (b) is the more defensible choice. At this point the paper's own evidence no longer supports BD/McCrary as a co-equal threshold estimator: at the accountant level, which is now the methodologically decisive level for inference, it produces no significant transition at all, while at the signature level the reported cosine transition is 0.985 with very large adjacent Z values, meaning the procedure is finding a sharp local density irregularity inside the non-hand-signed mode rather than a boundary between the two mechanisms the paper is supposed to separate. That is the central problem. If BD remains framed in the Abstract, Introduction, and Section III-I as one of three threshold estimators, the natural reviewer response is not "good triangulation" but "why do the estimators fail to converge around the accountant-level band of roughly 0.976 +/- 0.003?" and the manuscript has no persuasive answer beyond "BD is different." The missing bin-width robustness makes that vulnerability worse, not better: with a fixed 0.005 cosine bin width on a very large sample, the present signature-level transition could reflect a real local feature, a histogram-resolution artifact, or both, and running the sweep now creates asymmetric downside risk because instability would directly weaken Method 2 while stability still would not solve the deeper interpretability problem that the transition sits within, not between, modes. By contrast, option (b) aligns the front half of the paper with what the Discussion already correctly says in Sections V-B and V-G: BD/McCrary is informative here as a density-smoothness diagnostic, not as an independent accountant-level threshold setter. That reframing actually sharpens the paper's substantive claim. The coherent story is that accountant-level aggregates are structured enough for KDE and mixture methods to yield convergent thresholds, yet smooth enough that a discontinuity-based method does not identify a sharp density break; this supports "clustered but smoothly mixed" behavior better than the current "three estimators" rhetoric does. A third option the author has not explicitly considered is a hybrid: demote BD in the main text exactly as in option (b), but run a short bin-width sweep and place the results in an appendix or supplement as an audit trail. That would let the authors say, in one sentence, either that the signature-level transition is not robust to binning or that it is bin-stable but still diagnostically located at 0.985 and therefore not used as the accountant-level threshold. In my view that hybrid is the strongest version if time permits; but if the choice is strictly between (a) and (b), I would recommend (b) without hesitation.
@@ -0,0 +1,43 @@
+# Codex Partner Red-Pen Regression Audit (Paper A v3.19.0)
+
+Scope: focused regression audit of whether the authors' partner red-pen comments on v3.17 have been adequately addressed in the current v3.19.0 manuscript files under `paper/`. This is not a fresh peer review.
+
+## 1. Overall summary
+
+For the 11 lettered red-pen items (a-k), my independent count is **7 RESOLVED / 1 IMPROVED / 0 PARTIAL / 0 UNRESOLVED / 3 N/A**. The two broader theme-level issues are **Citation reality: RESOLVED** and **ZH/EN alignment: N/A**.
+
+My bottom-line assessment is close to Gemini's: the revision substantially addresses the partner's concerns by deleting the most confusing accountant-level GMM / accountant-level BD-McCrary material and by replacing several AI-sounding explanations with more literal, auditable prose. I do not agree with Gemini's fully clean "8 RESOLVED / 3 N/A" verdict, however. The BIC / strict-3-component item is materially improved, but the manuscript still retains "upper bound" wording in the methods and Table VI even though the results correctly call the two-component fit a forced fit. That is a small prose/rationale residue, not a blocking unresolved issue.
+
+## 2. Item-by-item table
+
+| Item | Status | Manuscript section addressing it | Brief justification | Disagreement with Gemini audit |
+|---|---:|---|---|---|
+| Theme 1: Citation reality for refs [5], [16], [21], [22], [25], [27], [37]-[41] | RESOLVED | `paper_a_references_v3.md`; `reference_verification_v3.md` | The current reference list fixes the serious [5] author/title error and includes real, recognizable method references for Hartigan, Burgstahler-Dichev, McCrary, Dempster-Laird-Rubin, and White. The flagged technical references are not hallucinated. Minor citation-polish items from the verification file appear fixed in the current reference list. | No substantive disagreement. One housekeeping note: `reference_verification_v3.md` still describes [5] as a "major problem" in the detailed findings/recommendations because it records the audit history; the actual current reference list is fixed. |
+| Theme 3: ZH/EN alignment gap at end of III-H Calibration Reference | N/A | Entire v3.19.0 manuscript | The dual-language zh-TW/en scaffold that produced the partner's "no English alongside?" concern is gone. The current draft is monolingual English for IEEE submission, so there is no remaining bilingual alignment task. | No disagreement. |
+| (a) A1 stipulation, "do not understand your description" | RESOLVED | Section III-G, `paper_a_methodology_v3.md` | A1 is now stated as a specific cross-year pair-existence assumption: if replication occurs, at least one same-CPA near-identical pair exists in the observed same-CPA pool. The text also states when A1 may fail. This is much clearer than a vague stipulation. | No disagreement. |
+| (h) A1 pair-detectability paragraph red-circled | RESOLVED | Section III-G, `paper_a_methodology_v3.md` | The red-circled assumption is now bounded: it is plausible for high-volume stamping/e-signing, not guaranteed under singletons, multiple templates, or scan noise, and not a within-year uniformity claim. That should answer the partner's concern about over-assumption. | No disagreement. |
+| (b) Conservative structural-similarity wording, "a bit roundabout?" | RESOLVED | Section III-G, `paper_a_methodology_v3.md` | The independent-minimum dHash is now defined directly as the minimum Hamming distance to any same-CPA signature and identified as the statistic used in the classifier and capture-rate analyses. The wording is concise enough for re-read. | No disagreement. |
+| (c) IV-G validation lead-in, "do not understand why you say this" | RESOLVED | Section IV-G, `paper_a_results_v3.md` | The lead-in now explicitly says Section IV-E capture rates are internally circular because Firm A helped set the thresholds, then explains why the three IV-G analyses are threshold-free or threshold-robust. This directly supplies the missing rationale. | No disagreement. |
+| (d) BD/McCrary at accountant level, "cannot understand" | N/A | Removed from current structure | The accountant-level BD/McCrary analysis no longer appears in the live v3.19.0 manuscript. BD/McCrary is now signature-level only and framed as a density-smoothness diagnostic, not an accountant-level threshold device. | No disagreement. |
+| (k) Accountant-level aggregation rationale, "why accountant level total, because component?" | N/A | Removed from current structure | The confusing accountant-level component narrative has been deleted. The paper now avoids translating signature-level outputs into accountant-level mechanism assignments except for auditor-year ranking. | No disagreement. |
+| (e) 92.6% match rate, "do not understand improvement angle" | RESOLVED | Section III-D, `paper_a_methodology_v3.md`; Table III in Section IV-B | The match rate is now a data-processing coverage metric: 168,755 of 182,328 signatures are CPA-matched, and the unmatched 7.4% are excluded because same-CPA best-match statistics are undefined. The old "improvement" angle is gone. | No disagreement. |
+| (f) 0.95 cosine cutoff, "cut-off corresponds to what?" | RESOLVED | Section III-K, `paper_a_methodology_v3.md`; Sections IV-E/F | The text now states that 0.95 corresponds to the whole-sample Firm A P7.5 heuristic: 92.5% of Firm A signatures exceed it and 7.5% fall at or below it. It also distinguishes 0.95 from the calibration-fold P5 = 0.9407 and rounded 0.945 sensitivity cut. | No disagreement. |
+| (g) 139/32 C1/C2 split, "too reliant on weighting factor?" | N/A | Removed from current structure | The C1/C2 accountant-level GMM cluster split is gone from the current manuscript. Residual fold-variance wording no longer invokes the 139/32 split. | No disagreement. |
+| (i) Hartigan rejection-as-bimodality, "so why?" | RESOLVED | Section III-I.1, `paper_a_methodology_v3.md`; Section IV-D.1 | The text now separates the dip test from component counting: it tests unimodality, does not specify a component count, and is used to decide whether a KDE antimode is meaningful. Section IV-D then explains why Firm A's non-rejection and all-CPA rejection matter. | No disagreement. |
+| (j) BIC strict-3-component upper-bound framing, red-circled paragraph | IMPROVED | Section III-I.2/III-I.4, `paper_a_methodology_v3.md`; Section IV-D.3/IV-D.4, `paper_a_results_v3.md` | The results section is much clearer: it labels the 2-component Beta mixture as "A Forced Fit," reports the 3-component BIC preference, and says the Beta/logit disagreement reflects unsupported parametric structure. However, the methods still say the 2-component crossing "should be treated as an upper bound," and Table VI labels one row as "signature-level Beta/KDE upper bound." That residual wording may still prompt "upper bound of what?" from the partner. | I disagree with Gemini's RESOLVED verdict here. The item is not unresolved, but it is only IMPROVED until "upper bound" is either defined in one plain sentence or removed in favor of "forced-fit descriptive reference." |
+
+## 3. Specific pushback on Gemini's RESOLVED verdict
+
+Only item **(j)** needs pushback.
+
+Gemini says the BIC issue is resolved because the results now title the subsection "A Forced Fit" and state that the 2-component structure is not supported. That is true for Section IV-D.3, but not the whole manuscript. Section III-I.2 still says that when BIC prefers three components, "the 2-component crossing should be treated as an upper bound rather than a definitive cut." Section III-I.4 repeats that the 2-component crossing is a forced fit and "should be read as an upper bound," and Table VI contains "signature-level Beta/KDE upper bound."
+
+For a statistically trained reviewer, this may be defensible shorthand. For the partner's original red-pen concern, it is still slightly too abstract. If the authors keep "upper bound," they should define the bound explicitly. Otherwise the safer fix is to remove the term and call these values "forced-fit descriptive references not used operationally."
+
+## 4. Smallest residual set before partner re-read
+
+1. Replace or explain the remaining **"upper bound"** wording in Section III-I.2, Section III-I.4, and Table VI. Suggested direction: "Because the two-component assumption is not supported, we report the crossing only as a forced-fit descriptive reference and do not use it as an operational threshold."
+
+2. Optional housekeeping: update `reference_verification_v3.md` so its detailed [5] entry no longer reads like an active problem after the reference list has been corrected. This is not a manuscript blocker, but it avoids confusion if the partner or a coauthor opens the verification note.
+
+No other partner red-pen issue appears to need substantive revision before re-read.
@@ -0,0 +1,114 @@
+# Fourth-Round Review of Paper A v3.4
+
+**Overall Verdict: Major Revision**
+
+v3.4 is materially better than v3.3. The ethics/interview blocker is genuinely fixed, the classifier-versus-accountant-threshold distinction is much clearer in the prose, Table XII now exists, and the held-out-validation story has been conceptually corrected from the false "within Wilson CI" claim to the right calibration-fold-versus-held-out comparison. I still do not recommend submission as-is, however, because two core problems remain. First, the newly added sensitivity and intra-report analyses do not appear to evaluate the classifier that Section III-L now defines: the paper says the operational five-way classifier uses *cosine-conditional* dHash cutoffs, but the new scripts use `min_dhash_independent` instead. Second, the replacement Table XI has z/p columns that do not consistently match its own reported counts under the script's published two-proportion formula. Those are fixable, but they keep the manuscript in major-revision territory.
+
+**1. v3.3 Blocker Resolution Audit**
+
+| Blocker | Status | Audit |
+|---|---|---|
+| B1. Classifier vs three-method convergence misalignment | `PARTIALLY-RESOLVED` | The prose repair is real. Section III-L now explicitly distinguishes the signature-level operational classifier from the accountant-level convergent reference band at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:251) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:275), and Section IV-G.3 is added as a sensitivity check at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:262). The remaining problem is that III-L defines the classifier's dHash cutoffs as *cosine-conditional* at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:269) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:271), but the new sensitivity script loads only `s.min_dhash_independent` at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:83) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:90) and then claims to "Replicate Section III-L" at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:204) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:241). So the conceptual alignment is improved, but the new empirical support is still not aligned to the declared classifier. |
+| B2. Held-out validation false within-Wilson-CI claim | `PARTIALLY-RESOLVED` | The false claim itself is removed. Section IV-G.2 now correctly says the calibration fold, not the whole sample, is the right comparison target at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:230) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237), and Discussion mirrors that at [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:44). The new script also implements the two-proportion z-test explicitly at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:66) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:80) and [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:175) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:202). However, several Table XI z/p entries do not match the displayed `k/n` counts under that formula: the `cosine > 0.837` row at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217) implies about `z = +0.41, p = 0.683`, not `+0.31 / 0.756`; the `cosine > 0.9407` row at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:220) implies about `z = -3.19, p = 0.0014`, not `-2.83 / 0.005`; and the `dHash_indep <= 15` row at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:224) implies about `z = -0.43, p = 0.670`, not `-0.31 / 0.754`. The conceptual blocker is fixed; the replacement inferential table still needs numeric cleanup. |
+| B3. Interview evidence lacks ethics statement | `RESOLVED` | This blocker is fixed. The manuscript now consistently reframes the contextual claim as practitioner / industry-practice knowledge rather than as research interviews; see [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:50) through [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:139) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140), and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:280) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:289). I also ran a grep across the nine v3 manuscript files and found no surviving `interview`, `IRB`, or `ethics` strings. The evidentiary burden now sits on paper-internal analyses rather than on undeclared human-subject evidence. |
+
+**2. v3.3 Major-Issues Follow-up**
+
+| Prior major issue | Status | v3.4 audit |
+|---|---|---|
+| dHash classifier ambiguity | `UNFIXED` | III-L now says the classifier uses *cosine-conditional* dHash thresholds at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:269) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:271), but the Results still report only `dHash_indep` capture rules at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:168) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:221) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:225), despite the promise at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:271) that both statistics would be reported. The new scripts for Table XII and Table XVI also use `min_dhash_independent`, not cosine-conditional dHash, at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:83) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:90) and [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:90) through [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:92). |
+| 70/30 split overstatement | `PARTIALLY-FIXED` | The paper is now more candid that the operational classifier still inherits whole-sample thresholds at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:272) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:273), and IV-G.2 properly frames the fold comparison at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:230) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237). But the Abstract still says "we break the circularity" at [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12), and the Conclusion repeats that framing at [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:20), which overstates what the 70/30 split accomplishes for the actual deployed classifier. |
+| Validation-metric story | `PARTIALLY-FIXED` | Methods and Results are substantially improved: precision and `F1` are now explicitly rejected as meaningless here at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:244) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:246) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:186) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:203). But the Introduction still promises validation with "precision, recall, F1, and equal-error-rate" at [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28), and the Impact Statement still overstates binary discrimination at [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8). |
+| Within-auditor-year empirical-check confusion | `UNFIXED` | Section III-G still says the intra-report analysis provides an empirical check on the within-auditor-year no-mixing assumption at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:123) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:127). But Section IV-H.3 still measures agreement between the two different signers on the same report at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:343) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:367). That is a cross-partner same-report test, not a same-CPA within-year mixing test. |
+| BD/McCrary rigor | `UNFIXED` | The Methods still mention KDE bandwidth sensitivity at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:173) and define a fixed-bin BD/McCrary procedure at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:177) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:183), but the Results still give only narrative transition statements at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:80) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:83) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:126) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:149), with no alternate-bin analysis, Z-statistics table, p-values, or McCrary-style estimator output. |
+| Reproducibility gaps | `PARTIALLY-FIXED` | There is some improvement at the code level: the new recalibration script exposes the seed and test formulae at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:46), [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:128) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:136), and [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:175) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:202). But from the paper alone the work is still not reproducible: the exact VLM prompt and parse rule remain absent at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:44) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:49), HSV thresholds remain absent at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:74), visual-inspection sample size/protocol remain absent at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145), and mixture initialization / stopping / boundary handling remain under-specified at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:187) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:195) and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:221). |
+| Section III-H / IV-F reconciliation | `FIXED` | The manuscript now clearly says the 92.5% Firm A figure is a within-sample consistency check, not the independent validation pillar, at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:155) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:176). That specific circularity / role-confusion problem is repaired. |
+| "Fixed 0.95 not calibrated to Firm A" inconsistency | `UNFIXED` | III-H still says the fixed `0.95` cutoff "is not calibrated to Firm A" at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151), but III-L says `0.95` is the whole-sample Firm A P95 heuristic at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:252) and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:272), and IV-F says the same at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:241). This contradiction remains. |
+
+**3. v3.3 Minor-Issues Follow-up**
+
+| Prior minor issue | Status | v3.4 audit |
+|---|---|---|
+| Table XII numbering | `FIXED` | Table XII now exists at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:246) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:254), and the numbering now runs XI-XVIII without the previous jump. |
+| `dHash_indep <= 5 (calib-fold median-adjacent)` label | `UNFIXED` | The unclear label remains at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165), even though the same table family now explicitly reports the calibration-fold independent-minimum median as `2` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:227). Calling `5` "median-adjacent" is still opaque. |
+| References [27], [31]-[36] cleanup | `UNFIXED` | These references remain present at [paper_a_references_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:57) through [paper_a_references_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:75), but a citation sweep across the nine manuscript files found no in-text uses of `[27]` or `[31]`-`[36]`. The Mann-Whitney test is still reported at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:50) without citing `[36]`. I also do not see uses of `[34]` or `[35]` in the reviewed manuscript text. |
+
+**4. New Findings in v3.4**
+
+**Blockers**
+
+- The new IV-G.3 sensitivity evidence does not appear to use the classifier that III-L now defines. III-L says the operational categories use cosine-conditional dHash cutoffs at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:269) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:271), and IV-G.3 presents itself as a sensitivity test of that classifier at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:262). But [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:83) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:90) load only `min_dhash_independent`, and the "Replicate Section III-L" classifier at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:212) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:241) uses that statistic directly. This is currently the most important unresolved issue because the newly added evidence that is meant to support B1 is not evaluating the paper's stated classifier.
+
+**Major Issues**
+
+- Table XI's z/p columns are not consistently arithmetically compatible with the published counts. The formula in [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:66) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:80) is straightforward, but several rows in [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:224) do not match their own `k/n` inputs. The qualitative interpretation survives, but a statistical table that does not reproduce from its displayed counts is not submission-ready.
+
+- Table XVI is affected by the same classifier-definition problem as Table XII. The paper says IV-H.3 uses the "dual-descriptor rules of Section III-L" at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:347), but [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:37) through [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:53) and [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:90) through [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:92) classify with `min_dhash_independent`. So the new "fourth pillar" consistency check is not actually tied to the classifier as specified in III-L.
+
+- The four-pillar Firm A validation is ethically cleaner, but not stronger in evidentiary reporting than v3.3. It is stronger on internal consistency because practitioner knowledge is now background-only at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:139) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140), and the paper states that the evidence comes from the manuscript's own analyses at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:142) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:155). But it is not stronger on empirical documentation because the visual-inspection pillar still has no sample size, randomization rule, rater count, or decision protocol at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145). My read is: ethically stronger, scientifically cleaner, but only roughly equal in evidentiary strength unless the visual-inspection protocol is documented.
+
+**Minor Issues**
+
+- III-H says "Two of them are fully threshold-free" at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:150), but item (a) immediately uses a fixed `0.95` cutoff at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151). The Results intro to Section IV-H is more accurate at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:270) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:274). This should be harmonized.
+
+- The Introduction still contains an obsolete metric promise at [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28), and the Impact Statement still reads too strongly for a five-way classifier with no full labeled test set at [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8). These are not new conceptual flaws, but they are still visible in the current version.
+
+**5. IEEE Access Fit Check**
+
+- **Scope:** Yes. The topic is a plausible IEEE Access Regular Paper fit as a methods paper spanning document forensics, computer vision, and audit/regulatory applications.
+
+- **Abstract length:** Not compliant yet. A local plain-word count of [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:5) through [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14) gives about **367 words**. The IEEE Author Center guidance says the abstract should be a single paragraph of up to 250 words. The current abstract is also dense with abbreviations / symbols (`KDE`, `EM`, `BIC`, `GMM`, `~`, `approx`) that IEEE generally prefers authors to avoid in abstracts.
+
+- **Impact Statement section:** The manuscript still includes a standalone Impact Statement at [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:1) through [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:9). **Inference from official IEEE Access / IEEE Author Center sources:** I do not see a Regular Paper requirement for a standalone `Impact Statement` section. Unless an editor specifically requested it, I would remove it or fold its content into the abstract / conclusion / cover letter.
+
+- **Formatting:** I cannot verify final IEEE template conformance from the markdown section files alone. Official IEEE Access guidance requires the journal template and submission of both source and PDF; that should be checked at the generated DOCX / PDF stage, not from these source snippets.
+
+- **Review model / anonymization:** IEEE Access uses **single-anonymized** review. The current pseudonymization of firms is therefore a confidentiality choice, not a review-blinding requirement. Within the nine reviewed section files I do not see author or institution metadata.
+
+- **Official sources checked:**  
+  - IEEE Access submission guidelines: https://ieeeaccess.ieee.org/authors/submission-guidelines/  
+  - IEEE Author Center article-structure guidance: https://journals.ieeeauthorcenter.ieee.org/create-your-ieee-journal-article/create-the-text-of-your-article/structure-your-article/  
+  - IEEE Access reviewer guidelines / reviewer info: https://ieeeaccess.ieee.org/reviewers/reviewer-guidelines/
+
+**6. Statistical Rigor Audit**
+
+- The high-level statistical story is cleaner than in v3.3. The paper now explicitly separates the primary accountant-level 1D convergence (`0.973 / 0.979 / 0.976`) from the secondary 2D-GMM marginal (`0.945`) at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:126) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:149), and III-L no longer pretends those accountant-level thresholds are themselves the deployed classifier at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:274).
+
+- The B2 statistical interpretation is substantially improved: IV-G.2 now frames fold differences as heterogeneity rather than as failed generalization at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:233) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237), and Discussion repeats that narrower reading at [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:44) through [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:45).
+
+- The main remaining statistical weakness is now more specific: the paper's new classifier definition and the paper's new sensitivity evidence are not using the same dHash statistic. That is a model-definition problem, not just a wording problem.
+
+- BD/McCrary remains the least rigorous component. The paper's qualitative interpretation is plausible, but the reporting is still too thin for a method presented as a co-equal thresholding component.
+
+- The anchor-based validation is better framed than before. The manuscript now correctly treats the byte-identical positives as a conservative subset and no longer uses precision / `F1` in the main validation table at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:205).
+
+**7. Anonymization Check**
+
+- Within the nine reviewed v3 manuscript files, I do not see any explicit real firm names or auditor names. The paper consistently uses `Firm A/B/C/D`; see [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:287) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:289) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:353) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:357).
+
+- The new III-M residual-identifiability disclosure at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:287) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:288) is appropriate. Knowledgeable local readers may still infer Firm A, but the paper now states that risk explicitly.
+
+**8. Numerical Consistency**
+
+- Most of the large headline counts still reconcile across sections: `90,282` reports, `182,328` signatures, `758` CPAs, and the Firm A `171 + 9` accountant split remain internally consistent across [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11) through [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:13), [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:62) through [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:63), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:121) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:127), and [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19) through [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:21).
+
+- Table XII arithmetic is internally consistent: both columns sum to `168,740`, and the listed percentages match the counts. Table XVI and Table XVII arithmetic also reconcile. The new numbering XI-XVIII is coherent.
+
+- The important remaining numerical inconsistency is Table XI's inferential columns, not its raw counts or percentages.
+
+**9. Reproducibility**
+
+- The paper is still **not reproducible from the manuscript alone**.
+
+- Missing or under-specified items that should be added before submission:
+  - Exact VLM prompt, parse rule, and failure-handling for page selection at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:44) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:49).
+  - HSV thresholds for red-stamp removal at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:74).
+  - Random seeds / sampling protocol for the 500-page annotation set, the 50,000 inter-CPA negatives, the 30-signature sanity sample, and the Firm A 70/30 split at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:59), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:232), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:237) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:239), and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:247).
+  - Visual-inspection sample size, selection rule, and decision protocol at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145).
+  - EM / mixture initialization, stopping criteria, boundary clipping for the logit transform, and software versions for the mixture fits at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:187) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:195) and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:221).
+
+- The new scripts help the audit, but they also expose that the Results tables are currently not perfectly aligned to the Methods classifier definition. So reproducibility is not only incomplete; it is presently inconsistent in one key place.
+
+**Bottom Line**
+
+v3.4 clears the ethics/interview blocker and substantially improves the classifier-threshold narrative. It is much closer to a submittable paper than v3.3. But I would still require one more round before IEEE Access submission: (1) make Section III-L, Table XII, Table XVI, and the supporting scripts use the same dHash statistic, or explicitly redefine the classifier around `dHash_indep`; (2) recompute and correct the Table XI z/p columns from the displayed counts; (3) remove the remaining overstatements about what the 70/30 split and the validation metrics establish; and (4) cut the abstract to <= 250 words while cleaning the non-standard Impact Statement. If those are repaired cleanly, the paper should move into minor-revision territory.
@@ -0,0 +1,165 @@
+# Fifth-Round Review of Paper A v3.5
+
+Audit basis: commit `12f716d`. Line numbers below refer to the current v3.5 markdown and script files.
+
+## 1. Overall Verdict
+
+**Minor Revision**
+
+v3.5 clears the two issues that kept v3.4 in major-revision territory. The classifier definition in Section III-L is now arithmetically aligned with the `dHash_indep` implementation used by the supporting scripts and downstream tables, and Table XI's `z/p` columns now reproduce from the displayed `k/n` counts under the exact two-proportion formula in Script 24. I do not see a core scientific regression in the B1/B2/B3 logic. I would still not submit v3.5 as-is, however, because a short v3.6 cleanup is still warranted: Table IX is not fully synchronized to the current script outputs, "breaks circularity" overclaim language survives in Methods/Results, the export path still hardcodes a double-blind placeholder even though IEEE Access is single-anonymized, and the manuscript still underdocuments BD/McCrary, visual inspection, and several key reproducibility details. This is now a close paper, but not yet the cleanest version to send.
+
+## 2. v3.4 Round-4 Follow-Up Audit
+
+### 2.1 Round-4 Blockers
+
+| Round-4 item | Round-4 status | v3.5 audit | Evidence |
+|---|---|---|---|
+| B1. Classifier vs three-method convergence misalignment | `PARTIALLY-RESOLVED` | `RESOLVED` | Section III-L now defines the operational classifier entirely in `dHash_indep` terms at Methodology L252-L277. The matching downstream tables also use `dHash_indep`: Results L165-L168, L221-L225, L246-L254, and L350-L361. Script 24 likewise loads `min_dhash_independent` and applies it in the Section III-L classifier at Script 24 L86-L99, L157-L168, and L215-L241. |
+| B2. Held-out validation false within-Wilson-CI claim | `PARTIALLY-RESOLVED` | `RESOLVED` | Results L230-L237 now correctly interpret the fold comparison, and the Table XI `z/p` entries at Results L217-L225 reproduce from Script 24's `two_prop_z` formula at Script 24 L69-L83 and L186-L205. |
+| B3. Interview evidence lacks ethics statement | `RESOLVED` | `RESOLVED` | The manuscript still treats practitioner knowledge as background context only and locates evidentiary weight in paper-internal analyses: Introduction L51-L55; Methodology L140-L156 and L282-L291. I found no regression to interview/IRB-style evidentiary claims. |
+
+### 2.2 Round-4 Major and Minor Follow-Up Items
+
+| Round-4 item | Round-4 status | v3.5 audit | Evidence |
+|---|---|---|---|
+| dHash classifier ambiguity | `UNFIXED` | `RESOLVED` | The classifier is now explicitly `dHash_indep`-based throughout III-L, not cosine-conditional: Methodology L254-L277. Results Tables IX, XI, XII, and XVI are written in the same statistic: Results L165-L168, L221-L225, L246-L254, L350-L361. |
+| 70/30 split overstatement | `PARTIALLY-FIXED` | `PARTIALLY-RESOLVED` | The Abstract and Conclusion are repaired: Abstract L5 and Conclusion L19-L21 now use fold-variance language. But the overclaim survives at Methodology L238, Results L171, and the subsection title at Results L207. |
+| Validation-metric story | `PARTIALLY-FIXED` | `RESOLVED` | The Introduction now promises anchor-based capture/FAR reporting rather than precision/F1/EER: Introduction L29-L30. Methods/Results remain aligned on why precision/F1 are not meaningful here: Methodology L245-L246; Results L186-L188. The archived Impact Statement is explicitly excluded from submission and self-warns against overclaim: Impact Statement L1-L12; `export_v3.py` L15-L25. |
+| Within-auditor-year empirical-check confusion | `UNFIXED` | `RESOLVED` | Methodology L123-L128 now explicitly says IV-H.3 is a related but distinct cross-partner same-report homogeneity test, not a same-CPA within-year mixing test. Results L343-L367 matches that framing exactly. |
+| BD/McCrary rigor | `UNFIXED` | `UNRESOLVED` | The paper still gives only narrative BD/McCrary outcomes without a table of `Z` statistics, `p` values, or bin-width robustness: Results L80-L83 and L126-L149. |
+| Reproducibility gaps | `PARTIALLY-FIXED` | `PARTIALLY-RESOLVED` | The scripts expose some seeds and formulas, but the manuscript still omits the exact VLM prompt/parse rule, HSV thresholds, visual-inspection protocol, and EM initialization/stopping details: Methodology L44-L49, L74-L75, L145-L146, L188-L196, L222-L223, L248. |
+| Section III-H / IV-F reconciliation | `FIXED` | `RESOLVED` | The 92.5% Firm A figure is still consistently framed as a within-sample consistency check, not an external validation pillar: Methodology L156-L160; Results L174-L176. |
+| "`0.95` not calibrated to Firm A" inconsistency | `UNFIXED` | `RESOLVED` | III-H now says the `0.95` cutoff is the whole-sample Firm A P95: Methodology L151-L154. III-L repeats that at Methodology L273-L277, and Results uses the same interpretation at L174-L176 and L241-L244. |
+| Table XII numbering | `FIXED` | `RESOLVED` | Numbering remains coherent through XI-XVIII, with Table XII present at Results L246-L254. |
+| `dHash_indep <= 5 (calib-fold median-adjacent)` label | `UNFIXED` | `UNRESOLVED` | The label still appears in Table IX at Results L165. III-L explains the rationale better at Methodology L275, but the table label itself remains opaque. |
+| References `[27]`, `[31]-[36]` cleanup | `UNFIXED` | `RESOLVED` | All seven are now cited in text: `[27]` at Methodology L100; `[31]-[33]` at Introduction L15; `[34]-[35]` at Methodology L44 and L58; `[36]` at Results L50. |
+
+### 2.3 Round-4 New-Issue Audit
+
+| Round-4 new issue | v3.5 audit | Evidence |
+|---|---|---|
+| IV-G.3 sensitivity evidence did not evaluate the stated classifier | `RESOLVED` | III-L now defines the same `dHash_indep` classifier that Script 24 evaluates: Methodology L252-L277; Script 24 L215-L241; Results L239-L262. |
+| Table XI `z/p` columns did not match displayed counts | `RESOLVED` | Results L217-L225 now matches recomputation from Script 24 L69-L83 exactly up to rounding; details in Section 3 below. |
+| Table XVI was affected by the same classifier-definition problem | `RESOLVED` | Table XVI is now aligned because III-L itself is `dHash_indep`-based. Script 23 also uses `min_dhash_independent`: Script 23 L37-L53 and L90-L92. |
+| Visual-inspection pillar still lacked protocol details | `UNRESOLVED` | The claim remains at Methodology L145-L149, but sample size, rater count, and adjudication rule are still absent from the manuscript. |
+| Threshold-free wording in III-H was inaccurate | `RESOLVED` | III-H now correctly says only partner-ranking is fully threshold-free: Methodology L151-L154. Results L270-L274 matches this. |
+| Introduction metric promise / Impact Statement wording still overstated | `RESOLVED` | The Introduction is repaired at L29-L30, and the Impact Statement is archived and excluded from export: Impact Statement L1-L12; `export_v3.py` L15-L25. |
+
+## 3. Verification of the v3.5 Critical Fixes
+
+### 3.1 Table XI Recalculation
+
+I recomputed every Table XI `z/p` pair from the displayed `k/n` counts using the exact two-proportion formula in Script 24 L69-L83. All nine rows now match the manuscript rounding at Results L217-L225.
+
+| Rule | Exact recomputation from displayed `k/n` | Paper value | Audit |
+|---|---|---|---|
+| `cosine > 0.837` | `z = +0.310601`, `p = 0.756104` | `+0.31`, `0.756` | Match |
+| `cosine > 0.9407` | `z = -3.184698`, `p = 0.001449` | `-3.19`, `0.001` | Match |
+| `cosine > 0.945` | `z = -4.541202`, `p = 0.00000559` | `-4.54`, `<0.001` | Match |
+| `cosine > 0.950` | `z = -5.966194`, `p = 0.0000000024` | `-5.97`, `<0.001` | Match |
+| `dHash_indep <= 5` | `z = -14.288642`, `p < 1e-40` | `-14.29`, `<0.001` | Match |
+| `dHash_indep <= 8` | `z = -6.446423`, `p = 1.15e-10` | `-6.45`, `<0.001` | Match |
+| `dHash_indep <= 9` | `z = -5.072930`, `p = 3.92e-07` | `-5.07`, `<0.001` | Match |
+| `dHash_indep <= 15` | `z = -0.313744`, `p = 0.753716` | `-0.31`, `0.754` | Match |
+| `cosine > 0.95 AND dHash_indep <= 8` | `z = -7.603992`, `p = 2.86e-14` | `-7.60`, `<0.001` | Match |
+
+This directly resolves the main round-4 numerical blocker.
+
+### 3.2 Section III-L Uses `dh_indep` Throughout
+
+This fix is real. Section III-L now states at Methodology L254-L255 that all dHash references in the operational classifier are the independent-minimum statistic, and the five categories at L257-L277 are all written with `dHash_indep`. The downstream result tables are consistent with that same statistic:
+
+- Table IX: Results L165-L168.
+- Table XI: Results L221-L225.
+- Table XII: Results L246-L258.
+- Table XVI: Results L347-L367.
+
+Script 24 is now consistent with that choice as well: it loads `min_dhash_independent` at L86-L99 and classifies with it at L215-L241.
+
+### 3.3 "`0.95` is Firm A P95" Is Now Consistent
+
+This inconsistency is fixed across the relevant sections:
+
+- III-H: Methodology L151-L154 states that the `0.95` cutoff is the whole-sample Firm A P95 and that the longitudinal analysis is about stability, not absolute-rate calibration.
+- III-L: Methodology L273-L277 repeats that `0.95` is the whole-sample Firm A P95 heuristic.
+- IV-F / IV-G.3: Results L174-L176 and L241-L244 use the same framing.
+
+I do not see a surviving contradiction of the old "not calibrated to Firm A" type.
+
+## 4. Verification of the v3.5 Major Fixes
+
+- **Abstract length:** The abstract is now one paragraph. A rendered whitespace count after stripping the header/comment gives 247 words, which is nominally under the IEEE 250-word cap. If one counts inline math markers as separate tokens, the count rises above 250, so the abstract is compliant in ordinary rendered form but still too close to the limit for comfort.
+- **"We break the circularity" overclaim:** Removed from the Abstract and Conclusion. The current Abstract L5 and Conclusion L19-L21 use fold-level variance / heterogeneity language instead. However, the same overclaim still survives elsewhere in the paper at Methodology L238 and Results L171 and L207.
+- **Introduction metric language:** Fixed. Introduction L29-L30 now promises per-rule capture/FAR with Wilson intervals and explicitly states why precision/F1 are not meaningful here. The obsolete introduction promise of precision/F1/EER is gone.
+- **III-G / IV-H.3 wording alignment:** Fixed. Methodology L123-L128 and Results L343-L367 now describe the same cross-partner same-report homogeneity test.
+- **III-H threshold-free wording:** Fixed. Methodology L151-L154 and Results L270-L274 now correctly say that only partner-ranking is threshold-free.
+
+## 5. Verification of the v3.5 Minor Fixes
+
+- **Impact Statement exclusion:** Fixed. `export_v3.py` excludes `paper_a_impact_statement_v3.md` from `SECTIONS` at L15-L25, and the archived file itself says it is not part of the IEEE Access submission at Impact Statement L1-L12.
+- **Previously unused references:** Fixed. `[27]`, `[31]`, `[32]`, `[33]`, `[34]`, `[35]`, and `[36]` all now have in-text citations; see the evidence in Section 2.2 above.
+
+## 6. New Findings in v3.5
+
+No core scientific regression is visible in the B1/B2/B3 logic. The remaining new findings are cleanup-level but real:
+
+1. **Table IX is still not fully synchronized to the current script outputs.** Using the displayed counts at Results L160-L168, three percentages are off by `0.01` under standard rounding: `57,131 / 60,448 = 94.51%`, not `94.52%`; `55,916 / 60,448 = 92.50%`, not `92.51%`; and `57,521 / 60,448 = 95.16%`, not `95.17%`. More importantly, Script 24 computes the whole-sample dual rule as `54,370 / 60,448`, not `54,373 / 60,448` (Script 24 L276-L316; generated recalibration report section 3 lines 48-52). This is small, but v3.5 explicitly positions itself as having cleaned exact table arithmetic, so it should be corrected.
+2. **The circularity overclaim is not fully removed paper-wide.** Methodology L238 still says the 70/30 split "break[s] the resulting circularity," Results L171 says the held-out analysis "addresses the circularity," and the IV-G.2 subsection title at Results L207 still says "(breaks calibration-validation circularity)." Those are stronger than the better, narrower interpretation at Results L233-L237, Discussion L44-L45, and Conclusion L20-L21.
+3. **The export path is not submission-ready for IEEE Access single-anonymized review.** `export_v3.py` correctly excludes the Impact Statement, but it still inserts `[Authors removed for double-blind review]` on the title page at L208-L218. If the manuscript were submitted literally from this export path, that would be a packaging error.
+4. **Methodology III-G retains one stale reference to cosine-conditional dHash.** Methodology L131-L132 says cosine-conditional dHash is used "as a diagnostic elsewhere," but no remaining main-text result appears to use it. After the III-L rewrite, this reads as leftover phrasing and should be either deleted or pointed to a real appendix/supplement.
+
+## 7. IEEE Access Submission Readiness Check
+
+- **Scope:** Yes. The topic remains a plausible IEEE Access Regular Paper fit spanning document forensics, computer vision, and audit/regulatory analytics.
+- **Abstract length:** Nominally compliant in rendered form at 247 words, but close enough to the cap that another 5-10 words of trimming would be safer.
+- **Formatting / template:** Not verifiable from the markdown section files alone. The paper is maintained as markdown fragments plus a custom `python-docx` exporter; I did not audit a final IEEE Access template-conformant DOCX/PDF package here.
+- **Review model:** IEEE Access is single-anonymized. The current export path still uses a double-blind placeholder on the title page (`export_v3.py` L208-L218). That must be fixed before submission.
+- **Anonymization:** The manuscript body still consistently uses `Firm A/B/C/D` and does not expose explicit real firm names or author metadata in the reviewed markdown sections. As before, that is a confidentiality choice rather than a review-model requirement.
+- **Ethics / data-source disclosure:** Adequate for this paper's current evidentiary framing. Methodology L282-L291 clearly states the corpus is public MOPS data and that no non-public records or human-subject evidence are used.
+- **Items that could trigger desk return if submitted literally now:** the missing author/affiliation metadata from the current export path, and any unverified IEEE template / metadata nonconformance in the final DOCX/PDF. The remaining scientific issues are reviewer-risk issues rather than obvious desk-return items.
+
+Bottom line on readiness: **not as-is**. The science is close; the packaging and last-round reporting cleanup are not finished.
+
+## 8. Statistical Rigor, Numerical Consistency, and Reproducibility
+
+### Statistical Rigor
+
+- The core statistical story is now coherent. The paper cleanly separates the operational signature-level classifier from the accountant-level convergence band and treats the held-out Firm A split as heterogeneity disclosure rather than a false Wilson-CI "generalization pass": Methodology L252-L277; Results L230-L237; Discussion L44-L45.
+- The anchor-based validation is better framed than in earlier rounds. The byte-identical positives are clearly treated as a conservative subset, and precision/F1 are no longer misused: Methodology L227-L248; Results L184-L205.
+- The main remaining rigor weakness is still BD/McCrary. Because the paper keeps advertising a three-method convergent threshold strategy in the title/abstract/introduction, the absence of explicit BD/McCrary `Z`/`p` reporting and bin-width sensitivity still leaves one of the three methods under-reported.
+- The visual-inspection pillar is still too thinly documented for the rhetorical weight it carries in III-H and the Conclusion.
+
+### Numerical Consistency
+
+- Table XI is now repaired and reproducible from its displayed counts.
+- Table XII, Table XVI, and Table XVII remain arithmetically consistent.
+- Table IX still has the residual percentage/count mismatches noted in Section 6.
+- The biggest numerical issue left is therefore no longer inferential-table arithmetic; it is the smaller but still avoidable transcription drift in Table IX.
+
+### Reproducibility
+
+The paper is still **not reproducible from the manuscript alone**.
+
+The most important under-specified items remain:
+
+- Exact VLM prompt, parse rule, and page-selection failure handling: Methodology L44-L49.
+- HSV thresholds for red-stamp removal: Methodology L74-L75.
+- Randomization / seed rules for the 500-page annotation set, the inter-CPA negative sample, the 30-signature sanity sample, and the Firm A split: Methodology L59-L62 and L232-L248.
+- Visual-inspection protocol details: sample size, rater count, and decision rule are still absent around Methodology L145-L146.
+- EM / mixture initialization count, stopping criteria, logit-boundary clipping, and software versions: Methodology L188-L196 and L222-L223.
+
+The scripts help auditability, but the manuscript still needs a short reproducibility appendix or supplement if the authors want the paper to look fully defensible on first submission.
+
+## 9. What v3.6 Must Change to Clear Review
+
+If the authors want the paper to clear this review and be genuinely submission-ready, v3.6 should do the following:
+
+1. **Re-sync Table IX and mirrored prose to the authoritative script outputs.** Correct the three `0.01` percentage mismatches and the whole-sample dual-rule count (`54,370 / 60,448` if Script 24 is authoritative).
+2. **Remove the surviving circularity overclaim from Methods/Results.** Replace Methodology L238, Results L171, and the IV-G.2 heading at L207 with the softer fold-variance / within-Firm-A heterogeneity framing already used elsewhere.
+3. **Fix the export path for IEEE Access single-anonymized review.** Restore author/affiliation/corresponding-author metadata and audit the real final DOCX/PDF against the IEEE Access template rather than relying on the current double-blind placeholder export.
+4. **Document the visual-inspection protocol.** At minimum: sample size, sampling rule, number of raters, whether review was blinded, and how disagreements were adjudicated.
+5. **Either substantiate BD/McCrary or demote it.** If it stays as one of the three headline methods, add a compact table of `Z` statistics, `p` values, and bin-width robustness. If not, explicitly recast it as a supplementary diagnostic rather than a co-equal threshold estimator.
+6. **Add a short reproducibility appendix or supplement.** Include the VLM prompt/parse rule, HSV thresholds, key seeds/sampling rules, and mixture-model implementation details.
+7. **Clean the stale cosine-conditional dHash sentence at Methodology L131-L132.** After the III-L rewrite, that sentence now looks like leftover terminology.
+
+If those items are addressed cleanly, I would treat the manuscript as submission-ready for IEEE Access.
@@ -0,0 +1,224 @@
+# Independent Peer Review (Round 16) - Paper A v3.18.1
+
+Reviewer role: independent peer reviewer for IEEE Access Regular Paper.
+Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.1, commit `cb77f481ec2ab4b93b0effbf4c0ee4c89e90d610`.
+Audit basis: manuscript sections under `paper/`, analysis scripts under `signature_analysis/`, generated reports under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/`, and `paper/reference_verification_v3.md`.
+
+## 1. Overall Verdict: Minor Revision
+
+The paper is close to submission-ready and the central empirical story is largely reproducible from the provided scripts: a large Taiwan audit-report corpus; a signature-detection and feature-extraction pipeline; percentile-calibrated dual-descriptor classification; annotation-free validation using byte-identical positives and inter-CPA negatives; and strong Firm A concentration in several benchmark checks. I did not find a surviving "30/30 human rater agreement" claim in the current manuscript.
+
+However, I would not recommend unconditional Accept. Three issues require revision before IEEE Access submission:
+
+1. Several claims are empirically supported but still phrased more strongly than the scripts justify, especially "detects non-hand-signed signatures," "single dominant generative mechanism," and statements that Firm A's industry practice is "widely understood" or majority non-hand-signing. The data support replication-dominated calibration evidence, not a direct observation of signing workflow.
+2. A number of section references are stale after the v3.18 retitling/reframing. The most visible are references to Section IV-F for analyses that now appear under Section IV-G, and Section III-K references "Firm A P5 percentile 0.941" while the reported sensitivity uses 0.945 and calibration-fold P5 is 0.9407.
+3. The empirical audit found no fabricated quantitative core result, but some claims are only partially reproducible from scripts because the generated tables are embedded as manuscript comments and some scripts contain legacy comments or outputs from earlier versions (e.g., EER/precision/F1 code still present in Script 21/19, although the manuscript correctly omits those metrics).
+
+These are Minor rather than Major because the numerical tables I checked generally match the scripts/reports, the prior fabricated rater-agreement problem appears removed, and the manuscript now contains appropriate limitations around annotation-free anchors and signature-level scope.
+
+## 2. Empirical-Claim Audit Table
+
+Status definitions: VERIFIED = matches scripts/reports or reference verification; UNVERIFIABLE = plausible but not independently supported by provided artifacts; SUSPICIOUS = likely true directionally but overphrased or internally inconsistent; FABRICATED = contradicted by provided artifacts or unsupported despite being presented as measured fact. I found no clear fabricated quantitative claim in v3.18.1.
+
+| Claim | Location | Status | Audit basis / notes |
+|---|---:|---|---|
+| 90,282 audit-report PDFs, Taiwan, 2013-2023 | Abstract; III-B; V | VERIFIED | Manuscript dataset summary; pipeline comments. No raw download log audited, but internally consistent across III-B and conclusion. |
+| 86,072 documents with signatures (95.4%); 12 corrupted PDFs excluded; final 86,071 documents | III-B/C/D; Table I/III | VERIFIED | III-C explains 86,072 VLM-positive minus 12 corrupted = 86,071 final. Slight table split is clear enough. |
+| 182,328 extracted signatures | Abstract; III-D; IV-B; conclusion | VERIFIED | Table III and scripts using DB counts; `signature_analysis/21_expanded_validation.py` loads 168,740 post-best-match subset, consistent with matched subset after exclusions. |
+| 758 unique CPAs; >50 accounting firms; 15 document types, 86.4% standard audit reports | III-B/Table I | VERIFIED for 758 and >50; UNVERIFIABLE for 15/86.4 | 758 is repeatedly used in manuscript. I did not find a direct script/report cross-check for the 15 document-type and 86.4% breakdown in the inspected artifacts. |
+| Qwen2.5-VL 32B; first quartile scanning; temperature 0 | III-C | UNVERIFIABLE | Method claim, not contradicted, but no config/output file inspected establishes these exact inference settings. |
+| VLM-YOLO agreement / YOLO detections in 98.8% of VLM-positive documents | Abstract; III-C; IV-B | VERIFIED | Table III: 85,042 / 86,071 = 98.8%. Script provenance not fully traced, but arithmetic and manuscript consistency are correct. |
+| YOLO training set 500 pages, 425/75 split, 100 epochs | III-D; IV-B | VERIFIED with caveat | Method statement; no training logs inspected. The 425/75 split is arithmetically consistent. |
+| YOLO metrics: precision 0.97-0.98, recall 0.95-0.98, mAP@0.50 0.98-0.99, mAP@0.50:0.95 0.85-0.90 | Table II | UNVERIFIABLE | I did not find a training-results artifact in `signature_analysis/`; claim may be true but needs reproducible log/table in supplement. |
+| Detection deployment: 43.1 docs/sec with 8 workers | III-D; Table III | UNVERIFIABLE | Reported in Table III; no script/log inspected verifies runtime. |
+| CPA-matched signatures: 168,755 / 182,328 = 92.6%; unmatched 13,573 = 7.4% | III-D; Table III | VERIFIED | 168,755 + 13,573 = 182,328; percentages correct. |
+| Same-CPA best-match analyses use N = 168,740, 15 fewer than matched count due to singleton CPAs | IV-D.1 | VERIFIED | `signature_analysis/15_hartigan_dip_test.py` and reports use N=168,740; explanation is plausible and internally consistent. |
+| ResNet-50, ImageNet-1K V2, 2048-d embeddings, 224x224 preprocessing, L2 normalization | III-E | VERIFIED | `signature_analysis/10_formal_statistical_analysis.py`, `paper/ablation_backbone_comparison.py`. |
+| All-pairs intra-class N = 41,352,824; inter-class N = 500,000 | Table IV | VERIFIED | `signature_analysis/10_formal_statistical_analysis.py` computes all intra-pairs and samples 500,000 inter-pairs. |
+| Table IV distribution stats: intra mean 0.821, inter mean 0.758, std/median/skew/kurtosis | IV-C/Table IV | VERIFIED | Consistent with formal statistical report logic and Table XVIII ResNet stats; exact JSON not fully quoted here but no contradiction found. |
+| Shapiro-Wilk and K-S reject normality, p < 0.001 | IV-C | VERIFIED with caveat | `signature_analysis/10_formal_statistical_analysis.py` performs tests. Large paired dependence caveat is correctly acknowledged later. |
+| Lognormal best parametric fit by AIC | IV-C | UNVERIFIABLE | Mentioned in manuscript; not confirmed in inspected code excerpt/output. Needs report citation or supplement table. |
+| KDE crossover at 0.837; Cohen's d = 0.669; Mann-Whitney p < 0.001; K-S p < 0.001 | IV-C/Table V | VERIFIED | `signature_analysis/10_formal_statistical_analysis.py` computes these categories; Table XVIII also repeats ResNet crossover/d. |
+| Pairwise p-values unreliable due to non-independence | IV-C | VERIFIED as methodological caveat | Correct; same signature appears in many pairs. |
+| Firm A cosine dip: N=60,448, dip=0.0019, p=0.169, unimodal | IV-D.1/Table V | VERIFIED | `/reports/dip_test/dip_test_results.json`; `signature_analysis/15_hartigan_dip_test.py`. |
+| Firm A dHash dip: N=60,448, dip=0.1051, p<0.001, multimodal | IV-D.1/Table V | VERIFIED | `/reports/dip_test/dip_test_results.json`. |
+| All-CPA cosine dip: N=168,740, dip=0.0035, p<0.001, multimodal | IV-D.1/Table V | VERIFIED | `/reports/dip_test/dip_test_results.json`. |
+| All-CPA dHash dip: N=168,740, dip=0.0468, p<0.001, multimodal | IV-D.1/Table V | VERIFIED | `/reports/dip_test/dip_test_results.json`. |
+| Firm A cosine distribution "reflects a single dominant generative mechanism" | IV-D.1 | SUSPICIOUS | Dip p=0.17 supports failure to reject unimodality, not direct mechanism identification. Rewrite as "consistent with" rather than "reflecting." |
+| BD/McCrary Firm A cosine transition 0.985 at bin 0.005; full 0.985; dHash transition 2 | IV-D.2; Appendix A | VERIFIED | `signature_analysis/25_bd_mccrary_sensitivity.py`; `/reports/bd_sensitivity/bd_sensitivity.json`. |
+| BD transition drift: Firm A cosine 0.987/0.985/0.980/0.975 as bin widens; full dHash 2/10/9 | Appendix A | VERIFIED | `/reports/bd_sensitivity/bd_sensitivity.json`. |
+| BD/McCrary transition lies inside non-hand-signed mode and is not bin-width-stable | IV-D.2; Appendix A | VERIFIED as interpretation | Script supports instability. "Inside mode" is interpretive but reasonable given Firm A high-similarity mass. |
+| Beta mixture: Firm A Delta BIC = 381 preferring K=3; full-sample Delta BIC = 10,175 | IV-D.3; V-B | VERIFIED | `/reports/beta_mixture/beta_mixture_results.json`: -371092.8 vs -371473.9; -787280.4 vs -797455.1. |
+| Firm A forced Beta-2 crossing 0.977; logit-GMM crossing 0.999 | IV-D.3/Table VI | VERIFIED | `/reports/beta_mixture/beta_mixture_results.json`: 0.9774276 and 0.9992143. |
+| Full-sample forced Beta crossing none; logit-GMM 0.980 | IV-D.3/Table VI | VERIFIED | `/reports/beta_mixture/beta_mixture_results.json`. |
+| Operational Firm A P7.5 cosine cut: cos > 0.95; 92.5% above / 7.5% at or below | Abstract; III-H/K; IV-E | VERIFIED | `/reports/pixel_validation/pixel_validation_results.json`: Firm A cosine>0.95 = 0.9251257. |
+| dHash cutoffs <=5, <=8, <=15; Firm A dHash median 2; P75 approx 4; P95 9 | III-K; IV-E/F | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json` and pixel-validation JSON. |
+| Firm A whole-sample capture: cos>0.837 99.93%, cos>0.9407 95.15%, cos>0.945 94.02%, cos>0.95 92.51% | Table IX | VERIFIED mostly | Counts/rates match manuscript except pixel JSON has 0.941 rather than 0.9407 from older run; recalibration JSON supports 0.9407 threshold family. |
+| Firm A whole-sample dHash<=5 84.20%, <=8 95.17%, <=15 99.83%, dual cos>0.95 AND dHash<=8 89.95% | Table IX; abstract | VERIFIED | `/reports/pixel_validation/pixel_validation_results.json`; `/reports/validation_recalibration/validation_recalibration.json`. |
+| 310 byte-identical positives | Abstract; IV-F.1; V-F | VERIFIED | `signature_analysis/19_pixel_identity_validation.py`; `/reports/pixel_validation/pixel_validation_results.json`; `/reports/expanded_validation/expanded_validation_results.json`. |
+| 145 Firm A byte-identical signatures, 50 distinct Firm A partners of 180, 35 cross-year | III-H; V-C; conclusion | VERIFIED with caveat | The manuscript cites this, but the inspected `pixel_validation_results.json` reports only 310 all-sample pixel-identical signatures. I did not inspect an output table listing the 145/50/35 decomposition. Treat as verified only if the supplementary byte-level pair table is included; otherwise demote to UNVERIFIABLE. |
+| 50,000 inter-CPA negative pairs; inter-CPA mean=0.762, P95=0.884, P99=0.913, max=0.988 | IV-F.1 | VERIFIED | `signature_analysis/21_expanded_validation.py`; `/reports/expanded_validation/expanded_validation_results.json`. |
+| Table X FAR at thresholds: 0.837 -> 0.2062; 0.900 -> 0.0233; 0.945 -> 0.0008; 0.950 -> 0.0007; 0.973 -> 0.0003; 0.979 -> 0.0002, Wilson CIs | IV-F.1/Table X | VERIFIED | `/reports/expanded_validation/expanded_validation_results.json`. |
+| Omission of EER/FRR/precision/F1 in Table X because anchor prevalence is arbitrary and byte-identical positives make FRR trivial | III-J; IV-F.1 | VERIFIED methodologically | Correct manuscript correction. Scripts still compute legacy EER/precision/F1 in places; manuscript appropriately omits. |
+| Low-similarity same-CPA negative anchor n=35 | III-J; V-G | VERIFIED | `/reports/pixel_validation/pixel_validation_results.json`. |
+| Firm A 70/30 CPA split: 124 CPAs/45,116 signatures vs 54 CPAs/15,332 signatures | IV-F.2/Table XI | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`; `signature_analysis/24_validation_recalibration.py`. |
+| 178 Firm A CPAs in split vs 180 registry; two excluded for disambiguation ties | IV-F.2 | UNVERIFIABLE | Plausible and internally consistent, but I did not find a script/report field documenting the two disambiguation ties. |
+| Calibration-fold thresholds: cosine median 0.9862, P1 0.9067, P5 0.9407; dHash median 2, P95 9 | Table XI | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`; `/reports/expanded_validation/expanded_validation_results.json`. |
+| Table XI fold rates and z-tests | IV-F.2/Table XI | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`. |
+| Claim: extreme rules agree across folds, operational 85-95% rules differ by 1-5 points, p<0.001 | IV-F.2; conclusion | VERIFIED | Recalibration JSON supports this. |
+| Sensitivity: cos>0.95 vs cos>0.945 reclassifies 8,508 signatures; category counts in Table XII | IV-F.3/Table XII | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`. |
+| Firm A dual capture shifts from 89.95% to 91.14%, +1.19 pp | IV-F.3 | VERIFIED | Recalibration JSON: 0.89945 vs 0.91138. |
+| Text says "Firm A P5 percentile 0.941" but sensitivity uses 0.945 | III-K | SUSPICIOUS | Calibration-fold P5 is 0.9407; deployed sensitivity cut is 0.945. Revise to avoid "P5 percentile 0.941" vs "0.945 rounded" ambiguity. |
+| Year-by-year Firm A left-tail table, 2013-2023 N/mean/% below 0.95 | IV-G.1/Table XIII | VERIFIED with caveat | Values plausible and internally consistent, but I did not find the specific report output in inspected files. Include generating script/table in supplement. |
+| 2013-2019 mean left-tail 8.26%, 2020-2023 mean 6.96%; lowest 2023 = 3.75% | IV-G.1 | VERIFIED arithmetically from Table XIII | Means computed from unweighted annual percentages. If intended signature-weighted means, disclose. |
+| Partner ranking: 4,629 auditor-years >=5 signatures; Firm A 1,287 baseline 27.8%; top decile 443/462 = 95.9%; top quartile 1,043/1,157 = 90.1%; top half 1,220/2,314 = 52.7% | IV-G.2/Table XIV | VERIFIED | `signature_analysis/22_partner_ranking.py`; `/reports/partner_ranking/partner_ranking_results.json`. |
+| Year-by-year top-decile Firm A share range 88.4%-100% | IV-G.2/Table XV | VERIFIED | `/reports/partner_ranking/partner_ranking_results.json`. |
+| Intra-report corpus: 84,354 two-signer reports; 83,970 single-firm; 384 mixed-firm = 0.46% | IV-G.3 | VERIFIED | `/reports/intra_report/intra_report_results.json` gives same-firm totals plus mixed-firm categories adding to 384. |
+| Intra-report Table XVI: Firm A 30,222 reports, agreement 89.91%; other Big-4 62-67%; 23-28 pp gap | IV-G.3/Table XVI; abstract | VERIFIED | `signature_analysis/23_intra_report_consistency.py`; `/reports/intra_report/intra_report_results.json`. |
+| Firm A both non-hand-signed 26,435/30,222 = 87.5%; both likely hand-signed 4 = 0.01% | IV-G.3 | VERIFIED | `/reports/intra_report/intra_report_results.json`. |
+| Intra-report gap "predicted by firm-wide practice" | IV-G.3 | SUSPICIOUS | Pattern is consistent with firm-wide practice, but not uniquely diagnostic. Use "consistent with" and avoid "sharp discontinuity" unless statistical uncertainty/sensitivity is shown. |
+| Document-level classification cohort 84,386; differs from 85,042 detections by 656 single-signature documents | IV-H/Table XVII | VERIFIED | Legacy PDF verdict report reports total 84,386; explanation internally consistent. |
+| Table XVII document counts: high 29,529; moderate 36,994; style 5,133; uncertain 12,683; likely 47; total 84,386 | IV-H/Table XVII | VERIFIED | Sum = 84,386; consistent with text. |
+| Within 71,656 documents exceeding cosine 0.95: 41.2% high, 51.7% moderate, 7.2% style-only | IV-H | VERIFIED | 29,529 + 36,994 + 5,133 = 71,656; percentages correct. |
+| Abstract says "only 41% exhibit converging structural evidence ... 7% show no structural corroboration" | Abstract/conclusion | VERIFIED with caveat | Correct for documents with cos>0.95, but "only" is rhetoric; moderate 51.7% still has partial structural similarity. |
+| Firm A document capture: 96.9% high/moderate, 0.6% style, 2.5% uncertain, 4/30,226 likely hand-signed | IV-H.1 | VERIFIED | Table XVII Firm A counts sum to 30,226; 22,970+6,311=29,281=96.9%. |
+| Cross-firm dual-descriptor convergence: non-Firm-A CPAs with cos>0.95 have dHash<=5 at 11.3%, Firm A 58.7% | IV-H.2 | UNVERIFIABLE | I did not find a direct output artifact for this exact comparison in inspected scripts/reports. Add reproducible table or script reference. |
+| Ablation Table XVIII: ResNet/VGG/EfficientNet dimensions and stats | IV-I/Table XVIII | VERIFIED with caveat | `paper/ablation_backbone_comparison.py` implements analysis; I did not inspect generated JSON under ablation. |
+| Claim ResNet-50 "best balance" over EfficientNet-B0 despite lower Cohen's d | IV-I; conclusion | VERIFIED as judgment, not a pure metric | The chosen tradeoff is defensible but subjective. Do not overstate as a purely empirical optimum. |
+| Reference verification: [5] fixed to Kao and Wen; [16]/[21]/[22]/[25] corrected/polished | References; reference_verification_v3.md | VERIFIED | Current `paper_a_references_v3.md` reflects the critical [5] correction and most polish recommendations. |
+| "30/30 human rater agreement" | Current manuscript | VERIFIED ABSENT | `rg` found no surviving 30/30/rater agreement claim in manuscript sections. |
+
+## 3. Methodological Rigor
+
+The methodological core is substantially stronger than in earlier described versions. The key positive points are:
+
+- The paper now separates operational calibration from descriptive distributional diagnostics. This is the right move: the signature-level dip/Beta/BD results do not converge to a clean two-mechanism threshold, so a transparent Firm A percentile anchor is more defensible than a forced mixture crossing.
+- The dual-descriptor classifier is methodologically sensible. Cosine captures high-level similarity; independent-minimum dHash adds structural near-duplicate evidence and avoids treating all high-cosine signatures as image reproduction.
+- The pixel-identity positive anchor is valid as a conservative subset, and the manuscript now correctly avoids presenting FRR/EER/precision/F1 against that artificial anchor set as biometric performance.
+- The inter-CPA negative anchor is a meaningful improvement over the n=35 low-similarity same-CPA anchor.
+- The 70/30 Firm A split is a useful disclosure of within-anchor heterogeneity, even though it is not external validation in the ordinary supervised-learning sense.
+
+Remaining rigor concerns:
+
+1. The inference from "Firm A dip p=0.17" to "single dominant generative mechanism" is too strong. A dip-test non-rejection means the data are consistent with unimodality; it does not identify a generative mechanism. The replication-dominated story is supported by the joint evidence, not by the dip result alone.
+2. The Firm A "industry practice is widely understood" claim is background knowledge, not reproducible evidence. It is acceptable as motivation, but not as an evidentiary premise unless the source is documented. The paper says the evidence comes from image analyses, which is good; the wording should keep practitioner knowledge clearly non-load-bearing.
+3. The dHash thresholds are reasonable but still heuristic. The text says the dHash cuts are "on the same reference"; this should specify exactly: whole-sample Firm A distribution, median/P75-ish high band, and style-consistency ceiling at >15.
+4. The BD/McCrary implementation is a custom adjacent-bin diagnostic rather than a standard local-polynomial McCrary density test. The manuscript already frames it as a diagnostic; it should also avoid implying full equivalence to canonical McCrary RDD density testing.
+5. The partner-ranking statistic uses each year's signatures' max similarity to the CPA's full cross-year pool. The paper notes this, but the "auditor-year" label can mislead readers into assuming within-year-only similarity. The untracked `signature_analysis/27_within_year_uniformity.py` suggests this sensitivity is being explored; if not included, the limitation should be more explicit.
+
+## 4. Narrative Discipline
+
+The narrative is much more disciplined than prior-round summaries suggested, but it still needs tightening.
+
+Overclaims / scope creep:
+
+- "Detects non-hand-signed signatures" should usually be "classifies signatures as replication-consistent / non-hand-signed under a calibrated dual-descriptor rule." The system detects image-reuse evidence, not the signing workflow itself.
+- "Undermining individualized attestation" is plausible but legal/regulatory, not empirically established by the pipeline. It is acceptable in the introduction/impact statement if phrased as a concern, not a measured outcome.
+- "From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise" is too absolute. Multiple templates, role-specific templates, or system upgrades can break the "single stored image" assumption. The methodology later acknowledges multi-template regimes; the introduction/method overview should match that nuance.
+- "This sharp discontinuity ... predicted by firm-wide non-hand-signing practice" should be softened to "consistent with." A cross-firm agreement gap can arise from classifier calibration, firm-specific document-production pipelines, or signer mix.
+- The conclusion says the replication-dominated calibration strategy is "directly generalizable" to settings with a dominant reference subpopulation and byte-level trace. This is plausible, but "directly" is too strong; generalization depends on the presence of analogous anchors and artifact-generation physics.
+
+Scope discipline that works well:
+
+- The paper now repeatedly states that signature-level rates are not partner-level frequencies.
+- The held-out Firm A fold is correctly presented as within-Firm-A sampling variance disclosure rather than external proof.
+- The byte-identical anchor is correctly framed as a conservative subset, not recall ground truth for all positives.
+
+## 5. IEEE Access Fit
+
+IEEE Access fit is good. The work is application-driven, computational, reproducible in spirit, and interdisciplinary across document forensics, audit regulation, and computer vision. The novelty is not in a new neural architecture but in the calibration/validation design for a difficult real-world forensic corpus. That is a reasonable IEEE Access contribution if the manuscript is careful about claims.
+
+Rigor is adequate for a Regular Paper after minor revisions. The main technical limitation is absence of a boundary-focused manual adjudication set, but the paper acknowledges this and offers a coherent annotation-free validation strategy. Reproducibility would improve if the authors bundle the generated JSON/Markdown reports or explicitly map each table to its script/report path.
+
+Clarity is mostly high, but the section-number drift and the 0.941/0.945 wording need cleanup before submission. IEEE Access reviewers will notice stale cross-references.
+
+## 6. Specific Actionable Revisions and Proposed Rewrites
+
+1. Soften mechanism-identification language.
+
+Current:
+"Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism..."
+
+Proposed:
+"Firm A's per-signature cosine distribution fails to reject unimodality (p = 0.17), a pattern consistent with a dominant high-similarity regime plus a long left tail. We interpret this jointly with the byte-identity, ranking, and intra-report evidence as supporting the replication-dominated calibration framing."
+
+2. Remove overabsolute "single stored image on every report" wording.
+
+Current:
+"both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise."
+
+Proposed:
+"both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise."
+
+3. Clarify practitioner-knowledge status.
+
+Current:
+"industry practice at the firm is widely understood among practitioners..."
+
+Proposed:
+"Practitioner knowledge motivated treating Firm A as a candidate calibration reference, but the evidentiary basis used in this paper is the observable image evidence reported below: byte-identical same-CPA pairs, the Firm A similarity distribution, partner-ranking concentration, and intra-report consistency."
+
+4. Fix section-reference drift.
+
+Examples:
+- III-H says the three complementary analyses are in Section IV-F; in the current manuscript they are in Section IV-G.
+- III-H bullet labels cite IV-F.1/IV-F.2/IV-F.3 for longitudinal, ranking, intra-report; these should be IV-G.1/IV-G.2/IV-G.3.
+- Results IV-F.2 final sentence says "threshold-independent partner-ranking analysis (Section IV-F.2)" but ranking is Section IV-G.2.
+- Methodology III-G says partner-level ranking is Section IV-F.2; update to IV-G.2.
+
+5. Fix the 0.941/0.945 sensitivity wording.
+
+Current:
+"replacing 0.95 with the slightly stricter Firm A P5 percentile 0.941 alters aggregate firm-level capture rates by at most approx 1.2 percentage points"
+
+Proposed:
+"replacing 0.95 with the nearby rounded sensitivity cut 0.945 (motivated by the calibration-fold P5 = 0.9407) shifts whole-Firm-A dual-rule capture by 1.19 percentage points."
+
+6. Add table-to-script provenance.
+
+Add a compact appendix table:
+
+| Manuscript table | Reproduction artifact |
+|---|---|
+| Table V | `signature_analysis/15_hartigan_dip_test.py`; `reports/dip_test/dip_test_results.json` |
+| Table VI | `signature_analysis/17_beta_mixture_em.py`; `reports/beta_mixture/beta_mixture_results.json`; `signature_analysis/25_bd_mccrary_sensitivity.py` |
+| Table X | `signature_analysis/21_expanded_validation.py`; `reports/expanded_validation/expanded_validation_results.json` |
+| Table XI/XII | `signature_analysis/24_validation_recalibration.py`; `reports/validation_recalibration/validation_recalibration.json` |
+| Table XIV/XV | `signature_analysis/22_partner_ranking.py`; `reports/partner_ranking/partner_ranking_results.json` |
+| Table XVI | `signature_analysis/23_intra_report_consistency.py`; `reports/intra_report/intra_report_results.json` |
+
+7. Either document or remove exact unverifiable decomposition claims.
+
+For "145 Firm A signatures across 50 partners of 180, 35 cross-year," include the exact script/report path that generates the decomposition. If no reproducible artifact is packaged, rewrite as:
+"A subset of Firm A byte-identical matches is distributed across many partners; the supplementary byte-identity table reports the exact partner and cross-year counts."
+
+8. Treat "cross-firm dual convergence 11.3% vs 58.7%" as a table or remove it.
+
+This is a useful claim, but I did not find a direct reproduction artifact. Add a small table with counts/denominators and script provenance.
+
+9. Tighten the impact statement.
+
+Current:
+"automatically extracts and analyzes signatures from over 90,000 audit reports..."
+
+This is accurate. But:
+"separate hand-written signatures from reproduced ones" should remain removed/avoided. Use:
+"stratifies signatures by evidence of image reproduction."
+
+10. Clean legacy script comments before supplement release.
+
+Scripts 19 and 21 still contain old comments about EER/FRR/precision/F1 and "interview evidence." Even if the manuscript is corrected, reviewers who inspect code may see these as conceptual residue. Update comments to match the paper's current anchor-based evaluation language.
+
+## 7. Disagreements with Prior Round-7 Gemini Accept Verdict
+
+I disagree with the round-7 Gemini "fully submission-ready / no v3.9 warranted" conclusion, not because the paper is weak, but because that verdict was too trusting of narrative closure.
+
+Specific disagreements:
+
+1. Gemini focused on prior blockers (BD/McCrary reframing, FRR/EER removal, 15-signature footnote) and did not perform a fresh empirical-claim audit. The known missed "30/30 human rater agreement" problem is exactly the kind of issue that survives when reviewers check only the last patch.
+2. Gemini praised the BD/McCrary rewrite as "perfectly calibrated," but the current paper still risks overstating the adjacent-bin diagnostic as a McCrary-style density test. It is now acceptable, but not perfect.
+3. Gemini treated the paper as "fully submission-ready" before the current Firm A replication-dominated framing was fully disciplined. v3.18.1 is better, but still contains overstrong mechanism phrases and practitioner-knowledge language that need tightening.
+4. Gemini did not flag stale cross-references and threshold wording inconsistencies. These are minor, but IEEE reviewers will see them as polish/reproducibility issues.
+5. Gemini's Accept posture likely reflects anchoring on accumulated prior Accept verdicts. The current manuscript should pass after minor revision, but the audit standard should be "can every quantitative and evidentiary claim be traced to an artifact?" not "did the last known blocker get patched?"
+
+Bottom line: I recommend Minor Revision. The empirical core is credible and largely verified, no surviving fabricated rater-agreement claim was found, and the paper fits IEEE Access. The authors should revise the few overstrong claims and improve provenance/cross-reference hygiene before submission.
@@ -0,0 +1,133 @@
+# Independent Peer Review (Round 17) - Paper A v3.18.2
+
+Reviewer role: independent peer reviewer for IEEE Access Regular Paper.
+Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.2, commit `7990dab` on `yolo-signature-pipeline`.
+Audit basis: manuscript sections under `paper/`, scripts under `signature_analysis/`, prior round-16 review `paper/codex_review_gpt55_v3_18_1.md`, and generated reports under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/`.
+
+## 1. Overall Verdict: Minor Revision
+
+I recommend **Minor Revision**, not unconditional Accept.
+
+The v3.18.2 revision fixes the most important round-16 empirical problem: the cross-firm dual-descriptor convergence claim is no longer the erroneous `11.3%` vs `58.7%` / `5x` statement. The new script `signature_analysis/28_byte_identity_decomposition.py` and JSON artifact reproduce the corrected values: among signatures with cosine `> 0.95`, non-Firm-A has `27,596 / 65,515 = 42.12%` with `dHash_indep <= 5`, while Firm A has `49,388 / 55,921 = 88.32%`, a `~2.1x` gap. The byte-identity decomposition is also now reproducible: `145` Firm A byte-identical signatures, `50` distinct partners, `180` registered Firm A partners, and `35` cross-year matches.
+
+The revision also resolves most stale section references and improves provenance. However, I found three remaining issues that should be corrected before IEEE Access submission:
+
+1. The Appendix B provenance map overclaims: several mapped report artifacts do not exist at the stated paths in the available report tree.
+2. Some mechanism-identification language was softened in Results but remains too strong in Methodology and Discussion, especially "consistent with a single dominant mechanism."
+3. A few exact method/performance claims remain unverifiable from packaged artifacts, especially YOLO validation metrics, VLM prompt/settings, HSV thresholds, runtime, and some extraction/document-type details.
+
+These are Minor because they do not overturn the central empirical findings, but they affect reproducibility and narrative discipline.
+
+## 2. Re-audit of Round-16 Findings
+
+| Round-16 finding | v3.18.2 status | Re-audit notes |
+|---|---|---|
+| Mechanism-identification overclaim from dip-test non-rejection | **PARTIAL** | Results IV-D.1 now correctly says Firm A "fails to reject unimodality." But Methodology III-H still says the distribution is "consistent with a single dominant mechanism (non-hand-signing)," and Discussion V-C says "consistent with a single dominant mechanism plus residual within-firm heterogeneity." A dip-test non-rejection plus left tail does not identify a single mechanism; the joint evidence supports a replication-dominated benchmark, not a mechanism count. |
+| Stale IV-F / IV-G references after retitling | **LARGELY RESOLVED** | I did not find the old round-16 pattern of IV-F references pointing to the new IV-G validation analyses. The current IV-F/IV-G references are mostly correct. Minor remaining issue: Introduction and conclusion still cite byte-identity as Section IV-F.1 although the detailed `145/50/180/35` decomposition itself is not reported in Section IV-F.1, only in III-H/V-C/Appendix B. |
+| Practitioner knowledge as load-bearing evidence | **PARTIAL** | III-H now explicitly says practitioner knowledge is "non-load-bearing," which is good. But Introduction still says Firm A is "widely recognized within the audit profession" and III-H says "widely held within the audit profession" without a citation or source. This is acceptable only as motivation; I would soften or cite. |
+| 0.941 / 0.945 / 0.9407 ambiguity | **RESOLVED** | III-K and IV-F.3 now correctly distinguish the operational 0.95 cut, the nearby rounded sensitivity cut 0.945, and calibration-fold P5 = 0.9407. |
+| Incorrect cross-firm dual-convergence claim | **RESOLVED** | The prior `11.3%` vs `58.7%` / `5x` claim is gone from current manuscript files. The replacement `42.12%` vs `88.32%` / `~2.1x` matches the new JSON artifact. |
+| Byte-identity decomposition was unverifiable | **RESOLVED with packaging caveat** | New script and JSON reproduce `145/50/180/35`. Caveat: the manuscript says reports are under the project's `reports/` tree, but the actual artifact I inspected is under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/...`, not under this repo's `reports/` path. |
+| Legacy EER/FRR/Precision/F1 script comments | **RESOLVED enough** | Scripts 19 and 21 now label EER/FRR/Precision/F1 as legacy / diagnostic-only and state that the manuscript omits them. Some functions still emit those sections if run, but the conceptual warning is explicit. |
+
+## 3. New Empirical-Claim Audit
+
+Status definitions: **VERIFIED** = matches script/report or arithmetic; **PARTIAL** = broadly supported but wording/provenance needs cleanup; **UNVERIFIABLE** = plausible but not traceable in the available artifacts; **SUSPICIOUS** = overphrased or internally inconsistent. I found no new fabricated core result.
+
+| Claim | Status | Audit basis / notes |
+|---|---|---|
+| 90,282 PDFs, 2013-2023, Taiwan | VERIFIED | Consistent across manuscript. Raw scraping log not audited. |
+| 86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071 | VERIFIED | Internally consistent in III-C. |
+| 182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched | VERIFIED | Matches manuscript counts and downstream `168,740` after singleton exclusion. |
+| 758 CPAs, >50 firms, 15 document types, 86.4% standard audit reports | PARTIAL | 758/>50 are stable manuscript counts. I did not find a direct packaged JSON for 15 document types / 86.4%. |
+| Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0 | UNVERIFIABLE | Method claim not contradicted, but prompt/config/log artifact not inspected. |
+| YOLO 500 annotated pages, 425/75 split, 100 epochs | PARTIAL | Method is clear; no training log audited. |
+| YOLO precision 0.97-0.98, recall 0.95-0.98, mAP metrics | UNVERIFIABLE | Table II remains unsupported by a visible training-results artifact. |
+| 43.1 docs/sec with 8 workers | UNVERIFIABLE | Runtime claim still lacks a visible timing log. |
+| Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs | VERIFIED | Matches dip-test report and script logic. |
+| ResNet-50 ImageNet-1K V2, 2048-d, L2 normalized | VERIFIED | Consistent with methods and ablation script. |
+| All-pairs intra/inter distribution N = 41,352,824 / 500,000; KDE crossover 0.837; Cohen's d = 0.669 | VERIFIED | Supported by formal-statistical script/report, although Appendix B points to the wrong JSON path. |
+| Firm A dip result N=60,448, dip=0.0019, p=0.169 | VERIFIED | `/reports/dip_test/dip_test_results.json`. |
+| Firm A dHash dip result N=60,448, dip=0.1051, p<0.001 | VERIFIED | Same JSON. |
+| All-CPA cosine/dHash dip results N=168,740, p<0.001 | VERIFIED | Same JSON. |
+| "p = 0.17 at n >= 10 signatures" in III-H | SUSPICIOUS | The `n >= 10` filter applies to accountant-level aggregates in script 15, not the Firm A signature-level dip test. The Firm A dip test uses N=60,448 signatures. |
+| "single dominant mechanism" language | SUSPICIOUS | Still too mechanistic for the statistics; use "dominant high-similarity regime" or "consistent with replication-dominated framing." |
+| BD/McCrary transition instability and values in Appendix A | VERIFIED | `/reports/bd_sensitivity/bd_sensitivity.json`; table values match. |
+| Beta mixture Delta BIC = 381 for Firm A; 10,175 full sample; forced crossings 0.977/0.999 | VERIFIED | `/reports/beta_mixture/beta_mixture_results.json`. |
+| Firm A whole-sample rates in Table IX | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json` and pixel-validation JSON: e.g., cos>0.95 `55,922/60,448 = 92.51%`, dual `54,370/60,448 = 89.95%`. |
+| 310 byte-identical positives | VERIFIED | `/reports/pixel_validation/pixel_validation_results.json`. |
+| Byte-identity decomposition `145 / 50 / 180 / 35` | VERIFIED | New `/reports/byte_identity_decomp/byte_identity_decomposition.json`. The script counts Firm A signatures whose nearest same-CPA match is byte-identical; the "35" is a cross-year nearest-match count, not necessarily a deduplicated unordered pair count. |
+| Table X FAR against 50,000 inter-CPA negatives | VERIFIED | `/reports/expanded_validation/expanded_validation_results.json`. |
+| Omission of EER/FRR/precision/F1 in manuscript | VERIFIED | Manuscript now explains why these are not meaningful for Table X. |
+| Firm A 70/30 split: 124 CPAs/45,116 signatures vs 54 CPAs/15,332 | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`. |
+| Two CPAs excluded from split due to disambiguation ties | UNVERIFIABLE | Plausible; I did not find a report field documenting those two ties. |
+| Table XI rates/z-tests | VERIFIED | Values match recalibration JSON, including corrected `z=-3.19` for cos>0.9407. |
+| Table XII sensitivity counts and +1.19 pp Firm A shift | VERIFIED | Recalibration JSON supports counts and `0.89945` vs `0.91138`. |
+| Table XIII per-year Firm A left-tail rates | PARTIAL | Values are internally coherent, but Appendix B points to `reports/deloitte_distribution/deloitte_distribution_results.json`, which does not exist in the inspected report tree. |
+| Tables XIV/XV partner ranking values | VERIFIED | `/reports/partner_ranking/partner_ranking_results.json`. |
+| Table XVI intra-report agreement | VERIFIED | `/reports/intra_report/intra_report_results.json`. |
+| Table XVII document-level classification counts | VERIFIED with path caveat | Counts match manuscript arithmetic and available PDF verdict artifacts, but Appendix B points to `reports/pdf_level/pdf_level_results.json`, which does not exist. Existing files include `pdf_signature_verdicts.json`, CSV/XLSX, and report markdown at report root. |
+| Cross-firm dual-descriptor convergence `42.12%` vs `88.32%` | VERIFIED | New JSON: non-Firm-A `27,596/65,515`, Firm A `49,388/55,921`. Note this Firm A denominator differs by one from Table IX's cosine-only `55,922`, so the text should specify the additional filters used by script 28. |
+| Ablation Table XVIII | PARTIAL | The script exists and `/Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json` exists, but Appendix B incorrectly maps it to `reports/ablation/ablation_results.json`. |
+| Appendix B claim that all report files are committed alongside scripts in the project's `reports/` tree | SUSPICIOUS | In the current workspace there is no repo-root `reports/` directory. Several paths named in Appendix B are missing even in the absolute report tree. |
+
+## 4. Methodological Rigor
+
+The core methodology remains credible for an IEEE Access Regular Paper. The strongest elements are:
+
+- The paper separates operational calibration from distributional characterization. This is essential because the per-signature diagnostics do not converge to a clean two-class threshold.
+- The dual-descriptor design is well motivated: cosine captures high-level similarity, while independent-minimum dHash provides a structural near-duplicate check.
+- The byte-identical positive anchor is a valid conservative subset, and the inter-CPA negative anchor gives meaningful specificity/FAR estimates.
+- The held-out Firm A fold is now framed as within-Firm-A sampling-variance disclosure rather than full external validation.
+- The new script 28 closes the most important prior provenance gap for byte identity and cross-firm convergence.
+
+Remaining rigor concerns:
+
+1. **Provenance packaging is still inconsistent.** Appendix B says scripts and reports live under the project's `reports/` tree. In this workspace there is no repo-root `reports/` directory, and the actual artifacts are under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/`. More importantly, the Appendix B paths for formal statistical results, Deloitte/Firm-A distribution results, PDF-level results, and ablation results are wrong or missing.
+2. **The Firm A prior remains partly socially sourced.** The text says practitioner knowledge is non-load-bearing, but the Introduction still relies rhetorically on "widely recognized." The empirical case can stand without that phrase.
+3. **The dip-test interpretation remains slightly overextended.** Failure to reject unimodality supports "no clear multimodal split"; it does not show a single mechanism. The byte-identity and ranking evidence do more of the work.
+4. **The `n >= 10` parenthetical in III-H is likely misplaced.** It should not be attached to the Firm A signature-level dip result unless the authors can show the exact filtering.
+5. **Several engineering details remain under-specified for full reproducibility:** VLM prompt/parse rule, HSV red-stamp thresholds, training log for YOLO metrics, and exact runtime environment for throughput.
+
+## 5. Narrative Discipline
+
+The narrative is substantially more disciplined than v3.18.1, but a few overclaims remain.
+
+Recommended softening:
+
+- Replace "detects such non-hand-signed signatures" in the Abstract with "classifies signatures by evidence of non-hand-signing" or "detects replication-consistent signatures." The pipeline does not observe the signing workflow directly.
+- Replace "consistent with a single dominant mechanism (non-hand-signing)" in III-H and "single dominant mechanism plus residual..." in V-C with "consistent with a dominant high-similarity regime plus residual heterogeneity."
+- Replace "widely recognized / widely held within the audit profession" with either a citation or a purely methodological framing: "Firm A was selected as a candidate calibration reference; its benchmark status is evaluated using image evidence below."
+- Be careful with "known-majority-positive population." The empirical evidence supports replication-dominated, but "known" implies a source of ground truth outside the image evidence.
+
+The corrected cross-firm claim is narratively better. The old `5x` story was both wrong and too dramatic; the new `~2.1x` gap is still meaningful and more defensible.
+
+## 6. IEEE Access Fit
+
+The paper fits IEEE Access well. It is application-driven, computationally substantial, and methodologically relevant to document forensics, audit analytics, and computer vision. The novelty is not a new neural architecture; it is the calibration and validation strategy for a real archival corpus with limited ground truth. That is a legitimate IEEE Access contribution.
+
+The remaining issues are editorial/reproducibility issues rather than grounds for rejection. IEEE Access reviewers are likely to value the added Appendix B provenance map, but they will also notice if the mapped paths do not exist. Fixing those paths, or bundling the missing JSON/Markdown reports, is important before submission.
+
+## 7. Specific Actionable Revisions
+
+1. **Fix Appendix B provenance paths.** In the inspected report tree, these Appendix B artifacts are missing at the stated paths:
+   - `reports/formal_statistical/formal_statistical_results.json` (available alternative appears to be `reports/formal_statistical_data.json`)
+   - `reports/deloitte_distribution/deloitte_distribution_results.json` (only figures were present)
+   - `reports/pdf_level/pdf_level_results.json` (available alternatives include `reports/pdf_signature_verdicts.json`, CSV/XLSX, and markdown)
+   - `reports/ablation/ablation_results.json` (actual path appears to be `/Volumes/NV2/PDF-Processing/signature-analysis/ablation/ablation_results.json`)
+
+2. **Either commit/copy the report tree into the repo or state the absolute artifact root.** The user-facing manuscript says `reports/...`; the current repo root has no `reports/` directory.
+
+3. **Remove the remaining "single dominant mechanism" phrasing.** Use "dominant high-similarity regime" instead.
+
+4. **Fix the III-H parenthetical "p = 0.17 at n >= 10 signatures."** The signature-level dip test is N=60,448; the `n >= 10` rule belongs to accountant-level aggregates.
+
+5. **Clarify the `55,921` denominator in IV-H.2.** It differs by one from Table IX's `55,922` cosine-only Firm A count. Add that script 28 conditions on `assigned_accountant IS NOT NULL` and `min_dhash_independent IS NOT NULL`, or reconcile the one-record discrepancy.
+
+6. **Add or cite artifacts for still-unverifiable operational claims.** At minimum: YOLO training metrics/logs, VLM prompt/config, HSV thresholds, throughput log, and document-type breakdown.
+
+7. **Soften "widely recognized/widely held" practitioner wording or cite it.** The current "non-load-bearing" sentence helps, but uncited professional-knowledge claims are still exposed.
+
+8. **Keep the impact statement archived or revise before reuse.** The archive note correctly warns that "distinguishes genuinely hand-signed signatures from reproduced ones" would overstate the evidence.
+
+Bottom line: v3.18.2 materially improves the paper and fixes the round-16 empirical error. I would not block submission on the central results, but I would require the provenance/path cleanup and the remaining mechanism-language softening before calling it Accept.
@@ -0,0 +1,127 @@
+# Independent Peer Review (Round 18) - Paper A v3.18.3
+
+Reviewer role: independent peer reviewer for IEEE Access Regular Paper.
+Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.3, commits `f1c2537` + `26b934c` on `yolo-signature-pipeline`.
+Audit basis: manuscript sections under `paper/`, prior round-16 and round-17 reviews, scripts under `signature_analysis/`, the current SQLite/report artifacts under `/Volumes/NV2/PDF-Processing/signature-analysis/`, and direct filesystem checks of Appendix B paths.
+
+## 1. Overall Verdict: Minor Revision
+
+I recommend **Minor Revision**, not Accept.
+
+v3.18.3 resolves the main round-17 provenance problem: the four fabricated Appendix B paths have been replaced with paths that exist in the available report tree, and the manuscript now explicitly states the local report root (`/Volumes/NV2/PDF-Processing/signature-analysis/`) plus the fact that the ablation artifact is a sibling of `reports/`. The prior "single dominant mechanism" wording is also removed from the main Methodology/Discussion passages, and the mistaken "p = 0.17 at n >= 10 signatures" parenthetical is fixed.
+
+However, the new reconciliation note for the `55,921` vs `55,922` Firm A cosine-only counts is not supported by the current artifacts. The manuscript attributes the one-record difference to successive database snapshots and a downstream floating-point shift of one borderline Firm A signature. Direct database checks indicate a different cause: Table IX is based on Firm A membership from `accountants.firm`, whereas `signature_analysis/28_byte_identity_decomposition.py` groups Firm A by `signatures.excel_firm`. In the current database, one signature above `cos > 0.95` belongs to an accountant whose registry firm is Firm A but whose `excel_firm` field is not Firm A. Thus the new note fixes the arithmetic discrepancy but introduces a false provenance explanation.
+
+This is Minor rather than Major because the one-record drift has negligible numerical effect and does not overturn the central findings. It should still be corrected before submission because v3.18.3 was specifically intended to repair provenance discipline.
+
+## 2. Re-audit of Round-17 Findings
+
+| Round-17 finding | v3.18.3 status | Re-audit notes |
+|---|---|---|
+| Appendix B provenance paths overclaimed / several did not exist | **RESOLVED** | All listed Appendix B report artifacts now exist when rebased to `/Volumes/NV2/PDF-Processing/signature-analysis/`. The replacement paths for formal statistics, Firm A per-year data, PDF verdicts, ablation, and byte decomposition are real. |
+| Residual "single dominant mechanism" wording | **RESOLVED enough** | The exact phrase is gone from Methodology III-H and Discussion V-C. Current wording uses "dominant high-similarity regime plus residual within-firm heterogeneity," which is more defensible. |
+| III-H "p = 0.17 at n >= 10 signatures" parenthetical | **RESOLVED** | The current text correctly reports the signature-level dip result as `p = 0.17`, `N = 60,448` Firm A signatures. The `n >= 10` filter is no longer attached to that claim. |
+| "Widely recognized / widely held" practitioner wording | **RESOLVED enough** | Introduction now frames Firm A as selected by practitioner-knowledge motivation and evaluated by image evidence. III-H says "is understood within the audit profession" but immediately marks this as non-load-bearing. A citation would still be cleaner, but this is no longer a submission blocker. |
+| 55,921 vs 55,922 Firm A cosine-only count discrepancy | **PARTIAL / NEW ERROR** | The manuscript now acknowledges the discrepancy, but the explanation appears wrong. Current DB evidence points to different Firm A attribution fields (`accountants.firm` vs `signatures.excel_firm`), not a snapshot/floating-point shift. |
+| Still-unverifiable operational details: YOLO logs, VLM prompt/config, HSV thresholds, throughput log | **UNRESOLVED but not new** | These remain plausible method claims, but I did not find dedicated artifacts establishing them. This is acceptable for main-paper review only if the supplement includes training/config/runtime logs. |
+| Section reference for `145/50/180/35` byte decomposition | **PARTIAL** | Appendix B now maps the decomposition to script 28, but the main results Section IV-F.1 still reports only the all-sample 310 byte-identical signatures, not the Firm A `145/50/180/35` decomposition. Several locations still cite Section IV-F.1 for a decomposition that is actually in III-H / V-C / Appendix B. |
+
+## 3. Appendix B Path Verification
+
+I checked every Appendix B artifact path directly against the filesystem. Rebased to `/Volumes/NV2/PDF-Processing/signature-analysis/`, all listed artifacts exist:
+
+| Appendix B artifact | Exists? |
+|---|---|
+| `reports/extraction_methodology.md` | Yes |
+| `reports/pdf_signature_verdicts.json` | Yes |
+| `reports/formal_statistical_data.json` | Yes |
+| `reports/formal_statistical_report.md` | Yes |
+| `reports/dip_test/dip_test_results.json` | Yes |
+| `reports/beta_mixture/beta_mixture_results.json` | Yes |
+| `reports/bd_sensitivity/bd_sensitivity.json` | Yes |
+| `reports/pixel_validation/pixel_validation_results.json` | Yes |
+| `reports/validation_recalibration/validation_recalibration.json` | Yes |
+| `reports/expanded_validation/expanded_validation_results.json` | Yes |
+| `reports/accountant_similarity_analysis.json` | Yes |
+| `reports/figures/` | Yes |
+| `reports/partner_ranking/partner_ranking_results.json` | Yes |
+| `reports/intra_report/intra_report_results.json` | Yes |
+| `reports/pdf_signature_verdict_report.md` | Yes |
+| `ablation/ablation_results.json` | Yes |
+| `reports/byte_identity_decomp/byte_identity_decomposition.json` | Yes |
+
+The path replacements are real. The only caveat is semantic rather than filesystem-level: Table XIII is described as "derived from `reports/accountant_similarity_analysis.json` filtered to Firm A; figures in `reports/figures/`." That is acceptable as provenance if the supplement documents the filter/query used for the table.
+
+## 4. Empirical-Claim Audit
+
+I focused on claims introduced or changed by v3.18.3.
+
+**Verified**
+
+- Appendix B path replacements exist in the actual report tree.
+- `reports/byte_identity_decomp/byte_identity_decomposition.json` exists and reports:
+  - Firm A byte-identical signatures: `145`
+  - distinct Firm A partners: `50`
+  - registered Firm A partners: `180`
+  - cross-year byte-identical matches: `35`
+- The same JSON reports cross-firm dual convergence:
+  - Firm A: `49,388 / 55,921 = 88.32%`
+  - Non-Firm-A: `27,596 / 65,515 = 42.12%`
+- `validation_recalibration.json` reports Table IX's Firm A `cos > 0.95` count as `55,922 / 60,448 = 92.51%`.
+
+**New / Incorrect**
+
+- The new Results IV-H.2 reconciliation note says the `55,921` vs `55,922` discrepancy comes from successive snapshots and one borderline Firm A signature shifting from `cos > 0.95` to `cos = 0.95...` at floating-point precision. I could not reproduce that explanation.
+- Direct SQLite checks on the current database show:
+  - Firm A by `accountants.firm`, `cos > 0.95`: `55,922`
+  - Firm A by `signatures.excel_firm`, `cos > 0.95`: `55,921`
+  - exactly one `cos > 0.95` signature has `accountants.firm = Firm A` but `signatures.excel_firm != Firm A`.
+- The discrepant row I saw was `signature_id = 37768`, `assigned_accountant = 徐文亞`, `excel_firm = 黃毅民`, `max_similarity_to_same_accountant = 0.978511691093445`, `min_dhash_independent = 0`. That is not a `cos = 0.95...` borderline case.
+
+The corrected explanation should be along the lines of: Table IX uses accountant-registry Firm A membership, while script 28's cross-firm decomposition uses the `excel_firm` field; one above-threshold signature differs between those two firm-attribution fields. Alternatively, change script 28 to use the same `accountants.firm` join as the validation artifacts and regenerate the JSON.
+
+**Still only partially supported**
+
+- YOLO validation metrics, VLM prompt/settings, HSV red-removal thresholds, and 43.1 docs/sec throughput remain method claims without visible log/config artifacts in the inspected report tree.
+- The two Firm A CPAs excluded from the held-out split due to disambiguation ties remain plausible but not directly documented in a report field.
+- The 15 document types / 86.4% standard audit-report breakdown remains plausible but was not traced to a packaged table.
+
+## 5. Methodological + Narrative Discipline
+
+The narrative is materially cleaner than v3.18.2. The manuscript now keeps the central inference where it belongs: the evidence supports a replication-dominated calibration population and a continuous similarity-quality spectrum, not a directly observed signing workflow or a clean two-mechanism mixture.
+
+The remaining narrative issues are narrow:
+
+1. **Fix the new count-reconciliation note.** The current note is too specific and appears empirically false. Do not invoke successive snapshots or a floating-point boundary shift unless that can be shown from archived artifacts. The current evidence points to a firm-attribution-field mismatch.
+
+2. **Clarify Firm A membership consistently.** Several scripts use `accountants.firm`; script 28 uses `signatures.excel_firm`. Both may be defensible for different questions, but the paper must state which field defines Firm A in each table or harmonize the scripts.
+
+3. **Remove or soften remaining "known-majority-positive" phrasing.** The term appears in the Introduction, Methodology, Discussion, and Conclusion. The paper's better phrase is "replication-dominated reference population." "Known" still implies external ground truth stronger than the paper can document.
+
+4. **Correct the auditor-year / cross-year pooling description.** Methodology III-G says the auditor-year ranking is a "deliberately within-year aggregation that avoids cross-year pooling." But the same section and Results IV-G.2 state that each signature's best match is computed against the full same-CPA cross-year pool. The aggregation is by auditor-year, but the underlying similarity statistic is cross-year. Replace "avoids cross-year pooling" with "aggregates signatures within each auditor-year while using the full same-CPA pool for each signature's best-match statistic."
+
+5. **Align the byte-decomposition section reference.** If the `145/50/180/35` decomposition is meant to be a Results claim, put a sentence in IV-F.1 or cite Appendix B directly. As written, Section IV-F.1 reports the 310 all-sample byte-identical signatures, not the Firm A decomposition.
+
+## 6. IEEE Access Fit
+
+The paper remains a good IEEE Access fit. It is application-driven, computationally substantial, and methodologically relevant to document forensics, audit analytics, and computer vision. The contribution is not a novel neural architecture; it is a defensible calibration and validation strategy for a large archival corpus with limited ground truth.
+
+The remaining problems are reproducibility/provenance polish, not a collapse of the empirical core. Still, IEEE Access reviewers may scrutinize the supplement and table provenance. v3.18.3's Appendix B is now much stronger, but the newly added reconciliation note should be corrected because it is exactly the kind of precise provenance statement that reviewers can audit.
+
+## 7. Specific Actionable Revisions
+
+1. Replace the IV-H.2 `55,921` vs `55,922` explanation. Either:
+   - harmonize script 28 to use `accountants.firm` like `validation_recalibration.py` and regenerate the byte-decomposition JSON; or
+   - keep the current script 28 output and state that the one-record difference arises from `accountants.firm` versus `signatures.excel_firm` Firm A attribution.
+
+2. Add a short note in Appendix B or the script 28 report defining the Firm A grouping field for each artifact.
+
+3. Replace "known-majority-positive" with "replication-dominated" or "candidate replication-dominated" unless an external citation/ground-truth source is supplied.
+
+4. Revise Methodology III-G's auditor-year sentence so it does not claim the ranking avoids cross-year pooling.
+
+5. Add the `145/50/180/35` Firm A byte-decomposition sentence to Results IV-F.1, or cite Appendix B directly instead of Section IV-F.1 when discussing that decomposition.
+
+6. If time permits before submission, include supplementary logs/configs for YOLO metrics, VLM prompt/settings, HSV thresholds, and throughput. These are not central-result blockers, but they would strengthen the reproducibility package.
+
+Bottom line: v3.18.3 successfully fixes the fabricated Appendix B paths and most narrative overclaim from round 17. The manuscript should not be accepted until the new count-reconciliation explanation and the auditor-year pooling wording are corrected, but the required changes are small and localized.
@@ -5,22 +5,35 @@ from docx import Document
 from docx.shared import Inches, Pt, RGBColor
 from docx.enum.text import WD_ALIGN_PARAGRAPH
 from pathlib import Path
+import hashlib
 import re

+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+
 PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
+EQUATION_CACHE_DIR = PAPER_DIR / "equations"
+EQUATION_CACHE_DIR.mkdir(exist_ok=True)
 FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures")
 EXTRA_FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
 OUTPUT = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"

 SECTIONS = [
    "paper_a_abstract_v3.md",
-    "paper_a_impact_statement_v3.md",
+    # paper_a_impact_statement_v3.md removed: not a standard IEEE Access
+    # Regular Paper section. Content folded into cover letter / abstract.
    "paper_a_introduction_v3.md",
    "paper_a_related_work_v3.md",
    "paper_a_methodology_v3.md",
    "paper_a_results_v3.md",
    "paper_a_discussion_v3.md",
    "paper_a_conclusion_v3.md",
+    # Appendix A: BD/McCrary bin-width sensitivity (see v3.7 notes).
+    "paper_a_appendix_v3.md",
+    # Declarations (COI / data availability / funding) before References,
+    # per IEEE Access convention.
+    "paper_a_declarations_v3.md",
    "paper_a_references_v3.md",
 ]

@@ -42,10 +55,10 @@ FIGURES = {
        "Fig. 3. Firm A per-signature cosine and dHash distributions against the overall CPA population.",
        3.5,
    ),
-    "Fig. 4 visualizes the accountant-level clusters": (
-        EXTRA_FIG_DIR / "accountant_mixture" / "accountant_mixture_2d.png",
-        "Fig. 4. Accountant-level 3-component Gaussian mixture in the (cosine-mean, dHash-mean) plane.",
-        4.5,
+    "Fig. 4 summarises the per-firm yearly per-signature": (
+        EXTRA_FIG_DIR / "figures" / "fig_yearly_big4_comparison.png",
+        "Fig. 4. Per-firm yearly per-signature best-match cosine, 2013-2023. (a) Mean per-signature best-match cosine by firm bucket and fiscal year (threshold-free). (b) Share of per-signature best-match cosine ≥ 0.95 (operational cut of Section III-K). Five lines: Firm A, B, C, D, Non-Big-4. Firm A is above the other Big-4 firms in every year; Non-Big-4 is below all four Big-4 firms in every year.",
+        6.5,
    ),
    "conducted an ablation study comparing three": (
        FIG_DIR / "fig4_ablation.png",
@@ -56,7 +69,321 @@ FIGURES = {


 def strip_comments(text):
-    return re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
+    """Remove HTML comments, but UNWRAP comments whose first non-blank line
+    starts with `TABLE ` (or `TABLE\t`).
+
+    The v3 markdown sources wrap every numerical table in an HTML comment of
+    the form
+
+        <!-- TABLE V: Hartigan Dip Test Results
+        | Distribution | N | ... |
+        |--------------|---|-----|
+        | ...          | … | ... |
+        -->
+
+    The caption (`TABLE V: Hartigan Dip Test Results`) is on the same line as
+    the opening `<!--`, the markdown table body is on the lines following,
+    and `-->` closes the block. The previous implementation wholesale-deleted
+    these comments, which silently dropped every table from the rendered
+    DOCX. We now (i) detect comments whose first non-empty line starts with
+    `TABLE `, (ii) emit a synthetic caption marker line `__TABLE_CAPTION__:
+    <caption>` so process_section can render the caption as a centered
+    bold paragraph above the table, and (iii) keep the table body so the
+    existing markdown-table detector picks it up. Non-TABLE comments
+    (figure placeholders, editorial notes) are stripped as before.
+    """
+    def _replace(match):
+        body = match.group(1)
+        # Find first non-blank line.
+        for line in body.splitlines():
+            stripped = line.strip()
+            if stripped:
+                first = stripped
+                break
+        else:
+            return ""
+        if not first.startswith("TABLE ") and not first.startswith("TABLE\t"):
+            return ""
+        # Split caption (first non-blank line) from the rest.
+        lines = body.splitlines()
+        # Find index of the first non-blank line and use everything after.
+        for idx, line in enumerate(lines):
+            if line.strip():
+                caption = line.strip()
+                rest = "\n".join(lines[idx + 1:])
+                break
+        else:
+            return ""
+        # Emit caption marker + body. Surround with blank lines so the
+        # paragraph/table detector treats the marker as its own paragraph.
+        return f"\n\n__TABLE_CAPTION__:{caption}\n{rest}\n"
+    # Non-greedy match across lines.
+    return re.sub(r"<!--(.*?)-->", _replace, text, flags=re.DOTALL)
+
+
+# ---------------------------------------------------------------------------
+# LaTeX → plain text + Unicode conversion
+# ---------------------------------------------------------------------------
+# The v3 markdown sources contain inline LaTeX ($...$) and a small number of
+# display-math blocks ($$...$$). Pandoc would render these natively; the
+# python-docx pipeline used here does not, so without preprocessing every
+# `\leq`, `\text{dHash}_\text{indep}`, `\Delta\text{BIC}`, `60{,}448`, etc.
+# leaks into the DOCX as raw LaTeX. The helpers below convert the common
+# inline cases to Unicode and split subscripts/superscripts into proper Word
+# runs. Display-math (rare; 3 equations in this paper) gets a best-effort
+# linearisation and is acceptable for a partner-handoff DOCX; final IEEE
+# typesetting is handled by the publisher's LaTeX/MathType pipeline.
+
+LATEX_TOKEN_REPLACEMENTS = [
+    # Greek letters (lower)
+    (r"\\alpha(?![A-Za-z])", "α"), (r"\\beta(?![A-Za-z])", "β"), (r"\\gamma(?![A-Za-z])", "γ"),
+    (r"\\delta(?![A-Za-z])", "δ"), (r"\\epsilon(?![A-Za-z])", "ε"), (r"\\zeta(?![A-Za-z])", "ζ"),
+    (r"\\eta(?![A-Za-z])", "η"), (r"\\theta(?![A-Za-z])", "θ"), (r"\\iota(?![A-Za-z])", "ι"),
+    (r"\\kappa(?![A-Za-z])", "κ"), (r"\\lambda(?![A-Za-z])", "λ"), (r"\\mu(?![A-Za-z])", "μ"),
+    (r"\\nu(?![A-Za-z])", "ν"), (r"\\xi(?![A-Za-z])", "ξ"), (r"\\pi(?![A-Za-z])", "π"),
+    (r"\\rho(?![A-Za-z])", "ρ"), (r"\\sigma(?![A-Za-z])", "σ"), (r"\\tau(?![A-Za-z])", "τ"),
+    (r"\\phi(?![A-Za-z])", "φ"), (r"\\chi(?![A-Za-z])", "χ"), (r"\\psi(?![A-Za-z])", "ψ"),
+    (r"\\omega(?![A-Za-z])", "ω"),
+    # Greek letters (upper, only those distinguishable from Latin)
+    (r"\\Gamma(?![A-Za-z])", "Γ"), (r"\\Delta(?![A-Za-z])", "Δ"), (r"\\Theta(?![A-Za-z])", "Θ"),
+    (r"\\Lambda(?![A-Za-z])", "Λ"), (r"\\Xi(?![A-Za-z])", "Ξ"), (r"\\Pi(?![A-Za-z])", "Π"),
+    (r"\\Sigma(?![A-Za-z])", "Σ"), (r"\\Phi(?![A-Za-z])", "Φ"), (r"\\Psi(?![A-Za-z])", "Ψ"),
+    (r"\\Omega(?![A-Za-z])", "Ω"),
+    # Relations / arrows
+    (r"\\leq(?![A-Za-z])", "≤"), (r"\\geq(?![A-Za-z])", "≥"),
+    (r"\\neq(?![A-Za-z])", "≠"), (r"\\approx(?![A-Za-z])", "≈"),
+    (r"\\equiv(?![A-Za-z])", "≡"), (r"\\sim(?![A-Za-z])", "~"),
+    (r"\\to(?![A-Za-z])", "→"), (r"\\rightarrow(?![A-Za-z])", "→"),
+    (r"\\leftarrow(?![A-Za-z])", "←"), (r"\\Rightarrow(?![A-Za-z])", "⇒"),
+    (r"\\Leftarrow(?![A-Za-z])", "⇐"),
+    # Binary operators
+    (r"\\times(?![A-Za-z])", "×"), (r"\\cdot(?![A-Za-z])", "·"),
+    (r"\\pm(?![A-Za-z])", "±"), (r"\\mp(?![A-Za-z])", "∓"),
+    (r"\\div(?![A-Za-z])", "÷"),
+    # Misc
+    (r"\\infty(?![A-Za-z])", "∞"), (r"\\partial(?![A-Za-z])", "∂"),
+    (r"\\sum(?![A-Za-z])", "∑"), (r"\\prod(?![A-Za-z])", "∏"),
+    (r"\\int(?![A-Za-z])", "∫"),
+    (r"\\ldots(?![A-Za-z])", "…"), (r"\\dots(?![A-Za-z])", "…"),
+    # Spacing commands (drop or replace with single space)
+    (r"\\,", " "), (r"\\;", " "), (r"\\:", " "),
+    (r"\\!", ""), (r"\\ ", " "),
+    (r"\\quad(?![A-Za-z])", "  "), (r"\\qquad(?![A-Za-z])", "    "),
+    # Escaped punctuation
+    (r"\\%", "%"), (r"\\#", "#"), (r"\\&", "&"),
+    (r"\\\$", "$"), (r"\\_", "_"),
+]
+
+
+def _unwrap_command(text, cmd):
+    """Repeatedly replace `\\cmd{X}` → `X` until stable."""
+    pat = re.compile(r"\\" + cmd + r"\{([^{}]*)\}")
+    prev = None
+    while prev != text:
+        prev = text
+        text = pat.sub(r"\1", text)
+    return text
+
+
+MATH_START = ""  # Private Use Area: XML-safe
+MATH_END = ""
+
+
+def latex_to_unicode(text):
+    """Convert a LaTeX-laced markdown paragraph into plain text.
+
+    Math context is preserved with private-use sentinel characters
+    (MATH_START / MATH_END) so the downstream run-splitter only treats
+    `_X` / `^X` as subscript / superscript inside math regions; in body
+    text underscores in identifiers like `signature_analysis` survive.
+    """
+    if "$" not in text and "\\" not in text:
+        return text
+
+    # 1. Strip display-math delimiters first (keep the inner content for
+    #    best-effort linearisation), wrapping math regions with sentinels.
+    #    Then strip inline math delimiters with the same sentinel wrapping.
+    text = re.sub(r"\$\$([\s\S]+?)\$\$",
+                  lambda m: MATH_START + m.group(1) + MATH_END, text)
+    text = re.sub(r"\$([^$]+?)\$",
+                  lambda m: MATH_START + m.group(1) + MATH_END, text)
+
+    # 2. Replace token-level commands with Unicode glyphs *before* unwrapping
+    #    `\text{...}` and friends, so that `\Delta\text{BIC}` becomes
+    #    `Δ\text{BIC}` (then `ΔBIC`) rather than `\DeltaBIC` which would be
+    #    stripped wholesale by the cleanup pass.
+    for pat, repl in LATEX_TOKEN_REPLACEMENTS:
+        text = re.sub(pat, repl, text)
+
+    # 3. Unwrap formatting / text commands (innermost first via _unwrap loop).
+    for cmd in ("text", "mathbf", "mathit", "mathrm", "mathsf", "mathtt",
+                "operatorname", "emph", "textbf", "textit"):
+        text = _unwrap_command(text, cmd)
+
+    # 4. \frac{a}{b} → (a)/(b); \sqrt{x} → √(x). Apply repeatedly to handle
+    #    one level of nesting; deeper nesting is rare in this paper.
+    for _ in range(3):
+        text = re.sub(
+            r"\\t?frac\{([^{}]+)\}\{([^{}]+)\}",
+            r"(\1)/(\2)",
+            text,
+        )
+    text = re.sub(r"\\sqrt\{([^{}]+)\}", r"√(\1)", text)
+
+    # 5. TeX braces used purely for spacing/grouping: K{=}3 → K=3,
+    #    60{,}448 → 60,448, 10{,}175 → 10,175.
+    text = re.sub(r"\{([=<>+\-,])\}", r"\1", text)
+
+    # 6. Strip any remaining `\cmd{...}` (best effort) and `\cmd ` tokens.
+    text = re.sub(r"\\[a-zA-Z]+\{([^{}]*)\}", r"\1", text)
+    text = re.sub(r"\\[a-zA-Z]+(?![A-Za-z])", "", text)
+
+    # 7. Collapse runs of whitespace introduced by command stripping.
+    text = re.sub(r"[ \t]{2,}", " ", text)
+    return text
+
+
+_SUBSUP_PATTERN = re.compile(
+    r"_\{([^{}]*)\}"     # _{...}
+    r"|\^\{([^{}]*)\}"   # ^{...}
+    r"|_([A-Za-z0-9+\-])"  # _X (single token)
+    r"|\^([A-Za-z0-9+\-])"  # ^X (single token)
+)
+
+
+def _emit_plain(paragraph, text, font_name, font_size, bold, italic):
+    if not text:
+        return
+    run = paragraph.add_run(text)
+    run.font.name = font_name
+    run.font.size = font_size
+    run.bold = bold
+    run.italic = italic
+
+
+def _emit_math(paragraph, text, font_name, font_size, bold, italic):
+    """Emit `text` from a math region: split on `_X` / `_{X}` / `^X` / `^{X}`
+    and render those as Word subscripts / superscripts."""
+    if "_" not in text and "^" not in text:
+        _emit_plain(paragraph, text, font_name, font_size, bold, italic)
+        return
+    pos = 0
+    for m in _SUBSUP_PATTERN.finditer(text):
+        if m.start() > pos:
+            _emit_plain(paragraph, text[pos:m.start()],
+                        font_name, font_size, bold, italic)
+        sub_text = m.group(1) or m.group(3)
+        sup_text = m.group(2) or m.group(4)
+        if sub_text is not None:
+            run = paragraph.add_run(sub_text)
+            run.font.subscript = True
+        else:
+            run = paragraph.add_run(sup_text)
+            run.font.superscript = True
+        run.font.name = font_name
+        run.font.size = font_size
+        run.bold = bold
+        run.italic = italic
+        pos = m.end()
+    if pos < len(text):
+        _emit_plain(paragraph, text[pos:],
+                    font_name, font_size, bold, italic)
+
+
+def add_text_with_subsup(paragraph, text, font_name="Times New Roman",
+                         font_size=Pt(10), bold=False, italic=False):
+    """Add `text` to `paragraph`. Subscript/superscript handling is scoped to
+    math regions delimited by MATH_START / MATH_END sentinels (set up by
+    `latex_to_unicode`). Outside math regions, underscores and carets are
+    preserved literally so identifiers like `signature_analysis` and
+    `paper_a_results_v3.md` survive intact.
+    """
+    if MATH_START not in text:
+        _emit_math(paragraph, text, font_name, font_size, bold, italic) \
+            if False else \
+            _emit_plain(paragraph, text, font_name, font_size, bold, italic)
+        return
+
+    pos = 0
+    while pos < len(text):
+        s = text.find(MATH_START, pos)
+        if s == -1:
+            _emit_plain(paragraph, text[pos:],
+                        font_name, font_size, bold, italic)
+            break
+        if s > pos:
+            _emit_plain(paragraph, text[pos:s],
+                        font_name, font_size, bold, italic)
+        e = text.find(MATH_END, s + 1)
+        if e == -1:
+            # Unterminated math region — emit rest as plain.
+            _emit_plain(paragraph, text[s + 1:],
+                        font_name, font_size, bold, italic)
+            break
+        math_body = text[s + 1:e]
+        _emit_math(paragraph, math_body, font_name, font_size, bold, italic)
+        pos = e + 1
+
+
+# ---------------------------------------------------------------------------
+# Display-equation rendering (matplotlib mathtext → PNG → embedded image)
+# ---------------------------------------------------------------------------
+
+# matplotlib mathtext is a subset of LaTeX. A few common TeX-only macros need
+# to be substituted with mathtext-supported equivalents before parsing.
+_MATHTEXT_SUBS = [
+    (re.compile(r"\\tfrac\b"), r"\\frac"),       # text-frac → frac
+    (re.compile(r"\\dfrac\b"), r"\\frac"),       # display-frac → frac
+    (re.compile(r"\\operatorname\{([^{}]+)\}"),
+     lambda m: r"\mathrm{" + m.group(1) + "}"),  # operatorname → mathrm
+    (re.compile(r"\\,"), " "),                   # thin space
+    (re.compile(r"\\;"), " "),
+    (re.compile(r"\\!"), ""),
+]
+
+
+def _sanitise_for_mathtext(latex: str) -> str:
+    out = latex
+    for pat, repl in _MATHTEXT_SUBS:
+        out = pat.sub(repl, out)
+    return out
+
+
+def render_equation_png(latex: str, fontsize: int = 14) -> Path:
+    """Render a LaTeX math expression to a tightly-cropped PNG using
+    matplotlib mathtext, with content-addressed caching so a re-build only
+    re-renders changed equations. Returns the cached PNG path."""
+    sanitised = _sanitise_for_mathtext(latex.strip())
+    digest = hashlib.sha1(
+        (sanitised + f"|fs{fontsize}").encode("utf-8")).hexdigest()[:16]
+    out_path = EQUATION_CACHE_DIR / f"eq_{digest}.png"
+    if out_path.exists():
+        return out_path
+    fig = plt.figure(figsize=(8, 1.6))
+    fig.text(0.5, 0.5, f"${sanitised}$",
+             fontsize=fontsize, ha="center", va="center")
+    fig.savefig(str(out_path), dpi=220, bbox_inches="tight",
+                pad_inches=0.05)
+    plt.close(fig)
+    return out_path
+
+
+def add_equation_block(doc, latex: str, equation_number: int,
+                       width_inches: float = 4.5):
+    """Insert a centered display equation (rendered as PNG) followed by
+    a right-aligned equation number `(N)`. Width keeps the equation
+    visually proportional within the IEEE Access body column."""
+    img_path = render_equation_png(latex)
+    p = doc.add_paragraph()
+    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
+    p.paragraph_format.space_before = Pt(6)
+    p.paragraph_format.space_after = Pt(6)
+    run = p.add_run()
+    run.add_picture(str(img_path), width=Inches(width_inches))
+    # Equation number on the same paragraph, tab-aligned to the right.
+    num_run = p.add_run(f"\t({equation_number})")
+    num_run.font.name = "Times New Roman"
+    num_run.font.size = Pt(10)


 def add_md_table(doc, table_lines):
@@ -73,14 +400,23 @@ def add_md_table(doc, table_lines):
    for r_idx, row in enumerate(rows_data):
        for c_idx in range(min(len(row), ncols)):
            cell = table.rows[r_idx].cells[c_idx]
-            cell.text = row[c_idx]
-            for p in cell.paragraphs:
-                p.alignment = WD_ALIGN_PARAGRAPH.CENTER
-                for run in p.runs:
-                    run.font.size = Pt(8)
-                    run.font.name = "Times New Roman"
-                    if r_idx == 0:
-                        run.bold = True
+            raw = row[c_idx]
+            # Strip markdown emphasis markers; convert LaTeX before rendering.
+            raw = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", raw)
+            raw = re.sub(r"\*\*(.+?)\*\*", r"\1", raw)
+            raw = re.sub(r"\*(.+?)\*", r"\1", raw)
+            raw = re.sub(r"`(.+?)`", r"\1", raw)
+            cell_text = latex_to_unicode(raw)
+            # Replace the default empty paragraph with one we control.
+            cell.text = ""
+            cp = cell.paragraphs[0]
+            cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
+            add_text_with_subsup(
+                cp, cell_text,
+                font_name="Times New Roman",
+                font_size=Pt(8),
+                bold=(r_idx == 0),
+            )
    doc.add_paragraph()


@@ -99,10 +435,27 @@ def _insert_figures(doc, para_text):
            cr.italic = True


-def process_section(doc, filepath):
+def process_section(doc, filepath, equation_counter=None):
+    """Process one v3 markdown section. `equation_counter` is a single-element
+    list (used as a mutable counter shared across sections) tracking the
+    running display-equation number."""
+    if equation_counter is None:
+        equation_counter = [0]
    text = filepath.read_text(encoding="utf-8")
    text = strip_comments(text)
    lines = text.split("\n")
+    # Defensive blockquote handling: markdown blockquote lines (`> body`) are
+    # not rendered as Word callout blocks here, but stripping the leading
+    # `> ` keeps the body text from leaking the literal `>` and the empty
+    # `>` separator lines into the DOCX.
+    cleaned = []
+    for ln in lines:
+        s = ln.lstrip()
+        if s == ">" or s.startswith("> "):
+            cleaned.append(ln[ln.index(">") + 1:].lstrip() if "> " in ln else "")
+        else:
+            cleaned.append(ln)
+    lines = cleaned
    i = 0
    while i < len(lines):
        line = lines[i]
@@ -111,23 +464,44 @@ def process_section(doc, filepath):
            i += 1
            continue
        if stripped.startswith("# "):
-            h = doc.add_heading(stripped[2:], level=1)
+            h = doc.add_heading(
+                latex_to_unicode(stripped[2:]).replace(MATH_START, "").replace(MATH_END, ""),
+                level=1)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        if stripped.startswith("## "):
-            h = doc.add_heading(stripped[3:], level=2)
+            h = doc.add_heading(
+                latex_to_unicode(stripped[3:]).replace(MATH_START, "").replace(MATH_END, ""),
+                level=2)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        if stripped.startswith("### "):
-            h = doc.add_heading(stripped[4:], level=3)
+            h = doc.add_heading(
+                latex_to_unicode(stripped[4:]).replace(MATH_START, "").replace(MATH_END, ""),
+                level=3)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
+        if stripped.startswith("__TABLE_CAPTION__:"):
+            caption_text = stripped[len("__TABLE_CAPTION__:"):].strip()
+            caption_text = latex_to_unicode(caption_text)
+            cp = doc.add_paragraph()
+            cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
+            cp.paragraph_format.space_before = Pt(6)
+            cp.paragraph_format.space_after = Pt(2)
+            add_text_with_subsup(
+                cp, caption_text,
+                font_name="Times New Roman",
+                font_size=Pt(9),
+                bold=True,
+            )
+            i += 1
+            continue
        if "|" in stripped and i + 1 < len(lines) and re.match(r"\s*\|[-|: ]+\|", lines[i + 1]):
            table_lines = []
            while i < len(lines) and "|" in lines[i]:
@@ -135,22 +509,74 @@ def process_section(doc, filepath):
                i += 1
            add_md_table(doc, table_lines)
            continue
+        # Display math: a line starting with `$$` is treated as a single-line
+        # equation block and rendered as an embedded mathtext PNG with an
+        # auto-incrementing equation number.
+        if stripped.startswith("$$"):
+            # Accumulate until a closing $$ is found (single line in our
+            # corpus, but defensively support multi-line just in case).
+            buf = [stripped]
+            if not (stripped.count("$$") >= 2 and stripped.endswith("$$")):
+                while i + 1 < len(lines):
+                    i += 1
+                    buf.append(lines[i])
+                    if "$$" in lines[i]:
+                        break
+            joined = "\n".join(buf).strip()
+            # Strip the leading and trailing $$ delimiters and any trailing
+            # punctuation (e.g. the `,` that some equation lines end with).
+            inner = joined
+            if inner.startswith("$$"):
+                inner = inner[2:]
+            if inner.endswith("$$"):
+                inner = inner[:-2]
+            inner = inner.rstrip(", ")
+            equation_counter[0] += 1
+            try:
+                add_equation_block(doc, inner, equation_counter[0])
+            except Exception as exc:
+                # Fallback: render as plain centered Times-Roman line so the
+                # build doesn't fail on a single un-renderable equation.
+                p = doc.add_paragraph()
+                p.alignment = WD_ALIGN_PARAGRAPH.CENTER
+                run = p.add_run(f"[equation render failed: {exc}] {inner}")
+                run.font.name = "Times New Roman"
+                run.font.size = Pt(10)
+                run.italic = True
+            i += 1
+            continue
        if re.match(r"^\d+\.\s", stripped):
-            p = doc.add_paragraph(style="List Number")
-            content = re.sub(r"^\d+\.\s", "", stripped)
+            # Manual numbering: keep the number from the markdown source and
+            # apply a hanging-indent paragraph format. Avoids python-docx's
+            # `style='List Number'` which depends on a properly-set-up
+            # numbering definition that the default Document() lacks.
+            m = re.match(r"^(\d+)\.\s+(.*)$", stripped)
+            num, content = m.group(1), m.group(2)
+            p = doc.add_paragraph()
+            p.paragraph_format.left_indent = Inches(0.4)
+            p.paragraph_format.first_line_indent = Inches(-0.25)
+            p.paragraph_format.space_after = Pt(4)
+            content = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", content)
            content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
-            run = p.add_run(content)
-            run.font.size = Pt(10)
-            run.font.name = "Times New Roman"
+            content = re.sub(r"\*(.+?)\*", r"\1", content)
+            content = re.sub(r"`(.+?)`", r"\1", content)
+            content = latex_to_unicode(content)
+            add_text_with_subsup(p, f"{num}. {content}")
            i += 1
            continue
        if stripped.startswith("- "):
-            p = doc.add_paragraph(style="List Bullet")
+            # Manual bullets with hanging indent (same rationale as numbered).
+            p = doc.add_paragraph()
+            p.paragraph_format.left_indent = Inches(0.4)
+            p.paragraph_format.first_line_indent = Inches(-0.25)
+            p.paragraph_format.space_after = Pt(4)
            content = stripped[2:]
+            content = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", content)
            content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
-            run = p.add_run(content)
-            run.font.size = Pt(10)
-            run.font.name = "Times New Roman"
+            content = re.sub(r"\*(.+?)\*", r"\1", content)
+            content = re.sub(r"`(.+?)`", r"\1", content)
+            content = latex_to_unicode(content)
+            add_text_with_subsup(p, f"•  {content}")
            i += 1
            continue
        # Regular paragraph
@@ -173,14 +599,12 @@ def process_section(doc, filepath):
        para_text = re.sub(r"\*\*(.+?)\*\*", r"\1", para_text)
        para_text = re.sub(r"\*(.+?)\*", r"\1", para_text)
        para_text = re.sub(r"`(.+?)`", r"\1", para_text)
-        para_text = para_text.replace("$$", "")
        para_text = para_text.replace("---", "\u2014")
+        para_text = latex_to_unicode(para_text)

        p = doc.add_paragraph()
        p.paragraph_format.space_after = Pt(6)
-        run = p.add_run(para_text)
-        run.font.size = Pt(10)
-        run.font.name = "Times New Roman"
+        add_text_with_subsup(p, para_text)

        _insert_figures(doc, para_text)

@@ -198,35 +622,68 @@ def main():
    run = p.add_run(
        "Automated Identification of Non-Hand-Signed Auditor Signatures\n"
        "in Large-Scale Financial Audit Reports:\n"
-        "A Dual-Descriptor Framework with Three-Method Convergent Thresholding"
+        "A Dual-Descriptor Framework with Replication-Dominated Calibration"
    )
    run.font.size = Pt(16)
    run.font.name = "Times New Roman"
    run.bold = True

+    # IEEE Access uses single-anonymized review: author / affiliation
+    # / corresponding-author block must appear on the title page in the
+    # final submission. Fill these placeholders with real metadata
+    # before submitting the generated DOCX.
    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    p.paragraph_format.space_after = Pt(6)
-    run = p.add_run("[Authors removed for double-blind review]")
+    run = p.add_run("[AUTHOR NAMES — fill in before submission]")
+    run.font.size = Pt(11)
+
+    p = doc.add_paragraph()
+    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
+    p.paragraph_format.space_after = Pt(6)
+    run = p.add_run("[Affiliations and corresponding-author email — fill in before submission]")
    run.font.size = Pt(10)
    run.italic = True

    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    p.paragraph_format.space_after = Pt(20)
-    run = p.add_run("Target journal: IEEE Access (Regular Paper)")
+    run = p.add_run("Target journal: IEEE Access (Regular Paper, single-anonymized review)")
    run.font.size = Pt(10)
    run.italic = True

+    equation_counter = [0]
    for section_file in SECTIONS:
        filepath = PAPER_DIR / section_file
        if filepath.exists():
-            process_section(doc, filepath)
+            process_section(doc, filepath, equation_counter=equation_counter)
        else:
            print(f"WARNING: missing section file: {filepath}")

    doc.save(str(OUTPUT))
    print(f"Saved: {OUTPUT}")
+    _run_linter()
+
+
+def _run_linter():
+    """Run the leak linter on the freshly built DOCX. Non-fatal: prints a
+    summary line. For full output run `python3 paper/lint_paper_v3.py`."""
+    try:
+        import lint_paper_v3  # local module
+    except Exception as exc:  # pragma: no cover
+        print(f"(lint skipped: {exc})")
+        return
+    findings = lint_paper_v3.lint_docx(OUTPUT)
+    errors = sum(1 for f in findings if f.severity == "ERROR")
+    warns = sum(1 for f in findings if f.severity == "WARN")
+    infos = sum(1 for f in findings if f.severity == "INFO")
+    if errors:
+        print(f"\n[lint] {errors} ERROR finding(s) in DOCX — run "
+              f"`python3 paper/lint_paper_v3.py --docx` for details.")
+    elif warns or infos:
+        print(f"[lint] DOCX clean of ERRORs ({warns} WARN, {infos} INFO).")
+    else:
+        print("[lint] DOCX clean.")


 if __name__ == "__main__":
@@ -0,0 +1,45 @@
+# Partner Red-Pen Regression Audit (v3.19.0) - Gemini 3.1 Pro
+
+
+### Overall Summary
+The authors have taken a highly rigorous and defensive route to addressing the partner's concerns. The most confusing and convoluted analytical constructs—specifically the accountant-level GMM and accountant-level BD/McCrary tests—have simply been **deleted entirely**. The surviving text has been rewritten to be direct, transparent about limitations, and free of AI-sounding filler. 
+
+Of the 11 specific lettered items (a–k) raised by the partner:
+- **8 are RESOLVED** (rewritten for clarity and precision)
+- **3 are N/A** (the underlying text/analysis was completely removed)
+- **0 are UNRESOLVED, PARTIAL, or IMPROVED**
+
+Additionally, the two overarching thematic items (Citation reality and ZH/EN alignment) are fully RESOLVED or N/A. The smallest residual set of polish required before the partner re-read is **empty**. The manuscript is clean and ready for review.
+
+---
+
+### Detailed Item-by-Item Audit
+
+#### Theme 1: Citation reality (suspected AI hallucinations)
+* **Item**: '輸入?', '有些幻覺像是研究方法', 'BD/McCrary 沒?', '引用?' (Are these hallucinated?)
+* **Status**: **RESOLVED**
+* **Citation**: `@paper/reference_verification_v3.md`, `@paper/paper_a_references_v3.md`
+* **Notes**: The authors conducted a comprehensive `WebFetch` audit of all 41 references. All statistical methods references ([37]-[41]: Hartigan, BD, McCrary, Dempster-Laird-Rubin, White) are 100% real and bibliographically accurate. The audit did catch one genuine error at ref [5] (wrong authors: "I. Hadjadj et al.") which the authors successfully fixed to "H.-H. Kao and C.-Y. Wen" in the current `paper_a_references_v3.md`.
+
+#### Theme 3: ZH/EN alignment gap
+* **Item**: '沒有跟英文嗎?比較' (no English alongside? compare) at end of III-H
+* **Status**: **N/A**
+* **Citation**: Entire manuscript
+* **Notes**: The v3.19.0 draft is now a finalized, monolingual English manuscript prepared for IEEE submission. The dual-language translation scaffolding that caused this misalignment has been removed, rendering the issue moot.
+
+#### Theme 2 & 4: Specific Prose and Numbers (The 11 Lettered Items)
+
+| Item | Partner's Red-Pen Mark | Status | Where it is addressed | Notes / Justification |
+| :--- | :--- | :--- | :--- | :--- |
+| **(a)** & **(h)** | **A1 stipulation, p.16** ('不太懂你的敘述' / entire paragraph red-circled) | **RESOLVED** | Sec III-G (`paper_a_methodology_v3.md`) | The paragraph was completely rewritten. It is no longer roundabout. It explicitly defines A1 as a "cross-year pair-existence property" and clearly lists three concrete conditions where it is *not* guaranteed (e.g., multiple template variants simultaneously, scan-stage noise). |
+| **(b)** | **Conservative structural-similarity, p.16** ('有點繞嗎?' / is it a bit roundabout?) | **RESOLVED** | Sec III-G (`paper_a_methodology_v3.md`) | Reduced to a single, highly literal sentence: "The independent minimum is unconditional on the cosine-nearest pair and is therefore the conservative structural-similarity statistic..." Extremely clear. |
+| **(c)** | **IV-G validation lead-in, p.18** ('不太懂為何陳述?' / don't follow why you say this) | **RESOLVED** | Sec IV-G (`paper_a_results_v3.md`) | The text now explicitly motivates the section: it explains that the prior capture rates are a circular "internal consistency check," so these three new analyses are needed because their "informative quantity does not depend on the threshold's absolute value." |
+| **(d)** & **(k)** | **BD/McCrary at accountant level, p.20** ('看不懂!' / '為何 accountant level 合計, 因為 component?') | **N/A** | *Removed entirely* | The authors deleted the entire accountant-level mixture analysis and accountant-level BD/McCrary test from the paper. Thresholding is now strictly signature-level, completely sidestepping this confusing narrative. |
+| **(e)** | **92.6% match rate, p.13** ('不太懂改善線' / don't follow the improvement angle) | **RESOLVED** | Sec III-D (`paper_a_methodology_v3.md`) | The "improvement angle" has been deleted. The 92.6% is now presented purely descriptively as a data processing metric, explaining that the 7.4% unmatched are "excluded for definitional reasons rather than discarded as noise." |
+| **(f)** | **0.95 cosine cut-off, p.18** ('Cut-off 對應!' / correspondence to what?) | **RESOLVED** | Sec III-K (`paper_a_methodology_v3.md`) | The text directly answers this now: "the cosine cutoff 0.95 corresponds to approximately the whole-sample Firm A P7.5 of the per-signature best-match cosine distribution..." |
+| **(g)** | **139/32 split in C1/C2 clusters, p.18** ('可能太倚加權因子!?' / too reliant on weighting factor?) | **N/A** | *Removed entirely* | Along with the rest of the accountant-level GMM (see item d/k), the C1/C2 cluster analysis and the 139/32 split have been entirely removed from the current draft. |
+| **(i)** | **Hartigan rejection-as-bimodality, p.19** ('?所以為何?' / so why?) | **RESOLVED** | Sec III-I.1 (`paper_a_methodology_v3.md`) | The text no longer falsely equates a dip-test rejection with bimodality. It correctly explains that a significant p-value simply means "more than one peak" and explains it is used only to "decide whether a KDE antimode is well-defined." |
+| **(j)** | **BIC strict-3-component upper-bound framing, p.20** (red-circled paragraph) | **RESOLVED** | Sec IV-D.3 (`paper_a_results_v3.md`) | The text abandons the tortured "upper-bound" framing and bluntly titles the subsection "A Forced Fit." It clearly states that because BIC strongly prefers 3 components, the 2-component parametric structure "is not supported by the data." |
+
+### Smallest Residual Set
+**None.** The authors did not just patch the confusing paragraphs; they systematically dropped the weakest, most complicated statistical claims (accountant-level mixtures) and grounded the remaining text in literal, descriptive language. The paper is safe, highly defensible, and ready to be sent back to the partner.
@@ -0,0 +1,68 @@
+# Independent Peer Review (Round 19) - Paper A v3.18.4
+
+## 1. Overall Verdict: Major Revision
+
+I recommend **Major Revision**. While v3.18.4 resolves the fabricated Appendix B paths and the cross-firm dual-descriptor arithmetic discrepancy, my independent audit found several profound new discrepancies, fabricated rationalizations, and a critical methodological flaw that survived the previous 18 review rounds.
+
+The most severe issues are:
+1. **Fabricated Rationalization for Excluded Documents:** Section IV-H claims 656 documents were excluded because they "carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available." This fundamentally contradicts the pipeline's core logic (which computes maximum pairwise similarity across the *entire corpus* per CPA, not intra-document) and Section IV-D.1 (which correctly states only 15 signatures belong to singleton CPAs). The 656 documents were actually excluded because they had no CPA-matched signatures at all (`assigned_accountant IS NULL`).
+2. **Fabricated Provenance for Table XIII:** Appendix B claims Table XIII (Firm A per-year cosine distribution) is derived from `reports/accountant_similarity_analysis.json`. However, the generating script (`08_accountant_similarity_analysis.py`) neither extracts nor groups by the `year_month` field. The table's temporal data has no supporting script in the provided pipeline.
+3. **Fabricated Rationalization for Firm A Partners:** Section IV-F.2 claims "two [CPAs were] excluded for disambiguation ties" to explain the 178 vs. 180 Firm A partner split. The actual script `24_validation_recalibration.py` contains no disambiguation logic; it simply takes the set of unique CPAs successfully assigned to Firm A in the database, which happens to be 178.
+4. **Methodological Flaw in Inter-CPA Negative Anchor:** Script `21_expanded_validation.py` claims to generate ~50,000 random inter-CPA pairs for validation. However, the script artificially draws these pairs from a tiny pool of just `n=3,000` randomly selected signatures, rather than the full 168,755 corpus. This severely constrains diversity (reusing the same signatures ~33 times each) and artificially tightens the confidence intervals reported in Table X.
+
+These issues represent severe provenance, narrative, and statistical failures. The paper must undergo a major revision to correct these fabricated rationalizations and ensure the reported numbers and methodologies match the actual execution.
+
+## 2. Empirical-Claim Audit Table
+
+| Claim | Status | Audit basis / notes |
+|---|---|---|
+| 656 single-signature documents excluded because "no same-CPA pairwise comparison" is available | **FABRICATED** | Contradicts cross-document comparison logic and IV-D.1 (only 15 singleton CPAs lack comparison). The real reason is they failed CPA matching entirely. |
+| 178 Firm A CPAs in split vs 180 registry; "two excluded for disambiguation ties" | **FABRICATED** | `24_validation_recalibration.py` simply takes unique accountants with `firm=FIRM_A`. There is no disambiguation logic in the script. |
+| Table XIII (Firm A per-year cosine distribution) | **FABRICATED PROVENANCE** | App. B claims it's derived from `accountant_similarity_analysis.json`, but `08_accountant_similarity_analysis.py` doesn't extract or group by year. |
+| 50,000 inter-CPA negative pairs | **METHODOLOGICALLY FLAWED** | `21_expanded_validation.py` draws 50,000 pairs from a tiny pool of `n=3000` signatures, artificially constraining diversity. |
+| 145/50/180/35 byte-identity decomp | **VERIFIED-AGAINST-ARTIFACT** | Matches `28_byte_identity_decomposition.py`. |
+| Cross-firm convergence 42.12% vs 88.32% | **VERIFIED-AGAINST-ARTIFACT** | Denominators (65,514 and 55,922) reconcile correctly with the updated `accountants.firm` logic. |
+| 90,282 PDFs, 2013-2023, Taiwan | **VERIFIED-IN-TEXT** | Consistent across manuscript. |
+| 86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071 | **VERIFIED-IN-TEXT** | Internally consistent in III-C. |
+| 182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched | **VERIFIED-IN-TEXT** | Matches manuscript counts. |
+| 758 CPAs, 15 document types, 86.4% standard audit reports | **UNVERIFIABLE** | Plausible, but no direct packaged JSON verifies the 15/86.4% split. |
+| Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0 | **UNVERIFIABLE** | No prompt/config/log artifact inspected. |
+| YOLO metrics (precision, recall, mAP) and 43.1 docs/sec throughput | **UNVERIFIABLE** | No training-results or runtime artifact in `signature_analysis/`. |
+| Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs | **VERIFIED-AGAINST-ARTIFACT** | Matches dip-test report and script logic. |
+| ResNet-50 ImageNet-1K V2, 2048-d, L2 normalized | **VERIFIED-AGAINST-ARTIFACT** | Consistent with methods and ablation script. |
+| All-pairs intra/inter distribution N = 41,352,824 / 500,000; KDE crossover 0.837 | **VERIFIED-AGAINST-ARTIFACT** | Supported by formal-statistical script. |
+| Firm A dip result N=60,448, dip=0.0019, p=0.169 | **VERIFIED-AGAINST-ARTIFACT** | `15_hartigan_dip_test.py`. |
+| Beta mixture Delta BIC = 381 for Firm A; forced crossings 0.977/0.999 | **VERIFIED-AGAINST-ARTIFACT** | `17_beta_mixture_em.py`. |
+
+## 3. Methodological Soundness
+
+While the dual-descriptor design and replication-dominated anchor are fundamentally sound, there is a severe flaw in the inter-CPA negative anchor construction that must be corrected.
+**Flawed Inter-CPA Anchor Generation:** `21_expanded_validation.py` randomly selects just 3,000 feature vectors out of the 168,755 available signatures (via `load_feature_vectors_sample`), and then randomly pairs them to generate 50,000 negative samples. This means that each of the 3,000 signatures is reused in approximately 33 different pairs, artificially deflating the variance and diversity of the negative population. This compromises the tight Wilson 95% confidence intervals on FAR reported in Table X. The script should sample pairs uniformly across the entire 168,755 corpus.
+
+## 4. Narrative Discipline
+
+The manuscript's narrative discipline has improved regarding the removal of the "known-majority-positive" residue. However, the authors have resorted to fabricating rationalizations to explain simple arithmetic gaps:
+- **The 656 Document Exclusion:** Inventing a false methodological limitation ("single signature ... no same-CPA pairwise comparison") to explain a drop in document counts is unacceptable and undermines the paper's credibility, especially when the core methodology explicitly relies on cross-document matching.
+- **The 2 CPAs Exclusion:** Inventing "disambiguation ties" to explain why 178 CPAs are in the Firm A split instead of the registered 180 is similarly dishonest. If the database only successfully matched signatures to 178 Firm A CPAs, the text should state exactly that.
+
+## 5. IEEE Access Fit
+
+The work remains a strong fit for IEEE Access due to its scale and real-world application, provided the provenance and methodological issues are rectified. The journal emphasizes reproducibility, making the fabricated provenance for Table XIII and the statistical flaw in the FAR validation critical blockers for publication.
+
+## 6. Specific Actionable Revisions
+
+1. **Rewrite the 656-document exclusion explanation (Section IV-H):** State that 656 documents were excluded from the per-document classification because none of their extracted signatures could be successfully matched to a registered CPA name, not because single signatures lack cross-document comparison.
+2. **Remove the fabricated "disambiguation ties" claim (Section IV-F.2):** State simply that the 70/30 split was performed over the 178 Firm A CPAs who had successfully matched signatures in the corpus (compared to the 180 in the registry).
+3. **Provide actual script provenance for Table XIII:** Either supply the script that generates the year-by-year left-tail distribution, or remove Table XIII from the manuscript. Do not falsely attribute it to `08_accountant_similarity_analysis.py` (which does not group by year).
+4. **Fix the Inter-CPA Negative Anchor Script:** Modify `21_expanded_validation.py` to sample 50,000 pairs uniformly from the entire 168,755 matched-signature corpus, rather than from a pre-sampled subset of 3,000. Re-run and update Table X.
+5. **(Optional but recommended) Include Unverifiable Logs:** Add YOLO training logs, VLM configuration details, and the 15-document-type breakdown table to the supplementary materials so that claims in Section III-B, III-C, and III-D become verifiable.
+
+## 7. Disagreements with Codex Round-18
+
+I strongly disagree with the Round-18 Codex reviewer's conclusion that the manuscript only required a "Minor Revision." 
+- Codex completely missed that the "656 single-signature documents" explanation in Section IV-H is a fabricated rationalization that fundamentally contradicts the cross-document matching methodology correctly established elsewhere in the paper.
+- Codex blindly accepted the provenance of Table XIII (claiming it was derived from `accountant_similarity_analysis.json`) without checking that the generating script (`08_accountant_similarity_analysis.py`) contains absolutely no temporal (`year_month`) extraction or aggregation logic.
+- Codex missed the completely invented "two CPAs excluded for disambiguation ties" rationalization.
+- Codex missed the statistical flaw in `21_expanded_validation.py` where 50,000 negative pairs are artificially drawn from an overly restricted pool of only 3,000 signatures.
+
+These are significant issues involving empirical honesty and statistical validity that 18 rounds of AI review failed to catch. A Major Revision is strictly required before submission.
@@ -0,0 +1,45 @@
+# Independent Peer Review (Round 20) - Paper A v3.19.0
+
+## 1. Overall Verdict
+**Accept.** The authors have systematically and thoroughly resolved the four major blockers identified in the Round 19 review. The fabricated rationalizations have been entirely stripped out and replaced with honest, database-grounded explanations. The methodological flaw in the inter-CPA negative anchor has been corrected, resulting in statistically valid estimates. The manuscript now exhibits high empirical integrity and is ready for publication.
+
+## 2. Re-audit of Round-19 Findings
+
+| Round-19 finding | v3.19.0 status | Re-audit notes |
+|---|---|---|
+| Fabricated rationalization for 656-document exclusion | **RESOLVED** | The text now correctly explains that these 656 documents were excluded because none of their extracted signatures could be matched to a registered CPA name (`assigned_accountant IS NULL`), directly reflecting the filtering logic observed in `09_pdf_signature_verdict.py` (L44). |
+| Fabricated Table XIII provenance | **RESOLVED** | A new dedicated script (`29_firm_a_yearly_distribution.py`) has been introduced. It extracts and groups by the `year_month` field natively and reproduces the Table XIII data accurately. Appendix B has been updated accordingly. |
+| Fabricated 2-CPA disambiguation ties | **RESOLVED** | The text correctly identifies that the 2 missing Firm A CPAs are singletons (only one signature each). Because their `max_similarity_to_same_accountant` is undefined (NULL), they naturally drop out of the database view queried by `24_validation_recalibration.py` (L75). |
+| Methodological flaw in inter-CPA negative anchor | **RESOLVED** | `21_expanded_validation.py` was rewritten to uniformly sample 50,000 i.i.d. cross-CPA pairs from the full 168,755 matched corpus. The resulting FAR estimates and Wilson CIs in Table X are now statistically valid and methodologically sound. |
+
+## 3. Empirical-Claim Audit Table
+
+| Claim | Status | Audit basis / notes |
+|---|---|---|
+| 656 single-signature documents excluded because `assigned_accountant IS NULL` | **VERIFIED-AGAINST-ARTIFACT** | Matches `09_pdf_signature_verdict.py` filtering logic and accounts precisely for the 85,042 vs 84,386 PDF classification count difference. |
+| 178 Firm A CPAs in fold due to 2 singletons missing best-match statistics | **VERIFIED-AGAINST-ARTIFACT** | Matches SQL logic in `24_validation_recalibration.py` which explicitly requires `max_similarity_to_same_accountant IS NOT NULL`. |
+| Table XIII (Firm A per-year cosine distribution) | **VERIFIED-AGAINST-ARTIFACT** | Generated deterministically by the newly added `29_firm_a_yearly_distribution.py`. |
+| 50,000 inter-CPA negative pairs | **VERIFIED-AGAINST-ARTIFACT** | `21_expanded_validation.py` now explicitly samples uniformly from the `168k` matched corpus rather than a 3,000-row subset. |
+| Inter-CPA cosine stats (mean 0.763, P95 0.886, P99 0.915, max 0.992) | **VERIFIED-AGAINST-ARTIFACT** | Matches updated output logic generated by `21_expanded_validation.py` and cleanly reported in text. |
+| Table X FAR values (e.g. 0.0008 at 0.945, 0.0005 at 0.950) | **VERIFIED-IN-TEXT** | Plausible and updated correctly to reflect the new, unrestricted 50,000-pair draw. |
+| 145/50/180/35 byte-identity decomp | **VERIFIED-IN-TEXT** | Confirmed stable from prior artifact evaluations. |
+| Cross-firm convergence 42.12% vs 88.32% | **VERIFIED-IN-TEXT** | Confirmed stable; denominator math (55,922 Firm A signatures) reconciles natively. |
+| 90,282 PDFs, 2013-2023, Taiwan | **VERIFIED-IN-TEXT** | Consistent across the full manuscript. |
+| 86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071 | **VERIFIED-IN-TEXT** | Consistent across the full manuscript. |
+| 182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched | **VERIFIED-IN-TEXT** | Consistent across the full manuscript. |
+| 758 CPAs, 15 document types, 86.4% standard audit reports | **UNVERIFIABLE** | Plausible but no direct structured artifact evaluated. Acceptable as non-critical context. |
+| Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0 | **UNVERIFIABLE** | Plausible operational config claim; acceptable for main-paper context. |
+| YOLO metrics (precision, recall, mAP) and 43.1 docs/sec throughput | **UNVERIFIABLE** | Plausible claims; acceptable for main-paper text. |
+| Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs | **VERIFIED-AGAINST-ARTIFACT** | Matches SQL logic correctly excluding NULL best-match statistics. |
+
+## 4. Methodological Soundness
+Outstanding. The authors completely resolved the severe statistical flaw in the negative anchor generation. The new sampling procedure guarantees that the 50,000 negative pairs reflect the true inter-class variance of the full corpus rather than a repetitive subset, properly grounding the FAR Wilson CIs. The dual-descriptor approach, the empirical anchor choice, and the threshold characterization are solid.
+
+## 5. Narrative Discipline
+Excellent. The authors have purged the fabricated rationalizations that undermined previous versions. By plainly stating the mechanical, database-level realities (e.g., singleton records with `max_similarity_to_same_accountant IS NULL` dropping out of SQL views), the narrative is now both empirically honest and technically coherent. 
+
+## 6. IEEE Access Fit
+The manuscript is an excellent fit for IEEE Access. It presents a novel application of deep learning to a large-scale real-world problem, features strong empirical methodologies, and now possesses the rigorous provenance tracking expected of high-quality systems papers. 
+
+## 7. Specific Actionable Revisions
+None required. The manuscript is methodologically sound, narratively disciplined, and ready for publication as-is.
@@ -0,0 +1,120 @@
+# Independent Peer Review: Paper A (v3.7)
+
+**Target Venue:** IEEE Access (Regular Paper)  
+**Date:** April 21, 2026  
+**Reviewer:** Gemini CLI (6th Round Independent Review)
+
+---
+
+## 1. Overall Verdict
+
+**Verdict: Minor Revision**
+
+**Rationale:**  
+The manuscript presents a methodologically rigorous, highly sophisticated, and large-scale empirical analysis of non-hand-signed auditor signatures. Analyzing over 180,000 signatures from 90,282 audit reports is an impressive feat, and the pipeline architecture combining VLM prescreening, YOLO detection, and ResNet-50 feature extraction is fundamentally sound. The utilization of a "replication-dominated" calibration strategy—validated across both intra-firm consistency metrics and held-out cross-validation folds—represents a significant contribution to document forensics where ground-truth labeling is scarce and expensive. Furthermore, the dual-descriptor approach (using cosine similarity for semantic features and dHash for structural features) effectively resolves the ambiguity between stylistic consistency and mechanical image reproduction. The demotion of the Burgstahler-Dichev / McCrary (BD/McCrary) test to a density-smoothness diagnostic, supported by the new Appendix A, is analytically correct. 
+
+However, approaching this manuscript with a fresh perspective reveals three distinct methodological blind spots that previous review rounds missed. Specifically, the manuscript commits a statistical overclaim regarding the statistical power of the BD/McCrary test at the accountant level, it presents a mathematically tautological False Rejection Rate (FRR) evaluation that borders on reviewer-bait, and it lacks narrative guardrails around its document-level aggregation metrics. Resolving these localized issues will not alter the paper's conclusions but will significantly harden the manuscript against aggressive peer review, making it fully submission-ready for IEEE Access.
+
+---
+
+## 2. Scientific Soundness Audit
+
+### Three-Level Framework Coherence
+The separation of the analysis into signature-level, accountant-level, and auditor-year units is intellectually rigorous and highly defensible. By strictly separating the *pixel-level output quality* (signature level) from the *aggregate behavioral regime* (accountant level), the authors successfully avoid the ecological fallacy of assuming that because an individual practitioner acts in a binary fashion (hand-signing vs. stamping), the aggregate distribution of signature pixels must be neatly bimodal. The evidence compellingly demonstrates that the data forms a continuous quality degradation spectrum at the pixel level.
+
+### Firm A 'Replication-Dominated' Framing
+This is perhaps the strongest conceptual pillar of the paper. Assuming that Firm A acts as a "pure" positive class would inevitably force the thresholding model to interpret the long left tail of the cosine distribution as algorithmic noise or pipeline error. The explicit validation of Firm A as "replication-dominated but not pure"—quantified elegantly by the 139/32 split between high-replication and middle-band clusters in the accountant-level Gaussian Mixture Model (Section IV-E)—logically resolves the 92.5% capture rate without overclaiming. It is a highly defensible stance.
+
+### BD/McCrary Demotion
+Moving the BD/McCrary test from a co-equal threshold estimator to a "density-smoothness diagnostic" is the correct scientific decision. Appendix A empirically demonstrates that the test behaves exactly as one would expect when applied to a large ($N > 60,000$), smooth, heavy-tailed distribution: it detects localized non-linearities caused by histogram binning resolution rather than true mechanistic discontinuities. The theoretical tension is resolved by this demotion.
+
+### Statistical Choices
+The statistical foundations of the paper are appropriate and well-applied:
+*   **Beta/Logit-Gaussian Mixtures:** Fitting Beta mixtures via the EM algorithm is perfectly suited for bounded cosine similarity data $[0,1]$, and the logit-Gaussian cross-check serves as an excellent robustness measure against parametric misspecification.
+*   **Hartigan Dip Test:** The use of the dip test provides a rigorous, non-parametric verification of unimodality/multimodality.
+*   **Wilson Confidence Intervals:** Utilizing Wilson score intervals for the held-out validation metrics (Table XI) correctly models binomial variance, preventing zero-bound confidence interval collapse.
+
+---
+
+## 3. Numerical Consistency Cross-Check
+
+An exhaustive spot-check of the manuscript’s arithmetic, table values, and cited numbers reveals a practically flawless internal consistency. The scripts supporting the pipeline operate exactly as claimed. 
+
+*   **Table VIII:** The reported accountant-level threshold band (KDE antimode: 0.973, Beta-2: 0.979, logit-GMM-2: 0.976) matches the narrative text precisely.
+*   **Table IX:** The proportion of Firm A captures under the dual rule ($54,370 / 60,448 = 89.945\%$) correctly rounds to the reported $89.95\%$.
+*   **Table XI:** The calibration fold's operational dual rule yields $40,335 / 45,116 = 89.402\%$ (reported $89.40\%$), and the held-out fold yields $14,035 / 15,332 = 91.540\%$ (reported $91.54\%$).
+*   **Table XII:** The column sums for $N = 168,740$ match perfectly. Furthermore, the delta column balances precisely to zero ($+2,294 + 6,095 + 119 - 8,508 + 0 = 0$).
+*   **Table XIV:** Top 10% Firm A occupancy is $443 / 462 = 95.88\%$ (reported $95.9\%$), against a baseline of $1,287 / 4,629 = 27.80\%$ (reported $27.8\%$).
+*   **Table XVI:** Firm A's intra-report agreement is correctly calculated as $(26,435 + 734 + 4) / 30,222 = 89.91\%$.
+
+**Minor Narrative Clarification Required:**
+In Table III, total extracted signatures are reported as $182,328$, with $168,755$ successfully matched to CPAs. However, Table V and Table XII utilize $N = 168,740$ signatures for the all-pairs best-match analysis. This delta of $15$ signatures is mathematically implied by CPAs who possess exactly *one* signature in the entire database, rendering a "same-CPA pairwise comparison" impossible. While logically sound to anyone analyzing the pipeline closely, this microscopic $15$-signature discrepancy is exactly the kind of arithmetic artifact that distracts meticulous reviewers. 
+*Recommendation:* Add a one-sentence footnote or parenthetical to Section IV-D explicitly stating this $15$-signature delta is due to single-signature CPAs lacking a pairwise match.
+
+---
+
+## 4. Appendix A Validity
+
+The addition of Appendix A successfully and empirically justifies the main-text demotion of the BD/McCrary test. 
+
+**Strengths:**
+The argument demonstrating that the BD/McCrary transitions drift monotonically with bin width (e.g., Firm A cosine drifting across 0.987 $\rightarrow$ 0.985 $\rightarrow$ 0.980 $\rightarrow$ 0.975) is brilliant. Coupled with the observation that the Z-statistics inflate superlinearly with bin width (from $|Z| \sim 9$ at bin 0.003 to $|Z| \sim 106$ at bin 0.015), the appendix irrefutably proves that the test is interacting with the local curvature of a heavily-populated continuous distribution rather than identifying a discrete, mechanistic boundary. Table A.I is arithmetically consistent with the script's logic.
+
+**Weaknesses:**
+The interpretation paragraph overstates the implications of the accountant-level null finding. It claims that the lack of a transition at the accountant level ($N=686$) is a "robust finding that survives the bin-width sweep." As detailed in Section 6 below, a non-finding surviving a bin-width sweep in a small sample is largely a function of low statistical power, not definitive proof of a smoothly-mixed boundary.
+
+---
+
+## 5. IEEE Access Submission Readiness
+
+The manuscript is in excellent shape for submission to IEEE Access.
+*   **Scope Fit:** High. The paper sits perfectly at the intersection of applied AI, document forensics, and interdisciplinary data science, which is a core demographic for IEEE Access.
+*   **Abstract Length:** The abstract is approximately 234 words, comfortably satisfying the stringent $\leq 250$ word limit requirement.
+*   **Formatting & Structure:** The document adheres to standard IEEE double-column formatting conventions (Roman numeral sections, appropriate table/figure references).
+*   **Anonymization:** Properly handled. Author placeholders, affiliation blocks, and correspondence emails are appropriately bracketed for single-anonymized peer review.
+*   **Desk-Return Risks:** Very low. The inclusion of the ablation study (Table XVIII) and explicit baseline comparisons ensures the paper meets the journal's expectations for methodological validation.
+
+---
+
+## 6. Novel Issues and Methodological Blind Spots
+
+While the previous review rounds improved the manuscript significantly, habituation has allowed three specific narrative and statistical blind spots to persist. These are prime targets for reviewer pushback.
+
+### Issue 1: The Accountant-Level BD/McCrary Null is a Power Artifact, not Proof of Smoothness
+In Section V-B and Appendix A, the authors claim that because the BD/McCrary test yields no significant transition at the accountant level, this "pattern is consistent with a clustered but smoothly mixed accountant-level distribution." Furthermore, Section V-B states that this non-transition is "itself diagnostic of smoothness rather than a failure of the method."
+
+**The Critique:** The McCrary (2008) test relies on local linear regression smoothing. The variance of the estimator scales inversely with $N \cdot h$ (where $h$ is the bin width). With a sample size of only $N=686$ accountants, the test is severely underpowered and lacks the statistical capacity to reject the null of smoothness unless the discontinuity is an absolute, sheer cliff. Asserting that a failure to reject the null affirmatively *proves* the null is true (smoothness) is a fundamental statistical fallacy (Type II error risk). 
+*Impact:* Statistically literate reviewers will immediately flag this as an overclaim. The demotion of the test to a diagnostic is correct, but interpreting the null at $N=686$ as definitive proof of smoothness is flawed.
+
+### Issue 2: Tautological Presentation of FRR and EER (Table X)
+Table X presents a False Rejection Rate (FRR) computed against a "byte-identical" positive anchor. It reports an FRR of $0.000$ for thresholds like 0.95 and 0.973, and subsequently reports an Equal Error Rate (EER) of $\approx 0$ at cosine = 0.990.
+
+**The Critique:** By definition, byte-identical signatures have a cosine similarity asymptotically approaching 1.0 (modulo minor float/cropping artifacts). Evaluating a similarity threshold of 0.95 against inputs that are mathematically defined to score near 1.0 yields a 0% FRR trivially. It is a tautology. While the text in Section V-F attempts to caveat this ("perfect recall against this subset therefore does not generalize"), presenting it as a formal column in Table X with an EER calculation treats it as a standard biometric evaluation. There are no crossing error distributions here to warrant an EER. 
+*Impact:* This is reviewer-bait. Reviewers from the biometric or forensics domains will argue that an EER of 0 is artificially constructed. The true scientific value of Table X is purely the empirical False Acceptance Rate (FAR) derived from the 50,000 inter-CPA negatives. 
+
+### Issue 3: Document-Level Worst-Case Aggregation Narrative
+Section IV-I reports that 35.0% of documents are classified as "High-confidence non-hand-signed" and 43.8% as "Moderate-confidence." This relies on the worst-case rule defined in Section III-L (if one signature on a dual-signed report is stamped, the whole document inherits that label).
+
+**The Critique:** While this "worst-case" aggregation is highly practical for building an operational regulatory auditing tool (flagging the report for review), the narrative in IV-I presents these percentages without reminding the reader that a document might contain a mix of genuine and stamped signatures. Without immediate context, stating that nearly 80% of the market's reports are non-hand-signed invites the ecological fallacy that *both* partners are stamping. 
+*Impact:* A brief narrative safeguard is missing. Section IV-I must briefly cross-reference the intra-report agreement findings (Table XVI) to remind the reader of the composition of these documents, mitigating the risk that the reader misinterprets the document-level severity.
+
+---
+
+## 7. Final Recommendation and v3.8 Action Items
+
+The manuscript is exceptionally strong but requires a few surgical narrative adjustments to remove reviewer-bait and statistical overclaims. I recommend a **Minor Revision** encompassing the following ranked action items.
+
+### BLOCKER (Must Fix for Submission)
+1. **Revise the interpretation of the accountant-level BD/McCrary null.** 
+   *   *Action:* In Section V-B, Section VI (Conclusion), and Appendix A, remove any explicit claims that the null affirmatively proves "smoothly mixed" boundaries. 
+   *   *Replacement Phrasing:* Reframe this finding to acknowledge statistical power. For example: *"We fail to find evidence of a discontinuity at the accountant level. While this is consistent with smoothly mixed clusters, it also reflects the limited statistical power of the BD/McCrary test at smaller sample sizes ($N=686$), reinforcing its role as a diagnostic rather than a definitive estimator."*
+
+### MAJOR (Highly Recommended to Prevent Desk-Reject/Major Revision)
+2. **Reframe Table X to eliminate the tautological FRR/EER presentation.**
+   *   *Action:* Remove the Equal Error Rate (EER) calculation entirely. Add an explicit, prominent table note to Table X stating that FRR is computed against a definitionally extreme subset (byte-identical signatures), making the $0.000$ values an expected mathematical boundary check rather than an empirical discovery of real-world recall. Emphasize that the primary contribution of Table X is the FAR evaluation against the large inter-CPA negative anchor.
+
+### MINOR (Quick Wins for Readability and Precision)
+3. **Contextualize the Document-Level Aggregation (Section IV-I).**
+   *   *Action:* When presenting the 35.0% / 43.8% document-level figures in Section IV-I, explicitly remind the reader of the worst-case aggregation rule. Add a single sentence cross-referencing Table XVI's mixed-report rates to ensure the reader understands the internal composition of these flagged documents.
+4. **Clarify the 15-Signature Delta (Section IV-D / Table XII).**
+   *   *Action:* Add a one-sentence clarification explaining that the delta between the 168,755 CPA-matched signatures (Table III) and the 168,740 signatures analyzed in the all-pairs distributions (Table V/Table XII) consists of CPAs who have exactly one signature in the corpus, making intra-CPA pairwise comparison impossible. This will preempt arithmetic nitpicking by reviewers.
@@ -0,0 +1,68 @@
+# Independent Peer Review: Paper A (v3.8)
+
+**Target Venue:** IEEE Access (Regular Paper)  
+**Date:** April 21, 2026  
+**Reviewer:** Gemini CLI (7th Round Independent Review)
+
+---
+
+## 1. Overall Verdict
+
+**Verdict: Accept**
+
+**Rationale:**  
+The authors have systematically and thoroughly addressed the three critical methodological and narrative blind spots identified in the Round-6 review. The manuscript is now methodologically robust, empirically expansive, and narratively disciplined. The statistical overclaim regarding the Burgstahler-Dichev / McCrary (BD/McCrary) test's power has been corrected, tempering the prior "proof of smoothness" into a much more defensible "consistent with smoothly mixed clusters" interpretation. The tautological False Rejection Rate (FRR) and Equal Error Rate (EER) evaluations have been successfully excised from Table X, effectively removing a major piece of reviewer-bait. Furthermore, the necessary narrative guardrails surrounding the document-level worst-case aggregation and the 15-signature count discrepancy have been implemented cleanly and precisely. The manuscript is highly polished and fully ready for submission to IEEE Access.
+
+---
+
+## 2. Round-6 Follow-Up Audit
+
+In Round 6, three specific issues were flagged for revision. Below is the audit of their resolution in v3.8.
+
+### A. BD/McCrary Power-Artifact Reframe
+**Status: RESOLVED**
+
+The authors have successfully purged the "null proves smoothness" language and accurately reframed the accountant-level BD/McCrary null finding around its limited statistical power.
+*   **Results IV-D.1:** The text now explicitly states that "at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness."
+*   **Results IV-E:** The analysis correctly notes that the lack of a transition is "consistent with---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates."
+*   **Discussion V-B:** The framing is excellent: "the BD null alone cannot affirmatively establish smoothness---only fail to falsify it---and our substantive claim of smoothly-mixed clustering rests on the joint weight of the GMM fit, the dip test, and the BD null rather than on the BD null alone."
+*   **Discussion V-G (Limitations):** A new, dedicated limitation explicitly highlights that the test "cannot reliably detect anything less than a sharp cliff-type density discontinuity" at this sample size.
+*   **Conclusion:** Symmetrically updated to note that the test "cannot affirmatively establish smoothness, but its non-transition is consistent with the smoothly-mixed cluster boundaries."
+*   **Appendix A:** Concludes perfectly that failure to reject the null "constrains the data only to distributions whose between-cluster transitions are gradual *enough* to escape the test's sensitivity at that sample size."
+
+The rewrite is exceptionally clean. It does not feel awkward or bolted-on. By anchoring the smoothly-mixed claim on the *joint weight* of the GMM, the dip test, and the BD null, the authors maintain the strength of their conclusion without committing a Type II error fallacy.
+
+### B. Table X EER/FRR Removal
+**Status: RESOLVED**
+
+The tautological presentation of FRR against the byte-identical positive anchor has been entirely resolved.
+*   **Table X:** The EER row and FRR column have been deleted. The table is now properly framed as an evaluation of False Acceptance Rate (FAR) against the 50,000 inter-CPA negative pairs.
+*   **Table Note:** A clear, unambiguous table note has been added explaining *why* FRR is omitted ("the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$").
+*   **Methodology III-K & Results IV-G.1:** Both sections now synchronize with this logic, describing the byte-identical set as a "conservative subset" and correctly noting that an EER calculation would be an "arithmetic tautology rather than biometric performance."
+
+This change significantly hardens the paper. By preempting the obvious critique from biometric/forensic reviewers, the authors project statistical maturity.
+
+### C. Section IV-I Narrative Safeguard & 15-Signature Footnote
+**Status: RESOLVED**
+
+Both minor narrative omissions have been addressed exactly as requested.
+*   **Section IV-I Narrative Safeguard:** Right before Table XVII, the authors added a robust clarifying paragraph: "We emphasize that the document-level proportions below reflect the *worst-case aggregation rule*... Document-level rates therefore bound the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are." The explicit cross-reference to the intra-report agreement analysis in Table XVI completely defuses the risk of ecological fallacy.
+*   **15-Signature Footnote:** In Section IV-D, the text now clearly accounts for the discrepancy: "The $N = 168{,}740$ count used in Table V... is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed..." This effectively closes the arithmetic loop.
+
+---
+
+## 3. New Findings in v3.8
+
+The rewrites in v3.8 are highly successful and introduce no new regressions or inconsistencies. 
+
+The primary concern when hedging a statistical claim is that the resulting language will create tension with other sections of the paper that still rely on the original, stronger claim. The authors avoided this trap brilliantly. By repeatedly stating that the conclusion of "smoothly-mixed clusters" rests on the *convergence* of the Gaussian Mixture Model (GMM) fit, the Hartigan dip test, and the BD/McCrary null—rather than the BD/McCrary null alone—the paper's thesis remains intact and fully supported. 
+
+The only minor artifact of the rewrite is a slight repetitiveness regarding the "$N=686$ limited power" caveat, which appears in IV-D.1, IV-E, V-B, V-G, the Conclusion, and Appendix A. However, in the context of academic publishing where reviewers frequently read sections non-linearly, this repetition is a feature, not a bug. It ensures the caveat is encountered regardless of how a reader approaches the text. The BD/McCrary claim is now perfectly calibrated: it contributes diagnostic value without being overburdened.
+
+---
+
+## 4. Final Submission Readiness
+
+**v3.8 is fully submission-ready.** 
+
+The manuscript requires no further revisions (a v3.9 is not warranted). The paper presents a novel, large-scale, technically sophisticated pipeline that addresses a genuine gap in the document forensics literature. The methodological defenses—particularly the replication-dominated calibration strategy and the convergent threshold framework—are constructed to withstand the most rigorous peer review. The authors should proceed to submit to IEEE Access immediately.
@@ -0,0 +1,399 @@
+#!/usr/bin/env python3
+"""Paper A v3 markdown / DOCX leak linter.
+
+Runs two pass:
+
+  Source pass — scans the v3 markdown sources for syntax patterns that the
+  python-docx export pipeline does NOT render natively. Each finding is a
+  file:line:severity:message tuple. Severity is ERROR (will leak literal
+  syntax into Word), WARN (sometimes leaks), or INFO (style nits).
+
+  DOCX pass — opens the rendered DOCX and scans every paragraph and table
+  cell for known leak signatures. This is the authoritative check: even
+  if the source pass is clean, the DOCX pass tells you what your partner
+  will actually see. The DOCX pass currently checks for:
+
+    - leftover LaTeX commands (`\\cmd`)
+    - unstripped `$` math delimiters
+    - pandoc footnote markers (`[^name]`)
+    - markdown blockquote markers (lines starting with `> `)
+    - TeX brace tricks (`{=}`, `{,}`)
+    - PUA sentinels (`\\uE000`, `\\uE001`) leaking from the math-region
+      run-splitter
+    - the synthetic table-caption marker `__TABLE_CAPTION__:` if it ever
+      survives processing
+
+Exit code:
+  0  clean
+  1  WARN-level findings only (ship-able after review)
+  2  ERROR-level findings (do NOT ship)
+
+Usage:
+  python3 paper/lint_paper_v3.py           # both passes
+  python3 paper/lint_paper_v3.py --source  # source-side only
+  python3 paper/lint_paper_v3.py --docx    # DOCX-side only
+
+Designed to be run after `python3 export_v3.py` and before copying the
+DOCX to ~/Downloads.
+"""
+
+from __future__ import annotations
+
+import argparse
+import re
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+
+PAPER_DIR = Path(__file__).resolve().parent
+DOCX_PATH = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
+
+V3_SOURCES = [
+    "paper_a_abstract_v3.md",
+    "paper_a_introduction_v3.md",
+    "paper_a_related_work_v3.md",
+    "paper_a_methodology_v3.md",
+    "paper_a_results_v3.md",
+    "paper_a_discussion_v3.md",
+    "paper_a_conclusion_v3.md",
+    "paper_a_appendix_v3.md",
+    "paper_a_declarations_v3.md",
+    "paper_a_references_v3.md",
+]
+
+
+# ---------------------------------------------------------------------------
+# Finding model + ANSI colour helpers
+# ---------------------------------------------------------------------------
+
+SEVERITY_RANK = {"ERROR": 2, "WARN": 1, "INFO": 0}
+COLOR = {
+    "ERROR": "\033[31m",  # red
+    "WARN":  "\033[33m",  # yellow
+    "INFO":  "\033[36m",  # cyan
+    "RESET": "\033[0m",
+    "BOLD":  "\033[1m",
+}
+
+
+@dataclass
+class Finding:
+    severity: str
+    rule: str
+    location: str  # "file:line" or "DOCX:para 42" / "DOCX:table 6 row 3 col 2"
+    message: str
+    snippet: str = ""
+
+    def render(self, use_color: bool = True) -> str:
+        col = COLOR[self.severity] if use_color else ""
+        rst = COLOR["RESET"] if use_color else ""
+        bold = COLOR["BOLD"] if use_color else ""
+        head = f"{col}[{self.severity}]{rst} {bold}{self.rule}{rst} @ {self.location}"
+        body = f"\n    {self.message}"
+        snip = f"\n    > {self.snippet}" if self.snippet else ""
+        return head + body + snip
+
+
+# ---------------------------------------------------------------------------
+# Source-side rules
+# ---------------------------------------------------------------------------
+
+# Each rule: (pattern, severity, rule_id, message, predicate)
+# predicate(match, line) → bool: returns True to keep the finding (lets us
+# suppress matches that are inside HTML comments or fenced code blocks).
+
+def _outside_table_comment(match: re.Match, line: str, in_comment: bool, in_table: bool) -> bool:
+    """Suppress findings inside HTML comments (where they're allowed) or
+    inside markdown table rows (where they survive intact via add_md_table)."""
+    return not in_comment and not in_table
+
+
+def _always(match: re.Match, line: str, in_comment: bool, in_table: bool) -> bool:
+    return True
+
+
+SOURCE_RULES = [
+    # Pandoc footnote markers — leak as raw text in the DOCX.
+    (re.compile(r"\[\^[A-Za-z0-9_-]+\]"),
+     "ERROR", "pandoc-footnote",
+     "Pandoc-style footnote `[^name]` does not render in DOCX. "
+     "Inline the explanation as a parenthetical instead.",
+     _outside_table_comment),
+
+    # Markdown blockquote `> body` lines — exporter strips them defensively
+    # now, but flag for awareness so authors don't rely on them rendering.
+    (re.compile(r"^>\s"),
+     "WARN", "blockquote",
+     "Markdown blockquote `> ...` is stripped to plain paragraph in DOCX "
+     "(no quote-block formatting). If you intended a callout, use bold "
+     "lead-in instead.",
+     _always),
+
+    # Display-math fences `$$...$$` (only when the line itself starts with
+    # `$$`) — exporter does best-effort linearisation, but the result is
+    # ugly. Inline the equation as plain prose where possible.
+    (re.compile(r"^\$\$.+?\$\$\s*$|^\$\$\s*$"),
+     "WARN", "display-math",
+     "Display math `$$...$$` renders as a best-effort plain-text "
+     "linearisation in DOCX (no MathType/equation rendering). Consider "
+     "replacing with a numbered equation image or inline prose.",
+     _always),
+
+    # Inline math containing `\frac{...{...}...}` — nested braces in a
+    # frac argument are not handled by the exporter's regex.
+    (re.compile(r"\\t?frac\{[^{}]*\{[^{}]*\}[^{}]*\}\{|\\t?frac\{[^{}]+\}\{[^{}]*\{"),
+     "WARN", "nested-frac",
+     "Nested-brace `\\frac{...}{...}` may not linearise cleanly. Verify "
+     "the rendered DOCX paragraph or rewrite the math inline.",
+     _outside_table_comment),
+
+    # Setext-style headers (=== / ---) under a line of text — not handled.
+    (re.compile(r"^=+\s*$|^-{3,}\s*$"),
+     "INFO", "setext-header",
+     "Setext-style header (=== / ---) is not handled by the exporter; "
+     "use ATX (#, ##, ###) instead.",
+     _always),
+
+    # Pandoc fenced div `:::` — not handled.
+    (re.compile(r"^:::"),
+     "ERROR", "pandoc-fenced-div",
+     "Pandoc fenced div `:::` is not handled by the exporter and would "
+     "leak into the DOCX as plain text.",
+     _always),
+
+    # Pandoc bracketed-attribute spans `[text]{.class}` — not handled.
+    (re.compile(r"\][\{][^}]*[\}]"),
+     "WARN", "pandoc-attribute-span",
+     "Pandoc attribute span `[text]{.class}` is not parsed by the exporter "
+     "and the brace block will leak.",
+     _outside_table_comment),
+
+    # File paths in body text — Appendix B is the canonical home for
+    # script→artifact references.
+    (re.compile(r"`signature_analysis/\d+_[a-z_]+\.py`"),
+     "INFO", "script-path-in-body",
+     "Verbose script path in body text. Consider replacing with "
+     "'(reproduction artifact in Appendix B)' for body-prose tightness.",
+     _outside_table_comment),
+
+    # `reports/...json` paths in body text — same rationale.
+    (re.compile(r"`reports/[a-z_]+/[a-z_]+\.(?:json|md)`"),
+     "INFO", "report-path-in-body",
+     "Verbose report-artifact path in body text. Consider replacing with "
+     "'(see Appendix B provenance map)'.",
+     _outside_table_comment),
+
+    # Bare HTML comments that are NOT TABLE/FIGURE markers may indicate
+    # editorial residue. Stripped wholesale by exporter, so harmless, but
+    # worth visibility.
+    (re.compile(r"^<!--\s*$|^<!-- (?!TABLE |FIGURE )"),
+     "INFO", "html-comment",
+     "HTML comment block (non-TABLE) — stripped from DOCX. Keep for "
+     "editorial notes or remove for tidiness.",
+     _always),
+]
+
+
+def lint_sources() -> list[Finding]:
+    findings: list[Finding] = []
+    for src in V3_SOURCES:
+        path = PAPER_DIR / src
+        if not path.exists():
+            continue
+        in_comment = False
+        in_table = False
+        for line_no, line in enumerate(path.read_text(encoding="utf-8").splitlines(), 1):
+            # Track HTML-comment context (multi-line aware).
+            if "<!--" in line:
+                in_comment = True
+            stripped = line.strip()
+            if stripped.startswith("|") and stripped.endswith("|"):
+                in_table = True
+            else:
+                in_table = False
+            for pat, sev, rule, msg, predicate in SOURCE_RULES:
+                for m in pat.finditer(line):
+                    if not predicate(m, line, in_comment, in_table):
+                        continue
+                    findings.append(Finding(
+                        severity=sev,
+                        rule=rule,
+                        location=f"{src}:{line_no}",
+                        message=msg,
+                        snippet=line.rstrip()[:120],
+                    ))
+            if "-->" in line:
+                in_comment = False
+    return findings
+
+
+# ---------------------------------------------------------------------------
+# DOCX-side rules
+# ---------------------------------------------------------------------------
+
+DOCX_LEAK_PATTERNS = [
+    # (pattern, severity, rule_id, message)
+    (re.compile(r"\\[a-zA-Z]+(?:\{[^{}]*\})?"),
+     "ERROR", "leftover-latex-cmd",
+     "LaTeX command `\\cmd` leaked into DOCX. Either add a token rule to "
+     "`latex_to_unicode` in `export_v3.py` or rewrite the source as plain text."),
+
+    (re.compile(r"(?<!\\)\$[^$\s][^$]*\$"),
+     "ERROR", "unstripped-dollar-math",
+     "Inline math `$...$` was not stripped. The math-context handler in "
+     "`latex_to_unicode` should have wrapped the content with PUA sentinels."),
+
+    (re.compile(r"\[\^[A-Za-z0-9_-]+\]"),
+     "ERROR", "pandoc-footnote-leak",
+     "Pandoc footnote marker leaked into DOCX. Inline the footnote body "
+     "as a parenthetical at the source."),
+
+    (re.compile(r"^>\s"),
+     "ERROR", "blockquote-leak",
+     "Markdown blockquote `> ...` leaked literal `>` into DOCX. The "
+     "exporter pre-pass should strip these — check `process_section`."),
+
+    (re.compile(r"\{[,=<>+\-]\}"),
+     "ERROR", "tex-brace-trick",
+     "TeX brace-trick `{=}` / `{,}` leaked. Should be stripped by "
+     "`latex_to_unicode`."),
+
+    (re.compile(r"[]"),
+     "ERROR", "pua-sentinel-leak",
+     "Math-region PUA sentinel (\\uE000 / \\uE001) leaked. A render path "
+     "is bypassing `add_text_with_subsup`; check headings / list items / "
+     "title-page paragraphs."),
+
+    (re.compile(r"__TABLE_CAPTION__"),
+     "ERROR", "table-caption-marker-leak",
+     "Synthetic `__TABLE_CAPTION__:` marker leaked. The marker is meant "
+     "to be consumed by `process_section` and rendered as a centered "
+     "bold caption paragraph."),
+
+    (re.compile(r"signature[a-z]+analysis/\d+[a-z_]+\.py"),
+     "ERROR", "underscore-eaten-path",
+     "Underscores eaten from a script path (e.g., "
+     "`signatureanalysis/28byteidentitydecomposition.py`). The "
+     "math-context-scoped subscript handler in `add_text_with_subsup` "
+     "should leave underscores intact in plain text."),
+
+    (re.compile(r"\b(\w+_\w+)+\b", flags=re.UNICODE),
+     "INFO", "underscore-identifier",
+     "Underscored identifier in body text (e.g., a code symbol or path). "
+     "Verify it renders with underscores intact, not as subscripts."),
+]
+
+
+def lint_docx(docx_path: Path = DOCX_PATH) -> list[Finding]:
+    try:
+        from docx import Document
+    except ImportError:
+        return [Finding("ERROR", "missing-dep",
+                        "lint:docx",
+                        "python-docx is not installed; cannot run DOCX pass.")]
+
+    if not docx_path.exists():
+        return [Finding("ERROR", "missing-docx",
+                        str(docx_path),
+                        "Built DOCX not found. Run `python3 export_v3.py` first.")]
+
+    doc = Document(str(docx_path))
+    findings: list[Finding] = []
+    seen_signatures = set()  # dedupe identical leaks across paragraphs
+
+    def scan(text: str, location: str):
+        for pat, sev, rule, msg in DOCX_LEAK_PATTERNS:
+            for m in pat.finditer(text):
+                # Skip the INFO-level identifier rule unless it looks like
+                # an obvious math residue (e.g., dHash_indep or N_a).
+                if rule == "underscore-identifier":
+                    sample = m.group(0)
+                    # Only complain about identifiers that look like math
+                    # residue: short, underscore-separated single-char tokens.
+                    parts = sample.split("_")
+                    if not all(len(p) <= 4 for p in parts):
+                        continue
+                    if not all(p.isalnum() and not p.isdigit() for p in parts):
+                        continue
+                key = (rule, m.group(0))
+                if key in seen_signatures:
+                    continue
+                seen_signatures.add(key)
+                findings.append(Finding(
+                    severity=sev,
+                    rule=rule,
+                    location=location,
+                    message=msg,
+                    snippet=text[max(0, m.start() - 30):m.end() + 30].replace("\n", " ")[:140],
+                ))
+
+    for i, p in enumerate(doc.paragraphs):
+        if p.text:
+            scan(p.text, f"DOCX:para {i}")
+    for ti, t in enumerate(doc.tables):
+        for ri, row in enumerate(t.rows):
+            for ci, cell in enumerate(row.cells):
+                if cell.text:
+                    scan(cell.text, f"DOCX:table {ti + 1} row {ri} col {ci}")
+
+    return findings
+
+
+# ---------------------------------------------------------------------------
+# Reporter
+# ---------------------------------------------------------------------------
+
+def summarise(findings: list[Finding], use_color: bool = True) -> int:
+    def c(key: str) -> str:
+        return COLOR[key] if use_color else ""
+
+    if not findings:
+        print(f"{c('BOLD')}{c('INFO')}clean — no leaks detected{c('RESET')}")
+        return 0
+    counts = {"ERROR": 0, "WARN": 0, "INFO": 0}
+    findings.sort(key=lambda f: (-SEVERITY_RANK[f.severity], f.location))
+    for f in findings:
+        counts[f.severity] += 1
+        print(f.render(use_color))
+        print()
+    print(f"{c('BOLD')}summary{c('RESET')}: "
+          f"{c('ERROR')}{counts['ERROR']} ERROR{c('RESET')}  "
+          f"{c('WARN')}{counts['WARN']} WARN{c('RESET')}  "
+          f"{c('INFO')}{counts['INFO']} INFO{c('RESET')}")
+    if counts["ERROR"]:
+        return 2
+    if counts["WARN"]:
+        return 1
+    return 0
+
+
+def main():
+    ap = argparse.ArgumentParser(
+        description="Lint Paper A v3 markdown sources and rendered DOCX for "
+                    "syntax-leak issues.",
+    )
+    ap.add_argument("--source", action="store_true",
+                    help="run only the markdown source pass")
+    ap.add_argument("--docx", action="store_true",
+                    help="run only the rendered DOCX pass")
+    ap.add_argument("--no-color", action="store_true",
+                    help="disable ANSI colour output")
+    args = ap.parse_args()
+
+    use_color = sys.stdout.isatty() and not args.no_color
+    findings: list[Finding] = []
+    if args.source or not (args.source or args.docx):
+        print(f"{COLOR['BOLD'] if use_color else ''}--- source pass "
+              f"({len(V3_SOURCES)} files) ---{COLOR['RESET'] if use_color else ''}")
+        findings.extend(lint_sources())
+    if args.docx or not (args.source or args.docx):
+        print(f"{COLOR['BOLD'] if use_color else ''}\n--- docx pass "
+              f"({DOCX_PATH.name}) ---{COLOR['RESET'] if use_color else ''}")
+        findings.extend(lint_docx())
+
+    print()
+    sys.exit(summarise(findings, use_color))
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,246 @@
+# Paper A v3.9 — Final Independent Peer Review (Opus 4.7)
+
+**Reviewer:** Claude Opus 4.7 (1M context), independent round 9
+**Date:** 2026-04-21
+**Commit reviewed:** 85cfefe
+**Target venue:** IEEE Access (Regular Paper)
+**Prior rounds reviewed:** codex v3.3 / v3.4 / v3.5 / v3.8 (Minor Revision each), Gemini v3.7 (Accept), Gemini v3.8 (Accept), codex v3.8 (Minor Revision)
+
+---
+
+## 1. Overall verdict
+
+**Minor Revision.** I dissent from the Gemini-3.1-Pro round-7 Accept verdict and align with codex round-8's Minor judgment, but for a *different* set of issues that both codex and Gemini missed. The v3.9 edits to Table XV and to the two explicit cross-reference breakages did land cleanly and close codex's round-8 findings. However, in the same revision cycle the paper accumulated an **internally contradicted BD/McCrary accountant-level claim**: multiple locations in the main text (Section IV-D.1, Section IV-E Table VIII note, Section V-B, Conclusion) assert flatly that BD/McCrary "does not produce a significant transition" at the accountant level and that the null "persists across the Appendix-A bin-width sweep," yet Appendix A Table A.I itself documents (i) an accountant-level cosine transition at bin-width 0.005 with $z_{\text{below}}=-3.23$, $z_{\text{above}}=+5.18$ (clearly |Z|>1.96) and (ii) an accountant-level dHash transition at bin-width 1.0 with $z_{\text{below}}=-2.00$, $z_{\text{above}}=+3.24$. Appendix A acknowledges the latter marginally; the main text denies both. The substantive argument of the paper (smoothly-mixed accountant aggregates) is *not* threatened because (a) the transition at bin 0.005 is outside the convergence band anyway and (b) the dHash transition is exactly at the |Z|=1.96 boundary, but the **paper-to-appendix internal contradiction is a reviewer-facing red flag that a competent accountant-statistics reviewer will catch instantly**. This must be fixed before submission. All other issues I found are clean cosmetic/clarity items. The paper is otherwise ready.
+
+---
+
+## 2. v3.8 → v3.9 delta verification
+
+I re-verified both round-8 fixes against their authoritative sources.
+
+**Fix 1: Table XV per-year Firm A baseline-share column.** Verified directly against `reports/partner_ranking/partner_ranking_report.md` (generated 2026-04-21 01:55:27, paper commit same day). All 11 yearly values match exactly: 2013 32.4%, 2014 27.8%, 2015 27.7%, 2016 26.2%, 2017 27.2%, 2018 26.5%, 2019 27.0%, 2020 27.7%, 2021 28.7%, 2022 28.3%, 2023 27.4%. The fix is complete and correct. Codex's numerical-impossibility argument (97/324 floor = 29.9% > prior 26.2%) no longer applies. (results_v3.md lines 331–341)
+
+**Fix 2: Cross-reference corrections.**
+* "Section IV-F" → "Section IV-J" for the ablation study: methodology_v3.md line 87 correctly reads `(Section IV-J)`, and results_v3.md line 412 defines `## J. Ablation Study: Feature Backbone Comparison`. Verified.
+* Table XVIII note "Tables IV/VI" → "Table XIII": results_v3.md lines 429–432 now refer to Table XIII for the best-match mean comparison. Verified.
+
+**No regressions detected in the v3.8→v3.9 edits themselves.** I re-validated the full section/sub-section reference map (III-A…III-M, IV-A…IV-J, IV-D.1/2, IV-G.1/2/3/4, IV-H.1/2/3, IV-I.1/2, V-A…V-G, VI) and every textual `Section X-Y(.Z)` reference resolves to an existing target. All 41 references [1]–[41] are cited in the body.
+
+---
+
+## 3. Numerical audit findings (spot-check against scripts)
+
+I verified 19 numerical claims against authoritative reports under `reports/`. All pass.
+
+| # | Paper claim | Source | Verified |
+|---|-------------|--------|----------|
+| 1 | Table IX whole-Firm-A cos>0.837 = 99.93% (60,408/60,448) | validation_recalibration.json whole_firm_a | ✓ |
+| 2 | Table IX cos>0.9407 = 95.15% (57,518/60,448) | same | ✓ (57518/60448=95.1529%) |
+| 3 | Table IX cos>0.95 = 92.51% (55,922/60,448) | same | ✓ |
+| 4 | Table IX cos>0.973 = 79.45% (48,028/60,448) | same | ✓ |
+| 5 | Table IX dual cos>0.95 AND dh≤8 = 89.95% (54,370/60,448) | same | ✓ |
+| 6 | Table XI calib cos>0.9407 = 94.99%, z=-3.19, p=0.0014 | validation_recalibration.json generalization_tests | ✓ |
+| 7 | Table XI held-out cos>0.9407 = 95.63% (14,662/15,332) | same | ✓ (rate 0.9563) |
+| 8 | Table V Firm A cos dip=0.0019, p=0.169 | dip_test_report.md | ✓ |
+| 9 | Table V Firm A dHash dip=0.1051, p<0.001 | same | ✓ |
+| 10 | Table V all-CPA 168,740 cos dip=0.0035 | same | ✓ |
+| 11 | Table VIII accountant KDE antimode cos=0.973 | accountant_three_methods_report.md | ✓ (0.9726) |
+| 12 | Table VIII accountant Beta-2 cos=0.979 | same | ✓ (0.9788) |
+| 13 | Table VIII accountant logit-GMM cos=0.976 | same | ✓ (0.9759) |
+| 14 | Table VIII accountant 2D-GMM marginal cos=0.945 | same | ✓ (0.9450) |
+| 15 | Table X FAR at 0.837=0.2062, CI [0.2027, 0.2098] | expanded_validation_report.md | ✓ |
+| 16 | Table X FAR at 0.973=0.0003 | same | ✓ |
+| 17 | Table XIV Firm A baseline 27.8% (1287/4629) | partner_ranking_report.md | ✓ |
+| 18 | 3.5× top-10% concentration ratio (95.9/27.8) | arithmetic | ✓ (3.45→3.5×) |
+| 19 | Table XVI Firm A intra-report 89.91% agreement | (26435+734+0+4)/30222 | ✓ (89.91%) |
+
+**Minor numerical imprecision (cosmetic, not blocker).** Results §IV-I.1 says "The absence of any meaningful 'likely hand-signed' rate (4 of 30,000+ Firm A documents, 0.01%) implies…" The true value is 4/30,226 = **0.013%**. Rounding 0.013% to "0.01%" is unusual; "0.013%" or "~0.01%" would be more accurate. (results_v3.md line 404)
+
+**Subtle inconsistency between two scripts (NOT paper's fault, flag-only).** `expanded_validation_report.md` records held-out `cos>0.9407` as k=14,664 (95.64%), while `validation_recalibration.json` records k=14,662 (95.63%). The paper cites the latter (authoritative), so the paper is internally self-consistent. The drift is in the underlying Script 22/24 pair and may be worth reconciling in the reproducibility package (the paper names only Script 24 in its captions, which is correct).
+
+---
+
+## 4. Cross-reference audit findings
+
+I enumerated every `Section X-Y(.Z)` and `Table [roman]` reference in the submission files and checked resolution.
+
+* All 32 distinct section references resolve. No dangling targets.
+* All 18 tables (I–XVIII plus A.I) defined are used at least once **except** Table XII, which is defined (results §IV-G.3) but the only textual mentions of "Table XII" are in the aggregation sentence at results line 59 ("downstream all-pairs analyses (Tables XII, XVIII)"), not at the point where Table XII is first presented.
+  * **Issue (MINOR):** results_v3.md §IV-G.3 (lines 245–268) introduces Table XII as "the Classifier Sensitivity … table" without any in-text `Table XII` numeral reference. A reader looking for the anchor will find it only in the earlier cross-reference at line 59, which is confusing. Add an explicit "Table XII reports …" or "… (Table XII) …" at line 252. This is exactly the sort of orphaned-table issue that IEEE Access copyediting catches.
+
+* **Issue (MINOR clarity — not broken, but misleading):** results_v3.md line 59 characterises Tables XII and XVIII as "downstream all-pairs analyses" that share the 168,740 count. Table XII is the per-signature classifier output (168,740) — not all-pairs — and Table XVIII's all-pairs intra-class stats are over 41.35M all-CPA pairs or 16M Firm-A-only pairs, not 168,740. The 15-signature exclusion described in line 59 does affect the 168,740 signature set (which is the unit in Tables V, XII, and Firm-A rows of XIII), but labelling them "all-pairs analyses" is a misnomer. Recommend: replace "(Tables XII, XVIII)" with "(Tables V, XII, and the Firm-A per-signature statistics of Tables XIII and XVIII)" or simply "(all same-CPA per-signature best-match analyses)".
+
+* Figures 1–4 are referenced; captions are elsewhere in the export pipeline and I did not audit PNG files. No textual figure-reference is broken.
+
+---
+
+## 5. Arithmetic audit findings
+
+I recomputed every `X%`, `k of N`, `k/n` and ratio I could find. Results:
+
+| Claim | Computed | Paper | Status |
+|-------|----------|-------|--------|
+| 182,328 / 86,071 docs avg | 2.118 | — | — |
+| 182,328 / 85,042 with-detections | 2.144 | "2.14 sigs/doc" | ✓ (docs-with-detections denominator) |
+| 85,042 / 86,071 | 98.80% | "98.8%" | ✓ |
+| 168,755 / 182,328 | 92.55% | "92.6%" | ✓ |
+| 85,042 − 84,386 | 656 | "656 documents" | ✓ |
+| 29,529 + 36,994 + 5,133 + 12,683 + 47 | 84,386 | ✓ | ✓ |
+| 29,529 / 84,386 | 35.00% | "35.0%" | ✓ |
+| 22,970 / 30,226 | 75.99% | "76.0%" | ✓ |
+| (22,970+6,311) / 30,226 | 96.87% | "96.9%" | ✓ |
+| 26,435 / 30,222 | 87.47% | "87.5%" | ✓ |
+| (26,435+734+0+4) / 30,222 | 89.91% | "89.91%" | ✓ |
+| 4 / 30,226 | 0.0132% | "0.01%" | **△ should be 0.013%** |
+| 141 + 361 + 184 | 686 | GMM total | ✓ |
+| 0.21 + 0.51 + 0.28 | 1.00 | GMM weights | ✓ |
+| 139 / 171 | 81.3% | "81%" | ✓ |
+| 32 / 171 | 18.7% | "19%" (§V-C) | ✓ |
+| 29,529 / 71,656 | 41.21% | "41.2%" | ✓ |
+| 36,994 / 71,656 | 51.63% | "51.7%" | ✓ |
+| 5,133 / 71,656 | 7.16% | "7.2%" | ✓ |
+| 95.9 / 27.8 | 3.45 | "3.5×" | ✓ |
+| 90.1 / 27.8 | 3.24 | "3.2×" | ✓ |
+| 139+32 = 171; 141-139 | 2 | non-Firm-A in C1 | ✓ |
+| cos>0.95: 92.51%, below: 7.49% | "92.5% / 7.5%" | ✓ | ✓ |
+| Abstract word count | 244 | ≤250 | ✓ |
+
+**One non-blocking integrity note.** Intro line 54: "92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below". This is the *whole-sample* Firm A rate (55,922/60,448 = 92.51%). Methodology §III-H line 147 and §V-C line 42 reuse the same 92.5% / 7.5% split. **Consistent** across locations.
+
+---
+
+## 6. Narrative / consistency findings
+
+### 6.1 BD/McCrary accountant-level claim — **main-text vs Appendix A contradiction (MAJOR)**
+
+This is the principal finding of my round. Three locations in the main text state or imply that BD/McCrary produces *no* significant accountant-level transition and that this null persists across the bin-width sweep:
+
+1. **results_v3.md §IV-D.1, lines 85–86:** "At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null persists across the Appendix-A bin-width sweep."
+
+2. **results_v3.md §IV-E Table VIII row (line 145):** `| Accountant-level, BD/McCrary transition (diagnostic; null across Appendix A) | no transition | no transition |`
+
+3. **results_v3.md §IV-E line 130, line 152; discussion_v3.md §V-B line 27; conclusion_v3.md line 16:** variants of "BD/McCrary finds no significant transition at the accountant level".
+
+But `reports/bd_sensitivity/bd_sensitivity.md` (and Appendix A Table A.I lines 23–28) actually report:
+
+* Accountant cosine bin 0.005: transition at 0.9800 with $z_{\text{below}}=-3.23$, $z_{\text{above}}=+5.18$ — **both exceed |1.96|, 1 significant transition.**
+* Accountant cosine bin 0.002: no transition; bin 0.010: no transition.
+* Accountant dHash bin 1.0: transition at 3.0 with $z_{\text{below}}=-2.00$, $z_{\text{above}}=+3.24$ — **|Z|=2.00 just above critical, 1 marginal transition.**
+* Accountant dHash bin 0.2: no transition; bin 0.5: no transition.
+
+Appendix A itself (line 36) acknowledges the dHash marginal transition ("the one marginal transition it does produce … sits exactly at the critical value for α = 0.05") but is **silent about the bin-0.005 cosine transition at 0.980**, even though the $|Z|$ values ($-3.23$ / $+5.18$) are well past the 1.96 cutoff and the accountant-level cosine convergence band the paper anchors its primary threshold to is $[0.973, 0.979]$ — i.e., the BD/McCrary transition at 0.980 sits **directly at the upper edge of that convergence band**, not outside it.
+
+**Substantive implication.** The paper's "smoothly-mixed cluster" narrative is not falsified by this — two of three cosine bin widths and two of three dHash bin widths do produce no transition, and one can still argue the pattern is "largely absent." But the paper currently claims something stronger than the data supports, namely that the null is unqualified at the accountant level. A reviewer who reads Appendix A Table A.I against Section IV-D.1 will see the contradiction within 30 seconds.
+
+**Fix.** Either (a) soften the main-text language to "the BD/McCrary accountant-level test rejects the smoothness null in only one of three cosine bin widths and one of three dHash bin widths; the pattern is largely but not uniformly null" (matching Appendix A's own hedging), or (b) additionally note in Appendix A the bin-0.005 cosine transition and explain why it does not disturb the substantive reading (e.g., sits at the band edge, $Z$ inflates with bin width as documented, consistent with a mild histogram-resolution artifact). Option (b) is stronger. **Either way the four locations in §IV-D.1 / Table VIII / §IV-E / §V-B / conclusion must be brought into alignment with Appendix A.**
+
+### 6.2 Related Work line 67 — stale BD/McCrary framing (MINOR)
+
+related_work_v3.md line 67: "The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions."
+
+The rest of the paper (Methodology §III-I.3, Results §IV-D.1, Appendix A) has **demoted** BD/McCrary from a threshold estimator to a density-smoothness diagnostic precisely because it does *not* cleanly detect that boundary (transitions sit inside the non-hand-signed mode, not between modes). Related Work's enthusiastic framing is residue from the v3.6-and-earlier framing and should be softened to something like "BD/McCrary provides a local-density-discontinuity diagnostic that is informative about distributional smoothness under minimal assumptions." This is a related-work-intent question only; the downstream text handles the nuance correctly.
+
+### 6.3 "0.01%" vs "0.013%" (MINOR)
+
+results_v3.md §IV-I.1 line 404: "4 of 30,000+ Firm A documents, 0.01%". True value 0.013%; reviewers who recompute will flag. Replace with "0.013%" or "roughly 0.01%".
+
+### 6.4 No substantive abstract-vs-body contradictions detected
+
+I cross-checked the abstract's quantitative claims (threshold convergence within ∼0.006 at cosine ≈0.975, FAR ≤ 0.001 at accountant-level thresholds, 310 byte-identical positives, ∼50,000-pair inter-CPA negative anchor, 182,328 signatures / 90,282 reports / 758 CPAs / 2013–2023) against the body and all match.
+
+### 6.5 No terminology drift detected
+
+`dHash` / `dHash_indep` / `independent minimum dHash` are defined in §III-G and used consistently; the operational classifier §III-L is explicit that it uses the independent-minimum variant; Tables IX/XI/XII/XVI all use that variant. Previous reviewers correctly flagged this; v3.9 is clean.
+
+---
+
+## 7. Novel issues no prior reviewer caught
+
+Beyond item **6.1 (BD/McCrary main-vs-appendix contradiction)**, which is the primary novel finding, I identified:
+
+### 7.1 Orphaned Table XII first reference
+
+Table XII is defined inside §IV-G.3 (results line 252) but the sub-section opens at line 245 without an in-text `Table XII` reference. The only textual `Table XII` string in the paper is in the line-59 aggregation sentence. A first-reader following the narrative has no numeric pointer to the table at the point of presentation. No prior reviewer flagged this. Fix: insert "Table XII presents the five-way output under each cut." before line 252 `<!-- TABLE XII: ... -->` comment, or similar.
+
+### 7.2 Section IV-E wording ambiguity around "the two-component GMM"
+
+results_v3.md line 131: "For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine = 0.945 and dHash = 8.10".
+
+This is ambiguous because §IV-E has *already* selected $K^*=3$ on BIC at line 103. The 2-component 2D fit here is an additional, separately-fit 2-comp 2D GMM reported for cross-check only. A reader can reasonably wonder whether this is the same fit at $K=3$ (it is not) or a parallel $K=2$ fit used only for the marginal crossings (it is). Fix: replace "the two-dimensional two-component GMM" with "a separately fit two-component 2D GMM (reported for cross-check of the 1D accountant-level crossings)".
+
+### 7.3 Subtle overclaim in `Methodology §III-H line 156`
+
+methodology_v3.md line 156: "We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the visual inspection, the accountant-level mixture, the three complementary analyses above, and the held-out Firm A fold described in Section III-K."
+
+However, as results §IV-G.2 cautions, the 70/30 held-out fold's operational rules differ between folds by 1–5 pp with $p<0.001$. The held-out fold therefore confirms the *qualitative* replication-dominated framing but does **not** provide clean quantitative validation. Calling it part of "the validation role" is slightly stronger than the results section is willing to say. Fix: replace "held-out Firm A fold" with "held-out Firm A fold (which confirms the qualitative replication-dominated framing; fold-level rate differences are disclosed in Section IV-G.2)".
+
+### 7.4 Abstract's "visual inspection and accountant-level mixture evidence"
+
+abstract_v3.md line 5: "… visual inspection and accountant-level mixture evidence supporting majority non-hand-signing and a minority of hand-signers". This omits the partner-level ranking analysis (§IV-H.2), which is the only **threshold-free** piece of evidence and is the strongest of the four. Including it in the one-sentence evidence summary would sharpen the abstract. Non-blocking: the abstract is already at 244/250 words.
+
+### 7.5 `Section III-I.4` never referenced
+
+methodology_v3.md defines subsections III-I.1 (KDE), III-I.2 (Beta mixture EM), III-I.3 (BD/McCrary), III-I.4 (Convergent Validation), III-I.5 (Accountant-Level Application). Only III-I.3 and III-I.5 are referenced in text. III-I.4's substantive content (level-shift framing) is summarised in §IV-E and §V-B; the standalone subsection could be folded into III-I.5 or III-I.1, or a forward-reference could be added. Non-blocking, but IEEE Access copyediting may flag a subsection with no cross-reference.
+
+### 7.6 BD/McCrary-as-threshold-estimator trace in Conclusion
+
+conclusion_v3.md line 14: "Third, we introduced a convergent threshold framework combining two methodologically distinct estimators … together with a Burgstahler-Dichev / McCrary density-smoothness diagnostic."
+
+This is fine — diagnostic, not estimator — and matches methodology §III-I.3 framing. But it contrasts with introduction_v3.md line 43–44 which still reads "(5) threshold determination using two methodologically distinct estimators … complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic …". Self-consistent. I verified there is no stale "three-method threshold" residue. v3.9 is clean on this.
+
+---
+
+## 8. Final recommendation — v3.10 action items
+
+### BLOCKER (must fix before submission)
+
+**B1. BD/McCrary accountant-level claim contradicts Appendix A.** (See §6.1.)
+* File: `paper_a_results_v3.md`, §IV-D.1, lines 85–86.
+* Change: "At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null persists across the Appendix-A bin-width sweep."
+* Replace with: "At the accountant level the BD/McCrary null is not rejected in two of three cosine bin widths (0.002 and 0.010) and two of three dHash bin widths (0.2 and 0.5); the one cosine transition (at bin width 0.005) sits at cosine 0.980 — at the upper edge of the convergence band of our two threshold estimators (Section IV-E) — and the one dHash transition (at bin width 1.0) has $|Z|$ at the 1.96 critical value. We read this pattern as *largely* null and report it as consistent with, rather than affirmative proof of, clustered-but-smoothly-mixed accountant-level aggregates (Appendix A)."
+* File: `paper_a_results_v3.md`, §IV-E Table VIII row (line 145). Change `null across Appendix A` to `largely null; 1/3 cos and 1/3 dHash bin widths exhibit a marginal transition (Appendix A)`.
+* File: `paper_a_discussion_v3.md` §V-B line 27 and `paper_a_conclusion_v3.md` line 16 — apply matching softening.
+
+### MAJOR (strongly recommended before submission)
+
+**M1. Related Work BD/McCrary framing stale.** (See §6.2.)
+* File: `paper_a_related_work_v3.md` line 67.
+* Soften "is well suited to detecting the boundary between two generative mechanisms" to "provides a local-density-discontinuity diagnostic that is informative about distributional smoothness".
+
+**M2. Orphaned Table XII first reference.** (See §7.1.)
+* File: `paper_a_results_v3.md` line 252, immediately before the `<!-- TABLE XII: … -->` comment.
+* Insert: "Table XII reports the five-way classifier output under both operational cuts."
+
+### MINOR (nice-to-have)
+
+**m1.** results_v3.md line 404: replace "0.01%" with "0.013%".
+
+**m2.** results_v3.md line 131: replace "the two-dimensional two-component GMM" with "a separately fit two-component 2D GMM (reported for cross-check of the 1D accountant-level crossings)".
+
+**m3.** results_v3.md line 59: replace "(Tables XII, XVIII)" with "(all same-CPA per-signature best-match analyses, including Tables V, XII, and XVIII)" to remove the "all-pairs" misnomer.
+
+**m4.** methodology_v3.md line 156: replace "the held-out Firm A fold described in Section III-K" with "the held-out Firm A fold (which confirms the qualitative replication-dominated framing; fold-level rate differences are disclosed in Section IV-G.2)".
+
+**m5.** abstract_v3.md (optional, non-blocking): consider inserting "the threshold-free partner-ranking analysis," before "and a minority of hand-signers" if word budget allows.
+
+**m6.** methodology_v3.md §III-I.4 never cross-referenced (§7.5). Either add one forward reference or fold into §III-I.1/5. Non-blocking.
+
+### Submission-readiness summary
+
+With **B1** addressed the paper is submission-ready. **M1** and **M2** are strongly recommended but would not by themselves be grounds for rejection. All **m1–m6** items are cosmetic.
+
+### IEEE Access compliance check
+
+* Abstract word count: 244 / 250 ✓
+* Impact statement correctly removed from submission via export_v3.py SECTIONS list ✓
+* Single-anonymized: "Firm A / B / C / D" pseudonyms used consistently, residual identifiability disclosed (methodology §III-M) ✓
+* Reference formatting: IEEE numbered, sequential by first appearance, 41 entries, all cited ✓
+* No author/institution information in v3 section files ✓
+* Figures 1–4 referenced; Table A.I defined in appendix with consistent IEEE prefix ✓
+* Appendix A correctly titled "Appendix A. BD/McCrary Bin-Width Sensitivity" and appears after Conclusion in the assembly order ✓
+
+**Reviewer's bottom line.** The paper is well-crafted, numerically rigorous, and has survived eight prior review rounds. v3.9 closed both codex round-8 items cleanly. The one residual issue I identified (**B1**) is a paper-vs-appendix contradiction that any careful round-10 reviewer will catch. It is fixable in 20 minutes by softening four sentences. After that fix the paper is ready for IEEE Access submission.
+
+---
+
+*End of review.*
@@ -1,16 +1,7 @@
 # Abstract

-<!-- 200-270 words -->
+<!-- IEEE Access target: <= 250 words, single paragraph -->

-Regulations in many jurisdictions require Certified Public Accountants (CPAs) to attest to each audit report they certify, typically by affixing a signature or seal.
-However, the digitization of financial reporting makes it straightforward to reuse a stored signature image across multiple reports---whether by administrative stamping or firm-level electronic signing systems---potentially undermining the intent of individualized attestation.
-Unlike signature forgery, where an impostor imitates another person's handwriting, *non-hand-signed* reproduction involves the legitimate signer's own stored signature image being reproduced on each report, a practice that is visually invisible to report users and infeasible to audit at scale through manual inspection.
-We present an end-to-end AI pipeline that automatically detects non-hand-signed auditor signatures in financial audit reports.
-The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash).
-For threshold determination we apply three methodologically distinct methods---Kernel Density antimode with a Hartigan unimodality test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level.
-Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum that no two-component mixture cleanly separates, while accountant-level aggregates are clustered into three recognizable groups (BIC-best $K = 3$) with the KDE antimode and the two mixture-based estimators converging within $\sim$0.006 of each other at cosine $\approx 0.975$; the Burgstahler-Dichev / McCrary test produces no significant discontinuity at the accountant level, consistent with clustered-but-smooth rather than sharply discrete accountant-level heterogeneity.
-A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with visual-inspection and accountant-level mixture evidence supporting majority non-hand-signing and a minority of hand-signers; we break the circularity of using the same firm for calibration and validation by a 70/30 CPA-level held-out fold.
-Validation against 310 byte-identical positive signatures and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 with Wilson 95% confidence intervals at all accountant-level thresholds.
-To our knowledge, this represents the largest-scale forensic analysis of auditor signature authenticity reported in the literature.
+Regulations require Certified Public Accountants (CPAs) to attest to each audit report by affixing a signature, but digitization makes reusing a stored signature image across reports---through administrative stamping or firm-level electronic signing---technically trivial and visually invisible to report users, undermining individualized attestation. We build an end-to-end pipeline that detects such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, a YOLOv11 detector localizes signature regions, ResNet-50 supplies deep features, and a dual-descriptor verification layer combines deep-feature cosine similarity with perceptual hashing (difference hash, dHash) to separate *style consistency* (high cosine, divergent dHash) from *image reproduction* (high cosine, low dHash). The operational classifier outputs a five-way verdict per signature with a worst-case document-level aggregation; the cosine cut is anchored on a transparent whole-sample Firm A P7.5 percentile (cos $> 0.95$), and the dHash cuts on the same reference. Applied to 90,282 audit reports filed in Taiwan over 2013-2023 (182,328 signatures from 758 CPAs), the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ captures 92.46\% of Firm A and yields FAR = 0.0005 against a $\sim$50,000-pair inter-CPA negative anchor; intra-report agreement is 89.9\% at Firm A versus 62-67\% at the other Big-4 firms (a 23-28 percentage-point cross-firm gap). Validation uses three annotation-free anchors (310 byte-identical positives, $\sim$50,000 inter-CPA negatives, and a 70/30 held-out Firm A fold) reported with Wilson 95\% intervals. Three statistical diagnostics applied to the per-signature similarity distribution (Hartigan dip test, EM-fitted Beta mixture with logit-Gaussian robustness check, Burgstahler-Dichev / McCrary density-smoothness procedure) jointly characterise the distribution as a continuous quality spectrum, which motivates the percentile-based anchor and is itself a substantive finding for similarity-threshold selection in document forensics.

-<!-- Word count: ~290 -->
+<!-- Target word count: 240 -->
@@ -0,0 +1,64 @@
+# Appendix A. BD/McCrary Bin-Width Sensitivity (Signature Level)
+
+The main text (Section III-I, Section IV-D.2) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator.
+This appendix documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction.
+
+<!-- TABLE A.I: BD/McCrary Bin-Width Sensitivity (two-sided alpha = 0.05, |Z| > 1.96)
+| Variant | n | Bin width | Best transition | z_below | z_above |
+|---------|---|-----------|-----------------|---------|---------|
+| Firm A cosine (sig-level)        | 60,448  | 0.003  | 0.9870 | -2.81  | +9.42   |
+| Firm A cosine (sig-level)        | 60,448  | 0.005  | 0.9850 | -9.57  | +19.07  |
+| Firm A cosine (sig-level)        | 60,448  | 0.010  | 0.9800 | -54.64 | +69.96  |
+| Firm A cosine (sig-level)        | 60,448  | 0.015  | 0.9750 | -85.86 | +106.17 |
+| Firm A dHash_indep (sig-level)   | 60,448  | 1      | 2.0    | -4.69  | +10.01  |
+| Firm A dHash_indep (sig-level)   | 60,448  | 2      | no transition | — | — |
+| Firm A dHash_indep (sig-level)   | 60,448  | 3      | no transition | — | — |
+| Full-sample cosine (sig-level)   | 168,740 | 0.003  | 0.9870 | -3.21  | +8.17   |
+| Full-sample cosine (sig-level)   | 168,740 | 0.005  | 0.9850 | -8.80  | +14.32  |
+| Full-sample cosine (sig-level)   | 168,740 | 0.010  | 0.9800 | -29.69 | +44.91  |
+| Full-sample cosine (sig-level)   | 168,740 | 0.015  | 0.9450 | -11.35 | +14.85  |
+| Full-sample dHash_indep (sig-l.) | 168,740 | 1      | 2.0    | -6.22  | +4.89   |
+| Full-sample dHash_indep (sig-l.) | 168,740 | 2      | 10.0   | -7.35  | +3.83   |
+| Full-sample dHash_indep (sig-l.) | 168,740 | 3      | 9.0    | -11.05 | +45.39  |
+-->
+
+Two patterns are visible in Table A.I.
+First, the procedure consistently identifies a "transition" under every bin width, but the *location* of that transition drifts monotonically with bin width (Firm A cosine: 0.987 → 0.985 → 0.980 → 0.975 as bin width grows from 0.003 to 0.015; full-sample dHash: 2 → 10 → 9 as the bin width grows from 1 to 3).
+The $Z$ statistics also inflate superlinearly with the bin width (Firm A cosine $|Z|$ rises from $\sim 9$ at bin 0.003 to $\sim 106$ at bin 0.015) because wider bins aggregate more mass per bin and therefore shrink the per-bin standard error on a very large sample.
+Both features are characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity.
+
+Second, the candidate transitions all locate *inside* the non-hand-signed mode (cosine $\geq 0.975$, dHash $\leq 10$) rather than between modes, which is the location pattern we would expect of a clean two-mechanism boundary.
+
+Taken together, Table A.I shows that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes.
+This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that per-signature similarity does not form a clean two-mechanism mixture.
+
+Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials.
+
+# Appendix B. Table-to-Script Provenance
+
+For reproducibility, the following table maps each numerical table in Section IV to the analysis script that produces its underlying values and to the report file emitted by that script. Scripts are under `signature_analysis/`. Report artifact paths below are listed relative to the project's analysis report root, which is `/Volumes/NV2/PDF-Processing/signature-analysis/` in our local deployment; replicators should rebase the paths to whatever report root they configure when invoking the scripts.
+
+<!-- TABLE B.I: Manuscript table → reproduction artifact
+| Manuscript table | Generating script | Report artifact |
+|------------------|-------------------|-----------------|
+| Table III (extraction results) | `02_extract_features.py`; `09_pdf_signature_verdict.py` | `reports/extraction_methodology.md`; `reports/pdf_signature_verdicts.json` |
+| Table IV (intra/inter all-pairs cosine statistics) | `10_formal_statistical_analysis.py` | `reports/formal_statistical_data.json`; `reports/formal_statistical_report.md` |
+| Table V (Hartigan dip test) | `15_hartigan_dip_test.py` | `reports/dip_test/dip_test_results.json` |
+| Table VI (signature-level threshold-estimator summary) | `17_beta_mixture_em.py`; `25_bd_mccrary_sensitivity.py` | `reports/beta_mixture/beta_mixture_results.json`; `reports/bd_sensitivity/bd_sensitivity.json` |
+| Table IX (Firm A whole-sample capture rates) | `19_pixel_identity_validation.py`; `24_validation_recalibration.py` | `reports/pixel_validation/pixel_validation_results.json`; `reports/validation_recalibration/validation_recalibration.json` |
+| Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` |
+| Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
+| Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
+| Table XII-B (cosine-threshold tradeoff: capture vs inter-CPA FAR) | `21_expanded_validation.py` (FAR column; canonical 50k-pair anchor); inline computation in revision (Firm A and non-Firm-A capture columns) | `reports/expanded_validation/expanded_validation_results.json` |
+| Table XIII (Firm A per-year cosine distribution) | `29_firm_a_yearly_distribution.py` | `reports/firm_a_yearly/firm_a_yearly_distribution.json` |
+| Fig. 4 (per-firm yearly best-match cosine, 2013-2023) | `30_yearly_big4_comparison.py` | `reports/figures/fig_yearly_big4_comparison.{png,pdf}`; `reports/firm_yearly_comparison/firm_yearly_comparison.{json,md}` |
+| Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` |
+| Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` |
+| Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_signature_verdicts.json`; `reports/pdf_signature_verdict_report.md` (CSV / XLSX bulk reports also at `reports/`) |
+| Table XVIII (backbone ablation) | `paper/ablation_backbone_comparison.py` | `ablation/ablation_results.json` (sibling of `reports/`) |
+| Table A.I (BD/McCrary bin-width sensitivity) | `25_bd_mccrary_sensitivity.py` | `reports/bd_sensitivity/bd_sensitivity.json` |
+| Byte-identity decomposition (145 / 50 / 180 / 35; Section IV-F.1) | `28_byte_identity_decomposition.py` | `reports/byte_identity_decomp/byte_identity_decomposition.json` |
+| Cross-firm dual-descriptor convergence (Section IV-H.2) | `28_byte_identity_decomposition.py` | `reports/byte_identity_decomp/byte_identity_decomposition.json` |
+-->
+
+The table-to-script mapping above is intended as a navigation aid for replicators. All scripts run deterministically under the fixed random seeds documented in the supplementary materials; the artifact paths above were verified against the local deployment at the time of submission, and any reviewer reproduction step should re-emit the artifacts from the listed scripts rather than depend on the absolute path layout.
@@ -3,22 +3,21 @@
 ## Conclusion

 We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
-Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through three methodologically distinct methods applied at two analysis levels.
+Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with the operational classifier's cosine cut anchored on a whole-sample Firm A percentile heuristic and the per-signature similarity distribution characterised through two threshold estimators and a density-smoothness diagnostic.

-Our contributions are fourfold.
+The seven numbered contributions listed in Section I can be grouped into four broader methodological themes, summarized below.

 First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.

 Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.

-Third, we introduced a three-method threshold framework combining KDE antimode (with a Hartigan unimodality test), Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture (with a logit-Gaussian robustness check).
-Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the KDE antimode together with the Beta-mixture and logit-Gaussian estimators agree within $\sim 0.006$ at cosine $\approx 0.975$.
-The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transition at the accountant level, consistent with clustered but smoothly mixed rather than sharply discrete accountant-level heterogeneity.
-The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered with smooth cluster boundaries.
+Third, we characterised the per-signature similarity distribution using three diagnostics---a Hartigan dip test, an EM-fitted Beta mixture (with logit-Gaussian robustness check), and a Burgstahler-Dichev / McCrary density-smoothness procedure---and showed that no two-mechanism mixture cleanly explains it: the dip test fails to reject unimodality for Firm A ($p = 0.17$), BIC strongly prefers a 3-component over a 2-component Beta fit ($\Delta\text{BIC} = 381$ for Firm A), and the BD/McCrary candidate transition lies inside the non-hand-signed mode rather than between modes (and is not bin-width-stable; Appendix A).
+The substantive reading is that *pixel-level output quality* is a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) and scan conditions, rather than a discrete class cleanly separated from hand-signing.
+This reading motivates anchoring the operational classifier's cosine cut on a whole-sample Firm A P7.5 percentile heuristic (cos $> 0.95$) rather than on a mixture-fit crossing.

 Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
-To break the circularity of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report post-hoc capture rates on the held-out fold with Wilson 95% confidence intervals.
-This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
+To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85--95% capture band differ by 1--5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure.
+This framing is internally consistent with the available evidence: the byte-level pair analysis finding of 145 pixel-identical calibration-firm signatures across 50 distinct partners of 180 registered (Section IV-F.1); the 92.5% / 7.5% split in signature-level cosine thresholds and the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1); and the 95.9% top-decile concentration of Firm A auditor-years in the threshold-independent partner-ranking analysis (Section IV-G.2).

 An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.

@@ -26,7 +25,6 @@ An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that

 Several directions merit further investigation.
 Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
-Extending the accountant-level analysis to auditor-year units---using the same three-method convergent framework but at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade.
 The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
-The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
+The replication-dominated calibration strategy and the pixel-identity anchor technique are both generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself, conditional on the availability of analogous anchors in the new domain and on artifact-generation physics that preserve the byte-level trace.
 Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
@@ -0,0 +1,7 @@
+# Declarations
+
+**Conflict of interest.** The authors declare no conflict of interest with Firm A, Firm B, Firm C, or Firm D, or with any other entity referenced in this work.
+
+**Data availability.** All audit reports analysed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, a publicly accessible regulatory disclosure platform. The CPA registry used to map signatures to certifying CPAs is publicly available. Signature images, model weights, and reproducibility scripts are available in the supplementary materials.
+
+**Funding.** [To be filled in before submission.]
@@ -11,40 +11,39 @@ Forgery detection systems optimize for inter-class discriminability---maximizing
 Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
 The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.

-## B. Continuous-Quality Spectrum vs. Clustered Accountant-Level Heterogeneity
+## B. Per-Signature Similarity is a Continuous Quality Spectrum

-The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the three-method framework and the Hartigan dip test (Sections IV-D and IV-E).
+A central empirical finding of this study is that per-signature similarity does not form a clean two-mechanism mixture (Section IV-D).
+Firm A's signature-level cosine is formally unimodal (Hartigan dip test $p = 0.17$) with a long left tail.
+The all-CPA signature-level cosine rejects unimodality ($p < 0.001$), reflecting the heterogeneity of signing practices across firms, but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit ($\Delta\text{BIC} = 381$ for Firm A; $10{,}175$ for the full sample), and the forced 2-component Beta crossing and its logit-GMM robustness counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
+The BD/McCrary discontinuity test locates its transition at cosine 0.985---*inside* the non-hand-signed mode rather than at a boundary between two mechanisms---and the transition is not bin-width-stable (Appendix A).

-At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
-Firm A's signature-level cosine is formally unimodal (dip test $p = 0.17$) with a long left tail.
-The all-CPA signature-level cosine rejects unimodality ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
-The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary between two mechanisms.
-Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class.
+Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class cleanly separated from hand-signing.
 Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.

-At the per-accountant aggregate level the picture partly reverses.
-The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, logit-GMM-2 crossing $= 0.976$).
-The BD/McCrary test, however, does not produce a significant transition at the accountant level either, in contrast to the signature level.
-This pattern is consistent with a clustered *but smoothly mixed* accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the transitions between them are gradual rather than discrete at the bin resolution BD/McCrary requires.
+The methodological implication is that the operational classifier's cosine cut should not be derived from a mixture-fit crossing.
+We accordingly anchor the operational cosine cut on the whole-sample Firm A P7.5 percentile (Section III-K), and treat the signature-level threshold-estimator outputs (KDE antimode, Beta and logit-Gaussian crossings) as descriptive characterisation of the similarity distribution rather than as the source of operational thresholds.
+The BD/McCrary procedure plays a *density-smoothness diagnostic* role in this framing rather than that of an independent threshold estimator.

-The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behaviour* is clustered (three recognizable groups) but not sharply discrete.
-The accountant-level mixture is a useful classifier of firm-and-practitioner-level signing regimes; individual behaviour may still transition or mix over time within a practitioner, and our cross-sectional analysis does not rule this out.
-Methodologically, the implication is that the three 1D methods are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is itself diagnostic of smoothness rather than a failure of the method.
+This continuous-spectrum finding also has substantive implications for downstream interpretation.
+Because pixel-level output quality varies continuously, *signature-level rates* (such as the 92.5% / 7.5% Firm A split) reflect the share of signatures whose similarity falls above or below a chosen threshold rather than the share that came from a "non-hand-signing mechanism" versus a "hand-signing mechanism."
+We accordingly report all rates as signature-level quantities and abstain from partner-level frequency claims (Section III-G).

 ## C. Firm A as a Replication-Dominated, Not Pure, Population

 A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
 Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.

-Three convergent strands of evidence support the replication-dominated framing.
-First, the visual-inspection evidence: randomly sampled Firm A reports exhibit pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.
-Second, the signature-level statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
-Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures ($\geq 10$) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---directly quantifying the within-firm minority of hand-signers.
-Nine additional Firm A CPAs are excluded from the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone.
-The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85–95% band differ between folds by 1–5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure).
-The accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to fold-level sampling variance.
+Two convergent strands of evidence support the replication-dominated framing.
+First, the byte-level pair evidence: 145 Firm A signatures (from 50 distinct partners of 180 registered) have a byte-identical same-CPA match in a different audit report, with 35 of these matches spanning different fiscal years.
+Independent hand-signing cannot produce byte-identical images across distinct reports, so these pairs directly establish image reuse within Firm A as a concrete, threshold-free phenomenon, and the 50/180 partner spread shows that replication is widespread rather than confined to a handful of CPAs.
+Second, the signature-level distributional evidence: Firm A's per-signature cosine distribution is unimodal long-tail (Hartigan dip test $p = 0.17$) rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
+The unimodal-long-tail *shape*, not the precise 92.5 / 7.5 split, is the structural evidence: it is consistent with a dominant high-similarity regime plus residual within-firm heterogeneity, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).

-The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
+Two additional checks, reported in Section IV-G, are robust to threshold choice and complement the two primary strands:
+the held-out Firm A 70/30 validation (Section IV-F.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85--95% band differ between folds by 1--5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure), and the threshold-independent partner-ranking analysis (Section IV-G.2) shows that Firm A auditor-years occupy 95.9% of the top decile of similarity-ranked auditor-years against a 27.8% baseline share---a 3.5$\times$ concentration ratio that uses only ordinal ranking and is independent of any absolute cutoff.
+
+The replication-dominated framing is internally coherent with both pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
 We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.

 ## D. The Style-Replication Gap
@@ -62,16 +61,16 @@ The dual-descriptor framework correctly identifies these cases as distinct from

 The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
 In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
-Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset.
+Our approach uses practitioner background---one Big-4 firm reportedly relies predominantly on stamping or e-signing workflows---only as a *motivation* for selecting that firm as a candidate reference population; the calibration role is then established from the audit-report images themselves (byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency), so the calibration does not depend on the practitioner-background claim being externally verified (Section III-H).

 This calibration strategy has broader applicability beyond signature analysis.
 Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
-The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity quantified by the accountant-level mixture, and yields classification rates that are internally consistent with the data.
+The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity visible in the unimodal-long-tail shape of Firm A's per-signature cosine distribution, and yields classification rates that are internally consistent with the data.

 ## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation

 A further methodological contribution is the combination of byte-level pixel identity as an annotation-free *conservative* gold positive and a large random-inter-CPA negative anchor.
-Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
+Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is pair-level proof of image reuse and, modulo the narrow source-template edge case discussed in the seventh limitation below, a conservative positive for non-hand-signing without requiring human review.
 In our corpus 310 signatures satisfied this condition.
 We emphasize that byte-identical pairs are a *subset* of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways).
 Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.
@@ -96,14 +95,15 @@ In these overlap regions, blended pixels are replaced with white, potentially cr
 This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.

 Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
-While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.
+While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.

-Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later).
-Extending the accountant-level analysis to auditor-year units is a natural next step.
+Fifth, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed.
+In the rare case that one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as the stamping or e-signature template, the pair correctly identifies image reuse but misattributes the non-hand-signed status to the source exemplar.
+This misattribution affects at most one source document per template variant per CPA (the exemplar from which the template was produced), is not expected to be common given that stored signature templates are typically generated in a separate acquisition step rather than extracted from submitted audit reports, and does not materially affect aggregate capture rates at the firm level.

-Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, and the test produces no significant transition at all at the accountant level.
-In our application, therefore, BD/McCrary contributes diagnostic information about local density-smoothness rather than an independent accountant-level threshold estimate; that role is played by the KDE antimode and the two mixture-based estimators.
-The BD/McCrary results remain informative as a robustness check---their non-transition at the accountant level is consistent with the dip-test and Beta-mixture evidence that accountant-level clustering is smooth rather than sharply discontinuous.
+Sixth, our analyses remain at the signature level; we abstain from partner-level frequency inferences such as "X% of CPAs hand-sign in a given year."
+Per-signature labels in this paper are not translated to per-report or per-partner mechanism assignments (Section III-G).
+The signature-level rates we report, including the 92.5% / 7.5% Firm A split and the year-by-year left-tail share of Section IV-G.1, should accordingly be read as signature-level quantities rather than partner-level frequencies.

 Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
 Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.
@@ -1,9 +1,21 @@
-# Impact Statement
+<!--
+ARCHIVED. Not part of the IEEE Access submission.

-<!-- 100-150 words. Non-specialist readable. No jargon. Specific, not vague. -->
+IEEE Access Regular Papers do not include a separate Impact Statement
+section. The text below is retained for possible reuse in a cover
+letter, grant report, or non-IEEE venue. It is excluded from the
+assembled paper by export_v3.py.
+
+If reused, note that the wording "distinguishes genuinely hand-signed
+signatures from reproduced ones" overstates what a five-way confidence
+classifier without a fully labeled test set establishes; soften before
+external use.
+-->
+
+# Impact Statement (archived; not in IEEE Access submission)

 Auditor signatures on financial reports are a key safeguard of corporate accountability.
 When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets.
-We developed an artificial intelligence system that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan.
-By combining deep-learning visual features with perceptual hashing and three methodologically distinct threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time.
-After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale.
+We developed a pipeline that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan.
+Combining deep-learning visual features with perceptual hashing and two methodologically distinct threshold estimators (plus a density-smoothness diagnostic), the system stratifies signatures into a five-way confidence-graded classification and quantifies how the practice varies across firms and over time.
+After further validation, the technology could support financial regulators in screening signature authenticity at national scale.
@@ -9,9 +9,10 @@ While the law permits either a handwritten signature or a seal, the CPA's attest
 The digitization of financial reporting has introduced a practice that complicates this intent.
 As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one.
 This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step.
-From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
+From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.
 We refer to signatures produced by either workflow collectively as *non-hand-signed*.
 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement.
+The accounting literature has long examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33].
 Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused.
 This practice, while potentially widespread, is visually invisible to report users and virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of image reproduction.

@@ -24,14 +25,14 @@ This detection problem differs fundamentally from forgery detection: while it do

 A secondary methodological concern shapes the research design.
 Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification.
-Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference.
-A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, $F_1$, and equal-error-rate metrics that prevail in the biometric-verification literature.
+Such thresholds are fragile in an archival-data setting where the cost of misclassification propagates into downstream inference.
+A defensible approach requires (i) a transparent threshold anchored to an empirical reference population drawn from the target corpus; (ii) statistical diagnostics that characterise the *shape* of the underlying similarity distribution and so motivate the choice of anchor; and (iii) external validation against naturally-occurring anchor populations---byte-level identical pairs as a conservative gold positive subset and large random inter-CPA pairs as a gold negative population---reported with Wilson 95% confidence intervals on per-rule capture / FAR rates, since precision and $F_1$ are not meaningful when the positive and negative anchor populations are sampled from different units.

 Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards.
 Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image.
 Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from image reproduction.
 Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis.
-From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37], Burgstahler-Dichev / McCrary discontinuity testing [38], [39], and finite mixture modelling via the EM algorithm [40], [41]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a three-method convergent framework for document-forensics threshold selection.
+From the statistical side, the methods we adopt for distributional characterisation---the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a joint diagnostic toolkit for document-forensics threshold selection.

 In this paper, we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale.
 Our approach processes raw PDF documents through the following stages:
@@ -39,7 +40,7 @@ Our approach processes raw PDF documents through the following stages:
 (2) signature region detection using a trained YOLOv11 object detector;
 (3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network;
 (4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance;
-(5) threshold determination using three methodologically distinct methods---KDE antimode with a Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and finite Beta mixture via EM with a logit-Gaussian robustness check---applied at both the signature level and the accountant level; and
+(5) signature-level distributional characterisation using two threshold estimators---KDE antimode with a Hartigan unimodality test and finite Beta mixture via EM with a logit-Gaussian robustness check---complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic, used to read the structure of the per-signature similarity distribution and to motivate a percentile-based operational anchor rather than a mixture-fit crossing; and
 (6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm.

 The dual-descriptor verification is central to our contribution.
@@ -48,16 +49,15 @@ Perceptual hashing (specifically, difference hashing) encodes structural-level i
 By requiring convergent evidence from both descriptors, we can differentiate *style consistency* (high cosine but divergent dHash) from *image reproduction* (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone.

 A second distinctive feature is our framing of the calibration reference.
-One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing for the majority of its certifying partners, while not ruling out that a minority may continue to hand-sign some reports.
+One major Big-4 accounting firm in Taiwan (hereafter "Firm A") was selected as a candidate calibration reference based on practitioner-knowledge motivation; its benchmark status is then evaluated using the image evidence reported in this paper, not asserted by the practitioner-knowledge motivation itself.
 We therefore treat Firm A as a *replication-dominated* calibration reference rather than a pure positive class.
-This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode.
-Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence among the visual-inspection evidence, the signature-level statistics, and the accountant-level mixture.
+This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail (Hartigan dip $p = 0.17$), 92.5% of Firm A signatures exceed cosine 0.95 with the remaining 7.5% forming the left tail, and 145 Firm A signatures across 50 distinct partners are byte-identical to a same-CPA match in a different audit report (35 spanning different fiscal years).
+Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb the 7.5% residual as noise---ensures internal coherence between the byte-level pixel-identity evidence and the signature-level distributional shape.

-A third distinctive feature is our unit-of-analysis treatment.
-Our three-method framework reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are clustered into three recognizable groups (BIC-best $K = 3$).
-The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, while *accountant-level aggregate behaviour* is clustered but not sharply discrete---a given CPA tends to cluster into a dominant regime (high-replication, middle-band, or hand-signed-tendency), though the boundaries between regimes are smooth rather than discontinuous.
-Among the three accountant-level methods, KDE antimode and the two mixture-based estimators converge within $\sim 0.006$ on a cosine threshold of approximately $0.975$, while the Burgstahler-Dichev / McCrary discontinuity test finds no significant transition at the accountant level---an outcome consistent with smoothly mixed clusters rather than a failure of the method.
-The two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are reported as a complementary cross-check rather than as the primary accountant-level threshold.
+A third distinctive feature is the empirical reading we take from the per-signature distributional analysis.
+Three diagnostics applied to the per-signature similarity distribution---the Hartigan dip test, an EM-fitted Beta mixture (with logit-Gaussian robustness check), and the Burgstahler-Dichev / McCrary density-smoothness procedure---jointly indicate that no two-mechanism mixture cleanly explains per-signature similarity: the dip test fails to reject unimodality for Firm A, BIC strongly prefers a 3-component over a 2-component Beta fit, and the BD/McCrary candidate transition lies *inside* the non-hand-signed mode rather than between modes (and is not bin-width-stable; Appendix A).
+The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) and scan conditions, rather than a discrete class cleanly separated from hand-signing.
+This reading motivates anchoring the operational classifier on a percentile heuristic over the Firm A reference distribution rather than on a mixture-fit crossing, and it motivates the byte-level pixel-identity anchor (Section IV-F.1) as a threshold-free positive reference that does not depend on resolving signature-level mixture structure.

 We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants.
 To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature.
@@ -70,17 +70,17 @@ The contributions of this paper are summarized as follows:

 3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures.

-4. **Three-method convergent threshold framework.** We introduce a threshold-selection framework that applies three methodologically distinct methods---KDE antimode with Hartigan unimodality test, Burgstahler-Dichev / McCrary discontinuity, and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, using the convergence (or principled divergence) across methods as diagnostic evidence about the mixture structure of the data.
+4. **Percentile-anchored operational threshold.** We anchor the operational classifier's cosine cut on the whole-sample Firm A P7.5 percentile (cos $> 0.95$), a transparent and reproducible reference drawn from a replication-dominated reference population, and complement it with dHash structural cuts derived from the same reference distribution. Operational thresholds are therefore explained by an empirical reference rather than asserted.

-5. **Continuous-quality / clustered-accountant finding.** We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates cluster into three recognizable groups with smoothly mixed rather than sharply discrete boundaries---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics.
+5. **Distributional characterisation of per-signature similarity.** We apply three statistical diagnostics---a Hartigan dip test, an EM-fitted Beta mixture with logit-Gaussian robustness check, and a Burgstahler-Dichev / McCrary density-smoothness procedure---to characterise the shape of the per-signature similarity distribution. The three diagnostics jointly find that per-signature similarity forms a continuous quality spectrum, which both motivates the percentile-based operational anchor over a mixture-fit crossing and is itself a substantive finding for the document-forensics literature on similarity-threshold selection.

-6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a known-majority-positive reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling.
+6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a replication-dominated reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling.

 7. **Large-scale empirical analysis.** We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on non-hand-signing practices in financial reporting under a methodology designed for peer-review defensibility.

 The remainder of this paper is organized as follows.
-Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination.
+Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for distributional characterisation.
 Section III describes the proposed methodology.
-Section IV presents experimental results including the three-method convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study.
+Section IV presents experimental results including the signature-level distributional characterisation, pixel-identity validation, and backbone ablation study.
 Section V discusses the implications and limitations of our findings.
 Section VI concludes with directions for future work.
@@ -4,19 +4,19 @@

 We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
 Fig. 1 illustrates the overall architecture.
-The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from three methodologically distinct statistical methods and a pixel-identity anchor.
+The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum anchored on whole-sample Firm A percentile heuristics and validated against a byte-level pixel-identity positive anchor and a large random inter-CPA negative anchor.

 Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
-From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
+From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.

 <!--
 [Figure 1: Pipeline Architecture - clean vector diagram]
 90,282 PDFs → VLM Pre-screening → 86,072 PDFs
 → YOLOv11 Detection → 182,328 signatures
 → ResNet-50 Features → 2048-dim embeddings
-→ Dual-Method Verification (Cosine + dHash)
-→ Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification
-→ Pixel-identity + Firm A + Accountant-level GMM validation
+→ Dual-Descriptor Verification (Cosine + dHash)
+→ Firm A P7.5-anchored Classifier → Five-way classification
+→ Pixel-identity + Inter-CPA + Held-Out Firm A validation
 -->

 ## B. Data Collection
@@ -41,7 +41,7 @@ Table I summarizes the dataset composition.

 ## C. Signature Page Identification

-To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism.
+To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism.
 Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
 The model was configured with temperature 0 for deterministic output.

@@ -55,7 +55,7 @@ The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pag

 ## D. Signature Detection

-We adopted YOLOv11n (nano variant) [25] for signature region localization.
+We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization.
 A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
 A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.

@@ -74,6 +74,7 @@ Batch inference on all 86,071 documents extracted 182,328 signature images at a
 A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.

 Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
+The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset is the sample that flows into Sections IV-D through IV-H; the unmatched 7.4% are excluded for definitional reasons rather than discarded as noise.

 ## E. Feature Extraction

@@ -83,8 +84,8 @@ The final classification layer was removed, yielding the 2048-dimensional output
 Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
 All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

-The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D).
-This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
+The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-G).
+This design choice is validated by an ablation study (Section IV-I) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

 ## F. Dual-Method Similarity Descriptors

@@ -97,7 +98,7 @@ $$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$
 where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized 2048-dim feature vectors.
 Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].

-**Perceptual hash distance (dHash)** captures structural-level similarity.
+**Perceptual hash distance (dHash)** [27] captures structural-level similarity.
 Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
 The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images.
 Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].
@@ -108,81 +109,105 @@ Non-hand-signing yields extreme similarity under *both* descriptors, since the u
 Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
 Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.

-We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content.
-Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.
+We did not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors, and the reasons are specific to what each of those measures was designed to do rather than to how either happened to perform on our corpus.
+
+SSIM was developed by Wang et al. [30] as a perceptual quality index for *natural images*, and it factorises local-window image statistics into three components---luminance, contrast, and structural correlation---combined multiplicatively over a sliding window.
+Each of these components is computed at the pixel level on the original-resolution image and is *designed to be sensitive* to small fluctuations in local luminance and local contrast, because that is what makes SSIM track human perception of natural-image quality.
+Applied to a binarised auditor's signature crop, exactly those design choices become liabilities: the JPEG block artifacts, scan-noise speckle, and faint scanner-rule ghosts that are routine in a print-scan cycle perturb local luminance and local contrast in every window they touch, and SSIM amplifies those perturbations in the structural-correlation product.
+A signature reproduced twice from the same stored image---the very case that defines our positive class---is therefore one in which SSIM is structurally guaranteed to penalise the easily perturbed margins around the strokes, even though the strokes themselves are identical up to rendering noise.
+This is a property of how SSIM is constructed, not a finding about how it scored on our data; the empirical observation that the calibration firm exhibits a mean SSIM of only $0.70$ in our corpus is a confirmation of the design-level prediction rather than the basis for the rejection.
+
+Pixel-level comparison---whether $L_1$, $L_2$, or pixel-identity counting---fails on a stricter design ground.
+Pixel-level distances are defined on geometrically aligned images at a common resolution, and they treat any sub-pixel translation, rotation, or rescale as a large perturbation by construction (a one-pixel uniform translation flips a fraction of foreground pixels on a thin-stroke signature crop and inflates pixel L1 distance to the same magnitude as for a different signer's signature).
+Two scans of the same physical document, however, do not share a common pixel grid: scanner DPI, paper-handling alignment, and PDF-page rasterisation each contribute random sub-pixel offsets, and the print-scan cycle that intervenes between the stored stamp image and the audit-report PDF additionally introduces resolution mismatch and small geometric drift.
+A pixel-level descriptor cannot therefore satisfy the basic stability requirement for our task: two presentations of the same stored image must score nearly identically.
+We retain pixel-identity counting only as a *threshold-free anchor* (Section III-J), because byte-identical pairs in our corpus are necessarily produced by literal file reuse rather than by repeated scanning, and so they do not interact with the alignment-fragility argument; they are not used as a primary similarity descriptor.
+
+Cosine similarity on deep embeddings and dHash, in contrast, both remain stable across the print-scan-rasterise cycle by design: cosine on L2-normalised pooled features is invariant to overall scale and bias and degrades gracefully under local-pixel noise that the convolutional backbone has been trained to absorb [14], [21], while dHash compresses the image to a $9 \times 8$ grayscale grid before computing horizontal-gradient signs, which removes the resolution and sub-pixel-alignment sensitivity that breaks pixel-level comparison [19], [27].
+Together they constitute the dual descriptor used throughout the rest of this paper.

 ## G. Unit of Analysis and Summary Statistics

-Two unit-of-analysis choices are relevant for this study: (i) the *signature*---one signature image extracted from one report---and (ii) the *accountant*---the collection of all signatures attributed to a single CPA across the sample period.
-A third composite unit---the *auditor-year*, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution.
+Two unit-of-analysis choices are relevant for this study, ordered from finest to coarsest: (i) the *signature*---one signature image extracted from one report; and (ii) the *auditor-year*---all signatures by one CPA within one fiscal year.
+The signature is the operational unit of classification (Section III-K) and of all primary statistical analyses (Section IV-D, IV-F, IV-G).
+The auditor-year is used in the partner-level similarity ranking of Section IV-G.2 as a within-year aggregation unit: each auditor-year's mean is computed over its own fiscal-year signatures, although the per-signature best-match cosine that feeds the mean is computed against the full same-CPA cross-year pool (Section III-G's max-cosine / min-dHash definition).
+We do not use a coarser CPA-level cross-year unit, because pooling a CPA's signatures across the full 2013--2023 sample period would conflate distinct signing-mechanism regimes whenever a CPA's practice changes during the sample, and we make no claim about the within-CPA stability of signing mechanisms over time.

-For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA.
+For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA (over the full same-CPA set, not restricted to the same fiscal year).
 The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
 Mean statistics would dilute this signal.

-We also adopt an explicit *within-auditor-year no-mixing* identification assumption.
-Specifically, within any single fiscal year we treat a given CPA's signing mechanism as uniform: a CPA who reproduces one signature image in that year is assumed to do so for every report, and a CPA who hand-signs in that year is assumed to hand-sign every report in that year.
-Domain-knowledge from industry practice at Firm A is consistent with this assumption for that firm during the sample period.
-Under the assumption, per-auditor-year summary statistics are well defined and robust to outliers: if even one pair of same-CPA signatures in the year is near-identical, the max/min captures it.
-The intra-report consistency analysis in Section IV-H.3 provides an empirical check on the within-auditor-year assumption at the report level.
+For the dHash dimension we use the *independent minimum dHash*: the minimum Hamming distance from a signature to *any* other signature of the same CPA (over the full same-CPA set).
+The independent minimum is unconditional on the cosine-nearest pair and is therefore the conservative structural-similarity statistic; it is the dHash statistic used throughout the operational classifier (Section III-K) and all reported capture-rate analyses.

-For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean *independent minimum dHash* across all signatures of that CPA.
-The *independent minimum dHash* of a signature is defined as the minimum Hamming distance to *any* other signature of the same CPA (over the full same-CPA set), in contrast to the *cosine-conditional dHash* used as a diagnostic elsewhere, which is the dHash to the single signature selected as the cosine-nearest match.
-The independent minimum avoids conditioning on the cosine choice and is therefore the conservative structural-similarity statistic for each signature.
-These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level three-method analysis in Section III-I.5.
+We make one stipulation about same-CPA pair detectability.
+
+**(A1) Pair-detectability.** *If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the cross-year same-CPA pool used by the max-cosine / min-dHash computation above.*
+This is plausible for high-volume stamping or firm-level electronic-signing workflows---where a stored image is typically reused many times under similar scan and compression conditions---but it is *not* guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are in use simultaneously, or (iii) scan-stage noise pushes a replicated pair outside the detection regime.
+A1 is a *cross-year pair-existence* property, not a within-year uniformity claim, and is the only assumption the per-signature detector requires to be sensitive to replication.
+
+We make *no* within-year or across-year uniformity assumption about CPA signing mechanisms.
+Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation.
+A CPA's signing output within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., different stored images for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination; our signature-level analyses remain valid under all of these regimes, since they do not attempt mechanism attribution at the partner or report level.
+
+The intra-report consistency analysis in Section IV-G.3 is a firm-level homogeneity check---whether the *two co-signing CPAs on the same report* receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.

 ## H. Calibration Reference: Firm A as a Replication-Dominated Population

 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
 Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.

-The background context for this choice is practitioner knowledge about Firm A's signing practice: industry practice at the firm is widely understood among practitioners to involve reproducing a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
-We use this only as background context for why Firm A is a plausible calibration candidate; the *evidence* for Firm A's replication-dominated status comes entirely from the paper's own analyses, which do not depend on any claim about signing practice beyond what the audit-report images themselves show.
+Practitioner knowledge motivated treating Firm A as a candidate calibration reference: the firm is understood within the audit profession to reproduce a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
+This practitioner background motivates Firm A's selection but is not used as evidence: the evidentiary basis in the analyses below---byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency---is derived entirely from the audit-report images themselves and does not depend on any claim about firm-level signing practice.

-We establish Firm A's replication-dominated status through four independent quantitative analyses, each of which can be reproduced from the public audit-report corpus alone:
+We establish Firm A's replication-dominated status through two primary independent quantitative analyses plus a third strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone:

-First, *independent visual inspection* of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.
+First, *automated byte-level pair analysis* (Section IV-F.1; reproduction artifact listed in Appendix B) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years.
+Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports---these pairs therefore establish image reuse as a concrete, threshold-free phenomenon within Firm A and confirm that replication is widespread (50 of 180 registered partners) rather than confined to a handful of CPAs.

-Second, *whole-sample signature-level rates*: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% form a long left tail consistent with a minority of hand-signers.
+Second, *signature-level distributional evidence*: Firm A's per-signature best-match cosine distribution fails to reject unimodality (Hartigan dip test $p = 0.17$, $N = 60{,}448$ Firm A signatures; Section IV-D) and exhibits a long left tail, consistent with a dominant high-similarity regime plus residual within-firm heterogeneity rather than two cleanly separated mechanisms.
+92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95 and the remaining 7.5% form the long left tail (we do not disaggregate partner-level mechanism here; see Section III-G for the scope of claims).
+The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).

-Third, *accountant-level mixture analysis* (Section IV-E): a BIC-selected three-component Gaussian mixture over per-accountant mean cosine and mean dHash places 139 of the 171 Firm A CPAs (with $\geq 10$ signatures) in the high-replication C1 cluster and 32 in the middle-band C2 cluster, directly quantifying the within-firm heterogeneity.
+Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-G. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
+  (a) *Longitudinal stability (Section IV-G.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
+  (b) *Partner-level similarity ranking (Section IV-G.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
+  (c) *Intra-report consistency (Section IV-G.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.

-Fourth, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-H. Two of them are fully threshold-free, and one uses the downstream classifier as an internal consistency check:
-  (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The fixed 0.95 cutoff is not calibrated to Firm A; the stability itself is the finding.
-  (b) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
-  (c) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the calibrated classifier and therefore is a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
+The 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the byte-level pixel-identity evidence, the unimodal-long-tail dip-test result, the three complementary analyses above, and the held-out Firm A fold (described in Section III-J; fold-level rate differences are disclosed in Section IV-F.2).
+Firm A's replication-dominated status itself was *not* derived from the thresholds we calibrate against it; it rests on the byte-level pair evidence and the dip-test-confirmed unimodal-long-tail shape, both of which are independent of any threshold choice.
+The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section IV-D)---and for avoiding overclaim in downstream inference.

-We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the visual inspection, the accountant-level mixture, the three complementary analyses above, and the held-out Firm A fold described in Section III-K.
+## I. Signature-Level Threshold Characterisation

-We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
-Its identification rests on visual evidence and accountant-level clustering that is independent of the statistical pipeline.
-The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.
+This section describes how we set the operational classifier's similarity threshold and how we characterise the per-signature similarity distribution that supports it.
+The two roles are kept separate by design.

-## I. Three-Method Convergent Threshold Determination
+**Operational threshold (used by the classifier).** The cosine cut is anchored on the whole-sample Firm A P7.5 percentile (cos $> 0.95$; Section III-K).

-Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
-To place threshold selection on a statistically principled and data-driven footing, we apply *three methodologically distinct* methods whose underlying assumptions decrease in strength.
-The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence is therefore a diagnostic of distributional structure rather than a formal statistical guarantee.
-When the three estimates agree, the decision boundary is robust to the choice of method; when they disagree, the disagreement itself is informative about whether the data support a single clean decision boundary at a given level.
+**Statistical characterisation (used to motivate the choice of anchor and to describe the distributional structure).** A Hartigan dip test, an EM-fitted Beta mixture (with logit-Gaussian robustness check), and a Burgstahler-Dichev / McCrary density-smoothness procedure---all applied at the per-signature level (Section IV-D).
+
+The reason for the split is empirical.
+The three statistical diagnostics jointly find that per-signature similarity forms a continuous quality spectrum (Section IV-D, summarised below): the dip test fails to reject unimodality for Firm A; BIC strongly prefers a 3-component over a 2-component Beta fit, so the 2-component crossing is a forced fit; and the BD/McCrary candidate transition lies inside the non-hand-signed mode rather than between modes (and is not bin-width-stable; Appendix A).
+Under these conditions the natural anchor for an operational cosine cut is a transparent percentile of a replication-dominated reference population (Firm A) rather than a mixture-fit crossing whose location depends on parametric assumptions the data do not support.
+
+We describe the three diagnostics and the assumptions underlying each in the subsections below.
+The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification, and the logit-Gaussian cross-check reports sensitivity to that form.
+The Burgstahler-Dichev / McCrary procedure is applied to the same distribution as a *density-smoothness diagnostic*: it would identify a sharp local density discontinuity if one existed at the boundary between two cleanly separated mechanisms.
+Because all three diagnostics are applied to the same sample rather than to independent experiments, agreement or disagreement among them is read as evidence about distributional structure rather than as a formal statistical guarantee.

 ### 1) Method 1: KDE Antimode / Crossover with Unimodality Test

 We use two closely related KDE-based threshold estimators and apply each where it is appropriate.
 When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the *KDE crossover* is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes.
-When a single distribution is analyzed (e.g., the per-accountant cosine mean of Section IV-E) the *KDE antimode* is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal.
-In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify threshold stability.
+When a single distribution is analysed (e.g., the per-signature best-match cosine distribution of Section IV-D) the *KDE antimode* is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal.
+In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality.
+The dip test asks one question: *is the distribution single-peaked?*
+A non-significant $p$-value means we cannot reject the single-peak null (the data are consistent with one peak); a significant $p$-value means the distribution has *more than one peak* (it could be two, three, or more---the test does not specify how many).
+We use the test to decide whether a KDE antimode is well-defined (it is, only when there is more than one peak), not to assert any particular number of components.
+We additionally perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify threshold stability.

-### 2) Method 2: Burgstahler-Dichev / McCrary Discontinuity
-
-We additionally apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39].
-We discretize the cosine distribution into bins of width 0.005 (and dHash into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,
-
-$$Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},$$
-
-which is approximately $N(0,1)$ under the null of distributional smoothness.
-A threshold is identified at the transition where $Z_{i-1}$ is significantly negative (observed count below expectation) adjacent to $Z_i$ significantly positive (observed count above expectation); equivalently, for distributions where the non-hand-signed peak sits to the right of a valley, the transition $Z^- \rightarrow Z^+$ marks the candidate decision boundary.
-
-### 3) Method 3: Finite Mixture Model via EM
+### 2) Method 2: Finite Mixture Model via EM

 We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data).
 The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread).
@@ -195,70 +220,70 @@ As a robustness check against the Beta parametric form we fit a parallel two-com
 White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery.

 We fit 2- and 3-component variants of each mixture and report BIC for model selection.
-When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.
+When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit; we report the resulting crossing only as a forced-fit descriptive reference and do not use it as an operational threshold.

-### 4) Convergent Validation and Level-Shift Diagnostic
+### 3) Density-Smoothness Diagnostic: Burgstahler-Dichev / McCrary

-The three methods rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the BD/McCrary test requires only local smoothness under the null; the Beta mixture additionally requires a parametric specification.
-If the three estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.
+Complementing the two threshold estimators above, we apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39], as a *density-smoothness diagnostic* rather than as a third threshold estimator.
+We discretize each distribution (cosine into bins of width 0.005; $\text{dHash}_\text{indep}$ into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,

-Equally informative is the *level at which the methods agree or disagree*.
-Applied to the per-signature similarity distribution the three methods yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D).
-Applied to the per-accountant cosine mean, Methods 1 (KDE antimode) and 3 (Beta-mixture crossing and its logit-Gaussian counterpart) converge within a narrow band, whereas Method 2 (BD/McCrary) does not produce a significant transition because the accountant-mean distribution is smooth at the bin resolution the test requires.
-This pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a discrete discontinuity, and we interpret it accordingly in Section V rather than treating disagreement among methods as a failure.
+$$Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},$$

-### 5) Accountant-Level Three-Method Analysis
+which is approximately $N(0,1)$ under the null of distributional smoothness.
+A candidate transition is identified at an adjacent bin pair where $Z_{i-1}$ is significantly negative and $Z_i$ is significantly positive (cosine) or the reverse (dHash).
+Appendix A reports a bin-width sensitivity sweep covering $\text{bin} \in \{0.003, 0.005, 0.010, 0.015\}$ for cosine and $\text{bin} \in \{1, 2, 3\}$ for dHash; the sweep shows that signature-level BD transitions are not bin-width-stable, consistent with histogram-resolution artifacts rather than a genuine cross-mode density discontinuity.
+We therefore do not treat the BD/McCrary procedure as a threshold estimator in our application but as diagnostic evidence about distributional smoothness.

-In addition to applying the three methods at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures.
-The accountant-level estimates provide the methodologically defensible threshold reference used in the per-document classification of Section III-L.
-All three methods are reported with their estimates and, where applicable, cross-method spreads.
+### 4) Reading the Three Diagnostics Together

-## J. Accountant-Level Mixture Model
+The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification (with logit-Gaussian as a robustness cross-check against that form).
+If the two estimated thresholds were to differ by less than a practically meaningful margin and the BD/McCrary procedure were to identify a sharp transition at the same level, that pattern would constitute convergent evidence for a clean two-mechanism boundary at that location.

-In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
-The motivation is the expectation---consistent with industry-practice knowledge at Firm A---that an individual CPA's signing *practice* is clustered (typically consistent adoption of non-hand-signing or consistent hand-signing within a given year) even when the output pixel-level *quality* lies on a continuous spectrum.
+This is *not* the pattern we observe at the per-signature level.
+The two threshold estimators yield crossings spread across a wide range (Section IV-D); the BIC clearly prefers a 3-component over a 2-component Beta fit, indicating that the 2-component crossing is a forced fit reported only as a descriptive reference rather than as an operational threshold; and the BD/McCrary procedure locates its candidate transition *inside* the non-hand-signed mode rather than between modes (Appendix A).
+We interpret this jointly as evidence that per-signature similarity is a continuous quality spectrum rather than a clean two-mechanism mixture, and we accordingly anchor the operational classifier's cosine cut on whole-sample Firm A percentile heuristics (Section III-K) rather than on a mixture-fit crossing.

-We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
-For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
-
-## K. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation)
+## J. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation)

 Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling:

 1. **Pixel-identical anchor (gold positive, conservative subset):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
-Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth *for the byte-identical subset* of non-hand-signed signatures.
-We emphasize that this anchor is a *subset* of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).
+Handwriting physics makes byte-identity impossible under independent signing events, so a byte-identical same-CPA pair is pair-level proof of image reuse and---for the byte-identical subset---conservative ground truth for non-hand-signed signatures; the narrow exception, in which a genuinely hand-signed exemplar was subsequently reused as the stamping or e-signature template, is discussed as a Limitation in Section V-G.
+We further emphasize that this anchor is a *subset* of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).

 2. **Inter-CPA negative anchor (large gold negative):** $\sim$50,000 pairs of signatures randomly sampled from *different* CPAs.
 Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps.
 This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.

-3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail contains a minority of hand-signers, as directly evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H).
-Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we break the resulting circularity by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% *calibration* fold and a 30% *heldout* fold.
-Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions are derived from the calibration fold only.
+3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference with within-firm heterogeneity in the left tail, as evidenced by the 7.5% of Firm A signatures whose per-signature best-match cosine falls at or below 0.95 (Section III-H, Section IV-D).
+Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% *calibration* fold and a 30% *heldout* fold.
+The calibration-fold percentiles used in thresholding---cosine median, P1, and P5 (lower-tail, since higher cosine indicates greater similarity), and dHash_indep median and P95 (upper-tail, since lower dHash indicates greater similarity)---are derived from the 70% calibration fold only.
 The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.

 4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70.
 This anchor is retained for continuity with prior work but is small in our dataset ($n = 35$) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.

-From these anchors we report FAR with Wilson 95% confidence intervals (against the inter-CPA negative anchor) and FRR (against the byte-identical positive anchor), together with the Equal Error Rate (EER) interpolated at the threshold where FAR $=$ FRR, following biometric-verification reporting conventions [3].
+From these anchors we report FAR with Wilson 95% confidence intervals against the inter-CPA negative anchor.
+We do not report an Equal Error Rate or FRR column against the byte-identical positive anchor, because byte-identical pairs have cosine $\approx 1$ by construction and any FRR computed against that subset is trivially $0$ at every threshold below $1$; the conservative-subset role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
 Precision and $F_1$ are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and $F_1$ from Table X.
-The 70/30 held-out Firm A fold of Section IV-G.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.
-We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.
+The 70/30 held-out Firm A fold of Section IV-F.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.

-## L. Per-Document Classification
+## K. Per-Document Classification
+
+The per-signature classifier operates at the signature level with operational thresholds anchored on whole-sample Firm A percentile heuristics: cos $> 0.95$ (Firm A P7.5) for the cosine dimension and dHash$_\text{indep} \leq 5$ / $> 15$ (Firm A median+P75 / style-consistency ceiling) for the structural dimension.
+This percentile-based anchor is the natural choice given the continuous-spectrum shape of the per-signature similarity distribution documented in Section IV-D; sensitivity to nearby alternatives is reported in Section IV-F.3.
+All dHash references in this section refer to the *independent-minimum* dHash defined in Section III-G---the smallest Hamming distance from a signature to any other same-CPA signature.
+We use a single dHash statistic throughout the operational classifier and the supporting capture-rate analyses (Tables IX, XI, XII, XVI), which keeps the classifier definition and its empirical evaluation arithmetically consistent.

-The per-signature classifier operates at the signature level and uses whole-sample Firm A percentile heuristics as its operational thresholds, while the three-method analysis of Section IV-E operates at the accountant level and supplies a *convergent* external reference for the operational cuts.
-Because the two analyses are at different units (signature vs accountant) we treat them as complementary rather than substitutable: the accountant-level convergence band cos $\in [0.945, 0.979]$ anchors the signature-level operational cut cos $> 0.95$ used below, and Section IV-G.3 reports a sensitivity analysis in which cos $> 0.95$ is replaced by the accountant-level 2D-GMM marginal crossing cos $> 0.945$.
 We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:

-1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
+1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 5$.
 Both descriptors converge on strong replication evidence.

-2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < $ dHash $\leq 15$.
+2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < \text{dHash}_\text{indep} \leq 15$.
 Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.

-3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
+3. **High style consistency:** Cosine $> 0.95$ AND $\text{dHash}_\text{indep} > 15$.
 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.

 4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
@@ -266,18 +291,21 @@ High feature-level similarity without structural corroboration---consistent with
 5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.

 We note three conventions about the thresholds.
-First, the dHash cutoffs $\leq 5$ and $\leq 15$ correspond to the whole-sample Firm A *cosine-conditional* dHash distribution's median and 95th percentile (the dHash to the cosine-nearest same-CPA match), not to the *independent-minimum* dHash distribution we use elsewhere.
-The two dHash statistics are related but not identical: the whole-sample cosine-conditional distribution has median $= 5$ and 95th percentile $= 15$, while the calibration-fold independent-minimum distribution has median $= 2$ and 95th percentile $= 9$.
-The classifier retains the cosine-conditional cutoffs for continuity with the preceding version of this work while the anchor-level capture-rate analysis reports both cosine-conditional and independent-minimum rates for comparability.
-Second, the cosine cutoff $0.95$ is the whole-sample Firm A P95 heuristic (chosen for its transparent interpretation in the whole-sample reference distribution) and the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; neither cutoff is re-derived from the 70% calibration fold specifically, so the classifier inherits its operational thresholds from the whole-sample Firm A distribution and the all-pairs distribution rather than from the calibration fold.
-The held-out fold of Section IV-G.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that the fold-level sampling variance is visible.
-Third, the three accountant-level 1D estimators (KDE antimode $0.973$, Beta-2 crossing $0.979$, logit-GMM-2 crossing $0.976$) and the accountant-level 2D GMM marginal ($0.945$) are *not* the operational thresholds of this classifier: they are the *convergent external reference* that supports the choice of signature-level operational cut.
-Section IV-G.3 reports the classifier's five-way output under the nearby operational cut cos $> 0.945$ as a sensitivity check; the aggregate firm-level capture rates change by at most $\approx 1.2$ percentage points (e.g., the operational dual rule cos $> 0.95$ AND dHash $\leq 8$ captures 89.95% of whole Firm A versus 91.14% at cos $> 0.945$), and category-level shifts are concentrated at the Uncertain/Moderate-confidence boundary (Section IV-G.3).
+First, the cosine cutoff $0.95$ is the *operating point* chosen for the five-way classifier from a small grid of candidate cuts, on the basis of an explicit capture-vs-FAR tradeoff against the inter-CPA negative anchor of Section III-J---*not* a discovered natural boundary in the per-signature distribution.
+The candidate grid spans the calibration-fold P5 (0.9407), its rounded value (0.945), the operational anchor (0.95), and two reference points drawn from the signature-level threshold-estimator outputs of Section IV-D (the Firm A Beta-2 forced-fit crossing 0.977 and the BD/McCrary candidate transition 0.985); for each grid point Section IV-F.3 reports the Firm A capture rate, the non-Firm-A capture rate, and the inter-CPA FAR with Wilson 95% CI (Table XII-B).
+Three considerations motivate the operating point at 0.95.
+(i) *Inter-CPA specificity.* At cosine $> 0.95$ the inter-CPA FAR against the 50,000-pair anchor of Section IV-F.1 is $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$): one in two thousand random cross-CPA pairs exceeds the cut, an order-of-magnitude margin against the working assumption that random cross-CPA pairs do not arise from image reuse.
+(ii) *Capture stability under nearby alternatives.* Moving the cut to $0.945$ raises Firm A capture by 1.51 percentage points (operational dual rule cos $> t$ AND $\text{dHash}_\text{indep} \leq 15$; Section IV-F.3) and inter-CPA FAR by $0.00032$, while moving it to the calibration-fold P5 of $0.9407$ raises Firm A capture by 2.63 percentage points and inter-CPA FAR by $0.00076$; in either direction the qualitative finding---Firm A is replication-dominated, non-Firm-A capture is much lower at the same cut, and the inter-CPA noise floor is small---is preserved.
+(iii) *Interpretive transparency.* The complement $7.5\%$ corresponds to the whole-sample Firm A P7.5 of the per-signature best-match cosine distribution---that is, $92.5\%$ of whole-sample Firm A signatures exceed this cutoff and $7.5\%$ fall at or below it (Section III-H)---which gives the operational cut a transparent reading in the replication-dominated reference population without requiring a parametric mixture fit that the data of Section IV-D do not support.
+The cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; both $0.95$ and $0.837$ are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions.
+Section IV-F.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible; Section IV-F.3 (Table XII-B) reports the full capture-vs-FAR tradeoff at the candidate grid above.
+Second, the dHash cutoffs $\leq 5$ and $> 15$ are chosen from the whole-sample Firm A $\text{dHash}_\text{indep}$ distribution: $\leq 5$ captures the upper tail of the high-similarity mode (whole-sample Firm A median $\text{dHash}_\text{indep} = 2$, P75 $\approx 4$, so $\leq 5$ is the band immediately above median), while $> 15$ marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction.
+Third, the signature-level threshold-estimator outputs of Section IV-D (KDE antimode, Beta-mixture and logit-Gaussian crossings, BD/McCrary diagnostic) are *not* the operational thresholds of this classifier: they are descriptive characterisation of the per-signature similarity distribution, and Section IV-D shows they do not converge to a clean two-mechanism boundary at the per-signature level---which is why the operational cosine cut is anchored on the whole-sample Firm A percentile rather than on any mixture-fit crossing.

 Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed determines the document's classification).
 This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.

-## M. Data Source and Firm Anonymization
+## L. Data Source and Firm Anonymization

 **Audit-report corpus.** The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation.
 MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document.
@@ -286,4 +314,3 @@ The CPA registry used to map signatures to CPAs is a publicly available audit-fi

 **Firm-level anonymization.** Although all audit reports and CPA identities in the corpus are public, we report firm-level results under the pseudonyms Firm A / B / C / D throughout this paper to avoid naming specific accounting firms in descriptive rate comparisons.
 Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name.
-Authors declare no conflict of interest with Firm A, Firm B, Firm C, or Firm D.
@@ -10,7 +10,7 @@

 [4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.

-[5] I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020.
+[5] H.-H. Kao and C.-Y. Wen, "An offline signature verification and forgery detection method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020.

 [6] H. Li et al., "TransOSV: Offline signature verification with transformers," *Pattern Recognit.*, vol. 145, p. 109882, 2024.

@@ -32,7 +32,7 @@

 [15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 1342–1356, 2024.

-[16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 1735–1745, 2019.
+[16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 1735–1745, 2020.

 [17] H. Farid, "Image forgery detection," *IEEE Signal Process. Mag.*, vol. 26, no. 2, pp. 16–25, 2009.

@@ -42,15 +42,15 @@

 [20] D. Engin et al., "Offline signature verification on real-world documents," in *Proc. CVPRW*, 2020.

-[21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, 2022.
+[21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, vol. 189, art. 116136, 2022.

-[22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," *Procedia Comput. Sci.*, vol. 270, 2025.
+[22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification across multilingual datasets," *Procedia Comput. Sci.*, vol. 270, pp. 4024–4033, 2025.

 [23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in *Proc. ECCV*, 2014, pp. 584–599.

 [24] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, "Qwen2.5-VL technical report," arXiv:2502.13923, 2025. [Online]. Available: https://arxiv.org/abs/2502.13923

-[25] Ultralytics, "YOLOv11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/
+[25] Ultralytics, "YOLO11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/models/yolo11/

 [26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. CVPR*, 2016.

@@ -6,7 +6,7 @@ Offline signature verification---determining whether a static signature image is
 Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant.
 Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work.
 Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining.
-Hadjadj et al. [5] addressed the practical constraint of limited reference samples, achieving competitive verification accuracy using only a single known genuine signature per writer.
+Kao and Wen [5] addressed offline verification and forgery detection using only a single known genuine signature per writer with an explainable deep-learning approach.
 More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results.
 Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives.
 Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer.
@@ -64,7 +64,7 @@ The statistical validity of the unimodality-vs-multimodality dichotomy can be te
 Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions.
 Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately $N(0,1)$.
 The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis.
-The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions.
+The BD/McCrary pairing provides a local-density-discontinuity diagnostic that is informative about distributional smoothness under minimal assumptions; we use it in that diagnostic role (rather than as a threshold estimator) because its transitions in our corpus are bin-width-sensitive at the signature level and rarely significant at the accountant level (Appendix A).

 *Finite mixture models.*
 When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters.
@@ -76,7 +76,7 @@ The present study combines all three families, using each to produce an independ
 REFERENCES for Related Work (see paper_a_references_v3.md for full list):
 [3]  Bromley et al. 1993 — Siamese TDNN (NeurIPS)
 [4]  Dey et al. 2017 — SigNet
-[5]  Hadjadj et al. 2020 — Single sample SV
+[5]  Kao & Wen 2020 — Single-sample SV with forgery detection
 [6]  Li et al. 2024 — TransOSV
 [7]  Tehsin et al. 2024 — Triplet Siamese
 [8]  Brimoh & Olisah 2024 — Consensus threshold
@@ -2,9 +2,10 @@

 ## A. Experimental Setup

-All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration.
+Experiments used mixed hardware: YOLOv11n training and inference for signature detection, and ResNet-50 forward inference for feature extraction over all 182,328 detected signatures, were performed on an Nvidia RTX 4090 (CUDA); the downstream statistical analyses (KDE antimode, Hartigan dip test, Beta-mixture EM with logit-Gaussian robustness check, Burgstahler-Dichev/McCrary density-smoothness diagnostic, and pairwise cosine/dHash computations) were performed on an Apple Silicon workstation with Metal Performance Shaders (MPS) acceleration.
 Feature extraction used PyTorch 2.9 with torchvision model implementations.
 The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
+Because all steps rely on deterministic forward inference over fixed pre-trained weights (no fine-tuning) plus fixed-seed numerical procedures, reported results are platform-independent to within floating-point precision.

 ## B. Signature Detection Performance

@@ -27,7 +28,7 @@ The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliabi
 ## C. All-Pairs Intra-vs-Inter Class Distribution Analysis

 Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
-This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
+This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-K).
 Table IV summarizes the distributional statistics.

 <!-- TABLE IV: Cosine Similarity Distribution Statistics
@@ -43,19 +44,25 @@ Table IV summarizes the distributional statistics.

 Both distributions are left-skewed and leptokurtic.
 Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
-Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
+Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent threshold-estimator outputs reported in Section IV-D are derived via the methods of Section III-I to avoid single-family distributional assumptions.

 The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
 Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
-Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).
+Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).

 We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
 We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
 A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

-## D. Hartigan Dip Test: Unimodality at the Signature Level
+## D. Signature-Level Distributional Characterisation
+
+This section applies the threshold-estimator and density-smoothness diagnostic of Section III-I to the per-signature similarity distribution.
+The joint reading is that per-signature similarity is a continuous quality spectrum rather than a clean two-mechanism mixture, which is why the operational classifier (Section III-K) anchors its cosine cut on the whole-sample Firm A P7.5 percentile rather than on any mixture-fit crossing.
+
+### 1) Hartigan Dip Test: Unimodality at the Signature Level

 Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
+The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.

 <!-- TABLE V: Hartigan Dip Test Results
 | Distribution | N | dip | p-value | Verdict (α=0.05) |
@@ -64,223 +71,263 @@ Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match d
 | Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
 | All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
 | All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
-| Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
-| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
 -->

-Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in the accountant-level mixture (Section IV-E).
+Firm A's per-signature cosine distribution *fails to reject unimodality* ($p = 0.17$), a pattern consistent with a dominant high-similarity regime plus a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
 The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
-At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
+The Firm A unimodal-long-tail finding is, in conjunction with the byte-identity, partner-ranking, and intra-report evidence reported below, consistent with the replication-dominated framing (Section III-H): a dominant high-similarity regime plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.

-This asymmetry between signature level and accountant level is itself an empirical finding.
-It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
+### 2) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic

-### 1) Burgstahler-Dichev / McCrary Discontinuity
+Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine distribution yields a nominally significant $Z^- \rightarrow Z^+$ transition at cosine 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample under the bin width ($0.005$ / $1$) used here.
+Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
+First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
+Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
+We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator.

-Applying the BD/McCrary test (Section III-I.2) to the per-signature cosine distribution yields a single significant transition at 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample.
-We note that the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
-In contrast, the dHash transition at distance 2 is a substantively meaningful structural boundary that corresponds to the natural separation between pixel-near-identical replication and scan-noise-perturbed replication.
-At the accountant level the test does not produce a significant $Z^- \rightarrow Z^+$ transition in either the cosine-mean or the dHash-mean distribution (Section IV-E), reflecting that accountant aggregates are smooth at the bin resolution the test requires rather than exhibiting a sharp density discontinuity.
-
-### 2) Beta Mixture at Signature Level: A Forced Fit
+### 3) Beta Mixture at Signature Level: A Forced Fit

 Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
 For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
 Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
 Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.

-The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
-Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
-This motivates the pivot to the accountant-level analysis in Section IV-E, where aggregation over signatures reveals clustered (though not sharply discrete) patterns in individual-level signing *practice* that the signature-level analysis lacks.
+### 4) Joint Reading of the Three Diagnostics

-## E. Accountant-Level Gaussian Mixture
+The three diagnostics agree that per-signature similarity does not form a clean two-mechanism mixture:
+(i) the Hartigan dip test fails to reject unimodality for Firm A and rejects it for the heterogeneous-firm pooled sample;
+(ii) BIC strongly prefers a 3-component over a 2-component Beta fit, so the 2-component crossing is a *forced fit* and the Beta-vs-logit-Gaussian disagreement (0.977 vs 0.999 for Firm A) reflects parametric-form sensitivity rather than a stable two-mechanism boundary;
+(iii) the BD/McCrary procedure locates its candidate transition *inside* the non-hand-signed mode rather than between modes, and the transition is not bin-width-stable.

-We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
-BIC selects $K^* = 3$ (Table VI).
+Table VI summarises the signature-level threshold-estimator outputs for cross-method comparison.

-<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
-| K | BIC | AIC | Converged |
-|---|-----|-----|-----------|
-| 1 | −316 | −339 | ✓ |
-| 2 | −545 | −595 | ✓ |
-| 3 | **−792** | **−869** | ✓  (best) |
-| 4 | −779 | −883 | ✓ |
-| 5 | −747 | −879 | ✓ |
+<!-- TABLE VI: Signature-Level Threshold-Estimator Summary
+| Population | Method | Cosine threshold | dHash threshold | Status |
+|------------|--------|------------------|-----------------|--------|
+| **Threshold estimators (signature-level distributional fits)** | | | | |
+| Firm A signature-level    | KDE antimode + Hartigan dip (Section III-I.1)  | undefined           | —    | unimodal at $\alpha=0.05$ ($p=0.169$); antimode not defined for unimodal data |
+| Firm A signature-level    | Beta-2 EM crossing (Section III-I.2)           | 0.977               | —    | forced fit; BIC strongly prefers $K{=}3$ ($\Delta\text{BIC} = 381$) |
+| Firm A signature-level    | logit-Gaussian-2 crossing (robustness check)   | 0.999               | —    | forced fit; sharply inconsistent with Beta-2 crossing—reflects parametric-form sensitivity |
+| Full-sample signature-l.  | KDE antimode + Hartigan dip                    | (multiple modes)    | —    | multimodal ($p<0.001$); KDE crossover at full-sample is dominated by between-firm heterogeneity |
+| Full-sample signature-l.  | Beta-2 EM crossing                             | no crossing         | —    | forced fit; component densities do not cross over $[0,1]$ under recovered parameters |
+| Full-sample signature-l.  | logit-Gaussian-2 crossing                      | 0.980               | —    | forced fit; BIC strongly prefers $K{=}3$ ($\Delta\text{BIC} = 10{,}175$) |
+| **Density-smoothness diagnostics (not threshold estimators)** | | | | |
+| Firm A signature-level    | BD/McCrary candidate transition (Section III-I.3) | 0.985 (bin 0.005)| 2.0 (bin 1) | bin-unstable across $\{0.003, 0.005, 0.010, 0.015\}$ (Appendix A); transition lies *inside* the non-hand-signed mode |
+| Full-sample signature-l.  | BD/McCrary candidate transition                | 0.985 (bin 0.005)   | 2.0 (bin 1) | bin-unstable across $\{0.003, 0.005, 0.010, 0.015\}$ (Appendix A) |
+| **Reference: between-class KDE (different unit of analysis)** | | | | |
+| All-pairs intra/inter (pair-level; Section IV-C) | KDE crossover | 0.837 | — | reference point for the Uncertain/Likely-hand-signed boundary in the operational classifier |
+| **Operational classifier anchors and percentile cross-references** | | | | |
+| Firm A whole-sample       | P7.5 (operational anchor; Section III-K)       | 0.95                | —    | operational cosine cut for the five-way classifier |
+| Firm A whole-sample       | dHash$_\text{indep}$ P75                       | —                   | 4    | informs the $\leq 5$ high-confidence band edge in the classifier |
+| Firm A whole-sample       | dHash$_\text{indep}$ style-consistency ceiling | —                   | 15   | operational $> 15$ style-consistency boundary |
+| Firm A calibration fold (70%) | cosine P5 (Section IV-F.2)                  | 0.9407              | —    | calibration-fold cross-reference; held-out fold reports rates at this cut |
+| Firm A calibration fold (70%) | dHash$_\text{indep}$ P95                    | —                   | 9    | calibration-fold cross-reference (Tables IX and XI report rates at the rounded $\leq 8$ cut for continuity) |
+
+Read this table by *population × method*: each row reports one method applied to one population.
+The first three blocks (threshold estimators; density-smoothness diagnostics; between-class KDE) are *characterisation* outputs; the bottom block is the operational anchor set used by the classifier of Section III-K.
+The disagreement between Firm A Beta-2 (0.977) and Firm A logit-Gaussian-2 (0.999) is the parametric-form sensitivity referenced in the prose of Section IV-D.3; it cannot be resolved from the data because BIC rejects the underlying $K{=}2$ assumption itself.
 -->

-Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
+Non-hand-signed replication quality is therefore best read as a continuous spectrum produced by firm-specific reproduction technologies (administrative stamping in early years, firm-level e-signing later) acting on a common stored exemplar.
+This finding has a direct methodological pay-off: it is *why* the operational cosine cut is anchored on the whole-sample Firm A P7.5 percentile (Section III-K), and it is *why* the byte-level pixel-identity anchor (Section IV-F.1) is the natural threshold-free positive reference for downstream validation.

-<!-- TABLE VII: Accountant-Level 3-Component GMM
-| Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
-|-------|----------|------------|--------|---|----------------|
-| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
-| C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
-| C3 (hand-signed tendency) | 0.928 | 11.17 | 0.28 | 184 | smaller domestic firms |
-->
-
-Three empirical findings stand out.
-First, of the 180 CPAs in the Firm A registry, 171 have $\geq 10$ signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only).
-Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
-This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
-Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
-Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary test does not produce a significant transition at the accountant level.
-For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
-
-Table VIII summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison.
-
-<!-- TABLE VIII: Threshold Convergence Summary Across Levels
-| Level / method | Cosine threshold | dHash threshold |
-|----------------|-------------------|------------------|
-| Signature-level, all-pairs KDE crossover | 0.837 | — |
-| Signature-level, BD/McCrary transition | 0.985 | 2.0 |
-| Signature-level, Beta-2 EM crossing (Firm A) | 0.977 | — |
-| Signature-level, logit-GMM-2 crossing (Full) | 0.980 | — |
-| Accountant-level, KDE antimode | **0.973** | **4.07** |
-| Accountant-level, BD/McCrary transition | no transition | no transition |
-| Accountant-level, Beta-2 EM crossing | **0.979** | **3.41** |
-| Accountant-level, logit-GMM-2 crossing | **0.976** | **3.93** |
-| Accountant-level, 2D-GMM 2-comp marginal crossing | 0.945 | 8.10 |
-| Firm A calibration-fold cosine P5 | 0.941 | — |
-| Firm A calibration-fold dHash P95 | — | 9 |
-| Firm A calibration-fold dHash median | — | 2 |
-->
-
-Methods 1 and 3 (KDE antimode, Beta-2 crossing, and its logit-GMM robustness check) converge at the accountant level to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$, while Method 2 (BD/McCrary) does not produce a significant discontinuity.
-This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
-The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
-
-## F. Calibration Validation with Firm A
+## E. Calibration Validation with Firm A

 Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
 Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).

 <!-- TABLE IX: Firm A Whole-Sample Capture Rates (consistency check, NOT external validation)
-| Rule | Firm A rate | n / N |
+| Rule | Firm A rate | k / N |
 |------|-------------|-------|
-| cosine > 0.837 (all-pairs KDE crossover) | 99.93% | 60,405 / 60,448 |
-| cosine > 0.941 (calibration-fold P5) | 95.08% | 57,473 / 60,448 |
-| cosine > 0.945 (2D GMM marginal crossing) | 94.52% | 57,131 / 60,448 |
-| cosine > 0.95 | 92.51% | 55,916 / 60,448 |
-| cosine > 0.973 (accountant KDE antimode) | 80.91% | 48,910 / 60,448 |
-| dHash_indep ≤ 5 (calib-fold median-adjacent) | 84.20% | 50,897 / 60,448 |
-| dHash_indep ≤ 8 | 95.17% | 57,521 / 60,448 |
-| dHash_indep ≤ 15 | 99.83% | 60,345 / 60,448 |
-| cosine > 0.95 AND dHash_indep ≤ 8 | 89.95% | 54,373 / 60,448 |
+| **Cosine-only marginal rates** | | |
+| cosine > 0.837 (all-pairs KDE crossover)                                  | 99.93% | 60,408 / 60,448 |
+| cosine > 0.9407 (calibration-fold P5)                                     | 95.15% | 57,518 / 60,448 |
+| cosine > 0.945 (calibration-fold P5 rounded)                              | 94.02% | 56,836 / 60,448 |
+| cosine > 0.95 (operational; whole-sample Firm A P7.5)                     | 92.51% | 55,922 / 60,448 |
+| **dHash-only marginal rates** | | |
+| dHash_indep ≤ 5 (operational high-confidence cap)                         | 84.20% | 50,897 / 60,448 |
+| dHash_indep ≤ 8 (calibration-fold P95 rounded)                            | 95.17% | 57,527 / 60,448 |
+| dHash_indep ≤ 15 (operational style-consistency boundary)                 | 99.83% | 60,348 / 60,448 |
+| **Operational classifier dual rules (Section III-K)** | | |
+| cosine > 0.95 AND dHash_indep ≤ 5 (high-confidence non-hand-signed)       | 81.70% | 49,389 / 60,448 |
+| cosine > 0.95 AND 5 < dHash_indep ≤ 15 (moderate-confidence)              | 10.76% | 6,503 / 60,448  |
+| cosine > 0.95 AND dHash_indep ≤ 15 (combined non-hand-signed)             | 92.46% | 55,892 / 60,448 |
+| **Calibration-fold-adjacent cross-reference (not the operational classifier rule)** | | |
+| cosine > 0.95 AND dHash_indep ≤ 8                                         | 89.95% | 54,370 / 60,448 |

-All rates computed exactly from the full Firm A sample (N = 60,448 signatures).
-The threshold 0.941 corresponds to the 5th percentile of the calibration-fold Firm A cosine distribution (see Section IV-G for the held-out validation that addresses the circularity inherent in this whole-sample table).
+All rates computed exactly from the full Firm A sample (N = 60,448 signatures); per-rule counts and codes are available in the supplementary materials.
+The two operational dHash cuts ($\leq 5$ for the high-confidence cap and $\leq 15$ for the style-consistency boundary) come from the classifier definition in Section III-K and are the rules used by the five-way classifier of Tables XII and XVII; the dHash $\leq 8$ row is *not* an operational classifier rule but a calibration-fold-adjacent reference (Section IV-F.2 calibration-fold dHash P95 = 9; we report the $\leq 8$ rate as the integer-valued threshold immediately below P95, included here so that Firm A capture in the calibration-fold-P95 neighbourhood can be read off the same table).
 -->

-Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H.
-The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E).
-Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
+Table IX is a whole-sample consistency check rather than an external validation: the cosine cut $0.95$ and the operational dHash band edges ($\leq 5$ high-confidence cap and $\leq 15$ style-consistency boundary) are themselves anchored to the whole-sample Firm A distribution described in Section III-K (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
+The operational dual rule used by the five-way classifier of Section III-K---cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ (the union of the high-confidence and moderate-confidence non-hand-signed buckets)---captures 92.46% of Firm A; the high-confidence component alone (cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 5$) captures 81.70%.
+For continuity with prior calibration-fold reporting (Section IV-F.2 reports the calibration-fold rate at the calibration-fold-P95-adjacent cut $\text{dHash}_\text{indep} \leq 8$), Table IX also lists the cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ rate of 89.95%; this is *not* the operational classifier rule but a cross-reference value.
+Both operational rates are consistent with the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 92.5% / 7.5% signature-level split (Section III-H).
+Section IV-F.2 reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.

-## G. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
+## F. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation

-We report three validation analyses corresponding to the anchors of Section III-K.
+We report three validation analyses corresponding to the anchors of Section III-J.

 ### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor

-Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
-As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
+Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
+Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; reproduction artifact for this Firm A decomposition is listed in Appendix B.
+As the gold-negative anchor we sample 50,000 i.i.d. random cross-CPA signature pairs from the full 168,755-signature matched corpus (inter-CPA cosine: mean $= 0.763$, $P_{95} = 0.886$, $P_{99} = 0.915$, max $= 0.992$).
 Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
-We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor and FRR against the byte-identical positive anchor in Table X; these two error rates are well defined within their respective anchor populations.
-The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because every byte-identical positive falls at cosine very close to 1.
+We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
+The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
+We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.

-<!-- TABLE X: Cosine Threshold Sweep (positives = 310 byte-identical signatures; negatives = 50,000 inter-CPA pairs)
-| Threshold | FAR | FAR 95% Wilson CI | FRR (byte-identical) |
-|-----------|-----|-------------------|----------------------|
-| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] | 0.000 |
-| 0.900                            | 0.0233 | [0.0221, 0.0247] | 0.000 |
-| 0.945 (2D GMM marginal)          | 0.0008 | [0.0006, 0.0011] | 0.000 |
-| 0.950                            | 0.0007 | [0.0005, 0.0009] | 0.000 |
-| 0.973 (accountant KDE antimode)  | 0.0003 | [0.0002, 0.0004] | 0.000 |
-| 0.979 (accountant Beta-2)        | 0.0002 | [0.0001, 0.0004] | 0.000 |
+<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
+| Threshold | FAR | FAR 95% Wilson CI |
+|-----------|-----|-------------------|
+| 0.837 (all-pairs KDE crossover) | 0.2101 | [0.2066, 0.2137] |
+| 0.900                            | 0.0250 | [0.0237, 0.0264] |
+| 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
+| 0.950 (whole-sample Firm A P7.5; operational cut)  | 0.0005 | [0.0003, 0.0007] |
+| 0.977 (Firm A Beta-2 forced-fit crossing; Section IV-D)  | 0.00014 | [0.00007, 0.00029] |
+| 0.985 (BD/McCrary candidate transition; Appendix A) | 0.00004 | [0.00001, 0.00015] |
+
+Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
 -->

 Two caveats apply.
-First, the gold-positive anchor is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
-Zero FRR against this subset does not establish zero FRR against the broader positive class, and the reported FRR should therefore be interpreted as a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable miss rate.
-Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
-The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.
+First, the byte-identical positive anchor referenced above is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
+A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
+Second, the 0.945 / 0.95 thresholds are derived from the Firm A whole-sample and calibration-fold percentiles rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
+The very low FAR at the operational cut is therefore informative about specificity against a realistic inter-CPA negative population.

-### 2) Held-Out Firm A Validation (breaks calibration-validation circularity)
+### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)

 We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
-The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here and has no effect on the accountant-level mixture analysis of Section IV-E, which uses the $\geq 10$-signature subset of 171 CPAs.
+The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two registered Firm A partners whose signatures in the corpus are singletons (only one signature each, so the per-signature best-match cosine is undefined and they do not appear in the same-CPA matched-signature table that script `24_validation_recalibration.py` reads); they are therefore not represented in either fold by construction rather than by an explicit exclusion rule.
 Thresholds are re-derived from calibration-fold percentiles only.
 Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.

 <!-- TABLE XI: Held-Out vs Calibration Firm A Capture Rates and Generalization Test
 | Rule | Calibration 70% fold (CI) | Held-out 30% fold (CI) | 2-prop z | p | k/n calib | k/n held |
 |------|---------------------------|-------------------------|----------|---|-----------|----------|
-| cosine > 0.837                      | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31 | 0.756 n.s. | 45,088/45,116 | 15,321/15,332 |
-| cosine > 0.945 (2D GMM marginal)    | 93.77% [93.55%, 93.98%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001     | 42,304/45,116 | 14,531/15,332 |
-| cosine > 0.950                      | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001     | 41,571/45,116 | 14,352/15,332 |
-| cosine > 0.9407 (calib-fold P5)     | 95.00% [94.80%, 95.20%] | 95.64% [95.31%, 95.95%] | -2.83 | 0.005      | 42,862/45,116 | 14,664/15,332 |
-| dHash_indep ≤ 5                      | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001    | 37,434/45,116 | 13,467/15,332 |
-| dHash_indep ≤ 8                      | 94.84% [94.63%, 95.05%] | 96.13% [95.82%, 96.43%] | -6.45  | <0.001    | 42,791/45,116 | 14,739/15,332 |
-| dHash_indep ≤ 9 (calib-fold P95)     | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07  | <0.001    | 43,603/45,116 | 14,945/15,332 |
-| dHash_indep ≤ 15                     | 99.83% [99.79%, 99.86%] | 99.84% [99.77%, 99.89%] | -0.31  | 0.754 n.s. | 45,038/45,116 | 15,308/15,332 |
-| cosine > 0.95 AND dHash_indep ≤ 8    | 89.40% [89.12%, 89.69%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001    | 40,335/45,116 | 14,035/15,332 |
+| cosine > 0.837                       | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31  | 0.756 n.s. | 45,087/45,116 | 15,321/15,332 |
+| cosine > 0.9407 (calib-fold P5)      | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.19  | 0.001      | 42,856/45,116 | 14,662/15,332 |
+| cosine > 0.945 (calib-fold P5 rounded) | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54  | <0.001     | 42,305/45,116 | 14,531/15,332 |
+| cosine > 0.950 (whole-sample P7.5; operational cut) | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97  | <0.001     | 41,570/45,116 | 14,352/15,332 |
+| dHash_indep ≤ 5                      | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001     | 37,430/45,116 | 13,467/15,332 |
+| dHash_indep ≤ 8                      | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45  | <0.001     | 42,788/45,116 | 14,739/15,332 |
+| dHash_indep ≤ 9 (calib-fold P95)     | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07  | <0.001     | 43,604/45,116 | 14,945/15,332 |
+| dHash_indep ≤ 15                     | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31  | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
+| cosine > 0.95 AND dHash_indep ≤ 8 (calibration-fold P95-adjacent reference; P95 = 9) | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001     | 40,335/45,116 | 14,035/15,332 |
+| cosine > 0.95 AND dHash_indep ≤ 15 (operational classifier rule, Section III-K) | 92.09% [91.84%, 92.34%] | 93.56% [93.16%, 93.93%] | -5.93  | <0.001     | 41,548/45,116 | 14,344/15,332 |

-Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash median = 2, P95 = 9.
+Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. Counts and z/p values are reproducible from the supplementary materials (fixed random seed).
 -->

-Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
 We report fold-versus-fold comparisons rather than fold-versus-whole-sample comparisons, because the whole-sample rate is a weighted average of the two folds and therefore cannot, in general, fall inside the Wilson CI of either fold when the folds differ in rate; the correct generalization reference is the calibration fold, which produced the thresholds.

 Under this proper test the two extreme rules agree across folds (cosine $> 0.837$ and $\text{dHash}_\text{indep} \leq 15$; both $p > 0.7$).
 The operationally relevant rules in the 85–95% capture band differ between folds by 1–5 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
-Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
-The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity (see the $139 / 32$ accountant-level split of Section IV-E): the random 30% CPA sample happened to contain proportionally more accountants from the high-replication C1 cluster.
-We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to this fold variance.
+Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the calibration-fold-adjacent reference rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ (the integer cut immediately below the calibration-fold dHash P95 of 9) captures 89.40% of the calibration fold and 91.54% of the held-out fold; the operational classifier rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K captures still higher rates in both folds (calibration 92.09%, 41,548 / 45,116; held-out 93.56%, 14,344 / 15,332).
+The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
+We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-G.2) is the cross-check that is robust to this fold variance.

 ### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$

-The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P95 heuristic.
-The accountant-level three-method convergence (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$, and the accountant-level 2D-GMM marginal at $0.945$.
-Because the classifier operates at the signature level while the three-method convergence estimates are at the accountant level, they are formally non-substitutable.
-We report a sensitivity check in which the classifier's operational cut cos $> 0.95$ is replaced by the nearest accountant-level reference, cos $> 0.945$.
+The per-signature classifier (Section III-K) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P7.5 heuristic (i.e., 7.5% of whole-sample Firm A signatures lie at or below 0.95; see Section III-H).
+We report a sensitivity check in which this round-number cut is replaced by the slightly stricter calibration-fold P5 rounded value cos $> 0.945$ (calibration-fold P5 = 0.9407, see Table XI).
+Table XII reports the five-way classifier output under each cut.

 <!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
-| Category                                   | cos > 0.95 count (%) | cos > 0.945 count (%) | Δ count |
-|--------------------------------------------|----------------------|-----------------------|---------|
-| High-confidence non-hand-signed            | 76,984 (45.62%)      | 79,278 (46.98%)       | +2,294  |
-| Moderate-confidence non-hand-signed        | 43,906 (26.02%)      | 50,001 (29.63%)       | +6,095  |
-| High style consistency                     |    546 ( 0.32%)      |    665 ( 0.39%)       |   +119  |
-| Uncertain                                  | 46,768 (27.72%)      | 38,260 (22.67%)       | -8,508  |
-| Likely hand-signed                         |    536 ( 0.32%)      |    536 ( 0.32%)       |     +0  |
+| Cosine cut | High-confidence | Moderate-confidence | High style consistency | Uncertain | Likely hand-signed |
+|------------|-----------------|---------------------|------------------------|-----------|--------------------|
+| cos > 0.940                | 81,069 (48.04%) | 55,308 (32.78%) | 801 (0.47%) |  31,026 (18.39%) | 536 (0.32%) |
+| cos > 0.945                | 79,278 (46.98%) | 50,001 (29.63%) | 665 (0.39%) |  38,260 (22.67%) | 536 (0.32%) |
+| cos > 0.950 (operational)  | 76,984 (45.62%) | 43,906 (26.02%) | 546 (0.32%) |  46,768 (27.72%) | 536 (0.32%) |
+| cos > 0.960                | 70,250 (41.63%) | 29,450 (17.45%) | 288 (0.17%) |  68,216 (40.43%) | 536 (0.32%) |
+| cos > 0.970                | 60,247 (35.70%) | 14,865 ( 8.81%) | 117 (0.07%) |  92,975 (55.10%) | 536 (0.32%) |
+| cos > 0.985                | 37,368 (22.15%) |  2,231 ( 1.32%) |  10 (0.01%) | 128,595 (76.21%) | 536 (0.32%) |
+
+The dHash band edges ($\leq 5$ for high-confidence, $5 < \text{dHash}_\text{indep} \leq 15$ for moderate-confidence, $> 15$ for style) are held fixed across the grid; only the cosine cut varies. The Likely-hand-signed count is invariant across the grid because it depends only on the all-pairs KDE crossover cosine $= 0.837$.
 -->

-At the aggregate firm-level, the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
+At the aggregate firm-level, the calibration-fold-adjacent reference dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
+The operational classifier rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K captures 92.46% under the 0.95 cut and 93.97% under the 0.945 cut---a shift of 1.51 percentage points.
+
+Reading the wider grid in Table XII: the High-confidence and Moderate-confidence shares shift by less than 5 percentage points across the 0.940-0.950 neighbourhood, while pushing the cosine cut to 0.970 or 0.985 produces qualitatively different classifier behaviour (Moderate-confidence collapses from 26.02% at $0.95$ to 8.81% at $0.97$ and 1.32% at $0.985$, with the displaced mass landing in Uncertain rather than reclassifying out of the corpus).
+The classifier output is therefore robust to small (~0.005-cosine) perturbations of the operational cut but not to wholesale reanchoring at the threshold-estimator outputs of Section IV-D, which is consistent with our reading that those outputs are not classifier thresholds.
 At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies 8,508 signatures (5.04% of the corpus) out of the Uncertain band; 6,095 of them migrate to Moderate-confidence non-hand-signed, 2,294 to High-confidence non-hand-signed, and 119 to High style consistency.
 The Likely-hand-signed category is unaffected because it depends only on the fixed all-pairs KDE crossover cosine $= 0.837$.
 The High-confidence non-hand-signed share grows from 45.62% to 46.98%.

-We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within the accountant-level convergence band, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
-The paper therefore retains cos $> 0.95$ as the primary operational cut for transparency and reports the 0.945 results as a sensitivity check rather than as a deployed alternative; a future deployment requiring tighter accountant-level alignment could substitute cos $> 0.945$ without altering the substantive firm-level conclusions.
+We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within a 0.005-cosine neighbourhood of the Firm A P7.5 anchor, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.

-### 4) Sanity Sample
+To make the operating-point selection (Section III-K) auditable rather than presented as a single fixed value, Table XII-B reports the capture-vs-FAR tradeoff over the candidate threshold grid spanning the calibration-fold P5 (0.9407), its rounded value (0.945), the operational anchor (0.95), the Firm A Beta-2 forced-fit crossing from Section IV-D.3 (0.977), and the BD/McCrary candidate transition from Section IV-D.2 (0.985).
+For each grid point we report Firm A capture (under both the cosine-only marginal and the operational dual rule cos $> t$ AND $\text{dHash}_\text{indep} \leq 15$ used by the five-way classifier of Section III-K), non-Firm-A capture (the cosine-only marginal in the 108,292 non-Firm-A matched signatures), and inter-CPA FAR with Wilson 95% CI against the 50,000-pair anchor of Section IV-F.1.

-A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
+<!-- TABLE XII-B: Cosine-Threshold Tradeoff: Capture vs Inter-CPA FAR
+| Cosine cut t | Firm A capture (cos > t) | Firm A capture (cos > t AND dHash_indep ≤ 15) | Non-Firm-A capture (cos > t) | Inter-CPA FAR | Inter-CPA FAR Wilson 95% CI |
+|--------------|--------------------------|------------------------------------------------|------------------------------|---------------|------------------------------|
+| 0.9407 (calibration-fold P5)               | 95.15% (57,518/60,448) | 95.09% (57,482/60,448) | 72.68% (78,710/108,292) | 0.00126 | [0.00099, 0.00161] |
+| 0.945 (calibration-fold P5 rounded)        | 94.02% (56,836/60,448) | 93.97% (56,804/60,448) | 67.51% (73,108/108,292) | 0.00082 | [0.00061, 0.00111] |
+| 0.95 (whole-sample Firm A P7.5; **operational cut**) | **92.51%** (55,922/60,448) | **92.46%** (55,892/60,448) | 60.50% (65,514/108,292) | **0.00050** | [0.00034, 0.00074] |
+| 0.977 (Firm A Beta-2 forced-fit crossing)  | 74.53% (45,050/60,448) | 74.51% (45,038/60,448) | 13.14% (14,233/108,292) | 0.00014 | [0.00007, 0.00029] |
+| 0.985 (BD/McCrary candidate transition)    | 55.27% (33,409/60,448) | 55.26% (33,406/60,448) |  5.73%  (6,200/108,292) | 0.00004 | [0.00001, 0.00015] |

-## H. Additional Firm A Benchmark Validation
+Inter-CPA FAR computed against 50,000 i.i.d. inter-CPA pairs (random seed 42, reproducing the anchor of Section IV-F.1 / Table X). Capture and FAR percentages are exact ratios of the displayed integer counts; gap arithmetic in the surrounding prose is computed from those exact counts and rounded to two decimal places. The dual-rule column is the operational classifier rule of Section III-K; for cuts above the dHash-15 saturation point (Firm A dHash$_\text{indep}$ $> 15$ rate is only 0.17%, Table IX), the dual-rule and cosine-only columns coincide to within the dHash$_\text{indep}$ $> 15$ residual.
+-->

-The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
-This section reports three complementary analyses that go beyond the whole-sample capture rates.
-Subsection H.2 is fully threshold-independent (it uses only ordinal ranking).
-Subsection H.1 uses a fixed 0.95 cutoff but derives information from the longitudinal stability of rates rather than from the absolute rate at any single year.
-Subsection H.3 applies the calibrated classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the informative quantity is the cross-firm *gap* rather than the absolute agreement rate at any one firm.
+Reading Table XII-B, three patterns motivate the choice of $0.95$ as the operating point.
+First, *Firm A capture* on the operational dual rule decays smoothly from 95.09% at $t = 0.9407$ to 55.26% at $t = 0.985$.
+Relaxing the cut from $0.95$ to $0.945$ buys 1.51 percentage points of additional Firm A capture, and to $0.9407$ buys 2.63 percentage points; tightening from $0.95$ to $0.977$ costs 17.96 percentage points and to $0.985$ costs 37.20 percentage points.
+The selected cut at $0.95$ is the strictest cut on this grid at which Firm A capture remains above $90\%$ on the operational dual rule.
+Second, *inter-CPA FAR* is small in absolute terms across the entire candidate grid ($0.00126$ at $0.9407$, falling to $0.00004$ at $0.985$): under any of these operating points the classifier's specificity against random cross-CPA pairs is in the per-mille range or better, so FAR alone does not determine the choice.
+The marginal FAR cost of relaxing from $0.95$ to $0.945$ is $+0.00032$ ($25 \to 41$ false positives per 50,000 pairs) and to $0.9407$ is $+0.00076$ ($25 \to 63$); the marginal FAR savings from tightening to $0.977$ and $0.985$ are $-0.00036$ and $-0.00046$ respectively.
+The FAR savings from going stricter are small in absolute terms compared with the corresponding Firm A capture loss, which makes $0.95$ a balanced operating point on this grid rather than a uniquely optimal one.
+Third, *non-Firm-A capture* (the cosine-only marginal in the 108,292 non-Firm-A signatures) decays from 67.51% at $0.945$ to 60.50% at $0.95$, 13.14% at $0.977$, and 5.73% at $0.985$.
+The Firm-A-minus-non-Firm-A gap widens with strictness through $0.977$ and then contracts (22.41 percentage points at $0.9407$; 26.46 at $0.945$; 31.97 at $0.95$; 61.36 at $0.977$; 49.54 at $0.985$): on the $0.95 \to 0.977$ segment non-Firm-A capture falls faster than Firm A capture in absolute terms ($-47.35$ vs $-17.96$ percentage points), so the widening is dominated by non-Firm-A removal rather than by an intrinsic property of Firm A; on the $0.977 \to 0.985$ segment Firm A capture falls faster than non-Firm-A's already-low residual, so the gap contracts.
+We do *not* read the gap pattern as evidence for a particular cut; it is reported here as cross-firm replication heterogeneity rather than as a selection criterion.
+The operating point at $0.95$ is therefore a defensible---not unique---selection in this neighbourhood, motivated by (i) keeping Firm A capture above $90\%$ on the operational dual rule, (ii) achieving an FAR of $0.0005$ at which marginal further savings from tightening are small relative to the corresponding capture loss, and (iii) preserving the interpretive transparency of the whole-sample Firm A P7.5 reading.
+It is *not* derived from the threshold-estimator outputs of Section IV-D, which the data do not support as classifier thresholds.
+
+The paper therefore retains cos $> 0.95$ as the primary operational cut and reports the 0.945 result of Table XII as a sensitivity check rather than as a deployed alternative; downstream document-level rates (Table XVII) and intra-report agreement (Table XVI) are robust to moderate cutoff shifts within the 0.945--0.95 neighbourhood as long as the same cutoff is applied uniformly across firms.
+
+## G. Additional Firm A Benchmark Validation
+
+Before presenting the three threshold-robust analyses, Fig. 4 summarises the per-firm yearly per-signature best-match cosine distribution that motivates them.
+The left panel reports the mean per-signature best-match cosine within each firm bucket and fiscal year (a threshold-free statistic); the right panel reports the share of each firm-bucket-year with per-signature best-match cosine $\geq 0.95$ (the operational cut of Section III-K).
+Both panels show Firm A above the other Big-4 firms in every year of the 2013-2023 sample, with non-Big-4 firms below all four Big-4 firms throughout, and the cross-firm ordering is stable across the sample period.
+The mean-cosine separation between Firm A and the other Big-4 firms is on the order of 0.02-0.04 throughout the sample (e.g., 2013: Firm A $0.9733$ vs Firm B $0.9498$, Firm C $0.9464$, Firm D $0.9395$, Non-Big-4 $0.9227$; 2023: $0.9860$ vs $0.9668$, $0.9662$, $0.9525$, $0.9346$); the share-above-0.95 separation is wider (2013: Firm A $87.2\%$ vs $61.8\%$, $56.2\%$, $38.5\%$, $27.5\%$).
+This visual is the most direct cross-firm evidence in the paper that Firm A's high-similarity behaviour is firm-specific rather than corpus-wide; the three subsections below decompose this gap along three threshold-free or threshold-robust dimensions.
+
+<!-- FIGURE 4: Per-firm yearly per-signature best-match cosine
+File: reports/figures/fig_yearly_big4_comparison.png (and .pdf)
+Generated by: signature_analysis/30_yearly_big4_comparison.py
+Caption: Per-firm yearly per-signature best-match cosine, 2013-2023.
+(a) Mean per-signature best-match cosine by firm bucket and fiscal year
+(threshold-free). (b) Share of per-signature best-match cosine $\geq 0.95$
+(operational cut of Section III-K). Five lines: Firm A, B, C, D, Non-Big-4.
+Firm A is above the other Big-4 firms in every year; Non-Big-4 is below all
+four Big-4 firms in every year. Per-firm signature counts and exact values
+are in `reports/firm_yearly_comparison/firm_yearly_comparison.{json,md}`.
+-->
+
+The capture rates of Section IV-E are an *internal* consistency check: they ask "how much of Firm A does our threshold capture?", but the threshold was itself derived from Firm A's percentiles, so a high capture rate is not surprising.
+To go beyond this circular check, we report three further analyses, each chosen so that the *informative quantity* does not depend on the threshold's absolute value:
+
+- **§IV-G.1 (year-by-year stability).** Holds the cosine cutoff fixed at 0.95 and asks whether the share of Firm A below the cutoff is *stable across years*. The information is in the temporal trend, not in the absolute rate; under a noise-only explanation of the left tail, the share should shrink as scan/PDF technology matured.
+- **§IV-G.2 (partner-level similarity ranking).** Uses *no threshold at all*: every auditor-year is ranked by mean similarity, and we measure Firm A's share of the top decile against its baseline share. The information is in the concentration ratio, which is invariant to the choice of cutoff.
+- **§IV-G.3 (intra-report agreement).** Applies the calibrated classifier and measures whether the *two co-signing CPAs on the same Firm A report* receive the same classifier label, then compares Firm A's intra-report agreement rate to the other firms'. The information is in the *cross-firm gap*; the absolute agreement rate at any one firm depends on the cutoff, but the gap is robust to moderate cutoff shifts as long as the same cutoff is applied uniformly across firms.
+
+Together these three analyses provide threshold-free or threshold-robust evidence that complements the within-sample capture rates of Section IV-E.

 ### 1) Year-by-Year Stability of the Firm A Left Tail

 Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year.
-Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign.
+Under the replication-dominated interpretation (Section III-H), this signature-level left-tail rate reflects within-firm heterogeneity in signing outputs at Firm A.
+Consistent with the scope-of-claims framing in Section III-G, we report the rate as a signature-level quantity without disaggregating the underlying mechanism (which may span a minority of hand-signing partners, multi-template replication workflows within the firm, or a combination); partner-level mechanism attribution is not attempted.
 Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.

 <!-- TABLE XIII: Firm A Per-Year Cosine Distribution
-| Year | N sigs | mean cosine | % below 0.95 |
+| Year | N sigs | mean best-match cosine | % below 0.95 |
 |------|--------|-------------|--------------|
 | 2013 | 2,167 | 0.9733 | 12.78% |
 | 2014 | 5,256 | 0.9781 | 8.69% |
@@ -297,45 +344,51 @@ Under the alternative hypothesis that the left tail is an artifact of scan or co

 The left tail is stable at 6-13% throughout the sample period and shows no pre/post-2020 level shift: the 2013-2019 mean left-tail share is 8.26% and the 2020-2023 mean is 6.96%.
 The lowest observed share is in 2023 (3.75%), consistent with firm-level electronic signing systems producing more uniform output than earlier manual scanning-and-stamping, not less.
-This stability supports the replication-dominated framing: a persistent minority of hand-signing Firm A partners is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved.
+This stability supports the replication-dominated framing: a persistent within-firm heterogeneity component is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved.

 ### 2) Partner-Level Similarity Ranking

-If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all Big-4 auditor-years.
+If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all auditor-years (across all firms).
 We test this prediction directly.

 For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
 Firm A accounts for 1,287 of these (27.8% baseline share).
 Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.
+The per-signature best-match cosine underlying each auditor-year mean is taken over the full same-CPA pool (Section III-G), consistent with the unit-of-analysis framing in Section III-G.

 <!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
 | Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
 |-------|-------------|--------|--------|--------|--------|-----------|--------------|
 | 10%   | 462         | 443    | 2      | 3      | 0      | 14        | 95.9% |
+| 20%   | 925         | 877    | 9      | 14     | 2      | 23        | 94.8% |
 | 25%   | 1,157       | 1,043  | 32     | 23     | 9      | 50        | 90.1% |
+| 30%   | 1,388       | 1,129  | 105    | 52     | 25     | 77        | 81.3% |
 | 50%   | 2,314       | 1,220  | 473    | 273    | 102    | 246       | 52.7% |
 -->

-Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
+Firm A occupies 95.9% of the top 10%, 94.8% of the top 20%, 90.1% of the top 25%, and 81.3% of the top 30% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of $3.5\times$ at the top decile, $3.4\times$ at the top quintile, and $2.9\times$ at the top tercile.
+Firm A's share decays monotonically as the bracket widens (95.9% $\to$ 94.8% $\to$ 90.1% $\to$ 81.3% $\to$ 52.7% across top-10/20/25/30/50%), and only at the top 50% does its share approach its baseline; the over-representation is therefore concentrated in the very top of the distribution rather than spread uniformly through the upper half.
 Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.

-<!-- TABLE XV: Firm A Share of Top-10% Similarity by Year
-| Year | N auditor-years | Top-10% k | Firm A in top-10% | Firm A share | Firm A baseline |
-|------|-----------------|-----------|-------------------|--------------|-----------------|
-| 2013 | 324 | 32 | 32 | 100.0% | 26.2% |
-| 2014 | 399 | 39 | 39 | 100.0% | 27.1% |
-| 2015 | 394 | 39 | 38 | 97.4% | 27.2% |
-| 2016 | 413 | 41 | 39 | 95.1% | 27.4% |
-| 2017 | 415 | 41 | 41 | 100.0% | 27.9% |
-| 2018 | 434 | 43 | 43 | 100.0% | 28.1% |
-| 2019 | 429 | 42 | 42 | 100.0% | 28.2% |
-| 2020 | 430 | 43 | 38 | 88.4% | 28.3% |
-| 2021 | 450 | 45 | 44 | 97.8% | 28.4% |
-| 2022 | 467 | 46 | 43 | 93.5% | 28.5% |
-| 2023 | 474 | 47 | 46 | 97.9% | 28.5% |
+<!-- TABLE XV: Firm A Share of Top-K Similarity by Year (K = 10%, 20%, 30%)
+| Year | N auditor-years | Top-10% share | Top-20% share | Top-30% share | Firm A baseline |
+|------|-----------------|---------------|---------------|---------------|-----------------|
+| 2013 | 324 | 100.0% (32/32) | 98.4% (63/64) | 89.7% (87/97) | 32.4% |
+| 2014 | 399 | 100.0% (39/39) | 98.7% (78/79) | 82.4% (98/119) | 27.8% |
+| 2015 | 394 | 97.4% (38/39) | 96.2% (75/78) | 84.7% (100/118) | 27.7% |
+| 2016 | 413 | 95.1% (39/41) | 96.3% (79/82) | 81.3% (100/123) | 26.2% |
+| 2017 | 415 | 100.0% (41/41) | 97.6% (81/83) | 83.9% (104/124) | 27.2% |
+| 2018 | 434 | 100.0% (43/43) | 97.7% (84/86) | 80.0% (104/130) | 26.5% |
+| 2019 | 429 | 100.0% (42/42) | 97.6% (83/85) | 78.9% (101/128) | 27.0% |
+| 2020 | 430 | 88.4% (38/43)  | 91.9% (79/86) | 76.0% (98/129)  | 27.7% |
+| 2021 | 450 | 97.8% (44/45)  | 96.7% (87/90) | 81.5% (110/135) | 28.7% |
+| 2022 | 467 | 93.5% (43/46)  | 95.7% (89/93) | 84.3% (118/140) | 28.3% |
+| 2023 | 474 | 97.9% (46/47)  | 94.7% (89/94) | 83.8% (119/142) | 27.4% |
+
+Per-cell entries are "share (k_FirmA / k_total)". Top-25% and top-50% pooled values are reported in Table XIV; per-year top-25/50 columns are omitted from this table to reduce visual width but are reproducible from the supplementary materials.
 -->

-This over-representation is a direct consequence of firm-wide non-hand-signing practice and is not derived from any threshold we subsequently calibrate.
+This over-representation is consistent with firm-wide non-hand-signing practice at Firm A and is not derived from any threshold we subsequently calibrate.
 It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.

 ### 3) Intra-Report Consistency
@@ -344,8 +397,8 @@ Taiwanese statutory audit reports are co-signed by two engagement partners (a pr
 Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
 Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.

-For each report with exactly two signatures and complete per-signature data (83,970 reports assigned to a single firm, plus 384 reports with one signer per firm in the mixed-firm buckets for 84,354 total), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
-Table XVI reports per-firm intra-report agreement (firm-assignment defined by the firm identity of both signers; mixed-firm reports are reported separately).
+For each report with exactly two signatures and complete per-signature data (84,354 reports total: 83,970 single-firm reports, in which both signers are at the same firm, and 384 mixed-firm reports, in which the two signers are at different firms), we classify each signature using the dual-descriptor rules of Section III-K and record whether the two classifications agree.
+Table XVI reports per-firm intra-report agreement for the 83,970 single-firm reports only (firm-assignment defined by the common firm identity of both signers); the 384 mixed-firm reports (0.46% of the 2-signature corpus) are excluded from the intra-report analysis because firm-level agreement is not well defined when the two signers are at different firms.

 <!-- TABLE XVI: Intra-Report Classification Agreement by Firm
 | Firm | Total 2-signer reports | Both non-hand-signed | Both uncertain | Both style | Both hand-signed | Mixed | Agreement rate |
@@ -362,14 +415,15 @@ A report is "in agreement" if both signature labels fall in the same coarse buck

 Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
 The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
-This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.
+This 23-28 percentage-point gap in intra-report agreement between Firm A and the other firms is consistent with firm-wide (rather than partner-specific) non-hand-signing practice; we do not claim a sharp discontinuity in the formal sense, since classifier calibration, firm-specific document-production pipelines, and signer-mix differences could each contribute to gap magnitude.

-We note that this test uses the calibrated classifier of Section III-L rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
+We note that this test uses the calibrated classifier of Section III-K rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.

-## I. Classification Results
+## H. Classification Results

-Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
-The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
+Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents (656 documents excluded from the 85,042-document YOLO-detection cohort because no signature on the document could be matched to a registered CPA; see Table XVII note).
+We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
+Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.

 <!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
 | Verdict | N (PDFs) | % | Firm A | Firm A % |
@@ -380,7 +434,8 @@ The document count (84,386) differs from the 85,042 documents with any YOLO dete
 | Uncertain | 12,683 | 15.0% | 758 | 2.5% |
 | Likely hand-signed | 47 | 0.1% | 4 | 0.0% |

-Per the worst-case aggregation rule of Section III-L, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
+Per the worst-case aggregation rule of Section III-K, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
+The 84,386-document cohort excludes 656 documents (relative to the 85,042 YOLO-detected cohort of Table III) for which no signature could be matched to a registered CPA: the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity is defined. The exclusion is definitional rather than discretionary; typical causes are auditor's-report-page formats deviating from the standard two-signature layout, or OCR returning a printed CPA name not present in the registry.
 -->

 Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
@@ -392,16 +447,18 @@ A cosine-only classifier would treat all 71,656 identically; the dual-descriptor
 ### 1) Firm A Capture Profile (Consistency Check)

 96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
-This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the 32/171 middle-band minority identified by the accountant-level mixture (Section IV-E).
-The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
-We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.
+This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the within-firm heterogeneity implied by the dip-test-confirmed unimodal-long-tail shape of Firm A's per-signature cosine distribution (Section IV-D.1) and the 7.5% signature-level left tail (Section III-H).
+The near-zero "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 denominator is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset of Table XVI by 4 mixed-firm reports excluded from the firm-level intra-report comparison) indicates that the within-firm heterogeneity implied by the 7.5% signature-level left tail (Section IV-D) does not project into the lowest-cosine document-level category under the dual-descriptor rules; it is absorbed instead into the uncertain or high-style-consistency categories at this threshold set.
+We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-F.2 is the corresponding external check.

-### 2) Cross-Method Agreement
+### 2) Cross-Firm Comparison of Dual-Descriptor Convergence

-Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
-This is consistent with the three-method thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
+Among the 65,514 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,922 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible.
+The Firm A denominator (55,922) matches Table IX exactly: both Table IX and the cross-firm decomposition define Firm A membership via the CPA registry (`accountants.firm`), and the cross-firm analysis additionally requires a non-null independent-min dHash record, which all 55,922 Firm A cosine-eligible signatures have in the current database.
+This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings.
+Reproduction artifact for these counts is listed in Appendix B.

-## J. Ablation Study: Feature Backbone Comparison
+## I. Ablation Study: Feature Backbone Comparison

 To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
 All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
@@ -420,8 +477,9 @@ Table XVIII presents the comparison.

 Note: Firm A values in this table are computed over all intra-firm pairwise
 similarities (16.0M pairs) for cross-backbone comparability. These differ from
-the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
-the classification-relevant statistic: the similarity of each signature to its
+the per-signature best-match statistic used in Section IV-D and visualized in
+Table XIII (whole-sample Firm A best-match mean $\approx 0.980$), which reflects
+the classification-relevant quantity: the similarity of each signature to its
 single closest match from the same CPA.
 -->

@@ -0,0 +1,226 @@
+# Reference Verification — Paper A v3 (41 refs)
+
+Date: 2026-04-27 (initial audit); v3.18 reference list updated to incorporate every fix recorded below.
+
+Method: WebSearch + WebFetch verification of each citation against authoritative sources (publisher pages, DOIs, arXiv, IEEE Xplore, Project Euclid, etc.).
+
+## Summary (audit history)
+- Verified correct on first audit: 35/41
+- Minor discrepancies (typos, page numbers, year on early-access vs. issue): 5/41 — all fixed in v3.18
+- MAJOR PROBLEMS (wrong author): 1/41 — `[5]` Hadjadj et al. → Kao and Wen, fixed in v3.18
+
+The current `paper_a_references_v3.md` reflects every correction listed below. The detailed findings are retained as an audit trail; the live reference list no longer carries any of the recorded errors.
+
+The single major problem at the time of the audit was **[5]**, where the paper at the cited venue/article number is real, but the cited authors ("Hadjadj et al.") were wrong — the actual authors are Kao and Wen. None of the statistical-method refs [37]–[41] flagged by the partner are fabricated; all five are bibliographically correct.
+
+## Detailed findings
+
+### [1] Taiwan CPA Act + FSC Attestation Regulations
+**Status:** ✅ VERIFIED
+**Notes:** The URL https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067 resolves to the official Republic of China (Taiwan) "Certified Public Accountant Act" page (Laws & Regulations Database, Financial Supervisory Commission).
+**Evidence:** WebFetch returned the CPA Act page with 8 chapters; latest amendment 2018-01-31. Article 4 and the FSC Attestation Regulations (查核簽證核准準則) are part of the official regulatory framework.
+
+### [2] S.-H. Yen, Y.-S. Chang, H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," Res. Account. Regul., 25(2), 230–235, 2013.
+**Status:** ✅ VERIFIED
+**Evidence:** ScienceDirect listing (https://www.sciencedirect.com/science/article/abs/pii/S1052045713000234) confirms authors Sin-Hui Yen, Yu-Shan Chang, Hui-Ling Chen; Research in Accounting Regulation 25(2):230–235, 2013.
+
+### [3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," Proc. NeurIPS, 1993.
+**Status:** ✅ VERIFIED
+**Notes:** Authors are Bromley, Bentz, Bottou, Guyon, LeCun, Moore, Säckinger, Shah; pages 737–744 of NIPS 6 (1993). Citation as "Bromley et al." in NeurIPS 1993 is correct.
+**Evidence:** https://proceedings.neurips.cc/paper/1993/hash/288cc0ff022877bd3df94bc9360b9c5d-Abstract.html
+
+### [4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.
+**Status:** ✅ VERIFIED
+**Evidence:** arXiv 1707.02131 resolves to exactly this title; authors Sounak Dey, Anjan Dutta, J.I. Toledo, Suman K. Ghosh, Josep Llados, Umapada Pal; submitted July 2017.
+
+### [5] I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," Appl. Sci., 10(11), 3716, 2020.
+**Status:** ❌ MAJOR PROBLEM (wrong authors)
+**Notes:** The paper at Applied Sciences vol. 10, issue 11, article 3716 (DOI 10.3390/app10113716) is real, but the actual authors are **Hsin-Hsiung Kao and Che-Yen Wen**, NOT "Hadjadj et al." The full title in the journal is also "An Offline Signature Verification **and Forgery Detection** Method Based on a Single Known Sample and an Explainable Deep Learning Approach" — the v3 reference omits "and Forgery Detection."
+**Evidence:** MDPI listing (https://www.mdpi.com/2076-3417/10/11/3716) and Semantic Scholar both list authors as Kao and Wen, published 27 May 2020. There is a separate researcher I. Hadjadj who works on signature verification with co-authors Gattal/Djeddi/Ayad/Siddiqi/Abass on textural-descriptor methods, but that work is published elsewhere — not in Appl. Sci. 10(11):3716.
+**Recommendation:** Replace authors with "H.-H. Kao and C.-Y. Wen" and use correct title.
+
+### [6] H. Li et al., "TransOSV: Offline signature verification with transformers," Pattern Recognit., 145, 109882, 2024.
+**Status:** ✅ VERIFIED
+**Notes:** Authors Huan Li, Ping Wei, Zeyu Ma, Changkai Li, Nanning Zheng. PR vol. 145, art. 109882, January 2024.
+**Evidence:** ScienceDirect S0031320323005800.
+
+### [7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," Mathematics, 12(17), 2757, 2024.
+**Status:** ✅ VERIFIED
+**Notes:** Authors Sara Tehsin, Ali Hassan, Farhan Riaz, Inzamam Mashood Nasir, Norma Latif Fitriyani, Muhammad Syafrudin. DOI 10.3390/math12172757.
+**Evidence:** https://www.mdpi.com/2227-7390/12/17/2757
+
+### [8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.
+**Status:** ✅ VERIFIED
+**Notes:** Full title is "...using **Convolutional Neural Network** Learned Representations" (the v3 ref says "CNN" — acceptable abbreviation).
+**Evidence:** https://arxiv.org/abs/2401.03085 — authors Paul Brimoh and Chollette C. Olisah.
+
+### [9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021.
+**Status:** ✅ VERIFIED
+**Evidence:** arXiv 2107.14091 — authors Nikhil Woodruff, Amir Enshaei, Bashar Awwad Shiekh Hasan; submitted 29 July 2021.
+
+### [10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," Proc. Electronic Imaging, 2016.
+**Status:** ✅ VERIFIED
+**Notes:** Published in IS&T Electronic Imaging: Media Watermarking, Security, and Forensics 2016, pp. 1–10 (article 4 in session 8). Authors Svetlana Abramova and Rainer Böhme.
+**Evidence:** https://library.imaging.org/ei/articles/28/8/art00004 ; Semantic Scholar entry confirms title and authors.
+
+### [11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," Multimedia Tools Appl., 2024.
+**Status:** ✅ VERIFIED
+**Notes:** Published in Multimedia Tools and Applications, 2024, DOI 10.1007/s11042-024-18399-2.
+**Evidence:** https://link.springer.com/article/10.1007/s11042-024-18399-2
+
+### [12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," Inf. Process. Manage., 104086, 2025.
+**Status:** ✅ VERIFIED
+**Notes:** Authors Yash Jakhar and Malaya Dutta Borah; Information Processing & Management 62(4):104086, July 2025; DOI 10.1016/j.ipm.2025.104086.
+**Evidence:** https://www.sciencedirect.com/science/article/abs/pii/S0306457325000287
+
+### [13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," Proc. CVPR, 2022.
+**Status:** ✅ VERIFIED
+**Notes:** Authors Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, Matthijs Douze; CVPR 2022.
+**Evidence:** https://openaccess.thecvf.com/content/CVPR2022/html/Pizzi_A_Self-Supervised_Descriptor_for_Image_Copy_Detection_CVPR_2022_paper.html ; arXiv 2202.10261.
+
+### [14] L. G. Hafemann, R. Sabourin, L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," Pattern Recognit., 70, 163–176, 2017.
+**Status:** ✅ VERIFIED
+**Evidence:** ScienceDirect S0031320317302017; PR 70:163–176, 2017; arXiv 1705.05787.
+
+### [15] E. N. Zois, D. Tsourounis, D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," IEEE Trans. Inf. Forensics Security, 19, 1342–1356, 2024.
+**Status:** ✅ VERIFIED
+**Evidence:** IEEE Xplore document 10319735; TIFS vol. 19, pp. 1342–1356, 2024.
+
+### [16] L. G. Hafemann, R. Sabourin, L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," IEEE Trans. Inf. Forensics Security, 15, 1735–1745, 2019.
+**Status:** ⚠️ MINOR
+**Notes:** Volume and pages (15, 1735–1745) are correct. Year is technically 2020 for the journal issue (DOI 10.1109/TIFS.2019.2949425; early-access October 2019, issue volume 15 published 2020). The "2019" in the v3 reference reflects the online/early-access date but is inconsistent with TIFS's volume-15 2020 issue convention.
+**Evidence:** arXiv 1910.08060; ÉTS espace listing confirms TIFS 15:1735–1745, 2020.
+**Recommendation:** Change year to 2020 for IEEE Access editorial consistency, or accept as-is (both forms appear in the literature).
+
+### [17] H. Farid, "Image forgery detection," IEEE Signal Process. Mag., 26(2), 16–25, 2009.
+**Status:** ✅ VERIFIED
+**Notes:** The paper's actual title (in some indexes) is given as "A Survey of Image Forgery Detection," but the IEEE Xplore canonical title is "Image Forgery Detection." Vol. 26, no. 2, pp. 16–25, March 2009.
+**Evidence:** https://pages.cs.wisc.edu/~dyer/cs534/papers/farid-sigproc09.pdf (PDF header confirms IEEE SPM, March 2009, p. 16).
+
+### [18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, R. Sheikhpour, "A survey on deep learning-based image forgery detection," Pattern Recognit., 144, 109778, 2023.
+**Status:** ✅ VERIFIED
+**Evidence:** ScienceDirect S0031320323004764; PR vol. 144 art. 109778, December 2023.
+
+### [19] J. Luo et al., "A survey of perceptual hashing for multimedia," ACM Trans. Multimedia Comput. Commun. Appl., 21(7), 2025.
+**Status:** ✅ VERIFIED
+**Notes:** Published April 2025, DOI 10.1145/3727880.
+**Evidence:** https://dl.acm.org/doi/10.1145/3727880
+
+### [20] D. Engin et al., "Offline signature verification on real-world documents," Proc. CVPRW, 2020.
+**Status:** ✅ VERIFIED
+**Notes:** Authors Deniz Engin, Alperen Kantarci, Secil Arslan, Hazim Kemel Ekenel; CVPR 2020 Biometrics Workshop.
+**Evidence:** https://openaccess.thecvf.com/content_CVPRW_2020/html/w48/Engin_Offline_Signature_Verification_on_Real-World_Documents_CVPRW_2020_paper.html
+
+### [21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," Expert Syst. Appl., 2022.
+**Status:** ⚠️ MINOR
+**Notes:** Citation lacks volume/article number. Full record: Expert Systems with Applications, vol. 189, art. 116136, 2022. Authors Tsourounis, Theodorakopoulos, Zois, Economou.
+**Evidence:** ScienceDirect S0957417421014652.
+**Recommendation:** Add ", vol. 189, art. 116136" for IEEE-style completeness.
+
+### [22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," Procedia Comput. Sci., 270, 2025.
+**Status:** ⚠️ MINOR
+**Notes:** Full title in publisher record is "A Unified ResNet18-Based Approach for Offline Signature Classification and Verification **Across Multilingual Datasets**." Procedia CS vol. 270, pp. 4024–4033, 2025 (KES 2025).
+**Evidence:** ScienceDirect S1877050925032004.
+**Recommendation:** Either keep short title or add "Across Multilingual Datasets" for accuracy; add page range.
+
+### [23] A. Babenko, A. Slesarev, A. Chigorin, V. Lempitsky, "Neural codes for image retrieval," Proc. ECCV, 2014, pp. 584–599.
+**Status:** ✅ VERIFIED
+**Evidence:** Springer LNCS 8689, ECCV 2014 Part I, pp. 584–599; arXiv 1404.1777.
+
+### [24] S. Bai et al., "Qwen2.5-VL technical report," arXiv:2502.13923, 2025.
+**Status:** ✅ VERIFIED
+**Evidence:** arXiv 2502.13923; lead author Shuai Bai, Qwen Team Alibaba; submitted 19 Feb 2025. URL https://arxiv.org/abs/2502.13923 resolves correctly.
+
+### [25] Ultralytics, "YOLOv11 documentation," 2024.
+**Status:** ⚠️ MINOR
+**Notes:** Ultralytics names the model **"YOLO11"** (no "v"), released 10 Sept 2024. The cited URL https://docs.ultralytics.com/ is the docs root and resolves; the model-specific page is https://docs.ultralytics.com/models/yolo11/.
+**Recommendation:** Rename to "YOLO11" to match official Ultralytics terminology, or note that "YOLOv11" is informal.
+
+### [26] K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition," Proc. CVPR, 2016.
+**Status:** ✅ VERIFIED
+**Evidence:** CVF Open Access; CVPR 2016 pp. 770–778.
+
+### [27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013.
+**Status:** ⚠️ MINOR
+**Notes:** Blog post is real (the canonical dHash explanation). The cited URL https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html is the historical permalink; the active URL form returned by Google is https://www.hackerfactor.com/blog/?/archives/529-Kind-of-Like-That.html. Both 403'd in our WebFetch test (likely User-Agent block on the blog), but the post is widely cited and references confirm it exists. Year is 2013 per blog archive.
+**Recommendation:** Verify the URL still resolves in a browser; both index.php and bare forms are accepted by the blog historically.
+
+### [28] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London: Chapman & Hall, 1986.
+**Status:** ✅ VERIFIED
+**Evidence:** Routledge/Taylor&Francis catalog; ISBN 0412246201; Chapman & Hall, London, 1986.
+
+### [29] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988.
+**Status:** ✅ VERIFIED
+**Evidence:** Routledge listing ISBN 9780805802832; Lawrence Erlbaum Associates, 2nd ed., 1988.
+
+### [30] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Trans. Image Process., 13(4), 600–612, 2004.
+**Status:** ✅ VERIFIED
+**Evidence:** IEEE Xplore document 1284395; vol. 13, no. 4, pp. 600–612, April 2004.
+
+### [31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," The Accounting Review, 88(5), 1511–1546, 2013.
+**Status:** ✅ VERIFIED
+**Evidence:** SSRN abstract 2225427; The Accounting Review 88(5):1511–1546, September 2013.
+
+### [32] A. D. Blay, M. Notbohm, C. Schelleman, A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," Int. J. Auditing, 18(3), 172–192, 2014.
+**Status:** ✅ VERIFIED
+**Evidence:** Wiley DOI 10.1111/ijau.12022; IJA 18(3):172–192, 2014.
+
+### [33] W. Chi, H. Huang, Y. Liao, H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," Contemp. Account. Res., 26(2), 359–391, 2009.
+**Status:** ✅ VERIFIED
+**Evidence:** Wiley DOI 10.1506/car.26.2.2; CAR 26(2):359–391, 2009.
+
+### [34] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, "You only look once: Unified, real-time object detection," Proc. CVPR, 2016, pp. 779–788.
+**Status:** ✅ VERIFIED
+**Evidence:** CVF Open Access; CVPR 2016 pp. 779–788.
+
+### [35] J. Zhang, J. Huang, S. Jin, S. Lu, "Vision-language models for vision tasks: A survey," IEEE Trans. Pattern Anal. Mach. Intell., 46(8), 5625–5644, 2024.
+**Status:** ✅ VERIFIED
+**Evidence:** IEEE Xplore document 10445007; DOI 10.1109/TPAMI.2024.3369699; TPAMI 46(8):5625–5644, August 2024.
+
+### [36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," Ann. Math. Statist., 18(1), 50–60, 1947.
+**Status:** ✅ VERIFIED
+**Evidence:** Project Euclid DOI 10.1214/aoms/1177730491; AMS 18(1):50–60, March 1947.
+
+### [37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," Ann. Statist., 13(1), 70–84, 1985.
+**Status:** ✅ VERIFIED
+**Notes:** **Partner-flagged ref — confirmed real and bibliographically correct.** Annals of Statistics 13(1):70–84, March 1985.
+**Evidence:** Project Euclid https://projecteuclid.org/journals/annals-of-statistics/volume-13/issue-1/The-Dip-Test-of-Unimodality/10.1214/aos/1176346577.full
+
+### [38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," J. Account. Econ., 24(1), 99–126, 1997.
+**Status:** ✅ VERIFIED
+**Notes:** **Partner-flagged ref — confirmed real and bibliographically correct.** Seminal earnings-management paper.
+**Evidence:** ScienceDirect S0165410197000177; JAE 24(1):99–126, December 1997.
+
+### [39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," J. Econometrics, 142(2), 698–714, 2008.
+**Status:** ✅ VERIFIED
+**Notes:** **Partner-flagged ref — confirmed real and bibliographically correct.** Foundational RDD density-manipulation test (>1750 citations).
+**Evidence:** ScienceDirect S0304407607001133; JoE 142(2):698–714, February 2008.
+
+### [40] A. P. Dempster, N. M. Laird, D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," J. R. Statist. Soc. B, 39(1), 1–38, 1977.
+**Status:** ✅ VERIFIED
+**Notes:** **Partner-flagged ref — confirmed real and bibliographically correct.** Canonical EM algorithm paper, presented to RSS Research Section 8 Dec 1976.
+**Evidence:** Wiley DOI 10.1111/j.2517-6161.1977.tb01600.x; JRSS B 39(1):1–38, 1977.
+
+### [41] H. White, "Maximum likelihood estimation of misspecified models," Econometrica, 50(1), 1–25, 1982.
+**Status:** ⚠️ MINOR
+**Notes:** **Partner-flagged ref — confirmed real, but page numbers slightly off.** Some sources list pp. 1–25, others pp. 1–26. The Econometric Society's official record (and JSTOR 1912526) lists pages 1–25; Emerald and a few other indices list 1–26 (likely including a typo-correction footnote). The v3 reference's "1–25" matches the Econometric Society canonical listing.
+**Evidence:** https://www.econometricsociety.org/publications/econometrica/1982/01/01/maximum-likelihood-estimation-misspecified-models ; JSTOR 1912526. Authors and venue exact.
+**Recommendation:** No fix needed; "1–25" is the canonical page range.
+
+## Recommendations
+
+**Critical fixes (must fix before submission):**
+
+1. **[5]** Replace authors and title:
+   - Current: `I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," Appl. Sci., vol. 10, no. 11, p. 3716, 2020.`
+   - Corrected: `H.-H. Kao and C.-Y. Wen, "An offline signature verification and forgery detection method based on a single known sample and an explainable deep learning approach," Appl. Sci., vol. 10, no. 11, p. 3716, 2020.`
+
+**Recommended polish (style/completeness):**
+
+2. **[16]** Year is 2020 in TIFS volume 15; consider changing 2019 → 2020 (or leave as 2019 if matching the early-access date is preferred — both are defensible).
+3. **[21]** Add volume and article number: `Expert Syst. Appl., vol. 189, art. 116136, 2022.`
+4. **[22]** Add page range: `Procedia Comput. Sci., vol. 270, pp. 4024–4033, 2025.` Optionally restore full subtitle "Across Multilingual Datasets."
+5. **[25]** Use Ultralytics' official name "YOLO11" (no "v") if matching their branding; current "YOLOv11" is widely used colloquially but not the canonical name.
+6. **[27]** Verify URL renders in a browser; both `blog/index.php?/archives/...` and `blog/?/archives/...` forms have historically resolved on hackerfactor.com.
+
+**No fix needed:** All five partner-flagged statistical-method references [37]–[41] are real, correctly attributed, and bibliographically accurate. The partner's suspicion that they might be AI hallucinations is unfounded — Hartigan & Hartigan (1985), Burgstahler & Dichev (1997), McCrary (2008), Dempster-Laird-Rubin (1977), and White (1982) are all foundational, heavily-cited works in their respective fields.
@@ -8,39 +8,40 @@ occurring reference populations instead of manual labels:
  Positive anchor 1:  pixel_identical_to_closest = 1
      Two signature images byte-identical after crop/resize.
      Mathematically impossible to arise from independent hand-signing
-      => absolute ground truth for replication.
+      => pair-level proof of image reuse and a CONSERVATIVE-SUBSET
+      ground truth for non-hand-signing (only those whose nearest
+      same-CPA match happens to be byte-identical).

-  Positive anchor 2:  Firm A (Deloitte) signatures
-      Interview evidence from multiple Firm A accountants confirms that
-      MOST use replication (stamping / firm-level e-signing) but a
-      MINORITY may still hand-sign. Firm A is therefore a
-      "replication-dominated" population (not a pure one). We use it as
-      a strong prior positive for the majority regime, while noting that
-      ~7% of Firm A signatures fall below cosine 0.95 consistent with
-      the minority hand-signers. This matches the long left tail
-      observed in the dip test (Script 15) and the Firm A members who
-      land in C2 (middle band) of the accountant-level GMM (Script 18).
+  Positive anchor 2:  Firm A signatures
+      Treated in the manuscript as a REPLICATION-DOMINATED population
+      based on the paper's own image evidence: the byte-level pair
+      analysis, the Firm A per-signature similarity distribution, the
+      partner-ranking concentration, and the intra-report consistency
+      gap. Approximately 7% of Firm A signatures fall below cosine
+      0.95, forming the long left tail observed in the dip test
+      (Script 15).

  Negative anchor:    signatures with cosine <= low threshold
      Pairs with very low cosine similarity cannot plausibly be pixel
-      duplicates, so they serve as absolute negatives.
+      duplicates, so they serve as a conservative supplementary
+      negative reference.

-Metrics reported:
-  - FAR/FRR/EER using the pixel-identity anchor as the gold positive
-    and low-similarity pairs as the gold negative.
-  - Precision/Recall/F1 at cosine and dHash thresholds from Scripts
-    15/16/17/18.
+Metrics computed (legacy; NOT all reported in the manuscript):
+  - FAR against the inter-CPA negative anchor is the primary metric
+    reported (Table X). The byte-identical positive anchor has cosine
+    ~= 1 by construction, so FRR / EER / Precision / F1 against that
+    subset are arithmetic tautologies (FRR is trivially 0 below
+    threshold 1) and are intentionally OMITTED from Table X. Legacy
+    EER/FRR/precision/F1 helper functions remain in this script for
+    diagnostic use only and their outputs are NOT cited as biometric
+    performance in the paper.
  - Convergence with Firm A anchor (what fraction of Firm A signatures
    are correctly classified at each threshold).

-Small visual sanity sample (30 pairs) is exported for spot-check, but
-metrics are derived entirely from pixel and Firm A evidence.
-
 Output:
  reports/pixel_validation/pixel_validation_report.md
  reports/pixel_validation/pixel_validation_results.json
  reports/pixel_validation/roc_cosine.png, roc_dhash.png
-  reports/pixel_validation/sanity_sample.csv
 """

 import sqlite3
@@ -2,26 +2,39 @@
 """
 Script 21: Expanded Validation with Larger Negative Anchor + Held-out Firm A
 ============================================================================
-Addresses codex review weaknesses of Script 19's pixel-identity validation:
+Addresses three weaknesses of Script 19's pixel-identity validation:

  (a) Negative anchor of n=35 (cosine<0.70) is too small to give
      meaningful FAR confidence intervals.
-  (b) Pixel-identical positive anchor is an easy subset, not
-      representative of the broader positive class.
-  (c) Firm A is both the calibration anchor and the validation anchor
-      (circular).
+  (b) Pixel-identical positive anchor is a CONSERVATIVE SUBSET of the
+      true non-hand-signed class, not representative of the broader
+      positive class. Recall against this subset is therefore a
+      lower-bound calibration check, not a generalizable recall
+      estimate.
+  (c) Firm A is both the calibration anchor and a validation anchor
+      (circular). The 70/30 fold split makes within-Firm-A sampling
+      variance visible without claiming external validation.

 This script:
  1. Constructs a large inter-CPA negative anchor (~50,000 pairs) by
     randomly sampling pairs from different CPAs. Inter-CPA high
     similarity is highly unlikely to arise from legitimate signing.
  2. Splits Firm A CPAs 70/30 into CALIBRATION and HELDOUT folds.
-     Re-derives signature-level / accountant-level thresholds from the
-     calibration fold only, then reports all metrics (including Firm A
-     anchor rates) on the heldout fold.
-  3. Computes proper EER (FAR = FRR interpolated) in addition to
-     metrics at canonical thresholds.
-  4. Computes 95% Wilson confidence intervals for each FAR/FRR.
+     Re-derives signature-level thresholds from the calibration fold
+     only, then reports capture rates on the heldout fold.
+  3. Computes 95% Wilson confidence intervals for FAR at canonical
+     thresholds (Table X in the manuscript).
+
+Legacy / diagnostic-only metrics:
+  Helper functions for EER, Precision, Recall, F1, and FRR remain in
+  this script for backward compatibility. The manuscript intentionally
+  OMITS these metrics from Table X because the byte-identical positive
+  anchor has cosine ~= 1 by construction (so FRR / EER are arithmetic
+  tautologies) and because positive and negative anchors are
+  constructed from different sampling units, making prevalence
+  arbitrary (so Precision and F1 have no meaningful population
+  interpretation). Only FAR against the large inter-CPA negative
+  anchor is reported as a biometric metric in the paper.

 Output:
  reports/expanded_validation/expanded_validation_report.md
@@ -72,44 +85,78 @@ def load_signatures():
    return rows


-def load_feature_vectors_sample(n=2000):
-    """Load feature vectors for inter-CPA negative-anchor sampling."""
+def load_signature_ids_for_negative_pool(seed=SEED):
+    """Load lightweight (sig_id, accountant) pool from the entire matched
+    corpus. Per Gemini round-19 review, the prior implementation drew
+    50,000 inter-CPA pairs from a tiny LIMIT-3000 random subset, reusing
+    each signature ~33 times and artificially tightening Wilson FAR CIs.
+    The corrected implementation samples pairs i.i.d. across the FULL
+    matched corpus (~168k signatures); only the unique signatures that
+    actually appear in the sampled pairs need feature vectors loaded.
+    """
    conn = sqlite3.connect(DB)
    cur = conn.cursor()
    cur.execute('''
-        SELECT signature_id, assigned_accountant, feature_vector
+        SELECT signature_id, assigned_accountant
        FROM signatures
        WHERE feature_vector IS NOT NULL
          AND assigned_accountant IS NOT NULL
-        ORDER BY RANDOM()
-        LIMIT ?
-    ''', (n,))
+    ''')
    rows = cur.fetchall()
    conn.close()
-    out = []
-    for r in rows:
-        vec = np.frombuffer(r[2], dtype=np.float32)
-        out.append({'sig_id': r[0], 'accountant': r[1], 'feature': vec})
-    return out
+    sig_ids = np.array([r[0] for r in rows], dtype=np.int64)
+    accts = np.array([r[1] for r in rows])
+    return sig_ids, accts


-def build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS, seed=SEED):
-    """Sample random cross-CPA pairs; return their cosine similarities."""
+def load_features_for_ids(sig_ids):
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    placeholders = ','.join('?' * len(sig_ids))
+    cur.execute(
+        f'SELECT signature_id, feature_vector FROM signatures '
+        f'WHERE signature_id IN ({placeholders})',
+        [int(s) for s in sig_ids],
+    )
+    rows = cur.fetchall()
+    conn.close()
+    feat_by_id = {}
+    for sid, blob in rows:
+        feat_by_id[int(sid)] = np.frombuffer(blob, dtype=np.float32)
+    return feat_by_id
+
+
+def build_inter_cpa_negative(sig_ids, accts, n_pairs=N_INTER_PAIRS, seed=SEED):
+    """Sample i.i.d. random cross-CPA pairs from the full matched corpus
+    and return their cosine similarities.
+    """
    rng = np.random.default_rng(seed)
-    n = len(sample)
-    feats = np.stack([s['feature'] for s in sample])
-    accts = np.array([s['accountant'] for s in sample])
-    sims = []
+    n = len(sig_ids)
+    pairs = []
    tries = 0
-    while len(sims) < n_pairs and tries < n_pairs * 10:
+    seen_pairs = set()
+    while len(pairs) < n_pairs and tries < n_pairs * 10:
        i = rng.integers(n)
        j = rng.integers(n)
        if i == j or accts[i] == accts[j]:
            tries += 1
            continue
-        sim = float(feats[i] @ feats[j])
-        sims.append(sim)
+        a, b = (i, j) if i < j else (j, i)
+        if (a, b) in seen_pairs:
+            tries += 1
+            continue
+        seen_pairs.add((a, b))
+        pairs.append((a, b))
        tries += 1
+
+    needed_ids = sorted({int(sig_ids[i]) for pair in pairs for i in pair})
+    feat_by_id = load_features_for_ids(needed_ids)
+
+    sims = []
+    for i, j in pairs:
+        fi = feat_by_id[int(sig_ids[i])]
+        fj = feat_by_id[int(sig_ids[j])]
+        sims.append(float(fi @ fj))
    return np.array(sims)


@@ -199,9 +246,12 @@ def main():
    print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')

    # --- (1) INTER-CPA NEGATIVE ANCHOR ---
-    print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} pairs)...')
-    sample = load_feature_vectors_sample(n=3000)
-    inter_cos = build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS)
+    print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} '
+          f'i.i.d. pairs from full matched corpus)...')
+    pool_sig_ids, pool_accts = load_signature_ids_for_negative_pool()
+    print(f'  pool size: {len(pool_sig_ids):,} matched signatures')
+    inter_cos = build_inter_cpa_negative(pool_sig_ids, pool_accts,
+                                         n_pairs=N_INTER_PAIRS)
    print(f'  inter-CPA cos: mean={inter_cos.mean():.4f}, '
          f'p95={np.percentile(inter_cos, 95):.4f}, '
          f'p99={np.percentile(inter_cos, 99):.4f}, '
@@ -236,7 +286,8 @@ def main():
    print(f"    threshold={eer['threshold']:.4f}, EER={eer['eer']:.4f}")
    # Canonical threshold evaluations with Wilson CIs
    canonical = {}
-    for tt in [0.70, 0.80, 0.837, 0.90, 0.945, 0.95, 0.973, 0.979]:
+    for tt in [0.70, 0.80, 0.837, 0.90, 0.9407, 0.945, 0.95, 0.973, 0.977,
+               0.979, 0.985]:
        y_pred = (scores > tt).astype(int)
        m = classification_metrics(y, y_pred)
        m['threshold'] = float(tt)
@@ -46,7 +46,10 @@ FIRM_A = '勤業眾信聯合'
 SEED = 42

 # Rules of interest for held-out vs calib comparison.
-COS_RULES = [0.837, 0.945, 0.95]
+# 0.9407 = calibration-fold P5 of the Firm A cosine distribution
+# (see Script 21 / Section III-K) and is included so Table XI of the
+# paper can report calib- and held-fold rates for the same rule set.
+COS_RULES = [0.837, 0.9407, 0.945, 0.95]
 DH_RULES = [5, 8, 9, 15]
 # Dual rule (the paper's classifier's operational dual).
 DUAL_RULES = [(0.95, 8), (0.945, 8)]
@@ -0,0 +1,337 @@
+#!/usr/bin/env python3
+"""
+Script 25: BD/McCrary Bin-Width Sensitivity Sweep
+==================================================
+Codex gpt-5.4 round-5 review recommended that the paper (a) demote
+BD/McCrary in the main-text framing from a co-equal threshold
+estimator to a density-smoothness diagnostic, and (b) run a short
+bin-width robustness sweep and place the results in a supplementary
+appendix as an audit trail. This script implements (b).
+
+For each (variant, bin_width) cell it reports:
+  - transition coordinate (None if no significant transition at alpha=0.05)
+  - Z_below / Z_above adjacent-bin statistics
+  - two-sided p-values for each adjacent Z
+  - number of signatures n
+
+Variants:
+  - Firm A cosine     (signature-level)
+  - Firm A dHash_indep (signature-level)
+  - Full cosine       (signature-level)
+  - Full dHash_indep  (signature-level)
+  - Accountant-level cosine_mean
+  - Accountant-level dHash_indep_mean
+
+Bin widths:
+  cosine:  0.003, 0.005, 0.010, 0.015
+  dHash:   1, 2, 3
+
+Output:
+  reports/bd_sensitivity/bd_sensitivity.md
+  reports/bd_sensitivity/bd_sensitivity.json
+"""
+
+import sqlite3
+import json
+import numpy as np
+from pathlib import Path
+from datetime import datetime
+from scipy.stats import norm
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'bd_sensitivity')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+Z_CRIT = 1.96
+ALPHA = 0.05
+
+COS_BINS = [0.003, 0.005, 0.010, 0.015]
+DH_BINS = [1, 2, 3]
+
+
+def bd_mccrary(values, bin_width, lo=None, hi=None):
+    arr = np.asarray(values, dtype=float)
+    arr = arr[~np.isnan(arr)]
+    if lo is None:
+        lo = float(np.floor(arr.min() / bin_width) * bin_width)
+    if hi is None:
+        hi = float(np.ceil(arr.max() / bin_width) * bin_width)
+    edges = np.arange(lo, hi + bin_width, bin_width)
+    counts, _ = np.histogram(arr, bins=edges)
+    centers = (edges[:-1] + edges[1:]) / 2.0
+    N = counts.sum()
+    if N == 0:
+        return centers, counts, np.full_like(centers, np.nan), np.full_like(centers, np.nan)
+    p = counts / N
+    n_bins = len(counts)
+    z = np.full(n_bins, np.nan)
+    expected = np.full(n_bins, np.nan)
+    for i in range(1, n_bins - 1):
+        p_lo = p[i - 1]
+        p_hi = p[i + 1]
+        exp_i = 0.5 * (counts[i - 1] + counts[i + 1])
+        var_i = (N * p[i] * (1 - p[i])
+                 + 0.25 * N * (p_lo + p_hi) * (1 - p_lo - p_hi))
+        if var_i > 0:
+            z[i] = (counts[i] - exp_i) / np.sqrt(var_i)
+            expected[i] = exp_i
+    return centers, counts, z, expected
+
+
+def find_best_transition(centers, z, direction='neg_to_pos', z_crit=Z_CRIT):
+    """Find strongest adjacent (significant negative, significant
+    positive) pair in the specified direction.
+
+    direction='neg_to_pos' means we look for Z_{i-1} < -z_crit and
+    Z_i > +z_crit (valley on the left, peak on the right). This is
+    the configuration for cosine distributions where the non-hand-
+    signed peak sits to the right.
+
+    direction='pos_to_neg' is the opposite (peak on the left, valley
+    on the right), used for dHash where small values are the
+    non-hand-signed peak.
+    """
+    best = None
+    best_mag = 0.0
+    for i in range(1, len(z)):
+        if np.isnan(z[i]) or np.isnan(z[i - 1]):
+            continue
+        if direction == 'neg_to_pos':
+            if z[i - 1] < -z_crit and z[i] > z_crit:
+                mag = abs(z[i - 1]) + abs(z[i])
+                if mag > best_mag:
+                    best_mag = mag
+                    best = {
+                        'idx': int(i),
+                        'threshold_between': float(0.5 * (centers[i - 1] + centers[i])),
+                        'z_below': float(z[i - 1]),
+                        'z_above': float(z[i]),
+                        'p_below': float(2 * (1 - norm.cdf(abs(z[i - 1])))),
+                        'p_above': float(2 * (1 - norm.cdf(abs(z[i])))),
+                    }
+        else:  # pos_to_neg
+            if z[i - 1] > z_crit and z[i] < -z_crit:
+                mag = abs(z[i - 1]) + abs(z[i])
+                if mag > best_mag:
+                    best_mag = mag
+                    best = {
+                        'idx': int(i),
+                        'threshold_between': float(0.5 * (centers[i - 1] + centers[i])),
+                        'z_above': float(z[i - 1]),
+                        'z_below': float(z[i]),
+                        'p_above': float(2 * (1 - norm.cdf(abs(z[i - 1])))),
+                        'p_below': float(2 * (1 - norm.cdf(abs(z[i])))),
+                    }
+    return best
+
+
+def load_signature_data():
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute('''
+        SELECT s.assigned_accountant, a.firm,
+               s.max_similarity_to_same_accountant,
+               s.min_dhash_independent
+        FROM signatures s
+        LEFT JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE s.max_similarity_to_same_accountant IS NOT NULL
+    ''')
+    rows = cur.fetchall()
+    conn.close()
+    return rows
+
+
+def aggregate_accountant(rows):
+    """Compute per-accountant mean cosine and mean dHash_indep."""
+    by_acct = {}
+    for acct, _firm, cos, dh in rows:
+        if acct is None:
+            continue
+        by_acct.setdefault(acct, {'cos': [], 'dh': []})
+        by_acct[acct]['cos'].append(cos)
+        if dh is not None:
+            by_acct[acct]['dh'].append(dh)
+    cos_means = []
+    dh_means = []
+    for acct, v in by_acct.items():
+        if len(v['cos']) >= 10:  # match Section IV-E >=10-signature filter
+            cos_means.append(float(np.mean(v['cos'])))
+            if v['dh']:
+                dh_means.append(float(np.mean(v['dh'])))
+    return np.array(cos_means), np.array(dh_means)
+
+
+def run_variant(values, bin_widths, direction, label, is_integer=False):
+    """Run BD/McCrary at multiple bin widths and collect results."""
+    results = []
+    for bw in bin_widths:
+        centers, counts, z, _ = bd_mccrary(values, bw)
+        all_transitions = []
+        # Also collect ALL significant transitions (not just best) so
+        # the appendix can show whether the procedure consistently
+        # identifies the same or different locations.
+        for i in range(1, len(z)):
+            if np.isnan(z[i]) or np.isnan(z[i - 1]):
+                continue
+            sig_neg_pos = (direction == 'neg_to_pos'
+                           and z[i - 1] < -Z_CRIT and z[i] > Z_CRIT)
+            sig_pos_neg = (direction == 'pos_to_neg'
+                           and z[i - 1] > Z_CRIT and z[i] < -Z_CRIT)
+            if sig_neg_pos or sig_pos_neg:
+                thr = float(0.5 * (centers[i - 1] + centers[i]))
+                all_transitions.append({
+                    'threshold_between': thr,
+                    'z_below': float(z[i - 1] if direction == 'neg_to_pos' else z[i]),
+                    'z_above': float(z[i] if direction == 'neg_to_pos' else z[i - 1]),
+                })
+        best = find_best_transition(centers, z, direction)
+        results.append({
+            'bin_width': float(bw) if not is_integer else int(bw),
+            'n_bins': int(len(centers)),
+            'n_transitions': len(all_transitions),
+            'best_transition': best,
+            'all_transitions': all_transitions,
+        })
+    return {
+        'label': label,
+        'direction': direction,
+        'n': int(len(values)),
+        'bin_sweep': results,
+    }
+
+
+def fmt_transition(t):
+    if t is None:
+        return 'no transition'
+    thr = t['threshold_between']
+    z1 = t['z_below']
+    z2 = t['z_above']
+    return f'{thr:.4f} (z_below={z1:+.2f}, z_above={z2:+.2f})'
+
+
+def main():
+    print('=' * 70)
+    print('Script 25: BD/McCrary Bin-Width Sensitivity Sweep')
+    print('=' * 70)
+
+    rows = load_signature_data()
+    print(f'\nLoaded {len(rows):,} signatures')
+
+    cos_all = np.array([r[2] for r in rows], dtype=float)
+    dh_all = np.array([-1 if r[3] is None else r[3] for r in rows],
+                      dtype=float)
+    firm_a = np.array([r[1] == FIRM_A for r in rows])
+
+    cos_firm_a = cos_all[firm_a]
+    dh_firm_a = dh_all[firm_a]
+    dh_firm_a = dh_firm_a[dh_firm_a >= 0]
+    dh_all_valid = dh_all[dh_all >= 0]
+
+    print(f'  Firm A sigs:  cos n={len(cos_firm_a)}, dh n={len(dh_firm_a)}')
+    print(f'  Full sigs:    cos n={len(cos_all)}, dh n={len(dh_all_valid)}')
+
+    cos_acct, dh_acct = aggregate_accountant(rows)
+    print(f'  Accountants (>=10 sigs): cos_mean n={len(cos_acct)}, dh_mean n={len(dh_acct)}')
+
+    variants = {}
+    variants['firm_a_cosine'] = run_variant(
+        cos_firm_a, COS_BINS, 'neg_to_pos', 'Firm A cosine (signature-level)')
+    variants['firm_a_dhash'] = run_variant(
+        dh_firm_a, DH_BINS, 'pos_to_neg',
+        'Firm A dHash_indep (signature-level)', is_integer=True)
+    variants['full_cosine'] = run_variant(
+        cos_all, COS_BINS, 'neg_to_pos', 'Full-sample cosine (signature-level)')
+    variants['full_dhash'] = run_variant(
+        dh_all_valid, DH_BINS, 'pos_to_neg',
+        'Full-sample dHash_indep (signature-level)', is_integer=True)
+    # Accountant-level: use narrower bins because n is ~700
+    variants['acct_cosine'] = run_variant(
+        cos_acct, [0.002, 0.005, 0.010], 'neg_to_pos',
+        'Accountant-level mean cosine')
+    variants['acct_dhash'] = run_variant(
+        dh_acct, [0.2, 0.5, 1.0], 'pos_to_neg',
+        'Accountant-level mean dHash_indep')
+
+    # Print summary table
+    print('\n=== Summary (best significant transition per bin width) ===')
+    print(f'{"Variant":<40} {"bin":>8} {"result":>50}')
+    print('-' * 100)
+    for vname, v in variants.items():
+        for r in v['bin_sweep']:
+            bw = r['bin_width']
+            res = fmt_transition(r['best_transition'])
+            if r['n_transitions'] > 1:
+                res += f' [+{r["n_transitions"]-1} other sig]'
+            print(f'{v["label"]:<40} {bw:>8} {res:>50}')
+
+    # Save JSON
+    summary = {
+        'generated_at': datetime.now().isoformat(),
+        'z_critical': Z_CRIT,
+        'alpha': ALPHA,
+        'variants': variants,
+    }
+    (OUT / 'bd_sensitivity.json').write_text(
+        json.dumps(summary, indent=2, ensure_ascii=False), encoding='utf-8')
+    print(f'\nJSON: {OUT / "bd_sensitivity.json"}')
+
+    # Markdown report
+    md = [
+        '# BD/McCrary Bin-Width Sensitivity Sweep',
+        f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
+        '',
+        f'Critical value |Z| > {Z_CRIT} (two-sided, alpha = {ALPHA}).',
+        'A significant transition requires an adjacent bin pair with',
+        'Z_{below} and Z_{above} both exceeding the critical value in',
+        'the expected direction (neg_to_pos for cosine, pos_to_neg for',
+        'dHash). "no transition" means no adjacent pair satisfied the',
+        'two-sided criterion at the stated bin width.',
+        '',
+    ]
+
+    for vname, v in variants.items():
+        md += [
+            f'## {v["label"]} (n = {v["n"]:,})',
+            '',
+            '| Bin width | Best transition | z_below | z_above | p_below | p_above | # sig transitions |',
+            '|-----------|------------------|---------|---------|---------|---------|-------------------|',
+        ]
+        for r in v['bin_sweep']:
+            t = r['best_transition']
+            if t is None:
+                md.append(f'| {r["bin_width"]} | no transition | — | — | — | — | {r["n_transitions"]} |')
+            else:
+                md.append(
+                    f'| {r["bin_width"]} | {t["threshold_between"]:.4f} '
+                    f'| {t["z_below"]:+.3f} | {t["z_above"]:+.3f} '
+                    f'| {t["p_below"]:.2e} | {t["p_above"]:.2e} '
+                    f'| {r["n_transitions"]} |'
+                )
+        md.append('')
+
+    md += [
+        '## Interpretation',
+        '',
+        '- Accountant-level variants (the unit of analysis used for the',
+        '  paper\'s primary threshold determination) produce no',
+        '  significant transition at any bin width tested, consistent',
+        '  with clustered-but-smoothly-mixed accountant-level',
+        '  aggregates.',
+        '- Signature-level variants produce a transition near cosine',
+        '  0.985 or dHash 2 at every bin width tested, but that',
+        '  transition sits inside (not between) the dominant',
+        '  non-hand-signed mode and therefore does not correspond to a',
+        '  boundary between the hand-signed and non-hand-signed',
+        '  populations.',
+        '- We therefore frame BD/McCrary in the main text as a density-',
+        '  smoothness diagnostic rather than as an independent',
+        '  accountant-level threshold estimator.',
+    ]
+    (OUT / 'bd_sensitivity.md').write_text('\n'.join(md), encoding='utf-8')
+    print(f'Report: {OUT / "bd_sensitivity.md"}')
+
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,211 @@
+#!/usr/bin/env python3
+"""
+Script 28: Byte-Identity Decomposition + Cross-Firm Dual-Descriptor Convergence
+================================================================================
+Produces two reproducible artifacts cited in the manuscript that previously
+lacked dedicated provenance (codex review v3.18.1 items #7 and #8):
+
+  (#7) Byte-identical Firm A signature decomposition:
+       - Total Firm A signatures with pixel_identical_to_closest = 1
+       - Number of distinct Firm A partners they span
+       - Number of partners in the registry (denominator)
+       - Number of byte-identical pairs that span DIFFERENT fiscal years
+
+  (#8) Cross-firm dual-descriptor convergence:
+       - Among signatures with cosine > 0.95 (per-signature best-match),
+         the fraction with min_dhash_independent <= 5, broken out by
+         Firm A vs Non-Firm-A.
+
+Firm A membership is defined throughout via accountants.firm (the CPA
+registry firm) joined on signatures.assigned_accountant. This matches
+the convention used by signature_analysis/24_validation_recalibration.py
+and the validation_recalibration JSON, so counts are directly comparable
+to Tables IX / XI / XII.
+
+Output:
+  /Volumes/NV2/PDF-Processing/signature-analysis/reports/byte_identity_decomp/
+      byte_identity_decomposition.json
+      byte_identity_decomposition.md
+
+These figures are intended to be cited from the paper (Section IV-F.1 for #7;
+Section IV-H.2 for #8) so that every quantitative claim in the manuscript
+traces to a specific JSON field.
+"""
+
+import json
+import sqlite3
+from datetime import datetime
+from pathlib import Path
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'byte_identity_decomp')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+
+
+def byte_identity_decomposition(conn):
+    """Codex item #7: 145 / 50 / 180 / 35 decomposition."""
+    cur = conn.cursor()
+
+    cur.execute("""
+        SELECT COUNT(DISTINCT name)
+        FROM accountants
+        WHERE firm = ?
+    """, (FIRM_A,))
+    n_registered_partners = cur.fetchone()[0]
+
+    cur.execute("""
+        WITH byte_pairs AS (
+          SELECT s1.signature_id AS sig_a,
+                 s1.assigned_accountant AS partner,
+                 s1.year_month AS ym_a,
+                 s2.year_month AS ym_b
+          FROM signatures s1
+          JOIN accountants a ON s1.assigned_accountant = a.name
+          JOIN signatures s2 ON s1.closest_match_file = s2.image_filename
+          WHERE s1.pixel_identical_to_closest = 1
+            AND a.firm = ?
+        )
+        SELECT
+          COUNT(*) AS total_pixel_identical_firm_a,
+          COUNT(DISTINCT partner) AS partners_with_pixel_identical,
+          SUM(CASE WHEN substr(ym_a,1,4) <> substr(ym_b,1,4) THEN 1 ELSE 0 END)
+            AS cross_year_pairs
+        FROM byte_pairs
+    """, (FIRM_A,))
+    n_total, n_partners, n_cross_year = cur.fetchone()
+
+    return {
+        'definition': (
+            'Among Firm A signatures whose nearest same-CPA match is '
+            'byte-identical after crop and normalization '
+            '(pixel_identical_to_closest = 1), this section reports the '
+            'count, the distinct-partner spread, the registry denominator, '
+            'and the subset whose byte-identical match is in a different '
+            'fiscal year.'
+        ),
+        'firm_label': 'Firm A',
+        'n_pixel_identical_firm_a_signatures': n_total,
+        'n_distinct_partners_with_pixel_identical': n_partners,
+        'n_registered_partners_in_firm_a': n_registered_partners,
+        'partner_coverage_share': round(n_partners / n_registered_partners, 4),
+        'n_cross_year_byte_identical_pairs': n_cross_year,
+    }
+
+
+def cross_firm_dual_convergence(conn):
+    """Codex item #8: per-signature dual-descriptor convergence by firm."""
+    cur = conn.cursor()
+
+    cur.execute("""
+        SELECT
+          CASE WHEN a.firm = ? THEN 'Firm A' ELSE 'Non-Firm-A' END
+            AS firm_group,
+          COUNT(*) AS n_signatures_above_095,
+          SUM(CASE WHEN s.min_dhash_independent <= 5 THEN 1 ELSE 0 END)
+            AS n_dhash_le_5
+        FROM signatures s
+        JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE s.max_similarity_to_same_accountant > 0.95
+          AND s.min_dhash_independent IS NOT NULL
+        GROUP BY firm_group
+        ORDER BY firm_group
+    """, (FIRM_A,))
+
+    rows = cur.fetchall()
+    by_group = {}
+    for firm_group, n_above, n_dhash in rows:
+        by_group[firm_group] = {
+            'n_signatures_above_cosine_095': n_above,
+            'n_dhash_indep_le_5': n_dhash,
+            'pct_dhash_indep_le_5': round(100.0 * n_dhash / n_above, 2),
+        }
+
+    return {
+        'definition': (
+            'Per-signature best-match cosine > 0.95 AND assigned_accountant '
+            'IS NOT NULL AND min_dhash_independent IS NOT NULL. The reported '
+            'percentage is the share of these signatures whose independent '
+            'min dHash to any same-CPA signature is <= 5.'
+        ),
+        'unit_of_observation': 'signature',
+        'cosine_threshold': 0.95,
+        'dhash_indep_threshold': 5,
+        'by_firm_group': by_group,
+    }
+
+
+def write_markdown(payload, path):
+    bid = payload['byte_identity_decomposition']
+    cf = payload['cross_firm_dual_convergence']
+
+    lines = []
+    lines.append('# Byte-Identity Decomposition + Cross-Firm Dual-Descriptor '
+                 'Convergence')
+    lines.append('')
+    lines.append(f"Generated at: {payload['generated_at']}")
+    lines.append('')
+
+    lines.append('## 1. Byte-Identity Decomposition (Firm A)')
+    lines.append('')
+    lines.append(bid['definition'])
+    lines.append('')
+    lines.append('| Quantity | Value |')
+    lines.append('|----------|-------|')
+    lines.append(f"| Pixel-identical Firm A signatures | "
+                 f"{bid['n_pixel_identical_firm_a_signatures']} |")
+    lines.append(f"| Distinct Firm A partners with at least one such pair | "
+                 f"{bid['n_distinct_partners_with_pixel_identical']} |")
+    lines.append(f"| Registered Firm A partners | "
+                 f"{bid['n_registered_partners_in_firm_a']} |")
+    lines.append(f"| Partner coverage share | "
+                 f"{bid['partner_coverage_share']:.3f} |")
+    lines.append(f"| Pairs whose byte-identical match spans different fiscal "
+                 f"years | {bid['n_cross_year_byte_identical_pairs']} |")
+    lines.append('')
+
+    lines.append('## 2. Cross-Firm Dual-Descriptor Convergence')
+    lines.append('')
+    lines.append(cf['definition'])
+    lines.append('')
+    lines.append('| Firm group | N signatures with cosine > 0.95 | '
+                 'N with dHash_indep <= 5 | % with dHash_indep <= 5 |')
+    lines.append('|------------|--------------------------------:|'
+                 '------------------------:|------------------------:|')
+    for grp in ('Firm A', 'Non-Firm-A'):
+        g = cf['by_firm_group'][grp]
+        lines.append(f"| {grp} | "
+                     f"{g['n_signatures_above_cosine_095']:,} | "
+                     f"{g['n_dhash_indep_le_5']:,} | "
+                     f"{g['pct_dhash_indep_le_5']:.2f}% |")
+
+    path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
+
+
+def main():
+    conn = sqlite3.connect(DB)
+    try:
+        payload = {
+            'generated_at': datetime.now().isoformat(timespec='seconds'),
+            'database_path': DB,
+            'firm_a_label': FIRM_A,
+            'byte_identity_decomposition': byte_identity_decomposition(conn),
+            'cross_firm_dual_convergence': cross_firm_dual_convergence(conn),
+        }
+    finally:
+        conn.close()
+
+    json_path = OUT / 'byte_identity_decomposition.json'
+    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
+                         encoding='utf-8')
+    print(f'Wrote {json_path}')
+
+    md_path = OUT / 'byte_identity_decomposition.md'
+    write_markdown(payload, md_path)
+    print(f'Wrote {md_path}')
+
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,123 @@
+#!/usr/bin/env python3
+"""
+Script 29: Firm A Per-Year Cosine Distribution (Table XIII)
+============================================================
+Generates the year-by-year Firm A per-signature best-match cosine
+distribution reported as Table XIII in the manuscript. Codex / Gemini
+round-19 review identified that this table previously had no dedicated
+generating script (Appendix B incorrectly attributed it to Script 08,
+which has no year_month extraction).
+
+Definition:
+  Firm A membership is via CPA registry (accountants.firm joined on
+  signatures.assigned_accountant), matching the convention used by
+  scripts 24 and 28.
+
+  For each fiscal year (substr(year_month, 1, 4)):
+    - N signatures with non-null max_similarity_to_same_accountant
+    - mean of max_similarity_to_same_accountant (the per-signature
+      best-match cosine)
+    - share with max_similarity_to_same_accountant < 0.95 (the
+      left-tail rate cited in Section IV-G.1)
+
+Output:
+  reports/firm_a_yearly/firm_a_yearly_distribution.json
+  reports/firm_a_yearly/firm_a_yearly_distribution.md
+"""
+
+import json
+import sqlite3
+from datetime import datetime
+from pathlib import Path
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'firm_a_yearly')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+
+
+def yearly_distribution(conn):
+    cur = conn.cursor()
+    cur.execute("""
+        SELECT substr(s.year_month, 1, 4) AS year,
+               COUNT(*) AS n_sigs,
+               AVG(s.max_similarity_to_same_accountant) AS mean_cos,
+               SUM(CASE
+                     WHEN s.max_similarity_to_same_accountant < 0.95
+                     THEN 1 ELSE 0
+                   END) AS n_below_095
+        FROM signatures s
+        JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE a.firm = ?
+          AND s.max_similarity_to_same_accountant IS NOT NULL
+          AND s.year_month IS NOT NULL
+        GROUP BY year
+        ORDER BY year
+    """, (FIRM_A,))
+
+    rows = []
+    for year, n_sigs, mean_cos, n_below in cur.fetchall():
+        rows.append({
+            'year': int(year),
+            'n_signatures': n_sigs,
+            'mean_best_match_cosine': round(mean_cos, 4),
+            'n_below_cosine_095': n_below,
+            'pct_below_cosine_095': round(100.0 * n_below / n_sigs, 2),
+        })
+    return rows
+
+
+def write_markdown(payload, path):
+    rows = payload['yearly_rows']
+    lines = []
+    lines.append('# Firm A Per-Year Cosine Distribution (Table XIII)')
+    lines.append('')
+    lines.append(f"Generated at: {payload['generated_at']}")
+    lines.append('')
+    lines.append('Firm A membership: CPA registry '
+                 '(accountants.firm = "勤業眾信聯合"). Per-signature '
+                 'best-match cosine = '
+                 'signatures.max_similarity_to_same_accountant.')
+    lines.append('')
+    lines.append('| Year | N sigs | mean best-match cosine | % below 0.95 |')
+    lines.append('|------|--------|------------------------|--------------|')
+    for r in rows:
+        lines.append(
+            f"| {r['year']} | {r['n_signatures']:,} | "
+            f"{r['mean_best_match_cosine']:.4f} | "
+            f"{r['pct_below_cosine_095']:.2f}% |"
+        )
+    path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
+
+
+def main():
+    conn = sqlite3.connect(DB)
+    try:
+        payload = {
+            'generated_at': datetime.now().isoformat(timespec='seconds'),
+            'database_path': DB,
+            'firm_a_label': FIRM_A,
+            'firm_a_membership_definition': (
+                'CPA registry: accountants.firm joined on '
+                'signatures.assigned_accountant'
+            ),
+            'cosine_metric': 'signatures.max_similarity_to_same_accountant',
+            'yearly_rows': yearly_distribution(conn),
+        }
+    finally:
+        conn.close()
+
+    json_path = OUT / 'firm_a_yearly_distribution.json'
+    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
+                         encoding='utf-8')
+    print(f'Wrote {json_path}')
+
+    md_path = OUT / 'firm_a_yearly_distribution.md'
+    write_markdown(payload, md_path)
+    print(f'Wrote {md_path}')
+
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,255 @@
+#!/usr/bin/env python3
+"""
+Script 30: Yearly Per-Firm Cosine Similarity Comparison
+========================================================
+Generates the per-firm year-by-year per-signature best-match cosine
+distribution: Firm A (Deloitte), Firm B (KPMG), Firm C (PwC),
+Firm D (EY), Non-Big-4. The two-panel figure (mean cosine; share above
+0.95) is the headline cross-firm visual requested in partner review of
+v3.19.1 (2026-04-27): five lines, X-axis 2013-2023, Firm A at the top.
+
+Outputs:
+  reports/figures/fig_yearly_big4_comparison.png
+  reports/figures/fig_yearly_big4_comparison.pdf
+  reports/firm_yearly_comparison/firm_yearly_comparison.json
+  reports/firm_yearly_comparison/firm_yearly_comparison.md
+"""
+
+import json
+import sqlite3
+from datetime import datetime
+from pathlib import Path
+
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+import numpy as np
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+FIG_OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+               'figures')
+DATA_OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+                'firm_yearly_comparison')
+FIG_OUT.mkdir(parents=True, exist_ok=True)
+DATA_OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_BUCKETS = [
+    ('Firm A', '勤業眾信聯合'),
+    ('Firm B', '安侯建業聯合'),
+    ('Firm C', '資誠聯合'),
+    ('Firm D', '安永聯合'),
+]
+
+FIRM_COLORS = {
+    'Firm A': '#d62728',
+    'Firm B': '#1f77b4',
+    'Firm C': '#2ca02c',
+    'Firm D': '#9467bd',
+    'Non-Big-4': '#7f7f7f',
+}
+FIRM_MARKERS = {
+    'Firm A': 'o',
+    'Firm B': 's',
+    'Firm C': '^',
+    'Firm D': 'D',
+    'Non-Big-4': 'v',
+}
+COSINE_CUT = 0.95
+
+
+def firm_bucket(firm):
+    for label, name in FIRM_BUCKETS:
+        if firm == name:
+            return label
+    return 'Non-Big-4'
+
+
+def load_rows(conn):
+    cur = conn.cursor()
+    cur.execute("""
+        SELECT a.firm,
+               CAST(substr(s.year_month, 1, 4) AS INTEGER) AS year,
+               s.max_similarity_to_same_accountant
+        FROM signatures s
+        LEFT JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE s.max_similarity_to_same_accountant IS NOT NULL
+          AND s.year_month IS NOT NULL
+          AND s.assigned_accountant IS NOT NULL
+    """)
+    return cur.fetchall()
+
+
+def aggregate(rows):
+    """Returns dict keyed by (firm_label, year) -> {n, mean_cos, share_ge_cut}."""
+    by_firm_year = {}
+    for firm, year, cos in rows:
+        if year is None or year < 2013 or year > 2023:
+            continue
+        label = firm_bucket(firm)
+        key = (label, int(year))
+        by_firm_year.setdefault(key, []).append(float(cos))
+
+    summary = {}
+    for (label, year), vals in by_firm_year.items():
+        arr = np.array(vals, dtype=float)
+        summary[(label, year)] = {
+            'n': int(arr.size),
+            'mean_cos': float(arr.mean()),
+            'share_ge_cut': float(np.mean(arr >= COSINE_CUT)),
+        }
+    return summary
+
+
+def plot_figure(summary, years, firm_labels, fig_path_png, fig_path_pdf):
+    fig, axes = plt.subplots(1, 2, figsize=(13, 5))
+
+    ax = axes[0]
+    for label in firm_labels:
+        ys = [summary[(label, y)]['mean_cos']
+              if (label, y) in summary else np.nan
+              for y in years]
+        ax.plot(years, ys,
+                marker=FIRM_MARKERS[label], color=FIRM_COLORS[label],
+                lw=2.0, ms=6, label=label,
+                zorder=3 if label == 'Firm A' else 2)
+    ax.set_xlabel('Fiscal year')
+    ax.set_ylabel('Mean per-signature best-match cosine')
+    ax.set_title('(a) Mean per-signature best-match cosine, by firm and year')
+    ax.set_xticks(years)
+    ax.tick_params(axis='x', rotation=0)
+    ax.grid(True, ls=':', alpha=0.4)
+    ax.legend(loc='lower right', framealpha=0.95)
+
+    ax = axes[1]
+    for label in firm_labels:
+        ys = [100.0 * summary[(label, y)]['share_ge_cut']
+              if (label, y) in summary else np.nan
+              for y in years]
+        ax.plot(years, ys,
+                marker=FIRM_MARKERS[label], color=FIRM_COLORS[label],
+                lw=2.0, ms=6, label=label,
+                zorder=3 if label == 'Firm A' else 2)
+    ax.set_xlabel('Fiscal year')
+    ax.set_ylabel(f'% signatures with best-match cosine $\\geq$ {COSINE_CUT}')
+    ax.set_title(f'(b) Share with cosine $\\geq$ {COSINE_CUT}, '
+                 'by firm and year')
+    ax.set_xticks(years)
+    ax.tick_params(axis='x', rotation=0)
+    ax.grid(True, ls=':', alpha=0.4)
+    ax.legend(loc='lower right', framealpha=0.95)
+    ax.set_ylim(0, 100)
+
+    fig.suptitle('Per-firm yearly per-signature best-match cosine '
+                 '(operational cut shown as 0.95)',
+                 fontsize=12, y=1.02)
+    fig.tight_layout()
+    fig.savefig(fig_path_png, dpi=200, bbox_inches='tight')
+    fig.savefig(fig_path_pdf, bbox_inches='tight')
+    plt.close(fig)
+
+
+def write_markdown(summary, years, firm_labels, md_path):
+    lines = ['# Per-Firm Yearly Cosine Comparison',
+             '',
+             f"Generated: {datetime.now().isoformat(timespec='seconds')}",
+             '',
+             ('Per-signature best-match cosine '
+              '(`max_similarity_to_same_accountant`), aggregated by firm '
+              'bucket and fiscal year. Firm bucket via CPA registry '
+              '(`accountants.firm`).'),
+             '']
+
+    lines.append('## Mean per-signature best-match cosine')
+    lines.append('')
+    header = '| Year | ' + ' | '.join(firm_labels) + ' |'
+    sep = '|------|' + '|'.join(['------'] * len(firm_labels)) + '|'
+    lines.append(header)
+    lines.append(sep)
+    for y in years:
+        row = f'| {y} | '
+        cells = []
+        for lab in firm_labels:
+            if (lab, y) in summary:
+                cells.append(f"{summary[(lab, y)]['mean_cos']:.4f}")
+            else:
+                cells.append('---')
+        row += ' | '.join(cells) + ' |'
+        lines.append(row)
+
+    lines.append('')
+    lines.append(f'## Share with cosine $\\geq$ {COSINE_CUT}')
+    lines.append('')
+    lines.append(header)
+    lines.append(sep)
+    for y in years:
+        row = f'| {y} | '
+        cells = []
+        for lab in firm_labels:
+            if (lab, y) in summary:
+                cells.append(f"{100*summary[(lab, y)]['share_ge_cut']:.1f}%")
+            else:
+                cells.append('---')
+        row += ' | '.join(cells) + ' |'
+        lines.append(row)
+
+    lines.append('')
+    lines.append('## Per-firm signature counts')
+    lines.append('')
+    lines.append(header)
+    lines.append(sep)
+    for y in years:
+        row = f'| {y} | '
+        cells = []
+        for lab in firm_labels:
+            if (lab, y) in summary:
+                cells.append(f"{summary[(lab, y)]['n']:,}")
+            else:
+                cells.append('---')
+        row += ' | '.join(cells) + ' |'
+        lines.append(row)
+
+    md_path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
+
+
+def main():
+    conn = sqlite3.connect(DB)
+    try:
+        rows = load_rows(conn)
+    finally:
+        conn.close()
+    print(f'Loaded {len(rows):,} signatures with cosine + year + firm.')
+
+    summary = aggregate(rows)
+    years = sorted({y for (_, y) in summary})
+    firm_labels = ['Firm A', 'Firm B', 'Firm C', 'Firm D', 'Non-Big-4']
+
+    fig_png = FIG_OUT / 'fig_yearly_big4_comparison.png'
+    fig_pdf = FIG_OUT / 'fig_yearly_big4_comparison.pdf'
+    plot_figure(summary, years, firm_labels, fig_png, fig_pdf)
+    print(f'Wrote {fig_png}')
+    print(f'Wrote {fig_pdf}')
+
+    payload = {
+        'generated_at': datetime.now().isoformat(timespec='seconds'),
+        'database_path': DB,
+        'cosine_cut': COSINE_CUT,
+        'firm_buckets': dict(FIRM_BUCKETS) | {'Non-Big-4': 'all other'},
+        'years': years,
+        'rows': [
+            {'firm': lab, 'year': y, **summary[(lab, y)]}
+            for lab in firm_labels for y in years
+            if (lab, y) in summary
+        ],
+    }
+    json_path = DATA_OUT / 'firm_yearly_comparison.json'
+    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
+                         encoding='utf-8')
+    print(f'Wrote {json_path}')
+
+    md_path = DATA_OUT / 'firm_yearly_comparison.md'
+    write_markdown(summary, years, firm_labels, md_path)
+    print(f'Wrote {md_path}')
+
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,249 @@
+#!/usr/bin/env python3
+"""
+Script 31: Within-Year Same-CPA Ranking Robustness Check
+==========================================================
+Recomputes the per-auditor-year mean cosine ranking of Table XIV using
+within-year same-CPA matching only (instead of cross-year same-CPA pool
+which Table XIV uses by construction). Reports pooled top-10/20/30%
+Firm A share under the within-year restriction so the partner-level
+ranking finding can be checked against the cross-year aggregation
+choice flagged in Section IV-G.2.
+
+Definition (within-year statistic):
+  For each signature s, with CPA = c, year = y:
+    cos_within(s) = max cosine(s, s') over s' != s, CPA(s')=c, year(s')=y
+  If a (CPA, year) block has only one signature, cos_within is undefined
+  and that signature is dropped from the auditor-year aggregation
+  (matching the same-CPA pair-existence requirement of Section III-G).
+
+Outputs:
+  reports/within_year_ranking/within_year_ranking.json
+  reports/within_year_ranking/within_year_ranking.md
+"""
+
+import json
+import sqlite3
+from collections import defaultdict
+from datetime import datetime
+from pathlib import Path
+
+import numpy as np
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'within_year_ranking')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+MIN_SIGS_PER_AUDITOR_YEAR = 5
+
+
+def firm_bucket(firm):
+    if firm == '勤業眾信聯合':
+        return 'Firm A'
+    if firm == '安侯建業聯合':
+        return 'Firm B'
+    if firm == '資誠聯合':
+        return 'Firm C'
+    if firm == '安永聯合':
+        return 'Firm D'
+    return 'Non-Big-4'
+
+
+def load_signatures():
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    cur.execute("""
+        SELECT s.signature_id, s.assigned_accountant, a.firm,
+               CAST(substr(s.year_month, 1, 4) AS INTEGER) AS year,
+               s.feature_vector
+        FROM signatures s
+        LEFT JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE s.feature_vector IS NOT NULL
+          AND s.assigned_accountant IS NOT NULL
+          AND s.year_month IS NOT NULL
+    """)
+    rows = cur.fetchall()
+    conn.close()
+    return rows
+
+
+def compute_within_year_max(rows):
+    """Group by (CPA, year), compute max cosine to other same-block sigs."""
+    blocks = defaultdict(list)  # (cpa, year) -> [(sig_id, feat)]
+    for sig_id, cpa, firm, year, blob in rows:
+        if year is None:
+            continue
+        feat = np.frombuffer(blob, dtype=np.float32)
+        blocks[(cpa, int(year))].append((sig_id, feat, firm))
+
+    sig_max_within = {}  # sig_id -> max within-year same-CPA cosine
+    sig_meta = {}        # sig_id -> (cpa, year, firm)
+    for (cpa, year), entries in blocks.items():
+        if len(entries) < 2:
+            continue  # singleton: max-within is undefined
+        feats = np.stack([e[1] for e in entries])  # (n, 2048)
+        sims = feats @ feats.T                      # (n, n)
+        np.fill_diagonal(sims, -np.inf)
+        maxs = sims.max(axis=1)
+        for i, (sig_id, _, firm) in enumerate(entries):
+            sig_max_within[sig_id] = float(maxs[i])
+            sig_meta[sig_id] = (cpa, year, firm)
+    return sig_max_within, sig_meta
+
+
+def auditor_year_aggregation(sig_max_within, sig_meta):
+    by_ay = defaultdict(list)  # (cpa, year) -> list of cos
+    for sig_id, cos in sig_max_within.items():
+        cpa, year, firm = sig_meta[sig_id]
+        by_ay[(cpa, year)].append(cos)
+    rows = []
+    for (cpa, year), vals in by_ay.items():
+        if len(vals) < MIN_SIGS_PER_AUDITOR_YEAR:
+            continue
+        firm = sig_meta[next(s for s in sig_max_within
+                              if sig_meta[s][0] == cpa
+                              and sig_meta[s][1] == year)][2]
+        rows.append({
+            'acct': cpa,
+            'year': year,
+            'firm': firm,
+            'cos_mean_within_year': float(np.mean(vals)),
+            'n': len(vals),
+        })
+    return rows
+
+
+def top_k_breakdown(rows, k_pcts=(10, 20, 25, 30, 50)):
+    sorted_rows = sorted(rows, key=lambda r: -r['cos_mean_within_year'])
+    N = len(sorted_rows)
+    out = {}
+    for k_pct in k_pcts:
+        k = max(1, int(N * k_pct / 100))
+        top = sorted_rows[:k]
+        counts = defaultdict(int)
+        for r in top:
+            counts[firm_bucket(r['firm'])] += 1
+        out[f'top_{k_pct}pct'] = {
+            'k': k,
+            'firm_counts': dict(counts),
+            'firm_a_share': counts['Firm A'] / k,
+        }
+    return out
+
+
+def per_year_top_k(rows, k_pcts=(10, 20, 30)):
+    years = sorted(set(r['year'] for r in rows))
+    out = {}
+    for y in years:
+        yr = [r for r in rows if r['year'] == y]
+        if not yr:
+            continue
+        sr = sorted(yr, key=lambda r: -r['cos_mean_within_year'])
+        n_y = len(sr)
+        n_a = sum(1 for r in sr if r['firm'] == FIRM_A)
+        per = {'n_auditor_years': n_y,
+               'firm_a_baseline_share': n_a / n_y,
+               'top_k': {}}
+        for kp in k_pcts:
+            k = max(1, int(n_y * kp / 100))
+            n_a_top = sum(1 for r in sr[:k] if r['firm'] == FIRM_A)
+            per['top_k'][f'top_{kp}pct'] = {
+                'k': k,
+                'firm_a_in_top': n_a_top,
+                'firm_a_share': n_a_top / k,
+            }
+        out[y] = per
+    return out
+
+
+def main():
+    print('Loading signatures + features...')
+    rows = load_signatures()
+    print(f'  loaded {len(rows):,}')
+
+    print('Computing within-year same-CPA max cosine...')
+    sig_max_within, sig_meta = compute_within_year_max(rows)
+    print(f'  signatures with within-year pair: {len(sig_max_within):,}')
+    n_dropped = len(rows) - len(sig_max_within)
+    print(f'  dropped (singleton within year): {n_dropped:,}')
+
+    ay_rows = auditor_year_aggregation(sig_max_within, sig_meta)
+    print(f'  auditor-years (>={MIN_SIGS_PER_AUDITOR_YEAR} sigs '
+          f'with within-year pair): {len(ay_rows):,}')
+
+    pooled = top_k_breakdown(ay_rows)
+    yearly = per_year_top_k(ay_rows)
+
+    payload = {
+        'generated_at': datetime.now().isoformat(timespec='seconds'),
+        'n_signatures_loaded': len(rows),
+        'n_signatures_with_within_year_pair': len(sig_max_within),
+        'n_singleton_dropped': n_dropped,
+        'min_sigs_per_auditor_year': MIN_SIGS_PER_AUDITOR_YEAR,
+        'n_auditor_years': len(ay_rows),
+        'n_firm_a_auditor_years': sum(1 for r in ay_rows
+                                       if r['firm'] == FIRM_A),
+        'pooled_top_k': pooled,
+        'yearly_top_k': yearly,
+    }
+    json_path = OUT / 'within_year_ranking.json'
+    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
+                         encoding='utf-8')
+    print(f'\nWrote {json_path}')
+
+    # Markdown
+    md = ['# Within-Year Same-CPA Ranking Robustness',
+          '',
+          f"Generated: {payload['generated_at']}",
+          '',
+          ('Per-signature best-match cosine recomputed using within-year '
+           'same-CPA matching only. See Script 31 docstring for the '
+           'precise definition.'),
+          '',
+          f"- Signatures loaded: {len(rows):,}",
+          f"- Signatures with at least one within-year same-CPA pair: "
+          f"{len(sig_max_within):,}",
+          f"- Singletons dropped (no within-year pair): {n_dropped:,}",
+          f"- Auditor-years with >= {MIN_SIGS_PER_AUDITOR_YEAR} sigs: "
+          f"{len(ay_rows):,}",
+          f"- Firm A auditor-years: {payload['n_firm_a_auditor_years']:,} "
+          f"({100*payload['n_firm_a_auditor_years']/len(ay_rows):.1f}% baseline)",
+          '',
+          '## Pooled (2013-2023) top-K Firm A share',
+          '',
+          '| Top-K | k | Firm A share | A | B | C | D | NB4 |',
+          '|-------|---|--------------|---|---|---|---|-----|']
+    for kp in [10, 20, 25, 30, 50]:
+        d = pooled[f'top_{kp}pct']
+        c = d['firm_counts']
+        md.append(f"| {kp}% | {d['k']:,} | "
+                  f"{100*d['firm_a_share']:.1f}% | "
+                  f"{c.get('Firm A', 0)} | {c.get('Firm B', 0)} | "
+                  f"{c.get('Firm C', 0)} | {c.get('Firm D', 0)} | "
+                  f"{c.get('Non-Big-4', 0)} |")
+
+    md.extend(['',
+               '## Year-by-year top-K Firm A share',
+               '',
+               '| Year | n AY | Top-10% share | Top-20% share | '
+               'Top-30% share | A baseline |',
+               '|------|------|---------------|---------------|'
+               '---------------|------------|'])
+    for y in sorted(yearly):
+        per = yearly[y]
+        line = (f"| {y} | {per['n_auditor_years']:,} ")
+        for kp in [10, 20, 30]:
+            d = per['top_k'][f'top_{kp}pct']
+            line += (f"| {100*d['firm_a_share']:.1f}% "
+                     f"({d['firm_a_in_top']}/{d['k']}) ")
+        line += f"| {100*per['firm_a_baseline_share']:.1f}% |"
+        md.append(line)
+
+    md_path = OUT / 'within_year_ranking.md'
+    md_path.write_text('\n'.join(md) + '\n', encoding='utf-8')
+    print(f'Wrote {md_path}')
+
+
+if __name__ == '__main__':
+    main()