Reading additional input from stdin...
OpenAI Codex v0.122.0 (research preview)
--------
workdir: /Volumes/NV2/pdf_recognize
model: gpt-5.4
provider: openai
approval: never
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/gbanyan/.codex/memories]
reasoning effort: xhigh
reasoning summaries: none
session id: 019daecf-4878-7b80-8698-5ab11491949a
--------
user
You are performing an eighth-round peer review of Paper A v3.8 (commit fcce58a), targeted at IEEE Access (Regular Paper). This is the CPA signature authenticity paper you have reviewed in rounds 3 through 5 (codex_review_gpt54_v3_{3,4,5}.md). Rounds 6 and 7 were performed by Gemini 3.1 Pro (gemini_review_v3_{7,8}.md).

Context - full review trajectory:
- v3.3: Major revision (codex round-3, 3 blockers)
- v3.4: Major revision (codex round-4, partial resolution)
- v3.5: Minor revision (codex round-5)
- v3.6: quick-win cleanup
- v3.7: Gemini round-6 Minor revision (flagged 3 issues codex had missed: BD/McCrary power-artifact Type II overclaim, Table X EER/FRR tautology, IV-I worst-case narrative + 15-signature delta)
- v3.8 (current): addressed all 3 Gemini findings; Gemini round-7 returned Accept verdict

Your task: perform the EIGHTH and hopefully final round of review. Confirm (or challenge) Gemini's Accept verdict from YOUR perspective. You gave v3.5 Minor Revision; since then v3.6/v3.7/v3.8 have addressed both your residual items AND Gemini's new findings. Is v3.8 now submission-ready from the codex-gpt-5.4 perspective too?

Be independent, not rubber-stamp. If you disagree with Gemini's Accept (e.g., you see remaining issues Gemini missed), say so clearly.

Read ALL 10 paper markdown files:
- @paper/paper_a_abstract_v3.md
- @paper/paper_a_introduction_v3.md
- @paper/paper_a_related_work_v3.md
- @paper/paper_a_methodology_v3.md
- @paper/paper_a_results_v3.md
- @paper/paper_a_discussion_v3.md
- @paper/paper_a_conclusion_v3.md
- @paper/paper_a_appendix_v3.md
- @paper/paper_a_references_v3.md
- @paper/paper_a_impact_statement_v3.md (archived, NOT in submission)

Cross-reference prior reviews:
- @paper/codex_review_gpt54_v3_5.md (your round-5 Minor Revision, with 7-item v3.6 action list)
- @paper/gemini_review_v3_7.md (Gemini round-6 Minor Revision, with 3 new findings)
- @paper/gemini_review_v3_8.md (Gemini round-7 Accept verdict)

Your review MUST include:

1. Overall verdict: Accept / Minor Revision / Major Revision / Reject, with one-paragraph rationale. This is the decision the author wants: submit or v3.9.

2. Round-5 codex-action-list audit: your round-5 review had 7 items for v3.6. Confirm each as RESOLVED in the current v3.8 (or flag if any regressed). Items were:
   (1) Table IX re-sync to script outputs
   (2) Circularity overclaim removal
   (3) export_v3.py single-anonymized fix
   (4) Visual-inspection protocol documentation [substantive, deferred]
   (5) BD/McCrary either substantiate or demote [now demoted + Appendix A + power caveat]
   (6) Reproducibility appendix [substantive, still deferred]
   (7) Stale cosine-conditional dHash sentence

3. Round-6 Gemini findings verification: independently verify the three Gemini issues are resolved in v3.8 (BD/McCrary power reframe; Table X EER/FRR removal; IV-I worst-case narrative + 15-signature footnote). Flag any tension you see with earlier codex framings.

4. Regression check: the v3.7/v3.8 rewrites touched many sections. Look for (a) internal contradictions between old and new prose, (b) broken cross-references, (c) orphaned sentences that now lack context after an adjacent rewrite, (d) statistical claims that still don't match script outputs.

5. Remaining outstanding items: codex round-5 flagged visual-inspection protocol and reproducibility appendix as v3.6 action items 4 and 6. These have been explicitly deferred to post-submission. Is that deferral defensible for an IEEE Access first submission, or do you still consider them blocking?

6. Final recommendation:
   - If Accept: confirm and list any very-last-mile items the author should verify manually before the real DOCX submission (e.g., author name fill-in, figure rendering, reference list sanity check).
   - If Minor Revision: list the specific v3.9 items, ranked.

Be concise (800-1500 words). This is a verification round, not a first read. Do not rehash your round-3-5 findings. Focus on whether v3.8 clears review.

IMPORTANT: be independent. If you think Gemini's Accept was generous and a Minor is still warranted, say so. If you concur with Accept, say so clearly and explicitly.
deprecated: `[features].web_search_request` is deprecated because web search is enabled by default.
Set `web_search` to `"live"`, `"cached"`, or `"disabled"` at the top level (or under a profile) in config.toml if you want to override it.
codex
I’m reviewing the current manuscript and the prior round notes side by side first, then I’ll verify whether the specific round-5 and Gemini issues are actually closed in `v3.8` or if anything regressed.
exec
/bin/zsh -lc 'wc -l paper/paper_a_abstract_v3.md paper/paper_a_introduction_v3.md paper/paper_a_related_work_v3.md paper/paper_a_methodology_v3.md paper/paper_a_results_v3.md paper/paper_a_discussion_v3.md paper/paper_a_conclusion_v3.md paper/paper_a_appendix_v3.md paper/paper_a_references_v3.md paper/paper_a_impact_statement_v3.md paper/codex_review_gpt54_v3_5.md paper/gemini_review_v3_7.md paper/gemini_review_v3_8.md' in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
       7 paper/paper_a_abstract_v3.md
      87 paper/paper_a_introduction_v3.md
     104 paper/paper_a_related_work_v3.md
     294 paper/paper_a_methodology_v3.md
     444 paper/paper_a_results_v3.md
     111 paper/paper_a_discussion_v3.md
      32 paper/paper_a_conclusion_v3.md
      45 paper/paper_a_appendix_v3.md
      87 paper/paper_a_references_v3.md
      21 paper/paper_a_impact_statement_v3.md
     165 paper/codex_review_gpt54_v3_5.md
     120 paper/gemini_review_v3_7.md
      67 paper/gemini_review_v3_8.md
    1584 total

exec
/bin/zsh -lc 'pwd && git rev-parse --short HEAD && git status --short && rg --files paper' in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
/Volumes/NV2/pdf_recognize
fcce58a
?? paper/gemini_review_v3_8.md
paper/paper_a_methodology_v3.md
paper/renumber_citations.py
paper/codex_review_gpt54_v3_4.md
paper/paper_a_impact_statement.md
paper/paper_a_outline.md
paper/Paper_A_IEEE_TAI_Draft_v2.docx
paper/paper_a_discussion.md
paper/ablation_backbone_comparison.py
paper/export_paper_to_docx.py
paper/Paper_A_IEEE_TAI_Draft_20260403.docx
paper/codex_review_gpt54_v3_5.md
paper/build_docx.sh
paper/codex_review_gpt54_v3.md
paper/export_v2.py
paper/codex_bd_mccrary_opinion.md
paper/generate_recalibrated_report.py
paper/paper_a_appendix_v3.md
paper/paper_a_abstract_v3.md
paper/export_v3.py
paper/paper_a_conclusion.md
paper/paper_a_methodology.md
paper/paper_a_abstract.md
paper/paper_a_introduction_v3.md
paper/generate_paper_figures.py
paper/gemini_review_v3_7.md
paper/paper_a_conclusion_v3.md
paper/paper_a_related_work_v3.md
paper/paper_a_results.md
paper/paper_a_results_v3.md
paper/paper_a_related_work.md
paper/gemini_review_v3_8.md
paper/paper_a_references_v3.md
paper/paper_a_discussion_v3.md
paper/Paper_A_IEEE_Access_Draft_v3.docx
paper/codex_review_gpt54_v3_2.md
paper/paper_a_impact_statement_v3.md
paper/paper_a_references.md
paper/recalibrate_classification.py
paper/paper_a_introduction.md
paper/codex_review_gpt54_v3_3.md

codex
I’ve confirmed the workspace is at `fcce58a`. Next I’m reading the prior codex and Gemini reviews together with the revised Results/Discussion/Appendix sections, since that’s where the late-cycle fixes and the likely regression risk sit.
exec
/bin/zsh -lc "nl -ba paper/codex_review_gpt54_v3_5.md | sed -n '1,220p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	# Fifth-Round Review of Paper A v3.5
     2	
     3	Audit basis: commit `12f716d`. Line numbers below refer to the current v3.5 markdown and script files.
     4	
     5	## 1. Overall Verdict
     6	
     7	**Minor Revision**
     8	
     9	v3.5 clears the two issues that kept v3.4 in major-revision territory. The classifier definition in Section III-L is now arithmetically aligned with the `dHash_indep` implementation used by the supporting scripts and downstream tables, and Table XI's `z/p` columns now reproduce from the displayed `k/n` counts under the exact two-proportion formula in Script 24. I do not see a core scientific regression in the B1/B2/B3 logic. I would still not submit v3.5 as-is, however, because a short v3.6 cleanup is still warranted: Table IX is not fully synchronized to the current script outputs, "breaks circularity" overclaim language survives in Methods/Results, the export path still hardcodes a double-blind placeholder even though IEEE Access is single-anonymized, and the manuscript still underdocuments BD/McCrary, visual inspection, and several key reproducibility details. This is now a close paper, but not yet the cleanest version to send.
    10	
    11	## 2. v3.4 Round-4 Follow-Up Audit
    12	
    13	### 2.1 Round-4 Blockers
    14	
    15	| Round-4 item | Round-4 status | v3.5 audit | Evidence |
    16	|---|---|---|---|
    17	| B1. Classifier vs three-method convergence misalignment | `PARTIALLY-RESOLVED` | `RESOLVED` | Section III-L now defines the operational classifier entirely in `dHash_indep` terms at Methodology L252-L277. The matching downstream tables also use `dHash_indep`: Results L165-L168, L221-L225, L246-L254, and L350-L361. Script 24 likewise loads `min_dhash_independent` and applies it in the Section III-L classifier at Script 24 L86-L99, L157-L168, and L215-L241. |
    18	| B2. Held-out validation false within-Wilson-CI claim | `PARTIALLY-RESOLVED` | `RESOLVED` | Results L230-L237 now correctly interpret the fold comparison, and the Table XI `z/p` entries at Results L217-L225 reproduce from Script 24's `two_prop_z` formula at Script 24 L69-L83 and L186-L205. |
    19	| B3. Interview evidence lacks ethics statement | `RESOLVED` | `RESOLVED` | The manuscript still treats practitioner knowledge as background context only and locates evidentiary weight in paper-internal analyses: Introduction L51-L55; Methodology L140-L156 and L282-L291. I found no regression to interview/IRB-style evidentiary claims. |
    20	
    21	### 2.2 Round-4 Major and Minor Follow-Up Items
    22	
    23	| Round-4 item | Round-4 status | v3.5 audit | Evidence |
    24	|---|---|---|---|
    25	| dHash classifier ambiguity | `UNFIXED` | `RESOLVED` | The classifier is now explicitly `dHash_indep`-based throughout III-L, not cosine-conditional: Methodology L254-L277. Results Tables IX, XI, XII, and XVI are written in the same statistic: Results L165-L168, L221-L225, L246-L254, L350-L361. |
    26	| 70/30 split overstatement | `PARTIALLY-FIXED` | `PARTIALLY-RESOLVED` | The Abstract and Conclusion are repaired: Abstract L5 and Conclusion L19-L21 now use fold-variance language. But the overclaim survives at Methodology L238, Results L171, and the subsection title at Results L207. |
    27	| Validation-metric story | `PARTIALLY-FIXED` | `RESOLVED` | The Introduction now promises anchor-based capture/FAR reporting rather than precision/F1/EER: Introduction L29-L30. Methods/Results remain aligned on why precision/F1 are not meaningful here: Methodology L245-L246; Results L186-L188. The archived Impact Statement is explicitly excluded from submission and self-warns against overclaim: Impact Statement L1-L12; `export_v3.py` L15-L25. |
    28	| Within-auditor-year empirical-check confusion | `UNFIXED` | `RESOLVED` | Methodology L123-L128 now explicitly says IV-H.3 is a related but distinct cross-partner same-report homogeneity test, not a same-CPA within-year mixing test. Results L343-L367 matches that framing exactly. |
    29	| BD/McCrary rigor | `UNFIXED` | `UNRESOLVED` | The paper still gives only narrative BD/McCrary outcomes without a table of `Z` statistics, `p` values, or bin-width robustness: Results L80-L83 and L126-L149. |
    30	| Reproducibility gaps | `PARTIALLY-FIXED` | `PARTIALLY-RESOLVED` | The scripts expose some seeds and formulas, but the manuscript still omits the exact VLM prompt/parse rule, HSV thresholds, visual-inspection protocol, and EM initialization/stopping details: Methodology L44-L49, L74-L75, L145-L146, L188-L196, L222-L223, L248. |
    31	| Section III-H / IV-F reconciliation | `FIXED` | `RESOLVED` | The 92.5% Firm A figure is still consistently framed as a within-sample consistency check, not an external validation pillar: Methodology L156-L160; Results L174-L176. |
    32	| "`0.95` not calibrated to Firm A" inconsistency | `UNFIXED` | `RESOLVED` | III-H now says the `0.95` cutoff is the whole-sample Firm A P95: Methodology L151-L154. III-L repeats that at Methodology L273-L277, and Results uses the same interpretation at L174-L176 and L241-L244. |
    33	| Table XII numbering | `FIXED` | `RESOLVED` | Numbering remains coherent through XI-XVIII, with Table XII present at Results L246-L254. |
    34	| `dHash_indep <= 5 (calib-fold median-adjacent)` label | `UNFIXED` | `UNRESOLVED` | The label still appears in Table IX at Results L165. III-L explains the rationale better at Methodology L275, but the table label itself remains opaque. |
    35	| References `[27]`, `[31]-[36]` cleanup | `UNFIXED` | `RESOLVED` | All seven are now cited in text: `[27]` at Methodology L100; `[31]-[33]` at Introduction L15; `[34]-[35]` at Methodology L44 and L58; `[36]` at Results L50. |
    36	
    37	### 2.3 Round-4 New-Issue Audit
    38	
    39	| Round-4 new issue | v3.5 audit | Evidence |
    40	|---|---|---|
    41	| IV-G.3 sensitivity evidence did not evaluate the stated classifier | `RESOLVED` | III-L now defines the same `dHash_indep` classifier that Script 24 evaluates: Methodology L252-L277; Script 24 L215-L241; Results L239-L262. |
    42	| Table XI `z/p` columns did not match displayed counts | `RESOLVED` | Results L217-L225 now matches recomputation from Script 24 L69-L83 exactly up to rounding; details in Section 3 below. |
    43	| Table XVI was affected by the same classifier-definition problem | `RESOLVED` | Table XVI is now aligned because III-L itself is `dHash_indep`-based. Script 23 also uses `min_dhash_independent`: Script 23 L37-L53 and L90-L92. |
    44	| Visual-inspection pillar still lacked protocol details | `UNRESOLVED` | The claim remains at Methodology L145-L149, but sample size, rater count, and adjudication rule are still absent from the manuscript. |
    45	| Threshold-free wording in III-H was inaccurate | `RESOLVED` | III-H now correctly says only partner-ranking is fully threshold-free: Methodology L151-L154. Results L270-L274 matches this. |
    46	| Introduction metric promise / Impact Statement wording still overstated | `RESOLVED` | The Introduction is repaired at L29-L30, and the Impact Statement is archived and excluded from export: Impact Statement L1-L12; `export_v3.py` L15-L25. |
    47	
    48	## 3. Verification of the v3.5 Critical Fixes
    49	
    50	### 3.1 Table XI Recalculation
    51	
    52	I recomputed every Table XI `z/p` pair from the displayed `k/n` counts using the exact two-proportion formula in Script 24 L69-L83. All nine rows now match the manuscript rounding at Results L217-L225.
    53	
    54	| Rule | Exact recomputation from displayed `k/n` | Paper value | Audit |
    55	|---|---|---|---|
    56	| `cosine > 0.837` | `z = +0.310601`, `p = 0.756104` | `+0.31`, `0.756` | Match |
    57	| `cosine > 0.9407` | `z = -3.184698`, `p = 0.001449` | `-3.19`, `0.001` | Match |
    58	| `cosine > 0.945` | `z = -4.541202`, `p = 0.00000559` | `-4.54`, `<0.001` | Match |
    59	| `cosine > 0.950` | `z = -5.966194`, `p = 0.0000000024` | `-5.97`, `<0.001` | Match |
    60	| `dHash_indep <= 5` | `z = -14.288642`, `p < 1e-40` | `-14.29`, `<0.001` | Match |
    61	| `dHash_indep <= 8` | `z = -6.446423`, `p = 1.15e-10` | `-6.45`, `<0.001` | Match |
    62	| `dHash_indep <= 9` | `z = -5.072930`, `p = 3.92e-07` | `-5.07`, `<0.001` | Match |
    63	| `dHash_indep <= 15` | `z = -0.313744`, `p = 0.753716` | `-0.31`, `0.754` | Match |
    64	| `cosine > 0.95 AND dHash_indep <= 8` | `z = -7.603992`, `p = 2.86e-14` | `-7.60`, `<0.001` | Match |
    65	
    66	This directly resolves the main round-4 numerical blocker.
    67	
    68	### 3.2 Section III-L Uses `dh_indep` Throughout
    69	
    70	This fix is real. Section III-L now states at Methodology L254-L255 that all dHash references in the operational classifier are the independent-minimum statistic, and the five categories at L257-L277 are all written with `dHash_indep`. The downstream result tables are consistent with that same statistic:
    71	
    72	- Table IX: Results L165-L168.
    73	- Table XI: Results L221-L225.
    74	- Table XII: Results L246-L258.
    75	- Table XVI: Results L347-L367.
    76	
    77	Script 24 is now consistent with that choice as well: it loads `min_dhash_independent` at L86-L99 and classifies with it at L215-L241.
    78	
    79	### 3.3 "`0.95` is Firm A P95" Is Now Consistent
    80	
    81	This inconsistency is fixed across the relevant sections:
    82	
    83	- III-H: Methodology L151-L154 states that the `0.95` cutoff is the whole-sample Firm A P95 and that the longitudinal analysis is about stability, not absolute-rate calibration.
    84	- III-L: Methodology L273-L277 repeats that `0.95` is the whole-sample Firm A P95 heuristic.
    85	- IV-F / IV-G.3: Results L174-L176 and L241-L244 use the same framing.
    86	
    87	I do not see a surviving contradiction of the old "not calibrated to Firm A" type.
    88	
    89	## 4. Verification of the v3.5 Major Fixes
    90	
    91	- **Abstract length:** The abstract is now one paragraph. A rendered whitespace count after stripping the header/comment gives 247 words, which is nominally under the IEEE 250-word cap. If one counts inline math markers as separate tokens, the count rises above 250, so the abstract is compliant in ordinary rendered form but still too close to the limit for comfort.
    92	- **"We break the circularity" overclaim:** Removed from the Abstract and Conclusion. The current Abstract L5 and Conclusion L19-L21 use fold-level variance / heterogeneity language instead. However, the same overclaim still survives elsewhere in the paper at Methodology L238 and Results L171 and L207.
    93	- **Introduction metric language:** Fixed. Introduction L29-L30 now promises per-rule capture/FAR with Wilson intervals and explicitly states why precision/F1 are not meaningful here. The obsolete introduction promise of precision/F1/EER is gone.
    94	- **III-G / IV-H.3 wording alignment:** Fixed. Methodology L123-L128 and Results L343-L367 now describe the same cross-partner same-report homogeneity test.
    95	- **III-H threshold-free wording:** Fixed. Methodology L151-L154 and Results L270-L274 now correctly say that only partner-ranking is threshold-free.
    96	
    97	## 5. Verification of the v3.5 Minor Fixes
    98	
    99	- **Impact Statement exclusion:** Fixed. `export_v3.py` excludes `paper_a_impact_statement_v3.md` from `SECTIONS` at L15-L25, and the archived file itself says it is not part of the IEEE Access submission at Impact Statement L1-L12.
   100	- **Previously unused references:** Fixed. `[27]`, `[31]`, `[32]`, `[33]`, `[34]`, `[35]`, and `[36]` all now have in-text citations; see the evidence in Section 2.2 above.
   101	
   102	## 6. New Findings in v3.5
   103	
   104	No core scientific regression is visible in the B1/B2/B3 logic. The remaining new findings are cleanup-level but real:
   105	
   106	1. **Table IX is still not fully synchronized to the current script outputs.** Using the displayed counts at Results L160-L168, three percentages are off by `0.01` under standard rounding: `57,131 / 60,448 = 94.51%`, not `94.52%`; `55,916 / 60,448 = 92.50%`, not `92.51%`; and `57,521 / 60,448 = 95.16%`, not `95.17%`. More importantly, Script 24 computes the whole-sample dual rule as `54,370 / 60,448`, not `54,373 / 60,448` (Script 24 L276-L316; generated recalibration report section 3 lines 48-52). This is small, but v3.5 explicitly positions itself as having cleaned exact table arithmetic, so it should be corrected.
   107	2. **The circularity overclaim is not fully removed paper-wide.** Methodology L238 still says the 70/30 split "break[s] the resulting circularity," Results L171 says the held-out analysis "addresses the circularity," and the IV-G.2 subsection title at Results L207 still says "(breaks calibration-validation circularity)." Those are stronger than the better, narrower interpretation at Results L233-L237, Discussion L44-L45, and Conclusion L20-L21.
   108	3. **The export path is not submission-ready for IEEE Access single-anonymized review.** `export_v3.py` correctly excludes the Impact Statement, but it still inserts `[Authors removed for double-blind review]` on the title page at L208-L218. If the manuscript were submitted literally from this export path, that would be a packaging error.
   109	4. **Methodology III-G retains one stale reference to cosine-conditional dHash.** Methodology L131-L132 says cosine-conditional dHash is used "as a diagnostic elsewhere," but no remaining main-text result appears to use it. After the III-L rewrite, this reads as leftover phrasing and should be either deleted or pointed to a real appendix/supplement.
   110	
   111	## 7. IEEE Access Submission Readiness Check
   112	
   113	- **Scope:** Yes. The topic remains a plausible IEEE Access Regular Paper fit spanning document forensics, computer vision, and audit/regulatory analytics.
   114	- **Abstract length:** Nominally compliant in rendered form at 247 words, but close enough to the cap that another 5-10 words of trimming would be safer.
   115	- **Formatting / template:** Not verifiable from the markdown section files alone. The paper is maintained as markdown fragments plus a custom `python-docx` exporter; I did not audit a final IEEE Access template-conformant DOCX/PDF package here.
   116	- **Review model:** IEEE Access is single-anonymized. The current export path still uses a double-blind placeholder on the title page (`export_v3.py` L208-L218). That must be fixed before submission.
   117	- **Anonymization:** The manuscript body still consistently uses `Firm A/B/C/D` and does not expose explicit real firm names or author metadata in the reviewed markdown sections. As before, that is a confidentiality choice rather than a review-model requirement.
   118	- **Ethics / data-source disclosure:** Adequate for this paper's current evidentiary framing. Methodology L282-L291 clearly states the corpus is public MOPS data and that no non-public records or human-subject evidence are used.
   119	- **Items that could trigger desk return if submitted literally now:** the missing author/affiliation metadata from the current export path, and any unverified IEEE template / metadata nonconformance in the final DOCX/PDF. The remaining scientific issues are reviewer-risk issues rather than obvious desk-return items.
   120	
   121	Bottom line on readiness: **not as-is**. The science is close; the packaging and last-round reporting cleanup are not finished.
   122	
   123	## 8. Statistical Rigor, Numerical Consistency, and Reproducibility
   124	
   125	### Statistical Rigor
   126	
   127	- The core statistical story is now coherent. The paper cleanly separates the operational signature-level classifier from the accountant-level convergence band and treats the held-out Firm A split as heterogeneity disclosure rather than a false Wilson-CI "generalization pass": Methodology L252-L277; Results L230-L237; Discussion L44-L45.
   128	- The anchor-based validation is better framed than in earlier rounds. The byte-identical positives are clearly treated as a conservative subset, and precision/F1 are no longer misused: Methodology L227-L248; Results L184-L205.
   129	- The main remaining rigor weakness is still BD/McCrary. Because the paper keeps advertising a three-method convergent threshold strategy in the title/abstract/introduction, the absence of explicit BD/McCrary `Z`/`p` reporting and bin-width sensitivity still leaves one of the three methods under-reported.
   130	- The visual-inspection pillar is still too thinly documented for the rhetorical weight it carries in III-H and the Conclusion.
   131	
   132	### Numerical Consistency
   133	
   134	- Table XI is now repaired and reproducible from its displayed counts.
   135	- Table XII, Table XVI, and Table XVII remain arithmetically consistent.
   136	- Table IX still has the residual percentage/count mismatches noted in Section 6.
   137	- The biggest numerical issue left is therefore no longer inferential-table arithmetic; it is the smaller but still avoidable transcription drift in Table IX.
   138	
   139	### Reproducibility
   140	
   141	The paper is still **not reproducible from the manuscript alone**.
   142	
   143	The most important under-specified items remain:
   144	
   145	- Exact VLM prompt, parse rule, and page-selection failure handling: Methodology L44-L49.
   146	- HSV thresholds for red-stamp removal: Methodology L74-L75.
   147	- Randomization / seed rules for the 500-page annotation set, the inter-CPA negative sample, the 30-signature sanity sample, and the Firm A split: Methodology L59-L62 and L232-L248.
   148	- Visual-inspection protocol details: sample size, rater count, and decision rule are still absent around Methodology L145-L146.
   149	- EM / mixture initialization count, stopping criteria, logit-boundary clipping, and software versions: Methodology L188-L196 and L222-L223.
   150	
   151	The scripts help auditability, but the manuscript still needs a short reproducibility appendix or supplement if the authors want the paper to look fully defensible on first submission.
   152	
   153	## 9. What v3.6 Must Change to Clear Review
   154	
   155	If the authors want the paper to clear this review and be genuinely submission-ready, v3.6 should do the following:
   156	
   157	1. **Re-sync Table IX and mirrored prose to the authoritative script outputs.** Correct the three `0.01` percentage mismatches and the whole-sample dual-rule count (`54,370 / 60,448` if Script 24 is authoritative).
   158	2. **Remove the surviving circularity overclaim from Methods/Results.** Replace Methodology L238, Results L171, and the IV-G.2 heading at L207 with the softer fold-variance / within-Firm-A heterogeneity framing already used elsewhere.
   159	3. **Fix the export path for IEEE Access single-anonymized review.** Restore author/affiliation/corresponding-author metadata and audit the real final DOCX/PDF against the IEEE Access template rather than relying on the current double-blind placeholder export.
   160	4. **Document the visual-inspection protocol.** At minimum: sample size, sampling rule, number of raters, whether review was blinded, and how disagreements were adjudicated.
   161	5. **Either substantiate BD/McCrary or demote it.** If it stays as one of the three headline methods, add a compact table of `Z` statistics, `p` values, and bin-width robustness. If not, explicitly recast it as a supplementary diagnostic rather than a co-equal threshold estimator.
   162	6. **Add a short reproducibility appendix or supplement.** Include the VLM prompt/parse rule, HSV thresholds, key seeds/sampling rules, and mixture-model implementation details.
   163	7. **Clean the stale cosine-conditional dHash sentence at Methodology L131-L132.** After the III-L rewrite, that sentence now looks like leftover terminology.
   164	
   165	If those items are addressed cleanly, I would treat the manuscript as submission-ready for IEEE Access.

exec
/bin/zsh -lc "nl -ba paper/gemini_review_v3_7.md | sed -n '1,220p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba paper/gemini_review_v3_8.md | sed -n '1,180p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba paper/paper_a_results_v3.md | sed -n '1,260p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	# Independent Peer Review: Paper A (v3.8)
     2	
     3	**Target Venue:** IEEE Access (Regular Paper)  
     4	**Date:** April 21, 2026  
     5	**Reviewer:** Gemini CLI (7th Round Independent Review)
     6	
     7	---
     8	
     9	## 1. Overall Verdict
    10	
    11	**Verdict: Accept**
    12	
    13	**Rationale:**  
    14	The authors have systematically and thoroughly addressed the three critical methodological and narrative blind spots identified in the Round-6 review. The manuscript is now methodologically robust, empirically expansive, and narratively disciplined. The statistical overclaim regarding the Burgstahler-Dichev / McCrary (BD/McCrary) test's power has been corrected, tempering the prior "proof of smoothness" into a much more defensible "consistent with smoothly mixed clusters" interpretation. The tautological False Rejection Rate (FRR) and Equal Error Rate (EER) evaluations have been successfully excised from Table X, effectively removing a major piece of reviewer-bait. Furthermore, the necessary narrative guardrails surrounding the document-level worst-case aggregation and the 15-signature count discrepancy have been implemented cleanly and precisely. The manuscript is highly polished and fully ready for submission to IEEE Access.
    15	
    16	---
    17	
    18	## 2. Round-6 Follow-Up Audit
    19	
    20	In Round 6, three specific issues were flagged for revision. Below is the audit of their resolution in v3.8.
    21	
    22	### A. BD/McCrary Power-Artifact Reframe
    23	**Status: RESOLVED**
    24	
    25	The authors have successfully purged the "null proves smoothness" language and accurately reframed the accountant-level BD/McCrary null finding around its limited statistical power.
    26	*   **Results IV-D.1:** The text now explicitly states that "at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness."
    27	*   **Results IV-E:** The analysis correctly notes that the lack of a transition is "consistent with---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates."
    28	*   **Discussion V-B:** The framing is excellent: "the BD null alone cannot affirmatively establish smoothness---only fail to falsify it---and our substantive claim of smoothly-mixed clustering rests on the joint weight of the GMM fit, the dip test, and the BD null rather than on the BD null alone."
    29	*   **Discussion V-G (Limitations):** A new, dedicated limitation explicitly highlights that the test "cannot reliably detect anything less than a sharp cliff-type density discontinuity" at this sample size.
    30	*   **Conclusion:** Symmetrically updated to note that the test "cannot affirmatively establish smoothness, but its non-transition is consistent with the smoothly-mixed cluster boundaries."
    31	*   **Appendix A:** Concludes perfectly that failure to reject the null "constrains the data only to distributions whose between-cluster transitions are gradual *enough* to escape the test's sensitivity at that sample size."
    32	
    33	The rewrite is exceptionally clean. It does not feel awkward or bolted-on. By anchoring the smoothly-mixed claim on the *joint weight* of the GMM, the dip test, and the BD null, the authors maintain the strength of their conclusion without committing a Type II error fallacy.
    34	
    35	### B. Table X EER/FRR Removal
    36	**Status: RESOLVED**
    37	
    38	The tautological presentation of FRR against the byte-identical positive anchor has been entirely resolved.
    39	*   **Table X:** The EER row and FRR column have been deleted. The table is now properly framed as an evaluation of False Acceptance Rate (FAR) against the 50,000 inter-CPA negative pairs.
    40	*   **Table Note:** A clear, unambiguous table note has been added explaining *why* FRR is omitted ("the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$").
    41	*   **Methodology III-K & Results IV-G.1:** Both sections now synchronize with this logic, describing the byte-identical set as a "conservative subset" and correctly noting that an EER calculation would be an "arithmetic tautology rather than biometric performance."
    42	
    43	This change significantly hardens the paper. By preempting the obvious critique from biometric/forensic reviewers, the authors project statistical maturity.
    44	
    45	### C. Section IV-I Narrative Safeguard & 15-Signature Footnote
    46	**Status: RESOLVED**
    47	
    48	Both minor narrative omissions have been addressed exactly as requested.
    49	*   **Section IV-I Narrative Safeguard:** Right before Table XVII, the authors added a robust clarifying paragraph: "We emphasize that the document-level proportions below reflect the *worst-case aggregation rule*... Document-level rates therefore bound the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are." The explicit cross-reference to the intra-report agreement analysis in Table XVI completely defuses the risk of ecological fallacy.
    50	*   **15-Signature Footnote:** In Section IV-D, the text now clearly accounts for the discrepancy: "The $N = 168{,}740$ count used in Table V... is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed..." This effectively closes the arithmetic loop.
    51	
    52	---
    53	
    54	## 3. New Findings in v3.8
    55	
    56	The rewrites in v3.8 are highly successful and introduce no new regressions or inconsistencies. 
    57	
    58	The primary concern when hedging a statistical claim is that the resulting language will create tension with other sections of the paper that still rely on the original, stronger claim. The authors avoided this trap brilliantly. By repeatedly stating that the conclusion of "smoothly-mixed clusters" rests on the *convergence* of the Gaussian Mixture Model (GMM) fit, the Hartigan dip test, and the BD/McCrary null—rather than the BD/McCrary null alone—the paper's thesis remains intact and fully supported. 
    59	
    60	The only minor artifact of the rewrite is a slight repetitiveness regarding the "$N=686$ limited power" caveat, which appears in IV-D.1, IV-E, V-B, V-G, the Conclusion, and Appendix A. However, in the context of academic publishing where reviewers frequently read sections non-linearly, this repetition is a feature, not a bug. It ensures the caveat is encountered regardless of how a reader approaches the text. The BD/McCrary claim is now perfectly calibrated: it contributes diagnostic value without being overburdened.
    61	
    62	---
    63	
    64	## 4. Final Submission Readiness
    65	
    66	**v3.8 is fully submission-ready.** 
    67	
    68	The manuscript requires no further revisions (a v3.9 is not warranted). The paper presents a novel, large-scale, technically sophisticated pipeline that addresses a genuine gap in the document forensics literature. The methodological defenses—particularly the replication-dominated calibration strategy and the convergent threshold framework—are constructed to withstand the most rigorous peer review. The authors should proceed to submit to IEEE Access immediately.
 succeeded in 0ms:
     1	# Independent Peer Review: Paper A (v3.7)
     2	
     3	**Target Venue:** IEEE Access (Regular Paper)  
     4	**Date:** April 21, 2026  
     5	**Reviewer:** Gemini CLI (6th Round Independent Review)
     6	
     7	---
     8	
     9	## 1. Overall Verdict
    10	
    11	**Verdict: Minor Revision**
    12	
    13	**Rationale:**  
    14	The manuscript presents a methodologically rigorous, highly sophisticated, and large-scale empirical analysis of non-hand-signed auditor signatures. Analyzing over 180,000 signatures from 90,282 audit reports is an impressive feat, and the pipeline architecture combining VLM prescreening, YOLO detection, and ResNet-50 feature extraction is fundamentally sound. The utilization of a "replication-dominated" calibration strategy—validated across both intra-firm consistency metrics and held-out cross-validation folds—represents a significant contribution to document forensics where ground-truth labeling is scarce and expensive. Furthermore, the dual-descriptor approach (using cosine similarity for semantic features and dHash for structural features) effectively resolves the ambiguity between stylistic consistency and mechanical image reproduction. The demotion of the Burgstahler-Dichev / McCrary (BD/McCrary) test to a density-smoothness diagnostic, supported by the new Appendix A, is analytically correct. 
    15	
    16	However, approaching this manuscript with a fresh perspective reveals three distinct methodological blind spots that previous review rounds missed. Specifically, the manuscript commits a statistical overclaim regarding the statistical power of the BD/McCrary test at the accountant level, it presents a mathematically tautological False Rejection Rate (FRR) evaluation that borders on reviewer-bait, and it lacks narrative guardrails around its document-level aggregation metrics. Resolving these localized issues will not alter the paper's conclusions but will significantly harden the manuscript against aggressive peer review, making it fully submission-ready for IEEE Access.
    17	
    18	---
    19	
    20	## 2. Scientific Soundness Audit
    21	
    22	### Three-Level Framework Coherence
    23	The separation of the analysis into signature-level, accountant-level, and auditor-year units is intellectually rigorous and highly defensible. By strictly separating the *pixel-level output quality* (signature level) from the *aggregate behavioral regime* (accountant level), the authors successfully avoid the ecological fallacy of assuming that because an individual practitioner acts in a binary fashion (hand-signing vs. stamping), the aggregate distribution of signature pixels must be neatly bimodal. The evidence compellingly demonstrates that the data forms a continuous quality degradation spectrum at the pixel level.
    24	
    25	### Firm A 'Replication-Dominated' Framing
    26	This is perhaps the strongest conceptual pillar of the paper. Assuming that Firm A acts as a "pure" positive class would inevitably force the thresholding model to interpret the long left tail of the cosine distribution as algorithmic noise or pipeline error. The explicit validation of Firm A as "replication-dominated but not pure"—quantified elegantly by the 139/32 split between high-replication and middle-band clusters in the accountant-level Gaussian Mixture Model (Section IV-E)—logically resolves the 92.5% capture rate without overclaiming. It is a highly defensible stance.
    27	
    28	### BD/McCrary Demotion
    29	Moving the BD/McCrary test from a co-equal threshold estimator to a "density-smoothness diagnostic" is the correct scientific decision. Appendix A empirically demonstrates that the test behaves exactly as one would expect when applied to a large ($N > 60,000$), smooth, heavy-tailed distribution: it detects localized non-linearities caused by histogram binning resolution rather than true mechanistic discontinuities. The theoretical tension is resolved by this demotion.
    30	
    31	### Statistical Choices
    32	The statistical foundations of the paper are appropriate and well-applied:
    33	*   **Beta/Logit-Gaussian Mixtures:** Fitting Beta mixtures via the EM algorithm is perfectly suited for bounded cosine similarity data $[0,1]$, and the logit-Gaussian cross-check serves as an excellent robustness measure against parametric misspecification.
    34	*   **Hartigan Dip Test:** The use of the dip test provides a rigorous, non-parametric verification of unimodality/multimodality.
    35	*   **Wilson Confidence Intervals:** Utilizing Wilson score intervals for the held-out validation metrics (Table XI) correctly models binomial variance, preventing zero-bound confidence interval collapse.
    36	
    37	---
    38	
    39	## 3. Numerical Consistency Cross-Check
    40	
    41	An exhaustive spot-check of the manuscript’s arithmetic, table values, and cited numbers reveals a practically flawless internal consistency. The scripts supporting the pipeline operate exactly as claimed. 
    42	
    43	*   **Table VIII:** The reported accountant-level threshold band (KDE antimode: 0.973, Beta-2: 0.979, logit-GMM-2: 0.976) matches the narrative text precisely.
    44	*   **Table IX:** The proportion of Firm A captures under the dual rule ($54,370 / 60,448 = 89.945\%$) correctly rounds to the reported $89.95\%$.
    45	*   **Table XI:** The calibration fold's operational dual rule yields $40,335 / 45,116 = 89.402\%$ (reported $89.40\%$), and the held-out fold yields $14,035 / 15,332 = 91.540\%$ (reported $91.54\%$).
    46	*   **Table XII:** The column sums for $N = 168,740$ match perfectly. Furthermore, the delta column balances precisely to zero ($+2,294 + 6,095 + 119 - 8,508 + 0 = 0$).
    47	*   **Table XIV:** Top 10% Firm A occupancy is $443 / 462 = 95.88\%$ (reported $95.9\%$), against a baseline of $1,287 / 4,629 = 27.80\%$ (reported $27.8\%$).
    48	*   **Table XVI:** Firm A's intra-report agreement is correctly calculated as $(26,435 + 734 + 4) / 30,222 = 89.91\%$.
    49	
    50	**Minor Narrative Clarification Required:**
    51	In Table III, total extracted signatures are reported as $182,328$, with $168,755$ successfully matched to CPAs. However, Table V and Table XII utilize $N = 168,740$ signatures for the all-pairs best-match analysis. This delta of $15$ signatures is mathematically implied by CPAs who possess exactly *one* signature in the entire database, rendering a "same-CPA pairwise comparison" impossible. While logically sound to anyone analyzing the pipeline closely, this microscopic $15$-signature discrepancy is exactly the kind of arithmetic artifact that distracts meticulous reviewers. 
    52	*Recommendation:* Add a one-sentence footnote or parenthetical to Section IV-D explicitly stating this $15$-signature delta is due to single-signature CPAs lacking a pairwise match.
    53	
    54	---
    55	
    56	## 4. Appendix A Validity
    57	
    58	The addition of Appendix A successfully and empirically justifies the main-text demotion of the BD/McCrary test. 
    59	
    60	**Strengths:**
    61	The argument demonstrating that the BD/McCrary transitions drift monotonically with bin width (e.g., Firm A cosine drifting across 0.987 $\rightarrow$ 0.985 $\rightarrow$ 0.980 $\rightarrow$ 0.975) is brilliant. Coupled with the observation that the Z-statistics inflate superlinearly with bin width (from $|Z| \sim 9$ at bin 0.003 to $|Z| \sim 106$ at bin 0.015), the appendix irrefutably proves that the test is interacting with the local curvature of a heavily-populated continuous distribution rather than identifying a discrete, mechanistic boundary. Table A.I is arithmetically consistent with the script's logic.
    62	
    63	**Weaknesses:**
    64	The interpretation paragraph overstates the implications of the accountant-level null finding. It claims that the lack of a transition at the accountant level ($N=686$) is a "robust finding that survives the bin-width sweep." As detailed in Section 6 below, a non-finding surviving a bin-width sweep in a small sample is largely a function of low statistical power, not definitive proof of a smoothly-mixed boundary.
    65	
    66	---
    67	
    68	## 5. IEEE Access Submission Readiness
    69	
    70	The manuscript is in excellent shape for submission to IEEE Access.
    71	*   **Scope Fit:** High. The paper sits perfectly at the intersection of applied AI, document forensics, and interdisciplinary data science, which is a core demographic for IEEE Access.
    72	*   **Abstract Length:** The abstract is approximately 234 words, comfortably satisfying the stringent $\leq 250$ word limit requirement.
    73	*   **Formatting & Structure:** The document adheres to standard IEEE double-column formatting conventions (Roman numeral sections, appropriate table/figure references).
    74	*   **Anonymization:** Properly handled. Author placeholders, affiliation blocks, and correspondence emails are appropriately bracketed for single-anonymized peer review.
    75	*   **Desk-Return Risks:** Very low. The inclusion of the ablation study (Table XVIII) and explicit baseline comparisons ensures the paper meets the journal's expectations for methodological validation.
    76	
    77	---
    78	
    79	## 6. Novel Issues and Methodological Blind Spots
    80	
    81	While the previous review rounds improved the manuscript significantly, habituation has allowed three specific narrative and statistical blind spots to persist. These are prime targets for reviewer pushback.
    82	
    83	### Issue 1: The Accountant-Level BD/McCrary Null is a Power Artifact, not Proof of Smoothness
    84	In Section V-B and Appendix A, the authors claim that because the BD/McCrary test yields no significant transition at the accountant level, this "pattern is consistent with a clustered but smoothly mixed accountant-level distribution." Furthermore, Section V-B states that this non-transition is "itself diagnostic of smoothness rather than a failure of the method."
    85	
    86	**The Critique:** The McCrary (2008) test relies on local linear regression smoothing. The variance of the estimator scales inversely with $N \cdot h$ (where $h$ is the bin width). With a sample size of only $N=686$ accountants, the test is severely underpowered and lacks the statistical capacity to reject the null of smoothness unless the discontinuity is an absolute, sheer cliff. Asserting that a failure to reject the null affirmatively *proves* the null is true (smoothness) is a fundamental statistical fallacy (Type II error risk). 
    87	*Impact:* Statistically literate reviewers will immediately flag this as an overclaim. The demotion of the test to a diagnostic is correct, but interpreting the null at $N=686$ as definitive proof of smoothness is flawed.
    88	
    89	### Issue 2: Tautological Presentation of FRR and EER (Table X)
    90	Table X presents a False Rejection Rate (FRR) computed against a "byte-identical" positive anchor. It reports an FRR of $0.000$ for thresholds like 0.95 and 0.973, and subsequently reports an Equal Error Rate (EER) of $\approx 0$ at cosine = 0.990.
    91	
    92	**The Critique:** By definition, byte-identical signatures have a cosine similarity asymptotically approaching 1.0 (modulo minor float/cropping artifacts). Evaluating a similarity threshold of 0.95 against inputs that are mathematically defined to score near 1.0 yields a 0% FRR trivially. It is a tautology. While the text in Section V-F attempts to caveat this ("perfect recall against this subset therefore does not generalize"), presenting it as a formal column in Table X with an EER calculation treats it as a standard biometric evaluation. There are no crossing error distributions here to warrant an EER. 
    93	*Impact:* This is reviewer-bait. Reviewers from the biometric or forensics domains will argue that an EER of 0 is artificially constructed. The true scientific value of Table X is purely the empirical False Acceptance Rate (FAR) derived from the 50,000 inter-CPA negatives. 
    94	
    95	### Issue 3: Document-Level Worst-Case Aggregation Narrative
    96	Section IV-I reports that 35.0% of documents are classified as "High-confidence non-hand-signed" and 43.8% as "Moderate-confidence." This relies on the worst-case rule defined in Section III-L (if one signature on a dual-signed report is stamped, the whole document inherits that label).
    97	
    98	**The Critique:** While this "worst-case" aggregation is highly practical for building an operational regulatory auditing tool (flagging the report for review), the narrative in IV-I presents these percentages without reminding the reader that a document might contain a mix of genuine and stamped signatures. Without immediate context, stating that nearly 80% of the market's reports are non-hand-signed invites the ecological fallacy that *both* partners are stamping. 
    99	*Impact:* A brief narrative safeguard is missing. Section IV-I must briefly cross-reference the intra-report agreement findings (Table XVI) to remind the reader of the composition of these documents, mitigating the risk that the reader misinterprets the document-level severity.
   100	
   101	---
   102	
   103	## 7. Final Recommendation and v3.8 Action Items
   104	
   105	The manuscript is exceptionally strong but requires a few surgical narrative adjustments to remove reviewer-bait and statistical overclaims. I recommend a **Minor Revision** encompassing the following ranked action items.
   106	
   107	### BLOCKER (Must Fix for Submission)
   108	1. **Revise the interpretation of the accountant-level BD/McCrary null.** 
   109	   *   *Action:* In Section V-B, Section VI (Conclusion), and Appendix A, remove any explicit claims that the null affirmatively proves "smoothly mixed" boundaries. 
   110	   *   *Replacement Phrasing:* Reframe this finding to acknowledge statistical power. For example: *"We fail to find evidence of a discontinuity at the accountant level. While this is consistent with smoothly mixed clusters, it also reflects the limited statistical power of the BD/McCrary test at smaller sample sizes ($N=686$), reinforcing its role as a diagnostic rather than a definitive estimator."*
   111	
   112	### MAJOR (Highly Recommended to Prevent Desk-Reject/Major Revision)
   113	2. **Reframe Table X to eliminate the tautological FRR/EER presentation.**
   114	   *   *Action:* Remove the Equal Error Rate (EER) calculation entirely. Add an explicit, prominent table note to Table X stating that FRR is computed against a definitionally extreme subset (byte-identical signatures), making the $0.000$ values an expected mathematical boundary check rather than an empirical discovery of real-world recall. Emphasize that the primary contribution of Table X is the FAR evaluation against the large inter-CPA negative anchor.
   115	
   116	### MINOR (Quick Wins for Readability and Precision)
   117	3. **Contextualize the Document-Level Aggregation (Section IV-I).**
   118	   *   *Action:* When presenting the 35.0% / 43.8% document-level figures in Section IV-I, explicitly remind the reader of the worst-case aggregation rule. Add a single sentence cross-referencing Table XVI's mixed-report rates to ensure the reader understands the internal composition of these flagged documents.
   119	4. **Clarify the 15-Signature Delta (Section IV-D / Table XII).**
   120	   *   *Action:* Add a one-sentence clarification explaining that the delta between the 168,755 CPA-matched signatures (Table III) and the 168,740 signatures analyzed in the all-pairs distributions (Table V/Table XII) consists of CPAs who have exactly one signature in the corpus, making intra-CPA pairwise comparison impossible. This will preempt arithmetic nitpicking by reviewers.

 succeeded in 0ms:
     1	# IV. Experiments and Results
     2	
     3	## A. Experimental Setup
     4	
     5	All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration.
     6	Feature extraction used PyTorch 2.9 with torchvision model implementations.
     7	The complete pipeline---from raw PDF processing through final classification---was implemented in Python.
     8	
     9	## B. Signature Detection Performance
    10	
    11	The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images).
    12	We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total).
    13	However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
    14	The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.
    15	
    16	<!-- TABLE III: Extraction Results
    17	| Metric | Value |
    18	|--------|-------|
    19	| Documents processed | 86,071 |
    20	| Documents with detections | 85,042 (98.8%) |
    21	| Total signatures extracted | 182,328 |
    22	| Avg. signatures per document | 2.14 |
    23	| CPA-matched signatures | 168,755 (92.6%) |
    24	| Processing rate | 43.1 docs/sec |
    25	-->
    26	
    27	## C. All-Pairs Intra-vs-Inter Class Distribution Analysis
    28	
    29	Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
    30	This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
    31	Table IV summarizes the distributional statistics.
    32	
    33	<!-- TABLE IV: Cosine Similarity Distribution Statistics
    34	| Statistic | Intra-class | Inter-class |
    35	|-----------|-------------|-------------|
    36	| N (pairs) | 41,352,824 | 500,000 |
    37	| Mean | 0.821 | 0.758 |
    38	| Std. Dev. | 0.098 | 0.090 |
    39	| Median | 0.836 | 0.774 |
    40	| Skewness | −0.711 | −0.851 |
    41	| Kurtosis | 0.550 | 1.027 |
    42	-->
    43	
    44	Both distributions are left-skewed and leptokurtic.
    45	Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
    46	Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
    47	
    48	The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
    49	Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
    50	Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).
    51	
    52	We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
    53	We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
    54	A Cohen's $d$ of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.
    55	
    56	## D. Hartigan Dip Test: Unimodality at the Signature Level
    57	
    58	Applying the Hartigan & Hartigan dip test [37] to the per-signature best-match distributions reveals a critical structural finding (Table V).
    59	The $N = 168{,}740$ count used in Table V and downstream all-pairs analyses (Tables XII, XVIII) is $15$ signatures smaller than the $168{,}755$ CPA-matched count reported in Table III: these $15$ signatures belong to CPAs with exactly one signature in the entire corpus, for whom no same-CPA pairwise best-match statistic can be computed, and are therefore excluded from all same-CPA similarity analyses.
    60	
    61	<!-- TABLE V: Hartigan Dip Test Results
    62	| Distribution | N | dip | p-value | Verdict (α=0.05) |
    63	|--------------|---|-----|---------|------------------|
    64	| Firm A cosine (max-sim) | 60,448 | 0.0019 | 0.169 | Unimodal |
    65	| Firm A min dHash (independent) | 60,448 | 0.1051 | <0.001 | Multimodal |
    66	| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
    67	| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
    68	| Per-accountant cos mean | 686 | 0.0339 | <0.001 | Multimodal |
    69	| Per-accountant dHash mean | 686 | 0.0277 | <0.001 | Multimodal |
    70	-->
    71	
    72	Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in the accountant-level mixture (Section IV-E).
    73	The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
    74	At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
    75	
    76	This asymmetry between signature level and accountant level is itself an empirical finding.
    77	It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
    78	
    79	### 1) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
    80	
    81	Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine distribution yields a nominally significant $Z^- \rightarrow Z^+$ transition at cosine 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample under the bin width ($0.005$ / $1$) used here.
    82	Two cautions, however, prevent us from treating these signature-level transitions as thresholds.
    83	First, the cosine transition at 0.985 lies *inside* the non-hand-signed mode rather than at the separation between two mechanisms, consistent with the dip-test finding that per-signature cosine is not cleanly bimodal.
    84	Second, Appendix A documents that the signature-level transition locations are not bin-width-stable (Firm A cosine drifts across 0.987, 0.985, 0.980, 0.975 as the bin width is widened from 0.003 to 0.015, and full-sample dHash transitions drift across 2, 10, 9 as bin width grows from 1 to 3), which is characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity between two mechanisms.
    85	At the accountant level the test does not produce a significant transition in either the cosine-mean or the dHash-mean distribution, and this null persists across the Appendix-A bin-width sweep.
    86	We read this accountant-level pattern as *consistent with*---not affirmative proof of---clustered-but-smoothly-mixed aggregates: at $N = 686$ accountants the BD/McCrary test has limited statistical power, so a non-rejection of the smoothness null does not by itself establish smoothness (Section V-G).
    87	We therefore use BD/McCrary as a density-smoothness diagnostic rather than as an independent threshold estimator, and the substantive claim of smoothly-mixed accountant clustering rests on the joint evidence of the dip test, the BIC-selected GMM, and the BD null.
    88	
    89	### 2) Beta Mixture at Signature Level: A Forced Fit
    90	
    91	Fitting 2- and 3-component Beta mixtures to Firm A's per-signature cosine via EM yields a clear BIC preference for the 3-component fit ($\Delta\text{BIC} = 381$), with a parallel preference under the logit-GMM robustness check.
    92	For the full-sample cosine the 3-component fit is likewise strongly preferred ($\Delta\text{BIC} = 10{,}175$).
    93	Under the forced 2-component fit the Firm A Beta crossing lies at 0.977 and the logit-GMM crossing at 0.999---values sharply inconsistent with each other, indicating that the 2-component parametric structure is not supported by the data.
    94	Under the full-sample 2-component forced fit no Beta crossing is identified; the logit-GMM crossing is at 0.980.
    95	
    96	The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
    97	Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
    98	This motivates the pivot to the accountant-level analysis in Section IV-E, where aggregation over signatures reveals clustered (though not sharply discrete) patterns in individual-level signing *practice* that the signature-level analysis lacks.
    99	
   100	## E. Accountant-Level Gaussian Mixture
   101	
   102	We aggregated per-signature descriptors to the CPA level (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures and fit Gaussian mixtures in two dimensions with $K \in \{1, \ldots, 5\}$.
   103	BIC selects $K^* = 3$ (Table VI).
   104	
   105	<!-- TABLE VI: Accountant-Level GMM Model Selection (BIC)
   106	| K | BIC | AIC | Converged |
   107	|---|-----|-----|-----------|
   108	| 1 | −316 | −339 | ✓ |
   109	| 2 | −545 | −595 | ✓ |
   110	| 3 | **−792** | **−869** | ✓  (best) |
   111	| 4 | −779 | −883 | ✓ |
   112	| 5 | −747 | −879 | ✓ |
   113	-->
   114	
   115	Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
   116	
   117	<!-- TABLE VII: Accountant-Level 3-Component GMM
   118	| Comp. | cos_mean | dHash_mean | weight | n | Dominant firms |
   119	|-------|----------|------------|--------|---|----------------|
   120	| C1 (high-replication) | 0.983 | 2.41 | 0.21 | 141 | Firm A (139/141) |
   121	| C2 (middle band) | 0.954 | 6.99 | 0.51 | 361 | three other Big-4 firms (Firms B/C/D, ~256 together) |
   122	| C3 (hand-signed tendency) | 0.928 | 11.17 | 0.28 | 184 | smaller domestic firms |
   123	-->
   124	
   125	Three empirical findings stand out.
   126	First, of the 180 CPAs in the Firm A registry, 171 have $\geq 10$ signatures and therefore enter the accountant-level GMM (the remaining 9 have too few signatures for reliable aggregates and are excluded from this analysis only).
   127	Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
   128	This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
   129	Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
   130	Third, applying the threshold framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary density-smoothness diagnostic does not produce a significant transition at the accountant level (robust across the bin-width sweep in Appendix A).
   131	For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
   132	
   133	Table VIII summarizes the threshold estimates produced by the two threshold estimators and the BD/McCrary smoothness diagnostic across the two analysis levels for a compact cross-level comparison.
   134	
   135	<!-- TABLE VIII: Threshold Convergence Summary Across Levels
   136	| Level / method | Cosine threshold | dHash threshold |
   137	|----------------|-------------------|------------------|
   138	| Signature-level, all-pairs KDE crossover                                 | 0.837             | —                 |
   139	| Signature-level, Beta-2 EM crossing (Firm A)                              | 0.977             | —                 |
   140	| Signature-level, logit-GMM-2 crossing (Full)                              | 0.980             | —                 |
   141	| Signature-level, BD/McCrary transition (diagnostic only; bin-unstable, Appendix A) | 0.985     | 2.0               |
   142	| Accountant-level, KDE antimode (threshold estimator)                      | **0.973**         | **4.07**          |
   143	| Accountant-level, Beta-2 EM crossing (threshold estimator)                | **0.979**         | **3.41**          |
   144	| Accountant-level, logit-GMM-2 crossing (robustness)                       | **0.976**         | **3.93**          |
   145	| Accountant-level, BD/McCrary transition (diagnostic; null across Appendix A) | no transition  | no transition     |
   146	| Accountant-level, 2D-GMM 2-comp marginal crossing (secondary)             | 0.945             | 8.10              |
   147	| Firm A calibration-fold cosine P5                                         | 0.9407            | —                 |
   148	| Firm A calibration-fold dHash_indep P95                                   | —                 | 9                 |
   149	| Firm A calibration-fold dHash_indep median                                | —                 | 2                 |
   150	-->
   151	
   152	At the accountant level the two threshold estimators (KDE antimode and Beta-2 crossing) together with the logit-Gaussian robustness crossing converge to a cosine threshold of $\approx 0.975 \pm 0.003$ and a dHash threshold of $\approx 3.8 \pm 0.4$; the BD/McCrary density-smoothness diagnostic produces no significant transition at the same level (a null that persists across Appendix A's bin-width sweep), which is *consistent with*---though, at $N = 686$, not sufficient to affirmatively establish---clustered-but-smoothly-mixed accountant-level aggregates.
   153	This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
   154	The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
   155	
   156	## F. Calibration Validation with Firm A
   157	
   158	Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
   159	Table IX reports the proportion of Firm A signatures crossing each candidate threshold; these rates play the role of calibration-validation metrics (what fraction of a known replication-dominated population does each threshold capture?).
   160	
   161	<!-- TABLE IX: Firm A Whole-Sample Capture Rates (consistency check, NOT external validation)
   162	| Rule | Firm A rate | k / N |
   163	|------|-------------|-------|
   164	| cosine > 0.837 (all-pairs KDE crossover)              | 99.93% | 60,408 / 60,448 |
   165	| cosine > 0.9407 (calibration-fold P5)                 | 95.15% | 57,518 / 60,448 |
   166	| cosine > 0.945 (2D GMM marginal crossing)             | 94.02% | 56,836 / 60,448 |
   167	| cosine > 0.95                                         | 92.51% | 55,922 / 60,448 |
   168	| cosine > 0.973 (accountant-level KDE antimode)        | 79.45% | 48,028 / 60,448 |
   169	| dHash_indep ≤ 5 (whole-sample upper-tail of mode)     | 84.20% | 50,897 / 60,448 |
   170	| dHash_indep ≤ 8                                       | 95.17% | 57,527 / 60,448 |
   171	| dHash_indep ≤ 15 (style-consistency boundary)         | 99.83% | 60,348 / 60,448 |
   172	| cosine > 0.95 AND dHash_indep ≤ 8 (operational dual)  | 89.95% | 54,370 / 60,448 |
   173	
   174	All rates computed exactly from the full Firm A sample (N = 60,448 signatures); counts reproduce from `signature_analysis/24_validation_recalibration.py` (whole_firm_a section).
   175	-->
   176	
   177	Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H.
   178	The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E).
   179	Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
   180	
   181	## G. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation
   182	
   183	We report three validation analyses corresponding to the anchors of Section III-K.
   184	
   185	### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor
   186	
   187	Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
   188	As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
   189	Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
   190	We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
   191	The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
   192	We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.
   193	
   194	<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
   195	| Threshold | FAR | FAR 95% Wilson CI |
   196	|-----------|-----|-------------------|
   197	| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] |
   198	| 0.900                            | 0.0233 | [0.0221, 0.0247] |
   199	| 0.945 (2D GMM marginal)          | 0.0008 | [0.0006, 0.0011] |
   200	| 0.950                            | 0.0007 | [0.0005, 0.0009] |
   201	| 0.973 (accountant KDE antimode)  | 0.0003 | [0.0002, 0.0004] |
   202	| 0.979 (accountant Beta-2)        | 0.0002 | [0.0001, 0.0004] |
   203	
   204	Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
   205	-->
   206	
   207	Two caveats apply.
   208	First, the byte-identical positive anchor referenced above is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
   209	A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
   210	Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
   211	The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.
   212	
   213	### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)
   214	
   215	We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
   216	The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here and has no effect on the accountant-level mixture analysis of Section IV-E, which uses the $\geq 10$-signature subset of 171 CPAs.
   217	Thresholds are re-derived from calibration-fold percentiles only.
   218	Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
   219	
   220	<!-- TABLE XI: Held-Out vs Calibration Firm A Capture Rates and Generalization Test
   221	| Rule | Calibration 70% fold (CI) | Held-out 30% fold (CI) | 2-prop z | p | k/n calib | k/n held |
   222	|------|---------------------------|-------------------------|----------|---|-----------|----------|
   223	| cosine > 0.837                       | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31  | 0.756 n.s. | 45,087/45,116 | 15,321/15,332 |
   224	| cosine > 0.9407 (calib-fold P5)      | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.19  | 0.001      | 42,856/45,116 | 14,662/15,332 |
   225	| cosine > 0.945 (2D GMM marginal)     | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54  | <0.001     | 42,305/45,116 | 14,531/15,332 |
   226	| cosine > 0.950                       | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97  | <0.001     | 41,570/45,116 | 14,352/15,332 |
   227	| dHash_indep ≤ 5                      | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001     | 37,430/45,116 | 13,467/15,332 |
   228	| dHash_indep ≤ 8                      | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45  | <0.001     | 42,788/45,116 | 14,739/15,332 |
   229	| dHash_indep ≤ 9 (calib-fold P95)     | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07  | <0.001     | 43,604/45,116 | 14,945/15,332 |
   230	| dHash_indep ≤ 15                     | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31  | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
   231	| cosine > 0.95 AND dHash_indep ≤ 8    | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001     | 40,335/45,116 | 14,035/15,332 |
   232	
   233	Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. All counts and z/p values are reproducible from `signature_analysis/24_validation_recalibration.py` (seed = 42).
   234	-->
   235	
   236	Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
   237	We report fold-versus-fold comparisons rather than fold-versus-whole-sample comparisons, because the whole-sample rate is a weighted average of the two folds and therefore cannot, in general, fall inside the Wilson CI of either fold when the folds differ in rate; the correct generalization reference is the calibration fold, which produced the thresholds.
   238	
   239	Under this proper test the two extreme rules agree across folds (cosine $> 0.837$ and $\text{dHash}_\text{indep} \leq 15$; both $p > 0.7$).
   240	The operationally relevant rules in the 85–95% capture band differ between folds by 1–5 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
   241	Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
   242	The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity (see the $139 / 32$ accountant-level split of Section IV-E): the random 30% CPA sample happened to contain proportionally more accountants from the high-replication C1 cluster.
   243	We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to this fold variance.
   244	
   245	### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
   246	
   247	The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P95 heuristic.
   248	The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$ (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at $0.945$.
   249	Because the classifier operates at the signature level while these convergent accountant-level estimates are at the accountant level, they are formally non-substitutable.
   250	We report a sensitivity check in which the classifier's operational cut cos $> 0.95$ is replaced by the nearest accountant-level reference, cos $> 0.945$.
   251	
   252	<!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
   253	| Category                                   | cos > 0.95 count (%) | cos > 0.945 count (%) | Δ count |
   254	|--------------------------------------------|----------------------|-----------------------|---------|
   255	| High-confidence non-hand-signed            | 76,984 (45.62%)      | 79,278 (46.98%)       | +2,294  |
   256	| Moderate-confidence non-hand-signed        | 43,906 (26.02%)      | 50,001 (29.63%)       | +6,095  |
   257	| High style consistency                     |    546 ( 0.32%)      |    665 ( 0.39%)       |   +119  |
   258	| Uncertain                                  | 46,768 (27.72%)      | 38,260 (22.67%)       | -8,508  |
   259	| Likely hand-signed                         |    536 ( 0.32%)      |    536 ( 0.32%)       |     +0  |
   260	-->

exec
/bin/zsh -lc "nl -ba paper/paper_a_results_v3.md | sed -n '261,520p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
   261	
   262	At the aggregate firm-level, the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
   263	At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies 8,508 signatures (5.04% of the corpus) out of the Uncertain band; 6,095 of them migrate to Moderate-confidence non-hand-signed, 2,294 to High-confidence non-hand-signed, and 119 to High style consistency.
   264	The Likely-hand-signed category is unaffected because it depends only on the fixed all-pairs KDE crossover cosine $= 0.837$.
   265	The High-confidence non-hand-signed share grows from 45.62% to 46.98%.
   266	
   267	We interpret this sensitivity pattern as indicating that the classifier's aggregate and high-confidence output is robust to the choice of operational cut within the accountant-level convergence band, and that the movement is concentrated at the Uncertain/Moderate-confidence boundary.
   268	The paper therefore retains cos $> 0.95$ as the primary operational cut for transparency and reports the 0.945 results as a sensitivity check rather than as a deployed alternative; a future deployment requiring tighter accountant-level alignment could substitute cos $> 0.945$ without altering the substantive firm-level conclusions.
   269	
   270	### 4) Sanity Sample
   271	
   272	A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.
   273	
   274	## H. Additional Firm A Benchmark Validation
   275	
   276	The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
   277	This section reports three complementary analyses that go beyond the whole-sample capture rates.
   278	Subsection H.2 is fully threshold-independent (it uses only ordinal ranking).
   279	Subsection H.1 uses a fixed 0.95 cutoff but derives information from the longitudinal stability of rates rather than from the absolute rate at any single year.
   280	Subsection H.3 applies the calibrated classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the informative quantity is the cross-firm *gap* rather than the absolute agreement rate at any one firm.
   281	
   282	### 1) Year-by-Year Stability of the Firm A Left Tail
   283	
   284	Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year.
   285	Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign.
   286	Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.
   287	
   288	<!-- TABLE XIII: Firm A Per-Year Cosine Distribution
   289	| Year | N sigs | mean cosine | % below 0.95 |
   290	|------|--------|-------------|--------------|
   291	| 2013 | 2,167 | 0.9733 | 12.78% |
   292	| 2014 | 5,256 | 0.9781 | 8.69% |
   293	| 2015 | 5,484 | 0.9793 | 7.46% |
   294	| 2016 | 5,739 | 0.9811 | 6.92% |
   295	| 2017 | 5,796 | 0.9814 | 6.69% |
   296	| 2018 | 5,986 | 0.9808 | 6.58% |
   297	| 2019 | 6,122 | 0.9780 | 8.71% |
   298	| 2020 | 6,122 | 0.9770 | 9.46% |
   299	| 2021 | 5,996 | 0.9792 | 8.37% |
   300	| 2022 | 5,918 | 0.9819 | 6.25% |
   301	| 2023 | 5,862 | 0.9860 | 3.75% |
   302	-->
   303	
   304	The left tail is stable at 6-13% throughout the sample period and shows no pre/post-2020 level shift: the 2013-2019 mean left-tail share is 8.26% and the 2020-2023 mean is 6.96%.
   305	The lowest observed share is in 2023 (3.75%), consistent with firm-level electronic signing systems producing more uniform output than earlier manual scanning-and-stamping, not less.
   306	This stability supports the replication-dominated framing: a persistent minority of hand-signing Firm A partners is consistent with a Beta left tail that is stable across production technologies, whereas a noise-only explanation would predict a shrinking share as technology improved.
   307	
   308	### 2) Partner-Level Similarity Ranking
   309	
   310	If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all Big-4 auditor-years.
   311	We test this prediction directly.
   312	
   313	For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
   314	Firm A accounts for 1,287 of these (27.8% baseline share).
   315	Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.
   316	
   317	<!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
   318	| Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
   319	|-------|-------------|--------|--------|--------|--------|-----------|--------------|
   320	| 10%   | 462         | 443    | 2      | 3      | 0      | 14        | 95.9% |
   321	| 25%   | 1,157       | 1,043  | 32     | 23     | 9      | 50        | 90.1% |
   322	| 50%   | 2,314       | 1,220  | 473    | 273    | 102    | 246       | 52.7% |
   323	-->
   324	
   325	Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
   326	Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
   327	
   328	<!-- TABLE XV: Firm A Share of Top-10% Similarity by Year
   329	| Year | N auditor-years | Top-10% k | Firm A in top-10% | Firm A share | Firm A baseline |
   330	|------|-----------------|-----------|-------------------|--------------|-----------------|
   331	| 2013 | 324 | 32 | 32 | 100.0% | 26.2% |
   332	| 2014 | 399 | 39 | 39 | 100.0% | 27.1% |
   333	| 2015 | 394 | 39 | 38 | 97.4% | 27.2% |
   334	| 2016 | 413 | 41 | 39 | 95.1% | 27.4% |
   335	| 2017 | 415 | 41 | 41 | 100.0% | 27.9% |
   336	| 2018 | 434 | 43 | 43 | 100.0% | 28.1% |
   337	| 2019 | 429 | 42 | 42 | 100.0% | 28.2% |
   338	| 2020 | 430 | 43 | 38 | 88.4% | 28.3% |
   339	| 2021 | 450 | 45 | 44 | 97.8% | 28.4% |
   340	| 2022 | 467 | 46 | 43 | 93.5% | 28.5% |
   341	| 2023 | 474 | 47 | 46 | 97.9% | 28.5% |
   342	-->
   343	
   344	This over-representation is a direct consequence of firm-wide non-hand-signing practice and is not derived from any threshold we subsequently calibrate.
   345	It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.
   346	
   347	### 3) Intra-Report Consistency
   348	
   349	Taiwanese statutory audit reports are co-signed by two engagement partners (a primary and a secondary signer).
   350	Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
   351	Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.
   352	
   353	For each report with exactly two signatures and complete per-signature data (83,970 reports assigned to a single firm, plus 384 reports with one signer per firm in the mixed-firm buckets for 84,354 total), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
   354	Table XVI reports per-firm intra-report agreement (firm-assignment defined by the firm identity of both signers; mixed-firm reports are reported separately).
   355	
   356	<!-- TABLE XVI: Intra-Report Classification Agreement by Firm
   357	| Firm | Total 2-signer reports | Both non-hand-signed | Both uncertain | Both style | Both hand-signed | Mixed | Agreement rate |
   358	|------|-----------------------|----------------------|----------------|------------|------------------|-------|----------------|
   359	| Firm A  | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | **89.91%** |
   360	| Firm B  | 17,121 | 9,260  | 2,159| 5 | 6 | 5,691 | 66.76% |
   361	| Firm C  | 19,112 | 8,983  | 3,035| 3 | 5 | 7,086 | 62.92% |
   362	| Firm D  | 8,375  | 3,028  | 2,376| 0 | 3 | 2,968 | 64.56% |
   363	| Non-Big-4 | 9,140  | 1,671  | 3,945| 18| 27| 3,479 | 61.94% |
   364	
   365	A report is "in agreement" if both signature labels fall in the same coarse bucket
   366	(non-hand-signed = high+moderate; uncertain; style consistency; or likely hand-signed).
   367	-->
   368	
   369	Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
   370	The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
   371	This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.
   372	
   373	We note that this test uses the calibrated classifier of Section III-L rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
   374	
   375	## I. Classification Results
   376	
   377	Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
   378	The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
   379	We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-L: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
   380	Document-level rates therefore bound the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-H.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
   381	
   382	<!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
   383	| Verdict | N (PDFs) | % | Firm A | Firm A % |
   384	|---------|----------|---|--------|----------|
   385	| High-confidence non-hand-signed | 29,529 | 35.0% | 22,970 | 76.0% |
   386	| Moderate-confidence non-hand-signed | 36,994 | 43.8% | 6,311 | 20.9% |
   387	| High style consistency | 5,133 | 6.1% | 183 | 0.6% |
   388	| Uncertain | 12,683 | 15.0% | 758 | 2.5% |
   389	| Likely hand-signed | 47 | 0.1% | 4 | 0.0% |
   390	
   391	Per the worst-case aggregation rule of Section III-L, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
   392	-->
   393	
   394	Within the 71,656 documents exceeding cosine $0.95$, the dHash dimension stratifies them into three distinct populations:
   395	29,529 (41.2%) show converging structural evidence of non-hand-signing (dHash $\leq 5$);
   396	36,994 (51.7%) show partial structural similarity (dHash in $[6, 15]$) consistent with replication degraded by scan variations;
   397	and 5,133 (7.2%) show no structural corroboration (dHash $> 15$), suggesting high signing consistency rather than image reproduction.
   398	A cosine-only classifier would treat all 71,656 identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
   399	
   400	### 1) Firm A Capture Profile (Consistency Check)
   401	
   402	96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
   403	This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the 32/171 middle-band minority identified by the accountant-level mixture (Section IV-E).
   404	The absence of any meaningful "likely hand-signed" rate (4 of 30,000+ Firm A documents, 0.01%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
   405	We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.
   406	
   407	### 2) Cross-Method Agreement
   408	
   409	Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
   410	This is consistent with the accountant-level convergent thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
   411	
   412	## J. Ablation Study: Feature Backbone Comparison
   413	
   414	To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
   415	All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
   416	Table XVIII presents the comparison.
   417	
   418	<!-- TABLE XVIII: Backbone Comparison
   419	| Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
   420	|--------|-----------|--------|-----------------|
   421	| Feature dim | 2048 | 4096 | 1280 |
   422	| Intra mean | 0.821 | 0.822 | 0.786 |
   423	| Inter mean | 0.758 | 0.767 | 0.699 |
   424	| Cohen's d | 0.669 | 0.564 | 0.707 |
   425	| KDE crossover | 0.837 | 0.850 | 0.792 |
   426	| Firm A mean (all-pairs) | 0.826 | 0.820 | 0.810 |
   427	| Firm A 1st pct (all-pairs) | 0.543 | 0.520 | 0.454 |
   428	
   429	Note: Firm A values in this table are computed over all intra-firm pairwise
   430	similarities (16.0M pairs) for cross-backbone comparability. These differ from
   431	the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
   432	the classification-relevant statistic: the similarity of each signature to its
   433	single closest match from the same CPA.
   434	-->
   435	
   436	EfficientNet-B0 achieves the highest Cohen's $d$ (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions.
   437	However, it also exhibits the widest distributional spread (intra std $= 0.123$ vs. ResNet-50's $0.098$), resulting in lower per-sample classification confidence.
   438	VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.
   439	
   440	ResNet-50 provides the best overall balance:
   441	(1) Cohen's $d$ of 0.669 is competitive with EfficientNet-B0's 0.707;
   442	(2) its tighter distributions yield more reliable individual classifications;
   443	(3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and
   444	(4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.

codex
The late-stage fixes in Results look deliberate. I’m now checking Methods/Discussion/Conclusion plus the exporter and appendix language, because that’s where any residual overclaim, deferred reproducibility gap, or packaging regression would still show up.
exec
/bin/zsh -lc "nl -ba paper/paper_a_methodology_v3.md | sed -n '1,340p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	# III. Methodology
     2	
     3	## A. Pipeline Overview
     4	
     5	We propose a six-stage pipeline for large-scale non-hand-signed auditor signature detection in scanned financial documents.
     6	Fig. 1 illustrates the overall architecture.
     7	The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum supported by convergent evidence from two methodologically distinct threshold estimators complemented by a density-smoothness diagnostic and a pixel-identity anchor.
     8	
     9	Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
    10	From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
    11	
    12	<!--
    13	[Figure 1: Pipeline Architecture - clean vector diagram]
    14	90,282 PDFs → VLM Pre-screening → 86,072 PDFs
    15	→ YOLOv11 Detection → 182,328 signatures
    16	→ ResNet-50 Features → 2048-dim embeddings
    17	→ Dual-Method Verification (Cosine + dHash)
    18	→ Three-Method Threshold (KDE / BD-McCrary / Beta mixture) → Classification
    19	→ Pixel-identity + Firm A + Accountant-level GMM validation
    20	-->
    21	
    22	## B. Data Collection
    23	
    24	The dataset comprises 90,282 annual financial audit reports filed by publicly listed companies in Taiwan, covering fiscal years 2013 to 2023.
    25	The reports were collected from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation, the official repository for mandatory corporate filings.
    26	An automated web-scraping pipeline using Selenium WebDriver was developed to systematically download all audit reports for each listed company across the study period.
    27	Each report is a multi-page PDF document containing, among other content, the auditor's report page bearing the signatures of the certifying CPAs.
    28	
    29	CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
    30	Table I summarizes the dataset composition.
    31	
    32	<!-- TABLE I: Dataset Summary
    33	| Attribute | Value |
    34	|-----------|-------|
    35	| Total PDF documents | 90,282 |
    36	| Date range | 2013–2023 |
    37	| Documents with signatures | 86,072 (95.4%) |
    38	| Unique CPAs identified | 758 |
    39	| Accounting firms | >50 |
    40	-->
    41	
    42	## C. Signature Page Identification
    43	
    44	To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism.
    45	Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
    46	The model was configured with temperature 0 for deterministic output.
    47	
    48	The scanning range was restricted to the first quartile of each document's page count, reflecting the regulatory structure of Taiwanese audit reports in which the auditor's report page is consistently located in the first quarter of the document.
    49	Scanning terminated upon the first positive detection.
    50	This process identified 86,072 documents with signature pages; the remaining 4,198 documents (4.6%) were classified as having no signatures and excluded.
    51	An additional 12 corrupted PDFs were excluded, yielding a final set of 86,071 documents.
    52	
    53	Cross-validation between the VLM and subsequent YOLO detection confirmed high agreement: YOLO successfully detected signature regions in 98.8% of VLM-positive documents.
    54	The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pages incorrectly flagged as containing signatures) and (ii) YOLO false negatives (signature regions missed by the detector), and we do not attempt to attribute the residual to either source without further labeling.
    55	
    56	## D. Signature Detection
    57	
    58	We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization.
    59	A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
    60	A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.
    61	
    62	The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).
    63	
    64	<!-- TABLE II: YOLO Detection Performance
    65	| Metric | Value |
    66	|--------|-------|
    67	| Precision | 0.97–0.98 |
    68	| Recall | 0.95–0.98 |
    69	| mAP@0.50 | 0.98–0.99 |
    70	| mAP@0.50:0.95 | 0.85–0.90 |
    71	-->
    72	
    73	Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
    74	A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.
    75	
    76	Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
    77	
    78	## E. Feature Extraction
    79	
    80	Each extracted signature was encoded into a feature vector using a pre-trained ResNet-50 convolutional neural network [26] with ImageNet-1K V2 weights, used as a fixed feature extractor without fine-tuning.
    81	The final classification layer was removed, yielding the 2048-dimensional output of the global average pooling layer.
    82	
    83	Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
    84	All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.
    85	
    86	The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D).
    87	This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
    88	
    89	## F. Dual-Method Similarity Descriptors
    90	
    91	For each signature, we compute two complementary similarity measures against other signatures attributed to the same CPA:
    92	
    93	**Cosine similarity on deep embeddings** captures high-level visual style:
    94	
    95	$$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$
    96	
    97	where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized 2048-dim feature vectors.
    98	Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].
    99	
   100	**Perceptual hash distance (dHash)** [27] captures structural-level similarity.
   101	Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
   102	The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images.
   103	Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].
   104	
   105	These descriptors provide partially independent evidence.
   106	Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise.
   107	Non-hand-signing yields extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise.
   108	Hand-signing, by contrast, yields high dHash similarity (the overall layout of a signature is preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
   109	Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.
   110	
   111	We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content.
   112	Cosine similarity and dHash are both robust to the noise introduced by the print-scan cycle.
   113	
   114	## G. Unit of Analysis and Summary Statistics
   115	
   116	Two unit-of-analysis choices are relevant for this study: (i) the *signature*---one signature image extracted from one report---and (ii) the *accountant*---the collection of all signatures attributed to a single CPA across the sample period.
   117	A third composite unit---the *auditor-year*, i.e. all signatures by one CPA within one fiscal year---is also natural when longitudinal behavior is of interest, and we treat auditor-year analyses as a direct extension of accountant-level analysis at finer temporal resolution.
   118	
   119	For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA.
   120	The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
   121	Mean statistics would dilute this signal.
   122	
   123	We also adopt an explicit *within-auditor-year no-mixing* identification assumption.
   124	Specifically, within any single fiscal year we treat a given CPA's signing mechanism as uniform: a CPA who reproduces one signature image in that year is assumed to do so for every report, and a CPA who hand-signs in that year is assumed to hand-sign every report in that year.
   125	Domain-knowledge from industry practice at Firm A is consistent with this assumption for that firm during the sample period.
   126	Under the assumption, per-auditor-year summary statistics are well defined and robust to outliers: if even one pair of same-CPA signatures in the year is near-identical, the max/min captures it.
   127	The intra-report consistency analysis in Section IV-H.3 is a related but distinct check: it tests whether the *two co-signing CPAs on the same report* receive the same signature-level label (firm-level signing-practice homogeneity) rather than testing whether a single CPA mixes mechanisms within a fiscal year.
   128	A direct empirical check of the within-auditor-year assumption at the same-CPA level would require labeling multiple reports of the same CPA in the same year and is left to future work; in this paper we maintain the assumption as an identification convention motivated by industry practice and bounded by the worst-case aggregation rule of Section III-L.
   129	
   130	For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean *independent minimum dHash* across all signatures of that CPA.
   131	The *independent minimum dHash* of a signature is defined as the minimum Hamming distance to *any* other signature of the same CPA (over the full same-CPA set).
   132	The independent minimum is unconditional on the cosine-nearest pair and is therefore the conservative structural-similarity statistic; it is the dHash statistic used throughout the operational classifier (Section III-L) and all reported capture-rate analyses.
   133	These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level threshold analysis in Section III-I.5.
   134	
   135	## H. Calibration Reference: Firm A as a Replication-Dominated Population
   136	
   137	A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
   138	Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.
   139	
   140	The background context for this choice is practitioner knowledge about Firm A's signing practice: industry practice at the firm is widely understood among practitioners to involve reproducing a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
   141	We use this only as background context for why Firm A is a plausible calibration candidate; the *evidence* for Firm A's replication-dominated status comes entirely from the paper's own analyses, which do not depend on any claim about signing practice beyond what the audit-report images themselves show.
   142	
   143	We establish Firm A's replication-dominated status through four independent quantitative analyses, each of which can be reproduced from the public audit-report corpus alone:
   144	
   145	First, *independent visual inspection* of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.
   146	
   147	Second, *whole-sample signature-level rates*: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% form a long left tail consistent with a minority of hand-signers.
   148	
   149	Third, *accountant-level mixture analysis* (Section IV-E): a BIC-selected three-component Gaussian mixture over per-accountant mean cosine and mean dHash places 139 of the 171 Firm A CPAs (with $\geq 10$ signatures) in the high-replication C1 cluster and 32 in the middle-band C2 cluster, directly quantifying the within-firm heterogeneity.
   150	
   151	Fourth, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-H. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
   152	  (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P95 of the per-signature cosine distribution (Section III-L); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
   153	  (b) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
   154	  (c) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
   155	
   156	We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the visual inspection, the accountant-level mixture, the three complementary analyses above, and the held-out Firm A fold described in Section III-K.
   157	
   158	We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
   159	Its identification rests on visual evidence and accountant-level clustering that is independent of the statistical pipeline.
   160	The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.
   161	
   162	## I. Convergent Threshold Determination with a Density-Smoothness Diagnostic
   163	
   164	Direct assignment of thresholds based on prior intuition (e.g., cosine $\geq 0.95$ for non-hand-signed) is analytically convenient but methodologically vulnerable: reviewers can reasonably ask why these particular cutoffs rather than others.
   165	To place threshold selection on a statistically principled and data-driven footing, we apply *two methodologically distinct* threshold estimators---KDE antimode with a Hartigan dip test, and a finite Beta mixture (with a logit-Gaussian robustness check)---whose underlying assumptions decrease in strength (KDE antimode requires only smoothness; the Beta mixture additionally requires a parametric specification, and the logit-Gaussian cross-check reports sensitivity to that form).
   166	We complement these estimators with a Burgstahler-Dichev / McCrary density-smoothness diagnostic applied to the same distributions.
   167	The BD/McCrary procedure is *not* a third threshold estimator in our application---we show in Appendix A that the signature-level BD transitions are not bin-width-robust and that the accountant-level BD null survives a bin-width sweep---but it is informative about *how* the accountant-level distribution fails to exhibit a sharp density discontinuity even though it is clustered.
   168	The methods are applied to the same sample rather than to independent experiments, so their estimates are not statistically independent; convergence between the two threshold estimators is therefore a diagnostic of distributional structure rather than a formal statistical guarantee.
   169	When the two estimates agree, the decision boundary is robust to the choice of method; when the BD/McCrary diagnostic finds no significant transition at the same level, that pattern is evidence for clustered-but-smoothly-mixed rather than sharply discontinuous distributional structure.
   170	
   171	### 1) Method 1: KDE Antimode / Crossover with Unimodality Test
   172	
   173	We use two closely related KDE-based threshold estimators and apply each where it is appropriate.
   174	When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the *KDE crossover* is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes.
   175	When a single distribution is analyzed (e.g., the per-accountant cosine mean of Section IV-E) the *KDE antimode* is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal.
   176	In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify threshold stability.
   177	
   178	### 2) Method 2: Finite Mixture Model via EM
   179	
   180	We fit a two-component Beta mixture to the cosine distribution via the EM algorithm [40] using method-of-moments M-step estimates (which are numerically stable for bounded proportion data).
   181	The first component represents non-hand-signed signatures (high mean, narrow spread) and the second represents hand-signed signatures (lower mean, wider spread).
   182	Under the fitted model the threshold is the crossing point of the two weighted component densities,
   183	
   184	$$\pi_1 \cdot \text{Beta}(x; \alpha_1, \beta_1) = (1 - \pi_1) \cdot \text{Beta}(x; \alpha_2, \beta_2),$$
   185	
   186	solved numerically via bracketed root-finding.
   187	As a robustness check against the Beta parametric form we fit a parallel two-component Gaussian mixture to the *logit-transformed* similarity, following standard practice for bounded proportion data.
   188	White's [41] quasi-MLE consistency result justifies interpreting the logit-Gaussian estimates as asymptotic approximations to the best Gaussian-family fit under misspecification; we use the cross-check between Beta and logit-Gaussian crossings as a diagnostic of parametric-form sensitivity rather than as a guarantee of distributional recovery.
   189	
   190	We fit 2- and 3-component variants of each mixture and report BIC for model selection.
   191	When BIC prefers the 3-component fit, the 2-component assumption itself is a forced fit, and the Bayes-optimal threshold derived from the 2-component crossing should be treated as an upper bound rather than a definitive cut.
   192	
   193	### 3) Density-Smoothness Diagnostic: Burgstahler-Dichev / McCrary
   194	
   195	Complementing the two threshold estimators above, we apply the discontinuity test of Burgstahler and Dichev [38], made asymptotically rigorous by McCrary [39], as a *density-smoothness diagnostic* rather than as a third threshold estimator.
   196	We discretize each distribution (cosine into bins of width 0.005; $\text{dHash}_\text{indep}$ into integer bins) and compute, for each bin $i$ with count $n_i$, the standardized deviation from the smooth-null expectation of the average of its neighbours,
   197	
   198	$$Z_i = \frac{n_i - \tfrac{1}{2}(n_{i-1} + n_{i+1})}{\sqrt{N p_i (1-p_i) + \tfrac{1}{4} N (p_{i-1}+p_{i+1})(1 - p_{i-1} - p_{i+1})}},$$
   199	
   200	which is approximately $N(0,1)$ under the null of distributional smoothness.
   201	A candidate transition is identified at an adjacent bin pair where $Z_{i-1}$ is significantly negative and $Z_i$ is significantly positive (cosine) or the reverse (dHash).
   202	Appendix A reports a bin-width sensitivity sweep covering $\text{bin} \in \{0.003, 0.005, 0.010, 0.015\}$ for cosine and $\text{bin} \in \{1, 2, 3\}$ for dHash; the sweep shows that signature-level BD transitions are not bin-width-stable and that accountant-level BD transitions are largely absent, consistent with clustered-but-smoothly-mixed accountant-level aggregates.
   203	We therefore do not treat the BD/McCrary procedure as a threshold estimator in our application but as diagnostic evidence about distributional smoothness.
   204	
   205	### 4) Convergent Validation and Level-Shift Framing
   206	
   207	The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification (with logit-Gaussian as a robustness cross-check against that form).
   208	If the two estimated thresholds differ by less than a practically meaningful margin, the classification is robust to the choice of method.
   209	
   210	Equally informative is the *level at which the methods agree or disagree*.
   211	Applied to the per-signature similarity distribution the two estimators yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D).
   212	Applied to the per-accountant cosine mean, the KDE antimode and the Beta-mixture crossing (together with its logit-Gaussian counterpart) converge within a narrow band, while the BD/McCrary diagnostic finds no significant transition at the same level; this pattern is consistent with a clustered but smoothly mixed accountant-level distribution rather than a sharply discrete discontinuity, and we interpret it accordingly in Section V rather than treating the BD null as a failure of the test.
   213	
   214	### 5) Accountant-Level Application
   215	
   216	In addition to applying the two threshold estimators and the BD/McCrary diagnostic at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures.
   217	The accountant-level estimates from the two threshold estimators (together with their convergence) provide the methodologically defensible threshold reference used in the per-document classification of Section III-L; the BD/McCrary accountant-level null is reported alongside as a smoothness diagnostic.
   218	
   219	## J. Accountant-Level Mixture Model
   220	
   221	In addition to the signature-level analysis, we fit a Gaussian mixture model in two dimensions to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash).
   222	The motivation is the expectation---consistent with industry-practice knowledge at Firm A---that an individual CPA's signing *practice* is clustered (typically consistent adoption of non-hand-signing or consistent hand-signing within a given year) even when the output pixel-level *quality* lies on a continuous spectrum.
   223	
   224	We fit mixtures with $K \in \{1, 2, 3, 4, 5\}$ components under full covariance, selecting $K^*$ by BIC with 15 random initializations per $K$.
   225	For the selected $K^*$ we report component means, weights, per-component firm composition, and the marginal-density crossing points from the two-component fit, which serve as the natural per-accountant thresholds.
   226	
   227	## K. Pixel-Identity, Inter-CPA, and Held-Out Firm A Validation (No Manual Annotation)
   228	
   229	Rather than construct a stratified manual-annotation validation set, we validate the classifier using four naturally occurring reference populations that require no human labeling:
   230	
   231	1. **Pixel-identical anchor (gold positive, conservative subset):** signatures whose nearest same-CPA match is byte-identical after crop and normalization.
   232	Handwriting physics makes byte-identity impossible under independent signing events, so this anchor is absolute ground truth *for the byte-identical subset* of non-hand-signed signatures.
   233	We emphasize that this anchor is a *subset* of the true positive class---only those non-hand-signed signatures whose nearest match happens to be byte-identical---and perfect recall against this anchor therefore does not establish recall against the full non-hand-signed population (Section V-G discusses this further).
   234	
   235	2. **Inter-CPA negative anchor (large gold negative):** $\sim$50,000 pairs of signatures randomly sampled from *different* CPAs.
   236	Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps.
   237	This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.
   238	
   239	3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail contains a minority of hand-signers, as directly evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H).
   240	Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% *calibration* fold and a 30% *heldout* fold.
   241	Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions are derived from the calibration fold only.
   242	The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.
   243	
   244	4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70.
   245	This anchor is retained for continuity with prior work but is small in our dataset ($n = 35$) and is reported only as a supplementary reference; its confidence intervals are too wide for quantitative inference.
   246	
   247	From these anchors we report FAR with Wilson 95% confidence intervals against the inter-CPA negative anchor.
   248	We do not report an Equal Error Rate or FRR column against the byte-identical positive anchor, because byte-identical pairs have cosine $\approx 1$ by construction and any FRR computed against that subset is trivially $0$ at every threshold below $1$; the conservative-subset role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
   249	Precision and $F_1$ are not meaningful in this anchor-based evaluation because the positive and negative anchors are constructed from different sampling units (intra-CPA byte-identical pairs vs random inter-CPA pairs), so their relative prevalence in the combined set is an arbitrary construction rather than a population parameter; we therefore omit precision and $F_1$ from Table X.
   250	The 70/30 held-out Firm A fold of Section IV-G.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.
   251	We additionally draw a small stratified sample (30 signatures across high-confidence replication, borderline, style-only, pixel-identical, and likely-genuine strata) for manual visual sanity inspection; this sample is used only for spot-check and does not contribute to reported metrics.
   252	
   253	## L. Per-Document Classification
   254	
   255	The per-signature classifier operates at the signature level and uses whole-sample Firm A percentile heuristics as its operational thresholds, while the accountant-level threshold analysis of Section IV-E (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing) supplies a *convergent* external reference for the operational cuts.
   256	Because the two analyses are at different units (signature vs accountant) we treat them as complementary rather than substitutable: the accountant-level convergence band cos $\in [0.945, 0.979]$ anchors the signature-level operational cut cos $> 0.95$ used below, and Section IV-G.3 reports a sensitivity analysis in which cos $> 0.95$ is replaced by the accountant-level 2D-GMM marginal crossing cos $> 0.945$.
   257	All dHash references in this section refer to the *independent-minimum* dHash defined in Section III-G---the smallest Hamming distance from a signature to any other same-CPA signature.
   258	We use a single dHash statistic throughout the operational classifier and the supporting capture-rate analyses (Tables IX, XI, XII, XVI), which keeps the classifier definition and its empirical evaluation arithmetically consistent.
   259	
   260	We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:
   261	
   262	1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 5$.
   263	Both descriptors converge on strong replication evidence.
   264	
   265	2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < \text{dHash}_\text{indep} \leq 15$.
   266	Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.
   267	
   268	3. **High style consistency:** Cosine $> 0.95$ AND $\text{dHash}_\text{indep} > 15$.
   269	High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
   270	
   271	4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
   272	
   273	5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.
   274	
   275	We note three conventions about the thresholds.
   276	First, the cosine cutoff $0.95$ is the whole-sample Firm A P95 of the per-signature best-match cosine distribution (chosen for its transparent percentile interpretation in the whole-sample reference distribution), and the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions.
   277	Section IV-G.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible.
   278	Second, the dHash cutoffs $\leq 5$ and $> 15$ are chosen from the whole-sample Firm A $\text{dHash}_\text{indep}$ distribution: $\leq 5$ captures the upper tail of the high-similarity mode (whole-sample Firm A median $\text{dHash}_\text{indep} = 2$, P75 $\approx 4$, so $\leq 5$ is the band immediately above median), while $> 15$ marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction.
   279	Third, the three accountant-level 1D estimators (KDE antimode $0.973$, Beta-2 crossing $0.979$, logit-GMM-2 crossing $0.976$) and the accountant-level 2D GMM marginal ($0.945$) are *not* the operational thresholds of this classifier: they are the *convergent external reference* that supports the choice of signature-level operational cut.
   280	Section IV-G.3 reports the classifier's five-way output under the nearby operational cut cos $> 0.945$ as a sensitivity check; the aggregate firm-level capture rates change by at most $\approx 1.2$ percentage points (e.g., the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A versus 91.14% at cos $> 0.945$), and category-level shifts are concentrated at the Uncertain/Moderate-confidence boundary.
   281	
   282	Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed determines the document's classification).
   283	This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.
   284	
   285	## M. Data Source and Firm Anonymization
   286	
   287	**Audit-report corpus.** The 90,282 audit-report PDFs analyzed in this study were obtained from the Market Observation Post System (MOPS) operated by the Taiwan Stock Exchange Corporation.
   288	MOPS is the statutory public-disclosure platform for Taiwan-listed companies; every audit report filed on MOPS is already a publicly accessible regulatory document.
   289	We did not access any non-public auditor work papers, internal firm records, or personally identifying information beyond the certifying CPAs' names and signatures, which are themselves published on the face of the audit report as part of the public regulatory filing.
   290	The CPA registry used to map signatures to CPAs is a publicly available audit-firm tenure registry (Section III-B).
   291	
   292	**Firm-level anonymization.** Although all audit reports and CPA identities in the corpus are public, we report firm-level results under the pseudonyms Firm A / B / C / D throughout this paper to avoid naming specific accounting firms in descriptive rate comparisons.
   293	Readers with domain familiarity may still infer Firm A from contextual descriptors (Big-4 status, replication-dominated behavior); we disclose this residual identifiability explicitly and note that none of the paper's conclusions depend on the specific firm's name.
   294	Authors declare no conflict of interest with Firm A, Firm B, Firm C, or Firm D.

exec
/bin/zsh -lc "nl -ba paper/paper_a_appendix_v3.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba paper/paper_a_conclusion_v3.md | sed -n '1,120p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba paper/paper_a_discussion_v3.md | sed -n '1,220p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	# Appendix A. BD/McCrary Bin-Width Sensitivity
     2	
     3	The main text (Sections III-I and IV-E) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as one of the threshold estimators whose convergence anchors the accountant-level threshold band.
     4	This appendix documents the empirical basis for that framing by sweeping the bin width across six (variant, bin-width) panels: Firm A / full-sample / accountant-level, each in the cosine and $\text{dHash}_\text{indep}$ direction.
     5	
     6	<!-- TABLE A.I: BD/McCrary Bin-Width Sensitivity (two-sided alpha = 0.05, |Z| > 1.96)
     7	| Variant | n | Bin width | Best transition | z_below | z_above |
     8	|---------|---|-----------|-----------------|---------|---------|
     9	| Firm A cosine (sig-level)        | 60,448  | 0.003  | 0.9870 | -2.81  | +9.42   |
    10	| Firm A cosine (sig-level)        | 60,448  | 0.005  | 0.9850 | -9.57  | +19.07  |
    11	| Firm A cosine (sig-level)        | 60,448  | 0.010  | 0.9800 | -54.64 | +69.96  |
    12	| Firm A cosine (sig-level)        | 60,448  | 0.015  | 0.9750 | -85.86 | +106.17 |
    13	| Firm A dHash_indep (sig-level)   | 60,448  | 1      | 2.0    | -4.69  | +10.01  |
    14	| Firm A dHash_indep (sig-level)   | 60,448  | 2      | no transition | — | — |
    15	| Firm A dHash_indep (sig-level)   | 60,448  | 3      | no transition | — | — |
    16	| Full-sample cosine (sig-level)   | 168,740 | 0.003  | 0.9870 | -3.21  | +8.17   |
    17	| Full-sample cosine (sig-level)   | 168,740 | 0.005  | 0.9850 | -8.80  | +14.32  |
    18	| Full-sample cosine (sig-level)   | 168,740 | 0.010  | 0.9800 | -29.69 | +44.91  |
    19	| Full-sample cosine (sig-level)   | 168,740 | 0.015  | 0.9450 | -11.35 | +14.85  |
    20	| Full-sample dHash_indep (sig-l.) | 168,740 | 1      | 2.0    | -6.22  | +4.89   |
    21	| Full-sample dHash_indep (sig-l.) | 168,740 | 2      | 10.0   | -7.35  | +3.83   |
    22	| Full-sample dHash_indep (sig-l.) | 168,740 | 3      | 9.0    | -11.05 | +45.39  |
    23	| Accountant-level cosine_mean     | 686     | 0.002  | no transition | — | — |
    24	| Accountant-level cosine_mean     | 686     | 0.005  | 0.9800 | -3.23  | +5.18   |
    25	| Accountant-level cosine_mean     | 686     | 0.010  | no transition | — | — |
    26	| Accountant-level dHash_indep_mean| 686     | 0.2    | no transition | — | — |
    27	| Accountant-level dHash_indep_mean| 686     | 0.5    | no transition | — | — |
    28	| Accountant-level dHash_indep_mean| 686     | 1.0    | 3.0    | -2.00  | +3.24   |
    29	-->
    30	
    31	Two patterns are visible in Table A.I.
    32	First, at the signature level the procedure consistently identifies a "transition" under every bin width, but the *location* of that transition drifts monotonically with bin width (Firm A cosine: 0.987 → 0.985 → 0.980 → 0.975 as bin width grows from 0.003 to 0.015; full-sample dHash: 2 → 10 → 9 as the bin width grows from 1 to 3).
    33	The $Z$ statistics also inflate superlinearly with the bin width (Firm A cosine $|Z|$ rises from $\sim 9$ at bin 0.003 to $\sim 106$ at bin 0.015) because wider bins aggregate more mass per bin and therefore shrink the per-bin standard error on a very large sample.
    34	Both features are characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity.
    35	
    36	Second, at the accountant level---the unit we rely on for primary threshold inference (Sections III-H, III-J, IV-E)---the procedure produces no significant transition at two of three cosine bin widths and two of three dHash bin widths, and the one marginal transition it does produce ($Z_\text{below} = -2.00$ in the dHash sweep at bin width $1.0$) sits exactly at the critical value for $\alpha = 0.05$.
    37	We stress the inferential asymmetry here: *consistency* with smoothly-mixed clustering is what the BD null delivers, not *affirmative proof* of smoothness.
    38	At $N = 686$ accountants the BD/McCrary test has limited statistical power and can typically reject only sharp cliff-type discontinuities; failure to reject the smoothness null therefore constrains the data only to distributions whose between-cluster transitions are gradual *enough* to escape the test's sensitivity at that sample size.
    39	We read this as reinforcing---not establishing---the clustered-but-smoothly-mixed interpretation derived from the GMM fit and the dip-test evidence.
    40	
    41	Taken together, Table A.I shows (i) that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes---and (ii) that the accountant-level BD/McCrary null persists across the bin-width sweep, consistent with but not alone sufficient to establish the clustered-but-smoothly-mixed interpretation discussed in Section V-B and limitation-caveated in Section V-G.
    42	Both observations support the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator.
    43	The accountant-level threshold band reported in Table VIII ($\text{cosine} \approx 0.975$ from the convergence of the KDE antimode, the Beta-2 crossing, and the logit-GMM-2 crossing) is therefore not adjusted to include any BD/McCrary location.
    44	
    45	Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials (`reports/bd_sensitivity/bd_sensitivity.json`) produced by `signature_analysis/25_bd_mccrary_sensitivity.py`.

 succeeded in 0ms:
     1	# VI. Conclusion and Future Work
     2	
     3	## Conclusion
     4	
     5	We have presented an end-to-end AI pipeline for detecting non-hand-signed auditor signatures in financial audit reports at scale.
     6	Applied to 90,282 audit reports from Taiwanese publicly listed companies spanning 2013--2023, our system extracted and analyzed 182,328 CPA signatures using a combination of VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity verification, with threshold selection placed on a statistically principled footing through two methodologically distinct threshold estimators and a density-smoothness diagnostic applied at two analysis levels.
     7	
     8	Our contributions are fourfold.
     9	
    10	First, we argued that non-hand-signing detection is a distinct problem from signature forgery detection, requiring analytical tools focused on the upper tail of intra-signer similarity rather than inter-signer discriminability.
    11	
    12	Second, we showed that combining cosine similarity of deep embeddings with difference hashing is essential for meaningful classification---among 71,656 documents with high feature-level similarity, the dual-descriptor framework revealed that only 41% exhibit converging structural evidence of non-hand-signing while 7% show no structural corroboration despite near-identical feature-level appearance, demonstrating that a single-descriptor approach conflates style consistency with image reproduction.
    13	
    14	Third, we introduced a convergent threshold framework combining two methodologically distinct estimators---KDE antimode (with a Hartigan unimodality test) and an EM-fitted Beta mixture (with a logit-Gaussian robustness check)---together with a Burgstahler-Dichev / McCrary density-smoothness diagnostic.
    15	Applied at both the signature and accountant levels, this framework surfaced an informative structural asymmetry: at the per-signature level the distribution is a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas at the per-accountant level BIC cleanly selects a three-component mixture and the KDE antimode together with the Beta-mixture and logit-Gaussian estimators agree within $\sim 0.006$ at cosine $\approx 0.975$.
    16	The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transition at the accountant level; at $N = 686$ accountants the test has limited power and cannot affirmatively establish smoothness, but its non-transition is consistent with the smoothly-mixed cluster boundaries implied by the accountant-level GMM.
    17	The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered into three recognizable groups whose inter-cluster boundaries are gradual rather than sharp.
    18	
    19	Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
    20	To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85-95% capture band differ by 1-5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure.
    21	This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
    22	
    23	An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
    24	
    25	## Future Work
    26	
    27	Several directions merit further investigation.
    28	Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
    29	Extending the accountant-level analysis to auditor-year units---using the same convergent threshold framework at finer temporal resolution---could reveal within-accountant transitions between hand-signing and non-hand-signing over the decade.
    30	The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
    31	The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
    32	Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.

 succeeded in 0ms:
     1	# V. Discussion
     2	
     3	## A. Non-Hand-Signing Detection as a Distinct Problem
     4	
     5	Our results highlight the importance of distinguishing *non-hand-signing detection* from the well-studied *signature forgery detection* problem.
     6	In forgery detection, the challenge lies in modeling the variability of skilled forgers who produce plausible imitations of a target signature.
     7	In non-hand-signing detection the signer's identity is not in question; the challenge is distinguishing between legitimate intra-signer consistency (a CPA who signs similarly each time) and image-level reproduction of a stored signature (a CPA whose signature on each report is a byte-level or near-byte-level copy of a single source image).
     8	
     9	This distinction has direct methodological consequences.
    10	Forgery detection systems optimize for inter-class discriminability---maximizing the gap between genuine and forged signatures.
    11	Non-hand-signing detection, by contrast, requires sensitivity to the *upper tail* of the intra-class similarity distribution, where the boundary between consistent handwriting and image reproduction becomes ambiguous.
    12	The dual-descriptor framework we propose---combining semantic-level features (cosine similarity) with structural-level features (dHash)---addresses this ambiguity in a way that single-descriptor approaches cannot.
    13	
    14	## B. Continuous-Quality Spectrum vs. Clustered Accountant-Level Heterogeneity
    15	
    16	The most consequential empirical finding of this study is the asymmetry between signature level and accountant level revealed by the convergent threshold framework and the Hartigan dip test (Sections IV-D and IV-E).
    17	
    18	At the per-signature level, the distribution of best-match cosine similarity is *not* cleanly bimodal.
    19	Firm A's signature-level cosine is formally unimodal (dip test $p = 0.17$) with a long left tail.
    20	The all-CPA signature-level cosine rejects unimodality ($p < 0.001$), but its structure is not well approximated by a two-component Beta mixture: BIC clearly prefers a three-component fit, and the 2-component Beta crossing and its logit-GMM counterpart disagree sharply on the candidate threshold (0.977 vs. 0.999 for Firm A).
    21	The BD/McCrary discontinuity test locates its transition at 0.985---*inside* the non-hand-signed mode rather than at a boundary between two mechanisms.
    22	Taken together, these results indicate that non-hand-signed signatures form a continuous quality spectrum rather than a discrete class.
    23	Replication quality varies continuously with scan equipment, PDF compression, stamp pressure, and firm-level e-signing system generation, producing a heavy-tailed distribution that no two-mechanism mixture explains at the signature level.
    24	
    25	At the per-accountant aggregate level the picture partly reverses.
    26	The distribution of per-accountant mean cosine (and mean dHash) rejects unimodality, a BIC-selected three-component Gaussian mixture cleanly separates (C1) a high-replication cluster dominated by Firm A, (C2) a middle band shared by the other Big-4 firms, and (C3) a hand-signed-tendency cluster dominated by smaller domestic firms, and the three 1D threshold methods applied at the accountant level produce mutually consistent estimates (KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, logit-GMM-2 crossing $= 0.976$).
    27	The BD/McCrary test, however, does not produce a significant transition at the accountant level either, in contrast to the signature level.
    28	This pattern is consistent with a clustered *but smoothly mixed* accountant-level distribution rather than with a sharp density discontinuity: accountant-level means cluster into three recognizable groups, yet the test fails to reject the smoothness null at the sample size available ($N = 686$), and the GMM cluster boundaries appear gradual rather than sheer.
    29	We caveat this interpretation appropriately in Section V-G: the BD null alone cannot affirmatively establish smoothness---only fail to falsify it---and our substantive claim of smoothly-mixed clustering rests on the joint weight of the GMM fit, the dip test, and the BD null rather than on the BD null alone.
    30	
    31	The substantive interpretation we take from this evidence is therefore narrower than a "discrete-behaviour" claim: *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behaviour* is clustered (three recognizable groups) but not sharply discrete.
    32	The accountant-level mixture is a useful classifier of firm-and-practitioner-level signing regimes; individual behaviour may still transition or mix over time within a practitioner, and our cross-sectional analysis does not rule this out.
    33	Methodologically, the implication is that the two threshold estimators (KDE antimode, Beta mixture with logit-Gaussian robustness) are meaningfully applied at the accountant level for threshold estimation, while the BD/McCrary non-transition at the same level is a failure-to-reject rather than a failure of the method---informative alongside the other evidence but subject to the power caveat recorded in Section V-G.
    34	
    35	## C. Firm A as a Replication-Dominated, Not Pure, Population
    36	
    37	A recurring theme in prior work that treats Firm A or an analogous reference group as a calibration anchor is the implicit assumption that the anchor is a pure positive class.
    38	Our evidence across multiple analyses rules out that assumption for Firm A while affirming its utility as a calibration reference.
    39	
    40	Three convergent strands of evidence support the replication-dominated framing.
    41	First, the visual-inspection evidence: randomly sampled Firm A reports exhibit pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.
    42	Second, the signature-level statistical evidence: Firm A's per-signature cosine distribution is unimodal long-tail rather than a tight single peak; 92.5% of Firm A signatures exceed cosine 0.95, with the remaining 7.5% forming the left tail.
    43	Third, the accountant-level evidence: of the 171 Firm A CPAs with enough signatures ($\geq 10$) to enter the accountant-level GMM, 32 (19%) fall into the middle-band C2 cluster rather than the high-replication C1 cluster---directly quantifying the within-firm minority of hand-signers.
    44	Nine additional Firm A CPAs are excluded from the GMM for having fewer than 10 signatures, so we cannot place them in a cluster from the cross-sectional analysis alone.
    45	The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85–95% band differ between folds by 1–5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure).
    46	The accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to fold-level sampling variance.
    47	
    48	The replication-dominated framing is internally coherent with all three pieces of evidence, and it predicts and explains the residuals that a "near-universal" framing would be forced to treat as noise.
    49	We therefore recommend that future work building on this calibration strategy should explicitly distinguish replication-dominated from replication-pure calibration anchors.
    50	
    51	## D. The Style-Replication Gap
    52	
    53	Within the 71,656 documents exceeding cosine $0.95$, the dHash descriptor partitions them into three distinct populations: 29,529 (41.2%) with high-confidence structural evidence of non-hand-signing, 36,994 (51.7%) with moderate structural similarity, and 5,133 (7.2%) with no structural corroboration despite near-identical feature-level appearance.
    54	A cosine-only classifier would treat all 71,656 documents identically; the dual-descriptor framework separates them into populations with fundamentally different interpretations.
    55	
    56	The 7.2% classified as "high style consistency" (cosine $> 0.95$ but dHash $> 15$) are particularly informative.
    57	Several plausible explanations may account for their high feature similarity without structural identity, though we lack direct evidence to confirm their relative contributions.
    58	Many accountants may develop highly consistent signing habits---using similar pen pressure, stroke order, and spatial layout---resulting in signatures that appear nearly identical at the semantic feature level while retaining the microscopic variations inherent to handwriting.
    59	Some may use signing pads or templates that further constrain variability without constituting image-level reproduction.
    60	The dual-descriptor framework correctly identifies these cases as distinct from non-hand-signed signatures by detecting the absence of structural-level convergence.
    61	
    62	## E. Value of a Replication-Dominated Calibration Group
    63	
    64	The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
    65	In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
    66	Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset.
    67	
    68	This calibration strategy has broader applicability beyond signature analysis.
    69	Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
    70	The framing we adopt---replication-dominated rather than replication-pure---is an important refinement of this strategy: it prevents overclaim, accommodates the within-firm heterogeneity quantified by the accountant-level mixture, and yields classification rates that are internally consistent with the data.
    71	
    72	## F. Pixel-Identity and Inter-CPA Anchors as Annotation-Free Validation
    73	
    74	A further methodological contribution is the combination of byte-level pixel identity as an annotation-free *conservative* gold positive and a large random-inter-CPA negative anchor.
    75	Handwriting physics makes byte-identity impossible under independent signing events, so any pair of same-CPA signatures that are byte-identical after crop and normalization is an absolute positive for non-hand-signing, requiring no human review.
    76	In our corpus 310 signatures satisfied this condition.
    77	We emphasize that byte-identical pairs are a *subset* of the true non-hand-signed positive class---they capture only those whose nearest same-CPA match happens to be bytewise identical, excluding replications that are pixel-near-identical but not byte-identical (for example, under different scan or compression pathways).
    78	Perfect recall against this subset therefore does not generalize to perfect recall against the full non-hand-signed population; it is a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable recall estimate.
    79	
    80	Paired with the $\sim$50,000-pair inter-CPA negative anchor, the byte-identical positives yield FAR estimates with tight Wilson 95% confidence intervals (Table X), which is a substantive improvement over the low-similarity same-CPA negative ($n = 35$) we originally considered.
    81	The combination is a reusable pattern for other document-forensics settings in which the target mechanism leaves a byte-level physical signature in the artifact itself, provided that its generalization limits are acknowledged: FAR is informative, whereas recall is valid only for the conservative subset.
    82	
    83	## G. Limitations
    84	
    85	Several limitations should be acknowledged.
    86	
    87	First, comprehensive per-document ground truth labels are not available.
    88	The pixel-identity anchor is a strict *subset* of the true non-hand-signing positives (only those whose nearest same-CPA match happens to be byte-identical), so perfect recall against this anchor does not establish the classifier's recall on the broader positive class.
    89	The low-similarity same-CPA anchor ($n = 35$) is small because intra-CPA pairs rarely fall below cosine 0.70; we use the $\sim$50,000-pair inter-CPA negative anchor as the primary negative reference, which yields tight Wilson 95% CIs on FAR (Table X), but it too does not exhaust the set of true negatives (in particular, same-CPA hand-signed pairs with moderate cosine similarity are not sampled).
    90	A manual-adjudication study concentrated at the decision boundary---for example 100--300 auditor-years stratified by cosine band---would further strengthen the recall estimate against the full positive class.
    91	
    92	Second, the ResNet-50 feature extractor was used with pre-trained ImageNet weights without domain-specific fine-tuning.
    93	While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.
    94	
    95	Third, the red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions.
    96	In these overlap regions, blended pixels are replaced with white, potentially creating small gaps in the signature strokes that could reduce dHash similarity.
    97	This effect would bias classification toward false negatives rather than false positives, but the magnitude has not been quantified.
    98	
    99	Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
   100	While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded, and we note that our accountant-level aggregates could mask within-accountant temporal transitions.
   101	
   102	Fifth, the classification framework treats all signatures from a CPA as belonging to a single class, not accounting for potential changes in signing practice over time (e.g., a CPA who signed genuinely in early years but adopted non-hand-signing later).
   103	Extending the accountant-level analysis to auditor-year units is a natural next step.
   104	
   105	Sixth, the BD/McCrary transition estimates fall inside rather than between modes for the per-signature cosine distribution, and the test produces no significant transition at all at the accountant level.
   106	In our application, therefore, BD/McCrary contributes diagnostic information about local density-smoothness rather than an independent accountant-level threshold estimate; that role is played by the KDE antimode and the two mixture-based estimators.
   107	We emphasize that the accountant-level BD/McCrary null is *consistent with*---not affirmative proof of---smoothly mixed cluster boundaries: the BD/McCrary test is known to have limited statistical power at modest sample sizes, and with $N = 686$ accountants in our analysis the test cannot reliably detect anything less than a sharp cliff-type density discontinuity.
   108	Failure to reject the smoothness null at this sample size therefore reinforces BD/McCrary's role as a diagnostic rather than a definitive estimator; the substantive claim of smoothly-mixed accountant-level clustering rests on the joint weight of the dip-test and Beta-mixture evidence together with the BD null, not on the BD null alone.
   109	
   110	Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."
   111	Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a legal question that our technical analysis can inform but cannot resolve.

exec
/bin/zsh -lc "nl -ba paper/export_v3.py | sed -n '1,320p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	#!/usr/bin/env python3
     2	"""Export Paper A v3 (IEEE Access target) to Word, reading from v3 md section files."""
     3	
     4	from docx import Document
     5	from docx.shared import Inches, Pt, RGBColor
     6	from docx.enum.text import WD_ALIGN_PARAGRAPH
     7	from pathlib import Path
     8	import re
     9	
    10	PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
    11	FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures")
    12	EXTRA_FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
    13	OUTPUT = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
    14	
    15	SECTIONS = [
    16	    "paper_a_abstract_v3.md",
    17	    # paper_a_impact_statement_v3.md removed: not a standard IEEE Access
    18	    # Regular Paper section. Content folded into cover letter / abstract.
    19	    "paper_a_introduction_v3.md",
    20	    "paper_a_related_work_v3.md",
    21	    "paper_a_methodology_v3.md",
    22	    "paper_a_results_v3.md",
    23	    "paper_a_discussion_v3.md",
    24	    "paper_a_conclusion_v3.md",
    25	    # Appendix A: BD/McCrary bin-width sensitivity (see v3.7 notes).
    26	    "paper_a_appendix_v3.md",
    27	    "paper_a_references_v3.md",
    28	]
    29	
    30	# Figure insertion hooks (trigger phrase -> (file, caption, width inches)).
    31	# New figures for v3: dip test, BD/McCrary overlays, accountant GMM 2D + marginals.
    32	FIGURES = {
    33	    "Fig. 1 illustrates": (
    34	        FIG_DIR / "fig1_pipeline.png",
    35	        "Fig. 1. Pipeline architecture for automated non-hand-signed signature detection.",
    36	        6.5,
    37	    ),
    38	    "Fig. 2 presents the cosine similarity distributions for intra-class": (
    39	        FIG_DIR / "fig2_intra_inter_kde.png",
    40	        "Fig. 2. Cosine similarity distributions: intra-class vs. inter-class with KDE crossover at 0.837.",
    41	        3.5,
    42	    ),
    43	    "Fig. 3 presents the per-signature cosine and dHash distributions of Firm A": (
    44	        FIG_DIR / "fig3_firm_a_calibration.png",
    45	        "Fig. 3. Firm A per-signature cosine and dHash distributions against the overall CPA population.",
    46	        3.5,
    47	    ),
    48	    "Fig. 4 visualizes the accountant-level clusters": (
    49	        EXTRA_FIG_DIR / "accountant_mixture" / "accountant_mixture_2d.png",
    50	        "Fig. 4. Accountant-level 3-component Gaussian mixture in the (cosine-mean, dHash-mean) plane.",
    51	        4.5,
    52	    ),
    53	    "conducted an ablation study comparing three": (
    54	        FIG_DIR / "fig4_ablation.png",
    55	        "Fig. 5. Ablation study comparing three feature extraction backbones.",
    56	        6.5,
    57	    ),
    58	}
    59	
    60	
    61	def strip_comments(text):
    62	    return re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
    63	
    64	
    65	def add_md_table(doc, table_lines):
    66	    rows_data = []
    67	    for line in table_lines:
    68	        cells = [c.strip() for c in line.strip("|").split("|")]
    69	        if not re.match(r"^[-: ]+$", cells[0]):
    70	            rows_data.append(cells)
    71	    if len(rows_data) < 2:
    72	        return
    73	    ncols = len(rows_data[0])
    74	    table = doc.add_table(rows=len(rows_data), cols=ncols)
    75	    table.style = "Table Grid"
    76	    for r_idx, row in enumerate(rows_data):
    77	        for c_idx in range(min(len(row), ncols)):
    78	            cell = table.rows[r_idx].cells[c_idx]
    79	            cell.text = row[c_idx]
    80	            for p in cell.paragraphs:
    81	                p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    82	                for run in p.runs:
    83	                    run.font.size = Pt(8)
    84	                    run.font.name = "Times New Roman"
    85	                    if r_idx == 0:
    86	                        run.bold = True
    87	    doc.add_paragraph()
    88	
    89	
    90	def _insert_figures(doc, para_text):
    91	    for trigger, (fig_path, caption, width) in FIGURES.items():
    92	        if trigger in para_text and Path(fig_path).exists():
    93	            fp = doc.add_paragraph()
    94	            fp.alignment = WD_ALIGN_PARAGRAPH.CENTER
    95	            fr = fp.add_run()
    96	            fr.add_picture(str(fig_path), width=Inches(width))
    97	            cp = doc.add_paragraph()
    98	            cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
    99	            cr = cp.add_run(caption)
   100	            cr.font.size = Pt(9)
   101	            cr.font.name = "Times New Roman"
   102	            cr.italic = True
   103	
   104	
   105	def process_section(doc, filepath):
   106	    text = filepath.read_text(encoding="utf-8")
   107	    text = strip_comments(text)
   108	    lines = text.split("\n")
   109	    i = 0
   110	    while i < len(lines):
   111	        line = lines[i]
   112	        stripped = line.strip()
   113	        if not stripped:
   114	            i += 1
   115	            continue
   116	        if stripped.startswith("# "):
   117	            h = doc.add_heading(stripped[2:], level=1)
   118	            for run in h.runs:
   119	                run.font.color.rgb = RGBColor(0, 0, 0)
   120	            i += 1
   121	            continue
   122	        if stripped.startswith("## "):
   123	            h = doc.add_heading(stripped[3:], level=2)
   124	            for run in h.runs:
   125	                run.font.color.rgb = RGBColor(0, 0, 0)
   126	            i += 1
   127	            continue
   128	        if stripped.startswith("### "):
   129	            h = doc.add_heading(stripped[4:], level=3)
   130	            for run in h.runs:
   131	                run.font.color.rgb = RGBColor(0, 0, 0)
   132	            i += 1
   133	            continue
   134	        if "|" in stripped and i + 1 < len(lines) and re.match(r"\s*\|[-|: ]+\|", lines[i + 1]):
   135	            table_lines = []
   136	            while i < len(lines) and "|" in lines[i]:
   137	                table_lines.append(lines[i])
   138	                i += 1
   139	            add_md_table(doc, table_lines)
   140	            continue
   141	        if re.match(r"^\d+\.\s", stripped):
   142	            p = doc.add_paragraph(style="List Number")
   143	            content = re.sub(r"^\d+\.\s", "", stripped)
   144	            content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
   145	            run = p.add_run(content)
   146	            run.font.size = Pt(10)
   147	            run.font.name = "Times New Roman"
   148	            i += 1
   149	            continue
   150	        if stripped.startswith("- "):
   151	            p = doc.add_paragraph(style="List Bullet")
   152	            content = stripped[2:]
   153	            content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
   154	            run = p.add_run(content)
   155	            run.font.size = Pt(10)
   156	            run.font.name = "Times New Roman"
   157	            i += 1
   158	            continue
   159	        # Regular paragraph
   160	        para_lines = [stripped]
   161	        i += 1
   162	        while i < len(lines):
   163	            nxt = lines[i].strip()
   164	            if (
   165	                not nxt
   166	                or nxt.startswith("#")
   167	                or nxt.startswith("|")
   168	                or nxt.startswith("- ")
   169	                or re.match(r"^\d+\.\s", nxt)
   170	            ):
   171	                break
   172	            para_lines.append(nxt)
   173	            i += 1
   174	        para_text = " ".join(para_lines)
   175	        para_text = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", para_text)
   176	        para_text = re.sub(r"\*\*(.+?)\*\*", r"\1", para_text)
   177	        para_text = re.sub(r"\*(.+?)\*", r"\1", para_text)
   178	        para_text = re.sub(r"`(.+?)`", r"\1", para_text)
   179	        para_text = para_text.replace("$$", "")
   180	        para_text = para_text.replace("---", "\u2014")
   181	
   182	        p = doc.add_paragraph()
   183	        p.paragraph_format.space_after = Pt(6)
   184	        run = p.add_run(para_text)
   185	        run.font.size = Pt(10)
   186	        run.font.name = "Times New Roman"
   187	
   188	        _insert_figures(doc, para_text)
   189	
   190	
   191	def main():
   192	    doc = Document()
   193	    style = doc.styles["Normal"]
   194	    style.font.name = "Times New Roman"
   195	    style.font.size = Pt(10)
   196	
   197	    # Title page
   198	    p = doc.add_paragraph()
   199	    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
   200	    p.paragraph_format.space_after = Pt(12)
   201	    run = p.add_run(
   202	        "Automated Identification of Non-Hand-Signed Auditor Signatures\n"
   203	        "in Large-Scale Financial Audit Reports:\n"
   204	        "A Dual-Descriptor Framework with Three-Method Convergent Thresholding"
   205	    )
   206	    run.font.size = Pt(16)
   207	    run.font.name = "Times New Roman"
   208	    run.bold = True
   209	
   210	    # IEEE Access uses single-anonymized review: author / affiliation
   211	    # / corresponding-author block must appear on the title page in the
   212	    # final submission. Fill these placeholders with real metadata
   213	    # before submitting the generated DOCX.
   214	    p = doc.add_paragraph()
   215	    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
   216	    p.paragraph_format.space_after = Pt(6)
   217	    run = p.add_run("[AUTHOR NAMES — fill in before submission]")
   218	    run.font.size = Pt(11)
   219	
   220	    p = doc.add_paragraph()
   221	    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
   222	    p.paragraph_format.space_after = Pt(6)
   223	    run = p.add_run("[Affiliations and corresponding-author email — fill in before submission]")
   224	    run.font.size = Pt(10)
   225	    run.italic = True
   226	
   227	    p = doc.add_paragraph()
   228	    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
   229	    p.paragraph_format.space_after = Pt(20)
   230	    run = p.add_run("Target journal: IEEE Access (Regular Paper, single-anonymized review)")
   231	    run.font.size = Pt(10)
   232	    run.italic = True
   233	
   234	    for section_file in SECTIONS:
   235	        filepath = PAPER_DIR / section_file
   236	        if filepath.exists():
   237	            process_section(doc, filepath)
   238	        else:
   239	            print(f"WARNING: missing section file: {filepath}")
   240	
   241	    doc.save(str(OUTPUT))
   242	    print(f"Saved: {OUTPUT}")
   243	
   244	
   245	if __name__ == "__main__":
   246	    main()

exec
/bin/zsh -lc "nl -ba paper/paper_a_abstract_v3.md | sed -n '1,80p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba paper/paper_a_introduction_v3.md | sed -n '1,180p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	# Abstract
     2	
     3	<!-- IEEE Access target: <= 250 words, single paragraph -->
     4	
     5	Regulations require Certified Public Accountants (CPAs) to attest to each audit report by affixing a signature. Digitization makes reusing a stored signature image across reports trivial---through administrative stamping or firm-level electronic signing---potentially undermining individualized attestation. Unlike forgery, *non-hand-signed* reproduction reuses the legitimate signer's own stored image, making it visually invisible to report users and infeasible to audit at scale manually. We present a pipeline integrating a Vision-Language Model for signature-page identification, YOLOv11 for signature detection, and ResNet-50 for feature extraction, followed by dual-descriptor verification combining cosine similarity and difference hashing. For threshold determination we apply two estimators---kernel-density antimode with a Hartigan unimodality test and an EM-fitted Beta mixture with a logit-Gaussian robustness check---plus a Burgstahler-Dichev/McCrary density-smoothness diagnostic, at the signature and accountant levels. Applied to 90,282 audit reports filed in Taiwan over 2013-2023 (182,328 signatures from 758 CPAs), the methods reveal a level asymmetry: signature-level similarity is a continuous quality spectrum that no two-component mixture separates, while accountant-level aggregates cluster into three groups with the antimode and two mixture estimators converging within $\sim$0.006 at cosine $\approx 0.975$. A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with visual inspection and accountant-level mixture evidence supporting majority non-hand-signing and a minority of hand-signers; capture rates on both 70/30 calibration and held-out folds are reported with Wilson 95% intervals to make fold-level variance visible. Validation against 310 byte-identical positives and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 at all accountant-level thresholds.
     6	
     7	<!-- Target word count: 240 -->

 succeeded in 0ms:
     1	# I. Introduction
     2	
     3	<!-- Target: ~1.5 pages double-column IEEE format. Double-blind: no author/institution info. -->
     4	
     5	Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection.
     6	In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require that certifying CPAs affix their signature or seal (簽名或蓋章) to each audit report [1].
     7	While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].
     8	
     9	The digitization of financial reporting has introduced a practice that complicates this intent.
    10	As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one.
    11	This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step.
    12	From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
    13	We refer to signatures produced by either workflow collectively as *non-hand-signed*.
    14	Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement.
    15	The accounting literature has long examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33].
    16	Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused.
    17	This practice, while potentially widespread, is visually invisible to report users and virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of image reproduction.
    18	
    19	The distinction between *non-hand-signing detection* and *signature forgery detection* is both conceptually and technically important.
    20	The extensive body of research on offline signature verification [3]--[8] has focused almost exclusively on forgery detection---determining whether a questioned signature was produced by its purported author or by an impostor.
    21	This framing presupposes that the central threat is identity fraud.
    22	In our context, identity is not in question; the CPA is indeed the legitimate signer.
    23	The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports.
    24	This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction, requiring an analytical framework focused on detecting abnormally high similarity across documents.
    25	
    26	A secondary methodological concern shapes the research design.
    27	Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification.
    28	Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference.
    29	A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against naturally-occurring anchor populations---byte-level identical pairs as a conservative gold positive subset and large random inter-CPA pairs as a gold negative population---reported with Wilson 95% confidence intervals on per-rule capture / FAR rates, since precision and $F_1$ are not meaningful when the positive and negative anchor populations are sampled from different units.
    30	
    31	Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards.
    32	Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image.
    33	Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures, where legitimate visual similarity between a signer's authentic signatures is expected and must be distinguished from image reproduction.
    34	Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis.
    35	From the statistical side, the methods we adopt for threshold determination---the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39]---have been developed in statistics and accounting-econometrics but have not, to our knowledge, been combined as a convergent threshold framework for document-forensics threshold selection.
    36	
    37	In this paper, we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale.
    38	Our approach processes raw PDF documents through the following stages:
    39	(1) signature page identification using a Vision-Language Model (VLM);
    40	(2) signature region detection using a trained YOLOv11 object detector;
    41	(3) deep feature extraction via a pre-trained ResNet-50 convolutional neural network;
    42	(4) dual-descriptor similarity computation combining cosine similarity on deep embeddings with difference hash (dHash) distance;
    43	(5) threshold determination using two methodologically distinct estimators---KDE antimode with a Hartigan unimodality test and finite Beta mixture via EM with a logit-Gaussian robustness check---complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic, all applied at both the signature level and the accountant level; and
    44	(6) validation against a pixel-identical anchor, a low-similarity anchor, and a replication-dominated Big-4 calibration firm.
    45	
    46	The dual-descriptor verification is central to our contribution.
    47	Cosine similarity of deep feature embeddings captures high-level visual style similarity---it can identify signatures that share similar stroke patterns and spatial layouts---but cannot distinguish between a CPA who signs consistently and one whose signature is reproduced from a stored image.
    48	Perceptual hashing (specifically, difference hashing) encodes structural-level image gradients into compact binary fingerprints that are robust to scan noise but sensitive to substantive content differences.
    49	By requiring convergent evidence from both descriptors, we can differentiate *style consistency* (high cosine but divergent dHash) from *image reproduction* (high cosine with low dHash), resolving an ambiguity that neither descriptor can address alone.
    50	
    51	A second distinctive feature is our framing of the calibration reference.
    52	One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing for the majority of its certifying partners, while not ruling out that a minority may continue to hand-sign some reports.
    53	We therefore treat Firm A as a *replication-dominated* calibration reference rather than a pure positive class.
    54	This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode.
    55	Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence among the visual-inspection evidence, the signature-level statistics, and the accountant-level mixture.
    56	
    57	A third distinctive feature is our unit-of-analysis treatment.
    58	Our threshold-framework analysis reveals an informative asymmetry between the signature level and the accountant level: per-signature similarity forms a continuous quality spectrum for which no two-mechanism mixture provides a good fit, whereas per-accountant aggregates are clustered into three recognizable groups (BIC-best $K = 3$).
    59	The substantive reading is that *pixel-level output quality* is a continuous spectrum shaped by firm-specific reproduction technologies and scan conditions, while *accountant-level aggregate behaviour* is clustered but not sharply discrete---a given CPA tends to cluster into a dominant regime (high-replication, middle-band, or hand-signed-tendency), though the boundaries between regimes are smooth rather than discontinuous.
    60	At the accountant level, the KDE antimode and the two mixture-based estimators (Beta-2 crossing and its logit-Gaussian robustness counterpart) converge within $\sim 0.006$ on a cosine threshold of approximately $0.975$, while the Burgstahler-Dichev / McCrary density-smoothness diagnostic finds no significant transition---an outcome (robust across a bin-width sweep, Appendix A) consistent with smoothly mixed clusters.
    61	The two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are reported as a complementary cross-check rather than as the primary accountant-level threshold.
    62	
    63	We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants.
    64	To our knowledge, this represents the largest-scale forensic analysis of signature authenticity in financial documents reported in the literature.
    65	
    66	The contributions of this paper are summarized as follows:
    67	
    68	1. **Problem formulation.** We formally define non-hand-signing detection as distinct from signature forgery detection and argue that it requires an analytical framework focused on intra-signer similarity distributions rather than genuine-versus-forged classification.
    69	
    70	2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, deep feature extraction, and dual-descriptor similarity computation, with automated inference requiring no manual intervention after initial training and annotation.
    71	
    72	3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures.
    73	
    74	4. **Convergent threshold framework with a smoothness diagnostic.** We introduce a threshold-selection framework that applies two methodologically distinct estimators---KDE antimode with Hartigan unimodality test and EM-fitted Beta mixture with a logit-Gaussian robustness check---at both the signature and accountant levels, and uses a Burgstahler-Dichev / McCrary density-smoothness diagnostic to characterize the local density structure. The convergence of the two estimators, combined with the presence or absence of a BD/McCrary transition, is used as evidence about the mixture structure of the data.
    75	
    76	5. **Continuous-quality / clustered-accountant finding.** We empirically establish that per-signature similarity is a continuous quality spectrum poorly approximated by any two-mechanism mixture, whereas per-accountant aggregates cluster into three recognizable groups with smoothly mixed rather than sharply discrete boundaries---an asymmetry with direct implications for how threshold selection and mixture modelling should be applied in document forensics.
    77	
    78	6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a known-majority-positive reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling.
    79	
    80	7. **Large-scale empirical analysis.** We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on non-hand-signing practices in financial reporting under a methodology designed for peer-review defensibility.
    81	
    82	The remainder of this paper is organized as follows.
    83	Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods we adopt for threshold determination.
    84	Section III describes the proposed methodology.
    85	Section IV presents experimental results including the convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study.
    86	Section V discusses the implications and limitations of our findings.
    87	Section VI concludes with directions for future work.

exec
/bin/zsh -lc "nl -ba paper/paper_a_related_work_v3.md | sed -n '1,180p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba paper/paper_a_references_v3.md | sed -n '1,180p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba paper/paper_a_impact_statement_v3.md | sed -n '1,120p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	<!--
     2	ARCHIVED. Not part of the IEEE Access submission.
     3	
     4	IEEE Access Regular Papers do not include a separate Impact Statement
     5	section. The text below is retained for possible reuse in a cover
     6	letter, grant report, or non-IEEE venue. It is excluded from the
     7	assembled paper by export_v3.py.
     8	
     9	If reused, note that the wording "distinguishes genuinely hand-signed
    10	signatures from reproduced ones" overstates what a five-way confidence
    11	classifier without a fully labeled test set establishes; soften before
    12	external use.
    13	-->
    14	
    15	# Impact Statement (archived; not in IEEE Access submission)
    16	
    17	Auditor signatures on financial reports are a key safeguard of corporate accountability.
    18	When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets.
    19	We developed a pipeline that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan.
    20	Combining deep-learning visual features with perceptual hashing and two methodologically distinct threshold estimators (plus a density-smoothness diagnostic), the system stratifies signatures into a five-way confidence-graded classification and quantifies how the practice varies across firms and over time.
    21	After further validation, the technology could support financial regulators in screening signature authenticity at national scale.

 succeeded in 0ms:
     1	# References
     2	
     3	<!-- IEEE numbered style, sequential by first appearance in text. v3 adds statistical-method refs (37–41). -->
     4	
     5	[1] Taiwan Certified Public Accountant Act (會計師法), Art. 4; FSC Attestation Regulations (查核簽證核准準則), Art. 6. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=G0400067
     6	
     7	[2] S.-H. Yen, Y.-S. Chang, and H.-L. Chen, "Does the signature of a CPA matter? Evidence from Taiwan," *Res. Account. Regul.*, vol. 25, no. 2, pp. 230–235, 2013.
     8	
     9	[3] J. Bromley et al., "Signature verification using a Siamese time delay neural network," in *Proc. NeurIPS*, 1993.
    10	
    11	[4] S. Dey et al., "SigNet: Convolutional Siamese network for writer independent offline signature verification," arXiv:1707.02131, 2017.
    12	
    13	[5] I. Hadjadj et al., "An offline signature verification method based on a single known sample and an explainable deep learning approach," *Appl. Sci.*, vol. 10, no. 11, p. 3716, 2020.
    14	
    15	[6] H. Li et al., "TransOSV: Offline signature verification with transformers," *Pattern Recognit.*, vol. 145, p. 109882, 2024.
    16	
    17	[7] S. Tehsin et al., "Enhancing signature verification using triplet Siamese similarity networks in digital documents," *Mathematics*, vol. 12, no. 17, p. 2757, 2024.
    18	
    19	[8] P. Brimoh and C. C. Olisah, "Consensus-threshold criterion for offline signature verification using CNN learned representations," arXiv:2401.03085, 2024.
    20	
    21	[9] N. Woodruff et al., "Fully-automatic pipeline for document signature analysis to detect money laundering activities," arXiv:2107.14091, 2021.
    22	
    23	[10] S. Abramova and R. Böhme, "Detecting copy-move forgeries in scanned text documents," in *Proc. Electronic Imaging*, 2016.
    24	
    25	[11] Y. Li et al., "Copy-move forgery detection in digital image forensics: A survey," *Multimedia Tools Appl.*, 2024.
    26	
    27	[12] Y. Jakhar and M. D. Borah, "Effective near-duplicate image detection using perceptual hashing and deep learning," *Inf. Process. Manage.*, p. 104086, 2025.
    28	
    29	[13] E. Pizzi et al., "A self-supervised descriptor for image copy detection," in *Proc. CVPR*, 2022.
    30	
    31	[14] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Learning features for offline handwritten signature verification using deep convolutional neural networks," *Pattern Recognit.*, vol. 70, pp. 163–176, 2017.
    32	
    33	[15] E. N. Zois, D. Tsourounis, and D. Kalivas, "Similarity distance learning on SPD manifold for writer independent offline signature verification," *IEEE Trans. Inf. Forensics Security*, vol. 19, pp. 1342–1356, 2024.
    34	
    35	[16] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, "Meta-learning for fast classifier adaptation to new users of signature verification systems," *IEEE Trans. Inf. Forensics Security*, vol. 15, pp. 1735–1745, 2019.
    36	
    37	[17] H. Farid, "Image forgery detection," *IEEE Signal Process. Mag.*, vol. 26, no. 2, pp. 16–25, 2009.
    38	
    39	[18] F. Z. Mehrjardi, A. M. Latif, M. S. Zarchi, and R. Sheikhpour, "A survey on deep learning-based image forgery detection," *Pattern Recognit.*, vol. 144, art. no. 109778, 2023.
    40	
    41	[19] J. Luo et al., "A survey of perceptual hashing for multimedia," *ACM Trans. Multimedia Comput. Commun. Appl.*, vol. 21, no. 7, 2025.
    42	
    43	[20] D. Engin et al., "Offline signature verification on real-world documents," in *Proc. CVPRW*, 2020.
    44	
    45	[21] D. Tsourounis et al., "From text to signatures: Knowledge transfer for efficient deep feature learning in offline signature verification," *Expert Syst. Appl.*, 2022.
    46	
    47	[22] B. Chamakh and O. Bounouh, "A unified ResNet18-based approach for offline signature classification and verification," *Procedia Comput. Sci.*, vol. 270, 2025.
    48	
    49	[23] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, "Neural codes for image retrieval," in *Proc. ECCV*, 2014, pp. 584–599.
    50	
    51	[24] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, "Qwen2.5-VL technical report," arXiv:2502.13923, 2025. [Online]. Available: https://arxiv.org/abs/2502.13923
    52	
    53	[25] Ultralytics, "YOLOv11 documentation," 2024. [Online]. Available: https://docs.ultralytics.com/
    54	
    55	[26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. CVPR*, 2016.
    56	
    57	[27] N. Krawetz, "Kind of like that," The Hacker Factor Blog, 2013. [Online]. Available: https://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html
    58	
    59	[28] B. W. Silverman, *Density Estimation for Statistics and Data Analysis*. London: Chapman & Hall, 1986.
    60	
    61	[29] J. Cohen, *Statistical Power Analysis for the Behavioral Sciences*, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988.
    62	
    63	[30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600–612, 2004.
    64	
    65	[31] J. V. Carcello and C. Li, "Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom," *The Accounting Review*, vol. 88, no. 5, pp. 1511–1546, 2013.
    66	
    67	[32] A. D. Blay, M. Notbohm, C. Schelleman, and A. Valencia, "Audit quality effects of an individual audit engagement partner signature mandate," *Int. J. Auditing*, vol. 18, no. 3, pp. 172–192, 2014.
    68	
    69	[33] W. Chi, H. Huang, Y. Liao, and H. Xie, "Mandatory audit partner rotation, audit quality, and market perception: Evidence from Taiwan," *Contemp. Account. Res.*, vol. 26, no. 2, pp. 359–391, 2009.
    70	
    71	[34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in *Proc. CVPR*, 2016, pp. 779–788.
    72	
    73	[35] J. Zhang, J. Huang, S. Jin, and S. Lu, "Vision-language models for vision tasks: A survey," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 46, no. 8, pp. 5625–5644, 2024.
    74	
    75	[36] H. B. Mann and D. R. Whitney, "On a test of whether one of two random variables is stochastically larger than the other," *Ann. Math. Statist.*, vol. 18, no. 1, pp. 50–60, 1947.
    76	
    77	[37] J. A. Hartigan and P. M. Hartigan, "The dip test of unimodality," *Ann. Statist.*, vol. 13, no. 1, pp. 70–84, 1985.
    78	
    79	[38] D. Burgstahler and I. Dichev, "Earnings management to avoid earnings decreases and losses," *J. Account. Econ.*, vol. 24, no. 1, pp. 99–126, 1997.
    80	
    81	[39] J. McCrary, "Manipulation of the running variable in the regression discontinuity design: A density test," *J. Econometrics*, vol. 142, no. 2, pp. 698–714, 2008.
    82	
    83	[40] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," *J. R. Statist. Soc. B*, vol. 39, no. 1, pp. 1–38, 1977.
    84	
    85	[41] H. White, "Maximum likelihood estimation of misspecified models," *Econometrica*, vol. 50, no. 1, pp. 1–25, 1982.
    86	
    87	<!-- Total: 41 references (v2: 36 + 5 new statistical methods refs) -->

 succeeded in 0ms:
     1	# II. Related Work
     2	
     3	## A. Offline Signature Verification
     4	
     5	Offline signature verification---determining whether a static signature image is genuine or forged---has been studied extensively using deep learning.
     6	Bromley et al. [3] introduced the Siamese neural network architecture for signature verification, establishing the pairwise comparison paradigm that remains dominant.
     7	Hafemann et al. [14] demonstrated that deep CNN features learned from signature images provide strong discriminative representations for writer-independent verification, establishing the foundational baseline for subsequent work.
     8	Dey et al. [4] proposed SigNet, a convolutional Siamese network for writer-independent offline verification, extending this paradigm to generalize across signers without per-writer retraining.
     9	Hadjadj et al. [5] addressed the practical constraint of limited reference samples, achieving competitive verification accuracy using only a single known genuine signature per writer.
    10	More recently, Li et al. [6] introduced TransOSV, the first Vision Transformer-based approach, achieving state-of-the-art results.
    11	Tehsin et al. [7] evaluated distance metrics for triplet Siamese networks, finding that Manhattan distance outperformed cosine and Euclidean alternatives.
    12	Zois et al. [15] proposed similarity distance learning on SPD manifolds for writer-independent verification, achieving robust cross-dataset transfer.
    13	Hafemann et al. [16] further addressed the practical challenge of adapting to new users through meta-learning, reducing the enrollment burden for signature verification systems.
    14	
    15	A common thread in this literature is the assumption that the primary threat is *identity fraud*: a forger attempting to produce a convincing imitation of another person's signature.
    16	Our work addresses a fundamentally different problem---detecting whether the *legitimate signer's* stored signature image has been reproduced across many documents---which requires analyzing the upper tail of the intra-signer similarity distribution rather than modeling inter-signer discriminability.
    17	
    18	Brimoh and Olisah [8] proposed a consensus-threshold approach that derives classification boundaries from known genuine reference pairs, the methodology most closely related to our calibration strategy.
    19	However, their method operates on standard verification benchmarks with laboratory-collected signatures, whereas our approach applies threshold calibration using a replication-dominated subpopulation identified through domain expertise in real-world regulatory documents.
    20	
    21	## B. Document Forensics and Copy Detection
    22	
    23	Image forensics encompasses a broad range of techniques for detecting manipulated visual content [17], with recent surveys highlighting the growing role of deep learning in forgery detection [18].
    24	Copy-move forgery detection (CMFD) identifies duplicated regions within or across images, typically targeting manipulated photographs [11].
    25	Abramova and Böhme [10] adapted block-based CMFD to scanned text documents, noting that standard methods perform poorly in this domain because legitimate character repetitions produce high similarity scores that confound duplicate detection.
    26	
    27	Woodruff et al. [9] developed the work most closely related to ours: a fully automated pipeline for extracting and analyzing signatures from corporate filings in the context of anti-money-laundering investigations.
    28	Their system uses connected component analysis for signature detection, GANs for noise removal, and Siamese networks for author clustering.
    29	While their pipeline shares our goal of large-scale automated signature analysis on real regulatory documents, their objective---grouping signatures by authorship---differs fundamentally from ours, which is detecting image-level reproduction within a single author's signatures across documents.
    30	
    31	In the domain of image copy detection, Pizzi et al. [13] proposed SSCD, a self-supervised descriptor using ResNet-50 with contrastive learning for large-scale copy detection on natural images.
    32	Their work demonstrates that pre-trained CNN features with cosine similarity provide a strong baseline for identifying near-duplicate images, a finding that supports our feature-extraction approach.
    33	
    34	## C. Perceptual Hashing
    35	
    36	Perceptual hashing algorithms generate compact fingerprints that are robust to minor image transformations while remaining sensitive to substantive content changes [19].
    37	Unlike cryptographic hashes, which change entirely with any pixel modification, perceptual hashes produce similar outputs for visually similar inputs, making them suitable for near-duplicate detection in scanned documents where minor variations arise from the scanning process.
    38	
    39	Jakhar and Borah [12] demonstrated that combining perceptual hashing with deep learning features significantly outperforms either approach alone for near-duplicate image detection, achieving AUROC of 0.99 on standard benchmarks.
    40	Their two-stage architecture---pHash for fast structural comparison followed by deep features for semantic verification---provides methodological precedent for our dual-descriptor approach, though applied to natural images rather than document signatures.
    41	
    42	Our work differs from prior perceptual-hashing studies in its application context and in the specific challenge it addresses: distinguishing legitimate high visual consistency (a careful signer producing similar-looking signatures) from image-level reproduction in scanned financial documents.
    43	
    44	## D. Deep Feature Extraction for Signature Analysis
    45	
    46	Several studies have explored pre-trained CNN features for signature comparison without metric learning or Siamese architectures.
    47	Engin et al. [20] used ResNet-50 features with cosine similarity for offline signature verification on real-world scanned documents, incorporating CycleGAN-based stamp removal as preprocessing---a pipeline design closely paralleling our approach.
    48	Tsourounis et al. [21] demonstrated successful transfer from handwritten text recognition to signature verification, showing that CNN features trained on related but distinct handwriting tasks generalize effectively to signature comparison.
    49	Chamakh and Bounouh [22] confirmed that a simple ResNet backbone with cosine similarity achieves competitive verification accuracy across multilingual signature datasets without fine-tuning, supporting the viability of our off-the-shelf feature-extraction approach.
    50	
    51	Babenko et al. [23] established that CNN-extracted neural codes with cosine similarity provide an effective framework for image retrieval and matching, a finding that underpins our feature-comparison approach.
    52	These findings collectively suggest that pre-trained CNN features, when L2-normalized and compared via cosine similarity, provide a robust and computationally efficient representation for signature comparison---particularly suitable for large-scale applications where the computational overhead of Siamese training or metric learning is impractical.
    53	
    54	## E. Statistical Methods for Threshold Determination
    55	
    56	Our threshold-determination framework combines three families of methods developed in statistics and accounting-econometrics.
    57	
    58	*Non-parametric density estimation.*
    59	Kernel density estimation [28] provides a smooth estimate of a similarity distribution without parametric assumptions.
    60	Where the distribution is bimodal, the local density minimum (antimode) between the two modes is the Bayes-optimal decision boundary under equal priors.
    61	The statistical validity of the unimodality-vs-multimodality dichotomy can be tested via the Hartigan & Hartigan dip test [37], which tests the null of unimodality; we use rejection of this null as evidence consistent with (though not a direct test for) bimodality.
    62	
    63	*Discontinuity tests on empirical distributions.*
    64	Burgstahler and Dichev [38], working in the accounting-disclosure literature, proposed a test for smoothness violations in empirical frequency distributions.
    65	Under the null that the distribution is generated by a single smooth process, the expected count in any histogram bin equals the average of its two neighbours, and the standardized deviation from this expectation is approximately $N(0,1)$.
    66	The test was placed on rigorous asymptotic footing by McCrary [39], whose density-discontinuity test provides full asymptotic distribution theory, bandwidth-selection rules, and power analysis.
    67	The BD/McCrary pairing is well suited to detecting the boundary between two generative mechanisms (non-hand-signed vs. hand-signed) under minimal distributional assumptions.
    68	
    69	*Finite mixture models.*
    70	When the empirical distribution is viewed as a weighted sum of two (or more) latent component distributions, the Expectation-Maximization algorithm [40] provides consistent maximum-likelihood estimates of the component parameters.
    71	For observations bounded on $[0,1]$---such as cosine similarity and normalized Hamming-based dHash similarity---the Beta distribution is the natural parametric choice, with applications spanning bioinformatics and Bayesian estimation.
    72	Under mild regularity conditions, White's quasi-MLE result [41] supports interpreting maximum-likelihood estimates under a mis-specified parametric family as consistent estimators of the pseudo-true parameter that minimizes the Kullback-Leibler divergence to the data-generating distribution within that family; we use this result to justify the Beta-mixture fit as a principled approximation rather than as a guarantee that the true distribution is Beta.
    73	
    74	The present study combines all three families, using each to produce an independent threshold estimate and treating cross-method convergence---or principled divergence---as evidence of where in the analysis hierarchy the mixture structure is statistically supported.
    75	<!--
    76	REFERENCES for Related Work (see paper_a_references_v3.md for full list):
    77	[3]  Bromley et al. 1993 — Siamese TDNN (NeurIPS)
    78	[4]  Dey et al. 2017 — SigNet
    79	[5]  Hadjadj et al. 2020 — Single sample SV
    80	[6]  Li et al. 2024 — TransOSV
    81	[7]  Tehsin et al. 2024 — Triplet Siamese
    82	[8]  Brimoh & Olisah 2024 — Consensus threshold
    83	[9]  Woodruff et al. 2021 — AML signature pipeline
    84	[10] Abramova & Böhme 2016 — CMFD in scanned docs
    85	[11] Copy-move forgery detection survey — MTAP 2024
    86	[12] Jakhar & Borah 2025 — pHash + DL
    87	[13] Pizzi et al. 2022 — SSCD
    88	[14] Hafemann et al. 2017 — CNN features for SV
    89	[15] Zois et al. 2024 — SPD manifold SV
    90	[16] Hafemann et al. 2019 — Meta-learning for SV
    91	[17] Farid 2009 — Image forgery detection survey
    92	[18] Mehrjardi et al. 2023 — DL-based image forgery detection survey
    93	[19] Luo et al. 2025 — Perceptual hashing survey
    94	[20] Engin et al. 2020 — ResNet + cosine on real docs
    95	[21] Tsourounis et al. 2022 — Transfer from text to signatures
    96	[22] Chamakh & Bounouh 2025 — ResNet18 unified SV
    97	[23] Babenko et al. 2014 — Neural codes for image retrieval
    98	[28] Silverman 1986 — Density estimation
    99	[37] Hartigan & Hartigan 1985 — dip test of unimodality
   100	[38] Burgstahler & Dichev 1997 — earnings management discontinuity
   101	[39] McCrary 2008 — density discontinuity test
   102	[40] Dempster, Laird & Rubin 1977 — EM algorithm
   103	[41] White 1982 — quasi-MLE consistency
   104	-->

exec
/bin/zsh -lc "python - <<'PY'
from fractions import Fraction
vals=[('Table IX dual',54370,60448),('Table IX cos0.95',55922,60448),('Table IX dh8',57527,60448),('Table XI dual calib',40335,45116),('Table XI dual held',14035,15332),('Table XVI FirmA agree',(26435+734+4),30222),('Table XVII hc',29529,84386),('Table XVII mc',36994,84386),('Table XVII style',5133,84386),('Table XVII uncertain',12683,84386),('Table XVII hand',47,84386)]
for name,a,b in vals:
    print(name, a/b*100)
PY" in /Volumes/NV2/pdf_recognize
 exited 127 in 0ms:
zsh:1: command not found: python

exec
/bin/zsh -lc "rg -n \"circular|EER|FRR|double-blind|double blind|breaks calibration|cosine-conditional|conditional dHash|not calibrated|affirmatively establish smoothness|proof of smoothness|worst-case aggregation|168,740|168,755|54,370|54,373|89\\.95%|91\\.14%|0\\.945|0\\.95 AND\" paper/paper_a_*_v3.md paper/export_v3.py" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
paper/paper_a_results_v3.md:23:| CPA-matched signatures | 168,755 (92.6%) |
paper/paper_a_results_v3.md:66:| All-CPA cosine (max-sim) | 168,740 | 0.0035 | <0.001 | Multimodal |
paper/paper_a_results_v3.md:67:| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
paper/paper_a_results_v3.md:131:For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.
paper/paper_a_results_v3.md:146:| Accountant-level, 2D-GMM 2-comp marginal crossing (secondary)             | 0.945             | 8.10              |
paper/paper_a_results_v3.md:153:This is the accountant-level convergence we rely on for the primary threshold interpretation; the two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) differ because they reflect joint (cosine, dHash) covariance structure, and we report them as a secondary cross-check.
paper/paper_a_results_v3.md:166:| cosine > 0.945 (2D GMM marginal crossing)             | 94.02% | 56,836 / 60,448 |
paper/paper_a_results_v3.md:172:| cosine > 0.95 AND dHash_indep ≤ 8 (operational dual)  | 89.95% | 54,370 / 60,448 |
paper/paper_a_results_v3.md:178:The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E).
paper/paper_a_results_v3.md:192:We do not report an Equal Error Rate: EER is meaningful only when the positive and negative error-rate curves cross in a nontrivial interior region, but byte-identical positives all sit at cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$. An EER calculation against this anchor would be arithmetic tautology rather than biometric performance, and we therefore omit it.
paper/paper_a_results_v3.md:199:| 0.945 (2D GMM marginal)          | 0.0008 | [0.0006, 0.0011] |
paper/paper_a_results_v3.md:204:Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
paper/paper_a_results_v3.md:209:A would-be FRR computed against this subset is definitionally $0$ at every threshold below $1$ (since byte-identical pairs have cosine $\approx 1$), so such an FRR is a mathematical boundary check rather than an empirical miss-rate estimate; we discuss the generalization limits of this conservative-subset framing in Section V-F.
paper/paper_a_results_v3.md:210:Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
paper/paper_a_results_v3.md:225:| cosine > 0.945 (2D GMM marginal)     | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54  | <0.001     | 42,305/45,116 | 14,531/15,332 |
paper/paper_a_results_v3.md:231:| cosine > 0.95 AND dHash_indep ≤ 8    | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001     | 40,335/45,116 | 14,035/15,332 |
paper/paper_a_results_v3.md:245:### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
paper/paper_a_results_v3.md:248:The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$ (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at $0.945$.
paper/paper_a_results_v3.md:250:We report a sensitivity check in which the classifier's operational cut cos $> 0.95$ is replaced by the nearest accountant-level reference, cos $> 0.945$.
paper/paper_a_results_v3.md:252:<!-- TABLE XII: Classifier Sensitivity to the Operational Cosine Cut (All-Sample Five-Way Output, N = 168,740 signatures)
paper/paper_a_results_v3.md:253:| Category                                   | cos > 0.95 count (%) | cos > 0.945 count (%) | Δ count |
paper/paper_a_results_v3.md:262:At the aggregate firm-level, the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A under the 0.95 cut and 91.14% under the 0.945 cut---a shift of 1.19 percentage points.
paper/paper_a_results_v3.md:263:At the per-signature categorization level, replacing 0.95 by 0.945 reclassifies 8,508 signatures (5.04% of the corpus) out of the Uncertain band; 6,095 of them migrate to Moderate-confidence non-hand-signed, 2,294 to High-confidence non-hand-signed, and 119 to High style consistency.
paper/paper_a_results_v3.md:268:The paper therefore retains cos $> 0.95$ as the primary operational cut for transparency and reports the 0.945 results as a sensitivity check rather than as a deployed alternative; a future deployment requiring tighter accountant-level alignment could substitute cos $> 0.945$ without altering the substantive firm-level conclusions.
paper/paper_a_results_v3.md:379:We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-L: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
paper/paper_a_results_v3.md:391:Per the worst-case aggregation rule of Section III-L, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
paper/paper_a_methodology_v3.md:76:Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
paper/paper_a_methodology_v3.md:128:A direct empirical check of the within-auditor-year assumption at the same-CPA level would require labeling multiple reports of the same CPA in the same year and is left to future work; in this paper we maintain the assumption as an identification convention motivated by industry practice and bounded by the worst-case aggregation rule of Section III-L.
paper/paper_a_methodology_v3.md:248:We do not report an Equal Error Rate or FRR column against the byte-identical positive anchor, because byte-identical pairs have cosine $\approx 1$ by construction and any FRR computed against that subset is trivially $0$ at every threshold below $1$; the conservative-subset role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
paper/paper_a_methodology_v3.md:256:Because the two analyses are at different units (signature vs accountant) we treat them as complementary rather than substitutable: the accountant-level convergence band cos $\in [0.945, 0.979]$ anchors the signature-level operational cut cos $> 0.95$ used below, and Section IV-G.3 reports a sensitivity analysis in which cos $> 0.95$ is replaced by the accountant-level 2D-GMM marginal crossing cos $> 0.945$.
paper/paper_a_methodology_v3.md:279:Third, the three accountant-level 1D estimators (KDE antimode $0.973$, Beta-2 crossing $0.979$, logit-GMM-2 crossing $0.976$) and the accountant-level 2D GMM marginal ($0.945$) are *not* the operational thresholds of this classifier: they are the *convergent external reference* that supports the choice of signature-level operational cut.
paper/paper_a_methodology_v3.md:280:Section IV-G.3 reports the classifier's five-way output under the nearby operational cut cos $> 0.945$ as a sensitivity check; the aggregate firm-level capture rates change by at most $\approx 1.2$ percentage points (e.g., the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A versus 91.14% at cos $> 0.945$), and category-level shifts are concentrated at the Uncertain/Moderate-confidence boundary.
paper/paper_a_discussion_v3.md:29:We caveat this interpretation appropriately in Section V-G: the BD null alone cannot affirmatively establish smoothness---only fail to falsify it---and our substantive claim of smoothly-mixed clustering rests on the joint weight of the GMM fit, the dip test, and the BD null rather than on the BD null alone.
paper/paper_a_conclusion_v3.md:16:The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transition at the accountant level; at $N = 686$ accountants the test has limited power and cannot affirmatively establish smoothness, but its non-transition is consistent with the smoothly-mixed cluster boundaries implied by the accountant-level GMM.
paper/paper_a_introduction_v3.md:61:The two-dimensional GMM marginal crossings (cosine $= 0.945$, dHash $= 8.10$) are reported as a complementary cross-check rather than as the primary accountant-level threshold.
paper/paper_a_appendix_v3.md:16:| Full-sample cosine (sig-level)   | 168,740 | 0.003  | 0.9870 | -3.21  | +8.17   |
paper/paper_a_appendix_v3.md:17:| Full-sample cosine (sig-level)   | 168,740 | 0.005  | 0.9850 | -8.80  | +14.32  |
paper/paper_a_appendix_v3.md:18:| Full-sample cosine (sig-level)   | 168,740 | 0.010  | 0.9800 | -29.69 | +44.91  |
paper/paper_a_appendix_v3.md:19:| Full-sample cosine (sig-level)   | 168,740 | 0.015  | 0.9450 | -11.35 | +14.85  |
paper/paper_a_appendix_v3.md:20:| Full-sample dHash_indep (sig-l.) | 168,740 | 1      | 2.0    | -6.22  | +4.89   |
paper/paper_a_appendix_v3.md:21:| Full-sample dHash_indep (sig-l.) | 168,740 | 2      | 10.0   | -7.35  | +3.83   |
paper/paper_a_appendix_v3.md:22:| Full-sample dHash_indep (sig-l.) | 168,740 | 3      | 9.0    | -11.05 | +45.39  |

exec
/bin/zsh -lc "python - <<'PY'
# Verify Table XII deltas and percentages
vals_95=[76984,43906,546,46768,536]
vals_945=[79278,50001,665,38260,536]
N=168740
for a,b in zip(vals_95,vals_945):
    print(a, round(a/N*100,2), b, round(b/N*100,2), b-a)
print('sum95',sum(vals_95),'sum945',sum(vals_945))
PY" in /Volumes/NV2/pdf_recognize
 exited 127 in 0ms:
zsh:1: command not found: python

codex
I’ve cleared the obvious rewrite traps: no stray circularity language, no lingering FRR/EER presentation, and the worst-case aggregation safeguard is now wired into the main text. I’m doing one last arithmetic pass now to make sure the revised percentages and count deltas actually line up.
exec
/bin/zsh -lc "python3 - <<'PY'
vals=[('Table IX dual',54370,60448),('Table IX cos0.95',55922,60448),('Table IX dh8',57527,60448),('Table XI dual calib',40335,45116),('Table XI dual held',14035,15332),('Table XVI FirmA agree',(26435+734+4),30222),('Table XVII hc',29529,84386),('Table XVII mc',36994,84386),('Table XVII style',5133,84386),('Table XVII uncertain',12683,84386),('Table XVII hand',47,84386)]
for name,a,b in vals:
    print(f\"{name}: {a}/{b} = {a/b*100:.6f}%\")
PY" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
Table IX dual: 54370/60448 = 89.945077%
Table IX cos0.95: 55922/60448 = 92.512573%
Table IX dh8: 57527/60448 = 95.167747%
Table XI dual calib: 40335/45116 = 89.402873%
Table XI dual held: 14035/15332 = 91.540569%
Table XVI FirmA agree: 27173/30222 = 89.911323%
Table XVII hc: 29529/84386 = 34.992771%
Table XVII mc: 36994/84386 = 43.839025%
Table XVII style: 5133/84386 = 6.082763%
Table XVII uncertain: 12683/84386 = 15.029744%
Table XVII hand: 47/84386 = 0.055696%

exec
/bin/zsh -lc "python3 - <<'PY'
vals_95=[76984,43906,546,46768,536]
vals_945=[79278,50001,665,38260,536]
N=168740
labels=['HC','MC','Style','Uncertain','Hand']
for lab,a,b in zip(labels,vals_95,vals_945):
    print(f\"{lab}: 0.95 {a/N*100:.4f}% | 0.945 {b/N*100:.4f}% | delta {b-a}\")
print('sum95',sum(vals_95),'sum945',sum(vals_945),'delta sum',sum(b-a for a,b in zip(vals_95,vals_945)))
print('dual 0.945 capture', 55109/60448*100)
PY" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "python3 - <<'PY'
# two-proportion z for selected Table XI rows
from math import sqrt, erf
rows=[('cos>.837',45087,45116,15321,15332),('dual',40335,45116,14035,15332),('dh<=5',37430,45116,13467,15332)]
for name,k1,n1,k2,n2 in rows:
    p1=k1/n1;p2=k2/n2;p=(k1+k2)/(n1+n2)
    z=(p1-p2)/sqrt(p*(1-p)*(1/n1+1/n2))
    pval=2*(1-0.5*(1+erf(abs(z)/sqrt(2))))
    print(name, z, pval)
PY" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
HC: 0.95 45.6229% | 0.945 46.9823% | delta 2294
MC: 0.95 26.0199% | 0.945 29.6320% | delta 6095
Style: 0.95 0.3236% | 0.945 0.3941% | delta 119
Uncertain: 0.95 27.7160% | 0.945 22.6739% | delta -8508
Hand: 0.95 0.3176% | 0.945 0.3176% | delta 0
sum95 168740 sum945 168740 delta sum 0
dual 0.945 capture 91.16761514028586

 succeeded in 0ms:
cos>.837 0.31060078803025654 0.7561041280401521
dual -7.6039915942989875 2.864375403532904e-14
dh<=5 -14.288642299806297 0.0

web search: 
web search: 
exec
/bin/zsh -lc "ls -R /Volumes/NV2/PDF-Processing/signature-analysis/reports 2>/dev/null | sed -n '1,240p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
accountant_mixture
accountant_similarity_analysis.json
accountant_similarity_analysis.md
accountant_three_methods
bd_mccrary
bd_sensitivity
beta_mixture
column_definitions_and_methodology.txt
complete_pdf_report.csv
deloitte_distribution
dip_test
expanded_validation
extraction_log.txt
extraction_methodology.md
figures
formal_statistical_data.json
formal_statistical_report.md
high_similarity_pairs.json
high_similarity_report.html
intra_report
name_extraction_report.json
name_extraction_report.md
partner_ranking
pdf_level_complete_report.csv
pdf_level_complete_report.xlsx
pdf_signature_verdict_report.md
pdf_signature_verdicts.csv
pdf_signature_verdicts.json
pixel_validation
signature_cleanup_report.json
signature_cleanup_report.md
similarity_distribution.png
similarity_statistics.json
validation_recalibration

/Volumes/NV2/PDF-Processing/signature-analysis/reports/accountant_mixture:
accountant_clusters.csv
accountant_mixture_2d.png
accountant_mixture_marginals.png
accountant_mixture_report.md
accountant_mixture_results.json

/Volumes/NV2/PDF-Processing/signature-analysis/reports/accountant_three_methods:
accountant_cos_mean_panel.png
accountant_dh_mean_panel.png
accountant_three_methods_report.md
accountant_three_methods_results.json

/Volumes/NV2/PDF-Processing/signature-analysis/reports/bd_mccrary:
bd_mccrary_firm_a_cosine.png
bd_mccrary_firm_a_dhash.png
bd_mccrary_full_cosine.png
bd_mccrary_full_dhash.png
bd_mccrary_report.md
bd_mccrary_results.json

/Volumes/NV2/PDF-Processing/signature-analysis/reports/bd_sensitivity:
bd_sensitivity.json
bd_sensitivity.md

/Volumes/NV2/PDF-Processing/signature-analysis/reports/beta_mixture:
beta_mixture_firm_a_cosine.png
beta_mixture_full_cosine.png
beta_mixture_report.md
beta_mixture_results.json

/Volumes/NV2/PDF-Processing/signature-analysis/reports/deloitte_distribution:
firm_a_cosine_distribution.png
firm_a_dhash_distribution.png
firm_a_per_accountant_boxplot.png

/Volumes/NV2/PDF-Processing/signature-analysis/reports/dip_test:
dip_test_report.md
dip_test_results.json

/Volumes/NV2/PDF-Processing/signature-analysis/reports/expanded_validation:
expanded_validation_report.md
expanded_validation_results.json

/Volumes/NV2/PDF-Processing/signature-analysis/reports/figures:
fig1_intra_vs_inter_class.png
fig2_normality_tests.png
fig3_ssim_analysis.png
fig4_per_accountant.png
fig5_threshold_sensitivity.png

/Volumes/NV2/PDF-Processing/signature-analysis/reports/intra_report:
intra_report_disagreements.csv
intra_report_report.md
intra_report_results.json

/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking:
partner_rank_distribution.png
partner_ranking_report.md
partner_ranking_results.json

/Volumes/NV2/PDF-Processing/signature-analysis/reports/pixel_validation:
pixel_validation_report.md
pixel_validation_results.json
roc_cosine.png
roc_dhash.png
sanity_sample.csv

/Volumes/NV2/PDF-Processing/signature-analysis/reports/validation_recalibration:
validation_recalibration.json
validation_recalibration.md

exec
/bin/zsh -lc "rg -n \"91\\.14|55109|0\\.945 cut|cut=0\\.945|54370|54373|whole_firm_a\" -S /Volumes/NV2/pdf_recognize /Volumes/NV2/PDF-Processing/signature-analysis 2>/dev/null | sed -n '1,240p'" in /Volumes/NV2/pdf_recognize
 succeeded in 228ms:
medium,0.849258303642273,0.1950954373022656,517,0.9009730314083204,0.5604595774311444,0.7307163044197325,suspicious,5,10,7.5,suspicious,0.9999530603965967,0.9999499480787131,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11539:201501_4904_AI1.pdf,201501,4904,AI1,林安惠,郭政弘,勤業眾信聯合,2,0.8833,0.9942933320999146,0.9656011462211609,0.9799472391605377,copy,copy,林安惠,low,0.801108717918396,0.08295296260580735,306,郭政弘,medium,0.849258303642273,0.1950954373022656,517,0.9531569280109164,0.7068423235527891,0.8299996257818527,copy,2,4,3.0,suspicious,0.9996173478745699,0.9994765698692267,0,genuine,copy,above_crossover,definite_copy,very_high,2,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11565:201501_4960_AI1.pdf,201501,4960,AI1,林鴻鵬,黃樹傑,勤業眾信聯合,2,0.8785,0.9933323264122009,0.9890925288200378,0.9912124276161194,copy,copy,林鴻鵬,medium,0.8096306920051575,0.2652901543142795,215,黃樹傑,medium,0.8551093935966492,0.238150932887775,190,0.8936624446486968,0.7201063205277467,0.8068843825882217,suspicious,1,6,3.5,suspicious,0.9999634088003384,0.9995508864051179,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11687:201501_5531_AI1.pdf,201501,5531,AI1,王日春,吳麗容,國富浩華聯合,2,0.8791,0.9754370450973511,0.8813892602920532,0.9284131526947021,uncertain,copy,王日春,low,0.7568390965461731,0.005477308294209703,72,吳麗容,low,0.7712957262992859,0.008912655971479501,34,0.6516301004467174,1.956775803494176e-10,0.3258150503211975,genuine,9,16,12.5,genuine,0.9999120001739771,0.9997950230654927,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11696:201501_5608_AI1.pdf,201501,5608,AI1,林淑婉,郭政弘,勤業眾信聯合,2,0.8732,0.9936072826385498,0.990218997001648,0.9919131398200989,copy,copy,林淑婉,medium,0.882147490978241,0.20859827066511188,302,郭政弘,medium,0.849258303642273,0.1950954373022656,517,0.9529751896734562,0.9143055218368165,0.9336403557551363,copy,4,10,7.0,suspicious,0.9997498771497274,0.9996599471737382,0,genuine,copy,above_crossover,definite_copy,very_high,2,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11703:201501_5836_AI1.pdf,201501,5836,AI1,吳美慧,黃樹傑,勤業眾信聯合,2,0.8772,0.9919860363006592,0.9914437532424927,0.9917148947715759,copy,copy,吳美慧,medium,0.8545287251472473,0.23782586573284248,259,黃樹傑,medium,0.8551093935966492,0.238150932887775,190,0.8444292069475674,0.8376962330993345,0.841062720023451,suspicious,4,6,5.0,suspicious,0.999535475684853,0.9989447049761426,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11717:201501_6016_AI1.pdf,201501,6016,AI1,陳俊宏,郭政弘,勤業眾信聯合,2,0.8948,0.9928615093231201,0.9914718270301819,0.992166668176651,copy,copy,陳俊宏,high,0.870003342628479,0.3440753771840897,410,郭政弘,medium,0.849258303642273,0.1950954373022656,517,0.9296562469102436,0.7263560099241049,0.8280061284171742,suspicious,4,6,5.0,suspicious,0.9995691207842679,0.999191992129174,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11739:201501_6124_AI1.pdf,201501,6124,AI1,吳惠蘭,黃海寧,安侯建業聯合,2,0.8994,0.9792090654373169,0.9769542813301086,0.9780816733837128,copy,copy,吳惠蘭,low,0.8452295660972595,0.04004817825956037,82,黃海寧,low,0.8079185485839844,0.012388466426100725,543,0.8293400478501272,0.7749434741062338,0.8021417609781805,suspicious,6,8,7.0,genuine,0.9995945227195235,0.9994123532766245,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11813:201501_6214_AI1.pdf,201501,6214,AI1,許秀明,郭政弘,勤業眾信聯合,2,0.8831,0.9903192520141602,0.9882276654243469,0.9892734587192535,copy,copy,許秀明,medium,0.8785849809646606,0.2202386468952734,416,郭政弘,medium,0.849258303642273,0.1950954373022656,517,0.7973577591425317,0.637921483386859,0.7176396212646954,genuine,3,9,6.0,suspicious,0.999514695275231,0.9995054292900495,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11924:201501_8064_AI1.pdf,201501,8064,AI1,林鴻鵬,黃樹傑,勤業眾信聯合,2,0.8755,0.996562123298645,0.9889345169067383,0.9927483201026917,copy,copy,林鴻鵬,medium,0.8096306920051575,0.2652901543142795,215,黃樹傑,medium,0.8551093935966492,0.238150932887775,190,0.8925449693789542,0.7739683279689674,0.8332566486739608,suspicious,1,1,1.0,suspicious,0.9999443168818747,0.9996152712626158,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:11952:201501_8107_AI1.pdf,201501,8107,AI1,劉子猛,林姿妤,資誠聯合,2,0.8962,0.9693073034286499,0.9570549130439758,0.9631811082363129,copy,copy,劉子猛,medium,0.8596516847610474,0.07994784568806124,903,林姿妤,low,0.8303267955780029,0.016882845524535096,1116,0.6052804400576726,0.5437012059806625,0.5744908230191675,genuine,12,17,14.5,genuine,0.9997313658257408,0.9980057253761088,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12033:201501_8933_AI1.pdf,201501,8933,AI1,王日春,黃勝義,國富浩華聯合,2,0.8729,0.9597235918045044,0.9399851560592651,0.9498543739318848,copy,uncertain,王日春,low,0.7568390965461731,0.005477308294209703,72,黃勝義,low,0.7726985812187195,0.010101010101010102,100,0.5251685185229421,1.9813801141676967e-10,0.2625842593605401,genuine,11,15,13.0,genuine,0.9999635499084296,0.9999097455774968,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12089:201502_000960_AI3.pdf,201502,000960,AI3,方燕玲,鍾丹丹,安侯建業聯合,2,0.8833,0.9742055535316467,0.9631528854370117,0.9686792194843292,copy,copy,方燕玲,medium,0.8688790202140808,0.03956043956043956,105,鍾丹丹,low,0.8297760486602783,0.03237889057842152,483,0.6342658941951145,0.6028726180252525,0.6185692561101834,genuine,5,23,14.0,suspicious,0.9997978000557359,0.9996939074776655,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12144:201502_1402_AI1.pdf,201502,1402,AI1,施景彬,黃樹傑,勤業眾信聯合,2,0.8962,0.9087784290313721,0.8829125165939331,0.8958454728126526,uncertain,uncertain,施景彬,low,0.8224875926971436,0.09865214614634803,423,黃樹傑,medium,0.8551093935966492,0.238150932887775,190,0.48934225844364754,0.45518545526650245,0.472263856855075,genuine,13,17,15.0,genuine,0.9999668882490387,0.9992896942201721,0,genuine,uncertain,above_crossover,uncertain,low,0,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12276:201502_1733_AI1.pdf,201502,1733,AI1,黃裕峰,黃樹傑,勤業眾信聯合,2,0.8889,0.9780246019363403,0.9758568406105042,0.9769407212734222,copy,copy,黃裕峰,medium,0.8085153698921204,0.1401117298899312,1070,黃樹傑,medium,0.8551093935966492,0.238150932887775,190,0.7247146865445697,0.655196987383598,0.6899558369640839,genuine,3,13,8.0,suspicious,0.9984090367381535,0.998327414377506,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12311:201502_1905_AI1.pdf,201502,1905,AI1,許秀明,郭政弘,勤業眾信聯合,2,0.8835,0.9904640913009644,0.9887428879737854,0.9896034896373749,copy,copy,許秀明,medium,0.8785849809646606,0.2202386468952734,416,郭政弘,medium,0.849258303642273,0.1950954373022656,517,0.8861018731799083,0.8173625235254222,0.8517321983526652,suspicious,3,7,5.0,suspicious,0.9996328826172007,0.9994266103759022,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12313:201502_1907_AI1.pdf,201502,1907,AI1,邵志明,郭政弘,勤業眾信聯合,2,0.8785,0.9520202279090881,0.8505759835243225,0.9012981057167053,uncertain,copy,邵志明,medium,0.8712529540061951,0.2161141276854069,597,郭政弘,medium,0.849258303642273,0.1950954373022656,517,0.6645561150406111,0.6630301189822406,0.6637931170114258,genuine,11,21,16.0,genuine,0.9997170092191254,0.7745251119981659,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12315:201502_2002_AI1.pdf,201502,2002,AI1,郭麗園,郭政弘,勤業眾信聯合,3,0.8790999999999999,0.9755321741104126,0.8793651461601257,0.9274486601352692,copy,uncertain,郭麗園,low,0.7404786944389343,0.01299816749898583,846,郭政弘,medium,0.849258303642273,0.1950954373022656,517,2.389271180060164e-10,2.363857882433285e-10,2.376564531246725e-10,genuine,7,16,11.5,genuine,0.9984443370957733,0.9980954177367056,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12355:201502_2104_AI1.pdf,201502,2104,AI1,吳美慧,郭政弘,勤業眾信聯合,2,0.8882,0.9932514429092407,0.9881169199943542,0.9906841814517975,copy,copy,吳美慧,medium,0.8545287251472473,0.23782586573284248,259,郭政弘,medium,0.849258303642273,0.1950954373022656,517,0.8958334896020772,0.7894841302846441,0.8426588099433607,suspicious,3,7,5.0,suspicious,0.9994413098628834,0.9992473645681965,0,genuine,copy,above_crossover,likely_copy,medium,1,4
/Volumes/NV2/PDF-Processing/signature-analysis/reports/pdf_level_complete_report.csv:12395:201502_2331_AI1.pdf,201502,2331,AI1,黃秀椿,郭政弘,勤業眾信聯合,2,0.8926,0.9956549406051636,0.9874311685562134,0.9915430545806885,copy,copy,黃秀椿,medium,0.82326340675354,0.14930937538198263,405,郭政弘,medium,0.849258303642273,0.1950954373022656,517,0.9466163050335329,0.8974434495498314,0.9220298772916822,suspicious,1,2,1.5,suspicious,0.9999537763663968,0.9998319112081911,0,genuine,copy,above_crossover,likely_copy,medium,1,4

codex
I found the underlying generated reports under `/Volumes/NV2/PDF-Processing/signature-analysis/reports`, so I’m checking the manuscript against those rather than relying only on the table arithmetic in the paper itself.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/PDF-Processing/signature-analysis/reports/validation_recalibration/validation_recalibration.md | sed -n '1,120p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	# Validation Recalibration Report
     2	Generated: 2026-04-21 12:16:36
     3	
     4	Addresses codex gpt-5.4 v3.3 round-3 review Blockers 1 and 2.
     5	
     6	## 1. Calibration vs Held-out Firm A Generalization Test
     7	
     8	* Seed 42; 70/30 CPA-level split.
     9	* Calibration fold: 45,116 signatures (124 CPAs).
    10	* Held-out fold: 15,332 signatures (54 CPAs).
    11	
    12	**Reference comparison.** The correct generalization test compares calib-fold vs held-out-fold rates, not whole-sample vs held-out-fold. The whole-sample rate is a weighted average of the two folds and therefore cannot lie inside the held-out CI when the folds differ in rate.
    13	
    14	| Rule | Calib rate (CI) | Held-out rate (CI) | z | p | Held within calib CI? |
    15	|------|-----------------|---------------------|---|---|------------------------|
    16	| `cos>0.8370` | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.311 | 0.7561 | yes |
    17	| `cos>0.9407` | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.185 | 0.0014 | no |
    18	| `cos>0.9450` | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.541 | 0.0000 | no |
    19	| `cos>0.9500` | 92.14% [91.89%, 92.39%] | 93.61% [93.21%, 93.98%] | -5.966 | 0.0000 | no |
    20	| `dh_indep<=5` | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.289 | 0.0000 | no |
    21	| `dh_indep<=8` | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.446 | 0.0000 | no |
    22	| `dh_indep<=9` | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.073 | 0.0000 | no |
    23	| `dh_indep<=15` | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.314 | 0.7537 | yes |
    24	| `cos>0.950_AND_dh<=8` | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.604 | 0.0000 | no |
    25	| `cos>0.945_AND_dh<=8` | 90.69% [90.41%, 90.95%] | 92.47% [92.04%, 92.87%] | -6.702 | 0.0000 | no |
    26	
    27	## 2. Classifier Sensitivity: cos > 0.95 vs cos > 0.945
    28	
    29	All-sample five-way classifier output (N = 168,740 signatures).
    30	The 0.945 cutoff is the accountant-level 2D GMM marginal crossing; 
    31	the 0.95 cutoff is the whole-sample Firm A P95 heuristic.
    32	
    33	| Category | cos>0.95 count (%) | cos>0.945 count (%) | Δ |
    34	|----------|---------------------|-----------------------|---|
    35	| high_confidence_non_hand_signed | 76,984 (45.62%) | 79,278 (46.98%) | +2,294 |
    36	| moderate_confidence_non_hand_signed | 43,906 (26.02%) | 50,001 (29.63%) | +6,095 |
    37	| high_style_consistency | 546 (0.32%) | 665 (0.39%) | +119 |
    38	| uncertain | 46,768 (27.72%) | 38,260 (22.67%) | -8,508 |
    39	| likely_hand_signed | 536 (0.32%) | 536 (0.32%) | +0 |
    40	| dh_missing | 0 (0.00%) | 0 (0.00%) | +0 |
    41	
    42	### Category transitions (0.95 -> 0.945)
    43	
    44	* `uncertain->moderate_confidence_non_hand_signed`: 6,095
    45	* `uncertain->high_confidence_non_hand_signed`: 2,294
    46	* `uncertain->high_style_consistency`: 119
    47	
    48	## 3. Dual-Rule Capture on Whole Firm A
    49	
    50	* cos > 0.950 AND dh_indep <= 8: 54,370/60,448 (89.95%)
    51	* cos > 0.945 AND dh_indep <= 8: 55,091/60,448 (91.14%)
    52	
    53	## 4. Interpretation
    54	
    55	* The calib-vs-held-out 2-proportion z-test is the correct generalization check.  If `p >= 0.05` the two folds are not statistically distinguishable at 5% level.
    56	* Where the two folds differ significantly, the paper should say the held-out fold happens to be slightly more replication-dominated than the calibration fold (i.e., a sampling-variance effect, not a generalization failure), and still discloses the rates for both folds.
    57	* The sensitivity analysis shows how many signatures flip categories under the accountant-level convergence threshold (0.945) versus the whole-sample heuristic (0.95). Small shifts support the paper's claim that the operational classifier is robust to the threshold choice; larger shifts would require either changing the classifier or reporting results under both cuts.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/PDF-Processing/signature-analysis/reports/intra_report/intra_report_report.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/PDF-Processing/signature-analysis/reports/accountant_mixture/accountant_mixture_report.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	# Partner-Level Similarity Ranking Report
     2	Generated: 2026-04-21 01:55:27
     3	
     4	## Method
     5	
     6	* Unit of observation: auditor-year (CPA name, fiscal year) with at least 5 signatures in that year.
     7	* Similarity statistic: mean of max_similarity_to_same_accountant
     8	  across signatures in the auditor-year.
     9	* Auditor-years ranked globally; per-firm share of top-K positions
    10	  reported for the pooled 2013-2023 sample and per fiscal year.
    11	
    12	Total auditor-years analyzed: **4,629**
    13	
    14	## Auditor-year counts by firm
    15	
    16	| Firm | N auditor-years |
    17	|------|-----------------|
    18	| Other / Non-Big-4 | 1408 |
    19	| Deloitte (Firm A) | 1287 |
    20	| KPMG | 861 |
    21	| PwC | 695 |
    22	| EY | 378 |
    23	
    24	## Top-K concentration (pooled 2013-2023)
    25	
    26	| Top-K | N in bucket | Deloitte | KPMG | PwC | EY | Other | Deloitte share |
    27	|-------|-------------|----------|------|-----|-----|-------|----------------|
    28	| Top 10% | 462 | 443 | 2 | 3 | 0 | 14 | **95.9%** |
    29	| Top 25% | 1157 | 1043 | 32 | 23 | 9 | 50 | **90.1%** |
    30	| Top 50% | 2314 | 1220 | 473 | 273 | 102 | 246 | **52.7%** |
    31	
    32	## Per-year Deloitte share of top-10% similarity
    33	
    34	| Year | N auditor-years | Top-10% k | Deloitte in top-10% | Deloitte share | Deloitte baseline share |
    35	|------|-----------------|-----------|---------------------|----------------|-------------------------|
    36	| 2013 | 324 | 32 | 32 | 100.0% | 32.4% |
    37	| 2014 | 399 | 39 | 39 | 100.0% | 27.8% |
    38	| 2015 | 394 | 39 | 38 | 97.4% | 27.7% |
    39	| 2016 | 413 | 41 | 39 | 95.1% | 26.2% |
    40	| 2017 | 415 | 41 | 41 | 100.0% | 27.2% |
    41	| 2018 | 434 | 43 | 43 | 100.0% | 26.5% |
    42	| 2019 | 429 | 42 | 42 | 100.0% | 27.0% |
    43	| 2020 | 430 | 43 | 38 | 88.4% | 27.7% |
    44	| 2021 | 450 | 45 | 44 | 97.8% | 28.7% |
    45	| 2022 | 467 | 46 | 43 | 93.5% | 28.3% |
    46	| 2023 | 474 | 47 | 46 | 97.9% | 27.4% |
    47	
    48	## Interpretation
    49	
    50	If Deloitte Taiwan applies firm-wide stamping, Deloitte auditor-years
    51	should over-represent in the top of the similarity distribution relative
    52	to their baseline share of all auditor-years. The pooled top-10%
    53	Deloitte share divided by the baseline gives a concentration ratio
    54	that is informative about the firm's signing practice without
    55	requiring per-report ground-truth labels.
    56	
    57	Year-by-year stability of this concentration provides evidence about
    58	whether the stamping practice was maintained throughout 2013-2023 or
    59	changed in response to the industry-wide shift to electronic signing
    60	systems around 2020.
 succeeded in 0ms:
     1	# Intra-Report Consistency Report
     2	Generated: 2026-04-21 01:56:36
     3	
     4	## Method
     5	
     6	* 2-signer reports (primary + secondary engagement partner).
     7	* Each signature classified using the dual-descriptor rules of the
     8	  paper (cos > 0.95 AND dHash_indep ≤ 5 = high-confidence replication;
     9	  dHash 6-15 = moderate; > 15 = style consistency; cos ≤ 0.837 = likely
    10	  hand-signed; otherwise uncertain).
    11	* For each report, both signature-level labels are compared.
    12	  A report is "in agreement" if both fall in the same coarse bucket
    13	  (non-hand-signed = high+moderate combined, style_consistency,
    14	  uncertain, or hand-signed); otherwise "mixed".
    15	
    16	Total 2-signer reports analyzed: **84,354**
    17	
    18	## Per-firm agreement
    19	
    20	| Firm | Total | Both non-hand-signed | Both style | Both uncertain | Both hand-signed | Mixed | Agreement rate |
    21	|------|-------|----------------------|------------|----------------|------------------|-------|----------------|
    22	| Deloitte (Firm A) | 30222 | 26435 | 0 | 734 | 4 | 3049 | **89.91%** |
    23	| PwC | 19112 | 8983 | 3 | 3035 | 5 | 7086 | **62.92%** |
    24	| KPMG | 17121 | 9260 | 5 | 2159 | 6 | 5691 | **66.76%** |
    25	| Other / Non-Big-4 | 9140 | 1671 | 18 | 3945 | 27 | 3479 | **61.94%** |
    26	| EY | 8375 | 3028 | 0 | 2376 | 3 | 2968 | **64.56%** |
    27	| EY+PwC | 196 | 88 | 0 | 24 | 0 | 84 | **57.14%** |
    28	| PwC+EY | 183 | 92 | 0 | 25 | 0 | 66 | **63.93%** |
    29	| Other / Non-Big-4+PwC | 1 | 0 | 0 | 0 | 0 | 1 | **0.00%** |
    30	| Other / Non-Big-4+EY | 1 | 0 | 0 | 0 | 0 | 1 | **0.00%** |
    31	| EY+Other / Non-Big-4 | 1 | 0 | 0 | 0 | 0 | 1 | **0.00%** |
    32	| Deloitte (Firm A)+Other / Non-Big-4 | 1 | 0 | 0 | 0 | 0 | 1 | **0.00%** |
    33	| Other / Non-Big-4+Deloitte (Firm A) | 1 | 0 | 0 | 0 | 0 | 1 | **0.00%** |
    34	
    35	## Interpretation
    36	
    37	Under firmwide stamping practice the two engagement partners on a
    38	given report should both exhibit high-confidence non-hand-signed
    39	classifications. High intra-report agreement at Firm A (Deloitte) is
    40	consistent with uniform firm-level stamping; declining agreement at
    41	the other Big-4 firms reflects the interview evidence that stamping
    42	was applied only to a subset of partners.
    43	
    44	Mixed-classification reports (one signer non-hand-signed, the other
    45	hand-signed or style-consistent) are flagged for sensitivity review.
    46	Absent firmwide homogeneity, one would expect substantial mixed-rate
    47	contamination even at Firm A; the observed Firm A mixed rate is a
    48	direct empirical check on the identification assumption used in the
    49	threshold calibration.
 succeeded in 0ms:
     1	# Accountant-Level Gaussian Mixture Report
     2	Generated: 2026-04-20 21:49:02
     3	
     4	## Data
     5	
     6	* Per-accountant aggregates: mean cosine max-similarity, mean independent min dHash.
     7	* Minimum signatures per accountant: 10.
     8	* Accountants included: **686**.
     9	
    10	## Model selection (BIC)
    11	
    12	| K | BIC | AIC | Converged |
    13	|---|-----|-----|-----------|
    14	| 1 | -315.98 | -338.63 | True |
    15	| 2 | -545.02 | -594.86 | True |
    16	| 3 | -792.31 | -869.34 | True ←best |
    17	| 4 | -778.51 | -882.72 | True |
    18	| 5 | -747.42 | -878.82 | True |
    19	
    20	## 3-component fit
    21	
    22	| Component | cos_mean | dh_mean | weight | n_accountants | top firms |
    23	|-----------|----------|---------|--------|----------------|-----------|
    24	| C1 | 0.983 | 2.41 | 0.204 | 141 | 勤業眾信聯合:139, 資誠聯合:1, 建鴻聯合:1 |
    25	| C2 | 0.954 | 6.99 | 0.512 | 361 | 安侯建業聯合:106, 資誠聯合:82, 安永聯合:47, 勤業眾信聯合:32, 國富浩華聯合:18 |
    26	| C3 | 0.928 | 11.17 | 0.284 | 184 | 資誠聯合:19, 國富浩華聯合:14, 正風聯合:10, 信永中和聯合:8, 眾智聯合:8 |
    27	
    28	## 2-component fit (threshold derivation)
    29	
    30	| Component | cos_mean | dh_mean | weight | n_accountants |
    31	|-----------|----------|---------|--------|----------------|
    32	| C1 | 0.964 | 5.31 | 0.578 | 423 |
    33	| C2 | 0.935 | 9.89 | 0.422 | 263 |
    34	
    35	### Natural thresholds from 2-component crossings
    36	
    37	* Cosine: **0.9450**
    38	* dHash:  **8.1007**
    39	
    40	## Interpretation
    41	
    42	The accountant-level mixture separates signing-behaviour regimes,
    43	while the signature-level distribution is a continuous spectrum
    44	(see Scripts 15 and 17). The BIC-best model chooses how many
    45	discrete regimes the data supports. The 2-component crossings
    46	are the natural per-accountant thresholds for classifying a
    47	CPA's signing behaviour.
    48	
    49	## Artifacts
    50	
    51	* `accountant_mixture_2d.png` - 2D scatter with 3-component fit
    52	* `accountant_mixture_marginals.png` - 1D marginals with 2-component fit
    53	* `accountant_clusters.csv` - per-accountant cluster assignments
    54	* `accountant_mixture_results.json` - full numerical results
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/PDF-Processing/signature-analysis/reports/dip_test/dip_test_report.md | sed -n '1,140p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	# Hartigan Dip Test Report
     2	Generated: 2026-04-20 21:49:18
     3	
     4	## Method
     5	
     6	Hartigan & Hartigan (1985) dip test via `diptest` Python package.
     7	H0: distribution is unimodal. H1: multimodal (two or more modes).
     8	p-value computed by bootstrap against a uniform null (2000 reps for
     9	Firm A/accountant-level, 500 reps for full-sample due to size).
    10	
    11	## Results
    12	
    13	| Test | N | dip | p-value | Verdict (α=0.05) |
    14	|------|---|-----|---------|------------------|
    15	| Firm A cosine max-similarity | 60,448 | 0.0019 | 0.1685 | UNIMODAL (accept H0) |
    16	| Firm A independent min dHash | 60,448 | 0.1051 | 0.0000 | MULTIMODAL (reject H0) |
    17	| Full-sample cosine max-similarity | 168,740 | 0.0035 | 0.0000 | MULTIMODAL (reject H0) |
    18	| Full-sample independent min dHash | 168,740 | 0.0468 | 0.0000 | MULTIMODAL (reject H0) |
    19	| Per-accountant cosine mean | 686 | 0.0339 | 0.0000 | MULTIMODAL (reject H0) |
    20	| Per-accountant dHash mean | 686 | 0.0277 | 0.0000 | MULTIMODAL (reject H0) |
    21	
    22	## Interpretation
    23	
    24	* **Signature level** (Firm A + full sample): the dip test indicates
    25	  whether a single mode explains the max-cosine/min-dHash distribution.
    26	  Prior finding (2026-04-16) suggested unimodal long-tail; this script
    27	  provides the formal test.
    28	
    29	* **Accountant level** (per-accountant mean): if multimodal here but
    30	  unimodal at the signature level, this confirms the interpretation
    31	  that signing-behaviour is discrete across accountants (replication
    32	  vs hand-signing), while replication quality itself is a continuous
    33	  spectrum.
    34	
    35	## Downstream implication
    36	
    37	Methods that assume bimodality (KDE antimode, 2-component Beta mixture)
    38	should be applied at the level where dip test rejects H0. If the
    39	signature-level dip test fails to reject, the paper should report this
    40	and shift the mixture analysis to the accountant level (see Script 18).
codex
I found one substantive inconsistency candidate: the per-year Firm A baseline shares in Table XV do not match the generated partner-ranking report. I’m checking the underlying results now to see whether that is a stale report artifact or a manuscript regression.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_results.json | sed -n '1,260p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
     1	{
     2	  "generated_at": "2026-04-21T01:55:27.012073",
     3	  "min_signatures_per_auditor_year": 5,
     4	  "n_auditor_years": 4629,
     5	  "firm_counts": {
     6	    "EY": 378,
     7	    "Other / Non-Big-4": 1408,
     8	    "PwC": 695,
     9	    "KPMG": 861,
    10	    "Deloitte (Firm A)": 1287
    11	  },
    12	  "pooled_deciles": {
    13	    "top_10pct": {
    14	      "k": 462,
    15	      "N_total": 4629,
    16	      "by_firm": {
    17	        "Other / Non-Big-4": 14,
    18	        "Deloitte (Firm A)": 443,
    19	        "PwC": 3,
    20	        "KPMG": 2
    21	      },
    22	      "deloitte_share": 0.9588744588744589
    23	    },
    24	    "top_25pct": {
    25	      "k": 1157,
    26	      "N_total": 4629,
    27	      "by_firm": {
    28	        "Other / Non-Big-4": 50,
    29	        "Deloitte (Firm A)": 1043,
    30	        "PwC": 23,
    31	        "KPMG": 32,
    32	        "EY": 9
    33	      },
    34	      "deloitte_share": 0.9014693171996543
    35	    },
    36	    "top_50pct": {
    37	      "k": 2314,
    38	      "N_total": 4629,
    39	      "by_firm": {
    40	        "Other / Non-Big-4": 246,
    41	        "Deloitte (Firm A)": 1220,
    42	        "PwC": 273,
    43	        "KPMG": 473,
    44	        "EY": 102
    45	      },
    46	      "deloitte_share": 0.5272255834053586
    47	    }
    48	  },
    49	  "per_year": {
    50	    "2013": {
    51	      "top_10pct": {
    52	        "k": 32,
    53	        "N_total": 324,
    54	        "by_firm": {
    55	          "Deloitte (Firm A)": 32
    56	        },
    57	        "deloitte_share": 1.0
    58	      },
    59	      "top_25pct": {
    60	        "k": 81,
    61	        "N_total": 324,
    62	        "by_firm": {
    63	          "Deloitte (Firm A)": 78,
    64	          "Other / Non-Big-4": 2,
    65	          "PwC": 1
    66	        },
    67	        "deloitte_share": 0.9629629629629629
    68	      },
    69	      "top_50pct": {
    70	        "k": 162,
    71	        "N_total": 324,
    72	        "by_firm": {
    73	          "Deloitte (Firm A)": 97,
    74	          "Other / Non-Big-4": 9,
    75	          "PwC": 20,
    76	          "KPMG": 33,
    77	          "EY": 3
    78	        },
    79	        "deloitte_share": 0.5987654320987654
    80	      }
    81	    },
    82	    "2014": {
    83	      "top_10pct": {
    84	        "k": 39,
    85	        "N_total": 399,
    86	        "by_firm": {
    87	          "Deloitte (Firm A)": 39
    88	        },
    89	        "deloitte_share": 1.0
    90	      },
    91	      "top_25pct": {
    92	        "k": 99,
    93	        "N_total": 399,
    94	        "by_firm": {
    95	          "Deloitte (Firm A)": 90,
    96	          "Other / Non-Big-4": 6,
    97	          "KPMG": 2,
    98	          "PwC": 1
    99	        },
   100	        "deloitte_share": 0.9090909090909091
   101	      },
   102	      "top_50pct": {
   103	        "k": 199,
   104	        "N_total": 399,
   105	        "by_firm": {
   106	          "Deloitte (Firm A)": 105,
   107	          "Other / Non-Big-4": 27,
   108	          "KPMG": 36,
   109	          "PwC": 29,
   110	          "EY": 2
   111	        },
   112	        "deloitte_share": 0.5276381909547738
   113	      }
   114	    },
   115	    "2015": {
   116	      "top_10pct": {
   117	        "k": 39,
   118	        "N_total": 394,
   119	        "by_firm": {
   120	          "Deloitte (Firm A)": 38,
   121	          "Other / Non-Big-4": 1
   122	        },
   123	        "deloitte_share": 0.9743589743589743
   124	      },
   125	      "top_25pct": {
   126	        "k": 98,
   127	        "N_total": 394,
   128	        "by_firm": {
   129	          "Deloitte (Firm A)": 91,
   130	          "Other / Non-Big-4": 5,
   131	          "KPMG": 2
   132	        },
   133	        "deloitte_share": 0.9285714285714286
   134	      },
   135	      "top_50pct": {
   136	        "k": 197,
   137	        "N_total": 394,
   138	        "by_firm": {
   139	          "Deloitte (Firm A)": 107,
   140	          "Other / Non-Big-4": 23,
   141	          "KPMG": 36,
   142	          "EY": 9,
   143	          "PwC": 22
   144	        },
   145	        "deloitte_share": 0.5431472081218274
   146	      }
   147	    },
   148	    "2016": {
   149	      "top_10pct": {
   150	        "k": 41,
   151	        "N_total": 413,
   152	        "by_firm": {
   153	          "Deloitte (Firm A)": 39,
   154	          "Other / Non-Big-4": 2
   155	        },
   156	        "deloitte_share": 0.9512195121951219
   157	      },
   158	      "top_25pct": {
   159	        "k": 103,
   160	        "N_total": 413,
   161	        "by_firm": {
   162	          "Deloitte (Firm A)": 95,
   163	          "Other / Non-Big-4": 6,
   164	          "KPMG": 2
   165	        },
   166	        "deloitte_share": 0.9223300970873787
   167	      },
   168	      "top_50pct": {
   169	        "k": 206,
   170	        "N_total": 413,
   171	        "by_firm": {
   172	          "Deloitte (Firm A)": 105,
   173	          "Other / Non-Big-4": 26,
   174	          "KPMG": 39,
   175	          "EY": 8,
   176	          "PwC": 28
   177	        },
   178	        "deloitte_share": 0.5097087378640777
   179	      }
   180	    },
   181	    "2017": {
   182	      "top_10pct": {
   183	        "k": 41,
   184	        "N_total": 415,
   185	        "by_firm": {
   186	          "Deloitte (Firm A)": 41
   187	        },
   188	        "deloitte_share": 1.0
   189	      },
   190	      "top_25pct": {
   191	        "k": 103,
   192	        "N_total": 415,
   193	        "by_firm": {
   194	          "Deloitte (Firm A)": 96,
   195	          "Other / Non-Big-4": 6,
   196	          "KPMG": 1
   197	        },
   198	        "deloitte_share": 0.9320388349514563
   199	      },
   200	      "top_50pct": {
   201	        "k": 207,
   202	        "N_total": 415,
   203	        "by_firm": {
   204	          "Deloitte (Firm A)": 110,
   205	          "Other / Non-Big-4": 21,
   206	          "KPMG": 39,
   207	          "EY": 11,
   208	          "PwC": 26
   209	        },
   210	        "deloitte_share": 0.5314009661835749
   211	      }
   212	    },
   213	    "2018": {
   214	      "top_10pct": {
   215	        "k": 43,
   216	        "N_total": 434,
   217	        "by_firm": {
   218	          "Deloitte (Firm A)": 43
   219	        },
   220	        "deloitte_share": 1.0
   221	      },
   222	      "top_25pct": {
   223	        "k": 108,
   224	        "N_total": 434,
   225	        "by_firm": {
   226	          "Deloitte (Firm A)": 100,
   227	          "Other / Non-Big-4": 1,
   228	          "KPMG": 4,
   229	          "EY": 2,
   230	          "PwC": 1
   231	        },
   232	        "deloitte_share": 0.9259259259259259
   233	      },
   234	      "top_50pct": {
   235	        "k": 217,
   236	        "N_total": 434,
   237	        "by_firm": {
   238	          "Deloitte (Firm A)": 109,
   239	          "Other / Non-Big-4": 24,
   240	          "KPMG": 49,
   241	          "EY": 10,
   242	          "PwC": 25
   243	        },
   244	        "deloitte_share": 0.5023041474654378
   245	      }
   246	    },
   247	    "2019": {
   248	      "top_10pct": {
   249	        "k": 42,
   250	        "N_total": 429,
   251	        "by_firm": {
   252	          "Deloitte (Firm A)": 42
   253	        },
   254	        "deloitte_share": 1.0
   255	      },
   256	      "top_25pct": {
   257	        "k": 107,
   258	        "N_total": 429,
   259	        "by_firm": {
   260	          "Deloitte (Firm A)": 93,

exec
/bin/zsh -lc "rg -n \"26\\.2%|27\\.1%|27\\.2%|28\\.5%|32\\.4%|2013 \\| 324|2014 \\| 399\" paper/paper_a_results_v3.md /Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md /Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_results.json" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:36:| 2013 | 324 | 32 | 32 | 100.0% | 32.4% |
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:37:| 2014 | 399 | 39 | 39 | 100.0% | 27.8% |
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:39:| 2016 | 413 | 41 | 39 | 95.1% | 26.2% |
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:40:| 2017 | 415 | 41 | 41 | 100.0% | 27.2% |
paper/paper_a_results_v3.md:331:| 2013 | 324 | 32 | 32 | 100.0% | 26.2% |
paper/paper_a_results_v3.md:332:| 2014 | 399 | 39 | 39 | 100.0% | 27.1% |
paper/paper_a_results_v3.md:333:| 2015 | 394 | 39 | 38 | 97.4% | 27.2% |
paper/paper_a_results_v3.md:340:| 2022 | 467 | 46 | 43 | 93.5% | 28.5% |
paper/paper_a_results_v3.md:341:| 2023 | 474 | 47 | 46 | 97.9% | 28.5% |

exec
/bin/zsh -lc "python3 - <<'PY'
import json
from pathlib import Path
p=Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_results.json')
obj=json.loads(p.read_text())
print(obj.keys())
print('pooled', obj.get('pooled_top_k',{}) if 'pooled_top_k' in obj else 'no pooled_top_k')
# print year section keys
for k in obj.keys():
    if 'year' in k.lower():
        print('year-like key', k)
PY" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
dict_keys(['generated_at', 'min_signatures_per_auditor_year', 'n_auditor_years', 'firm_counts', 'pooled_deciles', 'per_year'])
pooled no pooled_top_k
year-like key min_signatures_per_auditor_year
year-like key n_auditor_years
year-like key per_year

codex
The candidate issue looks real, not just cosmetic: Table XV’s per-year baseline shares appear to have drifted away from the generated report. I’m tracing it to the source calculation so I can decide whether this is a stale-table typo or a deeper mismatch.
exec
/bin/zsh -lc "nl -ba /Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_results.json | sed -n '246,520p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
   246	    },
   247	    "2019": {
   248	      "top_10pct": {
   249	        "k": 42,
   250	        "N_total": 429,
   251	        "by_firm": {
   252	          "Deloitte (Firm A)": 42
   253	        },
   254	        "deloitte_share": 1.0
   255	      },
   256	      "top_25pct": {
   257	        "k": 107,
   258	        "N_total": 429,
   259	        "by_firm": {
   260	          "Deloitte (Firm A)": 93,
   261	          "EY": 3,
   262	          "Other / Non-Big-4": 6,
   263	          "KPMG": 5
   264	        },
   265	        "deloitte_share": 0.8691588785046729
   266	      },
   267	      "top_50pct": {
   268	        "k": 214,
   269	        "N_total": 429,
   270	        "by_firm": {
   271	          "Deloitte (Firm A)": 109,
   272	          "EY": 12,
   273	          "Other / Non-Big-4": 22,
   274	          "KPMG": 52,
   275	          "PwC": 19
   276	        },
   277	        "deloitte_share": 0.5093457943925234
   278	      }
   279	    },
   280	    "2020": {
   281	      "top_10pct": {
   282	        "k": 43,
   283	        "N_total": 430,
   284	        "by_firm": {
   285	          "Other / Non-Big-4": 4,
   286	          "Deloitte (Firm A)": 38,
   287	          "KPMG": 1
   288	        },
   289	        "deloitte_share": 0.8837209302325582
   290	      },
   291	      "top_25pct": {
   292	        "k": 107,
   293	        "N_total": 430,
   294	        "by_firm": {
   295	          "Other / Non-Big-4": 8,
   296	          "Deloitte (Firm A)": 94,
   297	          "KPMG": 5
   298	        },
   299	        "deloitte_share": 0.8785046728971962
   300	      },
   301	      "top_50pct": {
   302	        "k": 215,
   303	        "N_total": 430,
   304	        "by_firm": {
   305	          "Other / Non-Big-4": 23,
   306	          "Deloitte (Firm A)": 108,
   307	          "KPMG": 54,
   308	          "EY": 11,
   309	          "PwC": 19
   310	        },
   311	        "deloitte_share": 0.5023255813953489
   312	      }
   313	    },
   314	    "2021": {
   315	      "top_10pct": {
   316	        "k": 45,
   317	        "N_total": 450,
   318	        "by_firm": {
   319	          "Deloitte (Firm A)": 44,
   320	          "Other / Non-Big-4": 1
   321	        },
   322	        "deloitte_share": 0.9777777777777777
   323	      },
   324	      "top_25pct": {
   325	        "k": 112,
   326	        "N_total": 450,
   327	        "by_firm": {
   328	          "Deloitte (Firm A)": 101,
   329	          "Other / Non-Big-4": 6,
   330	          "KPMG": 4,
   331	          "EY": 1
   332	        },
   333	        "deloitte_share": 0.9017857142857143
   334	      },
   335	      "top_50pct": {
   336	        "k": 225,
   337	        "N_total": 450,
   338	        "by_firm": {
   339	          "Deloitte (Firm A)": 120,
   340	          "Other / Non-Big-4": 23,
   341	          "KPMG": 57,
   342	          "EY": 10,
   343	          "PwC": 15
   344	        },
   345	        "deloitte_share": 0.5333333333333333
   346	      }
   347	    },
   348	    "2022": {
   349	      "top_10pct": {
   350	        "k": 46,
   351	        "N_total": 467,
   352	        "by_firm": {
   353	          "Deloitte (Firm A)": 43,
   354	          "Other / Non-Big-4": 3
   355	        },
   356	        "deloitte_share": 0.9347826086956522
   357	      },
   358	      "top_25pct": {
   359	        "k": 116,
   360	        "N_total": 467,
   361	        "by_firm": {
   362	          "Deloitte (Firm A)": 105,
   363	          "Other / Non-Big-4": 5,
   364	          "PwC": 4,
   365	          "EY": 1,
   366	          "KPMG": 1
   367	        },
   368	        "deloitte_share": 0.9051724137931034
   369	      },
   370	      "top_50pct": {
   371	        "k": 233,
   372	        "N_total": 467,
   373	        "by_firm": {
   374	          "Deloitte (Firm A)": 130,
   375	          "Other / Non-Big-4": 25,
   376	          "PwC": 36,
   377	          "EY": 11,
   378	          "KPMG": 31
   379	        },
   380	        "deloitte_share": 0.5579399141630901
   381	      }
   382	    },
   383	    "2023": {
   384	      "top_10pct": {
   385	        "k": 47,
   386	        "N_total": 474,
   387	        "by_firm": {
   388	          "Deloitte (Firm A)": 46,
   389	          "PwC": 1
   390	        },
   391	        "deloitte_share": 0.9787234042553191
   392	      },
   393	      "top_25pct": {
   394	        "k": 118,
   395	        "N_total": 474,
   396	        "by_firm": {
   397	          "Deloitte (Firm A)": 104,
   398	          "PwC": 9,
   399	          "KPMG": 3,
   400	          "Other / Non-Big-4": 2
   401	        },
   402	        "deloitte_share": 0.8813559322033898
   403	      },
   404	      "top_50pct": {
   405	        "k": 237,
   406	        "N_total": 474,
   407	        "by_firm": {
   408	          "Deloitte (Firm A)": 129,
   409	          "PwC": 43,
   410	          "KPMG": 49,
   411	          "Other / Non-Big-4": 12,
   412	          "EY": 4
   413	        },
   414	        "deloitte_share": 0.5443037974683544
   415	      }
   416	    }
   417	  }
   418	}
exec
/bin/zsh -lc "python3 - <<'PY'
import json
from pathlib import Path
obj=json.loads(Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_results.json').read_text())
for yr,dat in obj['per_year'].items():
    print(yr, dat.keys())
    break
PY" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
2013 dict_keys(['top_10pct', 'top_25pct', 'top_50pct'])

exec
/bin/zsh -lc "rg -n \"partner_ranking|baseline share|Deloitte baseline|top-10%\" /Volumes/NV2/PDF-Processing/signature-analysis -g '"'!**/*.csv'"' -g '"'!**/*.png'"' | sed -n '1,240p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:32:## Per-year Deloitte share of top-10% similarity
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:34:| Year | N auditor-years | Top-10% k | Deloitte in top-10% | Deloitte share | Deloitte baseline share |
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:52:to their baseline share of all auditor-years. The pooled top-10%

exec
/bin/zsh -lc "find /Volumes/NV2/PDF-Processing/signature-analysis -maxdepth 3 \\( -name '*partner*rank*py' -o -name '*ranking*py' -o -name '*auditor*year*py' \\) | sed -n '1,120p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
exec
/bin/zsh -lc "find /Volumes/NV2/PDF-Processing/signature-analysis -maxdepth 3 -type f | rg 'partner_ranking|auditor' | sed -n '1,160p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_results.json
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md
/Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_rank_distribution.png

codex
I’ve got a couple of genuine regression candidates now: one numerical table baseline mismatch and at least one broken section/table cross-reference caused by late renumbering. I’m sweeping the cross-references systematically before I decide whether that keeps this at Accept or pulls it back to Minor Revision.
exec
/bin/zsh -lc "rg -n \"Section IV-[A-Z0-9.]+|Table [IVX]+/[IVX]+|Tables [IVX]+/[IVX]+|Section III-[A-Z0-9.]+|Table [A-Z]?\\.I|Fig\\. [0-9]\" paper/paper_a_*_v3.md | sed -n '1,260p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
paper/paper_a_results_v3.md:29:Fig. 2 presents the cosine similarity distributions computed over the full set of *pairwise comparisons* under two groupings: intra-class (all signature pairs belonging to the same CPA) and inter-class (signature pairs from different CPAs).
paper/paper_a_results_v3.md:30:This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
paper/paper_a_results_v3.md:46:Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived via the three convergent methods of Section III-I to avoid single-family distributional assumptions.
paper/paper_a_results_v3.md:72:Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to the minority of hand-signing Firm A partners identified in the accountant-level mixture (Section IV-E).
paper/paper_a_results_v3.md:74:At the per-accountant aggregate level both cosine and dHash means are strongly multimodal, foreshadowing the mixture structure analyzed in Section IV-E.
paper/paper_a_results_v3.md:77:It predicts that a two-component mixture fit to per-signature cosine will be a forced fit (Section IV-D.2 below), while the same fit at the accountant level will succeed---a prediction borne out in the subsequent analyses.
paper/paper_a_results_v3.md:81:Applying the BD/McCrary procedure (Section III-I.3) to the per-signature cosine distribution yields a nominally significant $Z^- \rightarrow Z^+$ transition at cosine 0.985 for Firm A and 0.985 for the full sample; the min-dHash distributions exhibit a transition at Hamming distance 2 for both Firm A and the full sample under the bin width ($0.005$ / $1$) used here.
paper/paper_a_results_v3.md:98:This motivates the pivot to the accountant-level analysis in Section IV-E, where aggregation over signatures reveals clustered (though not sharply discrete) patterns in individual-level signing *practice* that the signature-level analysis lacks.
paper/paper_a_results_v3.md:115:Table VII reports the three-component composition, and Fig. 4 visualizes the accountant-level clusters in the (cosine-mean, dHash-mean) plane alongside the marginal-density crossings of the two-component fit.
paper/paper_a_results_v3.md:128:This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
paper/paper_a_results_v3.md:130:Third, applying the threshold framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary density-smoothness diagnostic does not produce a significant transition at the accountant level (robust across the bin-width sweep in Appendix A).
paper/paper_a_results_v3.md:154:The signature-level estimates are reported for completeness and as diagnostic evidence of the continuous-spectrum asymmetry (Section IV-D.2) rather than as primary classification boundaries.
paper/paper_a_results_v3.md:158:Fig. 3 presents the per-signature cosine and dHash distributions of Firm A compared to the overall population.
paper/paper_a_results_v3.md:177:Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H.
paper/paper_a_results_v3.md:178:The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E).
paper/paper_a_results_v3.md:179:Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
paper/paper_a_results_v3.md:183:We report three validation analyses corresponding to the anchors of Section III-K.
paper/paper_a_results_v3.md:216:The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here and has no effect on the accountant-level mixture analysis of Section IV-E, which uses the $\geq 10$-signature subset of 171 CPAs.
paper/paper_a_results_v3.md:242:The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity (see the $139 / 32$ accountant-level split of Section IV-E): the random 30% CPA sample happened to contain proportionally more accountants from the high-replication C1 cluster.
paper/paper_a_results_v3.md:243:We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to this fold variance.
paper/paper_a_results_v3.md:247:The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P95 heuristic.
paper/paper_a_results_v3.md:248:The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$ (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at $0.945$.
paper/paper_a_results_v3.md:276:The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
paper/paper_a_results_v3.md:285:Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign.
paper/paper_a_results_v3.md:353:For each report with exactly two signatures and complete per-signature data (83,970 reports assigned to a single firm, plus 384 reports with one signer per firm in the mixed-firm buckets for 84,354 total), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
paper/paper_a_results_v3.md:373:We note that this test uses the calibrated classifier of Section III-L rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
paper/paper_a_results_v3.md:379:We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-L: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
paper/paper_a_results_v3.md:380:Document-level rates therefore bound the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-H.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
paper/paper_a_results_v3.md:391:Per the worst-case aggregation rule of Section III-L, a document with two signatures inherits the most-replication-consistent of the two signature-level labels.
paper/paper_a_results_v3.md:403:This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the 32/171 middle-band minority identified by the accountant-level mixture (Section IV-E).
paper/paper_a_results_v3.md:405:We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.
paper/paper_a_results_v3.md:410:This is consistent with the accountant-level convergent thresholds (Section IV-E, Table VIII) and with the cross-firm compositional pattern of the accountant-level GMM (Table VII).
paper/paper_a_results_v3.md:431:the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
paper/paper_a_methodology_v3.md:6:Fig. 1 illustrates the overall architecture.
paper/paper_a_methodology_v3.md:87:This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
paper/paper_a_methodology_v3.md:111:We specifically excluded SSIM (Structural Similarity Index) [30] after empirical testing showed it to be unreliable for scanned documents: the calibration firm (Section III-H) exhibited a mean SSIM of only 0.70 due to scan-induced pixel-level variations, despite near-identical visual content.
paper/paper_a_methodology_v3.md:127:The intra-report consistency analysis in Section IV-H.3 is a related but distinct check: it tests whether the *two co-signing CPAs on the same report* receive the same signature-level label (firm-level signing-practice homogeneity) rather than testing whether a single CPA mixes mechanisms within a fiscal year.
paper/paper_a_methodology_v3.md:128:A direct empirical check of the within-auditor-year assumption at the same-CPA level would require labeling multiple reports of the same CPA in the same year and is left to future work; in this paper we maintain the assumption as an identification convention motivated by industry practice and bounded by the worst-case aggregation rule of Section III-L.
paper/paper_a_methodology_v3.md:132:The independent minimum is unconditional on the cosine-nearest pair and is therefore the conservative structural-similarity statistic; it is the dHash statistic used throughout the operational classifier (Section III-L) and all reported capture-rate analyses.
paper/paper_a_methodology_v3.md:133:These accountant-level aggregates are the input to the mixture model described in Section III-J and to the accountant-level threshold analysis in Section III-I.5.
paper/paper_a_methodology_v3.md:149:Third, *accountant-level mixture analysis* (Section IV-E): a BIC-selected three-component Gaussian mixture over per-accountant mean cosine and mean dHash places 139 of the 171 Firm A CPAs (with $\geq 10$ signatures) in the high-replication C1 cluster and 32 in the middle-band C2 cluster, directly quantifying the within-firm heterogeneity.
paper/paper_a_methodology_v3.md:151:Fourth, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-H. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
paper/paper_a_methodology_v3.md:152:  (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P95 of the per-signature cosine distribution (Section III-L); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
paper/paper_a_methodology_v3.md:153:  (b) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
paper/paper_a_methodology_v3.md:154:  (c) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
paper/paper_a_methodology_v3.md:156:We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the visual inspection, the accountant-level mixture, the three complementary analyses above, and the held-out Firm A fold described in Section III-K.
paper/paper_a_methodology_v3.md:160:The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.
paper/paper_a_methodology_v3.md:174:When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the *KDE crossover* is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes.
paper/paper_a_methodology_v3.md:175:When a single distribution is analyzed (e.g., the per-accountant cosine mean of Section IV-E) the *KDE antimode* is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal.
paper/paper_a_methodology_v3.md:211:Applied to the per-signature similarity distribution the two estimators yield thresholds spread across a wide range because per-signature similarity is not a cleanly bimodal population (Section IV-D).
paper/paper_a_methodology_v3.md:216:In addition to applying the two threshold estimators and the BD/McCrary diagnostic at the per-signature level (Section IV-D), we apply them to the per-accountant aggregates (mean best-match cosine, mean independent minimum dHash) for the 686 CPAs with $\geq 10$ signatures.
paper/paper_a_methodology_v3.md:217:The accountant-level estimates from the two threshold estimators (together with their convergence) provide the methodologically defensible threshold reference used in the per-document classification of Section III-L; the BD/McCrary accountant-level null is reported alongside as a smoothness diagnostic.
paper/paper_a_methodology_v3.md:239:3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference whose left tail contains a minority of hand-signers, as directly evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H).
paper/paper_a_methodology_v3.md:240:Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% *calibration* fold and a 30% *heldout* fold.
paper/paper_a_methodology_v3.md:250:The 70/30 held-out Firm A fold of Section IV-G.2 additionally reports capture rates with Wilson 95% confidence intervals computed within the held-out fold, which is a valid population for rate inference.
paper/paper_a_methodology_v3.md:255:The per-signature classifier operates at the signature level and uses whole-sample Firm A percentile heuristics as its operational thresholds, while the accountant-level threshold analysis of Section IV-E (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing) supplies a *convergent* external reference for the operational cuts.
paper/paper_a_methodology_v3.md:256:Because the two analyses are at different units (signature vs accountant) we treat them as complementary rather than substitutable: the accountant-level convergence band cos $\in [0.945, 0.979]$ anchors the signature-level operational cut cos $> 0.95$ used below, and Section IV-G.3 reports a sensitivity analysis in which cos $> 0.95$ is replaced by the accountant-level 2D-GMM marginal crossing cos $> 0.945$.
paper/paper_a_methodology_v3.md:257:All dHash references in this section refer to the *independent-minimum* dHash defined in Section III-G---the smallest Hamming distance from a signature to any other same-CPA signature.
paper/paper_a_methodology_v3.md:277:Section IV-G.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible.
paper/paper_a_methodology_v3.md:280:Section IV-G.3 reports the classifier's five-way output under the nearby operational cut cos $> 0.945$ as a sensitivity check; the aggregate firm-level capture rates change by at most $\approx 1.2$ percentage points (e.g., the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A versus 91.14% at cos $> 0.945$), and category-level shifts are concentrated at the Uncertain/Moderate-confidence boundary.
paper/paper_a_methodology_v3.md:282:Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed determines the document's classification).
paper/paper_a_methodology_v3.md:290:The CPA registry used to map signatures to CPAs is a publicly available audit-firm tenure registry (Section III-B).
paper/paper_a_appendix_v3.md:31:Two patterns are visible in Table A.I.
paper/paper_a_appendix_v3.md:41:Taken together, Table A.I shows (i) that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes---and (ii) that the accountant-level BD/McCrary null persists across the bin-width sweep, consistent with but not alone sufficient to establish the clustered-but-smoothly-mixed interpretation discussed in Section V-B and limitation-caveated in Section V-G.
paper/paper_a_discussion_v3.md:45:The held-out Firm A 70/30 validation (Section IV-G.2) gives capture rates on a non-calibration Firm A subset that sit in the same replication-dominated regime as the calibration fold across the full range of operating rules (extreme rules are statistically indistinguishable; operational rules in the 85–95% band differ between folds by 1–5 percentage points, reflecting within-Firm-A heterogeneity in replication intensity rather than a generalization failure).
paper/paper_a_discussion_v3.md:46:The accountant-level GMM (Section IV-E) and the threshold-independent partner-ranking analysis (Section IV-H.2) are the cross-checks that are robust to fold-level sampling variance.

exec
/bin/zsh -lc "rg -n \"ablation study|Section IV-F|Section IV-J|Tables IV/VI|Table XV|baseline share|Table XVIII\" paper/paper_a_methodology_v3.md paper/paper_a_results_v3.md paper/paper_a_discussion_v3.md paper/paper_a_conclusion_v3.md paper/paper_a_introduction_v3.md | sed -n '1,220p'" in /Volumes/NV2/pdf_recognize
 succeeded in 0ms:
paper/paper_a_conclusion_v3.md:23:An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
paper/paper_a_introduction_v3.md:72:3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures.
paper/paper_a_introduction_v3.md:85:Section IV presents experimental results including the convergent threshold analysis, accountant-level mixture, pixel-identity validation, and backbone ablation study.
paper/paper_a_discussion_v3.md:93:While our ablation study and prior literature [20]--[22] support the effectiveness of transferred ImageNet features for signature comparison, a signature-specific feature extractor could improve discriminative performance.
paper/paper_a_results_v3.md:276:The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
paper/paper_a_results_v3.md:314:Firm A accounts for 1,287 of these (27.8% baseline share).
paper/paper_a_results_v3.md:325:Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
paper/paper_a_results_v3.md:326:Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
paper/paper_a_results_v3.md:354:Table XVI reports per-firm intra-report agreement (firm-assignment defined by the firm identity of both signers; mixed-firm reports are reported separately).
paper/paper_a_results_v3.md:377:Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
paper/paper_a_results_v3.md:380:Document-level rates therefore bound the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-H.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
paper/paper_a_results_v3.md:414:To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
paper/paper_a_results_v3.md:416:Table XVIII presents the comparison.
paper/paper_a_results_v3.md:431:the per-signature best-match values in Tables IV/VI (mean = 0.980), which reflect
paper/paper_a_methodology_v3.md:87:This design choice is validated by an ablation study (Section IV-F) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
paper/paper_a_methodology_v3.md:153:  (b) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.

codex
**Verdict**

**Minor Revision.** I do **not** fully concur with Gemini’s `Accept` from a codex-style numerics/consistency audit. On scientific substance, v3.8 is essentially there: it closes the remaining methodological issues from my round-5 review and Gemini’s round-6 review, and I do not see a surviving core inferential flaw. However, I found two late-cycle regressions that matter for submission readiness: one **results-table inconsistency** in Table XV’s per-year Firm A baseline-share column, and one set of **broken internal cross-references** introduced by renumbering. These are surgical fixes, not conceptual rewrites, but because one touches reported numbers, I would do a small `v3.9` before submission rather than submit `v3.8` as-is.

**Round-5 Action-List Audit**

- (1) **Table IX re-sync to script outputs:** `RESOLVED`. The whole-sample dual rule is now `54,370 / 60,448 = 89.95%`, matching the generated validation report.
- (2) **Circularity overclaim removal:** `RESOLVED`. I do not find surviving “breaks circularity” / “addresses circularity” language in the current submission files.
- (3) **`export_v3.py` single-anonymized fix:** `RESOLVED`. [export_v3.py](/Volumes/NV2/pdf_recognize/paper/export_v3.py:210) now uses single-anonymized submission placeholders rather than double-blind text.
- (4) **Visual-inspection protocol documentation:** `DEFERRED / still unresolved`. The claim remains, but the protocol is still underdocumented at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145).
- (5) **BD/McCrary substantiate or demote:** `RESOLVED`. The paper clearly chose the demotion path, and Appendix A plus the power caveat now make that defensible.
- (6) **Reproducibility appendix:** `DEFERRED / still unresolved`. Appendix A is BD-specific, not a general reproducibility appendix.
- (7) **Stale cosine-conditional dHash sentence:** `RESOLVED`. I do not see the old cosine-conditional phrasing surviving in the classifier description.

**Gemini Round-6 Findings**

I independently confirm that Gemini’s three v3.7 findings are genuinely resolved.

The BD/McCrary null is now framed correctly as a low-power non-rejection rather than proof of smoothness, across Results, Discussion, Conclusion, and Appendix A; that is aligned with my earlier “either substantiate or demote” position. Table X no longer presents FRR/EER against the byte-identical anchor and now explicitly explains why that would be tautological. Section IV-I now includes the needed worst-case aggregation safeguard, and the 15-signature delta is explicitly explained in IV-D.

**Regression Check**

Most of the late rewrites are clean. I spot-checked the revised arithmetic against the generated reports under `/Volumes/NV2/PDF-Processing/signature-analysis/reports`, and the key revised tables now line up: Table IX, Table XI, Table XII, Table XVI, Table XVII, and Appendix A are all consistent with either displayed arithmetic or the underlying generated reports.

I do, however, see two remaining issues Gemini missed:

- **Table XV baseline-share column is inconsistent with the generated partner-ranking output.** In [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:328), the per-year “Firm A baseline” values do not match the authoritative generated [partner_ranking_report.md](</Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:34>). The clearest case is 2013: the manuscript reports `26.2%`, while the generated report gives `32.4%`. This is not just a formatting drift; it is numerically impossible for 2013 to be `26.2%`, because the underlying [partner_ranking_results.json](</Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_results.json:69>) already places **97** Firm A auditor-years in the 2013 top-50% bucket out of **324** total, so the full-year baseline must be at least `97/324 = 29.9%`. This is a real table regression and should be fixed before submission.
- **Late renumbering left broken internal references.** [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:87) says the ablation study is in “Section IV-F,” but the ablation is actually in Section IV-J at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:412). Also, the Table XVIII note at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:429) refers to “Tables IV/VI” for per-signature best-match values, but Table IV is all-pairs summary statistics and Table VI is GMM model selection; that reference is orphaned and should be corrected or rewritten.

I did **not** find new contradictions in the BD/McCrary rewrite, the Table X cleanup, or the IV-I safeguard. Those parts are solid.

**Deferred Items**

The deferral of the visual-inspection protocol and the general reproducibility appendix is, in my view, **defensible for first IEEE Access submission** and no longer blocking. The reason is that the manuscript no longer leans on visual inspection or BD/McCrary as decisive standalone evidence; the main evidentiary weight now sits on the quantitative analyses and code-generated tables. That said, both remain likely reviewer requests in revision, especially the visual-inspection protocol details and a compact reproducibility supplement.

**Final Recommendation**

Gemini’s `Accept` was understandable on scientific substance, but slightly generous on manuscript hygiene. My recommendation is a **very small v3.9** with these ranked items:

1. **Correct Table XV’s per-year Firm A baseline-share column** to match the authoritative partner-ranking output, then re-scan the surrounding narrative for any numbers derived from that column.
2. **Fix the broken cross-references** introduced by renumbering, especially the ablation pointer in Methods and the “Tables IV/VI” note in Table XVIII.
3. **Do the final packaging pass**: fill the title-page author metadata in [export_v3.py](/Volumes/NV2/pdf_recognize/paper/export_v3.py:210), regenerate the DOCX, and manually check figure insertion, table rendering, and reference formatting.

If those are fixed cleanly, my verdict would move to **Accept**.
tokens used
130,069
**Verdict**

**Minor Revision.** I do **not** fully concur with Gemini’s `Accept` from a codex-style numerics/consistency audit. On scientific substance, v3.8 is essentially there: it closes the remaining methodological issues from my round-5 review and Gemini’s round-6 review, and I do not see a surviving core inferential flaw. However, I found two late-cycle regressions that matter for submission readiness: one **results-table inconsistency** in Table XV’s per-year Firm A baseline-share column, and one set of **broken internal cross-references** introduced by renumbering. These are surgical fixes, not conceptual rewrites, but because one touches reported numbers, I would do a small `v3.9` before submission rather than submit `v3.8` as-is.

**Round-5 Action-List Audit**

- (1) **Table IX re-sync to script outputs:** `RESOLVED`. The whole-sample dual rule is now `54,370 / 60,448 = 89.95%`, matching the generated validation report.
- (2) **Circularity overclaim removal:** `RESOLVED`. I do not find surviving “breaks circularity” / “addresses circularity” language in the current submission files.
- (3) **`export_v3.py` single-anonymized fix:** `RESOLVED`. [export_v3.py](/Volumes/NV2/pdf_recognize/paper/export_v3.py:210) now uses single-anonymized submission placeholders rather than double-blind text.
- (4) **Visual-inspection protocol documentation:** `DEFERRED / still unresolved`. The claim remains, but the protocol is still underdocumented at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145).
- (5) **BD/McCrary substantiate or demote:** `RESOLVED`. The paper clearly chose the demotion path, and Appendix A plus the power caveat now make that defensible.
- (6) **Reproducibility appendix:** `DEFERRED / still unresolved`. Appendix A is BD-specific, not a general reproducibility appendix.
- (7) **Stale cosine-conditional dHash sentence:** `RESOLVED`. I do not see the old cosine-conditional phrasing surviving in the classifier description.

**Gemini Round-6 Findings**

I independently confirm that Gemini’s three v3.7 findings are genuinely resolved.

The BD/McCrary null is now framed correctly as a low-power non-rejection rather than proof of smoothness, across Results, Discussion, Conclusion, and Appendix A; that is aligned with my earlier “either substantiate or demote” position. Table X no longer presents FRR/EER against the byte-identical anchor and now explicitly explains why that would be tautological. Section IV-I now includes the needed worst-case aggregation safeguard, and the 15-signature delta is explicitly explained in IV-D.

**Regression Check**

Most of the late rewrites are clean. I spot-checked the revised arithmetic against the generated reports under `/Volumes/NV2/PDF-Processing/signature-analysis/reports`, and the key revised tables now line up: Table IX, Table XI, Table XII, Table XVI, Table XVII, and Appendix A are all consistent with either displayed arithmetic or the underlying generated reports.

I do, however, see two remaining issues Gemini missed:

- **Table XV baseline-share column is inconsistent with the generated partner-ranking output.** In [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:328), the per-year “Firm A baseline” values do not match the authoritative generated [partner_ranking_report.md](</Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_report.md:34>). The clearest case is 2013: the manuscript reports `26.2%`, while the generated report gives `32.4%`. This is not just a formatting drift; it is numerically impossible for 2013 to be `26.2%`, because the underlying [partner_ranking_results.json](</Volumes/NV2/PDF-Processing/signature-analysis/reports/partner_ranking/partner_ranking_results.json:69>) already places **97** Firm A auditor-years in the 2013 top-50% bucket out of **324** total, so the full-year baseline must be at least `97/324 = 29.9%`. This is a real table regression and should be fixed before submission.
- **Late renumbering left broken internal references.** [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:87) says the ablation study is in “Section IV-F,” but the ablation is actually in Section IV-J at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:412). Also, the Table XVIII note at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:429) refers to “Tables IV/VI” for per-signature best-match values, but Table IV is all-pairs summary statistics and Table VI is GMM model selection; that reference is orphaned and should be corrected or rewritten.

I did **not** find new contradictions in the BD/McCrary rewrite, the Table X cleanup, or the IV-I safeguard. Those parts are solid.

**Deferred Items**

The deferral of the visual-inspection protocol and the general reproducibility appendix is, in my view, **defensible for first IEEE Access submission** and no longer blocking. The reason is that the manuscript no longer leans on visual inspection or BD/McCrary as decisive standalone evidence; the main evidentiary weight now sits on the quantitative analyses and code-generated tables. That said, both remain likely reviewer requests in revision, especially the visual-inspection protocol details and a compact reproducibility supplement.

**Final Recommendation**

Gemini’s `Accept` was understandable on scientific substance, but slightly generous on manuscript hygiene. My recommendation is a **very small v3.9** with these ranked items:

1. **Correct Table XV’s per-year Firm A baseline-share column** to match the authoritative partner-ranking output, then re-scan the surrounding narrative for any numbers derived from that column.
2. **Fix the broken cross-references** introduced by renumbering, especially the ablation pointer in Methods and the “Tables IV/VI” note in Table XVIII.
3. **Do the final packaging pass**: fill the title-page author metadata in [export_v3.py](/Volumes/NV2/pdf_recognize/paper/export_v3.py:210), regenerate the DOCX, and manually check figure insertion, table rendering, and reference formatting.

If those are fixed cleanly, my verdict would move to **Accept**.