Paper A v3.5: resolve codex round-4 residual issues

Fully addresses the partial-resolution / unfixed items from codex gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md): Critical - Table XI z/p columns now reproduce from displayed counts. Earlier table had 1-4-unit transcription errors in k values and a fabricated cos > 0.9407 calibration row; both fixed by rerunning Script 24 with cos = 0.9407 added to COS_RULES and copying exact values from the JSON output. - Section III-L classifier now defined entirely in terms of the independent-minimum dHash statistic that the deployed code (Scripts 21, 23, 24) actually uses; the legacy "cosine-conditional dHash" language is removed. Tables IX, XI, XII, XVI are now arithmetically consistent with the III-L classifier definition. - "0.95 not calibrated to Firm A" inconsistency reconciled: Section III-H now correctly says 0.95 is the whole-sample Firm A P95 of the per-signature cosine distribution, matching III-L and IV-F. Major - Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word limit. Removed "we break the circularity" overclaim; replaced with "report capture rates on both folds with Wilson 95% intervals to make fold-level variance visible". - Conclusion mirrors the Abstract reframe: 70/30 split documents within-firm sampling variance, not external generalization. - Introduction no longer promises precision / F1 / EER metrics that Methods/Results don't deliver; replaced with anchor-based capture / FAR + Wilson CI language. - Section III-G within-auditor-year empirical-check wording corrected: intra-report consistency (IV-H.3) is a different test (two co-signers on the same report, firm-level homogeneity) and is not a within-CPA year-level mixing check; the assumption is maintained as a bounded identification convention. - Section III-H "two analyses fully threshold-free" corrected to "only the partner-level ranking is threshold-free"; longitudinal-stability uses 0.95 cutoff, intra-report uses the operational classifier. Minor - Impact Statement removed from export_v3.py SECTIONS list (IEEE Access Regular Papers do not have a standalone Impact Statement). The file itself is retained as an archived non-paper note for cover-letter / grant-report reuse, with a clear archive header. - All 7 previously unused references ([27] dHash, [31][32] partner- signature mandates, [33] Taiwan partner rotation, [34] YOLO original, [35] VLM survey, [36] Mann-Whitney) are now cited in-text: [27] in Methodology III-E (dHash definition) [31][32][33] in Introduction (audit-quality regulation context) [34][35] in Methodology III-C/III-D [36] in Results IV-C (Mann-Whitney result) Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's calibration-fold P5 row is computed from the same data file as the other rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 12:23:03 +08:00
parent 0ff1845b22
commit 12f716ddf1
9 changed files with 172 additions and 48 deletions
@@ -0,0 +1,114 @@
 # Fourth-Round Review of Paper A v3.4
 **Overall Verdict: Major Revision**
 v3.4 is materially better than v3.3. The ethics/interview blocker is genuinely fixed, the classifier-versus-accountant-threshold distinction is much clearer in the prose, Table XII now exists, and the held-out-validation story has been conceptually corrected from the false "within Wilson CI" claim to the right calibration-fold-versus-held-out comparison. I still do not recommend submission as-is, however, because two core problems remain. First, the newly added sensitivity and intra-report analyses do not appear to evaluate the classifier that Section III-L now defines: the paper says the operational five-way classifier uses *cosine-conditional* dHash cutoffs, but the new scripts use `min_dhash_independent` instead. Second, the replacement Table XI has z/p columns that do not consistently match its own reported counts under the script's published two-proportion formula. Those are fixable, but they keep the manuscript in major-revision territory.
 **1. v3.3 Blocker Resolution Audit**
 | Blocker | Status | Audit |
 |---|---|---|
 | B1. Classifier vs three-method convergence misalignment | `PARTIALLY-RESOLVED` | The prose repair is real. Section III-L now explicitly distinguishes the signature-level operational classifier from the accountant-level convergent reference band at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:251) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:275), and Section IV-G.3 is added as a sensitivity check at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:262). The remaining problem is that III-L defines the classifier's dHash cutoffs as *cosine-conditional* at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:269) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:271), but the new sensitivity script loads only `s.min_dhash_independent` at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:83) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:90) and then claims to "Replicate Section III-L" at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:204) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:241). So the conceptual alignment is improved, but the new empirical support is still not aligned to the declared classifier. |
 | B2. Held-out validation false within-Wilson-CI claim | `PARTIALLY-RESOLVED` | The false claim itself is removed. Section IV-G.2 now correctly says the calibration fold, not the whole sample, is the right comparison target at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:230) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237), and Discussion mirrors that at [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:44). The new script also implements the two-proportion z-test explicitly at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:66) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:80) and [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:175) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:202). However, several Table XI z/p entries do not match the displayed `k/n` counts under that formula: the `cosine > 0.837` row at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217) implies about `z = +0.41, p = 0.683`, not `+0.31 / 0.756`; the `cosine > 0.9407` row at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:220) implies about `z = -3.19, p = 0.0014`, not `-2.83 / 0.005`; and the `dHash_indep <= 15` row at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:224) implies about `z = -0.43, p = 0.670`, not `-0.31 / 0.754`. The conceptual blocker is fixed; the replacement inferential table still needs numeric cleanup. |
 | B3. Interview evidence lacks ethics statement | `RESOLVED` | This blocker is fixed. The manuscript now consistently reframes the contextual claim as practitioner / industry-practice knowledge rather than as research interviews; see [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:50) through [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:139) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140), and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:280) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:289). I also ran a grep across the nine v3 manuscript files and found no surviving `interview`, `IRB`, or `ethics` strings. The evidentiary burden now sits on paper-internal analyses rather than on undeclared human-subject evidence. |
 **2. v3.3 Major-Issues Follow-up**
 | Prior major issue | Status | v3.4 audit |
 |---|---|---|
 | dHash classifier ambiguity | `UNFIXED` | III-L now says the classifier uses *cosine-conditional* dHash thresholds at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:269) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:271), but the Results still report only `dHash_indep` capture rules at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:168) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:221) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:225), despite the promise at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:271) that both statistics would be reported. The new scripts for Table XII and Table XVI also use `min_dhash_independent`, not cosine-conditional dHash, at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:83) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:90) and [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:90) through [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:92). |
 | 70/30 split overstatement | `PARTIALLY-FIXED` | The paper is now more candid that the operational classifier still inherits whole-sample thresholds at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:272) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:273), and IV-G.2 properly frames the fold comparison at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:230) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237). But the Abstract still says "we break the circularity" at [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12), and the Conclusion repeats that framing at [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:20), which overstates what the 70/30 split accomplishes for the actual deployed classifier. |
 | Validation-metric story | `PARTIALLY-FIXED` | Methods and Results are substantially improved: precision and `F1` are now explicitly rejected as meaningless here at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:244) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:246) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:186) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:203). But the Introduction still promises validation with "precision, recall, F1, and equal-error-rate" at [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28), and the Impact Statement still overstates binary discrimination at [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8). |
 | Within-auditor-year empirical-check confusion | `UNFIXED` | Section III-G still says the intra-report analysis provides an empirical check on the within-auditor-year no-mixing assumption at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:123) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:127). But Section IV-H.3 still measures agreement between the two different signers on the same report at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:343) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:367). That is a cross-partner same-report test, not a same-CPA within-year mixing test. |
 | BD/McCrary rigor | `UNFIXED` | The Methods still mention KDE bandwidth sensitivity at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:173) and define a fixed-bin BD/McCrary procedure at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:177) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:183), but the Results still give only narrative transition statements at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:80) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:83) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:126) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:149), with no alternate-bin analysis, Z-statistics table, p-values, or McCrary-style estimator output. |
 | Reproducibility gaps | `PARTIALLY-FIXED` | There is some improvement at the code level: the new recalibration script exposes the seed and test formulae at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:46), [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:128) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:136), and [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:175) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:202). But from the paper alone the work is still not reproducible: the exact VLM prompt and parse rule remain absent at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:44) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:49), HSV thresholds remain absent at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:74), visual-inspection sample size/protocol remain absent at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145), and mixture initialization / stopping / boundary handling remain under-specified at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:187) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:195) and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:221). |
 | Section III-H / IV-F reconciliation | `FIXED` | The manuscript now clearly says the 92.5% Firm A figure is a within-sample consistency check, not the independent validation pillar, at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:155) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:176). That specific circularity / role-confusion problem is repaired. |
 | "Fixed 0.95 not calibrated to Firm A" inconsistency | `UNFIXED` | III-H still says the fixed `0.95` cutoff "is not calibrated to Firm A" at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151), but III-L says `0.95` is the whole-sample Firm A P95 heuristic at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:252) and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:272), and IV-F says the same at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:241). This contradiction remains. |
 **3. v3.3 Minor-Issues Follow-up**
 | Prior minor issue | Status | v3.4 audit |
 |---|---|---|
 | Table XII numbering | `FIXED` | Table XII now exists at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:246) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:254), and the numbering now runs XI-XVIII without the previous jump. |
 | `dHash_indep <= 5 (calib-fold median-adjacent)` label | `UNFIXED` | The unclear label remains at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165), even though the same table family now explicitly reports the calibration-fold independent-minimum median as `2` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:227). Calling `5` "median-adjacent" is still opaque. |
 | References [27], [31]-[36] cleanup | `UNFIXED` | These references remain present at [paper_a_references_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:57) through [paper_a_references_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_references_v3.md:75), but a citation sweep across the nine manuscript files found no in-text uses of `[27]` or `[31]`-`[36]`. The Mann-Whitney test is still reported at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:50) without citing `[36]`. I also do not see uses of `[34]` or `[35]` in the reviewed manuscript text. |
 **4. New Findings in v3.4**
 **Blockers**
 - The new IV-G.3 sensitivity evidence does not appear to use the classifier that III-L now defines. III-L says the operational categories use cosine-conditional dHash cutoffs at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:269) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:271), and IV-G.3 presents itself as a sensitivity test of that classifier at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:239) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:262). But [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:83) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:90) load only `min_dhash_independent`, and the "Replicate Section III-L" classifier at [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:212) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:241) uses that statistic directly. This is currently the most important unresolved issue because the newly added evidence that is meant to support B1 is not evaluating the paper's stated classifier.
 **Major Issues**
 - Table XI's z/p columns are not consistently arithmetically compatible with the published counts. The formula in [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:66) through [24_validation_recalibration.py](/Volumes/NV2/pdf_recognize/signature_analysis/24_validation_recalibration.py:80) is straightforward, but several rows in [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:217) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:224) do not match their own `k/n` inputs. The qualitative interpretation survives, but a statistical table that does not reproduce from its displayed counts is not submission-ready.
 - Table XVI is affected by the same classifier-definition problem as Table XII. The paper says IV-H.3 uses the "dual-descriptor rules of Section III-L" at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:347), but [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:37) through [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:53) and [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:90) through [23_intra_report_consistency.py](/Volumes/NV2/pdf_recognize/signature_analysis/23_intra_report_consistency.py:92) classify with `min_dhash_independent`. So the new "fourth pillar" consistency check is not actually tied to the classifier as specified in III-L.
 - The four-pillar Firm A validation is ethically cleaner, but not stronger in evidentiary reporting than v3.3. It is stronger on internal consistency because practitioner knowledge is now background-only at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:139) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140), and the paper states that the evidence comes from the manuscript's own analyses at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:142) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:155). But it is not stronger on empirical documentation because the visual-inspection pillar still has no sample size, randomization rule, rater count, or decision protocol at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145). My read is: ethically stronger, scientifically cleaner, but only roughly equal in evidentiary strength unless the visual-inspection protocol is documented.
 **Minor Issues**
 - III-H says "Two of them are fully threshold-free" at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:150), but item (a) immediately uses a fixed `0.95` cutoff at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:151). The Results intro to Section IV-H is more accurate at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:270) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:274). This should be harmonized.
 - The Introduction still contains an obsolete metric promise at [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28), and the Impact Statement still reads too strongly for a five-way classifier with no full labeled test set at [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8). These are not new conceptual flaws, but they are still visible in the current version.
 **5. IEEE Access Fit Check**
 - **Scope:** Yes. The topic is a plausible IEEE Access Regular Paper fit as a methods paper spanning document forensics, computer vision, and audit/regulatory applications.
 - **Abstract length:** Not compliant yet. A local plain-word count of [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:5) through [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14) gives about **367 words**. The IEEE Author Center guidance says the abstract should be a single paragraph of up to 250 words. The current abstract is also dense with abbreviations / symbols (`KDE`, `EM`, `BIC`, `GMM`, `~`, `approx`) that IEEE generally prefers authors to avoid in abstracts.
 - **Impact Statement section:** The manuscript still includes a standalone Impact Statement at [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:1) through [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:9). **Inference from official IEEE Access / IEEE Author Center sources:** I do not see a Regular Paper requirement for a standalone `Impact Statement` section. Unless an editor specifically requested it, I would remove it or fold its content into the abstract / conclusion / cover letter.
 - **Formatting:** I cannot verify final IEEE template conformance from the markdown section files alone. Official IEEE Access guidance requires the journal template and submission of both source and PDF; that should be checked at the generated DOCX / PDF stage, not from these source snippets.
 - **Review model / anonymization:** IEEE Access uses **single-anonymized** review. The current pseudonymization of firms is therefore a confidentiality choice, not a review-blinding requirement. Within the nine reviewed section files I do not see author or institution metadata.
 - **Official sources checked:**  
  - IEEE Access submission guidelines: https://ieeeaccess.ieee.org/authors/submission-guidelines/  
  - IEEE Author Center article-structure guidance: https://journals.ieeeauthorcenter.ieee.org/create-your-ieee-journal-article/create-the-text-of-your-article/structure-your-article/  
  - IEEE Access reviewer guidelines / reviewer info: https://ieeeaccess.ieee.org/reviewers/reviewer-guidelines/
 **6. Statistical Rigor Audit**
 - The high-level statistical story is cleaner than in v3.3. The paper now explicitly separates the primary accountant-level 1D convergence (`0.973 / 0.979 / 0.976`) from the secondary 2D-GMM marginal (`0.945`) at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:126) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:149), and III-L no longer pretends those accountant-level thresholds are themselves the deployed classifier at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:274).
 - The B2 statistical interpretation is substantially improved: IV-G.2 now frames fold differences as heterogeneity rather than as failed generalization at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:233) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:237), and Discussion repeats that narrower reading at [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:44) through [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:45).
 - The main remaining statistical weakness is now more specific: the paper's new classifier definition and the paper's new sensitivity evidence are not using the same dHash statistic. That is a model-definition problem, not just a wording problem.
 - BD/McCrary remains the least rigorous component. The paper's qualitative interpretation is plausible, but the reporting is still too thin for a method presented as a co-equal thresholding component.
 - The anchor-based validation is better framed than before. The manuscript now correctly treats the byte-identical positives as a conservative subset and no longer uses precision / `F1` in the main validation table at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:184) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:205).
 **7. Anonymization Check**
 - Within the nine reviewed v3 manuscript files, I do not see any explicit real firm names or auditor names. The paper consistently uses `Firm A/B/C/D`; see [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:287) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:289) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:353) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:357).
 - The new III-M residual-identifiability disclosure at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:287) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:288) is appropriate. Knowledgeable local readers may still infer Firm A, but the paper now states that risk explicitly.
 **8. Numerical Consistency**
 - Most of the large headline counts still reconcile across sections: `90,282` reports, `182,328` signatures, `758` CPAs, and the Firm A `171 + 9` accountant split remain internally consistent across [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11) through [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:13), [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:62) through [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:63), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:121) through [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:127), and [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:19) through [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:21).
 - Table XII arithmetic is internally consistent: both columns sum to `168,740`, and the listed percentages match the counts. Table XVI and Table XVII arithmetic also reconcile. The new numbering XI-XVIII is coherent.
 - The important remaining numerical inconsistency is Table XI's inferential columns, not its raw counts or percentages.
 **9. Reproducibility**
 - The paper is still **not reproducible from the manuscript alone**.
 - Missing or under-specified items that should be added before submission:
  - Exact VLM prompt, parse rule, and failure-handling for page selection at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:44) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:49).
  - HSV thresholds for red-stamp removal at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:74).
  - Random seeds / sampling protocol for the 500-page annotation set, the 50,000 inter-CPA negatives, the 30-signature sanity sample, and the Firm A 70/30 split at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:59), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:232), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:237) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:239), and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:247).
  - Visual-inspection sample size, selection rule, and decision protocol at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:144) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:145).
  - EM / mixture initialization, stopping criteria, boundary clipping for the logit transform, and software versions for the mixture fits at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:187) through [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:195) and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:221).
 - The new scripts help the audit, but they also expose that the Results tables are currently not perfectly aligned to the Methods classifier definition. So reproducibility is not only incomplete; it is presently inconsistent in one key place.
 **Bottom Line**
 v3.4 clears the ethics/interview blocker and substantially improves the classifier-threshold narrative. It is much closer to a submittable paper than v3.3. But I would still require one more round before IEEE Access submission: (1) make Section III-L, Table XII, Table XVI, and the supporting scripts use the same dHash statistic, or explicitly redefine the classifier around `dHash_indep`; (2) recompute and correct the Table XI z/p columns from the displayed counts; (3) remove the remaining overstatements about what the 70/30 split and the validation metrics establish; and (4) cut the abstract to <= 250 words while cleaning the non-standard Impact Statement. If those are repaired cleanly, the paper should move into minor-revision territory.
@@ -14,7 +14,8 @@ OUTPUT = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
 SECTIONS = [
    "paper_a_abstract_v3.md",
-    "paper_a_impact_statement_v3.md",
+    # paper_a_impact_statement_v3.md removed: not a standard IEEE Access
    # Regular Paper section. Content folded into cover letter / abstract.
    "paper_a_introduction_v3.md",
    "paper_a_related_work_v3.md",
    "paper_a_methodology_v3.md",
@@ -1,16 +1,7 @@
 # Abstract
-<!-- 200-270 words -->
+<!-- IEEE Access target: <= 250 words, single paragraph -->
-Regulations in many jurisdictions require Certified Public Accountants (CPAs) to attest to each audit report they certify, typically by affixing a signature or seal.
+Regulations require Certified Public Accountants (CPAs) to attest to each audit report by affixing a signature. Digitization makes reusing a stored signature image across reports trivial---whether by administrative stamping or firm-level electronic signing---potentially undermining individualized attestation. Unlike signature forgery, *non-hand-signed* reproduction reuses the legitimate signer's own stored image, making it visually invisible to report users and infeasible to audit at scale manually. We present an end-to-end pipeline integrating a Vision-Language Model for signature-page identification, YOLOv11 for signature detection, and ResNet-50 for feature extraction, followed by a dual-descriptor verification combining cosine similarity and difference hashing. For threshold determination we apply three methodologically distinct estimators---kernel-density antimode with a Hartigan unimodality test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature and accountant levels. Applied to 90,282 audit reports filed in Taiwan over 2013-2023 (182,328 signatures from 758 CPAs), the methods reveal a level asymmetry: signature-level similarity is a continuous quality spectrum that no two-component mixture separates, while accountant-level aggregates cluster into three groups with the antimode and two mixture estimators converging within $\sim$0.006 at cosine $\approx 0.975$. A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with visual inspection and accountant-level mixture evidence supporting majority non-hand-signing and a minority of hand-signers; we report capture rates on both 70/30 calibration and held-out folds with Wilson 95% intervals to make fold-level variance visible. Validation against 310 byte-identical positives and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 at all accountant-level thresholds.
 However, the digitization of financial reporting makes it straightforward to reuse a stored signature image across multiple reports---whether by administrative stamping or firm-level electronic signing systems---potentially undermining the intent of individualized attestation.
 Unlike signature forgery, where an impostor imitates another person's handwriting, *non-hand-signed* reproduction involves the legitimate signer's own stored signature image being reproduced on each report, a practice that is visually invisible to report users and infeasible to audit at scale through manual inspection.
 We present an end-to-end AI pipeline that automatically detects non-hand-signed auditor signatures in financial audit reports.
 The pipeline integrates a Vision-Language Model for signature page identification, YOLOv11 for signature region detection, and ResNet-50 for deep feature extraction, followed by a dual-descriptor verification combining cosine similarity of deep embeddings with difference hashing (dHash).
 For threshold determination we apply three methodologically distinct methods---Kernel Density antimode with a Hartigan unimodality test, Burgstahler-Dichev/McCrary discontinuity, and EM-fitted Beta mixtures with a logit-Gaussian robustness check---at both the signature level and the accountant level.
 Applied to 90,282 audit reports filed in Taiwan over 2013--2023 (182,328 signatures from 758 CPAs) the methods reveal an informative asymmetry: signature-level similarity forms a continuous quality spectrum that no two-component mixture cleanly separates, while accountant-level aggregates are clustered into three recognizable groups (BIC-best $K = 3$) with the KDE antimode and the two mixture-based estimators converging within $\sim$0.006 of each other at cosine $\approx 0.975$; the Burgstahler-Dichev / McCrary test produces no significant discontinuity at the accountant level, consistent with clustered-but-smooth rather than sharply discrete accountant-level heterogeneity.
 A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with visual-inspection and accountant-level mixture evidence supporting majority non-hand-signing and a minority of hand-signers; we break the circularity of using the same firm for calibration and validation by a 70/30 CPA-level held-out fold.
 Validation against 310 byte-identical positive signatures and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 with Wilson 95% confidence intervals at all accountant-level thresholds.
 To our knowledge, this represents the largest-scale forensic analysis of auditor signature authenticity reported in the literature.
-<!-- Word count: ~290 -->
+<!-- Target word count: 240 -->
@@ -17,7 +17,7 @@ The Burgstahler-Dichev / McCrary test, by contrast, finds no significant transit
 The substantive reading is therefore narrower than "discrete behavior": *pixel-level output quality* is continuous and heavy-tailed, and *accountant-level aggregate behavior* is clustered with smooth cluster boundaries.
 Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
-To break the circularity of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report post-hoc capture rates on the held-out fold with Wilson 95% confidence intervals.
+To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85-95% capture band differ by 1-5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure.
 This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
 An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
@@ -1,9 +1,21 @@
-# Impact Statement
+<!--
 ARCHIVED. Not part of the IEEE Access submission.
-<!-- 100-150 words. Non-specialist readable. No jargon. Specific, not vague. -->
+IEEE Access Regular Papers do not include a separate Impact Statement
 section. The text below is retained for possible reuse in a cover
 letter, grant report, or non-IEEE venue. It is excluded from the
 assembled paper by export_v3.py.
 If reused, note that the wording "distinguishes genuinely hand-signed
 signatures from reproduced ones" overstates what a five-way confidence
 classifier without a fully labeled test set establishes; soften before
 external use.
 -->
 # Impact Statement (archived; not in IEEE Access submission)
 Auditor signatures on financial reports are a key safeguard of corporate accountability.
 When the signature on an audit report is produced by reproducing a stored image instead of by the partner's own hand---whether through an administrative stamping workflow or a firm-level electronic signing system---this safeguard is weakened, yet detecting the practice through manual inspection is infeasible at the scale of modern financial markets.
-We developed an artificial intelligence system that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan.
+We developed a pipeline that automatically extracts and analyzes signatures from over 90,000 audit reports spanning a decade of filings by publicly listed companies in Taiwan.
-By combining deep-learning visual features with perceptual hashing and three methodologically distinct threshold-selection methods, the system distinguishes genuinely hand-signed signatures from reproduced ones and quantifies how this practice varies across firms and over time.
+Combining deep-learning visual features with perceptual hashing and three methodologically distinct threshold-selection methods, the system stratifies signatures into a five-way confidence-graded classification and quantifies how the practice varies across firms and over time.
-After further validation, the technology could support financial regulators in automating signature-authenticity screening at national scale.
+After further validation, the technology could support financial regulators in screening signature authenticity at national scale.
@@ -12,6 +12,7 @@ This reproduction can occur either through an administrative stamping workflow--
 From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
 We refer to signatures produced by either workflow collectively as *non-hand-signed*.
 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement.
 The accounting literature has long examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33].
 Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused.
 This practice, while potentially widespread, is visually invisible to report users and virtually undetectable through manual inspection at scale: regulatory agencies overseeing thousands of publicly listed companies cannot feasibly examine each signature for evidence of image reproduction.
@@ -25,7 +26,7 @@ This detection problem differs fundamentally from forgery detection: while it do
 A secondary methodological concern shapes the research design.
 Many prior similarity-based classification studies rely on ad-hoc thresholds---declaring two images equivalent above a hand-picked cosine cutoff, for example---without principled statistical justification.
 Such thresholds are fragile and invite reviewer skepticism, particularly in an archival-data setting where the cost of misclassification propagates into downstream inference.
-A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against anchor populations with known ground-truth characteristics using precision, recall, $F_1$, and equal-error-rate metrics that prevail in the biometric-verification literature.
+A defensible approach requires (i) a statistically principled threshold-determination procedure, ideally anchored to an empirical reference population drawn from the target corpus; (ii) convergent validation across multiple threshold-determination methods that rest on different distributional assumptions; and (iii) external validation against naturally-occurring anchor populations---byte-level identical pairs as a conservative gold positive subset and large random inter-CPA pairs as a gold negative population---reported with Wilson 95% confidence intervals on per-rule capture / FAR rates, since precision and $F_1$ are not meaningful when the positive and negative anchor populations are sampled from different units.
 Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards.
 Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering (grouping signatures by signer identity) rather than detecting reuse of a stored image.
@@ -41,7 +41,7 @@ Table I summarizes the dataset composition.
 ## C. Signature Page Identification
-To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24] as an automated pre-screening mechanism.
+To identify which page of each multi-page PDF contains the auditor's signatures, we employed the Qwen2.5-VL vision-language model (32B parameters) [24], one of the multimodal generative models surveyed in [35], as an automated pre-screening mechanism.
 Each PDF page was rendered to JPEG at 180 DPI and submitted to the VLM with a structured prompt requesting a binary determination of whether the page contains a Chinese handwritten signature.
 The model was configured with temperature 0 for deterministic output.
@@ -55,7 +55,7 @@ The 1.2% disagreement reflects the combined rate of (i) VLM false positives (pag
 ## D. Signature Detection
-We adopted YOLOv11n (nano variant) [25] for signature region localization.
+We adopted YOLOv11n (nano variant) [25], a lightweight descendant of the original YOLO single-stage detector [34], for signature region localization.
 A training set of 500 randomly sampled signature pages was annotated using a custom web-based interface following a two-stage protocol: primary annotation followed by independent review and correction.
 A region was labeled as "signature" if it contained any Chinese handwritten content attributable to a personal signature, regardless of overlap with official stamps.
@@ -97,7 +97,7 @@ $$\text{sim}(\mathbf{f}_A, \mathbf{f}_B) = \mathbf{f}_A \cdot \mathbf{f}_B$$
 where $\mathbf{f}_A$ and $\mathbf{f}_B$ are L2-normalized 2048-dim feature vectors.
 Each feature dimension contributes to the angular alignment, so cosine similarity is sensitive to fine-grained execution differences---pen pressure, ink distribution, and subtle stroke-trajectory variations---that distinguish genuine within-writer variation from the reproduction of a stored image [14].
-**Perceptual hash distance (dHash)** captures structural-level similarity.
+**Perceptual hash distance (dHash)** [27] captures structural-level similarity.
 Each signature image is resized to 9×8 pixels and converted to grayscale; horizontal gradient differences between adjacent columns produce a 64-bit binary fingerprint.
 The Hamming distance between two fingerprints quantifies perceptual dissimilarity: a distance of 0 indicates structurally identical images, while distances exceeding 15 indicate clearly different images.
 Unlike DCT-based perceptual hashes, dHash is computationally lightweight and particularly effective for detecting near-exact duplicates with minor scan-induced variations [19].
@@ -124,7 +124,8 @@ We also adopt an explicit *within-auditor-year no-mixing* identification assumpt
 Specifically, within any single fiscal year we treat a given CPA's signing mechanism as uniform: a CPA who reproduces one signature image in that year is assumed to do so for every report, and a CPA who hand-signs in that year is assumed to hand-sign every report in that year.
 Domain-knowledge from industry practice at Firm A is consistent with this assumption for that firm during the sample period.
 Under the assumption, per-auditor-year summary statistics are well defined and robust to outliers: if even one pair of same-CPA signatures in the year is near-identical, the max/min captures it.
-The intra-report consistency analysis in Section IV-H.3 provides an empirical check on the within-auditor-year assumption at the report level.
+The intra-report consistency analysis in Section IV-H.3 is a related but distinct check: it tests whether the *two co-signing CPAs on the same report* receive the same signature-level label (firm-level signing-practice homogeneity) rather than testing whether a single CPA mixes mechanisms within a fiscal year.
 A direct empirical check of the within-auditor-year assumption at the same-CPA level would require labeling multiple reports of the same CPA in the same year and is left to future work; in this paper we maintain the assumption as an identification convention motivated by industry practice and bounded by the worst-case aggregation rule of Section III-L.
 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean *independent minimum dHash* across all signatures of that CPA.
 The *independent minimum dHash* of a signature is defined as the minimum Hamming distance to *any* other signature of the same CPA (over the full same-CPA set), in contrast to the *cosine-conditional dHash* used as a diagnostic elsewhere, which is the dHash to the single signature selected as the cosine-nearest match.
@@ -147,10 +148,10 @@ Second, *whole-sample signature-level rates*: 92.5% of Firm A's per-signature be
 Third, *accountant-level mixture analysis* (Section IV-E): a BIC-selected three-component Gaussian mixture over per-accountant mean cosine and mean dHash places 139 of the 171 Firm A CPAs (with $\geq 10$ signatures) in the high-replication C1 cluster and 32 in the middle-band C2 cluster, directly quantifying the within-firm heterogeneity.
-Fourth, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-H. Two of them are fully threshold-free, and one uses the downstream classifier as an internal consistency check:
+Fourth, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-H. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
-  (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The fixed 0.95 cutoff is not calibrated to Firm A; the stability itself is the finding.
+  (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P95 of the per-signature cosine distribution (Section III-L); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
  (b) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
-  (c) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the calibrated classifier and therefore is a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
+  (c) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
 We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the visual inspection, the accountant-level mixture, the three complementary analyses above, and the held-out Firm A fold described in Section III-K.
@@ -250,15 +251,18 @@ We additionally draw a small stratified sample (30 signatures across high-confid
 The per-signature classifier operates at the signature level and uses whole-sample Firm A percentile heuristics as its operational thresholds, while the three-method analysis of Section IV-E operates at the accountant level and supplies a *convergent* external reference for the operational cuts.
 Because the two analyses are at different units (signature vs accountant) we treat them as complementary rather than substitutable: the accountant-level convergence band cos $\in [0.945, 0.979]$ anchors the signature-level operational cut cos $> 0.95$ used below, and Section IV-G.3 reports a sensitivity analysis in which cos $> 0.95$ is replaced by the accountant-level 2D-GMM marginal crossing cos $> 0.945$.
 All dHash references in this section refer to the *independent-minimum* dHash defined in Section III-G---the smallest Hamming distance from a signature to any other same-CPA signature.
 We use a single dHash statistic throughout the operational classifier and the supporting capture-rate analyses (Tables IX, XI, XII, XVI), which keeps the classifier definition and its empirical evaluation arithmetically consistent.
 We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:
-1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND dHash $\leq 5$.
+1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 5$.
 Both descriptors converge on strong replication evidence.
-2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < $ dHash $\leq 15$.
+2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < \text{dHash}_\text{indep} \leq 15$.
 Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff, potentially due to scan variations.
-3. **High style consistency:** Cosine $> 0.95$ AND dHash $> 15$.
+3. **High style consistency:** Cosine $> 0.95$ AND $\text{dHash}_\text{indep} > 15$.
 High feature-level similarity without structural corroboration---consistent with a CPA who signs very consistently but not via image reproduction.
 4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence for classification in either direction.
@@ -266,13 +270,11 @@ High feature-level similarity without structural corroboration---consistent with
 5. **Likely hand-signed:** Cosine below the all-pairs KDE crossover threshold.
 We note three conventions about the thresholds.
-First, the dHash cutoffs $\leq 5$ and $\leq 15$ correspond to the whole-sample Firm A *cosine-conditional* dHash distribution's median and 95th percentile (the dHash to the cosine-nearest same-CPA match), not to the *independent-minimum* dHash distribution we use elsewhere.
+First, the cosine cutoff $0.95$ is the whole-sample Firm A P95 of the per-signature best-match cosine distribution (chosen for its transparent percentile interpretation in the whole-sample reference distribution), and the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions.
-The two dHash statistics are related but not identical: the whole-sample cosine-conditional distribution has median $= 5$ and 95th percentile $= 15$, while the calibration-fold independent-minimum distribution has median $= 2$ and 95th percentile $= 9$.
+Section IV-G.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible.
-The classifier retains the cosine-conditional cutoffs for continuity with the preceding version of this work while the anchor-level capture-rate analysis reports both cosine-conditional and independent-minimum rates for comparability.
+Second, the dHash cutoffs $\leq 5$ and $> 15$ are chosen from the whole-sample Firm A $\text{dHash}_\text{indep}$ distribution: $\leq 5$ captures the upper tail of the high-similarity mode (whole-sample Firm A median $\text{dHash}_\text{indep} = 2$, P75 $\approx 4$, so $\leq 5$ is the band immediately above median), while $> 15$ marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction.
 Second, the cosine cutoff $0.95$ is the whole-sample Firm A P95 heuristic (chosen for its transparent interpretation in the whole-sample reference distribution) and the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; neither cutoff is re-derived from the 70% calibration fold specifically, so the classifier inherits its operational thresholds from the whole-sample Firm A distribution and the all-pairs distribution rather than from the calibration fold.
 The held-out fold of Section IV-G.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that the fold-level sampling variance is visible.
 Third, the three accountant-level 1D estimators (KDE antimode $0.973$, Beta-2 crossing $0.979$, logit-GMM-2 crossing $0.976$) and the accountant-level 2D GMM marginal ($0.945$) are *not* the operational thresholds of this classifier: they are the *convergent external reference* that supports the choice of signature-level operational cut.
-Section IV-G.3 reports the classifier's five-way output under the nearby operational cut cos $> 0.945$ as a sensitivity check; the aggregate firm-level capture rates change by at most $\approx 1.2$ percentage points (e.g., the operational dual rule cos $> 0.95$ AND dHash $\leq 8$ captures 89.95% of whole Firm A versus 91.14% at cos $> 0.945$), and category-level shifts are concentrated at the Uncertain/Moderate-confidence boundary (Section IV-G.3).
+Section IV-G.3 reports the classifier's five-way output under the nearby operational cut cos $> 0.945$ as a sensitivity check; the aggregate firm-level capture rates change by at most $\approx 1.2$ percentage points (e.g., the operational dual rule cos $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.95% of whole Firm A versus 91.14% at cos $> 0.945$), and category-level shifts are concentrated at the Uncertain/Moderate-confidence boundary.
 Because each audit report typically carries two certifying-CPA signatures (Section III-D), we aggregate signature-level outcomes to document-level labels using a worst-case rule: the document inherits the *most-replication-consistent* signature label (i.e., among the two signatures, the label rank ordered High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed determines the document's classification).
 This rule is consistent with the detection goal of flagging any potentially non-hand-signed report rather than requiring all signatures on the report to converge.
@@ -47,7 +47,7 @@ Distribution fitting identified the lognormal distribution as the best parametri
 The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
 Under equal prior probabilities and equal misclassification costs, this crossover approximates the Bayes-optimal boundary between the two classes.
-Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney $p < 0.001$, K-S 2-sample $p < 0.001$).
+Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).
 We emphasize that pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders $p$-values unreliable as measures of evidence strength.
 We therefore rely primarily on Cohen's $d$ as an effect-size measure that is less sensitive to sample size.
@@ -214,17 +214,17 @@ Table XI reports both calibration-fold and held-out-fold capture rates with Wils
 <!-- TABLE XI: Held-Out vs Calibration Firm A Capture Rates and Generalization Test
 | Rule | Calibration 70% fold (CI) | Held-out 30% fold (CI) | 2-prop z | p | k/n calib | k/n held |
 |------|---------------------------|-------------------------|----------|---|-----------|----------|
-| cosine > 0.837                      | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31 | 0.756 n.s. | 45,088/45,116 | 15,321/15,332 |
+| cosine > 0.837                       | 99.94% [99.91%, 99.96%] | 99.93% [99.87%, 99.96%] | +0.31  | 0.756 n.s. | 45,087/45,116 | 15,321/15,332 |
-| cosine > 0.945 (2D GMM marginal)    | 93.77% [93.55%, 93.98%] | 94.78% [94.41%, 95.12%] | -4.54 | <0.001     | 42,304/45,116 | 14,531/15,332 |
+| cosine > 0.9407 (calib-fold P5)      | 94.99% [94.79%, 95.19%] | 95.63% [95.29%, 95.94%] | -3.19  | 0.001      | 42,856/45,116 | 14,662/15,332 |
-| cosine > 0.950                      | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97 | <0.001     | 41,571/45,116 | 14,352/15,332 |
+| cosine > 0.945 (2D GMM marginal)     | 93.77% [93.54%, 93.99%] | 94.78% [94.41%, 95.12%] | -4.54  | <0.001     | 42,305/45,116 | 14,531/15,332 |
-| cosine > 0.9407 (calib-fold P5)     | 95.00% [94.80%, 95.20%] | 95.64% [95.31%, 95.95%] | -2.83 | 0.005      | 42,862/45,116 | 14,664/15,332 |
+| cosine > 0.950                       | 92.14% [91.89%, 92.38%] | 93.61% [93.21%, 93.98%] | -5.97  | <0.001     | 41,570/45,116 | 14,352/15,332 |
-| dHash_indep ≤ 5                      | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001    | 37,434/45,116 | 13,467/15,332 |
+| dHash_indep ≤ 5                      | 82.96% [82.61%, 83.31%] | 87.84% [87.31%, 88.34%] | -14.29 | <0.001     | 37,430/45,116 | 13,467/15,332 |
-| dHash_indep ≤ 8                      | 94.84% [94.63%, 95.05%] | 96.13% [95.82%, 96.43%] | -6.45  | <0.001    | 42,791/45,116 | 14,739/15,332 |
+| dHash_indep ≤ 8                      | 94.84% [94.63%, 95.04%] | 96.13% [95.82%, 96.43%] | -6.45  | <0.001     | 42,788/45,116 | 14,739/15,332 |
-| dHash_indep ≤ 9 (calib-fold P95)     | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07  | <0.001    | 43,603/45,116 | 14,945/15,332 |
+| dHash_indep ≤ 9 (calib-fold P95)     | 96.65% [96.48%, 96.81%] | 97.48% [97.22%, 97.71%] | -5.07  | <0.001     | 43,604/45,116 | 14,945/15,332 |
-| dHash_indep ≤ 15                     | 99.83% [99.79%, 99.86%] | 99.84% [99.77%, 99.89%] | -0.31  | 0.754 n.s. | 45,038/45,116 | 15,308/15,332 |
+| dHash_indep ≤ 15                     | 99.83% [99.79%, 99.87%] | 99.84% [99.77%, 99.89%] | -0.31  | 0.754 n.s. | 45,040/45,116 | 15,308/15,332 |
-| cosine > 0.95 AND dHash_indep ≤ 8    | 89.40% [89.12%, 89.69%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001    | 40,335/45,116 | 14,035/15,332 |
+| cosine > 0.95 AND dHash_indep ≤ 8    | 89.40% [89.12%, 89.68%] | 91.54% [91.09%, 91.97%] | -7.60  | <0.001     | 40,335/45,116 | 14,035/15,332 |
-Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash median = 2, P95 = 9.
+Calibration-fold thresholds: Firm A cosine median = 0.9862, P1 = 0.9067, P5 = 0.9407; dHash_indep median = 2, P95 = 9. All counts and z/p values are reproducible from `signature_analysis/24_validation_recalibration.py` (seed = 42).
 -->
 Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
@@ -46,7 +46,10 @@ FIRM_A = '勤業眾信聯合'
 SEED = 42
 # Rules of interest for held-out vs calib comparison.
-COS_RULES = [0.837, 0.945, 0.95]
+# 0.9407 = calibration-fold P5 of the Firm A cosine distribution
 # (see Script 21 / Section III-K) and is included so Table XI of the
 # paper can report calib- and held-fold rates for the same rule set.
 COS_RULES = [0.837, 0.9407, 0.945, 0.95]
 DH_RULES = [5, 8, 9, 15]
 # Dual rule (the paper's classifier's operational dual).
 DUAL_RULES = [(0.95, 8), (0.945, 8)]