Paper A v3.4: resolve codex round-3 major-revision blockers

Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md): B1 Classifier vs three-method threshold mismatch - Methodology III-L rewritten to make explicit that the per-signature classifier and the accountant-level three-method convergence operate at different units (signature vs accountant) and are complementary rather than substitutable. - Add Results IV-G.3 + Table XII operational-threshold sensitivity: cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary. B2 Held-out validation false "within Wilson CI" claim - Script 24 recomputes both calibration-fold and held-out-fold rates with Wilson 95% CIs and a two-proportion z-test on each rule. - Table XI replaced with the proper fold-vs-fold comparison; prose in Results IV-G.2 and Discussion V-C corrected: extreme rules agree across folds (p>0.7); operational rules in the 85-95% band differ by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample contained more high-replication C1 accountants), not generalization failure. B3 Interview evidence reframed as practitioner knowledge - The Firm A "interviews" referenced throughout v3.3 are private, informal professional conversations, not structured research interviews. Reframed accordingly: all "interview*" references in abstract / intro / methodology / results / discussion / conclusion are replaced with "domain knowledge / industry-practice knowledge". - This avoids overclaiming methodological formality and removes the human-subjects research framing that triggered the ethics-statement requirement. - Section III-H four-pillar Firm A validation now stands on visual inspection, signature-level statistics, accountant-level GMM, and the three Section IV-H analyses, with practitioner knowledge as background context only. - New Section III-M ("Data Source and Firm Anonymization") covers MOPS public-data provenance, Firm A/B/C/D pseudonymization, and conflict-of-interest declaration. Add signature_analysis/24_validation_recalibration.py for the recomputed calib-vs-held-out z-tests and the classifier sensitivity analysis; output in reports/validation_recalibration/. Pending (not in this commit): abstract length (368 -> 250 words), Impact Statement removal, BD/McCrary sensitivity reporting, full reproducibility appendix, references cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 11:45:24 +08:00
parent 5717d61dd4
commit 0ff1845b22
8 changed files with 642 additions and 47 deletions
@@ -0,0 +1,130 @@
+# Third-Round Review of Paper A v3.3
+
+**Overall Verdict: Major Revision**
+
+v3.3 is substantially cleaner than v3.2. Most of the round-2 minor issues were genuinely fixed: the anonymization leak is gone, the BD/McCrary wording is now much more careful, the denominator and table-arithmetic errors were corrected, and the manuscript now explicitly distinguishes cosine-conditional from independent-minimum dHash. I do not recommend submission as-is, however, because three non-cosmetic problems remain. First, the central "three-method convergent thresholding" story is still not aligned with the operational classifier: the deployed rules in Section III-L use whole-sample Firm A heuristics (`0.95`, `5`, `15`, `0.837`) rather than the convergent accountant-level thresholds reported in Section IV-E. Second, the held-out Firm A validation section makes an objectively false numerical claim that the held-out rates match the whole-sample rates within the Wilson confidence intervals. Third, the paper relies on interview evidence from Firm A partners as a key calibration pillar but provides no human-subjects/ethics statement, no consent/exemption language, and almost no protocol detail. Those are fixable, but they are still submission-blocking.
+
+**1. v3.2 Findings Follow-up Audit**
+
+| Prior v3.2 finding | Status | v3.3 audit |
+|---|---|---|
+| Three-method convergence overclaim | `FIXED` | The paper now consistently states that the *KDE antimode plus the two mixture-based estimators* converge, while BD/McCrary does not produce an accountant-level transition; see [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11) and [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:15). |
+| KDE method inconsistency | `FIXED` | The KDE crossover vs KDE antimode distinction is now explicit in [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:167), and the Results use the distinction correctly at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:29). |
+| Unit-of-analysis clarity | `PARTIALLY-FIXED` | The signature/accountant distinction is much clearer at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:116), but Sections III-L and IV-F/IV-G still mix analysis levels and dHash statistics. The classifier is described with cosine-conditional dHash cutoffs at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:265), while the validation tables report only `dHash_indep` rules at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:221). |
+| Accountant-level interpretation overstated | `FIXED` | The manuscript now consistently frames the accountant-level result as clustered but smoothly mixed, not sharply discrete; see [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:59), [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:28), and [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:17). |
+| BD/McCrary rigor | `PARTIALLY-FIXED` | The overclaim is reduced and the limitation sentence is repaired at [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:103), but the paper still reports a fixed-bin implementation (`0.005` cosine bins) at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:175) without any reported bin-width sensitivity results or actual McCrary-style density-estimator output. |
+| White 1982 overclaim | `FIXED` | Related Work now uses the narrower pseudo-true-parameter framing at [paper_a_related_work_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:72), consistent with Methods at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:192). |
+| Firm A circular validation | `PARTIALLY-FIXED` | The 70/30 CPA-level split is now explicit at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:209), but the actual classifier still uses whole-sample Firm A-derived rules at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:268). The manuscript therefore overstates how fully the held-out fold breaks circularity. |
+| `139 + 32` vs `180` discrepancy | `FIXED` | The `171 + 9 = 180` accounting is now internally consistent; see [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:122), [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:42), and [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:21). |
+| dHash calibration story internally inconsistent | `PARTIALLY-FIXED` | The distinction between cosine-conditional and independent-minimum dHash is finally stated at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:265), but the Results still do not "report both" as promised at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:267). Tables IX and XI still report only `dHash_indep` rules at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:221). |
+| Section IV-H.3 not threshold-independent | `FIXED` | The paper now correctly labels H.3 as a classifier-based consistency check rather than a threshold-free test; see [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:150), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:243), and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:336). |
+| Table XVI numerical error | `FIXED` | The totals now reconcile: `83,970` single-firm reports plus `384` mixed-firm reports for `84,354` total at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:316). |
+| Held-out Firm A denominator shift | `FIXED` | The `178`-CPA held-out denominator is now explicitly explained by two excluded disambiguation-tie CPAs at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:210). |
+| Table numbering / cross-reference confusion | `PARTIALLY-FIXED` | The duplicate "Table VIII" phrasing is gone, but numbering still jumps from Table XI to Table XIII; see [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:214) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:251). |
+| Real firm identities leaked in tables | `FIXED` | The manuscript now consistently uses `Firm A/B/C/D`; see [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:281) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:322). |
+| Table X mixed unlike units while still reporting precision / F1 | `FIXED` | The paper now explicitly says precision and `F1` are not meaningful here and omits them; see [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:242) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:186). |
+| "three independent statistical methods" wording | `FIXED` | The manuscript now uses "methodologically distinct" at [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:10) and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:161). |
+| Abstract / conclusion / discussion still implied BD converged | `FIXED` | The relevant sections now explicitly separate the non-transition result from the convergent estimators; see [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:11), [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:27), and [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:16). |
+| Stale "discrete behaviour" wording | `FIXED` | The current wording is appropriately narrowed at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:216) and [paper_a_conclusion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_conclusion_v3.md:17). |
+| Related Work still overclaimed White 1982 | `FIXED` | The problematic sentence is gone; see [paper_a_related_work_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_related_work_v3.md:72). |
+| Section III-H preview said "two analyses" | `FIXED` | It now correctly says "three analyses" at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:147). |
+| Incorrect limitation sentence about BD/McCrary threshold-setting role | `FIXED` | The limitation is now correctly framed at [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:103). |
+
+**2. New Findings in v3.3**
+
+**Blockers**
+
+- The paper still does not document the ethics status of the interview evidence that underwrites the Firm A calibration anchor. The interviews are not incidental; they are used in the Abstract, Introduction, Methods, Discussion, and Conclusion as one of the main justifications for identifying Firm A as replication-dominated; see [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:12), [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:52), and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:140). There is no statement about IRB/review-board approval, exemption, participant consent, number of interviewees, interview dates, or anonymization protocol. For IEEE Access this is not optional if the paper reports human-subject research.
+
+- The operational classifier is still not the classifier implied by the paper's title and main thresholding narrative. Section III-I says the accountant-level estimates are the threshold reference used in classification at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:210), and Section IV-E says the primary accountant-level interpretation comes from the `0.973 / 0.979 / 0.976` convergence band (with `0.945 / 8.10` as a secondary cross-check) at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:148). But the actual five-way classifier in Section III-L uses `0.95`, `0.837`, and dHash cutoffs `5 / 15` from whole-sample Firm A heuristics at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:251) and [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:268). As written, the paper demonstrates convergent threshold *analysis*, but deploys a different heuristic classifier.
+
+- The "held-out fold confirms generalization" claim is numerically false as written. The manuscript states that the held-out rates "match the whole-sample rates of Table IX within each rule's Wilson confidence interval" at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:230), and repeats the same idea in Discussion at [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:44). That is not true for several published rules. Examples: whole-sample `cosine > 0.95 = 92.51%` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:163) is outside the held-out CI `[93.21%, 93.98%]` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:219); whole-sample `dHash_indep ≤ 5 = 84.20%` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165) is outside `[87.31%, 88.34%]` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:221); whole-sample dual-rule `89.95%` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:168) is outside `[91.09%, 91.97%]` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:225). This needs correction, not softening.
+
+**Major Issues**
+
+- The dHash statistic used by the deployed classifier remains ambiguous. Section III-L says the final classifier retains the *cosine-conditional* dHash cutoffs for continuity at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:267), but Tables IX and XI report only `dHash_indep` rules at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:221). Section III-L also promises that anchor-level analysis reports both cosine-conditional and independent-minimum rates, but the Results do not. This is still a material reproducibility and interpretation gap.
+
+- The paper still overstates what the 70/30 split accomplishes. Section III-K promises that calibration-fold percentiles are derived from the 70% fold only at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:235), but Section III-L then says the classifier uses thresholds inherited from the *whole-sample* Firm A distribution at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:268). That means the held-out fold is not a fully external evaluation for the actual deployed classifier.
+
+- The validation-metric story still overpromises in the Introduction and Impact Statement. The Introduction says the design includes validation using "precision, recall, `F_1`, and equal-error-rate metrics" at [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:28), but Methods and Results later state that precision and `F_1` are not meaningful here and that FRR/recall is only valid for the conservative byte-identical subset at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:242) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:203). The Impact Statement is even stronger, claiming the system "distinguishes genuinely hand-signed signatures from reproduced ones" at [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:8), which is not what a five-way confidence classifier with no full ground-truth test set has established.
+
+- The claimed empirical check on the within-auditor-year no-mixing assumption is not actually a check on that assumption. Section III-G says the intra-report consistency analysis "provides an empirical check on the within-auditor-year assumption" at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:127). But Section IV-H.3 measures agreement between *two different signers on the same report* at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:312); it does not test whether the *same CPA* mixes signing mechanisms within a fiscal year.
+
+- BD/McCrary is still the weakest statistical component and is not yet reported rigorously enough to sit as an equal methodological peer to the other two methods. The paper specifies a fixed bin width at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:175) and mentions a KDE bandwidth sensitivity check at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:170), but no actual sensitivity results, `Z`-statistics, p-values, or alternate-bin outputs are reported anywhere in Section IV. The narrative conclusions are probably directionally reasonable, but the evidentiary reporting is still thin.
+
+- Reproducibility from the paper alone is still insufficient. Missing or under-specified items include the exact VLM prompt and parsing rules ([paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:45)), HSV thresholds for red-stamp removal ([paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:74)), sampling/randomization seeds for the 500-image YOLO annotation set, the 500,000 inter-class pairs, the 50,000 inter-CPA negatives, and the 70/30 Firm A split ([paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:36), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:185), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:209)), and the initialization/convergence/clipping details for the Beta and logit-GMM fits ([paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:184), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:191), [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:218)).
+
+- Section III-H still contains one misleading sentence about H.1: it says the fixed `0.95` cutoff "is not calibrated to Firm A" at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:148), but Section IV-F explicitly says `0.95` and the dHash percentile rules are anchored to Firm A at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:174), and Section III-L says the classifier inherits thresholds from the whole-sample Firm A distribution at [paper_a_methodology_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_methodology_v3.md:268). Those statements need to be reconciled.
+
+**Minor Issues**
+
+- The table numbering still skips Table XII; the numbering jumps from Table XI at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:214) to Table XIII at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:251).
+
+- The label `dHash_indep ≤ 5 (calib-fold median-adjacent)` at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:165) is still unclear. If the calibration-fold independent-minimum median is `2`, then `5` is not a transparent "median-adjacent" label.
+
+- The references still need cleanup. At least `[27]` and `[31]`-`[36]` appear unused in the manuscript text, and the Mann-Whitney test is reported at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:50) without actually citing `[36]`.
+
+**3. IEEE Access Fit Check**
+
+- **Scope:** Yes. The topic fits IEEE Access well as a multidisciplinary methods paper spanning document forensics, computer vision, and audit-regulation applications.
+
+- **Single-anonymized review:** IEEE Access uses single-anonymized review according to the current reviewer information page. The manuscript's use of `Firm A/B/C/D` is therefore not required for author anonymity, but it is acceptable as an entity-confidentiality choice.
+
+- **Formatting / desk-return risks:** There are three concrete issues.
+  - The abstract is too long for current IEEE journal guidance. The IEEE Author Center says abstracts should be a single paragraph of up to 250 words, whereas the current abstract text at [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:5) through [paper_a_abstract_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_abstract_v3.md:14) is roughly 368 words by a plain-word count.
+  - The paper includes a standalone `Impact Statement` section at [paper_a_impact_statement_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_impact_statement_v3.md:1). That is not a standard IEEE Access Regular Paper section and should be removed or relocated unless the target article type explicitly requires it.
+  - Because the manuscript relies on partner interviews, it also appears to require the human-subject research statement that IEEE journal guidance asks authors to include when applicable.
+
+- **Official sources checked:** [IEEE Access submission guidelines](https://ieeeaccess.ieee.org/authors/submission-guidelines/), [IEEE Author Center article-structure guidance](https://journals.ieeeauthorcenter.ieee.org/create-your-ieee-journal-article/create-the-text-of-your-article/structure-your-article/), and [IEEE Access reviewer information](https://ieeeaccess.ieee.org/wp-content/uploads/2025/09/Reviewer-Information.pdf).
+
+**4. Statistical Rigor Audit**
+
+- The paper's main high-level statistical narrative is now mostly coherent. The "Firm A is replication-dominated but not pure" framing is supported by the combination of the `92.5%` signature-level rate, the `139 / 32` accountant-level split, and the unimodal-long-tail characterization; see [paper_a_introduction_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_introduction_v3.md:54), [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:123), and [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:41).
+
+- The Hartigan dip test is now described correctly as a unimodality test, and the paper no longer treats non-rejection as a formal bimodality finding. That said, the text at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:71) still moves quickly from "`p = 0.17`" to a substantive "single dominant generative mechanism" reading. That interpretation is plausible, but it is still an inference supported by interviews and ancillary evidence, not something the dip test itself establishes.
+
+- The accountant-level 1D thresholds are statistically described more carefully than before. The `0.973 / 0.979 / 0.976` cosine band is internally consistent across Abstract, Introduction, Results, Discussion, and Conclusion, and the text now correctly treats BD/McCrary non-transition as diagnostic rather than as failed thresholding.
+
+- The main remaining statistical weakness is the disconnect between *where the methods converge* and *what thresholds the classifier actually uses*. If the final classifier remains `0.95 / 5 / 15 / 0.837`, then the three-method convergence analysis is supporting context, not operational threshold-setting. The manuscript needs to say that explicitly or change the classifier accordingly.
+
+- The anchor-based validation is improved, especially because precision and `F_1` were removed and Wilson CIs were added. But the EER remains close to vacuous here: with 310 byte-identical positives all sitting near cosine `1.0`, the reported "`EER ≈ 0`" at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:188) is not very informative and should not be treated as a strong biometric-style performance result.
+
+**5. Anonymization Check**
+
+- Within the reviewed manuscript sections, I do **not** see any explicit real firm names or real auditor names. Firms are consistently pseudonymized as `Firm A/B/C/D`; see [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:281) and [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:322).
+
+- I also do **not** see author/institution metadata in the reviewed section files. From a single-anonymized IEEE Access standpoint, there is no obvious explicit anonymization leak in the manuscript text provided for review.
+
+- The one caveat is inferential rather than explicit: the combination of interview-based knowledge, Big-4 status, and distinctive cross-firm statistics may allow knowledgeable local readers to guess which firm is Firm A. That is not an explicit leak, but if firm confidentiality matters beyond mere pseudonymization, the authors should be aware of the residual identifiability risk.
+
+**6. Numerical Consistency**
+
+- The major cross-section numbers are now mostly consistent:
+  - `90,282` reports / `182,328` signatures / `758` CPAs are aligned across Abstract, Introduction, Methods, and Conclusion.
+  - Firm A's `171` analyzable CPAs, `9` excluded CPAs, and `139 / 32` accountant-level split are aligned across Introduction, Results, Discussion, and Conclusion.
+  - The partner-ranking `95.9%` top-decile share and the intra-report `89.9%` agreement are aligned between Methods and Results.
+  - Table XVI and Table XVII arithmetic now reconciles.
+
+- The remaining numerical inconsistency is the held-out-validation sentence discussed above. The underlying table counts are internally consistent, but the prose interpretation at [paper_a_results_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_results_v3.md:230) and [paper_a_discussion_v3.md](/Volumes/NV2/pdf_recognize/paper/paper_a_discussion_v3.md:44) is not.
+
+- A second consistency problem is metric-level rather than arithmetic: the classifier is described in Section III-L using cosine-conditional dHash cutoffs, while the validation tables are reported in independent-minimum dHash. That numerical comparison is not apples-to-apples until the paper states clearly which statistic drives Table XVII.
+
+**7. Reproducibility**
+
+- The paper is **not yet replicable from the manuscript alone**.
+
+- Missing items that should be added before submission:
+  - Exact VLM prompt, output format, and page-selection parse rule.
+  - YOLO training hyperparameters beyond epoch count and split ratio, plus inference confidence/NMS thresholds.
+  - HSV stamp-removal thresholds.
+  - Exact matching/disambiguation rules for CPA assignment ties.
+  - Random seeds and selection rules for the 500-page annotation sample, the 500,000 inter-class pairs, the 50,000 inter-CPA negatives, and the 70/30 Firm A split.
+  - EM/Beta/logit-GMM initialization, stopping criteria, handling of boundary values for the logit transform, and software/library versions for the mixture fits.
+  - Sensitivity-analysis results for KDE bandwidth and any analogous robustness checks for the BD/McCrary binning choice.
+  - Interview protocol details and the "independent visual inspection" sample size / decision rule.
+
+- I would not describe the current paper as reproducible "from the paper alone" yet. It is closer than v3.2, but it still depends on undocumented implementation choices.
+
+**Bottom Line**
+
+v3.3 is close, and most of the v3.2 cleanup work landed correctly. But before IEEE Access submission, I would require: (1) a clean reconciliation between the three-method threshold story and the actual classifier, (2) correction of the false held-out-validation claim, and (3) an explicit ethics/human-subjects statement plus minimal protocol disclosure for the interview evidence. Once those are fixed, the paper is much closer to minor-revision territory.