Phase 6 round-3 codex-review fixes: blockers + majors + minors

Resolved Codex review (gpt-5.5 xhigh) findings against b6913d2. BLOCKERS: - Appendix B reference mismatch: rewrote all main-text "Appendix B" references to "supplementary materials" since Appendix B is now a redirect stub. Affected the SSIM design-argument pointer, threshold provenance, byte-level decomposition, MC band capture-rate, and backbone-ablation table references across §III-F / §III-H.1 / §III-H.2 / §III-K / §III-L.4 / §III-M / §IV-F / §IV-J / §IV-K / §IV-L / §V-C / §V-H. - Table rendering: un-commented Tables I-IV (Dataset Summary, YOLO Detection, Extraction Results, Cosine Distribution Statistics) which were inside HTML comment blocks and would not have rendered in the submission. - Table numbering out of order: Table XIX appeared before Tables XVI-XVIII. Renumbered XIX -> XVI (document-level worst-case counts), XVI -> XVII (Firm x K=3 cross-tab), XVII -> XVIII (K=3 component comparison), XVIII -> XIX (Spearman correlation). Cross-references updated in §IV-J / §IV-K and §V-C. - Table V mis-citation: §IV-C said "KDE crossover ... (Table V)" but Table V is the dip test. Dropped the (Table V) tag; crossover is a textual finding. - Submission cleanup: wrapped the archived Impact Statement section heading and body inside the existing HTML comment (was rendering). Funding placeholder wrapped in HTML comment with a TO-DO note (won't render but is preserved as reminder). MAJORS: - Line 1077 numerical conflation: rewrote the §V-C / §III-L.4 paragraph that labelled Firm A's per-document HC+MC inter-CPA proxy ICCR of 0.6201 as a rate "on real same-CPA pools." 0.6201 is a counterfactual proxy under inter-CPA candidate-pool replacement, not the observed rate. Added explicit disambig: the corresponding observed rate from Table XVI (formerly XIX) is 97.5% HC+MC for Firm A; the proxy and observed rates measure different quantities. - Residual "validation" language softened: "Dual-descriptor verification" -> "Dual-descriptor similarity"; "we validate the backbone choice" -> "we support the backbone choice"; "pixel-identity validation" -> "pixel-identity positive-anchor check"; "## M. Validation Strategy and Limitations under Unsupervised Setting" -> "## M. Unsupervised Diagnostic Strategy and Limits". - "Specificity behaviour" overclaim: "characterises the cosine threshold's specificity behaviour" -> "specificity-proxy behaviour" (methodology §III-L.0 and discussion §V-F). - "Prior published / prior calibration" ambiguity: replaced "prior published per-comparison rate" with "the corpus-wide rate reported in §IV-I"; replaced "(prior published operating point)" with "(alternative operating point from supplementary calibration evidence)" in Tables XXI; replaced "prior reporting and the existing literature" with "the existing literature and the supplementary calibration evidence." MINORS: - Line 116 Bayes-optimal qualifier: "the local density minimum ... is the Bayes-optimal decision boundary under equal priors" -> "In idealized two-class mixture settings with equal priors and equal misclassification costs, the local density minimum ... coincides with the Bayes-optimal decision boundary." - Stale section refs: §V-G for the fine-tuning caveat retargeted to §V-H Engineering-level caveats (where it lives after the §V-H reorganisation); §III-L for the worst-case rule retargeted to §III-H.1; "Section IV-D.2" (nonexistent) retargeted to "Section IV-D Table VI." - Abstract / Introduction "after pool-size adjustment": separated the document-level D2 proxy ICCR claim from the per-signature logistic regression claim. Now: "Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms ... a per-signature logistic regression confirms the firm gap persists after pool-size control." NIT: - Related Work HTML comment "(see paper_a_references_v3.md for full list)" -> "(full list in the References section)"; removes the version-coded filename reference from the source. Artefacts: - Combined manuscript regenerated: paper_a_v4_combined.md, 1312 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:28:14 +08:00
parent b6913d2f93
commit 9e68f2e1d3
10 changed files with 108 additions and 112 deletions
@@ -16,7 +16,8 @@ We note that Table II reports validation-set metrics, as no separate hold-out te
 However, the subsequent production deployment provides a practical consistency check: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report.
 The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

-<!-- TABLE III: Extraction Results
+**Table III.** Extraction Results.
+
 | Metric | Value |
 |--------|-------|
 | Documents processed | 86,071 |
@@ -25,7 +26,6 @@ The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliabi
 | Avg. signatures per document | 2.14 |
 | CPA-matched signatures | 168,755 (92.6%) |
 | Processing rate | 43.1 docs/sec |
-->

 The Big-4 subset of the detection output yields 150,442 signatures with both descriptors (cosine and independent dHash) successfully computed; this is the per-signature population used in the primary analyses of §IV-D through §IV-J.

@@ -35,7 +35,8 @@ Fig. 2 presents the cosine similarity distributions computed over the full set o
 This all-pairs analysis is a different unit from the per-signature best-match statistics used in Sections IV-D onward; we report it first because it supplies the reference point for the KDE crossover used in per-document classification (Section III-L).
 Table IV summarizes the distributional statistics.

-<!-- TABLE IV: Cosine Similarity Distribution Statistics
+**Table IV.** Cosine Similarity Distribution Statistics.
+
 | Statistic | Intra-class | Inter-class |
 |-----------|-------------|-------------|
 | N (pairs) | 41,352,824 | 500,000 |
@@ -44,13 +45,12 @@ Table IV summarizes the distributional statistics.
 | Median | 0.836 | 0.774 |
 | Skewness | −0.711 | −0.851 |
 | Kurtosis | 0.550 | 1.027 |
-->

 Both distributions are left-skewed and leptokurtic.
 Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both ($p < 0.001$), confirming that parametric thresholds based on normality assumptions would be inappropriate.
 Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; the subsequent distributional diagnostics in Section IV-D are produced via the methods of Section III-I to avoid single-family distributional assumptions.

-The KDE crossover---where the two density functions intersect---was located at 0.837 (Table V).
+The KDE crossover---where the two density functions intersect---was located at 0.837.
 Under equal prior probabilities and equal misclassification costs, this crossover is a candidate decision boundary between the two classes; we adopt it only as the operational LH/UN boundary in §III-H.1, not as a natural distributional threshold.
 Statistical tests confirmed significant separation between the two distributions (Cohen's $d = 0.669$, Mann-Whitney [36] $p < 0.001$, K-S 2-sample $p < 0.001$).

@@ -149,7 +149,7 @@ The three scores agree on placing Firm A as the most replication-dominated and t
 | deployed binary high-confidence box rule vs per-signature K=3 hard label | 0.559 |
 | Per-CPA K=3 hard label vs per-signature K=3 hard label | 0.870 |

-(Source: Script 39.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) retains its prior calibration and capture-rate evidence (Appendix B; cross-referenced in §IV-J).
+(Source: Script 39.) Per-signature K=3 components ($n = 150{,}442$) sorted by ascending cosine: $(0.928, 9.75, 0.146)$ / $(0.963, 6.04, 0.582)$ / $(0.989, 1.27, 0.272)$, an absolute cosine drift of $0.018$ in C1 and $0.006$ in C3 relative to the per-CPA fit. These convergence checks cover only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$) retains its prior calibration and capture-rate evidence (supplementary materials; cross-referenced in §IV-J).

 ## G. Leave-One-Firm-Out Reproducibility

@@ -225,11 +225,11 @@ This section reports the five-way per-signature + document-level worst-case clas
 | Firm C | 23.75% | 41.44% | 0.38% | 34.21% | 0.22% | 38,613 |
 | Firm D | 24.51% | 29.33% | 0.22% | 45.65% | 0.29% | 17,133 |

-(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVI: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVI). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).
+(Source: Script 42 per-firm cross-tab.) The per-firm pattern qualitatively aligns with the K=3 cluster cross-tab of Table XVII: Firm A's signatures concentrate in the HC band (81.70%) while its CPAs concentrate at the accountant level in the K=3 C3 (high-cos / low-dHash) component (82.46%; Table XVII). These two figures address different units (per-signature classification vs per-CPA hard cluster assignment) and are not directly comparable as a like-for-like consistency check; we report the qualitative alignment but do not infer a numerical equivalence. The three non-Firm-A Big-4 firms have markedly lower HC rates than Firm A and substantially higher Uncertain rates, with Firm D having the highest Uncertain rate (45.65%).

-**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-L), applied to the Big-4 sub-corpus.
+**Document-level worst-case aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the worst-case rule (HC > MC > HSC > UN > LH; §III-H.1), applied to the Big-4 sub-corpus.

-**Table XIX.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs.
+**Table XVI.** Document-level worst-case category counts, Big-4 sub-corpus, $n = 75{,}233$ unique PDFs.

 | Category | Long name | $n$ documents | % |
 |---|---|---|---|
@@ -252,9 +252,9 @@ This section reports the five-way per-signature + document-level worst-case clas

 (Source: Script 42; mixed-firm PDFs $n = 379$ excluded from the per-firm rows but included in the overall counts above.)

-The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains its prior calibration (Appendix B); it is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in Appendix B and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).
+The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$) retains its prior calibration (supplementary materials); it is **not separately re-characterised by Scripts 38–40**, which checked only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$). The moderate-band cuts are not re-derived on the Big-4 subset; we report the Table XV per-firm MC proportions (10.76% / 35.88% / 41.44% / 29.33% across Firms A through D) descriptively. The capture-rate calibration evidence for the moderate band is reported in the supplementary materials and not regenerated on the Big-4 subset. We do not claim that the MC-band per-firm ordering above is a separate validation of the §III-K Spearman convergence, since MC occupancy is not a monotone function of the per-CPA less-replication-dominated ranking (e.g., Firm D's MC fraction is lower than Firm B's while Firm D's reverse-anchor score ranks it as less replication-dominated than Firm B).

-**Table XVI.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.
+**Table XVII.** Firm × K=3 cluster cross-tabulation, Big-4 sub-corpus.

 | Firm | $n$ | C1 (low-cos / high-dHash) | C2 (central) | C3 (high-cos / low-dHash) | C1 % | C3 % |
 |---|---|---|---|---|---|---|
@@ -265,13 +265,13 @@ The five-way **moderate-confidence non-hand-signed** band (cos $> 0.95$ AND $5 <

 (Source: Script 35.) The cross-tab is the accountant-level descriptive output of the K=3 mixture (§III-J / §IV-E). It is reported here as a complement to the five-way per-signature classifier (Table XV), not as an operational classifier output. Reading: Firm A's CPAs are concentrated in the C3 (high-cos / low-dHash) component (no Firm A CPAs in C1); Firm C has the highest C1 (low-cos / high-dHash) concentration of the Big-4 (C1 fraction $23.5\%$); Firms B and D sit between A and C on the K=3 hard-label ordering, broadly consistent with the per-firm Spearman ordering of Table X (with the within-Big-4-non-A reverse-anchor disagreement noted there).

-**Document-level worst-case aggregation outputs are reported in Table XIX above.**
+**Document-level worst-case aggregation outputs are reported in Table XVI above.**

 ## K. Full-Dataset Robustness (light scope)

-This section reports the reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (Appendix B; §IV-J).
+This section reports the reproducibility cross-check at the full accountant scope ($n = 686$ CPAs, Big-4 plus mid/small firms). The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + deployed operational-rule per-CPA less-replication-dominated rate analysis, sufficient to demonstrate that the K=3 + deployed-rule convergence reproduces at the wider scope. The §III-H.1 five-way classifier and the §IV-G LOOO analyses are not re-run at the full scope. The five-way moderate-confidence band retains its prior calibration (supplementary materials; §IV-J).

-**Table XVII.** K=3 component comparison, Big-4 sub-corpus vs full dataset.
+**Table XVIII.** K=3 component comparison, Big-4 sub-corpus vs full dataset.

 | K=3 component | Big-4 (n=437) cos / dHash / weight | Full (n=686) cos / dHash / weight | Drift Big-4 → Full |
 |---|---|---|---|
@@ -281,7 +281,7 @@ This section reports the reproducibility cross-check at the full accountant scop

 (Source: Script 41; full-dataset $\text{BIC}(K{=}3) = -792.31$ vs Big-4 $\text{BIC}(K{=}3) = -1111.93$; BIC values are not directly comparable across different $n$ and are reported only for completeness.)

-**Table XVIII.** Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.
+**Table XIX.** Spearman rank correlation between K=3 P(C1) and deployed operational less-replication-dominated rate, Big-4 sub-corpus vs full dataset.

 | Scope | $n$ CPAs | Spearman $\rho$ (P(C1) vs deployed less-replication-dominated rate) | $p$-value |
 |---|---|---|---|
@@ -297,9 +297,9 @@ This section reports the reproducibility cross-check at the full accountant scop

 To support the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim).
 All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization.
-The comparison summary is reported in Appendix B (the backbone-ablation table; not the same table as Table XVIII in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K).
+The comparison summary is reported in the supplementary materials (backbone-ablation table; not the same table as Table XIX in this section, which reports Big-4 vs full-dataset Spearman drift in §IV-K).

-<!-- BACKBONE ABLATION TABLE (rendered in Appendix B):
+<!-- BACKBONE ABLATION TABLE (rendered in supplementary materials):
 | Metric | ResNet-50 | VGG-16 | EfficientNet-B0 |
 |--------|-----------|--------|-----------------|
 | Feature dim | 2048 | 4096 | 1280 |
@@ -352,7 +352,7 @@ This section consolidates the empirical results that support the §III-L anchor-

 | Threshold | Per-comparison ICCR | 95% Wilson CI |
 |---|---|---|
-| cos $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ |
+| cos $> 0.945$ (alternative operating point from supplementary calibration evidence) | $0.00081$ | $[0.00073, 0.00089]$ |
 | cos $> 0.95$ (deployed operating point) | $0.00060$ | $[0.00053, 0.00067]$ |
 | cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
 | cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
@@ -393,7 +393,7 @@ Decile trend is broadly monotone in pool size with two minor reversals (decile 5
 | D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ |
 | D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |

-Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XIX's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XIX's per-firm rows sum to $74{,}854 = 75{,}233 - 379$.
+Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$). The Firm C denominator $n = 19{,}501$ exceeds Table XVI's single-firm Firm C count of $19{,}122$ by exactly the $379$ mixed-firm PDFs: all $379$ are $1{:}1$ Firm C / Firm D mixed-firm documents, and Script 45's mode-of-firms implementation (`np.argmax` over `np.unique`'s alphabetically-sorted firm counts) returns the first-sorted firm on ties, which assigns these tied documents to Firm C rather than to Firm D. The four per-firm denominators here therefore sum to the full $75{,}233$, whereas Table XVI's per-firm rows sum to $74{,}854 = 75{,}233 - 379$.

 ### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)