Paper A v3.3: apply codex v3.2 peer-review fixes

Codex (gpt-5.4) second-round review recommended 'minor revision'. This commit addresses all issues flagged in that review. ## Structural fixes - dHash calibration inconsistency (codex #1, most important): Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come from the whole-sample Firm A cosine-conditional dHash distribution (median=5, P95=15), not from the calibration-fold independent-minimum dHash distribution (median=2, P95=9) which we report elsewhere as descriptive anchors. Added explicit note about the two dHash conventions and their relationship. - Section IV-H framing (codex #2): Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence" to "Additional Firm A Benchmark Validation" and clarified in the section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully threshold-free, H.3 uses the calibrated classifier. H.3's concluding sentence now says "the substantive evidence lies in the cross-firm gap" rather than claiming the test is threshold-free. - Table XVI 93,979 typo fixed (codex #3): Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm). - Held-out Firm A denominator 124+54=178 vs 180 (codex #4): Added explicit note that 2 CPAs were excluded due to disambiguation ties in the CPA registry. - Table VIII duplication (codex #5): Removed the duplicate accountant-level-only Table VIII comment; the comprehensive cross-level Table VIII subsumes it. Text now says "accountant-level rows of Table VIII (below)". - Anonymization broken in Tables XIV-XVI (codex #6): Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/ "Firm D" across Tables XIV, XV, XVI. Table and caption language updated accordingly. - Table X unit mismatch (codex #7): Dropped precision, recall, F1 columns. Table now reports FAR (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR (against the byte-identical positive anchor). III-K and IV-G.1 text updated to justify the change. ## Sentence-level fixes - "three independent statistical methods" in Methodology III-A -> "three methodologically distinct statistical methods". - "three independent methods" in Conclusion -> "three methodologically distinct methods". - Abstract "~0.006 converging" now explicitly acknowledges that BD/McCrary produces no significant accountant-level discontinuity. - Conclusion ditto. - Discussion limitation sentence "BD/McCrary should be interpreted at the accountant level for threshold-setting purposes" rewritten to reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold estimator, at the accountant level. - III-H "two analyses" -> "three analyses" (H.1 longitudinal stability, H.2 partner ranking, H.3 intra-report consistency). - Related Work White 1982 overclaim rewritten: "consistent estimators of the pseudo-true parameter that minimizes KL divergence" replaces "guarantees asymptotic recovery". - III-J "behavior is close to discrete" -> "practice is clustered". - IV-D.2 pivot sentence "discreteness of individual behavior yields bimodality" -> "aggregation over signatures reveals clustered (though not sharply discrete) patterns". Target journal remains IEEE Access. Output: Paper_A_IEEE_Access_Draft_v3.docx (395 KB). Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 02:32:17 +08:00
parent 51d15b32a5
commit 5717d61dd4
8 changed files with 2352 additions and 71 deletions
@@ -91,7 +91,7 @@ Under the full-sample 2-component forced fit no Beta crossing is identified; the

 The joint reading of Sections IV-D.1 and IV-D.2 is unambiguous: *at the per-signature level, no two-mechanism mixture explains the data*.
 Non-hand-signed replication quality is a continuous spectrum, not a discrete class cleanly separated from hand-signing.
-This motivates the pivot to the accountant-level analysis in Section IV-E, where the discreteness of individual *behavior* (as opposed to pixel-level output quality) yields the bimodality that the signature-level analysis lacks.
+This motivates the pivot to the accountant-level analysis in Section IV-E, where aggregation over signatures reveals clustered (though not sharply discrete) patterns in individual-level signing *practice* that the signature-level analysis lacks.

 ## E. Accountant-Level Gaussian Mixture

@@ -123,20 +123,10 @@ First, of the 180 CPAs in the Firm A registry, 171 have $\geq 10$ signatures and
 Component C1 captures 139 of these 171 Firm A CPAs (81%) in a tight high-cosine / low-dHash cluster; the remaining 32 Firm A CPAs fall into C2.
 This split is consistent with the minority-hand-signers framing of Section III-H and with the unimodal-long-tail observation of Section IV-D.
 Second, the three-component partition is *not* a firm-identity partition: three of the four Big-4 firms dominate C2 together, and smaller domestic firms cluster into C3.
-Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in Table VIII: KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary test does not produce a significant transition at the accountant level.
+Third, applying the three-method framework of Section III-I to the accountant-level cosine-mean distribution yields the estimates summarized in the accountant-level rows of Table VIII (below): KDE antimode $= 0.973$, Beta-2 crossing $= 0.979$, and the logit-GMM-2 crossing $= 0.976$ converge within $\sim 0.006$ of each other, while the BD/McCrary test does not produce a significant transition at the accountant level.
 For completeness we also report the two-dimensional two-component GMM's marginal crossings at cosine $= 0.945$ and dHash $= 8.10$; these differ from the 1D crossings because they are derived from the joint (cosine, dHash) covariance structure rather than from each 1D marginal in isolation.

-<!-- TABLE VIII-acct: Accountant-Level Three-Method Threshold Summary
-| Level / method | Cosine threshold | dHash threshold |
-|----------------|-------------------|------------------|
-| Method 1 (KDE antimode)           | 0.973 | 4.07 |
-| Method 2 (BD/McCrary)             | no transition | no transition |
-| Method 3 (Beta-2 EM crossing)     | 0.979 | 3.41 |
-| Method 3' (logit-GMM-2 crossing)  | 0.976 | 3.93 |
-| 2D GMM 2-comp marginal crossing   | 0.945 | 8.10 |
-->
-
-Table VIII then summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison.
+Table VIII summarizes all threshold estimates produced by the three methods across the two analysis levels for a compact cross-level comparison.

 <!-- TABLE VIII: Threshold Convergence Summary Across Levels
 | Level / method | Cosine threshold | dHash threshold |
@@ -193,29 +183,31 @@ We report three validation analyses corresponding to the anchors of Section III-

 Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the gold-positive anchor.
 As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
-Table X reports precision, recall, $F_1$, FAR with Wilson 95% confidence intervals, and FRR at each candidate threshold.
-The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because pixel-identical positives are all at cosine very close to 1.
+Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
+We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor and FRR against the byte-identical positive anchor in Table X; these two error rates are well defined within their respective anchor populations.
+The Equal-Error-Rate point, interpolated at FAR $=$ FRR, is located at cosine $= 0.990$ with EER $\approx 0$, which is trivially small because every byte-identical positive falls at cosine very close to 1.

-<!-- TABLE X: Cosine Threshold Sweep (positives = 310 pixel-identical signatures; negatives = 50,000 inter-CPA pairs)
-| Threshold | Precision | Recall | F1 | FAR | FAR 95% Wilson CI | FRR |
-|-----------|-----------|--------|----|-----|-------------------|-----|
-| 0.837 (all-pairs KDE crossover) | 0.029 | 1.000 | 0.056 | 0.2062 | [0.2027, 0.2098] | 0.000 |
-| 0.900                            | 0.210 | 1.000 | 0.347 | 0.0233 | [0.0221, 0.0247] | 0.000 |
-| 0.945 (2D GMM marginal)         | 0.883 | 1.000 | 0.938 | 0.0008 | [0.0006, 0.0011] | 0.000 |
-| 0.950                            | 0.904 | 1.000 | 0.950 | 0.0007 | [0.0005, 0.0009] | 0.000 |
-| 0.973 (accountant KDE antimode) | 0.960 | 1.000 | 0.980 | 0.0003 | [0.0002, 0.0004] | 0.000 |
-| 0.979 (accountant Beta-2)       | 0.969 | 1.000 | 0.984 | 0.0002 | [0.0001, 0.0004] | 0.000 |
+<!-- TABLE X: Cosine Threshold Sweep (positives = 310 byte-identical signatures; negatives = 50,000 inter-CPA pairs)
+| Threshold | FAR | FAR 95% Wilson CI | FRR (byte-identical) |
+|-----------|-----|-------------------|----------------------|
+| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] | 0.000 |
+| 0.900                            | 0.0233 | [0.0221, 0.0247] | 0.000 |
+| 0.945 (2D GMM marginal)          | 0.0008 | [0.0006, 0.0011] | 0.000 |
+| 0.950                            | 0.0007 | [0.0005, 0.0009] | 0.000 |
+| 0.973 (accountant KDE antimode)  | 0.0003 | [0.0002, 0.0004] | 0.000 |
+| 0.979 (accountant Beta-2)        | 0.0002 | [0.0001, 0.0004] | 0.000 |
 -->

 Two caveats apply.
 First, the gold-positive anchor is a *conservative subset* of the true non-hand-signed population: it captures only those non-hand-signed signatures whose nearest match happens to be byte-identical, not those that are near-identical but not bytewise identical.
-Perfect recall against this subset does not establish perfect recall against the broader positive class, and the reported recall should therefore be interpreted as a lower-bound calibration check rather than a generalizable recall estimate.
+Zero FRR against this subset does not establish zero FRR against the broader positive class, and the reported FRR should therefore be interpreted as a lower-bound calibration check on the classifier's ability to catch the clearest positives rather than a generalizable miss rate.
 Second, the 0.945 / 0.95 / 0.973 thresholds are derived from the Firm A calibration fold or the accountant-level methods rather than from this anchor set, so the FAR values in Table X are post-hoc-fit-free evaluations of thresholds that were not chosen to optimize Table X.
-The very low FAR at the accountant-level thresholds is therefore informative.
+The very low FAR at the accountant-level thresholds is therefore informative about specificity against a realistic inter-CPA negative population.

 ### 2) Held-Out Firm A Validation (breaks calibration-validation circularity)

 We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
+The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here and has no effect on the accountant-level mixture analysis of Section IV-E, which uses the $\geq 10$-signature subset of 171 CPAs.
 Thresholds are re-derived from calibration-fold percentiles only.
 Table XI reports heldout-fold capture rates with Wilson 95% confidence intervals.

@@ -242,10 +234,13 @@ The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 91.54% [91.09%, 91.97%

 A 30-signature stratified visual sanity sample (six signatures each from pixel-identical, high-cos/low-dh, borderline, style-only, and likely-genuine strata) produced inter-rater agreement with the classifier in all 30 cases; this sample contributed only to spot-check and is not used to compute reported metrics.

-## H. Firm A Benchmark Validation: Threshold-Independent Evidence
+## H. Additional Firm A Benchmark Validation

 The capture rates of Section IV-F are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
-This section reports three additional analyses that are *threshold-independent* in the sense that their findings do not depend on any cutoff we calibrate to Firm A, and therefore constitute genuine benchmark-validation evidence rather than a circular check.
+This section reports three complementary analyses that go beyond the whole-sample capture rates.
+Subsection H.2 is fully threshold-independent (it uses only ordinal ranking).
+Subsection H.1 uses a fixed 0.95 cutoff but derives information from the longitudinal stability of rates rather than from the absolute rate at any single year.
+Subsection H.3 applies the calibrated classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the informative quantity is the cross-firm *gap* rather than the absolute agreement rate at any one firm.

 ### 1) Year-by-Year Stability of the Firm A Left Tail

@@ -283,19 +278,19 @@ Firm A accounts for 1,287 of these (27.8% baseline share).
 Table XIV reports per-firm occupancy of the top $K\%$ of the ranked distribution.

 <!-- TABLE XIV: Top-K Similarity Rank Occupancy by Firm (pooled 2013-2023)
-| Top-K | k in bucket | Deloitte (Firm A) | KPMG | PwC | EY | Other/Non-Big-4 | Deloitte share |
-|-------|-------------|-------------------|------|-----|----|----|-----------------|
-| 10%   | 462         | 443               | 2    | 3   | 0  | 14 | 95.9% |
-| 25%   | 1,157       | 1,043             | 32   | 23  | 9  | 50 | 90.1% |
-| 50%   | 2,314       | 1,220             | 473  | 273 | 102| 246| 52.7% |
+| Top-K | k in bucket | Firm A | Firm B | Firm C | Firm D | Non-Big-4 | Firm A share |
+|-------|-------------|--------|--------|--------|--------|-----------|--------------|
+| 10%   | 462         | 443    | 2      | 3      | 0      | 14        | 95.9% |
+| 25%   | 1,157       | 1,043  | 32     | 23     | 9      | 50        | 90.1% |
+| 50%   | 2,314       | 1,220  | 473    | 273    | 102    | 246       | 52.7% |
 -->

 Firm A occupies 95.9% of the top 10% and 90.1% of the top 25% of auditor-years by similarity, against its baseline share of 27.8%---a concentration ratio of 3.5$\times$ at the top decile and 3.2$\times$ at the top quartile.
-Year-by-year (Table XV), the top-10% Deloitte share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.
+Year-by-year (Table XV), the top-10% Firm A share ranges from 88.4% (2020) to 100% (2013, 2014, 2017, 2018, 2019), showing that the concentration is stable across the sample period.

-<!-- TABLE XV: Deloitte Share of Top-10% Similarity by Year
-| Year | N auditor-years | Top-10% k | Deloitte in top-10% | Deloitte share | Deloitte baseline |
-|------|-----------------|-----------|---------------------|----------------|-------------------|
+<!-- TABLE XV: Firm A Share of Top-10% Similarity by Year
+| Year | N auditor-years | Top-10% k | Firm A in top-10% | Firm A share | Firm A baseline |
+|------|-----------------|-----------|-------------------|--------------|-----------------|
 | 2013 | 324 | 32 | 32 | 100.0% | 26.2% |
 | 2014 | 399 | 39 | 39 | 100.0% | 27.1% |
 | 2015 | 394 | 39 | 38 | 97.4% | 27.2% |
@@ -309,7 +304,7 @@ Year-by-year (Table XV), the top-10% Deloitte share ranges from 88.4% (2020) to
 | 2023 | 474 | 47 | 46 | 97.9% | 28.5% |
 -->

-This over-representation is a direct consequence of firm-wide stamping practice and is not derived from any threshold we subsequently calibrate.
+This over-representation is a direct consequence of firm-wide non-hand-signing practice and is not derived from any threshold we subsequently calibrate.
 It therefore constitutes genuine cross-firm evidence for Firm A's benchmark status.

 ### 3) Intra-Report Consistency
@@ -318,27 +313,27 @@ Taiwanese statutory audit reports are co-signed by two engagement partners (a pr
 Under firm-wide stamping practice at a given firm, both signers on the same report should receive the same signature-level classification.
 Disagreement between the two signers on a report is informative about whether the stamping practice is firm-wide or partner-specific.

-For each report with exactly two signatures and complete per-signature data (93,979 reports), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
-Table XVI reports per-firm intra-report agreement.
+For each report with exactly two signatures and complete per-signature data (83,970 reports assigned to a single firm, plus 384 reports with one signer per firm in the mixed-firm buckets for 84,354 total), we classify each signature using the dual-descriptor rules of Section III-L and record whether the two classifications agree.
+Table XVI reports per-firm intra-report agreement (firm-assignment defined by the firm identity of both signers; mixed-firm reports are reported separately).

 <!-- TABLE XVI: Intra-Report Classification Agreement by Firm
 | Firm | Total 2-signer reports | Both non-hand-signed | Both uncertain | Both style | Both hand-signed | Mixed | Agreement rate |
 |------|-----------------------|----------------------|----------------|------------|------------------|-------|----------------|
-| Deloitte (Firm A) | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | **89.91%** |
-| KPMG              | 17,121 | 9,260  | 2,159| 5 | 6 | 5,691 | 66.76% |
-| PwC               | 19,112 | 8,983  | 3,035| 3 | 5 | 7,086 | 62.92% |
-| EY                | 8,375  | 3,028  | 2,376| 0 | 3 | 2,968 | 64.56% |
-| Other / Non-Big-4 | 9,140  | 1,671  | 3,945| 18| 27| 3,479 | 61.94% |
+| Firm A  | 30,222 | 26,435 | 734 | 0 | 4 | 3,049 | **89.91%** |
+| Firm B  | 17,121 | 9,260  | 2,159| 5 | 6 | 5,691 | 66.76% |
+| Firm C  | 19,112 | 8,983  | 3,035| 3 | 5 | 7,086 | 62.92% |
+| Firm D  | 8,375  | 3,028  | 2,376| 0 | 3 | 2,968 | 64.56% |
+| Non-Big-4 | 9,140  | 1,671  | 3,945| 18| 27| 3,479 | 61.94% |

 A report is "in agreement" if both signature labels fall in the same coarse bucket
 (non-hand-signed = high+moderate; uncertain; style consistency; or likely hand-signed).
 -->

 Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
-The other Big-4 firms and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
-This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) stamping practice.
+The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
+This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.

-Like the partner-level ranking, this test does not depend on any threshold we calibrate to Firm A; the firm-vs-firm comparison is invariant to the absolute cutoff so long as the cutoff is applied uniformly.
+We note that this test uses the calibrated classifier of Section III-L rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.

 ## I. Classification Results